Accessing Data

Overview

The source climate projections produced for California’s Fifth Climate Change Assessment that underlie the Cal-Adapt: Analytics Engine are freely available to the public. The projections can be accessed through Amazon’s Registry of Open Data and are hosted on Amazon Web Services (AWS) as part of the data catalog for the Cal-Adapt: Analytics Engine’s "Co-Produced Climate Data to Support California’s Resilience Investments.”

The Open Data program makes data publicly available in order to democratize access, enable the use of cloud-optimized datasets, techniques, and tools in cloud-native formats, and build communities that benefit from these shared datasets. Because the Analytics Engine is part of the Open Data program, all climate projections are accessible and available right now.

How can users access, visualize, and download climate data on the Analytics Engine?

There are many ways to access, visualize, and download data from the Analytics Engine (AE), depending on a user’s level of comfort with writing code:

Analytics Engine JupyterHub: energy-sector users with an [Analytics Engine JupyterHub](/sign-in/) login can access co-produced, pre-developed Jupyter notebooks to analyze, subset, and utilize existing Analytics Engine data. The code and data can be customized for individual needs through the JupyterHub web interface, which provides private cloud storage through Amazon Web Services. Users can then export data in a variety of formats to cloud storage or a local machine. This solution requires an Analytics Engine login, though all of our notebooks are publicly available via our CAE-Notebooks GitHub repository.

Analytics Engine Data Catalog: Anyone can retrieve, subset, visualize, and export data from the Analytics Engine [Data Catalog](/data/catalog), which allows for easy programmatic access to the data by using functions from the ClimaKitAE library, the Analytics Engine’s open-source Python library. This solution does not require an Analytics Engine login but does require familiarity with Python code.

AWS Explorer Functions: Anyone can utilize existing AWS Explorer functions within the Analytics Engine S3 bucket to explore and download existing data through a web browser. Data can be accessed by navigating the file structure to identify and select data to download. This solution does not require an Analytics Engine login and does not require coding.

AWS Command Line Interface (CLI): If users wish to download large amounts of data at once, they can do this by installing the free AWS Command Line Interface (CLI) tool. The AWS CLI is an open-source tool that enables the use of the command-line shell to directly access and interact with the Analytics Engine S3 bucket. This tool simplifies the process of downloading all of the data within a given dataset. This solution does not require an Analytics Engine login but does require some familiarity with shell scripting.

Data Download Tool: Cal-Adapt is developing a new Data Download Tool (DDT) to interact with the Analytics Engine Data Catalog. This DDT will allow users to quickly filter LOCA2 climate datasets and download data for multiple models at a time using different options for spatial and temporal aggregation. This tool is currently accessible to users in its beta version, and additional functionality will be added in upcoming months. To provide feedback, or if you have any questions, please contact support@cal-adapt.org.

Step by step instructions for each of the above data access approaches can be found in the “Data access methods” section below.

Example applications of Analytics Engine data and notebooks

To see how Analytics Engine data and notebooks can be used to answer common adaptation planning questions, please see the Example Applications on the Applications / Notebooks webpage.

Data access methods

Analytics Engine Data Catalog

Anyone can retrieve, subset, visualize, and export data using functions from the ClimaKitAE library, our open-source Python library that includes documentation. As part of building this Python package, we created an intake Data Catalog. This catalog allows for easy programmatic access to the data used on the Cal-Adapt: Analytics Engine and is specifically an intake-ESM catalog, which is designed for climate data. This catalog can be opened using Python by knowing the URL of the catalog file.

Connect to data using Intake Python package:

  1. The first step is to install the following Python packages:
pip install intake-esm s3fs
  1. This should install the other needed dependencies in the Python environment. The next step is to open the catalog:
import intake
cat = intake.open_esm_datastore(
    'https://cadcat.s3.amazonaws.com/cae-collection.json'
)
  1. To see the unique attributes contained in the database, run:
cat.unique()

activity_id                                                 [LOCA2, WRF]
institution_id                                         [UCSD, CAE, UCLA]
source_id              [ACCESS-CM2, CESM2-LENS, CNRM-ESM2-1, EC-Earth...
experiment_id           [historical, ssp245, ssp370, ssp585, reanalysis]
member_id              [r1i1p1f1, r2i1p1f1, r3i1p1f1, r10i1p1f1, r4i1...
table_id                                          [day, mon, yrmax, 1hr]
variable_id            [hursmax, hursmin, huss, pr, rsds, tasmax, tas...
grid_label                                               [d03, d01, d02]
path                   [s3://cadcat/loca2/ucsd/access-cm2/historical/...
derived_variable_id                                                   []
dtype: object
  1. Looking at the output, the variety of data stored in the catalog AWS S3 Explorer may be used to understand the shape of the data (WRF or LOCA2), or you can do a query of the intake database:
cat_subset = cat.search(activity_id="LOCA2")
cat_subset.unique()

activity_id                                                      [LOCA2]
institution_id                                                    [UCSD]
source_id              [ACCESS-CM2, CESM2-LENS, CNRM-ESM2-1, EC-Earth...
experiment_id                       [historical, ssp245, ssp370, ssp585]
member_id              [r1i1p1f1, r2i1p1f1, r3i1p1f1, r10i1p1f1, r4i1...
table_id                                               [day, mon, yrmax]
variable_id            [hursmax, hursmin, huss, pr, rsds, tasmax, tas...
grid_label                                                         [d03]
path                   [s3://cadcat/loca2/ucsd/access-cm2/historical/...
derived_variable_id                                                   []

cat_subset.unique()['source_id']

['ACCESS-CM2', 'CESM2-LENS', 'CNRM-ESM2-1', 'EC-Earth3', 'EC-Earth3-Veg',
'FGOALS-g3', 'GFDL-ESM4', 'HadGEM3-GC31-LL', 'INM-CM5-0', 'IPSL-CM6A-LR',
'KACE-1-0-G', 'MIROC6', 'MPI-ESM1-2-HR', 'MRI-ESM2-0', 'TaiESM1']
  1. Further refinement of the catalog search supports finding a particular dataset using the available attributes (not all combinations are possible). Again the AWS S3 Explorer is useful for figuring out needed attributes:
cat_1model = cat.search(
    activity_id="LOCA2",
    source_id="ACCESS-CM2",
    experiment_id="historical",
    member_id="r1i1p1f1",
    table_id="mon",
    variable_id="tasmax",
)
  1. This will narrow the catalog records to just one dataset. From there, these commands may be used to to load the dataset:
dset_dict = cat_1model.to_dataset_dict(zarr_kwargs={'consolidated': True}, storage_options={'anon': True})
ds = dset_dict['LOCA2.UCSD.ACCESS-CM2.historical.mon.d03']
  1. Notice that a dictionary of Xarray datasets is returned, so that the catalog query may be used to return multiple datasets (such as getting historical and SSP’s, or all models). Then the Xarray dataset is directly shown in Python without needing to download the data since the Zarr format is random access (does not require loading the entire dataset) and cloud-optimized. To save the dataset as a NetCDF file, run this command:
comp = dict(zlib=True, complevel=6)
compdict = {var: comp for var in ds.data_vars}
ds.to_netcdf('LOCA2_UCSD_ACCESS-CM2_historical_mon_d03.nc', encoding=compdict)
  1. This is an example of querying for all experiments for a particular model and saving them to local NetCDF files:
cat_subset = cat.search(
    activity_id="WRF",
    source_id="CNRM-ESM2-1",
    experiment_id=["historical", "ssp370"],
    member_id="r1i1p1f2",
    table_id="mon",
    grid_label="d02",
    variable_id="t2",
)
dset_dict = cat_subset.to_dataset_dict(zarr_kwargs={'consolidated': True}, storage_options={'anon': True})
comp = dict(zlib=True, complevel=6)
dset_keys = dset_dict.keys()
for key in dset_keys:
    compdict = {var: comp for var in dset_dict[key].data_vars}
    dset_dict[key].to_netcdf(key.replace(".","_")+".nc", encoding=compdict,)

Connect to data directly in Xarray using S3 path:

The main purpose of the intake catalog is to get to the path of the desired data. If the path is already known, interacting with intake may be bypassed and users may connect to the data directly in Python.

This is a quick way to open an individual Zarr store with Xarray:

import xarray as xr
ds = xr.open_zarr(
    's3://cadcat/loca2/ucsd/access-cm2/historical/r1i1p1f1/day/tasmax/d03/', storage_options={'anon': True}
)
ds.to_netcdf('access-cm2_historical_r1i1p1f1_day_tasmax_d03.nc')

For a complete list of S3 data paths, see the ESM catalog’s Zarr store CSV or our Data Catalog page on the Cal-Adapt: Analytics Engine website. For a more detailed walkthrough on using the intake catalog to access and download data, check out this Jupyter notebook.

AWS Explorer Functions

Anyone who doesn’t wish to use our Python library can visually navigate existing data and download it via a web browser by utilizing existing AWS explorer functions within the Analytics Engine S3 bucket. You can identify data that you are interested in downloading by scrolling through our list of existing data files and selecting the ones that pertain to your application of interest.

Please refer to our forthcoming Guidelines on Using Climate Data section to help you decide which climate data to use for your application of interest. While it is always recommended for users to examine as many models and ensemble members as possible, Analytics Engine also hosts a set of “General Use Projections”, which are 5 model runs that have been pre-selected to reasonably capture a range of future outcomes for a limited number of commonly used climate variables, scales, and SSP scenarios.

  1. Select the individual GCM to download from the list.
  2. Select the spatial resolution of interest.
  3. Select an ensemble member.
  4. Select the scenario of interest (historical, ssp245, ssp370, ssp585). To explore a range of scenarios and models, data are available for three SSPs:
    1. SSP2-4.5: a middle-of-the-road global emissions scenario
    2. SSP3-7.0: high global emissions scenario
    3. SSP5-8.5: very high global emissions scenario
  5. Select the variable of interest from the list.
  6. Select the time frame of interest (2015-2044, 2045-2074, and 2075-2100). At this point, the file size will be displayed for the selected data. Files average around 3GB.
  7. Select the file of interest and browse to a file location that has sufficient space available to download.

AWS Command Line Interface (CLI)

If users wish to download large amounts of data at once, they can do this by installing the free AWS Command Line Interface (CLI) tool. The AWS CLI is an open-source tool that enables use of the command-line shell to directly access and interact with the Analytics Engine S3 bucket. This tool simplifies the process of downloading all of the data within a dataset.

  1. Using AWS CLI to list the contents of an S3 bucket can be overwhelming as the contents of the bucket are not stored in a “directory” structure. This command will display the variables available for this model.
  2. aws s3 ls --no-sign-request s3://cadcat/loca2/aaa-ca-hybrid/MIROC6/0p0625deg/r1i1p1f1/ssp370/
  3. Given the amount of data available, this command will return a large number of directories. To avoid downloading a large dataset that isn’t needed, the ideal approach for managing the available data is to determine which SSP, ensemble member, temporal frequency, and variable(s) are of interest.

    The following is a sample download command:

  4. aws s3 cp s3://cadcat/loca2/aaa-ca-hybrid/MIROC6/0p0625deg/r1i1p1f1/ssp370/tasmax loca2/aaa-ca-hybrid/MIROC6/0p0625deg/r1i1p1f1/ssp370/tasmax --no-sign-request --recursive
  5. This downloads the three NetCDF files containing the SSP370 LOCA2 data. It totals 8GB. The AWS command may be combined with bash scripting to download a variable for multiple models:
  6. #!/bin/bash
    activities=("MIROC6" "GFDL-ESM4")
    for a in $activities
    do
        aws s3 cp s3://cadcat/loca2/aaa-ca-hybrid/$a/0p0625deg/r1i1p1f1/ssp370/tasmax loca2/aaa-ca-hybrid/$a/0p0625deg/r1i1p1f1/ssp370/tasmax --no-sign-request --recursive
    done
  7. To access all available models for a particular variable, utilize include and exclude flags. Add the variable name with wildcards to match files based on that pattern.
  8. aws s3 cp s3://cadcat/loca2/aaa-ca-hybrid . --no-sign-request --recursive --exclude '*' --include '*tasmax*'
  9. These commands can be used to download single files or variables or all data for a particular model. Should there be sufficient local storage, 12.7TB (12,700GB) of disk storage, the entire LOCA2 dataset may be downloaded using the following command:
  10. aws s3 cp s3://cadcat/loca2/aaa-ca-hybrid /my/local/path --no-sign-request  --recursive

Cal-Adapt Data Download Tool (DDT)

The beta version of the Cal Adapt Data Download Tool (DDT) is available and under ongoing development. Data packages are available to customize and download LOCA2 data from the DDT. These data packages are pre-made groups or clusters of statistically downscaled LOCA2 data that are most commonly accessed on Cal-Adapt. Packages help make data selection and download more intuitive and user-friendly.

  1. Navigate to the beta Data Download Tool.
  2. Click the “Customize and Download” button under your data package of choice.
  3. Make custom selections in the “Review Your Data Package” sidebar that opens.
    1. Dataset
    2. Scenarios (SSPs)
    3. Models
    4. Variables
    5. Spatial Extent
    6. Range (Time)
    7. Data Format
  4. Click “Download your data” to create the data packages that are ready for download to a local machine.