Frequently Asked Questions

General

What is JupyterHub?

A simple schematic describing how Pangeo envisions a data proximate science platform using Jupyter as the user interface.

JupyterHub gives users access to standardized computational environments and data resources through a webpage without having to install complex software. Our JupyterHub is maintained by the Analytics Engine team.

Who can use this service?

The Analytics Engine JupyterHub service is currently open to a select group of users invited to alpha test the Analytics Engine. If you are interested in being notified of when the Analytics Engine will be made available to the general public or if you would like to join our group of early testers, please send us an email.

What is JupyterLab?

JupyterLab is a web-based interactive development environment for notebooks, code, and data. Its flexible interface allows users to configure and arrange workflows in data science, scientific computing, computational journalism, and machine learning.

What packages and libraries are available?

Our JupyterHub is built using the Pangeo ecosystem and ClimaKitAE, a climate data processing Python library developed by the Analytics Engine team.

Pangeo provides Python tools (e.g. xarray, Dask, JupyterLab) and cloud infrastructure that enables near-instantaneous access and fast data processing of large climate and other datasets used in geosciences.

ClimaKitAE provides an analytical toolkit for working with downscaled CMIP6 data. You can use ClimaKitAE for complex analysis like selecting models for a specific metric, deriving threshold based assessments, etc. It also provides common functionality for working with climate datasets such as temporal and spatial aggregation & downloading timeseries for a weather station.

What data is available?

Image of data coverage extents for the various resolutions of WRF data

Downscaled climate projections in support of California’s Fifth Climate Change Assessment are available via two different downscaling methods (dynamically downscaled and statistically downscaled), with coarse resolution data available for western North America, and the finest scale (3-km) available over California state (see figure below). Fore more detailed information, visit our data catalog.

Can I use Analytics Engine within my own organization’s cloud?

You can run the Analytics Engine within your organization’s AWS cloud using our Docker container.

Can I request a dataset to add to Analytics Engine?

Users may request the addition of datasets to the platform for use in the cloud computing environment, which will be approved on a case-by-case basis. Datasets are encouraged to comply with CF conventions (see Data Standards for more information). Requests for particular datasets should be sent via email.

Technical

When I run app.select(), nothing is displayed. What should I do?

If next to the app.select() cell you see a symbol like [*] - this indicates that the process is still running and the select panel should appear shortly. If there is a number in the brackets and not an asterisk, you may need to reset your kernel (click Kernel > Restart Kernel… from the toolbar). Resetting the kernel is a good first step in troubleshooting issues more generally on the JupyterHub (but note that if you do this you will need to re-run any computations).

What is dask?

Dask is a python library for parallel computing – meaning complicated computing tasks are separated and computed simultaneously on separate processors. Our tools automatically take advantage of this, if you start up a dask cluster at the beginning of your notebook. For more information about the various applications of this library, see the Dask library documentation.

How can I start up a dask cluster?

from climakitae.cluster import Cluster
cluster = Cluster()
client = cluster.get_client()
cluster

The last line in this code snippet will display the cluster widget with a link to the Dask dashboard, which you can use to monitor the cluster (for more information about what is displayed on the dashboard, please refer to the Dask documentation on the subject). Before working with the cluster, you will need to specify the number of workers to use (see next question below).

How can I specify how many workers to use?

cluster.adapt(maximum=3)

We recommend starting with the command cluster.adapt(maximum=3), which will scale the cluster to have three workers. After specifying the number of workers, you can then run any code you would like to run on the cluster. Use more workers when working on high-resolution, calculation heavy applications (for example, with hourly datasets you may want to use 6 or 8 workers) and less workers when working with low-resolution simpler notebooks.

Do I need to do anything special to my code to take advantage of the dask cluster?

No. Xarray functions and the climate data on the Analytics Engine are already optimized to take advantage of parallel computing resources that are present, so simply by starting a cluster on the Analytics Engine our tools (and any of your xarray code) will utilize those parallel resources automatically.

How can I close down the cluster when I am done?

cluster.close()

Workers will come from a pool that is shared between users of the Analytics Engine, so please close the cluster you are using when you are done working with it.

I am confident with coding in python. How can I access the data catalog directly?

import intake
col = intake.open_esm_datastore('https://cadcat.s3.amazonaws.com/cae-collection.json')

See documentation for more on the usage of an ESM catalog.

How can I download the climate data used in the Analytics Engine?

All the climate data used by the Analytics Engine is stored in a publicly accessible AWS S3 bucket. If you are familiar with Python you can easily access the data directly using the xarray package or through the intake package with our intake ESM data catalog. Once opened as an xarray dataset you can then export the data to a NetCDF file on your computer.

Here is a quick way to open an individual Zarr store with Xarray:

import xarray as xr

ds = xr.open_zarr('s3://cadcat/wrf/ucla/cesm2/historical/1hr/lwdnb/d01', storage_options={'anon': True})

For a complete list of S3 data paths see the ESM catalog’s Zarr store CSV.

For a more detailed walkthrough on using the intake catalog to access and download data, check out this Jupyter notebook.

If you are looking for direct access to LOCA2 NetCDF files, you can interactively browse the S3 bucket to download individual .nc files by variable, or install AWS CLI tools and run the following command to download everything:

aws s3 sync --no-sign-request s3://cadcat/loca2/aaa-ca-hybrid /my/local/path

Please note that if you would like to download everything, you will need to change /my/local/path to a directory where you are able to store 12.7TB of data.