AE Documentation

Frequently Asked Questions

General

What is JupyterHub?

JupyterHub gives users access to standardized computational environments and data resources through a webpage without having to install complex software. This JupyterHub is maintained by the Analytics Engine team.

Who can use this service?

The Analytics Engine JupyterHub service is currently open to a select group of users invited to alpha test the Analytics Engine. If you are interested in being notified of when the Analytics Engine will be made available to the general public or if you would like to join our group of early testers, please send us an email.

What is JupyterLab?

JupyterLab is a web-based interactive development environment for notebooks, code, and data. Its flexible interface allows users to configure and arrange workflows in data science, scientific computing, computational journalism, and machine learning.

What packages and libraries are available?

Our JupyterHub is built using the Pangeo ecosystem and ClimaKitAE, a climate data processing Python library developed by the Analytics Engine team.

Pangeo provides Python tools (e.g. xarray, Dask, JupyterLab) and cloud infrastructure that enables near-instantaneous access and fast data processing of large climate and other datasets used in geosciences.

ClimaKitAE provides an analytical toolkit for working with downscaled CMIP6 data. You can use ClimaKitAE for complex analysis like selecting models for a specific metric, deriving threshold based assessments, etc. It also provides common functionality for working with climate datasets such as temporal and spatial aggregation & downloading timeseries for a weather station.

Can I use Analytics Engine within my own organization’s cloud?

You can run the Analytics Engine within your organization’s AWS cloud using our Docker container.

Can I request a dataset to add to Analytics Engine?

Users may request the addition of datasets to the platform for use in the cloud computing environment, which will be approved on a case-by-case basis. Datasets are encouraged to comply with CF conventions (see Metadata Standards for more information). Requests for particular datasets should be sent via email.

Technical

When I run app.select(), nothing is displayed. What should I do?

If next to the app.select() cell you see a symbol like [*] - this indicates that the process is still running and the select panel should appear shortly. If there is a number in the brackets and not an asterisk, you may need to reset your kernel (click Kernel > Restart Kernel… from the toolbar). Resetting the kernel is a good first step in troubleshooting issues more generally on the JupyterHub (but note that if you do this you will need to re-run any computations).

What is dask?

Dask is a python library for parallel computing – meaning complicated computing tasks are separated and computed simultaneously on separate processors. Our tools automatically take advantage of this, if you start up a dask cluster at the beginning of your notebook. For more information about the various applications of this library, see the Dask library documentation.

How can I start up a dask cluster?

from dask_gateway
import GatewayCluster
cluster = GatewayCluster()
client = cluster.get_client()
cluster

The last line in this code chunk will display the cluster widget with a link to the Dask dashboard, which you can use to monitor the cluster (for more information about what is displayed on the dashboard, please refer to the Dask documentation on the subject). Before working with the cluster, you will need to specify the number of workers to use (see next question below).

How can I specify how many workers to use?

cluster.scale(3)

We recommend starting with the command cluster.scale(3), which will scale the cluster to have three workers. After specifying the number of workers, you can then run any code you would like to run on the cluster. Use more workers when working on high-resolution, calculation heavy applications (for example, with hourly datasets you may want to use 6 or 8 workers) and less workers when working with low-resolution simpler notebooks.

Do I need to do anything special to my code to take advantage of the dask cluster?

No. Xarray functions and the climate data on the Analytics Engine are already optimized to take advantage of parallel computing resources that are present, so simply by starting a cluster on the Analytics Engine our tools (and any of your xarray code) will utilize those parallel resources automatically.

How can I close down the cluster when I am done?

cluster.close()

Workers will come from a pool that is shared between users of the Analytics Engine, so please close the cluster you are using when you are done working with it.

I am confident with coding in python. How can I access the data catalog directly?

import intake
col = intake.open_esm_datastore('https://cadcat.s3.amazonaws.com/cae-collection.json')

See documentation for more on the usage of an ESM datastore.