AE Analytics
Methods
Fetching Global Warming Level data in the Analytics Engine
This section outlines the steps a user takes to access data on global warming levels using climakitAE. The tools in the AE platform can calculate global warming levels for dynamically downscaled (WRF) or statistically downscaled (LOCA) data using the same methods. For more information on how to select a dataset for an application of interest, refer to the Guidance section on the AE website titled Using Climate Data in Decision-Making.
The specific methodology and calculations underlying the generation of GWL data are covered in the next section.
California climate data on GWLs can be retrieved in two ways depending on the user’s preferred workflow:
1. Use the get_data()
function in climakitae
2. Use a GUI that visually shows all available options
Option 1: Retrieve WLs directly using the get_data()
function
This method uses a simple function from the data_interface
module
in
climakitae
to retrieve warming levels data directly using a python
function – no GUI required. This might be a good approach for a user that already
knows what data they want, and doesn’t need to examine the full set of options.
More information and additional options are available in the
get_data()
function – this is accessible in the
basic_data_access.ipynb notebook.
Some basic information on how to use this method is shown below. Users are
encouraged to work through the notebook
basic_data_access.ipynb
in the Analytics Engine for more details
on how this function works.
1. Import the function from the data_interface module:
from climakitae.core.data_interface import get_data
2. Optional: Print the function docstrings to see the possible function inputs. Note that some of these arguments require an input (variable, downscaling_method, resolution, and timescale), and some of the arguments are ignored for a warming level approach (time_slice and scenario).
print(get_data.__doc__)
3. Set the arguments to retrieve warming levels data:
get_data(
variable = "Precipitation (total)",
downscaling_method = "Dynamical",
resolution = "45 km",
timescale = "monthly",
approach = "Warming Level",
warming_level_window = 10,
warming_level = [2.0, 2.5, 3.0, 4.0]
)
Make sure the argument approach
is set to “Warming Level”
; the function defaults to a time-based approach. The argument warming_level_window
defaults to 15 (years on either side, i.e. a 30 year window), and the argument
warming_level
defaults to 2.0
(2 deg C).
Option 2: Retrieving WLs through the GUI
To visualize the available data options and minimize the amount of coding, a
GUI-based approach is also provided. To display the selections GUI in a
notebook, run the following lines of code (which are also listed in the
interactive_data_access_and_viz.ipynb
):
import climakitae as ck
import climakitaegui as ckg
selections = ckg.Select()
selections.show()
After the last line is executed, the following panel pops up:

The two important sections that need to be defined before retrieving WL data are boxed in red. Similarly to the manual parameter definition above, the “Approach” must first be set to “Warming Level”, then the desired time window (“Years around GWL”) and desired warming levels can be selected (1.5 deg C - 4.0 deg C).
After making these selections, running data = selections.retrieve()
in a following cell will load a warming levels data object. Below will show
how this data is shaped and how to interact with it.
Working with the WL data object
The warming levels object will look like the following:

The dimensions can be interpreted as follows:
- warming_level: the number of warming levels in the data object. Since we listed 2.0 and 2.5 warming levels in the examples above, we see that both 2.0 and 2.5 warming levels were retrieved in the resulting data object.
- time_delta: this is the number of time steps from the center warming level year in this object. Since we specified monthly data with a 30-year window, this results in 360 time steps (30 years x 12 months). The time steps are labeled with coordinate values of -180 to 179, with a time_delta= 0 indicating the year that the climate simulation reaches the specified global warming level.
- y,x: spatial dimensions for WRF data.
- simulation: the simulation names grabbed for this set of parameters, where the names are listed as: [Downscaling_Method]_[Model]_[Ensemble Member]_[Historical Data Used]_[SSP]
There may be warnings that pop up when using warming level data that look like the following:

These are only printed to illustrate the limitations of the data object returned, as different warming levels may have a different number of simulations that reach that given warming level.
How California-focused Warming Level data is Calculated
This section outlines the methodology for calculating data on GWL in the Analytics Engine. These are the calculations that take place “under the hood” (i.e., in the climakitae code for AE) when you retrieve GWL data as described in the previous section. The methods described here follow the approach used by the IPCC AR6 report as closely as possible.
Calculating GWL on the Analytics Engine is a two-step process:
1. Generate the GWL lookup Tables for Models
For each global climate model (GCM) simulation in the CMIP6 archive, the average global temperature increase relative to pre-industrial conditions (1850-1900) is measured for each year. This time-series of global warming is smoothed with a 20-year running average and used to create a lookup table to determine when each simulation reaches a given global warming level.
2. Retrieve the Analytics Engine Model data at selected GWL years
Using the years that each global simulation reaches a specified warming level as determined by the GWL lookup table in step 1, the corresponding years of data are taken from each regionally downscaled climate simulation. A slice of the time series (typically 30 years) is taken from each simulation, centered on the year that simulation reaches the specified warming level. This data therefore represents the estimated regional climate impacts that will be felt at a given level of global warming.
A detailed description of how these two steps are implemented in the Analytics Engine is described below:
Step 1: Generate the GWL lookup tables for models
The GWL lookup table captures what year each global climate simulation reaches each warming level. This table is pre-generated in the climakitae repository, and is only updated if changes are made to the methodology or new warming levels are added to the platform. No actions are required by the user for this step, but the section is provided as technical documentation and outlines the method used to generate this table for transparency.
1. From the CMIP6 catalog, all CMIP6 models and their ensemble members are selected via the Pangeo CMIP6 CSV.
2. Global average surface air temperature for each ensemble member is then calculated by a spatially weighted average of the tas (or appropriate surface air temperature) variable using this formula (which is essentially a weighted average of all grid cells around the world, which accounts for the fact that grid cells towards the poles are smaller than grid cells near the equator):
weightlat = np.sqrt(np.cos(np.deg2rad(ensemble_mem[lat])))
weightlat = weightlat / np.sum(weightlat)
timeseries = (ensemble_mem * weightlat).sum(lat).mean(lon)
3. Each time series is smoothed with a 20-year running average window. Then, the month/year that each time-series first exceeds a certain degree of warming (1.5, 2.0, 2.5, 3.0, 4.0) relative to the average temperature from a given reference period is computed and then saved per model into a lookup table of GWL, with the model as the index.
4. The lookup tables are saved in the data directory of the climakitae repository:
a. gwl_1850-1900ref.csv
uses the reference period 1850-1900, consistent
with the IPCC warming level definitions. This reference period can not be used
to calculate anomalies (different from historical), because the downscaled data
only extends back to 1950.
b. gwl_1980-2010ref.csv
uses the reference period 1980-2010, and
is only used when calculating anomalies.
c. A 20-year running average window is used to determine the “crossing year” for each GWL by the center of the window. This ensures that any one particularly high value year does not skew the results when the overall average temperature trend has not yet reached the warming level.
5. Additionally, the GWL at each month is saved for each ensemble member
from 1860-2090 into gwl_1850-1900ref_timeidx.csv
and
gwl_1980-2010ref_timeidx.csv
. These act as translations between
time and WLs.
a. These files have time as the index, whereas the files generated in Step 3 have the model as the index.
b. The time range of these files is 1860-2090 because the 20-year running average window clips the first and last 10 years.
6. The above steps are calculated for all CMIP6 models. The CESM2-LENS data is processed separately due to a slightly different data structure, but follows the same methodology.
Example of GWL lookup tables
WL-based indexing (Step 1, part 3): gwl_1850-1900ref.csv

Time-based indexing (Step 1, part 4): gwl_1850-1900ref_timeidx.csv

Step 2: Retrieval of the Analytics Engine model data at selected GWLs
When data on warming levels is retrieved using get_data()
or the
Select GUI, the following procedure will be run:
For each warming level:
For each simulation within our AE WRF/LOCA2-Hybrid catalog (depending on downscaling method):
A. Find the centered year that that simulations passes this global warming
level using the table generated from step 3 in Global Warming Level
calculations (gwl_1850-1900ref.csv
).
B. Slice the window (i.e. +/- 15 years) around the centered year from step 1 for the current simulation.
C. Filter for desired months and remove leap days.
D. Reset the time index so that all simulations can be stacked on top of each other.
a. Change timestamps to timedeltas with centered_year
coordinates.
b. i.e. a 30-year simulation with monthly frequency data from 2010-2040 is
transformed into a dataset with time-deltas from -180 to 179 (360 months in
30 years) with an added
centered_year
coordinate of 2025.
c. The time dimension is now called time_delta
because it represents
the time distance from the central year.
E. Simulations that don’t reach this given warming level are set to NaN.
Now, you will be able to view your warming level data through the data
object.

Calculating a Typical Meteorological Year on the Analytics Engine
This section outlines the methodology for generating a Typical Meteorological Year (or TMY) in the Analytics Engine. A TMY is one year of hourly data representing the median meteorological conditions for a location over a set amount of time, usually a 30-year period. A TMY is built from ten different weather variables to statistically assess the median conditions, and each month in the TMY is the most “typical” month. For example, the most “typical” January within the 30 year period could be from 2010, while the most “typical” February could be from 2022, and so on. The end result is an hourly profile for an entire year with each month spliced together from multiple input years. TMY data is widely used as a critical input for energy modeling, simulating solar energy conversion systems, and evaluating building standards and energy efficiency. The methods in the Analytics Engine for generating a TMY closely follow the NREL TMY version 3 method, where you can find more information here. The typical_meteorological_year_methodology notebook on the Analytics Engine demonstrates the full process.
The following workflow is how a TMY is calculated:
Step 1:
A user selects their location of interest. Calculating a TMY on the Analytics Engine is for point-based information, meaning that a user will first select a specific location, like a power plant or an airport weather station as their location of interest. Optionality for a grid location is forthcoming.
Step 2:
The input data for determining a “typical” month is retrieved for that location, which includes these variables:
- Mean air temperature
- Min air temperature
- Max air temperature
- Mean dew point temperature
- Min dew point temperature
- Max dew point temperature
- Mean wind speed
- Max wind speed
- Global irradiance
- Direct irradiance
For the TMY, a minimum of 15-20 years of daily data is required; on the Analytics Engine we use a 30-year period as our default. One of the benefits of the Analytics Engine TMY methodology is that a user can build a TMY for a current, historical or a future period, leveraging the high-resolution downscaled climate model projections! It is important to note that only 4 of the WRF downscaled models have all of the required variables to calculate a TMY: namely the two important solar variables. These 4 WRF models are also bias-adjusted. In the TMY process, we subset the data carefully to only include the relevant models. The last step in the data retrieval process is to ensure that all of the input data is in the local time zone for the location of interest. Because the input data is in UTC, the minimum temperature in hourly data “appears” on the day before (i.e., midnight on Monday in UTC corresponds to 5pm PST on Sunday!). Converting to the local timezone first is important to ensure that the daily minimum is on the correct day.
Step 3:
Next, we calculate the long-term (30-year) climatological conditions for each variable. The TMY process specifically uses a cumulative distribution function (CDF), which essentially means that we are calculating the long-term baseline for each variable. We’ll use this to determine which month is closest to this baseline condition for all months.

Figure 1. An example of the long-term climatological conditions of daily max air temperature, for use in a TMY. This CDF represents the baseline conditions of January max air temperatures in 4 bias-adjusted WRF models at Los Angeles International Airport (LAX) from 1990-2020.
Step 4:
We then calculate the cumulative distribution function for each variable for all months. At this point we also carefully remove specific months from consideration if they were during major volcanic eruptions like Pinatubo (June 1991 to December 1994), because volcanic aerosols have a major impact on solar variables.

Figure 2. An example of a candidate month’s daily max air temperature, for use in a TMY. This CDF represents the January 2015 conditions of max air temperatures in 4 bias-adjusted WRF models at Los Angeles International Airport (LAX). The TMY process identifies the closest candidate month to the long-term climatological conditions to pick a “typical” month. For example, we would look for the closest instance of the distribution in this Figure to that of the first figure.
Step 5:
The long-term climatological distribution is compared to the monthly distribution for each variable. To do this, we calculate a Finkelstein-Schafer statistic, which tells us the absolute difference between the climatological and each candidate month's distribution profile, capturing the individual month that is closest to climatology.
Step 6:
The results from the F-S statistic are then weighted based on the input variables, based on the following scheme, which places higher weight on the solar variables.
- Mean air temperature: 2/20, or 10%
- Min air temperature: 1/20, or 5%
- Max air temperature: 1/20, or 5%
- Mean dew point temperature: 2/20, or 10%
- Min dew point temperature: 1/20, or 5%
- Max dew point temperature: 1/20, or 5%
- Mean wind speed: 1/20, or 5%
- Max wind speed: 1/20, or 5%
- Global irradiance: 5/20, or 25%
- Direct irradiance: 5/20, or 25%
Since the TMY methodology heavily weights the solar radiation input data, be aware that the final selection of “typical” months may not be typical for specific variables. In other words, what is selected as a typical June, may not fully represent typical June temperatures.
Step 7:
Once weighted, we select the top month for each month of the year that has the lowest weighted sum, meaning that the candidate month is the closest or most “typical” to the long-term climatology for that specific month. On the Analytics Engine, we ensure that model data is kept intact. This means that the most typical month for all months is selected from the same model, not across models (e.g., not: January from MIROC6 and February from EC-Earth3). The end result of this process is that a TMY is generated four times, one for each model! This provides a great opportunity to be able to do multi-model comparisons of TMYs in a physically consistent space.
Step 8:
Once the “typical” months are selected, we generate the full hourly profile by providing the standard meteorological information for a “TMY file”. Interestingly, the required TMY variables in a TMY file are not the same as the input variables in Steps 1-7! A TMY output file includes information on: air temperature, dewpoint temperature, relative humidity, global irradiance, direct irradiance, diffuse irradiance, downwelling radiation, wind speed and direction, and surface air pressure. On the Analytics Engine, the last step of the method is to retrieve all of these variables for the specific months determined by Step 7 for all 4 models. TMY output files have very specific formatting requirements, which we take care of for you in our “under the hood” code.

Figure 3. An example TMY hourly profile for Los Angeles International Airport (LAX) for the 1990-2020 period, from MPI-ESM1-2-HR.
Because a TMY represents average conditions, rather than extreme conditions, a TMY is not suited for designing systems to meet the worst-case conditions occurring at a location. The ability to calculate a customizable “extreme” TMY, or a “XMY” is in the works on the Analytics Engine!