Exploring Uncertainty in Climate Data

There are several kinds of scientific uncertainty that arise when working with long-term projections of future climates:

  1. Model Uncertainty, which illustrates the differences between different models (namely, how model physics, settings, or parameters can change the outcome)
  2. Internal Variability, which represents the variations inherent within the climate system itself
  3. Scenario Uncertainty, which arises from differences in outcomes between emissions trajectories.

This notebook explores Model Uncertainty in the Cal-Adapt: Analytics Engine by focusing on temperature trends across simulations. We also compare the suite of models currently available in the Cal-Adapt Data Catalog to the full set of CMIP6 models to illustrate the differences between our models and all available models.

Intended Application : As a user, I want to understand when taking a mean across models is appropriate for my research question , by learning about:
  1. The wide range of possibilities for the end of century across all available CMIP6 models
  2. What kinds of questions can be answered with the multi-model mean (qualitative), and which cannot (quantitative)
  3. How the Analytics Engine models compare to all available CMIP6 models

Terms used in this notebook:

  • Ensemble member: When a given model is run multiple times, we call the group of runs an ensemble. Each member of that ensemble represents a distinct realization featuring its own combination of model parameters.
  • Multi-model mean: The average response across all models.

Step 0: Setup and CMIP6 data processing

Import useful packages and libraries.

import climakitae as ck
import xarray as xr
import pandas as pd
import numpy as np
import cartopy.crs as ccrs
import holoviews as hv
from climakitae.util.utils import read_csv_file

import warnings

First, we will want to grab the regridded CMIP6 models that have both historical and SSP3-7.0 simulations by setting specific data options in order to assess air temperature trends for California. To do this, we will want to grab monthly near-surface air temperature data, and subset the global data for the state of California.

The next few cells process the data into a consistent format for our use.

from climakitae.explore.uncertainty import CmipOpt

# select data options
copt = CmipOpt()
copt.variable = 'tas' # near-surface air temperature
copt.area_subset = 'states'
copt.location = 'California' # location of interest to subset for
copt.area_average = False # ensures that we grab spatiotemporal data
copt.timescale = 'Amon' # monthly frequency

The next cell does the actual work of grabbing the data from the catalog and pre-processing it. Several things happen during pre-processing:

  • First we search the catalog for the data we want to use. We will compare the first ensemble member from each model in the CMIP6 archive that has both a historical and SSP3-7.0 simulation.
  • Then we organize each of the resulting available data, with our specific data selections, and pre-process it so that it subsets correctly for California.
  • Lastly, the datasets are merged together for ease of use throughout the rest of our analysis.

After running the following cell, you can examine the resulting Dataset that holds all of the CMIP6 models of interest.

from climakitae.explore.uncertainty import grab_multimodel_data

mdls_ds = grab_multimodel_data(copt)

Step 1: Assess CMIP6 multi-model spread

1a) Calculate useful metrics

Now that we have all of our desired models processed, the next step is to calculate several key metrics of interest using functions in climakitae:

  • cmip_annual calculates the annual average temperature in each model, collapsing from monthly data
  • calc_anom calculates the temperature difference (i.e., anomaly) from a historical baseline, of which we use 1850-1900 to place our results in a global warming levels context. For more information on warming levels, check out the warming_levels.ipynb notebook
  • cmip_mmm calculates the average response across models, otherwise known as the multi-model mean

These next cells will take 1-2 minutes to run - hang tight!

from climakitae.explore.uncertainty import cmip_annual, calc_anom, cmip_mmm

# calculate spatial data first for the entire cmip6 archive
xy_ds_yr = cmip_annual(mdls_ds).compute()
xy_ds_yr.tas.attrs['units'] = '°C' # set units to Celsius

xy_anom = calc_anom(xy_ds_yr, base_start=1850, base_end=1900)
xy_anom.tas.attrs['units'] = '°C' # set units to Celsius
xy_anom_mmm = cmip_mmm(xy_anom)

# calculate area-averaged timeseries
cmip_anom = xy_anom.groupby("year").mean(dim=["x","y"])
cmip_anom.tas.attrs['units'] = '°C' # set units to Celsius
cmip_anom_mmm = cmip_mmm(cmip_anom)
# calculate the difference and multi-model mean specifically for the 4 cal-adapt: analytics engine models
hist_start, hist_end, ssp_end = 1950, 2014, 2100 # historical start and end dates, future end date
cae_mdls_ls = ["FGOALS-g3", "EC-Earth3-Veg", "CESM2", "CNRM-ESM2-1"]
cae_mdls = cmip_anom.sel(simulation=cae_mdls_ls)
cae_anom = cae_mdls.sel(year=slice(hist_start, ssp_end))

# calculate the historical anomaly and multi-model mean
hist_anom = cmip_anom.sel(year=slice(hist_start, hist_end))
hist_anom_mmm = cmip_anom_mmm.sel(year=slice(hist_start, hist_end))

# calculate the future anomaly and multi-model mean
ssp_anom = cmip_anom.sel(year=slice(hist_end, ssp_end))
ssp_anom_mmm = cmip_anom_mmm.sel(year=slice(hist_end, ssp_end))

Next, we read in a reference table that provides several warming level options (1.5°C, 2.0°C, 3.0°C, 4.0°C) for the CMIP6 archive. We identify which model ensemble members are essential for our analysis and grab these specific simulations.

# read in global warming levels table
from climakitae.core.paths import gwl_1850_1900_file
gwl_times = read_csv_file(gwl_1850_1900_file, index_col=[0, 1, 2])

# grab the ensemble members specific to our needs here
sim_idx = []
scenario = 'ssp370'
for simulation in ssp_anom.simulation.values:
    if simulation in gwl_times.index:
        if simulation == 'CESM2':
            sim_idx.append((simulation, 'r11i1p1f1', scenario))
        elif simulation == 'CNRM-ESM2-1':
            sim_idx.append((simulation, 'r1i1p1f2', scenario))
            sim_idx.append((simulation, 'r1i1p1f1', scenario))

We need to identify where each individual model reaches a designated warming level. As the default, we select 3.0°C as our desired warming level to investigate. Depending on the warming level we also provide information as to whether any model does not reach the selected warming level by 2100. Play around with different values for the warming level to see how the resulting analyses change throughout the rest of this notebook!

# warming level
warm_level = 3.0

# identify the year that the selected warming level is reached for each ensemble member
xy_da_list = []
year_reached_by_sim = []
for i in sim_idx:
    year_warmlevel_reached = str(gwl_times[str(warm_level)].loc[i])[:4]
    if len(year_warmlevel_reached) != 4:
        print("{}°C warming level not reached for {}".format(warm_level, i[0]))
        year_reached_by_sim.append((i, int(year_warmlevel_reached)))
        xy_da_list.append(xy_anom.sel(year=int(year_warmlevel_reached), simulation=i[0]))

thresh_df = pd.DataFrame(
xy_by_warmlevel = xr.concat(xy_da_list, dim="simulation")

1b) Visualize the range in temperature amongst the CMIP6 archive

We also highlight the Cal-Adapt: Analytics Engine models in order to illustrate where these models fall within the larger CMIP6 model spread. Many things are happening in this figure. Let’s break it down:

  • The thin grey lines represent a single CMIP6 model in the historical (1950-2014) period. Their corresponding future (SSP 3-7.0) counterparts are illustrated by the thin orange lines for 2014-2100.
  • The thick black line represents the multi-model mean for the historical period
  • The thick red line represents the multi-model mean for the future period
  • The thin blue lines represent the currently downscaled models available in the Cal-Adapt: Analytics Engine
  • The dashed line at 0°C is also provided to help visualize the overall trend
  • The two vertical lines represent the earliest and latest occurrences that any model exceeds the selected warming level (here it is 3.0°C), if prior to 2100. Play around with different warming level values to see how it changes the response.

This cell will take a few minutes to run; it has a lot of information to display!

# figure set-up
h_color, ssp_color, cae_color = 'grey', 'orange', 'blue'
lw, alpha = 0.75, 0.25
ylab = hist_anom.tas.long_name + ' (' + hist_anom.tas.attrs['units'] + ')'

# all individual models
all_hist = hist_anom.hvplot.line(x="year", ylabel=ylab, by='simulation', line_width=lw, color=h_color, legend=False, alpha=alpha)
all_ssp = ssp_anom.hvplot.line(x="year", by="simulation", line_width=lw, color=ssp_color, legend=False, alpha=alpha)

# cal-adapt models
all_cae = cae_anom.hvplot.line(x="year", by="simulation", line_width=lw, color=cae_color, alpha=alpha*1.5, legend=False)

# multi-model means
mmm_hist = hist_anom_mmm.hvplot.line(x="year", line_width=lw*3, color='black')
mmm_ssp = ssp_anom_mmm.hvplot.line(x="year", line_width=lw*3, color='red',
                                                       title="CMIP6 mean surface temperature change in California")
# warming level boundaries
warmlevel_firstoccurence = hv.VLine(int(thresh_df['year_warming_level_reached'].min())).opts(color='black', line_width=lw)
warmlevel_lastoccurence = hv.VLine(int(thresh_df['year_warming_level_reached'].max())).opts(color='black', line_width=lw)
zero_line = hv.HLine(0.0).opts(color="black", line_width=0.5, line_dash="dashed")

all_hist * all_ssp * all_cae * mmm_hist * mmm_ssp * zero_line * warmlevel_firstoccurence * warmlevel_lastoccurence

Visualizing temperature trends in this way allows us to identify several key pieces of information:

  1. Temperatures in California are projected to be higher in the future than the historical period given this emissions scenario.
  2. The range of temperatures by 2100 is several degrees, between 2-8°C higher than 1850-1900. However, not all models are weighted equally by the IPCC in terms of their global mean temperature change. This is particularly important for considering our results within a warming levels context, as we do here.
  3. Several models project a much faster increase in temperature, while others increase at a slower rate. This will have an impact on California climate, with differences at the local scale.

Step 2: Illustrate spatial statistics across California

2a) Visualize the year that the warming level is reached

Next, let’s spatially visualize the differences between the CMIP6 model archive at a designated warming level. We will also identify how the Cal-Adapt: Analytics Engine models compares to the broader spread. In the next cell, we do some minor plot set-up and calculate the minimum and maximum of our data to ensure that each plot displays on the same range for ease of comparison.

# set up for plots
from climakitae.explore.uncertainty import compute_vmin_vmax
from climakitae.util.utils import read_ae_colormap
cmap = read_ae_colormap(cmap='ae_orange', cmap_hex=True)

from bokeh.models import HoverTool
hover = HoverTool(description='Custom Tooltip',
        tooltips=[('Longitude (deg E)', '@x'),
        ('Latitude (deg N)', '@y'),
        ('Air Temp (°C)', '@z')])

def make_hvplot(data, title, vmin, vmax, sopt, width=200, height=200):
    """Make single map"""
    _plot = hv.QuadMesh(
        (data['x'], data['y'], data)).opts(
        colorbar=True, cmap=cmap,
        symmetric=False, clim=(vmin,vmax),
        xaxis=None, yaxis=None,
        clabel="Air Temperature (°C)",
        width=width, height=height)
    return _plot

num_simulations = len(xy_by_warmlevel.simulation.values) # number of simulations

# compute 1% min and 99% max of all simulations
vmin_l, vmax_l = [], []
for sim in range(num_simulations):
    data = xy_by_warmlevel.isel(simulation=sim)
    vmin_i, vmax_i, sopt = compute_vmin_vmax(data.tas, data.tas)
vmin = np.nanmin(vmin_l)
vmax = np.nanmax(vmax_l)

In the next cell we visualize what each model simulates at the year the warming level is reached for California. In other words, the next figure will show how each model spatially represents the selected warming level and the differences across the state.

all_plots = make_hvplot(  # plot first simulation separate from the loop
        sopt=sopt, vmin=vmin, vmax=vmax,

for sim_i in range(1, num_simulations): # plot remaining simulations
    pl_i = make_hvplot(
        sopt=sopt, vmin=vmin, vmax=vmax,
    all_plots += pl_i

# additional aesthetic settings to tidy figure
all_plots.cols(5)  # organize columns
all_plots.opts(hv.opts.Layout(merge_tools=True))  # merge toolbar
all_plots.opts(toolbar="below")  # set toolbar location
all_plots.opts(title="Air Temperature at 2m: Anomalies for "
               + str(warm_level)
               + "°C Warming by Simulation")  # add title

Above we’ve plotted each CMIP6 model at our selected warming level (in this example, 3°C) for California. There are a few key features to take note of:

  1. The increase in temperature will be perceived differently across California. Some locations may have a smaller increase (for example, coastal regions in some models, but not all), while other locations will have a larger increase in temperature (for example, the Sierra Nevadas and northern California in some models, but not all).
  2. There is a substantial difference between each model in how temperatures are simulated at the year the warming level is reached.

It is important to note that no one model is “better” or “worse” than another model. In terms of understanding model uncertainty, inter-model variations and internal variation expressed between models represent a range of possibilities.

2b) Calculate and visualize statistics for a cross-model analysis

Next, let’s visualize the minimum/maximum/median/mean conditions across models. These statistics are calculated from the data observed in the figure above at each grid cell. For example, in the minimum map, each grid cell represents the model that had the lowest value for that grid cell and the process is repeated for each grid cell (i.e., a grid cell may be the minimum value at that location from one model, while the grid cell next to it may be the minimum from a different model for that location).

# compute stats
min_data = xy_by_warmlevel.tas.min(dim="simulation")
max_data = xy_by_warmlevel.tas.max(dim="simulation")
med_data = xy_by_warmlevel.tas.median(dim="simulation")
mean_data = xy_by_warmlevel.tas.mean(dim="simulation")

# set up plots
min_plot = make_hvplot(data=min_data, sopt=sopt, vmin=vmin, vmax=vmax, title="Minimum")
max_plot = make_hvplot(data=max_data, sopt=sopt, vmin=vmin, vmax=vmax,title="Maximum")
med_plot = make_hvplot(data=med_data, sopt=sopt, vmin=vmin, vmax=vmax, title="Median")
mean_plot = make_hvplot(data=mean_data, sopt=sopt, vmin=vmin, vmax=vmax, title="Mean")
all_plots = mean_plot + med_plot + min_plot + max_plot

# additional aesthetic settings to tidy figure
all_plots.opts(toolbar="below")  # set toolbar location
all_plots.opts(hv.opts.Layout(merge_tools=True))  # merge toolbar
all_plots.opts(title="Air Temperature at 2m: "
               + str(warm_level)
               + "°C Warming Across Models")  # add title

Illustrating cross-model statistics can tell us several things. Let’s break it down:

  1. For a 3°C warming level, across statistics, models simulate California warming in the future.
  2. For a 3°C warming level, there is approximately 1 degree °C difference in the mean conditions, and a 2.2-3.7 degree °C difference between the minimum and maximum simulated conditions. Differences also exist across the state, with some regions simulating more or less warming than other locations.

Step 3: Example applications

What might be most useful to know when addressing model uncertainty in climate data is to know when it is reasonable to use an average across models and when it is not appropriate (note this depends on the question being asked).

3a) When it is reasonable to use a multi-model mean

To assess when it is reasonable to use a multi-model mean, we’ll look at a hypothetical question that asks: “Will Southern California be warmer in the future than it was in the past?”

First, we’ll calculate a 10-year running average to display the trends over time, so that we can assess the long-term decadal trend in the historical period. Then, we’ll select a location of interest, focusing on Southern California. To compare the future and past, we’ll select 30 year comparison periods in each. For the past, we’ll focus on 1981-2010, which is a commonly used climatological baseline. For the future, we’ll look towards the end of the century and focus on 2071-2100.

# calculate running mean first, then calculate anomaly
xy_10yr_rolling = xy_ds_yr.rolling(year=10, center=True).mean()
xy_10yr_anom = calc_anom(xy_10yr_rolling, base_start=1850, base_end=1900)
# calculate the area average and the multi-model mean for S. California
def socal_area_average(ds):
    lower_lat, upper_lat = 33.0, 37.0
    socal_ts = ds.sel(y=slice(lower_lat, upper_lat)).mean(dim=["x","y"])
    return socal_ts

socal_hist = socal_area_average(xy_10yr_anom).sel(year=slice(1981,2010))
socal_hist.tas.attrs['units'] = '°C'
socal_hist_mmm = cmip_mmm(socal_hist)

socal_ssp = socal_area_average(xy_10yr_anom).sel(year=slice(2071,2100))
socal_ssp.tas.attrs['units'] = '°C'
socal_ssp_mmm = cmip_mmm(socal_ssp)

Now let’s plot the historical timeseries for 1981-2010 to understand how models have represented this period. We continue to use 1850-1900 to assess how conditions have changed in a warming level context.

# visualize historical model response and the multi-model mean
socal_hist_mdls = socal_hist.hvplot.line(x='year', by='simulation', line_width=lw, color=h_color, alpha=alpha, legend=False)
hist_mmm = socal_hist_mmm.hvplot.line(x='year', line_width=lw*3, color='black', legend=False,
                                     title='Historical S. California mean surface temperature change relative to 1850-1900')
socal_hist_mdls * hist_mmm

Notice throughout 1981-2010 that the multi-model mean increases by approximately 1°C. Now, let’s look at the future period to compare how conditions may be at the end of the century.

# visualize future model response and the multi-model mean
socal_ssp_mdls = socal_ssp.hvplot.line(x='year', by='simulation', line_width=lw, color=ssp_color, alpha=alpha, legend=False)
ssp_mmm = socal_ssp_mmm.hvplot.line(x='year', line_width=lw*3, color='r', legend=False,
                                   title='Future S. California mean surface temperature change relative to 1850-1900')
socal_ssp_mdls * ssp_mmm

Upon comparison to the historical period, we note that the multi-model mean increases between 3.5°C and 4.8°C for Southern California. Note, the future timeseries ends before 2100 because of the rolling average; this ensures we capture the overall long-term climatological trend.

Lastly, let’s establish the difference between the multi-model mean in the historical (1981-2010) and future (2071-2100) periods to answer our question.

historical = np.asarray(socal_hist_mmm.tas.values)
future = np.asarray(socal_ssp_mmm.tas.values)

print('The future multi-model mean is {:.2f}-{:.2f}°C above the historical multi-model mean'.format(np.nanmin(future - historical), np.nanmax(future - historical)))

Using the individual models, the multi-model mean, and this min-max range, we can see that the future period consistently is higher than the historical period for Southern California. Thus, we can definitively answer “yes, there will be a higher temperature in the future” in this example.

However, if we wanted to know how much warmer in the future , we must consider the full range and cannot use only the multi-model mean as we lose a lot of valuable information about the regional response if the full range is not considered.

3b) When it is not appropriate to use an average across models

Here, we’ll address a hypothetical example with the question: “I want to know the exact temperature it will be at my location in 2100” and discuss why this is not the right question to ask.

With more than a 4°C difference between models, it is therefore not appropriate to utilize the multi-model mean in this example. Each of these simulations represent a possible climate future, and there is too much variation to confidently say the multi-model mean represents the overall climate conditions. It is strongly recommended that the full range of climate possibilities be considered; a single value cannot represent this range accurately.

Summary and Key Points

This notebook has illustrated how to diagnose and consider the differences between multiple models in representing climate, otherwise known as model uncertainty. We discussed several important aspects of model uncertainty:

  1. The range of plausible future climates can be substantial, with a model simulating a very different response than another.
  2. The multi-model mean captures the average conditions, but provides no information towards the range of plausible outcomes.
  3. In addition to an annual time series, the differences between models throughout your area of interest is critical for understanding the spatial impact(s) of a plausible climate future.
  4. The multi-model mean is a useful tool, but should be employed carefully when answering a question. The multi-model mean can only report the average conditions. For many questions, this is an inappropriate metric to solely rely on.

Recommendation: If you are using GCM output as input into another model, we suggest running that model with output from all GCMs available.

Want to know more about different kinds of climate uncertainty in climate data? Check out the internal_variability.ipynb notebook too!