CLIMATE DATA SCIENCE

Xarray Recipes for Earth Scientists

Code snippets to help you analyze your data

Earth science data typically comes packaged in NetCDF files with labeled dimensions, making xarray the perfect analysis tool. Xarray has some powerful, yet versatile, built-in methods, such as resample(), groupby(), and concat(). This package is integral to geospatial analysis, which is why it is the backbone of the Pangeo software stack. Xarray is available under the open source Apache License.

This article is a collection of snippets geared towards Earth scientists.

Table of Contents

0. Installation
  0.1 Tutorial dataset
1. Climatology and Anomalies
2. Downsampling: Monthly Average
  2.1 Monthly Max, Min, Median, etc.
  2.2 N-month Average
3. Upsampling: Daily Interpolation
4. Weighted Average
5. Moving Average
6. Ensemble Average
7. Assign New Variables or Coordinate
  7.1 Assign new variable
  7.1 Changing time coordinate
  7.2 Changing longitude coordinate
8. Select a Specific Location
9. Fill in Missing Values
  9.1 Fill NaN with value
  9.2 Replace with climatology
  9.3 Interpolate between points
  9.4 Forward/backward filling
10. Filter Data
11. Mask Data
12. Final Thoughts
  12.1 Split-apply-combine

0. Installation

The developers recommend using the conda package manager and the community-maintained conda-forge channel for installation.

0.1 Tutorial dataset

Once installed, try loading this tutorial dataset, which is two years of air temperature sampled 4 times daily.

This dataset can be used to explore the recipes in this article

1. Climatology and Anomalies

Climatology: A monthly climatology entails averaging all the Januaries in a time series, then all the Februaries, etc.
Anomalies: These are deviations from climatology or the difference between the original time series and climatology. For example, if the time series is monthly, then the January 2013 anomaly is the January 2013 value minus the January climatology.

groupby()will collect all the like coordinate values, in this case, each group contains each month— regardless of the year. We then calculate the mean across each group and combines them back together

The argument supplied to groupby() specifes what we want to group. You can get the months via the .dt accessor, which is what we are doing in this example.

When calculating monthly anomalies, we first group by month and then subtract the climatology. For example, the January climatology is subtracted from each member in the January group, this is then repeated for every other month.

2. Downsampling: Monthly Average

Downsampling: Decreasing the frequency of the samples. For example, going from a daily time series to monthly

To achieve this with xarray we use .resample(). The argument supplied specifies the temporal dimension (e.g. time) and resample frequency (e.g. monthly). In the example above, the sampling frequency string '1MS’ means sample monthly with the new time vector centered on the start of the month (2000-Jan.-01, 2000-Feb.-01, etc.).

The sampling frequency (e.g. ‘1MS’) specifies how to resample the data. See offset aliases in Pandas documentation for other options.

To calculate the monthly average you have to attach .mean(). Removing this method will return a DatasetResample object. Always specify what you want to do with the samples.

2.1 Monthly Max, Min, Median, etc.

There is a suite of other methods you can use with resample() including: max ,min ,median,std,var,quantile,sum

Check out the resample documentation for more details

2.2 N-month Average

Instead of averaging monthly, what if you wanted to average every 2 months? In this case, you would simply supply the number of months to sample by in the sampling frequency string:

Monthly average: ‘1MS’ (the 1 is optional)
2-month average: ‘2MS’
…
15-month average: ‘15MS’ (I don’t know why you would want to do this, but you can)

3. Upsampling: Daily Interpolation

Upsampling: Increasing the sampling frequency. For example, going from a monthly time series to daily

Here is an example of upscaling data from a monthly resolution to daily using linear interpolation. In other words, converting a “low” temporal resolution (e.g. monthly) to a “high” resolution (e.g. daily). The argument supplied specifies the temporal dimension (e.g. time) and resample frequency (e.g. daily). In the example above the sampling frequency string '1D’ means sample daily.

We then attach .interpolate(“linear”) to linearly interpolate between the points.

4. Weighted Average

Weighted average: Average of values which are scaled by their importance. Calculated as the sum of weights multiplied by the values divided by the sum of the weights.

When averaging geospatial spatial data, it is often important to weight the data so small regions do not skew the result. For instance, if your data is on a uniform latitude-longitude grid, then data near the poles occupy less area than the low latitudes. If you average temperature over the entire globe then you should weight the arctic less than the tropics since it occupies less area.

In this example, I am doing a common approach and weighting by the cosine of the latitude. This example is from the xarray webpage.

When calculating a weighted average with xarray note that ds.weighted(weights) is an instance of the “weighted class.” In addition to a weighted mean, we can calculate a weighted standard deviation, weighted sum, etc. See options here.

Another approach is weight by grid cell area, see the post below

The correct way to average the globe

Why area-weighting your data is important

towardsdatascience.com

5. Moving Average

Moving average: Technique to smooth data by averaging within specific intervals. The interval is defined by a specific number of data points, termed the window length.

Rolling mean/running mean/moving average is a technique to smooth short-term fluctuations to enhance the signal-to-noise ratio. You use the .rolling() method and supply it the dimension to apply the mean over and the window length.

In the example above I am applying a 30 day running mean over the air-temperature tutorial dataset. Since data is sampled 4 times each day there are 120 samples every 30 days.

An optional argument you can supply to rolling() is center=True , this will set the labels at the center of the window instead of the beginning.

6. Ensemble Average

Ensemble Average: Averaging multiple estimates of the same quantify. For example, the collection of CMIP models is an example of an ensemble.

If you have a dataset with multiple related variables you can concatenate them across a new dimension and then perform statistics.

I previously wrote about this topic earlier describing a different technique to achieve this with xarray. I now prefer the code above.

Pythonic Way to Perform Statistics Across Multiple Variables with Xarray

By first creating a categorical dimension in your Dataset

towardsdatascience.com

7. Assign New Variables or Coordinate

Xarray provides three “assign” methods:

.assign() → Assign new data variables to a Dataset
.assign_coords() → Assign new coordinates to a Dataset
.assign_attrs() → Assign new attributes to a Dataset

7.1 Assign new variable

To add new variables to a dataset you can use .assign(). Keep in mind there is not an ‘inplace’ option with xarray, you will always have to explicitly assign the object to a variable.

In this example, I am changing the units from degrees Celsius to Kelvin and assigning it to a variable named temp_k. Units are very important, remember the Mars Climate Orbiter incident?

The second line is optional and simply adds attributes to the new variable.

7.1 Changing time coordinate

Here is an example of changing the time coordinate. This is useful when you want the time coordinate to start on the 15th of the month instead of the 1st.

7.2 Changing longitude coordinate

There are two conventions for longitude

“360 convention”: 0 to 360, with 0 at the prime meridian and values increasing Eastward
“180 convention”: -180W to 180E, centered on zero at the prime meridian

I made up the convention names. Please let me know if proper names exist

Change from 180 to 360

Change from 360 to 180

Alternatively, you could use list comprehension to make this more intuitive.

8. Select a Specific Location

Use sel() to select a specific latitude/longitude location. By default, sel() looks for exact matches in the dataset, but supplying method=’nearest’tells xarray to find the closest match to your selection.

9. Fill in Missing Values

Often times your data will have missing values you need to fill.

Unfortunately, there is not a “one size fits all” solution to this problem

Common approaches:

Fill missing values with some value
Fill missing values with climatology
Interpolate between points
Propagate values

9.1 Fill NaN with value

Using the fillna() method is a straightforward way to replace missing values some value. In the example above I am using 0 as the fill value, but this could be anything, the choice depends on the situation.

9.2 Replace with climatology

To replace with climatological values you first have to group the data via grouby() and then use fillna() supplied with the climatological values.

9.3 Interpolate between points

Another approach is to interpolate across missing values using interpolate_na(). Here I am linear interpolating across time.

See the de>interpolate_na() documentation for a complete list of interp methods.

9.4 Forward/backward filling

ds.ffill('time')
ds.bfill('time')

Forward fill → ffill() propagates values forward
Backward fill → bfill() propagates values backward

These methods fill NaN values with the first non-NaN value, with ffill() filling forward and bfill() filling backward along the supplied dimension.

10. Filter Data

Select data based on a conditional expression. Only the data where the expression evaluates to True will be retained, everything else will be NaN.

11. Mask Data

Mask data based on a conditional expression. Mask values will be True if the supplied condition is True and False otherwise. This can be turned into a mask of 1s and 0s simply by multiplying by 1.

12. Final Thoughts

I hope this post helped you with your work and illustrated the power of xarray. Keep in mind this post merely provides one person’s perspective. There may be more efficient or intuitive solutions to some of these problems. For that reason, I strongly encourage feedback and comments

A common task I did not include in this post is regridding data. I encourage you to check out the awesome xESMF package.

Finally, the examples in this post can be adapted for other purposes. For example, calculating climatology is really an application of a split-apply-combine strategy.

12.1 Split-apply-combine

Split the data into groups,
Apply a function to each group
Combine all the groups back together.

When calculating climatologies the “apply” step is mean. However, this can be swapped out for: max ,min ,median,std,var,quantile, or sum . A custom method can even be supplied by using map() .

Thank you for reading and supporting Medium writers