The undefined website provides a collection of code snippets and methodologies for Earth scientists to analyze data using the xarray library, with a focus on climate science applications.
Abstract
The undefined website presents a comprehensive guide titled "Xarray Recipes for Earth Scientists," which serves as a practical resource for analyzing Earth science data, particularly those stored in NetCDF files. The guide emphasizes the utility of the xarray library, showcasing its powerful features such as resample(), groupby(), and concat(). These features are integral to geospatial analysis and are part of the Pangeo software stack. The article outlines various data manipulation techniques, including climatology calculations, downsampling and upsampling methods, weighted averages, moving averages, ensemble averaging, and the assignment of new variables or coordinates. It also addresses common data processing tasks such as selecting specific locations, filling in missing values, filtering data, and masking data. The author encourages feedback and suggests exploring the xESMF package for regridding tasks, highlighting the versatility of xarray for a wide range of analyses in Earth science research.
Opinions
The author believes that xarray is the perfect tool for Earth science data analysis due to its ability to handle labeled dimensions efficiently.
The article conveys that weighting by grid cell area is important when averaging geospatial data to avoid skewing results due to differences in region sizes.
The author suggests that there is no one-size-fits-all solution for filling in missing data, and the approach should be tailored to the specific situation.
The author provides a personal perspective on data analysis techniques, acknowledging that there may be more efficient or intuitive solutions and encourages community feedback.
The author emphasizes the importance of proper unit handling, referencing the Mars Climate Orbiter incident as a cautionary tale.
The author promotes the xESMF package as an excellent tool for regridding data, a task not covered in the post.
Earth science data typically comes packaged in NetCDF files with labeled dimensions, making xarray the perfect analysis tool. Xarray has some powerful, yet versatile, built-in methods, such as resample(), groupby(), and concat(). This package is integral to geospatial analysis, which is why it is the backbone of the Pangeo software stack. Xarray is available under the open source Apache License.
This article is a collection of snippets geared towards Earth scientists.
Table of Contents
0. Installation
0.1 Tutorial dataset
1. Climatology and Anomalies
2. Downsampling: Monthly Average2.1 Monthly Max, Min, Median, etc.
2.2 N-monthAverage3. Upsampling: Daily Interpolation
4. Weighted Average5. Moving Average6. Ensemble Average7. Assign New Variables or Coordinate
7.1 Assign newvariable7.1 Changing time coordinate
7.2 Changing longitude coordinate
8.Select a Specific Location
9. Fill in Missing Values
9.1 Fill NaNwith value
9.2 Replace with climatology
9.3 Interpolate between points
9.4 Forward/backward filling
10. Filter Data11. Mask Data12. Final Thoughts
12.1 Split-apply-combine
0. Installation
The developers recommend using the conda package manager and the community-maintained conda-forge channel for installation.
0.1 Tutorial dataset
Once installed, try loading this tutorial dataset, which is two years of air temperature sampled 4 times daily.
This dataset can be used to explore the recipes in this article
image by author
1. Climatology and Anomalies
Climatology: A monthly climatology entails averaging all the Januaries in a time series, then all the Februaries, etc.
Anomalies: These are deviations from climatology or the difference between the original time series and climatology. For example, if the time series is monthly, then the January 2013 anomaly is the January 2013 value minus the January climatology.
groupby()will collect all the like coordinate values, in this case, each group contains each month— regardless of the year. We then calculate the mean across each group and combines them back together
The argument supplied to groupby() specifes what we want to group. You can get the months via the .dt accessor, which is what we are doing in this example.
When calculating monthly anomalies, we first group by month and then subtract the climatology. For example, the January climatology is subtracted from each member in the January group, this is then repeated for every other month.
2. Downsampling: Monthly Average
Downsampling: Decreasing the frequency of the samples. For example, going from a daily time series to monthly
To achieve this with xarray we use .resample(). The argument supplied specifies the temporal dimension (e.g. time) and resample frequency (e.g. monthly). In the example above, the sampling frequency string '1MS’ means sample monthly with the new time vector centered on the start of the month (2000-Jan.-01, 2000-Feb.-01, etc.).
The sampling frequency (e.g. ‘1MS’) specifies how to resample the data. See offset aliases in Pandas documentation for other options.
To calculate the monthly average you have to attach .mean(). Removing this method will return a DatasetResample object. Always specify what you want to do with the samples.
2.1 Monthly Max, Min, Median, etc.
There is a suite of other methods you can use with resample() including: max ,min ,median,std,var,quantile,sum
Instead of averaging monthly, what if you wanted to average every 2 months? In this case, you would simply supply the number of months to sample by in the sampling frequency string:
Monthly average: ‘1MS’ (the 1 is optional)
2-month average: ‘2MS’
…
15-month average: ‘15MS’ (I don’t know why you would want to do this, but you can)
3. Upsampling: Daily Interpolation
Upsampling: Increasing the sampling frequency. For example, going from a monthly time series to daily
Here is an example of upscaling data from a monthly resolution to daily using linear interpolation. In other words, converting a “low” temporal resolution (e.g. monthly) to a “high” resolution (e.g. daily). The argument supplied specifies the temporal dimension (e.g. time) and resample frequency (e.g. daily). In the example above the sampling frequency string '1D’ means sample daily.
We then attach .interpolate(“linear”) to linearly interpolate between the points.
4. Weighted Average
Weighted average: Average of values which are scaled by their importance. Calculated as the sum of weights multiplied by the values divided by the sum of the weights.
When averaging geospatial spatial data, it is often important to weight the data so small regions do not skew the result. For instance, if your data is on a uniform latitude-longitude grid, then data near the poles occupy less area than the low latitudes. If you average temperature over the entire globe then you should weight the arctic less than the tropics since it occupies less area.
In this example, I am doing a common approach and weighting by the cosine of the latitude. This example is from the xarray webpage.
When calculating a weighted average with xarray note that ds.weighted(weights) is an instance of the “weighted class.” In addition to a weighted mean, we can calculate a weighted standard deviation, weighted sum, etc. See options here.
Another approach is weight by grid cell area, see the post below
Moving average: Technique to smooth data by averaging within specific intervals. The interval is defined by a specific number of data points, termed the window length.
Rolling mean/running mean/moving average is a technique to smooth short-term fluctuations to enhance the signal-to-noise ratio. You use the .rolling() method and supply it the dimension to apply the mean over and the window length.
In the example above I am applying a 30 day running mean over the air-temperature tutorial dataset. Since data is sampled 4 times each day there are 120 samples every 30 days.
An optional argument you can supply to rolling() is center=True , this will set the labels at the center of the window instead of the beginning.
6. Ensemble Average
Ensemble Average: Averaging multiple estimates of the same quantify. For example, the collection of CMIP models is an example of an ensemble.
If you have a dataset with multiple related variables you can concatenate them across a new dimension and then perform statistics.
I previously wrote about this topic earlier describing a different technique to achieve this with xarray. I now prefer the code above.
To add new variables to a dataset you can use .assign(). Keep in mind there is not an ‘inplace’ option with xarray, you will always have to explicitly assign the object to a variable.
In this example, I am changing the units from degrees Celsius to Kelvin and assigning it to a variable named temp_k. Units are very important, remember the Mars Climate Orbiter incident?
The second line is optional and simply adds attributes to the new variable.
7.1 Changing time coordinate
Here is an example of changing the time coordinate. This is useful when you want the time coordinate to start on the 15th of the month instead of the 1st.
7.2 Changing longitude coordinate
There are two conventions for longitude
“360 convention”: 0 to 360, with 0 at the prime meridian and values increasing Eastward
“180 convention”: -180W to 180E, centered on zero at the prime meridian
I made up the convention names. Please let me know if proper names exist
Change from 180 to 360
Change from 360 to 180
Alternatively, you could use list comprehension to make this more intuitive.
8. Select a Specific Location
Use sel() to select a specific latitude/longitude location. By default, sel() looks for exact matches in the dataset, but supplying method=’nearest’tells xarray to find the closest match to your selection.
9. Fill in Missing Values
Often times your data will have missing values you need to fill.
Unfortunately, there is not a “one size fits all” solution to this problem
Common approaches:
Fill missing values with some value
Fill missing values with climatology
Interpolate between points
Propagate values
9.1 Fill NaN with value
Using the fillna() method is a straightforward way to replace missing values some value. In the example above I am using 0 as the fill value, but this could be anything, the choice depends on the situation.
9.2 Replace with climatology
To replace with climatological values you first have to group the data via grouby() and then use fillna() supplied with the climatological values.
9.3 Interpolate between points
Another approach is to interpolate across missing values using interpolate_na(). Here I am linear interpolating across time.
These methods fill NaN values with the first non-NaN value, with ffill() filling forward and bfill() filling backward along the supplied dimension.
10. Filter Data
Select data based on a conditional expression. Only the data where the expression evaluates to True will be retained, everything else will be NaN.
11. Mask Data
Mask data based on a conditional expression. Mask values will be True if the supplied condition is True and False otherwise. This can be turned into a mask of 1s and 0s simply by multiplying by 1.
12. Final Thoughts
I hope this post helped you with your work and illustrated the power of xarray. Keep in mind this post merely provides one person’s perspective. There may be more efficient or intuitive solutions to some of these problems. For that reason, I strongly encourage feedback and comments
A common task I did not include in this post is regridding data. I encourage you to check out the awesome xESMF package.
Finally, the examples in this post can be adapted for other purposes. For example, calculating climatology is really an application of a split-apply-combine strategy.
12.1 Split-apply-combine
Split the data into groups,
Apply a function to each group
Combine all the groups back together.
When calculating climatologies the “apply” step is mean. However, this can be swapped out for: max ,min ,median,std,var,quantile, or sum . A custom method can even be supplied by using map() .
Thank you for reading and supporting Medium writers