avatarIvo Bernardo

Summarize

Using Resample in Pandas

Learn how to work with the Pandas resample method, a cool way to work with time based data

Photo by Markus Spiske @Unsplash.com

Time based data is one of the most common data formats that you, as data scientist, have probably stumbled. Either in the format of historical features (for instance, customer data) or time series data, it’s pretty common that one has to deal with timestamp columns in data pipelines.

If you work as a data scientist in 2022, it’s mandatory that you know how to work with pandas, one of the most powerful Python libraries to wrangle data. pandasis famous for its regular updates and new methods that fit a lot of tasks regarding data wrangling and manipulation.

One of its most famous methods is the resampleone, that lets you handle time based data smoothly. In a nutshell, resample contains several features that help you tackle time based grouping and aggregation in a really smooth way, improving the speed and simplicity when working with datetime columns. In the past, you would probably have to construct complicated (or even buggy) functions, that would have a tendency to be hard to maintain. With resample that’s completely gone and you can play around with dates in pandas dataframes much easily.

In this post, we’ll explore some cool things about this method and how easy it is to apply it — particularly when we want to create time based features for our data science algorithms.

Reading the Data

First, I‘ll start by loading a time series data set that consists of Hourly Weather data between 2008 and 2021 in Gallipoli, Turkey — a dataset belonging to Meteoblue.

Let me start by loading this dataset into Python:

gallipoli_data = pd.read_csv(‘../Hourly Weather Data in Gallipoli (20082021).csv’)

Let’s look at the first 10 rows of the dataset:

Sneak Peek of the Gallipoli Dataset — Image by Author

To work with resample, our data must meet two conditions:

  • The column that we will be using to represent time must be a datetime data type.
  • That column must go into the index of the dataframe.

Starting by converting the DateTime column into a DateTime type:

gallipoli_data[‘DateTime’] = pd.to_datetime(gallipoli_data[‘DateTime’])

And passing our column into the index of the dataframe:

gallipoli_data.index = gallipoli_data[‘DateTime’]

Our dataframe is ready! Let’s also load the libraries needed to follow along this blog post:

import pandas as pd
import numpy as np

Having everything set up, let’s tackle our resample method!

Using resample

Let’s start with a simple use case: imagine that we would like to get the average temperature in Gallipoli for every 3 hour interval. Although simple, if we would try to do this in base Python or with custom pandas code, it would be a bit cumbersome as we would have to:

  • create some type of column that would increment every 3 hours;
  • use a group by on that column;

Although this is manageable with a couple of lines of code, a one liner would be better, right? And what if that one liner would also be able to generalize for other time frames (weeks, months, quarter or even minutes)? That’s exactly what resample is for! We can perform a quick time based aggregation on the data really fast:

gallipoli_data.Temperature.resample(rule = ‘3h’).mean()

Here’s the output of the code above:

Resample using 3h as Time Frame — Image by Author

How cool? The resample method contains a bunch of knobs we can turn such as the rule or the aggregator function we are using. Can you guess what we need to tweak if we want to check the maximum value of sunshine duration in an interval of 3 hours? We just switch mean() for max() !

gallipoli_data[‘Sunshine Duration’].resample(rule = ‘3h’).max()
Resample using 3h as Time Frame but using the Max function — Image by Author

To make this even more clear and peak under the hood of resample, let’s see the evolution of the Sunshine Duration for the 2nd of January in Gallipoli:

(
    gallipoli_data.loc[
        (gallipoli_data.DateTime >= '2008-01-02') 
        & 
        (gallipoli_data.DateTime <= '2008-01-03')
    ]
)['Sunshine Duration'].plot(color='darkorange')
Sunshine Duration through 02-Jan-2008 — Image by Author

Whatresample is doing is grouping data for every three hours by applying the following:

  • From 9 a.m. until 12, the maximum Sunshine Duration is 60.
  • From noon until 15 p.m., the maximum Sunshine Duration is 60.
  • From 15 until 18 p.m., the maximum Sunshine Duration is 38.

This is reflected on the table we’ve seen above — checking it again:

Resample using 3h as Time Frame using the Max function — Image by Author

Notice that the interval is closed on the left side. Meaning that the row that relates to 2008-01-02 09:00:00 aggregates the entire data from 09:00:00 a.m. until 11:59:59 a.m. If we want to change this behavior, we need to pass a new argument closedto the resample :

gallipoli_data['Sunshine Duration'].resample(rule = '3h', closed= 'right').max()
Resample using 3h as Time Frame using Max and Closed = ‘Right’— Image by Author

In this case, the same row 2008-01-02 09:00:00 aggregates all data from 09:00:01 a.m. until 12:00:00 a.m.

Tweaking the resample knobs is very easy. Another common argument that we may want to tweak is the time frame aggregation — for instance, aggregating our data by week or month. Is this possible? Let’s see, next.

Modifying the Rule Argument

With the ruleargument we’re not only restricted to hourly aggregations! This is one of the coolest features of the resamplefunction. For instance, if we would like a bi-weekly aggregation, we can:

gallipoli_data.Temperature.resample(rule = ‘2w’).mean()
Bi-Weekly Temperature aggregation — Image by Author

The number before the w defines the number of instances we want to group by and the w defines that we want a week time-frame.

Can you guess what’s the argument we need to pass to aggregate our data for every 4-months?

gallipoli_data.Temperature.resample(rule = ‘4m’).mean()
4 Months aggregation Example — Image by Author

In summary, rule controls the time interval we want to group by. Some of the most famous rule parameters you can use are:

  • s for seconds
  • t for minutes
  • h for hours
  • w for weeks
  • m for months
  • q for quarter

Filling the blanks

Another important feature of resampleis that it can help you create a more complete data frame by making sense of the gaps in the data. For instance, imagine we would have a hourly sales dataset from a shop:

date_range = pd.date_range('1/2/2022', periods=24, freq='H')
sales = np.random.randint(100, 400, size=24)
sales_data = pd.DataFrame(
    sales,
    index = date_range,
    columns = ['Sales']
)
Head of the sales_data dataframe — Image by Author

Let’s delete the row of the sales at 10 a.m. (imagining this was something that really happened and there were no sales in this store during this time):

sales_data = sales_data.drop(sales_data.iloc[10].name)
Head of the sales_data dataframe with no sales at 10 a.m.— Image by Author

If we apply a resampleto this dataframe, a new row with NaN will show up:

pd.DataFrame(
    sales_data.Sales.resample('h').mean()
)
Head of the sales_data dataframe with no sales at 10 a.m. after resample— Image by Author

This is extremely handy in a lot of time based scenarios. Normally, if we have a gap in our time series data, we want to have a row that represents that gap and resample gives us a quick way to achieve it. In the example above, we can even add a fillna(0) to make our scenario of “lack of sales” at 10 a.m. even more obvious:

Head of the sales_data dataframe with no sales at 10 a.m. after resample — Image by Author

Filling the gaps is a really common thing we may want to do when working with time based data. Not only when creating features, but also as a way to catch major bugs on our data pipelines.

Thank you for taking the time to read this post!

If you work with time based data and never usedresample before, I really recommend that you study this method thoroughly as it will save you a lot of headaches during your data wranling process.

Either when we’re building time based features or manipulating time-series data, using resample makes our code more efficient and clean. As we’ve discussed, this method is also handy to catch some unexpected gaps in time-based data so I hope you can use the knowledge in this post sometime in your future data pipelines!

The dataset used in this post is under CC BY 4.0 and belongs to Meteoblue.

Python
Resample
Data Science
Data Engineering
Tips And Tricks
Recommended from ReadMedium