Using Resample in Pandas
Learn how to work with the Pandas resample method, a cool way to work with time based data
Time based data is one of the most common data formats that you, as data scientist, have probably stumbled. Either in the format of historical features (for instance, customer data) or time series data, it’s pretty common that one has to deal with timestamp columns in data pipelines.
If you work as a data scientist in 2022, it’s mandatory that you know how to work with pandas
, one of the most powerful Python libraries to wrangle data. pandas
is famous for its regular updates and new methods that fit a lot of tasks regarding data wrangling and manipulation.
One of its most famous methods is the resample
one, that lets you handle time based data smoothly. In a nutshell, resample
contains several features that help you tackle time based grouping and aggregation in a really smooth way, improving the speed and simplicity when working with datetime columns. In the past, you would probably have to construct complicated (or even buggy) functions, that would have a tendency to be hard to maintain. With resample
that’s completely gone and you can play around with dates in pandas
dataframes much easily.
In this post, we’ll explore some cool things about this method and how easy it is to apply it — particularly when we want to create time based features for our data science algorithms.
Reading the Data
First, I‘ll start by loading a time series data set that consists of Hourly Weather data between 2008 and 2021 in Gallipoli, Turkey — a dataset belonging to Meteoblue.
Let me start by loading this dataset into Python:
gallipoli_data = pd.read_csv(‘../Hourly Weather Data in Gallipoli (2008–2021).csv’)
Let’s look at the first 10 rows of the dataset:
To work with resample
, our data must meet two conditions:
- The column that we will be using to represent time must be a
datetime
data type. - That column must go into the index of the dataframe.
Starting by converting the DateTime
column into a DateTime type:
gallipoli_data[‘DateTime’] = pd.to_datetime(gallipoli_data[‘DateTime’])
And passing our column into the index of the dataframe:
gallipoli_data.index = gallipoli_data[‘DateTime’]
Our dataframe is ready! Let’s also load the libraries needed to follow along this blog post:
import pandas as pd
import numpy as np
Having everything set up, let’s tackle our resample
method!
Using resample
Let’s start with a simple use case: imagine that we would like to get the average temperature in Gallipoli for every 3 hour interval. Although simple, if we would try to do this in base Python or with custom pandas
code, it would be a bit cumbersome as we would have to:
- create some type of column that would increment every 3 hours;
- use a group by on that column;
Although this is manageable with a couple of lines of code, a one liner would be better, right? And what if that one liner would also be able to generalize for other time frames (weeks, months, quarter or even minutes)? That’s exactly what resample
is for! We can perform a quick time based aggregation on the data really fast:
gallipoli_data.Temperature.resample(rule = ‘3h’).mean()
Here’s the output of the code above:
How cool? The resample
method contains a bunch of knobs we can turn such as the rule or the aggregator function we are using. Can you guess what we need to tweak if we want to check the maximum value of sunshine duration in an interval of 3 hours? We just switch mean()
for max()
!
gallipoli_data[‘Sunshine Duration’].resample(rule = ‘3h’).max()
To make this even more clear and peak under the hood of resample
, let’s see the evolution of the Sunshine Duration
for the 2nd of January in Gallipoli:
(
gallipoli_data.loc[
(gallipoli_data.DateTime >= '2008-01-02')
&
(gallipoli_data.DateTime <= '2008-01-03')
]
)['Sunshine Duration'].plot(color='darkorange')
Whatresample
is doing is grouping data for every three hours by applying the following:
- From 9 a.m. until 12, the maximum
Sunshine Duration
is 60. - From noon until 15 p.m., the maximum
Sunshine Duration
is 60. - From 15 until 18 p.m., the maximum
Sunshine Duration
is 38.
This is reflected on the table we’ve seen above — checking it again:
Notice that the interval is closed on the left side. Meaning that the row that relates to 2008-01-02 09:00:00
aggregates the entire data from 09:00:00 a.m. until 11:59:59 a.m. If we want to change this behavior, we need to pass a new argument closed
to the resample
:
gallipoli_data['Sunshine Duration'].resample(rule = '3h', closed= 'right').max()
In this case, the same row 2008-01-02 09:00:00
aggregates all data from 09:00:01 a.m. until 12:00:00 a.m.
Tweaking the resample
knobs is very easy. Another common argument that we may want to tweak is the time frame aggregation — for instance, aggregating our data by week or month. Is this possible? Let’s see, next.
Modifying the Rule Argument
With the rule
argument we’re not only restricted to hourly aggregations! This is one of the coolest features of the resample
function. For instance, if we would like a bi-weekly aggregation, we can:
gallipoli_data.Temperature.resample(rule = ‘2w’).mean()
The number before the w
defines the number of instances we want to group by and the w
defines that we want a week time-frame.
Can you guess what’s the argument we need to pass to aggregate our data for every 4-months?
gallipoli_data.Temperature.resample(rule = ‘4m’).mean()
In summary, rule
controls the time interval we want to group by. Some of the most famous rule
parameters you can use are:
s
for secondst
for minutesh
for hoursw
for weeksm
for monthsq
for quarter
Filling the blanks
Another important feature of resample
is that it can help you create a more complete data frame by making sense of the gaps in the data. For instance, imagine we would have a hourly sales dataset from a shop:
date_range = pd.date_range('1/2/2022', periods=24, freq='H')
sales = np.random.randint(100, 400, size=24)
sales_data = pd.DataFrame(
sales,
index = date_range,
columns = ['Sales']
)
Let’s delete the row of the sales at 10 a.m. (imagining this was something that really happened and there were no sales in this store during this time):
sales_data = sales_data.drop(sales_data.iloc[10].name)
If we apply a resample
to this dataframe, a new row with NaN
will show up:
pd.DataFrame(
sales_data.Sales.resample('h').mean()
)
This is extremely handy in a lot of time based scenarios. Normally, if we have a gap in our time series data, we want to have a row that represents that gap and resample
gives us a quick way to achieve it. In the example above, we can even add a fillna(0)
to make our scenario of “lack of sales” at 10 a.m. even more obvious:
Filling the gaps is a really common thing we may want to do when working with time based data. Not only when creating features, but also as a way to catch major bugs on our data pipelines.
Thank you for taking the time to read this post!
If you work with time based data and never usedresample
before, I really recommend that you study this method thoroughly as it will save you a lot of headaches during your data wranling process.
Either when we’re building time based features or manipulating time-series data, using resample
makes our code more efficient and clean. As we’ve discussed, this method is also handy to catch some unexpected gaps in time-based data so I hope you can use the knowledge in this post sometime in your future data pipelines!
The dataset used in this post is under CC BY 4.0 and belongs to Meteoblue.