avatarSoner Yıldırım

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

7017

Abstract

9</span>-<span class="hljs-number">2020</span>') <span class="hljs-attribute">Timestamp</span>('<span class="hljs-number">2020</span>-<span class="hljs-number">09</span>-<span class="hljs-number">13</span> <span class="hljs-number">00</span>:<span class="hljs-number">00</span>:<span class="hljs-number">00</span>')</pre></div><p id="845a"><b>7. Converting a dataframe to time series data</b></p><p id="c61b">The to_datetime function can convert a dataframe with appropriate columns to a time series. Consider the following dataframe:</p><figure id="a4f6"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*bBDaIhEzaJd_NRZFOUQ7MA.png"><figcaption></figcaption></figure><div id="08d1"><pre>pd.to_datetime(<span class="hljs-built_in">df</span>)</pre></div><div id="d9b8"><pre><span class="hljs-attribute">0</span> <span class="hljs-number">2020</span>-<span class="hljs-number">04</span>-<span class="hljs-number">13</span> <span class="hljs-attribute">1</span> <span class="hljs-number">2020</span>-<span class="hljs-number">05</span>-<span class="hljs-number">16</span> <span class="hljs-attribute">2</span> <span class="hljs-number">2019</span>-<span class="hljs-number">04</span>-<span class="hljs-number">11</span> <span class="hljs-attribute">dtype</span>: datetime64[ns]</pre></div><p id="a7f6"><b>8. Beyond a timestamp</b></p><p id="388b">In real-life cases, we almost always work sequential time series data rather than individual dates. Pandas makes it very simple to work with sequential time series data as well.</p><p id="7b61">We can pass a list of dates to the to_datetime function.</p><div id="3fe2"><pre>pd.to_datetime(['<span class="hljs-number">2020-09-13</span>', '<span class="hljs-number">2020-08-12</span>', '<span class="hljs-number">2020-08-04</span>', '<span class="hljs-number">2020-09-05</span>'])</pre></div><div id="7685"><pre><span class="hljs-attribute">DatetimeIndex</span>(['<span class="hljs-number">2020</span>-<span class="hljs-number">09</span>-<span class="hljs-number">13</span>', '<span class="hljs-number">2020</span>-<span class="hljs-number">08</span>-<span class="hljs-number">12</span>', '<span class="hljs-number">2020</span>-<span class="hljs-number">08</span>-<span class="hljs-number">04</span>', '<span class="hljs-number">2020</span>-<span class="hljs-number">09</span>-<span class="hljs-number">05</span>'], dtype='datetime64[ns]', freq=None)</pre></div><p id="7d13">The returned object is a DatetimeIndex.</p><p id="2ab1">There are more practical ways to create sequences of dates.</p><p id="a791"><b>9. Creating a time series with to_datetime and to_timedelta</b></p><p id="b48e">A DatetimeIndex can be created by adding a TimedeltaIndex to a timestamp.</p><div id="d731"><pre>pd<span class="hljs-selector-class">.to_datetime</span>(<span class="hljs-string">'10-9-2020'</span>) + pd<span class="hljs-selector-class">.to_timedelta</span>(np<span class="hljs-selector-class">.arange</span>(<span class="hljs-number">5</span>), <span class="hljs-string">'D'</span>)</pre></div><figure id="4771"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*r6TTRRoXwy_wjC868LgiWw.png"><figcaption></figcaption></figure><p id="5738">‘D’ is used for ‘day’ but there are many other options available. You can check the whole list <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_timedelta.html">here</a>.</p><p id="43b2"><b>10. The date_range function</b></p><p id="c971">It provides a more flexible way to create a DatetimeIndex.</p><div id="d9b1"><pre>pd.date_range(<span class="hljs-attribute">start</span>=<span class="hljs-string">'2020-01-10'</span>, <span class="hljs-attribute">periods</span>=10, <span class="hljs-attribute">freq</span>=<span class="hljs-string">'M'</span>)</pre></div><figure id="d14f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*GJsSTWJ4bixpsv4q8UNM1g.png"><figcaption></figcaption></figure><p id="1ac4">The periods parameter specifies the number of items in the index. The freq is the frequency and ‘M’ indicates the last day of a month.</p><p id="f4e8">The date_range is pretty flexible in terms of the arguments for the freq parameter.</p><div id="5dd0"><pre>pd.date_range(<span class="hljs-attribute">start</span>=<span class="hljs-string">'2020-01-10'</span>, <span class="hljs-attribute">periods</span>=10, <span class="hljs-attribute">freq</span>=<span class="hljs-string">'6D'</span>)</pre></div><figure id="135f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*8NnpJVHc-byg4nq6FBDE7g.png"><figcaption></figcaption></figure><p id="1eb8">We have created an index with a frequency of 6 days.</p><p id="6606"><b>11. The period_range function</b></p><p id="8c01">It returns a PeriodIndex. The syntax is similar to the date_range function.</p><div id="d198"><pre>pd.period_range(<span class="hljs-string">'2018'</span>, <span class="hljs-attribute">periods</span>=10, <span class="hljs-attribute">freq</span>=<span class="hljs-string">'M'</span>)</pre></div><figure id="7acc"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*NxY3CBULqTpHEBdpU88fqA.png"><figcaption></figcaption></figure><p id="a3c0"><b>12. The timedelta_range function</b></p><p id="051a">It returns a TimedeltaIndex.</p><div id="61bc"><pre>pd.timedelta_range(<span class="hljs-attribute">start</span>=<span class="hljs-string">'0'</span>, <span class="hljs-attribute">periods</span>=24, <span class="hljs-attribute">freq</span>=<span class="hljs-string">'H'</span>)</pre></div><figure id="c074"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*S_iJElhPbktOK_rOYR-7FQ.png"><figcaption></figcaption></figure><p id="df8b"><b>13. Time zones</b></p><p id="66c0">By default, time series objects of pandas do not have an assigned time zone.</p><div id="a3b7"><pre><span class="hljs-attribute">dates</span> = pd.date_range('<span class="hljs-number">2019</span>-<span class="hljs-number">01</span>-<span class="hljs-number">01</span>','<span class="hljs-number">2019</span>-<span class="hljs-number">01</span>-<span class="hljs-number">10</span>')</pre></div><div id="c355"><pre>dates.tz <span class="hljs-keyword">is</span> <span class="hljs-keyword">None</span> <span class="hljs-keyword">True</span></pre></div><p id="00ec">We can assign a time zone to these objects using the <b>tz_localize</b> method.</p><div id="a338"><pre><span class="hljs-attr">dates_lcz</span> = dates.tz_localize(<span class="hljs-string">'Europe/Berlin'</span>)</pre></div><div id="5657"><pre><span class="hljs-attribute">dates_lcz</span>.tz <span class="hljs-section"><DstTzInfo 'Europe/Berlin' LMT+0<span class="hljs-number">:53</span><span class="hljs-number">:00</span> STD></span></pre></div><p id="e739"><b>14. Create a time series with an assigned time zone</b></p><p id="6289">We can also create a time series object with a time zone using <b>tz</b> keyword argument.</p><div id="c179"><pre>pd.date_range(<span class="hljs-string">'2020-01-01'</span>,

Options

periods = 5, freq = <span class="hljs-string">'D'</span>, <span class="hljs-attribute">tz</span>=<span class="hljs-string">'US/Eastern'</span>)</pre></div><figure id="5ba9"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*swQk8aPR9To_MpfzwtC8zw.png"><figcaption></figcaption></figure><p id="6694"><b>15. Offsets</b></p><p id="13ed">Consider we have a time series index and want to offset all the dates for a specific time.</p><div id="0668"><pre>A = pd.date_range(<span class="hljs-string">'2020-01-01'</span>, <span class="hljs-attribute">periods</span>=10, <span class="hljs-attribute">freq</span>=<span class="hljs-string">'D'</span>) A</pre></div><figure id="3566"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*9kCSjto5RDgnEAzJARZ3YA.png"><figcaption></figcaption></figure><p id="2bda">Let’s add an offset of one week to this series.</p><div id="7847"><pre>A + pd<span class="hljs-selector-class">.offsets</span><span class="hljs-selector-class">.Week</span>()</pre></div><figure id="a959"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*mbPN5-tEOPAqcWwJwe4dMg.png"><figcaption></figcaption></figure><p id="ba76"><b>16. Shifting time series data</b></p><p id="a35a">Time series data analysis may require to shift data points to make a comparison. The <b>shift </b>function shifts data in time.</p><figure id="1cfc"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*4GoFYOxwSyRekvhjnpsH_w.png"><figcaption></figcaption></figure><div id="2f5b"><pre>A.shift(10, <span class="hljs-attribute">freq</span>=<span class="hljs-string">'M'</span>)</pre></div><figure id="6cf6"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*83jyxpNrN64nvJwWPpf1HQ.png"><figcaption></figcaption></figure><p id="f780"><b>17. Shift vs tshift</b></p><ul><li>shift: shifts the data</li><li>tshift: shifts the time index</li></ul><p id="41fe">Let’s create a dataframe with a time series index and plot it to see the difference between shift and tshift.</p><div id="b515"><pre>dates = pd.date_range(<span class="hljs-string">'2020-03-01'</span>, <span class="hljs-attribute">periods</span>=30, <span class="hljs-attribute">freq</span>=<span class="hljs-string">'D'</span>) values = np.random.randint(10, <span class="hljs-attribute">size</span>=30) df = pd.DataFrame({<span class="hljs-string">'values'</span>:values}, <span class="hljs-attribute">index</span>=dates)</pre></div><div id="be73"><pre>df.head<span class="hljs-comment">()</span></pre></div><figure id="f652"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Wk_w2GLaibnzvhCRH9h-dg.png"><figcaption></figcaption></figure><p id="0f98">Let’s plot the original time series along with the shifted and tshifted ones.</p><div id="ad85"><pre><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt</pre></div><div id="ec84"><pre>fig, axs = plt.subplots(<span class="hljs-attribute">nrows</span>=3, figsize=(10,6), <span class="hljs-attribute">sharey</span>=<span class="hljs-literal">True</span>) plt.tight_layout(<span class="hljs-attribute">pad</span>=4) df.plot(<span class="hljs-attribute">ax</span>=axs[0], <span class="hljs-attribute">legend</span>=None) df.shift(10).plot(<span class="hljs-attribute">ax</span>=axs[1], <span class="hljs-attribute">legend</span>=None) df.tshift(10).plot(<span class="hljs-attribute">ax</span>=axs[2], <span class="hljs-attribute">legend</span>=None)</pre></div><figure id="15ce"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*xdJCu4zZpG5ViKbRtfKUdg.png"><figcaption></figcaption></figure><p id="acec"><b>18. Resampling with the resample function</b></p><p id="626d">Another common operation with time series data is resampling. Depending on the task, we may need to resample data at a higher or lower frequency.</p><p id="79c6">Resample creates groups (or bins) of specified internal and lets you do aggregations on groups.</p><p id="cb81">Let’s create a Pandas series with 30 values and a time series index.</p><div id="089c"><pre>A = pd.date_range(<span class="hljs-string">'2020-01-01'</span>, <span class="hljs-attribute">periods</span>=30, <span class="hljs-attribute">freq</span>=<span class="hljs-string">'D'</span>) values = np.random.randint(10, <span class="hljs-attribute">size</span>=30) S = pd.Series(values, <span class="hljs-attribute">index</span>=A)</pre></div><p id="f3be">The following will return the averages of 3 day periods.</p><div id="66ec"><pre>S<span class="hljs-selector-class">.resample</span>(<span class="hljs-string">'3D'</span>)<span class="hljs-selector-class">.mean</span>()</pre></div><figure id="b04d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Wi0-DpFb4W0BZkoaxNUa4Q.png"><figcaption></figcaption></figure><p id="342b"><b>19. Asfreq function</b></p><p id="e680">In some cases, we may be interested in the values at certain frequencies. Asfreq function returns the value at the end of the specified interval. For instance, we may only need the values at every 3 days (not a 3-day average) in the series we created in the previous step.</p><div id="862d"><pre>S.asfre<span class="hljs-string">q('3D')</span></pre></div><figure id="5b49"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Vr7BEAGvnRbacwETjDN4Cw.png"><figcaption></figcaption></figure><p id="4369"><b>20. Rolling</b></p><p id="0a6b">Rolling is a very useful operation for time series data. Rolling means creating a rolling window with a specified size and perform calculations on the data in this window which, of course, rolls through the data. The figure below explains the concept of rolling.</p><figure id="1012"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*icBoqkN25ngenBrG.png"><figcaption>(Image by author)</figcaption></figure><p id="f624">It is worth noting that the calculation starts when the whole window is in the data. In other words, if the size of the window is three, the first aggregation is done in the third row.</p><p id="b82e">Let’s apply a 3-day rolling window to our series.</p><div id="86ef"><pre>S<span class="hljs-selector-class">.rolling</span>(<span class="hljs-number">3</span>)<span class="hljs-selector-class">.mean</span>()<span class="hljs-selector-attr">[:10]</span></pre></div><figure id="e3b0"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*rKRVkFw9AcdTu3AqiVqJZQ.png"><figcaption></figcaption></figure><h1 id="2c11">Conclusion</h1><p id="546c">We have covered a comprehensive introduction to time series analysis with Pandas. It is worth noting that Pandas provides much more in terms of time series analysis.</p><p id="503a">The official <a href="https://pandas.pydata.org/docs/user_guide/timeseries.html">documentation</a> covers all the functions and methods of time series. It may seem exhaustive at first glance but you will get comfortable by practicing.</p><p id="9fae">Thank you for reading. Please let me know if you have any feedback.</p></article></body>

20 Points to Master Pandas Time Series Analysis

How to handle time series data.

Photo by Markus Winkler on Unsplash

There are many definitions of time series data, all of which indicate the same meaning in a different way. A straightforward definition is that time series data includes data points attached to sequential time stamps.

The sources of time series data are periodic measurements or observations. We observe time series data in many industries. Just to give a few examples:

  • Stock prices over time
  • Daily, weekly, monthly sales
  • Periodic measurements in a process
  • Power or gas consumption rates over time

In this post, I will list 20 points that will help you gain a comprehensive understanding of handling time series data with Pandas.

  1. Different forms of time series data

Time series data can be in the form of a specific date, time duration, or fixed defined interval.

Timestamp can be the date of a day or a nanosecond on a given day depending on the precision. For example, ‘2020–01–01 14:59:30’ is a second-based timestamp.

2. Time series data structures

Pandas provides flexible and efficient data structures to work with all kinds of time series data.

In addition to these 3 structures, Pandas also supports the date offset concept which is a relative time duration that respects calendar arithmetic.

3. Creating a timestamp

The most basic time series data structure is timestamp which can be created using to_datetime or Timestamp functions

import pandas as pd
pd.to_datetime('2020-9-13')
Timestamp('2020-09-13 00:00:00')
pd.Timestamp('2020-9-13')
Timestamp('2020-09-13 00:00:00')

4. Accessing the information hold by a timestamp

We can get information about the day, month, and year stored in a timestamp.

a = pd.Timestamp('2020-9-13')
a.day_name()
'Sunday'
a.month_name()
'September'
a.day
13
a.month
9
a.year
2020

5. Accessing not-so-obvious information

Timestamp objects also hold information about date arithmetic. For instance, we can ask if the year is a leap year. Here are some of the more specific information we can access:

b = pd.Timestamp('2020-9-30')
b.is_month_end
True
b.is_leap_year
True
b.is_quarter_start
False
b.weekofyear
40

6. European style date

We can work with the European style dates (i.e. day comes first) with the to_datetime function. The dayfirst parameter is set as True.

pd.to_datetime('10-9-2020', dayfirst=True)
Timestamp('2020-09-10 00:00:00')
pd.to_datetime('10-9-2020')
Timestamp('2020-10-09 00:00:00')

Note: If the first item is greater than 12, Pandas knows it cannot be a month.

pd.to_datetime('13-9-2020')
Timestamp('2020-09-13 00:00:00')

7. Converting a dataframe to time series data

The to_datetime function can convert a dataframe with appropriate columns to a time series. Consider the following dataframe:

pd.to_datetime(df)
0   2020-04-13 
1   2020-05-16 
2   2019-04-11 
dtype: datetime64[ns]

8. Beyond a timestamp

In real-life cases, we almost always work sequential time series data rather than individual dates. Pandas makes it very simple to work with sequential time series data as well.

We can pass a list of dates to the to_datetime function.

pd.to_datetime(['2020-09-13', '2020-08-12', 
'2020-08-04', '2020-09-05'])
DatetimeIndex(['2020-09-13', '2020-08-12', '2020-08-04', '2020-09-05'], dtype='datetime64[ns]', freq=None)

The returned object is a DatetimeIndex.

There are more practical ways to create sequences of dates.

9. Creating a time series with to_datetime and to_timedelta

A DatetimeIndex can be created by adding a TimedeltaIndex to a timestamp.

pd.to_datetime('10-9-2020') + pd.to_timedelta(np.arange(5), 'D')

‘D’ is used for ‘day’ but there are many other options available. You can check the whole list here.

10. The date_range function

It provides a more flexible way to create a DatetimeIndex.

pd.date_range(start='2020-01-10', periods=10, freq='M')

The periods parameter specifies the number of items in the index. The freq is the frequency and ‘M’ indicates the last day of a month.

The date_range is pretty flexible in terms of the arguments for the freq parameter.

pd.date_range(start='2020-01-10', periods=10, freq='6D')

We have created an index with a frequency of 6 days.

11. The period_range function

It returns a PeriodIndex. The syntax is similar to the date_range function.

pd.period_range('2018', periods=10, freq='M')

12. The timedelta_range function

It returns a TimedeltaIndex.

pd.timedelta_range(start='0', periods=24, freq='H')

13. Time zones

By default, time series objects of pandas do not have an assigned time zone.

dates = pd.date_range('2019-01-01','2019-01-10')
dates.tz is None
True

We can assign a time zone to these objects using the tz_localize method.

dates_lcz = dates.tz_localize('Europe/Berlin')
dates_lcz.tz
<DstTzInfo 'Europe/Berlin' LMT+0:53:00 STD>

14. Create a time series with an assigned time zone

We can also create a time series object with a time zone using tz keyword argument.

pd.date_range('2020-01-01', periods = 5, freq = 'D', tz='US/Eastern')

15. Offsets

Consider we have a time series index and want to offset all the dates for a specific time.

A = pd.date_range('2020-01-01', periods=10, freq='D')
A

Let’s add an offset of one week to this series.

A + pd.offsets.Week()

16. Shifting time series data

Time series data analysis may require to shift data points to make a comparison. The shift function shifts data in time.

A.shift(10, freq='M')

17. Shift vs tshift

  • shift: shifts the data
  • tshift: shifts the time index

Let’s create a dataframe with a time series index and plot it to see the difference between shift and tshift.

dates = pd.date_range('2020-03-01', periods=30, freq='D')
values = np.random.randint(10, size=30)
df = pd.DataFrame({'values':values}, index=dates)
df.head()

Let’s plot the original time series along with the shifted and tshifted ones.

import matplotlib.pyplot as plt
fig, axs = plt.subplots(nrows=3, figsize=(10,6), sharey=True)
plt.tight_layout(pad=4)
df.plot(ax=axs[0], legend=None)
df.shift(10).plot(ax=axs[1], legend=None)
df.tshift(10).plot(ax=axs[2], legend=None)

18. Resampling with the resample function

Another common operation with time series data is resampling. Depending on the task, we may need to resample data at a higher or lower frequency.

Resample creates groups (or bins) of specified internal and lets you do aggregations on groups.

Let’s create a Pandas series with 30 values and a time series index.

A = pd.date_range('2020-01-01', periods=30, freq='D')
values = np.random.randint(10, size=30)
S = pd.Series(values, index=A)

The following will return the averages of 3 day periods.

S.resample('3D').mean()

19. Asfreq function

In some cases, we may be interested in the values at certain frequencies. Asfreq function returns the value at the end of the specified interval. For instance, we may only need the values at every 3 days (not a 3-day average) in the series we created in the previous step.

S.asfreq('3D')

20. Rolling

Rolling is a very useful operation for time series data. Rolling means creating a rolling window with a specified size and perform calculations on the data in this window which, of course, rolls through the data. The figure below explains the concept of rolling.

(Image by author)

It is worth noting that the calculation starts when the whole window is in the data. In other words, if the size of the window is three, the first aggregation is done in the third row.

Let’s apply a 3-day rolling window to our series.

S.rolling(3).mean()[:10]

Conclusion

We have covered a comprehensive introduction to time series analysis with Pandas. It is worth noting that Pandas provides much more in terms of time series analysis.

The official documentation covers all the functions and methods of time series. It may seem exhaustive at first glance but you will get comfortable by practicing.

Thank you for reading. Please let me know if you have any feedback.

Data Science
Machine Learning
Artificial Intelligence
Programming
Pandas
Recommended from ReadMedium