avatarMahbub Alam

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5833

Abstract

every time point (month-year) gets its own row. In the wide-format, years are in the rows, months are in the columns and each cell stores corresponding values.</p><p id="42c1">Long format data will have many more rows than wide format but that’s what data scientists prefer because it’s easy to work with “tidy” data in programming libraries.</p><div id="6d3a"><pre>x = df<span class="hljs-selector-class">.groupby</span>(<span class="hljs-selector-attr">[<span class="hljs-string">"year"</span>, <span class="hljs-string">"month"</span>]</span>)<span class="hljs-selector-attr">[<span class="hljs-string">"Value"</span>]</span><span class="hljs-selector-class">.mean</span>() df_wide = x<span class="hljs-selector-class">.unstack</span>() df_wide<span class="hljs-selector-class">.head</span>()</pre></div><figure id="d3ad"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*HWGU9cjIL_EQsy1XQjNJ2g.png"><figcaption></figcaption></figure><h2 id="f70a">v) Filtering by datetime index</h2><p id="10d7">Now that we have converted a “normal” dataframe into a datetime object, it's time to put this new-found strength into action! You can now query data based on any specific date you want.</p><div id="b91d"><pre># <span class="hljs-keyword">filter</span> <span class="hljs-keyword">by</span> <span class="hljs-type">date</span> df.loc["2016-01-01"]</pre></div><div id="7cc0"><pre><span class="hljs-string">>></span> <span class="hljs-string">Value</span> <span class="hljs-number">1428.0</span> <span class="hljs-string">year</span> <span class="hljs-number">2016.0</span> <span class="hljs-string">month</span> <span class="hljs-number">1.0</span> <span class="hljs-attr">Name:</span> <span class="hljs-number">2016-01-01 00:00:00</span><span class="hljs-string">,</span> <span class="hljs-attr">dtype:</span> <span class="hljs-string">float64</span></pre></div><p id="910d">You can also filter data by date, month and year range. Filtering is as easy and intuitive as it gets.</p><div id="ea4c"><pre># <span class="hljs-keyword">filter</span> <span class="hljs-keyword">by</span> a <span class="hljs-type">date</span> range df.loc["2016-01-01": "2016-12-01"]</pre></div><div id="f03b"><pre># <span class="hljs-attribute">filter</span> by month df<span class="hljs-selector-class">.loc</span><span class="hljs-selector-attr">[<span class="hljs-string">"2008-01"</span>]</span></pre></div><div id="cde6"><pre># <span class="hljs-keyword">filter</span> <span class="hljs-keyword">by</span> <span class="hljs-keyword">month</span> <span class="hljs-keyword">range</span> df.loc["2010-01": "2010-05"]</pre></div><div id="3581"><pre># <span class="hljs-attribute">filter</span> by year df<span class="hljs-selector-class">.loc</span><span class="hljs-selector-attr">[<span class="hljs-string">"2006"</span>]</span></pre></div><div id="e9fc"><pre># <span class="hljs-keyword">filter</span> <span class="hljs-keyword">by</span> <span class="hljs-keyword">year</span> <span class="hljs-keyword">range</span> df.loc["2011": "2012"]</pre></div><h2 id="670c">vi) Resampling</h2><p id="c03e">Resampling is a way to group data by time units — day, month, year etc. Below is an example of resampling by month (“M”). You can also use “A” for years and and “D” days as appropriate.</p><div id="f768"><pre><span class="hljs-comment"># resampling by month</span> <span class="hljs-built_in">df</span>[<span class="hljs-string">"Value"</span>].resample(<span class="hljs-string">"M"</span>).mean()</pre></div><h2 id="c36c">Vii) Moving average</h2><p id="dc11">Moving average is a powerful technique to smooth out variations in temporal trend and is done by taking an average of past observations. We’ll see how to visualize moving averages in the following section but here’s how the code works.</p><div id="e594"><pre># <span class="hljs-keyword">rolling</span> <span class="hljs-keyword">window</span>/moving average of <span class="hljs-keyword">N</span> past observations df[<span class="hljs-string">"Value"</span>].<span class="hljs-keyword">rolling</span>(<span class="hljs-keyword">window</span>=6).<span class="hljs-keyword">mean</span>()</pre></div><h1 id="f015">Part B: Visualization</h1><p id="bbd7">Time series data is best analyzed and understood through visualization. We can write all the codes to do resampling and moving averages etc. and create new data frames all we want but in the end, we can’t understand anything until they are visualized.</p><p id="b2e0">Below is a sample of 8 different techniques for visualization. Is that all you need to know? The short answer is — if you know how to create and interpret these 8 visualizations, you are in pretty good shape!</p><p id="d327">I am not going to write complex codes to create beautiful figures, instead, use the default <code>seaborn</code> parameters in one-liner codes. As an individual, you can customize these figures however you wish based on your taste and sense of beauty!</p><p id="7a6d">Also as you will notice, I’m only using <code>seaborn</code> as the plotting library. There are all kinds of fancy libraries out there, but again, I’d keep things simple and sweet.</p><h2 id="a7a0">i) Simple plot</h2><p id="df53">The simple plot is really simple, it is just plotting values column against the time dimension.</p><div id="1ffc"><pre><span class="hljs-comment"># simple time series plot</span> sns.lineplot(data = <span class="hljs-built_in">df</span>[<span class="hljs-string">"Value"</span>])</pre></div><figure id="5e99"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*viX92NfQD3zi6GVgFKJvcA.png"><figcaption></figcaption></figure><h2 id="f82a">ii) Slicing</h2><p id="755a">Sometimes you may want to zoom in to a specific date range and period in time.</p><div id="aae7"><pre><span class="hljs-comment"># zooming

Options

in on specific date range</span> drange = df.loc[<span class="hljs-string">"2011"</span>: <span class="hljs-string">"2015"</span>] sns.lineplot(data = drange[<span class="hljs-string">"Value"</span>])</pre></div><figure id="9172"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ad12pt_iIPzG_Y41zDCPhw.png"><figcaption></figcaption></figure><h2 id="6d49">iii) Resampling</h2><p id="bc7e">We talked about resampling in the previous section but now let’s see how resampled data looks like.</p><div id="d58f"><pre><span class="hljs-comment"># plotting resampled data </span> resampled = df[<span class="hljs-string">"Value"</span>]<span class="hljs-string">.resample</span><span class="hljs-params">("A")</span><span class="hljs-string">.mean</span><span class="hljs-params">()</span> sns.lineplot<span class="hljs-params">(<span class="hljs-attr">data</span> = resampled)</span></pre></div><figure id="9742"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*mMdL2Un0QdEw8muMMsCZ0Q.png"><figcaption></figcaption></figure><h2 id="35df">iv) Moving average</h2><p id="46d6">Moving average is similar to resampling but has more flexibility, you can specify any number of past observations as a rolling window. Below is an example using a moving window of 6 past observations.</p><div id="3c55"><pre><span class="hljs-comment"># plotting moving average (window = N past observations)</span> ma = df[<span class="hljs-string">"Value"</span>]<span class="hljs-string">.rolling</span><span class="hljs-params">(<span class="hljs-attr">window</span>=6)</span><span class="hljs-string">.mean</span><span class="hljs-params">()</span> sns.lineplot<span class="hljs-params">(<span class="hljs-attr">data</span> = ma)</span></pre></div><figure id="07fe"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*e4nr0-d2D9KSfbDLpPKicQ.png"><figcaption></figcaption></figure><h2 id="578c">v) Boxplot</h2><p id="2840">In any exploratory data analysis, boxplots are the most useful statistical graphics to understand both the central tendency and the distribution of data. It is somewhat similarly useful in time series data. Below is an example of monthly boxplots of values.</p><div id="82b5"><pre><span class="hljs-comment"># boxplots by month</span> sns.boxplot(x = <span class="hljs-string">'month'</span>, <span class="hljs-attribute">y</span>=<span class="hljs-string">'Value'</span>, data = df)</pre></div><figure id="3b26"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*w4LHfWqYReVQZj7txBRt3Q.png"><figcaption></figcaption></figure><p id="eb64">[I didn’t try but I think you can also draw violin plots with the same line of code. Go ahead give it a try, just replace <i>boxplot</i> with <i>violinplot</i>!]</p><h2 id="3797">vi) Barplot</h2><p id="d1aa">Bar plots are an old-school visualization technique and I don’t know if people use them too often in their projects. But I’m including this here so you know they exist!</p><div id="a977"><pre><span class="hljs-comment"># barplot</span> sns.barplot(x = <span class="hljs-string">'month'</span>, <span class="hljs-attribute">y</span>=<span class="hljs-string">'Value'</span>, data = df)</pre></div><figure id="7472"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*81smxFLxadijqEMqdSpPhg.png"><figcaption></figcaption></figure><h2 id="8d30">vii) Time series with confidence intervals</h2><p id="695e">It is also possible to visualize time series showing both the trend and confidence intervals (i.e. variation of data at each time point).</p><div id="4b2e"><pre><span class="hljs-comment"># plotting with confidence intervals </span> sns.lineplot(<span class="hljs-attribute">x</span>=<span class="hljs-string">"year"</span>, <span class="hljs-attribute">y</span>=<span class="hljs-string">"Value"</span>, <span class="hljs-attribute">data</span>=df)</pre></div><figure id="a53a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*shRI1LRbxN7nbT9zWRdysg.png"><figcaption></figcaption></figure><h2 id="1272">viii) Plotting wide-form data</h2><p id="58d4">Last, but not least, remember we’ve created wide-form data in the beginning and now it’s time to put that into use. Using the visualization technique below we are now seeing trends in data for each month separately.</p><div id="03e0"><pre><span class="hljs-comment"># plotting wideform data</span> sns.lineplot(<span class="hljs-attribute">x</span>=<span class="hljs-string">"year"</span>, <span class="hljs-attribute">y</span>=<span class="hljs-string">"Value"</span>, <span class="hljs-attribute">data</span>=df, <span class="hljs-attribute">hue</span>=<span class="hljs-string">"month"</span>, <span class="hljs-attribute">palette</span>=<span class="hljs-string">"Dark2"</span>)</pre></div><figure id="9d35"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ldQV6S5Xtl7eoRGhryJQOg.png"><figcaption></figcaption></figure><h1 id="e94a">Final thought</h1><p id="6c53">The purpose of this article was to put in one place most of the data wrangling and visualization techniques you’d need as a beginner or intermediate time series analyst. First, we saw how to convert an ordinary dataframe into a powerful datetime object and put that into use for filtering and visualizing. In the second part, we’ve created 8 different plots using different techniques to visualize time series. The next step would be to go ahead and pick a different dataset, reproduce the results and play with different plotting parameters.</p><p id="5759">I hope this was a useful post. If you have comments feel free to write them down below. You can follow me on <a href="https://mab-datasc.medium.com/">Medium</a>, <a href="https://twitter.com/DataEnthus">Twitter </a>or <a href="https://www.linkedin.com/in/mab-alam/">LinkedIn</a>.</p></article></body>

Formating and visualizing time series data

Data wrangling and visualization with Pandas, Matplotlib and Seaborn

Photo by Ian Barsby on Unsplash

Well, it’s time for another installment of time series analysis. This time I’m focusing on two things: a) converting a normal dataframe into the right format for analysis; and b) making sense of that data through visualization.

The first objective is quite essential. Wrangling and cleaning up data is a big thing in data science, and it’s more so in time series analysis. Even for basic analysis, it is easier to work with data that is in a good shape. For advanced modeling such as forecasting, it’s often mandatory to have the right format so that the programming library can recognize it as a time series object.

The rest of the article is divided into two parts: in the first part I’ll go through the usual ritual of formating and exploratory data analysis and in the second part I will focus on different ways to visualize time series data.

Let’s dive right in!

Part A: Data wrangling

Upfront I want to say what I am not covering in this section — renaming columns, subsetting data, change of data types (e.g. string to int) and missing value treatments. To keep this writing focused on time series formating I will not cover them here, but if interested you could check out my previous article — A checklist for data wrangling.

i) Import libraries

As usual, I’m using pandas for data wrangling and I’ll go with matplotlib and seaborn for visualization.

# pandas for data wrangling
import pandas as pd
# seaborn and matplotlib for visualization
import seaborn as sns
import matplotlib.pyplot as plt

ii) Import data

For this exercise, I’ve downloaded an interesting dataset on monthly retail book sales (million US$) reported by book stores all across the US. The date range is between 1992 and 2018.

You can download the data from Census Bureau (census.gov) databases to follow along, but I’d encourage picking a different one that you are more familiar with.

So after some initial clean-up, I load in data and examine a few things by calling the head() and info()functions.

# load in data
df = pd.read_csv("BookSales.csv")
# data structure
df.head()
df.info()

One key thing I’m looking at is that the time dimension is written in “mm/yyyy” format and is stored as a string/object (see output of the info() function).

iii) Creating a datetime object

In this part we want to achieve 3 objectives:

  • convert the “Period” column into datetime object
  • set the new datetime column as the index of the dataframe
  • create additional “month” and “year” columns for ease of visualization
# converting dates/time columns into a datetime object
df["Period"] = pd.to_datetime(df["Period"])
# set the new datetime column as the index
df = df.set_index("Period")
# create new columns from datetime index
df["year"] = df.index.year
df["month"] = df.index.month
# new dataframe
df.head()

iv) Converting long-form to wide-form

Our current dataframe is in long format, meaning every time point (month-year) gets its own row. In the wide-format, years are in the rows, months are in the columns and each cell stores corresponding values.

Long format data will have many more rows than wide format but that’s what data scientists prefer because it’s easy to work with “tidy” data in programming libraries.

x = df.groupby(["year", "month"])["Value"].mean()
df_wide = x.unstack()
df_wide.head()

v) Filtering by datetime index

Now that we have converted a “normal” dataframe into a datetime object, it's time to put this new-found strength into action! You can now query data based on any specific date you want.

# filter by date
df.loc["2016-01-01"]
>> Value    1428.0
year     2016.0
month       1.0
Name: 2016-01-01 00:00:00, dtype: float64

You can also filter data by date, month and year range. Filtering is as easy and intuitive as it gets.

# filter by a date range
df.loc["2016-01-01": "2016-12-01"]
# filter by month
df.loc["2008-01"]
# filter by month range
df.loc["2010-01": "2010-05"]
# filter by year
df.loc["2006"]
# filter by year range
df.loc["2011": "2012"]

vi) Resampling

Resampling is a way to group data by time units — day, month, year etc. Below is an example of resampling by month (“M”). You can also use “A” for years and and “D” days as appropriate.

# resampling by month
df["Value"].resample("M").mean()

Vii) Moving average

Moving average is a powerful technique to smooth out variations in temporal trend and is done by taking an average of past observations. We’ll see how to visualize moving averages in the following section but here’s how the code works.

# rolling window/moving average of N past observations
df["Value"].rolling(window=6).mean()

Part B: Visualization

Time series data is best analyzed and understood through visualization. We can write all the codes to do resampling and moving averages etc. and create new data frames all we want but in the end, we can’t understand anything until they are visualized.

Below is a sample of 8 different techniques for visualization. Is that all you need to know? The short answer is — if you know how to create and interpret these 8 visualizations, you are in pretty good shape!

I am not going to write complex codes to create beautiful figures, instead, use the default seaborn parameters in one-liner codes. As an individual, you can customize these figures however you wish based on your taste and sense of beauty!

Also as you will notice, I’m only using seaborn as the plotting library. There are all kinds of fancy libraries out there, but again, I’d keep things simple and sweet.

i) Simple plot

The simple plot is really simple, it is just plotting values column against the time dimension.

# simple time series plot
sns.lineplot(data = df["Value"])

ii) Slicing

Sometimes you may want to zoom in to a specific date range and period in time.

# zooming in on specific date range
drange = df.loc["2011": "2015"]
sns.lineplot(data = drange["Value"])

iii) Resampling

We talked about resampling in the previous section but now let’s see how resampled data looks like.

# plotting resampled data 
resampled = df["Value"].resample("A").mean() 
sns.lineplot(data = resampled)

iv) Moving average

Moving average is similar to resampling but has more flexibility, you can specify any number of past observations as a rolling window. Below is an example using a moving window of 6 past observations.

# plotting moving average (window = N past observations)
ma = df["Value"].rolling(window=6).mean() 
sns.lineplot(data = ma)

v) Boxplot

In any exploratory data analysis, boxplots are the most useful statistical graphics to understand both the central tendency and the distribution of data. It is somewhat similarly useful in time series data. Below is an example of monthly boxplots of values.

# boxplots by month
sns.boxplot(x = 'month', y='Value', data = df)

[I didn’t try but I think you can also draw violin plots with the same line of code. Go ahead give it a try, just replace boxplot with violinplot!]

vi) Barplot

Bar plots are an old-school visualization technique and I don’t know if people use them too often in their projects. But I’m including this here so you know they exist!

# barplot
sns.barplot(x = 'month', y='Value', data = df)

vii) Time series with confidence intervals

It is also possible to visualize time series showing both the trend and confidence intervals (i.e. variation of data at each time point).

# plotting with confidence intervals 
sns.lineplot(x="year", y="Value", data=df)

viii) Plotting wide-form data

Last, but not least, remember we’ve created wide-form data in the beginning and now it’s time to put that into use. Using the visualization technique below we are now seeing trends in data for each month separately.

# plotting wideform data
sns.lineplot(x="year", y="Value", data=df, hue="month", palette="Dark2")

Final thought

The purpose of this article was to put in one place most of the data wrangling and visualization techniques you’d need as a beginner or intermediate time series analyst. First, we saw how to convert an ordinary dataframe into a powerful datetime object and put that into use for filtering and visualizing. In the second part, we’ve created 8 different plots using different techniques to visualize time series. The next step would be to go ahead and pick a different dataset, reproduce the results and play with different plotting parameters.

I hope this was a useful post. If you have comments feel free to write them down below. You can follow me on Medium, Twitter or LinkedIn.

Data Science
Data Visualization
Machine Learning
Artificial Intelligence
Time Series Analysis
Recommended from ReadMedium