avatarJoe T. Santhanavanich

Summary

This article is a tutorial on using Python's Pandas library for time series analysis, specifically applied to COVID-19 case data from Johns Hopkins University.

Abstract

The provided web content serves as an introductory guide to manipulating and analyzing time series data with Python's Pandas library, using the confirmed COVID-19 case dataset from Johns Hopkins CSSE. The tutorial covers essential steps such as data preparation, cleaning, and indexing with datetime objects. It also demonstrates how to perform data aggregation, resampling, and trend analysis, including calculating weekly percentage growth and rolling averages. The article aims to equip readers with the foundational skills needed to work with time series datasets in data science projects, emphasizing practical applications through real-world examples.

Opinions

  • The author believes that time series datasets are common in data science and that understanding how to manipulate and analyze them is crucial.
  • The tutorial is designed for both beginners and those looking to refresh their skills, suggesting that familiarity with Pandas and Python is beneficial for data analysis tasks.
  • The author emphasizes the importance of data cleaning and preparation, highlighting the need to aggregate and transpose data for meaningful analysis.
  • The use of real-world data from the COVID-19 pandemic underscores the relevance and immediacy of applying data science techniques to current global issues.
  • By providing code snippets and visual examples, the author conveys a hands-on approach to learning, encouraging readers to apply the techniques directly to the dataset.
  • The inclusion of various resampling methods and rolling window calculations indicates the author's preference for these techniques in identifying trends and patterns in time series data.
  • The concluding remarks express the author's hope that the article will be useful for readers' daily work or projects, inviting further engagement and discussion on the topic.

AN ESSENTIAL PYTHON GUIDE

COVID-19 Time Series Analysis with Pandas in Python

Manipulate and Analyze COVID-19 Time Series Data

illustration by Chaeyun Kim

In Data Sciences, the time series is one of the most daily common datasets. In this tutorial, I will show you a short introduction on how to use Pandas to manipulate and analyze the time series dataset with the confirmed COVID-19 case dataset from JHU CSSE.

Let’s Get Started

If you are very new to Pandas/Python, just download the latest version of Python and then you can install Pandas with pip in your console as below:

$ pip install pandas

Prepare Data and Import to Dataframe

Let’s begin with preparing your project folder and download a time-series csv file from Johns Hopkins CSSE.

Then, create a new Python file and loading thecsv file into the Pandas dataframe:

import pandas as pd
df = pd.read_csv(f'time_series_covid19_confirmed_global.csv')
print(df)
Dataframe Represents Confirmed COVID-19 Cases (Viewed by Spyder IDE)

Data Cleaning

Now, let’s do some data cleaning. You can see that this dataset has the time-series represents a number of cases in the level of Province/State only in some areas so let’s aggregate all data into the level of Country/Region using groupby to sum up the number of cases.

Before the data aggregation, we may drop some fields that we are not interested including province/State, Lat, Long using drop .

df = df.drop(columns=['Province/State','Lat', 'Long'])
df = df.groupby('Country/Region').agg('sum')
Confirmed COVID-19 Cases Sum by Country (Viewed by Spyder IDE)

Prepare Datetime index in a Dataframe

Now, we get the time series of daily confirmed cases for each country. But to use the time series analysis function, we would need to create a DateTime as the index column. So, let’s transpose the dataframe first.

df = df.T
Transposed Dataframe (Viewed by Spyder IDE)

From here, the index represents the date value. However, it is still a String type, so let’s change it to DateTime. To do this, we can use pd.to_datetime and pd.DatetimeIndex as follows:

df_time = pd.to_datetime(df.index)
datetime_index = pd.DatetimeIndex(df_time.values) 
df = df.set_index(datetime_index)
Dataframe with DateTime Index(Viewed by Spyder IDE)

Explore the Time-Series Dataframe

Now, our time-series dataframe is ready with the datetime in the index. You may select data from any specific time. For example:

> Select the confirmed COVID-19 case where the date is 15th

df[df.index.day == 15]
Selected data where the date is 15th (Viewed by Spyder IDE)

> Select the data where the month is April

df[df.index.month == 4] # April, all years
# or
df['2020-04'] # April, 2020
# or
df[(df.index.month == 4) & (df.index.year == 2020)] # April, 2020
Selected data where the month is April (Viewed by Spyder IDE)

> Select the data from April 1, 2020, to April 5, 2020

df['2020-04-01':'2020-04-05']
Selected data from April 1, 2020, to April 5, 2020 (Viewed by Spyder IDE)

> Select 6 Countries with the most confirmed COVID-19 cases

#Sort by confirmed cases on the latest date (index.value[-1])
df = df.sort_values(by=df.index.values[-1], axis=1,ascending=False)
#Select first 6 countries
df = df.iloc[:,0:6]
Selected data of 6 Countries with the most confirmed COVID-19 cases (Viewed by Spyder IDE)

Resampling Time-Series Dataframe

Now, let’s come to the fun part. We can easily resample the time-series to any time-frequency with resample method. You may resample the time-series dataframes or series to any time-frequency you like with:

df.resample(<offet frequency>).<aggregation-operation>

Here is a list of most used <offset frequency> aliases:

'nD'           ==> n days  
'nM'           ==> n months
'nW'           ==> n weeks
'nD'           ==> n days  
'nH'           ==> n hours  
'nT' or 'nmin' ==> n mins  
'nS'           ==> n seconds
'nL' or 'nms'  ==> n milliseconds
#or mix offset together
'nD.mH'        ==> n days, m hours

Here is a list of most used <aggregation-operation> :

# Summarize
df.resample(timeinterval).sum()
# Mean
df.resample(timeinterval).mean()
# Max
df.resample(timeinterval).max()
# Min
df.resample(timeinterval).min()
# First value
df.resample(timeinterval).first()
# Last value
df.resample(timeinterval).last()
# Standard deviation
df.resample(timeinterval).std()

Now, let’s try to use it with the real example in our COVID-19 dataset:

> Mean-resampled of confirmed COVID-19 cases in a weekly interval

#Resample the dataframe
df.resample('W').mean()
#You can plot the chart by adding .plot()
df.resample('W').mean().plot()
Mean-Resampled of confirmed COVID-19 Cases in weekly Interval (Viewed by Spyder IDE)

We can also plot any resampling result by using.plot(...) . For example:

df.resample('W').mean().plot(marker="v",figsize=(15,5))
Mean-Resampled of confirmed COVID-19 Cases in weekly Interval

Percentage Growth

Now, let’s find a weekly percentage growth of COVID-19 in each country. We can do it by simply adding .pct_change() to find the percentage change.

df.resample('W').mean().pct_change()
Percentage Change of confirmed COVID-19 cases in weekly Interval

In this case, the first two lines contain some of nan and infso we can just ignore these lines by using iloc[2:] to skip the lines. Then, we can use plot to see the plot of the percentage growth of the confirmed COVID-19 cases.

df.resample('W').mean().pct_change().iloc[2:].plot(marker="v",figsize=(15,5))
Percentage Change of Confirmed COVID-19 Cases in Weekly Interval

Identifying Trends in Time-Series Dataframe

There are many ways to analyze the trends of time series, the rollingis one of my most favorite tools. For each row or timestamp, it takes the rolling window calculations on both sides by:

df.rolling(<window size/ offset frequency>).<aggregation-operation>

For example, let’s plot the rolling average of confirmed COVID-19 cases by window size of 10 days.

df.rolling('10D').mean().plot(marker="v",figsize=(15,5))
10-Days Rolling Average of Confirmed COVID-19 Cases in Each Country

Conclusion

This article provides a basic guide on how to manipulate and analyze the time series data with an example of a confirmed COVID-19 cases dataset. Still, it only covers very basic parts and there are a lot more functionalities you can do with the time-series dataset. You can check a full guide on the official Pandas time-series webpage.

I hope you like this article and found it useful for your daily work or projects. Feel free to leave me a message if you have any questions or comments.

About me & Check out all my blog contents: Link

Be Safe and Healthy! 💪

Thank you for Reading. 📚

Python
Data Science
Covid-19
Pandas
Science
Recommended from ReadMedium