Time Series Forecasting in Python
Introduction to Time Series Forecasting in Python
Introduction
Time series data is data gathered on the same subject over time, such as a country’s annual GDP, a stock price of a company over time, or your own heartbeat recorded at each second. In fact, anything that you can capture continuously at different time intervals is a time series data.
The chart below shows the daily stock price of Tesla Inc. (Ticker Symbol: TSLA) for the past year as an example of time series data. The value in US$ is shown by the y-axis on the right-hand side (the last point on the chart, $701.91, represents the current stock price as of the writing of this article on April 12, 2021).

Cross-sectional data, on the other hand, refers to datasets that hold information at a single moment in time, such as customer information, product information, company information, and so on.
An example of a dataset that records America’s best-selling electric automobiles in the first half of 2020 may be seen below. Rather than monitoring the number of automobiles sold over time, the graphic below compares the sales of different cars such as Tesla, Chevy, and Nissan over the same time period.

The distinction between cross-sectional and time-series data is easy to spot since the analytic goals for both datasets are vastly different. We were interested in watching Tesla’s stock price through time in the first study, but in the second, we wanted to look at various firms in the same time period, i.e. the first half of 2020.
A typical real-world dataset, on the other hand, is likely to be a hybrid. Consider a retailer such as Walmart, which sells thousands of things each day. A cross-sectional study is when you look at sales by product on a certain day, for example, if you want to know what the best-selling item is on Christmas Eve. In contrast, if you want to know the sales of a single item, such as the PS4, over a period of time (let’s say the last 5 years), you’ll need to do a time-series analysis.
Specifically, the analytic objectives for time-series and cross-sectional data are different, and a real-world dataset is likely to have a mix of both time-series and cross-sectional data.
What is Time Series Forecasting?
Time series forecasting entails projecting future unknown values, as the name implies. In the actual world, though, it’s a little less fascinating than in sci-fi movies. It entails gathering historical data, preparing it for consumption by algorithms (the algorithm is just the math that occurs on behind the scenes), and then predicting future values based on patterns acquired from the prior data.
Is there any reason why corporations or anybody else would be interested in estimating future values for any time series? (GDP, monthly sales, inventory, unemployment, global temperatures, etc.). Allow me to provide you with some business perspective:
- For planning and budgeting purposes, a business can be interested in estimating future sales at the SKU level.
- A small business can be interested in projecting revenues by location so that it can allocate the appropriate resources (more people during busy periods and vice versa).
- A software behemoth like Google could be interested in knowing the busiest hour of the day or day of the week so that server resources can be allocated appropriately.
- The health department may be interested in projecting the cumulative COVID immunization doses given so that it may determine when herd immunity is anticipated to kick in.
What is Time Series Forecasting?
Time series forecasting can broadly be categorized into the following categories:
- Classical / Statistical Models — Moving Averages, Exponential smoothing, ARIMA, SARIMA, TBATS
- Machine Learning — Linear Regression, XGBoost, Random Forest, or any ML model with reduction methods
- Deep Learning — RNN, LSTM, Transfer Learning
This tutorial is focused on forecasting time series using Machine Learning. For this tutorial, I will use the Regression Module of an open-source, low-code machine library in Python called PyCaret. If you haven’t used PyCaret before, you can get quickly started here. Although, you don’t require any prior knowledge of PyCaret to follow along with this tutorial.
Dataset
For this tutorial, I have used the US airline passengers dataset. You can download the dataset from Kaggle. This dataset provides monthly totals of US airline passengers from 1949 to 1960.
# read csv file
import pandas as pd
data = pd.read_csv('AirPassengers.csv')
data['Date'] = pd.to_datetime(data['Date'])
data.head()

# create 12 month moving average
data['MA12'] = data['Passengers'].rolling(12).mean()
# plot the data and MA
import plotly.express as px
fig = px.line(data, x="Date", y=["Passengers", "MA12"], template = 'plotly_dark')
fig.show()

Because machine learning algorithms can’t deal with dates directly, let’s extract some basic properties from dates, such as month and year, and then remove the original date column from the dataset.
# extract month and year from dates
data['Month'] = [i.month for i in data['Date']]
data['Year'] = [i.year for i in data['Date']]
# create a sequence of numbers
data['Series'] = np.arange(1,len(data)+1)
# drop unnecessary columns and re-arrange
data.drop(['Date', 'MA12'], axis=1, inplace=True)
data = data[['Series', 'Year', 'Month', 'Passengers']]
# check the head of the dataset
data.head()

# split data into train-test set
train = data[data['Year'] < 1960]
test = data[data['Year'] >= 1960]
# check shape
train.shape, test.shape
>>> ((132, 4), (12, 4))
Initialize Setup
Now it’s time to initialize the setup
function, where we will explicitly pass the training data, test data, and cross-validation strategy using the fold_strategy
parameter.
# import the regression module
from pycaret.regression import *
# initialize setup
s = setup(data = train, test_data = test, target = 'Passengers', fold_strategy = 'timeseries', numeric_features = ['Year', 'Series'], fold = 3, transform_target = True, session_id = 123)
Model Training and Selection
Trying different models and evaluate average cross-validated model performance.
best = compare_models(sort = 'MAE')

The best model based on cross-validated MAE is Least Angle Regression (MAE: 22.3). Let’s check the score on the test set.
prediction_holdout = predict_model(best);

# generate predictions on the original dataset
predictions = predict_model(best, data=data)
# add a date column in the dataset
predictions['Date'] = pd.date_range(start='1949-01-01', end = '1960-12-01', freq = 'MS')
# line plot
fig = px.line(predictions, x='Date', y=["Passengers", "Label"], template = 'plotly_dark')
# add a vertical rectange for test-set separation
fig.add_vrect(x0="1960-01-01", x1="1960-12-01", fillcolor="grey", opacity=0.25, line_width=0)fig.show()

The test era is depicted by the greyish backdrop near the conclusion (i.e. 1960). Let’s now finish the model by training the best model, which is Least Angle Regression, on the complete dataset (this time, including the test set).
final_best = finalize_model(best)
Create a future scoring dataset
Now that we’ve trained our model on the complete dataset (1949 to 1960), Let’s forecast five years into the future through 1964. To utilize our final model to make future predictions, we’ll need to first make a dataset with the Month, Year, and Series columns for future dates.
future_dates = pd.date_range(start = '1961-01-01', end = '1965-01-01', freq = 'MS')
future_df = pd.DataFrame()
future_df['Month'] = [i.month for i in future_dates]
future_df['Year'] = [i.year for i in future_dates]
future_df['Series'] = np.arange(145, (145+len(future_dates)))
future_df.head()

predictions_future = predict_model(final_best, data=future_df)
predictions_future.head()

concat_df = pd.concat([data,predictions_future], axis=0)
concat_df_i = pd.date_range(start='1949-01-01', end = '1965-01-01', freq = 'MS')
concat_df.set_index(concat_df_i, inplace=True)
fig = px.line(concat_df, x=concat_df.index, y=["Passengers", "Label"], template = 'plotly_dark')
fig.show()

I hope you find this tutorial easy. If you think you are ready for the next level, you can subscribe to my mailing list as I will be writing tutorial on Multivariate Time Series Forecasting soon.
Thank you for reading!
Author:
I write about data science, machine learning, and PyCaret. If you would like to be notified automatically, you can follow me on Medium, LinkedIn, and Twitter.