avatarMoez Ali

Summary

This context provides a comprehensive guide on time series forecasting using Python, with a focus on machine learning models, particularly using the PyCaret library, and applies the methods to forecast US airline passenger numbers.

Abstract

The provided content delves into the concept of time series forecasting, distinguishing it from cross-sectional data analysis by emphasizing the temporal aspect of data collection. It explains the importance of forecasting future values in various business and scientific contexts, such as sales predictions and resource allocation. The tutorial demonstrates the use of the PyCaret regression module to perform time series forecasting on the classic US airline passengers dataset, covering steps from data preprocessing to model training, evaluation, and future prediction. The author guides the reader through creating a moving average, preparing the dataset for machine learning algorithms, splitting the data into training and test sets, initializing the regression setup, comparing different models to select the best performer based on cross-validated mean absolute error (MAE), and finalizing the model to make predictions, including forecasting future values. The tutorial concludes with a call to action for readers to follow the author's work for upcoming tutorials on multivariate time series forecasting.

Opinions

  • The author emphasizes the practical utility of time series forecasting in real-world applications.
  • There is a clear preference for the PyCaret library due to its low-code approach and ease of use for time series analysis.
  • The author suggests that a combination of classical statistical methods and machine learning can be effective for time series forecasting.
  • The tutorial implies that cross-validation is a crucial step in the model selection process, with a particular preference for time series cross-validation.
  • The author expresses enthusiasm for continued learning and improvement in the field of time series forecasting, indicating a commitment to sharing knowledge through future tutorials.

Time Series Forecasting in Python

Introduction to Time Series Forecasting in Python

Photo by KOBU Agency on Unsplash

Introduction

Time series data is data gathered on the same subject over time, such as a country’s annual GDP, a stock price of a company over time, or your own heartbeat recorded at each second. In fact, anything that you can capture continuously at different time intervals is a time series data.

The chart below shows the daily stock price of Tesla Inc. (Ticker Symbol: TSLA) for the past year as an example of time series data. The value in US$ is shown by the y-axis on the right-hand side (the last point on the chart, $701.91, represents the current stock price as of the writing of this article on April 12, 2021).

Example of Time Series Data — Tesla Inc. (ticker symbol: TSLA) daily stock price 1Y interval.

Cross-sectional data, on the other hand, refers to datasets that hold information at a single moment in time, such as customer information, product information, company information, and so on.

An example of a dataset that records America’s best-selling electric automobiles in the first half of 2020 may be seen below. Rather than monitoring the number of automobiles sold over time, the graphic below compares the sales of different cars such as Tesla, Chevy, and Nissan over the same time period.

Source: Forbes

The distinction between cross-sectional and time-series data is easy to spot since the analytic goals for both datasets are vastly different. We were interested in watching Tesla’s stock price through time in the first study, but in the second, we wanted to look at various firms in the same time period, i.e. the first half of 2020.

A typical real-world dataset, on the other hand, is likely to be a hybrid. Consider a retailer such as Walmart, which sells thousands of things each day. A cross-sectional study is when you look at sales by product on a certain day, for example, if you want to know what the best-selling item is on Christmas Eve. In contrast, if you want to know the sales of a single item, such as the PS4, over a period of time (let’s say the last 5 years), you’ll need to do a time-series analysis.

Specifically, the analytic objectives for time-series and cross-sectional data are different, and a real-world dataset is likely to have a mix of both time-series and cross-sectional data.

What is Time Series Forecasting?

Time series forecasting entails projecting future unknown values, as the name implies. In the actual world, though, it’s a little less fascinating than in sci-fi movies. It entails gathering historical data, preparing it for consumption by algorithms (the algorithm is just the math that occurs on behind the scenes), and then predicting future values based on patterns acquired from the prior data.

Is there any reason why corporations or anybody else would be interested in estimating future values for any time series? (GDP, monthly sales, inventory, unemployment, global temperatures, etc.). Allow me to provide you with some business perspective:

  • For planning and budgeting purposes, a business can be interested in estimating future sales at the SKU level.
  • A small business can be interested in projecting revenues by location so that it can allocate the appropriate resources (more people during busy periods and vice versa).
  • A software behemoth like Google could be interested in knowing the busiest hour of the day or day of the week so that server resources can be allocated appropriately.
  • The health department may be interested in projecting the cumulative COVID immunization doses given so that it may determine when herd immunity is anticipated to kick in.

What is Time Series Forecasting?

Time series forecasting can broadly be categorized into the following categories:

  • Classical / Statistical Models — Moving Averages, Exponential smoothing, ARIMA, SARIMA, TBATS
  • Machine Learning — Linear Regression, XGBoost, Random Forest, or any ML model with reduction methods
  • Deep Learning — RNN, LSTM, Transfer Learning

This tutorial is focused on forecasting time series using Machine Learning. For this tutorial, I will use the Regression Module of an open-source, low-code machine library in Python called PyCaret. If you haven’t used PyCaret before, you can get quickly started here. Although, you don’t require any prior knowledge of PyCaret to follow along with this tutorial.

Dataset

For this tutorial, I have used the US airline passengers dataset. You can download the dataset from Kaggle. This dataset provides monthly totals of US airline passengers from 1949 to 1960.

# read csv file
import pandas as pd
data = pd.read_csv('AirPassengers.csv')
data['Date'] = pd.to_datetime(data['Date'])
data.head()
Sample rows
# create 12 month moving average
data['MA12'] = data['Passengers'].rolling(12).mean()
# plot the data and MA
import plotly.express as px
fig = px.line(data, x="Date", y=["Passengers", "MA12"], template = 'plotly_dark')
fig.show()
US Airline Passenger Dataset Time Series Plot with Moving Average = 12

Because machine learning algorithms can’t deal with dates directly, let’s extract some basic properties from dates, such as month and year, and then remove the original date column from the dataset.

# extract month and year from dates
data['Month'] = [i.month for i in data['Date']]
data['Year'] = [i.year for i in data['Date']]
# create a sequence of numbers
data['Series'] = np.arange(1,len(data)+1)
# drop unnecessary columns and re-arrange
data.drop(['Date', 'MA12'], axis=1, inplace=True)
data = data[['Series', 'Year', 'Month', 'Passengers']] 
# check the head of the dataset
data.head()
Sample rows after extracting features
# split data into train-test set
train = data[data['Year'] < 1960]
test = data[data['Year'] >= 1960]
# check shape
train.shape, test.shape
>>> ((132, 4), (12, 4))

Initialize Setup

Now it’s time to initialize the setup function, where we will explicitly pass the training data, test data, and cross-validation strategy using the fold_strategy parameter.

# import the regression module
from pycaret.regression import *
# initialize setup
s = setup(data = train, test_data = test, target = 'Passengers', fold_strategy = 'timeseries', numeric_features = ['Year', 'Series'], fold = 3, transform_target = True, session_id = 123)

Model Training and Selection

Trying different models and evaluate average cross-validated model performance.

best = compare_models(sort = 'MAE')
Results from the compare_models function

The best model based on cross-validated MAE is Least Angle Regression (MAE: 22.3). Let’s check the score on the test set.

prediction_holdout = predict_model(best);
Results from predict_model(best) function
# generate predictions on the original dataset
predictions = predict_model(best, data=data)
# add a date column in the dataset
predictions['Date'] = pd.date_range(start='1949-01-01', end = '1960-12-01', freq = 'MS')
# line plot
fig = px.line(predictions, x='Date', y=["Passengers", "Label"], template = 'plotly_dark')
# add a vertical rectange for test-set separation
fig.add_vrect(x0="1960-01-01", x1="1960-12-01", fillcolor="grey", opacity=0.25, line_width=0)fig.show()
Actual and Predicted US airline passengers (1949–1960)

The test era is depicted by the greyish backdrop near the conclusion (i.e. 1960). Let’s now finish the model by training the best model, which is Least Angle Regression, on the complete dataset (this time, including the test set).

final_best = finalize_model(best)

Create a future scoring dataset

Now that we’ve trained our model on the complete dataset (1949 to 1960), Let’s forecast five years into the future through 1964. To utilize our final model to make future predictions, we’ll need to first make a dataset with the Month, Year, and Series columns for future dates.

future_dates = pd.date_range(start = '1961-01-01', end = '1965-01-01', freq = 'MS')
future_df = pd.DataFrame()
future_df['Month'] = [i.month for i in future_dates]
future_df['Year'] = [i.year for i in future_dates]    
future_df['Series'] = np.arange(145, (145+len(future_dates)))
future_df.head()
Sample rows from future_df
predictions_future = predict_model(final_best, data=future_df)
predictions_future.head()
Sample rows from predictions_future
concat_df = pd.concat([data,predictions_future], axis=0)
concat_df_i = pd.date_range(start='1949-01-01', end = '1965-01-01', freq = 'MS')
concat_df.set_index(concat_df_i, inplace=True)
fig = px.line(concat_df, x=concat_df.index, y=["Passengers", "Label"], template = 'plotly_dark')
fig.show()
Actual (1949–1960) and Predicted (1961–1964) US airline passengers

I hope you find this tutorial easy. If you think you are ready for the next level, you can subscribe to my mailing list as I will be writing tutorial on Multivariate Time Series Forecasting soon.

Thank you for reading!

Author:

I write about data science, machine learning, and PyCaret. If you would like to be notified automatically, you can follow me on Medium, LinkedIn, and Twitter.

Data Science
Machine Learning
Artificial Intelligence
Python
Technology
Recommended from ReadMedium