avatarAmy @GrabNGoInfo

Summary

This article provides a comprehensive tutorial on enhancing multivariate time series forecasting using Facebook Prophet in Python, incorporating seasonality, holidays, special events, and additional features to improve model performance.

Abstract

The tutorial detailed on the GrabNGoInfo website guides readers through the process of building a robust time series forecasting model using Facebook Prophet in Python. It covers the steps to install necessary libraries, pull historical stock data, preprocess the data, and incorporate various factors such as seasonal trends, US holidays, and special events like the Super Bowl and COVID-19 outbreak. The article emphasizes the importance of including these factors to capture the complexity of time series data, particularly for stock prices. It also demonstrates how to add additional predictors, such as the Vanguard Total Stock Market Index Fund ETF (VTI) data, to enhance the prediction of Google's stock price (GOOG). The performance of different models is evaluated using metrics like Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE), showing that adding seasonality and multivariate data significantly improves forecast accuracy.

Opinions

  • The author believes that considering seasonality, holidays, and special events is crucial for accurate time series forecasting, especially in the context of stock market predictions.
  • The article suggests that incorporating additional predictors, such as related stock indices, can lead to more precise forecasts.
  • There is an opinion that Prophet's built-in capability to handle seasonality and holidays simplifies the modeling process for time series data.
  • The tutorial implies that time series cross-validation, as mentioned in a previous tutorial by the author, is also an important aspect of model evaluation.
  • The author advocates for the use of Prophet's multivariate modeling capabilities, indicating that they

Multivariate Time Series Forecasting with Seasonality and Holiday Effect Using Prophet in Python

How the time series model performance is impacted by seasonalities, holidays, special events, and additional features

Photo by Brian Wangenheim on Unsplash

Do you want to build a time series model that incorporates seasonalities, holidays, special events, and other features? In this tutorial, we will talk about how to achieve this using Facebook Prophet in Python. After the tutorial, you will learn:

  • How to include seasonalities in time series prediction?
  • How to add standard holidays of a country?
  • How to add special events such as Super Bowl?
  • How to add additional predictors?
  • How the time series model performance is impacted by seasonalities, holidays, special events, and additional features?

Resources for this post:

Let’s get started!

Step 1: Install and Import Libraries

In the first step, we will install and import libraries.

yfinance is the python package for pulling stock data from Yahoo Finance. prophet is the package for the time series model. After installing yfinance and prophet, we will import them into the notebook.

We will also import pandas and numpy for data processing, seaborn and matplotlib for visualization, and mean_absolute_error and mean_absolute_percentage_error for the model performance evaluation.

# Install libraries
!pip install yfinance prophet
# Get time series data
import yfinance as yf
# Prophet model for time series forecast
from prophet import Prophet
# Data processing
import numpy as np
import pandas as pd
# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
# Model performance evaluation
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error

Step 2: Pull Data

The second step pulls stock data from Yahoo Finance API. We will pull 2 years of daily data from the beginning of 2020 to the end of 2021.

  • start_date = '2020-01-02' because January 1st is a holiday, and there is no stock data on holidays and weekends.
  • end_date = '2022-01-01' because yfinance excludes the end date, so we need to add one day to the last day of the data end date.
  • train_end_date = '2021-12-15' means that the splitting date for the training and testing dataset is December 15th of 2021.
# Data start date
start_date = '2020-01-02'
# Data end date. yfinance excludes the end date, so we need to add one day to the last day of data
end_date = '2022-01-01' 
# Date for splitting training and testing dataset
train_end_date = '2021-12-15'

We will download the closing prices for two tickers, GOOG and VTI. GOOG is the ticker for Google, and VTI is the ticker for the Vanguard Total Stock Market Index Fund ETF.

The goal of the time series model is to predict the closing price of Google stock. The closing price of VTI will be used as an additional predictor.

# Pull close data from Yahoo Finance for the list of tickers
ticker_list = ['GOOG', 'VTI']
data = yf.download(ticker_list, start=start_date, end=end_date)[['Close']]
# Take a look at the data
data.head()
Download stock price data using yfinance — Image from GrabNGoInfo.com

The downloaded dataset is in the pandas dataframe format. We can see that the output has two-level column names. Using droplevel, the first level column name Close is dropped.

# Drop one level of the column names
data.columns = data.columns.droplevel(0)
# Take a look at the data
data.head()
droplevel for downloaded stock price data using yfinance — Image from GrabNGoInfo.com

Using .info, we can see that the dataset has 505 records and there are no missing values.

# Information on the dataframe
data.info()

Output

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 505 entries, 2020-01-02 to 2021-12-31
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   GOOG    505 non-null    float64
 1   VTI     505 non-null    float64
dtypes: float64(2)
memory usage: 11.8 KB

Next, let’s visualize the closing prices of the two tickers using seaborn, and add the legend to the plot using matplotlib.

# Visualize data using seaborn
sns.set(rc={'figure.figsize':(12,8)})
sns.lineplot(x=data.index, y=data['GOOG'])
sns.lineplot(x=data.index, y=data['VTI'])
plt.legend(['Google', 'VTI'])

We can see that the price for Google increased a lot starting in late 2020, and almost doubled in late 2021.

Stock prices time series data — Image from GrabNGoInfo.com

Step 3: Data Processing

Step 3 transforms the dataset into a time series modeling dataset.

Prophet requires at least two columns as inputs: a ds column and a y column.

  • The ds column has the time information. Currently we have the date as the index, so we reset the index and rename date to ds.
  • The y column has the time series values. In this example, because we are predicting the Google closing price, the column name GOOG is changed to y.
  • There is no pre-defined name for the additional predictor in prophet, so we can keep the name VTI as is.
# Change variable names
data = data.reset_index()
data.columns = ['ds', 'y', 'VTI']
# Take a look at the data
data.head()
Prophet change variable names — Image from GrabNGoInfo.com

Next, let’s check the correlation between the Google stock closing price and the VTI closing price.

# Check correlation
data.corrwith(data["y"])

Output

y      1.000000
VTI    0.967116
dtype: float64

The value of 0.967 is close to 1, indicating that there is a high correlation between the two closing prices. Therefore, VTI closing price is a potential good predictor for the Google closing price.

Step 4: Train Test Split

In step 4, we will do the train test split. For time series data, usually a threshold date is chosen, then we set the dates before the threshold to be the training dataset and the dates after the threshold to be the testing dataset.

Based on the threshold date (train_end_date) we set before, there are 494 data points in the training dataset and 11 data points in the testing dataset.

# Train test split
train = data[data['ds'] <= train_end_date]
test = data[data['ds'] > train_end_date]
# Check the shape of the dataset
print(train.shape)
print(test.shape)

Output

(494, 3)
(11, 3)

Checking the minimum and maximum values for the train and test dataset separately gave us the starting and ending dates.

The training dataset has the dates from January 2nd, 2020 to December 15th, 2021, and the testing dataset has the dates from December 16th, 2021 to December 31, 2021.

# Check the start and end time of the training and testing dataset
print('The start time of the training dataset is ', train['ds'].min())
print('The end time of the training dataset is ', train['ds'].max())
print('The start time of the testing dataset is ', test['ds'].min())
print('The end time of the testing dataset is ', test['ds'].max())

Step 5: Baseline Model

In step 5, we will build a univariate baseline model using the default prophet hyperparameters, and fit the model using the training dataset.

# Use the default hyperparameters to initiate the Prophet model
model_baseline = Prophet()
# Fit the model on the training dataset
model_baseline.fit(train)

Prophet automatically fits daily, weekly, and yearly seasonalities if the time series is more than two cycles long.

The model information shows that the yearly seasonality and the daily seasonality are disabled.

  • The daily seasonality is disabled because we do not have sub-daily time series.
  • The yearly seasonality is disabled because we do not have two full years of data in the training dataset.
INFO:prophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
<prophet.forecaster.Prophet at 0x7fa2c4105610>

We will continue with the default values for the baseline model and force the yearly seasonality in the next model to see the impact of the yearly seasonality.

To make a forecast, we first need to create a future dataframe. periods=16 means that we would like to make predictions for the next 16 days.

If you remember that the testing dataset has 11 data points, you might wonder why we put 16 instead of 11 as the number of periods here. That is because there are 5 weekends and holidays from December 16th to December 31st in 2021. We need to add those days back in order to get the predictions till December 31st of 2021.

# Create the time range for the forecast
future_baseline = model_baseline.make_future_dataframe(periods=16)
# Make prediction
forecast_baseline = model_baseline.predict(future_baseline)
# Visualize the forecast
model_baseline.plot(forecast_baseline); # Add semi-colon to remove the duplicated chart

After making the prediction on the future dataframe, we can plot the result using .plot.

  • The black dots are the actual values.
  • The blue line is the prediction.
  • The blue shades are the uncertainty interval. The default value for the uncertainty interval is 80%, so we are using 80% here. The uncertainty interval is calculated based on the assumption that the average frequency and magnitude of trend changes in the future will be the same as the historical data. The historical data trend changes are projected forward to get the uncertainty intervals [1].
Prophet time series forecast using default hyperparameters — Image from GrabNGoInfo.com

In addition to the forecast plot, prophet also provides the components plot.

From the component plot chart, we can see that the Google stock closing price has an overall upward trend. The weekly seasonality shows that the price tends to be lower at the beginning of the week and higher at the end of the week.

# Visualize the forecast components
model_baseline.plot_components(forecast_baseline);
Prophet time series components using default hyperparameters — Image from GrabNGoInfo.com

Next, let’s check the model performance.

The forecast dataframe does not include the actual values, so we need to merge the forecast dataframe with the test dataframe to compare the actual values with the predicted values.

# Merge actual and predicted values
performance_baseline = pd.merge(test, forecast_baseline[['ds', 'yhat', 'yhat_lower', 'yhat_upper']][-16:], on='ds')
# Check MAE value
performance_baseline_MAE = mean_absolute_error(performance_baseline['y'], performance_baseline['yhat'])
print(f'The MAE for the baseline model is {performance_baseline_MAE}')
# Check MAPE value
performance_baseline_MAPE = mean_absolute_percentage_error(performance_baseline['y'], performance_baseline['yhat'])
print(f'The MAPE for the baseline model is {performance_baseline_MAPE}')

The mean absolute error (MAE) for the baseline model is $84, meaning that on average, the forecast is off by $84. Given the Google price of nearly $3000, the prediction is not bad.

The mean absolute percent error (MAPE) for the baseline model is 2.9%, meaning that on average, the forecast is off by 2.9% of the stock price.

The MAE for the baseline model is 84.12552420889001
The MAPE for the baseline model is 0.02904474676076865

Step 6: Add Seasonality to Baseline Model

The baseline model already gives us good estimations. Can we tune the model to make the estimations better? In step 6, we will force the model to consider the yearly seasonality.

When initiating the prophet model, the yearly_seasonality and weekly_seasonality are explicitly set to True, and then fit on the training data.

# Add seasonality
model_season = Prophet(yearly_seasonality=True, weekly_seasonality=True)
# Fit the model on the training dataset
model_season.fit(train)

The forecast plot is much better than the baseline model. The predictions are more aligned with the actual values, and the uncertainty interval is much narrower.

# Create the time range for the forecast
future_season = model_season.make_future_dataframe(periods=16)
# Make prediction
forecast_season = model_season.predict(future_season)
# Visualize the forecast
model_season.plot(forecast_season); # Add semi-colon to remove the duplicated chart
Prophet model forecast with seasonalities — Image from GrabNGoInfo.com

For the components plot, besides trend and weekly seasonality, there is a yearly seasonality chart, showing the seasonality over the year.

# Visualize the forecast components
model_season.plot_components(forecast_season);
Prophet model component with seasonalities — Image from GrabNGoInfo.com

The model performance is better than the baseline model.

  • MAE decreased to $60 from the baseline model’s $84.
  • MAPE decreased to 2% from the baseline model’s 2.9%.
# Merge actual and predicted values
performance_season = pd.merge(test, forecast_season[['ds', 'yhat', 'yhat_lower', 'yhat_upper']][-16:], on='ds')
# Check MAE value
performance_season_MAE = mean_absolute_error(performance_season['y'], performance_season['yhat'])
print(f'The MAE for the seasonality model is {performance_season_MAE}')
# Check MAPE value
performance_season_MAPE = mean_absolute_percentage_error(performance_season['y'], performance_season['yhat'])
print(f'The MAPE for the seasonality model is {performance_season_MAPE}')

Output

The MAE for the seasonality model is 60.64578181306512
The MAPE for the seasonality model is 0.020749685375986347

Step 7: Multivariate Model

In step 7, we added the VTI price as an additional predictor using the add_regressor function. standardize=False means the regressor will not be standardized.

VTI is chosen as a convenient example of illustrating the process of building a multivariate model. In practice, multiple features need to be created and evaluated as feature engineering and feature selection steps for the model.

# Add seasonality 
model_multivariate = Prophet(yearly_seasonality=True, weekly_seasonality=True)
# Add regressor
model_multivariate.add_regressor('VTI', standardize=False)
# Fit the model on the training dataset
model_multivariate.fit(train)

When making forecasts for the multivariate model, we need to make sure that the regressors have values for the forecast periods, so we used left join and appended VTI data to the future dataframe.

In the case that the forecast is for the future without the regressor data, separate models need to be built for the regressors to get the predictions for the future dates.

If there are missing values, we can use .fillna(method='ffill') to fill the missing value with the previous day's data.

# Create the time range for the forecast
future_multivariate = model_multivariate.make_future_dataframe(periods=16)
# Append the regressor values
future_multivariate = pd.merge(future_multivariate, data[['ds', 'VTI']], on='ds', how='inner')
# Fill the missing values with the previous value
future_multivariate = future_multivariate.fillna(method='ffill')
# Check the data
future_multivariate.tail(10)
Prophet multivariate model forecast data — Image from GrabNGoInfo.com

Compared with the model with no additional regressor, we can see that the model with a regressor has the predictions more aligned with the actual values.

# Make prediction
forecast_multivariate = model_multivariate.predict(future_multivariate)
# Visualize the forecast
model_multivariate.plot(forecast_multivariate); # Add semi-colon to remove the duplicated chart
Prophet multivariate model forecast — Image from GrabNGoInfo.com

The components plot has one additional chart for the additional regressor.

# Visualize the forecast components
model_multivariate.plot_components(forecast_multivariate);
Prophet multivariate model components — Image from GrabNGoInfo.com

The multivariate model performance is much better than the univariate model.

  • MAE decreased to $31 from the univariate seasonality model’s $60.
  • MAPE decreased to 1% from the univariate seasonality model’s 2%.
# Merge actual and predicted values
performance_multivariate = pd.merge(test, forecast_multivariate[['ds', 'yhat', 'yhat_lower', 'yhat_upper']][-16:], on='ds')
# Check MAE value
performance_multivariate_MAE = mean_absolute_error(performance_multivariate['y'], performance_multivariate['yhat'])
print(f'The MAE for the multivariate model is {performance_multivariate_MAE}')
# Check MAPE value
performance_multivariate_MAPE = mean_absolute_percentage_error(performance_multivariate['y'], performance_multivariate['yhat'])
print(f'The MAPE for the multivariate model is {performance_multivariate_MAPE}')

Output

The MAE for the multivariate model is 31.372078294947972
The MAPE for the multivariate model is 0.010728081175901991

Step 8: Model with Holiday Effect and Event Effect

Besides seasonalities and additional predictors, we can also incorporate holidays and special events in a prophet time series model. In step 8, we will add special events and US holidays to the multivariate model and see how it impacts the model predictions.

Firstly, let’s create two special events: COVID start and Super Bowl.

  • For the COVID start event, we set the date to be March 15th of 2020, then extend the event to 15 days before and 15 days after using lower_window and upper_window separately.
  • For the Super Bowl event, we set the date in 2020 and 2021 separately, and extend the event to 7 days before and 1 day after.

Then the two events are concatenated together to one dataframe called events.

# COVID time window
COVID = pd.DataFrame({
    'holiday': 'COVID',
    'ds': pd.to_datetime(['2020-03-15']),
    'lower_window': -15,
    'upper_window': 15,    
})
# Super Bowl time window
superbowl = pd.DataFrame({
    'holiday': 'superbowl',
    'ds': pd.to_datetime(['2020-02-02', '2021-02-07']),
    'lower_window': -7,
    'upper_window': 1,    
})
# Combine all events
events = pd.concat((COVID, superbowl))
# Take a look at the events data
events
Prophet time series special events — Image from GrabNGoInfo.com

When initiating the prophet model, the dataframe name events is passed to the hyperparameter holidays.

Prophet has built-in country-specific holidays. We add US holidays to the model using add_country_holidays and setting country_name='US'.

# Add holidays
model_holiday = Prophet(yearly_seasonality=True, weekly_seasonality=True, holidays=events)
# Add built-in country-specific holidays
model_holiday.add_country_holidays(country_name='US')
# Add regressor
model_holiday.add_regressor('VTI', standardize=False)
# Fit the model on the training dataset
model_holiday.fit(train)
# All the holidays and events
model_holiday.train_holiday_names

Using .train_holiday_names, we can see the list of holidays and events used for the modeling fitting.

0                           COVID
1                       superbowl
2                  New Year's Day
3      Martin Luther King Jr. Day
4           Washington's Birthday
5                    Memorial Day
6                Independence Day
7     Independence Day (Observed)
8                       Labor Day
9                    Columbus Day
10                   Veterans Day
11                   Thanksgiving
12                  Christmas Day
13      New Year's Day (Observed)
14       Christmas Day (Observed)

When doing a forecast, the holiday and special events effects are included.

# Create the time range for the forecast
future_holiday = model_holiday.make_future_dataframe(periods=16)
# Append the regressor values
future_holiday = pd.merge(future_holiday, data[['ds', 'VTI']], on='ds', how='inner')
# Fill the missing values with the previous value
future_holiday = future_holiday.fillna(method='ffill')
# Make prediction
forecast_holiday = model_holiday.predict(future_holiday)
# Visualize the forecast
model_holiday.plot(forecast_holiday); # Add semi-colon to remove the duplicated chart
Prophet time series forecast with holiday effects — Image from GrabNGoInfo.com

The components plot has one additional plot for the holidays including special events.

# Visualize the forecast components
model_holiday.plot_components(forecast_holiday);
Prophet time series components with holiday effects — Image from GrabNGoInfo.com

The added holiday effects did not improve the performance of the multivariate model.

  • MAE increased to $48 from the multivariate model’s $31.
  • MAPE increased to 1.6% from the multivariate model’s 1%.
# Merge actual and predicted values
performance_holiday = pd.merge(test, forecast_holiday[['ds', 'yhat', 'yhat_lower', 'yhat_upper']][-16:], on='ds')
# Check MAE value
performance_holiday_MAE = mean_absolute_error(performance_holiday['y'], performance_holiday['yhat'])
print(f'The MAE for the holiday/event model is {performance_holiday_MAE}')
# Check MAPE value
performance_holiday_MAPE = mean_absolute_percentage_error(performance_holiday['y'], performance_holiday['yhat'])
print(f'The MAPE for the holiday/event model is {performance_holiday_MAPE}')

Output

The MAE for the holiday/event model is 47.51193003230161
The MAPE for the holiday/event model is 0.016239759857200307

Summary

In this tutorial, you learned:

  • How to include seasonalities in time series prediction using Prophet?
  • How to add standard holidays of a country?
  • How to add special events such as Super Bowl?
  • How to add additional predictors?
  • How the time series model performance is impacted by seasonalities, holidays, special events, and additional features?

To learn about how to do time series cross-validation using prophet, please refer to my previous tutorial Time Series Forecasting Of Bitcoin Prices Using Prophet.

For more information about data science and machine learning, please check out my YouTube channel and Medium Page.

Recommended Tutorials

References

[1] Prophet Documentation

Multivariate Time Series
Facebook Prophet
Time Series Model
Machine Learning Python
Time Series Forecasting
Recommended from ReadMedium