The provided content offers an in-depth exploration of the SARIMA model, an extension of the ARIMA model that incorporates seasonality, and demonstrates its application in time series forecasting with a Python tutorial.
Abstract
The content delves into the Seasonal Autoregressive Integrated Moving Average (SARIMA) model, which is a refinement of the ARIMA model designed to handle seasonal data. It explains the theoretical underpinnings of SARIMA, detailing how it extends the ARIMA model with seasonal components, denoted as SARIMA(p, d, q)(P, D, Q)m. The article emphasizes the importance of stationarity in time series data for SARIMA modeling and discusses techniques for achieving it, such as differencing and transformations. It also guides readers through the process of selecting appropriate model orders using tools like the Augmented Dickey-Fuller test, autocorrelation function (ACF), and partial autocorrelation function (PACF). The practical implementation of SARIMA is illustrated with Python code, using data from Kaggle to forecast air passenger numbers. The tutorial covers data preprocessing, model fitting, and forecast visualization, showcasing the effectiveness of SARIMA in capturing trends and seasonality.
Opinions
The author highly recommends familiarity with ARIMA before diving into SARIMA, suggesting that a solid understanding of the former is crucial for grasping the latter.
The author expresses that stationarity is a key requirement for SARIMA, implying that without it, the model's effectiveness is compromised.
The use of Maximum Likelihood Estimation (MLE) is advocated for estimating SARIMA coefficients, indicating a preference for this method in the modeling process.
The author provides additional resources for readers to explore related topics, such as seasonality and stationarity, showing a commitment to comprehensive learning.
The Python implementation using the statsmodels package is presented as a straightforward approach to applying SARIMA, suggesting that the package is user-friendly for practitioners.
The author's enthusiasm for SARIMA is evident, as they conclude that the model is simple to apply and can effectively handle complex time series data with trends and seasonality.
What is SARIMA in Time Series Forecasting
A deep dive into the SARIMA model and its applications in time series analysis
In one of my previous posts we covered probably the most famous skforecasting model, Autoregressive Integrated Moving Average better known as ARIMA. However, one disadvantage of this model is that it is lacking awareness of any seasonality. This is where the Seasonal Autoregressive Integrated Moving Average, or SARIMA, model comes in. In this post, we will take a deep dive into the theory and main ideas behind the SARIMA model and how to implement it in Python.
I highly recommend reading my previous article if you are not too familiar with ARIMA, as in this article we will be drawing quite a lot of assumed prior knowledge of how the original ARIMA model works!
What Is SARIMA?
Overview
SARIMA is an extension of the regular ARIMA model that adds a seasonality component to the model. This allows us to better capture seasonal affects that the regular ARIMA model does not permit.
If you want to learn more about seasonality in time series, I highly recommend you read one of my previous posts:
The classic ARIMA model has three components: Autoregressive, Integrated (differencing), and Moving-Average. These are then linearly combined to form the model:
Equation generated by author in LaTeX.
Where:
y’: differenced time series, the number of differencing applied is noted as d
The model is often compactly written ARIMA(p, d, q) where p, d, and q refer to the order of autoregressors, differencing and moving-average components respectively.
SARIMA adds a seasonality component to each factor of the ARIMA equation to produce SARIMA(p, d, q)(P, D, Q)m:
Equation generated by author in LaTeX.
Where:
y’: differenced time series, through both regular, d, and seasonal, D, differencing
P: number of seasonal auto-regressors
ω: coefficients of the seasonal autoregressive components
Like the original ARIMA model, the SARIMA model needs to have stationary data to model and forecast the time series. A stationarity time series does not exhibit any long-term trend or clear seasonality, its statistical properties, such as mean and variance, remain constant over time.
To produce a stationary time series we need to stabilize the mean and variance. The mean can be stabilized through differencing and the number of differencing applied is d or D in the case of seasonal differencing. The variance can be stabilized through transformations such as the logarithmic and Box-Cox transform, this makes the seasonal fluctuations occur on a similar level every season.
If you want to learn more about stationarity, check out my previous blog posts about it here:
After the time series is stationary, we then need to deduce the best orders, (p, d, q) and (P, D, Q)m, for our model. The simplest one to calculate is the seasonal, D,and regular differencing, d. This can be deduced through the Augmented Dickey-Fuller (ADF) statistical test that deduces whether a time series is stationary or not.
The autoregressive and moving-average (forecast errors) orders (p, q, P, Q) can be computed by analyzing the partial autocorrelation function (PACF)and autocorrelation function respectively. The idea behind this technique is to plot a correlogramof the autoregressors and moving-average value and deduce which ones are statistically significant. The significant ones indicate that they have a substantial impact on the forecast.
These correlograms will also allow us to observe the seasonal pattern if any, as we may see peaks at certain multiple lags. For example, a SARIMA(0,0,0)(1,0,0)4 will show exponential decay in the lags for the ACF but a significant spike at lag 4 in the PACF. If the data is indexed by month, then this is would be an example of quarterly seasonality.
If this seems confusing at the moment, don’t worry. In the Python implementation later we will walk through this process!
The final step is to compute the corresponding coefficients for these orders. The most common method is to use Maximum Likelihood Estimation (MLE)which estimates the coefficients against some assumed probability distribution, typically normal, to calculate which coefficient is the most likely to generate that data. As the time series is stationary and has constant statistical properties, we can say that it belongs to some probability distribution allowing us to use MLE. This is why stationarity is the key requirement for SARIMA.
Josh Starmer’s StatQuest does a great explanation of MLE. Link here.
Python Tutorial
Data
Let’s begin by plotting the time series we want to forecast:
There is an obvious trend and seasonality, so the data is not stationary as the mean and variance is changing over time. Therefore, we need to apply differencing and the Box-Cox transform to make our series stationary as required by SARIMA:
Plot generated by author in Python.
The data now looks sufficiently stationary.
Modelling
We will now use the ACF and PACF correlograms to deduce the orders for the autoregressive and moving-average components:
Plot generated by author in Python.
The blue region signifies where the lags are no longer statistically significant.
We already observed that our series yearly seasonality, m=12, but the above plots confirm this as we have large spikes at the 12th lags. The lags are also significant to around ~10th lag for both plots. Overall this indicates that a SARIMA(10, 1, 10)(1, 1, 1)12 model should be suitable.
Now, let’s fit the model using the ARIMA class from statsmodels and generate the forecasts. Luckily, this class carries out differencing for us, so we only need to pass the Box-Cox transformed time series:
Analysis
Finally, we will plot the forecasts:
Plot generated by author in Python.
The SARIMA forecasts seemed to have done quite well!
Summary and Further Thoughts
In this article, we have discussed an extension to the famous ARIMA forecasting model, SARIMA. This model adds seasonality components to the regular ARIMA model to enable the modeling of more complex time series. The SARIMA model is simple to apply in Python through the statsmodels package.
The full code used in this article can be found on my GitHub here:
I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no “fluff” or “clickbait,” just pure actionable insights from a practicing Data Scientist.