Time Series Analysis Use Case: The Air Passengers Dataset

This article is part of the “Datascience with Python” series. You can find the other stories of this series below:

Data Science with Python

Aka the best programming language for data scientists

medium.com

After explaining how to perform time series analysis in Python, I’ll guide you through a real use case of it.

We will use the Air Passengers Dataset, which is a widely used dataset in the field of time series analysis. The dataset contains monthly airline passenger numbers from 1949 to 1960 and has been used in various studies to develop forecasting models and analyze the trends and seasonality of the data.

Exploratory Data Analysis

Before applying any time series analysis techniques, it is important to perform an exploratory data analysis (EDA) to gain insights into the data and identify any trends, seasonality, or anomalies.Data Description and Importing the Dataset

The Air Passengers Dataset consists of the number of passengers (in thousands) who traveled by air between 1949 and 1960. The dataset has 144 observations, with one observation for each month in the period. To perform the EDA, we first need to import the dataset:

import pandas as pd
import matplotlib.pyplot as plt

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv"
dataset = pd.read_csv(url, index_col='Month', parse_dates=True)

print(dataset.head())

            Passengers
Month                
1949-01-01         112
1949-02-01         118
1949-03-01         132
1949-04-01         129
1949-05-01         121

The pd.read_csv() function is used to read the CSV file containing the dataset from a URL, and the index_col and parse_dates arguments are used to set the index of the DataFrame as the 'Month' column and convert it to a datetime format. We can also use the head() function to print the first few rows of the dataset and verify that it has been imported correctly.

Then, we can visualize the data using matplotlib:

dataset.plot(figsize=(10,5))
plt.xlabel('Year')
plt.ylabel('Passengers (in thousands)')
plt.title('Air Passengers Dataset')
plt.show()

From the plot, we can observe that the data exhibits an upward trend, with a noticeable seasonality pattern. There is also some variability in the data.

We can then decompose the time series into its components — trend, seasonality, and residuals. This can be done using the seasonal_decompose() function from the statsmodels library:

pip install statsmodel

from statsmodels.tsa.seasonal import seasonal_decompose

# Decomposing the time series into trend, seasonality, and residuals
decomposition = seasonal_decompose(dataset, model='additive')

# Plotting the decomposed components
trend = decomposition.trend
seasonality = decomposition.seasonal
residuals = decomposition.resid

plt.figure(figsize=(10,8))

plt.subplot(411)
plt.plot(dataset, label='Original')
plt.legend(loc='upper left')

plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='upper left')

plt.subplot(413)
plt.plot(seasonality, label='Seasonality')
plt.legend(loc='upper left')

plt.subplot(414)
plt.plot(residuals, label='Residuals')
plt.legend(loc='upper left')

plt.tight_layout()
plt.show()

From the decomposition plot, we can observe that the trend is increasing linearly, and the seasonality is fairly consistent with a clear peak and trough every 12 months. The residuals also appear to be stationary and normally distributed, which is a good sign for model selection.

Finally, we can calculate some summary statistics of the dataset, such as the mean, standard deviation, and autocorrelation:

from statsmodels.graphics.tsaplots import plot_acf

# Printing the summary statistics of the dataset
print("Summary Statistics:")
print(dataset.describe())

# Plotting the autocorrelation function (ACF) of the dataset
plot_acf(dataset, lags=50)
plt.xlabel('Lags')
plt.show()

Summary Statistics:
        Passengers
count   144.000000
mean    280.298611
std     119.966317
min     104.000000
25%     180.000000
50%     265.500000
75%     360.500000
max     622.000000

The summary statistics show that the mean number of passengers is 280.3 thousand, with a standard deviation of 120 thousand. The ACF plot shows a significant autocorrelation at lag 1, indicating that the previous month’s passenger numbers are a good predictor of the current month’s numbers. This suggests that an autoregressive (AR) model may be appropriate for the dataset.

Modeling the Data

After performing the EDA, we can now move on to modeling the data. We will use the ARIMA (Autoregressive Integrated Moving Average) model, which is a popular time series forecasting model.

ARIMA models have three main parameters: p, d, and q. The p parameter refers to the autoregressive term, which measures the relationship between the current value and the past values. The d parameter refers to the integrated term, which measures the number of times the data needs to be differenced to achieve stationarity. The q parameter refers to the moving average term, which measures the relationship between the current value and the past forecast errors.

We will use the auto_arima() function from the pmdarima library to automatically select the best parameters for our model. This function uses a stepwise approach to iteratively fit different ARIMA models and select the best one based on the Akaike Information Criterion (AIC).

pip install pmdarima

from pmdarima.arima import auto_arima

# Fitting the ARIMA model using auto_arima()
model = auto_arima(dataset, seasonal=True, m=12, trace=True)

print(model.summary())

We can see from the output that the auto_arima() function has tried multiple combinations of p, d, and q parameters and has selected the model with the lowest AIC value, which is an ARIMA(2,1,0)(0,1,0)[12] model.

Next, we need to split our dataset into training and testing sets. We will use the first 80% of the data for training and the remaining 20% for testing.

train_size = int(len(dataset) * 0.8)
train, test = dataset[0:train_size], dataset[train_size:len(dataset)]

print("Train Shape:", train.shape)
print("Test Shape:", test.shape)

We can now fit the ARIMA model on the training set and make predictions on the test set.

# Fitting the ARIMA model on the training set
model.fit(train)

# Making predictions on the test set
predictions = model.predict(n_periods=len(test))
predictions = pd.DataFrame(predictions, index=test.index, columns=['Passengers'])

Finally, we can plot the actual values and the predicted values to see how well our model has performed.

plt.figure(figsize=(10,6))
plt.plot(train, label='Train')
plt.plot(test, label='Test')
plt.plot(predictions, label='Predictions')
plt.legend(loc='best')
plt.show()

From the plot, we can see that our ARIMA model has done a good job of predicting the future values of the time series. The predicted values closely follow the actual values and capture the overall trend and seasonality of the data.

Evaluating and Fine-Tuning

After fitting the ARIMA model and making predictions on the test set, it is important to evaluate the performance of the model and fine-tune it if necessary.

One way to evaluate the performance of the model is to calculate the mean absolute error (MAE), which measures the average absolute difference between the predicted values and the actual values.

from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(test, predictions)
print("Mean Absolute Error (MAE):", mae)

The MAE value tells us that, on average, the predicted values are off by around 25 passengers. However, it is also important to visualize the performance of the model using a plot of the residuals, which are the differences between the actual values and the predicted values.

residuals = test - predictions

plt.figure(figsize=(10,6))
plt.plot(residuals)
plt.title("Residuals Plot")
plt.show()

From the residuals plot, we can see that there are some errors, and we can identify a seasonality. It means we’re not capturing all the patterns in the data and our model may eventually benefit from fine-tuning.

Here are some approaches that we could take to improve the model (even if in our case everything is not doable):

Adjusting the ARIMA model parameters or changing the model: We can adjust the p, d, and q parameters of the ARIMA model to better capture the patterns in the data. We can use techniques such as grid search or random search to find the optimal values of these parameters that minimize the residual errors. Alternatively, we can try other models such as SARIMAX or Prophet.
Adding exogenous variables: If there are other variables that are known to have an impact on the time series, we can include them as exogenous variables in the ARIMA model. For example, in the case of air passenger traffic, we can include variables such as fuel prices, inflation rates, or economic indicators that are known to impact the number of air passengers.
Using different modeling techniques: If the ARIMA model is not able to capture all the patterns in the data, we can try using different modeling techniques such as seasonal decomposition, exponential smoothing, or neural networks to model the time series.
Increasing the amount of data: If the dataset is small, the model may not be able to capture all the patterns in the data. In this case, we can try to collect more data to improve the performance of the model.

The only thing we can do is test with other models, so let’s do this. We’ll try the SARIMAX model:

from statsmodels.tsa.statespace.sarimax import SARIMAX

# Fitting the SARIMAX model
model = SARIMAX(dataset, order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
results = model.fit()

predictions = results.predict(start=len(train), end=len(train)+len(test)-1, dynamic=False)

residuals = []
for i in range(len(test)):
    actual = test.iloc[i][0]
    predicted = predictions.iloc[i]
    residuals.append(actual - predicted)

plt.figure(figsize=(10,6))
plt.plot(residuals)
plt.title("Residuals Plot")
plt.show()

It looks better! The residuals seem centered around 0 and there’s no seasonality in them. But let’s try to do better. We’ll make our own implementation of the auto_arima function. It’s a bit complex, but it’s worth it:

from itertools import product
import warnings

warnings.filterwarnings("ignore")
import statsmodels.api as sm

# Creating a list of all possible combinations of p, d, and q values
pdq = list(product(range(3), range(2), range(3)))

# Creating a list of seasonal p, d, and q values
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(product(range(3), range(2), range(3)))]

# Performing grid search to find the best ARIMA parameters
best_mae = float("inf")
best_params = None

for param in pdq:
    for param_seasonal in seasonal_pdq:
        try:
            # Fitting the ARIMA model
            mod = sm.tsa.statespace.SARIMAX(train, order=param, seasonal_order=param_seasonal,
                                            enforce_stationarity=False, enforce_invertibility=False)
            results = mod.fit()

            # Making predictions on the test set
            predictions = results.predict(start=len(train), end=len(train) + len(test) - 1, dynamic=False)

            # Calculating the mean absolute error (MAE)
            mae = mean_absolute_error(test, predictions)

            # Checking if the current model is better than the best model so far
            if mae < best_mae:
                best_mae = mae
                best_params = (param, param_seasonal)

        except:
            continue

print("Best ARIMA parameters:", best_params)
print("Best MAE:", best_mae)

# Fitting the model with the best parameters
mod = sm.tsa.statespace.SARIMAX(train, order=best_params[0], seasonal_order=best_params[1], enforce_stationarity=False, enforce_invertibility=False)
results = mod.fit()

predictions = results.predict(start=len(train), end=len(train)+len(test)-1, dynamic=False)

mae = mean_absolute_error(test, predictions)

# Plotting the actual vs predicted values
plt.figure(figsize=(10,6))
plt.plot(test, label="Actual")
plt.plot(predictions, label="Predicted")
plt.title("Actual vs Predicted Values with Fine-tuned ARIMA Model")
plt.legend()
plt.show()

# Calculate the new residuals
residuals = []
for i in range(len(test)):
    actual = test.iloc[i][0]
    predicted = predictions.iloc[i]
    residuals.append(actual - predicted)

# Plotting the residuals
plt.figure(figsize=(10,6))
plt.plot(residuals)
plt.title("Residuals Plot")
plt.show()

Better! Now I don’t think we can do better than this, so let’s stop here. Our predicted values look like the following:

You can compare this plot with the first one to see the difference.

Final Note

As you can see, time series analysis is a bit different than regression or clustering analysis. It requires different libraries and algorithms, even if the overall approach is the same.

I hope this article was helpful, time series analysis is not so easy!

To explore the other stories of this series, click below!

Data Science with Python

Aka the best programming language for data scientists

medium.com

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!

If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:

Join Medium with my referral link — Esteban Thilliez

Read every story from Esteban Thilliez (and thousands of other writers on Medium). Your membership fee directly…

medium.com