avatarAlexzap

Summary

The provided context outlines a comprehensive approach to forecasting NVIDIA's short-term stock price using advanced machine learning and statistical methods, including SHAP feature importance, lag features, optimized regression models, Auto-ARIMA, multi-seasonal FB Prophet tuning, and LSTM neural networks.

Abstract

The article delves into the application of Explainable AI, statistical modeling, and supervised machine learning algorithms to predict the future stock price of NVIDIA, a prominent technology company. It emphasizes the integration of SHapley Additive exPlanations (SHAP) for feature dominance analysis, the use of lag features in linear regression, and the implementation of 8 different supervised ML regressors with hyperparameter optimization. The study also explores the effectiveness of Auto-ARIMA for time series fitting, multi-seasonal Facebook Prophet forecasting with hyperparameter tuning, and the use of LSTM neural networks for prediction and validation quality control. The methodology is detailed, with a focus on enhancing the accuracy of NVDA price predictions to aid investors in making informed decisions and potentially maximizing profits.

Opinions

  • The article suggests that a combination of traditional statistical methods and advanced machine learning techniques can provide a more accurate and interpretable model for stock price forecasting.
  • There is an opinion that NVIDIA's stock split does not change the company's fundamentals but could make its shares more attractive to retail investors, reflecting a view on market psychology.
  • The use of SHAP values is presented as a valuable tool for understanding the contribution of each feature to the model's output, indicating a preference for interpretable machine learning.
  • The author appears to value the predictive power of time series models like ARIMA and Prophet, especially when combined with machine learning regressors.
  • There is a belief that the boom in AI has positioned NVIDIA as a key player in the technology sector, which is reflected in the company's stock performance and market perception.
  • The article conveys that despite the complexity of forecasting stock prices, the proposed methodology can provide insights and predictions that are useful for investment decisions.

NVIDIA Short-Term Stock Price Forecasting: SHAP Explainer, Lags, Optimized ML Regressors, Auto-ARIMA, Multi-Seasonal FB Prophet Tuning & LSTM

Photo by BoliviaInteligente on Unsplash
  • “Most forecasts don’t allow for alternative outcomes.” Howard Marks.
  • The key question is whether the integrated Explainable AI, Statistical Modeling, Supervised ML algorithms and model tuning can predict the future of NVIDIA’s AI with alternative outcomes.
  • This article is about NVIDIA (NASDAQ: NVDA) Short-Term Stock Price Forecasting [3] using SHAP 60+ Feature Engineering (FE) [1], Basic Linear Regression with Time Lags [2], 8 Supervised ML Regressors (Scikit-Learn) with Hyperparameter Optimization (HPO) [1,4], Auto-ARIMA time series fitting [1,5], Multi-Seasonal FB Prophet Forecasting HPO with anomalous dates [1], and end-to-end LSTM with Validation QC.

Motivation

  • NVIDIA is one of the ‘Magnificent Seven’ group of influential US mega stocks (along with Apple, Microsoft, Alphabet, Amazon, Tesla and Meta Platforms) [3].
  • The business has enjoyed meteoric success in recent years and, on 18 June 2024, leap-frogged both Microsoft and Apple to briefly become the world’s most valuable company.
  • Charlie Huggins, manager of the Wealth Club Quality Shares Portfolio, says: “The NVIDIA stock split doesn’t change the fundamentals, or value, of the business. But it could make the shares even more attractive to retail investors.”
  • Thanks to the boom in AI, NVIDIA is enjoying a purple patch. According to Victoria Scholar, head of investment at interactive investor, an online share trading service, the “AI stock market darling” was the most bought stock among the platform’s UK investors for the second month running in May 2024.

Challenge

  • It’s impossible to predict with certainty how a company’s share price will perform [3].
  • A stock’s behavior depends partly on the business’s internal performance, combined with wider macro-economic events affecting the markets in which it operates.

Objectives

  • Predict short-term NVDA stock price through analyzing the relationship between the stock price and various features.
  • Ultimately, the primary goal is to enhance the accuracy of NVDA price predictions, enabling investors to anticipate market movements and potentially maximize profits.
  • Including making informed investment decisions, averting losses, optimizing stockholder investments, and improving decision-making strategies.

Tools

The Machine Learning library SciKit-learn in Python comes with a load of features to simplify ML:

  • Supervised ML: Generalized linear models such as Linear regression, Decision Trees, SVM, Random Forest, Bayesian methods, etc.
  • Unsupervised ML: factoring, clustering, PCA, and unsupervised NN.
  • Cross-validation: The accuracy and validity of supervised models on unseen data can be checked with the help of SciKit-learn.
  • FB Prophet

Developed by Facebook (FB) and made an open-source contribution to the data science community, Prophet is a powerful forecasting tool available in both R and Python. It is fast, accurate, automated, and feature-rich.

For the time series forecasting shown below, we will be using FB Prophet. More information can be found here: https://facebook.github.io/prophet/

Pros: The model provides an interpretable decomposition of the data that is observed or forecasted into seasonal, trend, and holiday components. This is useful if you want to understand why you are seeing certain patterns on certain day.

Cons: Seasonality not explicitly modeled; lagged effect when forecasting; lacks support for multiple inputs.

ARIMA perform well, if data is not stationary then we can use differencing to make it stationary. If data exhibits clear trends and regular seasonality, ARIMA can be a good choice. ARIMA’s forecasting accuracy tends to reduce as the forecast horizon increases. So it is not good for long-term forecasts.

Read more here.

Using Keras and Tensorflow (TF) makes building NN much easier to build from scratch.

Scikit-Learn is primarily used for traditional ML tasks and algorithms.

TF is mainly utilized for deep learning problems, particularly in AI.

SHAP (SHapley Additive exPlanations) values are a way to explain the output of any machine learning model. It uses a game theoretic approach that measures each player’s contribution to the final outcome. In ML, each feature is assigned an importance value representing its contribution to the model’s output.

SHAP values show how each feature affects each final prediction, the significance of each feature compared to others, and the model’s reliance on the interaction between features.

8 Supervised ML Algorithms

  name                               model 
0 Linear Regression                  LinearRegression() 
1 KNeighbors Regressor               KNeighborsRegressor() 
2 Support Vector Machines            SVR() 
3 Linear SVR                         LinearSVR() 
4 Random Forest Regressor            RandomForestRegressor() 
5 Bagging Regressor                  BaggingRegressor() 
6 XGB Regressor                      XGBRegressor()
7 MLP Regressor                      MLPRegressor() 

Key ML Error Metrics

  • R2 score, RMSE, MAE, Explained Variance Score, and MAPE.

Milestones

  • Exploratory Data Analysis (EDA) of Input Stock Data
  • Basic Linear Regression with Time-Step & Lag Features
  • TSFRESH Feature Engineering [1]
  • Multi-Seasonal FB Prophet Prediction & HPO
  • Auto-ARIMA Model Fitting & SARIMAX QC Summary
  • SciKit-Learn Supervised ML Regression & HPO
  • SHAP Feature Dominance Analysis
  • TF LSTM Keras Prediction & Validation QC

Let’s get into the specifics of the proposed methodology with a final source code line-by-line explained.

Python Imports, Installations & Settings

  • Installing/importing necessary Python libraries and setting parameters
!pip install seaborn, yfinance, sklearn, tsfresh, prophet, lightgbm, xgboost, pmdarima, shap, statsmodels, plotly, xgboost

import random
import os
import numpy as np 
import pandas as pd 
import requests
#import pandas_datareader as web
import yfinance as yf

# Date
import datetime as dt
from datetime import date, timedelta, datetime

# EDA
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

# FE
from tsfresh import extract_features, select_features, extract_relevant_features
from tsfresh.utilities.dataframe_functions import impute
from sklearn.inspection import permutation_importance

import shap

# Time Series Modelling
import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA

# Metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error

# Modeling and preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR, LinearSVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import BaggingRegressor, AdaBoostRegressor
from sklearn.neural_network import MLPRegressor
from prophet import Prophet
import xgboost as xgb
from xgboost import XGBRegressor
import lightgbm as lgb
from lightgbm import LGBMRegressor
import pmdarima as pm

import warnings
warnings.filterwarnings("ignore")

# Set random state
def fix_all_seeds(seed):
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

random_state = 42
fix_all_seeds(random_state)


# Set main parameters
cryptocurrency = 'NVDA' #define any stock
target = 'Close'
forecasting_days = 10 

# Set time interval of data for given cryptocurrency - the period of coronavirus in 2020-2021
date_start = dt.datetime(2020, 4, 1)
# date_end = dt.datetime.now()
date_end = dt.datetime(2024, 7, 13)   

Reading & Analyzing Input Stock Data

  • Using Yahoo Finance to fetch the daily NVIDIA historical data from 2020–04–01 to 2024–07–12
def get_data(cryptocurrency, date_start, date_end=None):
    # Get data for a given stock
    # date_end = None means that the date_end is the current day
    
    if date_end is None:
        date_end = dt.datetime.now()
  
    df =yf.download(cryptocurrency, date_start, date_end)
    
    return df

# Download data of the cryptocurrency via API
df = get_data(cryptocurrency, date_start, date_end)
  • Let’s drop ‘Adj Close’ and take a look at the general structure, descriptive statistics of the dataset and missing values if any
df = df.drop(columns = ["Adj Close"])
df.tail()

           Open       High       Low        Close      Volume
Date     
2024-07-08 127.489998 130.770004 127.040001 128.199997 237677300
2024-07-09 130.350006 133.820007 128.649994 131.380005 285366600
2024-07-10 134.029999 135.100006 132.419998 134.910004 248978600
2024-07-11 135.750000 136.149994 127.050003 127.400002 374782700
2024-07-12 128.259995 131.919998 127.220001 129.240005 252103100

df.shape
(1077, 5)

df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1077 entries, 2020-04-01 to 2024-07-12
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Open    1077 non-null   float64
 1   High    1077 non-null   float64
 2   Low     1077 non-null   float64
 3   Close   1077 non-null   float64
 4   Volume  1077 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 50.5 KB

df_nan_count = df.isna().sum()

print(df_nan_count)
Open      0
High      0
Low       0
Close     0
Volume    0
dtype: int64
  • Creating the column ‘Time’ as follows
df['Time'] = np.arange(len(df.index))

Date
2020-01-02       0
2020-01-03       1
2020-01-06       2
2020-01-07       3
2020-01-08       4
              ... 
2024-07-08    1134
2024-07-09    1135
2024-07-10    1136
2024-07-11    1137
2024-07-12    1138
Name: Time, Length: 1139, dtype: int32
import pandas as pd
import plotly.io as pio
import plotly.graph_objects as go
pio.renderers.default='browser'

def get_candlestick_chart(df: pd.DataFrame):

    layout = go.Layout(
        title = 'NVDA Stock Price',
        xaxis = {'title': 'Date'},
        yaxis = {'title': 'Price'},
    ) 
    
    fig = go.Figure(
        layout=layout,
        data=[
            go.Candlestick(
                x = df['Date'],
                open = df['Open'], 
                high = df['High'],
                low = df['Low'],
                close = df['Close'],
                name = 'Candlestick chart'
            ),
        ]
    )
    
    fig.update_xaxes(rangebreaks = [{'bounds': ['sat', 'mon']}])
    
    return fig

if __name__ == '__main__':
    fig = get_candlestick_chart(df)
    fig.show()
NVDA Stock Price Candlesticks

Linear Regression with Time-Step & Lag Features

  • Let’s begin with Linear Regression (LR) using the most basic time-step feature ‘Time’ v.s. (aka time dummy)
#import matplotlib.pyplot as plt
#import seaborn as sns

plt.rc(
    "figure",
    autolayout=True,
    figsize=(11, 4),
    titlesize=18,
    titleweight='bold',
)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=16,
    titlepad=10,
)
%config InlineBackend.figure_format = 'retina'

fig, ax = plt.subplots()
ax.plot('Time', 'Close', data=df, color='0.75')
ax = sns.regplot(x='Time', y='Close', data=df, ci=None, scatter_kws=dict(color='0.25'))
ax.set_title('Linear Regression of NVDA Price vs Time Dummy');
plt.grid()
Linear Regression of NVDA Price vs Time Dummy
  • Clearly, LR does not fit the Close price well as Time>1000.
  • Let’s try LR with lag features [2].
  • Firstly, we compute Lag_10 for the 10-day forecast
df['Lag_10'] = df['Close'].shift(10)
df = df.reindex(columns=['Close', 'Lag_10'])

#df.head()

fig, ax = plt.subplots()
ax = sns.regplot(x='Lag_10', y='Close', data=df, ci=None, scatter_kws=dict(color='0.25'))
ax.set_aspect('equal')
ax.set_title('Lag-10 X-Plot of NVDA Price');
plt.grid()
Lag-10 X-Plot of NVDA Price
  • Training the LR model with the above feature and making predictions
from sklearn.linear_model import LinearRegression

# Training data
X = df.loc[:, ['Lag_10']].fillna(method='bfill')  # features
y = df.loc[:, 'Close']  # target

# Train the model
model = LinearRegression()
model.fit(X, y)

# Store the fitted values as a time series with the same time index as
# the training data
y_pred = pd.Series(model.predict(X), index=X.index)

y.plot(label='Close')
y_pred.plot(label='LR Lag_10')
plt.legend()
plt.title('NVDA Close Price vs LR Lag_10')
plt.grid()
NVDA Close Price vs LR Lag_10
  • Quantifying the quality of LR predictions
from sklearn.metrics import r2_score
r2_score(y, y_pred)
0.9806962531842807

from sklearn.metrics import root_mean_squared_error
root_mean_squared_error(y, y_pred)
3.62088252156672

from sklearn.metrics import mean_absolute_error
mean_absolute_error(y, y_pred)
2.2952125909109022

from sklearn.metrics import explained_variance_score
explained_variance_score(y, y_pred)
0.9806962531842807

TSFRESH Feature Engineering

  • Performing Feature Engineering with TSFRESH [1]
def get_tsfresh_features(data):
    # Get statistic features using library TSFRESH 
    # Thanks to https://www.kaggle.com/code/vbmokin/btc-growth-forecasting-with-advanced-fe-for-ohlc
    
    data = data.reset_index(drop=False).reset_index(drop=False)
    
    # Extract features
    extracted_features = extract_features(data, column_id="Date", column_sort="Date")
    
    # Drop features with NaN
    extracted_features_clean = extracted_features.dropna(axis=1, how='all').reset_index(drop=True)
    
    # Drop features with constants
    cols_std_zero  = []
    for col in extracted_features_clean.columns:
        if extracted_features_clean[col].std()==0:
            cols_std_zero.append(col)
    extracted_features_clean = extracted_features_clean.drop(columns = cols_std_zero)

    extracted_features_clean['Date'] = data['Date']   # For the merging
    
    return extracted_features_clean

extracted_features_clean = get_tsfresh_features(df[['Close']])
#extracted_features_clean

extracted_features_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1077 entries, 0 to 1076
Data columns (total 51 columns):
 #   Column                                                         Non-Null Count  Dtype         
---  ------                                                         --------------  -----         
 0   index__sum_values                                              1077 non-null   float64       
 1   index__abs_energy                                              1077 non-null   float64       
 2   index__median                                                  1077 non-null   float64       
 3   index__mean                                                    1077 non-null   float64       
 4   index__root_mean_square                                        1077 non-null   float64       
 5   index__maximum                                                 1077 non-null   float64       
 6   index__absolute_maximum                                        1077 non-null   float64       
 7   index__minimum                                                 1077 non-null   float64       
 8   index__benford_correlation                                     1076 non-null   float64       
 9   index__quantile__q_0.1                                         1077 non-null   float64       
 10  index__quantile__q_0.2                                         1077 non-null   float64       
 11  index__quantile__q_0.3                                         1077 non-null   float64       
 12  index__quantile__q_0.4                                         1077 non-null   float64       
 13  index__quantile__q_0.6                                         1077 non-null   float64       
 14  index__quantile__q_0.7                                         1077 non-null   float64       
 15  index__quantile__q_0.8                                         1077 non-null   float64       
 16  index__quantile__q_0.9                                         1077 non-null   float64       
 17  index__cwt_coefficients__coeff_0__w_2__widths_(2, 5, 10, 20)   1077 non-null   float64       
 18  index__cwt_coefficients__coeff_0__w_5__widths_(2, 5, 10, 20)   1077 non-null   float64       
 19  index__cwt_coefficients__coeff_0__w_10__widths_(2, 5, 10, 20)  1077 non-null   float64       
 20  index__cwt_coefficients__coeff_0__w_20__widths_(2, 5, 10, 20)  1077 non-null   float64       
 21  index__fft_coefficient__attr_"real"__coeff_0                   1077 non-null   float64       
 22  index__fft_coefficient__attr_"abs"__coeff_0                    1077 non-null   float64       
 23  index__value_count__value_0                                    1077 non-null   float64       
 24  index__value_count__value_1                                    1077 non-null   float64       
 25  index__range_count__max_1__min_-1                              1077 non-null   float64       
 26  index__count_below__t_0                                        1077 non-null   float64       
 27  Close__sum_values                                              1077 non-null   float64       
 28  Close__abs_energy                                              1077 non-null   float64       
 29  Close__median                                                  1077 non-null   float64       
 30  Close__mean                                                    1077 non-null   float64       
 31  Close__root_mean_square                                        1077 non-null   float64       
 32  Close__maximum                                                 1077 non-null   float64       
 33  Close__absolute_maximum                                        1077 non-null   float64       
 34  Close__minimum                                                 1077 non-null   float64       
 35  Close__benford_correlation                                     1077 non-null   float64       
 36  Close__quantile__q_0.1                                         1077 non-null   float64       
 37  Close__quantile__q_0.2                                         1077 non-null   float64       
 38  Close__quantile__q_0.3                                         1077 non-null   float64       
 39  Close__quantile__q_0.4                                         1077 non-null   float64       
 40  Close__quantile__q_0.6                                         1077 non-null   float64       
 41  Close__quantile__q_0.7                                         1077 non-null   float64       
 42  Close__quantile__q_0.8                                         1077 non-null   float64       
 43  Close__quantile__q_0.9                                         1077 non-null   float64       
 44  Close__cwt_coefficients__coeff_0__w_2__widths_(2, 5, 10, 20)   1077 non-null   float64       
 45  Close__cwt_coefficients__coeff_0__w_5__widths_(2, 5, 10, 20)   1077 non-null   float64       
 46  Close__cwt_coefficients__coeff_0__w_10__widths_(2, 5, 10, 20)  1077 non-null   float64       
 47  Close__cwt_coefficients__coeff_0__w_20__widths_(2, 5, 10, 20)  1077 non-null   float64       
 48  Close__fft_coefficient__attr_"real"__coeff_0                   1077 non-null   float64       
 49  Close__fft_coefficient__attr_"abs"__coeff_0                    1077 non-null   float64       
 50  Date                                                           1077 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(50)
memory usage: 429.2 KB
  • Merging all features
df = pd.merge(df, extracted_features_clean, how='left', on='Date')

df.shape
(1077, 56)
  • Adding more features
def get_add_features(df_feat):
    # FE for data as row of DataFrame
    # Thanks to https://www.kaggle.com/code/vbmokin/g-research-crypto-forecasting-baseline-fe
    
    # Two new features from the competition tutorial
    df_feat['Upper_Shadow'] = df_feat['High'] - np.maximum(df_feat['Close'], df_feat['Open'])
    df_feat['Lower_Shadow'] = np.minimum(df_feat['Close'], df_feat['Open']) - df_feat['Low']
    
    # Thanks to https://www.kaggle.com/code1110/gresearch-simple-lgb-starter
    df_feat['lower_shadow'] = np.minimum(df_feat['Close'], df_feat['Open']) - df_feat['Low']
    df_feat['high2low'] = (df_feat['High'] / df_feat['Low']).replace([np.inf, -np.inf, np.nan], 0.)
    
    return df_feat

df = get_add_features(df)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1077 entries, 0 to 1076
Data columns (total 60 columns):
 #   Column                                                         Non-Null Count  Dtype         
---  ------                                                         --------------  -----         
 0   Date                                                           1077 non-null   datetime64[ns]
 1   Open                                                           1077 non-null   float64       
 2   High                                                           1077 non-null   float64       
 3   Low                                                            1077 non-null   float64       
 4   Close                                                          1077 non-null   float64       
 5   Volume                                                         1077 non-null   int64         
 6   index__sum_values                                              1077 non-null   float64       
 7   index__abs_energy                                              1077 non-null   float64       
 8   index__median                                                  1077 non-null   float64       
 9   index__mean                                                    1077 non-null   float64       
 10  index__root_mean_square                                        1077 non-null   float64       
 11  index__maximum                                                 1077 non-null   float64       
 12  index__absolute_maximum                                        1077 non-null   float64       
 13  index__minimum                                                 1077 non-null   float64       
 14  index__benford_correlation                                     1076 non-null   float64       
 15  index__quantile__q_0.1                                         1077 non-null   float64       
 16  index__quantile__q_0.2                                         1077 non-null   float64       
 17  index__quantile__q_0.3                                         1077 non-null   float64       
 18  index__quantile__q_0.4                                         1077 non-null   float64       
 19  index__quantile__q_0.6                                         1077 non-null   float64       
 20  index__quantile__q_0.7                                         1077 non-null   float64       
 21  index__quantile__q_0.8                                         1077 non-null   float64       
 22  index__quantile__q_0.9                                         1077 non-null   float64       
 23  index__cwt_coefficients__coeff_0__w_2__widths_(2, 5, 10, 20)   1077 non-null   float64       
 24  index__cwt_coefficients__coeff_0__w_5__widths_(2, 5, 10, 20)   1077 non-null   float64       
 25  index__cwt_coefficients__coeff_0__w_10__widths_(2, 5, 10, 20)  1077 non-null   float64       
 26  index__cwt_coefficients__coeff_0__w_20__widths_(2, 5, 10, 20)  1077 non-null   float64       
 27  index__fft_coefficient__attr_"real"__coeff_0                   1077 non-null   float64       
 28  index__fft_coefficient__attr_"abs"__coeff_0                    1077 non-null   float64       
 29  index__value_count__value_0                                    1077 non-null   float64       
 30  index__value_count__value_1                                    1077 non-null   float64       
 31  index__range_count__max_1__min_-1                              1077 non-null   float64       
 32  index__count_below__t_0                                        1077 non-null   float64       
 33  Close__sum_values                                              1077 non-null   float64       
 34  Close__abs_energy                                              1077 non-null   float64       
 35  Close__median                                                  1077 non-null   float64       
 36  Close__mean                                                    1077 non-null   float64       
 37  Close__root_mean_square                                        1077 non-null   float64       
 38  Close__maximum                                                 1077 non-null   float64       
 39  Close__absolute_maximum                                        1077 non-null   float64       
 40  Close__minimum                                                 1077 non-null   float64       
 41  Close__benford_correlation                                     1077 non-null   float64       
 42  Close__quantile__q_0.1                                         1077 non-null   float64       
 43  Close__quantile__q_0.2                                         1077 non-null   float64       
 44  Close__quantile__q_0.3                                         1077 non-null   float64       
 45  Close__quantile__q_0.4                                         1077 non-null   float64       
 46  Close__quantile__q_0.6                                         1077 non-null   float64       
 47  Close__quantile__q_0.7                                         1077 non-null   float64       
 48  Close__quantile__q_0.8                                         1077 non-null   float64       
 49  Close__quantile__q_0.9                                         1077 non-null   float64       
 50  Close__cwt_coefficients__coeff_0__w_2__widths_(2, 5, 10, 20)   1077 non-null   float64       
 51  Close__cwt_coefficients__coeff_0__w_5__widths_(2, 5, 10, 20)   1077 non-null   float64       
 52  Close__cwt_coefficients__coeff_0__w_10__widths_(2, 5, 10, 20)  1077 non-null   float64       
 53  Close__cwt_coefficients__coeff_0__w_20__widths_(2, 5, 10, 20)  1077 non-null   float64       
 54  Close__fft_coefficient__attr_"real"__coeff_0                   1077 non-null   float64       
 55  Close__fft_coefficient__attr_"abs"__coeff_0                    1077 non-null   float64       
 56  Upper_Shadow                                                   1077 non-null   float64       
 57  Lower_Shadow                                                   1077 non-null   float64       
 58  lower_shadow                                                   1077 non-null   float64       
 59  high2low                                                       1077 non-null   float64       
dtypes: datetime64[ns](1), float64(58), int64(1)
memory usage: 505.0 KB

FB Prophet Prediction

  • Preparing the DF for FB Prophet forecasting [1]
# Create a new dataframe with anomalous dates for Facebook Prophet
if is_anomalies:
    anomalous_dates = ['2020-09-03', '2021-02-12', '2021-04-13', '2021-12-02',
                       '2022-04-01', '2023-08-29', '2024-03-26', '2024-06-18']
    holidays_df = pd.DataFrame(columns = ['ds', 'lower_window', 'upper_window', 'prior_scale'])
    holidays_df['ds'] = anomalous_dates
    holidays_df['holiday'] = 'anomalous_dates'
    holidays_df['lower_window'] = 0
    holidays_df['upper_window'] = 0
    holidays_df['prior_scale'] = 10
    display(holidays_df)

ds lower_window upper_window prior_scale holiday
0 2020-09-03 0 0 10 anomalous_dates
1 2021-02-12 0 0 10 anomalous_dates
2 2021-04-13 0 0 10 anomalous_dates
3 2021-12-02 0 0 10 anomalous_dates
4 2022-04-01 0 0 10 anomalous_dates
5 2023-08-29 0 0 10 anomalous_dates
6 2024-03-26 0 0 10 anomalous_dates
7 2024-06-18 0 0 10 anomalous_dates

# Add a few more anomalous dates
if is_anomalies:
    anomalous_dates_diff = anomalous_dates.copy()
    anomalous_dates_diff.append('2022-08-08')
    anomalous_dates_diff.append('2022-12-01')
    anomalous_dates_diff.append('2023-11-21')
    print(anomalous_dates_diff)

['2020-09-03', '2021-02-12', '2021-04-13', '2021-12-02', '2022-04-01', '2023-08-29', '2024-03-26', '2024-06-18', '2022-08-08', '2022-12-01', '2023-11-21']

# Create a new feature in df for anomalous_dates_diff
if is_anomalies:
    df['Close_diff_anomalous'] = df['Date'].isin(anomalous_dates_diff).astype('int')    
    display(df)
    
    # Number of anomalous dates
    print(f"Number of anomalous dates - {df['Close_diff_anomalous'].sum()}")

Number of anomalous dates - 11

# Illustration of number transformations in the columns:
# "Close" -> "Close_diff" -> "Target" -> "Close_diff_pred" -> "Close_pred"
# Get target and the result of the forecasting
forecasting_days_example = 3
df_example = pd.DataFrame({'Close':[1, 2, 4, 8, 15, 25], 'Day': [0, 1, 2, 3, 4, 5]})
df_example['Close_diff'] = df_example['Close'].diff()
df_example['target'] = df_example['Close_diff'].shift(-forecasting_days_example)
df_example['target_pred'] = df_example['target'].copy()   # Ideal forecasting result
print(f'Simulation of the result of ideal forecasting the "target_pred" for {forecasting_days_example} days')
display(df_example[['Day', 'Close', 'Close_diff', 'target', 'target_pred']])

# Get inverse target
print('\nSimulation of the recovering predicted values "Close_pred" from the "target_pred"')
df_example['Close_diff_pred_shifted'] = df_example['target_pred'].shift(forecasting_days_example)

# Let's create an intermediate feature to make it easier to explain the transformation
temp_column_name = f'Close_diff_pred_shifted_with_Close'  # Intermediate feature for transformations 
df_example[temp_column_name] = df_example['Close_diff_pred_shifted'].copy()
df_example.loc[forecasting_days_example, temp_column_name] = df_example.loc[forecasting_days_example,'Close']
df_example['Close_pred'] = np.concatenate((df_example['Close'].tolist()[:forecasting_days_example], 
                                           np.cumsum(df_example[temp_column_name].values[forecasting_days_example:], dtype=float)))
df_example['Close_pred'] = df_example['Close_pred'].astype('int')
display(df_example[['Day', 'Close', 'Close_diff', 'target', 'target_pred', 'Close_diff_pred_shifted', temp_column_name, 'Close_pred']])

Simulation of the result of ideal forecasting the "target_pred" for 3 days
Day Close Close_diff target target_pred
0 0 1 NaN 4.0 4.0
1 1 2 1.0 7.0 7.0
2 2 4 2.0 10.0 10.0
3 3 8 4.0 NaN NaN
4 4 15 7.0 NaN NaN
5 5 25 10.0 NaN NaN

Simulation of the recovering predicted values "Close_pred" from the "target_pred"
Day Close Close_diff target target_pred Close_diff_pred_shifted Close_diff_pred_shifted_with_Close Close_pred
0 0 1 NaN 4.0 4.0 NaN NaN 1
1 1 2 1.0 7.0 7.0 NaN NaN 2
2 2 4 2.0 10.0 10.0 NaN NaN 4
3 3 8 4.0 NaN NaN 4.0 8.0 8
4 4 15 7.0 NaN NaN 7.0 7.0 15
5 5 25 10.0 NaN NaN 10.0 10.0 25

def cut_data(df, y, num_start, num_end):
    # Cutting dataframe df and array or list for [num_start, num_end-1]        
    df2 = df[num_start:(num_end+1)]
    y2 = y[num_start:(num_end+1)] if y is not None else None
    return df2, y2

def get_train_valid_test_ts(df, forecasting_days, target='Close'):
    # Get training, validation and test datasets with target for Time Series models
    
    # Data prepairing
    df = df.dropna(how="any").reset_index(drop=True)
    df = df[['Date', 'Close']]
    df.columns = ['ds', 'y']        
    y = None
   
    N = len(df)
    train, _ = cut_data(df, y, 0, N-2*forecasting_days-1)
    valid, _ = cut_data(df, y, N-2*forecasting_days, N-forecasting_days-1)
    test, _ = cut_data(df, y, N-forecasting_days, N)
    
    # Train+valid - for optimal model training
    train_valid = pd.concat([train, valid])

    print(f'Origin dataset has {len(df)} rows and {len(df.columns)} features')
    print(f'Get training dataset with {len(train)} rows')
    print(f'Get validation dataset with {len(valid)} rows')
    print(f'Get test dataset with {len(test)} rows')
    
    return train, valid, test, train_valid
  • FB Prophet model training, tuning and forecasting with multiple trends [1], seasonality, the above anomalous dates and in-built US holidays
def get_train_valid_test_mf(df, forecasting_days, target='target'):
    # Get training, validation and test datasets with target for multi-features ML models
    
    df = df.drop(columns = ['Date']).dropna(how="any").reset_index(drop=True)
    
    # Save and drop target        
    y = df.pop(target)

    # Get starting points for the recovering "Close" from "Close_diff_shigted"
    N = len(df)
    #print(f"Total - {N}, Valid start index = {N-forecasting_days-1}, Test start index = {N-1}")
    start_points = {'valid_start_point' : df.loc[N-forecasting_days-1, 'Close'],
                    'test_start_point' : df.loc[N-1, 'Close']}

    # Standartization data
    scaler = StandardScaler()
    df = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)
    
    
    train, ytrain = cut_data(df.copy(), y, 0, N-2*forecasting_days-1)
    valid, yvalid = cut_data(df.copy(), y, N-2*forecasting_days, N-forecasting_days-1)
    test, ytest = cut_data(df.copy(), y, N-forecasting_days, N)


    # Train+valid - for optimal model training
    train_valid = pd.concat([train, valid])
    y_train_valid = pd.concat([ytrain, yvalid])

    print(f'Origin dataset has {len(df)} rows and {len(df.columns)} features')
    print(f'Get training dataset with {len(train)} rows')
    print(f'Get validation dataset with {len(valid)} rows')
    print(f'Get test dataset with {len(test)} rows')
    
    return train, ytrain, valid, yvalid, test, ytest, train_valid, y_train_valid, start_points

#Model training and forecasting
def calc_metrics(type_score, list_true, list_pred):
    # Calculation score with type=type_score for list_true and list_pred 
    if type_score=='r2_score':
        score = r2_score(list_true, list_pred)
    elif type_score=='rmse':
        score = mean_squared_error(list_true, list_pred, squared=False)
    elif type_score=='mape':
        score = mean_absolute_percentage_error(list_true, list_pred)
    return score

def result_add_metrics(result, n, y_true, y_pred):
    # Calculate and add metrics to dataframe result[n,:]
    
    result.loc[n,'r2_score'] = calc_metrics('r2_score', y_true, y_pred)
    result.loc[n,'rmse'] = calc_metrics('rmse', y_true, y_pred)      # in coins
    result.loc[n,'mape'] = 100*calc_metrics('mape', y_true, y_pred)  # in %
    
    return result

# Results of all models
result = pd.DataFrame(columns = ['name_model', 'type_data', 'r2_score', 'rmse', 'mape', 'params', 'ypred'])

if is_Prophet:
    df2=df.copy()
    display(df2)

# Get datasets
if is_Prophet:
    train_ts, valid_ts, test_ts, train_valid_ts = get_train_valid_test_ts(df2.copy(), forecasting_days, target='Close')
    
    if not is_anomalies:
        holidays_df = None

from prophet.plot import add_changepoints_to_plot

def prophet_modeling(result, 
                     cryptocurrency, 
                     train, 
                     test, 
                     holidays_df, 
                     period_days,
                     fourier_order_seasonality,
                     forecasting_period,
                     name_model,
                     type_data):
    # Performs FB Prophet model training for given train dataset, holidays_df and seasonality_mode
    # Performs forecasting with period by this model, visualization and error estimation
    # df - dataframe with real data in the forecasting_period
    # can be such combinations of parameters: train=train, test=valid or train=train_valid, test=test
    # Save results into dataframe result
    
    # Build Prophet model with parameters and structure 
    model = Prophet(daily_seasonality=False, 
                    weekly_seasonality=True, 
                    yearly_seasonality=True, 
                    changepoint_range=1, 
                    changepoint_prior_scale = 0.5, 
                    holidays=holidays_df, 
                    seasonality_mode = 'additive'
                   )
    model.add_seasonality(name='seasonality', period=period_days, 
                          fourier_order=fourier_order_seasonality, 
                          mode = 'additive', prior_scale = 0.5)
    model.add_seasonality(name='monthly', period=30.5, fourier_order=5)
    model.add_country_holidays(country_name='US')
    # Training model for df
    model.fit(train)
    
    # Make a forecast
    future = model.make_future_dataframe(periods = forecasting_period)
    forecast = model.predict(future)
    
    # Draw plot of the values with forecasting data
    figure = model.plot(forecast, xlabel = 'Date', ylabel = f"{name_model} for {cryptocurrency}")
    a = add_changepoints_to_plot(figure.gca(), model, forecast)
    
    # Draw plot with the components (trend and seasonalities) of the forecasts
    figure_component = model.plot_components(forecast)
    
    # Ouput the prediction for the next time on forecasted_days
    #forecast[['yhat_lower', 'yhat', 'yhat_upper']] = forecast[['yhat_lower', 'yhat', 'yhat_upper']].round(1)
    #forecast[['ds', 'yhat_lower', 'yhat', 'yhat_upper']].tail(forecasting_period)
    
    # Forecasting data by the model
    ypred = forecast['yhat'][-forecasting_period:]
    #print(ypred)
    # Save results
    n = len(result)
    result.loc[n,'name_model'] = f"Prophet_{name_model}"
    result.loc[n,'type_data'] = type_data
    result.at[n,'params'] = [period_days]+[fourier_order_seasonality]
    result.at[n,'ypred'] = ypred
    #result = result_add_metrics(result, n, test['y'], y_pred)
    future1 = model.make_future_dataframe(periods=365)
    forecast1 = model.predict(future1)
    fig1 = model.plot(forecast1)

    return result, ypred


# Models tuning
if is_Prophet:
    for period_days in [4, 5, 7, 14]:
        for fourier_order_seasonality in [3, 12]:
            result, _ = prophet_modeling(result, 
                                         cryptocurrency, 
                                         train_ts, 
                                         valid_ts, 
                                         holidays_df, 
                                         period_days,
                                         fourier_order_seasonality,
                                         forecasting_days,
                                         f'{period_days}_days_{fourier_order_seasonality}_order',
                                         'valid')
  • Plotting the complete output of FB Prophet forecasting HPO
  • period_days=4 & fourier_order_seasonality=3
In-Sample FB Forecast period_days=4 & fourier_order_seasonality=3
FB Forecast trend, holidays & weekly/monthly/yearly seasonality period_days=4 & fourier_order_seasonality=3
FB Forecast seasonality period_days=4 & fourier_order_seasonality=3
Out-of-Sample FB Forecast period_days=4 & fourier_order_seasonality=3
  • period_days=4 & fourier_order_seasonality=12
In-Sample FB Forecast period_days=4 & fourier_order_seasonality=12
FB Forecast trend, holidays & weekly/monthly/yearly seasonality period_days=4 & fourier_order_seasonality=12
FB Forecast seasonality period_days=4 & fourier_order_seasonality=12
Out-of-Sample FB Forecast period_days=4 & fourier_order_seasonality=12
  • period_days=5 & fourier_order_seasonality=3
In-Sample FB Forecast period_days=5 & fourier_order_seasonality=3
FB Forecast trend, holidays & weekly/monthly/yearly seasonality period_days=5 & fourier_order_seasonality=3
FB Forecast seasonality period_days=5 & fourier_order_seasonality=3
Out-of-Sample FB Forecast period_days=5 & fourier_order_seasonality=3
  • period_days=5 & fourier_order_seasonality=12
In-Sample FB Forecast period_days=5 & fourier_order_seasonality=12
FB Forecast trend, holidays & weekly/monthly/yearly seasonality period_days=5 & fourier_order_seasonality=12
FB Forecast seasonality period_days=5 & fourier_order_seasonality=12
Out-of-Sample FB Forecast period_days=5 & fourier_order_seasonality=12
  • period_days=7 & fourier_order_seasonality=3
In-Sample FB Forecast period_days=7 & fourier_order_seasonality=3
FB Forecast trend, holidays & weekly/monthly/yearly seasonality period_days=7 & fourier_order_seasonality=3
FB Forecast seasonality period_days=7 & fourier_order_seasonality=3
Out-of-Sample FB Forecast period_days=7 & fourier_order_seasonality=3
  • period_days=7 & fourier_order_seasonality=12
In-Sample FB Forecast period_days=7 & fourier_order_seasonality=12
FB Forecast trend, holidays & weekly/monthly/yearly seasonality period_days=7 & fourier_order_seasonality=12
FB Forecast seasonality period_days=7 & fourier_order_seasonality=12
Out-of-Sample FB Forecast period_days=7 & fourier_order_seasonality=12
  • period_days=14 & fourier_order_seasonality=3
In-Sample FB Forecast period_days=14 & fourier_order_seasonality=3
FB Forecast trend, holidays & weekly/monthly/yearly seasonality period_days=14 & fourier_order_seasonality=3
FB Forecast seasonality period_days=14 & fourier_order_seasonality=3
Out-of-Sample FB Forecast period_days=14 & fourier_order_seasonality=3
  • period_days=14 & fourier_order_seasonality=12
In-Sample FB Forecast period_days=14 & fourier_order_seasonality=12
FB Forecast trend, holidays & weekly/monthly/yearly seasonality period_days=14 & fourier_order_seasonality=12
FB Forecast seasonality period_days=14 & fourier_order_seasonality=12
Out-of-Sample FB Forecast period_days=14 & fourier_order_seasonality=12
  • Performing cross-validation QC
res1=result['ypred']
print(res1[1])

1056    116.936342
1057    116.850275
1058    116.233301
1059    116.709060
1060    117.953234
1061    119.812505
1062    118.262082
1063    118.641367
1064    118.779563
1065    118.254753
Name: yhat, dtype: float64

print(valid_ts)

       ds           y
1056 2024-06-13  129.610001
1057 2024-06-14  131.880005
1058 2024-06-17  130.979996
1059 2024-06-18  135.580002
1060 2024-06-20  130.779999
1061 2024-06-21  126.570000
1062 2024-06-24  118.110001
1063 2024-06-25  126.089996
1064 2024-06-26  126.400002
1065 2024-06-27  123.989998

n=5
valid_ts['y'].plot(label='Validation')
res1[n].plot(label='Forecast')
plt.legend()
plt.xlabel('Time Sample')
plt.ylabel('Price')
plt.grid()

y_true=valid_ts['y']
y_pred=res1[5]
resr2 = calc_metrics('r2_score', y_true, y_pred)
resrmse = calc_metrics('rmse', y_true, y_pred)      # in coins
resmape = 100*calc_metrics('mape', y_true, y_pred)  # in %
print(resr2)
print(resrmse)
print(resmape)

-4.654220698172283
10.99633729407076
7.488504106783908
FB Prophet forecast vs validation data

Auto-ARIMA Model Fitting

  • This section is a demonstration on how to leverage Auto ARIMA functionality [1,5] using the pmdarima package to forecast the future NVDA stock price. The auto-ARIMA process seeks to identify the most optimal parameters for an ARIMA model, settling on a single fitted ARIMA model.
  • Fetching and splitting the dataset
#ARIMA
# Get datasets
if is_ARIMA:
    train_ts, valid_ts, test_ts, train_valid_ts = get_train_valid_test_ts(df2.copy(), forecasting_days, target='Close')

Origin dataset has 1076 rows and 2 features
Get training dataset with 1056 rows
Get validation dataset with 10 rows
Get test dataset with 10 rows
  • Calculating and plotting ACF/PACF for the original time series and first-, second-order time differences
if is_ARIMA:
    # ACF and PACF
    lag_num = 100
    acf_pacf_draw(train_ts['y'], lag_num, True, True, 'Original Series')
    acf_pacf_draw(train_ts['y'].diff().dropna(), lag_num, True, True, '1st Order Differencing')
    acf_pacf_draw(train_ts['y'].diff().diff().dropna(), lag_num, True, True, '2nd Order Differencing')
ACF/PACF for the original time series and first-, second-order time differences.
  • Implementing ARIMA model fitting and Auto-ARIMA HPO
def arima_fit(df, col, order=(1,1,1)):
    # ARIMA model fitting for series df[col]
    
    model = sm.tsa.arima.ARIMA(df[col].values.squeeze(), order=order)
    model = model.fit()
    return model

def get_residual_errors(model):
    # Calculation and drawing the plot residual errors for ARIMA model
    residuals = pd.DataFrame(model.resid)
    fig, ax = plt.subplots(1,2, figsize=(12,6))
    residuals.plot(title="Residuals", ax=ax[0])
    residuals.plot(kind='kde', title='Density', ax=ax[1])
    plt.show()

def arima_forecasting(result, model, params, name_model, df, type_data):
    # Data df (validation or test) forecasting on the num days by the model 
    # with params and save metrics to result 
    
    ypred = model.forecast(steps=len(df))
    
    n = len(result)
    result.loc[n,'name_model'] = name_model
    result.loc[n,'type_data'] = type_data
    result.at[n,'params'] = params
    result.at[n,'ypred'] = ypred
    #result = result_add_metrics(result, n, df['y'], y_pred)
    
    return result

if is_ARIMA:
    # Automatic tuning of the ARIMA model 
    model_auto = pm.auto_arima(train_ts['y'].values, 
                               start_p=4,        # start p
                               start_q=4,        # start q
                               test='adf',       # use adftest to find optimal 'd'
                               max_p=5, max_q=5, # maximum p and q
                               m=1,              # frequency of series (1 - No Seasonality)
                               d=None,           # let model determine 'd'
                               seasonal=False,   # No Seasonality
                               start_P=0,        
                               D=0, 
                               start_Q=0,
                               trace=True,
                               error_action='ignore',  
                               suppress_warnings=False, 
                               stepwise=True     # use the stepwise algorithm outlined in Hyndman and Khandakar (2008) 
                                                 # to identify the optimal model parameters. 
                                                 # The stepwise algorithm can be significantly faster than fitting all 
                                                 # hyper-parameter combinations and is less likely to over-fit the model
                              )

    print(model_auto.summary())

Performing stepwise search to minimize aic
 ARIMA(4,1,4)(0,0,0)[0] intercept   : AIC=inf, Time=1.19 sec
 ARIMA(0,1,0)(0,0,0)[0] intercept   : AIC=3278.225, Time=0.01 sec
 ARIMA(1,1,0)(0,0,0)[0] intercept   : AIC=3280.157, Time=0.04 sec
 ARIMA(0,1,1)(0,0,0)[0] intercept   : AIC=3280.146, Time=0.04 sec
 ARIMA(0,1,0)(0,0,0)[0]             : AIC=3286.435, Time=0.01 sec
 ARIMA(1,1,1)(0,0,0)[0] intercept   : AIC=3274.481, Time=0.25 sec
 ARIMA(2,1,1)(0,0,0)[0] intercept   : AIC=3276.898, Time=0.17 sec
 ARIMA(1,1,2)(0,0,0)[0] intercept   : AIC=3276.272, Time=0.30 sec
 ARIMA(0,1,2)(0,0,0)[0] intercept   : AIC=3276.936, Time=0.06 sec
 ARIMA(2,1,0)(0,0,0)[0] intercept   : AIC=3277.302, Time=0.05 sec
 ARIMA(2,1,2)(0,0,0)[0] intercept   : AIC=3277.648, Time=0.34 sec
 ARIMA(1,1,1)(0,0,0)[0]             : AIC=3282.470, Time=0.14 sec

Best model:  ARIMA(1,1,1)(0,0,0)[0] intercept
Total fit time: 2.629 seconds
                               SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                 1056
Model:               SARIMAX(1, 1, 1)   Log Likelihood               -1633.240
Date:                Wed, 17 Jul 2024   AIC                           3274.481
Time:                        14:06:03   BIC                           3294.326
Sample:                             0   HQIC                          3282.003
                               - 1056                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept      0.2192      0.076      2.876      0.004       0.070       0.369
ar.L1         -0.9472      0.020    -46.224      0.000      -0.987      -0.907
ma.L1          0.9748      0.016     59.838      0.000       0.943       1.007
sigma2         1.2946      0.020     64.815      0.000       1.255       1.334
===================================================================================
Ljung-Box (L1) (Q):                   0.16   Jarque-Bera (JB):             16930.67
Prob(Q):                              0.69   Prob(JB):                         0.00
Heteroskedasticity (H):              23.15   Skew:                             1.79
Prob(H) (two-sided):                  0.00   Kurtosis:                        22.30
===================================================================================
  • Getting orders of the best ARIMA model
if is_ARIMA:
    # Get orders of the best model from AutoARIMA
    arima_orders_best = list(model_auto.get_params().get('order'))
    print(f"Optimal parameters are {arima_orders_best}")
    model_auto = arima_fit(train_ts, 'y', order=(arima_orders_best[0],arima_orders_best[1],arima_orders_best[2]))

Optimal parameters are [1, 1, 1]
  • Plotting the Auto-ARIMA diagnostics summary
if is_ARIMA:
    # Best model from AutoARIMA
    fig = model_auto.plot_diagnostics(figsize=(12,10))
    plt.show()
Auto-ARIMA diagnostics summary: standardized residuals for “y”, KDE & histogram, normal Q-Q plot, and correlogram.
  • Printing the final validation result
if is_ARIMA:
    # Valid forecasting and save result
    result = arima_forecasting(result, model_auto, arima_orders_best, 'ARIMA_auto', valid_ts, 'valid')

y_true=valid_ts['y']
y_pred=res1[0]
resr2 = calc_metrics('r2_score', y_true, y_pred)
resrmse = calc_metrics('rmse', y_true, y_pred)      # in coins
resmape = 100*calc_metrics('mape', y_true, y_pred)  # in %
print(resr2)
print(resrmse)
print(resmape)

-0.3756885596041244
5.424025684842693
3.472035737786662

SciKit-Learn Supervised ML Regression & HPO

  • Let’s train/test the popular supervised ML regression models and optimize their hyperparameters available in SciKit-Learn [1,4].
  • Hyperparameters are parameters that are not directly learnt within estimators. In Scikit-Learn they are passed as arguments to the constructor of the estimator classes [4].
  • Fetching and splitting the input dataset
if is_other_ML:
    df2 = get_target_mf(df2, forecasting_days, col='Close')
    train_mf, ytrain_mf, valid_mf, yvalid_mf, test_mf, ytest_mf, train_valid_mf, y_train_valid_mf, starting_point = \
                                    get_train_valid_test_mf(df2.copy(), forecasting_days, target='target')

Origin dataset has 242 rows and 60 features
Get training dataset with 222 rows
Get validation dataset with 10 rows
Get test dataset with 10 rows
  • Defining the 8 supervised ML models with search hyperparameters
if is_other_ML:
    # Set parameters of models
    models = pd.DataFrame(columns = ['name', 'model', 'param_grid'])

    # Linear Regression
    n = len(models)
    models.loc[n, 'name'] = 'Linear Regression'
    models.at[n, 'model'] = LinearRegression()
    models.at[n, 'param_grid'] = {'fit_intercept' : [True, False]}


    # KNeighbors Regressor
    n = len(models)
    models.loc[n, 'name'] = 'KNeighbors Regressor'
    models.at[n, 'model'] = KNeighborsRegressor()
    models.at[n, 'param_grid'] = {'n_neighbors': [3, 5, 10, 20, 30],
                                  'leaf_size': [10, 20, 30]
                                 }

    # Support Vector Machines
    n = len(models)
    models.loc[n, 'name'] = 'Support Vector Machines'
    models.at[n, 'model'] = SVR()
    models.at[n, 'param_grid'] = {'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
                                  'C': np.linspace(1, 15, 15),
                                  'tol': [1e-3, 1e-4]
                                 }

    # Linear SVC
    n = len(models)
    models.loc[n, 'name'] = 'Linear SVR'
    models.at[n, 'model'] = LinearSVR()
    models.at[n, 'param_grid'] = {'C': np.linspace(1, 15, 15)}


    # Random Forest Classifier
    n = len(models)
    models.loc[n, 'name'] = 'Random Forest Regressor'
    models.at[n, 'model'] = RandomForestRegressor()
    models.at[n, 'param_grid'] = {'n_estimators': [40, 50, 60, 80], 
                                  'min_samples_split': [30, 40, 50, 60], 
                                  'min_samples_leaf': [10, 12, 15, 20, 50],
                                  'max_features': [100], 
                                  'max_depth': [3, 4, 5, 6]                   
                                 }

    # Bagging Classifier
    n = len(models)
    models.loc[n, 'name'] = 'Bagging Regressor'
    models.at[n, 'model'] = BaggingRegressor()
    models.at[n, 'param_grid'] = {'max_features': np.linspace(0.05, 0.8, 1),
                                  'n_estimators': [3, 4, 5, 6],
                                  'warm_start' : [False]
                                 }

    # XGB Classifier
    n = len(models)
    models.loc[n, 'name'] = 'XGB Regressor'
    models.at[n, 'model'] = xgb.XGBRegressor()
    models.at[n, 'param_grid'] = {'n_estimators': [50, 70, 90], 
                                  'learning_rate': [0.01, 0.05, 0.1, 0.2],
                                  'max_depth': [3, 4, 5]
                                 }

    # MLP Classifier
    n = len(models)
    models.loc[n, 'name'] = 'MLP Regressor'
    models.at[n, 'model'] = MLPRegressor()
    models.at[n, 'param_grid'] = {'hidden_layer_sizes': [i for i in range(2,5)],
                                  'solver': ['lbfgs', 'sgd'],
                                  'learning_rate': ['adaptive'],
                                  'learning_rate_init': [0.001, 0.01],
                                  'max_iter': [1000]
                                 }
models
8 supervised ML models with search hyperparameters
  • Tuning the above ML models
def model_prediction(result, models, train_features, valid_features, train_labels, valid_labels):    
    # Models training and data prediction for all models from DataFrame models
    # Saving results for validation dataset into dataframe result
    
    def calc_add_score(res, n, type_score, list_true, list_pred, feature_end):
        # Calculation score with type=type_score for list_true and list_pred 
        # Adding score into res.loc[n,...]
        res.loc[i, type_score + feature_end] = calc_metrics(type_score, list_true, list_pred)
        return res
    
    # Results
    model_all = []

    for i in range(len(models)):
        # Training
        print(f"Tuning model '{models.loc[i, 'name']}'")
        model = GridSearchCV(models.at[i, 'model'], models.at[i, 'param_grid'])
        model.fit(train_features, train_labels)
        model_all.append(model)
        print(f"Best parameters: {model.best_params_}\n")
        
        # Prediction
        ypred = model.predict(valid_features)
        
        # Scoring and saving results into the main dataframe result
        n = len(result)
        result.loc[n,'name_model'] = f"{models.loc[i, 'name']}"
        result.loc[n,'type_data'] = "valid"
        result.at[n,'params'] = model.best_params_
        result.at[n,'ypred'] = ypred
        #result = result_add_metrics(result, n, valid_labels, valid_pred)
        
    return result, model_all

if is_other_ML:
    # Models tuning and the forecasting
    result, model_all = model_prediction(result, models, train_mf, valid_mf, ytrain_mf, yvalid_mf)

Tuning model 'Linear Regression'
Best parameters: {'fit_intercept': False}

Tuning model 'KNeighbors Regressor'
Best parameters: {'leaf_size': 10, 'n_neighbors': 30}

Tuning model 'Support Vector Machines'
Best parameters: {'C': 1.0, 'kernel': 'rbf', 'tol': 0.0001}

Tuning model 'Linear SVR'
Best parameters: {'C': 3.0}

Tuning model 'Random Forest Regressor'
Best parameters: {'max_depth': 6, 'max_features': 100, 'min_samples_leaf': 15, 'min_samples_split': 40, 'n_estimators': 40}

Tuning model 'Bagging Regressor'
Best parameters: {'max_features': 0.05, 'n_estimators': 6, 'warm_start': False}

Tuning model 'XGB Regressor'
Best parameters: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 70}

Tuning model 'MLP Regressor'
Best parameters: {'hidden_layer_sizes': 4, 'learning_rate': 'adaptive', 'learning_rate_init': 0.01, 'max_iter': 1000, 'solver': 's
  • Validating the optimized ML models and computing their r2_score, rmse, and mape [1]
def recovery_prediction(y, starting_point):
    # Recovering prediction of multi-factors model for shifted col_diff to col in the dataframe df
    # y has type np.array
    # starting_point is dictionary with start values for the recovering data
    # Returns y (np.array) with recovering data
    
    return np.insert(y, 0, starting_point).cumsum()[1:]

def result_recover_and_metrics(result, df_ts, type_data, start_points):
    # Recovering prediction: from shifted_Close_diff to Close
    # Calculation metrics for recovering ypred forecasting for all models in result
    # ypred real is from df_ts['y']
    # start points value for the recovering is from dictionary start_points
    # type_data = 'valid' or 'test'

    for i in range(len(result)):
        if (result.loc[i, 'type_data']==type_data) and (result.loc[i, 'mape'] is np.nan):
            ypred = result.loc[i, 'ypred']

            # Recovering ypred for multi-factors models
            if not (str(result.loc[i, 'type_model']) in ['Prophet', 'ARIMA']):
                # Multi-factors model
                # Get start points value for the recovering
                start_point_value = start_points['valid_start_point'] if type_data=='valid' else start_points['test_start_point']
                # Recovering prediction
                ypred = recovery_prediction(ypred, start_point_value)            

            # Calculation metrics
            result = result_add_metrics(result, i, df_ts['y'], ypred)
    
    return result

# Dispay and save all results for validation dataset
if len(result) > 0:
    
    # Get type of each model
    result['type_model'] = result['name_model'].str.split('_').str[0]

    # Calculation metrics for recovering prediction ypred for validation dataset by all models 
    result = result_recover_and_metrics(result, valid_ts, 'valid', starting_point)
    display(result[['name_model', 'type_data', 'r2_score', 'rmse', 'mape']].sort_values(by=['type_data', 'mape', 'rmse'], ascending=True))
    
    # Save results
    num_models = len(result[result['type_data']=='valid']['name_model'].unique().tolist())
    print(f"Number of models built - {num_models}")
    result.to_csv(f'result_of_{num_models}_models_for_forecasting_days_{forecasting_days}.csv')
else: 
    print('There are no tuned models!')
HPO Output: 17 tuned models r2_score, rmse, and mape.
  • Optimized ML model training and forecasting [1]
def get_model_opt(name_model, params):
    # Model tuning for the name_model
    
    print(name_model)
    if name_model=='Linear Regression':
        model = LinearRegression(**params)
        
    elif name_model=='KNeighbors Regressor':
        model = KNeighborsRegressor(**params)
        
    elif name_model=='Support Vector Machines':
        model = SVR(**params)
        
    elif name_model=='Linear SVR':
        model = LinearSVR(**params)
        
    elif name_model=='Random Forest Regressor':
        model = RandomForestRegressor(**params)
        
    elif name_model=='Bagging Regressor':
        model = BaggingRegressor(**params)
    
    elif name_model=='MLP Regressor':
        model = MLPRegressor(**params)
        
    elif name_model=='XGB Regressor':
        model = xgb.XGBRegressor(**params)
        
    else: model = None
        
    return model

def get_params_optimal_model(result, main_metrics):
    # Get parameters of the optimal model from dataframe result by main_metrics

    # Set the data type to float (just in case)
    result[main_metrics] = result[main_metrics].astype('float')
    
    # Choose the optimal model
    opt_result = result[result['type_data']=='valid'].reset_index(drop=True)
    if main_metrics=='r2_score':
        opt_model = opt_result.nlargest(1, main_metrics)
    else:
        # 'mape' or 'rmse'
        opt_model = opt_result.nsmallest(1, main_metrics)
    display(opt_model[['name_model', 'r2_score', 'rmse', 'mape', 'params']])

    # Get parameters of the optimal model
    opt_name_model = opt_model['name_model'].tolist()[0]
    opt_type_model = opt_model['type_model'].tolist()[0]
    opt_params_model = opt_model['params'].tolist()[0]
    print(f'Optimal model by metrics "{main_metrics}" is "{opt_name_model}" with type "{opt_type_model}" parameters {opt_params_model}')
    
    return opt_name_model, opt_type_model, opt_params_model

def model_training_forecasting(result, df, y, test, ytest,  
                               name_model, type_model, params, type_test='1'):
    # Model training for df and y
    # Forecasting ypred
    # type_model = 'Prophet' or "ARIMA" or 'Other ML'
    # type_test = '1' (with find optimal parameters by GridSearchCV) 
    # type_test = '2' (with optimal parameters - without GridSearchCV)
    # return params and metrics in the dataframe result
    
    if type_model=='Prophet':    
        season_days_optimal = params[0]
        fourier_order_seasonality_optimal = params[1]
        model_opt = None
        _, ypred = prophet_modeling(result, 
                                    cryptocurrency, 
                                    df, 
                                    test, 
                                    holidays_df, 
                                    season_days_optimal,
                                    fourier_order_seasonality_optimal,
                                    forecasting_days,
                                    f'{type_model}_optimal',
                                    'test')        
    elif type_model=='ARIMA':
        season_days_optimal = params[0]
        fourier_order_seasonality_optimal = params[1]        
        model_opt = None
        
        # Training ARIMA optimal model for training+valid dataset
        df['y'] = y
        model_opt = arima_fit(df, 'y', order=(params[0],params[1],params[2]))        

        # Model diagnostics
        fig = model_opt.plot_diagnostics(figsize=(12,10))
        plt.show()

        # Plot residual errors
        get_residual_errors(model_opt)

        # Test forecasting and save result
        ypred = model_opt.forecast(steps=len(test))        

    else:
        # Other ML model
        # Training ML optimal model for training+valid dataset
        print(f"Tuning model '{name_model}'")
        models_opt_number = models[models['name']==name_model].index.tolist()[0]
        #print(f"Model - {models.at[models_opt_number,'model']} with parameters {params}")
        if type_test=='1':
            model_opt = GridSearchCV(models.at[models_opt_number,'model'], models.at[models_opt_number,'param_grid'])
        else:
            # type_test=='2'
            model_opt = get_model_opt(models.at[models_opt_number,'name'], params)
        model_opt.fit(df, y)
        
        # Forecasting
        ypred = model_opt.predict(test)

        
    # Scoring and saving results into the dataframe result
    n = len(result)-1
    result.loc[n,'name_model'] = f"{type_model}_optimal"
    result.loc[n,'type_data'] = "test"
    result.loc[n,'type_model'] = type_model
    result.at[n,'params'] = params
    result.at[n,'ypred'] = ypred
    #result = result_add_metrics(result, n, ytest, ypred)
    
    return result, model_opt, ypred

def get_optimal_model_and_forecasting(result, main_metrics, start_points):
    # Choosion the optimal model from dataframe result by main_metrics
    # Tuning optimal model for big dataset train+valid 
    # Test forecasting and drawing it
    # Returns the optimal model and it's name

    
    if len(result) > 0:
        # Get parameters of the optimal model from dataframe result by main_metrics
        opt_name_model, opt_type_model, opt_params_model = get_params_optimal_model(result, 
                                                                                    main_metrics)
        # Set datasets for the final tuning and testing by optimal model
        if (opt_type_model=='Prophet') or (opt_type_model=='ARIMA'):
            train_valid = train_valid_ts.copy()
            y_train_valid = train_valid_ts['y'].copy()
            test = test_ts.copy()
            ytest = test_ts['y'].copy()
            
        else:
            # Multi-factors ML models
            train_valid = train_valid_mf.copy()
            y_train_valid = y_train_valid_mf.copy()
            test = test_mf.copy()
            ytest = ytest_mf.copy()
    
        # Optimal model training for train+valid and test forecasting
        result, model_opt, ypred = model_training_forecasting(result, train_valid, y_train_valid,
                                                              test, ytest,
                                                              opt_name_model, opt_type_model, 
                                                              opt_params_model, '1')
        
        # Calculation metrics for recovering prediction ypred for test dataset by the optimal model
        result = result_recover_and_metrics(result, test_ts, 'test', start_points)
        
        # Drawing plot for prediction for the test data 
        if not ((opt_type_model=='Prophet') or (opt_type_model=='ARIMA')):
            # Recovery values "Close"
            ytest_plot = recovery_prediction(ytest.values, start_points['test_start_point'])
            ypred_plot = recovery_prediction(ypred, start_points['test_start_point'])
        else:
            ytest_plot = ytest.copy()
            ypred_plot = ypred.copy()
            
        # Drawing 
        plt.figure(figsize=(12,8))
        x = np.arange(len(ytest_plot))
        plt.scatter(x, ytest_plot, label = "Target test data", color = 'g', s=100)
        plt.scatter(x, ypred_plot, label = f"{opt_name_model} forecasting", color = 'r', s=50)
        plt.title(f'Forecasting of test data using the "{opt_name_model}" model, which is optimal for "{main_metrics}" metrics')
        plt.ylim(0)
        plt.legend(loc='lower right')
        plt.grid(True)
        
        return opt_name_model

# Get the optimal model by different metrics
if len(result) > 0:
    for valid_metrics in ['r2_score', 'rmse', 'mape']:
        get_optimal_model_and_forecasting(result, valid_metrics, starting_point) 

Best optimized ML models by 3 metrics: r2_score, rmse, and mape.
Forecasting of test data using the “Linear Regression” model, which is optimal for “r2_score” metrics
Forecasting of test data using the “Linear Regression” model, which is optimal for “rmse” metrics
Forecasting of test data using the “Linear SVR” model, which is optimal for “mape” metrics
  • Getting the best Linear Regression model
if is_other_ML:
    main_metrics = 'r2_score'
    if (len(result) > 0) and (len(models) > 0):
        result_nonTS = result[(result['type_model']!='Prophet') & (result['type_model']!='ARIMA')].reset_index(drop=True)
        opt_name_model2, opt_type_model2, opt_params_model2 = get_params_optimal_model(result_nonTS, 
                                                                                main_metrics)

        result, model_opt, ypred = model_training_forecasting(result, 
                                                              train_valid_mf, 
                                                              y_train_valid_mf,
                                                              test_mf, 
                                                              ytest_mf,
                                                              opt_name_model2, 
                                                              opt_type_model2, 
                                                              opt_params_model2, 
                                                              '2')
The best Linear Regression model by metrics r2_score

SHAP Feature Dominance Analysis

  • Let’s use the SHAP package to calculate feature importance [1, 6]. The most common way of understanding ML models is to examine the coefficients learned for each feature.
  • Plotting the feature importance diagrams with SHAP [1]
# All features names
if is_other_ML:
    coeff = pd.DataFrame(train_valid_mf.columns)
    coeff.columns = ['feature']

def add_fi_coeff(coeff, col, list_new_fi_coeff=None, df_new_fi_coeff=None):
    # Adds new importance of features as feature col
    # from list list_new_fi_coeff or dataframe df_new_fi_coeff
    # to the resulting dataframe coeff with feature names 
    # Missed importance values are replaced by zero
    
    if list_new_fi_coeff is not None:
        df_new_fi_coeff = coeff[['feature']].copy()
        df_new_fi_coeff["score"] = pd.Series(list_new_fi_coeff)
    
    if df_new_fi_coeff is not None:
        # Rename df_new_fi_coeff
        df_new_fi_coeff.columns = ['feature', 'score']   # to the plot drawing
        df_new_fi_coeff[col] = df_new_fi_coeff['score']  # to the merging and saving
        
        # Merging dataframes - coeff of all features with new_fi_coeff
        coeff = coeff.merge(df_new_fi_coeff[['feature', col]], on='feature', how='left').fillna(0)
        
        is_score = True
    else:
        print(f'Data is absent fol {col}')
        is_score = False
        coeff = None
    
    return coeff, df_new_fi_coeff, is_score

# Feature importance diagram with SHAP
if is_other_ML:
    if (len(result) > 0) and (len(models) > 0):
        print('Feature importance diagram with SHAP:')
        try:
            # Trees
            explainer = shap.TreeExplainer(model_opt)
            shap_values = explainer.shap_values(test_mf)
            shap.summary_plot(shap_values, test_mf, plot_type="bar", feature_names=coeff['feature'].tolist())
            shap.summary_plot(shap_values, test_mf)

            # Save permutation feature importance values
            coeff, _, is_SHAP_successfully = add_fi_coeff(coeff, 'shap_fi_score', shap_values)
        except: 
            try:
                # Other types of models
                explainer = shap.KernelExplainer(model_opt.predict, train_valid_mf)
                shap_values = explainer.shap_values(test_mf)

                # Plot drawing
                shap.summary_plot(shap_values, test_mf, plot_type="bar", feature_names=coeff['feature'].tolist())
                shap.summary_plot(shap_values, test_mf)

                # Get feature importance values from shap_values format
                # Thanks to https://stackoverflow.com/a/69523421/12301574
                shap_values_all = pd.DataFrame(shap_values, columns = test_mf.columns)
                vals = np.abs(shap_values_all.values).mean(0)
                shap_importance = pd.DataFrame(list(zip(test_mf.columns, vals)),
                                                  columns=['feature','score'])            

                # Saving feature importance values
                coeff, _, is_SHAP_successfully = add_fi_coeff(coeff, 'shap_fi_score', None, shap_importance)            

            except: 
                is_SHAP_successfully = False

        if not is_SHAP_successfully:
            print('Feature importance diagram for this optimal model is not supported in SHAP')
Feature importance diagram with SHAP: mean(|SHAP value|).
Feature importance diagram with SHAP: SHAP value.
  • Creating the Feature importance diagram as the Force plot with SHAP
Feature importance diagram as the Force plot with SHAP

LSTM Prediction & Validation QC

  • LSTMs are an improved version of recurrent neural networks (RNNs).
  • Here, we demonstrate how the LSTM algorithm [7] can be used to predict NVDA stock prices by training on historical data.
  • Importing the relevant Python libraries and reading the stock data
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
plt.style.use("fivethirtyeight")
%matplotlib inline

# For reading stock data from yahoo
from pandas_datareader.data import DataReader

# For time stamps
from datetime import datetime
from math import sqrt
from math import sqrt
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler

#ignore the warnings
import warnings
warnings.filterwarnings('ignore')

from keras.models import Sequential
from keras.layers import Dense, LSTM
import tensorflow as tf

#!pip install keras
#!pip install tensorflow

def plot_close_val(data_frame, column, stock):
    plt.figure(figsize=(12,6))
    plt.title(column + ' Price History for ' + stock )
    plt.plot(data_frame[column])
    plt.xlabel('Date', fontsize=18)
    plt.ylabel(column + ' Price USD ($) for ' + stock, fontsize=18)
    plt.show()

#Test the function
plot_close_val(df, 'Close', 'NVDA')
Close Price History for NVDA
  • Preparing the input data for LSTM model training
def build_training_dataset(input_ds):
    # Create a new dataframe with only the 'Close column 
    input_ds.reset_index()
    data = input_ds.filter(items=['Close'])
    # Convert the dataframe to a numpy array
    dataset = data.values
    # Get the number of rows to train the model on
    training_data_len = int(np.ceil( len(dataset) * .90 ))
    return data, dataset, training_data_len

#Test the function
training_data_df, training_dataset_np, training_data_len = build_training_dataset(df)
dataset=training_dataset_np
data=training_data_df

# Scale the data
from sklearn.preprocessing import MinMaxScaler
def scale_the_data(dataset):
    scaler = MinMaxScaler(feature_range=(0,1))
    scaled_data = scaler.fit_transform(dataset)
    return scaler, scaled_data

#Test the function
scaler, scaled_data = scale_the_data(training_dataset_np)

# Create the training data set 
# Create the scaled training data set
def split_train_dataset(training_data_len):
    train_data = scaled_data[0:int(training_data_len), :]
    # Split the data into x_train and y_train data sets
    x_train = []
    y_train = []
    for i in range(60, len(train_data)):
        x_train.append(train_data[i-60:i, 0])
        y_train.append(train_data[i, 0])
        if i<= 61:
            #print(x_train)
            #print(y_train)
            print('.')
            
    # Convert the x_train and y_train to numpy arrays 
    x_train, y_train = np.array(x_train), np.array(y_train)

    # Reshape the data
    x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
    # x_train.shape
    return x_train, y_train

#Test the function
x_train,y_train = split_train_dataset(training_data_len)
  • Building the LSTM model
def build_lstm_model(x_train,y_train):
    # Build the LSTM model
    model = Sequential()
    model.add(LSTM(128, return_sequences=True, input_shape= (x_train.shape[1], 1)))
    model.add(LSTM(64, return_sequences=False))
    model.add(Dense(25))
    model.add(Dense(1))

    # Compile the model
    model.compile(optimizer='adam', loss='mean_squared_error')
    #model.compile(optimizer='lion', loss='mean_squared_error')
    #model.compile(optimizer='adamw', loss='mean_squared_error')
    # Train the model
    model.fit(x_train, y_train, batch_size=2, epochs=10)
    return model 

#Test the function
lstm_model = build_lstm_model(x_train,y_train)

  • Making test data predictions with TEST_DATA_LENGTH = 100
def create_testing_data_set(model, scaler, training_data_len,test_data_len):
    # Create the testing data set
    # 
    test_data = scaled_data[training_data_len - test_data_len: , :]
    # Create the data sets x_test and y_test
    x_test = []
    y_test = dataset[training_data_len:, :]
    for i in range(test_data_len, len(test_data)):
        x_test.append(test_data[i-test_data_len:i, 0])
    
    # Convert the data to a numpy array
    x_test = np.array(x_test)

    # Reshape the data
    x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1 ))

    # Get the models predicted price values 
    predictions = model.predict(x_test)
    predictions = scaler.inverse_transform(predictions)

    # Get the root mean squared error (RMSE)
    rmse = np.sqrt(np.mean(((predictions - y_test) ** 2)))
    print(rmse)
    return (x_test, y_test, predictions, rmse)

#Test the function
TEST_DATA_LENGTH = 100
x_test,y_test, predictions, rmse = create_testing_data_set(lstm_model,scaler,training_data_len, TEST_DATA_LENGTH)

4.706425009397595
  • Comparing test data vs LSTM predictions
plt.figure(figsize=(12,6))
plt.plot(y_test,label='Test')
plt.plot(predictions,label='Prediction')
plt.xlabel('Forecast Day No')
plt.ylabel('Price')
plt.legend()
Test data vs LSTM predictions
  • Comparing validation data vs LSTM predictions
def plot_predictions(stock, data,training_data_len):
    #Plot the data
    train = data[:training_data_len]
    valid = data[training_data_len:]
    valid['Predictions'] = predictions
    # Visualize the data
    plt.figure(figsize=(14,6))
    title = stock + ' Model Forecast'
    ylabel = stock + ' Close Price USD ($)'
    plt.title(title)
    plt.xlabel('Date', fontsize=18)
    plt.ylabel(ylabel, fontsize=18)
    plt.plot(train['Close'])
    plt.plot(valid[['Close', 'Predictions']])
    plt.legend(['Training Data', 'Validated Data', 'Predicted Data'], loc='lower right')
    plt.show()
    return valid
    
#Test the function
valid = plot_predictions('NVDA',data,training_data_len)
Validation data vs LSTM predictions
  • Calculating the LSTM error metrics
from sklearn.metrics import r2_score
r2_score(y_test, predictions)

0.9375826947059053

from sklearn.metrics import root_mean_squared_error
root_mean_squared_error(y_test, predictions)

4.706425009397595


from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, predictions)

3.7365759733681365

from sklearn.metrics import mean_absolute_percentage_error
mean_absolute_percentage_error(y_test, predictions)

0.03859417843438189

from sklearn.metrics import explained_variance_score
explained_variance_score(y_test, predictions)

0.9638450129795046
model = Sequential()
model.add(LSTM(128, return_sequences=True, input_shape= (x_train.shape[1], 1)))
model.add(LSTM(64, return_sequences=False))
model.add(Dense(25))
model.add(Dense(1))

    # Compile the model
model.compile(optimizer='adam', loss='mean_squared_error',metrics=['accuracy'])
    #model.compile(optimizer='lion', loss='mean_squared_error')
    #model.compile(optimizer='adamw', loss='mean_squared_error')
    # Train the model
history=model.fit(x_train, y_train, batch_size=2, epochs=10,validation_split=0.5)
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper right')
plt.show()
LSTM model loss vs epochs for both train and test data

Conclusions

  • In this article, we have applied the the concept of algorithmic trading to the NVIDIA Short-Term Stock Price Forecasting.
  • We’ve walked through the following topics:
  • Exploratory Data Analysis (EDA) of Input Stock Data
  • Basic Linear Regression with Time-Step & Lag Features
  • TSFRESH Feature Engineering [1]
  • Multi-Seasonal FB Prophet Prediction & HPO
  • Auto-ARIMA Model Fitting & SARIMAX QC Summary
  • SciKit-Learn Supervised ML Regression & HPO
  • SHAP Feature Dominance Analysis
  • TF LSTM Keras Prediction & Validation QC
  • We have shown that time series forecasting (predicting future values based on historical values) applies well to NVIDIA stock forecasting.
  • Since stock prices prediction is essentially a regression problem, the R2-Score, RMSE, MAE, and MAPE have been our key model evaluation metrics.
  • Using the LSTM algorithm [7] and the basic Linear Regression [2] of 10-day Lags, our results showed that the forecasting model has a high accuracy of >95% for most of the stock data used, demonstrating the appropriateness of the ML model and the test set data is used to evaluate the model’s performance.
  • Our ongoing efforts focus on addressing the following challenges:
  • Market Uncertainty: Financial markets can be highly unpredictable, making accurate predictions challenging.
  • External Factors: Models may not account for sudden geopolitical events or economic shifts.
  • Overfitting: Models may perform well on historical data but struggle with new, unforeseen patterns.

Acknowledgements

  • The main code of this notebook was written or compiled from many notebooks by Kaggle Grandmaster, Dr. Eng. Sc., Prof. Vitalii Mokin.

References

  1. Crypto — BTC : Analysis & Forecasting
  2. Linear Regression With Time Series
  3. NVIDIA Stock Price Prediction
  4. SciKit-Learn Supervised Learning
  5. Forecasting Time Series with Auto-Arima
  6. Explaining individual predictions when features are dependent: More accurate approximations to Shapley values
  7. NASDAQ Stock Price Prediction using LSTM

Explore More

Contacts

Disclaimer

  • The following disclaimer clarifies that the information provided in this article is for educational use only and should not be considered financial or investment advice.
  • The information provided does not take into account your individual financial situation, objectives, or risk tolerance.
  • Any investment decisions or actions you undertake are solely your responsibility.
  • You should independently evaluate the suitability of any investment based on your financial objectives, risk tolerance, and investment timeframe.
  • It is recommended to seek advice from a certified financial professional who can provide personalized guidance tailored to your specific needs.
  • The tools, data, content, and information offered are impersonal and not customized to meet the investment needs of any individual. As such, the tools, data, content, and information are provided solely for informational and educational purposes only.
Python
Nvidia
Machine Learning
Forecasting
Statistics
Recommended from ReadMedium