Forecasting Stock Prices With XGBoost
Predicting stock prices with precision is a critical challenge in financial analytics. This article explores an advanced approach using the XGBoost algorithm to forecast next-day stock prices based on historical data. Our model is trained on three years of stock data, segmented into training (60%), development (20%), and test (20%) sets.
The core of our methodology lies in data normalization and adaptive scaling. Initially, we standardize the training set to a mean of 0 and variance of 1, applying this transformation to the development and test sets for consistency. We further refine the model by scaling the past N days’ data in the development set, ensuring predictions are based on appropriately normalized inputs.
The most significant evolution in our approach involves adaptive scaling for the development and test sets. Rather than applying a uniform scaling factor, we dynamically adjust the scaling based on the mean and variance of the preceding N days’ data. This ensures that our model remains sensitive to recent market trends and data variations, enhancing its predictive accuracy for future stock prices.
This article presents a detailed exploration of this sophisticated predictive model, demonstrating how machine learning can be leveraged for more accurate financial forecasting.
Let’s start coding:
import math
import matplotlib
import numpy as np
import pandas as pd
import seaborn as sns
import time
from datetime import date
from matplotlib import pyplot as plt
from pylab import rcParams
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm_notebook
from xgboost import XGBRegressor
%matplotlib inline
#### Input params ##################
stk_path = "./data/VTI.csv"
test_size = 0.2 # proportion of dataset to be used as test set
cv_size = 0.2 # proportion of dataset to be used as cross-validation set
N = 3 # for feature at day t, we use lags from t-1, t-2, ..., t-N as features
n_estimators = 100 # Number of boosted trees to fit. default = 100
max_depth = 3 # Maximum tree depth for base learners. default = 3
learning_rate = 0.1 # Boosting learning rate (xgb’s “eta”). default = 0.1
min_child_weight = 1 # Minimum sum of instance weight(hessian) needed in a child. default = 1
subsample = 1 # Subsample ratio of the training instance. default = 1
colsample_bytree = 1 # Subsample ratio of columns when constructing each tree. default = 1
colsample_bylevel = 1 # Subsample ratio of columns for each split, in each level. default = 1
gamma = 0 # Minimum loss reduction required to make a further partition on a leaf node of the tree. default=0
model_seed = 100
fontsize = 14
ticklabelsize = 14
####################################
This simplified script uses the XGBoost algorithm to predict stock prices. It loads necessary libraries including math, matplotlib, numpy, pandas, and seaborn, and configures parameters like dataset location, test/training set sizes, feature count, and XGBoost settings like tree count, depth, learning rate, and instance weights. Additionally, it sets a seed for consistent results and defines the font size for charts. The script is set up to prepare and execute stock price predictions with XGBoost.
Common Functions
def get_mov_avg_std(df, col, N):
"""
Given a dataframe, get mean and std dev at timestep t using values from t-1, t-2, ..., t-N.
Inputs
df : dataframe. Can be of any length.
col : name of the column you want to calculate mean and std dev
N : get mean and std dev at timestep t using values from t-1, t-2, ..., t-N
Outputs
df_out : same as df but with additional column containing mean and std dev
"""
mean_list = df[col].rolling(window = N, min_periods=1).mean() # len(mean_list) = len(df)
std_list = df[col].rolling(window = N, min_periods=1).std() # first value will be NaN, because normalized by N-1
# Add one timestep to the predictions
mean_list = np.concatenate((np.array([np.nan]), np.array(mean_list[:-1])))
std_list = np.concatenate((np.array([np.nan]), np.array(std_list[:-1])))
# Append mean_list to df
df_out = df.copy()
df_out[col + '_mean'] = mean_list
df_out[col + '_std'] = std_list
return df_out
def scale_row(row, feat_mean, feat_std):
"""
Given a pandas series in row, scale it to have 0 mean and var 1 using feat_mean and feat_std
Inputs
row : pandas series. Need to scale this.
feat_mean: mean
feat_std : standard deviation
Outputs
row_scaled : pandas series with same length as row, but scaled
"""
# If feat_std = 0 (this happens if adj_close doesn't change over N days),
# set it to a small number to avoid division by zero
feat_std = 0.001 if feat_std == 0 else feat_std
row_scaled = (row-feat_mean) / feat_std
return row_scaled
def get_mape(y_true, y_pred):
"""
Compute mean absolute percentage error (MAPE)
"""
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
def train_pred_eval_model(X_train_scaled, \
y_train_scaled, \
X_test_scaled, \
y_test, \
col_mean, \
col_std, \
seed=100, \
n_estimators=100, \
max_depth=3, \
learning_rate=0.1, \
min_child_weight=1, \
subsample=1, \
colsample_bytree=1, \
colsample_bylevel=1, \
gamma=0):
'''
Train model, do prediction, scale back to original range and do evaluation
Use XGBoost here.
Inputs
X_train_scaled : features for training. Scaled to have mean 0 and variance 1
y_train_scaled : target for training. Scaled to have mean 0 and variance 1
X_test_scaled : features for test. Each sample is scaled to mean 0 and variance 1
y_test : target for test. Actual values, not scaled.
col_mean : means used to scale each sample of X_test_scaled. Same length as X_test_scaled and y_test
col_std : standard deviations used to scale each sample of X_test_scaled. Same length as X_test_scaled and y_test
seed : model seed
n_estimators : number of boosted trees to fit
max_depth : maximum tree depth for base learners
learning_rate : boosting learning rate (xgb’s “eta”)
min_child_weight : minimum sum of instance weight(hessian) needed in a child
subsample : subsample ratio of the training instance
colsample_bytree : subsample ratio of columns when constructing each tree
colsample_bylevel : subsample ratio of columns for each split, in each level
gamma :
Outputs
rmse : root mean square error of y_test and est
mape : mean absolute percentage error of y_test and est
est : predicted values. Same length as y_test
'''
model = XGBRegressor(seed=model_seed,
n_estimators=n_estimators,
max_depth=max_depth,
learning_rate=learning_rate,
min_child_weight=min_child_weight,
subsample=subsample,
colsample_bytree=colsample_bytree,
colsample_bylevel=colsample_bylevel,
gamma=gamma)
# Train the model
model.fit(X_train_scaled, y_train_scaled)
# Get predicted labels and scale back to original range
est_scaled = model.predict(X_test_scaled)
est = est_scaled * col_std + col_mean
# Calculate RMSE
rmse = math.sqrt(mean_squared_error(y_test, est))
mape = get_mape(y_test, est)
return rmse, mape, est
This code is a collection of functions for training, predicting, and assessing a model using a dataset. The first function calculates the moving average and standard deviation for a specified number of previous steps and updates the dataframe with these statistics. The second function normalizes a set of numbers to have a zero mean and unit variance, handling cases where the standard deviation might be zero. The third function computes the error percentage between actual and predicted values. The last function applies an XGBoost model using the scaled data, makes predictions, then reverses the scaling and calculates prediction errors. Finally, it returns the error metrics and the predictions. These functions simplify model training, forecasting, and evaluation.
Load Data
df = pd.read_csv(stk_path, sep = ",")
# Convert Date column to datetime
df.loc[:, 'Date'] = pd.to_datetime(df['Date'],format='%Y-%m-%d')
# Change all column headings to be lower case, and remove spacing
df.columns = [str(x).lower().replace(' ', '_') for x in df.columns]
# Get month of each sample
df['month'] = df['date'].dt.month
# Sort by datetime
df.sort_values(by='date', inplace=True, ascending=True)
df.head()
The script loads a CSV file from stk_path into a pandas DataFrame, formats the Date column to datetime, and modifies all column names to lowercase replacing spaces with underscores. It also adds a month column using the dates and sorts the data by date. The first 5 rows are then shown.
# Plot adjusted close over time
rcParams['figure.figsize'] = 10, 8 # width 10, height 8
ax = df.plot(x='date', y='adj_close', style='b-', grid=True)
ax.set_xlabel("date")
ax.set_ylabel("USD")
The code creates a 10x8 graph showing the stocks adjusted closing prices over time with a blue line. It also includes a grid and labels the x-axis as date and the y-axis as USD for clarity.
Feature Engineering
# Get difference between high and low of each day
df['range_hl'] = df['high'] - df['low']
df.drop(['high', 'low'], axis=1, inplace=True)
# Get difference between open and close of each day
df['range_oc'] = df['open'] - df['close']
df.drop(['open', 'close'], axis=1, inplace=True)
df.head()
This code takes a dataframe named df and does the following: 1. Creates a range_hl column with the difference between high and low values. 2. Deletes the high and low columns. 3. Creates a range_oc column with the difference between open and close values. 4. Deletes the open and close columns. 5. Shows the first few rows with .head. The new range_hl and range_oc columns show the daily price range of the stock, making the dataframe neater and more focused.
Now we use lags up to N number of days to use as features.
# Add a column 'order_day' to indicate the order of the rows by date
df['order_day'] = [x for x in list(range(len(df)))]
# merging_keys
merging_keys = ['order_day']
# List of columns that we will use to create lags
lag_cols = ['adj_close', 'range_hl', 'range_oc', 'volume']
lag_cols
The code starts by adding an order_day column to the data frame to show the sequence of rows. It then fills this column with a sequence of numbers, matching the number of rows. A list named merging_keys is created for combining this data frame with others. Another list named lag_cols is made, listing the columns for creating lag variables. Lastly, the column names meant for lag variables are updated to ensure they are used correctly when making these variables.
shift_range = [x+1 for x in range(N)]
for shift in tqdm_notebook(shift_range):
train_shift = df[merging_keys + lag_cols].copy()
# E.g. order_day of 0 becomes 1, for shift = 1.
# So when this is merged with order_day of 1 in df, this will represent lag of 1.
train_shift['order_day'] = train_shift['order_day'] + shift
foo = lambda x: '{}_lag_{}'.format(x, shift) if x in lag_cols else x
train_shift = train_shift.rename(columns=foo)
df = pd.merge(df, train_shift, on=merging_keys, how='left') #.fillna(0)
del train_shift
# Remove the first N rows which contain NaNs
df = df[N:]
df.head()
This code generates lagged features for a dataset by shifting the order_day column by a range of 1 to N days. For each shift value, it adds a new column indicating the lag and fills any gaps with 0. It then merges this new data with the original dataset using a left join. The first N rows are removed due to missing values, and the result is the dataset with lagged features included.
Get mean and std dev at timestamp t using values from t-1, …, t-N
cols_list = [
"adj_close",
"range_hl",
"range_oc",
"volume"
]
for col in cols_list:
df = get_mov_avg_std(df, col, N)
df.head()
This code creates a list of four columns, named cols_list, and loops through each one with a for loop. In each loop, it calculates the moving average and standard deviation for that column using the get_mov_avg_std function with a window size N. After processing all columns, it displays the first few rows of the updated dataframe by printing with head. In short, the code computes and stores the moving average and standard deviation for each column in the dataframe.
Split Into Train, Dev And Test Set
# Get sizes of each of the datasets
num_cv = int(cv_size*len(df))
num_test = int(test_size*len(df))
num_train = len(df) - num_cv - num_test
print("num_train = " + str(num_train))
print("num_cv = " + str(num_cv))
print("num_test = " + str(num_test))
# Split into train, cv, and test
train = df[:num_train]
cv = df[num_train:num_train+num_cv]
train_cv = df[:num_train+num_cv]
test = df[num_train+num_cv:]
print("train.shape = " + str(train.shape))
print("cv.shape = " + str(cv.shape))
print("train_cv.shape = " + str(train_cv.shape))
print("test.shape = " + str(test.shape))
This code divides a dataset into training, validation, and test sets. It calculates the set sizes based on the datasets length, prints the number of items in each set, and then splits the data accordingly. Once split, it prints out the dimensions of each set to verify the process. This is handy for preparing data for machine learning tasks.
Scale The Train, Dev And Test Set
cols_to_scale = [
"adj_close"
]
for i in range(1,N+1):
cols_to_scale.append("adj_close_lag_"+str(i))
cols_to_scale.append("range_hl_lag_"+str(i))
cols_to_scale.append("range_oc_lag_"+str(i))
cols_to_scale.append("volume_lag_"+str(i))
# Do scaling for train set
# Here we only scale the train dataset, and not the entire dataset to prevent information leak
scaler = StandardScaler()
train_scaled = scaler.fit_transform(train[cols_to_scale])
print("scaler.mean_ = " + str(scaler.mean_))
print("scaler.var_ = " + str(scaler.var_))
print("train_scaled.shape = " + str(train_scaled.shape))
# Convert the numpy array back into pandas dataframe
train_scaled = pd.DataFrame(train_scaled, columns=cols_to_scale)
train_scaled[['date', 'month']] = train.reset_index()[['date', 'month']]
print("train_scaled.shape = " + str(train_scaled.shape))
train_scaled.head()
This code creates a list named cols_to_scale with the names of columns to be scaled. It then uses a loop and the .append method to add columns with specific suffixes to this list according to a value N. A scaler is set up to scale the training data, which is then transformed and stored in train_scaled. The code prints the mean and variance of the scaled data, as well as the datasets shape. The scaled dataset is converted back to a dataframe and two new columns, date and month, are added for better organization. The purpose is to prepare the dataset for more advanced analysis.
# Do scaling for train+dev set
scaler_train_cv = StandardScaler()
train_cv_scaled = scaler_train_cv.fit_transform(train_cv[cols_to_scale])
print("scaler_train_cv.mean_ = " + str(scaler_train_cv.mean_))
print("scaler_train_cv.var_ = " + str(scaler_train_cv.var_))
print("train_cv_scaled.shape = " + str(train_cv_scaled.shape))
# Convert the numpy array back into pandas dataframe
train_cv_scaled = pd.DataFrame(train_cv_scaled, columns=cols_to_scale)
train_cv_scaled[['date', 'month']] = train_cv.reset_index()[['date', 'month']]
print("train_cv_scaled.shape = " + str(train_cv_scaled.shape))
train_cv_scaled.head()
This code scales the training and development datasets with the StandardScaler. It starts by creating a scaler called scaler_train_cv, which scales the given columns cols_to_scale to have a mean of 0 and a standard deviation of 1. Then, it prints the mean and variance of the scaled columns, which should be around 0 and 1. The size of the scaled data is also printed. The scaled data, which is initially a numpy array, is turned back into a pandas dataframe named train_cv_scaled, with proper column names. It then adds the original date and month columns back to this dataframe from train_cv and confirms this by printing the dataframes shape. The dataframes first few rows are displayed to check everything is correct. Overall, the code adjusts the scale of data and keeps the dataframe format intact.
# Do scaling for dev set
cv_scaled = cv[['date']]
for col in tqdm_notebook(cols_list):
feat_list = [col + '_lag_' + str(shift) for shift in range(1, N+1)]
temp = cv.apply(lambda row: scale_row(row[feat_list], row[col+'_mean'], row[col+'_std']), axis=1)
cv_scaled = pd.concat([cv_scaled, temp], axis=1)
# Now the entire dev set is scaled
cv_scaled.head()
This code scales the development dataset. It starts by creating a new dataframe named cv_scaled with just the date column. It then loops through each column in cols_list, using a lambda function to generate feature lists for scaling. The scale_row function is applied to each row, scaling features based on their mean and standard deviation, and the results are added to cv_scaled. The updated dataframe is then displayed, showing the scaled development set. In simple terms, the code adjusts the data values to a common scale, enhancing the effectiveness of machine learning models.
# Do scaling for test set
test_scaled = test[['date']]
for col in tqdm_notebook(cols_list):
feat_list = [col + '_lag_' + str(shift) for shift in range(1, N+1)]
temp = test.apply(lambda row: scale_row(row[feat_list], row[col+'_mean'], row[col+'_std']), axis=1)
test_scaled = pd.concat([test_scaled, temp], axis=1)
# Now the entire test set is scaled
test_scaled.head()
The code scales a test dataset by first creating a variable called test_scaled for the scaled test data. It uses a loop to go through each column listed in cols_list and for each, it makes a list called feat_list that includes past values of that column. It then scales each row of the features using a function named scale_row, which adjusts values based on that columns mean and standard deviation. These scaled values are added to test_scaled. After processing all columns, the test data is fully scaled, and the first few rows of test_scaled are shown.
Split Into X And Y
features = []
for i in range(1,N+1):
features.append("adj_close_lag_"+str(i))
features.append("range_hl_lag_"+str(i))
features.append("range_oc_lag_"+str(i))
features.append("volume_lag_"+str(i))
target = "adj_close"
# Split into X and y
X_train = train[features]
y_train = train[target]
X_cv = cv[features]
y_cv = cv[target]
X_train_cv = train_cv[features]
y_train_cv = train_cv[target]
X_sample = test[features]
y_sample = test[target]
print("X_train.shape = " + str(X_train.shape))
print("y_train.shape = " + str(y_train.shape))
print("X_cv.shape = " + str(X_cv.shape))
print("y_cv.shape = " + str(y_cv.shape))
print("X_train_cv.shape = " + str(X_train_cv.shape))
print("y_train_cv.shape = " + str(y_train_cv.shape))
print("X_sample.shape = " + str(X_sample.shape))
print("y_sample.shape = " + str(y_sample.shape))
The code initiates an empty list named features. Through a for loop from 1 to N+1, it adds four strings to features, each string is modified by adding the loop counter i as a string. It designates target as adj_close and separates the data by assigning the columns to variables. It concludes by displaying the sizes of these data groups, utilizing their shape attribute. Essentially, the code preps data for analysis and shows their sizes.
# Split into X and y
X_train_scaled = train_scaled[features]
y_train_scaled = train_scaled[target]
X_cv_scaled = cv_scaled[features]
X_train_cv_scaled = train_cv_scaled[features]
y_train_cv_scaled = train_cv_scaled[target]
X_sample_scaled = test_scaled[features]
print("X_train_scaled.shape = " + str(X_train_scaled.shape))
print("y_train_scaled.shape = " + str(y_train_scaled.shape))
print("X_cv_scaled.shape = " + str(X_cv_scaled.shape))
print("X_train_cv_scaled.shape = " + str(X_train_cv_scaled.shape))
print("y_train_cv_scaled.shape = " + str(y_train_cv_scaled.shape))
print("X_sample_scaled.shape = " + str(X_sample_scaled.shape))
The code divides data into sets for training, testing, and cross-validation. It assigns the scaled features to X_train_scaled for training and X_sample_scaled for testing, while the training targets go into y_train_scaled. The cross-validation features and targets are placed into X_cv_scaled and y_train_cv_scaled. Finally, it outputs the sizes of these sets to check the division. This helps in building fair and accurate machine learning models by avoiding overfitting.
EDA
# Plot adjusted close over time
rcParams['figure.figsize'] = 10, 8 # width 10, height 8
ax = train.plot(x='date', y='adj_close', style='b-', grid=True)
ax = cv.plot(x='date', y='adj_close', style='y-', grid=True, ax=ax)
ax = test.plot(x='date', y='adj_close', style='g-', grid=True, ax=ax)
ax.legend(['train', 'dev', 'test'])
ax.set_xlabel("date")
ax.set_ylabel("USD")
ax.set_title("Without scaling")
The code adjusts the plot size to 10 by 8 inches, then plots the adjusted closing values for training, validation, and test datasets over time, each in a different color. A grid improves visibility, and a legend identifies the datasets. Date and USD are labeled on the x and y axes, with Without scaling as the title. The plot is designed to easily compare the datasets.
# Plot adjusted close over time
rcParams['figure.figsize'] = 10, 8 # width 10, height 8
ax = train_scaled.plot(x='date', y='adj_close', style='b-', grid=True)
ax.legend(['train_scaled'])
ax.set_xlabel("date")
ax.set_ylabel("USD (scaled)")
ax.set_title("With scaling")
This code creates a graph to display the adjusted closing prices over time. It makes the graph 10 inches wide and 8 inches tall. The code graphs the train_scaled data, showing adjusted close prices on the y-axis and dates on the x-axis. It uses a solid blue line with a grid background and labels the line as train_scaled. The labels for the x-axis and y-axis are date and USD scaled, respectively. It also adds a title mentioning that the data is scaled. This makes it simple to track how scaled adjusted close values have changed over time.
Train The Model With XGBoost
# Create the model
model = XGBRegressor(seed=model_seed,
n_estimators=n_estimators,
max_depth=max_depth,
learning_rate=learning_rate,
min_child_weight=min_child_weight,
subsample=subsample,
colsample_bytree=colsample_bytree,
colsample_bylevel=colsample_bylevel,
gamma=gamma)
# Train the regressor
model.fit(X_train_scaled, y_train_scaled)
The code sets up a simple XGBRegressor model to predict numerical values. It uses settings like model_seed, n_estimators, and others to customize the models learning process. We then train the model with scaled training data and target values by using the fit method. The goal is to tweak the model so it can predict values closely matching the real data. In short, the code prepares and trains a model to accurately predict outcomes based on the provided information.
Predict On Train Set
# Do prediction on train set
est_scaled = model.predict(X_train_scaled)
est = est_scaled * math.sqrt(scaler.var_[0]) + scaler.mean_[0]
# Calculate RMSE
print("RMSE on train set = %0.3f" % math.sqrt(mean_squared_error(y_train, est)))
# Calculate MAPE
print("MAPE on train set = %0.3f%%" % get_mape(y_train, est))
This code uses a trained model to predict results on training data X_train_scaled thats been normalized. After predicting, it converts the results back to their original scale using the scalers properties. To measure prediction accuracy, it calculates the Root Mean Square Error RMSE and Mean Absolute Percentage Error MAPE between the actual y_train and predicted values est. RMSE measures the average error magnitude, while a custom function, get_mape, computes the average percentage error. These metrics help evaluate the models performance on the training data.
# Plot adjusted close over time
rcParams['figure.figsize'] = 10, 8 # width 10, height 8
est_df = pd.DataFrame({'est': est,
'date': train['date']})
ax = train.plot(x='date', y='adj_close', style='b-', grid=True)
ax = cv.plot(x='date', y='adj_close', style='y-', grid=True, ax=ax)
ax = test.plot(x='date', y='adj_close', style='g-', grid=True, ax=ax)
ax = est_df.plot(x='date', y='est', style='r-', grid=True, ax=ax)
ax.legend(['train', 'dev', 'test', 'predictions'])
ax.set_xlabel("date")
ax.set_ylabel("USD")
ax.set_title('Without scaling')
This code plots adjusted closing prices over time using the matplotlib library. It sets the plot size to 10x8 inches, creates a dataframe est_df with estimated values and dates, and then plots the train, cv, test, and est_df data with unique colors and styles. It adds a legend to identify each dataset and labels the axes and title of the plot. Simply put, the code generates a clear and labeled graph displaying adjusted closing prices for different datasets.
Predict On Dev Set
# Do prediction on test set
est_scaled = model.predict(X_cv_scaled)
cv['est_scaled'] = est_scaled
cv['est'] = cv['est_scaled'] * cv['adj_close_std'] + cv['adj_close_mean']
# Calculate RMSE
rmse_bef_tuning = math.sqrt(mean_squared_error(y_cv, cv['est']))
print("RMSE on dev set = %0.3f" % rmse_bef_tuning)
# Calculate MAPE
mape_bef_tuning = get_mape(y_cv, cv['est'])
print("MAPE on dev set = %0.3f%%" % mape_bef_tuning)
This code predicts outcomes using a trained model and stores them in a new column est_scaled. It then rescales these predictions to their original value using the adj_close mean and standard deviation. It calculates and displays two accuracy measurements: the root mean squared error RMSE, which shows the average prediction error, and the mean absolute percentage error MAPE, which expresses the error as a percentage of actual values. These measurements help evaluate the models accuracy before making any further changes.
# Plot adjusted close over time
rcParams['figure.figsize'] = 10, 8 # width 10, height 8
est_df = pd.DataFrame({'est': cv['est'],
'y_cv': y_cv,
'date': cv['date']})
ax = train.plot(x='date', y='adj_close', style='b-', grid=True)
ax = cv.plot(x='date', y='adj_close', style='y-', grid=True, ax=ax)
ax = test.plot(x='date', y='adj_close', style='g-', grid=True, ax=ax)
ax = est_df.plot(x='date', y='est', style='r-', grid=True, ax=ax)
ax.legend(['train', 'dev', 'test', 'predictions'])
ax.set_xlabel("date")
ax.set_ylabel("USD")
This code makes a larger plot 10x8 inches for better visibility using the matplotlib library. It creates a smaller dataframe called est_df from est, y_cv, and date information from a larger dataframe. Using pandas, it then plots the training data with a blue line and grid on the graph. It also adds plots for cross-validation and test data in yellow and green on the same graph. Predictions are shown with a red line. The graph is finished with a legend and axis labels. The purpose of the code is to display and compare predicted versus actual adjusted closing prices over time in a clear visual format.
# Plot adjusted close over time, for dev set only
rcParams['figure.figsize'] = 10, 8 # width 10, height 8
ax = train.plot(x='date', y='adj_close', style='b-', grid=True)
ax = cv.plot(x='date', y='adj_close', style='y-', grid=True, ax=ax)
ax = test.plot(x='date', y='adj_close', style='g-', grid=True, ax=ax)
ax = est_df.plot(x='date', y='est', style='r-', grid=True, ax=ax)
ax.legend(['train', 'dev', 'test', 'predictions'])
ax.set_xlabel("date")
ax.set_ylabel("USD")
ax.set_xlim([date(2017, 8, 1), date(2018, 5, 31)])
ax.set_title("Zoom in to dev set")
The code sets up a plot with a size of 10 by 8 inches. It then plots the training data with a blue line and a grid, using date for the x-axis and adj_close for the y-axis. It adds the development set data with a yellow line and the test set data with a green line, both with grids and using the same axes. Predictions are plotted with a red line, with est on the y-axis. It includes a legend for each line, labels the x-axis as date and the y-axis as USD, defines the x-axis limits from August 1st, 2017 to May 31st, 2018, and adds a title.
The predictions capture the turn in directions with a slight lag
# View a list of the features and their importance scores
imp = list(zip(train[features], model.feature_importances_))
imp.sort(key=lambda tup: tup[1])
imp[-10:]
This code forms a list called imp by pairing elements from the features subset of the train dataset with their corresponding importance scores from a model. It then arranges the list by the scores from lowest to highest and shows the top ten most important features.
Final Model
rmse, mape, est = train_pred_eval_model(X_train_cv_scaled,
y_train_cv_scaled,
X_sample_scaled,
y_sample,
test['adj_close_mean'],
test['adj_close_std'],
seed=model_seed,
n_estimators=n_estimators_opt,
max_depth=max_depth_opt,
learning_rate=learning_rate_opt,
min_child_weight=min_child_weight_opt,
subsample=subsample_opt,
colsample_bytree=colsample_bytree_opt,
colsample_bylevel=colsample_bylevel_opt,
gamma=gamma_opt)
# Calculate RMSE
print("RMSE on test set = %0.3f" % rmse)
# Calculate MAPE
print("MAPE on test set = %0.3f%%" % mape)
The first line sets up three variables rmse, mape, and est by using the train_pred_eval_model function, which needs the scaled data and some settings to train a prediction model. Then the code prints out the rmse, an error score showing how close the models predictions are to the real test data. Next, it prints the mape, which tells us the average percent error of the predictions. In short, this code trains a model, tests how well it works, and shows the accuracy using two error score types, RMSE and MAPE.
# Plot adjusted close over time
rcParams['figure.figsize'] = 10, 8 # width 10, height 8
est_df = pd.DataFrame({'est': est,
'y_sample': y_sample,
'date': test['date']})
ax = train.plot(x='date', y='adj_close', style='b-', grid=True)
ax = cv.plot(x='date', y='adj_close', style='y-', grid=True, ax=ax)
ax = test.plot(x='date', y='adj_close', style='g-', grid=True, ax=ax)
ax = est_df.plot(x='date', y='est', style='r-', grid=True, ax=ax)
ax.legend(['train', 'dev', 'test', 'predictions'])
ax.set_xlabel("date")
ax.set_ylabel("USD")
This code makes a graph showing the adjusted close price over time. It sets the graphs size to 10 by 8 using rcParams. A new DataFrame is created with est, y_sample, and date columns from the test dataset. It then plots the training data with a blue line, cross-validation data with a yellow line, and test data with a green line, all on the same chart using the ax plot with date on the x-axis and adj_close on the y-axis. Grid lines are added for better visibility. The estimates are added as a red line and a legend explains each line. The x-axis is labeled date and the y-axis USD. The end result is a chart that compares real values from training, validation, and test data with predicted values over different dates.
# Plot adjusted close over time, for test set only
rcParams['figure.figsize'] = 10, 8 # width 10, height 8
ax = train.plot(x='date', y='adj_close', style='b-', grid=True)
ax = cv.plot(x='date', y='adj_close', style='y-', grid=True, ax=ax)
ax = test.plot(x='date', y='adj_close', style='g-', grid=True, ax=ax)
ax = est_df.plot(x='date', y='est', style='r-', grid=True, ax=ax)
ax.legend(['train', 'dev', 'test', 'predictions'])
ax.set_xlabel("date")
ax.set_ylabel("USD")
ax.set_xlim([date(2018, 4, 1), date(2018, 11, 30)])
ax.set_ylim([130, 155])
ax.set_title("Zoom in to test set")
This code creates a large chart 10x10 inches that shows the adjusted close prices over time for a particular stock from a test dataset. The chart includes labels, styles, and grids for different parts of the data training, cross-validation, test, and estimates and a legend to differentiate them. The x-axis represents dates and the y-axis shows values in USD. It only displays data from April 1 to November 30, 2018. The charts title, Zoom in to test set, suggests it mainly focuses on the test data.
Similar to dev set, the predictions capture turns in direction with a slight lag
# Plot adjusted close over time, only for test set
rcParams['figure.figsize'] = 10, 8 # width 10, height 8
matplotlib.rcParams.update({'font.size': 14})
ax = test.plot(x='date', y='adj_close', style='gx-', grid=True)
ax = est_df.plot(x='date', y='est', style='rx-', grid=True, ax=ax)
ax.legend(['test', 'predictions using xgboost'], loc='upper left')
ax.set_xlabel("date")
ax.set_ylabel("USD")
ax.set_xlim([date(2018, 4, 23), date(2018, 11, 23)])
ax.set_ylim([130, 155])
This code creates a chart that shows the adjusted closing prices over time from a sample dataset. The chart measures 10 by 8 inches with text in 14-point font. It features two plots: one with green crosses representing the actual prices using date as the x-axis and adj_close as the y-axis and another with red crosses for the predicted prices from the est_df dataset with date on the x-axis and est on the y-axis. Both plots have a grid. The legend is located in the upper left corner, and the axes are labeled date and USD. The x-axis shows a specific date range, and the y-axis shows a value range. This plot compares real and predicted prices for the test data using the xgboost algorithm.