avatarNickolas Discolll

Summary

This content discusses a comprehensive guide to using machine learning for pairs trading strategy, focusing on cluster-based stock analysis.

Abstract

The content begins by introducing the concept of using machine learning techniques to identify and capitalize on mean-reverting relationships in stock pairs. It then outlines the process of categorizing stocks into distinct clusters based on various factors, using algorithms like Principal Component Analysis (PCA) and DBSCAN for unsupervised learning. The guide aims to simplify financial data and reveal hidden correlations and patterns in the stock market. The content also includes code snippets for data analysis and machine learning tasks, using libraries such as matplotlib, sklearn, statsmodels, and quantopian.

Opinions

  • The use of machine learning offers an innovative edge in the financial trading realm, particularly in pairs trading strategy.
  • Categorizing stocks into distinct clusters based on various factors can help simplify financial data and reveal hidden correlations and patterns in the stock market.
  • Principal Component Analysis (PCA) and DBSCAN are effective algorithms for unsupervised learning in this context.
  • The guide provides a clear path to mastering pairs trading with machine learning, from initial data standardization to the final selection of stock pairs.
  • The code snippets provided can be used for data analysis and machine learning tasks, utilizing libraries such as matplotlib, sklearn, statsmodels, and quantopian.
  • The content suggests that stocks historically influenced by similar factors are likely to exhibit correlated behaviors in the future.
  • The process of selecting principal components is crucial in compressing the feature space and making the model more robust.

Harnessing Machine Learning for Pairs Trading Strategy

A Comprehensive Guide to Cluster-Based Stock Analysis

In the dynamic realm of financial trading, machine learning offers an innovative edge, particularly in the strategy of pairs trading. This guide delves into the intricacies of utilizing machine learning techniques to identify and capitalize on mean-reverting relationships in stock pairs. By blending pricing data with fundamental and industry-specific insights, the approach aims to uncover hidden correlations and patterns in the stock market.

The journey begins with a thorough analysis of stocks, categorizing them into distinct clusters based on a variety of factors. Leveraging algorithms like Principal Component Analysis (PCA) and DBSCAN for unsupervised learning, the methodology focuses on dimensionality reduction and sensible clustering. This process not only simplifies the vast financial data but also reveals promising stock pairs for further investigation. The guide will walk you through each step of this sophisticated process, from initial data standardization to the final selection of stock pairs, offering a clear path to mastering pairs trading with machine learning.

import matplotlib.pyplot as plt
import matplotlib.cm as cm

import numpy as np
import pandas as pd

from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

from statsmodels.tsa.stattools import coint

from scipy import stats

from quantopian.pipeline.data import morningstar
from quantopian.pipeline.filters.morningstar import Q500US, Q1500US, Q3000US
from quantopian.pipeline import Pipeline
from quantopian.research import run_pipeline

To analyze data and perform machine learning, this code imports many libraries, such as matplotlib, sklearn, statsmodels, and quantopian. Using the import matplotlib.pyplot as plt line, you can create charts and plots using the pyplot submodule of the matplotlib library. The import matplotlib.cm as cm line imports the color-maps submodule found in matplotlib.cm. Importing numpy and pandas, which manipulate and analyze data, is done with import numpy and import pandas lines. TSNE and PCA import specific submodules from the SkLearn library, which reduce dimensionality. Import Preprocessing and Import StandardScaler import additional submodules from sklearn. This line imports a specific function from statsmodels.

Finally, the quantopian lines import specific modules and functions from the quantopian library, which is a platform for algorithmic trading. Data analysis and machine learning tasks are performed using the imported libraries and functions. Clustering is performed via KMeans and DBSCAN algorithms from the Sklearn library, while dimensionality reduction is performed via PCA, and statistical analysis is performed using coint from statsmodels. Using the quantopian library, you can also create customized trading strategies and analysis. It is primarily used in algorithmic trading and research for data analysis, preprocessing, and modeling.

Acquire Stock Info

study_date = "2016-12-31"

universe = Q1500US()

pipe = Pipeline(
    columns= {
        'Market Cap': morningstar.valuation.market_cap.latest.quantiles(5),
        'Industry': morningstar.asset_classification.morningstar_industry_group_code.latest,
        'Financial Health': morningstar.asset_classification.financial_health_grade.latest
    },
    screen=universe
)

res = run_pipeline(pipe, study_date, study_date)
res.index = res.index.droplevel(0)  # drop the single date from the multi-index

print res.shape
# print res.head()

An analysis of financial performance is run using the Quantopian platform, coded in Python. A study_date value of 2016–12–31 is assigned to the variable in the first line. A universe variable is created using Q1500US in the following line. The function defines a set of stocks based on certain parameters such as market capitalization for analysis. Using the Pipeline function, a pipeline is created. Analyzing these columns of data is specified by this pipeline. Market cap, industry, and financial health grades are included in these columns. Stocks to be analyzed are specified by the screen parameter. The run_pipeline function is used to run the pipeline after it is defined. In this case, the start and end dates are the same for the pipeline created earlier and the analysis pipeline. As a result, the variable res is set to the result. On the next line, the multi-index is stripped of its date, leaving only stock symbols and their associated data. For analysis, the dataframe’s shape and the first few rows are printed.

Remove Undesired Stocks

# remove stocks in Industry "Conglomerates"
res = res[res['Industry']!=31055]
print res.shape

Stocks in the Conglomerates industry are removed from the dataset represented by res by this code. Conglomerates is the industry code 31055. Using a boolean expression, the code filters the dataset to remove all rows without the value 31055 in the Industry column. In the next step, the resulting dataset is saved back into the variable res. At the end of the code, the resulting dataset’s shape is printed, which tells us how many rows and columns there are in it after the filter. Seeing the effect of the filter allows the user to verify that stocks in the Conglomerates industry are no longer showing.

# replace the categorical data with numerical scores per the docs
res['Financial Health'] = res['Financial Health'].astype('object')
health_dict = {u'A': 0.1,
               u'B': 0.3,
               u'C': 0.7,
               u'D': 0.9,
               u'F': 1.0}
res = res.replace({'Financial Health': health_dict})

The code replaces categorical data with numerical scores in a column called Financial Health. In the first step, the column is converted into an object datatype using the astype method. An alphabetic dictionary is created where each letter represents a numerical score of 0.1, 0.3, 0.7, 0.9, 1.0. By using the replace method, each letter in the column Financial Health is replaced with a corresponding numerical score. The use of numerical values provides easier data analysis and manipulation because they are easier to manipulate than categorical ones.

Define Time Horizon

pricing = get_pricing(
    symbols=res.index,
    fields='close_price',
    start_date=pd.Timestamp("2016-12-31"),
    end_date=pd.Timestamp("2017-12-31")
)

In this code, the get_pricing function is used to retrieve historical pricing data for symbols stored in the res variable, specifically their closing prices. Data start and end dates are 31 December 2016 and 31 December 2017, respectively, and are stored as pandas Timestamp objects. In the pricing variable, the function call result is stored. The code retrieves historical price data by selecting a symbol and a time period.

pricing.shape

The Python code will return a tuple containing the pricing data object’s dimensions. It indicates the size of the data set by showing the number of rows and columns.

#change price to pct change
returns = pricing.pct_change()
# returns.head()

The code calculates the percentage change in price over time. Using the pct_change function on pricing data, the first line sets up a new variable called returns and assigns the result of it to the variable. With this function, you can calculate the percentage change between two price points based on the price data. During the second line, returns.head, the user is able to see the initial data points’ calculated percentage changes for the returns variable. It is useful for analyzing stock price changes or commodity prices over time using this code.

returns.shape

The code gets the dimensions of a NumPy array or Pandas DataFrame. Datasets are frequently checked to determine their number of rows and columns. Let’s take a closer look at how it works: returns is the name of a NumPy array or Pandas DataFrame. With .shape, you can find out the dimensions of a dataset in the form of a tuple (number of rows and columns). This case uses .shape without parentheses or arguments, which means it returns the dimensions of the returns dataset. As an example, runs returns.shape will return 100,5 if returns is a NumPy array with 100 rows and 5 columns. The array contains 100 rows and 5 columns. A dataset’s dimensions and size are easily checked by this code, which returns the number of rows and columns it contains.

# we can only work with stocks that have the full return series
returns = returns.iloc[1:,:].dropna(axis=1)
print returns.shape
# print returns.head()

It uses the method .iloc for indexing the column data by taking a variable returns. : means we are selecting all columns and 1: means we are selecting all rows. .dropnaaxis=1 drops missing column values. In this way, we can ensure that we are working with complete data. Next, the shape of the data is shown, along with the number of rows and columns. The last line prints a few rows of data using the .head command. As a result, we can inspect the data visually to verify that it is as expected.

Identifying Potential Stock Pairs

To explore potential mean-reverting relationships among stocks, we start by categorizing them into clusters based on pricing, fundamental, and industry/sector data.

Our hypothesis is that stocks historically influenced by similar factors are likely to exhibit correlated behaviors in the future.

We begin with Principle Component Analysis (PCA) to simplify the returns data, focusing on historical latent factors that influence each stock.

Next, we apply the DBSCAN unsupervised learning algorithm for clustering. The choice of DBSCAN is strategic because:

  • It excludes stocks that don’t align well with any cluster.
  • It eliminates the need to predefine the number of clusters.

This clustering process will help us identify viable pairs of stocks. Further validation of these pairs will be conducted in the subsequent phase.

Principle Component Analysis To Reduce Dimension

# RESTART from here!!!
X = returns.ix[:,].values.T
X.shape

Afterward, the ix function is used to select all rows and columns in the returns dataset, which are then stored in X. The result is X, which is a two-dimensional array containing all the values from the returns dataset, where different data points are represented by rows and different factors or features represented by columns. As a result, the .values function converts it into an array of values, while .T transposes the matrix so that rows become columns and vice versa. This line returns the rows and columns of a matrix. This code converts a dataset into a matrix and transposes it, returning the matrix’s shape.

Standardize The Data

X_std = StandardScaler().fit_transform(X)

To transform the numpy array X, this code uses the StandardScaler function from the Scikit-learn library. It standardizes the data by subtracting the mean from each feature and dividing it by its standard deviation. Scalers are fitted to data and transformed, returning numpy arrays containing the scaled data. For certain algorithms involving machine learning, this code standardizes the data in array X.

mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot(X_std - mean_vec) / (X_std.shape[0]-1)
# print('Covariance matrix \n%s' %cov_mat)
cov_mat.shape

Using numpy mean, we calculate the mean across the columns of the array X_std. When axis=0 is specified, the calculation should be made across the columns. The variable mean_vec is used to store the result. As part of the second line, the covariance matrix is calculated by subtracting the mean vector from the array X_std. To do this, we use the numpy dot function, which calculates the dot product of the two arrays. Dot products can be calculated correctly using the .T method. X_std.shape[0]-1 is divided by the result, which is the number of observations minus one. The purpose of this factor is to correct the calculation for bias. In the variable cov_mat, we save the resulting matrix. On the third line, we find the shape of the covariance matrix, which is a list of rows and columns. To ensure that the calculation was performed properly, this is a useful check.

# Next, we perform an eigendecomposition on the covariance matrix
cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('Eigenvectors', eig_vecs.shape)
print('Eigenvalues', eig_vals.shape)

An import of the numpy library is used to calculate the covariance matrix for a given dataset X_std. In this case, we use the np.cov function, which is called with the transpose of X_std as it’s argument. In the variable cov_mat, the result is saved. Using the np.linalg.eig function from the numpy library, the eigendecomposition is performed. The function takes in the covariance matrix as an argument and returns two arrays: eig_vals and eig_vecs. In eig_vals, the eigenvalues of the covariance matrix are compiled, and in eig_vecs, the corresponding eigenvectors are compiled.

As a final step, the results are printed using the print function, with the shape attribute used to determine the dimensions of the eigenvectors and eigenvalues. Using the original dataset, eigenvectors represent directions with the highest variation, while eigenvalues reflect how much variance each eigenvector explains. The code reduces dimensionality by finding the dataset’s most important features or dimensions.

Select Principal Components

# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort()
eig_pairs.reverse()

The code creates a list of tuples containing the eigenvectors and eigenvalues for a given data set. Iterate through each eigenvalue in the first line and create a tuple consisting of the absolute value of the eigenvalue and the corresponding eigenvector and its corresponding absolute value. After sorting in descending order, eig_pairs is assigned. To reverse the order of the tuples in the list from highest to lowest eigenvalues, the code reverses their order. To verify the results visually, the eigenvalues are sorted descendingly. To do this, we iterate through the eig_pairs list, printing out the eigenvalue in the first element of each tuple. This allows the user to verify whether the list is sorted properly and whether the eigenvalues are in the right order. To summarize, this code computes eigenvalues and eigenvectors in descending order, and sorts the results accordingly. Machine learning algorithms using eigenvalues and eigenvectors often use this step as well.

Once the eigenpairs are organized, the crucial decision is determining the number of principal components to include in our new feature space. To aid this decision, we use the concept of “explained variance,” derived from the eigenvalues. Explained variance quantifies the amount of information (in terms of variance) that each principal component contributes.

tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)

To begin with, this code calculates the sum of all eigenvalues in the list eig_vals. Afterward, it creates a list called var_exp which contains the percentages of each eigenvalue. A hundred times the sum of each eigenvalue is obtained as a result of dividing the eigenvalue by the total sum. By setting the reverse parameter to True, the list is then sorted in descending order. After that, the cumulative sum is calculated using numpy’s cumsum function and stored in the variable cum_var_exp. Each eigenvalue accounts for a cumulative percentage of the total sum. It calculates and visualizes the contribution of each eigenvalue to the total variance of a dataset. To select the most important principal components that explain the majority of data variance, principal component analysis is often used.

we want to select features that stand for 90% of the variance.

cum_var_exp = cum_var_exp[cum_var_exp[:] < 50]

valid_cols = cum_var_exp.shape[0]
orig_cols = len(eig_vals)

print("orig_cols", orig_cols)
print("valid_cols", valid_cols)

In the first line of the code, all elements less than 50 in cum_var_exp are assigned back to cum_var_exp. This code is used to filter out values that are greater than or equal to 50 from cum_var_exp, most likely for some specific purpose. In the second line, valid_cols is created and assigned the number of rows in cum_var_exp, which counts the values that are less than 50. As a code variable, this variable could be used as a flag or a counter. It prints out, respectively, the number of valid columns and the number of original columns. Code functionality is not affected by this change. It is simply for visual purposes. This code probably filters and selects values from the cum_var_exp variable to only include those that are less than 50. The program also tracks the amount of valid values within this range, which is useful for analyzing or manipulating the code in the future.

In this process, we’re compressing the feature space from 250 dimensions down to a subspace of 167 dimensions. This is achieved by selecting the “top 167” eigenvectors, based on their eigenvalues, to form our d×k-dimensional eigenvector matrix W.

# init matrix_w
matrix_w = np.hstack(eig_pairs[0][1]).reshape(orig_cols,1)

# concatenate matrix_w
for i in range(valid_cols-1):
    temp_vec = np.hstack(eig_pairs[i+1][1]).reshape(orig_cols,1)
    matrix_w = np.hstack((matrix_w,temp_vec))

By using hstack, the horizontal stacking function in NumPy, this code initializes a matrix called matrix_w. From a list of eigenvalue-eigenvector pairs, eig_pairs, a column vector is reshaped to have a specified number of rows orig_cols. As a result, the vector and the original matrix_w are concatenated. After the for loop, the remaining values from the list of eig_pairs are iterated over. A vector is selected from the list and reshaped to match the matrix_w’s number of rows. In order to combine this vector with matrix_w, the hstack function is used. In essence, this code horizontally stacks all the eigenvectors from the eig_pairs list to create matrix_w. A dataset is often reduced in dimensionality by applying principal components analysis PCA while retaining the most important features.

Projection Onto the New Feature Space

X: Initial matrix X_std: Matrix after standardization X_PCA: Matrix post-PCA for dimensionality reduction X_TO_CLST: Matrix prepared for clustering

X_PCA = X_std.dot(matrix_w)

print("shape of new features matrix: ", X_PCA.shape)

By multiplying X_std by matrix_w, the dot product function in numpy performs a matrix multiplication between the standardized data matrix and the weight matrix. In this example, the principal component analysis of the original data is appended to a new feature matrix, X_PCA. Upon succeeding in the multiplication, the print statement displays the matrix shape. To analyze or visualize further, the user can see how many principal components have been created and whether the dimensions are correct.

We have done the PCA to reduce dimension. Let’s add some fundamental values as well to make the model more robust.

X_AFT_PCA = np.hstack(
    (X_PCA,
     res['Market Cap'][returns.columns].values[:, np.newaxis],
     res['Financial Health'][returns.columns].values[:, np.newaxis])
    )

print X_AFT_PCA.shape

To calculate the AFT_PCA, the code imports numpy and assigns the data stored in X_PCA to X_AFT_PCA. Additionally, X_AFT_PCA gets two more columns by using the np.hstack function. X_PCA is the first argument of np.hstack that contains the values of the original dataset. By using res[Market Cap][returns.columns].values[:, np.newaxis], you can add Market Cap values from the res dictionary to X_AFT_PCA. Retrieving Financial Health values with res[Financial Health][returns.columns].values[:], np.newaxis] works in the same way. As the np.newaxis function adds a new axis to the array, the values can be added as columns instead of rows. It contains all the original columns from X_PCA, plus two additional columns for Market Cap and Financial Health. By using the print statement, you can display the array’s shape, or how many rows and columns it has.

DBSCAN Clustering to Find Pairs

Determining the Ideal Epsilon for DBSCAN

Our approach primarily incorporates strategies from a specific study. This involves calculating the average distances of each point to its k nearest neighbors in a data matrix, where k is defined by the user and aligns with MinPts.

We then arrange these k-distances in ascending order. The objective is to identify the “knee” of this ordered sequence, indicative of the optimal epsilon (eps) value.

This “knee” represents a point of significant change on the k-distance curve, marking a suitable threshold.

Following recommendations from the original DBSCAN publication, minPts is selected based on the data’s dimensionality, while eps is chosen based on the noticeable “elbow” in the k-distance graph.

from sklearn.neighbors import NearestNeighbors

# norm_data = MinMaxScaler()
# X1 = norm_data.fit_transform(Y)

X_TO_CLST = preprocessing.StandardScaler().fit_transform(X_AFT_PCA)

nbrs = NearestNeighbors(n_neighbors=10).fit(X_TO_CLST)
distances, indices = nbrs.kneighbors(X_TO_CLST)

It performs nearest neighbor analysis using the Python library scikit-learn. As you can see, the first line imports a NearestNeighbors function from the library. Using the library’s MinMaxScaler function, this object is normalized as norm_data. A normalization helps to improve the accuracy of the nearest neighbor analysis. Using a function from the library named fit_transform, the code then creates a new object called X1 from the normalized data. After normalizing and transforming the data, this object represents it.

Using the library’s preprocessing module, X_TO_CLST is created as a new object. Standardize the data with a function called StandardScaler. Comparing and interpreting data is made easier with standardization. An object called nbrs is created on the fourth line to represent the nearest neighbor analysis. To fit this model to the standard data, we are using the NearestNeighbors function from the library. kneighbors from the nbrs model is used to determine the nearest neighbors for each data point in the next two lines of code.

In the dataset, indices represents the index of each nearest neighbor, and distances represents the distance between each point and its nearest neighbor. In order to sort and plot the distances object, the code uses the NumPy library. By visualizing the distribution of distances between data points, one can see how those data points relate to one another. Last few lines of code trim the first column of distances object, which is not required for the analysis. Using standardized and normalized data, this code performs a nearest neighbor analysis. In this analysis, patterns and relationships between data points are found.

distances = distances.mean(axis=1)
distances = np.sort(distances, axis=0)
print(distances)
plt.plot(distances)

As a result of calculating the mean of the distance along the y-axis, each row has its mean value returned in a new array. Assigning the variable distances to the new array follows. In the second line, NumPy sort sorts the array elements along the 0-axis, which here happens to be the x-axis. Distances are sorted ascending. Ascending distances are displayed in the third line. This line plots the distances using Matplotlib’s plot method. It shows distance values on the y-axis and distance indices on the x-axis. Overall, this code calculates the mean of distances and ranks them ascendingly before plotting them on a graph.

clf = DBSCAN(eps=5, min_samples=2)
print clf

The code creates a model or object with specific parameters: eps = 5 and min_samples = 2. In order for a point to be considered a core point, the maximum distance between two points must be less than or equal to the distance between them. Using the clf variable, we store this DBSCAN model. A print statement prints out details about the DBSCAN model, including its type and parameters. Basically, this code prints out a DBSCAN model with specific parameters.

clf.fit(X_TO_CLST)
labels = clf.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print "\nClusters discovered: %d" % n_clusters_

clustered = clf.labels_

In this code, a clustering algorithm is used from a machine learning library called Sklearn CF. In line one, clf.fitX_TO_CLST, the algorithm is trained on a dataset, represented by the variable X_TO_CLST. The variable labels is created, assigned to the clf.labels_ attribute, and represents the predicted cluster labels. n_clusters_ is determined by lensetlabels — 1 if labels are -1, otherwise by 0. If the label -1 represents data points which cannot be assigned to clusters, it subtracts 1 from the number of unique labels in the labels variable.

A cluster of outliers may have been found by the algorithm. A console is printed with the number of clusters, n_clusters_. This line assigns the predicted labels to the variable clustered using clf.labels_. A clustering algorithm is used to find patterns in a dataset and group data points that are similar. After determining the number of clusters, the predicted labels are assigned to a variable for analysis or visualization.

# the initial dimensionality of the search was
ticker_count = len(returns.columns)
print "Total pairs possible in universe: %d " % (ticker_count*(ticker_count-1)/2)

The code assigns the variable ticker_count a value equal to the number of columns tickers in the returns dataset. This formula gives the number of pairs that can be formed from these tickers by multiplying the ticker count by the ticker count-1 / 2, which is the formula for selecting two tickers simultaneously without replacing them. In the end, it prints a formatted string, including the variable ticker_count. The code counts tickers, calculates the total number of possible pairs that can be formed from them, and displays the result in a human-readable manner. Possibly used in some sort of financial analysis to determine the number of pairs of trading strategies that can be employed.

clustered_series = pd.Series(index=returns.columns, data=clustered.flatten())
clustered_series_all = pd.Series(index=returns.columns, data=clustered.flatten())
clustered_series = clustered_series[clustered_series != -1]

Using the first line of code, we create a pandas Series object called clustered_series, whose index matches the returns object’s columns, and whose values match the clustered object’s flattened version. This clustered object is likely the result of a clustering algorithm, and it is turned into a one-dimensional array using the .flatten method. With the second line of code, a new pandas Series object, clustered_series_all, is created with the same index and values as the first. As a final step, we remove all values equal to -1 from the clustered_series object. The Series will then be cleared of all data points that were not assigned to a specific cluster during clustering. The data clustering can then be analyzed or visualized further.

CLUSTER_SIZE_LIMIT = 9999
counts = clustered_series.value_counts()
ticker_count_reduced = counts[(counts>1) & (counts<=CLUSTER_SIZE_LIMIT)]
print "Clusters formed: %d" % len(ticker_count_reduced)
print "Pairs to evaluate: %d" % (ticker_count_reduced*(ticker_count_reduced-1)).sum()

CLUSTER_SIZE_LIMIT is set to 9999 on the first line. This value may later be used as a threshold for certain cluster sizes in the code. On the second line, a variable called clustered_series is created to use the value_counts method. Counts the number of occurrences of each unique value in the series and returns the result. The third line creates a new variable, ticker_count_reduced, that selects only counts with a greater than 1 and less than or equal to the CLUSTER_SIZE_LIMIT value. Any values that do not meet these criteria are essentially filtered out.

The following two lines show the number of clusters formed and the pairs that need to be evaluated. As the first line indicates, the ticker_count_reduced series is the number of clusters, and is calculated using the len function. Next, the ticker_count_reduced series is multiplied by itself plus 1, then summed. The total number of pairs to be evaluated is calculated with this formula. This code appears to be part of an algorithm that identifies and evaluates pairs of data to determine clusters. In this case, the algorithm may filter out any clusters that do not meet a certain size threshold before calculating the number of pairs that need to be evaluated.

plt.figure(1, facecolor='white')
plt.clf()
plt.axis('off')

plt.scatter(
    X_tsne[(labels!=-1), 0],
    X_tsne[(labels!=-1), 1],
    s=100,
    alpha=0.85,
    c=labels[labels!=-1],
    cmap=cm.Paired
)

plt.scatter(
    X_tsne[(clustered_series_all==-1).values, 0],
    X_tsne[(clustered_series_all==-1).values, 1],
    s=100,
    alpha=0.05
)

plt.title('T-SNE of all Stocks with DBSCAN Clusters Noted');

A new white-background figure is created in the first line. Lastly, the second line clears the current figure. Plotting axes will be disabled next. In the following lines, plot points on the figure using the scatter function. With the first scatter plot call, we would use the x-values from the first column of X_tsne and the y-values from the second column to plot the data points from X_tsne array with labels not equal to -1. s sets the point size to 100, alpha sets the transparency to 0.85, and c sets the color of the points. Data points from X_tsne are plotted with clustered_series_all equal to -1 in the second scatter plot call. All points are plotted in the same color with an alpha value of 0.05, resulting in faint dots in the background. A title is added to the figure at the end. Ultimately, this code builds a scatter plot to display the clustering results from DBSCAN.

plt.barh(
    xrange(len(clustered_series.value_counts())),
    clustered_series.value_counts()
)
plt.title('Cluster Member Counts')
plt.xlabel('Stocks in Cluster')
plt.ylabel('Cluster Number');

In this code, a horizontal bar plot is created using matplotlib.pyplot. This plot will show the number of stocks in each cluster, with the horizontal axis representing the count and the vertical axis representing the cluster number. Clustered_series contains a series of cluster assignments for each stock, each represented by a number. The first line of code creates a range of numbers the length of the value counts in clustered_series. As the x axis values, the values are passed to the plt.barh function. As a result of the value_counts method, the y axis values are the value counts in clustered_series. Following these three lines of code, we add a title and labels to the plot, providing further context for the plot. In other environments, a semicolon at the end suppresses any default output and displays the plot in a separate window or output cell. A bar plot is generated by this code to visualize stock distributions across clusters.

# get the number of stocks in each cluster
counts = clustered_series.value_counts()

# let's visualize some clusters
cluster_vis_list = list(counts[(counts<20) & (counts>1)].index)[::-1]

# plot a handful of the smallest clusters
for clust in cluster_vis_list[0:min(len(cluster_vis_list), 3)]:
    tickers = list(clustered_series[clustered_series==clust].index)
    means = np.log(pricing[tickers].mean())
    data = np.log(pricing[tickers]).sub(means)
    data.plot(title='Stock Time Series for Cluster %d' % clust)

Using this code, you can visualize stock clusters by calculating their value counts. The first line calculates each cluster’s value count, which is the number of stocks. In the second line, a list of cluster values between 1 and 20 is created. By reversing the list, the smaller clusters are placed at the top. A graph of the first 3 clusters in reversed order is shown on the fourth line. For the current cluster being analyzed, the fifth line creates a list of stock tickers. Sixth, the mean stock price is calculated for the given cluster by taking the mean of the prices for each ticker. As a result of the seventh line, the mean is subtracted from the log of each stock price to normalize the data. Here is the final plot of the cluster-specific time series data. By organizing and visualizing data, this code identifies patterns and trends in various clusters of stocks.

which_cluster = clustered_series.loc[symbols('JPM')]
clustered_series[clustered_series == which_cluster]

tickers = list(clustered_series[clustered_series==which_cluster].index)
means = np.log(pricing[tickers].mean())
data = np.log(pricing[tickers]).sub(means)
data.plot(legend=False, title="Stock Time Series for Cluster %d" % which_cluster);

To assign a specific cluster to the variable which_cluster, this code uses the clustered_series object’s .loc method. It returns a subset of the clustered_series containing only data for JPM’s company. Following that, the code creates a list of ticker symbols for companies that belong to the same cluster as JPMorgan. Clustered_series is converted into a list by calling the .index method on it. Creating a variable means based on the logarithmic mean of stock prices for the companies in the cluster is the third line. A pricing object is calculated by using the .mean method, assuming that it contains prices for all companies in the tickers list, however only for the tickers in the pricing object. To get each value, the log of the ticker price data is taken, followed by the log of the means found in the previous line. In order to normalize time series data, this technique is often used. After that, the data are plotted with the plot method and a title is given with the which_cluster variable. It extracts data from a specific company, finds all companies in the same cluster, then plots the time series data for those companies, normalized by taking the log of the data and subtracting the log of the mean. Visualizing and comparing trends within a cluster is useful.

Machine Learning
Deep Learning
Data Science
Finance
Recommended from ReadMedium