A Code-Free Clustering of the top 1500 Liquid US Stocks Universe

A First Step in Alpha Research

Published on 13th August 2023. Last updated on 6th October (review of content).

In the dynamic world of finance, traditional analytical methods, while still potent, often struggle with today’s vast and intricate datasets. Machine learning, with clustering techniques such as k-means, offers a promising solution to this challenge. In this article, we’ll harness OpenAI ChatGPT-4 to cluster a set of 1500 US stocks, all without a single line of code. This article represents an initial exploration within the framework I outlined in my preceding article.

Important Disclaimer: The insights offered here are strictly educational. The robustness of our model, especially concerning metric selection, hasn’t been cross-referenced with established academic research or extensively back-tested. This piece shouldn’t be interpreted as an investment guide or a definitive market analysis. I’m not equipped to provide financial counsel, and readers are urged to conduct comprehensive due diligence before any investment actions.

Step 1: Data Extraction from Finviz

As highlighted in a previous article, I’ve consistently underscored the analytical prowess of Finviz. Using its capabilities, I’ve drawn a detailed set of approximately 40 financial ratios from the top 1500 liquid stocks, each boasting an average trading volume above 1M.

Step 2: Data Formatting before ChatGPT-4 Upload

Before processing data through ChatGPT-4, it’s imperative to structure it appropriately, preferably in formats like Excel or csv. Extraneous columns should be removed, and headers should be descriptive and neatly labeled to facilitate subsequent steps.

Step 3: Handling Missing Values

Incomplete data can skew clustering results. We’ll replace the missing values in the columns with the average from their counterparts within the same industry category. This will provide a more accurate fill for the missing data by considering the industry’s average values.

Step 4: Data Normalization

K-means clustering is a distance-based algorithm, so feature scaling is crucial. We’ll normalize the data using the Min-Max scaling method, which scales the dataset such that all feature values are in the range [0, 1]. This scaling ensures that no particular feature dominates the clustering due to its scale or units.

Step 5: Selection of Metrics for Clustering

Selecting the right metrics for clustering is crucial as it determines the quality of the clusters. Given the vast number of metrics available, we can use a combination of domain knowledge and data-driven techniques to determine which metrics to use.

Domain knowledge approach: Some metrics are more informative than others when determining the performance and financial health of a company. For example: • P/E ratio: Indicates the price investors are willing to pay for each dollar of earnings. • EPS (ttm): Earnings per Share gives a snapshot of a company's profitability. • Beta: Indicates the stock's volatility in comparison to the market. • Gross Margin, Operating Margin, and Profit Margin: Gives insight into a company's profitability at different stages of its operations. • Performance (YTD): Year-to-date performance provides a snapshot of how the stock has performed recently.
Data-driven approach: One could also use techniques like Principal Component Analysis (PCA) to reduce dimensionality or correlation matrices to ensure that the chosen metrics are not highly correlated.

We chose here a domain knowledge approach by selecting a subset of key financial metrics. These metrics give a well-rounded view of a company’s performance, valuation, and financial health. Let’s proceed with the mentioned metrics: P/E, EPS (ttm), Beta, Gross Margin, Operating Margin, Profit Margin, and Performance (YTD).

Step 6: 5-means Clustering

With our metrics in hand, we segmented the stocks into five clear groups. We deployed the k-means clustering algorithm, emphasizing features like the Price-to-Earnings (P/E) ratio and the Price-to-Book (P/B) ratio. Importantly, the sector of each firm was considered, ensuring relevant comparisons. The choice of five clusters, while initially arbitrary, strikes a balance between granularity and clarity. However, this number can be adjusted based on the dataset’s intricacies.

Step 7: Evaluate the Reliability of the Clustering Using Silhouette Score

The silhouette score measures how close each point in one cluster is to the points in the neighboring clusters. Its values range from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. The silhouette score for the sampled data is approximately 0.298. This value suggests that the clustering is reasonable, but there’s potential for improvement. Ideally, a silhouette score closer to 1 would indicate well-separated clusters.

Step 8: Clusters Characterization

Let’s now analyze the 5 clusters to understand their internal properties, such as high P/E, low Profit Margin, etc., and provide the number of stocks in each cluster. We’ll compute the mean of our selected metrics for each cluster to characterize them. But, when interpreting financial clusters, be wary of: • Over-interpretation: Similar stocks in a cluster don’t guarantee uniformity across all facets. • Temporal Stability: Financial metrics are dynamic. Yearly data can influence cluster formations. • Feature Importance: Not all metrics hold equal weight. Correct feature selection and prioritization are vital. • Outliers: Stocks with unique metrics can distort clusters. Spotting these outliers is essential.

Here's a brief characterization of the 5 clusters based on the mean values of the selected metrics:

Cluster 0 (219 stocks): • Moderate P/E ratio • High gross margin and operating margin, suggesting good profitability • Slight positive performance YTD

Cluster 1 (281 stocks): • Extremely high P/E ratio, indicating potentially overvalued stocks • Negative operating and profit margins, suggesting companies might be operating at a loss • Positive performance YTD

Cluster 2 (481 stocks): • High P/B ratio, which might suggest that these companies have significant assets or intangibles • Moderate profitability with a positive performance YTD

Cluster 3 (516 stocks): • Lower gross margin compared to other clusters, indicating these companies might operate in competitive industries with thin margins • Slightly positive operating and profit margins • Positive performance YTD

Cluster 4 (89 stocks): • High P/E ratio, indicating potentially overvalued stocks • Significantly negative operating and profit margins, suggesting these companies are operating at a considerable loss • Positive performance YTD

Step 9: Interpret Characteristics of Each Cluster in Terms of a Recommendation

Cluster traits can guide stock recommendations. For example, stocks in a cluster with soaring valuations but declining profitability could be labeled 'Sell'. Yet, translating quantitative measures to a micro-economic perspective isn't direct. Here might reside the 'alpha' prowess of the investor. However, here below is the suggested recommendations split:

• Cluster 0: 'Buy' - These stocks have a good profitability profile and moderate valuations. • Cluster 1: 'Sell' - Extremely high valuations combined with negative margins suggest these stocks might be overvalued. • Cluster 2: 'Hold' - High assets or intangibles with moderate profitability. They might be good long-term holds but not immediate buys. • Cluster 3: 'Hold' - Companies in competitive industries but are managing to stay profitable. Good for diversification but not immediate buys. • Cluster 4: 'Strong sell' - High valuations with significant losses indicate a high risk.

Step 10: Executive Summary

At this stage, I have asked the 5 prominent members of the 2 extrema clusters to consider taking positions. In line with my initial disclaimer, I will not exhibit output here. I can only say that on the 'Buy' group the top performer has a performance of 36.84% this year and the worst performer -10.29%. One the 'Strong sell' group, the top performer has a performance of 283.15% and the worst performer has -43.14%. To summarize, in this exercise, we transformed a detailed dataset of 1500 US stocks into five distinct clusters, each bearing its investment advice. This clustering not only demystifies stock analysis but may also provide some valuable insights or bed plate for further investment analysis.

Step 11. Critical analysis of the study

Finviz, while widely used, is just one of many available financial data sources. Its data might not capture the full nuances of every stock, and there’s always a risk of inaccuracies or delays in updates. The choice of metrics like P/E, Gross Margin, and Performance YTD, while standard, can be restrictive. Other potentially vital metrics could provide deeper insights or better clustering results. In fact to make the things properly, we would analyse which features capture the most of the variance of the data, to determine the most prominent financial ratios. We would also reduce dimensionality of the dataset by performing principal component analysis and displaying the cluster on a scatter plot. I reserve this kind of refinements for the Python coded version of this article. The decision to use five clusters (k = 5) is somewhat arbitrary. Different values of 'k' might lead to more meaningful or actionable clusters. In fact, the quantitative method to determine which value of k capture the essence of the dataset exists and is called the 'elbow' method. Replacing missing values with industry averages is a double-edged sword. It retains the dataset’s size and structure but might introduce biases. As we are referring to static financial data, no micro-economic dynamic can be caught in this analysis. Normalization method is not unique and a different choice could influence clustering outcomes. While the silhouette score provides a measure of clustering quality, it’s just one of many validation metrics. A more comprehensive validation approach might offer deeper insights into the robustness of the clusters. For the case considered, the value of 0.298 is quite modest indeed, reflecting a medium predictive quality.

Step 12. Set directions for future works

To ensure data reliability and comprehensiveness, we can consider integrating data from multiple sources. A dynamic clustering, based on time-series analysis, could be considered but we speak here about a complete different approach. Other 'static' clustering methods exist, such as hierarchical clustering or DBSCAN. Principal component analysis, evoked above, could lead to feature engineering to highlight specific financial nuances, enhancing the clustering process. In fact, I did perform a PCA and achieved a new silhouette score of 0.348, marking a modest improvement. Outliers detection could handle stocks with extreme values that might skew clusters. Clusters stability could be back-tested against historical data to confirm the robustness of the clustering model. My initial research in this area indicates that the clusters are not entirely stable. While their composition doesn’t undergo drastic changes, there are occasional significant shifts. This instability doesn’t invalidate the approach, it only merely highlights the moderate quality of the clustering achieved so far. Once stocks are clustered at a satisfactory level of relevance, optimization techniques can balance a portfolio based on cluster insights, maximizing returns while hedging risks. Moreover, advanced ensemble methods could be explored to combine multiple clustering results, offering a more resilient and holistic view of the market dynamics.

Step 13. Python coding

While our initial analysis established a foundation, Python will allow us to refine our model with its robust machine learning capabilities. We’ll use Python to not only enhance accuracy but also rigorously explore the potential avenues identified earlier. Essentially, we’ve set the stage, and now with Python, we have to elevate our approach. Stay tuned for a dedicated follow-up article !

Step 14: Conclusion

Clustering of stocks is an emerging research topic. One way to better understand stock behaviors is by using daily log returns as a key feature. To make sure these clustering methods work well, it's important to test them with different data sets or use cross-validation. There's a growing trend of using algorithms to help manage assets. Financial markets are always changing due to factors like the economy, global events, or changes within specific industries. Regularly updating clustering models helps keep up with these changes, ensuring that investment strategies are still effective. Clustering offers a flexible tool for many in the finance world, from big investment firms to individual traders. By using clustering, they can get a clearer picture of the market, which might be harder to see with traditional methods.

References: - Nourahmadi, M., & Sadeqi, H. (2023). Portfolio Diversification Based on Clustering Analysis. Iranian Journal of Accounting, Auditing and Finance, 7(3), 1-16. doi: 10.22067/ijaaf.2023.74812.1092 https://ijaaf.um.ac.ir/article_43078.html - Duarte F.G., de Castro L.N. (2021). Asset allocation based on a partitional clustering algorithm.

Summarize