Use the elbow method to determine how many clusters unsupervised learning needs
Kaggle has very few competitions that involve clustering, so whenever one crops up, I jump at the chance to make predictions on the clusters.
Clustering is a machine learning technique used to group similar data points together based on their inherent characteristics or similarities. It is an unsupervised learning method, meaning that it does not require labelled data for training. Clustering algorithms aim to discover patterns or structures in the data by identifying groups or clusters of data points that are more similar to each other than to those in other clusters.
Here’s an overview of how clustering works:
- Data Representation: Clustering algorithms operate on a dataset represented by a set of data points or instances. Each data point typically has multiple features or attributes that describe its characteristics.
- Similarity Measurement: The first step in clustering is to define a similarity or distance measure that quantifies the similarity between data points. Common distance metrics include Euclidean distance, Manhattan distance, or cosine similarity. The choice of similarity measure depends on the nature of the data and the problem at hand.
- Cluster Initialization: The clustering algorithm initialises the process by assigning data points to an initial set of clusters. This initialization can be random or based on some heuristic or prior knowledge.
- Iterative Process: The algorithm iteratively updates the cluster assignments based on the similarity measure. It reassigns data points to the cluster that minimises the distance between the data point and the cluster centroid or prototype.
- Convergence: The iterative process continues until a convergence criterion is met. This criterion could be a maximum number of iterations, a small change in cluster assignments, or the stabilisation of cluster centroids.
- Cluster Evaluation: After convergence, the resulting clusters are evaluated and analysed to determine their quality and interpretability. Common evaluation metrics include cohesion (how close data points are within a cluster) and separation (how distinct clusters are from each other). Visualisations such as scatter plots or dendrograms can also aid in understanding the cluster structure.
- Cluster Interpretation: Once the clusters are obtained, they can be interpreted and analysed to gain insights into the underlying patterns or characteristics of the data. The clusters may correspond to distinct groups or classes, or they may reveal previously unknown patterns or relationships.
There are various clustering algorithms available, each with its own strengths and weaknesses. Some popular clustering algorithms include k-means, hierarchical clustering, DBSCAN, and Gaussian mixture models. The choice of algorithm depends on factors such as the nature of the data, the desired number of clusters, the computational requirements, and the specific problem domain.
Clustering finds applications in various domains, such as customer segmentation, image segmentation, document clustering, anomaly detection, and social network analysis. It helps in exploratory data analysis, pattern recognition, and understanding the underlying structure of the data, providing valuable insights for further analysis or decision-making.
In this competition, sklearn’s KMeans was used as the clustering mechanism. K-means clustering is an iterative algorithm that aims to partition a dataset into K clusters, where K is a predefined number. The algorithm assigns each data point to the cluster whose centroid (mean) is closest to it. It seeks to minimise the within-cluster sum of squared distances.
In this blog post, the clusters will be based on heart disease, and the Kaggle competition can be found here:- https://www.kaggle.com/competitions/k-means-clustering-for-heart-disease-analysis
I have written the program using Kaggle’s free online Jupyter Notebook and it is stored in my account with the data science company.
Once the Jupyter Notebook was created, I imported the libraries I would need to execute the program, being:-
- Numpy to create numpy arrays and undertake numerical computations,
- Pandas to create dataframes and process data,
- Os to go into the operating system and retrieve necessary files,
- Sklearn to provide machine learning functionality,
- Matplotlib to visualise the data, and
- Seaborn to statistically visualise the data.
After I imported the libraries, I used the os library to retrieve the files that would be used in the program.
I used pandas to read the csv files and convert them to dataframes; being data and submission:-
There were 920 rows in data and 299 rows in submission, so I created a new column in data, exists, which was labelled with True or False, depending on whether the id number in data was also in submission:-
I dropped the rows in data that were not in the submission dataframe, and called this new dataframe heart_data:-
I imputed the null values in the features in heart_data:-
I then dropped the column, id, from heart_data because the rows are automatically indexed in the dataframe. I also dropped the column, exists, because all of the values in it were True:-
I used sklearn’s OrdinalEncoder function to encode all of the object columns. Before I did so, however, I converted the object columns to type str:-
I then used the elbow method to create a graph of the clusters in the dataframe, heart_data. The three clusters I tried out were 2, 3, and 4, and I settled upon 3 because it gave me the best reading on Kaggle’s leaderboard:-
The diagram below is a depiction of the elbow, where cluster points 2, 3, and 4 can be prominently seen. The trick is to select the cluster number that creates the elbow in the graph:-
I then used sklearn’s KMeans to define the labels of three clusters, which had previously been decided whilst implementing the below method:-
I used sklearn’s PCA (principal component analysis) to convert heart_data to a two feature dataframe, being pca_data. I used pca_data to plot a graph of the three clusters that had previously been created.
Principal Component Analysis (PCA) is a dimensionality reduction technique commonly used in machine learning to transform high-dimensional data into a lower-dimensional representation while preserving the most important information. It aims to find the directions (principal components) along which the data varies the most.
Scikit-learn (sklearn) provides an implementation of PCA in its PCA class, which offers various options to customise the analysis, such as specifying the number of components to retain or setting a threshold for the explained variance. By applying PCA in your machine learning pipeline, you can simplify data representation, enhance understanding, and potentially improve the performance of subsequent learning algorithms:-
The diagram below is a plot of the three clusters with the centroids in the centre of the graph:-
I then prepared the submission by placing the predictions in the submission’s column, cluster. I converted the submission dataframe to a csv file, and this file is what was used for scoring:-
When I submitted my predictions to Kaggle for scoring, I scored around the middle of the leaderboard.
Additionally, I did not standardise the data because I achieved a better score without standardisation. It is important therefore to check the score with and without standardisation and select the methodology that works best.
I have created a code review to accompany this blog post, and it can be viewed here:- https://youtu.be/vgAVreW-U2c