avatarSerafeim Loukas, PhD

Summary

The provided content offers a comprehensive guide on Principal Component Analysis (PCA), detailing its purpose, implementation in Python using scikit-learn, and the interpretation of feature importance, along with practical examples using the Iris dataset.

Abstract

The article "PCA Clearly Explained - When, Why, How To Use It and Feature Importance: A Guide in Python" by Serafeim Loukas is a detailed tutorial on PCA, an unsupervised dimensionality reduction technique. It explains the linear PCA method, which transforms correlated variables into a smaller set of uncorrelated variables, maximizing the variance in the data. The guide discusses when and why to use PCA, such as in datasets with multicollinearity or high dimensions, and its utility in data denoising and compression. The core steps of PCA are outlined, including data normalization, eigendecomposition of the covariance matrix, and projection onto the reduced PCA space. The article also covers the maximum number of meaningful components, the relationship between eigenvalues and variance, and feature importance assessment through eigenvector magnitudes. A Python example using the Iris dataset illustrates PCA's application and the interpretation of principal components. Additionally, the author provides a custom function for creating a PCA biplot, a visual tool for understanding the relationships between features and principal components. The article concludes with an invitation for readers to follow the author, engage with the content, and explore further resources on data science and machine learning.

Opinions

  • The author emphasizes the importance of PCA in data preprocessing for machine learning tasks.
  • PCA is presented as a powerful method for reducing data complexity while retaining essential information.
  • The article suggests that PCA can be particularly useful for processing data with multicollinearity and high dimensionality.
  • The author provides a clear mathematical formulation and proof of the relationship between eigenvalues and the variance of the reduced space.
  • Feature importance in PCA is highlighted as a critical aspect for interpreting the results of the analysis.
  • The author's custom biplot function is offered as a valuable tool for visualizing PCA results, enhancing the understanding of feature contributions and correlations.
  • The author encourages reader interaction and support for their educational efforts in data science and machine learning.

Data Science, Machine Learning

PCA Clearly Explained -When, Why, How To Use It and Feature Importance: A Guide in Python

In this post, I explain what PCA is, when, and why to use it, and how to implement it in Python using scikit-learn. Also, I explain how to get the feature importance after a PCA analysis.

Handmade sketch made by the author.

1. Introduction & Background

Principal Components Analysis (PCA) is a well-known unsupervised dimensionality reduction technique that constructs relevant features/variables through linear (linear PCA) or non-linear (kernel PCA) combinations of the original variables (features). In this post, we will only focus on the famous and widely used linear PCA method.

The construction of relevant features is achieved by linearly transforming correlated variables into a smaller number of uncorrelated variables. This is done by projecting (dot product) the original data into the reduced PCA space using the eigenvectors of the covariance/correlation matrix aka the principal components (PCs).

The resulting projected data are essentially linear combinations of the original data capturing most of the variance in the data (Jolliffe 2002).

In summary, PCA is an orthogonal transformation of the data into a series of uncorrelated data living in the reduced PCA space such that the first component explains the most variance in the data with each subsequent component explaining less.

NEW: After a great deal of hard work and staying behind the scenes for quite a while, we’re excited to now offer our expertise through a platform, the “Data Science Hub” on Patreon (https://www.patreon.com/TheDataScienceHub). This hub is our way of providing you with bespoke consulting services and comprehensive responses to all your inquiries, ranging from Machine Learning to strategic data analytics planning.

2. When/Why to use PCA

  • PCA technique is particularly useful in processing data where multi-colinearity exists between the features/variables.
  • PCA can be used when the dimensions of the input features are high (e.g. a lot of variables).
  • PCA can be also used for denoising and data compression.

3. Core of the PCA method

Let X be a matrix containing the original data with shape [n_samples, n_features] .

Briefly, the PCA analysis consists of the following steps:

  • First, the original input variables stored in X are z-scored such each original variable (column of X) has zero mean and unit standard deviation.
  • The next step involves the construction and eigendecomposition of the covariance matrix Cx= (1/n)X'X(in case of z-scored data the covariance is equal to the correlation matrix since the standard deviation of all features is 1).
  • Eigenvalues are then sorted in a decreasing order representing decreasing variance in the data (the eigenvalues are equal to the variance — I will prove this below using Python in Paragraph 6).
  • Finally, the projection (transformation) of the original normalized data onto the reduced PCA space is obtained by multiplying (dot product) the originally normalized data by the leading eigenvectors of the covariance matrix i.e. the PCs.
  • The new reduced PCA space maximizes the variance of the original data. To visualize the projected data as well as the contribution of the original variables, in a joint plot, we can use the biplot.

4. The maximum number of meaningful components

There is an upper bound of the meaningful components that can be extracted using PCA. This is related to the rank of the covariance/correlation matrix (Cx). Having a data matrix X with shape [n_samples, n_features/n_variables], the covariance/correlation matrix would be [n_features, n_features] with maximum rank equal to min(n_samples, n_features).

Thus, we can have at most min(n_samples, n_features)meaningful PC components/dimensions due to the maximum rank of the covariance/correlation matrix.

5. Python example using scikit-learn and the Iris dataset

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler
plt.style.use('ggplot')
# Load the data
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Z-score the features
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)
# The PCA model
pca = PCA(n_components=2) # estimate only 2 PCs
X_new = pca.fit_transform(X) # project the original data into the PCA space

Let’s plot the data before and after the PCA transform and also color code each point (sample) using the corresponding class of the flower (y) .

fig, axes = plt.subplots(1,2)
axes[0].scatter(X[:,0], X[:,1], c=y)
axes[0].set_xlabel('x1')
axes[0].set_ylabel('x2')
axes[0].set_title('Before PCA')
axes[1].scatter(X_new[:,0], X_new[:,1], c=y)
axes[1].set_xlabel('PC1')
axes[1].set_ylabel('PC2')
axes[1].set_title('After PCA')
plt.show()
PCA output of the above code.

We can see that in the PCA space, the variance is maximized along PC1 (explains 73% of the variance) and PC2 (explains 22% of the variance). Together, they explain 95%.

print(pca.explained_variance_ratio_)
# array([0.72962445, 0.22850762])

6. Proof of eigenvalues of original covariance matrix being equal to the variances of the reduced space

Mathematical formulation & proof

Assuming that the original input variables stored in X are z-scored such each original variable (column of X) has zero mean and unit standard deviation, we have:

Latex code written by the author.

Λ matrix above stores the eigenvalues of the covariance matrix of the original space/dataset.

Verify using Python

The maximum variance proof can be also seen by estimating the covariance matrix of the reduced space:

np.cov(X_new.T)
array([[2.93808505e+00, 4.83198016e-16],
       [4.83198016e-16, 9.20164904e-01]])

We observe that these values (on the diagonal we have the variances) are equal to the actual eigenvalues of the covariance stored in pca.explained_variance_:

pca.explained_variance_
array([2.93808505, 0.9201649 ])

7. Feature importance

The importance of each feature is reflected by the magnitude of the corresponding values in the eigenvectors (higher magnitude — higher importance).

Let’s find the most important features:

print(abs( pca.components_ ))
[[0.52106591 0.26934744 0.5804131 0.56485654]
 [0.37741762 0.92329566 0.02449161 0.06694199]]

Here, pca.components_ has shape [n_components, n_features] Thus, by looking at the PC1 (first Principal Component) which is the first row

[[0.52106591 0.26934744 0.5804131 0.56485654]

we can conclude that feature 1, 3 and 4 are the most important for PC1. Similarly, we can state that feature 2 and then 1 are the most important for PC2.

To sum up, we look at the absolute values of the eigenvectors’ components corresponding to the k largest eigenvalues. In sklearn the components are sorted by explained variance. The larger they are these absolute values, the more a specific feature contributes to that principal component.

8. The biplot

The biplot is the best way to visualize all-in-one following a PCA analysis.

There is an implementation in R but there is no standard implementation in python so I decided to write my own function for that:

def biplot(score, coeff , y):
    '''
    Author: Serafeim Loukas, [email protected]
    Inputs:
       score: the projected data
       coeff: the eigenvectors (PCs)
       y: the class labels
   '''
xs = score[:,0] # projection on PC1
    ys = score[:,1] # projection on PC2
    n = coeff.shape[0] # number of variables
    plt.figure(figsize=(10,8), dpi=100)
    classes = np.unique(y)
    colors = ['g','r','y']
    markers=['o','^','x']
    for s,l in enumerate(classes):
        plt.scatter(xs[y==l],ys[y==l], c = colors[s], marker=markers[s]) # color based on group
    for i in range(n):
        #plot as arrows the variable scores (each variable has a score for PC1 and one for PC2)
        plt.arrow(0, 0, coeff[i,0], coeff[i,1], color = 'k', alpha = 0.9,linestyle = '-',linewidth = 1.5, overhang=0.2)
        plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'k', ha = 'center', va = 'center',fontsize=10)

    plt.xlabel("PC{}".format(1), size=14)
    plt.ylabel("PC{}".format(2), size=14)
    limx= int(xs.max()) + 1
    limy= int(ys.max()) + 1
    plt.xlim([-limx,limx])
    plt.ylim([-limy,limy])
    plt.grid()
    plt.tick_params(axis='both', which='both', labelsize=14)

Call the function (make sure to run first the initial blocks of code where we load the iris data and perform the PCA analysis):

import matplotlib as mpl
mpl.rcParams.update(mpl.rcParamsDefault) # reset ggplot style
# Call the biplot function for only the first 2 PCs
biplot(X_new[:,0:2], np.transpose(pca.components_[0:2, :]), y)
plt.show()
The PCA biplot using my custom function.

We can again verify visually that a) the variance is maximized and b) that feature 1, 3 and 4 are the most important for PC1. Similarly, feature 2 and then 1 are the most important for PC2.

Furthermore, arrows (variables/features) that point into the same direction indicate correlation between the variables that they represent whereas, the arrows heading in opposite directions indicate a contrast between the variables they represent.

Verify the above using code:

# Var 3 and Var 4 are extremely positively correlated
np.corrcoef(X[:,2], X[:,3])[1,0]
0.9628654314027957
# Var 2and Var 3 are negatively correlated
np.corrcoef(X[:,1], X[:,2])[1,0]
-0.42844010433054014

That’s all folks! Hope you liked this article!

Another resource. Learn Data Science and ML with the help of an 🤖 AI-powered tutor. Start here https://aigents.co/learn choose a topic and he will show up where you need him. No paywall, no signups, no ads.

Latest posts

Stay tuned & support this effort

If you liked and found this article useful, follow me!

Questions? Post them as a comment and I will reply as soon as possible.

References

[1] Jolliffe, I. T. Principal component analysis. New York, NY: Springer, 2002.

[2] https://en.wikipedia.org/wiki/Principal_component_analysis

[3] https://stattrek.com/matrix-algebra/matrix-rank.aspx

Get in touch with me

Principal Component
Machine Learning
Data Science
Dimensionality Reduction
Linear Algebra
Recommended from ReadMedium