avatarRenu Khandelwal

Summary

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of datasets while preserving as much information as possible, by transforming the original variables into a smaller set of uncorrelated variables called principal components.

Abstract

The article provides an in-depth explanation of Principal Component Analysis (PCA), a method employed in machine learning to decrease the number of input features in a dataset. PCA achieves dimensionality reduction by identifying the principal components that capture the most variance in the data, thus retaining the most significant information. The process involves standardizing the data, computing the covariance matrix, and performing eigen decomposition to obtain eigenvalues and eigenvectors. The eigenvectors corresponding to the largest eigenvalues are then selected to form a new feature space. This technique is particularly useful for visualizing high-dimensional data, improving computational efficiency, and mitigating the effects of multicollinearity. The article also demonstrates the application of PCA through a step-by-step example using Python and compares the results with those obtained using the sklearn library.

Opinions

  • PCA is recommended when it's challenging to identify irrelevant features for elimination and when less interpretable but independent features are acceptable.
  • It is emphasized that PCA is an unsupervised method, focusing solely on input features without considering the output variable.
  • The article suggests that PCA is sensitive to outliers, necessitating their removal before analysis.
  • The use of PCA is advocated for scenarios where visualization of multidimensional data is required, as it can reduce the number of dimensions to a manageable level for plotting.
  • The article posits that PCA contributes to cost-effective and faster model execution by reducing the feature space, which also requires less storage.
  • The importance of explained variance is highlighted as a criterion for determining the number of principal components to retain in the reduced feature set.

Principal Component Analysis (PCA)

In this article we will understand a technique called Principal Component Analysis(PCA)used to reduce the dimensionality when we have too many input features. We will understand what is PCA and how it works with a step by step example using Python

Prerequisites: Machine Learning

When we have a dataset with multiple input features we know the model will overfit. To reduce input feature space we can either drop or extract features

  • Drop irrelevant, redundant features as they do not contribute to the accuracy of the predictive problem. When we drop such input variable we lose information stored in these variables.
  • We can create a new independent variable from existing input variables. This way we do not lose the information in the variables. This is feature extraction

If we want to predict sales for a retail stores for a particular Item. Input features used for prediction are sales figures, retail changes of Item, inventory movement, store details, competitor’s retail, customer demography, and customer information like address, zip code etc.

We can drop certain variables like customer information. It does do not contribute to predicting sales for retail stores. When we drop these variables, we lose information available in those variables.

Principal Component Analysis -PCA help keeps the critical information in a dataset without dropping features. PCA does this by creating new independent variable from existing input variables.

What is Principal Component Analysis( PCA)?

When we have a large dataset of correlated input variables and we want to reduce the number of input variables to a smaller feature space. while doing this we still want to maintain the critical information. We can solve this by using Principal Component Analysis-PCA.

PCA reduce dimensionality of the data using feature extraction. It does this by using variables that help explain most variability of the data in the dataset

PCA removes redundant information by removing correlated features. PCA creates new independent variables that are independent from each other. This takes care of multicollinearity issue.

PCA is an unsupervised technique. It only looks at the input features and does not take into account the output or the target variable.

PCA helps with visualization of data by reducing the dimensionality of the dataset. If we have 9 input features and we visualize the data then the number of plot will be 9(9–1)/2=36. we have 36 different plots. As the number of input features increases the number of plots keeps increasing. It becomes difficult to look at all the plots and not all the plots will be informative. PCA reduce dimensionality by creating new variables. With reduced number of input features it is easy to visualize data.

PCA helps with faster and cost effective execution and less storage space. This is again possible with reduced input features.

Summarizing goal of PCA

Goal of PCA is to reduce the input feature dimension of the dataset from m to p where p

How does PCA work?

First step in principal component analysis is to standardize the input features. Different input features may be on different units. Standardizing input features puts them on the same unit scale.

After standardizing the data, we find the correlation or covariance between different variables.

Correlation shows if two variables have a relationship. If related do they have a positive or negative relationship.

we find the eigenvectors and eigenvalues by performing eigendecomposition using correlation matrix

What is eigenvalue and eigenvector and why do we need eigenvalue and eigenvector?

To reduce the dimensionality we need to select the input features with maximum variability. Eigenvalue determines the magnitude of the variability.

let’s understand briefly what is eigenvector and eigenvalue and how they help us understand variability in data

Eigenvector

Consider a matrix A to which we apply a linear transformation x such that

y =Ax

Here x is a vector that does not change direction along with linear transformation but the vector magnitude vary by λ such that

Ax= λx

λ is eigenvalue and x is eigenvector.

EigenValue

Eigenvalue determines the magnitude of the variability. Maximum eigenvalues signifies maximum variability. Eigenvector will lowest eigenvalues represent least information about the distribution of the data. we can drop eigenvectors with lowest eigenvalues.

Next we sorts eigenvalues in descending order. choose p eigenvectors with greatest eigenvalues. we choose a value for p less than the original input features m.

Finally we create a matrix Z from the selected p eigenvectors. Matrix Z forms our new reduced feature space.

We have finally transformed our original features to a reduced matrix Z and without losing information from the dataset.

Next question is how many principal components are we going to use for our new reduced feature space?

We use explained variance to determine how many principal components we should use

Explained variance is computed from eigenvalues. It is the measure of variance attributed to each principal component.

We can see from the plot below that the first two principal components are able to explain more than 90% of the information.

source: Introduction to Statistical Learning

First principal component accounts for most variance present in the original data

Second principal component must be orthogonal to the first principal component. It captures the variance in the data that is not present in the first principal component

Summarizing steps for PCA

  • Standardize the dataset
  • Find the eigenvector and eigenvalues
  • Sort the eigenvalues in descending order
  • choose p eigenvectors will greatest eigenvalues where p< no. of features. p will be the feature subspace
  • Create a matrix Z from the p selected eigenvectors
  • we now have transformed our original features using matrix Z to have reduced p dimensional space

When do we use PCA?

  • PCA should be used to reduce the input feature dimensionality when we are unable to identify variables that can be eliminated
  • When we are comfortable with input features being less interpretable. PCA creates less interpretable new independent variables which are independent of each other.
  • When we want all input variables to be independent of each other

A few things to consider when using PCA is that PCA is sensitive to outliers so we need to remove outliers.

Now we will take a simple array and run PCA step by step and also perform PCA using sklearn and compare the results.

we will first import the required libraries and create an array A of dimension 4 by 3. we have three input features in our array

import numpy as np
import pandas as pd
from numpy.linalg import eig
A =np.array([[100, 1, 1075],[125, 2,1900], [150, 1, 950], [91,1, 1650]])
print(A)
output:[[ 100    1 1075]
        [ 125    2 1900]
        [ 150    1  950]
        [  91    1 1650]]

we will first calculate the mean of each column mean_A

mean_A = np.mean(A.T, axis=1, dtype=np.int)
print(mean_A)
output: array([ 116,    1, 1393])

we now center the columns by subtracting from mean. Centering changes the value but not the scale

center_A= (A - mean_A)
print(center_A)
output: array([[ -16,    0, -318],
               [   9,    1,  507],
               [  34,    0, -443],
               [ -25,    0,  257]])

Next we find the covariance of the centered array and then find the eigenvalue(value) and eigenvectors(vector) of the covariance. Eigenvalue with greatest number explains most variance in data.

covariance = np.cov(center_A.T)
value, vector = eig(covariance)
print(value.astype(int))
output:[206898    630      0]

we see that feature 1 and feature 2 have highest eigenvalue so we we need to use eigenvector for column 0 and 1 only

vector_with_highest_eigenvalue= vector[:,[0,1]]
PCA_calc = vector_with_highest_eigenvalue.T.dot(center_A.T)
print(PCA_calc.T)
output:[[ 317.63555359  -22.07788348]
        [-506.73564035   18.70361328]
        [ 443.56928479   25.52011189]
        [-257.43119689  -20.07930745]]

we now have transformed new features.

Now we use sklearn to compute the PCA and compare the PCA values with step by step approach values

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
pca= PCA(n_components=2)
pca.fit(A)
PCA_value= pca.transform(A)
print(PCA_value)
output:[[-318.3760533   -22.59451704]
        [ 505.99514063   18.18697972]
        [-444.3097845    25.00347833]
        [ 256.69069718  -20.59594101]]

PCA values using sklearn is very close to the step by step calculation.

References:

An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

Clap if you liked the article!

Machine Learning
Pca
Dimensionality Reduction
Eigenvectors
Unsupervised Learning
Recommended from ReadMedium