What are PCA loadings and how to effectively use Biplots?
A practical guide for getting the most out of Principal Component Analysis.

Principal Component Analysis is the most well-known technique for (big) data analysis. However, interpretation of the variance in the low-dimensional space can remain challenging. Understanding the loadings and interpreting the biplot is a must-know part for anyone who uses PCA. Here I will explain i) how to interpret the loadings for in-depth insights to (visually) explain the variance in your data, ii) how to select the most informative features, iii) how to create insightful plots, and finally how to detect outliers. The theoretical background will be backed by a practical hands-on guide for getting the most out of your data with pca.
If you found this article helpful, use my referral link to continue learning without limits and sign up for a Medium membership. Plus, follow me to stay up-to-date with my latest content!
Introduction
At the end of this blog, you can (visually) explain the variance in your data, select the most informative features, and create insightful plots. We will go through the following topics:
- Feature Selection vs. Extraction.
- Dimension reduction using PCA.
- Explained variance, and the scree plot.
- Loadings and the Biplot.
- Extracting the most informative features.
- Outlier detection.
Gentle introduction to PCA.
The main purpose of PCA is to reduce dimensionality in datasets by minimizing information loss. In general, there are two manners to reduce dimensionality: Feature Selection and Feature Extraction. The latter is used, among others, in PCA where a new set of dimensions or latent variables are constructed based on a (linear) combination of the original features. In the case of feature selection, a subset of features is selected that should be informative for the task ahead. No matter what technique you choose, reducing dimensionality is an important step for several reasons such as reducing complexity, improving run time, determining feature importance, visualizing class information, and last but not least preventing the curse of dimensionality. This means that, for a given sample size, and above a certain number of features the classifier will degrade in performance rather than improve (Figure 1). In most cases, a lower-dimensional space results in more accurate mapping and compensates for the “loss” of information.
In the next section, I will explain how to choose between feature selection and feature extraction techniques because there are reasons to choose between one or another.

Feature selection.
Feature selection is necessary for a number of situations; 1. In case the features are not numeric (e.g., strings). 2. In case you need to extract meaningful features. 3. To keep measurements intact (a transformation would make a linear combination of measurements and the unit to be lost). A disadvantage is that feature selection procedures do require a search strategy and/or objective function to evaluate and select the potential candidates. As an example, it may require a supervised approach with class information to perform a statistical test or a cross-validation approach to select the most informative features. Nevertheless, feature selection can also be done without class information, such as by selecting the top N features on the variance (higher is better).

Feature extraction.
Feature extraction approaches can reduce the number of dimensions and at the same time minimize the loss of information. To do this, we need a transformation function; y=f(x). In the case of PCA, the transformation is limited to a linear function which we can rewrite as a set of weights that make up the transformation step; y=Wx, where W are the weights, x are the input features, and y is the final transformed feature space. See below a schematic overview to demonstrate the transformation step together with the mathematical steps.

A linear transformation with PCA has also some disadvantages. It will make features less interpretable, and sometimes even useless for follow-up in certain use-cases. As an example, if potential cancer-related genes were discovered using a feature extraction technique, it may describe that the gene was partially involved together with other genes. A follow-up in the laboratory would not make sense, e.g., to partially knock out/activate genes.
How are dimensions reduced in PCA?
We can break down PCA into roughly four parts, which I will describe illustratively.
Part 1. Center data around the origin.
The first part is computing the average of the data (illustrated in Figure 4) which can be done in four smaller steps. First by computing the average per feature (1 and 2), and then the center (3). We can now shift the data so that it is centered around the origin(4). Note that this transformation step does not change the relative distance between the points but only centers the data around the origin.

Part 2. Fit the line through origin and data points.
The next part is to fit a line through the origin and the data points (or samples). This can be done by 1. drawing a random line through the origin, 2. projecting the samples on the line orthogonally, and then 3. rotating until the best fit is found by minimizing the distances. However, it is more practical to maximize the distances from the projected data points to the origin which will lead to the same results. The fit is computed using the sum of squared distances (SS) as it will eliminate the orientation of the data points surrounding the line. At this point (Figure 5), we fitted a line in the direction with the maximum variance.

Part 3. Computing the Principal Components and the loadings.
We determined the best-fitted line in the direction with maximum variation which is now the 1st Principal Component or PC1. The next step is to compute the slope of PC1 that describes the contribution of each feature for PC1. In this example, we can visually observe that data points are spread out more across feature 1 than feature 2 (Figure 6). The slope of the red line is representative of our visual observation; for every 2 units we go across feature 1 (to the right), it goes down 1 unit in the axis of feature 2. Or in other words, to make PC1 (the red line), we need 2 parts of feature 1 and -1 part of feature 2. We can describe these “parts” as vectors b and c which we can then use to compute vector a. Vector a will get the value of 2.23 (see figure 6). This is what we call the eigenvector for this particular PC.
However, we need to standardize toward the so-called “unit vector” which we get by dividing all vectors by a=2.23. Thus vector b=2/2.23=0.85, vector c=1/2.23=0.44 and vector a=1 (aka the unit vector). Thus, in other words, the range of these vectors is between -1 and 1. If for example vector b would have been very large, such as a value towards 1, it would mean that feature 1 contributes almost entirely to PC1.

The next step is to determine PC2 which is a line that goes through the origin and is also perpendicular to the first PC. In this example, there are only two features but if there were many more features, the third PC would become the best fitting line through the origin and perpendicular to PC1 and PC2. As described before: New latent variables, aka the PCs, are a linear combination of the initial features. The proportion of each feature that is used in the PC is named the coefficient.
Loadings
It is important to realize that the principal components are less interpretable and don’t have any real meaning since they are constructed as linear combinations of the initial variables. But we can analyze the loadings which describe the importance of the independent variables. The loadings are from a numerical point of view, equal to the coefficients of the variables, and provide information about which variables give the largest contribution to the components.
- Loadings range from -1 to 1.
- A high absolute value (towards 1 or -1) describes that the variable strongly influences the component. Values close to 0 indicate that the variable has a weak influence on the component.
- The sign of a loading (+ or -) indicates whether a variable and a principal component are positively or negatively correlated.
Part 4. The transformation and explained variance.
We computed the PCs and we can now rotate (or transform) the entire dataset in such a manner that the x-axis is the direction where the largest variance is seen (aka PC1). Note that the transformation step will cause the values of the original feature will be lost. Instead, each PC will contain a proportion of the total variation but with the explained variance we can describe how much variance each PC contains. To compute the explained variance we can divide the sum of squared distances (SS) for each PC by the number of data points minus one.

Part 0. Standardization
Before we do parts 1 to 4, it is crucial to get the data in the right shape by standardization and this should therefore be the very first part. Because we search for the direction with the largest variance, a PCA is very sensitive to variables that have different value ranges or to the presence of outliers. If there are large differences between the ranges of initial variables, the variables with larger ranges will dominate over those with small ranges. I will demonstrate this in the next section. To prevent this, we need to standardize the range of the initial variables so that each variable contributes equally to the analysis. We can do this by subtracting the mean and dividing it by the standard deviation for each value of each variable. Standardization involves rescaling the features such that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one. This is also named a z-score standardization for which Scikit-learn has the StandardScaler(). Once the standardization is done, all the variables should be on the same scale.
The PCA library.
A few words about the pca library that is used for the upcoming analysis. The pca library is designed to tackle a few challenges such as:
- Analyze different types of data. Besides the regular PCA, the library also includes sparse PCA for which the sparseness is controllable by the coefficient of the L1 penalty. And there is a truncated SVD that can efficiently handle sparse matrices as it does not center the data before computing the singular value decomposition.
- Computing and plotting the explained variance. After fitting the data, the explained variance can be plotted: the scree plot.
- Extraction of the best-performing features. The best-performing features are returned by the model.
- Insights into the loadings with the Biplot. To retrieve more insights of the variation of the features and separability of the classes in relation to the PCs.
- Outlier Detection. Outliers can be detected using two well-known methods: Hotelling-T2, and SPE-Dmodx.
- Removal of unwanted (technical) bias. Data can be normalized in such a manner that the (technical) bias is removed from the original data set.
What benefits does pca offer over other implementations?
- At the core of the PCA library, the sklearn library is used to maximize compatibility and its integration in pipelines.
- Standardization is built-in functionality.
- Contains the most-wanted output and plots.
- Simple and intuitive.
- Open-source.
- Documentation page with many examples.
A practical example to understand the loadings.
Let’s start with a simple and intuitive example to demonstrate the loadings, the explained variance, and the extraction of the most important features.
First, we need to install the pca library.
pip install pcaCreating a Synthetic Dataset.
For demonstration purposes, I will create a synthetic dataset containing 8 features and 250 samples. Each feature will contain random integers but with increasing variance. All features are independent of each other. Feature 1 will contain integers in the range [0, 100] (and thus the largest variance), feature 2 will contain integers in the range of [0, 50], feature 3 with integers in the range [0, 25], and so on (see code block below). For the sake of example, I will not normalize the data to demonstrate the principles. This dataset is now ideal to 1. demonstrate the principles of PCA, 2. demonstrate the loadings and the explained variance, and 3. the importance of standardization (or the lack of it). Before we continue I want to repeat again: when working with real-world datasets, it is advised to carefully look at your data and normalize accordingly to bring each feature to the same scale.














