avatarChristian L. Goueguel, PhD

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

2569

Abstract

our data and detect whether there is an unusual structure or pattern. The obtained scores are illustrated by the scatterplot below in which the ellipses represent the 95% and 99% confidence interval from the Hotelling’s T². Most observations are below the 95% confidence level, albeit some observations seem to cluster on the top-right corner of the scores scatterplot.</p><figure id="2161"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*S-Ado_8XrugRSsGO7WoQlA.png"><figcaption>ROBPCA scores scatterplot.</figcaption></figure><p id="0e11">However, when looking more closely, for instance using the outlier map, we can see that ultimately there are only three observations that seem to pose a problem. We have two observations flagged as orthogonal outliers and only one as a bad leverage point. Some observations are flagged as good leverage points, whilst most are regular observations.</p><figure id="0fba"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*71x4vSRJMJc4kGUlWy7G9Q.png"><figcaption>ROBPCA outlier map.</figcaption></figure><h2 id="9244">PLS modeling</h2><p id="d8bb">It is worth mentioning that in our regression problem, ordinary least square (OLS) fitting is no option since <i>n</i><i>p</i>. PLS resolves this by searching for a small set of the so-called latent variables (LVs), that performs a simultaneous decomposition of <b><i>X</i></b> and <b><i>y</i></b> with the constraint that these components explain as much as possible of the covariance between <b><i>X</i></b> and <b><i>y</i></b>. The figures below are the results obtained from the PLS model. We obtained an R² of 0.85 with an RMSE and MAE of 0.08 and 0.06, respectively, which correspond to a mean absolute percentage error (MAPE) of approximately 7%.</p><figure id="3838"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*vJ126w4r-DKukk1xHHHEUQ.png"><figcaption>Observed vs. predicted plot (full dataset).</figcaption></figure><figure id="2907"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*hAtmwW3GjqWvs4JHXrETig.png"><figcaption>Residual plot (full dataset).</figcaption></figure><p id="a679">Similarly to the ROBPCA outlier map, the PLS residual plot has flagged three observations that exhibit high standardized residual value. Another way to check for outliers is to calculate Q-residuals and Hotelling’s T² from the PLS model, then define a criterion for which an observation is considered as an outlier or not. High Q-residual value corresponds to an observation which is not well explained by the mo

Options

del, while high Hotelling’s T² value expresses an observation that is far from the center of regular observations (i.e, score = 0). The results are plotted below.</p><figure id="aef3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*x39G8UXdhibBLSTCs5aidw.png"><figcaption>Q residuals vs. Hotelling’s T² plot (full dataset).</figcaption></figure><h2 id="2ab6">Hotelling-T² based variable selection</h2><p id="332e">Let’s now perform variable selection from our PLS model, which is carried out by computing the T² statistic (for more details see <a href="https://www.sciencedirect.com/science/article/abs/pii/S0169743916300375">Mehmood, 2016</a>),</p><figure id="ffcf"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*MItsAYSxetI5TCIJP2_BTQ.png"><figcaption></figcaption></figure><p id="0f5e">where <i>W</i> is the loading weight matrix and <i>C</i> is the covariance matrix. Thus, a variable is selected based on the following criteria,</p><figure id="171e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*SCBI5aqNAop9VjPMJaGZAQ.png"><figcaption></figcaption></figure><p id="2076">where <i>A</i> is number of LVs from our PLS model, and 1-𝛼 is the confidence level (with 𝛼 equals 0.05 or 0.01) from the <i>F</i>-distribution.</p><p id="97cc">Thus, from 7151 variables in our original dataset, only 217 were selected based on the aforementioned selection criterion. The observed vs. predicted plot is displayed below along with the model’s R² and RMSE.</p><figure id="d987"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*P2KM1Y8y4YRQ4h90-rjNWQ.png"><figcaption>Observed vs. predicted plot (selected variables).</figcaption></figure><p id="2d52">In the results below, the three observations that were flagged as outliers were removed from the dataset. The mean absolute percentage error is 6%.</p><figure id="b8f2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*2zKXeKYhX6gtjSNRPFGgFw.png"><figcaption>Observed vs. predicted plot (selected variable, outliers removed).</figcaption></figure><figure id="211e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*JcdkjHwU43T7DD7bfEpFJw.png"><figcaption>Residual plot (selected variable, outliers removed).</figcaption></figure><h2 id="eb4c">Summary</h2><p id="2ec4">In this article, we successfully performed Hotelling-T² based variable selection using partial least squares. We obtained a huge reduction (-97%) in the number of selected variables compared to using the model with the full dataset.</p></article></body>

Hotelling-T² based Variable Selection in Partial Least Square (PLS)

This article intends to explore Hotelling-T² based variable selection in partial least square for modeling high dimensional spectroscopic data.

Image by Ben Harritt

Background

One of the most common challenges encountered in the modeling of spectroscopic data is to select a subset of variables (i.e. wavelengths) out of a large number of variables associated with the response variable. It is common for spectroscopic data to have a large number of variables relative to the number of observations. In such a situation, the selection of a smaller number of variables is crucial especially if we want to speed up the computation time and gain in the model’s stability and interpretability. Typically, variable selection methods are classified into two groups:

• Filter-based methods: the most relevant variables are selected as a preprocessing step independently of the prediction model. • Wrapper-based methods: use the supervised learning approach.

Hence, any PLS-based variable selection is a wrapper method. Wrapper methods need a selection criterion that relies solely on the characteristics of the data at hand.

Method

Let us consider a regression problem for which the relation between the response variable y (n × 1) and the predictor matrix X (n × p) is assumed to be explained by the linear model y = β X, where β (p × 1) is the regression coefficients. Our dataset is comprised of n = 466 observations from various plant materials, and y corresponds to the concentration of calcium (Ca) for each plant. The matrix X is our measured LIBS spectra that includes p = 7151 wavelength variables. Our objective is therefore to find some columns subsets of X with satisfactorily predictive power for the Ca content.

ROBPCA modeling

Let’s first perform robust principal components analysis (ROBPCA) to help visualize our data and detect whether there is an unusual structure or pattern. The obtained scores are illustrated by the scatterplot below in which the ellipses represent the 95% and 99% confidence interval from the Hotelling’s T². Most observations are below the 95% confidence level, albeit some observations seem to cluster on the top-right corner of the scores scatterplot.

ROBPCA scores scatterplot.

However, when looking more closely, for instance using the outlier map, we can see that ultimately there are only three observations that seem to pose a problem. We have two observations flagged as orthogonal outliers and only one as a bad leverage point. Some observations are flagged as good leverage points, whilst most are regular observations.

ROBPCA outlier map.

PLS modeling

It is worth mentioning that in our regression problem, ordinary least square (OLS) fitting is no option since np. PLS resolves this by searching for a small set of the so-called latent variables (LVs), that performs a simultaneous decomposition of X and y with the constraint that these components explain as much as possible of the covariance between X and y. The figures below are the results obtained from the PLS model. We obtained an R² of 0.85 with an RMSE and MAE of 0.08 and 0.06, respectively, which correspond to a mean absolute percentage error (MAPE) of approximately 7%.

Observed vs. predicted plot (full dataset).
Residual plot (full dataset).

Similarly to the ROBPCA outlier map, the PLS residual plot has flagged three observations that exhibit high standardized residual value. Another way to check for outliers is to calculate Q-residuals and Hotelling’s T² from the PLS model, then define a criterion for which an observation is considered as an outlier or not. High Q-residual value corresponds to an observation which is not well explained by the model, while high Hotelling’s T² value expresses an observation that is far from the center of regular observations (i.e, score = 0). The results are plotted below.

Q residuals vs. Hotelling’s T² plot (full dataset).

Hotelling-T² based variable selection

Let’s now perform variable selection from our PLS model, which is carried out by computing the T² statistic (for more details see Mehmood, 2016),

where W is the loading weight matrix and C is the covariance matrix. Thus, a variable is selected based on the following criteria,

where A is number of LVs from our PLS model, and 1-𝛼 is the confidence level (with 𝛼 equals 0.05 or 0.01) from the F-distribution.

Thus, from 7151 variables in our original dataset, only 217 were selected based on the aforementioned selection criterion. The observed vs. predicted plot is displayed below along with the model’s R² and RMSE.

Observed vs. predicted plot (selected variables).

In the results below, the three observations that were flagged as outliers were removed from the dataset. The mean absolute percentage error is 6%.

Observed vs. predicted plot (selected variable, outliers removed).
Residual plot (selected variable, outliers removed).

Summary

In this article, we successfully performed Hotelling-T² based variable selection using partial least squares. We obtained a huge reduction (-97%) in the number of selected variables compared to using the model with the full dataset.

Data Science
Machine Learning
Chemometrics
Spectroscopy
Data Visualization
Recommended from ReadMedium