Summary

The web content provides an overview of feature selection techniques in machine learning, emphasizing heuristic, evolutionary, filter, and wrapper methods to improve model performance and efficiency.

Abstract

The article discusses various feature selection techniques used in machine learning to enhance model performance by identifying the most informative features from large datasets. It distinguishes between feature selection and feature extraction, highlighting the importance of reducing redundancy, improving efficiency, and mitigating overfitting. The author explains filter methods such as correlation filters, mutual information, and chi-square tests, which rank features based on their statistical relationship with the target variable. Wrapper methods like forward selection and backward selection are iterative techniques that add or remove features to optimize model accuracy. Embedded methods, including LASSO and RIDGE regression, incorporate built-in feature selection mechanisms. Additionally, the article explores the use of evolutionary algorithms, such as genetic algorithms and simulated annealing, for feature selection, noting their advantage in not requiring derivative information. The author concludes by sharing personal experience with feature selection on a loan prediction problem, demonstrating the effectiveness of these techniques in improving results.

Opinions

The author believes that feature selection is crucial for handling large datasets and improving model accuracy with less computational power.
There is a clear distinction made between feature selection and feature extraction, with the former being preferred for maintaining interpretability.
The author suggests that choosing features without a systematic approach is akin to selecting socks in the dark, emphasizing the complexity of the task.
Filter methods are presented as less accurate than wrapper methods but offer a trade-off between feature selection and model accuracy.
Wrapper methods, particularly forward and backward selection, are valued for their ability to iteratively refine the feature set based on model performance.
Embedded methods are praised for combining the benefits of filter and wrapper methods while being less time-consuming and less prone to overfitting.
Evolutionary algorithms are highlighted for their effectiveness in feature selection without the need for derivative information, allowing for sampling near the global optimum.
The author shares a positive personal experience with feature selection, indicating that it led to improved results in a practical machine learning problem.

Feature selection techniques for data

Heuristic and Evolutionary feature selection techniques

“Data Literacy includes the ability to read, work with, analyze, and argue with data”-(Jordan Morrow, Qlik)Data Literacy

The meaning of feature selection is to select the most informative feature from the dataset. When a dataset is huge, it's hard to build the model. The huge datasets take a lot of time and computational power to work and they exhaust the model. Feature selection is the method where we can select only the important or most contributing features to train with the expense of very less or no loss of accuracy. Lots of people misunderstood the concept of feature selection and feature extraction.

The basic difference between both is, in feature selection, you used a combination of features or a subset of features to get maximum performance/accuracy. whereas in feature extraction we create a whole new set of features from the existing one depending on the dataset’s or feature’s variance and other things.

Choosing features in the dark is a very complicated thing. I picked out two socks from my sock drawer this morning! It was still dark, but that shouldn’t matter, right? After all, they are the same size...THE SAME ?!? The Era of Big Data represents the END OF DEMOGRAPHICS (i.e., our models should no longer be based on and biased by a limited selection of attributes and features)

Uses of feature selection

To train the algorithm faster
Improve efficiency
Reduce redundancy
Reduce overfitting
Domain understanding
The simple interpretation of the model and data

We use two approaches

Heuristic Approach and Evaluation in ML Algorithm
Wrapper & filter methods

Filter methods

Filter methods use exact ranking information about features. According to ranking, features are arranged and there is no need to use machine learning algorithms repeatedly. Its accuracy is a little lesser than wrapper methods but still, you can set your trade-off for features and accuracy.

Filter methods: Correlation Filter

Correlation can be defined as a statistical relationship between two entities. In Machine learning, correlation is among two or more features or attributes to check how much they are related to each other.

If any of the two features are highly correlated or both carry the same information then one of them is redundant.

Correlation among features [Image by author]

In the table, descriptors 1 & 3 are highly correlated as well as descriptors 1 & 4 and descriptors 3 & 4. Remove the feature which has less variance. Here descriptor 4 carries the least variance and 3 carries the highest variance. 4th descriptor is removed. like this, we reduced the set of features.

Mutual Information

Mutual information (MI) measures the dependency between the input and the target variables. Mutual information should be always >= 0. MI = 0; i.e no relation between input variables and target variable that is input variable is independent of the target variable. Higher values mean higher dependency.

Information Gain

Information gain calculates the reduction in entropy from the dataset.

For the training dataset, by evaluating the information gain for each variable, and selecting the variable that maximizes the information gain, i.e. minimizes entropy. Maximum information gain which splits the data into groups for effective classification. Evaluating the gain of each variable in the context of the target variable

Chi-square method for feature selection:

Compute chi-squared stats between each non-negative feature and class. The statistical test was applied to the groups of categorical features to evaluate the likelihood of correlation or association between them using their frequency distribution. Measures the discriminating power of the feature in classifying the data point into one of the possible classes Quantifies the lack of interdependence between a given feature and classification category. A higher value of χ2 implies higher information given by the feature.

Wrapper Method: Forward Selection

It is an iterative method that starts with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

Here we believe in the individual accuracy of Each descriptor. Descriptor 4 has the highest accuracy taking its combination with descriptors. Descriptor 4, 1 has the highest accuracy taking its combination with descriptors. Descriptor 4, 1 & 2 has the highest accuracy taking their combination with descriptors.

This figure describes how the selection of features changes the accuracy of the model [Image by author]

Wrapper Method: Backward Selection

It starts with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed in the removal of features.

Accuracy is calculated by removing one feature at a time. When we removed X5, we get an improvement in accuracy.

Embedded Methods

Embedded methods combine the qualities of filter and wrapper methods. It is implemented by algorithms that have their own built-in feature selection methods. Feature selection in embedded methods has a deeper connection to learning algorithms and is part of the classification itself. Embedded methods are less time-consuming and are less prone to overfitting. E.g. LASSO regression and RIDGE regression which have inbuilt penalization functions to reduce overfitting.

Random Forest Feature Importance

Random forests provide two methods for feature selection:

● Mean decrease impurity: For classification trees measure is Gini impurity or information gain/entropy and for regression trees measure is variance. Training a tree, decreases in the weighted impurity are computed. For a forest, the impurity decrease from each feature is averaged and the features are ranked according to this measure.

● Mean decrease accuracy: For every tree grown in the forest, find OOB [out of bag] examples & count the number of votes cast for correct examples. Randomly permute values in features in OBB examples and put these cases down the tree. Subtract the number of votes for the correct class in the variable permuted OOB data from the number of votes for the correct class in the untouched OOB data. The average of this number of all trees in the forest is the raw importance score for variable

Evolutionary Algorithms for Feature Selection

Nature has always been a source of inspiration. It has sparked the development of numerous effective computational tools and algorithms during the past few decades for tackling challenging optimization problems. Evolutionary optimization algorithms are used for feature selection. Some examples are as follows

The biggest advantage of Evolutionary algorithms is they don't require derivative information to work with. We can Sample a subset of the solution near the global optimum.

I tried a simple filter feature selection method on the famous analytics vidya loan prediction problem with much less effort but eventually, the results got improved.

Results of comparison when we use feature selection and when we don’t [Image by author]

So like this, you can improve your results with the use of the same model that you have. Interestingly, feature selection currently drawing a lot of attention due to its effectiveness.

If you find this insightful

If you found this article insightful, follow me on Linkedin and medium. you can also subscribe to get notified when I publish articles. Let’s create a community! Thanks for your support!

If you want to support me :

As Your following and clapping is the most important thing but you can also support me by buying coffee. COFFEE.

You can also read my blogs related to

Optical character recognition [OCR]

Chatbot with the least number of lines

Federated learning

Cognitive science and AI

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com