avatarSatyam Kumar

Summary

This context discusses seven oversampling techniques for handling imbalanced data in machine learning models, including Random Over Sampling, SMOTE, Borderline SMOTE, KMeans SMOTE, SVM SMOTE, ADASYN, and SMOTE-NC.

Abstract

The context begins by introducing the challenge of modeling imbalanced data in machine learning and the importance of class balance in target class labels. It then explains the problem of overfitting in random oversampling and introduces SMOTE as a solution. The article goes on to describe Borderline SMOTE, KMeans SMOTE, SVM SMOTE, ADASYN, and SMOTE-NC as variations of SMOTE that address specific issues in oversampling. The performance of each technique is evaluated using a churn modeling dataset from Kaggle, and the article concludes by suggesting that model performance can be improved by using a combination of oversampling and undersampling techniques.

Opinions

  • Modeling imbalanced data is a major challenge in machine learning.
  • Random oversampling can lead to overfitting.
  • SMOTE is a synthetic minority oversampling technique that creates new synthetic samples to balance the dataset.
  • Borderline SMOTE addresses the problem of bridges of minority class points within the region of majority class points.
  • KMeans SMOTE is an oversampling method that avoids the generation of noise and effectively overcomes imbalances between and within classes.
  • SVM SMOTE incorporates the SVM algorithm to identify misclassification points.
  • ADASYN creates synthetic data according to the data density, with more synthetic data created in regions of low density of minority class.
  • SMOTE-NC is a variation of SMOTE that can handle categorical features.
  • Model performance can be improved by using a combination of oversampling and undersampling techniques.

7 Over Sampling techniques to handle Imbalanced Data

Deep dive analysis of various oversampling techniques

Image by LTD EHU from Pixabay

Modeling imbalanced data is the major challenge that we face when we train a model. For dealing with the classification problems the class balance of the target class label plays an important role in modeling. For imbalance class problems i.e presence of minority class in the dataset, the models try to learn only the majority class and result in biased prediction.

Some of the famous examples of imbalanced class problems are:

  1. Credit Card Fraud Detection
  2. Disease diagnosis
  3. Spam detection, and many more

The imbalance of the dataset needs to be handled before training a model. There are various techniques to handle class balance, some of them being Oversampling, Undersampling, or a combination of both. This article will cover a deep dive explanation of 7 techniques of oversampling:

  1. Random Over Sampling
  2. Smote
  3. BorderLine Smote
  4. KMeans Smote
  5. SVM Smote
  6. ADASYN
  7. Smote-NC

For the evaluation of different oversampling models, we are using the Churn modeling dataset from Kaggle.

Performace of the Logistic Regression model without using any oversampling or undersampling technique.

1. Random Over Sampling:

Random oversampling is the simplest oversampling technique to balance the imbalanced nature of the dataset. It balances the data by replicating the minority class samples. This does not cause any loss of information, but the dataset is prone to overfitting as the same information is copied.

(Image by Author), Left: Scatter plot after Random Oversampling, Right: Performance of model after Random Oversampling

2. SMOTE:

In the case of random oversampling, it was prone to overfitting as the minority class samples are replicated, here SMOTE comes into the picture. SMOTE stands for Synthetic Minority Oversampling Technique. It creates new synthetic samples to balance the dataset.

SMOTE works by utilizing a k-nearest neighbor algorithm to create synthetic data. Steps samples are created using Smote:

  • Identify the feature vector and its nearest neighbor
  • Compute the distance between the two sample points
  • Multiply the distance with a random number between 0 and 1.
  • Identify a new point on the line segment at the computed distance.
  • Repeat the process for identified feature vectors.
(Image by Author), Left: Scatter plot after SMOTE, Right: Performance of model after SMOTE

3. Borderline Smote:

Due to the presence of some minority points or outliers within the region of majority class points, bridges of minority class points are created. This is a problem in the case of Smote and is solved using Borderline Smote.

In the Borderline Smote technique, only the minority examples near the borderline are over-sampled. It classifier the minority class points into noise points, border points. Noise points are minority class points that have most of the points as majority points in its neighbor, and border points have both majority and minority class points in its neighbor. Borderline Smote algorithm tries to create synthetic points using only these border points and ignore the noise points.

(Image by Author), Left: Scatter plot after Borderline SMOTE, Right: Performance of model after Borderline SMOTE

4. KMeans Smote:

K-Means SMOTE is an oversampling method for class-imbalanced data. It aids classification by generating minority class samples in safe and crucial areas of the input space. The method avoids the generation of noise and effectively overcomes imbalances between and within classes.

K-Means SMOTE works in five steps:

  1. Cluster the entire data using the k-means clustering algorithm.
  2. Select clusters that have a high number of minority class samples
  3. Assign more synthetic samples to clusters where minority class samples are sparsely distributed.

Here each filtered cluster is oversampled using SMOTE.

(Image by Author), Left: Scatter plot after KMeans SMOTE, Right: Performance of model after KMeans SMOTE

5. SVM Smote:

Another variation of Borderline-SMOTE is Borderline-SMOTE SVM, or we could just call it SVM-SMOTE. This technique incorporates the SVM algorithm to identify the misclassification points.

In the SVM-SMOTE, the borderline area is approximated by the support vectors after training SVMs classifier on the original training set. Synthetic data is then randomly created along the lines joining each minority class support vector with a number of its nearest neighbors.

(Image by Author), Left: Scatter plot after SVM SMOTE, Right: Performance of model after SVM SMOTE

6. Adaptive Synthetic Sampling — ADASYN:

Borderline Smote gives more importance or creates synthetic points using only the extreme observations that are the border points and ignores the rest of minority class points. This problem is solved by the ADASYN algorithm, as it creates synthetic data according to the data density.

The synthetic data generation is inversely proportional to the density of the minority class. A comparatively larger number of synthetic data is created in regions of a low density of minority class than higher density regions.

In other terms, in the less dense area of the minority class, the synthetic data are created more.

(Image by Author), Left: Scatter plot after ADASYN, Right: Performance of model after ADASYN

7. Smote-NC:

Smote oversampling technique only works for the dataset with all continuous features. For a dataset with categorical features, we have a variation of Smote, which is Smote-NC (Nominal and Continuous).

Smote can also be used for data with categorical features, by one-hot encoding but it may result in an increase in dimensionality. Label Encoding can also be used to convert categorical to numerical, but after smote it may result in unnecessary information. This is why we need to use SMOTE-NC when we have cases of mixed data. Smote-NC can be used by denoting the features that are categorical, and Smote would resample the categorical data instead of creating synthetic data.

(Image by Author), Left: Performance of model before SMOTE-NC, Right: Performance of model after SMOTE-NC

Implementation:

Conclusion:

Modeling an imbalanced dataset is the major challenge that we face while training a model, using various oversampling techniques discussed above the performance of the model can be improved. Also in this article, we have discussed SMOTE-NC, which is a variation of SMOTE, that can handle categorical features.

Model performance of an Imbalanced dataset can also be improved by using various undersampling techniques such as Random Undersampling, TomekLinks, etc, and a combination of oversampling and undersampling techniques such as SMOTEENN, SMOTETomek, etc.

References:

[1] Imblearn documentation: https://imbalanced-learn.readthedocs.io/en/stable/api.html#module-imblearn.over_sampling

[2] https://pypi.org/project/kmeans-smote/

Thank You for Reading

Artificial Intelligence
Machine Learning
Data Science
Education
Imbalanced Data
Recommended from ReadMedium