Using Over-Sampling Techniques for Extremely Imbalanced Data

In the previous post “Using Under-Sampling Techniques for Extremely Imbalanced Data”, I described several under-sampling techniques to deal with extremely imbalanced data. In this post, I describe over-sampling techniques to attack the same issue.
I have written articles on a variety of data science topics. For ease of use, you can bookmark my summary post “Dataman Learning Paths — Build Your Skills, Drive Your Career” which lists the links to all articles.
Oversampling increases the weight of the minority class by replicating the minority class examples. Although it does not increase information, it raises the over-fitting issue, which causes the model to be too specific. It may well be the case that the accuracy for the training set is high, yet the performance for new datasets is worse.
(1) Random oversampling for the minority class
Random oversampling simply replicate randomly the minority class examples. Random oversampling is known to increase the likelihood of occurring overfitting. On the other hand, the major drawback of Random undersampling is that this method can discard useful data.
(2) Synthetic Minority Oversampling Technique (SMOTE)
To avoid the over-fitting problem, Chawla et al. (2002) propose the Synthetic Minority Over-sampling Technique (SMOTE). This method is considered a state-of-art technique and works well in various applications. This method generates synthetic data based on the feature space similarities between existing minority instances. To create a synthetic instance, it finds the K-nearest neighbors of each minority instance, randomly selects one of them and then calculates linear interpolations to produce a new minority instance in the neighborhood.
(3) ADASYN: Adaptive Synthetic Sampling
Motivated by SMOTE, He et al. (2009) propose the Adaptive Synthetic sampling (ADASYN) technique, and receive wide attention.
ADASYN generates samples of the minority class according to their density distributions. More synthetic data is generated for minority class samples that are harder to learn, compared to those minority samples that are easier to learn. It calculates the K-nearest neighbors of each minority instance, then gets the class ratio of the minority and majority instances to generate new samples. Repeating this process adaptively shifts the decision boundary to focus on those samples that are difficult to learn.
Below I demonstrate the three oversampling methods. The notebook is available via this Github link.








