avatarChris Kuo/Dr. Dataman

Summary

The article discusses over-sampling techniques for handling extremely imbalanced datasets, including random oversampling, SMOTE, and ADASYN, and provides practical examples and code snippets for implementation.

Abstract

The article "Using Over-Sampling Techniques for Extremely Imbalanced Data" by Chris Kuo/Dr. Dataman follows up on a previous post about under-sampling techniques by focusing on over-sampling methods to address class imbalance in datasets. The author explains that over-sampling increases the minority class examples, which can lead to overfitting but is useful for preventing the loss of potentially valuable data. The article details three main over-sampling methods: (1) random oversampling, which duplicates minority class instances randomly; (2) Synthetic Minority Oversampling Technique (SMOTE), which generates synthetic data based on feature space similarities to avoid overfitting; and (3) ADASYN, which adaptively generates samples according to the density distribution of the minority class, focusing on harder-to-learn instances. The article also provides a link to a GitHub notebook demonstrating these methods and discusses how to apply these techniques in H2O.ai for data scientists. Additionally, the article is part of a larger series on anomaly detection with Python, with references to other chapters and books by the author for further reading.

Opinions

  • The author, Chris Kuo/Dr. Dataman, emphasizes the importance of addressing class imbalance to improve model performance and prevent the loss of valuable data.
  • Random oversampling is acknowledged to be simple but potentially problematic due to the increased risk of overfitting.
  • SMOTE is presented as a state-of-the-art technique that effectively generates synthetic data to enhance the minority class representation without directly duplicating instances.
  • ADASYN is highlighted as an improvement over SMOTE, as it adaptively focuses on generating synthetic samples for minority class instances that are more difficult to classify.
  • The author provides practical resources, including code snippets and a GitHub notebook, to facilitate the application of these techniques by data scientists.
  • The integration of these sampling methods with H2O.ai is discussed, showcasing the ease of implementation within a popular data science platform.
  • The article is positioned within a broader context of a book series on anomaly detection, suggesting a comprehensive approach to the topic and encouraging readers to explore further chapters and related literature.

Using Over-Sampling Techniques for Extremely Imbalanced Data

In the previous post “Using Under-Sampling Techniques for Extremely Imbalanced Data”, I described several under-sampling techniques to deal with extremely imbalanced data. In this post, I describe over-sampling techniques to attack the same issue.

I have written articles on a variety of data science topics. For ease of use, you can bookmark my summary post “Dataman Learning Paths — Build Your Skills, Drive Your Career” which lists the links to all articles.

Oversampling increases the weight of the minority class by replicating the minority class examples. Although it does not increase information, it raises the over-fitting issue, which causes the model to be too specific. It may well be the case that the accuracy for the training set is high, yet the performance for new datasets is worse.

(1) Random oversampling for the minority class

Random oversampling simply replicate randomly the minority class examples. Random oversampling is known to increase the likelihood of occurring overfitting. On the other hand, the major drawback of Random undersampling is that this method can discard useful data.

(2) Synthetic Minority Oversampling Technique (SMOTE)

To avoid the over-fitting problem, Chawla et al. (2002) propose the Synthetic Minority Over-sampling Technique (SMOTE). This method is considered a state-of-art technique and works well in various applications. This method generates synthetic data based on the feature space similarities between existing minority instances. To create a synthetic instance, it finds the K-nearest neighbors of each minority instance, randomly selects one of them and then calculates linear interpolations to produce a new minority instance in the neighborhood.

(3) ADASYN: Adaptive Synthetic Sampling

Motivated by SMOTE, He et al. (2009) propose the Adaptive Synthetic sampling (ADASYN) technique, and receive wide attention.

ADASYN generates samples of the minority class according to their density distributions. More synthetic data is generated for minority class samples that are harder to learn, compared to those minority samples that are easier to learn. It calculates the K-nearest neighbors of each minority instance, then gets the class ratio of the minority and majority instances to generate new samples. Repeating this process adaptively shifts the decision boundary to focus on those samples that are difficult to learn.

Below I demonstrate the three oversampling methods. The notebook is available via this Github link.

Have you read the previous article “Using Under-Sampling Techniques for Extremely Imbalanced Data”? The two articles together can give you a comprehensive view of both the under-sampling and over-sampling techniques!

Additional Note when Applying in H2O

For data scientists who use H2O.ai, how do you apply the above sampling techniques? If you are not familiar with H2O, the post “My Lecture Notes on Random Forest, Gradient Boosting, Regularization, and H2O.ai” shows the H2O code snippets for various algorithms.

It is fairly easy to do so. From the above sampling code snippets, you get the sampled data X_rs and the corresponding y labels y_rs. All you need to do is to concatenate X_rs and y_rs to a data frame, then convert to an H2O data frame as usual:

References

  • NV Chawla, KW Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research (JAIR), 16:321–357, 2002.

This chapter is part of the book series “Handbook of Anomaly Detection with Python Outlier Detection.” For easy navigation to chapters, I list the chapters at the end.

Readers are recommended to purchase books by Chris Kuo:

Machine Learning
Sampling
Fraud
Anomaly Detection
Python
Recommended from ReadMedium