avatarAmy @GrabNGoInfo

Summary

The web content outlines methods for improving imbalanced classification in neural network models using Keras, focusing on adjusting class weights to give more attention to the minority class.

Abstract

The article discusses the challenge of classifying imbalanced datasets using neural network models, where the minority class is often misrepresented. It introduces the concept of adjusting class weights within the cost function to enhance the model's sensitivity to the minority class. The author provides a step-by-step guide on creating an imbalanced dataset, building a baseline neural network model, and then improving it by incorporating balanced class weights using both automatic computation via sklearn and manual adjustment. The tutorial demonstrates significant improvement in minority class recall by applying these techniques, emphasizing the importance of class weight adjustment in imbalanced classification problems.

Opinions

  • The author believes that using balanced weights for imbalanced datasets is beneficial as it allows for model training directly on the imbalanced data without the need for oversampling or undersampling.
  • The article suggests that the class_weight option in Keras' fit method is a powerful tool for addressing class imbalance.
  • The author shows a clear preference for using the sklearn library to compute class weights, while also acknowledging the value of manually tuning these weights as a hyperparameter.
  • The author's opinion is that the significant improvement in minority class recall after applying balanced weights justifies the use of this technique in practice.
  • The article implies that the baseline model without class weight adjustments is inadequate for imbalanced classification, as it failed to predict any minority class data correctly.
  • The author encourages readers to explore further tutorials available on their YouTube Channel and website, indicating a commitment to educational content and community engagement.

Neural Network Model Balanced Weight For Imbalanced Classification In Keras

Adjusting the balanced weight for the cost function to give more attention to the minority class in a neural network model

Photo by JJ Ying on Unsplash

When using a neural network model to classify imbalanced data, we can adjust the balanced weight for the cost function to give more attention to the minority class. Python’s Keras library has a built-in option called class_weight to help us achieve this quickly.

One benefit of using the balanced weight adjustment is that we can use the imbalanced data to build the model directly without oversampling or under-sampling before training the model. To learn about oversampling and under-sampling techniques, please check my previous posts here and here.

In this tutorial, we will go over the following topics:

  • Baseline neural network model for imbalanced classification
  • Calculate class weight using sklearn
  • Apply class weight on a neural network model
  • Apply manual class weight on a neural network model

Resources for this post:

Let’s get started!

Step 1: Import Libraries

# Synthetic dataset
from sklearn.datasets import make_classification
# Data processing
import pandas as pd
import numpy as np
from collections import Counter
# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Model and performance
from sklearn.model_selection import train_test_split, cross_validate, StratifiedKFold
from keras.layers import Dense
from keras.models import Sequential
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.utils import class_weight

Step 2: Create Imbalanced Dataset

Using make_classification from the sklearn library, We created two classes with the ratio between the majority class and the minority class being 0.995:0.005. Two informative features were made as predictors. We did not include any redundant or repeated features in this dataset.

# Create an imbalanced dataset
X, y = make_classification(n_samples=100000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=2,
                           n_clusters_per_class=1,
                           weights=[0.995, 0.005],
                           class_sep=0.5, random_state=0)
# Convert the data from numpy array to a pandas dataframe
df = pd.DataFrame({'feature1': X[:, 0], 'feature2': X[:, 1], 'target': y})
# Check the target distribution
df['target'].value_counts(normalize = True)

The output shows that we have about 1% of the data in the minority class and 99% in the majority class.

0    0.9897
1    0.0103
Name: target, dtype: float64

Step 3: Train Test Split

In this step, we split the dataset into 80% training data and 20% validation data. random_state ensures that we have the same train test split every time. The seed number for random_state does not have to be 42, and it can be any number.

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Check the number of records
print('The number of records in the training dataset is', X_train.shape[0])
print('The number of records in the test dataset is', X_test.shape[0])
print(f"The training dataset has {sorted(Counter(y_train).items())[0][1]} records for the majority class and {sorted(Counter(y_train).items())[1][1]} records for the minority class.")

The train test split gives us 80,000 records for the training dataset and 20,000 for the validation dataset. Thus, we have 79,183 data points from the majority class and 817 from the minority class in the training dataset.

The number of records in the training dataset is 80000
The number of records in the test dataset is 20000
The training dataset has 79183 records for the majority class and 817 records for the minority class.

Step 4: Baseline Neural Network Model

This step creates a neural network model on the imbalanced training datasets as the baseline model.

We created the neural network model with one input layer, one hidden layer, and one output layer. Since we have two features, the input_dim is 2. We set the input layer to have two neurons, the hidden layer to have two neurons, and the output layer to have one neuron.

The activation function for the input and hidden layers is 'relu', a popular activation function with good performance. The output activation function is 'sigmoid', which is used for binary classification.

# Train the neural network model using the imbalanced dataset
# Create model
nn_model=Sequential()
nn_model.add(Dense(2,input_dim=2,activation='relu'))
nn_model.add(Dense(2,activation='relu'))
nn_model.add(Dense(1,activation='sigmoid'))

We set the loss to be 'binary_crossentropy' when compiling the model because we are building a binary classification model. For a multi-class classification model, the loss is usually 'categorical_crossentropy', and for a linear regression model, the loss is usually 'mean_squared_error'.

The optimizer is responsible for changing the weights and the learning rate to reduce the loss. 'adam' is a widely used optimizer.

#Compile model
nn_model.compile(loss='binary_crossentropy',optimizer='adam')

After compiling the model, we fit the neural network model on the training dataset. The epochs of 50 means that the model will go through the training dataset 50 times. The batch_size of 100 means that each time the weights are updated, 100 data points are used.

#Fit the model
nn_model.fit(X_train,y_train, epochs=50, batch_size=100)

Now let’s make predictions on the testing dataset and check the model performance.

# Prediction
nn_model_prediction = nn_model.predict(X_test)
nn_model_classes =  [1 if i>0.5 else 0 for i in nn_model_prediction]
# Check the model performance
print(classification_report(y_test, nn_model_classes))

We got a recall of 0, which means that the neural network model did not predict any minority data correctly.

               precision    recall  f1-score   support
           0       0.99      1.00      0.99     19787
           1       0.00      0.00      0.00       213
    accuracy                           0.99     20000
   macro avg       0.49      0.50      0.50     20000
weighted avg       0.98      0.99      0.98     20000

Let’s see if the balanced weight can help us.

Step 5: Calculate Class Weight Using Sklearn

sklearn has a built-in utility function compute_class_weight to calculate the class weights. The weights are calculated using the inverse proportion of class frequencies.

# Calculate weights using sklearn
sklearn_weights = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)
sklearn_weights
array([ 0.50515894, 48.95960832])

The computed weights from sklearn are in array format. We need to transform it into a dictionary because Keras takes a dictionary as inputs.

# Transform array to dictionary
sklearn_weights = dict(enumerate(sklearn_weights))
sklearn_weights
{0: 0.5051589356301226, 1: 48.959608323133416}

Step 6: Neural Network Model With Balance Weight

In this step, we keep all the hyperparameters to be the same as the baseline model. The only difference is that we set the class_weight hyperparameter to be 'balanced' when fitting the model.

# Train the neural network model using the imbalanced dataset
# Create model
nn_model_balanced = Sequential()
nn_model_balanced.add(Dense(2,input_dim=2,activation='relu'))
nn_model_balanced.add(Dense(1,activation='sigmoid'))
#Compile model
nn_model_balanced.compile(loss='binary_crossentropy',optimizer='adam')
#Fit the model
nn_model_balanced.fit(X_train,y_train, epochs=50, batch_size=100, class_weight=sklearn_weights)
# Prediction
nn_model_balanced_prediction = nn_model_balanced.predict(X_test)
nn_model_balanced_classes = [1 if i>0.5 else 0 for i in nn_model_balanced_prediction]
# Check the model performance
print(classification_report(y_test, nn_model_balanced_classes))

We can see that the minority recall value increased from 0 to 56%, which is a significant improvement. Note that your results can be different than mine because of the randomness with the neural network model, but the difference should be small.

               precision    recall  f1-score   support
           0       0.99      0.62      0.76     19787
           1       0.02      0.56      0.03       213
    accuracy                           0.62     20000
   macro avg       0.50      0.59      0.40     20000
weighted avg       0.98      0.62      0.75     20000

Step 7: Manual Balance Weight On Neural Network Model

Although the balance weights are commonly calculated using the inverse proportion of class frequencies, we can set our own balance weight and tune it as a hyperparameter. For example, we can set the cost penalty ratio to be 1:200.

manual_weights = {0: 1, 1: 200}
# Train the neural network model using the imbalanced dataset
# Create model
nn_model_mbalanced = Sequential()
nn_model_mbalanced.add(Dense(2,input_dim=2,activation='relu'))
nn_model_mbalanced.add(Dense(1,activation='sigmoid'))
#Compile model
nn_model_mbalanced.compile(loss='binary_crossentropy',optimizer='adam')
#Fit the model
nn_model_mbalanced.fit(X_train,y_train, epochs=50, batch_size=100, class_weight=manual_weights)
# Prediction
nn_model_mbalanced_prediction = nn_model_mbalanced.predict(X_test)
nn_model_mbalanced_classes = [1 if i>0.5 else 0 for i in nn_model_mbalanced_prediction]
# Check the model performance
print(classification_report(y_test, nn_model_mbalanced_classes))

We are able to capture 98% of the minority class after increasing the cost penalty for the minority class.

               precision    recall  f1-score   support
           0       1.00      0.08      0.16     19787
           1       0.01      0.98      0.02       213
    accuracy                           0.09     20000
   macro avg       0.50      0.53      0.09     20000
weighted avg       0.99      0.09      0.15     20000

Summary

We built the neural network models with and without the balanced weight for imbalanced classification in this tutorial. Results show that the balanced weight significantly improved the model’s ability to capture the minority class. Python’s sklearn library can compute the balance weight based on the frequency of minority and majority class, but we can use our own weight and adjust it as a hyperparameter.

More tutorials are available on GrabNGoInfo YouTube Channel and GrabNGoInfo.com.

Recommended Tutorials

References

keras documentation

Neural Networks
Balanced Weights
Imbalanced Data
Imbalanced Classification
Recommended from ReadMedium