Gaussian Mixture Model (GMM) for Anomaly Detection

Predict anomalies from a Gaussian Mixture Model (GMM) using percentage threshold and value threshold, and improve anomaly prediction performance

Gaussian Mixture Model (GMM) is a probabilistic clustering model that assumes each data point belongs to a Gaussian distribution. Anomaly detection is the process of identifying unusual data points. Gaussian Mixture Model (GMM) detects outliers by identifying the data points in low-density regions [1]. In this tutorial, we will use Python’s sklearn library to implement Gaussian Mixture Model (GMM) and use it to detect outliers. You will learn:

How to train a Gaussian Mixture Model (GMM)?
How to predict anomalies from a Gaussian Mixture Model (GMM) using percentage threshold and value threshold separately?
How to visualize the anomaly prediction results?
How to improve the anomaly prediction performance?

Resources for this post:

Video tutorial for this post on YouTube
Python code is at the end of the post. Click here for the Colab notebook.
More video tutorials on anomaly detection
More blog posts on anomaly detection

Let’s get started!

Join Medium with my referral link - Amy @GrabNGoInfo

Read every story from Amy (and thousands of other writers on Medium). Your membership fee directly supports Amy and…

medium.com

Step 1: Import Library

Firstly, let’s import the Python libraries. We need to import make_blobs for synthetic dataset creation, import pandas and numpy for data processing, import matplotlib and seaborn for visualization, and import GaussianMixture for modeling.

# Synthetic dataset
from sklearn.datasets import make_blobs

# Data processing
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Model
from sklearn.mixture import GaussianMixture

Step 2: Create Modeling Dataset

In the 2nd step, we will use make_blobs from the sklearn library to create a synthetic dataset with outliers.

Firstly, we created a dataset with 500 data points, 3 clusters, 2 features, and a cluster standard deviation of 1. random_state ensures the reproducibility of the dataset.
Then, an anomaly dataset with 20 records and the cluster standard deviation of 10 is created.
The outputs from make_blobs are in array format. We converted them into the pandas dataframe format. The dataframe has three columns, feature1, feature2, and anomaly_indicator. anomaly_indicator represents the ground truth of the anomalies, and it has the value 0 for the normal data points and 1 for the anomaly data points.
After that, the normal dataset and the anomaly dataset are combined together.

# Create the normal data
X_normal, y_normal = make_blobs(n_samples=500, centers=3, n_features=2, cluster_std=1, random_state=42)
df_normal = pd.DataFrame({'feature1': X_normal[:, 0], 'feature2': X_normal[:, 1], 'anomaly_indicator': 0})

# Create the anomaly data
X_anomaly, y_anomaly = make_blobs(n_samples=20, centers=2, n_features=2, cluster_std =10, random_state=0)
df_anomaly = pd.DataFrame({'feature1': X_anomaly[:, 0], 'feature2': X_anomaly[:, 1], 'anomaly_indicator': 1})

# Combine the normal and the anomaly data
df = pd.concat([df_normal, df_anomaly])

# Change figure size
plt.figure(figsize=(12, 8))

# Visualization
sns.scatterplot(df['feature1'], df['feature2'], hue=df['anomaly_indicator'])

In the visualization, the orange dots are the outliers and the blue dots are the normal data points.

Anomaly Detection Dataset — GrabNGoInfo.com

Step 3: Gaussian Mixture Model (GMM) Training

In the 3rd step, a Gaussian Mixture Model (GMM) is trained.

Gaussian Mixture Model (GMM) is an unsupervised model, and the modeling dataset has features only, so the label anomaly_indicator is excluded from the model.
After the modeling dataset is created, we initiated the Gaussian Mixture Model (GMM) with n_components=3 and n_init=5. n_components=3 means that there are 3 clusters, and n_init=5 means that the best model from 5 random initial values is selected to build the model. random_state is for the model reproducibility.
Then, the Gaussian Mixture Model (GMM) is used to fit and predict the modeling dataset X.

# Model dataset
X = df[df.columns.difference(['anomaly_indicator'])]

# GMM model
gmm = GaussianMixture(n_components=3, n_init=5, random_state=42)

# Fit and predict on the data
y_gmm = gmm.fit_predict(X)

Step 4: GMM Predict Anomalies Using Percentage Threshold

In step 4, we will use the results from the Gaussian Mixture Model (GMM) to predict anomalies based on a percentage threshold. This percentage threshold is usually obtained through historical data or business knowledge.

Firstly, the score for each sample is obtained using the score_samples method.
Secondly, the score is saved as a column in the pandas dataframe.
After that, we get the score value for the percentage threshold set up for the anomaly detection. In this example, we assume 4 percent of the data are outliers, and the corresponding score for 4 percent is -6.56.
Finally, a column is created based on the threshold. The data point is predicted to be an outlier if the score is less than the threshold.

# Get the score for each sample
score = gmm.score_samples(X)

# Save score as a column
df['score'] = score

# Get the score threshold for anomaly
pct_threshold = np.percentile(score, 4)

# Print the score threshold
print(f'The threshold of the score is {pct_threshold:.2f}')

# Label the anomalies
df['anomaly_gmm_pct'] = df['score'].apply(lambda x: 1 if x < pct_threshold else 0)

Output

The threshold of the score is -6.56

From the visualization, we can see that the prediction identified most of the outliers, but there are some false positives and false negatives near the clusters.

# Visualize the actual and predicted anomalies
fig, (ax0, ax1)=plt.subplots(1,2, sharey=True, figsize=(20,12))
# Ground truth
ax0.set_title('Ground Truth')
ax0.scatter(df['feature1'], df['feature2'], c=df['anomaly_indicator'], cmap='rainbow')
# GMM Predictions
ax1.set_title('GMM Predict Anomalies Using Percentage')
ax1.scatter(df['feature1'], df['feature2'], c=df['anomaly_gmm_pct'], cmap='rainbow')

GMM Predict Anomalies Using Percentage Threshold — GrabNGoInfo.com

Step 5: GMM Predict Anomalies Using Value Threshold

In step 5, we will use the results from the Gaussian Mixture Model (GMM) to predict anomalies based on a value threshold. This value threshold is obtained by observing the distribution of the scores.

From the visualization, we can see that most of the scores are greater than -5.5, so we set -5.5 as the threshold value.

# Change figure size
plt.figure(figsize=(12, 8))

# Check score distribution
sns.histplot(df['score'], bins=100, alpha=0.8)

# Threshold value
plt.axvline(x=-5.5, color='orange')

GMM Predict Anomalies Using Value Threshold — GrabNGoInfo.com

After deciding the threshold score for anomaly predictions, a new column is created to save the prediction results.

# Get the score threshold for anomaly
value_threshold = -5.5

# Label the anomalies
df['anomaly_gmm_value'] = df['score'].apply(lambda x: 1 if x < value_threshold else 0)

# Visualize the actual and predicted anomalies
fig, (ax0, ax1)=plt.subplots(1,2, sharey=True, figsize=(20,12))
# Ground truth
ax0.set_title('Ground Truth')
ax0.scatter(df['feature1'], df['feature2'], c=df['anomaly_indicator'], cmap='rainbow')
# GMM Predictions
ax1.set_title('GMM Predict Anomalies Using Value')
ax1.scatter(df['feature1'], df['feature2'], c=df['anomaly_gmm_value'], cmap='rainbow')

The visualization shows that most of the outliers are identified correctly, but there are some false positives near clusters.

Step 6: GMM Anomaly Detection Optimization

In step 6, we will talk about two methods to improve anomaly detection performance.

The first method is to improve the Gaussian Mixture Model (GMM) performance by hyperparameter tuning. Because the sample scores are calculated based on the Gaussian Mixture Model, a better model is likely to lead to better scores for anomaly detection.
The second method is to optimize the threshold based on the ground truth. Take fraud detection as an example, after identifying the outliers based on a certain threshold, people can look into specific cases and decide if it is a true fraud. If there are a lot of false positives, we can lower the threshold, and if there are a lot of false negatives, we can increase the threshold.

Summary

In this tutorial, we talked about how to use Gaussian Mixture Model (GMM) to detect outliers. You learned:

How to train a Gaussian Mixture Model (GMM)?
How to predict anomalies from a Gaussian Mixture Model (GMM) using percentage threshold and value threshold separately?
How to visualize the anomaly prediction results?
How to improve the anomaly prediction performance?

More tutorials are available on GrabNGoInfo YouTube Channel and GrabNGoInfo.com.

References

[1] Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow 2nd Edition

Join Medium with my referral link - Amy GrabNGoInfo

As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…