Gaussian Mixture Model (GMM) for Anomaly Detection

Predict anomalies from a Gaussian Mixture Model (GMM) using percentage threshold and value threshold, and improve anomaly prediction performance

Photo by Randy Fath on Unsplash

Gaussian Mixture Model (GMM) is a probabilistic clustering model that assumes each data point belongs to a Gaussian distribution. Anomaly detection is the process of identifying unusual data points. Gaussian Mixture Model (GMM) detects outliers by identifying the data points in low-density regions [1]. In this tutorial, we will use Python’s sklearn library to implement Gaussian Mixture Model (GMM) and use it to detect outliers. You will learn:

  • How to train a Gaussian Mixture Model (GMM)?
  • How to predict anomalies from a Gaussian Mixture Model (GMM) using percentage threshold and value threshold separately?
  • How to visualize the anomaly prediction results?
  • How to improve the anomaly prediction performance?

Resources for this post:

Let’s get started!

Step 1: Import Library

Firstly, let’s import the Python libraries. We need to import make_blobs for synthetic dataset creation, import pandas and numpy for data processing, import matplotlib and seaborn for visualization, and import GaussianMixture for modeling.

# Synthetic dataset
from sklearn.datasets import make_blobs
# Data processing
import numpy as np
import pandas as pd
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Model
from sklearn.mixture import GaussianMixture

Step 2: Create Modeling Dataset

In the 2nd step, we will use make_blobs from the sklearn library to create a synthetic dataset with outliers.

  • Firstly, we created a dataset with 500 data points, 3 clusters, 2 features, and a cluster standard deviation of 1. random_state ensures the reproducibility of the dataset.
  • Then, an anomaly dataset with 20 records and the cluster standard deviation of 10 is created.
  • The outputs from make_blobs are in array format. We converted them into the pandas dataframe format. The dataframe has three columns, feature1, feature2, and anomaly_indicator. anomaly_indicator represents the ground truth of the anomalies, and it has the value 0 for the normal data points and 1 for the anomaly data points.
  • After that, the normal dataset and the anomaly dataset are combined together.
# Create the normal data
X_normal, y_normal = make_blobs(n_samples=500, centers=3, n_features=2, cluster_std=1, random_state=42)
df_normal = pd.DataFrame({'feature1': X_normal[:, 0], 'feature2': X_normal[:, 1], 'anomaly_indicator': 0})
# Create the anomaly data
X_anomaly, y_anomaly = make_blobs(n_samples=20, centers=2, n_features=2, cluster_std =10, random_state=0)
df_anomaly = pd.DataFrame({'feature1': X_anomaly[:, 0], 'feature2': X_anomaly[:, 1], 'anomaly_indicator': 1})
# Combine the normal and the anomaly data
df = pd.concat([df_normal, df_anomaly])
# Change figure size
plt.figure(figsize=(12, 8))
# Visualization
sns.scatterplot(df['feature1'], df['feature2'], hue=df['anomaly_indicator'])

In the visualization, the orange dots are the outliers and the blue dots are the normal data points.

Anomaly Detection Dataset — GrabNGoInfo.com

Step 3: Gaussian Mixture Model (GMM) Training

In the 3rd step, a Gaussian Mixture Model (GMM) is trained.

  • Gaussian Mixture Model (GMM) is an unsupervised model, and the modeling dataset has features only, so the label anomaly_indicator is excluded from the model.
  • After the modeling dataset is created, we initiated the Gaussian Mixture Model (GMM) with n_components=3 and n_init=5. n_components=3 means that there are 3 clusters, and n_init=5 means that the best model from 5 random initial values is selected to build the model. random_state is for the model reproducibility.
  • Then, the Gaussian Mixture Model (GMM) is used to fit and predict the modeling dataset X.
# Model dataset
X = df[df.columns.difference(['anomaly_indicator'])]
# GMM model
gmm = GaussianMixture(n_components=3, n_init=5, random_state=42)
# Fit and predict on the data
y_gmm = gmm.fit_predict(X)

Step 4: GMM Predict Anomalies Using Percentage Threshold

In step 4, we will use the results from the Gaussian Mixture Model (GMM) to predict anomalies based on a percentage threshold. This percentage threshold is usually obtained through historical data or business knowledge.

  • Firstly, the score for each sample is obtained using the score_samples method.
  • Secondly, the score is saved as a column in the pandas dataframe.
  • After that, we get the score value for the percentage threshold set up for the anomaly detection. In this example, we assume 4 percent of the data are outliers, and the corresponding score for 4 percent is -6.56.
  • Finally, a column is created based on the threshold. The data point is predicted to be an outlier if the score is less than the threshold.
# Get the score for each sample
score = gmm.score_samples(X)
# Save score as a column
df['score'] = score
# Get the score threshold for anomaly
pct_threshold = np.percentile(score, 4)
# Print the score threshold
print(f'The threshold of the score is {pct_threshold:.2f}')
# Label the anomalies
df['anomaly_gmm_pct'] = df['score'].apply(lambda x: 1 if x < pct_threshold else 0)


The threshold of the score is -6.56

From the visualization, we can see that the prediction identified most of the outliers, but there are some false positives and false negatives near the clusters.

# Visualize the actual and predicted anomalies
fig, (ax0, ax1)=plt.subplots(1,2, sharey=True, figsize=(20,12))
# Ground truth
ax0.set_title('Ground Truth')
ax0.scatter(df['feature1'], df['feature2'], c=df['anomaly_indicator'], cmap='rainbow')
# GMM Predictions
ax1.set_title('GMM Predict Anomalies Using Percentage')
ax1.scatter(df['feature1'], df['feature2'], c=df['anomaly_gmm_pct'], cmap='rainbow')
GMM Predict Anomalies Using Percentage Threshold — GrabNGoInfo.com

Step 5: GMM Predict Anomalies Using Value Threshold

In step 5, we will use the results from the Gaussian Mixture Model (GMM) to predict anomalies based on a value threshold. This value threshold is obtained by observing the distribution of the scores.

From the visualization, we can see that most of the scores are greater than -5.5, so we set -5.5 as the threshold value.

# Change figure size
plt.figure(figsize=(12, 8))
# Check score distribution
sns.histplot(df['score'], bins=100, alpha=0.8)
# Threshold value
plt.axvline(x=-5.5, color='orange')
GMM Predict Anomalies Using Value Threshold — GrabNGoInfo.com

After deciding the threshold score for anomaly predictions, a new column is created to save the prediction results.

# Get the score threshold for anomaly
value_threshold = -5.5
# Label the anomalies
df['anomaly_gmm_value'] = df['score'].apply(lambda x: 1 if x < value_threshold else 0)
# Visualize the actual and predicted anomalies
fig, (ax0, ax1)=plt.subplots(1,2, sharey=True, figsize=(20,12))
# Ground truth
ax0.set_title('Ground Truth')
ax0.scatter(df['feature1'], df['feature2'], c=df['anomaly_indicator'], cmap='rainbow')
# GMM Predictions
ax1.set_title('GMM Predict Anomalies Using Value')
ax1.scatter(df['feature1'], df['feature2'], c=df['anomaly_gmm_value'], cmap='rainbow')

The visualization shows that most of the outliers are identified correctly, but there are some false positives near clusters.

GMM Predict Anomalies Using Value Threshold — GrabNGoInfo.com

Step 6: GMM Anomaly Detection Optimization

In step 6, we will talk about two methods to improve anomaly detection performance.

  • The first method is to improve the Gaussian Mixture Model (GMM) performance by hyperparameter tuning. Because the sample scores are calculated based on the Gaussian Mixture Model, a better model is likely to lead to better scores for anomaly detection.
  • The second method is to optimize the threshold based on the ground truth. Take fraud detection as an example, after identifying the outliers based on a certain threshold, people can look into specific cases and decide if it is a true fraud. If there are a lot of false positives, we can lower the threshold, and if there are a lot of false negatives, we can increase the threshold.


In this tutorial, we talked about how to use Gaussian Mixture Model (GMM) to detect outliers. You learned:

  • How to train a Gaussian Mixture Model (GMM)?
  • How to predict anomalies from a Gaussian Mixture Model (GMM) using percentage threshold and value threshold separately?
  • How to visualize the anomaly prediction results?
  • How to improve the anomaly prediction performance?

More tutorials are available on GrabNGoInfo YouTube Channel and GrabNGoInfo.com.

