Gaussian Mixture Model (GMM) for Anomaly Detection
Predict anomalies from a Gaussian Mixture Model (GMM) using percentage threshold and value threshold, and improve anomaly prediction performance
Gaussian Mixture Model (GMM) is a probabilistic clustering model that assumes each data point belongs to a Gaussian distribution. Anomaly detection is the process of identifying unusual data points. Gaussian Mixture Model (GMM) detects outliers by identifying the data points in low-density regions [1]. In this tutorial, we will use Python’s sklearn
library to implement Gaussian Mixture Model (GMM) and use it to detect outliers. You will learn:
- How to train a Gaussian Mixture Model (GMM)?
- How to predict anomalies from a Gaussian Mixture Model (GMM) using percentage threshold and value threshold separately?
- How to visualize the anomaly prediction results?
- How to improve the anomaly prediction performance?
Resources for this post:
- Video tutorial for this post on YouTube
- Python code is at the end of the post. Click here for the Colab notebook.
- More video tutorials on anomaly detection
- More blog posts on anomaly detection
Let’s get started!
Step 1: Import Library
Firstly, let’s import the Python libraries. We need to import make_blobs
for synthetic dataset creation, import pandas
and numpy
for data processing, import matplotlib
and seaborn
for visualization, and import GaussianMixture
for modeling.
# Synthetic dataset
from sklearn.datasets import make_blobs
# Data processing
import numpy as np
import pandas as pd
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Model
from sklearn.mixture import GaussianMixture
Step 2: Create Modeling Dataset
In the 2nd step, we will use make_blobs
from the sklearn
library to create a synthetic dataset with outliers.
- Firstly, we created a dataset with 500 data points, 3 clusters, 2 features, and a cluster standard deviation of 1.
random_state
ensures the reproducibility of the dataset. - Then, an anomaly dataset with 20 records and the cluster standard deviation of 10 is created.
- The outputs from
make_blobs
are in array format. We converted them into the pandas dataframe format. The dataframe has three columns,feature1
,feature2
, andanomaly_indicator
.anomaly_indicator
represents the ground truth of the anomalies, and it has the value 0 for the normal data points and 1 for the anomaly data points. - After that, the normal dataset and the anomaly dataset are combined together.
# Create the normal data
X_normal, y_normal = make_blobs(n_samples=500, centers=3, n_features=2, cluster_std=1, random_state=42)
df_normal = pd.DataFrame({'feature1': X_normal[:, 0], 'feature2': X_normal[:, 1], 'anomaly_indicator': 0})
# Create the anomaly data
X_anomaly, y_anomaly = make_blobs(n_samples=20, centers=2, n_features=2, cluster_std =10, random_state=0)
df_anomaly = pd.DataFrame({'feature1': X_anomaly[:, 0], 'feature2': X_anomaly[:, 1], 'anomaly_indicator': 1})
# Combine the normal and the anomaly data
df = pd.concat([df_normal, df_anomaly])
# Change figure size
plt.figure(figsize=(12, 8))
# Visualization
sns.scatterplot(df['feature1'], df['feature2'], hue=df['anomaly_indicator'])
In the visualization, the orange dots are the outliers and the blue dots are the normal data points.
Step 3: Gaussian Mixture Model (GMM) Training
In the 3rd step, a Gaussian Mixture Model (GMM) is trained.
- Gaussian Mixture Model (GMM) is an unsupervised model, and the modeling dataset has features only, so the label anomaly_indicator is excluded from the model.
- After the modeling dataset is created, we initiated the Gaussian Mixture Model (GMM) with n_components=3 and n_init=5. n_components=3 means that there are 3 clusters, and n_init=5 means that the best model from 5 random initial values is selected to build the model.
random_state
is for the model reproducibility. - Then, the Gaussian Mixture Model (GMM) is used to fit and predict the modeling dataset
X
.
# Model dataset
X = df[df.columns.difference(['anomaly_indicator'])]
# GMM model
gmm = GaussianMixture(n_components=3, n_init=5, random_state=42)
# Fit and predict on the data
y_gmm = gmm.fit_predict(X)
Step 4: GMM Predict Anomalies Using Percentage Threshold
In step 4, we will use the results from the Gaussian Mixture Model (GMM) to predict anomalies based on a percentage threshold. This percentage threshold is usually obtained through historical data or business knowledge.
- Firstly, the score for each sample is obtained using the
score_samples
method. - Secondly, the score is saved as a column in the pandas dataframe.
- After that, we get the score value for the percentage threshold set up for the anomaly detection. In this example, we assume 4 percent of the data are outliers, and the corresponding score for 4 percent is -6.56.
- Finally, a column is created based on the threshold. The data point is predicted to be an outlier if the score is less than the threshold.
# Get the score for each sample
score = gmm.score_samples(X)
# Save score as a column
df['score'] = score
# Get the score threshold for anomaly
pct_threshold = np.percentile(score, 4)
# Print the score threshold
print(f'The threshold of the score is {pct_threshold:.2f}')
# Label the anomalies
df['anomaly_gmm_pct'] = df['score'].apply(lambda x: 1 if x < pct_threshold else 0)
Output
The threshold of the score is -6.56
From the visualization, we can see that the prediction identified most of the outliers, but there are some false positives and false negatives near the clusters.
# Visualize the actual and predicted anomalies
fig, (ax0, ax1)=plt.subplots(1,2, sharey=True, figsize=(20,12))
# Ground truth
ax0.set_title('Ground Truth')
ax0.scatter(df['feature1'], df['feature2'], c=df['anomaly_indicator'], cmap='rainbow')
# GMM Predictions
ax1.set_title('GMM Predict Anomalies Using Percentage')
ax1.scatter(df['feature1'], df['feature2'], c=df['anomaly_gmm_pct'], cmap='rainbow')
Step 5: GMM Predict Anomalies Using Value Threshold
In step 5, we will use the results from the Gaussian Mixture Model (GMM) to predict anomalies based on a value threshold. This value threshold is obtained by observing the distribution of the scores.
From the visualization, we can see that most of the scores are greater than -5.5, so we set -5.5 as the threshold value.
# Change figure size
plt.figure(figsize=(12, 8))
# Check score distribution
sns.histplot(df['score'], bins=100, alpha=0.8)
# Threshold value
plt.axvline(x=-5.5, color='orange')
After deciding the threshold score for anomaly predictions, a new column is created to save the prediction results.
# Get the score threshold for anomaly
value_threshold = -5.5
# Label the anomalies
df['anomaly_gmm_value'] = df['score'].apply(lambda x: 1 if x < value_threshold else 0)
# Visualize the actual and predicted anomalies
fig, (ax0, ax1)=plt.subplots(1,2, sharey=True, figsize=(20,12))
# Ground truth
ax0.set_title('Ground Truth')
ax0.scatter(df['feature1'], df['feature2'], c=df['anomaly_indicator'], cmap='rainbow')
# GMM Predictions
ax1.set_title('GMM Predict Anomalies Using Value')
ax1.scatter(df['feature1'], df['feature2'], c=df['anomaly_gmm_value'], cmap='rainbow')
The visualization shows that most of the outliers are identified correctly, but there are some false positives near clusters.
Step 6: GMM Anomaly Detection Optimization
In step 6, we will talk about two methods to improve anomaly detection performance.
- The first method is to improve the Gaussian Mixture Model (GMM) performance by hyperparameter tuning. Because the sample scores are calculated based on the Gaussian Mixture Model, a better model is likely to lead to better scores for anomaly detection.
- The second method is to optimize the threshold based on the ground truth. Take fraud detection as an example, after identifying the outliers based on a certain threshold, people can look into specific cases and decide if it is a true fraud. If there are a lot of false positives, we can lower the threshold, and if there are a lot of false negatives, we can increase the threshold.
Summary
In this tutorial, we talked about how to use Gaussian Mixture Model (GMM) to detect outliers. You learned:
- How to train a Gaussian Mixture Model (GMM)?
- How to predict anomalies from a Gaussian Mixture Model (GMM) using percentage threshold and value threshold separately?
- How to visualize the anomaly prediction results?
- How to improve the anomaly prediction performance?
More tutorials are available on GrabNGoInfo YouTube Channel and GrabNGoInfo.com.