avatarAmy @GrabNGoInfo

Summary

The web content provides a comprehensive guide on using Gaussian Mixture Models (GMM) for anomaly detection, including model training, prediction using different thresholds, visualization of results, and methods for improving prediction performance.

Abstract

The article titled "Gaussian Mixture Model (GMM) for Anomaly Detection" outlines the process of employing GMMs to identify anomalies in datasets. It begins by explaining the GMM as a probabilistic clustering model and its application in anomaly detection by identifying data points in low-density regions. The tutorial uses Python's sklearn library to implement GMM for outlier detection, detailing steps such as importing necessary libraries, creating a synthetic dataset with outliers, training the GMM, and predicting anomalies using both percentage and value thresholds. The article also emphasizes the importance of visualizing results and suggests strategies for enhancing the model's anomaly prediction performance, such as hyperparameter tuning and optimizing thresholds based on ground truth. Additionally, the article provides resources such as video tutorials, Python code, and further reading on anomaly detection.

Opinions

  • The author suggests that a Gaussian Mixture Model (GMM) can effectively detect anomalies by identifying data points in low-density regions.
  • The article posits that setting an appropriate threshold for anomaly detection is crucial and can be determined through historical data or business knowledge.
  • It is implied that visualization is a key component in understanding and communicating the results of anomaly detection.
  • The author advocates for the optimization of the GMM through hyperparameter tuning to improve the model's performance in anomaly detection.
  • The article encourages the use of ground truth to refine the threshold for anomaly detection, indicating a belief in the iterative nature of model tuning.
  • The provision of a synthetic dataset for demonstration purposes suggests the author's view that practical examples are beneficial for learning complex concepts like GMMs.
  • By offering a variety of resources, including video tutorials and a Colab notebook, the author expresses a commitment to accessible education and hands-on learning.

Gaussian Mixture Model (GMM) for Anomaly Detection

Predict anomalies from a Gaussian Mixture Model (GMM) using percentage threshold and value threshold, and improve anomaly prediction performance

Photo by Randy Fath on Unsplash

Gaussian Mixture Model (GMM) is a probabilistic clustering model that assumes each data point belongs to a Gaussian distribution. Anomaly detection is the process of identifying unusual data points. Gaussian Mixture Model (GMM) detects outliers by identifying the data points in low-density regions [1]. In this tutorial, we will use Python’s sklearn library to implement Gaussian Mixture Model (GMM) and use it to detect outliers. You will learn:

  • How to train a Gaussian Mixture Model (GMM)?
  • How to predict anomalies from a Gaussian Mixture Model (GMM) using percentage threshold and value threshold separately?
  • How to visualize the anomaly prediction results?
  • How to improve the anomaly prediction performance?

Resources for this post:

Let’s get started!

Step 1: Import Library

Firstly, let’s import the Python libraries. We need to import make_blobs for synthetic dataset creation, import pandas and numpy for data processing, import matplotlib and seaborn for visualization, and import GaussianMixture for modeling.

# Synthetic dataset
from sklearn.datasets import make_blobs
# Data processing
import numpy as np
import pandas as pd
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Model
from sklearn.mixture import GaussianMixture

Step 2: Create Modeling Dataset

In the 2nd step, we will use make_blobs from the sklearn library to create a synthetic dataset with outliers.

  • Firstly, we created a dataset with 500 data points, 3 clusters, 2 features, and a cluster standard deviation of 1. random_state ensures the reproducibility of the dataset.
  • Then, an anomaly dataset with 20 records and the cluster standard deviation of 10 is created.
  • The outputs from make_blobs are in array format. We converted them into the pandas dataframe format. The dataframe has three columns, feature1, feature2, and anomaly_indicator. anomaly_indicator represents the ground truth of the anomalies, and it has the value 0 for the normal data points and 1 for the anomaly data points.
  • After that, the normal dataset and the anomaly dataset are combined together.
# Create the normal data
X_normal, y_normal = make_blobs(n_samples=500, centers=3, n_features=2, cluster_std=1, random_state=42)
df_normal = pd.DataFrame({'feature1': X_normal[:, 0], 'feature2': X_normal[:, 1], 'anomaly_indicator': 0})
# Create the anomaly data
X_anomaly, y_anomaly = make_blobs(n_samples=20, centers=2, n_features=2, cluster_std =10, random_state=0)
df_anomaly = pd.DataFrame({'feature1': X_anomaly[:, 0], 'feature2': X_anomaly[:, 1], 'anomaly_indicator': 1})
# Combine the normal and the anomaly data
df = pd.concat([df_normal, df_anomaly])
# Change figure size
plt.figure(figsize=(12, 8))
# Visualization
sns.scatterplot(df['feature1'], df['feature2'], hue=df['anomaly_indicator'])

In the visualization, the orange dots are the outliers and the blue dots are the normal data points.

Anomaly Detection Dataset — GrabNGoInfo.com

Step 3: Gaussian Mixture Model (GMM) Training

In the 3rd step, a Gaussian Mixture Model (GMM) is trained.

  • Gaussian Mixture Model (GMM) is an unsupervised model, and the modeling dataset has features only, so the label anomaly_indicator is excluded from the model.
  • After the modeling dataset is created, we initiated the Gaussian Mixture Model (GMM) with n_components=3 and n_init=5. n_components=3 means that there are 3 clusters, and n_init=5 means that the best model from 5 random initial values is selected to build the model. random_state is for the model reproducibility.
  • Then, the Gaussian Mixture Model (GMM) is used to fit and predict the modeling dataset X.
# Model dataset
X = df[df.columns.difference(['anomaly_indicator'])]
# GMM model
gmm = GaussianMixture(n_components=3, n_init=5, random_state=42)
# Fit and predict on the data
y_gmm = gmm.fit_predict(X)

Step 4: GMM Predict Anomalies Using Percentage Threshold

In step 4, we will use the results from the Gaussian Mixture Model (GMM) to predict anomalies based on a percentage threshold. This percentage threshold is usually obtained through historical data or business knowledge.

  • Firstly, the score for each sample is obtained using the score_samples method.
  • Secondly, the score is saved as a column in the pandas dataframe.
  • After that, we get the score value for the percentage threshold set up for the anomaly detection. In this example, we assume 4 percent of the data are outliers, and the corresponding score for 4 percent is -6.56.
  • Finally, a column is created based on the threshold. The data point is predicted to be an outlier if the score is less than the threshold.
# Get the score for each sample
score = gmm.score_samples(X)
# Save score as a column
df['score'] = score
# Get the score threshold for anomaly
pct_threshold = np.percentile(score, 4)
# Print the score threshold
print(f'The threshold of the score is {pct_threshold:.2f}')
# Label the anomalies
df['anomaly_gmm_pct'] = df['score'].apply(lambda x: 1 if x < pct_threshold else 0)

Output

The threshold of the score is -6.56

From the visualization, we can see that the prediction identified most of the outliers, but there are some false positives and false negatives near the clusters.

# Visualize the actual and predicted anomalies
fig, (ax0, ax1)=plt.subplots(1,2, sharey=True, figsize=(20,12))
# Ground truth
ax0.set_title('Ground Truth')
ax0.scatter(df['feature1'], df['feature2'], c=df['anomaly_indicator'], cmap='rainbow')
# GMM Predictions
ax1.set_title('GMM Predict Anomalies Using Percentage')
ax1.scatter(df['feature1'], df['feature2'], c=df['anomaly_gmm_pct'], cmap='rainbow')
GMM Predict Anomalies Using Percentage Threshold — GrabNGoInfo.com

Step 5: GMM Predict Anomalies Using Value Threshold

In step 5, we will use the results from the Gaussian Mixture Model (GMM) to predict anomalies based on a value threshold. This value threshold is obtained by observing the distribution of the scores.

From the visualization, we can see that most of the scores are greater than -5.5, so we set -5.5 as the threshold value.

# Change figure size
plt.figure(figsize=(12, 8))
# Check score distribution
sns.histplot(df['score'], bins=100, alpha=0.8)
# Threshold value
plt.axvline(x=-5.5, color='orange')
GMM Predict Anomalies Using Value Threshold — GrabNGoInfo.com

After deciding the threshold score for anomaly predictions, a new column is created to save the prediction results.

# Get the score threshold for anomaly
value_threshold = -5.5
# Label the anomalies
df['anomaly_gmm_value'] = df['score'].apply(lambda x: 1 if x < value_threshold else 0)
# Visualize the actual and predicted anomalies
fig, (ax0, ax1)=plt.subplots(1,2, sharey=True, figsize=(20,12))
# Ground truth
ax0.set_title('Ground Truth')
ax0.scatter(df['feature1'], df['feature2'], c=df['anomaly_indicator'], cmap='rainbow')
# GMM Predictions
ax1.set_title('GMM Predict Anomalies Using Value')
ax1.scatter(df['feature1'], df['feature2'], c=df['anomaly_gmm_value'], cmap='rainbow')

The visualization shows that most of the outliers are identified correctly, but there are some false positives near clusters.

GMM Predict Anomalies Using Value Threshold — GrabNGoInfo.com

Step 6: GMM Anomaly Detection Optimization

In step 6, we will talk about two methods to improve anomaly detection performance.

  • The first method is to improve the Gaussian Mixture Model (GMM) performance by hyperparameter tuning. Because the sample scores are calculated based on the Gaussian Mixture Model, a better model is likely to lead to better scores for anomaly detection.
  • The second method is to optimize the threshold based on the ground truth. Take fraud detection as an example, after identifying the outliers based on a certain threshold, people can look into specific cases and decide if it is a true fraud. If there are a lot of false positives, we can lower the threshold, and if there are a lot of false negatives, we can increase the threshold.

Summary

In this tutorial, we talked about how to use Gaussian Mixture Model (GMM) to detect outliers. You learned:

  • How to train a Gaussian Mixture Model (GMM)?
  • How to predict anomalies from a Gaussian Mixture Model (GMM) using percentage threshold and value threshold separately?
  • How to visualize the anomaly prediction results?
  • How to improve the anomaly prediction performance?

More tutorials are available on GrabNGoInfo YouTube Channel and GrabNGoInfo.com.

Recommended Tutorials

References

[1] Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow 2nd Edition

Anomaly Detection
Outlier Detection
Gaussian Mixture Model
Fraud Detection
Clustering Algorithm
Recommended from ReadMedium