Anomaly Detection using PyOD Python Library

A Comparison of Algorithms

Introduction

Anomaly detection is an important task in data analysis and machine learning, which involves identifying data points that are significantly different from the rest of the data. These data points, known as anomalies or outliers, can be caused by various factors such as errors in data collection, measurement errors, or rare events. Anomaly detection has applications in various fields, including finance, healthcare, cybersecurity, and manufacturing.

PyOD is a Python library for detecting anomalies in data. It provides a unified interface for various anomaly detection algorithms, making it easy to compare and evaluate their performance. In this article, I will compare the performance of five anomaly detection algorithms implemented in PyOD: K-Nearest Neighbors (KNN), Histogram-based Outlier Detection (HBOS), Angle-based Outlier Detection (ABOD), Cluster-based Local Outlier Factor (CBLOF), and Isolation Forest.

Dataset

The dataset represents seasonal KPIs and can be accessed here. It was selected because it contains labeled anomalies, allowing quantitative evaluation of anomaly detection algorithm performance.

The figure below demonstrates our 1D dataset.

Anomaly detection models

Here is one way to improve that sentence:

Five anomaly detection algorithms from the PyOD library were evaluated: K-Nearest Neighbors (KNN), Histogram-based Outlier Detection (HBOS), Angle-based Outlier Detection (ABOD), Cluster-based Local Outlier Factor (CBLOF), and Isolation Forest (IF). These algorithms were selected to provide a diverse range of anomaly detection approaches, including nearest neighbor, statistical, angle-based, cluster-based, and tree-based techniques.

K-Nearest Neighbors (KNN): is a simple and intuitive algorithm for anomaly detection. It classifies data points as normal or anomalous based on the distances to their k-nearest neighbors. Points with abnormally large distances are considered outliers. KNN is effective in capturing local patterns and is relatively easy to understand and implement.

Histogram-based Outlier Detection (HBOS): is a fast and scalable algorithm that makes the assumption that features are independent. It builds histograms for each feature and calculates the probability of a data point based on these histograms. HBOS is particularly useful for high-dimensional data and is known for its efficiency in processing large datasets.

Angle-based Outlier Detection (ABOD): measures the angles between data points, making it sensitive to outliers with deviant angles. It is effective in high-dimensional data where traditional distance-based methods might struggle. ABOD is versatile and can handle different types of data patterns, making it suitable for a wide range of applications.

Cluster-based Local Outlier Factor (CBLOF): is a cluster-based algorithm that models normal clusters in the data. It assigns a score to each data point based on its deviation from the local cluster, making it effective in detecting outliers within clusters. CBLOF is suitable for datasets with distinct clusters and can adapt well to different cluster shapes.

Isolation Forest (IF): is an ensemble-based algorithm designed for efficient outlier detection. It works by isolating anomalies in random partitions. The algorithm constructs a tree structure, and anomalies are more likely to be isolated early in the tree-building process. Isolation Forest is known for its effectiveness, especially in high-dimensional spaces, and is less sensitive to the shape of the data distribution.

Anomaly detection process

The dataset is stored as a pandas DataFrame and includes labels indicating which data points are anomalies. This allows calculation of the anomaly proportion in the data, an important parameter when training and evaluating anomaly detection models.

# Convert y_true from pandas series to array
y_true = (df['label'].values.reshape(-1, 1))

# Calculate proportion of outliers
true_proportion_outliers = len(np.where(y_true == 1)[0])/len(np.where(y_true == 0)[0])

Then, I created a function to evaluate the results of the training models. The outcomes of the function were: percentage_outliers, average_outlier_score, median_outlier_score, std_outlier_score. Based on these outcomes I can see how each model performed, and then check the performance of different hyper-parameters of each model.

Then, I created a function to assess model performance on the training data. The function outputs several key metrics including: percentage of data points classified as outliers, average outlier score, median outlier score, and standard deviation of outlier scores. Examining these metrics for different models and parameter settings enables the identification of how each model performed based on the given hyper-parameters.

def calculate_outlier_stats(outlier_scores):
  """
  Calculates the percentage of outliers and the average, median, and standard deviation of the outlier scores.

  Parameters:
  outlier_scores (numpy array): The outlier scores for each data point.

  Returns:
  A dictionary with the following keys:
  - percentage_outliers (float): The percentage of outliers in the dataset.
  - average_outlier_score (float): The average outlier score.
  - median_outlier_score (float): The median outlier score.
  - std_outlier_score (float): The standard deviation of the outlier scores.
  """
  # Calculate the percentage of outliers
  percentage_outliers = len(np.where(outlier_scores > 0)[0]) / len(outlier_scores) * 100

  # Calculate the average, median, and standard deviation of the outlier scores
  average_outlier_score = np.mean(outlier_scores)
  median_outlier_score = np.median(outlier_scores)
  std_outlier_score = np.std(outlier_scores)

  # Return the results as a dictionary
  return {
    "percentage_outliers": percentage_outliers,
    "average_outlier_score": average_outlier_score,
    "median_outlier_score": median_outlier_score,
    "std_outlier_score": std_outlier_score
  }

The next step was to standardise the data, and train the models using different hyper-parameters.

In all the models, the contamination parameter is present. It represent the proportion of outliers in the data. In our example this value was known therefore it was used here. In any case that this value is not known, it’s recommended to tune the contamination parameter using cross-validation or other model selection techniques to find the optimal value that maximizes the performance of the model.

The contamination parameter, which specifies the expected proportion of outliers in the data, is a common input across the anomaly detection models tested. Since the outlier ratio was already calculated, it could be set directly. However, when the true outlier percentage is unknown, techniques like cross-validation should be utilised to tune this parameter instead of guessing. Finding the contamination value that maximises model performance on held-out data is recommended over using an arbitrary default value.

Every model comes with its unique set of hyper-parameters tailored to its functioning. Below are the hyper-parameters used for our models.

K-Nearest Neighbors (KNN):

n_neighbors: Number of neighbors to consider when computing the nearest neighbors.

Histogram-based Outlier Detection (HBOS):

n_bins: Number of bins to use in the histograms.

Cluster-based Local Outlier Factor (CBLOF):

n_clusters: Number of clusters to form.

Isolation Forest (IF):

n_estimators: The number of base estimators (trees) in the ensemble.

# Standardise the data
scaler = StandardScaler()
data = scaler.fit_transform(data)

# Train and predict with K-Nearest Neighbors (KNN)
knn_model = KNN(contamination=true_proportion_outliers, n_neighbors=100, metric='euclidean')
knn_model.fit(data)
knn_predictions = knn_model.predict(data)
# Get the outlier scores
outlier_scores_knn = knn_model.decision_scores_
# Get the outlier indices
outlier_indices_knn = np.where(outlier_scores_knn > knn_model.threshold_)[0]
# Print the outlier indices
# print(f"Outlier indices: {outlier_indices_knn}")
outlier_stats_knn = calculate_outlier_stats(outlier_scores_knn)
print(outlier_stats_knn)
print()

# Train and predict with Histogram-based Outlier Detection (HBOS)
hbos_model = HBOS(contamination=true_proportion_outliers, n_bins=800)
hbos_model.fit(data)
hbos_predictions = hbos_model.predict(data)
outlier_scores_hbos = hbos_model.decision_scores_
outlier_indices_hbos = np.where(outlier_scores_hbos > hbos_model.threshold_)[0]
outlier_stats_hbos = calculate_outlier_stats(outlier_scores_hbos)
print(outlier_stats_hbos)
print()

# Train and predict with Angle-based Outlier Detection (ABOD)
abod_model = ABOD(contamination=true_proportion_outliers, n_neighbors=10)
abod_model.fit(data)
abod_predictions = abod_model.predict(data)
outlier_scores_abod = abod_model.decision_scores_
outlier_indices_abod = np.where(outlier_scores_abod > abod_model.threshold_)[0]
outlier_stats_abod = calculate_outlier_stats(outlier_scores_abod)
print(outlier_stats_abod)
print()

# Train and predict with Cluster-based Local Outlier Factor (CBLOF)
cblof_model = CBLOF(contamination=true_proportion_outliers, n_clusters=50)
cblof_model.fit(data)
cblof_predictions = cblof_model.predict(data)
outlier_scores_cblof = cblof_model.decision_scores_
outlier_indices_cblof = np.where(outlier_scores_cblof > cblof_model.threshold_)[0]
outlier_stats_cblof = calculate_outlier_stats(outlier_scores_cblof)
print(outlier_stats_cblof)
print()

# Train and predict with Isolation Forest
if_model = IForest(contamination=true_proportion_outliers, n_estimators=700)
if_model.fit(data)
if_predictions = if_model.predict(data)
outlier_scores_if_model = if_model.decision_scores_
outlier_indices_if_model = np.where(outlier_scores_if_model > if_model.threshold_)[0]
outlier_stats_if_model = calculate_outlier_stats(outlier_scores_if_model)
print(outlier_stats_if_model)

Then, the outcomes were visualised in a subplot.

The top left plot shows the true anomalies present in the seasonal KPI dataset. The other five plots visualise the outliers identified by each respective anomaly detection model. There is significant overlap in the anomalies flagged by the HBOS, CBLOF, and Isolation Forest algorithms. The KNN model yields quite similar results to these three as well. However, ABOD does not detect any outliers at all. This suggests the data points have uniform neighbor angles, without distinct clusters or nonconforming patterns that ABOD uses to identify anomalies.

There are several potential reasons why ABOD failed to detect outliers: 1) there are no true anomalies and all points are normal, 2) parameters of the ABOD algorithm may not be optimal for the data, or 3) data may have been preprocessed or transformed in a way that hides the outliers. Overall, these five models do not reliably uncover outliers in this dataset.

The main goal of this article was to demonstrate a simple PyOD anomaly detection workflow rather than comprehensively benchmark performance. With just a few lines of code, a variety of detection algorithms can be applied and evaluated.

Evaluate performance

It is possible to evaluate the performance of the models since the labels of the anomalies are known. The true and predicted arrays should have similar length in order to be able to use the sklearn metrics.

# Calculate the true labels of the anomalies
y_true = np.where(df['label'].values.reshape(-1, 1) == 1)[0]
y_true = (df['label'].values.reshape(-1, 1))

# Calculate the predicted labels of the anomalies
y_pred_knn = np.where(outlier_scores_knn > knn_model.threshold_)[0]
y_pred_hbos = np.where(outlier_scores_hbos > hbos_model.threshold_)[0]
y_pred_abod = np.where(outlier_scores_abod > abod_model.threshold_)[0]
y_pred_cblof = np.where(outlier_scores_cblof > cblof_model.threshold_)[0]
y_pred_if_model = np.where(outlier_scores_if_model > if_model.threshold_)[0]

# Create an array of zeros with the same length as y_true
y_pred_knn = np.zeros(len(y_true))

# Set the values at the outlier indices to 1
y_pred_knn[outlier_indices_knn] = 1

y_pred_hbos = np.zeros(len(y_true))
y_pred_hbos[outlier_indices_hbos] = 1
y_pred_abod = np.zeros(len(y_true))
y_pred_abod[outlier_indices_abod] = 1
y_pred_cblof = np.zeros(len(y_true))
y_pred_cblof[outlier_indices_cblof] = 1
y_pred_if_model = np.zeros(len(y_true))
y_pred_if_model[outlier_indices_if_model] = 1

# Calculate the evaluation metrics
precision_knn = precision_score(y_true, y_pred_knn)
recall_knn = recall_score(y_true, y_pred_knn)
f1_knn = f1_score(y_true, y_pred_knn)
accuracy_knn = accuracy_score(y_true, y_pred_knn)

precision_hbos = precision_score(y_true, y_pred_hbos)
recall_hbos = recall_score(y_true, y_pred_hbos)
f1_hbos = f1_score(y_true, y_pred_hbos)
accuracy_hbos = accuracy_score(y_true, y_pred_hbos)

precision_abod = precision_score(y_true, y_pred_abod)
recall_abod = recall_score(y_true, y_pred_abod)
f1_abod = f1_score(y_true, y_pred_abod)
accuracy_abod = accuracy_score(y_true, y_pred_abod)

precision_cblof = precision_score(y_true, y_pred_cblof)
recall_cblof = recall_score(y_true, y_pred_cblof)
f1_cblof = f1_score(y_true, y_pred_cblof)
accuracy_cblof = accuracy_score(y_true, y_pred_cblof)

precision_if_model = precision_score(y_true, y_pred_if_model)
recall_if_model = recall_score(y_true, y_pred_if_model)
f1_if_model = f1_score(y_true, y_pred_if_model)
accuracy_if_model = accuracy_score(y_true, y_pred_if_model)

# Print the evaluation metrics
print("KNN:")
print(f"Precision: {precision_knn:.3f}")
print(f"Recall: {recall_knn:.3f}")
print(f"F1-score: {f1_knn:.3f}")
print(f"Accuracy: {accuracy_knn:.3f}")
print()
print("HBOS:")
print(f"Precision: {precision_hbos:.3f}")
print(f"Recall: {recall_hbos:.3f}")
print(f"F1-score: {f1_hbos:.3f}")
print(f"Accuracy: {accuracy_hbos:.3f}")
print()
print("ABOD:")
print(f"Precision: {precision_abod:.3f}")
print(f"Recall: {recall_abod:.3f}")
print(f"F1-score: {f1_abod:.3f}")
print(f"Accuracy: {accuracy_abod:.3f}")
print()
print("CBLOF:")
print(f"Precision: {precision_cblof:.3f}")
print(f"Recall: {recall_cblof:.3f}")
print(f"F1-score: {f1_cblof:.3f}")
print(f"Accuracy: {accuracy_cblof:.3f}")
print()
print("IF_model:")
print(f"Precision: {precision_if_model:.3f}")
print(f"Recall: {recall_if_model:.3f}")
print(f"F1-score: {f1_if_model:.3f}")
print(f"Accuracy: {accuracy_if_model:.3f}")

The results were not very promising as expected. The table shows that very low results for precision, recall and f1-score, despite displaying high accuracy scores.

Remember that accuracy is not best metric to be used, especially in the context of imbalanced datasets or when dealing with outlier detection and anomaly detection tasks.

Conclusion

The PyOD library provides a straightforward, accessible interface for applying anomaly detection techniques. It implements a diverse selection of outlier algorithms that can be easily applied through just a few lines of Python code. This enables using PyOD for anomaly detection without extensive machine learning expertise.

This article underscores two key takeaways:

1. Choose an anomaly detection model that suits well with the characteristics of your data. 2. Incorporate recall, precision, and f1-score for a more comprehensive performance evaluation; do not rely on the accuracy metric