Handbook of Anomaly Detection: with Python Outlier Detection — (2) HBOS

Consider multi-dimensional data like a data frame in an Excel Spreadsheet. The columns are the dimensions or variables, and the rows are the observations. An observation had multiple values. The count statistic of a variable is called the histogram. If there are N variables, there will be N histograms. If a value of observation falls in the tail of a histogram, the value is an outlier. If many values of observation are outliers, the observation is very likely to be an outlier.
The columns are also called variables. With all the observations, we can derive the count statistic, called a histogram, for each variable. If a value of observation falls in the tail of a histogram, the value is an outlier. It is often the case that some values of observation are outliers in terms of the corresponding variables, but some values are normal. If many values of observation are outliers, the observation is very likely to be an outlier.
With this intuition, the histogram of a variable can be used to define the univariate outlier score for a variable. An observation shall have N univariate outlier scores. The technique assumes independence between variables to derive histograms and the univariate outlier scores. The N univariate outlier scores of an observation can be summed up to become the Histogram-based Outlier Score (HBOS). Although this assumption may sound strong, HBOS proves its effectiveness in real-world cases.
(A) How Does the HBOS Work?
The HBOS constructs the histograms independently for all the N variables. The height of the bin is used to measure the “outlier-ness”. Most of the observations belong to the bins of high frequency, and outliers belong to the bins of low frequency. The univariate outlier score is defined as the inverse of the height of a bin.
The HBOS is formally defined as the sum of the logarithmic univariate outlier score of the N variables:

In the above equation, hist_i(p) is the height of the bin of variable i where Observation p belongs to, and 1/hist(p) is the univariate outlier score. This definition will associate a large value to an outlier.
If a variable is categorical, the histogram is the count by category. If a variable is numeric, it shall first be discretized into bins of equal width to derive the count statistic. The maximum height of each histogram is normalized to 1.0. This ensures all the univariate scores can be summed up equally to get the HBOS.
(B) Distribution-Based Algorithms Can Be Fast
In the previous chapter, I mentioned anomaly detection algorithms can be proximity-based, distribution-based, or ensemble-based methods. Distribution-based methods fit data with probability distributions to get outlier scores. The computational time of distribution-based methods is generally shorter than the time of the proximity-based or the ensemble-based methods. A proximity-based method can be time-consuming because it needs to compute the distance between any two data points. For this reason, it is a good modeling candidate for a data science project to start with.
(C) Modeling Procedure
In this book, I apply the procedure in Figure (C) that helps you to develop the model, assess the model performance, and demonstrate the model outcome. They are (1) Model development, (2) Threshold determination, and (3) Profile the normal and abnormal groups.

In most cases we do not have the verified outliers to conduct supervised learning modeling. Since we do not have the known outliers, we do not even know the percentage of outliers in a population. The good news is the outlier scores already measure the deviation of an observation from the normal data. If we derive the histogram for the outlier scores, we can discover those observations and determine the percentage of the outliers. Therefore, In Step 1 we develop the model and assign outlier scores. In Step 2, we plot the outlier scores in a histogram, then choose a value, called the threshold, to separate normal observations from abnormal observations. The threshold also determines the size of the abnormal group.
How do we assess the soundness of an unsupervised model? If a model can effectively identify outliers in the training data, those outliers should show the characteristics of outlier-ness. They should be very different from the normal data in terms of those variables. In Step 3, we will profile the normal and outlier groups to prove the soundness of the model. The descriptive statistics (such as the means and standard deviations) of the variables between the two groups prove the model predictability.
The descriptive statistic table is a reasonable metric to evaluate whether a model is consistent with any prior knowledge. If a variable is expected to be higher or lower in the outlier group but the result is counter-intuitive, you shall investigate, modify, or drop the variable and do modeling again. The final version of a model should deliver a descriptive statistic table that is consistent with any prior knowledge.
(C.1) Step 1 — Build your Model
Like before, I will use the utility function generate_data() of PyOD to generate ten percent outliers. To make the case more interesting, the data generation process (DGP) will create six variables. Although this mock dataset has the target variable Y, the unsupervised models only use the X variables. The Y variable is simply for validation. I set the percentage of outliers to 5% with “contamination=0.05.”















