Handbook of Anomaly Detection: With Python Outlier Detection — (8) KNN

(Revised on December 9, 2022)
The K-nearest neighbor algorithm, known as KNN or k-NN, probably is one of the most popular algorithms in machine learning. KNNs are typically used as a supervised learning technique where the target labels are provided. KNNs can also be used for the computation of the distance to the k neighbors. Because the latter does not use a target variable, some on-line sources such as the scikit-learn KNN [1] calls it unsupervised learning. The KNN in PyOD uses the latter. It computes the distance to the k neighbors and uses the distance to define the outlier scores.
In the beginning of the chapter, I will dedicate some space to clarify how KNN can be used in a supervised learning or unsupervised learning. Then I will explain how KNN defines the outlier score for anomaly detection.
(A) Use KNN as Unsupervised Learning Technique
The unsupervised k-NN method computes the Euclidean distance of observation to other observations. The unsupervised KNN does not have any parameters to tune to make the performance better. It simply computes the distances between neighbors. It does the following steps:
- Step 1: For each data point, calculate the distance to other data points.
- Step 2: Sort the data points from smallest to largest by the distance.
- Step 3: Pick the first K entries.
There are several choices to compute the distance between two data points. The most popular one is the Euclidean distance.
(B) Use KNN as Supervised Learning Technique
The KNN algorithm is widely used as a classification algorithm in a supervised learning setting. It is used to predict the class of a new data point. It assumes that similar data points of the same class are usually near one another.
Figure (B) shows data points with the blue class and the red class. If there is a new data point, what should be the class for the new data point? The algorithm calculates the distances of this data point to other data points. For the 5-nearest neighbors, it counts the number of the blue and the red classes. In the graph, there are 4 red class and 1 blue class. The algorithm uses the majority voting rule to determine the class. The new data point is assigned the red class.

The procedure can be enumerated as below. In addition to Step 1 to 3 above, the supervised learning KNN does Step 4 and 5:
- Step 4: Among these K neighbors, count the number of the classes.
- Step 5: Assign the new data point to the majority class.
(C) How Is the Anomaly Score Defined?
Since an outlier is a point that is distant from neighboring points, the outlier score is defined as the distance to its kth nearest neighbor. Each point will have an outlier score. Our job is to find those points with high outlier scores.
The KNN method in PyOD uses one of the three types of distance measures as the outlier score: largest (default), mean, and median. The “largest” uses the largest of the distance to k neighbors as the outlier score. The “mean” and “median” use the average and median respectively as the outlier score.
(D) Modeling Procedure
I apply the following modeling procedure for the model development, assessment, and interpretation of the results.
- Model development
- Threshold determination
- Descriptive statistics of the normal and abnormal groups

Step 1 will build the model and produce outlier scores. In Step 2, we will choose a threshold to separate the abnormal observations with high outlier scores from normal observations. If any prior knowledge suggests the percentage of anomalies should be no more than 1%, you can choose a threshold that results in approximately 1% of anomalies.
In Step 3, we will profile the two groups using the descriptive statistics (such as the means and standard deviations) by group. The descriptive statistic table is important to communicate the soundness of the model. If it is expected that the mean of a feature in the abnormal group is higher than that of the normal group and the result is counter-intuitive, you shall investigate, modify, or drop the feature. You shall reiterate Step 1 to 3 until the descriptive statistics of all features are consistent with expectations.
(D.1) Step 1: Build your Model
Let’s generate some data with outliers. I use the utility function generate_data() of PyOD and generate ten percent outliers. Notice that although this mock dataset has the target variable Y, the unsupervised KNN models only use the X variables. The Y variable is simply for validation.


















