UNSUPERVISED LEARNING
Understanding OPTICS and Implementation with Python
In this post, I briefly talk about how to understand an unsupervised learning method, OPTICS, and its implementation in Python.
OPTICS stands for Ordering points to identify the clustering structure. It is a density-based unsupervised learning algorithm, which was developed by the same research group that developed DBSCAN. As discussed in my last post, DBSCAN has a major disadvantage in that it struggles to identify clusters in data of varying density. However, OPTICS doesn’t require the density to be consistent across the dataset.
One simple example to illustrate the advantage of OPTICS over DBSCAN is shown below.
In the example above, the constant distance parameter eps in DBSCAN can only regard points within eps from each other as neighbors, and obviously missed the cluster on the bottom right of the figure (read this post for more detailed info about parameters in DBSCAN). If we used a larger eps parameter, that cluster could be identified, but it will also potentially merge the other two top clusters together. So, it’s very hard to predefine a perfect eps in DBSCAN; however, in OPTICS, we don't need to worry about that.
In this post, I will talk about how to understand this algorithm and how to implement it in Python. Hope the article is helpful.
Understanding OPTICS
As you can remember, a single cutoff of distance is applied in DBSCAN to determine whether two data points are close to each other (neighbors) or not. But as shown in the example above, it may not be an ideal solution because a “long” distance in one cluster could be a “short” distance in another.
Just like the house prices across the nation, a 300k house is pretty expensive in the town I live, but it is a cheap one in manhattan. That’s why it’s important to have different standards of prices in mind when we look at different groups (clusters) of houses (data points).
So, in OPTICS, how can we take the groups’ different “standards” of distance into consideration? One of the most straightforward ways is just to compare with the local environment (the context). Specifically, if we are interested in determining whether the distance from point A to point B is large or small, we can compare the distance (A to B) with all the pairwise distances in the local environment.
As we can see from the illustration plot above, the distance from A to B is not that large compared to all the pairwise distances within that local environment, right? So, we should regard A and B as neighbors (A is close to B in distance).
You may have already noticed several things that are not very clear in the previous steps to determine whether the distance from A to B is large or not. For example,
1. how to define a “local area”? and especially when there’s no such a clear separation of data as shown in the example?
2. How to determine whether a local environment is worth investigating? What if there’s no sufficient pairwise distance for me to compare with?
3. Is it really necessary to consider all pairwise distances in the area? For example, is it fair to compare the distance from A to B with that from B to D (especially when we already know that B and D seem to be the opposite boundary of the local area)?
If you do have the questions listed above, congratulations! because you are thinking in the correct direction!
First, to define a “local area”, we need a center and a radius (same concepts in DBSCAN). For example, point A is the first point we want to visit and we want to look around A’s local area. We set A as the center and a predefined value, r, as the radius, and the local area is defined as the area within the circle (shown below).
In the example plot above, obviously, r is a good selection of radius around point A because it perfectly covers a “local cluster” of data points. However, in reality, we don’t see the entire picture of data or we don’t have those clear separations of “local clusters”. So, it’s usually a good practice to set r as a relatively large value, in order to avoid starring at a very small region when you are standing at A.
Second, to determine whether an area is worth investigating, we set a minimum number of data points as the criteria. For example, a “local area” defined in the previous step is worth investigating only if there are at least N data points (including the center itself).
In the plot above, we set N= 3, which means an area is only worth investigating if there are more than 3 data points. So, the region around point P is not going to be investigated as a “local area” because it has zero neighbors and we are not able to compare any pairwise distances.
Third, as you may have noticed, it is not necessary to calculate all the pairwise distances within the “local area” when we want to determine whether the distance from A to B is large or not. Let’s look at the pairwise distances in the local area again,
As shown in the left plot above, point A and point B are “adjacent” in the plot, but point B and point D are not. There’s no doubt that AB is shorter than BD, so it’s neither fair nor necessary to compare the distance from A to B with that from B to D. Is there a way for us to only consider those distances between “adjacent” points (as shown in the right plot above)?
Actually, reducing the comparisons from the left to the right in the plot above is very ideal because we have more data points with more complicated locations in reality and it’s hard for us manually select those edges to be compared with the distance from A to B.
Alternatively, can we solve the problem in an iteration way instead? for example, we visit the points in the local area one by one, and every time we only record the distance from the current point to its adjacent points and then move to the next point. After visiting all the points in the area, we have recorded all the adjacent distances. In this way, we don’t have to see the entire picture of distances in advance or manually select those after visiting all pairwise distances.
This is a good way, but can we further reduce the complexity?
Alright, I don’t want you to feel lost, so let me state the problem again. We want to compare the distance between A and B to all the other pairwise “adjacent” distances, in order to determine whether A and B are close to each other.
Do you still remember why we wanted to know whether the distance between A and B is larger or smaller compared to the other pairwise distances? We want to know whether B belongs to the current visiting cluster of data points (cluster centered around point A). In another word, we want to use the distance metric to determine whether a point is reachable or not to a cluster of data points.
Reachable or unreachable. Uhmm… Can we assign a reachability score to each point? Would that be much simpler? A large reachability score is somewhere the cluster cannot reach, and a small reachability score is somewhere reachable inside the cluster.
Don’t worry if you feel a little lost here. Please follow me to go through the entire process, and I’m sure you will understand what I’m saying.
Here are the steps:
First, we visit the local area around point A. We calculate the distances from the center point to all the other points in the area, AB, AC, AD, AE, and AF, and record them. We order the recorded distances and regard these distances as the current reachability score. We also need to push A to the final score list in order not to visit it again. As the initial point, A has no score assigned, so let’s keep it blank (or we can assign a super large value to it).
Next, we visit the point with the smallest reachability score (point D), which is also the closest point to the current center point. We use this new point as the center point and calculate the distances from the current center point to the other points that have not been visited within its neighborhood.
As shown in the plot above, we only calculate the distance from D to C and that from D to E, because A has already been visited (no need to calculate again). Then, we update the current reachability score list with a smaller score if available. Here, E has a new score of 1, which is smaller than that in the previous step (E=1.4 in step 1). So, we update E’s score. But point C has a larger score than the first step, which means C is more reachable from A than from D. So, we keep C’s score unchanged at this step. Without investigating, B and F are not updated in this step.
Remember we want to record the smallest reachability score of each point in order to claim whether it belongs to the current cluster or not, so we cannot use a large score for C when a smaller score is available. It’s unfair for point C if we exclude it from the cluster just because we didn’t use its best score.
Third, we move to E because E has the smallest reachability score in the current list.
As shown in the plot above, we only need to calculate the distance from E to F since A and D were already executed in previous steps. We get a slightly smaller reachability score for F (F = 1.9), so we update it in the current score list.
Next, it is point B that has the smallest reachability score, so we use it as the current center point in this step.
No reachability score is updated in step 4 because we cannot find smaller values. Then we move to step 5 with C as the center point.
And then we use point F as the center point and output F’s reachability score to the list.
Up till now, we have visited all six points in the current local area and got the final reachability score list. So, in such a way, we didn’t miss any pairwise distances from the “adjacent” points. In addition, we got an ordered list of scores that measure how reachable the points are from the investigating cluster.
Is the procedure described above much simpler than calculating all pairwise distances in the area?
Actually, we can apply this procedure to the entire dataset. For example, after visiting F as the center point, we have no other unvisited points in the local area. Then, the next center point could be any other data point in the dataset.
If we draw one of the possible entire paths connecting the center points in the order of visit, it could be something like the plot below.
You can imagine the path as a greedy snake that wants to eat all the balls (data points) in the graph by always biting the next closest ball (in the score list).
The single point on the top right has one “neighbor” with the predefined neighborhood of the middle cluster (with the radius, r), so it was visited in the last step for the orange part. However, the point on the bottom left has no “neighbors” at all, so it has never been investigated because it doesn't meet our criteria that an area is worth investigating only if it has at least N = 3 neighbors (including itself).
After doing all of these, we end up with a reachability score list, what can we do with it? Don’t forget that our goal is to define whether a point belongs to a local cluster or not. For example, here is the score list we get,
If we plot the score as a bar chart, we can visualize them as below.
You may have already noticed that the three valleys in the barplot on the right are corresponding to the three clusters on the left. So, using the final reachability score list, we are able to detect the clusters even they have different densities.
In addition, you can find that a denser cluster (brown one on the top left of the left panel) corresponds to a deeper valley on the barplot (the valley circled by brown on the right panel). And a sparser cluster (the blue cluster on the bottom right of the left panel) corresponds to a relatively shallower valley (the most left circle on the barplot).
Remember that we define a neighborhood that is worth visiting as an area with at least 3 neighbors. We can regard the three neighbors as the core of the neighborhood. And actually, we don’t care much about the points if they locate within the core because they obviously belong to the cluster. So, typically for all the points within the core, we set the core size as their reachability score.
For example, in the plot below, there are six points in the area centered at A with a radius of r. The core of the area is formed by A, D, and E because D and E are the two closest points to A (N = 3). We know that D is within the core, so we don’t have to record the exact distance between A to D (a tiny value). We can simply assign the core size (the radius of the core) as the score of point D.
After approximating the scores of points within the core by the core size, the smallest value we can get from the final reachability score list is the core size. Why do we do that? First, to avoid recording so many useless tiny tiny values; second, to make the valleys in the barplot look smoother (as shown below).
I didn’t use any concept from OPTICS, but the procedure described above is exactly OPTICS. You have already understood OPTICS if you follow all the way down here.
Let me just pull out some descriptions from the textbooks and briefly illustrate how they are embedded in my explanation above.
The OPTICS Algorithm
Most textbooks start with this, “The basic idea of OPTICS is similar to DBSCAN but it has two more concepts than DBSCAN”. Let me use some official words to describe these two concepts in OPTICS and tell you what these complex terms really are referring to my plain-language description above.
In OPTICS each point is assigned a core distance that describes the distance to the MinPtsth closest point, and a reachability distance of another point o from a point p that is either the distance between o and p, or the core distance of p, whichever is bigger.
The core distance is just the core size as I described, which simply serves as the threshold to approximate the distances of points inside the core. The reachability distance is just the reachability score we record for each point.
And here is the pseudo-code of the algorithm from textbooks.
function OPTICS(DB, eps, MinPts) is
for each point p of DB do
p.reachability-distance = UNDEFINED
for each unprocessed point p of DB do
N = getNeighbors(p, eps)
mark p as processed
output p to the ordered list
if core-distance(p, eps, MinPts) != UNDEFINED then
Seeds = empty priority queue
update(N, p, Seeds, eps, MinPts)
for each next q in Seeds do
N' = getNeighbors(q, eps)
mark q as processed
output q to the ordered list
if core-distance(q, eps, MinPts) != UNDEFINED do
update(N', q, Seeds, eps, MinPts)
I know it’s hard to understand. So, ignore it, it’s exactly the same as the steps I used in the previous paragraphs. The final output of OPTICS here is the reachability distance list, which is exactly the reachability score list we have seen before.
Actually, OPTICS stops when the reachability distance barplot is generated. The final clustering step needs to be executed manually, that’s why strictly speaking, OPTICS is NOT a clustering method, but a method to show the structure of the dataset.
The Implementation in Python
The implementation of OPTICS in Python is super easy,
from sklearn.cluster import OPTICS
optics_clustering = OPTICS(min_samples=3).fit(X)
If you want to know the final labels of the data points, use
optics_clustering.labels_
Easy, right?
You may have noticed that there’s no parameter called eps in the OPTICS implementation code. That’s because this parameter doesn’t really affect the result as long as it’s not super small. Imagine it as the maximum field of vision when you are standing at point A in the very first step. Why? if you go over the post again, it’s easy to get the answer.
That’s it! If you enjoyed reading my post. Please subscribe to my Medium account!
Cheers!