Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3435

Abstract

re id="75da"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*OoXcO16SDoE8F7wu1ngmaQ.png"><figcaption><a href="https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/tkdd11.pdf">https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/tkdd11.pdf</a> — the path length is the number of edges that have to be traversed.</figcaption></figure><figure id="3204"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*G0unZpZ22lkoQrPf4tAQsQ.png"><figcaption><a href="https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/tkdd11.pdf">https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/tkdd11.pdf</a></figcaption></figure><p id="0ac1">As you can see from the <a href="https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/tkdd11.pdf">original author's</a> images above, the point our eye clearly tells us is an outlier takes much less partitioning to be isolated. The path length converges at less than half the length of the point which is clearly not an outlier.</p><h1 id="26ab">How does it compare to other anomaly detection algorithms?</h1><p id="c22e">Firstly, there are a lot of benefits to using iForest. Here are some examples.</p><ul><li>iForest can exploit sub-sampling so it has a low linear time complexity and a small memory requirement.</li><li>It can deal with the effects of swamping and masking.</li><li>Works well on high dimensional problems.</li><li>Works well with irrelevant attributes.</li><li>Works without training sets that include anomalies (unsupervised).</li></ul><p id="5ed5">But the main reason that iForest is so great at what it does is due to the fact that it was designed for this job. This isn’t the case for a lot of its counterparts. Algorithms such as one-class SVM and most clustering methods are designed for other purposes. They suit anomaly detection but still were not created for this purpose.</p><p id="460d">A full comparison of different anomaly detection algorithms can be seen <a href="https://scikit-learn.org/stable/modules/outlier_detection.html#isolation-forest">here.</a></p><p id="b4b4">From this picture alone I know which one I fancy using.</p><figure id="fa21"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*UIrsRUsKrsHa4QEGKcYoJw.png"><figcaption><a href="https://scikit-learn.org/stable/modules/outlier_detection.html#isolation-forest">https://scikit-learn.org/stable/modules/outlier_detection.html#isolation-forest</a></figcaption></figure><h1 id="a97b">Here’s how it works in Python</h1><p id="7188">It’s easy to run in Python. As you can see it's only 30 or so lines to get the predictions.</p> <figure id="bfe6"> <div> <div>

            <iframe class="gist-iframe" src="/gist/jasher4994/c84afa8615d556335616329a6aec7db6.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="745e">All you have to do is import the packages and your data. Then, convert it to a Numpy array before running and fitting the model. You can then convert it back to a pandas dataframe and run the value_counts method on it and this will tell you how many outliers you have. From the dataset I used, the algorithm picked up 148 outliers — which it then assigns the value of -1.</p><p id="7a8a">You can then merge this back onto your original dataframe and use your plotting package of

Options

choice to have a look at the outliers it has predicted.</p><p id="c790">I hope this all makes sense and happy (outlier) hunting.</p><p id="8f30">Cheers,</p><p id="2df4">James</p><div id="b97a"><pre>If I’<span class="hljs-keyword">ve</span> inspired you <span class="hljs-keyword">to</span> <span class="hljs-keyword">join</span> medium I would <span class="hljs-keyword">be</span> really grateful <span class="hljs-keyword">if</span> you did it through this link — it will <span class="hljs-keyword">help</span> <span class="hljs-keyword">to</span> support <span class="hljs-keyword">me</span> <span class="hljs-keyword">to</span> <span class="hljs-keyword">write</span> better content in the future.</pre></div><div id="c588"><pre>If you want to learn more about data <span class="hljs-keyword">science, </span><span class="hljs-keyword">become </span>a certified data <span class="hljs-keyword">scientist, </span><span class="hljs-keyword">or </span>land a <span class="hljs-keyword">job </span>in data <span class="hljs-keyword">science, </span>then checkout <span class="hljs-number">365</span> data <span class="hljs-keyword">science </span>through my affiliate link.</pre></div><p id="f799">If you enjoyed this, here are some more of my articles.</p><div id="ad5e" class="link-block"> <a href="https://towardsdatascience.com/econometrics-is-the-original-data-science-6725d3f0d843"> <div> <div> <h2>Econometrics Is The Original Data Science</h2> <div><h3>This is why you should know more about it</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*u7cAMgSgiXb0jI6TWR2x1g.jpeg)"></div> </div> </div> </a> </div><div id="cc25" class="link-block"> <a href="https://towardsdatascience.com/how-to-easily-show-your-matplotlib-plots-and-pandas-dataframes-dynamically-on-your-website-a9613eff7ae3"> <div> <div> <h2>How to easily show your Matplotlib plots and Pandas dataframes dynamically on your website.</h2> <div><h3>A surprisingly easy approach to showcasing your plots and dataframes online for the whole world to see — in less than…</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*C1If2cnsug8oIXhe)"></div> </div> </div> </a> </div><div id="1d96" class="link-block"> <a href="https://towardsdatascience.com/how-to-make-a-free-serverless-interactive-dashboard-in-minutes-e6ce5a1088e0"> <div> <div> <h2>How To Make a Free, Serverless, Interactive Dashboard in Minutes</h2> <div><h3>I’ll make you a promise, you can make this dashboard as fast as you can make a standard visualisation of the same…</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*THdAdzIrTL2A9znAWyLIDg.jpeg)"></div> </div> </div> </a> </div></article></body>

Are These Data ‘Normal’? Anomalies & Outliers In Machine Learning

A deep dive into isolation forests with Python

Calculus jokes are mostly derivative, trigonometry jokes are too graphic, algebra jokes are usually formulaic, and arithmetic jokes are pretty basic.

But the occasional statistics joke is an outlier.

Nobody wants outliers in their data — especially when they have come from the likes of false entries due to fat thumbs. A couple of zeros can throw off an algorithm and can destroy summary statistics.

So this is how you use machine learning to remove those pesky outliers.

What is normal?

Historically, the first step to anomaly detection is to try and understand what’s “normal”, and then find examples of “not normal”. These “not normal” points are what we would classify as outliers — they didn’t fit our expected distribution even at the furthest ends of it.

Isolation Forests, or iForest, an elegant and beautiful idea, don’t follow this approach. The premise behind it is simple. The original authors and inventors of the idea for iForests stated that outliers, or anomalies, are data points that are “few and different” from the rest of the population.

They furthered by noting that data points that are “few and different” suffer from a characteristic called “isolation”.

Seems logical.

How does it work?

The isolation forest algorithm works by isolating instances without relying on any distance measure. They use a combination of the two features they prescribed to anomalies earlier — that anomalies are both few and different.

The algorithm makes use of binary tree structures. Which are recursive structures that have at most two children resulting from each node.

https://en.wikipedia.org/wiki/Binary_tree#/media/File:Full_binary.svg

Here’s the important bit.

Because of their susceptibility to being isolated, outliers are more likely to be located near the ROOT of the tree — it takes fewer partitions to isolate them.

Therefore, points with a shorter path length are likely to be outliers. The iForest algorithm builds an ensemble of iTrees (seem familiar?) and then averages the path length.

https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/tkdd11.pdf — the path length is the number of edges that have to be traversed.

https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/tkdd11.pdf

As you can see from the original author's images above, the point our eye clearly tells us is an outlier takes much less partitioning to be isolated. The path length converges at less than half the length of the point which is clearly not an outlier.

How does it compare to other anomaly detection algorithms?

Firstly, there are a lot of benefits to using iForest. Here are some examples.

iForest can exploit sub-sampling so it has a low linear time complexity and a small memory requirement.
It can deal with the effects of swamping and masking.
Works well on high dimensional problems.
Works well with irrelevant attributes.
Works without training sets that include anomalies (unsupervised).

But the main reason that iForest is so great at what it does is due to the fact that it was designed for this job. This isn’t the case for a lot of its counterparts. Algorithms such as one-class SVM and most clustering methods are designed for other purposes. They suit anomaly detection but still were not created for this purpose.

A full comparison of different anomaly detection algorithms can be seen here.

From this picture alone I know which one I fancy using.

https://scikit-learn.org/stable/modules/outlier_detection.html#isolation-forest

Here’s how it works in Python

It’s easy to run in Python. As you can see it's only 30 or so lines to get the predictions.

All you have to do is import the packages and your data. Then, convert it to a Numpy array before running and fitting the model. You can then convert it back to a pandas dataframe and run the value_counts method on it and this will tell you how many outliers you have. From the dataset I used, the algorithm picked up 148 outliers — which it then assigns the value of -1.

You can then merge this back onto your original dataframe and use your plotting package of choice to have a look at the outliers it has predicted.

I hope this all makes sense and happy (outlier) hunting.

Cheers,

James

If I’ve inspired you to join medium I would be really grateful if you did it through this link — it will help to support me to write better content in the future.

If you want to learn more about data science, become a certified data scientist, or land a job in data science, then checkout 365 data science through my affiliate link.

If you enjoyed this, here are some more of my articles.

Econometrics Is The Original Data Science

This is why you should know more about it

towardsdatascience.com

How to easily show your Matplotlib plots and Pandas dataframes dynamically on your website.

A surprisingly easy approach to showcasing your plots and dataframes online for the whole world to see — in less than…

towardsdatascience.com

How To Make a Free, Serverless, Interactive Dashboard in Minutes

I’ll make you a promise, you can make this dashboard as fast as you can make a standard visualisation of the same…

towardsdatascience.com