avatarHarsh Mishra

Summary

The blog post discusses Benford's Law, its applications in data validation and fraud detection, and its implementation on real-world datasets.

Abstract

The article "Benford’s law can be a game changer!!!" by Harsh Mishra delves into the intricacies of Benford's Law, a statistical phenomenon where the distribution of leading digits in many naturally occurring datasets is not uniform. The law suggests that the number 1 appears as the leading digit about 30% of the time, while 9 appears less than 5% of the time. Harsh Mishra illustrates the practical utility of this law by applying it to datasets from Kaggle, specifically examining the prices of cars and runs scored by batsmen in test cricket, and finds that the results align with Benford's Law's predictions. The post emphasizes the importance of data validation in data science projects, highlighting Benford's Law as a simple yet powerful tool to detect anomalies or potential fraud, as was controversially attempted in the analysis of the 2009 Iranian elections and the 2020 U.S. presidential election. The author encourages the audience to consider this law as a first step in validating numerical data and invites readers to engage with his work on GitHub, LinkedIn, and through his blog.

Opinions

  • The author, Harsh Mishra, expresses that Benford's Law is underutilized in data validation processes, despite its potential to identify fraudulent data.
  • Mishra suggests that reliance on trusted data sources without validation is a common oversight in data science, which Benford's Law can help address.
  • The author is enthusiastic about the ease of implementing Benford's Law and its effectiveness in analyzing diverse datasets, such as car prices and cricket statistics.
  • Mishra points out that Benford's Law is more reliable when applied to naturally occurring datasets rather than those with human interference or mathematical constraints.
  • The author criticizes the misapplication of Benford's Law in some election fraud analyses, emphasizing the need for proper understanding and application of the law.
  • Mishra advocates for the broader adoption of Benford's Law as an initial tool for numerical data validation, suggesting it could be a game-changer in the field.

Benford’s law can be a game changer!!!

Today in this blog I wanted to share some information about benford’s law. We will see what it is and how it could be a gamechanger as we go ahead.

All the code written in this blog is in my github- https://github.com/HarshMishra2002/benfords_law

link for the dataset used-

https://www.kaggle.com/veeralakrishna/icc-test-cricket-runs

https://www.kaggle.com/harshmishraandheri/car-dataset

What is Benford’s Law?

First lets see how wikipedia defines it

So the law is simple. Consider any numerical data and extract its first digit from left, One would be the number occurring the most that is almost 30% of the time and 9 being the least with around 5% appearance.

Graphical representation of Benfords law

It was first presented by Frank Benford in his paper “The Law of Anomalous Numbers”.

The link for that paper is- https://mdporter.github.io/SYS6018/other/(Benford)%20The%20Law%20of%20Anomalous%20Numbers.pdf

This law became more interesting when I read this one paragraph from his paper which said:

“The study of the items shows a distinct tendency for those of a random nature to agree better with the logarithmic law than those of a formal or mathematical nature. The best agreement was found in the arabic numbers (not spelled out) of consecutive front page news items of a newspaper. Dates were barred as not being variable, and the omission of spelled-out numbers restricted the counted digits to numbers 10 and over. The first 342 street addresses given in the current American Men of Science (Item R, Table IV) gave excellent agreement, and a complete count (except for dates and page numbers) of an issue of the Readers’ Digest was also in agreement. On the other hand, the greatest variations from the logarithmic relation were found in the first digits of mathematical tables from engineering handbooks, and in tabulations of such closely knit data as Molecular Weights, Specific Heats, Physical Constants and Atomic Weights.”

Let me put this in simple words, this law is more applicable and reliable when the data on which it is applied is naturally occurring and not when there is some mathematical equation to get those numbers or there is any kind of intentional human interference.

But why this law can be a gamechanger is its Application. When I work on any data science project I simply go to kaggle and download the csv file and start working on it.

I NEVER VALIDATE DATA.

There could be two reasons for that, Firstly I trust the source of the data too much that checking whether its appropriate or whether it has some parentage of fraud data is never a question. Second reason is that actually there are not too many tools available for it and so here comes the Benford’s Law for you.

The thing which I love the most about this law is that it is very easy to implement.

When I learnt about it I though that I should actually try to implement this law on actual dataset and see what’s the result.

So I took a dataset of Cars and applied the law on price column which represents the price of each car.

Now this result was the exact representation of the Benford’s Law.

I tried the same thing on another dataset. This time I had the data of runs scored by batsman in test cricket.

And once again I had the same result.

In the 2009 Iranian elections, Benford’s law was presented as proof of fraud. According to Mebane’s analysis, the second digits in vote counts for President Mahmoud Ahmadinejad, the election winner, tended to differ significantly from the expectations of Benford’s law, and that ballot boxes with a small number of invalid ballots had a greater impact on the results, implying widespread ballot stuffing.

Election fraud has also been claimed using Benford’s law in an inappropriate manner. The distribution of the first number did not match Benford’s formula when applied to Joe Biden’s election returns for Chicago, Milwaukee, and other cities in the 2020 United States presidential election. The error occurred as a result of looking at data that was tightly bound in range, which violated Benford’s law’s assumption that the data range be large.

SO I feel if there is any kind of need to validate numerical data or detect fraud in numbers this law should be the first tool to be used.

I hope you guys got to learn something new and enjoyed this blog. If you do like it than share it with your friends. Take care. keep learning.

You could also reach me through my Linkedin account- https://www.linkedin.com/in/harsh-mishra-4b79031b3/

Benfords Law
Python
Data Science
Machine Learning
Data Validation
Recommended from ReadMedium