Finding outliers in dataset using python

In this article, we will use z score and IQR -interquartile range to identify any outliers using python

Jupyter notebook is available at- https://github.com/arshren/MachineLearning/blob/master/Identifying%20outliers.ipynb

What is an outlier?

An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset.

What are the criteria to identify an outlier?

  • Data point that falls outside of 1.5 times of an interquartile range above the 3rd quartile and below the 1st quartile
  • Data point that falls outside of 3 standard deviations. we can use a z score and if the z score falls outside of 2 standard deviation

What is the reason for an outlier to exists in a dataset?

An outlier could exist in a dataset due to

  • Variability in the data
  • An experimental measurement error

What is the impact of an outlier?

causes serious issues for statistical analysis

  • skew the data,
  • significant impact on mean
  • significant impact on standard deviation.

How can we identify an outlier?

  • using scatter plots
  • using Z score
  • using the IQR interquartile range

Using Scatter Plot

We can see the scatter plot and it shows us if a data point lies outside the overall distribution of the dataset

Scatter plot to identify an outlier

Using Z score

Formula for Z score = (Observation — Mean)/Standard Deviation

z = (X — μ) / σ

we first import the libraries

import numpy as np
import pandas as pd

we will use a list here

dataset= [10,12,12,13,12,11,14,13,15,10,10,10,100,12,14,13, 12,10, 10,11,12,15,12,13,12,11,14,13,15,10,15,12,10,14,13,15,10]

we write a function that takes numeric data as an input argument.

we find the mean and standard deviation of the all the data points

We find the z score for each of the data point in the dataset and if the z score is greater than 3 than we can classify that point as an outlier. Any point outside of 3 standard deviations would be an outlier.

import numpy as np
import pandas as pd
def detect_outlier(data_1):
    mean_1 = np.mean(data_1)
    std_1 =np.std(data_1)
    for y in data_1:
        z_score= (y - mean_1)/std_1 
        if np.abs(z_score) > threshold:
    return outliers

we now pass dataset that we created earlier and pass that as an input argument to the detect_outlier function

outlier_datapoints = detect_outlier(dataset)
output of the outlier_datapoints

Using IQR

IQR tells how spread the middle values are. It can be used to tell when a value is too far from the middle.

An outlier is a point which falls more than 1.5 times the interquartile range above the third quartile or below the first quartile.

we will use the same dataset

step 1:

  • Arrange the data in increasing order
  • Calculate first(q1) and third quartile(q3)
  • Find interquartile range (q3-q1)
  • Find lower bound q1*1.5
  • Find upper bound q3*1.5
  • Anything that lies outside of lower and upper bound is an outlier

Fist sorting the dataset


Finding first quartile and third quartile

q1, q3= np.percentile(dataset,[25,75])

q1 is 11 and q3 is 14

Find the IQR which is the difference between third and first quartile

iqr = q3 - q1

iqr is 3

Find lower and upper bound

lower_bound = q1 -(1.5 * iqr) 
upper_bound = q3 +(1.5 * iqr) 

lower_bound is 6.5 and upper bound is 18.5, so anything outside of 6.5 and 18.5 is an outlier.

