avatarRenu Khandelwal

Summary

The provided web content outlines methods for detecting outliers in datasets using Python, specifically through the calculation of Z scores and Interquartile Range (IQR).

Abstract

The article "Finding outliers in dataset using python" discusses the importance of identifying outliers in datasets, which are data points that deviate significantly from the rest. It explains that outliers can arise from variability in data or measurement errors and can skew statistical analyses by affecting the mean and standard deviation. The article presents two primary methods for detecting outliers: the Z score method, which identifies outliers as points lying beyond three standard deviations from the mean, and the IQR method, which defines outliers as those falling outside of 1.5 times the IQR from the first and third quartiles. The article includes a Python function using the Z score to detect outliers and demonstrates its application on a sample dataset. Additionally, it provides step-by-step instructions for using IQR to identify outliers, including sorting the data, calculating quartiles, and establishing bounds.

Opinions

  • The article conveys the opinion that outliers can have a significant and detrimental impact on data analysis, emphasizing their ability to skew results.
  • It suggests that the presence of outliers is a natural occurrence in datasets due to inherent variability or errors in data collection.
  • The author appears to advocate for the use of both Z scores and IQR as standard techniques for outlier detection, implying that these methods are reliable and effective.
  • The inclusion of a Jupyter notebook with the code for outlier detection indicates the author's preference for practical, code-based demonstrations of data analysis techniques.
  • The article seems to assume a level of familiarity with Python and data analysis concepts, targeting an audience with some technical background.

Finding outliers in dataset using python

In this article, we will use z score and IQR -interquartile range to identify any outliers using python

Jupyter notebook is available at- https://github.com/arshren/MachineLearning/blob/master/Identifying%20outliers.ipynb

What is an outlier?

An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset.

What are the criteria to identify an outlier?

  • Data point that falls outside of 1.5 times of an interquartile range above the 3rd quartile and below the 1st quartile
  • Data point that falls outside of 3 standard deviations. we can use a z score and if the z score falls outside of 2 standard deviation

What is the reason for an outlier to exists in a dataset?

An outlier could exist in a dataset due to

  • Variability in the data
  • An experimental measurement error

What is the impact of an outlier?

causes serious issues for statistical analysis

  • skew the data,
  • significant impact on mean
  • significant impact on standard deviation.

How can we identify an outlier?

  • using scatter plots
  • using Z score
  • using the IQR interquartile range

Using Scatter Plot

We can see the scatter plot and it shows us if a data point lies outside the overall distribution of the dataset

Scatter plot to identify an outlier

Using Z score

Formula for Z score = (Observation — Mean)/Standard Deviation

z = (X — μ) / σ

we first import the libraries

import numpy as np
import pandas as pd

we will use a list here

dataset= [10,12,12,13,12,11,14,13,15,10,10,10,100,12,14,13, 12,10, 10,11,12,15,12,13,12,11,14,13,15,10,15,12,10,14,13,15,10]

we write a function that takes numeric data as an input argument.

we find the mean and standard deviation of the all the data points

We find the z score for each of the data point in the dataset and if the z score is greater than 3 than we can classify that point as an outlier. Any point outside of 3 standard deviations would be an outlier.

import numpy as np
import pandas as pd
outliers=[]
def detect_outlier(data_1):
    
    threshold=3
    mean_1 = np.mean(data_1)
    std_1 =np.std(data_1)
    
    
    for y in data_1:
        z_score= (y - mean_1)/std_1 
        if np.abs(z_score) > threshold:
            outliers.append(y)
    return outliers

we now pass dataset that we created earlier and pass that as an input argument to the detect_outlier function

outlier_datapoints = detect_outlier(dataset)
print(outlier_datapoints)
output of the outlier_datapoints

Using IQR

IQR tells how spread the middle values are. It can be used to tell when a value is too far from the middle.

An outlier is a point which falls more than 1.5 times the interquartile range above the third quartile or below the first quartile.

we will use the same dataset

step 1:

  • Arrange the data in increasing order
  • Calculate first(q1) and third quartile(q3)
  • Find interquartile range (q3-q1)
  • Find lower bound q1*1.5
  • Find upper bound q3*1.5
  • Anything that lies outside of lower and upper bound is an outlier

Fist sorting the dataset

sorted(dataset)

Finding first quartile and third quartile

q1, q3= np.percentile(dataset,[25,75])

q1 is 11 and q3 is 14

Find the IQR which is the difference between third and first quartile

iqr = q3 - q1

iqr is 3

Find lower and upper bound

lower_bound = q1 -(1.5 * iqr) 
upper_bound = q3 +(1.5 * iqr) 

lower_bound is 6.5 and upper bound is 18.5, so anything outside of 6.5 and 18.5 is an outlier.

Data Science
Python
Outliers
Z Score
Recommended from ReadMedium