avatarAayushi Johari

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

6890

Abstract

ample size because it works by deleting all other observations where any of the variables are missing.</p><figure id="90c3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*26ZhQcURL63ccuIck5JMBw.png"><figcaption></figcaption></figure><p id="e6e6">The above code indicates that there are no null values in our data set.</p><h2 id="390b">Fill Missing Values:</h2><p id="1799">This is the most common method of handling missing values. This is a process whereby missing values are replaced with a test statistic like mean, median or mode of the particular feature the missing value belongs to. Let’s suppose we have a missing value of age in the Boston data set. Then the below code will fill the missing value with the 30.</p><figure id="fa6b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*gFLqTj98XAlRY6Pj5fuSHg.png"><figcaption></figcaption></figure><h2 id="2ed9">Predict Missing values with an ML Algorithm:</h2><p id="1ef5">This is by far one of the best and most efficient methods for handling missing data. Depending on the class of data that is missing, one can either use a regression or classification model to predict missing data.</p><h2 id="4193">c) Handling outliers:</h2><p id="cd4f">An outlier is something which is separate or different from the crowd. Outliers can be a result of a mistake during data collection or it can be just an indication of variance in your data. Some of the methods for detecting and handling outliers:</p><ul><li>BoxPlot</li><li>Scatterplot</li><li>Z-score</li><li>IQR(Inter-Quartile Range)</li></ul><h2 id="b6d5">BoxPlot:</h2><p id="542f">A box plot is a method for graphically depicting groups of numerical data through their quartiles. The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). The whiskers extend from the edges of the box to show the range of the data. Outlier points are those past the end of the whiskers. Boxplots show robust measures of location and spread as well as providing information about symmetry and outliers.</p><div id="3415"><pre><span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns sns.boxplot(x=boston_df[<span class="hljs-string">'DIS'</span>])</pre></div><figure id="e6a3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*f74fjJi_-2X8xaI28A_HwA.png"><figcaption></figcaption></figure><h2 id="0328">Scatterplot:</h2><p id="719c">A scatter plot is a mathematical diagram using Cartesian coordinates to display values for two variables for a set of data. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. The points that are far from the population can be termed as an outlier.</p><div id="5ad1"><pre>import matplotlib<span class="hljs-selector-class">.pyplot</span> as plt fig, ax = plt<span class="hljs-selector-class">.subplots</span>(figsize=(<span class="hljs-number">16</span>,<span class="hljs-number">8</span>)) ax<span class="hljs-selector-class">.scatter</span>(boston_df<span class="hljs-selector-attr">[<span class="hljs-string">'INDUS'</span>]</span> , boston_df<span class="hljs-selector-attr">[<span class="hljs-string">'TAX'</span>]</span>) ax<span class="hljs-selector-class">.set_xlabel</span>(<span class="hljs-string">'proportion of non-retail business acre per town'</span>) ax<span class="hljs-selector-class">.set_ylabel</span>(<span class="hljs-string">'full-value property-tax per $10000'</span>) plt<span class="hljs-selector-class">.show</span>()</pre></div><figure id="5291"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*P-hqa9NfMQCTXKxGUOLbrA.png"><figcaption></figcaption></figure><h2 id="ef29">Z-score:</h2><p id="86c7">The Z-score is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured. While calculating the Z-score we re-scale and center the data and look for data points that are too far from zero. These data points which are way too far from zero will be treated as the outliers. In most of the cases a threshold of 3 or -3 is used i.e if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as outliers.</p><figure id="ae6d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Q7WNlL-kTgs3395wynAOZQ.png"><figcaption></figcaption></figure><p id="54f6">We can see from the above code that the shape changes, which indicates that our dataset has some outliers.</p><h2 id="0e4a">IQR:</h2><p id="9e13">The interquartile range (IQR) is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles.</p><p id="276d"><b>IQR = Q3 − Q1</b></p><figure id="50e2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*GFF45_eTtQu-bxZg05MfFw.png"><figcaption></figcaption></figure><p id="12c5">Once we have IQR scores below code will remove all the outliers in our dataset.</p><figure id="e9bb"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*cPH2Bp6IveNz-avm0xSG_Q.png"><figcaption></figcaption></figure><h2 id="9f8c">Understanding relationships and new insights through plots :</h2><p id="3967">We can get many relations in our data by visualizing our dataset. Let’s go through some techniques in order to see the insights.</p><ul><li>Histogram</li><li>HeatMaps</li></ul><h2 id="65e1">Histogram:</h2><p id="5822">A histogram is a great tool for quickly assessing a probability distribution that is easy for interpretation by almost any audience. Python offers a handful of different options for building and plotting histograms.</p><figure id="2e02"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*9M25lKJm55Ti3fF_vF7gGQ.png"><figcaption></figcaption></figure><h2 id="74df">HeatMaps:</h2><p id="1efe">The Heat Map procedure shows the distribution of a quantitative variable over all combinations of 2 categorical factors. If one of the 2 factors represents time, then the evolution of the variable can be easily viewed using the map. A gradient color scale is used to represent the values of the quantitative variable. The correlation between two random variables is a number that runs from -1 through 0 to +1 and indicates a strong inverse relationship, no relationship, and a strong direct relationship, respectively.</p><figure id="13bd"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*2mQiYlJZvEt14fe2Go3VVg.png"><figcaption></figcaption></figure><h1 id="6c0f">The Tools Exploratory Data Analysis</h1><p id="48ed">There are plenty of open-source tools exist which automate the steps of predictive modeling like data cleaning, data visualization. Some o

Options

f them are also quite popular like Excel, Tableau, Qlikview, Weka and many more apart from the programming.</p><p id="fa4c">In programming, we can accomplish EDA using Python, R, SAS. Some of the important packages in Python are:</p><ul><li>Pandas</li><li>Numpy</li><li>Matplotlib</li><li>Seaborn</li><li>Bokeh</li></ul><figure id="7799"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*po_Xez_JV8PvLltb739I3g.png"><figcaption></figcaption></figure><p id="cc78">Many Data Scientists will be in a hurry to get to the machine learning stage, some either entirely skip exploratory process or do a very minimal job. This is a mistake with many implications, including generating inaccurate models, generating accurate models but on the wrong data, not creating the right types of variables in data preparation, and using resources inefficiently because of realizing only after generating models that perhaps the data is skewed, or has outliers, or has too many missing values, or finding that some values are inconsistent.</p><p id="3282">In our <b>Trip example</b>, without a prior exploration of the place you will be facing many problems like directions, cost, travel in the trip which can be reduced by EDA the same applies to the machine learning problem. If you wish to check out more articles on the market’s most trending technologies like Artificial Intelligence, DevOps, Ethical Hacking, then you can refer to <a href="https://www.edureka.co/blog/?utm_source=medium&amp;utm_medium=content-link&amp;utm_campaign=exploratory-data-analysis-in-python">Edureka’s official site.</a></p><p id="0d6d">Do look out for other articles in this series which will explain the various other aspects of Python and Data Science.</p><blockquote id="39fc"><p>1. <a href="https://readmedium.com/machine-learning-classifier-c02fbd8400c9">Machine Learning Classifier in Python</a></p></blockquote><blockquote id="99e2"><p>2. <a href="https://readmedium.com/python-scikit-learn-cheat-sheet-9786382be9f5">Python Scikit-Learn Cheat Sheet</a></p></blockquote><blockquote id="1b57"><p>3. <a href="https://readmedium.com/python-libraries-for-data-science-and-machine-learning-1c502744f277">Machine Learning Tools</a></p></blockquote><blockquote id="e148"><p>4. <a href="https://readmedium.com/python-libraries-for-data-science-and-machine-learning-1c502744f277">Python Libraries For Data Science And Machine Learning</a></p></blockquote><blockquote id="a857"><p>5. <a href="https://readmedium.com/how-to-make-a-chatbot-in-python-b68fd390b219">Chatbot In Python</a></p></blockquote><blockquote id="5469"><p>6. <a href="https://readmedium.com/collections-in-python-d0bc0ed8d938">Python Collections</a></p></blockquote><blockquote id="3911"><p>7. <a href="https://readmedium.com/python-modules-abb0145a5963">Python Modules</a></p></blockquote><blockquote id="e225"><p>8. <a href="https://readmedium.com/python-developer-skills-371583a69be1">Python developer Skills</a></p></blockquote><blockquote id="faf0"><p>9. <a href="https://readmedium.com/oops-interview-questions-621fc922cdf4">OOPs Interview Questions and Answers</a></p></blockquote><blockquote id="5f06"><p>10. <a href="https://readmedium.com/python-developer-resume-ded7799b4389">Resume For A Python Developer</a></p></blockquote><blockquote id="60fe"><p>11. <a href="https://readmedium.com/web-scraping-with-python-d9e6506007bf">Web Scraping With Python</a></p></blockquote><blockquote id="3489"><p>12. <a href="https://readmedium.com/python-turtle-module-361816449390">Snake Game With Python’s Turtle Module</a></p></blockquote><blockquote id="0f62"><p>13. <a href="https://readmedium.com/python-developer-salary-ba2eff6a502e">Python Developer Salary</a></p></blockquote><blockquote id="c6b6"><p>14.<a href="https://readmedium.com/principal-component-analysis-69d7a4babc96"> Principal Component Analysis</a></p></blockquote><blockquote id="f39b"><p>15. <a href="https://readmedium.com/python-vs-cpp-c3ffbea01eec">Python vs C++</a></p></blockquote><blockquote id="96d1"><p>16. <a href="https://readmedium.com/scrapy-tutorial-5584517658fb">Scrapy Tutorial</a></p></blockquote><blockquote id="47ec"><p>17. <a href="https://readmedium.com/scipy-tutorial-38723361ba4b">Python SciPy</a></p></blockquote><blockquote id="8bbf"><p>18. <a href="https://readmedium.com/least-square-regression-40b59cca8ea7">Least Squares Regression Method</a></p></blockquote><blockquote id="3155"><p>19. <a href="https://readmedium.com/jupyter-notebook-cheat-sheet-88f60d1aca7">Jupyter Notebook Cheat Sheet</a></p></blockquote><blockquote id="606b"><p>20. <a href="https://readmedium.com/python-basics-f371d7fc0054">Python Basics</a></p></blockquote><blockquote id="0727"><p>21. <a href="https://readmedium.com/python-pattern-programs-75e1e764a42f">Python Pattern Programs</a></p></blockquote><blockquote id="daf2"><p>22. <a href="https://readmedium.com/generators-in-python-258f21e3d3ff">Generators in Python</a></p></blockquote><blockquote id="45db"><p>23. <a href="https://readmedium.com/python-decorator-tutorial-bf7b21278564">Python Decorator</a></p></blockquote><blockquote id="68af"><p>24.<a href="https://readmedium.com/spyder-ide-2a91caac4e46"> Python Spyder IDE</a></p></blockquote><blockquote id="ff5b"><p>25. <a href="https://readmedium.com/kivy-tutorial-9a0f02fe53f5">Mobile Applications Using Kivy In Python</a></p></blockquote><blockquote id="cd1b"><p>26. <a href="https://readmedium.com/best-books-for-python-11137561beb7">Top 10 Best Books To Learn & Practice Python</a></p></blockquote><blockquote id="bf3d"><p>27. <a href="https://readmedium.com/robot-framework-tutorial-f8a75ab23cfd">Robot Framework With Python</a></p></blockquote><blockquote id="4782"><p>28. <a href="https://readmedium.com/snake-game-with-pygame-497f1683eeaa">Snake Game in Python using PyGame</a></p></blockquote><blockquote id="aaf0"><p>29. <a href="https://readmedium.com/django-interview-questions-a4df7bfeb7e8">Django Interview Questions and Answers</a></p></blockquote><blockquote id="21c8"><p>30. <a href="https://readmedium.com/python-applications-18b780d64f3b">Top 10 Python Applications</a></p></blockquote><blockquote id="cfd0"><p>31. <a href="https://readmedium.com/hash-tables-and-hashmaps-in-python-3bd7fc1b00b4">Hash Tables and Hashmaps in Python</a></p></blockquote><blockquote id="7d70"><p>32. <a href="https://readmedium.com/whats-new-python-3-8-7d52cda747b">Python 3.8</a></p></blockquote><blockquote id="15da"><p>33. <a href="https://readmedium.com/support-vector-machine-in-python-539dca55c26a">Support Vector Machine</a></p></blockquote><blockquote id="ec57"><p>34. <a href="https://readmedium.com/python-tutorial-be1b3d015745">Python Tutorial</a></p></blockquote><p id="534f"><i>Originally published at <a href="https://www.edureka.co/blog/exploratory-data-analysis-in-python/">https://www.edureka.co</a> on July 29, 2019.</i></p></article></body>

The Why And How Of Exploratory Data Analysis In Python

Exploratory Data Analysis — Edureka

Data Analysis is basically where you use statistics and probability to figure out trends in the data set. It helps you to sort out the “ real” trends from the statistical noise. What is “ noise “? A large amount of data that doesn’t seem to mean anything at all. Following are the topics that we are going to discuss as part of Exploratory Data Analysis in Python:

  • What is Exploratory Data Analysis In Python?
  • Need For Exploratory Data Analysis
  • What Are The Steps In Exploratory Data Analysis In Python?
  • The Tools Used In EDA

What Is Exploratory Data Analysis In Python?

Exploratory Data Analysis (EDA) in Python is the first step in your data analysis process developed by “ John Tukey” in the 1970s. In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. By the name itself, we can get to know that it is a step in which we need to explore the data set.

For Example, You are planning to go on a trip to the “ X “ location. Things you do before taking a decision:

  • You will explore the location on what all places, waterfalls, trekking, beaches, restaurants that location has in Google, Instagram, Facebook, and other social Websites.
  • Calculate whether it is in your budget or not.
  • Check for the time to cover all the places.
  • Type of Travel method.

Similarly, when you are trying to build a machine learning model you need to be pretty sure whether your data is making sense or not. The main aim of exploratory data analysis is to obtain confidence in your data to an extent where you’re ready to engage a machine learning algorithm.

Need For Exploratory Data Analysis

Exploratory Data Analysis is a crucial step before you jump to machine learning or modeling of your data. By doing this you can get to know whether the selected features are good enough to model, are all the features required, are there any correlations based on which we can either go back to the Data Pre-processing step or move on to modeling.

Once Exploratory Data Analysis is complete and insights are drawn, its feature can be used for supervised and unsupervised machine learning modeling.

In every machine learning workflow, the last step is Reporting or Providing the insights to the Stake Holders and as a Data Scientist you can explain every bit of code but you need to keep in mind the audience. By completing the Exploratory Data Analysis you will have many plots, heat-maps, frequency distribution, graphs, correlation matrix along with the hypothesis by which any individual can understand what your data is all about and what insights you got from exploring your data set.

There is a saying “ A picture is worth a thousand words “.

I want to modify it for data scientist as “ A Plot is worth a thousand rows

In our Trip Example, we do all the exploration of the selected place based on which we will get the confidence to plan the trip and even share with our friends the insights we got regarding the place so that they can also join.

What Are The Steps In Exploratory Data Analysis In Python?

There are many steps for conducting Exploratory data analysis. I want to discuss regarding the below few steps using the Boston Data Set which can be imported from sklearn.datasets import load_boston

  • Description of data
  • Handling missing data
  • Handling outliers
  • Understanding relationships and new insights through plots

a) Description of data:

We need to know the different kinds of data and other statistics of our data before we can move on to the other steps. A good one is to start with the describe() function in python. In Pandas, we can apply describe() on a DataFrame which helps in generating descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.

The result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default, the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

Loading the Dataset:

import pandas as pd
from sklearn.datasets import load_boston
 
boston = load_boston()
x = boston.data
y = boston.target
columns = boston.feature_names
# creating dataframes
boston_df = pd.DataFrame(boston.data)
boston_df.columns = columns
boston_df.describe()

b) Handling missing data:

Data in the real-world are rarely clean and homogeneous. Data can either be missing during data extraction or collection due to several reasons. Missing values need to be handled carefully because they reduce the quality of any of our performance matrix. It can also lead to wrong prediction or classification and can also cause a high bias for any given model being used. There are several options for handling missing values. However, the choice of what should be done is largely dependent on the nature of our data and the missing values. Below are some of the techniques:

  • Drop NULL or missing values
  • Fill Missing Values
  • Predict Missing values with an ML Algorithm

Drop NULL or missing values:

This is the fastest and easiest step to handle missing values. However, it is not generally advised. This method reduces the quality of our model as it reduces sample size because it works by deleting all other observations where any of the variables are missing.

The above code indicates that there are no null values in our data set.

Fill Missing Values:

This is the most common method of handling missing values. This is a process whereby missing values are replaced with a test statistic like mean, median or mode of the particular feature the missing value belongs to. Let’s suppose we have a missing value of age in the Boston data set. Then the below code will fill the missing value with the 30.

Predict Missing values with an ML Algorithm:

This is by far one of the best and most efficient methods for handling missing data. Depending on the class of data that is missing, one can either use a regression or classification model to predict missing data.

c) Handling outliers:

An outlier is something which is separate or different from the crowd. Outliers can be a result of a mistake during data collection or it can be just an indication of variance in your data. Some of the methods for detecting and handling outliers:

  • BoxPlot
  • Scatterplot
  • Z-score
  • IQR(Inter-Quartile Range)

BoxPlot:

A box plot is a method for graphically depicting groups of numerical data through their quartiles. The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). The whiskers extend from the edges of the box to show the range of the data. Outlier points are those past the end of the whiskers. Boxplots show robust measures of location and spread as well as providing information about symmetry and outliers.

import seaborn as sns 
sns.boxplot(x=boston_df['DIS'])

Scatterplot:

A scatter plot is a mathematical diagram using Cartesian coordinates to display values for two variables for a set of data. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. The points that are far from the population can be termed as an outlier.

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(boston_df['INDUS'] , boston_df['TAX'])
ax.set_xlabel('proportion of non-retail business acre per town')
ax.set_ylabel('full-value property-tax per $10000')
plt.show()

Z-score:

The Z-score is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured. While calculating the Z-score we re-scale and center the data and look for data points that are too far from zero. These data points which are way too far from zero will be treated as the outliers. In most of the cases a threshold of 3 or -3 is used i.e if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as outliers.

We can see from the above code that the shape changes, which indicates that our dataset has some outliers.

IQR:

The interquartile range (IQR) is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles.

IQR = Q3 − Q1

Once we have IQR scores below code will remove all the outliers in our dataset.

Understanding relationships and new insights through plots :

We can get many relations in our data by visualizing our dataset. Let’s go through some techniques in order to see the insights.

  • Histogram
  • HeatMaps

Histogram:

A histogram is a great tool for quickly assessing a probability distribution that is easy for interpretation by almost any audience. Python offers a handful of different options for building and plotting histograms.

HeatMaps:

The Heat Map procedure shows the distribution of a quantitative variable over all combinations of 2 categorical factors. If one of the 2 factors represents time, then the evolution of the variable can be easily viewed using the map. A gradient color scale is used to represent the values of the quantitative variable. The correlation between two random variables is a number that runs from -1 through 0 to +1 and indicates a strong inverse relationship, no relationship, and a strong direct relationship, respectively.

The Tools Exploratory Data Analysis

There are plenty of open-source tools exist which automate the steps of predictive modeling like data cleaning, data visualization. Some of them are also quite popular like Excel, Tableau, Qlikview, Weka and many more apart from the programming.

In programming, we can accomplish EDA using Python, R, SAS. Some of the important packages in Python are:

  • Pandas
  • Numpy
  • Matplotlib
  • Seaborn
  • Bokeh

Many Data Scientists will be in a hurry to get to the machine learning stage, some either entirely skip exploratory process or do a very minimal job. This is a mistake with many implications, including generating inaccurate models, generating accurate models but on the wrong data, not creating the right types of variables in data preparation, and using resources inefficiently because of realizing only after generating models that perhaps the data is skewed, or has outliers, or has too many missing values, or finding that some values are inconsistent.

In our Trip example, without a prior exploration of the place you will be facing many problems like directions, cost, travel in the trip which can be reduced by EDA the same applies to the machine learning problem. If you wish to check out more articles on the market’s most trending technologies like Artificial Intelligence, DevOps, Ethical Hacking, then you can refer to Edureka’s official site.

Do look out for other articles in this series which will explain the various other aspects of Python and Data Science.

1. Machine Learning Classifier in Python

2. Python Scikit-Learn Cheat Sheet

3. Machine Learning Tools

4. Python Libraries For Data Science And Machine Learning

5. Chatbot In Python

6. Python Collections

7. Python Modules

8. Python developer Skills

9. OOPs Interview Questions and Answers

10. Resume For A Python Developer

11. Web Scraping With Python

12. Snake Game With Python’s Turtle Module

13. Python Developer Salary

14. Principal Component Analysis

15. Python vs C++

16. Scrapy Tutorial

17. Python SciPy

18. Least Squares Regression Method

19. Jupyter Notebook Cheat Sheet

20. Python Basics

21. Python Pattern Programs

22. Generators in Python

23. Python Decorator

24. Python Spyder IDE

25. Mobile Applications Using Kivy In Python

26. Top 10 Best Books To Learn & Practice Python

27. Robot Framework With Python

28. Snake Game in Python using PyGame

29. Django Interview Questions and Answers

30. Top 10 Python Applications

31. Hash Tables and Hashmaps in Python

32. Python 3.8

33. Support Vector Machine

34. Python Tutorial

Originally published at https://www.edureka.co on July 29, 2019.

Data Science
Eda
Data Analysis
Data Analytics
Python
Recommended from ReadMedium