Summary

The website content discusses Exploratory Data Analysis (EDA) using Python, emphasizing the importance of EDA in data pre-processing and showcasing Python libraries such as Sweetviz and DTale for automated EDA and visualization.

Abstract

Exploratory Data Analysis (EDA) is a critical step in understanding and preparing data for statistical modeling or machine learning. The content outlines the benefits of EDA, which include cleaning data, understanding variable relationships, and performing various tests and visualizations. It highlights the efficiency of EDA by using Python libraries like Pandas-Profiling, Sweetviz, Autoviz, and DTale, which automate the process and provide high-density visualizations, comparative analysis, and detailed reports. The Sweetviz library is particularly noted for its ability to generate comprehensive reports that explore datasets and compare them, while DTale offers a user-friendly interface with custom filters, correlation charts, and code export features. The article also references a dataset named "Churn Prediction" available on Kaggle, which can be analyzed using these tools to perform an EDA.

Opinions

The author suggests that spending time on EDA and data visualization is important but can be optimized using auto-visualization tools.
The use of Sweetviz and DTale libraries is recommended for their ability to automate EDA tasks and generate insightful reports.
There is an emphasis on the importance of data quality checks, statistical tests, and quantitative tests as part of the EDA process.
The author implies that EDA is not just a mechanical process but requires an exploratory mindset, quoting Tukey's view on EDA as an attitude and a state of flexibility.

Exploratory Data Analysis (EDA) using Python

“The field of exploratory data analysis was established with Tukey’s 1977 now-classic book Exploratory Data Analysis [Tukey-1977].”

Exploratory Data Analysis (EDA) is an approach to extract the information enfolded in the data and summarize the main characteristics of the data.EDA involves looking at and describing the data set from different angles and then summarizing it.

Today, this data pre-processing step is an essential one before starting statistical modeling or machine learning engines to ensure the correctness and effectiveness of data used.

Benefits of EDA:

1) It helps to clean the garbage from the dataset 2) Helps users to understand the relationship between each of the variables.

People spend more time on Exploratory data analysis and data visualization but time can be minimized through such auto visualization tools such as Pandas-Profiling, Sweetviz, Autoviz, and D-tale.

The whole process of EDA involves some steps which include statistical tests, visualize our data by creating different plots for it, Data quality check, quantitative tests, etc.

Data quality check: It can be done using pandas library function describe(),info() etc.

Statistical test: Pearson correlation, Spearman correlation, Kendall test, etc can be found using the stats library.

Quantitative Test: find the spread of numerical features, count of categorical features performed through the pandas library.

Visualization: Barplots, histograms, pie charts, scatter plots, etc are used.

But all this library can do this task in just a few lines of code automatically.

Sweetviz Library:

Sweetviz is an open-source python auto-visualization library that generates a report, exploring the data with the help of high-density plots. It not only automates the EDA but is also used for comparing datasets and drawing inferences from it. A comparison of two datasets can be done by treating one as training and the other as testing.

It generates a report having:

Overview of the dataset, variable properties, categorical associations, Numerical associations, Largest, smallest, or most frequent value for the dataset.

Dataset Name: Churn Prediction

Link:https://www.kaggle.com/shubh0799/churn-modelling

#Installing necessary packages

!pip install sweetviz

import pandas as pd

import sweetviz as sv

#EDA using Autoviz

sweet_report = sv.analyze(pd.read_csv("/content/Churn_Modelling.csv"))

#Saving results to HTML file

sweet_report.show_html('sweet_report.html')

DTale library:

D-Tale is an open-source python auto-visualization library. It is one of the best auto data-visualization libraries. D-Tale helps you to get a detailed EDA of the data. It also has a feature of code export, for every plot or analysis in the report.

It Generates the report of having:

An overview of the dataset, Custom filters, Correlation, Charts, and Heatmaps, Highlight datatypes, missing values, ranges, Code export

Dataset Name: Churn Prediction

Link:https://www.kaggle.com/shubh0799/churn-modelling

#Installing necessary packages

!pip install dtale

import dtale

import pandas as pd

dtale.show(pd.read_csv("/content/Churn_Modelling.csv"))

“Exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as the things we believe might be there.”