avatarSoner Yıldırım

Summary

The web content discusses using Pandas for data visualization in exploratory data analysis, focusing on the Melbourne housing dataset.

Abstract

The article titled "Data Visualization with Pandas" emphasizes the versatility of Pandas, a popular data analysis and manipulation library, in creating basic yet insightful visualizations for exploratory data analysis. It demonstrates how to import and manipulate a subset of the Melbourne housing dataset from Kaggle, filtering outliers and selecting specific columns for analysis. Through examples such as scatter plots, histograms, and boxplots, the article illustrates the ease with which Pandas can reveal correlations, distributions, and outliers within the dataset. It also shows how to create bar plots to compare average house prices across different regions in Melbourne. The author concludes that while Pandas is not primarily a data visualization library, it is highly practical for generating fundamental plots quickly and efficiently, which can be especially useful during the initial stages of data exploration.

Opinions

  • Pandas is highly useful for creating basic plots, particularly for exploratory data analysis.
  • Data visualization is a critical component of data analysis, providing a more intuitive understanding of data compared to raw numbers.
  • Pandas' plotting capabilities are sufficient for basic visualizations, potentially eliminating the need for additional visualization libraries in certain scenarios.
  • The Melbourne housing dataset is used to exemplify the practicality of Pandas in real-world data analysis tasks.
  • The article suggests that combining Pandas with more specialized data visualization libraries can be beneficial, depending on the complexity of the visualizations required.

Data Visualization with Pandas

It is more than just plain numbers

Photo by Markus Winkler on Unsplash

Pandas is arguably the most popular data analysis and manipulation library. It makes it extremely easy to manipulate data in tabular form. The various functions of Pandas constitutes a powerful and versatile data analysis tool.

Data visualization is an essential part of exploratory data analysis. It is more effective than plain numbers at providing an overview or summary of data. Data visualizations help us understand the underlying structure within a dataset or explore the relationships among variables.

Pandas is not a data visualization library but it is capable of creating basic plots. If you are just creating plots for exploratory data analysis, Pandas might be highly useful and practical. You do not have to use an additional data visualization library for such tasks.

In this article, we will create several plots using only Pandas. Our goal is to explore the Melbourne housing dataset available on Kaggle.

Let’s start by importing libraries and reading the dataset into a dataframe.

import numpy as np
import pandas as pd
df = pd.read_csv("/content/melb_data.csv", usecols = ['Price', 'Landsize','Distance','Type','Regionname'])
df = df[df.Price < 3_000_000].sample(n=1000).reset_index(drop=True)
df.head()
(image by author)

I have only read a small part of the original dataset. The usecols parameter of the read_csv function allows for reading only the given columns of the csv file. I have also filtered out the outliers with regards to the price. Finally, a random sample of 1000 observations (i.e. rows) is selected using the sample function.

We can start with creating a scatter plot of the price and distance columns. A scatter plot is mainly used to check the correlation between two continuous variables.

We can use the plot function of Pandas to create many different types of visualizations. The plot type is specified with the kind parameter.

df.plot(x='Distance', y='Price', kind='scatter',
        figsize=(10,6),
        title='House Prices in Melbourne')
(image by author)

We do not observe a strong correlation between the price and distance. However, there is a slight negative correlation for the houses with lower prices.

Another commonly used plot type in exploratory data analysis is histogram. It divides the value range of a continuous variable into discrete bins and counts the number of observations (i.e. rows) in each bin. Thus, we get a structured overview of the distribution of the variable.

The following code generates a histogram of the price column.

df['Price'].plot(kind='hist', figsize=(10,6), title='Distribution of House Prices in Melbourne')
(image by author)

Most of the houses cost between 500K and 1 million. As you notice, we apply the plot function to a series (df[‘Price’]). Depending on the plot type, we can use the plot function with either dataframe or series.

Boxplots can demonstrate the distribution of a variable. They show how values are spread out by means of quartiles and outliers. We can use the boxplot function of Pandas as follows.

df.boxplot(column='Price', by='Type', figsize=(10,6))
(image by author)

This boxplot represents a distribution of the house prices. The “by” parameter groups the data points by the given column. We pass the type column to the by parameter so we can see the distribution separately for each type.

The houses with “h” type are more expensive than others in general. The outliers (i.e. extreme values) are represented with dots. The height of the boxes are proportional to how much the values are spread out. Thus, taller boxes indicate more variance.

One advantage of using Pandas for creating visualizations is that we can chain data analysis functions and plotting functions. It kind of simplifies the task. For instance, the groupby and plot.bar functions can be used to create a bar plot of the average house prices in different regions.

We first group the prices by region name and calculates the average. Then, the plot.bar function is applied to the result.

df[['Regionname','Price']].groupby('Regionname').mean().sort_values(by='Price', ascending=False).plot.bar(figsize=(10,6), rot=45, title='Average house prices in different regions')
(image by author)

The sort_values function can be used to sort the results in either ascending or descending order to make the plot look better. The most expensive houses are located in the southern metropolitan region.

Conclusion

We have seen how Pandas can be used as a data visualization tool. It is way beyond the dedicated data visualization libraries such as Seaborn and Matplotlib. However, Pandas offers a more practical way of creating basic plots during exploratory data analysis process.

You can always use Pandas and a data visualization library together. There is no harm in that. But, in some cases, you can get the work done easier and faster with Pandas.

Thank you for reading. Please let me know if you have any feedback.

Data Science
Artificial Intelligence
Machine Learning
Python
Data Visualization
Recommended from ReadMedium