# Data Visualization with Pandas

It is more than just plain numbers

Pandas is arguably the most popular data analysis and manipulation library. It makes it extremely easy to manipulate data in tabular form. The various functions of Pandas constitutes a powerful and versatile data analysis tool.

Data visualization is an essential part of exploratory data analysis. It is more effective than plain numbers at providing an overview or summary of data. Data visualizations help us understand the underlying structure within a dataset or explore the relationships among variables.

Pandas is not a data visualization library but it is capable of creating basic plots. If you are just creating plots for exploratory data analysis, Pandas might be highly useful and practical. You do not have to use an additional data visualization library for such tasks.

In this article, we will create several plots using only Pandas. Our goal is to explore the Melbourne housing dataset available on Kaggle.

Let’s start by importing libraries and reading the dataset into a dataframe.

```
import numpy as np
import pandas as pd
```

`df = pd.read_csv("/content/melb_data.csv", usecols = ['Price', 'Landsize','Distance','Type','Regionname'])`

`df = df[df.Price < 3_000_000].sample(n=1000).reset_index(drop=True)`

`df.head()`

I have only read a small part of the original dataset. The usecols parameter of the read_csv function allows for reading only the given columns of the csv file. I have also filtered out the outliers with regards to the price. Finally, a random sample of 1000 observations (i.e. rows) is selected using the sample function.

We can start with creating a scatter plot of the price and distance columns. A scatter plot is mainly used to check the correlation between two continuous variables.

We can use the plot function of Pandas to create many different types of visualizations. The plot type is specified with the kind parameter.

```
df.plot(x='Distance', y='Price', kind='scatter',
figsize=(10,6),
title='House Prices in Melbourne')
```

We do not observe a strong correlation between the price and distance. However, there is a slight negative correlation for the houses with lower prices.

Another commonly used plot type in exploratory data analysis is histogram. It divides the value range of a continuous variable into discrete bins and counts the number of observations (i.e. rows) in each bin. Thus, we get a structured overview of the distribution of the variable.

The following code generates a histogram of the price column.

`df['Price'].plot(kind='hist', figsize=(10,6), title='Distribution of House Prices in Melbourne')`

Most of the houses cost between 500K and 1 million. As you notice, we apply the plot function to a series (df[‘Price’]). Depending on the plot type, we can use the plot function with either dataframe or series.

Boxplots can demonstrate the distribution of a variable. They show how values are spread out by means of quartiles and outliers. We can use the boxplot function of Pandas as follows.

`df.boxplot(column='Price', by='Type', figsize=(10,6))`

This boxplot represents a distribution of the house prices. The “by” parameter groups the data points by the given column. We pass the type column to the by parameter so we can see the distribution separately for each type.

The houses with “h” type are more expensive than others in general. The outliers (i.e. extreme values) are represented with dots. The height of the boxes are proportional to how much the values are spread out. Thus, taller boxes indicate more variance.

One advantage of using Pandas for creating visualizations is that we can chain data analysis functions and plotting functions. It kind of simplifies the task. For instance, the groupby and plot.bar functions can be used to create a bar plot of the average house prices in different regions.

We first group the prices by region name and calculates the average. Then, the plot.bar function is applied to the result.

`df[['Regionname','Price']].groupby('Regionname').mean().sort_values(by='Price', ascending=False).plot.bar(figsize=(10,6), rot=45, title='Average house prices in different regions')`

The sort_values function can be used to sort the results in either ascending or descending order to make the plot look better. The most expensive houses are located in the southern metropolitan region.

## Conclusion

We have seen how Pandas can be used as a data visualization tool. It is way beyond the dedicated data visualization libraries such as Seaborn and Matplotlib. However, Pandas offers a more practical way of creating basic plots during exploratory data analysis process.

You can always use Pandas and a data visualization library together. There is no harm in that. But, in some cases, you can get the work done easier and faster with Pandas.

Thank you for reading. Please let me know if you have any feedback.