The provided content is a comprehensive guide on using Seaborn's catplot() function for creating categorical data visualizations, specifically focusing on count plots, bar plots, and box plots.
Abstract
The article serves as an introduction to categorical data visualization using Seaborn's catplot() function, emphasizing its ease of use and integration with pandas data structures. It covers the advantages of Seaborn over other plotting libraries, such as its simplicity and foundation on Matplotlib. The guide delves into the creation of count plots to display the number of observations in each category, bar plots to compare the mean values of different categories, and box plots to illustrate the distribution of data, including outliers and skewness. The author provides practical examples using the diamonds dataset to demonstrate how to manipulate the order of categories, add confidence intervals, change plot orientations, and customize whisker ranges in box plots. The article concludes with an invitation for readers to support the author's work by becoming Medium members.
Opinions
The author advocates for the use of Seaborn for data visualization, coining the golden rule: "Do it in Seaborn if you can do it in Seaborn."
Seaborn's catplot() function is praised for its ability to produce a variety of categorical plots with minimal code.
The author expresses surprise at the high average prices of low-quality cut diamonds, suggesting it as a topic for further exploration.
The article implies that understanding the distribution of data is crucial, and box plots are an effective tool for this purpose.
The author suggests that readers might skip the explanation of confidence intervals if they are not familiar with statistics.
The naming of Seaborn after a fictional character from "The West Wing" is mentioned with a casual, humorous tone.
The author promotes their Medium membership, offering access to premium content as a benefit for joining through their referral link.
Mastering catplot() in Seaborn: Categorical data visualization guide.
The goal of this article is to introduce you to the most common categorical plots using Seaborn’s catplot() function.
While doing Exploratory or Explanatory data analysis, you will have to choose from a wide range of plot types. Choosing one which depicts the relationships in your data accurately can be tricky.
If you are working with data that involves any categorical variables like survey responses, your best tools to visualize and compare different features of your data would be categorical plots. Fortunately, a data visualization library Seaborn encompasses several types of categorical plots into a single function: catplot().
Seaborn library offers many advantages over other plotting libraries:
1. It is very easy to use and requires less code syntax
2. Works really well with `pandas` data structures, which is just what you need as a data scientist.
3. It is built on top of Matplotlib, another vast and deep data visualization library.
BTW, my golden rule for Data Visualization is “Do it in Seabron if you can do it in Seaborn”.
In SB’s (I will be abbreviating from now on) documentation, it states that catplot() function includes 8 different types of categorical plots. But in this guide, I will cover the three most common plots: count plots, bar plots, and box plots.
Overview
I. Introduction
II. Setup
III. Seaborn Count Plot
1. Changing the orderof categories
IV. Seaborn BarPlot1. Confidence intervals in a barplot2. Changing the orientation in bar plots
V. Seaborn Box Plot
1. Overall understanding
2. Working with outliers
3. Working with whiskers
VI. Conclusion
You can get the sample data and the notebook of the article on this GitHub repo.
Setup
If you have not SB already installed, you can install it using pip along with other libraries we will be using:
pip install numpy pandas seaborn matplotlib
If you are wondering why we don’t alias Seaborn as sb like a normal person, that's because the initials sns were named after a fictional character Samuel Norman Seaborn from the TV show "The West Wing". What can you say? (shrugs).
For the dataset, we will be using the classic diamonds dataset. It contains the price and quality data of 54000 diamonds. It is a great dataset for Data Visualization. One version of the data comes pre-loaded in Seaborn. You can get other loaded datasets with sns.get_dataset_names() function (there are many). But in this guide, we will be using the full version which I downloaded from Kaggle.
As the name suggests, a count plot displays the number of observations in each category of your variable. Throughout this article, we will be using catplot() function changing its kind parameter to create different plots. For the count plot, we set kind parameter to count and feed in the data using data parameter. Let's start by exploring the diamond cut quality.
We start off with catplot() function and use x argument to specify the axis we want to show the categories. You can use y to make the chart horizontal. The count plot automatically counts the number of values in each category and displays them on YAxis.
Changing the order of categories
In our plot, the quality of the cut is from best to worst. But let’s reverse the order:
It is best to create a list of categories in the order you want and then passing it to order. This improves code readability.
Seaborn bar plot
Another popular choice for plotting categorical data is a bar plot. In the count plot example, our plot only needed a single variable. In the bar plot, we often use one categorical variable and one quantitative. Let’s see how the prices of different diamond cuts compare to each other.
To create a bar plot, we feed the values for XAxis, YAxis separately and set kind parameter to bar:
The height of each bar represents the mean value in each category. In our plot, each bar is showing the mean price of diamonds in each category. I think you are also surprised to see that low-quality cuts also have significantly high prices. Lowest quality diamonds are, on average, even more expensive than ideal diamonds. This surprising trend is worth exploring but it would be beyond the scope of this article.
Confidence Intervals in a Bar Plot
Black lines at the top of each bar represent 95% confidence intervals for the mean which can be thought of as the uncertainty in our sample data. Simply put, the tips of each line are the interval where you would expect the real mean price of all the diamonds (nut just 54000) in each category. If you don’t know statistics, best to skip this part. You can turn off confidence intervals setting the ci parameter to None:
Box plots are visuals that can be a little difficult to understand but depict the distribution of data very beautifully. It is best to start the explanation with an example of a box plot. I am going to use one of the common built-in datasets in Seaborn:
This box plot shows the distribution of bill amounts in a sample restaurant per day. Let’s start by interpreting Thursday’s.
The edges of the blue box are the 25th and 75th percentiles of the distribution of all bills. This means that 75% of all the bills on Thursday were lower than 20 dollars, while another 75% (from the bottom to the top) was higher than almost 13 dollars. The horizontal line in the box shows the median value of the distribution.
The dots above the whisker are called outliers. Outliers are calculated in three steps:
Find Inter Quartile Range (IQR) by subtracting the 25th percentile from the 75th: 75% — 25%
The lower outlier limit is calculated by subtracting 1.5 times of IQR from the 25th: 25% — 1.5*IQR
The upper outlier limit is calculated by adding 1.5 times of IQR to the 75th: 75% + 1.5*IQR
Any values above and below the outlier limits become dots in a box plot.
Now that you understand box plots a little better, let’s get back to shiny diamonds:
We create a box plot in the same way as any other plot. The key difference is that we set kind parameter to box. This box plot shows the distribution of prices of different quality cut diamonds. As you see, there are a lot of outliers for each category. And the distributions are highly skewed.
Box plots are very useful because they:
Show outliers, skewness, spread, and distribution in a single plot
Great for comparing different groups
Outliers in Box Plot
It is also possible to turn off the outliers in a box plot by setting the sym parameter to an empty string:
The outliers in a box plot is by default calculated using the method I introduced earlier. However, you can change it by passing different values for whis parameter:
sns.catplot(x='cut',
y='price',
data=diamonds,
kind='box',
order=category_order,
whis=2); # Using 2 times of IQR to calculate outliers
Working With Whiskers
Using different percentiles:
sns.catplot(x='cut',
y='price',
data=diamonds,
kind='box',
order=category_order,
whis=[5, 95]); # Whiskers show 5th and 95th percentiles
Or make the whiskers show minimum and max values:
sns.catplot(x='cut',
y='price',
data=diamonds,
kind='box',
order=category_order,
whis=[0, 100]); # Min and max values in distribution
Wrapping Up
We have covered the three most common categorical plots. I did not include how to create subplots using the catplot() function even though it is one of the advantages of catplot()'s flexibility. I recently wrote another article for a similar function relplot() which is used to plot relational variables. I have discussed how to create subplots in detail there and the same techniques can be applied here.
Loved this article and, let’s face it, its bizarre writing style? Imagine having access to dozens more just like it, all written by a brilliant, charming, witty author (that’s me, by the way :).
For only 4.99$ membership, you will get access to not just my stories, but a treasure trove of knowledge from the best and brightest minds on Medium. And if you use my referral link, you will earn my supernova of gratitude and a virtual high-five for supporting my work.