Mastering catplot() in Seaborn: Categorical data visualization guide.

If you can do it in Seaborn, do it in Seaborn, #2

Introduction

The goal of this article is to introduce you to the most common categorical plots using Seaborn’s catplot() function.

While doing Exploratory or Explanatory data analysis, you will have to choose from a wide range of plot types. Choosing one which depicts the relationships in your data accurately can be tricky.

If you are working with data that involves any categorical variables like survey responses, your best tools to visualize and compare different features of your data would be categorical plots. Fortunately, a data visualization library Seaborn encompasses several types of categorical plots into a single function: catplot().

Seaborn library offers many advantages over other plotting libraries:

1. It is very easy to use and requires less code syntax

2. Works really well with `pandas` data structures, which is just what you need as a data scientist.

3. It is built on top of Matplotlib, another vast and deep data visualization library.

BTW, my golden rule for Data Visualization is “Do it in Seabron if you can do it in Seaborn”.

In SB’s (I will be abbreviating from now on) documentation, it states that catplot() function includes 8 different types of categorical plots. But in this guide, I will cover the three most common plots: count plots, bar plots, and box plots.

Overview

  I. Introduction

 II. Setup

III. Seaborn Count Plot
         1. Changing the order of categories

 IV. Seaborn Bar Plot
         1. Confidence intervals in a bar plot
         2. Changing the orientation in bar plots

  V. Seaborn Box Plot
         1. Overall understanding
         2. Working with outliers
         3. Working with whiskers

 VI. Conclusion

You can get the sample data and the notebook of the article on this GitHub repo.

Setup

If you have not SB already installed, you can install it using pip along with other libraries we will be using:

pip install numpy pandas seaborn matplotlib

If you are wondering why we don’t alias Seaborn as sb like a normal person, that's because the initials sns were named after a fictional character Samuel Norman Seaborn from the TV show "The West Wing". What can you say? (shrugs).

For the dataset, we will be using the classic diamonds dataset. It contains the price and quality data of 54000 diamonds. It is a great dataset for Data Visualization. One version of the data comes pre-loaded in Seaborn. You can get other loaded datasets with sns.get_dataset_names() function (there are many). But in this guide, we will be using the full version which I downloaded from Kaggle.

# Load sample data
diamonds = pd.read_csv('data/diamonds.csv', index_col=0)

Basic Exploration

diamonds.head()
diamonds.info()
diamonds.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 53940 entries, 1 to 53940
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53940 non-null  float64
 1   cut      53940 non-null  object 
 2   color    53940 non-null  object 
 3   clarity  53940 non-null  object 
 4   depth    53940 non-null  float64
 5   table    53940 non-null  float64
 6   price    53940 non-null  int64  
 7   x        53940 non-null  float64
 8   y        53940 non-null  float64
 9   z        53940 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.5+ MB

diamonds.shape

(53940, 10)

Seaborn count plot

As the name suggests, a count plot displays the number of observations in each category of your variable. Throughout this article, we will be using catplot() function changing its kind parameter to create different plots. For the count plot, we set kind parameter to count and feed in the data using data parameter. Let's start by exploring the diamond cut quality.

sns.catplot(x='cut', data=diamonds, kind='count');

We start off with catplot() function and use x argument to specify the axis we want to show the categories. You can use y to make the chart horizontal. The count plot automatically counts the number of values in each category and displays them on YAxis.

Changing the order of categories

In our plot, the quality of the cut is from best to worst. But let’s reverse the order:

category_order = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']

sns.catplot(x='cut', data=diamonds, kind='count', order=category_order);

It is best to create a list of categories in the order you want and then passing it to order. This improves code readability.

Seaborn bar plot

Another popular choice for plotting categorical data is a bar plot. In the count plot example, our plot only needed a single variable. In the bar plot, we often use one categorical variable and one quantitative. Let’s see how the prices of different diamond cuts compare to each other.

To create a bar plot, we feed the values for XAxis, YAxis separately and set kind parameter to bar:

sns.catplot(x='cut',
            y='price',
            data=diamonds,
            kind='bar',
            order=category_order);

The height of each bar represents the mean value in each category. In our plot, each bar is showing the mean price of diamonds in each category. I think you are also surprised to see that low-quality cuts also have significantly high prices. Lowest quality diamonds are, on average, even more expensive than ideal diamonds. This surprising trend is worth exploring but it would be beyond the scope of this article.

Confidence Intervals in a Bar Plot

Black lines at the top of each bar represent 95% confidence intervals for the mean which can be thought of as the uncertainty in our sample data. Simply put, the tips of each line are the interval where you would expect the real mean price of all the diamonds (nut just 54000) in each category. If you don’t know statistics, best to skip this part. You can turn off confidence intervals setting the ci parameter to None:

sns.catplot(x='cut',
            y='price',
            data=diamonds,
            kind='bar',
            order=category_order,
            ci=None);

Changing the Orientation In Bar Plots

When you have lots of categories/bars, or long category names, it is a good idea to change the orientation. Just swap the x and y-axis values:

sns.catplot(x='price',
            y='cut',
            data=diamonds,
            kind='bar',
            order=category_order,
            ci=None);

Seaborn box plot

Box plots are visuals that can be a little difficult to understand but depict the distribution of data very beautifully. It is best to start the explanation with an example of a box plot. I am going to use one of the common built-in datasets in Seaborn:

tips = sns.load_dataset('tips')

sns.catplot(x='day', y='total_bill', data=tips, kind='box');

Overall Understanding

This box plot shows the distribution of bill amounts in a sample restaurant per day. Let’s start by interpreting Thursday’s.

The edges of the blue box are the 25th and 75th percentiles of the distribution of all bills. This means that 75% of all the bills on Thursday were lower than 20 dollars, while another 75% (from the bottom to the top) was higher than almost 13 dollars. The horizontal line in the box shows the median value of the distribution.

The dots above the whisker are called outliers. Outliers are calculated in three steps:

Find Inter Quartile Range (IQR) by subtracting the 25th percentile from the 75th: 75% — 25%
The lower outlier limit is calculated by subtracting 1.5 times of IQR from the 25th: 25% — 1.5*IQR
The upper outlier limit is calculated by adding 1.5 times of IQR to the 75th: 75% + 1.5*IQR

Any values above and below the outlier limits become dots in a box plot.

Now that you understand box plots a little better, let’s get back to shiny diamonds:

sns.catplot(x='cut',
            y='price',
            data=diamonds,
            kind='box',
            order=category_order);

We create a box plot in the same way as any other plot. The key difference is that we set kind parameter to box. This box plot shows the distribution of prices of different quality cut diamonds. As you see, there are a lot of outliers for each category. And the distributions are highly skewed.

Box plots are very useful because they:

Show outliers, skewness, spread, and distribution in a single plot
Great for comparing different groups

Outliers in Box Plot

It is also possible to turn off the outliers in a box plot by setting the sym parameter to an empty string:

sns.catplot(x='cut',
            y='price',
            data=diamonds,
            kind='box',
            order=category_order,
            sym='');

The outliers in a box plot is by default calculated using the method I introduced earlier. However, you can change it by passing different values for whis parameter:

sns.catplot(x='cut',
            y='price',
            data=diamonds,
            kind='box',
            order=category_order,
            whis=2);   # Using 2 times of IQR to calculate outliers

Working With Whiskers

Using different percentiles:

sns.catplot(x='cut',
            y='price',
            data=diamonds,
            kind='box',
            order=category_order,
            whis=[5, 95]); # Whiskers show 5th and 95th percentiles

Or make the whiskers show minimum and max values:

sns.catplot(x='cut',
            y='price',
            data=diamonds,
            kind='box',
            order=category_order,
            whis=[0, 100]);   # Min and max values in distribution

Wrapping Up

We have covered the three most common categorical plots. I did not include how to create subplots using the catplot() function even though it is one of the advantages of catplot()'s flexibility. I recently wrote another article for a similar function relplot() which is used to plot relational variables. I have discussed how to create subplots in detail there and the same techniques can be applied here.

Loved this article and, let’s face it, its bizarre writing style? Imagine having access to dozens more just like it, all written by a brilliant, charming, witty author (that’s me, by the way :).

For only 4.99$ membership, you will get access to not just my stories, but a treasure trove of knowledge from the best and brightest minds on Medium. And if you use my referral link, you will earn my supernova of gratitude and a virtual high-five for supporting my work.

Join Medium with my referral link — Bex T.

Get exclusive access to all my ⚡premium⚡ content and all over Medium without limits. Support my work by buying me a…

ibexorigin.medium.com