avatarWei-Meng Lee

Summary

The provided content is a comprehensive guide on using the pivot_table() function in Pandas for Exploratory Data Analysis (EDA), demonstrating its application through examples analyzing world population and Titanic datasets.

Abstract

The article on the undefined website delves into the utility of pivot tables within the Pandas library for performing Exploratory Data Analysis (EDA). It begins by explaining the concept of pivot tables, which are tools for summarizing and aggregating data to facilitate comparisons and trend analysis. The author then illustrates the use of pivot_table() in Pandas with practical examples, starting with an analysis of world population data sourced from the Gapminder project. The examples cover various scenarios, such as finding mean values for countries, analyzing GDP and life expectancies, and grouping data by year and continent. The article also explores the Titanic dataset, examining survival rates in relation to gender, class of service, and port of embarkation. Throughout the article, the author emphasizes the flexibility of the pivot_table() function, showcasing how it can be combined with other functions and custom aggregation to perform sophisticated data analyses. The article concludes with a summary of the pivot_table() function's capabilities and an invitation for readers to share their own tips and tricks.

Opinions

  • The author believes that the pivot_table() function in Pandas is a powerful tool for EDA, comparable to the functionality of pivot tables in Excel.
  • It is suggested that using pivot_table() can simplify the process of data aggregation and analysis, making it more efficient than performing manual groupings.
  • The author values the versatility of the pivot_table() function, as demonstrated by its ability to handle different types of data and aggregation functions.
  • There is an appreciation for the ability to write custom functions for use within pivot_table(), allowing for tailored analyses.
  • The author encourages community engagement and knowledge sharing, inviting readers to contribute their insights on using pivot_table() for data analysis.
  • The article promotes the idea that understanding how to effectively use pivot_table() can enhance one's data analysis skills and contribute to more insightful EDA.

Performing Exploratory Data Analysis (EDA) using Pivot Tables in Pandas

Learn how to use the pivot_table() function in Pandas to aggregate your data

Photo by Andreas Chu on Unsplash

Most excel users are very familiar with the concept of a pivot table. A pivot table is a table of grouped values that aggregates all the individual items of a much bigger table. A pivot table provides a summary of discrete categories, such as sum, averages, as well as various statistics of interests. A pivot table serves as a very useful tool for you to explore and analyze your data, and makes it easy for you to perform comparisons and view trends.

In this article, I will walk you through pivot tables in Pandas and how you can make use of it to perform EDA.

Example 1 — Analyzing World Population Data

For the first part, I shall make use of the world population dataset to answer some of the questions pertaining to world population.

Source of Data: The data source for this part is from https://github.com/resbaz/r-novice-gapminder-files/blob/master/data/gapminder-FiveYearData.csv.

Important Note from the website (https://github.com/resbaz/r-novice-gapminder-files): “This dataset is not endorsed by the people at Gapminder. In particular, we removed relatively incomplete portions of the original data to make it cleaner for our pedagogical purposes. If you require the unfiltered data you should obtain it from Gapminder. Also note that this repository is not currently actively maintained. However, if you would like to improve upon the data herein, we will view serious pull requests favourably.”

Loading the data

Let’s load the data into a Pandas DataFrame:

import pandas as pd
data_url = 'https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv'
df = pd.read_csv(data_url)
df

The dataframe contains various data of countries spread across five continents — Africa, Americas, Asia, Europe and Oceania. It contains data on population, life expectancy, and GDP per capita, from 1952 to 2007:

Image by author

Finding the mean values for each country

To start off, let’s find the mean of the various statistics for each country using the pivot_table() function:

pd.pivot_table(
    df,
    index='country',
    aggfunc='mean'
)

The index parameter specifies the index to use for the result of the function. The aggfunc parameter specifies the function to apply on the numeric columns of the dataframe. The following figure shows the result and how the various parameters dictate the outcome:

Image by author

If you are already familiar with the groupby() function in Pandas, you can think of the pivot_table() function as equal to groupby()+ agg()+ unstack().

The default aggfunc value is mean (a Pandas function) if you do not specify its value. You can also supply a NumPy function such as np.mean, or any function that returns an aggregated value (you will see this in a later section of this article):

import numpy as np
pd.pivot_table(
    df,
    index='country',
    aggfunc=np.mean
)

Finding the mean GDP and mean, max, and min of life expectancies

From the previous result you see that it does not really make sense to calculate the mean of the year column. Also, you might want to know the minimum, maximum, and average life expectancies of each country. To do so, you can specify a dictionary for the aggfunc parameter and indicate what function to apply to which column:

import numpy as np
pd.pivot_table(
    df,
    index='country',
    aggfunc={
        'gdpPercap':np.mean, 
        'lifeExp':[np.mean,np.max,np.min]
    }
)

The above code snippet produces the following output:

Image by author

Observe that since we did not specify the pop and year columns in the dictionary, they will no longer appear in the result.

Finding the mean values of each countries for each year

The index parameters also accepts a list of columns that will result in a multi-index dataframe result. For example, I want to know the mean GDP, life expectancies, and population of each county for each year from 1952 to 2007. I can do it like this:

pd.pivot_table(
    df,
    index=['country','year'],
    aggfunc='mean'
)

The result is a multi-index dataframe with country and year as the index, and the rest of the numeric fields as columns :

Image by author

Finding the mean values of each year for each country

How about we group the result based on year, followed by country? Simple, in this change the order of the list items in the index parameter:

pd.pivot_table(
    df,
    index=['year','country'],
    aggfunc='mean'
)

The result now is grouped by year:

Image by author

Finding the mean values for each continent

If you want to find the mean values for each continent, simply set the index parameter to continent:

pd.pivot_table(
    df,
    index='continent'
)

You will now see the following result:

Image by author

Finding the population of each country

If you want to know the mean population for each country from 1952 to 2007, set the index to country and values to pop:

pd.pivot_table(
    df, 
    index='country', 
    values='pop',
    aggfunc='mean'
)

The following shows the use of the values parameter:

Image by author

Finding the mean life expectancies for each continent

To find the mean life expectancies for each continent, set the index and values parameters as follows:

pd.pivot_table(
    df,               
    index='continent',
    values='lifeExp', 
    aggfunc='mean'
)

You will see the result as follows:

Image by author

What if you want to flip the columns and rows of the result? Easy, change the index parameter to columns:

pd.pivot_table(
    df,
    columns='continent',
    values='lifeExp',
    aggfunc='mean'
)

The following figure shows the result and the use of the various parameters:

Image by author

Finding the life expectancies of each country in the various continents

Next, we want to know the life expectancies of each country in each of the five continents. We could do this:

pd.pivot_table(
    df,
    index='country',
    columns='continent',
    values='lifeExp', 
    aggfunc='mean'
)

Now the life expectancies of each country will be displayed in the respective continent that the country belongs to:

Image by author

Notice the NaNs scattered in the result. If you do not want to see the NaNs, you can set the fill_value parameter to fill them with some values, such as 0s:

pd.pivot_table(
    df,
    index='country',
    columns='continent',
    values='lifeExp', 
    aggfunc='mean',
    fill_value=0
)

You should now see 0s instead of NaNs:

Image by author

Finding the mean life expectancies of each continent by year

Finally, let’s find the mean life expectancies of each continent and group them by year:

pd.pivot_table(
    df,  
    index='year',
    columns='continent',
    values='lifeExp',
    aggfunc='mean'
)

The figure below shows the result and the use of the various parameters:

Image by author

Example 2 — Analyzing Titanic Data

The next dataset that I am going to use is the classic Titanic Dataset.

Source of Data: The data source for this part is from https://www.kaggle.com/datasets/tedllh/titanic-train.

Licensing — Database Contents License (DbCL) v1.0 https://opendatacommons.org/licenses/dbcl/1-0/

Let’s load the CSV file into a Pandas DataFrame:

import pandas as pd
df = pd.read_csv('titanic_train.csv')
df

The dataframe has 891 rows and 12 columns:

Image by author

Finding the mean values for each gender

First, we are interested to see the various mean values for each gender:

pd.pivot_table(
    df,  
    index='Sex'
)
Image by author

For some of the columns, it does not make sense to see their aggregated values, for example — PassengerId, and Pclass. So let’s modify the aggfunc parameter so that I can get the following information:

  • Get the mean Age
  • Get the distribution of Pclass
  • Get the mode (the most frequent occurrence) of Survived

You can do this using the following code snippet:

from scipy import stats
pd.pivot_table(
    df,  
    index='Sex',
    aggfunc={
        'Age': 'mean',
        'Pclass':lambda x: dict(x.value_counts()),
        'Survived':scipy.stats.mode
    }
)

For the aggfunc argument, the value for each key must be a function that returns an aggregated result. So far all the functions that we have used are straight-forward — mean and sum (you can also use functions like min and max). If you want something more complicated you need to write your own function. For the Pclass column, I will write a lambda function to get the value counts of the Series passed to it and then convert it to a dictionary. For the Survived column I will make use of the mode function in the scipy module.

The result looks like this:

Image by author

You can see from the result above:

  • Majority of the female passengers survived while majority of the male passengers died
  • Most male and female passengers were in third class cabins
  • The mean age for female passengers is younger than that for males

Finding the distribution of genders from different port of embarkation

Next, we would also like to know how the distribution of genders is related to the port of embarkation:

df_sex_embarked = pd.pivot_table(
    df,
    index='Sex',
    columns='Embarked',
    values='Survived',
    aggfunc='count'
)
df_sex_embarked
Image by author

From the result, you can see that most people boarded the ship at Southampton (‘S’), regardless of gender.

Finding the distribution of for survivors for each genders from different port of embarkation

We would also like to know how each gender’s survivor numbers relate to the port of embarkation. This is simple as you just need to change the aggfunc from count to sum:

df_sex_embarked_survived = pd.pivot_table(
    df,
    index='Sex',
    columns='Embarked',
    values='Survived',
    aggfunc='sum'
)
df_sex_embarked_survived
Image by author

But the result doesn’t really give you much idea as to which port of embarkation has a higher survival rate. Here is a better way is to calculate the survival ratio of each gender at different port of embarkation:

df_sex_embarked_survived / df_sex_embarked * 100

The result now shows a much clearer picture:

Image by author

You can now observe that more people who boarded the Titanic at Cherbourg (‘C’) survived.

Finding the distribution of genders for each cabin class

We are also curious about the distribution of genders for each cabin class:

df_sex_pclass = pd.pivot_table(
    df,
    index='Pclass',
    columns='Sex',
    values='Survived',
    aggfunc='count'
)
df_sex_pclass

You can see that most passengers are in third class:

Image by author

And their survival based on classes:

df_sex_pclass_survived = pd.pivot_table(
    df,
    index='Pclass',
    columns='Sex',
    values='Survived',
    aggfunc='sum'
)
Image by author

Again, it would be better to calculate their survival ratio:

df_sex_pclass_survived / df_sex_pclass * 100

And interestingly (but not surprisingly), passengers who were in first class have a higher survival rate:

Image by author

Instead of taking the two different dataframes and dividing them, is there an easier way to calculate their survival rate? Of course, if you use a lambda function:

pd.pivot_table(
    df,
    index='Pclass',
    columns='Sex',
    values='Survived',
    aggfunc=lambda x: x.sum() / len(x)
)

In the above statement, x will be passed the values (for the Survived column) for females in class 1 (as a Series), followed by males in class 1, females in class 2, and so on.

Summary

I hope this article managed to shed some lights in the usefulness of the pivot_table() function. Instead of spending time performing groupbys, you can directly call the pivot_table() function and get the aggregated results. And instead of using built-in Pandas or NumPy functions, you can write your own custom functions and use it within the pivot_table() function. Do you have some nice tricks to share? Let me know in the comments!

Level Up Coding

Thanks for being a part of our community! Before you go:

🚀👉 Join the Level Up talent collective and find an amazing job

Eda
Pivot Table
Pandas
Dataframe
Exploratory Data Analysis
Recommended from ReadMedium