Get the Most from Pandas GroupBy
From basic examples to a practical exercise

Python’s pandas library includes many useful tools for interrogating and manipulating data, one of which is the powerful GroupBy function. This function enables grouping observations by various categories and aggregating them in numerous ways.
This may sound confusing at first, but this guide will walk through how to use the function and its various features. The walkthrough includes:
- An introduction to GroupBy.
- Applying GroupBy to Practice Datasets.
- Various GroupBy Techniques.
- Practical Exercise and Application.
Code and Data:
The data and Jupyter notebook with full Python code used in this walkthrough is available at the linked github page. Download or clone the repository to follow along. This guide uses synthetic data with fake names generated by the author for this article; the data is available at the linked github page.
The code requires the following libraries:
# Data Handling
import pandas as pd
import numpy as np
# Data visualization
import plotly.express as px
1.1. Getting Started — Data Load and GroupBy Basics
The first step is to load in a dataset:
# Load Data:
df = pd.read_csv('StudentData.csv')
df.head(3)
This gets the following dataframe with information about students who took a series of tests at school. It includes their age, three test scores, when they took their class, their average grade, letter grade, and whether or not they passed:

Pandas’ GroupBy allows splitting the dataframe into elements of interest and applying some sort of function to it. The easiest way to think about GroupBy is to formulate a question that the GroupBy operation solves. A simple starting point is to ask how many students passed the course:
df.groupby('CoursePass')['CoursePass'].count()
In the above line of code, GroupBy groups on the ‘CoursePass’ column and then performs a count function on the ‘CoursePass’ column. This returns a simple count:

This is fairly similar to the value_counts() operation, which performs as follows:
df.CoursePass.value_counts()

This is a good start, but GroupBy can do far more advanced operations. Let’s look at multiple columns and more complex questions about the student data.
1.2. GroupBy() with Multiple Columns
Suppose we want to know how many students passed in each class session (Afternoon, Evening, and Morning). The following GroupBy code quickly answers that:
df.groupby(['ClassSession', 'CoursePass'])['CoursePass'].count()
The above code groups by columns “ClassSession” and “CoursePass” while returning counts for each CoursePass status (“Yes” or “No”). The result is:

Note that ordering the columns within the parenthesis matters. Swapping the order of “ClassSession” and “CoursePass” results in the following:
df.groupby(['CoursePass', 'ClassSession'])['ClassSession'].count()

Before moving to even more advanced applications of GroupBy, let’s take a lok at how to clean up the output for an improved presentation.
1.3. Cleaning Up GroupBy’s Outputs
There are several ways to get a standard GroupBy output into a cleaner appearing dataframe format. Let’s take one of the examples above and add “.to_frame()” to the end of the code:
df.groupby(['ClassSession', 'CoursePass'])['CoursePass'].count().to_frame()

Notice this returns multilevel columns and no index. Also note that two columns are named “CoursePass.” To flatten the columns, adjust the names, and add an index, use the following code:
df.groupby(['ClassSession', 'CoursePass'])['CoursePass'].count().reset_index(name='Count')
This returns the following:

Adding reset_index() to the operation flattens the columns and adds an index; passing a name within reset_index() names the column being counted (in this case “CoursePass”) to “Count.” Failing to rename the column within reset_index() would return an error due to two columns being named “CoursePass.”
This can be taken a step further by hiding the index altogether:
df.groupby(['ClassSession', 'CoursePass'])['CoursePass'].count().reset_index(name='Count').style.hide_index()
The output is:

Next, let’s look at more advanced GroupBy functions.
1.4. Advanced GroupBy Capabilities
GroupBy can create an object of the selected columns. The below code creates a GroupBy object based on two columns of interest; applying the mean() function to the object while passing two other columns of interest returns the mean for those columns:
# Create a groupby object:
dfGroupby = df.groupby(['LetterGrade', 'CoursePass'])
dfGroupby[['Test1', 'Test2']].mean()
The result is:

Note how the output is grouped by the columns “LetterGrade” and “CoursePass” and returns the averages for the columns “Test1” and “Test2,” which were the two columns chosen for the mean() function from the initial GroupBy object.
This can be taken to an even more advanced level by using multiple aggregations. Consider the following code:
dfGroupby = df.groupby(['LetterGrade', 'CoursePass'])
dfGroupby[['Grade', 'Age']].agg(['mean', 'min', 'max']).reset_index()
This returns:

The above code answers several questions: sorting by LetterGrade and CoursePass, what is the average, min, and max Grade of each student and the average, min, and max Age of each student?
Creating a GroupBy object is not always necessary; the agg() function can allow the answering of fairly complex questions. Suppose we wanted to filter by CoursePass and ClassSession to find the average Age and Grade for each student; the following code provides the answers:
df.groupby(['CoursePass', 'ClassSession']).agg({'Age':'mean', 'Grade':'mean'})
The output is:

The aggregate function can also pass numpy operations:
data = df.groupby(['CoursePass', 'ClassSession'])['Grade'].agg([np.mean, np.max, np.min]).reset_index()
data = data.rename(columns={'mean':'AverageGrade',
'amax':'HighestGrade',
'amin':'LowestGrade'})
data.style.hide_index()
This yields:

The aggregate function can also directly rename columns:
df.groupby(['CoursePass', 'ClassSession'])\
.agg(AverageGrade=('Grade', 'mean'),
HighestGrade=('Grade', 'max'),
LowestGrade=('Grade', 'min'))
This returns:

The number of syntactical possibilities for GroupBy means there’s probably an elegant code solution to a complex question about a dataframe. Now that we’ve explored the basics, let’s look at some practical applications of how to apply GroupBy.
2. Exercise: Quickly Visualize Grade Counts
Requirement: Given student test scores for a course, visualize the total of each letter grade achieved by each class session on one chart.
Let’s start by taking a look at the data:
df.head()

The requirement is interested in letter grades and class session; that indicates the GroupBy will sort on the columns ClassSession and LetterGrade. The requirement also wants a total; let’s use the count() function and see what comes back:
session = df.groupby(['ClassSession', 'LetterGrade'])['LetterGrade'].count().reset_index(name='Count')
session = session.sort_values(by='LetterGrade', ascending = True)
session

The above code grouped the data by Class Session and Letter Grade while counting the number of occurrences of each letter grade in each grouping combination. The sort_values() code sorts by Letter Grade A through F. All that’s left to do is to visualize the data:
# Generate plot:
plot = px.histogram(session,
y='Count',
x='LetterGrade',
color='ClassSession')
plot.update_layout(
title={'text': "Grade Distribution\
<br><sup>Number of Grades by Type and Class Session</br>",
'xanchor': 'center',
'yanchor': 'top',
'x': 0.47},
xaxis_title='Letter Grade',
yaxis_title='Count')
plot.show()
This generates the following plot:

With just a few lines of code, the question is answered!
3. Conclusion.
The Pandas GroupBy function is a powerful tool that can quickly perform aggregations and mathematical operations on specific data groupings from a dataframe. Often, one execution of the function is all that’s necessary to answer a seemingly involved question. Mastering GroupBy’s use provides data scientists with an effective tool for data preparation and manipulation. Feel free to use the provided notebook and code at the linked GitHub page and experiment some more with GroupBy!