Ultimate Pandas Guide — Mastering the Groupby
Develop a deep understanding of “groupby” and “agg”
One of the most common exercises in data analysis is to split your data into groups and perform an aggregation.
For example, let’s say you have customer sales data with several different dimensions:
A natural question might be to ask — what is the sales total by state? Or by sex? Or by month?
In this post, I’ll walk through the ins and outs of the Pandas “groupby” to help you confidently answers these types of questions with Python.
A visual representation of “grouping” data
The easiest way to remember what a “groupby” does is to break it down into three steps: “split”, “apply”, and “combine”.
1.Split: This means to create separate groups based on a column in your data. For example, we can split our sales data into months.
2. Apply: This means that we perform a function on each of the groups. For example, we can sum the sales for each month.
3. Combine: This means that we return a new data table with each of the results from the “apply” stage.
The magic of the “groupby” is that it can help you do all of these steps in very compact piece of code.
Running a “groupby” in Pandas
In order to get sales by month, we can simply run the following:
sales_data.groupby('month').agg(sum)[['purchase_amount']]
Understanding the “split” step in Pandas
The first thing to call out is that when we run the code above, we are actually running two different functions — groupby and agg — where groupby addresses the“split” stage and agg addresses the “apply” stage.
For example, the following code actually does the work to split our data into “month” groups:
sales_data.groupby('month')
While this output is not particularly interesting, there are a handful of things we can do with this object before we even get to the aggregation stage:
- Check out the “groups” attribute:
grouped = sales_data.groupby('month')
grouped.groups
Notice that the “groups” attribute returns a dictionary, whose keys are the groups and whose values are the row indexes from each group.
2. Inspect an individual group using the “get group” method:
grouped.get_group('August')
3. Iterate through each group:
for name, group in grouped:
print(name, group)
Understanding the “agg” step in Pandas
Now let’s explore the “agg” function.
The simplest thing we can pass to “agg” is the name of the aggregation we would like to perform on each of the groups:
sales_data.groupby('month').agg(sum)
Note that this approach will return the sum of all available numerical columns in the DataFrame.
However, in this example, it doesn’t make any sense to return the sum of the “year” or “customer_id” columns.We can fix that by indexing with the list of columns we want to see at the end of our agg call:
sales_data.groupby('month').agg(sum)[['purchase_amount']]
We can also index with a single column (as opposed to list):
sales_data.groupby('month').agg(sum)['purchase_amount']
In this case, we get a Series object instead of a DataFrame. I tend to prefer working with DataFrames, so I typically go with the first approach.
Advanced “groupby” concepts
Now that we have the basics down, let’s go through a few of the more advanced things we can do.
Multiple aggregations
First, let’s say we want the total sales and the average sales by month. To accomplish this, we can pass a list of functions to “agg”:
sales_data.groupby(‘month’).agg([sum, np.mean])[‘purchase_amount’]
This is helpful, but now we are stuck with columns that are named after the aggregation functions (ie. sum and mean).
And this becomes even more of a hindrance when we want to return multiple aggregations for multiple columns:
sales_data.groupby(‘month’).agg([sum, np.mean])[[‘purchase_amount’, 'year']]
In this case, we’re stuck with a multi-index for our column names:
In order to resolve this, we can leverage the “NamedAgg” object that Pandas provides. The syntax here is a little different, but our output makes it very clear what’s going on here:
sales_data.groupby(“month”).agg(
total_sales=pd.NamedAgg(column=’purchase_amount’, aggfunc=sum),
avg_sales=pd.NamedAgg(column=’purchase_amount’, aggfunc=np.mean),
max_year=pd.NamedAgg(column=’year’, aggfunc=max))
This is a really useful mechanism for performing multiple functions on different columns, while maintaining control of the column names in your output.
We can also pass a dictionary to the agg function, but this does not provide us with the same flexibility to name our resulting columns:
sales_data.groupby(“month”).agg(
{‘purchase_amount’: [sum, np.mean],
‘year’: [max]}
)
Grouping on multiple columns
Another thing we might want to do is get the total sales by both month and state.
In order to group by multiple columns, we simply pass a list to our groupby function:
sales_data.groupby(["month", "state"]).agg(sum)[['purchase_amount']]
You’ll also notice that our “grouping keys” — month and state — have become our index. We can easily convert these to columns by running the “reset_index” function:
g = sales_data.groupby([“month”, “state”]).agg(sum)
g[[‘purchase_amount’].reset_index()
Closing thoughts
Once you have a solid intuition behind the “split-apply-combine” approach, running a “groupby” in Pandas is fairly straightforward.
Nonetheless, you may still run into issues with syntax when you’re first getting comfortable with the functions. If you’re facing errors, I recommend taking a more careful look at the data types that you’re passing.
For example, if you’re running something similar to the code below, make sure that you’re passing a list of functions to agg and that you’re placing a list of columns inside another set of brackets to do the column indexing:
sales_data.groupby(‘month’).agg([sum, np.mean])[[‘purchase_amount’, ‘year’]]
For a quick review on Pandas indexing, checking out my intuitive guide below. Happy coding!