avatarSkyler Dale

Summary

The web content provides a comprehensive guide on mastering the "groupby" and "agg" functionalities in Pandas to perform advanced data analysis in Python.

Abstract

The article titled "Ultimate Pandas Guide — Mastering the Groupby" offers an in-depth exploration of the "groupby" and "agg" operations in Pandas, which are essential for data analysts and data scientists to segment data and compute aggregations. It breaks down the "groupby" process into three steps: split, apply, and combine. The guide illustrates how to use these operations to answer questions about data, such as total sales by state or month, and demonstrates how to perform these actions with concise Python code. It also covers advanced concepts like multiple aggregations, renaming aggregated columns using "NamedAgg," and grouping by multiple columns. The article aims to build readers' confidence in using "groupby" to manipulate and analyze data effectively.

Opinions

  • The author emphasizes the importance of understanding the "split-apply-combine" approach for effective data analysis using Pandas.
  • The guide suggests that the "groupby" operation in Pandas is powerful yet straightforward, capable of handling complex data manipulation tasks compactly.
  • The author shows a preference for working with DataFrames over Series objects when dealing with the results of aggregations.
  • The article highlights the utility of the "NamedAgg" feature for maintaining clear and customizable column names when performing multiple aggregations on different columns.
  • The author recommends checking data types and ensuring correct syntax to avoid common errors when using "groupby" and "agg" functions.
  • The guide concludes by encouraging readers to review indexing in Pandas for a better understanding of data manipulation, linking to additional resources for further learning.
Photo by Laura Woodbury from Pexels

Ultimate Pandas Guide — Mastering the Groupby

Develop a deep understanding of “groupby” and “agg”

One of the most common exercises in data analysis is to split your data into groups and perform an aggregation.

For example, let’s say you have customer sales data with several different dimensions:

A natural question might be to ask — what is the sales total by state? Or by sex? Or by month?

In this post, I’ll walk through the ins and outs of the Pandas “groupby” to help you confidently answers these types of questions with Python.

A visual representation of “grouping” data

The easiest way to remember what a “groupby” does is to break it down into three steps: “split”, “apply”, and “combine”.

1.Split: This means to create separate groups based on a column in your data. For example, we can split our sales data into months.

2. Apply: This means that we perform a function on each of the groups. For example, we can sum the sales for each month.

3. Combine: This means that we return a new data table with each of the results from the “apply” stage.

The magic of the “groupby” is that it can help you do all of these steps in very compact piece of code.

Running a “groupby” in Pandas

In order to get sales by month, we can simply run the following:

sales_data.groupby('month').agg(sum)[['purchase_amount']]

Understanding the “split” step in Pandas

The first thing to call out is that when we run the code above, we are actually running two different functions — groupby and agg — where groupby addresses the“split” stage and agg addresses the “apply” stage.

For example, the following code actually does the work to split our data into “month” groups:

sales_data.groupby('month')

While this output is not particularly interesting, there are a handful of things we can do with this object before we even get to the aggregation stage:

  1. Check out the “groups” attribute:
grouped = sales_data.groupby('month')
grouped.groups

Notice that the “groups” attribute returns a dictionary, whose keys are the groups and whose values are the row indexes from each group.

2. Inspect an individual group using the “get group” method:

grouped.get_group('August')

3. Iterate through each group:

for name, group in grouped:
    print(name, group)

Understanding the “agg” step in Pandas

Now let’s explore the “agg” function.

The simplest thing we can pass to “agg” is the name of the aggregation we would like to perform on each of the groups:

sales_data.groupby('month').agg(sum)

Note that this approach will return the sum of all available numerical columns in the DataFrame.

However, in this example, it doesn’t make any sense to return the sum of the “year” or “customer_id” columns.We can fix that by indexing with the list of columns we want to see at the end of our agg call:

sales_data.groupby('month').agg(sum)[['purchase_amount']]

We can also index with a single column (as opposed to list):

sales_data.groupby('month').agg(sum)['purchase_amount']

In this case, we get a Series object instead of a DataFrame. I tend to prefer working with DataFrames, so I typically go with the first approach.

Advanced “groupby” concepts

Now that we have the basics down, let’s go through a few of the more advanced things we can do.

Multiple aggregations

First, let’s say we want the total sales and the average sales by month. To accomplish this, we can pass a list of functions to “agg”:

sales_data.groupby(‘month’).agg([sum, np.mean])[‘purchase_amount’]

This is helpful, but now we are stuck with columns that are named after the aggregation functions (ie. sum and mean).

And this becomes even more of a hindrance when we want to return multiple aggregations for multiple columns:

sales_data.groupby(‘month’).agg([sum, np.mean])[[‘purchase_amount’, 'year']]

In this case, we’re stuck with a multi-index for our column names:

In order to resolve this, we can leverage the “NamedAgg” object that Pandas provides. The syntax here is a little different, but our output makes it very clear what’s going on here:

sales_data.groupby(“month”).agg(
   total_sales=pd.NamedAgg(column=’purchase_amount’, aggfunc=sum),
   avg_sales=pd.NamedAgg(column=’purchase_amount’, aggfunc=np.mean),
   max_year=pd.NamedAgg(column=’year’, aggfunc=max))

This is a really useful mechanism for performing multiple functions on different columns, while maintaining control of the column names in your output.

We can also pass a dictionary to the agg function, but this does not provide us with the same flexibility to name our resulting columns:

sales_data.groupby(“month”).agg( 
        {‘purchase_amount’: [sum, np.mean],
        ‘year’: [max]}
)

Grouping on multiple columns

Another thing we might want to do is get the total sales by both month and state.

In order to group by multiple columns, we simply pass a list to our groupby function:

sales_data.groupby(["month", "state"]).agg(sum)[['purchase_amount']]

You’ll also notice that our “grouping keys” — month and state — have become our index. We can easily convert these to columns by running the “reset_index” function:

g = sales_data.groupby([“month”, “state”]).agg(sum) 
g[[‘purchase_amount’].reset_index()

Closing thoughts

Once you have a solid intuition behind the “split-apply-combine” approach, running a “groupby” in Pandas is fairly straightforward.

Nonetheless, you may still run into issues with syntax when you’re first getting comfortable with the functions. If you’re facing errors, I recommend taking a more careful look at the data types that you’re passing.

For example, if you’re running something similar to the code below, make sure that you’re passing a list of functions to agg and that you’re placing a list of columns inside another set of brackets to do the column indexing:

sales_data.groupby(‘month’).agg([sum, np.mean])[[‘purchase_amount’, ‘year’]]

For a quick review on Pandas indexing, checking out my intuitive guide below. Happy coding!

Data Science
Programming
Python
Pandas
Coding
Recommended from ReadMedium