3 Methods for Aggregating Data with Python Pandas
Pandas offers what best fits your needs
Pandas is a data analysis and manipulation library for Python and is one of the most popular ones out there. What I think its biggest strengths are ease-of-use and clean syntax.
Pandas is quite flexible in terms of how to perform the common operations so it almost always offers a solution that perfectly fits your needs.
In this article, we will go over the different methods for aggregating data with Pandas. You will see how Pandas offers a variety of ways to complete a specific task.
Note: This article is originally published on datasciencehowto.com.
Let’s start with creating a sample DataFrame filled with mock data.
import pandas as pd
import numpy as np
from random import shuffle
pg = ["A","B","C","D"] * 25
supplier = ["S1","S2"] * 50
shuffle(pg)
shuffle(supplier)
df = pd.DataFrame({
"product_code": np.arange(1000,1100),
"product_group": pg,
"supplier": supplier,
"price": np.round(np.random.rand(100) * 5, 2),
"sales_qty": np.random.randint(10, 200, size=100)
})
df.head()
We use NumPy to generate arrays with random numbers. The product group and supplier columns are created with Python lists. To increase the randomness, the shuffle module in the built-in random library is used for shuffling the lists.
1. Using the aggregate functions directly
The obvious method is to use the aggregate functions such as mean, median, min, and so on.
df.mean()
# Output
product_code 1049.5000
price 2.6519
sales_qty 103.5300
dtype: float64
In this method, Pandas only calculates the aggregated value for numeric columns. However, this feature will be deprecated in a future version and this operation will generate a type error. Thus, it is advised to select the columns of interest first and then do the aggregation.
df[["price","sales_qty"]].mean()
# Output
price 2.6519
sales_qty 103.5300
dtype: float64
Although the product code column is numeric, we cannot really talk about an average product code so we do not select it.
2. Multiple aggregations with apply
We sometimes need to calculate multiple aggregations in a single operation. It can be done by passing a list of functions to the apply function.
df["price"].apply(["mean","median"])
# Output
mean 2.6519
median 2.8550
Name: price, dtype: float64
In case of multiple aggregations on multiple columns, the apply function returns a DataFrame with function names in the index.
df[["price","sales_qty"]].apply(["mean","median"])
# Output
If you want to apply different aggregate functions to different columns, you can use a Python dictionary as follows:
df[["price","sales_qty"]].apply(
{"price":"mean","sales_qty":"median"}
)
# Output
price 2.6519
sales_qty 102.5000
dtype: float64
The values we see in the output are the mean of the price column and the median of the sales quantity column.
3. The agg function
Another way for calculating multiple aggregations at once is the agg function.
df[["price","sales_qty"]].agg(["mean","median"])
This line of code produces the same output as the one with the apply function above.
The agg function also allows for assigning customized names to the aggregated columns. Here is an example.
df[["price","sales_qty"]].agg(
avg_price = ("price","mean"),
max_price = ("price","max"),
median_sales_qty = ("sales_qty","median")
)
# Output
The column to be aggregated and the aggregate function are written in a tuple. The output is a DataFrame with only relevant values filled in.
The named aggregations come in handy when used with the groupby function. Let’s do an example.
df.groupby("product_group").agg(
avg_price = ("price","mean"),
max_price = ("price","max"),
product_count = ("product_code","count"),
avg_sales = ("sales_qty","mean")
)
# Output
Bonus: Describe
The describe function calculates the following aggregations and statistics:
- count
- mean
- std (standard deviation)
- min
- 25% (first quartile)
- 50% (second quartile or median)
- 75% (third quartile)
- max
When sorted from the lowest to highest, 25% of the values are below the first quartile, 50% of the values are below the second quartile, and so on.
The describe function can be applied to a column or a list of columns. We can also directly use it on the DataFrame.
df[["price","sales_qty"]].describe()
# Output
The describe function provides an overview of the numeric columns. It is quite useful for exploratory data analysis.
You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.
Thank you for reading. Please let me know if you have any feedback.