avatarSoner Yıldırım

Summary

The website content provides an overview of three methods for aggregating data using Python's Pandas library, emphasizing its flexibility and ease of use for data analysis tasks.

Abstract

The article "3 Methods for Aggregating Data with Python Pandas"

3 Methods for Aggregating Data with Python Pandas

Pandas offers what best fits your needs

Photo by Alp Duran on Unsplash

Pandas is a data analysis and manipulation library for Python and is one of the most popular ones out there. What I think its biggest strengths are ease-of-use and clean syntax.

Pandas is quite flexible in terms of how to perform the common operations so it almost always offers a solution that perfectly fits your needs.

In this article, we will go over the different methods for aggregating data with Pandas. You will see how Pandas offers a variety of ways to complete a specific task.

Note: This article is originally published on datasciencehowto.com.

Let’s start with creating a sample DataFrame filled with mock data.

import pandas as pd
import numpy as np
from random import shuffle
pg = ["A","B","C","D"] * 25
supplier = ["S1","S2"] * 50
shuffle(pg)
shuffle(supplier)
df = pd.DataFrame({
    "product_code": np.arange(1000,1100),
    "product_group": pg,
    "supplier": supplier,
    "price": np.round(np.random.rand(100) * 5, 2),
    "sales_qty": np.random.randint(10, 200, size=100)     
})
df.head()
The first 5 rows of df (image by author)

We use NumPy to generate arrays with random numbers. The product group and supplier columns are created with Python lists. To increase the randomness, the shuffle module in the built-in random library is used for shuffling the lists.

1. Using the aggregate functions directly

The obvious method is to use the aggregate functions such as mean, median, min, and so on.

df.mean()
# Output
product_code    1049.5000
price              2.6519
sales_qty        103.5300
dtype: float64

In this method, Pandas only calculates the aggregated value for numeric columns. However, this feature will be deprecated in a future version and this operation will generate a type error. Thus, it is advised to select the columns of interest first and then do the aggregation.

df[["price","sales_qty"]].mean()
# Output
price          2.6519
sales_qty    103.5300
dtype: float64

Although the product code column is numeric, we cannot really talk about an average product code so we do not select it.

2. Multiple aggregations with apply

We sometimes need to calculate multiple aggregations in a single operation. It can be done by passing a list of functions to the apply function.

df["price"].apply(["mean","median"])
# Output
mean      2.6519
median    2.8550
Name: price, dtype: float64

In case of multiple aggregations on multiple columns, the apply function returns a DataFrame with function names in the index.

df[["price","sales_qty"]].apply(["mean","median"])
# Output
(image by author)

If you want to apply different aggregate functions to different columns, you can use a Python dictionary as follows:

df[["price","sales_qty"]].apply(
   {"price":"mean","sales_qty":"median"}
)
# Output
price          2.6519
sales_qty    102.5000
dtype: float64

The values we see in the output are the mean of the price column and the median of the sales quantity column.

3. The agg function

Another way for calculating multiple aggregations at once is the agg function.

df[["price","sales_qty"]].agg(["mean","median"])

This line of code produces the same output as the one with the apply function above.

The agg function also allows for assigning customized names to the aggregated columns. Here is an example.

df[["price","sales_qty"]].agg(
    avg_price = ("price","mean"),
    max_price = ("price","max"),
    median_sales_qty = ("sales_qty","median")
)
# Output
(image by author)

The column to be aggregated and the aggregate function are written in a tuple. The output is a DataFrame with only relevant values filled in.

The named aggregations come in handy when used with the groupby function. Let’s do an example.

df.groupby("product_group").agg(
    avg_price = ("price","mean"),
    max_price = ("price","max"),
    product_count = ("product_code","count"),
    avg_sales = ("sales_qty","mean")
)
# Output
(image by author)

Bonus: Describe

The describe function calculates the following aggregations and statistics:

  • count
  • mean
  • std (standard deviation)
  • min
  • 25% (first quartile)
  • 50% (second quartile or median)
  • 75% (third quartile)
  • max

When sorted from the lowest to highest, 25% of the values are below the first quartile, 50% of the values are below the second quartile, and so on.

The describe function can be applied to a column or a list of columns. We can also directly use it on the DataFrame.

df[["price","sales_qty"]].describe()
# Output
(image by author)

The describe function provides an overview of the numeric columns. It is quite useful for exploratory data analysis.

You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.

Thank you for reading. Please let me know if you have any feedback.

Data Science
Artificial Intelligence
Machine Learning
Python
Python Programming
Recommended from ReadMedium