Pandas Basics — 2. Grouping, Aggregating, Analyzing
This article follows Pandas Basics — Part. 1. Today, we’ll focus on topics more related to data manipulation. We’ll explore advanced topics in Pandas such as merging and joining DataFrames, handling categorical data, etc…
Let’s get started!
Grouping and Aggregation
Grouping and aggregation are essential techniques in any data analysis pipeline, and Pandas provides several functions and methods to perform these operations with ease.
First, the groupbyfunction is one of the most powerful and flexible tools in Pandas. It allows you to split a DataFrame into groups based on one or more columns, and apply a function to each group.
For example, let’s say we have a dataset of sales data for a store, and we want to calculate the total sales for each store location. We can use the groupbyfunction to split the data by store location, and then apply the sum function to each group:
import pandas as pd
df = pd.read_csv('sales_data.csv')
# Group data by store location and sum sales
df_grouped = df.groupby('location').sum()
print(df_grouped)This will output a new DataFrame containing the total sales for each store location:
location New York 89000 San Francisco 80000 Toronto 70000 Name: sales, dtype: int64
Another useful function for grouping and aggregation is pivot_table. This function allows you to create a new DataFrame by pivoting the columns of an existing DataFrame.
For example, let’s say we have a dataset of sales data for a store, and we want to create a new table showing the total sales by store location and product category. We can use the pivot_tablefunction to do this:
import pandas as pd
df = pd.read_csv('sales_data.csv')
# Create a pivot table showing total sales by location and product category
df_pivot = df.pivot_table(index='location', columns='category', values='sales', aggfunc=sum)
print(df_pivot)This will output a new pivot table showing the total sales by location and product category:
category Furniture Office Supplies Technology
location
New York 35000.0 22000.0 32000.0
San Francisco 30000.0 25000.0 25000.0
Toronto 20000.0 10000.0 10000.0In addition to the built-in functions, you can also apply your own custom functions to groups of data in Pandas.
For example, let’s say we have a dataset of employee data, and we want to calculate the average salary by department. We can use the apply function to apply a custom function to each group:
import pandas as pd
df = pd.read_csv('employee_data.csv')
# Define a custom function to calculate the average salary
def avg_salary(group):
return group['salary'].mean()
# Group data by department and apply the custom function
df_grouped = df.groupby('department').apply(avg_salary)
print(df_grouped)This will output a new Series containing the average salary for each department:
department HR 50000.0 IT 60000.0 Marketing 75000.0 Sales 65000.0 dtype: float64
Time Series Analysis
Time series data is a type of data that is collected over a period of time, and is often used in fields such as finance, economics, and meteorology. In Pandas, time series data is typically stored in a DataFrame with a DatetimeIndex.
One of the key techniques for working with time series data in Pandas is resampling. Resampling is the process of downsampling or upsampling a time series to a different frequency. For example, you might want to resample a daily time series to a monthly frequency in order to compare the data at a higher level.
To resample a time series in Pandas, you can use the resample function, which is available on the DataFrame object. Here is an example of how to resample a daily time series to a monthly frequency:
import pandas as pd
df = pd.read_csv('time_series_data.csv', index_col='date', parse_dates=True)
# Resample to monthly frequency
df_monthly = df.resample('M').mean()
print(df_monthly)This will output a new time series with the data resampled to a monthly frequency. You can also use the resample function to apply other functions to the data, such as sum, min, or max.
Another useful technique for working with time series data is shifting. Shifting is the process of moving the data forward or backward in time by a certain number of periods. For example, you might want to shift a time series forward by one month in order to compare the data to the previous period.
To shift a time series in Pandas, you can use the shift function, which is available on the DataFrame object. Here is an example of how to shift a monthly time series forward by one month:
import pandas as pd
df = pd.read_csv('time_series_data.csv', index_col='date', parse_dates=True)
# Shift the data forward by one month
df_shifted = df.shift(periods=1, freq='M')
print(df_shifted)This will output a new time series with the data shifted forward by one month. You can also use the shift function to shift the data backward by specifying a negative value for the periods parameter.
Merging and Joining
Merging and joining DataFrames is a common task in data analysis, and Pandas provides a number of functions and methods to perform these operations with ease.
The main difference between merging and joining is the way in which the data is combined. In a merge, the data is combined based on common columns or keys. In a join, the data is combined based on the index of the DataFrames.
To merge two DataFrames in Pandas, you can use the mergefunction, which is available on the DataFrame object. Here is an example of how to merge two DataFrames based on a common column:
import pandas as pd
df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
# Merge the DataFrames on the 'id' column
df_merged = pd.merge(df1, df2, on='id')
print(df_merged)This will output a new DataFrame containing the merged data from both DataFrames.
# Input
data1.csv:
id,value1,value2
1,10,20
2,15,25
3,20,30
data2.csv:
id,value3,value4
1,30,40
2,35,45
3,40,50
# Output
id,value1,value2,value3,value4
1,10,20,30,40
2,15,25,35,45
3,20,30,40,50You can also specify different merge methods, such as inner, outer, or left, to control how the data is combined.
To join two DataFrames in Pandas, you can use the joinfunction, which is available on the DataFrame object. Here is an example of how to join two DataFrames based on their index:
import pandas as pd
df1 = pd.read_csv('data1.csv', index_col='id')
df2 = pd.read_csv('data2.csv', index_col='id')
# Join the DataFrames
df_joined = df1.join(df2)
print(df_joined)This will output a new DataFrames containing the joined data from both DataFrames.
# Input
data1.csv:
id,value1,value2
1,10,20
2,15,25
3,20,30
data2.csv:
value3,value4,id
30,40,1
35,45,2
40,50,3
# Output
id,value1,value2,value3,value4
1,10,20,30,40
2,15,25,35,45
3,20,30,40,50You can also specify different join types, such as inner, outer, or left, to control how the data is combined.
Another advanced technique in Pandas is concatenation, which is the process of combining multiple DataFrames into a single DataFrames. To concatenate DataFrames in Pandas, you can use the concatfunction. Here is an example of how to concatenate two DataFrames vertically (i.e., adding rows to the DataFrame):
import pandas as pd
df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
# Concatenate the DataFrames
df_concat = pd.concat([df1, df2])
print(df_concat)This will output a new DataFrame containing the concatenated data from both DataFrames.
# Input
data1.csv:
id,value1,value2
1,10,20
2,15,25
3,20,30
data2.csv:
id,value3,value4
4,30,40
5,35,45
6,40,50
# Output
id,value1,value2,value3,value4
1,10,20,NaN,NaN
2,15,25,NaN,NaN
3,20,30,NaN,NaN
4,NaN,NaN,30,40
5,NaN,NaN,35,45
6,NaN,NaN,40,50You can also specify different axis values, such as ‘columns’, to concatenate the DataFrames horizontally (i.e., adding columns to the DataFrame).
Pandas in Practice
Now that you have learned the basics of Pandas, it’s time to put your skills into practice.
One common task in data analysis is cleaning and preparing data for further analysis. In Pandas, you can use a number of functions and methods to handle missing data, remove duplicates, and transform the data into the desired format. For example, let’s say we have a dataset of customer data, and we want to clean and prepare the data for further analysis. Here is some code that demonstrates how to do this:
import pandas as pd
df = pd.read_csv('customer_data.csv')
# Remove rows with missing data
df.dropna(inplace=True)
# Remove duplicates
df.drop_duplicates(inplace=True)
# Transform the data
df['customer_id'] = df['customer_id'].astype(str)
df['signup_date'] = pd.to_datetime(df['signup_date'])
print(df)Another common task in data analysis is visualizing data. In Pandas, you can use the plot function, which is available on the DataFrame object, to create a variety of plots. You need Matplotlib to plot DataFrames.
For example, let’s say we have a dataset of sales data, and we want to visualize the data using a line plot. Here is some code that demonstrates how to do this:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('sales_data.csv', index_col='date', parse_dates=True)
# Create a line plot
df.plot(kind='line')
# Show the plot
plt.show()This will create a line plot showing the sales data over time. You can also use the plot function to create other types of plots, such as bar plots, scatter plots, and pie charts.
Now it’s your turn to try out these techniques in practice. Here are some hands-on exercises for you to complete:
- Load a dataset of your choice and clean and prepare the data using the techniques demonstrated above.
- Visualize the data using the plot function and try out different plot types.
- Use the
groupbyandpivot_tablefunctions to analyze and summarize the data. - Try applying custom functions to groups of data using the apply function.
Final Note
Now you are ready to take it to the next level and learn more complex things, such as developing solutions for supervised, unsupervised learning, etc…
I’ll soon write a series to learn machine learning, so be sure to follow me to don’t miss it!
To explore more of my Python stories, click here! You can also access all my content by checking this page.
If you liked the story, don’t forget to clap, comment, and maybe follow me if you want to explore more of my content :)
You can also subscribe to me via email to be notified every time I publish a new story, just click here!
If you’re not subscribed to Medium yet and wish to support me or get access to all my stories, you can use my link:






