Summary

This context explains how to incorporate null values while grouping by columns in pandas, a popular Python library for data manipulation and analysis.

Abstract

In this context, the author discusses the default behavior of pandas' groupby function, which excludes null values from the results. The author then demonstrates how to incorporate null values in the resulting aggregations by explicitly specifying dropna=False when calling the groupby function. This feature is available for pandas versions ≥ 1.1. The author also provides an alternative workaround for older versions of pandas that do not support dropna in the groupby method.

Bullet points

The default behavior of pandas' groupby function excludes null values from the results.
To incorporate null values in the resulting aggregations, explicitly specify dropna=False when calling the groupby function.
This feature is available for pandas versions ≥ 1.1.
An alternative workaround for older versions of pandas is to fill the missing values with a particular placeholder value before performing the groupby operation.
Be careful when selecting the placeholder value to avoid collisions with pre-existing values in the dataframe.

How To Group By Columns With Missing Values in Pandas

Incorporating null values whilst grouping by in pandas columns

Performing aggregations with the use of group by clauses is probably something you come across in your day-to-day work. Pandas in Python is no exception to this since this is an operation you will definitely see in many different places of a repository utilising the library.

However, pandas’ default behaviour excludes empty/missing (aka null) values from the results. In today’s short tutorial we will be demonstrating this default behaviour as well as a way for incorporating missing values in the resulting aggregations.

First, let’s create an example

import numpy as np
import pandas as pd 

df = pd.DataFrame(
    [ 
        (1, 'B', 121, 10.1, True),
        (2, 'C', 145, 5.5, False),
        (3, 'A', 345, 4.5, False),
        (4, 'A', 112, np.nan, True),
        (5, 'C', 105, 2.1, False),
        (6, np.nan, 435, 7.8, True),
        (7, np.nan, 521, np.nan, True),
        (8, 'B', 322, 8.7, True),
        (9, 'C', 213, 5.8, True),
        (10, 'B', 718, 9.1, False),
    ],
    columns=['colA', 'colB', 'colC', 'colD', 'colE']
)


print(df)
   colA colB  colC  colD   colE
0     1    B   121  10.1   True
1     2    C   145   5.5  False
2     3    A   345   4.5  False
3     4    A   112   NaN   True
4     5    C   105   2.1  False
5     6  NaN   435   7.8   True
6     7  NaN   521   NaN   True
7     8    B   322   8.7   True
8     9    C   213   5.8   True
9    10    B   718   9.1  False

The default behaviour

Now let’s suppose that we want to compute the sum per value in colB. That’s easy enough and can be done with the following expression

df.groupby('colB')['colD'].sum()

that is going to return the following result:

>>> df.groupby('colB')['colD'].sum()
colB
A     4.5
B    27.9
C    13.4
Name: colD, dtype: float64

But as we can notice, the missing values (None) are missing from the output.

Incorporating null values in aggregations

There’s a chance you may still want to incorporate missing values in your aggregations. In order to do so, all you need to do is explicitly specify dropna=False when calling the groupby function — this value defaults to True. Note that this is possible for pandas versions ≥ 1.1.

df.groupby('colB', dropna=False)['colD'].sum()

And the resulting Series will also include the count for missing values:

>>> df.groupby('colB', dropna=False)['colD'].sum()
colB
A       4.5
B      27.9
C      13.4
NaN     7.8
Name: colD, dtype: float64

Obviously, the same concept applies for other aggregation types, such as count. For example,

>>> df.groupby('colB').count()
      colA  colC  colD  colE
colB                        
A        2     2     1     2
B        3     3     3     3
C        3     3     3     3

as opposed to

>>> df.groupby('colB', dropna=False).count()
      colA  colC  colD  colE
colB                        
A        2     2     1     2
B        3     3     3     3
C        3     3     3     3
NaN      2     2     1     2

Final Thoughts

With recent pandas versions you can now incorporate missing values when performing aggregations over pandas DataFrames. Note though that if you are running an older version that does not support dropna in groupby method, then (apart from upgrading to one of the most recent versions which is always a good practice!) you could also find some workarounds such as the one outlined below:

# Fill the missing values with a particular placeholder value
# Note though that you must be careful when selecting such 
# value in order to avoid any collisions with pre-existing values
# in the dataframe
# In the example below I'll use -1, but make sure to amend if needed
>>> df.fillna(-1).groupby('colB').sum()

Become a member and read every story on Medium. Your membership fee directly supports me and other writers you read. You’ll also get full access to every story on Medium.

Join Medium with my referral link — Giorgos Myrianthous

As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…

gmyrianthous.medium.com

Related articles you may also like

How To Merge Pandas DataFrames

Performing left, right, inner and anti joins with pandas DataFrames

towardsdatascience.com

Tools For Data Engineers

The basics of a Data Engineering toolbox

towardsdatascience.com

A Visual Explanation of SQL Joins

Understanding SQL joins with Venn Diagrams and practical examples

towardsdatascience.com