avatarB. Chen

Summary

This article provides guidance on flattening MultiIndex columns and rows in Pandas DataFrames, offering six different techniques for simplifying hierarchical indices for more straightforward data analysis.

Abstract

The article "How to flatten MultiIndex Columns and Rows in Pandas" introduces six methods for converting a MultiIndex DataFrame into a single-level index structure, which is often more manageable for data manipulation and analysis. It starts by explaining the concept of MultiIndex and its common occurrence after operations like groupby(), melt(), pivot_table(), and stack(). The techniques include using get_level_values(), to_flat_index(), joining column labels, and various ways of flattening rows such as using reset_index() and joining row labels. The author illustrates each method with examples and screenshots, ensuring readers can follow along and understand the transformations. The article concludes by directing readers to a notebook for source code and encourages them to explore more tutorials from the author's GitHub repository.

Opinions

  • The author believes that working with a single-level index is often easier than dealing with a MultiIndex DataFrame.
  • The use of to_flat_index() is highlighted as a feature available from Pandas version 0.24.0 onwards, suggesting its relevance for users working with updated versions of Pandas.
  • Joining MultiIndex labels with a separator (e.g., an underscore) is presented as a way to create more human-readable column and row indices.
  • The article emphasizes the practicality of these methods in saving time during data analysis, implying that these techniques are valuable for data scientists and analysts.
  • By providing a link to the source code notebook and the GitHub repository, the author indicates a commitment to open-source learning and community contribution.
  • The author endorses an AI service, ZAI.chat, as a cost-effective alternative to ChatGPT Plus (GPT-4), suggesting its potential utility for readers interested in AI and machine learning tools.

How to flatten MultiIndex Columns and Rows in Pandas

6 Tricks to effectively flatten MultiIndex columns and rows in a Pandas DataFrame

Photo by Anastasiia Chepinska on Unsplash

A MultiIndex (also known as a hierarchical index) DataFrame allows you to have multiple columns acting as a row identifier and multiple rows acting as a header identifier.

Pandas MultiIndex DataFrame Example (Image by author)

MultIndex is very useful for doing sophisticated data analysis, especially for working with higher dimensional data. Pandas has various methods that can output a MultIndex DataFrame, for instance, groupby(), melt(), pivot_table(), stack() etc. However, sometimes it’s just easier to work with a single-level index in a DataFrame.

In this article, you’ll learn how to flatten MultiIndex columns and rows. This article is organized as follows:

  1. Flatten columns: use get_level_values()
  2. Flatten columns: use to_flat_index()
  3. Flatten columns: join column labels
  4. Flatten rows: flatten all levels
  5. Flatten rows: flatten a specific level
  6. Flatten rows: join row labels

Please check out the Notebook for source code. More tutorials are available from Github Repo.

1. MultiIndex columns: use get_level_values()

To start, let’s create a sample DataFrame and call groupby() to create a MultiIndex column:

df = pd.DataFrame({
    'name': ['Tom', 'James', 'Allan', 'Chris'],
    'year': ['2000', '2000', '2001', '2001'],
    'math': [67, 80, 75, 50],
    'star': [1, 2, 3, 4]
})
df_grouped = df.groupby('year').agg(
    { 'math': ['mean', 'sum'], 'star': 'sum'}
)
DataFrame with MultiIndex (Image by author)

Running df_grouped.columns , we can see that we have a MultiIndex column

df_grouped.columns
MultiIndex([('math', 'mean'),
            ('math',  'sum'),
            ('star',  'sum')],
           )

The easiest way to flatten the MultiIndex columns is to set the columns to a specific level, for instance, set the columns to the top level:

df_grouped.columns = df_grouped.columns.get_level_values(0)
flatten MultiIndex DataFrame (Image by author)

get_level_values(0) returns the top level and we assign the value to df_grouped.columns. This certainly does the job, but you may have already noticed that the result has 2 math columns.

2. Flatten columns: use to_flat_index()

As of Pandas version 0.24.0, the to_flat_index() converts a MultiIndex to an Index of Tuples containing the level values:

df_grouped.columns.to_flat_index()
# output
Index([('math', 'mean'), ('math', 'sum'), ('star', 'sum')], dtype='object')

By assigning the result to df_grouped.columns, the result will look like this:

df_grouped.columns = df_grouped.columns.to_flat_index()
flatten MultiIndex DataFrame (Image by author)

3. Flatten columns: join column labels

If you want a better human-readable single-level index, you can join the values in the MultiIndex:

['_'.join(col) for col in df_grouped.columns.values]
# output
['math_mean', 'math_sum', 'star_sum']

By assigning the result to df_grouped.columns, the result will look like this:

df_grouped.columns 
  = ['_'.join(col) for col in df_grouped.columns.values]
flatten MultiIndex DataFrame (Image by author)

4. Flatten rows: flatten all levels

Suppose we have the following DataFrame with MultiIndex rows:

multi_index = pd.MultiIndex.from_tuples([
  ('Oxford', 'A', '01-01-2022'), 
  ('Oxford', 'B', '01-01-2022'), 
  ('Oxford', 'A', '02-01-2022'),
  ('Oxford', 'B', '02-01-2022'),
  ('London', 'C', '01-01-2022'), 
  ('London', 'D', '01-01-2022'),
  ('London', 'C', '02-01-2022'),
  ('London', 'D', '02-01-2022')], 
  names=['Location','Store', 'Date']
)
data = {
  'Num_employee': [1,2,3,4,5,6,7,8],
  'Sales': [11,22,33,44,55,66,77,88]
}
df = pd.DataFrame(data, index=multi_index)
MultiIndex DataFrame (Image by author)

We can simply call reset_index() to flatten every level of the MultiIndex:

df.reset_index()
flatten MultiIndex DataFrame (Image by author)

5. Flatten rows: flatten a specific level

The first argument in reset_index() controls the level to be flattened. It defaults to None which flatten all levels. To flatten a level or a list of levels, we can pass a number/string or a list:

# Flatten 'Date' level
df.reset_index(2)
df.reset_index('Date')
# Flatten 'Store' and 'Date'
df.reset_index([1, 2])
df.reset_index(['Store', 'Date'])
flatten MultiIndex DataFrame (Image by author)

6. Flatten rows: join row labels

Similarly, we can join the values in the MultiIndex to create a better human-readable single index:

['_'.join(col) for col in df.index.values]
# output
['Oxford_A_01-01-2022',
 'Oxford_B_01-01-2022',
 'Oxford_A_02-01-2022',
 'Oxford_B_02-01-2022',
 'London_C_01-01-2022',
 'London_D_01-01-2022',
 'London_C_02-01-2022',
 'London_D_02-01-2022']

By assigning the result to df.columns, the result will look like this:

df.index = ['_'.join(col) for col in df.index.values]
flatten MultiIndex DataFrame (Image by author)

Conclusion

In this article, we have covered 6 use cases about flattening MultiIndex columns and rows in Pandas. I hope this article will help you to save time in analyzing data.

Thanks for reading. Please check out the Notebook for the source code and stay tuned if you are interested in the practical aspect of machine learning. More tutorials are available from the Github Repo.

Data Analysis
Pandas
Flatten
Multiindex
Data Science
Recommended from ReadMedium