avatarYufeng

Summary

The article discusses methods for handling list values in Pandas data frames, focusing on creating a list per group and adding it as a new column, as well as addressing the challenge of dropping duplicates when dealing with list-type data.

Abstract

The post outlines several techniques for managing list values within Pandas data frames. It addresses common issues such as creating a new column with list values for each group in the data frame and resolving the 'unhashable type: 'list'' error when attempting to drop duplicates. The author provides two solutions for each problem: to create a list per group, one can use either the transform + *len method or the map + agg method; to drop duplicates involving list values, the author suggests converting lists to strings or tuples, which are hashable and thus compatible with Pandas' drop_duplicates() function. The article aims to offer practical solutions to data analysts and data scientists who encounter these challenges in their work.

Opinions

  • The author believes that the solutions provided are useful for those new to dealing with list values in Pandas.
  • There is an expectation that future updates to Pandas might resolve some of the issues encountered when working with lists, such as the need for extra brackets in the transform function.
  • The author implies that the methods discussed (transforming lists to strings or tuples) are commonly used in the data science community to overcome the limitations of Pandas when handling list-type data.
  • The article conveys that while Pandas is a powerful tool for data analysis, users may need to employ creative workarounds for non-standard data types like lists.

PANDAS

TypeError: unhashable type: ‘list’! How to Drop Duplicates with Lists in Pandas

In this short post, I’m writing about several tricks about dealing with list values in the data frame in Pandas. You may find them useful when you face such problems.

Photo by Aziz Acharki on Unsplash

Pandas is one of the most widely used tools in the field of data science. Usually, people only work on numerical or categorical values in the cells of the data frame. However, we have to deal with values in the list format sometimes.

I list several potential issues in dealing with list in Pandas data frame as well as the solutions in this post. Hopefully it’s helpful.

Create a list per group and add back as a new column

If you want to add a new column to an existing data frame, where the values in the new column is the list of values from the same group, what function should be used following groupby?

For example, you have a data frame like this,

toy data frame. (image by author)

and you want to create a new column named ‘list_value’, which combines the values from the same label to a list. The final output is like this,

toy data frame after adding column of list values. (image by author)

Specifically, for label ‘A’, there are three rows with values, ‘hello’, ‘coding’, and ‘!’. so, the combined list should be [hello, coding, !]. And the combined list is added back to the rows of label ‘A’. Similar manipulations for group ‘B’ and ‘C’.

If it’s your first time doing it, you may want to run the following codes to generate the new column,

df.groupby('label')['value'].transform(list)

However, you will find that it returns a Pandas Series that is exactly the same as the column of ‘value’ instead of lists of values,

Unexpected values by transform(list) (image by author)

You may also try this,

df.groupby('label')['value'].agg(list)

It turns out that you get the lists as expected but it’s not able to be added back to the data frame because it has different dimension from the original data frame.

Unexpected values by agg(list) (image by author)

The reason is that the returned values are not expanded to the same length of the original data frame. We can use the following two tricks to solve the problem.

№1. transform + *len

The first solution is to expand the result from transform to the same length with the original data frame by multiplying the length of each group.

df['list_value'] = df.groupby('label')['value'].transform(lambda x : [x.tolist()]*len(x))

here, *len(x) is the important part of the solution. The result is like this,

Add transformed list values back to the data frame (image by author)

You may have noticed that there are squared brackets outside x.tolist(), which kind of makes the format as a list of list. It looks weird here, but it may be an internal issue of Pandas that the only way to keep the list format after the transform function is to add the brackets. Maybe a future update will resolve this issue, then we don’t need the outside brackets of x.tolist() in the codes. But for now, it’s an effective way to construct list values as a new column.

№2. map + agg

The second solution is to use map + agg.

df['list_value'] = df['label'].map(df.groupby('label')['value'].agg(list))

Remember the agg function creates the list values of length three as shown above. If we want to expand it to the same length of the original data frame, we simply use the label as the key and search for the corresponding list from the return values of df.groupby(‘label’)[‘value’].agg(list).

It gives us,

Add mapped list values back to the data frame (image by author)

The two solutions above are vary useful when you try to create list values from groups in your analysis.

Always fail to drop duplicates

If you have columns made of list values, it’s not possible to drop duplicates as that of numerical/categorical values. For example, in the example above, we want to drop duplicates based on two columns, label and list_value, to keep the unique values for each group. What would you do?

Some people will directly use the built-in function in Pandas, drop_duplicates(),

df.drop_duplicates()

However, you will get the Type error as below,

TypeError: unhashable type: 'list'

That’s because the list values in each row can not be hashable. So, how do we solve the problem?

№1. to string.

The first solution to that is to transform everything to string and then record the unique rows’ index and subset the original dataset.

df[['label','list_value']].loc[df[['label','list_value']].astype(str).drop_duplicates().index]

It will return the following data frame,

drop_duplicates after transform to string (image by author)

The idea is simple that astype(str) change the values into string and then .index after .drop_duplicates() records the index of the unique rows. .loc will select the rows based on the recorded index.

№2. list to tuple.

The second solution is to use tuple instead of list in the data frame. Remember the way we built the column of the list values. This time we change the list to tuple,

df['list_value'] = df['label'].map(df.groupby('label')['value'].agg(tuple))

it gives us,

Add tuple value to the new column (image by author)

and after that, we do drop duplicates,

df[['label','list_value']].drop_duplicates()

we won’t receive any error this time and here’s the result,

Remove duplicates using tuple value (image by author)

Easy, right? These two solutions are very commonly used to tackle the problem in dropping duplicates on list values.

That’s it! Hope this article is helpful!

Cheers!

Photo by Lauren Richmond on Unsplash

Reference:

Python
Pandas
Lists
Data Science
Data
Recommended from ReadMedium