PANDAS
TypeError: unhashable type: ‘list’! How to Drop Duplicates with Lists in Pandas
In this short post, I’m writing about several tricks about dealing with list values in the data frame in Pandas. You may find them useful when you face such problems.
Pandas is one of the most widely used tools in the field of data science. Usually, people only work on numerical or categorical values in the cells of the data frame. However, we have to deal with values in the list format sometimes.
I list several potential issues in dealing with list in Pandas data frame as well as the solutions in this post. Hopefully it’s helpful.
Create a list per group and add back as a new column
If you want to add a new column to an existing data frame, where the values in the new column is the list of values from the same group, what function should be used following groupby?
For example, you have a data frame like this,

and you want to create a new column named ‘list_value’, which combines the values from the same label to a list. The final output is like this,

Specifically, for label ‘A’, there are three rows with values, ‘hello’, ‘coding’, and ‘!’. so, the combined list should be [hello, coding, !]. And the combined list is added back to the rows of label ‘A’. Similar manipulations for group ‘B’ and ‘C’.
If it’s your first time doing it, you may want to run the following codes to generate the new column,
df.groupby('label')['value'].transform(list)
However, you will find that it returns a Pandas Series that is exactly the same as the column of ‘value’ instead of lists of values,

You may also try this,
df.groupby('label')['value'].agg(list)
It turns out that you get the lists as expected but it’s not able to be added back to the data frame because it has different dimension from the original data frame.

The reason is that the returned values are not expanded to the same length of the original data frame. We can use the following two tricks to solve the problem.
№1. transform + *len
The first solution is to expand the result from transform to the same length with the original data frame by multiplying the length of each group.
df['list_value'] = df.groupby('label')['value'].transform(lambda x : [x.tolist()]*len(x))
here, *len(x) is the important part of the solution. The result is like this,

You may have noticed that there are squared brackets outside x.tolist(), which kind of makes the format as a list of list. It looks weird here, but it may be an internal issue of Pandas that the only way to keep the list format after the transform function is to add the brackets. Maybe a future update will resolve this issue, then we don’t need the outside brackets of x.tolist() in the codes. But for now, it’s an effective way to construct list values as a new column.
№2. map + agg
The second solution is to use map + agg.
df['list_value'] = df['label'].map(df.groupby('label')['value'].agg(list))
Remember the agg function creates the list values of length three as shown above. If we want to expand it to the same length of the original data frame, we simply use the label as the key and search for the corresponding list from the return values of df.groupby(‘label’)[‘value’].agg(list).
It gives us,

The two solutions above are vary useful when you try to create list values from groups in your analysis.
Always fail to drop duplicates
If you have columns made of list values, it’s not possible to drop duplicates as that of numerical/categorical values. For example, in the example above, we want to drop duplicates based on two columns, label and list_value, to keep the unique values for each group. What would you do?
Some people will directly use the built-in function in Pandas, drop_duplicates(),
df.drop_duplicates()
However, you will get the Type error as below,
TypeError: unhashable type: 'list'
That’s because the list values in each row can not be hashable. So, how do we solve the problem?
№1. to string.
The first solution to that is to transform everything to string and then record the unique rows’ index and subset the original dataset.
df[['label','list_value']].loc[df[['label','list_value']].astype(str).drop_duplicates().index]
It will return the following data frame,

The idea is simple that astype(str) change the values into string and then .index after .drop_duplicates() records the index of the unique rows. .loc will select the rows based on the recorded index.
№2. list to tuple.
The second solution is to use tuple instead of list in the data frame. Remember the way we built the column of the list values. This time we change the list to tuple,
df['list_value'] = df['label'].map(df.groupby('label')['value'].agg(tuple))
it gives us,

and after that, we do drop duplicates,
df[['label','list_value']].drop_duplicates()
we won’t receive any error this time and here’s the result,

Easy, right? These two solutions are very commonly used to tackle the problem in dropping duplicates on list values.
That’s it! Hope this article is helpful!
Cheers!