avatarRahul Agarwal

Summary

The provided content discusses advanced usage of apply and lambda functions in pandas to perform complex data manipulation tasks, including creating new columns, filtering dataframes, changing column types, and tracking progress during computation.

Abstract

The article delves into the power of apply and lambda functions within the pandas library, emphasizing their utility in handling intricate data operations. It illustrates how these functions can be applied to create custom columns based on complex logic, such as adjusting movie ratings according to genre-specific rules. The author also demonstrates advanced filtering techniques using apply to evaluate conditions that cannot be directly expressed with standard pandas operators. Additionally, the article covers the use of apply for type conversion of dataframe columns, offering a flexible alternative to the astype method, especially when dealing with non-standard formats. Finally, the article introduces the progress_apply function from the tqdm package, which enhances the standard apply function with a progress bar for long-running operations. The author advocates for the versatility and readability of apply and lambda functions, suggesting that mastery of these tools can significantly streamline data manipulation workflows in pandas.

Opinions

  • The author expresses a strong preference for using apply and lambda functions due to their flexibility and the ability to encapsulate complex logic in a readable manner.
  • They believe that sticking to certain pandas conventions, such as apply and lambda, has proven beneficial over the years, despite the library's continuous evolution and introduction of new functionalities.
  • The author finds apply to be more readable and better suited to their workflow compared to alternative methods like map or applymap.
  • They recommend the use of progress_apply from the tqdm package to provide visual feedback on the progress of time-consuming apply operations, indicating that this can save a significant amount of time.
  • The author suggests that proficiency with apply and lambda functions can alleviate the need to remember various pandas syntaxes and provides a link to a course for those interested in improving their intermediate-level Python skills.
  • The author encourages feedback and constructive criticism, inviting readers to engage with them on Medium and Twitter for further discussion on the topic.
Photo by chuttersnap on Unsplash

Apply and Lambda usage in pandas

Learn these to master Pandas

Pandas is a wonderful tool to have at your disposal.

I have been working with Pandas for years and it never ceases to amaze me with its new functionalities, shortcuts and multiple ways of doing a particular thing.

But I have realized that sticking to some of the conventions I have learned has served me well over the years.

apply and lambda are some of the best things I have learned to use with pandas.

I use apply and lambda anytime I get stuck while building a complex logic for a new column or filter.

And that happens a lot when the business comes to you with custom requests.

This post is about demonstrating the power of apply and lambda to you.

I will be using a data set of 1,000 popular movies on IMDB in the last 10 years. You can also follow along in the Kaggle Kernel.

Creating a Column

Complex columns

You can create a new column in many ways.

If you want a column that is a sum or difference of columns, you can pretty much use simple basic arithmetic. Here I get the average rating based on IMDB and Normalized Metascore.

df['AvgRating'] = (df['Rating'] + df['Metascore']/10)/2

But sometimes we may need to build complex logic around the creation of new columns.

To give you a convoluted example, let’s say that we want to build a custom movie score based on a variety of factors.

Say, If the movie is of the thriller genre, I want to add 1 to the IMDB rating subject to the condition that IMDB rating remains less than or equal to 10. And If a movie is a comedy I want to subtract 1 from the rating.

How do we do that?

Whenever I get a hold of such complex problems, I use apply/lambda. Let me first show you how I will do this.

def custom_rating(genre,rating):
    if 'Thriller' in genre:
        return min(10,rating+1)
    elif 'Comedy' in genre:
        return max(0,rating-1)
    else:
        return rating
        
df['CustomRating'] = df.apply(lambda x: custom_rating(x['Genre'],x['Rating']),axis=1)

The general structure is:

  • You define a function that will take the column values you want to play with to come up with your logic. Here the only two columns we end up using are genre and rating.
  • You use an apply function with lambda along the row with axis=1. The general syntax is:
df.apply(lambda x: func(x['col1'],x['col2']),axis=1)

You should be able to create pretty much any logic using apply/lambda since you just have to worry about the custom function.

Filtering a dataframe

Filtering….

Pandas make filtering and subsetting dataframes pretty easy. You can filter and subset dataframes using normal operators and &,|,~ operators.

# Single condition: dataframe with all movies rated greater than 8
df_gt_8 = df[df['Rating']>8]
# Multiple conditions: AND - dataframe with all movies rated greater than 8 and having more than 100000 votes
And_df = df[(df['Rating']>8) & (df['Votes']>100000)]
# Multiple conditions: OR - dataframe with all movies rated greater than 8 or having a metascore more than 90
Or_df = df[(df['Rating']>8) | (df['Metascore']>80)]
# Multiple conditions: NOT - dataframe with all emovies rated greater than 8 or having a metascore more than 90 have to be excluded
Not_df = df[~((df['Rating']>8) | (df['Metascore']>80))]

Pretty simple stuff.

But sometimes we may need to do complex filtering operations.

And sometimes we need to do some operations which we won’t be able to do using just the above format.

For instance: Let us say we want to filter those rows where the number of words in the movie title is greater than or equal to than 4.

How would you do it?

Trying the below will give you an error. Apparently, you cannot do anything as simple as split with a series.

new_df = df[len(df['Title'].split(" "))>=4]
-------------------------------------------
AttributeError: 'Series' object has no attribute 'split'

One way is to first create a column which contains no of words in the title using apply and then filter on that column.

#create a new column
df['num_words_title'] = df.apply(lambda x : len(x['Title'].split(" ")),axis=1)
#simple filter on new column
new_df = df[df['num_words_title']>=4]

And that is a perfectly fine way as long as you don’t have to create a lot of columns. But, I prefer this:

new_df = df[df.apply(lambda x : len(x['Title'].split(" "))>=4,axis=1)]

What I did here is that my apply function returns a boolean which can be used to filter.

Now once you understand that you just have to create a column of booleans to filter, you can use any function/logic in your apply statement to get however complex a logic you want to build.

Let us see another example. I will try to do something a little complex to just show the structure.

We want to find movies for which the revenue is less than the average revenue for that particular year?

year_revenue_dict = df.groupby(['Year']).agg({'Rev_M':np.mean}).to_dict()['Rev_M']
def bool_provider(revenue, year):
    return revenue<year_revenue_dict[year]
    
new_df = df[df.apply(lambda x : bool_provider(x['Rev_M'],x['Year']),axis=1)]

We have a function here which we can use to write any logic. That provides a lot of power for advanced filtering as long as we can play with simple variables.

Change Column Types

I even use apply to change the column types since I don’t want to remember the syntax for changing column type and also since it lets me do much more complex things.

The normal syntax to change column type is astype in Pandas. So if I had a column named price in my data in an str format. I could do this:

df['Price'] = newDf['Price'].astype('int')

But sometimes it won’t work as expected.

You might get the error: ValueError: invalid literal for long() with base 10: ‘13,000’. That is you cannot cast a string with “,” to an int. To do that we first have to get rid of the comma.

After facing this problem time and again, I have stopped using astype altogether now and just use apply to change column types.

df['Price'] = df.apply(lambda x: int(x['Price'].replace(',', '')),axis=1)

And lastly there is progress_apply

progress_apply is a single function that comes with tqdm package.

And this has saved me a lot of time.

Sometimes when you have got a lot of rows in your data, or you end up writing a pretty complex apply function, you will see that apply might take a lot of time.

I have seen apply taking hours when working with Spacy. In such cases, you might like to see the progress bar with apply.

You can use tqdm for that.

After the initial imports at the top of your notebook, just replace apply with progress_apply and everything remains the same.

from tqdm import tqdm, tqdm_notebook
tqdm_notebook().pandas()
df.progress_apply(lambda x: custom_rating_function(x['Genre'],x['Rating']),axis=1)

And you get progress bars.

Conclusion

apply and lambda functionality lets you take care of a lot of complex things while manipulating data.

I feel that I don’t have to worry about a lot of stuff while using Pandas since I can use apply well.

In this post, I tried to explain how it works. And there might be other ways to do whatever I have done above.

But I like to stick with apply/lambda in place of map/applymap because I find it more readable and well suited to my workflow.

If you want to learn more about Python 3, I would like to call out an excellent course on Learn Intermediate level Python from the University of Michigan. Do check it out.

I am going to be writing more of such posts in the future too. Let me know what you think about the series. Follow me up at Medium or Subscribe to my blog to be informed about them. As always, I welcome feedback and constructive criticism and can be reached on Twitter @mlwhiz.

Data Science
Machine Learning
Programming
Towards Data Science
Python
Recommended from ReadMedium