Do You Use Apply in Pandas? There is a 600x Faster Way

By leveraging vectorization and data types, you can massively speed up complex computations in Pandas

I recently read yet another article showing you how to speed up the apply function in pandas. These articles will usually tell you to parallelize the apply function to make it 2 to 4 times faster.

Before I show you how to make it 600 times faster, let’s illustrate a use case using the vanilla apply().

Pandas Apply

Let’s imagine you have a pandas dataframe df and want to perform some operation on it.

I will use a dataframe with 1m rows and five columns (with integers ranging from 0 to 10; I am using a setup similar to this article)

df = pd.DataFrame(np.random.randint(0, 11, size=(1000000, 5)), columns=('a','b','c','d','e'))

I want to apply a logic based on ‘e’ that will generate a result based on the four other columns.

def func(a,b,c,d,e):
    if e == 10:
        return c*d
    elif (e < 10) and (e>=5):
        return c+d
    elif e < 5:
        return a+b

Let’s use pandas apply with this function.

df['new'] = df.apply(lambda x: func(x['a'], x['b'], x['c'], x['d'], x['e']), axis=1)

We get a running time of around 11.8 seconds (over 10 runs, with a minimum running time of 11.7 seconds).

Parallelize Pandas Apply with Swifter

You can easily parallelize this process by using swifter.

As swifter is not installed by default with anaconda, you will have to install it first.

conda install -c conda-forge swifter

We can now use parallelize apply by calling swifter before apply

import swifter
df['new'] = df.swifter.apply(lambda x : func(x['a'],x['b'],x['c'],x['d'],x['e']),axis=1)

On my MacBook Air (using an M1 CPU), I got an average running time of 6.71 seconds (over ten runs, with a minimum running time of 6.45 seconds). This is nearly twice as fast as our initial apply implementation.

Parallelization in Python is not a silver bullet: you can only expect slight improvements (if any).

Pandas Vectorization

The fastest way to work with Pandas and Numpy is to vectorize your functions. On the other hand, running functions element by element along an array or a series using for loops, list comprehension, or apply() is a bad practice.

List Comprehensions vs. For Loops: It Is Not What You Think

Many articles, posts, or questions on Stack Overflow emphasize that list comprehensions are faster than for loops in…

towardsdatascience.com

Let’s create a vector implementation of our previous function. As you can see, I am using two masks to identify relevant cases, then .loc to update the values. Moreover, the default case is assigned without using any mask.

df['new'] = df['c'] * df['d'] #default case e = =10
mask = df['e'] < 10
df.loc[mask,'new'] = df['c'] + df['d']
mask = df['e'] < 5
df.loc[mask,'new'] = df['a'] + df['b']

Running time is now 0.035 seconds (with a minimum running time of 0.027 seconds). That’s nearly a 200x improvement compared to swifter!

Vectorization will offer you lightning-fast execution

Want to learn how to apply data science to supply chains? Check my books here, here, and here.

Lighter Pandas DataFrames

You can speed up the execution even faster by using another trick: making your pandas' dataframes lighter by using more efficent data types.

As we know that df only contains integers from 1 to 10, we can then reduce the data type from 64 bits to 16 bits.

for col in ('a','b','c','d'):
    df[col] = df[col].astype(np.int16)

See how we reduced the size of our dataframe from 38MB to 9.5MB. Obivously, your computer will have an easier time dealing with a nearly 4x smaller object.

The running time of our function now decreased to around 0.019 seconds, which is nearly twice as fast as using our initial dataframe (with np.int64).

You might not be lucky enough to have a dataset with only small integer numbers in real life. Nevertheless, you can try to speed up your process by using np.float32 instead of the usual np.float64 or by using pandas categories.

Reduce the size of your dataframe by leveraging datatypes

Do You Read Excel Files with Python? There is a 1000x Faster Way.

In this article, I’ll show you five ways to load data in Python. Achieving a speedup of 3 orders of magnitude.

towardsdatascience.com

NumPy Vectorization

The code above is relying on pandas Series to perform checks and computation. Pandas’ Series are composed of NumPy Array (to store data) plus some overhead info (such as the Series index and name).

We can directly access the NumPy Arrays ‘behind’ the Series by using .values to make our vectorization slightly faster. This usually works quite well, except if you need to play with masks and specific columns — as in our example.

To show you the power of numpy vectorization vs. pandas vectorization let’s create another use-case.

You want to compute the sum of columns a, b, c, and d and multiply it by e. Let’s also increase the dataframe’s size to 100M rows (instead of the initial 1M).

df = pd.DataFrame(np.random.randint(0, 11, size=(100000000, 5), dtype=np.int16), columns=('a','b','c','d','e'))

Our new dataframe takes around 900MB.

df['new'] = df[['a','b','c','d']].sum(axis=1) * df['e']

With this 100% pandas execution, the average running time (over 10 trials) is 2.92 seconds (minimum of 2.87)

df[‘new’] = df[[‘a’,’b’,’c’,’d’]].values.sum(axis=1) * df[‘e’].values

Using .values, the running time decreased to 2.65 seconds (minimum of 2.62 seconds), a 10% reduction.

NumPy Arrays can speed up the execution time further on massive datasets

Conclusion

We showed that by using pandas vectorization together with efficient data types, we could reduce the running time of the apply function by 600 (without using anything else than pandas).

Apply: 11.8 seconds
Apply + Swifter: 6.71 seconds
Pandas vectorizatoin: 0.035 seconds
Pandas vectorization + data types: 0.019 seconds