avatarNicolas Vandeput

Summary

This article demonstrates how to speed up complex computations in Pandas by leveraging vectorization and data types, resulting in a 600x faster way than using the apply function.

Abstract

The article begins by discussing the use of the apply function in Pandas and its limitations in terms of speed. It then introduces the concept of vectorization, which is a more efficient way to perform operations on dataframes. The author demonstrates how to create a vectorized implementation of a function and compares its performance to the apply function, showing a significant improvement in speed. The article also discusses the use of more efficient data types to further reduce the size of dataframes and improve performance. Finally, the author touches on the use of NumPy arrays for vectorization and their benefits for massive datasets.

Bullet points

  • The apply function in Pandas can be slow for complex computations.
  • Vectorization is a more efficient way to perform operations on dataframes.
  • A vectorized implementation of a function can be created using masks and .loc.
  • Vectorization can result in a significant improvement in speed compared to the apply function.
  • Using more efficient data types can reduce the size of dataframes and improve performance.
  • NumPy arrays can be used for vectorization and offer benefits for massive datasets.

Do You Use Apply in Pandas? There is a 600x Faster Way

By leveraging vectorization and data types, you can massively speed up complex computations in Pandas

I recently read yet another article showing you how to speed up the apply function in pandas. These articles will usually tell you to parallelize the apply function to make it 2 to 4 times faster.

Before I show you how to make it 600 times faster, let’s illustrate a use case using the vanilla apply().

Credit

Pandas Apply

Let’s imagine you have a pandas dataframe df and want to perform some operation on it.

I will use a dataframe with 1m rows and five columns (with integers ranging from 0 to 10; I am using a setup similar to this article)

df = pd.DataFrame(np.random.randint(0, 11, size=(1000000, 5)), columns=('a','b','c','d','e'))

I want to apply a logic based on ‘e’ that will generate a result based on the four other columns.

def func(a,b,c,d,e):
    if e == 10:
        return c*d
    elif (e < 10) and (e>=5):
        return c+d
    elif e < 5:
        return a+b

Let’s use pandas apply with this function.

df['new'] = df.apply(lambda x: func(x['a'], x['b'], x['c'], x['d'], x['e']), axis=1)

We get a running time of around 11.8 seconds (over 10 runs, with a minimum running time of 11.7 seconds).

Parallelize Pandas Apply with Swifter

You can easily parallelize this process by using swifter.

As swifter is not installed by default with anaconda, you will have to install it first.

conda install -c conda-forge swifter

We can now use parallelize apply by calling swifter before apply

import swifter
df['new'] = df.swifter.apply(lambda x : func(x['a'],x['b'],x['c'],x['d'],x['e']),axis=1)

On my MacBook Air (using an M1 CPU), I got an average running time of 6.71 seconds (over ten runs, with a minimum running time of 6.45 seconds). This is nearly twice as fast as our initial apply implementation.

Parallelization in Python is not a silver bullet: you can only expect slight improvements (if any).

Pandas Vectorization

The fastest way to work with Pandas and Numpy is to vectorize your functions. On the other hand, running functions element by element along an array or a series using for loops, list comprehension, or apply() is a bad practice.

Let’s create a vector implementation of our previous function. As you can see, I am using two masks to identify relevant cases, then .loc to update the values. Moreover, the default case is assigned without using any mask.

df['new'] = df['c'] * df['d'] #default case e = =10
mask = df['e'] < 10
df.loc[mask,'new'] = df['c'] + df['d']
mask = df['e'] < 5
df.loc[mask,'new'] = df['a'] + df['b']

Running time is now 0.035 seconds (with a minimum running time of 0.027 seconds). That’s nearly a 200x improvement compared to swifter!

Vectorization will offer you lightning-fast execution

Want to learn how to apply data science to supply chains? Check my books here, here, and here.

Lighter Pandas DataFrames

You can speed up the execution even faster by using another trick: making your pandas' dataframes lighter by using more efficent data types.

As we know that df only contains integers from 1 to 10, we can then reduce the data type from 64 bits to 16 bits.

for col in ('a','b','c','d'):
    df[col] = df[col].astype(np.int16)

See how we reduced the size of our dataframe from 38MB to 9.5MB. Obivously, your computer will have an easier time dealing with a nearly 4x smaller object.

The running time of our function now decreased to around 0.019 seconds, which is nearly twice as fast as using our initial dataframe (with np.int64).

You might not be lucky enough to have a dataset with only small integer numbers in real life. Nevertheless, you can try to speed up your process by using np.float32 instead of the usual np.float64 or by using pandas categories.

Reduce the size of your dataframe by leveraging datatypes

NumPy Vectorization

The code above is relying on pandas Series to perform checks and computation. Pandas’ Series are composed of NumPy Array (to store data) plus some overhead info (such as the Series index and name).

We can directly access the NumPy Arrays ‘behind’ the Series by using .values to make our vectorization slightly faster. This usually works quite well, except if you need to play with masks and specific columns — as in our example.

To show you the power of numpy vectorization vs. pandas vectorization let’s create another use-case.

You want to compute the sum of columns a, b, c, and d and multiply it by e. Let’s also increase the dataframe’s size to 100M rows (instead of the initial 1M).

df = pd.DataFrame(np.random.randint(0, 11, size=(100000000, 5), dtype=np.int16), columns=('a','b','c','d','e'))

Our new dataframe takes around 900MB.

df['new'] = df[['a','b','c','d']].sum(axis=1) * df['e']

With this 100% pandas execution, the average running time (over 10 trials) is 2.92 seconds (minimum of 2.87)

df[‘new’] = df[[‘a’,’b’,’c’,’d’]].values.sum(axis=1) * df[‘e’].values

Using .values, the running time decreased to 2.65 seconds (minimum of 2.62 seconds), a 10% reduction.

NumPy Arrays can speed up the execution time further on massive datasets

Conclusion

We showed that by using pandas vectorization together with efficient data types, we could reduce the running time of the apply function by 600 (without using anything else than pandas).

  • Apply: 11.8 seconds
  • Apply + Swifter: 6.71 seconds
  • Pandas vectorizatoin: 0.035 seconds
  • Pandas vectorization + data types: 0.019 seconds
Pandas
Vectorization
Numpy
Python
Parallelization
Recommended from ReadMedium