400x times faster Pandas Data Frame Iteration
Avoid using iterrows() function
Data processing is and data wrangling is important components of a data science model development pipeline. A data scientist spends 80% of their time preparing the dataset to make it fit for modeling. Sometimes performing data wrangling and explorations for a large-sized dataset becomes a tedious task, and one is only left to either wait quite long till the computations are completed or shift to some parallel processing.
Pandas is one of the famous Python libraries that has a vast list of API, but when it comes to scalability, it fails miserably. For large-size datasets, it takes a lot of time sometimes even hours just to iterate over the loops, and even for small-size datasets, iterating over the data frame using standard loops is quite time-consuming,
In this article, we will discuss techniques or hacks to speed the iteration process over large size datasets.
Pandas Built-In Function: iterrows()
iterrows() is a built-in Pandas library function, that returns a series of each instance or row. It iterates over the data frame as a pair of indexes and column features as Series.
To compare the benchmark time constraints, I am using a dataset having 10 million records and 5 columns. We have a feature‘name’
in the dataset with the string type, that has needs to be strip to remove the spaces.
temp=[]
for i,row in df.iterrows():
name_new = row['name'].strip()
temp.append(name_new)
The code snippet took nearly 1967 seconds to execute, which includes hovering over the data frame and performing a strip function of ‘name’ values.
It is not recommended to use iterrows, not only because of time performance issues but also iterrows()
function does not preserve dtypes across the rows. You can use itertuples()
function that preserves the types.
Now let’s find out the other techniques to iterate over the data frame and compare its time complexity.
Iteration by Index:
Dataframes are Pandas-object with rows and columns. The rows and columns of the data frame are indexed, and one can loop over the indexes to iterate through the rows.
temp=[]
for idx in range(0,df.shape[0],1):
name_new = df['name'].iloc[idx].strip()
temp.append(name_new)
It took nearly 223 seconds (approx 9x times faster than iterrows function) to iterate over the data frame and perform the strip operation.
Using to_dict():
You can iterate over the data frame and perform your operations with lightning-fast speed by just converting your Pandas data frame into a dictionary. You can use .to_dict()
function in Pandas to convert the data frame to a dictionary. Now iterating over a dictionary is comparatively very fast compared to iterrows()
function.
df_dict = df.to_dict('records')
temp=[]
for row in df_dict:
name_new = row['name'].strip()
temp.append(name_new)
Iterating over a dictionary format of the dataset takes about 25 records that is 77x times faster than the iterrows()
function.
Using apply():
apply() is a built-in Pandas function that allows to pass a function and apply it on each value of the Pandas series. apply() function is not faster in itself but it has a huge improvement for the Pandas library as this function helps to segregate data according to the required conditions.
temp = df['name'].apply(lambda x: x.strip())
apply()
the function takes 4.60 seconds to execute which is 427x times faster than the iterrows()
function.
From the above-mentioned image (starting of this article), you can compare the benchmark time numbers calculated on a system having 8 cores and 32GB of RAM.
Conclusion:
In this article, we have discussed several techniques to iterate over the Pandas data frame and compared their time complexity. It is recommended to use the iterrows()
function in very specific cases. One can easily shift from using iterrows()
or indexing approach to the dictionary-based iteration technique which speeds up the workflow by 77x times.
Apply function is around 400x times faster, but it has limited use, and one needs to make a lot of changes in the code to shift to this approach. I personally convert my data frame to a dictionary and then proceed with the iteration.
References:
[1] Pandas Documentation: https://pandas.pydata.org/docs/index.html
Loved the article? Become a Medium member to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, with no extra cost to you.
Thank You for Reading