avatarSatyam Kumar

Summary

The article discusses efficient methods to iterate over large Pandas Data Frames, emphasizing the inefficiency of iterrows() and presenting faster alternatives such as itertuples(), index-based iteration, using to_dict(), and the apply() function.

Abstract

Data processing in data science often involves iterating over large datasets, which can be time-consuming using traditional methods like iterrows() in Pandas. The article highlights the significant performance drawbacks of iterrows() and suggests several more efficient techniques for data frame iteration. These include using itertuples() to preserve data types, iterating by index for a modest speed improvement, converting the data frame to a dictionary with to_dict() for a substantial increase in speed, and employing the apply() function for specific operations, which can be up to 400x times faster than iterrows(). The author provides benchmarks and recommends converting data frames to dictionaries for iteration, based on personal preference and efficiency.

Opinions

  • The author strongly advises against using iterrows() due to its poor performance with large datasets and its inability to preserve data types.
  • Iterating by index is seen as a slight improvement over iterrows(), but not the most efficient method.
  • Converting a data frame to a dictionary using to_dict() is highly recommended by the author for its significant speed advantage.
  • The apply() function is recognized for its speed and utility in specific cases, though it may require significant code changes to implement.
  • The author expresses a personal preference for the dictionary-based iteration technique over other methods discussed.
  • The article implies that the choice of iteration method can greatly impact the efficiency of data processing tasks in Pandas.

400x times faster Pandas Data Frame Iteration

Avoid using iterrows() function

Image by Michal Jarmoluk from Pixabay

Data processing is and data wrangling is important components of a data science model development pipeline. A data scientist spends 80% of their time preparing the dataset to make it fit for modeling. Sometimes performing data wrangling and explorations for a large-sized dataset becomes a tedious task, and one is only left to either wait quite long till the computations are completed or shift to some parallel processing.

Pandas is one of the famous Python libraries that has a vast list of API, but when it comes to scalability, it fails miserably. For large-size datasets, it takes a lot of time sometimes even hours just to iterate over the loops, and even for small-size datasets, iterating over the data frame using standard loops is quite time-consuming,

In this article, we will discuss techniques or hacks to speed the iteration process over large size datasets.

(Image by Author), Time constraints comparison to iterate over the data frame

Pandas Built-In Function: iterrows()

iterrows() is a built-in Pandas library function, that returns a series of each instance or row. It iterates over the data frame as a pair of indexes and column features as Series.

To compare the benchmark time constraints, I am using a dataset having 10 million records and 5 columns. We have a feature‘name’ in the dataset with the string type, that has needs to be strip to remove the spaces.

temp=[]
for i,row in df.iterrows():
name_new = row['name'].strip()
temp.append(name_new)

The code snippet took nearly 1967 seconds to execute, which includes hovering over the data frame and performing a strip function of ‘name’ values.

It is not recommended to use iterrows, not only because of time performance issues but also iterrows() function does not preserve dtypes across the rows. You can use itertuples() function that preserves the types.

Now let’s find out the other techniques to iterate over the data frame and compare its time complexity.

Iteration by Index:

Dataframes are Pandas-object with rows and columns. The rows and columns of the data frame are indexed, and one can loop over the indexes to iterate through the rows.

temp=[]
for idx in range(0,df.shape[0],1):
name_new = df['name'].iloc[idx].strip()
temp.append(name_new)

It took nearly 223 seconds (approx 9x times faster than iterrows function) to iterate over the data frame and perform the strip operation.

Using to_dict():

You can iterate over the data frame and perform your operations with lightning-fast speed by just converting your Pandas data frame into a dictionary. You can use .to_dict() function in Pandas to convert the data frame to a dictionary. Now iterating over a dictionary is comparatively very fast compared to iterrows() function.

df_dict = df.to_dict('records')
temp=[]
for row in df_dict:
name_new = row['name'].strip()
temp.append(name_new)

Iterating over a dictionary format of the dataset takes about 25 records that is 77x times faster than the iterrows() function.

Using apply():

apply() is a built-in Pandas function that allows to pass a function and apply it on each value of the Pandas series. apply() function is not faster in itself but it has a huge improvement for the Pandas library as this function helps to segregate data according to the required conditions.

temp = df['name'].apply(lambda x: x.strip())

apply() the function takes 4.60 seconds to execute which is 427x times faster than the iterrows() function.

From the above-mentioned image (starting of this article), you can compare the benchmark time numbers calculated on a system having 8 cores and 32GB of RAM.

Conclusion:

In this article, we have discussed several techniques to iterate over the Pandas data frame and compared their time complexity. It is recommended to use the iterrows() function in very specific cases. One can easily shift from using iterrows() or indexing approach to the dictionary-based iteration technique which speeds up the workflow by 77x times.

Apply function is around 400x times faster, but it has limited use, and one needs to make a lot of changes in the code to shift to this approach. I personally convert my data frame to a dictionary and then proceed with the iteration.

References:

[1] Pandas Documentation: https://pandas.pydata.org/docs/index.html

Loved the article? Become a Medium member to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, with no extra cost to you.

Thank You for Reading

Artificial Intelligence
Machine Learning
Data Science
Python
Education
Recommended from ReadMedium