avatarJeremy DiBattista

Summary

The article discusses the common misconceptions and inefficiencies in merging Pandas dataframes, emphasizing the performance differences between using merge and join functions and providing tips for optimizing these operations.

Abstract

The author of the article, after years of merging Pandas dataframes inefficiently, conducted a stress test to compare the performance of merge and join functions. The test involved creating a fake sales data dataframe and splitting it into two: an Orders table and a Customers table, which were then joined to simulate a common data manipulation task. The results revealed that using non-index columns for merging significantly slows down the process, and that specifying the join explicitly with merge or using join with the correct parameters can lead to a substantial performance increase. The article concludes with recommendations for writing optimal join statements in Pandas, highlighting the importance of using indexes and avoiding the "on" parameter in merge for better efficiency.

Opinions

  • The author believes that the difference between merge and join in Pandas is not well understood, leading to widespread inefficient usage.
  • It is the author's opinion that setting the index to the column used for joining can result in a significant speed improvement.
  • The author suggests that explicitly stating the columns or indexes to merge on, rather than using the "on" parameter, can yield the most efficient queries.
  • The article conveys the author's view that join is generally more efficient than merge when the join can be performed on an index.
  • The author recommends using merge when dealing with non-unique columns in both dataframes, indicating that merge remains a valuable tool when used correctly.
  • There is an opinion that the Pandas documentation does not adequately explain the nuances of using merge and join, which has contributed to suboptimal practices among data scientists.

I Have Been Merging Pandas Dataframes Completely Wrong

Why almost everyone is writing them inefficiently, and a few tips to fix it!

Photo by Caspar Camille Rubin on Unsplash

One of the more confusing Pandas concepts for a majority of data scientists is the difference between Pandas merge and Pandas join. Most people have told me personally “just use merge”, and there are very few resources online explaining which is definitively better. Because of this, I decided to put both of these functions under a stress test, and in my results… I realized I have been merging wrong for years.

Setting up the problem

To test this problem, I created a fake sales data dataframe with 2823 rows and 8 columns. Since I want to test merging, I will simply transform this single data frame into 2 data frames in a more condensed format. A format you would likely see when pulling from SQL tables. One of our tables will be Orders, indexed by our ORDERNUMBER field, and the other will be our Customers table, indexed by CUSTOMERNAME, and containing all of the metadata about the customer. Our goal will be to join these two tables to determine where each order should be shipped to. Their only overlapping column will be the CUSTOMERNAME field. The code below shows how I performed the split!

#Create the dataframe (random lists not shown, in github)
df = pd.DataFrame(index=range(2823),columns=["ORDERNUMBER", "QUANTITYORDERED", "ITEMNAME", "PRICEEACH", "CUSTOMERNAME", "CUSTOMERADDRESS"])
df["ORDERNUMBER"] = df.index
df["QUANTITYORDERED"] = np.random.randint(1, 6, df.shape[0])
df["ITEMNAME"] = np.random.choice(list(item_ordered), len(df))
df["PRICEEACH"] = np.random.randint(1, 10, df.shape[0])
df["CUSTOMERNAME"] = np.random.choice(list(customers), len(df))
df["CUSTOMERADDRESS"] = df['CUSTOMERNAME'].map(customers)
df.set_index("ORDERNUMBER")
#create the ORDERS table, selecting the first 4columns
df_orders = df.iloc[:, : 4]
#Create the CUSTOMERS table by selecting the last columns and indexing by unique customers
df_unique_customers = df.iloc[:, 4:]
df_unique_customers = df_unique_customers.groupby("CUSTOMERNAME").first()
#Create a duplicate of the customer dataframe, but indexed from 0 instead of by customer (for testing)
df_unique_customers_unindexed = df_unique_customers.reset_index()

You will notice I also created a duplicate of the Customer data frame that is 0 indexed instead of customer indexed for testing purposes. Here are samples of what our data frames look like! (You can also view the full notebook HERE)

ORDER Data Frame | Image by Author
CUSTOMER Data Frame | Image by Author

Merge vs Join

The difference — on the surface level — is not extremely significant beyond syntax. Join acts as a method to join two data frames, but it will exclusively work on the index of the right data frame. For the left data frame, either the left index or a column can be selected (we need to use a column in our example since the customer is not unique in our orders data frame).

Merge is essentially the same thing, but more flexible. Any column or index can be specified in either of the data frames to be used for the join. Essentially, as far as syntax is concerned, merge is just an easier-to-use join. But that does not necessarily hold true for their performance.

For this example, I want to perform a left join with our order data frame on the left and our customer data frame on the right. I was able to do this correctly in 1 way with Join, and in 4 ways with merge. If you are unfamiliar with the difference between different kinds of joins, here is a refresher! (Or if you are looking for a full course, try this)

To test the proficiency of these different joins, I will be running each join in a loop 10000 times to get the average cost of a join.

The 5 ways to perform a left Join/merge

First I will start with a few merges, because this is where it is very easy to have inefficient queries

The Wrong way to merge — Merging on non-index columns

  • df_combined = df_orders.merge(df_unique_customers_unindexed, how=’left’, on=”CUSTOMERNAME”)
Time for 10000 joins
18.28343653678894
average time per join
0.001828343653678894
  • df_combined = df_orders.merge(df_unique_customers_unindexed, how=’left’, left_on=”CUSTOMERNAME”, right_on=”CUSTOMERNAME”)
Time for 10000 joins
18.183201789855957
average time per join
0.0018183201789855957

Both of the above methods select the column CUSTOMERNAME from the each table. This is fairly simple and how I assume most people use merge, as well as how people lose a lot of speed on their merges. These are generally equivalent queries, with the bottom giving a pretty consistent 1–2% bump.

As a general rule of thumb, using a non-index column does cause a pretty significant speed hit, and we will fix this in the merges below.

The Inexplicably suboptimal way

  • df_combined = df_orders.merge(df_unique_customers, how=’left’, on=”CUSTOMERNAME”)
Time for 10000 joins
16.125476121902466
average time per join
0.0016125476121902466

By simply setting the index to the column we plan to join on(can be done using set_index(“CUSTOMERNAME”)), we see an immediate speed increase of 10–15% over a non-indexed.

However, because merge has to infer whether “on” is referring to a column or an index, this is still suboptimal, and something I, along with most people I am assuming, never knew! I have not seen this explained anywhere, and by avoiding this, you can create the optimal queries below!

The Fastest Ways

  • df_combined = df_orders.merge(df_unique_customers, how=’left’, left_on=”CUSTOMERNAME”, right_index=True)
Time for 10000 joins
14.555766582489014
average time per join
0.0014555766582489014
  • df_combined = df_orders.join(df_unique_customers, on=”CUSTOMERNAME”, how=”left”)
Time for 10000 joins
14.551718711853027
average time per join
0.0014551718711853028

Finally, we come to the proper, and fastest way to write a join. As it turns out, join always tends to perform well, and merge will perform almost exactly the same given the syntax is optimal. Writing the query perfectly grants a whopping 20+% increase over our unindexed queries and a surprising 10% bump over our properly indexed but poorly written query.

Conclusion

If you can write a join statement, I would recommend doing so as it is almost guaranteed to perform optimally.

However, that does not mean that merge is not useful. Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as:

  1. Merging is done on indexes if possible
  2. The “on” parameter is avoided, and instead, both columns to merge on are explicitly stated using the keywords left_on, left_index, right_on, and right_index (when applicable).

I hope this short article helped you speed up your joins in Pandas! Be sure to save/bookmark the story if you think you will need to come back for reference! If you enjoyed this article, feel free to follow me, and read more of what I write. I tend to deep-dive into a topic once per month and this normally involves data science tips, new packages, or whatever else in the ML space I feel needs better explained!

Pandas
Python
Data Science
Programming
Join
Recommended from ReadMedium