I Have Been Merging Pandas Dataframes Completely Wrong
Why almost everyone is writing them inefficiently, and a few tips to fix it!
One of the more confusing Pandas concepts for a majority of data scientists is the difference between Pandas merge and Pandas join. Most people have told me personally “just use merge”, and there are very few resources online explaining which is definitively better. Because of this, I decided to put both of these functions under a stress test, and in my results… I realized I have been merging wrong for years.
Setting up the problem
To test this problem, I created a fake sales data dataframe with 2823 rows and 8 columns. Since I want to test merging, I will simply transform this single data frame into 2 data frames in a more condensed format. A format you would likely see when pulling from SQL tables. One of our tables will be Orders, indexed by our ORDERNUMBER field, and the other will be our Customers table, indexed by CUSTOMERNAME, and containing all of the metadata about the customer. Our goal will be to join these two tables to determine where each order should be shipped to. Their only overlapping column will be the CUSTOMERNAME field. The code below shows how I performed the split!
#Create the dataframe (random lists not shown, in github)
df = pd.DataFrame(index=range(2823),columns=["ORDERNUMBER", "QUANTITYORDERED", "ITEMNAME", "PRICEEACH", "CUSTOMERNAME", "CUSTOMERADDRESS"])
df["ORDERNUMBER"] = df.index
df["QUANTITYORDERED"] = np.random.randint(1, 6, df.shape[0])
df["ITEMNAME"] = np.random.choice(list(item_ordered), len(df))
df["PRICEEACH"] = np.random.randint(1, 10, df.shape[0])
df["CUSTOMERNAME"] = np.random.choice(list(customers), len(df))
df["CUSTOMERADDRESS"] = df['CUSTOMERNAME'].map(customers)
df.set_index("ORDERNUMBER")
#create the ORDERS table, selecting the first 4columns
df_orders = df.iloc[:, : 4]
#Create the CUSTOMERS table by selecting the last columns and indexing by unique customers
df_unique_customers = df.iloc[:, 4:]
df_unique_customers = df_unique_customers.groupby("CUSTOMERNAME").first()
#Create a duplicate of the customer dataframe, but indexed from 0 instead of by customer (for testing)
df_unique_customers_unindexed = df_unique_customers.reset_index()
You will notice I also created a duplicate of the Customer data frame that is 0 indexed instead of customer indexed for testing purposes. Here are samples of what our data frames look like! (You can also view the full notebook HERE)


Merge vs Join
The difference — on the surface level — is not extremely significant beyond syntax. Join acts as a method to join two data frames, but it will exclusively work on the index of the right data frame. For the left data frame, either the left index or a column can be selected (we need to use a column in our example since the customer is not unique in our orders data frame).
Merge is essentially the same thing, but more flexible. Any column or index can be specified in either of the data frames to be used for the join. Essentially, as far as syntax is concerned, merge is just an easier-to-use join. But that does not necessarily hold true for their performance.
For this example, I want to perform a left join with our order data frame on the left and our customer data frame on the right. I was able to do this correctly in 1 way with Join, and in 4 ways with merge. If you are unfamiliar with the difference between different kinds of joins, here is a refresher! (Or if you are looking for a full course, try this)
To test the proficiency of these different joins, I will be running each join in a loop 10000 times to get the average cost of a join.
The 5 ways to perform a left Join/merge
First I will start with a few merges, because this is where it is very easy to have inefficient queries
The Wrong way to merge — Merging on non-index columns
- df_combined = df_orders.merge(df_unique_customers_unindexed, how=’left’, on=”CUSTOMERNAME”)
Time for 10000 joins
18.28343653678894
average time per join
0.001828343653678894
- df_combined = df_orders.merge(df_unique_customers_unindexed, how=’left’, left_on=”CUSTOMERNAME”, right_on=”CUSTOMERNAME”)
Time for 10000 joins
18.183201789855957
average time per join
0.0018183201789855957
Both of the above methods select the column CUSTOMERNAME from the each table. This is fairly simple and how I assume most people use merge, as well as how people lose a lot of speed on their merges. These are generally equivalent queries, with the bottom giving a pretty consistent 1–2% bump.
As a general rule of thumb, using a non-index column does cause a pretty significant speed hit, and we will fix this in the merges below.
The Inexplicably suboptimal way
- df_combined = df_orders.merge(df_unique_customers, how=’left’, on=”CUSTOMERNAME”)
Time for 10000 joins
16.125476121902466
average time per join
0.0016125476121902466
By simply setting the index to the column we plan to join on(can be done using set_index(“CUSTOMERNAME”)), we see an immediate speed increase of 10–15% over a non-indexed.
However, because merge has to infer whether “on” is referring to a column or an index, this is still suboptimal, and something I, along with most people I am assuming, never knew! I have not seen this explained anywhere, and by avoiding this, you can create the optimal queries below!
The Fastest Ways
- df_combined = df_orders.merge(df_unique_customers, how=’left’, left_on=”CUSTOMERNAME”, right_index=True)
Time for 10000 joins
14.555766582489014
average time per join
0.0014555766582489014
- df_combined = df_orders.join(df_unique_customers, on=”CUSTOMERNAME”, how=”left”)
Time for 10000 joins
14.551718711853027
average time per join
0.0014551718711853028
Finally, we come to the proper, and fastest way to write a join. As it turns out, join always tends to perform well, and merge will perform almost exactly the same given the syntax is optimal. Writing the query perfectly grants a whopping 20+% increase over our unindexed queries and a surprising 10% bump over our properly indexed but poorly written query.
Conclusion
If you can write a join statement, I would recommend doing so as it is almost guaranteed to perform optimally.
However, that does not mean that merge is not useful. Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as:
- Merging is done on indexes if possible
- The “on” parameter is avoided, and instead, both columns to merge on are explicitly stated using the keywords left_on, left_index, right_on, and right_index (when applicable).
I hope this short article helped you speed up your joins in Pandas! Be sure to save/bookmark the story if you think you will need to come back for reference! If you enjoyed this article, feel free to follow me, and read more of what I write. I tend to deep-dive into a topic once per month and this normally involves data science tips, new packages, or whatever else in the ML space I feel needs better explained!