A Simple Way to Compare Pandas DataFrames in Unit Tests
A single function for all your needs.

Testing is a fundamental part of any software. Without a thorough testing process, no software product is deployed into production.
Data analytics products also require testing. Let’s say you’re building a forecasting engine. No one can or should rely on the output of your product if you are testing it continuously.
The scope of the tests vary. There is no limit on what you should or should not be testing. It all depends on the data and the product.
A greater fraction of the data-based products work with tabular data. Therefore, tabular data structures such as Pandas DataFrames or SQL tables are quite commonly used in the data science ecosystem.
DataFrame is a two-dimensional data structure with labeled rows and columns. The testing procedure may require you to check if two DataFrames are equal. You can write a unit test for this task.
For two DataFrames, being equal might mean different things. For instance, the DataFrames shown in the image below are equal in terms of their shapes. Both have 4 rows and 3 columns.

You can simply check if two DataFrames have the same shape by using the shapemethod in a unit test.
def test_df_equal(df1, df2):
assert df1.shape == df2.shape, "DataFrames have different shapes."The test_df_equal function will raise an AssertionErrorwith the given error message if the shape of the DataFrames are not equal.
A single function for all your needs
But the tests are usually not that simple and you may need to check different attributes such as index, data types, column names, etc.. Moreover, you may have to check all the values in the DataFrames to evaluate them as equal.
Thankfully, assert_frame_equalfunction in the pandas.testing module can be used to make all these checks simply.
Consider the DataFrames shown in the image below:

They have the same shape. All the values are the same except for the first value in col1. We can write a unit test using the assert_frame_equal function to check if these two DataFrames are equal as follows:
from pandas.testing import assert_frame_equal
def test_df_equal(df1, df2):
assert_frame_equal(df1, df2), "DataFrames are not equal."
test_df_equal(df1, df2)
# output
AssertionError: DataFrame.iloc[:, 0] (column name="col1") are different
DataFrame.iloc[:, 0] (column name="col1") values are different (25.0 %)
[index]: [0, 1, 2, 3]
[left]: [1, 2, 3, 4]
[right]: [10, 2, 3, 4]It tells us the column names that have different values and also shows how much of the values are different.
Let’s say the numeric values are the same but have different data types (e.g. int and float). The assert_frame_equal function can detect that as well:
# create two DataFrames with same values
df1 = pd.DataFrame(
{
"col1": [1, 2, 3, 4],
"col2": ["a", "b", "c", "d"],
"col3": [1.2, 4.1, 2.34, 3.2]
}
)
df2 = pd.DataFrame(
{
"col1": [1, 2, 3, 4],
"col2": ["a", "b", "c", "d"],
"col3": [1.2, 4.1, 2.34, 3.2]
}
)
# change the data type of col1 in df1 to float
df1["col1"] = df1["col1"].astype("float")
def test_df_equal(df1, df2):
assert_frame_equal(df1, df2), "DataFrames are not equal."
test_df_equal(df1, df2)
# output
AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="col1") are different
Attribute "dtype" are different
[left]: float64
[right]: int64If you don’t want to evaluate “float” and “int” as different, you can disregard the data type difference by setting the check_dtypeparameter as False.
def test_df_equal(df1, df2):
assert_frame_equal(df1, df2, check_dtype=False), "DataFrames are not equal."
test_df_equal(df1, df2)The function does not return anything if the given condition (two DataFrames are being equal) evaluates True.
Not exactly the same but close enough
In some cases, you may want to evaluate very close numbers as being equal. Consider the floating point numbers shown below. They are very close and the difference between them can be negligible for your application.
3.41231 == 3.41235
# output
FalseHowever, if you are comparing them based on strict equality, they will be evaluated not equal.
The atol (i.e. absolute tolerance) parameter of the assert_frame_equalfunction can be used for allowing some tolerance when comparing the values.
Let’s go over an example.
df1 = pd.DataFrame(
{
"col1": [1, 2, 3, 4],
"col2": ["a", "b", "c", "d"],
"col3": [1.23, 4.14, 2.34, 3.26]
}
)
df2 = pd.DataFrame(
{
"col1": [1, 2, 3, 4],
"col2": ["a", "b", "c", "d"],
"col3": [1.25, 4.11, 2.34, 3.25]
}
)
def test_df_equal(df1, df2):
assert_frame_equal(df1, df2, atol=0.05), "DataFrames are not equal."
test_df_equal(df1, df2)The values in “col3” are slightly different by 0.01 to 0.03. If we compare them without using the atolparameter, the assert_frame_equalfunction raises an assertion error. However, as shown in the example above, if we give a tolerance of 0.05, the DataFrames are considered to be equal and so no assertion error is raised.
We learned about the assert_frame_equalfunction, which comes in handy if you are writing unit tests in your data processing or machine learning pipelines. It allows for checking if two DataFrames are equal based on several different settings.
There are many other parameters that you can use to customize the equality checking based on your needs. Check out the official documentation for all the parameters of the assert_frame_equalfunction.
You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.
Thank you for reading. Please let me know if you have any feedback.






