Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3823

Abstract

, how='outer', suffixes=('_SourceA', '_SourceB'), indicator=True)

mismatches = reconciled[reconciled['_merge'] != 'both'] display(mismatches)</pre></div><figure id="063d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*GBj3bh86v3WS2Engeq_tMw.png"><figcaption>Image by Author</figcaption></figure>Solution B :Let’s take the same source_a, source_b and then get the mismatches.<div id="f371"><pre>pd.concat([source_a,source_b],axis=0).drop_duplicates(subset='Account_Number',keep=False)</pre></div><figure id="3f3d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*UEIRGrvXTha6J6lQbSPkRg.png"><figcaption>Image by Author</figcaption></figure>Solution C:Let’s take the same source_a, source_b and then get the mismatches.<div id="ed2f"><pre>import numpy as np diff = np.setxor1d(np.array(source_a.Account_Number), np.array(source_b.Account_Number)) pd.concat([source_a.loc[source_a['Account_Number'].isin(list(diff)),:],source_b.loc[source_b['Account_Number'].isin(list(diff)),:]])</pre></div><figure id="ae02"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*UEIRGrvXTha6J6lQbSPkRg.png"><figcaption>Image by Author</figcaption></figure><h1 id="aacb">Matches</h1>To reconcile values in matching records:<div id="dc96"><pre>value_discrepancies = reconciled[ (reconciled['_merge'] == 'both') & (reconciled['Amount_SourceA'] != reconciled['Amount_SourceB']) ] display(value_discrepancies)</pre></div><figure id="21c0"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*OECoXzm1432tuiq1NFRqAQ.png"><figcaption>Image by Author</figcaption></figure><h1 id="015c">Reconciliation report</h1>Create a reconciliation report like below :<div id="4939"><pre>print(f"Total Transactions in Source A: {len(source_a)}") print(f"Total Transactions in Source B: {len(source_b)}") print(f"Mismatched Transactions: {len(mismatches)}") print(f"Value Discrepancies: {len(value_discrepancies)}")</pre></div><figure id="2d3b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*P543Xh5lucmt1Lf1Hlb4EQ.png"><figcaption>Image by Author</figcaption></figure>Example 2 : Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrameLet’s use the same dataframe and give some null values to the data<div id="a729"><pre>import pandas as pd

source_a <span cla

Options

ss="hljs-operator">= pd.DataFrame({ 'Transaction_ID': ['TXN001', 'TXN002', 'TXN003', 'TXN004'], 'Account_Number': [12345, 12346, 12347, 12348], 'Date': [np.nan, '2023-01-02', '2023-01-03', '2023-01-04'], 'Amount': [500.00, 1500.00, np.nan, 300.00] })

source_b = pd.DataFrame({ 'Transaction_ID': ['TXN005', 'TXN006', 'TXN007', 'TXN008'], 'Account_Number': [12345, 12346, 12349, 12347], 'Date': ['2023-01-01', '2023-01-02', np.nan, '2023-01-03'], 'Amount': [500.00, np.nan, 100.00, 300.00] })

display(source_a) display(source_b)source_a.combine_first(source_b,)</pre></div><figure id="d19c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*k9iV5w4ATQGrWwj-3PXjyg.png"><figcaption>Image by Author</figcaption></figure>We’ll use combine_first method to fill up the not null values.<div id="dd33"><pre>source_a.combine_first(source_b)</pre></div><figure id="c82c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*R3uLz6R3fd2aABPe0L2APQ.png"><figcaption>Image by Author</figcaption></figure>Find the code in Github Repo : <a href="https://github.com/dabhishek316/Data-Reconcilation-Best-Way-To-Find-Differences-Between-Datasets">Github Link</a><h1 id="6be5">Best Practices for Data Reconciliation</h1><ul><li>Use Unique Keys: Ensure datasets have unique identifiers (e.g., Transaction IDs).</li><li>Standardize Data: Align formats for dates, numbers, and text.</li><li>Automate the Process: Create reusable scripts to save time.</li><li>Validate Regularly: Periodic reconciliation minimizes errors.</li></ul><h1 id="ad10">Conclusion</h1>Data reconciliation is essential for maintaining the integrity of financial data. Python, with its powerful libraries, makes this process efficient and scalable. By following the steps outlined above, you can easily find and resolve discrepancies in your datasets.<blockquote id="816e">If you like my article give a clap and reach out to my socials for data engineering related : <a href="https://medium.com/@abhishekdas69597">Medium </a>| <a href="https://www.youtube.com/@adasdataengineer">Youtube</a> | <a href="https://www.instagram.com/adas_data_engineer/">Instagram</a> | <a href="https://x.com/dabhishek3160/">Twitter</a> | <a href="https://www.facebook.com/profile.php?id=61556527066455">Facebook</a> | <a href="https://www.linkedin.com/in/abhishek-das-9859a31a2/">Linked</a><a href="https://www.linkedin.com/in/abhishek-das-9859a31a2/">-In</a></blockquote></article></body>

Data Reconciliation in Python : Comparing Datasets for Accurate Insights

In the world of data science and analytics, ensuring consistency between datasets is a crucial task. Data reconciliation helps identify discrepancies, align datasets, and ensure data quality. This article demonstrates the best practices for data reconciliation using Python.

What is Data Reconciliation?

Data reconciliation involves comparing two or more datasets to:

Detect missing or mismatched records.
Ensure consistency between data sources.
Align data for reporting or analysis.

Let’s understand wit the help of an example with different solutions :

Example 1 : Two dataframes with reconciliation column key containing Matched/Mismatched elements, with reports

Assume that the transaction amount from the same account should be the same even though we obtain the data from two distinct sources.

Source_a and source_b are 2 different dataframes coming from 2 different sources.

import pandas as pd

source_a = pd.DataFrame({
    'Transaction_ID': ['TXN001', 'TXN002', 'TXN003', 'TXN004'],
    'Account_Number': [12345, 12346, 12347, 12348],
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'],
    'Amount': [500.00, 1500.00, 750.00, 300.00]
})

source_b = pd.DataFrame({
    'Transaction_ID': ['TXN005', 'TXN006', 'TXN007', 'TXN008'],
    'Account_Number': [12345, 12346, 12349, 12347],
    'Date': ['2023-01-01', '2023-01-02', '2023-01-05', '2023-01-03'],
    'Amount': [500.00, 1500.00, 100.00, 700.00]
})

display(source_a)
display(source_b)

Let’s Perform an outer join to combine both datasets and identify mismatches and matches, then create reconciliation report.

Mismatches

There are 2 solutions below :

Solution A :

reconciled = pd.merge(source_a, source_b, on='Account_Number', how='outer', suffixes=('_SourceA', '_SourceB'), indicator=True)

mismatches = reconciled[reconciled['_merge'] != 'both']
display(mismatches)

Solution B :

Let’s take the same source_a, source_b and then get the mismatches.

pd.concat([source_a,source_b],axis=0).drop_duplicates(subset='Account_Number',keep=False)

Solution C:

Let’s take the same source_a, source_b and then get the mismatches.

import numpy as np
diff = np.setxor1d(np.array(source_a.Account_Number), np.array(source_b.Account_Number))
pd.concat([source_a.loc[source_a['Account_Number'].isin(list(diff)),:],source_b.loc[source_b['Account_Number'].isin(list(diff)),:]])

Matches

To reconcile values in matching records:

value_discrepancies = reconciled[
    (reconciled['_merge'] == 'both') & 
    (reconciled['Amount_SourceA'] != reconciled['Amount_SourceB'])
]
display(value_discrepancies)

Reconciliation report

Create a reconciliation report like below :

print(f"Total Transactions in Source A: {len(source_a)}")
print(f"Total Transactions in Source B: {len(source_b)}")
print(f"Mismatched Transactions: {len(mismatches)}")
print(f"Value Discrepancies: {len(value_discrepancies)}")

Example 2 : Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame

Let’s use the same dataframe and give some null values to the data

import pandas as pd

source_a = pd.DataFrame({
    'Transaction_ID': ['TXN001', 'TXN002', 'TXN003', 'TXN004'],
    'Account_Number': [12345, 12346, 12347, 12348],
    'Date': [np.nan, '2023-01-02', '2023-01-03', '2023-01-04'],
    'Amount': [500.00, 1500.00, np.nan, 300.00]
})

source_b = pd.DataFrame({
    'Transaction_ID': ['TXN005', 'TXN006', 'TXN007', 'TXN008'],
    'Account_Number': [12345, 12346, 12349, 12347],
    'Date': ['2023-01-01', '2023-01-02', np.nan, '2023-01-03'],
    'Amount': [500.00, np.nan, 100.00, 300.00]
})

display(source_a)
display(source_b)source_a.combine_first(source_b,)

We’ll use combine_first method to fill up the not null values.

source_a.combine_first(source_b)

Find the code in Github Repo : Github Link

Best Practices for Data Reconciliation

Use Unique Keys: Ensure datasets have unique identifiers (e.g., Transaction IDs).
Standardize Data: Align formats for dates, numbers, and text.
Automate the Process: Create reusable scripts to save time.
Validate Regularly: Periodic reconciliation minimizes errors.

Conclusion

Data reconciliation is essential for maintaining the integrity of financial data. Python, with its powerful libraries, makes this process efficient and scalable. By following the steps outlined above, you can easily find and resolve discrepancies in your datasets.

If you like my article give a clap and reach out to my socials for data engineering related : Medium | Youtube | Instagram | Twitter | Facebook | Linked-In