avatarAbhishek Das

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3823

Abstract

</span>, how=<span class="hljs-string">'outer'</span>, suffixes=(<span class="hljs-string">'_SourceA'</span>, <span class="hljs-string">'_SourceB'</span>), indicator=<span class="hljs-literal">True</span>)

mismatches = reconciled[reconciled[<span class="hljs-string">'_merge'</span>] != <span class="hljs-string">'both'</span>] display(mismatches)</pre></div><figure id="063d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*GBj3bh86v3WS2Engeq_tMw.png"><figcaption>Image by Author</figcaption></figure><p id="f822"><b>Solution B :</b></p><p id="6ad8">Let’s take the same source_a, source_b and then get the mismatches.</p><div id="f371"><pre>pd.concat([source_a,source_b],axis=<span class="hljs-number">0</span>).drop_duplicates(subset=<span class="hljs-string">'Account_Number'</span>,keep=<span class="hljs-literal">False</span>)</pre></div><figure id="3f3d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*UEIRGrvXTha6J6lQbSPkRg.png"><figcaption>Image by Author</figcaption></figure><p id="8d57"><b>Solution C:</b></p><p id="1089">Let’s take the same source_a, source_b and then get the mismatches.</p><div id="ed2f"><pre><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np diff = np.setxor1d(np.array(source_a.Account_Number), np.array(source_b.Account_Number)) pd.concat([source_a.loc[source_a[<span class="hljs-string">'Account_Number'</span>].isin(<span class="hljs-built_in">list</span>(diff)),:],source_b.loc[source_b[<span class="hljs-string">'Account_Number'</span>].isin(<span class="hljs-built_in">list</span>(diff)),:]])</pre></div><figure id="ae02"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*UEIRGrvXTha6J6lQbSPkRg.png"><figcaption>Image by Author</figcaption></figure><h1 id="aacb">Matches</h1><p id="f29e">To reconcile values in <b>matching </b>records:</p><div id="dc96"><pre>value_discrepancies = reconciled[ (reconciled[<span class="hljs-string">'_merge'</span>] == <span class="hljs-string">'both'</span>) & (reconciled[<span class="hljs-string">'Amount_SourceA'</span>] != reconciled[<span class="hljs-string">'Amount_SourceB'</span>]) ] display(value_discrepancies)</pre></div><figure id="21c0"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*OECoXzm1432tuiq1NFRqAQ.png"><figcaption>Image by Author</figcaption></figure><h1 id="015c">Reconciliation report</h1><p id="638e">Create a <b>reconciliation report</b> like below :</p><div id="4939"><pre><span class="hljs-built_in">print</span>(<span class="hljs-string">f"Total Transactions in Source A: <span class="hljs-subst">{<span class="hljs-built_in">len</span>(source_a)}</span>"</span>) <span class="hljs-built_in">print</span>(<span class="hljs-string">f"Total Transactions in Source B: <span class="hljs-subst">{<span class="hljs-built_in">len</span>(source_b)}</span>"</span>) <span class="hljs-built_in">print</span>(<span class="hljs-string">f"Mismatched Transactions: <span class="hljs-subst">{<span class="hljs-built_in">len</span>(mismatches)}</span>"</span>) <span class="hljs-built_in">print</span>(<span class="hljs-string">f"Value Discrepancies: <span class="hljs-subst">{<span class="hljs-built_in">len</span>(value_discrepancies)}</span>"</span>)</pre></div><figure id="2d3b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*P543Xh5lucmt1Lf1Hlb4EQ.png"><figcaption>Image by Author</figcaption></figure><p id="5c3b"><b>Example 2 : Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame</b></p><p id="b674">Let’s use the same dataframe and give some null values to the data</p><div id="a729"><pre><span class="hljs-keyword">import</span> pandas as <span class="hljs-type">pd</span>

<span class="hljs-variable">source_a</span> <span cla

Options

ss="hljs-operator">=</span> pd.DataFrame({ <span class="hljs-string">'Transaction_ID'</span>: [<span class="hljs-string">'TXN001'</span>, <span class="hljs-string">'TXN002'</span>, <span class="hljs-string">'TXN003'</span>, <span class="hljs-string">'TXN004'</span>], <span class="hljs-string">'Account_Number'</span>: [<span class="hljs-number">12345</span>, <span class="hljs-number">12346</span>, <span class="hljs-number">12347</span>, <span class="hljs-number">12348</span>], <span class="hljs-string">'Date'</span>: [np.nan, <span class="hljs-string">'2023-01-02'</span>, <span class="hljs-string">'2023-01-03'</span>, <span class="hljs-string">'2023-01-04'</span>], <span class="hljs-string">'Amount'</span>: [<span class="hljs-number">500.00</span>, <span class="hljs-number">1500.00</span>, np.nan, <span class="hljs-number">300.00</span>] })

source_b = pd.DataFrame({ <span class="hljs-string">'Transaction_ID'</span>: [<span class="hljs-string">'TXN005'</span>, <span class="hljs-string">'TXN006'</span>, <span class="hljs-string">'TXN007'</span>, <span class="hljs-string">'TXN008'</span>], <span class="hljs-string">'Account_Number'</span>: [<span class="hljs-number">12345</span>, <span class="hljs-number">12346</span>, <span class="hljs-number">12349</span>, <span class="hljs-number">12347</span>], <span class="hljs-string">'Date'</span>: [<span class="hljs-string">'2023-01-01'</span>, <span class="hljs-string">'2023-01-02'</span>, np.nan, <span class="hljs-string">'2023-01-03'</span>], <span class="hljs-string">'Amount'</span>: [<span class="hljs-number">500.00</span>, np.nan, <span class="hljs-number">100.00</span>, <span class="hljs-number">300.00</span>] })

display(source_a) display(source_b)source_a.combine_first(source_b,)</pre></div><figure id="d19c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*k9iV5w4ATQGrWwj-3PXjyg.png"><figcaption>Image by Author</figcaption></figure><p id="816b">We’ll use <b>combine_first method </b>to fill up the not null values.</p><div id="dd33"><pre>source_a<span class="hljs-selector-class">.combine_first</span>(source_b)</pre></div><figure id="c82c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*R3uLz6R3fd2aABPe0L2APQ.png"><figcaption>Image by Author</figcaption></figure><p id="c4d5">Find the code in Github Repo : <a href="https://github.com/dabhishek316/Data-Reconcilation-Best-Way-To-Find-Differences-Between-Datasets">Github Link</a></p><h1 id="6be5">Best Practices for Data Reconciliation</h1><ul><li><b>Use Unique Keys</b>: Ensure datasets have unique identifiers (e.g., Transaction IDs).</li><li><b>Standardize Data</b>: Align formats for dates, numbers, and text.</li><li><b>Automate the Process</b>: Create reusable scripts to save time.</li><li><b>Validate Regularly</b>: Periodic reconciliation minimizes errors.</li></ul><h1 id="ad10">Conclusion</h1><p id="3354">Data reconciliation is essential for maintaining the integrity of financial data. Python, with its powerful libraries, makes this process efficient and scalable. By following the steps outlined above, you can easily find and resolve discrepancies in your datasets.</p><blockquote id="816e"><p>If you like my article give a clap and reach out to my socials for data engineering related : <a href="https://medium.com/@abhishekdas69597"><b>Medium </b></a><b>| <a href="https://www.youtube.com/@adasdataengineer">Youtube</a> | <a href="https://www.instagram.com/adas_data_engineer/">Instagram</a> | <a href="https://x.com/dabhishek3160/">Twitter</a> | <a href="https://www.facebook.com/profile.php?id=61556527066455">Facebook</a> | <a href="https://www.linkedin.com/in/abhishek-das-9859a31a2/">Linked</a></b><a href="https://www.linkedin.com/in/abhishek-das-9859a31a2/">-In</a></p></blockquote></article></body>

Data Reconciliation in Python : Comparing Datasets for Accurate Insights

Image by Author

In the world of data science and analytics, ensuring consistency between datasets is a crucial task. Data reconciliation helps identify discrepancies, align datasets, and ensure data quality. This article demonstrates the best practices for data reconciliation using Python.

What is Data Reconciliation?

Data reconciliation involves comparing two or more datasets to:

  • Detect missing or mismatched records.
  • Ensure consistency between data sources.
  • Align data for reporting or analysis.

Let’s understand wit the help of an example with different solutions :

Example 1 : Two dataframes with reconciliation column key containing Matched/Mismatched elements, with reports

Assume that the transaction amount from the same account should be the same even though we obtain the data from two distinct sources.

Source_a and source_b are 2 different dataframes coming from 2 different sources.

import pandas as pd

source_a = pd.DataFrame({
    'Transaction_ID': ['TXN001', 'TXN002', 'TXN003', 'TXN004'],
    'Account_Number': [12345, 12346, 12347, 12348],
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'],
    'Amount': [500.00, 1500.00, 750.00, 300.00]
})

source_b = pd.DataFrame({
    'Transaction_ID': ['TXN005', 'TXN006', 'TXN007', 'TXN008'],
    'Account_Number': [12345, 12346, 12349, 12347],
    'Date': ['2023-01-01', '2023-01-02', '2023-01-05', '2023-01-03'],
    'Amount': [500.00, 1500.00, 100.00, 700.00]
})

display(source_a)
display(source_b)
Image by Author

Let’s Perform an outer join to combine both datasets and identify mismatches and matches, then create reconciliation report.

Mismatches

There are 2 solutions below :

Solution A :

reconciled = pd.merge(source_a, source_b, on='Account_Number', how='outer', suffixes=('_SourceA', '_SourceB'), indicator=True)

mismatches = reconciled[reconciled['_merge'] != 'both']
display(mismatches)
Image by Author

Solution B :

Let’s take the same source_a, source_b and then get the mismatches.

pd.concat([source_a,source_b],axis=0).drop_duplicates(subset='Account_Number',keep=False)
Image by Author

Solution C:

Let’s take the same source_a, source_b and then get the mismatches.

import numpy as np
diff = np.setxor1d(np.array(source_a.Account_Number), np.array(source_b.Account_Number))
pd.concat([source_a.loc[source_a['Account_Number'].isin(list(diff)),:],source_b.loc[source_b['Account_Number'].isin(list(diff)),:]])
Image by Author

Matches

To reconcile values in matching records:

value_discrepancies = reconciled[
    (reconciled['_merge'] == 'both') & 
    (reconciled['Amount_SourceA'] != reconciled['Amount_SourceB'])
]
display(value_discrepancies)
Image by Author

Reconciliation report

Create a reconciliation report like below :

print(f"Total Transactions in Source A: {len(source_a)}")
print(f"Total Transactions in Source B: {len(source_b)}")
print(f"Mismatched Transactions: {len(mismatches)}")
print(f"Value Discrepancies: {len(value_discrepancies)}")
Image by Author

Example 2 : Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame

Let’s use the same dataframe and give some null values to the data

import pandas as pd

source_a = pd.DataFrame({
    'Transaction_ID': ['TXN001', 'TXN002', 'TXN003', 'TXN004'],
    'Account_Number': [12345, 12346, 12347, 12348],
    'Date': [np.nan, '2023-01-02', '2023-01-03', '2023-01-04'],
    'Amount': [500.00, 1500.00, np.nan, 300.00]
})

source_b = pd.DataFrame({
    'Transaction_ID': ['TXN005', 'TXN006', 'TXN007', 'TXN008'],
    'Account_Number': [12345, 12346, 12349, 12347],
    'Date': ['2023-01-01', '2023-01-02', np.nan, '2023-01-03'],
    'Amount': [500.00, np.nan, 100.00, 300.00]
})

display(source_a)
display(source_b)source_a.combine_first(source_b,)
Image by Author

We’ll use combine_first method to fill up the not null values.

source_a.combine_first(source_b)
Image by Author

Find the code in Github Repo : Github Link

Best Practices for Data Reconciliation

  • Use Unique Keys: Ensure datasets have unique identifiers (e.g., Transaction IDs).
  • Standardize Data: Align formats for dates, numbers, and text.
  • Automate the Process: Create reusable scripts to save time.
  • Validate Regularly: Periodic reconciliation minimizes errors.

Conclusion

Data reconciliation is essential for maintaining the integrity of financial data. Python, with its powerful libraries, makes this process efficient and scalable. By following the steps outlined above, you can easily find and resolve discrepancies in your datasets.

If you like my article give a clap and reach out to my socials for data engineering related : Medium | Youtube | Instagram | Twitter | Facebook | Linked-In

Data Engineer
Data Reconciliation
Python
Pandas
Dataframes
Recommended from ReadMedium