avatarFerry Djaja

Summary

The web content outlines a process for extracting data from receipts using Optical Character Recognition (OCR) with the docTR tool and further processing the information using Azure's OpenAI capabilities.

Abstract

The provided text is a technical tutorial that guides readers through the process of setting up and using an OCR tool called docTR from Mindee to capture data from receipts. It details the installation process, testing for successful installation, and the use of a pre-trained model to generate OCR output. The tutorial also mentions the intention to leverage Azure’s OpenAI for further information retrieval from the receipt data, although it does not elaborate on this step. The OCR output is demonstrated in a JSON format, which includes detailed text recognition results with confidence scores and geometric data for the placement of text on the receipt. This structured data is intended to facilitate the extraction of pertinent details such as the total amount, date, time, and additional relevant information from the receipt images.

Opinions

  • The author is thorough in guiding the reader through the installation and testing of the docTR tool, emphasizing the importance of a successful setup.
  • There is an assumption that the reader has some technical background, as the instructions involve command-line operations and understanding of machine learning model outputs.
  • The use of Azure’s OpenAI suggests a preference or recommendation for this platform's capabilities in handling large language models for data extraction tasks.
  • The author seems to value the precision of the OCR output, as evidenced by the inclusion of confidence scores and geometric data for each recognized word and block of text.
  • The tutorial is likely aimed at developers or data scientists interested in automating the extraction of structured data from receipts for various applications such as expense tracking or financial analysis.

Create a Receipt Parsing Using OCR and a Large Language Model

In this tutorial, I will go through how I leverage an OCR to capture data from receipts and then leverages a Large Language Model (LLM) to extract pertinent details such as the total amount, date and time of the receipt, and additional relevant information.

To perform OCR, I will utilize the docTR tool from Mindee as outlined below.

To retrieve the information from the receipt, I will use Azure’s OpenAI capabilities.

Construct the OCR Output Data

Let’s begin the installation process for docTR and the necessary libraries on your machine. I will not going through the detail of the installation process as you can find comprehensive instructions in the provided Git repository

Let’s test the installation if is successful without error by executing this below code with the provided receipt image in Jpeg.

import os
import json

# Let's pick the desired backend
# os.environ['USE_TF'] = '1'
os.environ['USE_TORCH'] = '1'

import matplotlib.pyplot as plt

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

# Read the file
doc = DocumentFile.from_images("receipt.jpg")
print(f"Number of pages: {len(doc)}")

If there is no error, you will get this output:

Number of pages: 1

Let’s proceed with the instantiation of a pre-trained model.

# Instantiate a pretrained model
predictor = ocr_predictor(pretrained=True)

Export the output in JSON format.

result = predictor(doc)

# JSON export
json_export = result.export()
print(json_export)

You will get this output:

{'pages': [{'page_idx': 0, 'dimensions': (600, 600), 'orientation': {'value': None, 'confidence': None}, 'language': {'value': None, 'confidence': None}, 'blocks': [{'geometry': ((0.2734375, 0.0), (0.6875, 0.1162109375)), 'lines': [{'geometry': ((0.33984375, 0.0), (0.6171875, 0.0234375)), 'words': [{'value': '#01-901', 'confidence': 0.9932250380516052, 'geometry': ((0.33984375, 0.001953125), (0.416015625, 0.0234375))}, {'value': 'SINGAPORE', 'confidence': 0.9812156558036804, 'geometry': ((0.4208984375, 0.001953125), (0.54296875, 0.01953125))}, {'value': '380011', 'confidence': 0.562835156917572, 'geometry': ((0.5458984375, 0.0), (0.6171875, 0.017578125))}]}, {'geometry': ((0.2734375, 0.017578125), (0.6875, 0.05078125)), 'words': [{'value': 'GST', 'confidence': 0.9999666213989258, 'geometry': ((0.2734375, 0.02734375), (0.3212890625, 0.0498046875))}, {'value': 'Reg:', 'confidence': 0.9997168183326721, 'geometry': ((0.322265625, 0.02734375), (0.3671875, 0.05078125))}, {'value': 'M2-0065333-5', 'confidence': 0.6861922740936279, 'geometry': ((0.3720703125, 0.0234375), (0.5087890625, 0.0439453125))}, {'value': 'UEN:', 'confidence': 0.9687079787254333, 'geometry': ((0.5087890625, 0.0205078125), (0.5625, 0.0419921875))}, {'value': '198304925E', 'confidence': 0.9952959418296814, 'geometry': ((0.56640625, 0.017578125), (0.6875, 0.0380859375))}]}, {'geometry': ((0.3603515625, 0.0439453125), (0.6015625, 0.0693359375)), 'words': [{'value': 'Phone', 'confidence': 0.9936328530311584, 'geometry': ((0.3603515625, 0.0498046875), (0.423828125, 0.068359375))}, {'value': ':', 'confidence': 0.9998807907104492, 'geometry': ((0.423828125, 0.0478515625), (0.4404296875, 0.0693359375))}, {'value': '67472780', 'confidence': 0.9968281388282776, 'geometry': ((0.4365234375, 0.0458984375), (0.5380859375, 0.0673828125))}, {'value': 'Fax:-', 'confidence': 0.9917964935302734, 'geometry': ((0.5380859375, 0.0439453125), (0.6015625, 0.0654296875))}]}, {'geometry': ((0.3720703125, 0.0703125), (0.5888671875, 0.095703125)), 'words': [{'value': 'Manager:', 'confidence': 0.6913022398948669, 'geometry': ((0.3720703125, 0.07421875), (0.4609375, 0.095703125))}, {'value': 'SIVAKUMAR', 'confidence': 0.9983320832252502, 'geometry': ((0.4658203125, 0.0703125), (0.5888671875, 0.0908203125))}]}, {'geometry': ((0.373046875, 0.09375), (0.5869140625, 0.1162109375)), 'words': [{'value': 'Contact', 'confidence': 0.992266833782196, 'geometry': ((0.373046875, 0.0966796875), (0.4482421875, 0.115234375))}, {'value': 'No.:', 'confidence': 0.9826020002365112, 'geometry': ((0.4482421875, 0.09375), (0.4912109375, 0.1162109375))}, {'value': '88008584', 'confidence': 0.8402541875839233, 'geometry': ((0.494140625, 0.09375), (0.5869140625, 0.1123046875))}]}], 'artefacts': []}, {'geometry': ((0.3046875, 0.134765625), (0.6611328125, 0.2314453125)), 'lines': [{'geometry': ((0.3056640625, 0.134765625), (0.6572265625, 0.16015625)), 'words': [{'value': 'Terminal:', 'confidence': 0.8031894564628601, 'geometry': ((0.3056640625, 0.142578125), (0.396484375, 0.16015625))}, {'value': 'BK0003', 'confidence': 0.8097429275512695, 'geometry': ((0.404296875, 0.140625), (0.4814453125, 0.1591796875))}, {'value': '13/02/2022', 'confidence': 0.8739034533500671, 'geometry': ((0.4892578125, 0.1376953125), (0.595703125, 0.158203125))}, {'value': '19:21', 'confidence': 0.9997132420539856, 'geometry': ((0.603515625, 0.134765625), (0.6572265625, 0.15625))}]}, {'geometry': ((0.3046875, 0.1630859375), (0.6611328125, 0.1904296875)), 'words': [{'value': 'ReceiptTaxInvoice', 'confidence': 0.4457036852836609, 'geometry': ((0.3046875, 0.166015625), (0.4892578125, 0.1904296875))}, {'value': 'BKA3500490695', 'confidence': 0.504152774810791, 'geometry': ((0.49609375, 0.1630859375), (0.6611328125, 0.18359375))}]}, {'geometry': ((0.3662109375, 0.1884765625), (0.59765625, 0.208984375)), 'words': [{'value': 'Quotation', 'confidence': 0.8169445991516113, 'geometry': ((0.3662109375, 0.1904296875), (0.458984375, 0.208984375))}, {'value': 'No.', 'confidence': 0.9977673292160034, 'geometry': ((0.4609375, 0.1884765625), (0.498046875, 0.208984375))}, {'value': ':', 'confidence': 0.9996732473373413, 'geometry': ((0.4990234375, 0.189453125), (0.5126953125, 0.2080078125))}, {'value': 'S031362', 'confidence': 0.5456238985061646, 'geometry': ((0.51171875, 0.1884765625), (0.59765625, 0.20703125))}]}, {'geometry': ((0.34375, 0.208984375), (0.6220703125, 0.2314453125)), 'words': [{'value': 'Cashier:', 'confidence': 0.9858759045600891, 'geometry': ((0.34375, 0.212890625), (0.4228515625, 0.2314453125))}, {'value': 'HONG', 'confidence': 0.9993447661399841, 'geometry': ((0.43359375, 0.2099609375), (0.4990234375, 0.2314453125))}, {'value': 'THI', 'confidence': 0.9992380142211914, 'geometry': ((0.5, 0.2099609375), (0.537109375, 0.2294921875))}, {'value': 'BE', 'confidence': 0.9985008239746094, 'geometry': ((0.5390625, 0.208984375), (0.572265625, 0.2314453125))}, {'value': 'DAO', 'confidence': 0.9940517544746399, 'geometry': ((0.5732421875, 0.208984375), (0.6220703125, 0.228515625))}]}], 'artefacts': []}, {'geometry': ((0.2451171875, 0.24609375), (0.40234375, 0.26953125)), 'lines': [{'geometry': ((0.2451171875, 0.24609375), (0.40234375, 0.26953125)), 'words': [{'value': 'No', 'confidence': 0.9999253749847412, 'geometry': ((0.2451171875, 0.24609375), (0.2822265625, 0.26953125))}, {'value': 'Description', 'confidence': 0.9901004433631897, 'geometry': ((0.294921875, 0.248046875), (0.40234375, 0.26953125))}]}], 'artefacts': []}, {'geometry': ((0.564453125, 0.2421875), (0.7177734375, 0.26953125)), 'lines': [{'geometry': ((0.564453125, 0.2421875), (0.7177734375, 0.26953125)), 'words': [{'value': 'Qty', 'confidence': 0.9939969778060913, 'geometry': ((0.564453125, 0.2421875), (0.6064453125, 0.26953125))}, {'value': 'Amount', 'confidence': 0.9966546297073364, 'geometry': ((0.640625, 0.2431640625), (0.7177734375, 0.26171875))}]}], 'artefacts': []}, {'geometry': ((0.2578125, 0.2724609375), (0.5908203125, 0.298828125)), 'lines': [{'geometry': ((0.2578125, 0.2724609375), (0.5908203125, 0.298828125)), 'words': [{'value': '1.', 'confidence': 0.9985117316246033, 'geometry': ((0.2578125, 0.2744140625), (0.2919921875, 0.298828125))}, {'value': '#OTIS', 'confidence': 0.9894990921020508, 'geometry': ((0.2919921875, 0.275390625), (0.3642578125, 0.2978515625))}, {'value': 'BARISTA', 'confidence': 0.42725348472595215, 'geometry': ((0.3662109375, 0.2763671875), (0.458984375, 0.2939453125))}, {'value': 'OAT', 'confidence': 0.999354898929596, 'geometry': ((0.4609375, 0.2744140625), (0.5068359375, 0.2939453125))}, {'value': 'MILK', 'confidence': 0.9774147272109985, 'geometry': ((0.5087890625, 0.2724609375), (0.5634765625, 0.2939453125))}, {'value': '1L', 'confidence': 0.9945043325424194, 'geometry': ((0.5595703125, 0.2724609375), (0.5908203125, 0.29296875))}]}], 'artefacts': []}, {'geometry': ((0.2490234375, 0.30859375), (0.45703125, 0.40234375)), 'lines': [{'geometry': ((0.306640625, 0.30859375), (0.45703125, 0.326171875)), 'words': [{'value': '9421906089017', 'confidence': 0.9027230143547058, 'geometry': ((0.306640625, 0.30859375), (0.45703125, 0.326171875))}]}, {'geometry': ((0.3046875, 0.3330078125), (0.4208984375, 0.3544921875)), 'words': [{'value': '2', 'confidence': 0.9997554421424866, 'geometry': ((0.3046875, 0.3330078125), (0.322265625, 0.3544921875))}, {'value': 'for', 'confidence': 0.9995049238204956, 'geometry': ((0.3212890625, 0.3330078125), (0.3525390625, 0.3544921875))}, {'value': '$11.95', 'confidence': 0.9979997277259827, 'geometry': ((0.353515625, 0.333984375), (0.4208984375, 0.3525390625))}]}, {'geometry': ((0.2490234375, 0.3798828125), (0.3828125, 0.40234375)), 'words': [{'value': 'Total', 'confidence': 0.9654089212417603, 'geometry': ((0.2490234375, 0.3798828125), (0.302734375, 0.40234375))}, {'value': 'Amount', 'confidence': 0.9976258873939514, 'geometry': ((0.3056640625, 0.3818359375), (0.3828125, 0.400390625))}]}], 'artefacts': []}, {'geometry': ((0.529296875, 0.3056640625), (0.607421875, 0.32421875)), 'lines': [{'geometry': ((0.529296875, 0.3056640625), (0.607421875, 0.32421875)), 'words': [{'value': '4x6.95', 'confidence': 0.629564642906189, 'geometry': ((0.529296875, 0.3056640625), (0.607421875, 0.32421875))}]}], 'artefacts': []}, {'geometry': ((0.6513671875, 0.3017578125), (0.724609375, 0.47265625)), 'lines': [{'geometry': ((0.662109375, 0.3017578125), (0.7216796875, 0.3232421875)), 'words': [{'value': '27.80', 'confidence': 0.9991148114204407, 'geometry': ((0.662109375, 0.3017578125), (0.7216796875, 0.3232421875))}]}, {'geometry': ((0.66796875, 0.328125), (0.7216796875, 0.349609375)), 'words': [{'value': '-3.90', 'confidence': 0.9843301177024841, 'geometry': ((0.66796875, 0.328125), (0.7216796875, 0.349609375))}]}, {'geometry': ((0.6513671875, 0.375), (0.7216796875, 0.3974609375)), 'words': [{'value': '$23.90', 'confidence': 0.9994686245918274, 'geometry': ((0.6513671875, 0.375), (0.7216796875, 0.3974609375))}]}, {'geometry': ((0.65234375, 0.4111328125), (0.72265625, 0.4326171875)), 'words': [{'value': '$23.90', 'confidence': 0.9990628361701965, 'geometry': ((0.65234375, 0.4111328125), (0.72265625, 0.4326171875))}]}, {'geometry': ((0.666015625, 0.4501953125), (0.724609375, 0.47265625)), 'words': [{'value': '$0.00', 'confidence': 0.9990418553352356, 'geometry': ((0.666015625, 0.4501953125), (0.724609375, 0.47265625))}]}], 'artefacts': []}, {'geometry': ((0.248046875, 0.416015625), (0.4931640625, 0.560546875)), 'lines': [{'geometry': ((0.2509765625, 0.416015625), (0.4931640625, 0.4384765625)), 'words': [{'value': 'MASIERICOOID', 'confidence': 0.16996675729751587, 'geometry': ((0.2509765625, 0.416015625), (0.4931640625, 0.4384765625))}]}, {'geometry': ((0.2490234375, 0.455078125), (0.376953125, 0.48046875)), 'words': [{'value': 'Change', 'confidence': 0.9970219731330872, 'geometry': ((0.2490234375, 0.4560546875), (0.330078125, 0.48046875))}, {'value': 'Due', 'confidence': 0.9999706745147705, 'geometry': ((0.33203125, 0.455078125), (0.376953125, 0.4775390625))}]}, {'geometry': ((0.2490234375, 0.48828125), (0.447265625, 0.51171875)), 'words': [{'value': 'Items', 'confidence': 0.9890830516815186, 'geometry': ((0.2490234375, 0.490234375), (0.306640625, 0.51171875))}, {'value': 'Purchased', 'confidence': 0.9993000030517578, 'geometry': ((0.310546875, 0.4892578125), (0.4189453125, 0.509765625))}, {'value': ':', 'confidence': 0.9981997013092041, 'geometry': ((0.419921875, 0.490234375), (0.43359375, 0.509765625))}, {'value': '4', 'confidence': 0.9994581341743469, 'geometry': ((0.4296875, 0.48828125), (0.447265625, 0.509765625))}]}, {'geometry': ((0.248046875, 0.53125), (0.3935546875, 0.560546875)), 'words': [{'value': '#Total', 'confidence': 0.9086952209472656, 'geometry': ((0.248046875, 0.53125), (0.322265625, 0.560546875))}, {'value': 'Saving', 'confidence': 0.9651548862457275, 'geometry': ((0.3232421875, 0.5341796875), (0.3935546875, 0.5595703125))}]}], 'artefacts': []}, {'geometry': ((0.4296875, 0.5322265625), (0.4970703125, 0.5546875)), 'lines': [{'geometry': ((0.4296875, 0.5322265625), (0.4970703125, 0.5546875)), 'words': [{'value': '-', 'confidence': 0.43670952320098877, 'geometry': ((0.4296875, 0.5361328125), (0.4453125, 0.55078125))}, {'value': '$3.90', 'confidence': 0.9483895301818848, 'geometry': ((0.4365234375, 0.5322265625), (0.4970703125, 0.5546875))}]}], 'artefacts': []}, {'geometry': ((0.2509765625, 0.564453125), (0.6005859375, 0.5908203125)), 'lines': [{'geometry': ((0.2509765625, 0.564453125), (0.6005859375, 0.5908203125)), 'words': [{'value': 'GST', 'confidence': 0.9998136162757874, 'geometry': ((0.2509765625, 0.568359375), (0.30078125, 0.5908203125))}, {'value': '%', 'confidence': 0.999920129776001, 'geometry': ((0.3017578125, 0.5673828125), (0.3271484375, 0.5908203125))}, {'value': 'Exclude', 'confidence': 0.8899426460266113, 'geometry': ((0.353515625, 0.568359375), (0.43359375, 0.5869140625))}, {'value': 'GST', 'confidence': 0.9998469352722168, 'geometry': ((0.435546875, 0.5654296875), (0.4853515625, 0.587890625))}, {'value': 'GST', 'confidence': 0.998401939868927, 'geometry': ((0.5048828125, 0.564453125), (0.5546875, 0.5869140625))}, {'value': 'Amt', 'confidence': 0.850462794303894, 'geometry': ((0.5546875, 0.564453125), (0.6005859375, 0.5869140625))}]}], 'artefacts': []}, {'geometry': ((0.6533203125, 0.5654296875), (0.734375, 0.6171875)), 'lines': [{'geometry': ((0.6533203125, 0.5654296875), (0.7314453125, 0.583984375)), 'words': [{'value': 'Amount', 'confidence': 0.9848493337631226, 'geometry': ((0.6533203125, 0.5654296875), (0.7314453125, 0.583984375))}]}, {'geometry': ((0.6630859375, 0.5947265625), (0.734375, 0.6171875)), 'words': [{'value': '$23.90', 'confidence': 0.9978439807891846, 'geometry': ((0.6630859375, 0.5947265625), (0.734375, 0.6171875))}]}], 'artefacts': []}, {'geometry': ((0.2802734375, 0.599609375), (0.2978515625, 0.6220703125)), 'lines': [{'geometry': ((0.2802734375, 0.599609375), (0.2978515625, 0.6220703125)), 'words': [{'value': '7', 'confidence': 0.9998346567153931, 'geometry': ((0.2802734375, 0.599609375), (0.2978515625, 0.6220703125))}]}], 'artefacts': []}, {'geometry': ((0.4140625, 0.5986328125), (0.484375, 0.6201171875)), 'lines': [{'geometry': ((0.4140625, 0.5986328125), (0.484375, 0.6201171875)), 'words': [{'value': '$22.34', 'confidence': 0.9993184804916382, 'geometry': ((0.4140625, 0.5986328125), (0.484375, 0.6201171875))}]}], 'artefacts': []}, {'geometry': ((0.541015625, 0.5966796875), (0.6005859375, 0.6181640625)), 'lines': [{'geometry': ((0.541015625, 0.5966796875), (0.6005859375, 0.6181640625)), 'words': [{'value': '$1.56', 'confidence': 0.9944227337837219, 'geometry': ((0.541015625, 0.5966796875), (0.6005859375, 0.6181640625))}]}], 'artefacts': []}, {'geometry': ((0.4404296875, 0.666015625), (0.5400390625, 0.6875)), 'lines': [{'geometry': ((0.4404296875, 0.666015625), (0.5400390625, 0.6875)), 'words': [{'value': 'MASTER', 'confidence': 0.8670670986175537, 'geometry': ((0.4404296875, 0.666015625), (0.5400390625, 0.6875))}]}], 'artefacts': []}, {'geometry': ((0.248046875, 0.701171875), (0.6865234375, 0.74609375)), 'lines': [{'geometry': ((0.248046875, 0.701171875), (0.642578125, 0.7216796875)), 'words': [{'value': 'DatelTime:', 'confidence': 0.8654562830924988, 'geometry': ((0.248046875, 0.7041015625), (0.337890625, 0.7216796875))}, {'value': '13022022192100', 'confidence': 0.6854404211044312, 'geometry': ((0.35546875, 0.7041015625), (0.525390625, 0.71875))}, {'value': '(Contactiess)', 'confidence': 0.5816012024879456, 'geometry': ((0.5361328125, 0.701171875), (0.642578125, 0.7216796875))}]}, {'geometry': ((0.248046875, 0.7255859375), (0.6865234375, 0.74609375)), 'words': [{'value': 'Mercid', 'confidence': 0.8570956587791443, 'geometry': ((0.248046875, 0.7275390625), (0.3134765625, 0.74609375))}, {'value': '000001050644651', 'confidence': 0.7285884022712708, 'geometry': ((0.34375, 0.7265625), (0.4970703125, 0.744140625))}, {'value': 'Terminal', 'confidence': 0.9665992259979248, 'geometry': ((0.5068359375, 0.7265625), (0.5830078125, 0.744140625))}, {'value': '-', 'confidence': 0.93905109167099, 'geometry': ((0.591796875, 0.728515625), (0.6015625, 0.7421875))}, {'value': '51523260', 'confidence': 0.9988250136375427, 'geometry': ((0.6015625, 0.7255859375), (0.6865234375, 0.7431640625))}]}], 'artefacts': []}, {'geometry': ((0.2490234375, 0.75), (0.4697265625, 0.79296875)), 'lines': [{'geometry': ((0.25, 0.75), (0.4111328125, 0.7724609375)), 'words': [{'value': 'Approval', 'confidence': 0.9914907813072205, 'geometry': ((0.25, 0.7509765625), (0.326171875, 0.7724609375))}, {'value': ':', 'confidence': 0.9201642274856567, 'geometry': ((0.3330078125, 0.7509765625), (0.3466796875, 0.7705078125))}, {'value': 'R69046', 'confidence': 0.9995259046554565, 'geometry': ((0.3427734375, 0.75), (0.4111328125, 0.7685546875))}]}, {'geometry': ((0.2490234375, 0.771484375), (0.4697265625, 0.79296875)), 'words': [{'value': 'RefNo', 'confidence': 0.9922246932983398, 'geometry': ((0.2490234375, 0.771484375), (0.3125, 0.79296875))}, {'value': '000011076745', 'confidence': 0.9994035959243774, 'geometry': ((0.34375, 0.771484375), (0.4697265625, 0.7890625))}]}], 'artefacts': []}, {'geometry': ((0.5078125, 0.748046875), (0.576171875, 0.814453125)), 'lines': [{'geometry': ((0.5078125, 0.748046875), (0.5595703125, 0.7666015625)), 'words': [{'value': 'Batch', 'confidence': 0.9954745173454285, 'geometry': ((0.5078125, 0.748046875), (0.5595703125, 0.7666015625))}]}, {'geometry': ((0.5078125, 0.7685546875), (0.552734375, 0.7880859375)), 'words': [{'value': 'Card', 'confidence': 0.9997015595436096, 'geometry': ((0.5078125, 0.7685546875), (0.552734375, 0.7880859375))}]}, {'geometry': ((0.5078125, 0.7958984375), (0.576171875, 0.814453125)), 'words': [{'value': 'Amount', 'confidence': 0.9982516169548035, 'geometry': ((0.5078125, 0.7958984375), (0.576171875, 0.814453125))}]}], 'artefacts': []}, {'geometry': ((0.6015625, 0.7470703125), (0.6669921875, 0.7646484375)), 'lines': [{'geometry': ((0.6015625, 0.7470703125), (0.6669921875, 0.7646484375)), 'words': [{'value': '000435', 'confidence': 0.9871779680252075, 'geometry': ((0.6015625, 0.7470703125), (0.6669921875, 0.7646484375))}]}], 'artefacts': []}, {'geometry': ((0.65625, 0.7685546875), (0.7373046875, 0.857421875)), 'lines': [{'geometry': ((0.6728515625, 0.7685546875), (0.7138671875, 0.783203125)), 'words': [{'value': '1641', 'confidence': 0.9989182949066162, 'geometry': ((0.6728515625, 0.7685546875), (0.7138671875, 0.783203125))}]}, {'geometry': ((0.666015625, 0.7939453125), (0.732421875, 0.8125)), 'words': [{'value': '$23.90', 'confidence': 0.9973084926605225, 'geometry': ((0.666015625, 0.7939453125), (0.732421875, 0.8125))}]}, {'geometry': ((0.65625, 0.8330078125), (0.7373046875, 0.857421875)), 'words': [{'value': '$23.90', 'confidence': 0.9831066131591797, 'geometry': ((0.65625, 0.8330078125), (0.7373046875, 0.857421875))}]}], 'artefacts': []}, {'geometry': ((0.4208984375, 0.8369140625), (0.57421875, 0.9228515625)), 'lines': [{'geometry': ((0.4345703125, 0.8369140625), (0.560546875, 0.8603515625)), 'words': [{'value': 'Net', 'confidence': 0.9999843835830688, 'geometry': ((0.4345703125, 0.8369140625), (0.4765625, 0.8603515625))}, {'value': 'Amount', 'confidence': 0.9871867895126343, 'geometry': ((0.4775390625, 0.8369140625), (0.560546875, 0.8583984375))}]}, {'geometry': ((0.4208984375, 0.8984375), (0.57421875, 0.9228515625)), 'words': [{'value': 'APPROVED', 'confidence': 0.9999109506607056, 'geometry': ((0.4208984375, 0.8984375), (0.57421875, 0.9228515625))}]}], 'artefacts': []}]}]}

Let’s print the output using matplotlib.

synthetic_pages = result.synthesize()
plt.figure(figsize=(18, 16))  # Adjust the width and height as needed
plt.imshow(synthetic_pages[0]); plt.axis('off'); plt.show()

We need to remove the irrelevant information in JSON output such as dimension, orientation, language, geometry associated with blocks and lines. My focus is solely on extracting the data associated under words: value and geometry without confidence as I highlighted in the box below.

To proceed with the elimination of irrelevant information from the JSON output.

# Define a function to remove fields recursively
def remove_fields(obj, fields):
    if isinstance(obj, list):
        for item in obj:
            remove_fields(item, fields)
    elif isinstance(obj, dict):
        for key in list(obj.keys()):
            if key in fields:
                del obj[key]
            else:
                remove_fields(obj[key], fields)

# Function to remove 'geometry' key from 'blocks' and 'lines'
def remove_geometry(data):
    if isinstance(data, list):
        for item in data:
            remove_geometry(item)
    elif isinstance(data, dict):
        if 'geometry' in data:
            del data['geometry']
        for key, value in data.items():
            remove_geometry(value)

# Fields to remove
fields_to_remove = ['confidence', 'page_idx', 'dimensions', 'orientation', 'language', 'artefacts']

# Remove the specified fields
remove_fields(json_export, fields_to_remove)

# Remove 'geometry' from 'blocks' and 'lines'
for page in json_export['pages']:
    for block in page['blocks']:
        if 'geometry' in block:
            del block['geometry']
        for line in block.get('lines', []):
            if 'geometry' in line:
                del line['geometry']

# Convert the modified data back to JSON
modified_json = json.dumps(json_export, separators=(',', ':'))

# Print the modified JSON
print(modified_json)

Subsequently, save the output to a file named OCR.txt.

#Convert the JSON data to a string
json_export_str = str(modified_json)

# Write the JSON data to a file
with open("OCR.txt", "w") as file:
    file.write(json_export_str)

The resulting output will now appear as follows:

Now, we are prepared to provide this information to LLM.

Input into the LLM

We will proceed by importing the LangChain libraries and entering the Azure OpenAI API key.

from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.prompts import PromptTemplate
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.chat_models import AzureChatOpenAI
from langchain.chains import RetrievalQA

import os

os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_VERSION"] = ""
os.environ["OPENAI_API_BASE"] = ""
os.environ["OPENAI_API_KEY"] = ""

We load the OCR.txt file, split its contents, and insert them into the FAISS database as vectors with OpenAI embeddings.

embedding_model = OpenAIEmbeddings(chunk_size=10)
OCR_Content = TextLoader('OCR.txt').load()
text_splitter = CharacterTextSplitter(chunk_overlap=100)
content = text_splitter.split_documents(OCR_Content)
faiss_db = FAISS.from_documents(content, embedding_model)
retriever = faiss_db.as_retriever(search_type="similarity", search_kwargs={"k": 4})

We set the temperature to 0 and utilize the gpt-4 deployment. Additionally, we establish the prompt template.

Within the prompt, I explicitly stated:

Analyze the JSON receipt data provided and group “value” entries with similar “geometry” proximity under “words,” then summarize this information into one concise sentence.

llm = AzureChatOpenAI(
    temperature=0,
    deployment_name="gpt-4",
)

prompt_template = """

Task: Analyze the JSON receipt data provided and group "value" entries with similar "geometry" proximity under "words," then summarize this information into one concise sentence.
    
JSON Data:
{context}
    
User questions: 
{question}
       
Respond to the user in JSON format and include the key-value pairs:

"""
QA_PROMPT = PromptTemplate(
    template=prompt_template, input_variables=['context', 'question']
)

We will use RetrievalQA with a specific question to extract information such as the amount, receipt number, date & time, and line items.

qa_chain = RetrievalQA.from_chain_type(
    llm=llm, 
    retriever=retriever, 
    chain_type_kwargs={"prompt": QA_PROMPT},
    verbose=True
)

question = """

Please extract the following details:
Amount, 
Receipt/Invoice number, 
Date & Time,
Line Items

"""

result = qa_chain({"query": question})
print(result["result"])

Here is the output:

It successfully extracted the amount, receipt number and receipt date & time accurately. The additional fine-tuning is necessary to improve the output for the line items.

https://ko-fi.com/ferrydjaja

Ocr
Llm
OpenAI
ChatGPT
Generative Ai Use Cases
Recommended from ReadMedium