The website content provides a comprehensive tutorial on fine-tuning the LayoutLMv2 model for invoice recognition, detailing the process from data annotation to model training and inference.
Abstract
The article outlines a step-by-step guide for fine-tuning the LayoutLMv2 model, a state-of-the-art transformer architecture for document understanding, specifically for the task of invoice recognition. It begins with an introduction to the advancements of LayoutLMv2 over its predecessor, emphasizing its ability to learn cross-modality interactions between visual and textual information. The tutorial covers the annotation process using UBIAI's OCR annotation tool, data pre-processing, model training with hyper-parameter configuration, and concludes with performing inference on new invoices using the fine-tuned model. The author provides links to Google Colab notebooks for training and inference scripts, and discusses the performance metrics achieved, suggesting potential improvements through further annotation efforts.
Opinions
The author expresses that LayoutLMv2 significantly outperforms the original LayoutLM model, indicating a substantial improvement in the field of document understanding.
The use of UBIAI's OCR annotation tool is highly recommended for its ease of use and multi-language support, which includes Arabic and Hebrew, among others.
The author believes that the results obtained with the fine-tuned LayoutLMv2 model, such as an F1 score of 0.75 and accuracy of 0.96, are decent for the task at hand, but also acknowledges the need for more diverse annotated data to improve generalization.
There is an opinion that the LayoutLMv2 model can be adapted and fine-tuned for various types of semi-structured documents beyond invoices, showcasing its versatility.
The author encourages readers to engage with UBIAI's OCR annotation feature by signing up for free, indicating a positive view of the tool's utility in practical applications.
A forward-looking perspective is presented with the mention of a newer version, LayoutLMv3, suggesting that continuous advancements are expected in the field.
Fine-Tuning LayoutLM v2 For Invoice Recognition
From annotation to training and inference
Image by Author: LayoutLMV2 for Invoice Recognition
Introduction
Since writing my last article on “Fine-Tuning Transformer Model for Invoice Recognition” which leveraged layoutLM transformer models for invoice recognition, Microsoft has released a new layoutLM v2 transformer model with a significant improvement in performance compared to the first LayoutLM model. In this tutorial, I will demonstrate step by step how to fine-tune layoutLM V2 on invoices starting from data annotation to model training and inference.
Training and inference scripts are available on Google Colab.
Unlike the first layoutLM version, layoutLM v2 integrates the visual features, text and positional embedding, in the first input layer of the Transformer architecture as shown below. This enables the model to learn cross modality interaction between visual and textual information, the interaction among text, layout, and image in a single multi-modal framework. Here is a snippet from the abstract: “Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 → 0.8420), CORD (0.9493 → 0.9601), SROIE (0.9524 → 0.9781), Kleister-NDA (0.8340 → 0.8520), RVL-CDIP (0.9443 → 0.9564), and DocVQA (0.7295 → 0.8672)”.
For more information, please refer to the original paper.
For this tutorial, we have annotated a total of 220 invoices using UBIAI Text Annotation Tool. UBIAI OCR Annotation allows annotation directly on native PDFs, scanned documents, or images PNG and JPG in a regular or handwritten form. We have recently added support for over 20 languages including Arabic and Hebrew, etc.
Image by Author: UBIAI Multi-language OCR Annotation
Here is an excellent overview on how to use the tool to annotate PDFs and images:
In addition to the labeled text offsets and bounding boxes, we will need to export the image of each annotated document. This can be done easily with UBIAI since it exports all the annotations along with the images of each document in one ZIP file.
Image by Author: Annotated Data JSON Output
Data Pre-processing:
After exporting the ZIP file from UBIAI, we upload the file to a google drive folder. We will use google colab for model training and inference.
First step is to open a google colab, connect your google drive and install the transfromers and detectron2 packages:
To simplify the data pre-process and model training steps, we have created preprocess.py and train.py files that contain all the code required to launch the training. Clone the file from github:
! rm -r layoutlmv2_fine_tuning
! git clone-b main https://github.com/walidamamou/layoutlmV2.git
Next, we need to unzip our exported dataset and place all the files in a folder:
We are almost ready to launch the training, we just need to specify a few hyper-parameters to configure our model and the path of the model output. You can of course play around with these variables to get the best result. For this tutorial we are using a test size of 33%, batch size = 4, learning rate = 5e-5 and 50 epochs.
After the training is done, precision, recall and F1 score will be displayed as shown below. We obtain an F1 score of 0.75 and accuracy of 0.96 which is a decent score for annotating 220 invoices.
Image by Author: LayoutLMV2 Scores
Inference with layoutLM V2:
We are now ready to test our newly trained model on a new unseen invoice. For this step we will use Google’s Tesseract to OCR the document and layoutLM V2 to extract entities from the invoice.
Let’s install pytesseract library:
## install tesseract OCR Engine
! sudo apt install tesseract-ocr
! sudo apt install libtesseract-dev
## install pytesseract , please click restart runtime button in the cell output and move forward in the notebook
Once the inference is done, you will find the overlayed predictions on the image as well as a JSON file containing all the label, text and offsets in the inference) output2 folder. Let’s look at the model prediction:
Image by Author: LayoutLMV2 predictions
Here is a sample of the JSON file:
Image by Author: JSON Output
The model was able to predict most of the entities such as Seller, Date, Invoice number and Total but erroneously predicted “unit price” as TTC_ID. This suggests we need to annotate more types of invoices so our model learns to generalize.
Conclusion
In conclusion, we have shown a step by step tutorial on how to fine-tune layoutLM V2 on invoices starting from annotation to training and inference. The model can be fine-tuned on any other semi-structured documents such as driver licences, contracts, government documents, financial documents, etc.
If you would like to try out UBIAI’s OCR annotation feature, simply signup for free and start annotating.