avatarEivind Kjosbakken

Summary

This article guides readers through the process of enhancing the Donut model's performance on receipts by self-annotating data and fine-tuning the model with custom datasets.

Abstract

The article outlines a method for improving the Donut model's accuracy in processing receipts by using self-annotated data. It begins by introducing the Donut model and the CORD dataset, then proceeds to detail the use of the Sparrow annotation tool for creating a custom dataset. The process involves annotating receipts with bounding boxes and labels, converting the annotated data to the CORD dataset format, and finally, fine-tuning the Donut model with this new dataset. The author provides a step-by-step guide, including code snippets and folder structures, to assist readers in training the model with their annotated data. The article concludes by encouraging readers to experiment with a smaller number of samples to see significant improvements and invites them to explore other related articles.

Opinions

  • The author believes that fine-tuning the Donut model with custom annotated data can significantly improve its performance for specific tasks.
  • The use of the Sparrow annotation tool is recommended for its ease of use and adaptability, with the author suggesting specific changes made to the original tool for enhanced functionality.
  • The author emphasizes the importance of proper data format conversion to match the CORD dataset standards for successful fine-tuning.
  • It is suggested that a moderate number of annotated samples (50-100) can be sufficient to achieve a notable improvement in the model's performance.
  • The author provides additional resources for further learning on related topics, indicating a commitment to community education and continuous improvement in OCR tasks.

Empower Your Donut Model for Receipts with Self-Annotated Data

Introduction

In this article, you will learn how to fine-tune the Donut model with your own custom receipts data. Further fine-tuning the Donut model with image annotations for your specific need, can massively boost the performance of the model in the particular task. This article will use a Donut model already fine-tuned on the CORD dataset, annotate some receipts, and then use those annotations to further fine-tune the Donut model.

Create your own dataset and fine-tune your Donut model with custom data with this tutorial. Photo by Isabella Fischer on Unsplash

Story overview

  • Finding an annotation tool and annotating
  • Converting data to the correct format
  • Training with your annotated data

Finding an annotation tool and annotating

To create your own dataset, you have to have an annotation tool. Luckily, there are plenty of tools available online. For this tutorial, I will be using the Sparrow annotation tool from this GitHub repository. Note that this is forked from another GitHub repository, and with a few changes for my specific needs which are described later in the article. How to annotate is explained in the repository, but I will display the steps you have to do below, for simplicity.

  1. Clone the GitHub repository (you have to clone all folders, not just the sparrow-ui folder)
  2. Install all required packages (requirements.txt file can be found within each subfolder in the Sparrow repository, use the one from sparrow-ui)
  3. Run the Streamlit application by going into the sparrow-ui folder, and running: streamlit run main.py
  4. A window will now open in your browser, and you can then choose the data annotation page
  5. On the data annotation page, press browse files and choose an image you want to annotate (should be an image of a receipt) or use one of the existing images already in the GitHub repository.
  6. Make sure the assign labels box is unchecked, and mark an area in the picture.
  7. After selecting all areas you want to mark, check the Assign Labels box. Then press a box, and write the text of the box in the Value column, and the Row number in the Row ID column. Note the number of the Row ID only requires 2 things. Words and numbers of the same row must have the same Row ID, and the total needs to have the highest Row ID. Make sure to press the save button after adding values. To understand this fully, check out the image below.
Example of annotated receipt with Value and Row ID. Image taken from the Sparrow-ui Streamlit application.

8. Now check the Completed checkbox on the left side, and press export labels. You will receive the images with labels in the docs folder

The data annotation page with an uploaded image. Image taken from the Sparrow-ui Streamlit application.

For your information, I made 2 changes to the original Sparrow repository which I forked:

  • Allowing uploading of several images
  • Under Mapping, I changed the Label column to take any value and called it Row ID instead.

You do not have to do anything regarding the changes, but now you are aware of what has changed in the fork from the original Sparrow repository.

Converting data to the correct format

Now that you have images that are labeled with bounding boxes, the text in the bounding boxes, and the Row ID of the bounding boxes, we have everything we need to convert the data to the same format as the CORD dataset. Note that Sparrow has a folder called sparrow-data, which converts data to Donut format, though I did not find functionality there that fit my needs (as I wanted to specify the Row number).

Converting the data to the correct format can be confusing, as the CORD format consists of nested dictionaries. For this article, however, I have already prepared a finished script to convert your data which I store in a new file called “prepareData.py”. All you need to do is run the code below inside the sparrow-ui folder:

python prepareData.py

The prepareData.py file uses the images and labels you have annotated and converts them to a Donut-friendly format. Make sure you have not changed any folder names after exporting the data from the Sparrow annotation tool (I use all the standard paths)

The prepareData.py file will then create a folder with the name DATA_FOLDER_NAME (a variable you can change inside prepareData.py). Inside the created folder, it will create 3 folders (train, validation, test) that contain your data in a format that fits for fine-tuning the Donut model.

Training with your annotated data

If you have not yet figured out how to run the Donut model properly, you can check out my freeCodeCamp article here to learn more about that.

Now I assume you have the following folder structure, if “TEST_DATA” is the value of the DATA_FOLDER_NAME variable (note that the images does not necessarily have the same names):

The expected folder structure for the dataset is ready for fine-tuning with Donut. Note that I only have one sample within each folder (train, validation, test), and you should hopefully have more files. Image screenshot from my own folder structure.

You can then copy this folder over to wherever you have your Donut model (which you can clone from here):

The TEST_DATA folder was copied over to the Donut folder as you can see in the image. Image screenshot from my own folder structure

Now you have to make sure to change the Config file to use this data. So in your Config file (in the folder donut/config, filename that ends with .yaml), you change the dataset_name_or_paths variable to:

dataset_name_or_paths: ["TEST_DATA"]

Then you can run the training with:

python train.py --config config/train_cord_finetune_own_dataset.yaml

If your Config file is called train_cord_finetune_own_dataset.yaml

Conclusion

Congrats! You can now fine-tune the Donut model with your own annotated dataset. Even though annotating a dataset can take quite some time, you do not necessarily have to make too many samples to improve the model. I recommend trying to make 50–100 samples and running fine-tuning on that. Then see if you see a significant enough improvement. I hope you learned something useful in this article.

If you want to read some of my other articles, please check out:

In Plain English

Thank you for being a part of our community! Before you go:

Tech
Artificial Intelligence
Ocr
Python
Data Annotation Tools
Recommended from ReadMedium