Summary

This article discusses the use of the open-source Nougat framework for extracting formulas from scientific papers, highlighting its high accuracy and demonstrating its application with formulas from the paper "Attention Is All You Need."

Abstract

The article begins by acknowledging the challenge of extracting formulas from scientific papers and introduces several tools that can recognize formulas, including Nougat, grobid, LaTeX-OCR, Donut, and Mathpix Snip. The focus of the article is on the open-source Nougat framework, which is based on the Donut model architecture. The article explains the end-to-end architecture of Nougat and demonstrates its high accuracy in recognizing formulas in scientific papers using an example from the "Attention Is All You Need" paper. The article also mentions some limitations of Nougat, such as its lack of intermediate results and limited customization options, as well as its relatively slow generation speed and increased likelihood of hallucination and repetition due to its use of autoregressive forward passes for text generation.

Bullet points

Extracting formulas from scientific papers is a challenging task.
Several tools can recognize formulas in scientific papers, including Nougat, grobid, LaTeX-OCR, Donut, and Mathpix Snip.
The article focuses on the open-source Nougat framework, which is based on the Donut model architecture.
Nougat has a simple end-to-end architecture that converts document images into latent embeddings and subsequently into a sequence of tokens in an autoregressive manner.
Nougat has high accuracy in recognizing formulas in scientific papers, as demonstrated by its performance on the arXiv test set.
The article demonstrates the use of Nougat to extract formulas from the "Attention Is All You Need" paper.
Nougat has some limitations, including its lack of intermediate results and limited customization options, as well as its relatively slow generation speed and increased likelihood of hallucination and repetition.
The parsed result is an mmd format file that can be rendered using a vscode plugin.
Nougat also performs well for tables that contain formulas.
The article concludes by summarizing the strengths and limitations of Nougat and inviting readers to try it out.

Unveiling PDF Parsing: How to extract formulas from scientific pdf papers

This article is a supplement to Advanced RAG 02: Unveiling PDF Parsing.

Extracting formulas from scientific papers has always been a challenging task.

There are some tools that can recognize formulas in scientific papers, such as:

Nougat: Neural Optical Understanding for Academic Documents, an end-to-end trainable encoder-decoder transformer based model for converting document pages to markup.
grobid: Figure 2 demonstrates that its performance is inferior to Nougat.
LaTeX-OCR: Figure 2 demonstrates that its performance is inferior to Nougat.
Donut: Nougat is based on its model architecture
Mathpix Snip: A paid tool.

In this article, we use the open-source Nougat framework, the architecture is shown in Figure 1:

Figure 1: Simple end-to-end architecture following Donut. The Swin Transformer encoder takes a document image and converts it into latent embeddings, which are subsequently converted to a sequence of tokens in a autoregressive manner. Source: Nougat.

For scientific papers, the accuracy of formula recognition is high, as shown in Figure 2:

Figure 2: Results on arXiv test set. Source: Nougat

As a demonstration, we use some formulas from page 5 of the paper “Attention Is All You Need” as shown in Figure 3.

Figure 3: The original 5th page in “Attention Is All You Need” paper.

The result obtained after executing the command nougat YOUR_PDF_PATH -o YOUR_OUTPUT_DIR_PATH is as follows:

...
...
...

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.

\[\mathrm{MultiHead}(Q,K,V) =\mathrm{Concat}(\mathrm{head}_{1},...,\mathrm{head}_{\mathrm{h}})W^ {O}\] \[\text{where }\mathrm{head}_{\mathrm{i}} =\mathrm{Attention}(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V})\]

Where the projections are parameter matrices \(W_{i}^{Q}\in\mathbb{R}^{d_{\text{model}}\times d_{k}}\), \(W_{i}^{K}\in\mathbb{R}^{d_{\text{model}}\times d_{k}}\), \(W_{i}^{V}\in\mathbb{R}^{d_{\text{model}}\times d_{v}}\) and \(W^{O}\in\mathbb{R}^{hd_{v}\times d_{\text{model}}}\).

In this work we employ \(h=8\) parallel attention layers, or heads. For each of these we use \(d_{k}=d_{v}=d_{\text{model}}/h=64\). Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

...
...
...

### Position-wise Feed-Forward Networks

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

\[\mathrm{FFN}(x)=\max(0,xW_{1}+b_{1})W_{2}+b_{2} \tag{2}\]

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is \(d_{\text{model}}=512\), and the inner-layer has dimensionality \(d_{ff}=2048\).

...
...
...

The parsed result is an mmd format file. Download the corresponding plugin in vscode. After rendering, the result is shown in Figure 4 and 5.

Figure 4: Result after rendering. Image by author.

Figure 5: Result after rendering. Image by author.

It can be observed that the formula is indeed parsed accurately. However, the “3.3” is missing in the section title “3.3 Position-wise Feed-Forward Networks.”

It is worth mentioning that Nougat also performs well for table 1 in “Attention Is All You Need” paper, this is because table 1 contains formulas.

Figure 6: The original table 1 in “Attention Is All You Need” paper.

Interested readers can try it out.

Conclusion

Overall, Nougat is an excellent formula extraction tool.

However, being an end-to-end tool(it does not require any input or module related to OCR, the network implicitly recognizes the text), it lacks intermediate results and appears to have limited customization options.

In addition, Nougat utilizes autoregressive forward passes for text generation, which leads to a relatively slow generation speed and increases the likelihood of hallucination and repetition.

Lastly, If you have any questions, please indicate them in the comments section.