Unveiling PDF Parsing: How to extract formulas from scientific pdf papers
This article is a supplement to Advanced RAG 02: Unveiling PDF Parsing.
Extracting formulas from scientific papers has always been a challenging task.
There are some tools that can recognize formulas in scientific papers, such as:
- Nougat: Neural Optical Understanding for Academic Documents, an end-to-end trainable encoder-decoder transformer based model for converting document pages to markup.
- grobid: Figure 2 demonstrates that its performance is inferior to Nougat.
- LaTeX-OCR: Figure 2 demonstrates that its performance is inferior to Nougat.
- Donut: Nougat is based on its model architecture
- Mathpix Snip: A paid tool.
In this article, we use the open-source Nougat framework, the architecture is shown in Figure 1:
For scientific papers, the accuracy of formula recognition is high, as shown in Figure 2:
As a demonstration, we use some formulas from page 5 of the paper “Attention Is All You Need” as shown in Figure 3.
The result obtained after executing the command nougat YOUR_PDF_PATH -o YOUR_OUTPUT_DIR_PATH
is as follows:
... ... ... Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this. \[\mathrm{MultiHead}(Q,K,V) =\mathrm{Concat}(\mathrm{head}_{1},...,\mathrm{head}_{\mathrm{h}})W^ {O}\] \[\text{where }\mathrm{head}_{\mathrm{i}} =\mathrm{Attention}(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V})\] Where the projections are parameter matrices \(W_{i}^{Q}\in\mathbb{R}^{d_{\text{model}}\times d_{k}}\), \(W_{i}^{K}\in\mathbb{R}^{d_{\text{model}}\times d_{k}}\), \(W_{i}^{V}\in\mathbb{R}^{d_{\text{model}}\times d_{v}}\) and \(W^{O}\in\mathbb{R}^{hd_{v}\times d_{\text{model}}}\). In this work we employ \(h=8\) parallel attention layers, or heads. For each of these we use \(d_{k}=d_{v}=d_{\text{model}}/h=64\). Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality. ... ... ... ### Position-wise Feed-Forward Networks In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between. \[\mathrm{FFN}(x)=\max(0,xW_{1}+b_{1})W_{2}+b_{2} \tag{2}\] While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is \(d_{\text{model}}=512\), and the inner-layer has dimensionality \(d_{ff}=2048\). ... ... ...
The parsed result is an mmd format file. Download the corresponding plugin in vscode. After rendering, the result is shown in Figure 4 and 5.
It can be observed that the formula is indeed parsed accurately. However, the “3.3” is missing in the section title “3.3 Position-wise Feed-Forward Networks.”
It is worth mentioning that Nougat also performs well for table 1 in “Attention Is All You Need” paper, this is because table 1 contains formulas.
Interested readers can try it out.
Conclusion
Overall, Nougat is an excellent formula extraction tool.
However, being an end-to-end tool(it does not require any input or module related to OCR, the network implicitly recognizes the text), it lacks intermediate results and appears to have limited customization options.
In addition, Nougat utilizes autoregressive forward passes for text generation, which leads to a relatively slow generation speed and increases the likelihood of hallucination and repetition.
Lastly, If you have any questions, please indicate them in the comments section.