avatarMarco Reis

Summary

Apache Tika is a versatile Java library and server for extracting text, metadata, and performing OCR on various file formats, with a RESTful API for easy integration into diverse applications and scalability through Docker containers.

Abstract

Apache Tika is a robust content analysis tool that excels in extracting text and metadata from a wide range of file formats, including PDF, DOC, and PPT. It is widely used for indexing in search engines, content analysis in journalism, and can be integrated with translation and OCR services for enhanced functionality. Tika's capabilities extend beyond basic text extraction to include advanced features such as computer vision, machine learning, and natural language processing through integration with libraries like TensorFlow and NLTK. The Tika Server, accessible via a RESTful API, can be deployed using Docker, allowing for scalable solutions that can handle varying loads, with separate containers for OCR and non-OCR tasks to optimize performance. The server supports multiple languages for OCR and can identify the language of the text, making it a powerful tool for multilingual content processing.

Opinions

  • The article positions Apache Tika as an essential tool for content analysis and indexing, emphasizing its ease of use and continuous updates.
  • The integration of Tika with other libraries and APIs for advanced uses, such as machine learning and natural language processing, is seen as a significant enhancement to its core functionality.
  • The use of Docker for deploying Tika Server is highly recommended for its scalability and flexibility in managing different workloads, especially when OCR processing is required.
  • The article suggests that the choice of using separate Docker containers for OCR and non-OCR tasks is a strategic decision to maintain application performance, indicating a preference for this approach in high-demand scenarios.
  • The inclusion of multiple language support for OCR and language identification features is viewed as a critical aspect of Tika's versatility, particularly for applications dealing with multilingual content.

Text Extraction And OCR With Apache Tika

Credits: Unsplash.

Apache Tika is a library for extracting text from most file formats, including PDF, DOC, and PPT. Tika has a simplified interface that extracts the content, making it easy to operate the library. Its main uses are related to the indexing process in search engines, content analysis (journalism, for example), and even translation (using paid APIs).

The content analysis includes metadata extraction. Metadata is information that describes a resource, in the case of Tika, a file. They are data about the data, such as creation date, language, format, permissions, subject, authors, title, and keywords. The more metadata available, the more accurate the analysis of the file’s content will be.

Written in Java, Tika is a popular library, easy to use, and continuously updated. That is why it is used in several other software, such as Apache Solr. Tika is a FOSS (free and open-source software); thus, its API is extensible for you to create your customized functionalities.

In addition to extracting content, Tika’s features include:

  • Tika Server: makes its resources available via the RESTful API, which will be the subject of this article;
  • Identifies the MIME type, with the pattern type/subtype, for exampleimage/png;
  • Identifies metadata: for example, in a PDF the metadata is pdf:PDFVersion,access_permission, language,dc:format and Creation-Date (more details below);
  • Identifies the language of the text;
  • Text translation: via the paid APIs Microsoft Translator Text, Google Cloud Translation or Lingo24;
  • OCR: integrated with Tesseract OCR to extract content from images.

Advanced features

At the beginning of the project, Tika only did text extraction; however, in the most recent versions it was integrated with other libraries for more advanced uses (not detailed in this text):

  • Computer vision: for example the generation of image captions;
  • Machine learning (ML): integration with ML tools, such as TensorFlow and Mahout;
  • Natural language processing (NLP): integration with NLP tools, such as OpenNLP and NLTK.

Tika Server

In this article, we will see how to use Tika Server and how we can take advantage of the features through its RESTful API. The installation will use via Docker.

The choice of Tika Server with Docker brings scalability to the solution. That is, for small loads, we can use only one or two Docker containers, increasing according to the demand.

In the examples, we will run two containers, one with OCR enabled and one without OCR. This design choice is important because OCR demands a lot of processing and can degrade application performance. In applications with high demand for OCR, it is essential to have a set of separated containers just to process the images. With that, the flow can is redirected to each container according to the need.

Installation via Docker

The Docker image for the examples is available on the apache/tika Docker Hub, and the latest version is 1.24, which uses Java 11. We started by downloading version 1.24 (without OCR) and 1.24-full (with OCR).

docker pull apache/tika:1.24
docker pull apache/tika:1.24-full

To start the containers we use the following commands:

docker run -it \
    --name tika-server-ocr \
    -d \
    -p 9998:9998 \
    apache/tika:1.24-full
docker run -it \
    --name tika-server \
    -d \
    -p 9997:9998 \
    apache/tika:1.24

As seen, the server with OCR is on port 9997, and the server without OCR is on port 9997. With this decision, we can choose whether or not to OCR the images. We can check if the servers are running through the URLs http://localhost:9998/ and http://localhost:9997/, as seen in the following image.

To see the server log we use the command docker logs tika-server-ocr and the result should be something like this, showing that the service is available on port 9998, in the case of the server with OCR:

May 02, 2020 3:38:37 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
INFO  Starting Apache Tika 1.24 server
INFO  Setting the server's publish address to be http://0.0.0.0:9998/
INFO  Logging initialized @1831ms to org.eclipse.jetty.util.log.Slf4jLog
INFO  jetty-9.4.24.v20191120; built: 2019-11-20T21:37:49.771Z; git: 363d5f2df3a8a28de40604320230664b9c793c16; jvm 11.0.6+10-post-Ubuntu-1ubuntu118.04.1
INFO  Started ServerConnector@7fc44dec{HTTP/1.1,[http/1.1]}{0.0.0.0:9998}
INFO  Started @1924ms
WARN  Empty contextPath
INFO  Started o.e.j.s.h.ContextHandler@5bda80bf{/,null,AVAILABLE}
INFO  Started Apache Tika server at http://0.0.0.0:9998/

Languages in Tesseract OCR

In the standard installation, the languages available in Tesseract are English (default), French, German, Italian, and Spanish. To add new ones, we need to access the container terminal through Docker and execute the following commands to install, for example, the Portuguese language. The correct choice of the text language allows greater precision in character recognition.

docker exec -it tika-server-ocr /bin/bash
apt-get update
apt-get install tesseract-ocr-por

HTTP codes

The HTTP codes used on the Tika Server are:

  • 200 Ok: requisition completed successfully;
  • 204 No content: request completed successfully and empty result;
  • 422 Unprocessable Entity: Unsupported Mime-type, encrypted document, etc.;
  • 500 Error: error to process the document.

Metadata

The first example is extracting metadata from a PDF. We will use the Linux application curl and some test files in PDF, DOCX, ODT, TXT, PNG, and JPG format.

curl -T test.pdf http://localhost:9998/meta

Each file type has a list of different metadata. To show specific metadata, for example, Content-type, use the URL:

curl -T test.pdf http://localhost:9998/meta/Content-Type

You can choose the format of the result. According to the documentation available at http://localhost:9998/, we can choose between text, CSV, and JSON. To show the output in plain text, use the URL:

curl -T test.pdf http://localhost:9998/meta/Content-Type --header "Accept: text/plain"

To get the result of the X-Parsed-By metadata in CSV:

curl -T test.pdf http://localhost:9998/meta/X-Parsed-By --header "Accept: text/csv"

Finally, to get the result of the Creation-Date metadata in JSON:

curl -T test.odt http://localhost:9998/meta/Creation-Date --header "Accept: application/json"
Credits: Unsplash.

Text extraction

Text extraction is the main feature of Tika. To extract the contents of the file test.docx with the Tika server, use the URL:

curl -T test.docx http://localhost:9998/tika

In the previous case, Tika identifies the file type before selecting the appropriate parser. If you know the file type, Tika can directly choose the proper parser.

curl -T test.pdf http://localhost:9998/tika --header "Content-type: application/pdf"

For large files, Tika supports multipart:

curl -F [email protected] http://localhost:9998/tika/form

MIME types

The MIME types supported by Tika are: application, audio, chemical, image, message, model, multipart, text, video, and x-conference. The complete list can are available at the URL http://localhost:9998/mime-types.

OCR

Tika integrates with Tesseract OCR to extract content from images. The simplest way to OCR a PNG file is:

curl -T test.png http://localhost:9998/tika

Again, if you know the file type and language, we can indicate the Content-type. To change the OCR language, for example to Portuguese, use the X-Tika-OCRLanguage parameter:

curl -T test.jpg http://localhost:9998/tika \
    --header "Content-type: image/jpeg" \
    --header "X-Tika-OCRLanguage: por"

Identify language

To identify the language of the text of a file, we can use the URL:

curl -T test.odt http://localhost:9998/language/stream

To identify a text the URL is:

curl -X PUT --data "yo no hablo español muy bien" http://localhost:9998/language/string

Conclusion

Apache Tika is a useful project, and Tika Server adds an extra layer of ease with the RESTful API. Thus, it is possible to access Tika’s functionalities from virtually any programming language through web services. With Docker, the solution can handle different workloads. In the example shown, two containers were created, one with OCR support and the other without it, so that the application can choose whether or not it is necessary to OCR the image.

Reference

https://cwiki.apache.org/confluence/display/TIKA/TikaServer

https://cwiki.apache.org/confluence/display/TIKA/TikaOCR

https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/

https://www.manning.com/books/tika-in-action

Ocr
NLP
Pdf
Docker
Recommended from ReadMedium