Amazon Textract for Automated Text and Data Extraction

Extracting text and data from documents is a tedious and time-consuming task when done manually. But what if machines could automatically review documents and identify relevant information for us? Amazon Textract provides such advanced document analysis capabilities through the power of machine learning.

In this tutorial, we will learn how to leverage Textract to instantly extract text, tables, forms, and other structured data from our documents with high accuracy. By integrating Textract into our applications, we can unlock insights and automate downstream workflows that rely on processing paperwork.

We will walk through the steps of getting started with Textract hands-on. First, we will upload some sample documents that represent realistic use cases, whether they be scanned reports, forms submitted by customers, invoices, legal contracts or more obscure document types we want to analyze.

Next, we will set the required permissions through IAM roles so Textract has secure access to these source files stored in Amazon S3 buckets.

Then we can kickoff Textract analysis jobs on these documents. We will see how to monitor the status of these jobs as Textract works its magic in the background to detect text across images through OCR and identify tables, forms and other structures.

Finally, we will view the impressive results in the console or via APIs. The output precisely locates and extracts all text content, no matter how complex the original formatting and layout. Any tables, forms, and other data get neatly organized into structured JSON for easy analysis.

By walking through end-to-end examples, we will gain practical experience of automating document processing with Amazon Textract before leveraging it in production systems. Let’s get started!

Prerequisites

Before starting, you need:

An AWS account
Basic knowledge of AWS services like S3 and IAM
Sample documents in common formats like PDF, JPEG, PNG to test Amazon Textract

Steps

1. Upload sample documents to S3

Amazon Textract works by analyzing documents stored in Amazon S3 buckets. So the first step is to upload some sample documents that we want Textract to process.

Some best practices when uploading documents:

Supported formats — Amazon Textract supports common file formats like PDF, JPEG, PNG. It can even process scanned PDFs stored as images.
File size — Upload files less than 50MB in size for optimal performance. For larger files, split them into multiple smaller documents.
Folder structure — You can organize your files in a nested folder structure inside your S3 bucket. Textract will preserve the original hierarchy when saving extracted data.
Document types — Upload a diverse set of sample docs like forms, receipts, tables, scanned text to test Textract’s capabilities.
Metadata — Assign descriptive names and add metadata like keywords, dates to help identify documents later.

Once ready, upload the documents to your S3 bucket using the S3 console, AWS CLI or an S3 SDK. Make sure to set permissions correctly so Textract can access the files.

As best practice, create a separate test bucket and folder structure to evaluate Textract before processing production documents. This will help validate output.

Monitoring the S3 console you can verify documents are uploading correctly. Now they are ready for Textract to extract text and data automatically.

2. Create an IAM role for Textract

Amazon Textract needs permissions to access the S3 buckets containing your documents. The best way to grant these permissions is by creating an IAM role.

When creating a role:

Select service — Choose “Textract” as the service that will use this role. This automatically adds base permissions.
Permission policy — Attach the “AmazonTextractFullAccess” policy to let Textract call all its APIs.
Trust policy — Configure the trust policy to allow Textract service to assume this role.
S3 permissions — Add an additional inline policy with “s3:GetObject” and “s3:ListBucket” permissions. This allows Textract read-only access to your S3 documents.
No credentials — An important benefit of using roles is AWS credentials are automatically handled behind the scenes.

Once the role is created, copy the Role ARN which uniquely identifies it.

Later when using the Textract console or APIs, specify this role ARN so Textract can temporarily assume it to access your S3 buckets.

Using roles ensures secure access without needing to directly manage any credentials. You can grant limited, revocable permissions to Textract on a per-job basis.

Now Textract has the necessary role and policies configured to analyze source documents stored in S3 while keeping access locked down.

3. Analyze document with Textract

With documents uploaded and an IAM role created, we can now analyze the files using Amazon Textract.

When starting a Textract analysis job:

Select entry point — Choose the specific document in S3 that you want to process first out of your collection.
Input parameters — Specify input parameters like document location in S3, IAM role ARN created earlier, and job settings.
Feature types — Enable the output you want like text in reading order, printed vs handwritten text, forms, tables etc.
Job settings — Configure advanced settings like multi-page analysis, access permissions and data security preferences.
Async operation — Textract jobs can take some time, so APIs run asynchronously. Your app receives a job ID to check status.
Monitor job — Check the job status periodically to see when it changes from “Started” to “Succeeded”. Any failures will be visible here.
View output — Once complete, the console displays extracted text, tables, structures overlaid on document images. You can also retrieve entire raw JSON output.

By submitting jobs on a diverse set of sample documents, you can validate Textract’s ability to accurately extract text and data before integrating into production systems.

Monitoring job status, performance, and output quality helps address any issues evaluating different document types or job parameters.

4. View extracted text and data

Once an Amazon Textract job succeeds, you can view the results in the console output viewer. The output is also available via the API in JSON format.

In the Textract console:

Document viewer — See annotations and extracted text overlaid directly on the document image at the correct positions.

Raw OCR text — The text tab displays all readable text sequenced in proper reading order.

Structured data — View tables, lists, forms, key/values extracted and organized into structured JSON output.

Relationships — See the reading order, text hierarchies, and related elements like tables linked to their headers.

Filters — Filter to types like forms or handwritten text to analyze specifc output.

Search — Search for words or patterns in the document text.

JSON output — The full raw JSON provides flexible structured access to every element extracted from the document text, bounding boxes, confidence scores and relationships.

This output can be directly integrated into your own applications. Use cases like searching documents, mining data, populating databases become much easier using the rich Textract output.

Analyzing the output on sample docs helps validate quality before running large workloads through production pipelines.

AWS CLI Commands for Amazon Textract

Here are some AWS CLI commands to interact with Amazon Textract:

1. Create document text detection job

This starts an asynchronous job to detect text in a document stored in an S3 bucket:

2. Check job status

This checks if the job has completed and if it succeeded:

3. Get job results

This saves the extracted text and analysis from a completed job to a file:

4. Start document analysis

This starts analysis to extract text, forms, tables etc by specifying a wider range of FeatureTypes:

5. Create document analyzer

This creates a custom Textract analyzer to reuse for multiple documents:

The AWS CLI provides full access to configure and manage Textract jobs at scale for production workflows.

Conclusion

And that wraps up our walkthrough of extracting text and data from documents using Amazon Textract! We took some sample documents and used Textract’s advanced machine learning capabilities to automatically identify and digitize text, tables, forms, and other structures they contained.

First, we prepared our documents by uploading them into an Amazon S3 bucket. This provided a centralized repository that Textract could access for analysis.

We then configured an IAM role with permissions that allowed Textract to safely read these S3 documents while keeping access locked down. Using roles ensures secure authentication without needing to directly manage credentials.

With our source files ready, we started Textract jobs on these documents. We were able to monitor progress as Textract’s algorithms did the heavy lifting to process images and extract information in the background.

Finally, we viewed the completed results within the Textract console. We could validate how accurately it located text elements on the image and sequenced them in a readable order. Any tables, forms, keys, and values get neatly organized into structured JSON for integration with other applications.

As we validated output across diverse sample documents, we could confirm Textract delivers the automation we need to digitize paperwork at scale while maintaining high accuracy. Our downstream data workflows can seamlessly ingest this structured information that previously required tedious manual effort to collect.

So that concludes our tutorial on harnessing the power of Amazon Textract for document automation! We’re now ready to eliminate time wasted retyping forms or transcribing reports by extracting text, data and handwriting all on our own. Textract can unlock a goldmine of information from files lying unused and accelerate all our document-based processes.