avatarPoonam Yadav

Summary

The web content outlines a step-by-step process for extracting specific data, namely names of individuals, from a PDF document using Python libraries such as PyPDF2, textract, and NLTK.

Abstract

The article titled "How to extract text from a PDF (NLP)" delves into the application of Natural Language Processing (NLP) for extracting text from PDFs. The author, with a growing interest in NLP, describes the use of Python libraries to accomplish the task of extracting names from a specific page of a PDF provided by the Municipal Corporation of Greater Mumbai. The process involves installing necessary Python packages, importing libraries, fetching the PDF from a URL, creating a PDF reader object, extracting text from the desired page, and finally, tokenizing the text to remove punctuation and stopwords. The code identifies names by looking for prefixes such as "Mr.", "Mrs.", or "Ms." and compiles them into a list. The author provides code snippets and screenshots for each step and concludes by expressing hope that the article will aid in understanding the text extraction process, also sharing a GitHub link to the complete code.

Opinions

  • The author expresses a keen interest in NLP and acknowledges its broad impact across various industries.
  • The task of text extraction from PDFs is presented as a starting point for those interested in learning NLP.
  • The author endorses the use of PyPDF2 for its versatile functionalities in handling PDF files.
  • Textract is highlighted for its core function of text extraction, and NLTK is recommended for its comprehensive text processing capabilities.
  • The author's approach suggests a preference for Python due to its rich ecosystem of libraries for NLP tasks.
  • By sharing the GitHub link, the author encourages further exploration and collaboration on the code provided.

How to extract text from a PDF(NLP)

Extracting specific text from a pdf file

Source : https://cognitechx.com/wp-content/uploads/2020/04/107_agfuzc1tyxpllu5muc1kyxjrlwjsdwu-scaled-1.jpg

I have an increasing interest in learning Natural language processing (NLP). A very vast subject but with interesting and far reaching effects across industries. To begin with, I started with a simple task of extracting text or specific data from a given document.

Let me take you through the entire process of how I approached it.

Requirement: Extract names of individual from Municipal Corporation of Greater Mumbai from Page 2 of this pdf — (http://www.udri.org/pdf/02%20working%20paper%201.pdf)

Step 1: Installing the required python packages.

Here we are using three packages PyPDF2 , textract and nltk .

PyPDF2 is a python library built as pdf toolkit. It provides variety of functions like extracting information from a pdf , splitting or merging documents page by page , cropping pages , encrypting or decrypting pdf files and many more. Textract is a core function for extracting text. NLTK stands for natural language toolkit . It is a platform used for building python programs that work with human language. It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning.

Install required packages

Step 2: Importing the needed libraries

Importing libraries

Step 3: Next we fetch the pdf from the given url using urllib.request and saved the file in wFile.

Access the PDF from the url

Step 4: In this step we create an object of PdfFileReader class of PyPDF2 module and pass the pdf file object & get a pdf reader object.

Step 5: Here we use the getPage function to access the required page from pdf. getPage(2) will get us the second page and extractText() to extract text from the pdf page.

Step 6: In the following piece of code we perform tokenization and remove the punctuations and stopwords from the data.

Step 7: All the names in the input pdf are prefixed by either Mr. , Mrs. , or Ms. . In our code we use them to extract the full names. We take the help of enumerate function and save all the names in a list (name_list).

Step 8: The final step is to print the names and close the file.

I hope this article helps you understand the process of extraction text to some extent .

Here is a GitHub link for the code : https://github.com/poonam-ydv/NLP-

NLP
Machine Learning
AI
Data Science
Recommended from ReadMedium