How to extract text from a PDF(NLP)
Extracting specific text from a pdf file

I have an increasing interest in learning Natural language processing (NLP). A very vast subject but with interesting and far reaching effects across industries. To begin with, I started with a simple task of extracting text or specific data from a given document.
Let me take you through the entire process of how I approached it.
Requirement: Extract names of individual from Municipal Corporation of Greater Mumbai from Page 2 of this pdf — (http://www.udri.org/pdf/02%20working%20paper%201.pdf)
Step 1: Installing the required python packages.
Here we are using three packages PyPDF2 , textract and nltk .
PyPDF2 is a python library built as pdf toolkit. It provides variety of functions like extracting information from a pdf , splitting or merging documents page by page , cropping pages , encrypting or decrypting pdf files and many more. Textract is a core function for extracting text. NLTK stands for natural language toolkit . It is a platform used for building python programs that work with human language. It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning.

Step 2: Importing the needed libraries

Step 3: Next we fetch the pdf from the given url using urllib.request and saved the file in wFile.

Step 4: In this step we create an object of PdfFileReader class of PyPDF2 module and pass the pdf file object & get a pdf reader object.

Step 5: Here we use the getPage function to access the required page from pdf. getPage(2) will get us the second page and extractText() to extract text from the pdf page.

Step 6: In the following piece of code we perform tokenization and remove the punctuations and stopwords from the data.

Step 7: All the names in the input pdf are prefixed by either Mr. , Mrs. , or Ms. . In our code we use them to extract the full names. We take the help of enumerate function and save all the names in a list (name_list).

Step 8: The final step is to print the names and close the file.


I hope this article helps you understand the process of extraction text to some extent .
Here is a GitHub link for the code : https://github.com/poonam-ydv/NLP-






