avatarMarco Rodrigues

Summary

The provided content outlines a method for extracting email attachments, specifically PDF files, from Gmail using Python programming with the IMAP protocol.

Abstract

The article details a process for automating the extraction of PDF attachments from Gmail using Python's imaplib and email libraries. It emphasizes the value of email attachments as a source of data for various applications such as data analysis, automation, integration, and deep learning. The process involves configuring Gmail to enable IMAP access, creating an app password for secure script access, and writing Python functions to connect to the Gmail server, search for messages, and save attachments to a local directory. The article underscores the efficiency and versatility of this approach for handling email data and integrating it into other systems or workflows.

Opinions

  • The author suggests that email attachments are an underutilized resource that can provide significant value when extracted and analyzed.
  • Automating the extraction of attachments is presented as a time-saving measure that can streamline business processes and data analysis tasks.
  • The integration of extracted data into databases or deep learning models is seen as a way to enhance the functionality of applications and derive actionable insights from email data.
  • The use of Python for this task is advocated due to its strong ecosystem for data manipulation, visualization, and machine learning.
  • The article implies that with the right tools and scripts, users can creatively leverage their email inboxes as a data source, suggesting a proactive approach to email management.

How to Extract Attachments from Gmail with Python

Use IMAP protocol with Python to automate Gmail data extraction

Image generated with DreamStudio

Email has been here since the very beginning of the internet. Before messaging apps, video calls and now the metaverse, Email was and remains, one of the main sources of communication. People rely on Microsoft Outlook, Gmail, Proton Mail and others, for business, newsletters, writing letters, transact documentation and much more.

The diversity of email usage ultimately makes it a valuable source of data. Extracting attachments from Gmail for instance, can bring several benefits such as the following:

  • Data Analysis: If you have a subscription that regularly sends you PDF files, CSVs, TXTs, or other formats, you can extract and analyze them.
  • Automation: Let’s say you have several clients that send you Purchase Orders every day, and you don’t want to waste time downloading and grabbing the information from them manually. You can extract the data, apply processing steps and upload the cleaned data to a database.
  • Integration: Email can be used as a gateway of information for your application. For instance, files sent to your inbox can seamlessly be redirected to your platform’s database.
  • Deep Learning: Image, video and text files that are sent to your inbox, can be automatically downloaded and used to feed Deep Learning models.

These examples are just a sneak peek of the potentialities that arise from Email attachments. With Python Programming Language, we can easily apply the topics above. It offers several libraries to manipulate data, has great integration with Data Visualisation tools, is the leading programming language when it comes to Machine Learning and Deep Learning, and you can use imaplib and email packages to extract data from Gmail.

1 — Configurate your Google and Gmail account

To access the Gmail inbox through a Python script we first need to do a simple configuration.

First, go to your Gmail account, and click on Settings -> See all settings you’ll see different tabs go to Forwarding and POP/IMAP and Enable IMAP like in the image below:

Enable IMAP in Gmail Settings

Finally, we need to get a password to connect with through the Internet Message Access Protocol (IMAP), for that, the fastest way is to go to this link. Here, you should be able to create an App password.

Create an App password in the Google Account

When you click on Create, make sure to copy the password to a safe place.

2 — Get all messages from a Gmail folder

With imaplib Python library we can connect to an email server, access email folders (such as the inbox, sent items, or custom folders), and perform various operations on email messages. Let’s start by importing the library and making a function to extract the messages from a folder.

import imaplib
from tqdm import tqdm

def get_messsages_gmail(
        user_email,
        user_password,
        last_email=-1,
        email_folder='INBOX',
        from_email="All"):
    """This function extracts the messages objects from a gmail account"""

    # connect to gmail
    gmail = imaplib.IMAP4_SSL("imap.gmail.com")

    # sign in with your credentials
    gmail.login(user_email, user_password)

    # select the folder
    gmail.select(email_folder)

    if from_email == 'All':
        resp, items = gmail.search(None, from_email)
    else:
        resp, items = gmail.search(None, f"(FROM {from_email})")

    items = items[0].split()
    msgs = []
    for num in tqdm(items[:last_email]):
        typ, message_parts = gmail.fetch(num, '(RFC822)')
        msgs.append(message_parts)

    return msgs

This function takes the user_email, the user_password, which is the one obtained in the previous step, the last_email, the email_folder, and from_email, in this variable, we can specify a sender email address or put All to extract from all senders.

Then we use the instance imaplib , which encapsulates the connection to IMAP4 server, in this case, the Gmail one: imap.gmail.com.

We call the connection object gmail and we use it to login, then we select the folder (INBOX by default). The if condition triggers the search for email content, either from all senders or a specific one, and saves the variables resp and items.

From the items list, we select the first group of items (items[0]), which are message identifiers, and we separate them with .split().

Finally, we iterate over the item identifiers and ask the server to return the email messages in the RFC 822 format. We append all these messages to the msgs list and return it.

3 — Extract PDF files from the messages list

Now we can do another function to take the list generated with the previous one and extract the PDF files from each message.

import email
import os

def get_pdf_attachments(msgs, data_folder):
    """This function extracts the pdf files"""
    for msg_raw in msgs:
        if type(msg_raw[0]) is tuple:
            msg = email.message_from_string(str(msg_raw[0][1], 'utf-8'))
            for part in msg.walk():
                if part.get_content_maintype() == 'multipart':
                    continue
                if part.get('Content-Disposition') is None:
                    continue

                try:
                    if (".pdf" in part.get_filename())\
                            or (".PDF" in part.get_filename()):
                        filename = part.get_filename()
                        file_path = os.path.join(data_folder, filename)

                        # Save the file
                        with open(file_path, 'wb') as file:
                            file.write(part.get_payload(decode=True))

                except Exception as error:
                    print(error)
                    pass

It only takes two arguments, the list of messages and the data folder where we want to save the files.

We iterate over the msgs list and in each loop we get the parsed message using email.message_from_string(). The str() function is used to decode the message and create a human-readable string from it.

The msg.walk() function is used to iterate over the parts of the message, such as text, images and other attachments.

If the main content type is multipart the continue statement is executed, which means the code will skip the current part and move on to the next part in the email message. The same happens if the Content-Disposition is None.

Finally, we can grab the PDF files, by checking if there is a “.pdf “ string in the filename of the parts with part.get_filename().

The .get_filename() function only extracts the name of the files, to get the content we need .get_payload() , to write the output to the created files.

Conclusion

The implementation is easy, with two functions we are ready to extract attachments from Gmail. In the example, we did it with PDFs but it can be extended to other file formats, such as CSV, XLSX, TXT, PNG and so on. Now you have the script that can, for instance:

  • Save you hours of work, by integrating it in your automation script.
  • Gather valuable data and derive insights from your newsletters using data analysis.
  • Download and upload data to your database.

Be inventive, as email won’t leave us so soon.

In Plain English

Thank you for being a part of our community! Before you go:

Email Extractor
Data Extraction
Automation
Python
Gmail
Recommended from ReadMedium