avatarZoltan Guba

Summary

The article provides a guide on how to read DOCX files with Python using standard library tools and BeautifulSoup without the need for external packages.

Abstract

This article demonstrates a method for reading DOCX files using Python's standard library, which is particularly useful in environments where installing additional packages is restricted. It explains that DOCX files are essentially XML files within a ZIP archive, and by utilizing Python's zipfile, xml.etree.ElementTree, and the third-party BeautifulSoup library, one can extract and parse the content of a DOCX file. The process involves unzipping the DOCX file, prettifying the extracted XML content for readability, and then parsing the XML to retrieve the text data using XPath expressions. The article emphasizes that understanding the XML structure and XPath is crucial for extracting specific parts of the document, such as text elements written in a particular font.

Opinions

  • The author believes that Python is well-suited for processing large amounts of data, including less conventional formats like DOCX files.
  • It is noted that while external libraries are often convenient, they may not always be feasible in corporate environments, highlighting the importance of being able to work with standard libraries.
  • The article suggests that once the logic of the XML structure within a DOCX file is understood, it is relatively straightforward to parse documents using tools from the standard library.
  • The author provides a personal opinion that the BeautifulSoup library is useful for making XML content more human-readable, which facilitates the understanding and extraction of specific data points.
  • There is an appreciation for the complexity of the XML structure in DOCX files, as evidenced by the detailed explanation of how to navigate and extract information from it.
  • The author encourages readers to support them with a coffee donation if they find the content useful, indicating a desire for appreciation and acknowledgment for the effort put into creating educational content.

How to Read DOCX Files With Python

Photo by geralt from pixabay

Processing large amounts of data is one of the use cases Python excels in. Sometimes however we don’t have the content we need in convenient csv or XLS files, databases or other user-friendly structures: we have to work with MS Word documents, DOCX for instance. How can we read Word documents?

We can use dedicated libraries to interact with this type of files, however sometimes this is not feasible: for instance, you are trying to solve such a problem on a corporate computer, and you can’t just pip install any package you wish.

Fortunately DOCX content is stored in XML files under the hood — even though digging in a bit and understanding the structure can be a bit time consuming, once you have the logic you can easily parse documents with tools from the standard library.

Please note: I will use the ZIP, XML and BeautifulSoup modules for this demonstration, however I am not going to go into details on how they work. In case you need a refresher I am going to link documentation pages for reference.

This is a super basic 1-pager Lorem ipsum document I am going to use for the article: even though this is not even near the complexity of some of the DOCX files you might have to work with, it can give you the general idea.

The sample docx file — screenshot by the author

zipfile module

The zipfile library was created to read and work with compressed files. Reading in our sample document is just a ZipFile object creation using the file itself as the argument.

import zipfile
doc_zip = zipfile.ZipFile(“Lorem ipsum.docx”)

As a result, we got our ZipFile object (by default in read mode: mode=’r’):

Screenshot by the author

This object now contains the constituting files making up for the docx document, all we need to do is read at least one of them to get the document’s content. We can list the names of all archive members in the object using the ZipFile.namelist() method:

doc_zip .namelist()
Screenshot by the author

A bunch of xml files are revealed under the hood of the docx archive — discovering all these might be enticing, however now I want to focus on the actual string content of my file: that I can do by accessing the ‘word/document.xml’ file by calling the read method on my ZipFile object:

doc_xml = doc_zip.read(‘word/document.xml’)

Now we have the content to parse, however we are not yet out of the woods:

Screenshot by the author

Prettify and Parse XML content

The returned xml document is far from human-friendly at this stage. We can find parts of the text we saw in the original document, but we need some tweaking to make it palatable.

Fortunately XML is perfectly structured to find the pieces we need, we just have to get the gist of the logic at hand. The BeautifulSoup library can do the necessary tidy-up so that we can find the logic behind the storage of our text:

from bs4 import BeautifulSoup
soup_xml = BeautifulSoup(doc_xml, “xml”)
pretty_xml = soup_xml.prettify()

This is now a (more or less) human readable hierarchical structure we can work with! Notice the complexity of the first line itself (“Lorem ipsum”): all the attributes describing exactly what should appear in front of you when opening the document:

Screenshot by the author

In order to fetch desired parts of the document we need to define the XPath of these text elements — the location they are sitting in the XML file.

XPath search

Similar to HTML XPath locations, XML paths define the parent-child relationship under which you would like to access a certain data point. At this stage our prettified XML is just a string: in order to traverse it we need it to be a proper XML object. Python’s XML module can do just that for us:

import xml.etree.ElementTree as ET
root = ET.fromstring(pretty_xml)

This is now a proper XML Element:

Screenshot by the author

Now we can use the find and findall methods to locate specific XML node(s) using their XPath. Locating the “body” element for instance looks like this:

namespace = {'w': "http://schemas.openxmlformats.org/wordprocessingml/2006/main"}
body = root.find(‘w:body’, namespace)

Note the “namespace” variable/parameter in the above code. Namespaces are used to avoid confusing when mixing multiple XML documents — after all, the tag names and structure are completely up to the developer. For our purposes this has no particular significance.

Screenshot by the author

Note: since there is only one “body” element, in which all other child XML elements of the document are stored, using the findall method is not necessary. You can use it though, you will simply get back a list of Elements with length 1:

Screenshot by the author

Looking at the prettified XML we notice that all text blocs in the document are stored in “w:t” tags — if your goal is to get all text stored in the document, we just have to loop through these tags and get the text:

namespace = {‘w’: “http://schemas.openxmlformats.org/wordprocessingml/2006/main"}
text_elements = root.findall(‘.//w:t’, namespace)
for t_element in text_elements:
    print(t_element.text)

Here we are, the text from our docx file is ready to go. The text extracted can be now manipulated further any way you want.

Screenshot by the author

The “.//w:t” XPath defined is searching for all “w:t” elements, no matter where they are sitting in the element tree.

If you would like to keep me caffeinated for creating more content like this please consider supporting me, with just a coffee.

Assume you would like to do something more sophisticated beside grabbing all text in a document, for instance reading only specific headers, names of chapters and so on. Since this sample document is rather simple there is not much difference in the location of the elements in the tree nor in their attributes — however the title (“Lorem ipsum”) is not written in the default font, instead in Comic Sans MS. This XML document stores the font data in a “w:rFonts” node in attribute “w:ascii” (not only in that actually, but that is the first):

Screenshot by the author

If I want to grab only the text elements written with this font I can do that: just need to write a bit longer XPath expression:

xpath = './/w:rFonts[@w:ascii="Comic Sans MS"]/../..//w:t'
comic_sans_elements = root.findall(xpath, namespace)
for element in comic_sans_elements:
    print(element.text)
Screenshot by the author

The XPath reads like this:

  1. find all “w:rFonts” nodes anywhere in the root where the “w:ascii” attributes equals to “Comic Sans MS”
  2. Step up two levels on the element tree
  3. Get all “w:t” nodes anywhere within the element located in step 2

Note that for these search criteria you need to know the structure of the document rather well so that you can make sure you get all elements you need.

Thank you for reading this post. Even though I have touched multiple libraries in order to reach our goal, this was not intended to be a ZIP, XML, or BeautifulSoup tutorial, this is why I was so generous with the assumptions that you knew these modules — if this is not the case please visit the linked documentation pages.

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

Python
Data
Automation
Document Management
Microsoft Office
Recommended from ReadMedium