avatarLars Nielsen

Summary

The web content outlines three methods for cleaning HTML text in preparation for NLP text pre-processing: using regular expressions, Beautiful Soup, and XML's ElementTree.

Abstract

The article provides a concise guide on how to remove HTML tags from text to prepare a corpus for NLP tasks. It emphasizes the importance of this pre-processing step when dealing with a large number of HTML files. The first method discussed is the use of regular expressions, which are powerful for string extraction and pattern matching within HTML tags. The second method introduces Beautiful Soup, a package designed for web scraping that can parse various DOM structures to extract text. Lastly, the article covers the itertext() method from XML's ElementTree module, which is effective for text extraction from XML-structured data but requires well-formed data to function correctly. The article is part of a larger narrative on understanding NLP, from TF-IDF to transformers.

Opinions

  • Regular expressions are highly regarded for their versatility in complex string operations, including HTML tag removal.
  • Beautiful Soup is recommended for its robust parsing capabilities and ease of use in extracting text from web pages.
  • The use of XML's ElementTree is noted to be sensitive to data structure correctness, implying that it may not be

3 ways to clean your HTML text for NLP text pre-processing

How to remove the HTML tags from your corpus for building your NLP data-set

Image source : Jackson So ( unsplash.com)

This article is part of the supporting material for the story — ‘Understanding NLP — from TF-IDF to transformers

Background

Most of the times when you want to process a tonne of html files in your corpus, you would have to think about cleaning the HTML as a pre-processing step. Here are 3 ways to do the same.

1. Using Regex

Regular expressions are the most popular and powerful method for any of the complex string extraction process you want to carry out. Widely used in data mining and string matching algorithms, regex can be easily employed in searching for string patters between HTML tags.

html_text = "<HTML><HEAD> This is HEAD <INSIDE> The is inside tag </INSIDE></HEAD> <BODY> This is BODY </BODY></HTML>"
#the html text you want to process

We will use the above text stored in the variable html_text and try to parse out the text elements ( highlighted in bold )

Using regular Expression to remove HTML tags

The output :

This is HEAD The is inside tag This is BODY

2. Using Beautiful Soup

Beautiful Soup is a package widely used to scrape text from webpages. It has very powerful methods that can parse different DOM structures. Here we will use that to parse the HTML formatted text.

using Beautiful Soup to extract text from HTML tags

3. Using XML’s ElementTree

The itertext() method in XML’s ElementTree module can be used to pull out any text as long as it conforms to an XML structure.

Note : This method is very sensitive to the correctness of the structure of data and may not work if the data does not conform to the right tree structure as expected in an XML format.

Using XML’s ElementTree for extracting text from HTML

This article is part of the supporting material for the story — ‘Understanding NLP — from TF-IDF to transformers’. For other pre-processing steps, please follow the above story.

HTML
NLP
Python
AI
Machine Learning
Recommended from ReadMedium