avatarRuslan Brilenkov

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

7953

Abstract

ncluding a tab or space</li><li><code><b>\S</b></code> = anything but space characters</li><li><code><b>.</b></code> = any character, except for a new line</li><li><code><b>\b</b></code> = space around whole words, or a word boundary</li><li><code><b>.</b></code> = period. Must come with a backslash before because <code>.</code> normally means any character.</li></ul><p id="8fe1"><b>Modifiers:</b></p><ul><li><code><b>+</b></code> = match 1 or more</li><li><code><b>?</b></code> = match 0 or 1 repetitions.</li><li><code><b></b></code> = match 0 or MORE repetitions</li><li><code><b></b></code> = matches at the end of the string</li><li><code><b>^</b></code> = matches start of a string</li><li><code><b>|</b></code> = either/or. For example, <code>x|y</code> matches either <code>x</code> or <code>y</code></li></ul><p id="fb55"><b>Brackets/Quantifiers:</b></p><ul><li><code><b>[]</b></code> = range, or “variance”. Square brackets. For example, <code>[a-z]</code> returns any lowercase letter <code>a-z</code>, or <code>[2–7a-rA-Y]</code> returns all numbers between<code>2-7</code>, lowercase letters <code>a-r</code> and uppercase <code>A-Y</code></li><li><code><b>{x}</b></code> = exactly <code>x</code> instances of the proceeding character or pattern. Curly brackets. For instance, <code>a{2}</code> matches <code>aa</code></li><li><code><b>{x,y}</b></code> = matches the proceeding character or pattern <code>x</code> to <code>y</code> times. Such as previous command but a range of values. For instance, <code>a{3,5}</code> matches <code>aaa</code>, <code>aaaa</code> and <code>aaaaa</code>.</li></ul><p id="ed93"><b>White Space Characters:</b></p><ul><li><code><b>\n</b></code> = new line</li><li><code><b>\s</b></code> = space</li><li><code><b>\t</b></code> = tab</li></ul><p id="b04a">Remember to <b>Escape</b> these (special) metacharacters with a backslash</p><p id="ffce"><code><b>.</b></code><b> <code>+</code> <code>*</code> <code>^</code></b> <code><b></b></code><b> <code>(</code> <code>)</code> <code>[</code> <code>]</code> <code>{</code> <code>}</code> <code>|</code> <code></code></b></p><p id="8669">For example, to search for a <i>dot</i> put a backslash in front<code>.</code> otherwise, a dot will look for any symbol/character except a new line.</p><p id="9805">If you are interested in learning more about regex commands, you can visit <a href="https://docs.oracle.com/cd/E20593_01/doc.560/e23601/app_regexp.htm">this page</a> or <a href="https://www.rexegg.com/regex-quickstart.html">this</a> one, for example.</p><h1 id="e515">The More Specific the Pattern the Better</h1><p id="bea4">Usually, the more specific means the longer the pattern. To explain this point, during the search, the most time-consuming part is not a match between the pattern and the text. But it is a non-match that takes much longer to process. Basically, a non-match is a waste of computational power and time.</p><blockquote id="1492"><p><b>Tip:</b> <b><i>The more specific (longer) the pattern, the better the performance of the regex search.</i></b></p></blockquote><p id="0253">So, using something like <code>.</code> (dot and star) at the beginning of the pattern is dangerous, because the star is greedy. It will start the search by selecting initially the whole text. And then, will proceed with the other parts of the pattern. But that very first part will already increase the number of useless computations.</p><p id="1c9c">For example, if we know that the line starts with the date as <i>year-month-day</i>, such as 2020–09–29 and something else after that. Then, we can specify this format directly <code>[12]\d{3}-[01]\d-[0–3]\d</code> instead of just <code>.</code></p><p id="ce9f">The explanation of this piece of code is the following:</p><ul><li><code>[12]\d{3}</code> selects a 4-digit number which starts with either 1 or 2 (determined by the square brackets <code>[12]</code>), then it selects up to 3 numbers/digits (determined by <code>\d{3}</code>). In other words, it selects any possible year starting from 1000 to 2999</li><li><code>[01]\d</code> selects a two-digit number which is either 0 or 1 (determined by the square brackets <code>[01]</code>), then it selects any possible digit from 0 to 9 (determined by <code>\d</code>). In other words, it selects any possible month in format 01, 02, 03, … up to 12</li><li><code>[0–3]\d</code> selects any two-digit number. Starting with 0, 1, 2, or 3 and the second digit being any possible digit (0 to 9), i.e., from 01 to 31.</li></ul><p id="89c9">In this case, the more specific version of a regex pattern (<code>[12]\d{3}-[01]\d-[0–3]\d</code>) will perform better/faster than the general pattern (<code>.</code>).</p><h1 id="7f69">The Usage of Regex with Simple Examples</h1><p id="0f38">To use regular expressions in Python, we have to import <code>re</code> module</p><div id="6bb3"><pre><span class="hljs-keyword">import</span> re</pre></div><p id="5683">We are going to see several examples of the basic usage of regex. Let us define a string of text for our first example.</p><h1 id="75dd">Example 1</h1><div id="b447"><pre>text = "Isabella <span class="hljs-keyword">is</span> 16 years old, <span class="hljs-keyword">and</span> Jonas <span class="hljs-keyword">is</span> 24 years old.
Their grandfather, Andrew, <span class="hljs-keyword">is</span> 95 years old, <span class="hljs-keyword">and</span> Andrew's father, Arthur, <span class="hljs-keyword">is</span> 120."</pre></div><p id="0b1f">Suppose, we know that the text contains the names and the ages of people. We will call the regex <code>findall</code> method for matching in format</p><div id="41ac"><pre>re.findall(<span class="hljs-string">r"<search pattern>"</span>, <text>)</pre></div><p id="e6ea">Note that we need to put <code>r</code> before the quotes in the search method <code>findall</code>.</p><h2 id="e683">Retrieving the Ages of People</h2><p id="cd54">For <b>ages</b>, we will look for any number ranging from 1 to 3 digits</p><div id="b263"><pre>ages = re.findall(<span class="hljs-string">r"\d{1,3}"</span>,text)</pre></div><div id="f5d2"><pre><span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(<span class="hljs-string">"\n{}"</span>.format(ages)</span></span>)</pre></div><p id="2b99">The output will be an array of numbers: <code>[‘16’, ‘24’, ‘95’, ‘120’]</code></p><h2 id="9327">Retrieving the Names of People</h2><p id="3f49">With the <b>names</b>, the situation is a bit more complicated as we have one repeated name (<i>Andrew</i> and <i>Andrew’s</i>), and one word that starts the sentence (i.e., starts with the capital letter) but not a name (<i>Their</i>). So, a simple line such as</p><div id="5c8e"><pre>names = re.findall(<span class="hljs-string">r"[A-Z][a-z]+"</span>, text)</pre></div><div id="f72c"><pre><span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(<span class="hljs-string">"\n{}"</span>.format(names)</span></span>)</pre></div><p id="e609">returns an array <code>[‘Isabella’, ‘Jonas’, ‘Their’, ‘Andrew’, ‘Andrew’, ‘Arthur’]</code>. Because it searches for any word starting with a capital letter and one (or more) number of smaller case letters indicated by a <code>+</code> sign. We need a more specific pattern!</p><p id="d43a">To remove the names of ownership (apostrophe + s), such as <i>Andrew’s father</i>, we can modify our search to exclude these words with either a negative <i>lookahead</i> or a negative <i>lookbehind</i>. For example, using a negative lookahead a search pattern will be</p><div id="2779"><pre>names2 = re.findall(<span class="hljs-string">r'\b(?![A-Z][a-z]+'s\b)[A-Z][a-z]+\b'</span>, text)</pre></div><div id="7d9a"><pre><span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(<span class="hljs-string">"\n{}"</span>.format(names2)</span></span>)</pre></div><p id="285a">that returns

Options

<code>[‘Isabella’, ‘Jonas’, ‘Their’, ‘Andrew’, ‘Arthur’]</code>. Great! That is much better as we are avoiding all of the names with the ownership.</p><p id="5ce6">I will explain this part of the code in short. A pattern<code>(?![A-Z][a-z]+'s\b)</code> includes</p><ul><li><code>?!</code> a negative lookahead that fails the match if it finds [A-Z][a-z] characters before the apostrophe + s</li><li><code>\’s</code> a backslash transforms the apostrophe in a normal symbol</li><li><code>\b</code> means a word boundary, i.e., which characters stay just before the end of the word.</li></ul><blockquote id="2b33"><p>Tip: <i>The difference between negative lookahead and negative lookbehind methods, is that one checks a negative part of a matching pattern before looking for positive matches, and another is checking negative part after checking the positive matches.</i></p></blockquote><p id="e689">And, finally, to remove the words that are not names, we can cross-match it with a list of names and pop out the word from the array if we do not find the match. For example, using <code>set.intersection</code> to cross-match these two arrays</p><div id="95de"><pre><span class="hljs-attr">array_of_names</span> = [“Isabella”, “Jonas”, “Andrew”, “Arthur”] <span class="hljs-comment"># etc.</span>

<span class="hljs-attr">filtered_names</span> = set(names2).intersection(array_of_names)</pre></div><div id="e3eb"><pre><span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(filtered_names)</span></span></pre></div><p id="256f">as a result, we get a set of names <code>{‘Andrew’, ‘Jonas’, ‘Arthur’, ‘Isabella’}</code>. This might not be the most optimal way of excluding those extra words but it works in this case.</p><p id="55c4">Dear Reader, if you have any suggestions or ideas on how to exclude the extra words (and keep only people’s names), please let me know. I will be happy to learn from you. For the compiled code, please visit <a href="https://github.com/RuslanBrilenkov/Regex_COVID-19_papers">my GitHub page</a>.</p><h1 id="47a4">Example 2</h1><p id="55f4">Here, we will take a look at how to select a zip code. There has been already a good discussion on this topic, for example, <a href="https://stackoverflow.com/questions/578406/what-is-the-ultimate-postal-code-and-zip-regex">this one</a>. To save your time, I show only a few examples to give you an idea of what is going on.</p><p id="4e74">In short, the zip/postal codes around the world do not follow a common pattern. They can vary from numbers only to a combination of numbers and letters. Have a different number of characters — from two to six. Can contain spaces or dots. There are several examples</p><div id="9fcf"><pre><span class="hljs-attr">"US"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"\d{5}([ -]\d{4})?"</span> <span class="hljs-attr">"CA"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"[ABCEGHJKLMNPRSTVXY]\d[ABCEGHJ-NPRSTV-Z][ ]?\d[ABCEGHJ-NPRSTV-Z]\d"</span> <span class="hljs-string">"DE"</span><span class="hljs-punctuation">,</span> <span class="hljs-string">"IT"</span><span class="hljs-punctuation">,</span> <span class="hljs-attr">"UA"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"\d{5}"</span> <span class="hljs-attr">"JP"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"\d{3}-\d{4}"</span> <span class="hljs-attr">"FR"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"\d{2}[ ]?\d{3}"</span> <span class="hljs-attr">"CH"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"\d{4}"</span> <span class="hljs-attr">"NL"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"\d{4}[ ]?[A-Z]{2}"</span> <span class="hljs-string">"LI"</span><span class="hljs-punctuation">,</span> <span class="hljs-string">"(948[5-9])|(949[0-7])"</span></pre></div><p id="51d2">For a complete list of postal codes, please follow, the above-mentioned discussion or the query address data <a href="http://i18napis.appspot.com/address">page</a>, for example. To find the zip code in the text with Python, we follow the same process as in the first example:</p><div id="d045"><pre># <span class="hljs-number">1.</span> <span class="hljs-keyword">import</span> regex <span class="hljs-keyword">import</span> re</pre></div><div id="7e2d"><pre><span class="hljs-comment"># 2. define or import a given text</span> <span class="hljs-attr">example_text</span> = <span class="hljs-string">"..."</span></pre></div><div id="bb44"><pre><span class="hljs-comment"># 3. start searching for the zip code patters, e.g., US zip codes</span> US_zip_codes = re.findall(<span class="hljs-string">r"\d{5}([ -]\d{4})?"</span>, example_text)</pre></div><div id="3023"><pre><span class="hljs-comment"># 4. print the codes if you wish</span> <span class="hljs-built_in">print</span>(<span class="hljs-string">"\n{}"</span>.<span class="hljs-built_in">format</span>(US_zip_codes))</pre></div><h1 id="057d">In Conclusion</h1><p id="2ab8">Due to the rising second wave of the COVID-19 pandemic, I started asking myself more seriously about how can I contribute to studying this virus. The idea of this project is to get some insights from a medical scientific paper using Python.</p><p id="50a3">In this article, we discussed what are Regular Expressions (RegEx), and how we can use this language within Python to retrieve some useful textual information. We saw a cheat sheet of the most common regex rules and went through a few hands-on exercises. In the next Part 2, we will apply this knowledge to a real scientific paper.</p><p id="306f">I want to encourage everyone to use Python in day-to-day life. If you are interested in how to start recording your weight using Python to keep track of it, please take a look at <a href="https://readmedium.com/track-your-weight-with-python-4bf0cae42ef3">my previous article</a>.</p><p id="7e95">Thank you for reading my article! I hope you have learned something new.</p><p id="e252">Are you curious about the emerging field of Prompt Engineering? Grab <a href="https://ruslanbrilenkov.gumroad.com/l/promptengineering300">my new e-book</a>! You will learn and master everything from fundamental concepts to practical tips and real-world applications. Additionally, you will receive a bonus of 300 prompts and some of the free resources to kick-start your AI-driven journey. With all this value packed into one e-book, what is the price? The cost of a cup of coffee! Do not miss out on this opportunity to take your skills to the next level!</p><div id="0e12" class="link-block"> <a href="https://ruslanbrilenkov.gumroad.com/l/promptengineering300"> <div> <div> <h2>Prompt Engineering, 300 Prompts, & Free AI Resources</h2> <div><h3>Welcome to this e-book on prompt engineering — a rapidly growing field in artificial intelligence. This comprehensive…</h3></div> <div><p>ruslanbrilenkov.gumroad.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*kbPKUVsdzyKqgLhI)"></div> </div> </div> </a> </div><h1 id="242e">Contact</h1><p id="7098"><a href="https://www.linkedin.com/in/ruslan-brilenkov/"><b><i>LinkedIn</i></b></a></p><p id="7e91"><i>I recently started a <a href="https://bit.ly/RBrilenkovYT"><b>YouTube channel</b></a><b> </b>where I talk about different topics, including data science and AI news, research, and life in general among others. It is a steep learning curve for me but I invite you to <a href="https://bit.ly/RBrilenkovYT">check it out here</a>.</i></p><p id="46aa"><i>Never miss a story, join my <a href="https://ruslan-brilenkov.medium.com/subscribe"><b>mailing list</b></a>!</i></p></article></body>

Analyzing COVID-19 Papers With Python — Part 1

Explaining Regular Expressions (regex) plus a cheat sheet of useful rules

Photo by Guido Hofmann on Unsplash

Introduction

The rising second wave of COVID-19 makes me worried. Luckily, I can work from home and stay mostly indoors during these turbulent times. However, there are many people around the world who cannot do that. Especially, those who are living by earning from day-to-day services.

The pandemic hits everyone but some harder. While I have a chance, I want to do more, I want to do better.

In this situation, I ask myself “What can I do?” or “How can I help to fight the pandemic?”

Well, I can apply my knowledge of Python, programming, and data analysis to describe or create the tools, which others can use to get some insights into the coronavirus. For example, to help with the analysis of the COVID-19 data from the scientific papers.

Here, I am going to talk about how anyone can start analyzing a text using Python. This post is meant to encourage everyone to use Python in day-to-day life. Not only for one’s personal goals but also for the real-world large-scale issues such as the coronavirus pandemic.

In the framework of the current situation, I am going to show how we can get some useful information about COVID-19 from a scientific article automatically “without actually reading the paper”. Because

Time is money.

— Benjamin Franklin, 1748

In my opinion, this aphorism understates the case. As time is much more than money. If you have time, you can earn more money. But if you have money, you cannot buy even a single extra minute. So, let us try to find a way to save some of our time with Python.

Of course, it is much more useful to, actually, read and study scientific papers. However, there are several points I can mention:

  • there are just too many papers to read them all in a reasonable amount of time
  • there is also too much information of a different kind that it becomes impractical to keep all that inside our head
  • in hands of the professionals, the algorithms can be customized in the most optimal way to perform much better than the manual work
  • and, of course, the knowledge of regular expressions might be quite handy in other circumstances as well

The Idea Behind This Project

What if we can simply create some criteria and let the algorithm “read” the papers for us. To automate monotonous, lengthy, or boring stuff with Python. Let us think in terms of collecting as much useful data as possible by doing as least as possible. We work for the algorithm to make it work for us later.

As a result, we will accomplish a lot in a short amount of time.

Here, I will introduce to you regular expressions (regex), a useful language for pattern searching.

The Structure of This Tutorial

To describe this project better it is reasonable to divide it into two parts. Here, in Part 1, I am going to briefly describe the regular expressions, provide a cheat sheet for the common regex rules, and give a few examples of how to look for patterns in the text. In Part 2, I will show how we can apply this knowledge to a real-world problem by analyzing the scientific article on COVID-19.

Regular Expressions

Definition

A Regular Expression (also known as regex or regexp) is a powerful tool using which one can methodically analyze a given text (string). Basically, a regex is a sequence of characters or a search pattern (a collection of words, numbers, and symbols). For example, string-searching algorithms look for these patterns to find (and replace) them in the texts. Using a regex is like sifting through the text.

Photo by Sorin Gheorghita on Unsplash

Each character in a regex is either a regular character or a metacharacter. Regular characters have a literal meaning, such as a, b, c or 1, 2, 3. Metacharacters have a special meaning, for example, a dot . matches every character except a newline \n. Or a star * matches zero or more repetitions or a given character.

A regex is a language on its own, such as SQL (sequel). That means a regex stays the same no matter what programming language we are using. We are going to use it within Python by importing a module called re.

A good part about re is that it is a part of the standard python library, so you do not have to worry about downloading and installing. It is already there ready to be used.

A few Words About Kaggle

About half a year ago, Kaggle announced a competition on analyzing COVID-19 scientific articles. I hope, by the end of this project (Part 1 + Part 2), you will be able to participate in similar competitions.

By the way, there is a reason why I am not participating in Kaggle competitions. Because of their policy. As Frederik Bussler mentioned in his Medium article on the top international alternatives of Kaggle:

“Members of the Kaggle community who are not United States Citizens or legal permanent residents at the time of entry are allowed to participate in the Competition but are not eligible to win prizes. If a team has one or more members who are not prize eligible, then the entire team is not prize eligible.”

The reason I am writing here is to share my knowledge and contribute to the community to my best.

Even if my article helps only one researcher, scientist, or a doctor to figure out something about the novel coronavirus disease using Regular Expressions, I will be happy. Equally, I will be really happy if it helps anyone who is looking for the ways to search for the patterns in text using Python.

The Regex Cheat Sheet

It is useful to have the following general rules and expressions somewhere nearby while working with regex.

Identifiers/Characters:

  • \w = any alphanumeric character or the underscore. Identical to [A-Za-z0–9_]
  • \W = anything but the alphanumeric character or underscore
  • \d = any number/digit. Identical to [0–9]
  • \D = anything but a digit
  • \s = any white space character including a tab or space
  • \S = anything but space characters
  • . = any character, except for a new line
  • \b = space around whole words, or a word boundary
  • \. = period. Must come with a backslash before because . normally means any character.

Modifiers:

  • + = match 1 or more
  • ? = match 0 or 1 repetitions.
  • * = match 0 or MORE repetitions
  • $ = matches at the end of the string
  • ^ = matches start of a string
  • | = either/or. For example, x|y matches either x or y

Brackets/Quantifiers:

  • [] = range, or “variance”. Square brackets. For example, [a-z] returns any lowercase letter a-z, or [2–7a-rA-Y] returns all numbers between2-7, lowercase letters a-r and uppercase A-Y
  • {x} = exactly x instances of the proceeding character or pattern. Curly brackets. For instance, a{2} matches aa
  • {x,y} = matches the proceeding character or pattern x to y times. Such as previous command but a range of values. For instance, a{3,5} matches aaa, aaaa and aaaaa.

White Space Characters:

  • \n = new line
  • \s = space
  • \t = tab

Remember to Escape these (special) metacharacters with a backslash

. + * ^ $ ( ) [ ] { } | \

For example, to search for a dot put a backslash in front\. otherwise, a dot will look for any symbol/character except a new line.

If you are interested in learning more about regex commands, you can visit this page or this one, for example.

The More Specific the Pattern the Better

Usually, the more specific means the longer the pattern. To explain this point, during the search, the most time-consuming part is not a match between the pattern and the text. But it is a non-match that takes much longer to process. Basically, a non-match is a waste of computational power and time.

Tip: The more specific (longer) the pattern, the better the performance of the regex search.

So, using something like .* (dot and star) at the beginning of the pattern is dangerous, because the star is greedy. It will start the search by selecting initially the whole text. And then, will proceed with the other parts of the pattern. But that very first part will already increase the number of useless computations.

For example, if we know that the line starts with the date as year-month-day, such as 2020–09–29 and something else after that. Then, we can specify this format directly [12]\d{3}-[01]\d-[0–3]\d instead of just .*

The explanation of this piece of code is the following:

  • [12]\d{3} selects a 4-digit number which starts with either 1 or 2 (determined by the square brackets [12]), then it selects up to 3 numbers/digits (determined by \d{3}). In other words, it selects any possible year starting from 1000 to 2999
  • [01]\d selects a two-digit number which is either 0 or 1 (determined by the square brackets [01]), then it selects any possible digit from 0 to 9 (determined by \d). In other words, it selects any possible month in format 01, 02, 03, … up to 12
  • [0–3]\d selects any two-digit number. Starting with 0, 1, 2, or 3 and the second digit being any possible digit (0 to 9), i.e., from 01 to 31.

In this case, the more specific version of a regex pattern ([12]\d{3}-[01]\d-[0–3]\d) will perform better/faster than the general pattern (.*).

The Usage of Regex with Simple Examples

To use regular expressions in Python, we have to import re module

import re

We are going to see several examples of the basic usage of regex. Let us define a string of text for our first example.

Example 1

text = "Isabella is 16 years old, and Jonas is 24 years old. \
Their grandfather, Andrew, is 95 years old, and Andrew's father, Arthur, is 120."

Suppose, we know that the text contains the names and the ages of people. We will call the regex findall method for matching in format

re.findall(r"<search pattern>", <text>)

Note that we need to put r before the quotes in the search method findall.

Retrieving the Ages of People

For ages, we will look for any number ranging from 1 to 3 digits

ages = re.findall(r"\d{1,3}",text)
print("\n{}".format(ages))

The output will be an array of numbers: [‘16’, ‘24’, ‘95’, ‘120’]

Retrieving the Names of People

With the names, the situation is a bit more complicated as we have one repeated name (Andrew and Andrew’s), and one word that starts the sentence (i.e., starts with the capital letter) but not a name (Their). So, a simple line such as

names = re.findall(r"[A-Z][a-z]+", text)
print("\n{}".format(names))

returns an array [‘Isabella’, ‘Jonas’, ‘Their’, ‘Andrew’, ‘Andrew’, ‘Arthur’]. Because it searches for any word starting with a capital letter and one (or more) number of smaller case letters indicated by a + sign. We need a more specific pattern!

To remove the names of ownership (apostrophe + s), such as Andrew’s father, we can modify our search to exclude these words with either a negative lookahead or a negative lookbehind. For example, using a negative lookahead a search pattern will be

names2 = re.findall(r'\b(?![A-Z][a-z]+\'s\b)[A-Z][a-z]+\b', text)
print("\n{}".format(names2))

that returns [‘Isabella’, ‘Jonas’, ‘Their’, ‘Andrew’, ‘Arthur’]. Great! That is much better as we are avoiding all of the names with the ownership.

I will explain this part of the code in short. A pattern(?![A-Z][a-z]+\'s\b) includes

  • ?! a negative lookahead that fails the match if it finds [A-Z][a-z] characters before the apostrophe + s
  • \’s a backslash transforms the apostrophe in a normal symbol
  • \b means a word boundary, i.e., which characters stay just before the end of the word.

Tip: The difference between negative lookahead and negative lookbehind methods, is that one checks a negative part of a matching pattern before looking for positive matches, and another is checking negative part after checking the positive matches.

And, finally, to remove the words that are not names, we can cross-match it with a list of names and pop out the word from the array if we do not find the match. For example, using set.intersection to cross-match these two arrays

array_of_names = [“Isabella”, “Jonas”, “Andrew”, “Arthur”] # etc.

filtered_names = set(names2).intersection(array_of_names)
print(filtered_names)

as a result, we get a set of names {‘Andrew’, ‘Jonas’, ‘Arthur’, ‘Isabella’}. This might not be the most optimal way of excluding those extra words but it works in this case.

Dear Reader, if you have any suggestions or ideas on how to exclude the extra words (and keep only people’s names), please let me know. I will be happy to learn from you. For the compiled code, please visit my GitHub page.

Example 2

Here, we will take a look at how to select a zip code. There has been already a good discussion on this topic, for example, this one. To save your time, I show only a few examples to give you an idea of what is going on.

In short, the zip/postal codes around the world do not follow a common pattern. They can vary from numbers only to a combination of numbers and letters. Have a different number of characters — from two to six. Can contain spaces or dots. There are several examples

"US": "\d{5}([ \-]\d{4})?"
"CA": "[ABCEGHJKLMNPRSTVXY]\d[ABCEGHJ-NPRSTV-Z][ ]?\d[ABCEGHJ-NPRSTV-Z]\d"
"DE", "IT", "UA": "\d{5}"
"JP": "\d{3}-\d{4}"
"FR": "\d{2}[ ]?\d{3}"
"CH": "\d{4}"
"NL": "\d{4}[ ]?[A-Z]{2}"
"LI", "(948[5-9])|(949[0-7])"

For a complete list of postal codes, please follow, the above-mentioned discussion or the query address data page, for example. To find the zip code in the text with Python, we follow the same process as in the first example:

# 1. import regex
import re
# 2. define or import a given text
example_text = "..."
# 3. start searching for the zip code patters, e.g., US zip codes
US_zip_codes = re.findall(r"\d{5}([ \-]\d{4})?", example_text)
# 4. print the codes if you wish
print("\n{}".format(US_zip_codes))

In Conclusion

Due to the rising second wave of the COVID-19 pandemic, I started asking myself more seriously about how can I contribute to studying this virus. The idea of this project is to get some insights from a medical scientific paper using Python.

In this article, we discussed what are Regular Expressions (RegEx), and how we can use this language within Python to retrieve some useful textual information. We saw a cheat sheet of the most common regex rules and went through a few hands-on exercises. In the next Part 2, we will apply this knowledge to a real scientific paper.

I want to encourage everyone to use Python in day-to-day life. If you are interested in how to start recording your weight using Python to keep track of it, please take a look at my previous article.

Thank you for reading my article! I hope you have learned something new.

Are you curious about the emerging field of Prompt Engineering? Grab my new e-book! You will learn and master everything from fundamental concepts to practical tips and real-world applications. Additionally, you will receive a bonus of 300 prompts and some of the free resources to kick-start your AI-driven journey. With all this value packed into one e-book, what is the price? The cost of a cup of coffee! Do not miss out on this opportunity to take your skills to the next level!

Contact

LinkedIn

I recently started a YouTube channel where I talk about different topics, including data science and AI news, research, and life in general among others. It is a steep learning curve for me but I invite you to check it out here.

Never miss a story, join my mailing list!

Data Science
Python
Programming
Covid-19
Computer Science
Recommended from ReadMedium