Regular Expressions in Python
Introduction
The regular expression (regex) is a powerful tool to deal with characters/strings. Many programming languages support regex. It could need a whole book to introduce all syntax and features of it. This post will introduce some basic syntax which are very useful in daily programming and demonstrate how we can use regex in Python.
Basic Syntax
The regex is not a complex algorithm or data structure. It just some common rules defines by human to make processes of strings matching/finding/searching/etc convenient.
For example, if we wanna write a script to check whether a string is a valid e-mail address or not. Only two steps are needed if we use regex:
- Design a regex by its rules for matching the e-mail addresses.
- All e-mail addresses we received can be checked by this regex we just designed.
Very easy, right?
All right, let’s see what the rules are:
Firstly, if we are gonna match a single character:
- If a character is given directly, it is an exact match.
- Use
\dto match a number. - Use
\wto match a number or letter. - A
.can match any character except a newline. - Use
\sto match a space.
Examples:
00\dcan match007, but can’t match00M\d\w\dcan match1A1,2k9, etc.c.tcan matchcat,c!t,c9t, etc.
Secondly, if we gonna match a variable length character:
- Use
*to match 0 or more repetitions of the preceding regex, as many repetitions as are possible. - Use
+to match 1 or more repetitions of the preceding regex.ab+will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’. - Use
?to match 0 or 1 repetitions of the preceding regex. - Use
{n}to match n repetitions of the preceding regex. - Use
{n,m}to match n-m repetitions of the preceding regex.
Examples:
ab*can match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.ab+can match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.ab?can match either ‘a’ or ‘ab’.\d{5}can match five numbers, such as12345\d{2,8}can match 2–8 numbers, such as25,999,12345678
Looks so easy, right? But don’t start celebrating yet 🙂 . Let’s look a interesting example:
If we use <.*> to match '<a> b <c>', it will match the entire string, and not just '<a>'. Sometimes we could meet unexpected behaviours because regex defaults to greedy matching, which means matching as many characters as possible.
In situations like this, we can adding ? after a expression to make it become non-greedy. In other words, as few characters as possible will be matched. Using the regex <.*?> will match only '<a>'.
Thirdly, there are some more tools to do a more precise match:
- Use
[]to indicate a range. For example,[0-9a-z]can match a number between 0 and 9 or a letter between a and z. A|Bcan match A or B, so(P|python)can match ‘Python’ or ‘python’.^indicates the beginning of the line. For example,^\dindicates that the string must start with a number.$indicates the end of the line. For example,\d$indicates that the string must end with a number.- Use
\to escapes special characters (permitting us to match characters like'*','?', and so forth).
How to use regex in Python?
Python has a module re which makes it very convenient to use regex. Let’s see how it works:
- re.match()





