PROGRAMMING

Bash Scripts — Part 9 — Regular Expressions

A gentle introduction into regular expressions in BASH

In order to fully process texts in bash scripts using sed and awk, you need to understand regular expressions. Although implementations of this most useful tool can be found literally everywhere, all regular expressions are arranged in a similar way and based on the same ideas. However, working with them has certain peculiarities in different environments. Here we will talk about regular expressions that are suitable for use in Linux command line scripts. This material is intended to be an introduction to regular expressions for those who may not know at all what they are. So let’s start from the very beginning.

What regular expressions are

Many people, when they first see regular expressions, immediately think that they are in front of a meaningless jumble of characters. But this, of course, is far from the case. Take a look at this regex for example:

^([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$

In some advanced programmer opinion, even an absolute beginner will immediately understand how it works and why you need it 🙂

Well, if you do not quite understand what’s going on here as I did, just keep on reading, and everything will fall into place.

A regular expression is a pattern that programs like sed or awk use to filter text. Templates use regular ASCII characters that represent themselves, and so-called metacharacters, which play a special role, for example, allowing you to refer to certain groups of characters.

Type of regular expressions

Implementations of regular expressions can be easily found in various environments, for example, in such programming languages like Java, Perl, Python, and in such Linux tools like sed, awk, grep and many others. However, it has certain quirks depending on the environment. Notably, These features depend on so-called regex engines, which interpret patterns.

Linux has two regular expression engines:

An engine that supports the POSIX Basic Regular Expression (BRE) standard.
An engine that supports the POSIX Extended Regular Expression (ERE) standard.

Most Linux utilities conform at least to the POSIX BRE standard, but some utilities (including sed) understand only a subset of the BRE standard. One of the reasons for this limitation is the desire to make such utilities as fast as possible in word processing.

The POSIX ERE standard is often implemented in programming languages. It allows you to use a lot of tools when developing regular expressions. For example, these can be special character sequences for frequently used patterns, such as searching for individual words or sets of numbers in the text. Awk supports the ERE standard.

There are many ways to develop regular expressions, depending on both on the opinion of the programmer and on the features of the engine for which they are created. It is not easy to write generic regular expressions that any engine can understand. Therefore, we will focus on the most commonly used regular expressions and look at how they are implemented for sed and awk.

POSIX BRE Regular Expressions

Perhaps the simplest BRE pattern is a regular expression for finding the exact occurrence of a sequence of characters in a text. This is what sed and awk look like for a string:

~$ echo "This is a test" | sed -n '/test/p'
This is a test

~$ echo "This is a test" | awk '/test/{print $0}'
This is a test

You will notice that the search for a given pattern is performed without regard to the exact location of the text in the string. In addition, the number of occurrences does not matter. After the regular expression finds the given text anywhere in the string, the string is considered valid and is passed on for further processing.

When working with regular expressions, keep in mind that they are case-sensitive:

~$ echo "This is a test" | awk '/test/{print $0}'
This is a test

The first regular expression did not match, since the word “test” starting with a capital letter does not occur in the text. The second, tuned to search for a capitalized word, found a matching string in the stream.

In regular expressions, you can use not only letters but also spaces and numbers:

~$ echo "This is a test 2 again" | awk '/test 2/{print $0}'
This is a test 2 again

Spaces are treated by the regular expression engine as regular characters.

Special symbols

There are a few things to keep in mind when using different characters in regular expressions. So, there are some special characters, or metacharacters, which require a special approach to use in a template. Here they are:

.*[]^${}\+?|()

If one of them is needed in the template, it will need to be escaped with a backslash (backslash) — \.

For example, if you need to find a dollar sign in the text, it must be included in the template, preceded by an escape character. Let’s say you have a file myfilewith this text:

There is 10$ on my pocket

The dollar sign can be detected using a pattern like this:

~$ cat myfile
There is 10$ on my pocket

~$ awk '/\$/{print $0}' myfile
There is 10$ on my pocket

Also, a backslash is also a special character, so if you want to use it in a pattern, you will need to escape it too. It looks like two forward slashes:

~$ echo "\ is a special character" | awk '/\\/{print $0}'
\ is a special character

Although the forward slash is not included in the above list of special characters, trying to use it in a regular expression written for sed or awk will result in an error:

~$ echo "3 / 2" | awk '///{print $0}'
awk: cmd. line:1: ///{print $0}
awk: cmd. line:1:    ^ syntax error

If it is needed, it must also be escaped:

~$ echo "3 / 2" | awk '/\//{print $0}'

Anchor symbols

There are two special characters to anchor a pattern to the beginning or end of a text string. Cover ^character - allows you to describe sequences of characters that are at the beginning of text lines. If the pattern you are looking for is found elsewhere in the string, the regular expression will not respond to it. The use of this symbol looks like this:

~$ echo "welcome to mraevsky website" | awk '/^mraevsky/{print $0}'
~$ echo "mraevsky website" | awk '/^mraevsky/{print $0}'
mraevsky website

The symbol is^designed to search for a pattern at the beginning of a string, while the case is also taken into account. Let's see how this affects the processing of a text file:

~$ cat myfile
this is a test
This is another test
And this is one more

~$ awk '/^this/{print $0}' myfile
this is a test

When using sed, if you place a cap anywhere inside a pattern, it will be treated like any other regular character:

~$ echo "This ^ is a test" | sed -n '/s ^/p'
This ^ a test

We figured out how to find pieces of text at the beginning of a line. What if you want to find something at the end of a line?

The dollar sign -$, which is the anchor character for the end of the line, will help us with this:

~$ echo "This is a test" | awk '/test$/{print $0}'
This is a test

Both anchor characters can be used in the same pattern. Let’s process the filemyfile, the contents of which are shown in the figure below, using the following regular expression:

~$ cat myfile
this is a test
This is another test
And this is one more

~$ awk '/^this is a test$/{print $0}' myfile
this is a test

As you can see, the template reacted only to a line that fully corresponds to the specified sequence of characters and their location.

Here’s how to filter out empty lines using anchor characters:

~$ awk '!/^$/{print $0}' myfile

In this template, I used a negation character, an exclamation mark — !. This pattern searches for lines that contain nothing between the beginning and end of the line, and the exclamation mark only prints lines that do not match the pattern.

Point symbol

A period is used to search for any single character except for line feed. Let’s pass to such a regular expression a file myfile, the contents of which are given below:

~$ cat myfile
this is a test
This is another test
And this is one more
start with this

~$ awk '/.st/{print $0}' myfile
this is a test
This is another test

As you can see from the output, only the first two lines from the file match the pattern, since they contain the sequence of characters “st”, preceded by one more character, while the third line does not contain a suitable sequence, and in the fourth it is, but is at the very beginning of the line.

Character classes

The period matches any single character, but what if you need more flexibility to limit the set of characters you are looking for? In such a situation, you can use character classes.

Thanks to this approach, you can organize a search for any character from a given set. Square brackets are used to describe a character class — []:

~$ cat myfile
this is a test
This is another test
And this is one more
start with this

~$ awk '/[oi]th/{print $0}' myfile
This is another test
start with this

Here we are looking for the “th” character sequence preceded by the “o” or “i” character.

Classes come in handy when looking for words that can start with both uppercase and lowercase letters:

~$ echo "this is a test" | awk '/[Tt]his is a test/{print $0}'
this is a test

~$ echo "This is a test" | awk '/[Tt]his is a test/{print $0}'
This is a test

Character classes are not limited to letters. Other symbols can be used here. It is impossible to say in advance in what situation the classes will be needed — everything depends on the problem being solved.

The negation of character classes

Character classes can also be used to solve the opposite problem described above. Namely, instead of searching for symbols included in the class, you can organize a search for everything that is not included in the class. In order to achieve this behavior of a regular expression, a sign must be placed in front of the list of class symbols ^. It looks like this:

~$ cat myfile
this is a test
This is another test
And this is one more
start with this

~$ awk '/[^oi]th/{print $0}' myfile
And this is one more
start with this

In this case, sequences of characters “th” will be found, before which there is neither “o” nor “i”.

Ranges of characters

In character classes, you can describe ranges of characters using a dash:

~$ cat myfile
this is a test
This is another test
And this is one more
start with this

~$ awk '/[e-p]st/{print $0}' myfile
this is a test
This is another test

In this example, the regular expression responds to the character sequence “st” preceded by any character alphabetically located between the characters “e” and “p”.

Ranges can also be created from numbers:

~$ echo "123" | awk '/[0-9][0-9][0-9]/'
123

~$ echo "12a" | awk '/[0-9][0-9][0-9]/'

The character class can contain several ranges:

~$ cat myfile
this is a test
This is another test
And this is one more
start with this

~$ awk '/[a-fm-z]st/{print $0}' myfile
this is a test
This is another test

This regex will match all strings preceded by characters from the rangesa-fandm-z.

Special character classes

BRE has special character classes that you can use when writing regular expressions:

[[:alpha:]] - matches any alphabetic character written in upper or lower case.
[[:alnum:]]- matches any alphanumeric character - namely, the characters in the range 0-9, A-Z, a-z.
[[:blank:]] - matches a space and a tab character.
[[:digit:]]- any numeric character from 0to 9.
[[:upper:]]- uppercase alphabetic characters - A-Z.
[[:lower:]]- alphabetic characters in lower case - a-z.
[[:print:]] - matches any printable character.
[[:punct:]] - matches punctuation marks.
[[:space:]]- whitespace, in particular - a space, a tab character, characters NL, FF, VT, CR.

You can use special classes in templates like this:

~$ echo "abc" | awk '/[[:alpha:]]/{print $0}'
abc

~$ echo "abc" | awk '/[[:digit:]]/{print $0}'

~$ echo "abc123" | awk '/[[:digit:]]/{print $0}'
abc123

Star symbol

If you place an asterisk after a character in the pattern, it means that the regular expression will work if the character appears in the string any number of times — including the situation when the character is absent in the string.

~$ echo "test" | awk '/tes*t/{print $0}'
test

~$ echo "tessst" | awk '/tes*t/{print $0}'
tessst

This wildcard character is usually used to work with words that often contain typos, or for words that can be spelled differently:

~$ echo "I like green color" | awk '/colou*r/{print $0}'
I like green color

~$ echo "I like green colour " | awk '/colou*r/{print $0}'
I like green colour

In this example, the same regexp reacts to both the word “color” and the word “color”. This is due to the fact that the symbol “u”, after which there is an asterisk, can either be absent or appear several times in a row.

Another useful feature derived from the characteristics of the asterisk symbol is to combine it with a dot. This combination allows the regular expression to respond to any number of any characters:

~$ cat myfile
this is a test
This is another test
And this is one more
start with this

~$ awk '/this.*test/{print $0}' myfile
this is a test

In all three examples, the regular expression works because an asterisk after a character class means that if any number of “a” or “e” characters are found, or if they cannot be found, the string will match the specified pattern.

POSIX ERE Regular Expressions

The POSIX ERE templates that some Linux utilities support may contain additional characters. As already stated, awk supports this standard, but sed does not.

Here we will look at the most commonly used symbols in ERE patterns, which will come in handy when creating your own regular expressions.

❔ Question mark

The question mark indicates that the preceding character may appear once in the text or not at all. This character is one of the repetition metacharacters. Here are some examples:

~$ echo "tet" | awk '/tes?t/{print $0}'
tet

~$ echo "test" | awk '/tes?t/{print $0}'
test

~$ echo "tesst" | awk '/tes?t/{print $0}'

As you can see, in the third case the letter “s” occurs twice, so the regular expression does not respond to the word “tesst”.

The question mark can be used with character classes as well:

~$ echo "tst" | awk '/t[ae]?st/{print $0}'
tst

~$ echo "test" | awk '/t[ae]?st/{print $0}'
test

~$ echo "tast" | awk '/t[ae]?st/{print $0}'
tast

~$ echo "taest" | awk '/t[ae]?st/{print $0}'

~$ echo "teest" | awk '/t[ae]?st/{print $0}'

If there are no characters from the class in the string, or one of them occurs once, the regular expression works, however, as soon as two characters appear in the word, the system no longer finds a match for the pattern in the text.

➕ Plus symbol

The plus symbol in the pattern indicates that the regular expression will find the desired one if the preceding character occurs one or more times in the text. At the same time, such a construction will not react to the absence of a symbol:

~$ echo "test" | awk '/te+st/{print $0}'
test

~$ echo "teest" | awk '/te+st/{print $0}'
teest

~$ echo "tst" | awk '/te+st/{print $0}'

In this example, if there is no e in a word, the regular expression engine will not find a match in the text. The plus symbol also works with character classes, which makes it look like an asterisk and a question mark:

~$ echo "tst" | awk '/t[ae]+st/{print $0}'

~$ echo "test" | awk '/t[ae]+st/{print $0}'
test

~$ echo "teast" | awk '/t[ae]+st/{print $0}'
teast

~$ echo "teeast" | awk '/t[ae]+st/{print $0}'
teeast

In this case, if the string contains any character from the class, the text will be considered to match the pattern.

Curly brackets

The curly braces that you can use in ERE patterns are similar to the characters discussed above, but they allow you to more accurately specify the required number of occurrences of the preceding character. The limitation can be specified in two formats:

n — a number that specifies the exact number of occurrences to search for
n, m — two numbers, which are interpreted as follows: "at least n times, but not more than m".

Here are examples of the first option:

~$ echo "tst" | awk '/te{1}st/{print $0}'

~$ echo "test" | awk '/te{1}st/{print $0}'
test

In older versions of awk, you had to use a command-line switch--re-intervalin order for the program to recognize spacing in regular expressions, but newer versions do not.

~$ echo "tst" | awk '/te{1,2}st/{print $0}'

~$ echo "test" | awk '/te{1,2}st/{print $0}'
test

~$ echo "teest" | awk '/te{1,2}st/{print $0}'
teest

~$ echo "teeest" | awk '/te{1,2}st/{print $0}'

Note: In this example, the character “e” must appear in the line 1 or 2 times, then the regular expression will respond to the text.

Curly braces can also be used with character classes. Here are the principles you already know:

~$ echo "tst" | awk  '/t[ae]{1,2}st/{print $0}'

~$ echo "test" | awk  '/t[ae]{1,2}st/{print $0}'
test

~$ echo "teest" | awk  '/t[ae]{1,2}st/{print $0}'
teest

~$ echo "teeast" | awk  '/t[ae]{1,2}st/{print $0}'

The template will react to text if it contains the character “a” or the character “e” once or twice.

Boolean “or” character

The |vertical bar symbol means a logical "or" in regular expressions. When processing a regular expression containing several fragments, separated by such a sign, the engine will consider the parsed text to be appropriate if it matches any of the fragments. Here's an example:

~$ echo "This is a test" | awk '/test|exam/{print $0}'
This is a test

~$ echo "This is an exam" | awk '/test|exam/{print $0}'
This is an exam

~$ echo "This is something else" | awk '/test|exam/{print $0}'

In this example, the regular expression is configured to search the text for the words “test” or “exam”. Please note that|there should be no spaces between template fragments and the character separating them.

Grouping Regular Expression Fragments

Regular expression fragments can be grouped using parentheses. If you group a certain sequence of characters, it will be perceived by the system as an ordinary character. That is, for example, you can apply repetition metacharacters to it. This is how it looks:

~$ echo "John" | awk '/John(Doe)?/{print $0}'
John

~$ echo "JohnDoe" | awk '/John(Doe)?/{print $0}'
JohnDoe

In these examples, the word “Geeks” is enclosed in parentheses, followed by a question mark. Recall that a question mark means “0 or 1 repetition”, as a result, the regular expression will respond to the string “Like” and the string “LikeGeeks”.

Practical examples

Now that we’ve covered the basics of regular expressions, it’s time to do something useful with them.

1. Counting the number of files

Let’s write a bash script that counts the files in the directories that are written to the environment variable PATH. In order to do this, you will first need to generate a list of paths to directories. Let's do it with sed, replacing colons with spaces:

~$ echo $PATH | sed 's/:/ /g'

The replace command supports regular expressions as patterns for searching text. In this case, everything is extremely simple, we are looking for a colon symbol, but no one bothers to use something else here — it all depends on the specific task. Now you need to go through the resulting list in a loop and perform the actions necessary to count the number of files there. The general scheme of the script will be as follows:

mypath=$(echo $PATH | sed 's/:/ /g')
for directory in $mypath
 
do
 
done

Now let’s write the full text of the script, using the command lsto get information about the number of files in each of the directories:

When you run the script, it may turn out that some of the directories PATHdo not exist, however, this does not prevent it from counting the files in the existing directories.

~$ ./script.sh
/home/mraevsky/bin - 0
/home/mraevsky/.local/bin - 0
/usr/local/sbin - 0
/usr/local/bin - 8
/usr/sbin - 249
/usr/bin - 2001
/sbin - 295
/bin - 176

The main value of this example is that much more complex problems can be solved using the same approach. Which one exactly depends on your needs.

2. Checking email addresses

There are websites with huge collections of regular expressions that allow you to validate email addresses, phone numbers, and so on. However, it’s one thing to take a ready-made one, and quite another to create something yourself. Therefore, let’s write a regular expression to validate email addresses. Let’s start by analyzing the initial data. For example, here is a certain address:

username@hostname.com

The username username, can consist of alphanumeric and some other characters. Namely, it is a period, dash, underscore, plus sign. The username is followed by an @ sign.

Armed with this knowledge, let's start assembling the regular expression from its left side, which serves to validate the username. Here's what we got:

^([a-zA-Z0-9_\-\.\+]+)@

This regular expression can be read like this: “At the beginning of a line, there must be at least one of those in the group specified in square brackets, followed by an @ sign.”

Now — the hostname queue is hostname. The same rules apply as for the username, so the template for it will look like this:

([a-zA-Z0-9_\-\.]+)

The top-level domain name is subject to special rules. There can be only alphabetic characters, which must be at least two (for example, such domains usually contain a country code), and no more than five. All this means that the template for checking the last part of the address will be like this:

\.([a-zA-Z]{2,5})$

You can read it like this: “First there must be a period, then — from 2 to 5 alphabetical characters, and after that, the line ends”.

Having prepared templates for separate parts of the regular expression, let’s put them together:

^([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$

Now it remains only to test what happened:

~$ echo "[email protected]" | awk '/^([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$/{print $0}'

name@host.com

~$ echo "[email protected]" | awk '/^([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$/{print $0}'

name@host.com.us

The fact that the text supplied to awk is displayed on the screen means that the system recognized the email address in it.

Outcome

If the regular expression for validating email addresses that you met at the very beginning of the article seemed completely incomprehensible then, we hope that now it no longer looks like a meaningless set of characters. If this is true, then this material has fulfilled its purpose. In fact, regular expressions are a topic that can be dealt with all your life, but even the little that we have discussed can already help you in writing scripts that process texts quite advanced.

In this series of articles, we usually showed very simple examples of bash scripts that consisted of literally a few lines. Next time, let’s look at something bigger.

If you found this article helpful, click the💚 or 👏 button below or share the article on Facebook so your friends can benefit from it too.