PROGRAMMING
Bash Scripts — Part 9 — Regular Expressions
A gentle introduction into regular expressions in BASH
In order to fully process texts in bash scripts using sed and awk, you need to understand regular expressions. Although implementations of this most useful tool can be found literally everywhere, all regular expressions are arranged in a similar way and based on the same ideas. However, working with them has certain peculiarities in different environments. Here we will talk about regular expressions that are suitable for use in Linux command line scripts. This material is intended to be an introduction to regular expressions for those who may not know at all what they are. So let’s start from the very beginning.

What regular expressions are
Many people, when they first see regular expressions, immediately think that they are in front of a meaningless jumble of characters. But this, of course, is far from the case. Take a look at this regex for example:
^([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$In some advanced programmer opinion, even an absolute beginner will immediately understand how it works and why you need it 🙂
Well, if you do not quite understand what’s going on here as I did, just keep on reading, and everything will fall into place.
A regular expression is a pattern that programs like sed or awk use to filter text. Templates use regular ASCII characters that represent themselves, and so-called metacharacters, which play a special role, for example, allowing you to refer to certain groups of characters.
Type of regular expressions
Implementations of regular expressions can be easily found in various environments, for example, in such programming languages like Java, Perl, Python, and in such Linux tools like sed, awk, grep and many others. However, it has certain quirks depending on the environment. Notably, These features depend on so-called regex engines, which interpret patterns.
Linux has two regular expression engines:
- An engine that supports the POSIX Basic Regular Expression (BRE) standard.
- An engine that supports the POSIX Extended Regular Expression (ERE) standard.
Most Linux utilities conform at least to the POSIX BRE standard, but some utilities (including sed) understand only a subset of the BRE standard. One of the reasons for this limitation is the desire to make such utilities as fast as possible in word processing.
The POSIX ERE standard is often implemented in programming languages. It allows you to use a lot of tools when developing regular expressions. For example, these can be special character sequences for frequently used patterns, such as searching for individual words or sets of numbers in the text. Awk supports the ERE standard.
There are many ways to develop regular expressions, depending on both on the opinion of the programmer and on the features of the engine for which they are created. It is not easy to write generic regular expressions that any engine can understand. Therefore, we will focus on the most commonly used regular expressions and look at how they are implemented for sed and awk.
POSIX BRE Regular Expressions
Perhaps the simplest BRE pattern is a regular expression for finding the exact occurrence of a sequence of characters in a text. This is what sed and awk look like for a string:
~$ echo "This is a test" | sed -n '/test/p'
This is a test~$ echo "This is a test" | awk '/test/{print $0}'
This is a testYou will notice that the search for a given pattern is performed without regard to the exact location of the text in the string. In addition, the number of occurrences does not matter. After the regular expression finds the given text anywhere in the string, the string is considered valid and is passed on for further processing.
When working with regular expressions, keep in mind that they are case-sensitive:
~$ echo "This is a test" | awk '/test/{print $0}'
This is a testThe first regular expression did not match, since the word “test” starting with a capital letter does not occur in the text. The second, tuned to search for a capitalized word, found a matching string in the stream.
In regular expressions, you can use not only letters but also spaces and numbers:
~$ echo "This is a test 2 again" | awk '/test 2/{print $0}'
This is a test 2 againSpaces are treated by the regular expression engine as regular characters.
Special symbols
There are a few things to keep in mind when using different characters in regular expressions. So, there are some special characters, or metacharacters, which require a special approach to use in a template. Here they are:
.*[]^${}\+?|()If one of them is needed in the template, it will need to be escaped with a backslash (backslash) — \.
For example, if you need to find a dollar sign in the text, it must be included in the template, preceded by an escape character. Let’s say you have a file myfilewith this text:
There is 10$ on my pocketThe dollar sign can be detected using a pattern like this:
~$ cat myfile
There is 10$ on my pocket~$ awk '/\$/{print $0}' myfile
There is 10$ on my pocketAlso, a backslash is also a special character, so if you want to use it in a pattern, you will need to escape it too. It looks like two forward slashes:
~$ echo "\ is a special character" | awk '/\\/{print $0}'
\ is a special characterAlthough the forward slash is not included in the above list of special characters, trying to use it in a regular expression written for sed or awk will result in an error:
~$ echo "3 / 2" | awk '///{print $0}'
awk: cmd. line:1: ///{print $0}
awk: cmd. line:1: ^ syntax errorIf it is needed, it must also be escaped:
~$ echo "3 / 2" | awk '/\//{print $0}'Anchor symbols
There are two special characters to anchor a pattern to the beginning or end of a text string. Cover ^character - allows you to describe sequences of characters that are at the beginning of text lines. If the pattern you are looking for is found elsewhere in the string, the regular expression will not respond to it. The use of this symbol looks like this:
~$ echo "welcome to mraevsky website" | awk '/^mraevsky/{print $0}'
~$ echo "mraevsky website" | awk '/^mraevsky/{print $0}'
mraevsky websiteThe symbol is^designed to search for a pattern at the beginning of a string, while the case is also taken into account. Let's see how this affects the processing of a text file:
~$ cat myfile
this is a test
This is another test
And this is one more~$ awk '/^this/{print $0}' myfile
this is a testWhen using sed, if you place a cap anywhere inside a pattern, it will be treated like any other regular character:
~$ echo "This ^ is a test" | sed -n '/s ^/p'
This ^ a testWe figured out how to find pieces of text at the beginning of a line. What if you want to find something at the end of a line?
The dollar sign -$, which is the anchor character for the end of the line, will help us with this:
~$ echo "This is a test" | awk '/test$/{print $0}'
This is a testBoth anchor characters can be used in the same pattern. Let’s process the filemyfile, the contents of which are shown in the figure below, using the following regular expression:
~$ cat myfile
this is a test
This is another test
And this is one more~$ awk '/^this is a test$/{print $0}' myfile
this is a testAs you can see, the template reacted only to a line that fully corresponds to the specified sequence of characters and their location.
Here’s how to filter out empty lines using anchor characters:
~$ awk '!/^$/{print $0}' myfileIn this template, I used a negation character, an exclamation mark — !. This pattern searches for lines that contain nothing between the beginning and end of the line, and the exclamation mark only prints lines that do not match the pattern.
Point symbol
A period is used to search for any single character except for line feed. Let’s pass to such a regular expression a file myfile, the contents of which are given below:
~$ cat myfile
this is a test
This is another test
And this is one more
start with this~$ awk '/.st/{print $0}' myfile
this is a test
This is another testAs you can see from the output, only the first two lines from the file match the pattern, since they contain the sequence of characters “st”, preceded by one more character, while the third line does not contain a suitable sequence, and in the fourth it is, but is at the very beginning of the line.
Character classes
The period matches any single character, but what if you need more flexibility to limit the set of characters you are looking for? In such a situation, you can use character classes.
Thanks to this approach, you can organize a search for any character from a given set. Square brackets are used to describe a character class — []:
~$ cat myfile
this is a test
This is another test
And this is one more
start with this~$ awk '/[oi]th/{print $0}' myfile
This is another test
start with thisHere we are looking for the “th” character sequence preceded by the “o” or “i” character.
Classes come in handy when looking for words that can start with both uppercase and lowercase letters:
~$ echo "this is a test" | awk '/[Tt]his is a test/{print $0}'
this is a test~$ echo "This is a test" | awk '/[Tt]his is a test/{print $0}'
This is a testCharacter classes are not limited to letters. Other symbols can be used here. It is impossible to say in advance in what situation the classes will be needed — everything depends on the problem being solved.
The negation of character classes
Character classes can also be used to solve the opposite problem described above. Namely, instead of searching for symbols included in the class, you can organize a search for everything that is not included in the class. In order to achieve this behavior of a regular expression, a sign must be placed in front of the list of class symbols ^. It looks like this:
~$ cat myfile
this is a test
This is another test
And this is one more
start with this~$ awk '/[^oi]th/{print $0}' myfile
And this is one more
start with thisIn this case, sequences of characters “th” will be found, before which there is neither “o” nor “i”.
Ranges of characters
In character classes, you can describe ranges of characters using a dash:
~$ cat myfile
this is a test
This is another test
And this is one more
start with this~$ awk '/[e-p]st/{print $0}' myfile
this is a test
This is another testIn this example, the regular expression responds to the character sequence “st” preceded by any character alphabetically located between the characters “e” and “p”.
Ranges can also be created from numbers:
~$ echo "123" | awk '/[0-9][0-9][0-9]/'
123~$ echo "12a" | awk '/[0-9][0-9][0-9]/'The character class can contain several ranges:
~$ cat myfile
this is a test
This is another test
And this is one more
start with this~$ awk '/[a-fm-z]st/{print $0}' myfile
this is a test
This is another testThis regex will match all strings preceded by characters from the rangesa-fandm-z.
Special character classes
BRE has special character classes that you can use when writing regular expressions:
[[:alpha:]]- matches any alphabetic character written in upper or lower case.[[:alnum:]]- matches any alphanumeric character - namely, the characters in the range0-9,A-Z,a-z.[[:blank:]]- matches a space and a tab character.[[:digit:]]- any numeric character from0to9.[[:upper:]]- uppercase alphabetic characters -A-Z.[[:lower:]]- alphabetic characters in lower case -a-z.[[:print:]]- matches any printable character.[[:punct:]]- matches punctuation marks.[[:space:]]- whitespace, in particular - a space, a tab character, charactersNL,FF,VT,CR.
You can use special classes in templates like this:
~$ echo "abc" | awk '/[[:alpha:]]/{print $0}'
abc~$ echo "abc" | awk '/[[:digit:]]/{print $0}'~$ echo "abc123" | awk '/[[:digit:]]/{print $0}'
abc123Star symbol
If you place an asterisk after a character in the pattern, it means that the regular expression will work if the character appears in the string any number of times — including the situation when the character is absent in the string.
~$ echo "test" | awk '/tes*t/{print $0}'
test~$ echo "tessst" | awk '/tes*t/{print $0}'
tessstThis wildcard character is usually used to work with words that often contain typos, or for words that can be spelled differently:
~$ echo "I like green color" | awk '/colou*r/{print $0}'
I like green color~$ echo "I like green colour " | awk '/colou*r/{print $0}'
I like green colourIn this example, the same regexp reacts to both the word “color” and the word “color”. This is due to the fact that the symbol “u”, after which there is an asterisk, can either be absent or appear several times in a row.
Another useful feature derived from the characteristics of the asterisk symbol is to combine it with a dot. This combination allows the regular expression to respond to any number of any characters:
~$ cat myfile
this is a test
This is another test
And this is one more
start with this~$ awk '/this.*test/{print $0}' myfile
this is a testIn all three examples, the regular expression works because an asterisk after a character class means that if any number of “a” or “e” characters are found, or if they cannot be found, the string will match the specified pattern.
POSIX ERE Regular Expressions
The POSIX ERE templates that some Linux utilities support may contain additional characters. As already stated, awk supports this standard, but sed does not.
Here we will look at the most commonly used symbols in ERE patterns, which will come in handy when creating your own regular expressions.
❔ Question mark
The question mark indicates that the preceding character may appear once in the text or not at all. This character is one of the repetition metacharacters. Here are some examples:
~$ echo "tet" | awk '/tes?t/{print $0}'
tet~$ echo "test" | awk '/tes?t/{print $0}'
test~$ echo "tesst" | awk '/tes?t/{print $0}'As you can see, in the third case the letter “s” occurs twice, so the regular expression does not respond to the word “tesst”.
The question mark can be used with character classes as well:
~$ echo "tst" | awk '/t[ae]?st/{print $0}'
tst~$ echo "test" | awk '/t[ae]?st/{print $0}'
test~$ echo "tast" | awk '/t[ae]?st/{print $0}'
tast~$ echo "taest" | awk '/t[ae]?st/{print $0}'~$ echo "teest" | awk '/t[ae]?st/{print $0}'If there are no characters from the class in the string, or one of them occurs once, the regular expression works, however, as soon as two characters appear in the word, the system no longer finds a match for the pattern in the text.
➕ Plus symbol
The plus symbol in the pattern indicates that the regular expression will find the desired one if the preceding character occurs one or more times in the text. At the same time, such a construction will not react to the absence of a symbol:
~$ echo "test" | awk '/te+st/{print $0}'
test~$ echo "teest" | awk '/te+st/{print $0}'
teest~$ echo "tst" | awk '/te+st/{print $0}'In this example, if there is no e in a word, the regular expression engine will not find a match in the text. The plus symbol also works with character classes, which makes it look like an asterisk and a question mark:
~$ echo "tst" | awk '/t[ae]+st/{print $0}'~$ echo "test" | awk '/t[ae]+st/{print $0}'
test~$ echo "teast" | awk '/t[ae]+st/{print $0}'
teast~$ echo "teeast" | awk '/t[ae]+st/{print $0}'
teeastIn this case, if the string contains any character from the class, the text will be considered to match the pattern.
Curly brackets
The curly braces that you can use in ERE patterns are similar to the characters discussed above, but they allow you to more accurately specify the required number of occurrences of the preceding character. The limitation can be specified in two formats:
n —a number that specifies the exact number of occurrences to search forn, m —two numbers, which are interpreted as follows: "at least n times, but not more than m".
Here are examples of the first option:
~$ echo "tst" | awk '/te{1}st/{print $0}'~$ echo "test" | awk '/te{1}st/{print $0}'
testIn older versions of awk, you had to use a command-line switch--re-intervalin order for the program to recognize spacing in regular expressions, but newer versions do not.
~$ echo "tst" | awk '/te{1,2}st/{print $0}'~$ echo "test" | awk '/te{1,2}st/{print $0}'
test~$ echo "teest" | awk '/te{1,2}st/{print $0}'
teest~$ echo "teeest" | awk '/te{1,2}st/{print $0}'
Note: In this example, the character “e” must appear in the line 1 or 2 times, then the regular expression will respond to the text.
Curly braces can also be used with character classes. Here are the principles you already know:
~$ echo "tst" | awk '/t[ae]{1,2}st/{print $0}'~$ echo "test" | awk '/t[ae]{1,2}st/{print $0}'
test~$ echo "teest" | awk '/t[ae]{1,2}st/{print $0}'
teest~$ echo "teeast" | awk '/t[ae]{1,2}st/{print $0}'
The template will react to text if it contains the character “a” or the character “e” once or twice.
Boolean “or” character
The |vertical bar symbol means a logical "or" in regular expressions. When processing a regular expression containing several fragments, separated by such a sign, the engine will consider the parsed text to be appropriate if it matches any of the fragments. Here's an example:
~$ echo "This is a test" | awk '/test|exam/{print $0}'
This is a test~$ echo "This is an exam" | awk '/test|exam/{print $0}'
This is an exam~$ echo "This is something else" | awk '/test|exam/{print $0}'
In this example, the regular expression is configured to search the text for the words “test” or “exam”. Please note that|there should be no spaces between template fragments and the character separating them.
Grouping Regular Expression Fragments
Regular expression fragments can be grouped using parentheses. If you group a certain sequence of characters, it will be perceived by the system as an ordinary character. That is, for example, you can apply repetition metacharacters to it. This is how it looks:
~$ echo "John" | awk '/John(Doe)?/{print $0}'
John~$ echo "JohnDoe" | awk '/John(Doe)?/{print $0}'
JohnDoeIn these examples, the word “Geeks” is enclosed in parentheses, followed by a question mark. Recall that a question mark means “0 or 1 repetition”, as a result, the regular expression will respond to the string “Like” and the string “LikeGeeks”.
Practical examples
Now that we’ve covered the basics of regular expressions, it’s time to do something useful with them.
1. Counting the number of files
Let’s write a bash script that counts the files in the directories that are written to the environment variable PATH. In order to do this, you will first need to generate a list of paths to directories. Let's do it with sed, replacing colons with spaces:
~$ echo $PATH | sed 's/:/ /g'The replace command supports regular expressions as patterns for searching text. In this case, everything is extremely simple, we are looking for a colon symbol, but no one bothers to use something else here — it all depends on the specific task. Now you need to go through the resulting list in a loop and perform the actions necessary to count the number of files there. The general scheme of the script will be as follows:
mypath=$(echo $PATH | sed 's/:/ /g')
for directory in $mypath
do
doneNow let’s write the full text of the script, using the command lsto get information about the number of files in each of the directories:






