Bash Scripts — Part 8 — Data Processing With AWK

Last time we talked about the stream editor sed and looked at many examples of text processing with its help. Sed is capable of solving many problems, but it also has limitations. Sometimes you need a better tool for processing data, something like a programming language. In fact, such a tool is awk.

The awk utility, or more accurately GNU awk, takes the processing of data streams to a higher level than sed. Thanks to awk, we end up with a programming language rather than a rather modest set of commands to issue to the editor. Using the awk programming language, you can do the following:

Declare variables for storing data.
Use arithmetic and string operators to work with data.
Use the structural elements and control constructs of the language, such as if-then statements and loops, to enable complex data processing algorithms.
Create formatted reports.

If we talk only about the ability to create formatted reports that are easy to read and analyze, then this turns out to be very useful when working with log files that can contain millions of records. But awk is much more than a reporting tool.

Features of calling awk

The awkcalling scheme looks like this:

$ awk options program file

Awk treats the incoming data as a set of records. Records are collections of fields. Simplistically, if you do not take into account the possibilities of customizing awk and talk about some completely ordinary text, the lines of which are separated by line feed characters, a record is a line. A field is a word in a line.

Let’s take a look at the most commonly used awk command-line switches:

-F fs- allows you to specify a separator character for fields in a record. -f file- specifies the name of the file from which to read the awk script. -v var=value — allows you to declare a variable and set its default value that awk will use. -mf N- sets the maximum number of fields for processing in the data file. -mr N — sets the maximum record size in the data file. -W keyword- allows you to set the compatibility mode or warning level for awk.

The real power of awk lies in the part of the command to invoke it, which is marked above as program. It points to an awk script file written by a programmer to read data, process it, and output the results.

Reading awk scripts from the command line

Awk scripts that can be written directly on the command line are formatted as command texts enclosed in curly braces. Also, since awk assumes that the script is a text string, you must enclose it in single quotes:

$ awk '{print "Welcome to awk command tutorial"}'

Let’s run this command … And nothing will happen. The point is that when we called awk, we did not specify a data file. In a situation like this, awk waits for data to arrive from STDIN. Therefore, executing such a command does not lead to immediately observable effects, but this does not mean that awk is not working — it is waiting for input from STDIN.

Now, if you type something into the console and press it Enter, awk will process the input using the script specified when it was run. Awk processes text from the input stream line by line, in this it is similar to sed. In our case, awk does nothing with the data; it only displays the text specified in the command in response to each new line it receives print.

raevskym@DESKTOP-JNF3L6H:~$ awk '{print "Welcome to awk command tutorial"}'

Welcome to awk command tutorial

Welcome to awk command tutorial

Whatever we type, the result in this case will be the same — outputting text. In order to terminate awk, you need to pass it the end-of-file character (EOF). This can be done using the keyboard shortcutCTRL + D.

It's no surprise if this first example wasn't particularly impressive to you. However, the most interesting is ahead.

Positional variables storing field data

One of the main features of awk is the ability to manipulate data in text files. It does this by automatically assigning a variable to each item in the string. By default awk assigns the following variables to each data field it finds in a record:

$0 — represents an entire line of text (record).
$1 — first field.
$2 — second field.
$n — nth field.

Fields are separated from the text using a separator character. By default, these are whitespace characters such as space or tab character.

Let’s look at the use of these variables with a simple example. Namely, we will process a file that contains several lines (this file is shown in the figure below) using the following command:

$ awk '{print $1}' myfile

So let’s try it:

raevskym@DESKTOP-JNF3L6H:~$ cat myfile

This is a test.
This is the second test.
This is the third test.
This is the fourth test.

raevskym@DESKTOP-JNF3L6H:~$ awk '{print $1}' myfile

This
This
This
This

A variable is used here$1that allows you to access and display the first field of each line.

Sometimes some files use something other than spaces or tabs as field separators. Above we mentioned the awk switch-F, which allows you to set the separator required to process a particular file:

raevskym@DESKTOP-JNF3L6H:~$ awk -F: '{print $1}' /etc/passwd

root
daemon
bin
sys
sync
man
lp
mail
news
www-data

This command prints the first elements of the lines in the file/etc/passwd. Since this file uses colons as delimiters, this is the character that was passed to awk after the key-F.

Using multiple commands

Calling awk with one-word processing command is a very limited approach. Awk allows you to process data using multi-line scripts. In order to pass awk a multi-line command when invoked from the console, you need to separate its parts with a semicolon:

$ echo "My name is Adam" | awk '{$4="Michael"; print $0}'

My name is Michael

In this example, the first command writes a new value to a variable$4, and the second displays the entire line.

Reading awk script from file

Awk allows you to store scripts in files and reference them using a key -f. Let's prepare a file testfilein which we write the following:

{print $1 " has a  home directory at " $6}

Let’s call awk with this file as the command source:

raevskym@DESKTOP-JNF3L6H:~$ cat testfile

{print $1 " has a home at " $6}

raevskym@DESKTOP-JNF3L6H:~$ awk -F: -f testfile /etc/passwd

root has a home directory at /root
daemon has a home directory at /usr/sbin
bin has a home directory at /bin
sys has a home directory at /dev
sync has a home directory at /bin
man has a home directory at /var/cache/man
lp has a home directory at /var/spool/lpd
mail has a home directory at /var/mail
news has a home directory at /var/spool/news
www-data has a home directory at /var/www

Here we output from a file the/etc/passwdnames of the users that fall into the variable$1, and their home directories that fall into$6. Note that the script file is specified using a key-f, and the field separator, colon in our case, using a key-F.

The script file can contain many commands, and each of them can be written on a new line, there is no need to put a semicolon after each. This is how it might look:

{
text = " has a  home directory at "
print $1 text $6
}

Here we store the text used in the output of data received from each line of the processed file in a variable, and we use this variable in the command print. If you reproduce the previous example by writing this code to a file testfile, the output will be the same.

Executing commands before processing data

Sometimes you need to perform some action before the script starts processing records from the input stream. For example — create a report header or something similar.

To do this, you can use a keyword BEGIN. The commands that follow BEGINwill be executed before processing the data. In its simplest form, it looks like this:

$ awk 'BEGIN {print "Hello World!"}'

Here’s a slightly more complex example:

raevskym@DESKTOP-JNF3L6H:~$ awk 'BEGIN {print "The File Contents:"}
> {print $0}' myfile

This File Contents:
This is a test.
This is the second test.
This is the third test.
This is the fourth test.

First, awk executes the blockBEGIN, and then it processes the data. Be careful with single quotes when using similar command line constructs. Note that both the blockBEGINand the stream processing commands are one line in awk's view. The first single quotation mark that delimits this string comes beforeBEGIN. The second is after the closing curly brace of the data-processing command.

Executing commands after finishing data processing

The keyword ENDallows you to set commands to be executed after the end of data processing:

raevskym@DESKTOP-JNF3L6H:~$ awk 'BEGIN {print "The File Contents:"}
> {print $0}
> END {print "End of File"}' myfile

This File Contents:
This is a test.
This is the second test.
This is the third test.
This is the fourth test.
End of File

After it finishes displaying the contents of a file, awk executes the block commands END. This is a useful feature, with its help, for example, you can create a report footer. Now let's write a script with the following content and save it in a file myscript:

BEGIN {
print "The latest list of users and shells"
print " UserName \t HomePath"
print "-------- \t -------"
FS=":"
}
{
print $1 " \t " $6
}
END {
print "The end"
}

Here, in the block BEGIN, the header of the tabular report is created. In the same section, we indicate the separator character. After the end of processing the file, thanks to the block END, the system will inform us that the work is over.

Let's run the script:

raevskym@DESKTOP-JNF3L6H:~$ awk -f myscript  /etc/passwd

The latest list of users and shells
UserName    HomePath
--------    ---------
root        /root
daemon      /usr/bin
bin         /bin
sys         /dev
sync        /bin
man         /var/cache/man
lp          /var/spool/lpd
mail        /var/mail
news        /var/spool/news
www-data    /var/www

Everything we talked about above is just a small part of awk’s capabilities. Let’s continue mastering this useful tool.

Built-in variables: customizing data processing

The awk utility uses built-in variables that allow you to customize how the data is processed and give access to both the data being processed and some information about it.

We have already discussed the positional variables — $1, $2, $3that allow you to retrieve the field values, we worked with some other variables. In fact, there are quite a few of them. Some of the most commonly used are:

FIELDWIDTHS — a space-separated list of numbers specifying the exact width of each data field, including field separators. FSIs a variable you are already familiar with that allows you to set the field separator character. RS — a variable that allows you to set the record separator character. OFS — field separator in awk script output. ORS — the record separator in the awk script output.

By default, the variable is OFSset to use a space. It can be set as needed for output purposes:

raevskym@DESKTOP-JNF3L6H:~$ awk 'BEGIN{FS=":"; OFS="-"} {print $1,$6,$7}' /etc/passwd

root-/root-/bin/bash
daemon-/usr/sbin-/usr/sbin/nologin
bin-/bin-/usr/sbin/nologin
sys-/deb-/usr/sbin/nologin
sync-/bin-/bin/sync
man-/var/cache/man-/usr/sbin/nologin
lp-/var/spool/lpd-/usr/sbin/nologin
mail-/var/mail-/usr/sbin/nologin
news-/var/spool/news-/usr/sbin/nologin
www-data-/var/www-/usr/sbin/nologin

A variableFIELDWIDTHSallows you to read records without using the field separator character.

In some cases, instead of using a field separator, data within records is arranged in columns of constant width. In such cases, it is necessary to set the variableFIELDWIDTHSin such a way that its contents correspond to the peculiarities of data presentation.

With a variable set,FIELDWIDTHSawk will ignore the variableFSand find the data fields according to the width information given inFIELDWIDTHS.

Suppose you have a filetestfilecontaining the following data:

1235.9652147.91
927-8.365217.27
36257.8157492.5

It is known that the internal organization of this data follows the 3–5–2–5 pattern, that is, the first field is 3 characters wide, the second 5, and so on. Here is a script to parse such records:

raevskym@DESKTOP-JNF3L6H:~$ awk 'BEGIN{FIELDWIDTHS="3 5 2 5"}{print $1,$2,$3,$4}' testfile

123 5.965 21 47.91
927 -8.36 52 17.27
362 57.81 57 492.5

Let’s see what the script displays. The data is parsed taking into account the value of the variableFIELDWIDTHS, as a result, numbers and other characters in the lines are split according to the specified field width.

VariablesRSandORSset the order of processing records. By defaultRSand areORSset to linefeed character. This means that awk treats each new line of text as a new record and prints each record on a new line.

It sometimes happens that fields in a data stream are spread across multiple lines. For example, suppose you have a file like thisaddresses:

Person Name
123 High Street
(222) 466-1234

Another person
487 High Street
(523) 643-8754

If you try to read this data with FSand are RSset to their defaults, awk will treat each new line as a separate entry and highlight fields based on spaces. This is not what we need in this case.

In order to solve this problem, FSyou need to write a line feed character in. This will tell awk that each line in the data stream is a separate field.

In addition, in this example, you will need to write an RSempty string to the variable. Note that in the file, data blocks about different people are separated by an empty line. As a result, awk will treat empty lines as record delimiters. Here's how to do it all:

raevskym@DESKTOP-JNF3L6H:~$ awk 'BEGIN{FS="\n"; RS=""} {print $1,$3}' addresses

Person Name (222) 466-1234
Another Person (523) 643-8754

As you can see, thanks to these variable settings, awk treats the lines from the file as fields, and blank lines become the record delimiters.

Built-in Variables: Data and Environment Information

In addition to the built-in variables we discussed earlier, there are others that provide information about the data and the environment in which awk runs:

ARGC- the number of command line arguments. ARGV- array with command line arguments. ARGIND- index of the currently processed file in the array ARGV. ENVIRON- associative array with environment variables and their values. ERRNO- the code of a system error that can occur when reading or closing input files. FILENAME- the name of the input data file. FNR- number of the current record in the data file. IGNORECASE- if this variable is set to a nonzero value, case is ignored during processing. NF- the total number of data fields in the current record. NR- the total number of processed records.

Variables ARGCand ARGVallow you to work with command-line arguments. In this case, the script passed to awk does not end up in the argument array ARGV. Let's write a script like this:

$ awk 'BEGIN{print ARGC,ARGV[1]}' myfile

After launching it, you can find out that the total number of command-line arguments is 2, and the ARGVname of the file being processed is written under the index 1 in the array. The array element at index 0 will in this case be "awk".

raevskym@DESKTOP-JNF3L6H:~$ awk 'BEGIN{print ARGC,ARGV[1]}' myfile

2 myfile

A variableENVIRONis an associative array of environment variables. Let's try it out:

raevskym@DESKTOP-JNF3L6H:~$ awk '
> BEGIN{
> print ENVIRON["HOME"]
> print ENVIRON["PATH"]
> }'

/home/raevskym
/home/raevskym/bin:/home/raevskym/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin

Environment variables can be used without callingENVIRON. For example, you can do this like this:

raevskym@DESKTOP-JNF3L6H:~$ echo | awk -v home=$HOME '{print "My home is " home}'

My home is /home/raevskym

A variableNFallows you to access the last data field in a record without knowing its exact position:

raevskym@DESKTOP-JNF3L6H:~$ awk 'BEGIN{FS=":"; OFS=":"} {print $1,$NF}' /etc/passwd

root:/bin/bash
daemon:/usr/sbin/nologin
bin:/usr/sbin/nologin
sys:/usr/sbin/nologin
sync:/bin/sync
man:/usr/sbin/nologin
lp:/usr/sbin/nologin
mail:/usr/sbin/nologin
news:/usr/sbin/nologin
www-data:/usr/sbin/nologin

This variable contains the numeric index of the last data field in the record. You can refer to this field by placing in front of theNFsign$.

VariablesFNRandNR, while they may seem similar, are actually different. Thus, the variableFNRstores the number of records processed in the current file. The variableNRstores the total number of records processed. Let's look at a couple of examples, passing awk the same file twice:

raevskym@DESKTOP-JNF3L6H:~$ awk 'BEGIN{FS=","}{print $1,"FNR="FNR}' myfile myfile

This is a test. FNR=1
This is the second test. FNR=2
This is the third test. FNR=3
This is the forth test. FNR=4
This is a test. FNR=1
This is the second test. FNR=2
This is the third test. FNR=3
This is the forth test. FNR=4

Transferring the same file twice is equivalent to transferring two different files. Note what isFNRflushed at the start of processing each file.

Now let's take a look at how the variable behaves in a similar situationNR:

raevskym@DESKTOP-JNF3L6H:~$ awk '
> BEGIN {FS=","}
> {print $1,"FNR="FNR,"NR="NR}
> END{print "There were",NR,"records processed"}' myfile myfile

This is a test. FNR=1 NR=1
This is the second test. FNR=2 NR=2
This is the third test. FNR=3 NR=3
This is the forth test. FNR=4 NR=4
This is a test. FNR=1 NR=5
This is the second test. FNR=2 NR=6
This is the third test. FNR=3 NR=7
This is the forth test. FNR=4 NR=8

As you can see,FNRas in the previous example, it is reset at the beginning of processing each file, butNRwhen moving to the next file, it retains the value.

User variables

Like any other programming language, awk allows the programmer to declare variables. Variable names can include letters, numbers, underscores. However, they cannot start with a number. You can declare a variable, assign a value to it and use it in your code like this:

raevskym@DESKTOP-JNF3L6H:~$ awk '
> BEGIN{
> test="This is a test"
> print test
> }'

This is a test

Conditional operator

Awk supports the conditional statement format that is standard in many programming languages if-then-else. The one-line version of the operator is a keyword iffollowed by the expression to be tested, in parentheses, and then the command to be executed if the expression is true.

For example, there is a file like this testfile:

Let’s write a script that outputs numbers from this file greater than 20:

raevskym@DESKTOP-JNF3L6H:~$ awk '{if ($1 > 20) print $1}' testfile

33
45

If you need to executeifmultiple statements in a block, they must be enclosed in curly braces:

raevskym@DESKTOP-JNF3L6H:~$ awk '{
> if ($1 > 20)
> {
> x = $1 * 2
> print x
> }
> }' testfile

66
90

As mentioned, an awk conditional statement can contain a blockelse:

raevskym@DESKTOP-JNF3L6H:~$ awk '{
> if ($1 > 20)
> {
> x = $1 * 2
> print x
> } else
> {
> x = $1 / 2
> print x
> }}' testfile

A branchelsecan be part of a one-line statement of a conditional statement, including only one line with a command. In this case, after the branchif, immediately beforeelse, you need to put a semicolon:

raevskym@DESKTOP-JNF3L6H:~$ awk '{if ($1 > 20) print $1 * 2; else print $1 / 2}' testfile

While loop

A loop whileallows you to iterate over datasets, checking for a condition that will stop the loop.

Here is the file myfilewe want to loop through:

124 127 130
112 142 135
175 158 245

Let’s write a script like this:

raevskym@DESKTOP-JNF3L6H:~$ awk '{
> total = 0
> i = 1
> while (i < 4)
> {
> total += $i
> i++
> }
> avg = total / 3
> print "Average:",avg
> }' testfile

Average: 127
Average: 129.667
Average: 192.667

The loop iterateswhileover the fields of each record, accumulating their sum in a variabletotaland increasing in each iteration by 1 counter variablei. When itireaches 4, the condition at the entrance to the loop will be false and the loop will end, after which the remaining commands will be executed - calculating the average value for the numeric fields of the current record and outputting the found value.

Youwhilecan use thebreakand commands in loopscontinue. The first allows you to prematurely end the cycle and start executing the commands located after it. The second allows, without completing the current iteration to the end, to go to the next.

This is how the command worksbreak:

raevskym@DESKTOP-JNF3L6H:~$ awk '{
> total = 0
> i = 1
> while (i < 4)
> {
> total += $i
> if (i == 2)
> break
> i++
> }
> avg = total / 2
> print "The average of the first two elements is:",avg
> }' testfile

The average of the first two elements is: 125.5
The average of the first two elements is: 127
The average of the first two elements is: 166.5

For loop

Loops forare used in many programming languages. They are also supported by awk. Let's solve the problem of calculating the average value of numeric fields using the following cycle:

raevskym@DESKTOP-JNF3L6H:~$ awk '{
> total = 0
> for (i = 1; i < 4; i++)
> {
> total += $i
> }
> avg = total / 3
> print "Average:",avg
> }' testfile

Average: 127
Average: 129.667
Average: 192.667

The initial value of the counter variable and the rule for changing it in each iteration, as well as the condition for terminating the loop, are specified at the beginning of the loop, in parentheses. As a result, we do not needwhileto increment the counter ourselves, unlike the case with a loop.

Formatted data output

A command printfin awk allows you to output formatted data. It makes it possible to customize the appearance of the output by using templates that can contain text data and formatting specifiers.

A format specifier is a special character that specifies the type of output data and how it should be output. Awk uses format specifiers as pointers to where data is inserted from the variables being passed printf.

The first specifier matches the first variable, the second specifier the second, and so on.

Formatting specifiers are written as follows:

%[modifier]control-letter

Here are some of them:

c- takes the number passed to it as an ASCII character code and outputs this character. d- displays a decimal integer. i- the same as d. e- displays a number in exponential form. f- displays a floating point number. g- outputs a number either in exponential notation or in floating point format, whichever is shorter. o- displays the octal representation of a number. s- displays a text string.

Here’s how to format the output with printf:

raevskym@DESKTOP-JNF3L6H:~$ awk 'BEGIN{
> x = 100 * 100
> printf "The result is: %e\n", x
> }'

The result is: 1.000000e+04

Here, as an example, we are printing a number in exponential notation. We believe this is enough for you to understand the main idea behind the work withprintf.

Built-in math functions

When working with awk, built-in functions are available to the programmer . In particular, these are mathematical and string functions, functions for working with time. For example, here is a list of math functions that you can use when developing awk scripts:

cos(x)- cosine x( xexpressed in radians). sin(x)- sine x. exp(x)- exponential function. int(x)- returns the whole part of the argument. log(x)- natural logarithm. rand()- returns a random floating point number in the range 0 - 1. sqrt(x)- square root of x.

Here’s how to use these features:

raevskym@DESKTOP-JNF3L6H:~$ awk 'BEGIN{x=exp(5); print x}'

148.413

String functions

Awk supports many string functions. They are all arranged more or less the same. For example, here’s a function toupper:

raevskym@DESKTOP-JNF3L6H:~$ awk 'BEGIN{x = "raevskym"; print toupper(x)}'

RAEVSKYM

This function converts the characters stored in the string variable passed to it to uppercase.

Custom functions

You can create your own awk functions as needed. Such functions can be used in the same way as built-in ones:

raevskym@DESKTOP-JNF3L6H:~$ awk '
> function myprint()
> {
> printf "The user %s has home path at %s\n", $1,$6
> }
> BEGIN{FS=":"}
> {
> myprint()
> }' /etc/passwd

The user root has home path at /root
The user daemon has home path at /usr/sbin
The user bin has home path at /bin
The user sys has home path at /deb
The user sync has home path at /bin-/bin/sync
The user man has home path at /var/cache/man
The user lp has home path at /var/spool/lpd
The user mail has home path at /var/mail
The user news has home path at /var/spool/news
The user www has home path at data-/var/www

The example uses a function myprintthat we define and outputs data.

Outcome

Today we have covered the basics of awk. It is a powerful data processing tool with a scale comparable to that of a single programming language.

You could not help but notice that much of what we are talking about is not so difficult to understand and know the basics, you can already automate something, but if you dig deeper, delve into the documentation … For example, The GNU Awk User’s Guide… What is impressive about this tutorial is that it dates back to 1989 (the first version of awk, by the way, appeared in 1977). However, now you know enough about awk so that you don’t get lost in the official documentation and get to know it as closely as you like. Next time, by the way, we’ll talk about regular expressions. Without them, it is impossible to do serious text processing in bash scripts using sed and awk.

Dear Readers! Sure many of you use awk occasionally. How does he help you with your work?

If you found this article helpful, click the💚 or 👏 button below or share the article on Facebook so your friends can benefit from it too.

Bash Scripts — Part 7 — Word Processing

Last time we talked about functions in bash scripts, in particular, how to call them from the command line. Our topic…

medium.com

Bash Scripts — Part 6 — Functions and Library Development

While developing bash scripts, sooner or later you will come across the fact that you periodically have to use the same…

medium.com

Bash Scripts — Part 5 — Signals and Background Tasks

Last time we talked about working with input, output and error streams in bash scripts, file descriptors and stream…

medium.com

Bash Scripts — Part 8 — Data Processing With AWK

Features of calling awk

Reading awk scripts from the command line

Positional variables storing field data

Using multiple commands

Reading awk script from file

Executing commands before processing data

Executing commands after finishing data processing

Built-in variables: customizing data processing

Built-in Variables: Data and Environment Information

User variables

Conditional operator

While loop

For loop

Formatted data output

Built-in math functions

String functions

Custom functions

Outcome

Read More

Bash Scripts — Part 7 — Word Processing

Last time we talked about functions in bash scripts, in particular, how to call them from the command line. Our topic…

Bash Scripts — Part 6 — Functions and Library Development

While developing bash scripts, sooner or later you will come across the fact that you periodically have to use the same…

Bash Scripts — Part 5 — Signals and Background Tasks

Last time we talked about working with input, output and error streams in bash scripts, file descriptors and stream…