Data Cleaning

CHAPTER 5 DOING EVERYTHING ELSE WITH `awk`

This chapter introduces you to the awk command, which is a highly versatile utility for manipulating data and restructuring datasets. In fact, this utility is so versatile that entire books have been written about the awk utility. Awk is essentially an entire programming language in a single command, which accepts standard input, gives standard output, and uses regular expressions and metacharacters in the same way other Unix commands do. This lets you plug it into other expressions and do almost anything, at the cost of adding complexity to a command string that may already be doing quite a lot already. It is almost always worthwhile to add a comment when using awk; it is so versatile that it won’t be clear which of the many features you are using at a glance.

The first part of this chapter provides a very brief introduction of the awk command. You will learn about some built-in variables for awk, and also how to manipulate string variables using awk. Note that some of these string-related examples can also be handled using other bash commands.

The second part of this chapter shows you conditional logic, while loops, and for loops in awk in order to manipulate the rows and columns in datasets. This section also shows you how to delete lines and merge lines in datasets, and also how to print the contents of a file as a single line of text. You will see how to “join” lines and groups of lines in datasets.

The third section contains code samples that involve metacharacters (introduced in Chapter 1) and character sets in awk commands. You will also see how to use conditional logic in awk commands in order to determine whether or not to print a line of text.

The fourth section illustrates how to “split” a text string that contains multiple “.” characters as a delimiter, followed by examples of awk to perform numeric calculations (such as addition, subtraction, multiplication, and division) in files containing numeric data. This section also shows you various numeric functions that are available in awk, and also how to print text in a fixed set of columns.

The fifth section explains how to align columns in a dataset and also how to align and merge columns in a dataset. You will see how to delete columns, how to select a subset of columns from a dataset, and how to work with multiline records in datasets. This section contains some one-line awk commands that can be useful for manipulating the contents of datasets.

The final section of this chapter has a pair of use cases involving nested quotes and date formats in structured data sets.

The `awk` Command

The awk (Aho, Weinberger, and Kernighan) command has a C-like syntax, and you can use this utility to perform very complex operations on numbers and text strings.

As a side comment, there is also the gawk command which is GNU awk, as well as the nawk command which is “new” awk (neither command is discussed in this book). One advantage of nawk is that it allows you to set externally the value of an internal variable.

Built-In Variables That Control `awk`

The awk command provides variables that you can change from their default values in order to control how awk performs operations. Examples of such variables (and their default values) include: FS (" "), RS ("\n"), OFS (" "), ORS ("\n"), SUBSEP, and IGNORECASE. The variables FS and RS specify the field separator and record separator, whereas the variables OFS and ORS specify the output field separator and the output record separator, respectively.

You can think of the field separators as delimiters/IFS we used in other commands earlier. The record separators behave in a way similar to how sed treats individual lines—for example sed can match or delete a range of lines instead of matching or deleting something that matches a regular expression (and the default awk record separator is the newline character, so by default awk and sed have similar ability to manipulate and reference lines in a text file).

As a simple example, you can print a blank line after each line of a file by changing the ORS, from the default of one newline to two newlines, as shown here:

cat columns.txt | awk 'BEGIN { ORS ="\n\n" } ; { print $0 }'

Other built-in variables include FILENAME (the name of the file that awk is currently reading), FNR (the current record number in the current file), NF (the number of fields in the current input record), and NR (the number of input records awk has processed since the beginning of the program’s execution).

Consult the online documentation for additional information regarding these (and other) arguments for the awk command.

How Does the `awk` Command Work?

The awk command reads the input files one record at a time (by default, one record is one line). If a record matches a pattern, then an action is performed (otherwise no action is performed). If the search pattern is not given, then awk performs the given actions for each record of the input. The default behavior if no action is given is to print all the records that match the given pattern. Finally, empty braces without any action do nothing; that is, they will not perform the default printing operation. Note that each statement in actions should be delimited by semicolons.

In other to make the preceding paragraph more concrete, here are some simple examples involving text strings and the awk command (the results are displayed after each code snippet). The -F switch sets the field separator to whatever follows it, in this case a space. Switches will often provide a shortcut to an action that normally needs a command inside a “BEGIN{}” block:

x="a b c d e"
echo $x |awk -F" " '{print $1}'
a
echo $x |awk -F" " '{print NF}'
5
echo $x |awk -F" " '{print $0}'
a b c d e
echo $x |awk -F" " '{print $3, $1}'
c a

Now let’s change the FS (record separator) to an empty string to calculate the length of a string, this time using the BEGIN{} syntax:

echo "abc" | awk 'BEGIN { FS = "" } ; { print NF }'
3

The following example illustrates several equivalent ways to specify test.txt as the input file for an awk command:

awk < test.txt '{ print $1 }'
awk '{ print $1 }' < test.txt
awk '{ print $1 }' test.txt

Yet another way is shown here (but as we’ve discussed earlier, it can be inefficient, so only do it if the cat command is adding value in some way):

cat test.txt | awk '{ print $1 }'

This simple example of four ways to do the same task should illustrate why commenting awk calls of any complexity is almost always a good idea. The next person to look at your code may not know/remember the syntax you are using.

Aligning Text with the `printf` Command

Since awk is a programming language inside a single command, it also has its own way of producing formatted output via the printf command.

Listing 5.1 displays the contents of columns2.txt and Listing 5.2 displays the contents of the shell script AlignColumns1.sh, which shows you how to align the columns in a text file.

LISTING 5.1 columns2.txt

one two
three four
one two three four
five six
one two three
four five

LISTING 5.2 Align Columns1.sh

awk '
{
     # left-align $1 on a 10-char column
     # right-align $2 on a 10-char column
     # right-align $3 on a 10-char column
     # right-align $4 on a 10-char column
     printf("%-10s*%10s*%10s*%10s*\n", $1, $2, $3, $4)
}
' columns2.txt

Listing 5.2 contains a printf() statement that displays the first four fields of each row in the file columns2.txt, where each field is 10 characters wide.

The output from launching the code in Listing 5.2 is here:

one         *         two*           *          *
three       *        four*           *          *
one         *         two*      three*      four*
five        *         six*           *          *
one         *         two*      three*          *
four        *        five*           *          *

Keep in mind that printf is reasonably powerful and as such has its own syntax, which is beyond the scope of this chapter. A search online can find the manual pages and also discussions of “how to do X with printf().”

Conditional Logic and Control Statements

Like other programming languages, awk provides support for conditional logic (if/else) and control statements (for/while loops). awk is the only way to put conditional logic inside a piped command stream without creating, installing, and adding to the path a custom executable shell script. The following code block shows you how to use if/else logic:

echo "" | awk '
BEGIN { x = 10 }
{
  if (x % 2 == 0) }
      print "x is even"
  }
  else }
      print "x is odd"
  }
}
'

The preceding code block initializes the variable x with the value 10 and prints “x is even” if x is divisible by 2, otherwise it prints “x is odd”.

The `while` Statement

The following code block illustrates how to use a while loop in awk:

echo "" | awk '
{
  x = 0
  while(x < 4) {
    print "x:",x
    x = x + 1
  }
}
'

The preceding code block generates the following output:

x:0
x:1
x:2
x:3

The following code block illustrates how to use a do while loop in awk:

echo "" | awk '
{
  x = 0
  do {
    print "x:",x
    x = x + 1
  } while(x < 4)
}
'

The preceding code block generates the following output:

x:0
x:1
x:2
x:3

A for loop in `awk`

Listing 5.3 displays the contents of Loop.sh, which illustrates how to print a list of numbers in a loop. Note that “i++” is another way of writing "i=i+1" in awk (and most c-derived languages).

LISTING 5.3 Loop.sh

awk '
BEGIN {}
{
   for(i=0; i<5; i++) {
     printf("%3d", i)
   }
}
END { print "\n" }
'

Listing 5.3 contains a for loop that prints numbers on the same line via the printf() statement. Notice that a newline is printed only in the END block of the code. The output from Listing 5.3 is here:

0 1 2 3 4

A for loop with a `break` Statement

The following code block illustrates how to use a break statement in a for loop in awk:

echo "" | awk '
{
  for(x=1; x<4; x++) {
     print "x:",x
     if(x == 2) {
        break;
     }
  }
}
'

The preceding code block prints output only until the variable x has the value 2, after which the loop exits (because of the break inside the conditional logic). The following output is displayed:

x:1

The `next` and `continue` Statements

The following code snippet illustrates how to use next and continue in a for loop in awk:

awk '
{
   /expression1/ { var1 = 5; next }
   /expression2/ { var2 = 7; next }
   /expression3/ { continue }
    // some other code block here
' somefile

When the current line matches expression1, then var1 is assigned the value 5 and awk reads the next input line: hence, expression2 and expression3 will not be tested. If expression1 does not match and expression2 does match, then var2 is assigned the value 7 and then awk will read the next input line. If only expression3 results in a positive match, then awk skips the remaining block of code and processes the next input line.

Deleting Alternate Lines in Datasets

Listing 5.4 displays the contents of linepairs.csv and Listing 5.5 displays the contents of deletelines.sh, which illustrates how to print alternating lines from the dataset linepairs.csv that have exactly two columns.

LISTING 5.4 linepairs.csv

a,b,c,d
e,f,g,h
1,2,3,4
5,6,7,8

LISTING 5.5 deletelines.sh

inputfile="linepairs.csv"
outputfile="linepairsdeleted.csv"
awk ' NR%2 {printf "%s", $0; print ""; next}' < $inputfile >
$outputfile

Listing 5.5 checks if the current record number NR is divisible by 2, in which case it prints the current line and skips the next line in the dataset. The output is redirected to the specified output file, the contents of which are here:

a,b,c,d
1,2,3,4

A slightly more common task involves merging consecutive lines, which is the topic of the next section.

Merging Lines in Datasets

Listing 5.6 displays the contents of columns.txt and Listing 5.7 displays the contents of ColumnCount1.sh, which illustrates how to print the lines from the text file columns.txt that have exactly two columns.

LISTING 5.6 columns.txt

one two three
one two
one two three four
one
one three
one four

LISTING 5.7 ColumnCount1.sh

awk '
{
   if( NF == 2 ) { print $0 }
}
' columns.txt

Listing 5.7 is straightforward: if the current record number is even, then the current line is printed (i.e., odd-numbered rows are skipped). The output from launching the code in Listing 5.7 is here:

one two
one three
one four

If you want to display the lines that do not contain 2 columns, use the following code snippet:

if( NF != 2 ) { print $0 }

Printing File Contents as a Single Line

The contents of test4.txt are here (note the blank lines):

abc
def
abc
abc

The following code snippet illustrates how to print the contents of test4. txt as a single line:

awk '{printf("%s", $0)}' test4.txt

The output of the preceding code snippet is here. See if you can tell what is happening before reading the explanation in the next paragraph:

Abcdefabcabc

Explanation: %s here is the record separator syntax for printf, and having the end quote after it means the record separator is the empty field “”. Our default record separator for awk is \n (newline); what the printf is doing is stripping out all the newlines. The blank rows will vanish entirely, as all they have is the newline, so the result is that any actual text will be merged together with nothing between them. Had we added a space between the %s and the ending quote, there would be a space between each character block, plus an extra space for each newline.

Notice how the following comment helps the comprehension of the code snippet:

# Merging all text into a single line by removing the newlines 
awk '{printf("%s", $0)}' test4.txt

Joining Groups of Lines in a Text File

Listing 5.8 displays the contents of digits.txt and Listing 5.9 displays the contents of digits.sh, which “joins” three consecutive lines of text in the file digits.txt.

LISTING 5.8 digits.txt

LISTING 5.9 digits.sh

awk -F" " '{
  printf("%d",$0)
  if(NR % 3 == 0) { printf("\n") }
}' digits.txt

Listing 5.9 prints three consecutive lines of text on the same line, after which a linefeed is printed. This has the effect of “joining” every three consecutive lines of text. The output from launching digits.sh is here:

123
456
789

Joining Alternate Lines in a Text File

Listing 5.10 displays the contents of columns2.txt and Listing 5.11 displays the contents of JoinLines.sh, which “joins” two consecutive lines of text in the file columns2.txt.

LISTING 5.10 columns2.txt

one two
three four
one two three four
five six
one two three
four five

LISTING 5.11 JoinLines.sh

awk '
{
     printf("%s",$0)
     if( $1 !~ /one/) { print " " }
}
' columns2.txt

The output from launching Listing 5.11 is here:

one two three four
one two three four five six
one two three four five

Notice that the code in Listing 5.11 depends on the presence of the string “one” as the first field in alternating lines of text—we are merging based on matching a simple pattern, instead of tying it to record combinations.

To merge each pair of lines instead of merging based on matching a pattern, use the modified code in Listing 5.12.

LISTING 5.12 JoinLines2.sh

awk '
BEGIN { count = 0 }
{
   printf("%s",$0)
   if( ++count % 2 == 0) { print " " }
} columns2.txt

Yet another way to “join” consecutive lines is shown in Listing 5.13, where the input file and output file refer to files that you can populate with data. This is another example of an awk command that might be a puzzle if encountered in a program without a comment. It is doing exactly the same thing as Listing 5.12, but its purpose is less obvious because of the more compact syntax.

LISTING 5.13 JoinLines2.sh

inputfile="linepairs.csv"
outputfile="linepairsjoined.csv"
awk ' NR%2 {printf "%s,", $0; next;}1' < $inputfile >
$outputfile

Matching with Metacharacters and Character Sets

If we can match a simple pattern, by now you probably expect that you can also match a regular expression, just as we did in grep and sed. Listing 5.14 displays the contents of Patterns1.sh, which uses metacharacters to match the beginning and the end of a line of text in the file columns2.txt.

LISTING 5.14 Patterns1.sh

awk '
   /^f/    { print $1 }
   /two $/ { print $1 }
' columns2.txt

The output from launching Listing 5.14 is here:

one
five
four

Listing 5.15 displays the contents of RemoveColumns.txt with lines that contain a different number of columns.

LISTING 5.15 columns3.txt

123 one two
456 three four
one two three four
five 123 six
one two three
four five

Listing 5.16 displays the contents of MatchAlpha1.sh, which matches text lines that start with alphabetic characters as well as lines that contain numeric strings in the second column.

LISTING 5.16 MatchAlpha1.sh

awk '
{
   if( $0 ~ /^[0-9]/) { print $0 }
   if( $0 ~ /^[a-z]+ [0-9]/) { print $0 }
}
' columns3.txt

The output from Listing 5.16 is here:

123 one two
456 three four
five 123 six

Printing Lines Using Conditional Logic

Listing 5.17 displays the contents of products.txt, which contains three columns of information.

LISTING 5.17 products.txt

MobilePhone  400    new
Tablet       300    new
Tablet       300    used
MobilePhone  200    used
MobilePhone  100    used

The following code snippet prints the lines of text in products.txt whose second column is greater than 300:

awk '$2 > 300' products.txt

The output of the preceding code snippet is here:

MobilePhone 400 new

The following code snippet prints the lines of text in products.txt whose product is “new”:

awk '($3 == "new")' products.txt

The output of the preceding code snippet is here:

MobilePhone   400      new
Tablet        300      new

The following code snippet prints the first and third columns of the lines of text in products.txt whose cost equals 300:

awk ' $2 == 300 { print $1, $3 }' products.txt

The output of the preceding code snippet is here:

Tablet new
Tablet used

The following code snippet prints the first and third columns of the lines of text in products.txt that start with the string Tablet:

awk '/^Tablet/ { print $1, $3 }' products.txt

The output of the preceding code snippet is here:

Tablet new
Tablet used

Splitting Filenames with `awk`

Listing 5.18 displays the contents of SplitFilename2.sh, which illustrates how to split a filename containing the “.” character in order to increment the numeric value of one of the components of the filename. Note that this code only works for a file name with exactly the expected syntax. It is possible to write more complex code to count the number of segments, or alternately to just say “change the field right before the .zip”, which would only require that the filename have a format matching the final two sections (<anystructure>.number.zip).

LISTING 5.18 SplitFilename2.sh

echo "05.20.144q.az.1.zip" | awk -F"." '
{
  f5=$5 + 1
  printf("%s.%s.%s.%s.%s.%s",$1,$2,$3,$4,f5,$6)
}'

The output from Listing 5.18 is here:

05.20.144q.az.2.zip

Working with Postfix Arithmetic Operators

Listing 5.19 displays the contents of mixednumbers.txt that contain postfix operators, which means numbers where the negative (and/or positive) sign appears at the end of a column value instead of at the beginning of the number.

LISTING 5.19 mixednumbers.txt

324.000-|10|983.000-
453.000-|30|298.000-
783.000-|20|347.000-

Listing 5.20 displays the contents of AddSubtract1.sh, which illustrates how to add the rows of numbers in Listing 5.19.

LISTING 5.20 AddSubtract1.sh

myFile="mixednumbers.txt"
awk -F"|" '
BEGIN { line = 0; total = 0 }
{
   split($1, arr, "-")
   f1 = arr[1]
   if($1 ~ /-/) { f1 = -f1 }
   line += f1

   split($2, arr, "-")
   f2 = arr[1]
   if($2 ~ /-/) { f2 = -f2 }
   line += f2

   split($3, arr, "-")
   f3 = arr[1]
   if($3 ~ /-/) { f3 = -f3 }
   line += f3

   printf("f1: %d f2: %d f3: %d line: %d\n",f1,f2,f3, line)
   total += line
   line = 0
}
END { print "Total: ",total }
' $myfile

The output from Listing 5.20 is here. See if you can work out what the code is doing before reading the explanation that follows:

f1: -324 f2: 10 f3: -983 line: -1297
f1: -453 f2: 30 f3: -298 line: -721
f1: -783 f2: 20 f3: -347 line: -1110
Total:  -3128

The code assumes we know the format of the file. For each field in a given record, the split function returns a vector of length two, where the first position = number, the second position either an empty value or a dash, after which the first position number is “captured” into a variable. The if statement just sees if the original field has a dash in it. If the field has a dash, then the numeric variable is made negative, otherwise it is left alone. Then it adds the line up.

Numeric Functions in `awk`

The int(x) function returns the integer portion of a number. If the number is not already an integer, it falls between two integers. Of the two possible integers, the function will return the one closest to zero. This is different from a rounding function, which chooses the closer integer.

For example, int(3) is 3, int(3.9) is 3, int(-3.9) is -3, and int(-3) is -3 as well. An example of the int(x) function in an awk command is here:

awk 'BEGIN {
   print int(3.534);
   print int(4);
   print int(-5.223);
   print int(-5);
}'

The output is here:

3
4
-5
-5

The exp(x) function gives you the exponential of x, or reports an error if x is out of range. The range of values x can have depends on your machine’s floating point representation.

awk 'BEGIN{
   print exp(123434346);
   print exp(0);
   print exp(-12);
}'

The output is here:

inf
1
6.14421e-06

The log(x) function gives you the natural logarithm of x, if x is positive; otherwise, it reports an error (inf means infinity and nan in output means “not a number”).

awk 'BEGIN{
  print log(12);
  print log(0);
  print log(1);
  print log(-1);
}'

The output is here:

2.48491
-inf
0
nan

The sin(x) function gives you the sine of x and cos(x) gives you the cosine of x, with x in radians:

awk 'BEGIN {
   print cos(90);
   print cos(45);
}'

The output is here:

-0.448074
0.525322

The rand() function gives you a random number. The values of rand() are uniformly distributed between 0 and 1: the value is never 0 and never 1.

Often you want random integers instead. Here is a user-defined function you can use to obtain a random, nonnegative integer less than n:

function randint(n) {
     return int(n * rand())
}

The product produces a random real number greater than 0 and less than n. We then make it an integer (using int) between 0 and n - 1.

Here is an example where a similar function is used to produce random integers between 1 and n:

awk '
# Function to roll a simulated die.
function roll(n) { return 1 + int(rand() * n) }
# Roll 3 six-sided dice and print total number of points.
{
      printf("%d points\n", roll(6)+roll(6)+roll(6))
}'

Note that rand starts generating numbers from the same point (or “seed”) each time awk is invoked. Hence, a program will produce the same results each time it is launched. If you want a program to do different things each time it is used, you must change the seed to a value that will be different in each run.

Use the srand(x) function to set the starting point, or seed, for generating random numbers to the value x. Each seed value leads to a particular sequence of “random” numbers. Thus, if you set the seed to the same value a second time, you will get the same sequence of “random” numbers again. If you omit the argument x, as in srand(), then the current date and time of day are used for a seed. This is how to obtain random numbers that are truly unpredictable. The return value of srand() is the previous seed. This makes it easy to keep track of the seeds for use in consistently reproducing sequences of random numbers.

The time() function (not in all versions of awk) returns the current time in seconds since January 1, 1970. The function ctime (not in all versions of awk) takes a numeric argument in seconds and returns a string representing the corresponding date, suitable for printing or further processing.

The sqrt(x) function gives you the positive square root of x. It reports an error if x is negative. Thus, sqrt(4) is 2.

awk 'BEGIN{
   print sqrt(16);
   print sqrt(0);
   print sqrt(-12);
}'

The output is here:

4
0
nan

One-Line `awk` Commands

The code snippets in this section reference the text file short1.txt, which you can populate with any data of your choice.

The following code snippet prints each line preceded by the number of fields in each line:

awk '{print NF ":" $0}' short1.txt

Print the right-most field in each line:

awk '{print $NF}' short1.txt

Print the lines that contain more than 2 fields:

awk '{if(NF > 2) print }' short1.txt

Print the value of the right-most field if the current line contains more than 2 fields:

awk '{if(NF > 2) print $NF }' short1.txt

Remove leading and trailing whitespaces:

echo " a b c " | awk '{gsub(/^[ \t]+|[ \t]+$/,"");print}'

Print the first and third fields in reverse order for the lines that contain at least 3 fields:

awk '{if(NF > 2) print $3, $1}' short1.txt

Print the lines that contain the string one:

awk '{if(/one/) print }' *txt

As you can see from the preceding code snippets, it’s easy to extract information or subsets of rows and columns from text files using simple conditional logic and built-in variables in the awk command.

Useful Short `awk` Scripts

This section contains a set of short awk -based scripts for performing various operations. Some of these scripts can also be used in other shell scripts to perform more complex operations. Listing 5.21 displays the contents of the file data.txt, which is used in various code samples in this section.

LISTING 5.21 data.txt

this is line one that contains more than 40 characters
this is line two
this is line three that also contains more than 40 characters
four
this is line six and the preceding line is empty
line eight and the preceding line is also empty

The following code snippet prints every line that is longer than 40 characters:

awk 'length($0) > 40' data.txt

Now print the length of the longest line in data.txt:

awk '{ if (x < length()) x = length() }
END { print "maximum line length is " x }' < data.txt

The input is processed by the expand utility to change tabs into spaces, so the widths compared are actually the right-margin columns.

Print every line that has at least one field:

awk 'NF > 0' data.txt

The preceding code snippet illustrates an easy way to delete blank lines from a file (or rather, to create a new file similar to the old file but from which the blank lines have been removed).

Print seven random numbers from 0 to 100, inclusive:

awk 'BEGIN { for (i = 1; i <= 7; i++)
print int(101 * rand()) }'

Count the lines in a file:

awk 'END { print NR }' < data.txt

Print the even-numbered lines in the data file:

awk 'NR % 2 == 0' data.txt

If you use the expression 'NR % 2 == 1' in the previous code snippet, the program would print the odd-numbered lines.

Insert a duplicate of every line in a text file:

awk '{print $0, '\n', $0}' < data.txt

Insert a duplicate of every line in a text file and also remove blank lines:

awk '{print $0, "\n", $0}' < data.txt | awk 'NF > 0'

Insert a blank line after every line in a text file:

awk '{print $0, "\n"}' < data.txt

Printing the Words in a Text String in awk

Listing 5.22 displays the contents of Fields2.sh, which illustrates how to print the words in a text string using the awk command.

LISTING 5.22 Fields2.sh

echo "a b c d e"| awk '
{
  for(i=1; i<=NF; i++) {
     print "Field ",i,":",$i
}
}
'

The output from Listing 5.22 is here:

Field  1 : a
Field  2 : b
Field  3 : c
Field  4 : d
Field  5 : e

Count Occurrences of a String in Specific Rows

Listing 5.23 and Listing 5.24 display the contents data1.csv and data2.csv, respectively, and Listing 5.25 displays the contents of checkrows.sh, which illustrates how to count the number of occurrences of the string “past” in column 3 in rows 2, 5, and 7.

LISTING 5.23 data1.csv

in,the,past,or,the,present
for,the,past,or,the,present
in,the,past,or,the,present
for,the,paste,or,the,future
in,the,past,or,the,present
completely,unrelated,line1
in,the,past,or,the,present
completely,unrelated,line2

LISTING 5.24 data2.csv

in,the,past,or,the,present
completely,unrelated,line1
for,the,past,or,the,present
completely,unrelated,line2
for,the,paste,or,the,future
in,the,past,or,the,present
in,the,past,or,the,present
completely,unrelated,line3

LISTING 5.25 checkrows.sh

files="`ls data*.csv| tr '\n' ' '`"
echo "List of files: $files"
awk -F"," '
( FNR==2 || FNR==5 || FNR==7 ) {
    if ( $3 ~ "past" ) { count++ }
}
END {
    printf "past: matched %d times (INEXACT) ", count
    printf "in field 3 in lines 2/5/7\n"
}' data*.csv

Listing 5.25 looks for occurrences in the string past in columns 2, 5, and 7 because of the following code snippet:

( FNR==2 || FNR==5 || FNR==7 ) {
    if ( $3 ~ "past" ) { count++ }
}

If a match occurs, then the value of count is incremented. The END block reports the number of times that the string past was found in columns 2, 5, and 7. Note that strings such as paste and pasted will match the string past. The output from Listing 5.25 is here:

List of files: data1.csv data2.csv
past: matched 5 times (INEXACT) in field 3 in lines 2/5/7

The shell script checkrows2.sh replaces the term $3 ~ "past" with the term $3 == "past" in checkrows.sh in order to check for exact matches, which produces the following output:

List of files: data1.csv data2.csv
past: matched 4 times (EXACT) in field 3 in lines 2/5/7

Printing a String in a Fixed Number of Columns

Listing 5.26 displays the contents of FixedFieldCount1.sh, which illustrates how to print the words in a text string using the awk command.

LISTING 5.26 FixedFieldCount1.sh

echo "aa bb cc dd ee ff gg hh"| awk '
BEGIN { colCount = 3 }
{
  for(i=1; i<=NF; i++) {
     printf("%s ", $i)
     if(i % colCount == 0) {
        print " "
      }
  }
}
'

The output from Listing 5.26 is here:

aa bb cc
dd ee ff
gg hh

Printing a Dataset in a Fixed Number of Columns

Listing 5.27 displays the contents of VariableColumns.txt with lines of text that contain a different number of columns.

LISTING 5.27 VariableColumns.txt

this is line one
this is line number one
this is the third and final line

Listing 5.28 displays the contents of Fields3.sh, which illustrates how to print the words in a text string using the awk command.

LISTING 5.28 Fields3.sh

awk '{printf("%s ", $0)}' | awk '
BEGIN { columnCount = 3 }
{
  for(i=1; i<=NF; i++) {
     printf("%s ", $i)
     if( i % columnCount == 0 )
       print " "
  }
}
' VariableColumns.txt

The output from Listing 5.28 is here:

this is line
one this is
line number one
this is the
third and final
line

Aligning Columns in Datasets

If you have read the preceding two examples, the code sample in this section is easy to understand: you will see how to realign columns of data that are correct in terms of their content, but have been placed in different rows (and therefore are misaligned). Listing 5.29 displays the contents of mixed-data.csv with misaligned data values. In addition, the first line and final line in Listing 5.28 are empty lines, which will be removed by the shell script in this section.

LISTING 5.29 mixed-data.csv

Sara, Jones, 1000, CA, Sally, Smith, 2000, IL,
Dave, Jones, 3000, FL, John, Jones,
4000, CA,
Dave, Jones, 5000, NY, Mike,
Jones, 6000, NY, Tony, Jones, 7000, WA

Listing 5.30 displays the contents of mixed-data.sh, which illustrates how to realign the dataset in Listing 5.29.

LISTING 5.30 mixed-data.sh

#-----------------------------------------
# 1) remove blank lines
# 2) remove line feeds
# 3) print a LF after every fourth field
# 4) remove trailing ',' from each row
#-----------------------------------------
inputfile="mixed-data.csv"
grep -v "^$" $inputfile |awk -F"," '{printf("%s",$0)}' | awk '
BEGIN { columnCount = 4 }
{
   for(i=1; i<=NF; i++) {
     printf("%s ", $i)
     if( i % columnCount == 0) { print "" }
   }
}' > temp-columns
# 4) remove trailing ',' from output:
cat temp-columns | sed 's/, $//' | sed 's/ $//' > $outputfile

Listing 5.30 starts with a grep command that removes blank lines, followed by an awk command that prints the rows of the dataset as a single line of text. The second awk command initializes the columnCount variable with the value 4 in the BEGIN block, followed by a for loop that iterates through the input fields. After four fields are printed on the same output line, a linefeed is printed, which has the effect of realigning the input dataset as an output dataset consisting of rows that have four fields. The output from Listing 5.30 is here:

Sara, Jones, 1000, CA
Sally, Smith, 2000, IL
Dave, Jones, 3000, FL
John, Jones, 4000, CA
Dave, Jones, 5000, NY
Mike, Jones, 6000, NY
Tony, Jones, 7000, WA

Aligning Columns and Multiple Rows in Datasets

The preceding section showed you how to realign a dataset so that each row contains the same number of columns and also represents a single data record. The code sample in this section illustrates how to realign columns of data that are correct in terms of their content, and also place two records in each line of the new dataset. Listing 5.31 displays the contents of mixed-data2.csv with misaligned data values, followed by Listing 5.32 that displays the contents of aligned-data2.csv with the correctly formatted dataset.

LISTING 5.31 mixed-data2.csv

Sara, Jones, 1000, CA, Sally, Smith, 2000, IL,
Dave, Jones, 3000, FL, John, Jones,
4000, CA,
Dave, Jones, 5000, NY, Mike,
Jones, 6000, NY, Tony, Jones, 7000, WA

LISTING 5.32 aligned-data2.csv

Sara, Jones, 1000, CA, Sally, Smith, 2000, IL
Dave, Jones, 3000, FL, John, Jones, 4000, CA
Dave, Jones, 5000, NY, Mike, Jones, 6000, NY
Tony, Jones, 7000, WA

Listing 5.33 displays the contents of mixed-data2.sh, which illustrates how to realign the dataset in Listing 5.31.

LISTING 5.33 mixed-data2.sh

#-----------------------------------------
# 1) remove blank lines
# 2) remove line feeds
# 3) print a LF after every 8 fields
# 4) remove trailing ',' from each row
#-----------------------------------------
inputfile="mixed-data2.txt"
outputfile="aligned-data2.txt"
grep -v "^$" $inputfile |awk -F"," '{printf("%s",$0)}' | awk '
BEGIN { columnCount = 4; rowCount = 2; currRow = 0 }
{
   for(i=1; i<=NF; i++) {
     printf("%s ", $i)
     if( i % columnCount == 0) { ++currRow }
     if(currRow > 0 && currRow % rowCount == 0) {currRow = 0;
print ""}
   }
}' > temp-columns
# 4) remove trailing ',' from output:
cat temp-columns | sed 's/, $//' | sed 's/ $//' > $outputfile

Listing 5.33 is very similar to Listing 5.30. The key idea is to print a linefeed character after a pair of “normal” records have been processed, which is implemented via the code that is shown in bold in Listing 5.33.

Now you can generalize Listing 5.33 very easily by changing the initial value of the rowCount variable to any other positive integer, and the code will work correctly without any further modification. For example, if you initialize rowCount to the value 5, then every row in the new dataset (with the possible exception of the final output row) will contain 5 “normal” data records.

Removing a Column from a Text File

Listing 5.34 displays the contents of VariableColumns.txt with lines of text that contain a different number of columns.

LISTING 5.34 VariableColumns.txt

this is line one
this is line number one
this is the third and final line

Listing 5.35 displays the contents of RemoveColumn.sh that removes the first column from a text file.

LISTING 5.35 RemoveColumn.sh

awk '{ for (i=2; i<=NF; i++) printf "%s ", $i; printf "\n"; }'
products.txt

The loop is between 2 and NF, which iterates over all the fields except for the first field. In addition, printf explicitly adds newlines. The output of the preceding code snippet is here:

400 new
300 new
300 used
200 used
100 used

Subsets of Columns of Even Rows in Datasets

Listing 5.35 showed you how to align the rows of a dataset, and the code sample in this section illustrates how to extract a subset of the existing columns and a subset of the rows. Listing 5.36 displays the contents of sub-rows-cols.txt of the desired dataset that contains two columns from every even row of the file aligned-data.txt.

LISTING 5.36 sub-rows-cols.txt

Sara, 1000
Dave, 3000
Dave, 5000
Tony, 7000

Listing 5.37 displays the contents of sub-rows-cols.sh, which illustrates how to generate the dataset in Listing 5.36. Most of the code is the same as Listing 5.33, with the new code shown in bold.

LISTING 5.37 sub-rows-cols.sh

#-----------------------------------------
# 1) remove blank lines
# 2) remove line feeds
# 3) print a LF after every fourth field
# 4) remove trailing ',' from each row
#-----------------------------------------
inputfile="mixed-data.txt"
grep -v "^$" $inputfile |awk -F"," '{printf("%s",$0)}' | awk '
BEGIN { columnCount = 4 }
{
   for(i=1; i<=NF; i++) {
     printf("%s ", $i)
     if( i % columnCount == 0) { print "" }
   }
}' > temp-columns
# 4) remove trailing ',' from output:
cat temp-columns | sed 's/, $//' | sed 's/$//' > temp-columns2
cat temp-columns2 | awk '
    BEGIN { rowCount = 2; currRow = 0 }
            {
       if(currRow % rowCount == 0) { print $1, $3 }
       ++currRow
    }' > temp-columns3
   cat temp-columns3 | sed 's/,$//' | sed 's/ $//' >
$outputfile

Listing 5.37 contains a new block of code that redirects the output of step #4 to a temporary file temp-columns2 whose contents are processed by another awk command in the last section of Listing 5.37. Notice that the awk command contains a BEGIN block that initializes the variables rowCount and currRow with the values 2 and 0, respectively.

The main block prints columns 1 and 3 of the current line if the current row number is even, and then the value of currRow is incremented. The output of this awk command is redirected to yet another temporary file that is the input to the final code snippet, which uses the cat command and two occurrences of the sed command in order to remove a trailing “,” and a trailing space, as shown here:

cat temp-columns3 | sed 's/,$//' | sed 's/ $//' > $outputfile

Keep in mind that there are other ways to perform the functionality in Listing 5.37, and the main purpose is to show you different techniques for combining various bash commands.

Counting Word Frequency in Datasets

Listing 5.38 displays the contents of WordCounts1.sh, which illustrates how to count the frequency of words in a file.

LISTING 5.38 WordCounts1.sh

awk '
# Print list of word frequencies
{
    for (i = 1; i <= NF; i++)
        freq[$i]++
}
END {
    for (word in freq)
        printf "%s\t%d\n", word, freq[word]
}
' columns2.txt

Listing 5.38 contains a block of code that processes the lines in columns2.txt. Each time that a word (of a line) is encountered, the code increments the number of occurrences of that word in the hash table freq. The END block contains a for loop that displays the number of occurrences of each word in columns2.txt.

The output from Listing 5.38 is here:

two    3
one    3
three  3
six    1
four   3
five   2

Listing 5.39 displays the contents of WordCounts2.sh, which performs a case insensitive word count.

LISTING 5.39 WordCounts2.sh

awk '
{
    # convert everything to lower case
    $0 = tolower($0)
    # remove punctuation
   #gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
    for(i=1; i<=NF; i++) {
        freq[$i]++
    }
}
END {
    for(word in freq) {
       printf "%s\t%d\n", word, freq[word]
    }
}
' columns4.txt

Listing 5.39 is almost identical to Listing 5.38, with the addition of the following code snippet that converts the text in each input line to lowercase letters, as shown here:

$0 = tolower($0)

Listing 5.40 displays the contents of columns4.txt.

LISTING 5.40 columns4.txt

123 ONE TWO
456 three four
ONE TWO THREE FOUR
five 123 six
one two three
four five

The output from launching Listing 5.39 with columns4.txt is here:

456     1
two     3
one     3
three   3
six     1
123     2
four    3
five    2

Displaying Only “Pure” Words in a Dataset

For simplicity, let’s work with a text string and that way we can see the intermediate results as we work toward the solution. This example will be familiar from prior chapters, but now we see how awk does it.

Listing 5.41 displays the contents of onlywords.sh, which contains three awk commands for displaying the words, integers, and alphanumeric strings, respectively, in a text string.

LISTING 5.41 onlywords.sh

x="ghi abc Ghi 123 #def5 123z" echo "Only words:"
echo $x |tr -s ' ' '\n' | awk -F" " '
{
  if($0 ~ /^[a-zA-Z]+$/) { print $0 }
}
' | sort | uniq
echo
echo "Only integers:"
echo $x |tr -s ' ' '\n' | awk -F" " '
{
  if($0 ~ /^[0-9]+$/) { print $0 }
}
' | sort | uniq
echo
echo "Only alphanumeric words:"
echo $x |tr -s ' ' '\n' | awk -F" " '
{
  if($0 ~ /^[0-9a-zA-Z]+$/) { print $0 }
}
' | sort | uniq
echo

Listing 5.41 starts by initializing the variable x:

x="ghi abc Ghi 123 #def5 123z"

The next step is to split x into words:

echo $x |tr -s ' ' '\n'

The output is here:

ghi
abc
Ghi
123
#def5
123z

The third step is to invoke awk and check for words that match the regular expression ^[a-zA-Z]+, which matches any string consisting of one or more uppercase and/or lowercase letters (and nothing else):

if($0 ~ /^[a-zA-Z]+$/) { print $0 }

The output is here:

ghi
abc
Ghi

Finally, if you also want to sort the output and print only the unique words, redirect the output from the awk command to the sort command and the uniq command.

The second awk command uses the regular expression ^[0-9]+ to check for integers and the third awk command uses the regular expression ^[0-9a-zA-Z]+ to check for alphanumeric words. The output from launching Listing 5.37 is here:

Only words:
Ghi
abc
ghi

Only integers:
123

Only alphanumeric words:
123
123z
Ghi
abc
ghi

Now you can replace the variable x with a dataset in order to retrieve only alphabetic strings from that dataset.

Working with Multiline Records in `awk`

Listing 5.42 displays the contents of employee.txt and Listing 5.43 displays the contents of Employees.sh, which illustrates how to concatenate text lines in a file.

LISTING 5.42 employees.txt

Name:  Jane Edwards
EmpId: 12345
Address: 123 Main Street Chicago Illinois
Name:  John Smith
EmpId: 23456
Address: 432 Lombard Avenue SF California

LISTING 5.43 employees.sh

inputfile="employees.txt"
outputfile="employees2.txt"
awk '
{
  if($0 ~ /^Name:/) {
    x = substr($0,8) ","
    next
}
if( $0 ~ /^Empid:/) {
 #skip the Empid data row
 #x = x substr($0,7) ","
  next
}
if($0 ~ /^Address:/) {
  x = x substr($0,9)
  print x
}
}
' < $inputfile > $outputfile

The output from launching the code in Listing 5.43 is here:

Jane Edwards, 123 Main Street Chicago Illinois
John Smith, 432 Lombard Avenue SF California

Now that you have seen a plethora of awk code snippets and shell scripts containing the awk command that illustrate various types of tasks that you can perform on files and datasets, you are ready for some uses cases. The next section (which is the first use case) shows you how to replace multiple field delimiters with a single delimiter, and the second use case shows you how to manipulate date strings.

A Simple Use Case

The code sample in this section shows you how to use the awk command in order to split the comma-separated fields in the rows of a dataset, where fields can contain nested quotes of arbitrary depth.

Listing 5.44 displays the contents of the file quotes3.csv, which contains a “,” delimiter and multiple quoted fields.

LISTING 5.44 quotes3.csv

field5,field4,field3,"field2,foo,bar",field1,field6,field7,"fieldZ"
fname1,"fname2,other,stuff",fname3,"fname4,foo,bar",fname5
"lname1,a,b","lname2,c,d","lname3,e,f","lname4,foo,bar",lname5

Listing 5.45 displays the contents of the file delim1.sh, which illustrates how to replace the delimiters in delim1.csv with a “,” character.

LISTING 5.45 delim1.sh

#inputfile="quotes1.csv"
#inputfile="quotes2.csv"
inputfile="quotes3.csv"
grep -v "^$" $inputfile | awk '
{
   print "LINE #" NR ": " $0
   printf ("-------------------------\n")
   for (i = 0; ++i <= NF;)
     printf "field #%d : %s\n", i, $i
   printf ("\n")
}' FPAT='([^,]+)|(“[^”]+”)' < $inputfile

The output from launching the shell script in Listing 5.44 is here:

LINE #1:
field5,field4,field3,"field2,foo,bar",field1,field6,field7,"fieldZ"
-------------------------
field #1 : field5
field #2 : field4
field #3 : field3
field #4 : "field2,foo,bar"
field #5 : field1
field #6 : field6
field #7 : field7
field #8 : "fieldZ"
LINE #2: fname1,"fname2,other,stuff",fname3,"fname4,foo,bar",
fname5
-------------------------
field #1 : fname1
field #2 : "fname2,other,stuff"
field #3 : fname3
field #4 : "fname4,foo,bar"
field #5 : fname5
LINE #3: "lname1,a,b","lname2,c,d","lname3,e,f","lname4,foo,
bar",lname5
-------------------------
field #1 : "lname1,a,b"
field #2 : "lname2,c,d"
field #3 : "lname3,e,f"
field #4 : "lname4,foo,bar"
field #5 : lname5
LINE #4: "Outer1 "Inner "Inner "Inner C" B" A"
Outer1","XYZ1,c,d","XYZ2lname3,e,f"
-------------------------
field #1 : "Outer1 "Inner "Inner "Inner C" B" A" Outer1"
field #2 : "XYZ1,c,d"
field #3 : "XYZ2lname3,e,f"
LINE #5:
-------------------------

As you can see, the task in this section is very easily solved via the awk command.

Another Use Case

The code sample in this section shows you how to use the awk command in order to reformat the date field in a dataset and change the order of the fields in the new dataset. For example, given the following input line in the original dataset:

Jane,Smith,20140805234658

The reformatted line in the output dataset has this format:

2014-08-05 23:46:58,Jane,Smith

Listing 5.46 displays the contents of the file dates2.csv, which contains a “,” delimiter and three fields.

LISTING 5.46 dates2.csv

Jane,Smith,20140805234658
Jack,Jones,20170805234652
Dave,Stone,20160805234655
John,Smith,20130805234646
Jean,Davis,20140805234649
Thad,Smith,20150805234637
Jack,Pruit,20160805234638

Listing 5.47 displays the contents of string2date2.sh, which converts the date field to a new format and shifts the new date to the first field.

LISTING 5.47 string2date2.sh

inputfile="dates2.csv"
outputfile="formatteddates2.csv"
rm -f $outputfile; touch $outputfile
for line in `cat $inputfile`
do
  fname=`echo $line |cut -d"," -f1`
  lname=`echo $line |cut -d"," -f2`
  date1=`echo $line |cut -d"," -f3`
  # convert to new date format
  newdate=`echo $date1 | awk '{ print
substr($0,1,4)"-"substr($0,5,2)"-"substr($0,7,2)"
"substr($0,9,2)":"substr($0,11,2)":"substr($0,13,2)}'`
  # append newly formatted row to output file
  echo "${newdate},${fname},${lname}" >> $outputfile
done

The contents of the new dataset is here:

2014-08-05 23:46:58,Jane,Smith
2017-08-05 23:46:52,Jack,Jones
2016-08-05 23:46:55,Dave,Stone
2013-08-05 23:46:46,John,Smith
2014-08-05 23:46:49,Jean,Davis
2015-08-05 23:46:37,Thad,Smith
2016-08-05 23:46:38,Jack,Pruit

Summary

This chapter introduced the awk command, which is essentially an entire programming language packaged into a single Unix command.

We explored some of its built-in variables as well as conditional logic, while loops, and for loops in awk in order to manipulate the rows and columns in datasets. You then saw how to delete lines and merge lines in datasets, and also how to print the contents of a file as a single line of text. Next you learned how to use meta characters and character sets in awk commands. You learned how to perform numeric calculations (such as addition, subtraction, multiplication, and division) in files containing numeric data, and also some numeric functions that are available in awk.

In addition, you saw how to align columns in a dataset, how to delete columns, how to select a subset of columns from a dataset, and how to work with multiline records in datasets. Finally, you saw a couple of simple use cases involving nested quotes and date formats in a structured dataset.

At this point you have all the tools necessary to do quite sophisticated data cleansing and processing, and it is strongly encouraged that you try to apply them on some task or problem of interest. The final step of the learning process is doing something real.

“I saw something similar once, I wonder if there is a way to . . .” or the even more common “how do I do XXX in language YYY?” You can’t ask those questions if you don’t have a sense of what is possible.

At this point there is one more thing to say: congratulations! You have completed a fast-paced yet dense book, and if you are a bash neophyte, the material will probably keep you busy for many hours. The examples in the chapters provide a solid foundation, and the Appendix contains additional examples and use cases to further illustrate how the Unix commands work together. The combined effect demonstrates that the universe of possibilities is larger than the examples in this book, and ultimately they will spark ideas in you. Good luck!

DOING EVERYTHING ELSE WITH awk

The awk Command

Built-In Variables That Control awk

How Does the awk Command Work?

Aligning Text with the printf Command

LISTING 5.1 columns2.txt

LISTING 5.2 Align Columns1.sh

Conditional Logic and Control Statements

The while Statement

A for loop in awk

LISTING 5.3 Loop.sh

A for loop with a break Statement

The next and continue Statements

Deleting Alternate Lines in Datasets

LISTING 5.4 linepairs.csv

LISTING 5.5 deletelines.sh

Merging Lines in Datasets

LISTING 5.6 columns.txt

Printing File Contents as a Single Line

Joining Groups of Lines in a Text File

LISTING 5.8 digits.txt

LISTING 5.9 digits.sh

LISTING 5.10 columns2.txt

LISTING 5.11 JoinLines.sh

LISTING 5.12 JoinLines2.sh

LISTING 5.13 JoinLines2.sh

Matching with Metacharacters and Character Sets

LISTING 5.14 Patterns1.sh

LISTING 5.15 columns3.txt

LISTING 5.16 MatchAlpha1.sh

Printing Lines Using Conditional Logic

LISTING 5.17 products.txt

Splitting Filenames with awk

LISTING 5.18 SplitFilename2.sh

LISTING 5.19 mixednumbers.txt

LISTING 5.20 AddSubtract1.sh

Numeric Functions in awk

One-Line awk Commands

Useful Short awk Scripts

LISTING 5.21 data.txt

LISTING 5.22 Fields2.sh

Count Occurrences of a String in Specific Rows

LISTING 5.23 data1.csv

LISTING 5.24 data2.csv

Printing a String in a Fixed Number of Columns

LISTING 5.26 FixedFieldCount1.sh

Printing a Dataset in a Fixed Number of Columns

LISTING 5.27 VariableColumns.txt

LISTING 5.28 Fields3.sh

LISTING 5.29 mixed-data.csv

LISTING 5.30 mixed-data.sh

Aligning Columns and Multiple Rows in Datasets

LISTING 5.31 mixed-data2.csv

LISTING 5.32 aligned-data2.csv

LISTING 5.33 mixed-data2.sh

Removing a Column from a Text File

LISTING 5.34 VariableColumns.txt

LISTING 5.35 RemoveColumn.sh

Subsets of Columns of Even Rows in Datasets

LISTING 5.36 sub-rows-cols.txt

LISTING 5.37 sub-rows-cols.sh

Counting Word Frequency in Datasets

LISTING 5.38 WordCounts1.sh

LISTING 5.39 WordCounts2.sh

LISTING 5.40 columns4.txt

Displaying Only “Pure” Words in a Dataset

LISTING 5.41 onlywords.sh

Working with Multiline Records in awk

LISTING 5.42 employees.txt

LISTING 5.43 employees.sh

A Simple Use Case

LISTING 5.44 quotes3.csv

LISTING 5.45 delim1.sh

Another Use Case

LISTING 5.47 string2date2.sh

Summary

DOING EVERYTHING ELSE WITH `awk`

The `awk` Command

Built-In Variables That Control `awk`

How Does the `awk` Command Work?

Aligning Text with the `printf` Command

The `while` Statement

A for loop in `awk`

A for loop with a `break` Statement

The `next` and `continue` Statements

Splitting Filenames with `awk`

Numeric Functions in `awk`

One-Line `awk` Commands

Useful Short `awk` Scripts

Working with Multiline Records in `awk`