awk
This chapter introduces you to the awk
command, which is a highly versatile utility for manipulating data and restructuring datasets. In fact, this utility is so versatile that entire books have been written about the awk
utility. Awk is essentially an entire programming language in a single command, which accepts standard input, gives standard output, and uses regular expressions and metacharacters in the same way other Unix commands do. This lets you plug it into other expressions and do almost anything, at the cost of adding complexity to a command string that may already be doing quite a lot already. It is almost always worthwhile to add a comment when using awk;
it is so versatile that it won’t be clear which of the many features you are using at a glance.
The first part of this chapter provides a very brief introduction of the awk
command. You will learn about some built-in variables for awk,
and also how to manipulate string variables using awk.
Note that some of these string-related examples can also be handled using other bash
commands.
The second part of this chapter shows you conditional logic, while
loops, and for
loops in awk
in order to manipulate the rows and columns in datasets. This section also shows you how to delete lines and merge lines in datasets, and also how to print the contents of a file as a single line of text. You will see how to “join” lines and groups of lines in datasets.
The third section contains code samples that involve metacharacters (introduced in Chapter 1) and character sets in awk
commands. You will also see how to use conditional logic in awk
commands in order to determine whether or not to print a line of text.
The fourth section illustrates how to “split” a text string that contains multiple “.” characters as a delimiter, followed by examples of awk
to perform numeric calculations (such as addition, subtraction, multiplication, and division) in files containing numeric data. This section also shows you various numeric functions that are available in awk,
and also how to print text in a fixed set of columns.
The fifth section explains how to align columns in a dataset and also how to align and merge columns in a dataset. You will see how to delete columns, how to select a subset of columns from a dataset, and how to work with multiline records in datasets. This section contains some one-line awk
commands that can be useful for manipulating the contents of datasets.
The final section of this chapter has a pair of use cases involving nested quotes and date formats in structured data sets.
awk
CommandThe awk
(Aho, Weinberger, and Kernighan) command has a C-like syntax, and you can use this utility to perform very complex operations on numbers and text strings.
As a side comment, there is also the gawk
command which is GNU awk,
as well as the nawk
command which is “new” awk
(neither command is discussed in this book). One advantage of nawk
is that it allows you to set externally the value of an internal variable.
awk
The awk
command provides variables that you can change from their default values in order to control how awk
performs operations. Examples of such variables (and their default values) include: FS (" ")
, RS ("\n"), OFS (" ")
, ORS ("\n"), SUBSEP,
and IGNORECASE.
The variables FS
and RS
specify the field separator and record separator, whereas the variables OFS
and ORS
specify the output field separator and the output record separator, respectively.
You can think of the field separators as delimiters/IFS we used in other commands earlier. The record separators behave in a way similar to how sed
treats individual lines—for example sed
can match or delete a range of lines instead of matching or deleting something that matches a regular expression (and the default awk
record separator is the newline character, so by default awk
and sed
have similar ability to manipulate and reference lines in a text file).
As a simple example, you can print a blank line after each line of a file by changing the ORS,
from the default of one newline to two newlines, as shown here:
cat columns.txt | awk 'BEGIN { ORS ="\n\n" } ; { print $0 }'
Other built-in variables include FILENAME
(the name of the file that awk
is currently reading), FNR
(the current record number in the current file), NF
(the number of fields in the current input record), and NR
(the number of input records awk
has processed since the beginning of the program’s execution).
Consult the online documentation for additional information regarding these (and other) arguments for the awk
command.
awk
Command Work?The awk
command reads the input files one record at a time (by default, one record is one line). If a record matches a pattern, then an action is performed (otherwise no action is performed). If the search pattern is not given, then awk
performs the given actions for each record of the input. The default behavior if no action is given is to print all the records that match the given pattern. Finally, empty braces without any action do nothing; that is, they will not perform the default printing operation. Note that each statement in actions should be delimited by semicolons.
In other to make the preceding paragraph more concrete, here are some simple examples involving text strings and the awk
command (the results are displayed after each code snippet). The -F
switch sets the field separator to whatever follows it, in this case a space. Switches will often provide a shortcut to an action that normally needs a command inside a “BEGIN{}
” block:
x="a b c d e" echo $x |awk -F" " '{print $1}' a echo $x |awk -F" " '{print NF}' 5 echo $x |awk -F" " '{print $0}' a b c d e echo $x |awk -F" " '{print $3, $1}' c a
Now let’s change the FS
(record separator) to an empty string to calculate the length of a string, this time using the BEGIN{}
syntax:
echo "abc" | awk 'BEGIN { FS = "" } ; { print NF }' 3
The following example illustrates several equivalent ways to specify test.txt
as the input file for an awk
command:
awk < test.txt '{ print $1 }' awk '{ print $1 }' < test.txt awk '{ print $1 }' test.txt
Yet another way is shown here (but as we’ve discussed earlier, it can be inefficient, so only do it if the cat
command is adding value in some way):
cat test.txt | awk '{ print $1 }'
This simple example of four ways to do the same task should illustrate why commenting awk
calls of any complexity is almost always a good idea. The next person to look at your code may not know/remember the syntax you are using.
printf
CommandSince awk
is a programming language inside a single command, it also has its own way of producing formatted output via the printf
command.
Listing 5.1 displays the contents of columns2.txt
and Listing 5.2 displays the contents of the shell script AlignColumns1.sh,
which shows you how to align the columns in a text file.
one two three four one two three four five six one two three four five
awk ' { # left-align $1 on a 10-char column # right-align $2 on a 10-char column # right-align $3 on a 10-char column # right-align $4 on a 10-char column printf("%-10s*%10s*%10s*%10s*\n", $1, $2, $3, $4) } ' columns2.txt
Listing 5.2 contains a printf()
statement that displays the first four fields of each row in the file columns2.txt,
where each field is 10 characters wide.
The output from launching the code in Listing 5.2 is here:
one * two* * * three * four* * * one * two* three* four* five * six* * * one * two* three* * four * five* * *
Keep in mind that printf
is reasonably powerful and as such has its own syntax, which is beyond the scope of this chapter. A search online can find the manual pages and also discussions of “how to do X with printf().”
Like other programming languages, awk
provides support for conditional logic (if/else) and control statements (for/while loops). awk
is the only way to put conditional logic inside a piped command stream without creating, installing, and adding to the path a custom executable shell script. The following code block shows you how to use if/else logic:
echo "" | awk ' BEGIN { x = 10 } { if (x % 2 == 0) } print "x is even" } else } print "x is odd" } } '
The preceding code block initializes the variable x
with the value 10 and prints “x is even
” if x is divisible by 2, otherwise it prints “x is odd
”.
while
StatementThe following code block illustrates how to use a while
loop in awk:
echo "" | awk ' { x = 0 while(x < 4) { print "x:",x x = x + 1 } } '
The preceding code block generates the following output:
x:0 x:1 x:2 x:3
The following code block illustrates how to use a do while
loop in awk:
echo "" | awk ' { x = 0 do { print "x:",x x = x + 1 } while(x < 4) } '
The preceding code block generates the following output:
x:0 x:1 x:2 x:3
awk
Listing 5.3 displays the contents of Loop.sh
, which illustrates how to print a list of numbers in a loop. Note that “i++” is another way of writing "i=i+1" in awk
(and most c-derived languages).
awk ' BEGIN {} { for(i=0; i<5; i++) { printf("%3d", i) } } END { print "\n" } '
Listing 5.3 contains a for
loop that prints numbers on the same line via the printf()
statement. Notice that a newline is printed only in the END
block of the code. The output from Listing 5.3 is here:
0 1 2 3 4
break
StatementThe following code block illustrates how to use a break
statement in a for
loop in awk
:
echo "" | awk ' { for(x=1; x<4; x++) { print "x:",x if(x == 2) { break; } } } '
The preceding code block prints output only until the variable x
has the value 2, after which the loop exits (because of the break inside the conditional logic). The following output is displayed:
x:1
next
and continue
StatementsThe following code snippet illustrates how to use next
and continue
in a for
loop in awk
:
awk ' { /expression1/ { var1 = 5; next } /expression2/ { var2 = 7; next } /expression3/ { continue } // some other code block here ' somefile
When the current line matches expression1,
then var1
is assigned the value 5 and awk
reads the next input line: hence, expression2
and expression3
will not be tested. If expression1
does not match and expression2
does match, then var2
is assigned the value 7 and then awk
will read the next input line. If only expression3
results in a positive match, then awk
skips the remaining block of code and processes the next input line.
Listing 5.4 displays the contents of linepairs.csv
and Listing 5.5 displays the contents of deletelines.sh,
which illustrates how to print alternating lines from the dataset linepairs.csv
that have exactly two columns.
a,b,c,d e,f,g,h 1,2,3,4 5,6,7,8
inputfile="linepairs.csv" outputfile="linepairsdeleted.csv" awk ' NR%2 {printf "%s", $0; print ""; next}' < $inputfile > $outputfile
Listing 5.5 checks if the current record number NR
is divisible by 2, in which case it prints the current line and skips the next line in the dataset. The output is redirected to the specified output file, the contents of which are here:
a,b,c,d 1,2,3,4
A slightly more common task involves merging consecutive lines, which is the topic of the next section.
Listing 5.6 displays the contents of columns.txt
and Listing 5.7 displays the contents of ColumnCount1.sh,
which illustrates how to print the lines from the text file columns.txt
that have exactly two columns.
one two three one two one two three four one one three one four
awk ' { if( NF == 2 ) { print $0 } } ' columns.txt
Listing 5.7 is straightforward: if the current record number is even, then the current line is printed (i.e., odd-numbered rows are skipped). The output from launching the code in Listing 5.7 is here:
one two one three one four
If you want to display the lines that do not contain 2 columns, use the following code snippet:
if( NF != 2 ) { print $0 }
The contents of test4.txt
are here (note the blank lines):
abc def abc abc
The following code snippet illustrates how to print the contents of test4. txt
as a single line:
awk '{printf("%s", $0)}' test4.txt
The output of the preceding code snippet is here. See if you can tell what is happening before reading the explanation in the next paragraph:
Abcdefabcabc
Explanation: %s
here is the record separator syntax for printf,
and having the end quote after it means the record separator is the empty field “”. Our default record separator for awk
is \n (newline); what the printf
is doing is stripping out all the newlines. The blank rows will vanish entirely, as all they have is the newline, so the result is that any actual text will be merged together with nothing between them. Had we added a space between the %s
and the ending quote, there would be a space between each character block, plus an extra space for each newline.
Notice how the following comment helps the comprehension of the code snippet:
# Merging all text into a single line by removing the newlines awk '{printf("%s", $0)}' test4.txt
Listing 5.8 displays the contents of digits.txt
and Listing 5.9 displays the contents of digits.sh,
which “joins” three consecutive lines of text in the file digits.txt.
1 2 3 4 5 6 7 8 9
awk -F" " '{ printf("%d",$0) if(NR % 3 == 0) { printf("\n") } }' digits.txt
Listing 5.9 prints three consecutive lines of text on the same line, after which a linefeed is printed. This has the effect of “joining” every three consecutive lines of text. The output from launching digits.sh
is here:
123 456 789
Listing 5.10 displays the contents of columns2.txt
and Listing 5.11 displays the contents of JoinLines.sh,
which “joins” two consecutive lines of text in the file columns2.txt.
one two three four one two three four five six one two three four five
awk ' { printf("%s",$0) if( $1 !~ /one/) { print " " } } ' columns2.txt
The output from launching Listing 5.11 is here:
one two three four one two three four five six one two three four five
Notice that the code in Listing 5.11 depends on the presence of the string “one” as the first field in alternating lines of text—we are merging based on matching a simple pattern, instead of tying it to record combinations.
To merge each pair of lines instead of merging based on matching a pattern, use the modified code in Listing 5.12.
awk ' BEGIN { count = 0 } { printf("%s",$0) if( ++count % 2 == 0) { print " " } } columns2.txt
Yet another way to “join” consecutive lines is shown in Listing 5.13, where the input file and output file refer to files that you can populate with data. This is another example of an awk
command that might be a puzzle if encountered in a program without a comment. It is doing exactly the same thing as Listing 5.12, but its purpose is less obvious because of the more compact syntax.
inputfile="linepairs.csv" outputfile="linepairsjoined.csv" awk ' NR%2 {printf "%s,", $0; next;}1' < $inputfile > $outputfile
If we can match a simple pattern, by now you probably expect that you can also match a regular expression, just as we did in grep
and sed.
Listing 5.14 displays the contents of Patterns1.sh,
which uses metacharacters to match the beginning and the end of a line of text in the file columns2.txt.
awk ' /^f/ { print $1 } /two $/ { print $1 } ' columns2.txt
The output from launching Listing 5.14 is here:
one five four
Listing 5.15 displays the contents of RemoveColumns.txt
with lines that contain a different number of columns.
123 one two 456 three four one two three four five 123 six one two three four five
Listing 5.16 displays the contents of MatchAlpha1.sh,
which matches text lines that start with alphabetic characters as well as lines that contain numeric strings in the second column.
awk ' { if( $0 ~ /^[0-9]/) { print $0 } if( $0 ~ /^[a-z]+ [0-9]/) { print $0 } } ' columns3.txt
The output from Listing 5.16 is here:
123 one two 456 three four five 123 six
Listing 5.17 displays the contents of products.txt,
which contains three columns of information.
MobilePhone 400 new Tablet 300 new Tablet 300 used MobilePhone 200 used MobilePhone 100 used
The following code snippet prints the lines of text in products.txt
whose second column is greater than 300:
awk '$2 > 300' products.txt
The output of the preceding code snippet is here:
MobilePhone 400 new
The following code snippet prints the lines of text in products.txt
whose product is “new”:
awk '($3 == "new")' products.txt
The output of the preceding code snippet is here:
MobilePhone 400 new Tablet 300 new
The following code snippet prints the first and third columns of the lines of text in products.txt
whose cost equals 300:
awk ' $2 == 300 { print $1, $3 }' products.txt
The output of the preceding code snippet is here:
Tablet new Tablet used
The following code snippet prints the first and third columns of the lines of text in products.txt
that start with the string Tablet:
awk '/^Tablet/ { print $1, $3 }' products.txt
The output of the preceding code snippet is here:
Tablet new Tablet used
awk
Listing 5.18 displays the contents of SplitFilename2.sh,
which illustrates how to split a filename containing the “.” character in order to increment the numeric value of one of the components of the filename. Note that this code only works for a file name with exactly the expected syntax. It is possible to write more complex code to count the number of segments, or alternately to just say “change the field right before the .zip”, which would only require that the filename have a format matching the final two sections (<anystructure>.number.zip).
echo "05.20.144q.az.1.zip" | awk -F"." ' { f5=$5 + 1 printf("%s.%s.%s.%s.%s.%s",$1,$2,$3,$4,f5,$6) }'
The output from Listing 5.18 is here:
05.20.144q.az.2.zip
Listing 5.19 displays the contents of mixednumbers.txt
that contain postfix operators, which means numbers where the negative (and/or positive) sign appears at the end of a column value instead of at the beginning of the number.
324.000-|10|983.000- 453.000-|30|298.000- 783.000-|20|347.000-
Listing 5.20 displays the contents of AddSubtract1.sh,
which illustrates how to add the rows of numbers in Listing 5.19.
myFile="mixednumbers.txt" awk -F"|" ' BEGIN { line = 0; total = 0 } { split($1, arr, "-") f1 = arr[1] if($1 ~ /-/) { f1 = -f1 } line += f1 split($2, arr, "-") f2 = arr[1] if($2 ~ /-/) { f2 = -f2 } line += f2 split($3, arr, "-") f3 = arr[1] if($3 ~ /-/) { f3 = -f3 } line += f3 printf("f1: %d f2: %d f3: %d line: %d\n",f1,f2,f3, line) total += line line = 0 } END { print "Total: ",total } ' $myfile
The output from Listing 5.20 is here. See if you can work out what the code is doing before reading the explanation that follows:
f1: -324 f2: 10 f3: -983 line: -1297 f1: -453 f2: 30 f3: -298 line: -721 f1: -783 f2: 20 f3: -347 line: -1110 Total: -3128
The code assumes we know the format of the file. For each field in a given record, the split
function returns a vector of length two, where the first position = number, the second position either an empty value or a dash, after which the first position number is “captured” into a variable. The if
statement just sees if the original field has a dash in it. If the field has a dash, then the numeric variable is made negative, otherwise it is left alone. Then it adds the line up.
awk
The int(x)
function returns the integer portion of a number. If the number is not already an integer, it falls between two integers. Of the two possible integers, the function will return the one closest to zero. This is different from a rounding function, which chooses the closer integer.
For example, int(3)
is 3, int(3.9)
is 3, int(-3.9)
is -3, and int(-3)
is -3
as well. An example of the int(x)
function in an awk
command is here:
awk 'BEGIN { print int(3.534); print int(4); print int(-5.223); print int(-5); }'
The output is here:
3 4 -5 -5
The exp(x)
function gives you the exponential of x
, or reports an error if x is out of range. The range of values x
can have depends on your machine’s floating point representation.
awk 'BEGIN{ print exp(123434346); print exp(0); print exp(-12); }'
The output is here:
inf 1 6.14421e-06
The log(x)
function gives you the natural logarithm of x,
if x
is positive; otherwise, it reports an error (inf
means infinity and nan
in output means “not a number”).
awk 'BEGIN{ print log(12); print log(0); print log(1); print log(-1); }'
The output is here:
2.48491 -inf 0 nan
The sin(x)
function gives you the sine of x
and cos(x)
gives you the cosine of x,
with x
in radians:
awk 'BEGIN { print cos(90); print cos(45); }'
The output is here:
-0.448074 0.525322
The rand()
function gives you a random number. The values of rand()
are uniformly distributed between 0 and 1: the value is never 0 and never 1.
Often you want random integers instead. Here is a user-defined function you can use to obtain a random, nonnegative integer less than n:
function randint(n) { return int(n * rand()) }
The product produces a random real number greater than 0 and less than n. We then make it an integer (using int) between 0 and n - 1.
Here is an example where a similar function is used to produce random integers between 1 and n:
awk ' # Function to roll a simulated die. function roll(n) { return 1 + int(rand() * n) } # Roll 3 six-sided dice and print total number of points. { printf("%d points\n", roll(6)+roll(6)+roll(6)) }'
Note that rand
starts generating numbers from the same point (or “seed”) each time awk
is invoked. Hence, a program will produce the same results each time it is launched. If you want a program to do different things each time it is used, you must change the seed to a value that will be different in each run.
Use the srand(x)
function to set the starting point, or seed, for generating random numbers to the value x.
Each seed value leads to a particular sequence of “random” numbers. Thus, if you set the seed to the same value a second time, you will get the same sequence of “random” numbers again. If you omit the argument x, as in srand()
, then the current date and time of day are used for a seed. This is how to obtain random numbers that are truly unpredictable. The return value of srand()
is the previous seed. This makes it easy to keep track of the seeds for use in consistently reproducing sequences of random numbers.
The time()
function (not in all versions of awk
) returns the current time in seconds since January 1, 1970. The function ctime
(not in all versions of awk
) takes a numeric argument in seconds and returns a string representing the corresponding date, suitable for printing or further processing.
The sqrt(x)
function gives you the positive square root of x.
It reports an error if x is negative. Thus, sqrt(4)
is 2.
awk 'BEGIN{ print sqrt(16); print sqrt(0); print sqrt(-12); }'
The output is here:
4 0 nan
awk
CommandsThe code snippets in this section reference the text file short1.txt,
which you can populate with any data of your choice.
The following code snippet prints each line preceded by the number of fields in each line:
awk '{print NF ":" $0}' short1.txt
Print the right-most field in each line:
awk '{print $NF}' short1.txt
Print the lines that contain more than 2 fields:
awk '{if(NF > 2) print }' short1.txt
Print the value of the right-most field if the current line contains more than 2 fields:
awk '{if(NF > 2) print $NF }' short1.txt
Remove leading and trailing whitespaces:
echo " a b c " | awk '{gsub(/^[ \t]+|[ \t]+$/,"");print}'
Print the first and third fields in reverse order for the lines that contain at least 3
fields:
awk '{if(NF > 2) print $3, $1}' short1.txt
Print the lines that contain the string one:
awk '{if(/one/) print }' *txt
As you can see from the preceding code snippets, it’s easy to extract information or subsets of rows and columns from text files using simple conditional logic and built-in variables in the awk
command.
awk
ScriptsThis section contains a set of short awk
-based scripts for performing various operations. Some of these scripts can also be used in other shell scripts to perform more complex operations. Listing 5.21 displays the contents of the file data.txt
, which is used in various code samples in this section.
this is line one that contains more than 40 characters this is line two this is line three that also contains more than 40 characters four this is line six and the preceding line is empty line eight and the preceding line is also empty
The following code snippet prints every line that is longer than 40 characters:
awk 'length($0) > 40' data.txt
Now print the length of the longest line in data.txt:
awk '{ if (x < length()) x = length() } END { print "maximum line length is " x }' < data.txt
The input is processed by the expand utility to change tabs into spaces, so the widths compared are actually the right-margin columns.
Print every line that has at least one field:
awk 'NF > 0' data.txt
The preceding code snippet illustrates an easy way to delete blank lines from a file (or rather, to create a new file similar to the old file but from which the blank lines have been removed).
Print seven random numbers from 0 to 100, inclusive:
awk 'BEGIN { for (i = 1; i <= 7; i++) print int(101 * rand()) }'
Count the lines in a file:
awk 'END { print NR }' < data.txt
Print the even-numbered lines in the data file:
awk 'NR % 2 == 0' data.txt
If you use the expression 'NR % 2 == 1'
in the previous code snippet, the program would print the odd-numbered lines.
Insert a duplicate of every line in a text file:
awk '{print $0, '\n', $0}' < data.txt
Insert a duplicate of every line in a text file and also remove blank lines:
awk '{print $0, "\n", $0}' < data.txt | awk 'NF > 0'
Insert a blank line after every line in a text file:
awk '{print $0, "\n"}' < data.txt
Listing 5.22 displays the contents of Fields2.sh,
which illustrates how to print the words in a text string using the awk
command.
echo "a b c d e"| awk ' { for(i=1; i<=NF; i++) { print "Field ",i,":",$i } } '
The output from Listing 5.22 is here:
Field 1 : a Field 2 : b Field 3 : c Field 4 : d Field 5 : e
Listing 5.23 and Listing 5.24 display the contents data1.csv
and data2.csv,
respectively, and Listing 5.25 displays the contents of checkrows.sh,
which illustrates how to count the number of occurrences of the string “past” in column 3 in rows 2, 5, and 7.
in,the,past,or,the,present for,the,past,or,the,present in,the,past,or,the,present for,the,paste,or,the,future in,the,past,or,the,present completely,unrelated,line1 in,the,past,or,the,present completely,unrelated,line2
in,the,past,or,the,present completely,unrelated,line1 for,the,past,or,the,present completely,unrelated,line2 for,the,paste,or,the,future in,the,past,or,the,present in,the,past,or,the,present completely,unrelated,line3
files="`ls data*.csv| tr '\n' ' '`" echo "List of files: $files" awk -F"," ' ( FNR==2 || FNR==5 || FNR==7 ) { if ( $3 ~ "past" ) { count++ } } END { printf "past: matched %d times (INEXACT) ", count printf "in field 3 in lines 2/5/7\n" }' data*.csv
Listing 5.25 looks for occurrences in the string past
in columns 2, 5, and 7 because of the following code snippet:
( FNR==2 || FNR==5 || FNR==7 ) { if ( $3 ~ "past" ) { count++ } }
If a match occurs, then the value of count is incremented. The END
block reports the number of times that the string past was found in columns 2, 5, and 7. Note that strings such as paste
and pasted
will match the string past. The output from Listing 5.25 is here:
List of files: data1.csv data2.csv past: matched 5 times (INEXACT) in field 3 in lines 2/5/7
The shell script checkrows2.sh
replaces the term $3 ~ "past"
with the term $3 == "past"
in checkrows.sh
in order to check for exact matches, which produces the following output:
List of files: data1.csv data2.csv past: matched 4 times (EXACT) in field 3 in lines 2/5/7
Listing 5.26 displays the contents of FixedFieldCount1.sh,
which illustrates how to print the words in a text string using the awk
command.
echo "aa bb cc dd ee ff gg hh"| awk ' BEGIN { colCount = 3 } { for(i=1; i<=NF; i++) { printf("%s ", $i) if(i % colCount == 0) { print " " } } } '
The output from Listing 5.26 is here:
aa bb cc dd ee ff gg hh
Listing 5.27 displays the contents of VariableColumns.txt
with lines of text that contain a different number of columns.
this is line one this is line number one this is the third and final line
Listing 5.28 displays the contents of Fields3.sh,
which illustrates how to print the words in a text string using the awk
command.
awk '{printf("%s ", $0)}' | awk ' BEGIN { columnCount = 3 } { for(i=1; i<=NF; i++) { printf("%s ", $i) if( i % columnCount == 0 ) print " " } } ' VariableColumns.txt
The output from Listing 5.28 is here:
this is line one this is line number one this is the third and final line
If you have read the preceding two examples, the code sample in this section is easy to understand: you will see how to realign columns of data that are correct in terms of their content, but have been placed in different rows (and therefore are misaligned). Listing 5.29 displays the contents of mixed-data.csv
with misaligned data values. In addition, the first line and final line in Listing 5.28 are empty lines, which will be removed by the shell script in this section.
Sara, Jones, 1000, CA, Sally, Smith, 2000, IL, Dave, Jones, 3000, FL, John, Jones, 4000, CA, Dave, Jones, 5000, NY, Mike, Jones, 6000, NY, Tony, Jones, 7000, WA
Listing 5.30 displays the contents of mixed-data.sh,
which illustrates how to realign the dataset in Listing 5.29.
#----------------------------------------- # 1) remove blank lines # 2) remove line feeds # 3) print a LF after every fourth field # 4) remove trailing ',' from each row #----------------------------------------- inputfile="mixed-data.csv" grep -v "^$" $inputfile |awk -F"," '{printf("%s",$0)}' | awk ' BEGIN { columnCount = 4 } { for(i=1; i<=NF; i++) { printf("%s ", $i) if( i % columnCount == 0) { print "" } } }' > temp-columns # 4) remove trailing ',' from output: cat temp-columns | sed 's/, $//' | sed 's/ $//' > $outputfile
Listing 5.30 starts with a grep
command that removes blank lines, followed by an awk
command that prints the rows of the dataset as a single line of text. The second awk
command initializes the columnCount
variable with the value 4 in the BEGIN
block, followed by a for
loop that iterates through the input fields. After four fields are printed on the same output line, a linefeed is printed, which has the effect of realigning the input dataset as an output dataset consisting of rows that have four fields. The output from Listing 5.30 is here:
Sara, Jones, 1000, CA Sally, Smith, 2000, IL Dave, Jones, 3000, FL John, Jones, 4000, CA Dave, Jones, 5000, NY Mike, Jones, 6000, NY Tony, Jones, 7000, WA
The preceding section showed you how to realign a dataset so that each row contains the same number of columns and also represents a single data record. The code sample in this section illustrates how to realign columns of data that are correct in terms of their content, and also place two records in each line of the new dataset. Listing 5.31 displays the contents of mixed-data2.csv
with misaligned data values, followed by Listing 5.32 that displays the contents of aligned-data2.csv
with the correctly formatted dataset.
Sara, Jones, 1000, CA, Sally, Smith, 2000, IL, Dave, Jones, 3000, FL, John, Jones, 4000, CA, Dave, Jones, 5000, NY, Mike, Jones, 6000, NY, Tony, Jones, 7000, WA
Sara, Jones, 1000, CA, Sally, Smith, 2000, IL Dave, Jones, 3000, FL, John, Jones, 4000, CA Dave, Jones, 5000, NY, Mike, Jones, 6000, NY Tony, Jones, 7000, WA
Listing 5.33 displays the contents of mixed-data2.sh,
which illustrates how to realign the dataset in Listing 5.31.
#----------------------------------------- # 1) remove blank lines # 2) remove line feeds # 3) print a LF after every 8 fields # 4) remove trailing ',' from each row #----------------------------------------- inputfile="mixed-data2.txt" outputfile="aligned-data2.txt" grep -v "^$" $inputfile |awk -F"," '{printf("%s",$0)}' | awk ' BEGIN { columnCount = 4; rowCount = 2; currRow = 0 } { for(i=1; i<=NF; i++) { printf("%s ", $i) if( i % columnCount == 0) { ++currRow } if(currRow > 0 && currRow % rowCount == 0) {currRow = 0; print ""} } }' > temp-columns # 4) remove trailing ',' from output: cat temp-columns | sed 's/, $//' | sed 's/ $//' > $outputfile
Listing 5.33 is very similar to Listing 5.30. The key idea is to print a linefeed character after a pair of “normal” records have been processed, which is implemented via the code that is shown in bold in Listing 5.33.
Now you can generalize Listing 5.33 very easily by changing the initial value of the rowCount
variable to any other positive integer, and the code will work correctly without any further modification. For example, if you initialize rowCount
to the value 5,
then every row in the new dataset (with the possible exception of the final output row) will contain 5 “normal” data records.
Listing 5.34 displays the contents of VariableColumns.txt
with lines of text that contain a different number of columns.
this is line one this is line number one this is the third and final line
Listing 5.35 displays the contents of RemoveColumn.sh
that removes the first column from a text file.
awk '{ for (i=2; i<=NF; i++) printf "%s ", $i; printf "\n"; }' products.txt
The loop is between 2 and NF,
which iterates over all the fields except for the first field. In addition, printf
explicitly adds newlines. The output of the preceding code snippet is here:
400 new 300 new 300 used 200 used 100 used
Listing 5.35 showed you how to align the rows of a dataset, and the code sample in this section illustrates how to extract a subset of the existing columns and a subset of the rows. Listing 5.36 displays the contents of sub-rows-cols.txt
of the desired dataset that contains two columns from every even row of the file aligned-data.txt.
Sara, 1000 Dave, 3000 Dave, 5000 Tony, 7000
Listing 5.37 displays the contents of sub-rows-cols.sh,
which illustrates how to generate the dataset in Listing 5.36. Most of the code is the same as Listing 5.33, with the new code shown in bold.
#----------------------------------------- # 1) remove blank lines # 2) remove line feeds # 3) print a LF after every fourth field # 4) remove trailing ',' from each row #----------------------------------------- inputfile="mixed-data.txt" grep -v "^$" $inputfile |awk -F"," '{printf("%s",$0)}' | awk ' BEGIN { columnCount = 4 } { for(i=1; i<=NF; i++) { printf("%s ", $i) if( i % columnCount == 0) { print "" } } }' > temp-columns # 4) remove trailing ',' from output: cat temp-columns | sed 's/, $//' | sed 's/$//' > temp-columns2 cat temp-columns2 | awk ' BEGIN { rowCount = 2; currRow = 0 } { if(currRow % rowCount == 0) { print $1, $3 } ++currRow }' > temp-columns3 cat temp-columns3 | sed 's/,$//' | sed 's/ $//' > $outputfile
Listing 5.37 contains a new block of code that redirects the output of step #4 to a temporary file temp-columns2
whose contents are processed by another awk
command in the last section of Listing 5.37. Notice that the awk
command contains a BEGIN
block that initializes the variables rowCount
and currRow
with the values 2 and 0, respectively.
The main block prints columns 1 and 3 of the current line if the current row number is even, and then the value of currRow
is incremented. The output of this awk
command is redirected to yet another temporary file that is the input to the final code snippet, which uses the cat
command and two occurrences of the sed
command in order to remove a trailing “,” and a trailing space, as shown here:
cat temp-columns3 | sed 's/,$//' | sed 's/ $//' > $outputfile
Keep in mind that there are other ways to perform the functionality in Listing 5.37, and the main purpose is to show you different techniques for combining various bash commands.
Listing 5.38 displays the contents of WordCounts1.sh,
which illustrates how to count the frequency of words in a file.
awk ' # Print list of word frequencies { for (i = 1; i <= NF; i++) freq[$i]++ } END { for (word in freq) printf "%s\t%d\n", word, freq[word] } ' columns2.txt
Listing 5.38 contains a block of code that processes the lines in columns2.txt.
Each time that a word (of a line) is encountered, the code increments the number of occurrences of that word in the hash table freq.
The END
block contains a for
loop that displays the number of occurrences of each word in columns2.txt.
The output from Listing 5.38 is here:
two 3 one 3 three 3 six 1 four 3 five 2
Listing 5.39 displays the contents of WordCounts2.sh,
which performs a case insensitive word count.
awk ' { # convert everything to lower case $0 = tolower($0) # remove punctuation #gsub(/[^[:alnum:]_[:blank:]]/, "", $0) for(i=1; i<=NF; i++) { freq[$i]++ } } END { for(word in freq) { printf "%s\t%d\n", word, freq[word] } } ' columns4.txt
Listing 5.39 is almost identical to Listing 5.38, with the addition of the following code snippet that converts the text in each input line to lowercase letters, as shown here:
$0 = tolower($0)
Listing 5.40 displays the contents of columns4.txt.
123 ONE TWO 456 three four ONE TWO THREE FOUR five 123 six one two three four five
The output from launching Listing 5.39 with columns4.txt
is here:
456 1 two 3 one 3 three 3 six 1 123 2 four 3 five 2
For simplicity, let’s work with a text string and that way we can see the intermediate results as we work toward the solution. This example will be familiar from prior chapters, but now we see how awk
does it.
Listing 5.41 displays the contents of onlywords.sh,
which contains three awk
commands for displaying the words, integers, and alphanumeric strings, respectively, in a text string.
x="ghi abc Ghi 123 #def5 123z" echo "Only words:" echo $x |tr -s ' ' '\n' | awk -F" " ' { if($0 ~ /^[a-zA-Z]+$/) { print $0 } } ' | sort | uniq echo echo "Only integers:" echo $x |tr -s ' ' '\n' | awk -F" " ' { if($0 ~ /^[0-9]+$/) { print $0 } } ' | sort | uniq echo echo "Only alphanumeric words:" echo $x |tr -s ' ' '\n' | awk -F" " ' { if($0 ~ /^[0-9a-zA-Z]+$/) { print $0 } } ' | sort | uniq echo
Listing 5.41 starts by initializing the variable x:
x="ghi abc Ghi 123 #def5 123z"
The next step is to split x into words:
echo $x |tr -s ' ' '\n'
The output is here:
ghi abc Ghi 123 #def5 123z
The third step is to invoke awk
and check for words that match the regular expression ^[a-zA-Z]+,
which matches any string consisting of one or more uppercase and/or lowercase letters (and nothing else):
if($0 ~ /^[a-zA-Z]+$/) { print $0 }
The output is here:
ghi abc Ghi
Finally, if you also want to sort the output and print only the unique words, redirect the output from the awk
command to the sort
command and the uniq
command.
The second awk
command uses the regular expression ^[0-9]+
to check for integers and the third awk
command uses the regular expression ^[0-9a-zA-Z]+
to check for alphanumeric words. The output from launching Listing 5.37 is here:
Only words: Ghi abc ghi
Only integers: 123
Only alphanumeric words: 123 123z Ghi abc ghi
Now you can replace the variable x
with a dataset in order to retrieve only alphabetic strings from that dataset.
awk
Listing 5.42 displays the contents of employee.txt
and Listing 5.43 displays the contents of Employees.sh
, which illustrates how to concatenate text lines in a file.
Name: Jane Edwards EmpId: 12345 Address: 123 Main Street Chicago Illinois Name: John Smith EmpId: 23456 Address: 432 Lombard Avenue SF California
inputfile="employees.txt" outputfile="employees2.txt" awk ' { if($0 ~ /^Name:/) { x = substr($0,8) "," next } if( $0 ~ /^Empid:/) { #skip the Empid data row #x = x substr($0,7) "," next } if($0 ~ /^Address:/) { x = x substr($0,9) print x } } ' < $inputfile > $outputfile
The output from launching the code in Listing 5.43 is here:
Jane Edwards, 123 Main Street Chicago Illinois John Smith, 432 Lombard Avenue SF California
Now that you have seen a plethora of awk
code snippets and shell scripts containing the awk
command that illustrate various types of tasks that you can perform on files and datasets, you are ready for some uses cases. The next section (which is the first use case) shows you how to replace multiple field delimiters with a single delimiter, and the second use case shows you how to manipulate date strings.
The code sample in this section shows you how to use the awk
command in order to split the comma-separated fields in the rows of a dataset, where fields can contain nested quotes of arbitrary depth.
Listing 5.44 displays the contents of the file quotes3.csv,
which contains a “,” delimiter and multiple quoted fields.
field5,field4,field3,"field2,foo,bar",field1,field6,field7,"fieldZ" fname1,"fname2,other,stuff",fname3,"fname4,foo,bar",fname5 "lname1,a,b","lname2,c,d","lname3,e,f","lname4,foo,bar",lname5
Listing 5.45 displays the contents of the file delim1.sh,
which illustrates how to replace the delimiters in delim1.csv
with a “,” character.
#inputfile="quotes1.csv" #inputfile="quotes2.csv" inputfile="quotes3.csv" grep -v "^$" $inputfile | awk ' { print "LINE #" NR ": " $0 printf ("-------------------------\n") for (i = 0; ++i <= NF;) printf "field #%d : %s\n", i, $i printf ("\n") }' FPAT='([^,]+)|(“[^”]+”)' < $inputfile
The output from launching the shell script in Listing 5.44 is here:
LINE #1: field5,field4,field3,"field2,foo,bar",field1,field6,field7,"fieldZ" ------------------------- field #1 : field5 field #2 : field4 field #3 : field3 field #4 : "field2,foo,bar" field #5 : field1 field #6 : field6 field #7 : field7 field #8 : "fieldZ" LINE #2: fname1,"fname2,other,stuff",fname3,"fname4,foo,bar", fname5 ------------------------- field #1 : fname1 field #2 : "fname2,other,stuff" field #3 : fname3 field #4 : "fname4,foo,bar" field #5 : fname5 LINE #3: "lname1,a,b","lname2,c,d","lname3,e,f","lname4,foo, bar",lname5 ------------------------- field #1 : "lname1,a,b" field #2 : "lname2,c,d" field #3 : "lname3,e,f" field #4 : "lname4,foo,bar" field #5 : lname5 LINE #4: "Outer1 "Inner "Inner "Inner C" B" A" Outer1","XYZ1,c,d","XYZ2lname3,e,f" ------------------------- field #1 : "Outer1 "Inner "Inner "Inner C" B" A" Outer1" field #2 : "XYZ1,c,d" field #3 : "XYZ2lname3,e,f" LINE #5: -------------------------
As you can see, the task in this section is very easily solved via the awk
command.
The code sample in this section shows you how to use the awk
command in order to reformat the date field in a dataset and change the order of the fields in the new dataset. For example, given the following input line in the original dataset:
Jane,Smith,20140805234658
The reformatted line in the output dataset has this format:
2014-08-05 23:46:58,Jane,Smith
Listing 5.46 displays the contents of the file dates2.csv,
which contains a “,” delimiter and three fields.
Jane,Smith,20140805234658 Jack,Jones,20170805234652 Dave,Stone,20160805234655 John,Smith,20130805234646 Jean,Davis,20140805234649 Thad,Smith,20150805234637 Jack,Pruit,20160805234638
Listing 5.47 displays the contents of string2date2.sh
, which converts the date field to a new format and shifts the new date to the first field.
inputfile="dates2.csv" outputfile="formatteddates2.csv" rm -f $outputfile; touch $outputfile for line in `cat $inputfile` do fname=`echo $line |cut -d"," -f1` lname=`echo $line |cut -d"," -f2` date1=`echo $line |cut -d"," -f3` # convert to new date format newdate=`echo $date1 | awk '{ print substr($0,1,4)"-"substr($0,5,2)"-"substr($0,7,2)" "substr($0,9,2)":"substr($0,11,2)":"substr($0,13,2)}'` # append newly formatted row to output file echo "${newdate},${fname},${lname}" >> $outputfile done
The contents of the new dataset is here:
2014-08-05 23:46:58,Jane,Smith 2017-08-05 23:46:52,Jack,Jones 2016-08-05 23:46:55,Dave,Stone 2013-08-05 23:46:46,John,Smith 2014-08-05 23:46:49,Jean,Davis 2015-08-05 23:46:37,Thad,Smith 2016-08-05 23:46:38,Jack,Pruit
This chapter introduced the awk
command, which is essentially an entire programming language packaged into a single Unix command.
We explored some of its built-in variables as well as conditional logic, while loops, and for
loops in awk
in order to manipulate the rows and columns in datasets. You then saw how to delete lines and merge lines in datasets, and also how to print the contents of a file as a single line of text. Next you learned how to use meta characters and character sets in awk
commands. You learned how to perform numeric calculations (such as addition, subtraction, multiplication, and division) in files containing numeric data, and also some numeric functions that are available in awk
.
In addition, you saw how to align columns in a dataset, how to delete columns, how to select a subset of columns from a dataset, and how to work with multiline records in datasets. Finally, you saw a couple of simple use cases involving nested quotes and date formats in a structured dataset.
At this point you have all the tools necessary to do quite sophisticated data cleansing and processing, and it is strongly encouraged that you try to apply them on some task or problem of interest. The final step of the learning process is doing something real.
“I saw something similar once, I wonder if there is a way to . . .” or the even more common “how do I do XXX in language YYY?” You can’t ask those questions if you don’t have a sense of what is possible.
At this point there is one more thing to say: congratulations! You have completed a fast-paced yet dense book, and if you are a bash neophyte, the material will probably keep you busy for many hours. The examples in the chapters provide a solid foundation, and the Appendix contains additional examples and use cases to further illustrate how the Unix commands work together. The combined effect demonstrates that the universe of possibilities is larger than the examples in this book, and ultimately they will spark ideas in you. Good luck!