Up to this point, we've shown you tools to do basic batch editing of text files. These tools, although powerful, have limitations. Although you can script ex commands, the range of text manipulation is quite limited. If you need more powerful and flexible batch editing tools, you need to look at programming languages that are designed for text manipulation. One of the earliest Unix languages to do this is awk, created by Al Aho, Peter Weinberger, and Brian Kernighan. Even if you've never programmed before, there are some simple but powerful ways that you can use awk. Whenever you have a text file that's arranged in columns from which you need to extract data, awk should come to mind.
For example, every Red Hat Linux system stores its version number in /etc/redhat-release. On my system, it looks like this:
Red Hat Linux release 7.1 (Seawolf)
When applying new RPM files to your system, it is often helpful to know which Red Hat version you're using. On the command line, you can retrieve just that number with:
awk '{print $5}' /etc/redhat-release
What's going on here? By default, awk
splits each line read from standard input on whitespace, as is explained below.
In effect, it's like you are looking at one row of a spreadsheet. In
spreadsheets, columns are usually named with letters. In awk, columns are numbered and you only can see one
row (that is, one line of input) at a time. The Red Hat version number is in the
fifth column. Similar to the way shells use $
for variable interpolation, the values of columns in awk are retrieved using variables that start with $
and are followed by an integer.
As you can guess, this is a fairly simple demonstration of awk, which includes support for regular expressions, branching and looping, and subroutines. For a more complete reference on using awk, see Effective awk Programming or sed & awk Pocket Reference, both published by O'Reilly.
Since there are many flavor of awk, such as nawk and
gawk (Section 18.11),
this article tries to provide a usable reference for the most common elements of
the language. Dialect differences, when they occur, are noted. With the
exception of array subscripts, values in [
brackets]
are optional; don't
type the [
or ]
.
awk can be invoked in one of two ways:
awk [options
] 'script
' [var
=value
] [file(s)
] awk [options
] -fscriptfile
[var
=value
] [file(s)
]
You can specify a script
directly on the
command line, or you can store a script in a
scriptfile
and specify it with
-f
. In most versions, the -f
option
can be used multiple times. The variable var
can
be assigned a value on the command line. The value can be a literal, a shell
variable ($
name
), or a command substitution ('
cmd
'
), but the value is available only after
a line of input is read (i.e., after the BEGIN statement). awk operates on one or more
file(s)
. If none are specified (or if
-
is specified), awk reads from the standard input (Section
43.1).
The other recognized options
are:
-F
c
Set the field separator to character
c
. This is the same as setting
the system variable FS. nawk allows
c
to be a regular expression (Section 32.4). Each
record (by default, one input line) is divided into fields by
whitespace (blanks or tabs) or by some other user-definable
field separator. Fields are referred to by the variables
$1
, $2
, . . . $
n
. $0
refers to the entire record. For example, to
print the first three (colon-separated) fields on separate
lines:
% awk -F: '{print $1; print $2; print $3}' /etc/passwd
-v
var
=
value
Assign a value
to variable
var
. This allows assignment
before the script begins execution. (Available in nawk only.)
awk scripts consist of patterns and procedures:
pattern
{
procedure
}
Both are optional. If pattern
is missing,
{
procedure
}
is applied to all records. If {
procedure
}
is missing, the matched record is
written to the standard output.
pattern
can be any of the following:
/regular expression
/relational expression
pattern-matching expression
BEGIN END
Expressions can be composed of quoted strings, numbers, operators, functions, defined variables, and any of the predefined variables described later in Section 20.10.3.
Regular expressions use the extended set of metacharacters, as described in Section 32.15. In addition, ^ and $ (Section 32.5) can be used to refer to the beginning and end of a field, respectively, rather than the beginning and end of a record (line).
Relational expressions use the relational operators listed in
Section 20.10.4
later in this article. Comparisons can be either string or
numeric. For example, $2
>
$1
selects records for which
the second field is greater than the first.
Pattern-matching expressions use the operators ~
(match) and !~
(don't match). See Section 20.10.4 later
in this article.
The BEGIN pattern lets you specify procedures that will take place before the first input record is processed. (Generally, you set global variables here.)
The END pattern lets you specify procedures that will take place after the last input record is read.
Except for BEGIN and END, patterns can be combined with the
Boolean operators ||
(
OR), &&
(AND), and !
(NOT). A range of lines can also be specified using
comma-separated patterns:
pattern
,pattern
procedure
can consist of one or more
commands, functions, or variable assignments, separated by newlines or
semicolons (;
), and contained within
curly braces ({}
). Commands fall into four groups:
Variable or array assignments
Printing commands
Built-in functions
Control-flow commands
Print the first field of each line:
{ print $1 }
Print all lines that contain pattern
:
/pattern/
Print first field of lines that contain pattern
:
/pattern/{ print $1 }
Print records containing more than two fields:
NF > 2
Interpret input records as a group of lines up to a blank line:
BEGIN { FS = "\n"; RS = "" }
{ ...process records... }
Print fields 2 and 3 in switched order, but only on lines
whose first field matches the string URGENT
:
$1 ~ /URGENT/ { print $3, $2 }
Count and print the number of pattern
found:
/pattern/ { ++x } END { print x }
Add numbers in second column and print total:
{total += $2 }; END { print "column total is", total}
Print lines that contain fewer than 20 characters:
length($0) < 20
Print each line that begins with Name
: and that contains exactly seven
fields:
NF == 7 && /^Name:/
nawk supports all awk variables. gawk supports both nawk and awk.
Version |
Variable |
Description |
---|---|---|
awk |
FILENAME |
Current filename |
FS |
Field separator (default is whitespace) | |
NF |
Number of fields in current record | |
NR |
Number of the current record | |
OFMT |
Output format for numbers (default is | |
OFS |
Output field separator (default is a blank) | |
ORS |
Output record separator (default is a newline) | |
RS |
Record separator (default is a newline) | |
|
Entire input record | |
|
| |
nawk |
ARGC |
Number of arguments on command line |
ARGV |
An array containing the command-line arguments | |
ENVIRON |
An associative array of environment variables | |
FNR |
Like NR, but relative to the current file | |
RSTART |
First position in the string matched by match function | |
RLENGTH |
Length of the string matched by match function | |
SUBSEP |
Separator character for array subscripts (default is
|
This table lists the operators, in increasing precedence, that are available in awk.
Variables can be assigned a value with an
equal sign (=
). For example:
FS = ","
Expressions using the operators +
,
-
, *
, /
, and %
(modulus) can be assigned to
variables.
Arrays can be created with the split
function (see below), or they can
simply be named in an assignment statement. Array elements can be
subscripted with numbers (array
[1]
, . . .
,array
[
n
]
) or with names (as associative arrays). For example, to count
the number of occurrences of a pattern, you could use the following
script:
/pattern
/ {array
["pattern
"]++ } END { printarray
["pattern
"] }
awk commands may be classified as follows:
Arithmetic functions |
String functions |
Control flow statements |
Input/Output processing |
---|---|---|---|
atan2 [3] |
gsub [3] |
break |
close [3] |
cos [3] |
index |
continue |
delete [3] |
exp |
length |
do/while [3] |
getline [3] |
int |
match [3] |
exit |
next |
log |
split |
for |
|
rand [3] |
sub [3] |
if |
printf |
sin [3] |
substr |
return [3] |
sprintf |
sqrt |
tolower [3] |
while |
system [3] |
srand [3] |
toupper [3] | ||
[3] Not in original awk. |
The following alphabetical list of statements and functions includes all that are available in awk, nawk, or gawk. Unless otherwise mentioned, the statement or function is found in all versions. New statements and functions introduced with nawk are also found in gawk.
atan2(
y
,x
)
close(
filename-expr
)
close(
command-expr
)
In some implementations of awk, you can have only ten files open simultaneously and one pipe; modern versions allow more than one pipe open. Therefore, nawk provides a close statement that allows you to close a file or a pipe. close takes as an argument the same expression that opened the pipe or file. (nawk)
cos(
x
)
delete
array
[element
]
do
body
while (
expr
)
Looping statement. Execute statements in
body
, then evaluate
expr
. If
expr
is true, execute
body
again. More than one
command
must be put inside braces
({}
). (nawk)
exit
[expr
]
Do not execute remaining instructions and do not read new
input. END procedure, if any,
will be executed. The expr
, if any,
becomes awk's exit status (Section 34.12).
exp(
arg
)
for
(
[init-expr
];
[test-expr
];
[incr-expr
])
command
C-language-style looping construct. Typically,
init-expr
assigns the initial
value of a counter variable.
test-expr
is a relational
expression that is evaluated each time before executing the
command
. When
test-expr
is false, the loop is
exited. incr-expr
is used to
increment the counter variable after each pass. A series of
command
s must be put within
braces ({}
). For
example:
for (i = 1; i <= 10; i++) printf "Element %d is %s.\n", i, array[i]
For each item
in an associative
array
, do
command
. More than one
command
must be put inside braces
({}
). Refer to each
element of the array as array
[
item
]
.
getline
[var
][<
file
] or
command
| getline
[var
]
Read next line of input. Original awk does not support the syntax to open multiple
input streams. The first form reads input from
file
, and the second form reads
the standard output of a Unix
command
. Both forms read one line at
a time, and each time the statement is executed, it gets the
next line of input. The line of input is assigned to $0
, and it is parsed into fields,
setting NF, NR, and
FNR. If var
is specified, the result is assigned to
var
and the $0
is not changed. Thus, if the
result is assigned to a variable, the current line does not
change. getline is actually a function, and
it returns 1 if it reads a record successfully, 0 if end-of-file
is encountered, and -1 if for some reason it is otherwise
unsuccessful. (nawk)
gsub(
r
,s
[,t
])
Globally substitute s
for each
match of the
regular expression
r
in the string
t
. Return the number of
substitutions. If t
is not supplied,
defaults to $0
. (nawk)
if (
condition
)
command
[else
command
]
If condition
is true, do
command(s)
, otherwise do
command(s)
in
else clause (if any).
condition
can be an expression
that uses any of the
relational operators
<
, <=
, ==
, !=
, >=
, or >
, as well as the
pattern-matching operators
~
or !~
(e.g., if ($1 ~ /[Aa].*[Zz]/)
). A series of
command
s must be put within
braces ({}
).
index(
str
,substr
)
Return position of first substring
substr
in string
str
or 0 if not
found.
int(
arg
)
length(
arg
)
log(
arg
)
match(
s
,r
)
Function that matches the pattern, specified by the regular
expression
r
, in the string
s
and returns either the position
in s
where the match begins or 0 if
no occurrences are found. Sets the values of
RSTART and
RLENGTH. (nawk)
Read next input line and start new cycle through pattern/procedures statements.
print
[args
]
[destination
]
Print args
on output, followed by a
newline.
args
is usually one or more fields,
but it may also be one or more of the predefined variables — or
arbitrary expressions. If no args
are
given, prints $0
(the current
input record). Literal strings must be quoted. Fields are
printed in the order they are listed. If separated by commas (,)
in the argument list, they are separated in the output by the
OFS character. If separated by spaces,
they are concatenated in the output.
destination
is a Unix redirection
or pipe expression (e.g., >
file
) that redirects the default
standard output.
printf
format
[,
expression(s)
]
[destination
]
Formatted print statement. Fields or variables can be
formatted according to instructions in the
format
argument. The number of
expression
s must correspond to
the number specified in the format sections.
format
follows the conventions of
the C-language printf statement. Here are a
few of the most common formats:
%s
A string.
%d
A decimal number.
%
n
.m
f
A floating-point number, where n
is
the total number of digits and m
is
the number of digits after the decimal point.
%
[-
]nc
n
specifies minimum field length
for format type c
, while -
left-justifies value in field;
otherwise value is right-justified.
format
can also contain embedded
escape sequences: \n
(newline) or \t
(tab) are the
most common. destination
is a Unix
redirection or pipe expression (e.g., >
file
) that redirects the default
standard output.
For example, using the following script:
{printf "The sum on line %s is %d.\n", NR, $1+$2}
and the following input line:
5 5
produces this output, followed by a newline:
The sum on line 1 is 10.
rand( )
Generate a random number between 0 and 1. This function returns the same series of numbers each time the script is executed, unless the random number generator is seeded using the srand( ) function. (nawk)
return
[expr
]
Used at end of user-defined functions to exit the function,
returning value of
expression expr
, if any. (nawk)
sin(
x
)
split(
string
,array
[,sep
])
Split string
into elements of
array
array[1]
, . . . ,array[
n
]
.
string
is split at each
occurrence of separator sep
. (In
nawk, the separator may
be a regular expression.) If sep
is
not specified, FS is used. The number of
array elements created is returned.
sprintf (
format
[,
expression(s)
])
Return the value of expression(s)
,
using the specified format
(see printf).
Data is formatted but not printed.
sqrt(
arg
)
srand(
expr
)
Use
expr
to set a new seed for random
number generator. Default is time of day. Returns the old seed.
(nawk)
sub(
r
,s
[,t
])
Substitute s
for first match of the
regular expression
r
in the string
t
. Return 1 if successful; 0
otherwise. If t
is not supplied,
defaults to $0
. (nawk)
substr(
string
,m
[,n
])
Return substring of string
,
beginning at character position
m
and consisting of the next
n
characters. If
n
is omitted, include all
characters to the end of string.
system(
command
)
Function that executes the specified Unix
command
and returns its
status (Section 34.12). The
status of the command that is executed typically indicates its
success (0) or failure (nonzero). The output of the command is
not available for processing within the nawk script. Use
command
|
getline
to read the output of
the command into the script. (nawk)
tolower(
str
)
Translate all uppercase characters
in
str
to lowercase and return the
new string. (nawk)
toupper(
str
)
Translate all lowercase characters in
str
to uppercase and return the
new string. (nawk)
while
(condition
)
command
Do command
while
condition
is true (see
if for a description of allowable
conditions). A series of commands must be put within braces ({}
).
— DG