How can you parse (split, search) a string of text to find the last word, the second column, and so on? There are a lot of different ways. Pick the one that works best for you — or invent another one! (Unix has lots of ways to work with strings of text.)
The
expr command
(Section 36.21) can grab part of a
string with a regular expression. The example below is
from a shell script whose last command-line argument is a filename. The two
commands below use expr to grab the last
argument and all arguments except the last one. The "$*"
gives expr a list of
all command-line arguments in a single word. (Using "$@" (Section
35.20) here wouldn't work because it gives individually quoted
arguments. expr needs all arguments in
one word.)
last=`expr "$*" : '.* \(.*\)'` # LAST ARGUMENT first=`expr "$*" : '\(.*\) .*'` # ALL BUT LAST ARGUMENT
Let's look at the regular expression that gets the last word. The leading
part of the expression, .*
, matches as
many characters as it can, followed by a space. This includes all words up
to and including the last space. After that, the end of the expression,
\(.*\)
, matches the last word.
The regular expression that grabs the first words is the same as the
previous one — but I've moved the \( \)
pair. Now it grabs all words up to but not including the last space. The end
of the regular expression, .*
, matches
the last space and last word — and expr
ignores them. So the final .*
really
isn't needed here (though the space is). I've included the final .*
because it follows from the first
example.
expr is great when you want to split a
string into just two parts. The .*
also
makes expr good for skipping a variable
number of words when you don't know how many words a string will have. But
expr is poor at getting, say, the
fourth word in a string. And it's almost useless for handling more than one
line of text at a time.
awk can split lines into words, but it has a lot of overhead and can take some time to execute, especially on a busy system. The cut ( Section 21.14) command starts more quickly than awk but it can't do as much.
Both those utilities are designed to handle multiple lines of text. You can tell awk to handle a single line with its pattern-matching operators and its NR variable. You can also run those utilities with a single line of text, fed to the standard input through a pipe from echo. For example, to get the third field from a colon-separated string:
string="this:is:just:a:dummy:string" field3_awk=`echo "$string" | awk -F: '{print $3}'` field3_cut=`echo "$string" | cut -d: -f3`
Let's combine two echo commands. One sends text to awk or cut through a pipe; the utility ignores all the text from columns 1-24, then prints columns 25 to the end of the variable text. The outer echo prints The answer is and that answer. Notice that the inner double quotes are escaped with backslashes to keep the Bourne shell from interpreting them before the inner echo runs:
echo "The answer is `echo \"$text\" | awk '{print substr($0,25)}'`" echo "The answer is `echo \"$text\" | cut -c25-`"
The Bourne shell set (Section
35.25) command can be used to parse a single-line string and
store it in the command-line parameters (Section 35.20) "$@"
, $*
, $1
,
$2
, and so on. Then you can also loop
through the words with a for loop (Section 35.21) and use everything
else the shell has for dealing with command-line parameters. Also, you can
set the Bourne shell's IFS variable to control how the
shell splits the string.
By default, the IFS (internal field separator) shell variable holds three characters: SPACE, TAB, and NEWLINE. These are the places that the shell parses command lines.
If you have a line of text — say, from a database — and you want to split it into fields, you can put the field separator into IFS temporarily, use the shell's set (Section 35.25) command to store the fields in command-line parameters, then restore the old IFS.
For example, the chunk of a shell script below gets current terminal settings from stty -g , which looks like this:
2506:5:bf:8a3b:3:1c:8:15:4:0:0:0:11:13:1a:19:12:f:17:16:0:0
In the next example, the shell parses the line returned from stty by the backquotes (Section
28.14). It stores x in $1
, which stops errors if stty fails for some reason. (Without the
x, if stty made
no standard output, the shell's set
command would print a list of all shell variables.) Then
2506 goes into $2
, 5 into $3
, and so on. The original Bourne shell can handle only nine
parameters (through $9
); if your input
lines may have more than nine fields, this isn't a good technique. But this
script uses the Korn shell, which (along with most other Bourne-type shells)
doesn't have that limit.
#!/bin/ksh oldifs="$IFS" # Change IFS to a colon: IFS=: # Put x in $1, stty -g output in $2 thru ${23}: set x `stty -g` IFS="$oldifs" # Window size is in 16th field (not counting the first "x"): echo "Your window has ${17} rows."
Because you don't need a subprocess to parse the output of stty, this can be faster than using an external command like cut (Section 21.14) or awk (Section 20.10).
There are places where IFS can't be used because the shell separates command lines at spaces before it splits at IFS. It doesn't split the results of variable substitution or command substitution (Section 28.14) at spaces, though. Here's an example — three different ways to parse a line from /etc/passwd:
%cat splitter
#!/bin/sh IFS=: line='larry:Vk9skS323kd4q:985:100:Larry Smith:/u/larry:/bin/tcsh' set x $line echo "case 1: \$6 is '$6'" set x `grep larry /etc/passwd` echo "case 2: \$6 is '$6'" set x larry:Vk9skS323kd4q:985:100:Larry Smith:/u/larry:/bin/tcsh echo "case 3: \$6 is '$6'" %./splitter
case 1: $6 is 'Larry Smith' case 2: $6 is 'Larry Smith' case 3: $6 is 'Larry'
Case 1 used variable substitution and case 2 used command substitution;
the sixth field contained the space. In case 3, though, with the colons on
the command line, the sixth field was split: $6
became Larry and $7
was Smith. Another
problem would have come up if any of the fields had been empty (as in
larry::985:100
:etc...) — the shell would
"eat" the empty field and $6
would
contain /u/larry. Using sed with its escaped
parentheses (Section
34.11) to do the searching and the parsing could solve the last
two problems.
The Unix sed (Section 34.1) utility is good at parsing input that you may or may not otherwise be able to split into words, at finding a single line of text in a group and outputting it, and many other things. In this example, I want to get the percentage-used of the filesystem mounted on /home. That information is buried in the output of the df (Section 15.8) command. On my system,[3] df output looks like:
+% df
Filesystem kbytes used avail capacity Mounted on
...
/dev/sd3c 1294854 914230 251139 78% /work
/dev/sd4c 597759 534123 3861 99% /home
...
I want
the number 99 from the line ending with
/home. The sed
address / \/home$/
will find that line
(including a space before the /home makes sure the
address doesn't match a line ending with
/something/home). The -n
option
keeps sed from printing any lines except
the line we ask it to print (with its p command). I
know that the "capacity" is the only word on the line that ends with a
percent sign (%
). A space after the first
.*
makes sure that .*
doesn't "eat" the first digit of the number
that we want to match by [0-9]
. The
sed
escaped-parenthesis operators (
Section 34.11) grab that
number:
usage=`df | sed -n '/ \/home$/s/.* \([0-9][0-9]*\)%.*/\1/p'`
Combining sed with eval (Section 27.8) lets you set several shell variables at once from parts of the same line. Here's a command line that sets two shell variables from the df output:
eval `df | sed -n '/ \/home$/s/^[^ ]* *\([0-9]*\) *\([0-9]*\).*/kb=\1 u=\2/p'`
The left-hand side of that substitution command has a regular expression that uses sed's escaped parenthesis operators. They grab the "kbytes" and "used" columns from the df output. The right-hand side outputs the two df values with Bourne shell variable-assignment commands to set the kb and u variables. After sed finishes, the resulting command line looks like this:
eval kb=597759 u=534123