If you have become familiar with the customization techniques we presented in the previous chapter, you have probably run into various modifications to your environment that you want to make but can’t — yet. Shell programming makes these possible.
Some aspects of Korn shell programming are really extensions of the customization techniques we have already seen, while others resemble traditional programming language features. We have structured this chapter so that if you aren’t a programmer, you can read this chapter and do quite a bit more than you could with the information in the previous chapter. Experience with a conventional programming language like Pascal or C is helpful (though not strictly necessary) for subsequent chapters. Throughout the rest of the book, we will encounter occasional programming problems, called tasks, whose solutions make use of the concepts we cover.
A script, or file that contains shell commands, is a shell program. Your .profile and environment files, discussed in Chapter 3, are shell scripts.
You can create a script using the text editor of your choice.
Once you have created one,
there are a number of ways to run it.
One, which we have already covered,
is to type .
scriptname
(i.e., the command is a dot). This causes
the commands in the script to be read and run as if you typed them in.
The final way to run a script is simply to type its name and hit ENTER, just as if you were invoking a built-in command. This, of course, is the most convenient way. This method makes the script look just like any other Unix command, and in fact several “regular” commands are implemented as shell scripts (i.e., not as programs originally written in C or some other language), including spell, man on some systems, and various commands for system administrators. The resulting lack of distinction between “user command files” and “built-in commands” is one factor in Unix’s extensibility and, hence, its favored status among programmers.
You can run a script by typing its name only
if .
(the current directory) is part of your command
search path, i.e., is included in your PATH
variable
(as discussed in Chapter 3). If .
isn’t on your path,
you must type ./
scriptname, which is really the
same thing as typing the script’s relative pathname
(see Chapter 1).
Before you can invoke the shell script by name, you must also give it “execute” permission. If you are familiar with the Unix filesystem, you know that files have three types of permissions (read, write, and execute) and that those permissions apply to three categories of user (the file’s owner, a group of users, and everyone else). Normally, when you create a file with a text editor, the file is set up with read and write permission for you and read-only permission for everyone else.[46]
chmod +x scriptname
ksh: scriptname
: cannot execute [Permission denied]
But there is a more important difference between the two ways of running shell scripts. While the “dot” method causes the commands in the script to be run as if they were part of your login session, the “just the name” method causes the shell to do a series of things. First, it runs another copy of the shell as a subprocess. The shell subprocess then takes commands from the script, runs them, and terminates, handing control back to the parent shell.
Figure 4-1 shows how the shell executes scripts.
Assume you have a simple shell script called fred
that contains the commands bob and
dave. In Figure 4-1.a,
typing . fred
causes the two commands to
run in the same shell, just as if you had typed them in by hand.
Figure 4-1.b shows what happens when you type
just fred
: the commands run in the shell subprocess while
the parent shell waits for the subprocess to finish.
You may find it interesting to compare this with the situation in Figure 4-1.c, which shows what happens when you
type fred &
.
As you will recall from Chapter 1, the &
makes the command
run in the background, which is really just
another term for “subprocess.” It turns out that the only significant
difference between Figure 4-1.c and Figure 4-1.b is that you have control of your terminal
or workstation while the command runs — you need not wait until it
finishes before you can enter further commands.
There are many ramifications to using shell subprocesses.
An important one is that the exported environment variables
that we saw in the last chapter (e.g., TERM
, LOGNAME
,
PWD
)
are known in shell subprocesses, whereas other shell variables (such as any
that you define in your .profile without an export
statement) are not.
Other issues involving shell subprocesses are too complex to go into now; see Chapter 7 and Chapter 8 for more details about subprocess I/O and process characteristics, respectively. For now, just bear in mind that a script normally runs in a shell subprocess.
The Korn shell’s function feature is an expanded version of a similar facility in the System V Bourne shell and a few other shells. A function is sort of a script-within-a-script; you use it to define some shell code by name and store it in the shell’s memory, to be invoked and run later.
Functions improve the shell’s programmability significantly, for two main reasons. First, when you invoke a function, it is already in the shell’s memory (except for automatically loaded functions; see Section 4.1.1.1, later in this chapter); therefore a function runs faster. Modern computers have plenty of memory, so there is no need to worry about the amount of space a typical function takes up. For this reason, most people define as many functions as possible rather than keep lots of scripts around.
The other advantage of functions is that they are ideal for organizing long shell scripts into modular “chunks” of code that are easier to develop and maintain. If you aren’t a programmer, ask one what life would be like without functions (also called procedures or subroutines in other languages) and you’ll probably get an earful.
To define a function, you can use either one of two forms:
functionfunctname
{ Korn shell semanticsshell commands
}
or:
functname
() { POSIX semanticsshell commands
}
The first form provides access to the full power and programmability
of the Korn shell.
The second is compatible with the syntax for shell functions
introduced in the System V Release 2 Bourne shell.
This form obeys the semantics of the POSIX standard,
which are less powerful than full Korn shell-style functions.
(We discuss the differences in detail shortly.)
We always use the first form in this book.
You can delete a function definition with the command
unset -f
functname
.
You can find
out what functions are defined in your login session by typing
functions
.[47]
(Note the s
at the end of the command name.)
The shell will print not just the names but also the definitions
of all functions, in alphabetical order by function name.
Since this may result in long output, you might want to pipe
the output through more or redirect it to a file for examination
with a text editor.
Apart from the advantages, there are two important differences between functions and scripts. First, functions do not run in separate processes, as scripts do when you invoke them by name; the “semantics” of running a function are more like those of your .profile when you log in or any script when invoked with the “dot” command. Second, if a function has the same name as a script or executable program, the function takes precedence.
This is a good time to show the order of precedence for the various sources of commands. When you type a command to the shell, it looks in the following places until it finds a match:
Keywords, such as function
and several others (e.g.,
if
and for
)
that we will see in Chapter 5
Aliases
(although you can’t define an alias whose name is a shell keyword, you can
define an alias that expands to a keyword, e.g.,
alias aslongas=while
;
see Chapter 7 for more details)
Special built-ins, such as break and continue (the full list is . (dot), :, alias, break, continue, eval, exec, exit, export, login, newgrp, readonly, return, set, shift, trap, typeset, unalias, and unset)
Functions
Non-special built-ins, such as cd and whence
Scripts and executable programs, for which the shell searches
in the directories listed in the PATH
environment variable
We’ll examine this process in more detail in the section on command-line processing in Chapter 7.
If you need to know the exact source of a command, there is
an option to the whence built-in command that we saw
in Chapter 3. whence by itself will
print the pathname of a command if the command
is a script or executable
program, but it will only parrot the command’s
name back if it
is anything else.
But if you type whence -v
commandname
,
you get more complete information, such as:
$whence -v cd
cd is a shell builtin $whence -v function
function is a keyword $whence -v man
man is a tracked alias for /usr/bin/man $whence -v ll
ll is an alias for 'ls -l'
For compatibility with the System V Bourne shell, the Korn shell
predefines the alias type='whence -v'
.
This definitely makes the transition to the Korn shell easier for
long-time Bourne shell users; type is
similar to whence.
The whence command actually has several
options, described in Table 4-1.
For example, the AT&T Advanced Software Tools
group that distributes ksh93
also has many other tools, often installed in a separate ast/bin
directory. This feature allows the ast programs to find
their shared libraries, without the user having to manually adjust
LD_LIBRARY_PATH
in the .profile file.[48]
For example, if a command is found in /usr/local/ast/bin,
and the .paths file in that directory contains the
assignment LD_LIBRARY_PATH=../lib
, the shell
prepends /usr/local/ast/lib:
to the value of
LD_LIBRARY_PATH
before running the command.
Readers familiar with ksh88 will notice that
this part of the shell’s behavior has changed significantly.
Since ksh88 always read the environment
file, whether or not the shell was interactive, it was simplest to
just put function definitions there.
However, this could still yield a large, unwieldy file.
To get around this, you could create files in one or more
directories listed in $FPATH
.
Then, in the environment file, you would mark the functions
as being autoloaded:
autoload whoson ...
Marking a function with autoload
[49]
tells the shell that this name is a function, and to find the definition by
searching $FPATH
.
The advantage to this is that the function is not loaded into the shell’s
memory if it’s not needed.
The disadvantage is that you have to explicitly list all your functions
in your environment file.
ksh93’s integration of PATH
and FPATH
searching thus simplifies the way
you add shell functions to your personal shell function “library.”
As mentioned earlier, functions defined using the POSIX syntax obey POSIX semantics and not Korn shell semantics:
functname
() {shell commands
}
The best way to understand this is to think of a POSIX function as being like a dot script. Actions within the body of the function affect all the state of the current script. In contrast, Korn shell functions have much less shared state with the parent shell, although they are not identical to totally separate scripts.
The technical details follow; they include information that we haven’t covered yet. So come back and reread this section after you’ve learned about the typeset command in Chapter 6 and about traps in Chapter 8.
POSIX functions share variables with the parent script. Korn shell functions can have their own local variables.
POSIX functions share traps with the parent script. Korn shell functions can have their own local traps.
POSIX functions cannot be recursive (call themselves).[50] Korn shell functions can.
When a POSIX function is run, $0
is
not changed to the name of the function.
If you use the dot command with the name of a Korn shell function, that function will obey POSIX semantics, affecting all the state (variables and traps) of the parent shell:
$function demo {
Define a Korn shell function >typeset myvar=3
Set a local variable myvar >print "demo: myvar is $myvar"
>}
$myvar=4
Set the global myvar $demo ; print "global: myvar is $myvar"
Run the function demo: myvar is 3 global: myvar is 4 $. demo
Run with POSIX semantics demo: myvar is 3 $print "global: myvar is $myvar"
See the results global: myvar is 3
A major piece of the Korn shell’s programming functionality
relates to shell variables. We’ve already seen the basics of
variables. To recap briefly: they are named places to
store data, usually in the form of character strings, and their values can be
obtained by preceding their names with dollar signs ($
).
Certain variables, called environment variables, are conventionally
named in all capital letters, and their values are made known
(with the export statement) to subprocesses.
The chief difference between the Korn shell’s variable schema and those of conventional languages is that the Korn shell’s schema places heavy emphasis on character strings. (Thus it has more in common with a special-purpose language like SNOBOL than a general-purpose one like Pascal.) This is also true of the Bourne shell and the C shell, but the Korn shell goes beyond them by having additional mechanisms for handling integers and double-precision floating point numbers explicitly, as well as simple arrays.
As we have already seen, you can define values for variables
with statements of the form
varname
=
value
, e.g.:
$fred=bob
$print "$fred"
bob
The most important special, built-in variables
are called positional parameters.
These hold the command-line arguments to scripts when they are
invoked. Positional parameters have names 1
, 2
, 3
,
etc., meaning that their values are denoted by
$1
, $2
, $3
, etc. There is also
a positional parameter 0
, whose value is the name of the script
(i.e., the command typed in to invoke it).
Two special variables contain all of the positional parameters
(except positional parameter 0
): *
and @
.
The difference between them is subtle but important, and
it’s apparent only when they are within double quotes.
"$*"
is a single string that consists of all of the positional
parameters, separated by the first character in the
variable IFS
(internal field separator),
which is a space, TAB, and
newline by default. On the other hand,
"$@"
is equal to
"$1"
"$2"
...
"$
N
"
,
where N is the
number of positional parameters. That is, it’s equal to N
separate double-quoted strings, which are separated by spaces.
We’ll explore the ramifications of this difference in a little while.
The variable #
holds the number of positional parameters
(as a character string).
All of these variables are “read-only,” meaning that you can’t
assign new values to them within scripts.
(They can be changed, just not via assignment. See
Section 4.2.1.2, later in this chapter.)
For example, assume that you have the following simple shell script:
print "fred: $*" print "$0: $1 and $2" print "$# arguments"
Assume further that the script is called fred.
Then if you type fred bob dave
, you will see the following
output:
fred: bob dave fred: bob and dave 2 arguments
Shell functions use positional parameters and special variables
like *
and #
in exactly the same way that shell scripts do.
If you wanted to define fred as a function, you could put
the following in your .profile or environment file:
function fred { print "fred: $*" print "$0: $1 and $2" print "$# arguments" }
You get the same result if you type fred bob dave
.
Typically, several shell functions are defined within a single shell script. Therefore each function needs to handle its own arguments, which in turn means that each function needs to keep track of positional parameters separately. Sure enough, each function has its own copies of these variables (even though functions don’t run in their own subprocess, as scripts do); we say that such variables are local to the function.
Other variables defined within functions are not local; they are global, meaning that their values are known throughout the entire shell script.[51] For example, assume that you have a shell script called ascript that contains this:
function afunc { print in function $0: $1 $2 var1="in function" } var1="outside of function" print var1: $var1 print $0: $1 $2 afunc funcarg1 funcarg2 print var1: $var1 print $0: $1 $2
If you invoke this script by typing ascript arg1 arg2
,
you will see this output:
var1: outside of function ascript: arg1 arg2 in function afunc: funcarg1 funcarg2 var1: in function ascript: arg1 arg2
In other words, the function afunc changes the value of the
variable var1
from “outside of function” to
“in function,” and that change is
known outside the function, while $0
, $1
,
and $2
have different values in the function and the main script.
Figure 4-2 shows this graphically.
It is possible to make other variables local to
functions by using the typeset command, which we’ll see in
Chapter 6.
Now that we have this background,
let’s take a closer look at "$@"
and "$*"
. These variables
are two of the shell’s greatest idiosyncracies, so we’ll discuss some
of the most common sources of confusion.
IFS=, print "$*"
Changing IFS
in a script is fairly risky, but it’s probably OK
as long as nothing else in the script depends on it. If this script
were called arglist,
the command arglist bob dave ed
would produce the output bob,dave,ed
. Chapter 10
contains another example of changing IFS
.
Why does "$@"
act like N separate double-quoted strings?
To allow you to use them again as separate values. For example,
say you want to call a function within your script with the same list
of positional parameters, like this:
function countargs { print "$# args." }
Assume your script is called with the same arguments as arglist
above. Then if it contains the command countargs "$*"
,
the function prints 1 args
. But if the command is
countargs "$@"
, the function prints 3 args
.
Being able to retrieve the arguments as they came in is also important in case you need to preserve any embedded white space. If your script was invoked with the arguments “hi”, “howdy”, and “hello there”, here are the different results you might get:
$countargs $*
4 args $countargs "$*"
1 args $countargs $@
4 args $countargs "$@"
3 args
Because "$@"
always exactly preserves arguments,
we use it in just about all the example programs in this book.
Occasionally, it’s useful to change the positional parameters.
We’ve already mentioned that you cannot set them directly, using an
assignment such as 1="first"
.
However, the built-in command set can be used
for this purpose.
The set command is perhaps the single most complicated
and overloaded command in the shell. It takes a large number of options,
which are discussed in Chapter 9.
What we care about for the moment is that additional non-option arguments to
set replace the positional parameters.
Suppose our script was invoked with the three arguments “bob”,
“fred”, and “dave”. Then countargs "$@"
tells us that we have three arguments.
Upon using set to change the positional
parameters, $#
is updated too.
$set one two three "four not five"
Change the positional parameters $countargs "$@"
Verify the change 4 args
The set command also works inside a shell function. The shell function’s positional parameters are changed, but not those of the calling script:
$function testme {
>countargs "$@"
Show the original number of parameters >set a b c
Now change them >countargs "$@"
Print the new count >}
$testme 1 2 3 4 5 6
Run the function 6 args Original count 3 args New count $countargs "$@"
No change to invoking shell's parameters 4 args
Before we show the many things you can do with shell variables,
we have to make a confession: the syntax
of $
varname
for taking the value of a variable is not
quite accurate. Actually, it’s the simple form of the more general
syntax, which is ${
.
varname
}
Why two syntaxes?
For one thing, the more general syntax
is necessary if your code refers to more than nine positional
parameters: you must use ${10}
for the tenth instead of $10
.
(This ensures compatibility with the Bourne shell, where $10
means ${1}0
.)
Aside from that, consider the Chapter 3 example
of setting your primary prompt variable (PS1
) to your login name:
PS1="($LOGNAME)-> "
PS1="$LOGNAME_ "
For this reason, the full syntax for taking the value of a variable
is ${
varname
}
. So if we used:
PS1="${LOGNAME}_ "
we would get the desired yourname
_
.
It is safe to omit the curly braces ({}
) if the
variable name is followed by a character that isn’t a letter,
digit, or underscore.
As mentioned, Korn shell variables tend to be string-oriented. One operation that’s very common is to append a new value onto an existing variable. (For example, collecting a set of options into a single string.) Since time immemorial, this was done by taking advantage of variable substitution inside double quotes:
myopts="$myopts $newopt"
myopts+=" $newopt"
This accomplishes the same thing, but it is more efficient,
and it also makes it clear that the new value is being added onto the string.
(In C, the +=
operator adds the value on the right to the
variable on the left; x += 42
is the same as
x = x + 42
.)
ksh93 introduces a new feature, called
compound variables.
They are similar in nature to a Pascal or Ada record
or a C struct
, and they allow you to group related items together
under the same name. Here are some examples:
now="May 20 2001 19:44:57" Assign current date to variable now now.hour=19 Set the hour now.minute=44 Set the minute ...
$print ${now.hour}
19 $print $now.hour
May 20 2001 19:44:57.hour
person="John Q. Public" person.firstname=John person.initial=Q. person.lastname=Public
Fortunately, you can use a compound assignment to do it all in one fell swoop:
person=(firstname=John initial=Q. lastname=Public)
You can retrieve the value of either the entire variable, or a component, using print.
$print $person
Simple print ( lastname=Public initial=Q. firstname=John ) $print -r "$person"
Print in full glory ( lastname=Public initial=Q. firstname=John ) $print ${person.initial}
Print just the middle initial Q.
The second print command preserves the whitespace that the Korn shell provides when returning the value of a compound variable. The -r option to print is discussed in Chapter 7.
The order of the components is different from what was used in the initial assignment. This order depends upon how the Korn shell manages compound variables internally and cannot be controlled by the programmer.
A second assignment syntax exists, similar to the first:
person=(typeset firstname=John initial=Q. lastname=Public ; typeset -i age=42)
By using the typeset command, you can specify that a variable
is a number instead of a string. Here, person.age
is an
integer variable. The rest remain strings. The typeset
command and its options are presented in Chapter 6.
(You can also use readonly to declare that a
component variable cannot be changed.)
Just as you may use +=
to append to a regular
variable, you can add components to a compound variable as well:
person+= (typeset spouse=Jane)
A space is allowed after the =
but not before.
This is true for compound assignments with both =
and +=
.
The Korn shell has additional syntaxes for compound assignment that apply only to array variables; they are also discussed in Chapter 6.
Finally, we’ll mention that the Korn shell has a special compound variable
named .sh
. The various components almost all relate to features
we haven’t covered yet, except ${.sh.version}
, which tells
you the version of the Korn shell that you have:
$ print ${.sh.version}
Version M 1993-12-28 m
We will see another component of .sh
later in this chapter,
and the other components are covered as we introduce the features they relate to.
Most of the time, as we’ve seen so far, you manipulate variables
directly, by name (x=1
, for example).
The Korn shell allows you to manipulate variables indirectly,
using something called a nameref.
You create a nameref using typeset -n, or the
more convenient predefined alias, nameref.
Here is a simple example:
$name="bill"
Set initial value $nameref firstname=name
Set up the nameref $print $firstname
Actually references variable name bill $firstname="arnold"
Now change the indirect reference $print $name
Shazzam! Original variable is changed arnold
To find out the name of the real variable being referenced by the nameref,
use ${!
variable
}
:
$ print ${!firstname}
name
$date
Current day and time Wed May 23 17:49:44 IDT 2001 $function getday {
Define a function >typeset -n day=$1
Set up the nameref >day=$(date | awk '{ print $1 }')
Actually change it >}
$today=now
Set initial value $getday today
Run the function $print $today
Display new value Wed
The default output of date(1) looks like this:
$ date
Wed Nov 14 11:52:38 IST 2001
The getday function uses awk
to print the first field, which is the day of the week.
The result of this operation, which is done inside command substitution
(described later in this chapter),
is assigned to the local variable day
.
But day
is a nameref; the assignment actually updates
the global variable today
.
Without the nameref facility, you have to resort to advanced tricks like using
eval (see Chapter 7) to make
something like this happen.
To remove a nameref, use unset -n
, which removes the
nameref itself, instead of unsetting the variable the nameref is a
reference to.
Finally, note that variables that are namerefs may not have
periods in their names (i.e., be components of a compound variable).
They may, though, be references to a compound variable.
In particular, string operators let you do the following:
The basic idea behind the syntax of string operators is that special characters that denote operations are inserted between the variable’s name and the right curly brace. Any argument that the operator may need is inserted to the operator’s right.
The first group of string-handling operators tests for the existence of variables and allows substitutions of default values under certain conditions. These are listed in Table 4-2.
Operator | Substitution |
${
varname
:-
word
}
|
If varname exists and isn’t null, return its value; otherwise return word. |
Purpose: |
Returning a default value if the variable is undefined. |
Example: |
|
${
varname
:=
word
}
|
If varname exists and isn’t null, return its value; otherwise set it to word and then return its value.[a] |
Purpose: |
Setting a variable to a default value if it is undefined. |
Example: |
|
${
varname
:?
message
}
|
If varname exists and isn’t null, return its value;
otherwise print |
Purpose: |
Catching errors that result from variables being undefined. |
Example: |
|
${
varname
:+
word
}
|
If varname exists and isn’t null, return word; otherwise return null. |
Purpose: |
Testing for the existence of a variable. |
Example: |
|
The colon (:
) in each of these operators is actually optional.
If the colon is omitted, then change “exists and isn’t null”
to “exists” in each definition, i.e., the
operator tests for existence only.
The first two of these operators are ideal for setting defaults for command-line arguments in case the user omits them. We’ll actually use all four in Task 4-1, which is our first programming task.
sort -nr "$1" | head -${2:-10}
Here is how this works:
the sort(1) program sorts the data in the file whose name
is given as the first argument ($1
).
(The double quotes allow for spaces or other unusual characters in file names, and also
prevent wildcard expansion.)
The -n option tells sort
to interpret the first word on each line as a number
(instead of as a character string);
the -r tells it to reverse the comparisons, so as to sort in
descending order.
sort -nr myfile | head -10
Or if the user types highest myfile 22
, the line that runs is:
sort -nr myfile | head -22
Make sure you understand how the :-
string operator provides
a default value.
First, we can add comments to the code; anything between # and the end of a line is a comment. At minimum, the script should start with a few comment lines that indicate what the script does and the arguments it accepts. Next, we can improve the variable names by assigning the values of the positional parameters to regular variables with mnemonic names. Last, we can add blank lines to space things out; blank lines, like comments, are ignored. Here is a more readable version:
# highest filename [howmany] # # Print howmany highest-numbered lines in file filename. # The input file is assumed to have lines that start with # numbers. Default for howmany is 10. filename=$1 howmany=${2:-10} sort -nr "$filename" | head -$howmany
sort -nr | head -10
Therefore we need to make sure that the user supplies at least one argument. There are a few ways of doing this; one of them involves another string operator. We’ll replace the line:
filename=$1
filename=${1:?"filename missing."}
highest: line 1: : filename missing.
Second, the script exits without running the remaining code.
filename=$1 filename=${filename:?"missing."}
highest: line 2: filename: filename missing.
(Make sure you understand why.) Of course, there are ways of printing whatever message is desired; we’ll find out how in Chapter 5.
Before we move on, we’ll look more closely at the two remaining
operators in
Table 4-2
and see how we can incorporate them into
our task solution.
The :=
operator does roughly the
same thing as :-
, except that it has the side effect
of setting the
value of the variable to the given word if the variable doesn’t exist.
Therefore we would like to use :=
in our script in place of :-
,
but we can’t; we’d be trying to set the
value of a positional parameter, which is not allowed. But
if we replaced:
howmany=${2:-10}
with just:
howmany=$2
and moved the substitution down to the actual command line (as we
did at the start), then we could use the :=
operator:
sort -nr "$filename" | head -${howmany:=10}
Using :=
has the added benefit of setting the value of howmany
to 10 in case we need it afterwards in later versions of the script.
ALBUMS ARTIST
${header:+"ALBUMS ARTIST\n"}
print -n ${header:+"ALBUMS ARTIST\n"}
right before the command line that does the actual work.
The -n option to print
causes it not to print a newline after printing its
arguments. Therefore this print statement prints
nothing — not even a blank line — if
header
is null; otherwise it prints the header line
and a newline (\n
).
We’ll continue refining our solution to Task 4-1 later in this chapter.
The next type of string operator is used to match portions of a
variable’s string value against patterns.
Patterns, as we saw in Chapter 1, are strings that can contain
wildcard characters (*
, ?
,
and []
for character sets and ranges).
Wildcards have been standard features of all Unix shells going back (at least) to the Version 6 Thompson shell.[52] But the Korn shell is the first shell to add to their capabilities. It adds a set of operators, called regular expression (or regexp for short) operators, that give it much of the string-matching power of advanced Unix utilities like awk(1), egrep(1) (extended grep(1)), and the Emacs editor, albeit with a different syntax. These capabilities go beyond those that you may be used to in other Unix utilities like grep, sed(1), and vi(1).
Advanced Unix users will find the Korn shell’s regular expression capabilities useful for script writing, although they border on overkill. (Part of the problem is the inevitable syntactic clash with the shell’s myriad other special characters.) Therefore we won’t go into great detail about regular expressions here. For more comprehensive information, the “very last word” on practical regular expressions in Unix is Mastering Regular Expressions, by Jeffrey E. F. Friedl. A more gentle introduction may found in the second edition of sed & awk, by Dale Dougherty and Arnold Robbins. Both are published by O’Reilly & Associates. If you are already comfortable with awk or egrep, you may want to skip the following introductory section and go to Section 4.5.2.3, later in this chapter, where we explain the shell’s regular expression mechanism by comparing it with the syntax used in those two utilities. Otherwise, read on.
Think of regular expressions as strings that match patterns more powerfully than the standard shell wildcard schema. Regular expressions began as an idea in theoretical computer science, but they have found their way into many nooks and crannies of everyday, practical computing. The syntax used to represent them may vary, but the concepts are very much the same.
A shell regular expression can contain regular characters, standard wildcard characters, and additional operators that are more powerful than wildcards. Each such operator has the form x(exp), where x is the particular operator and exp is any regular expression (often simply a regular string). The operator determines how many occurrences of exp a string that matches the pattern can contain. Table 4-3 describes the shell’s regular expression operators and their meanings.
Operator | Meaning |
*(
exp
)
| 0 or more occurrences of exp |
+(
exp
)
| 1 or more occurrences of exp |
?(
exp
)
| 0 or 1 occurrences of exp |
@(
exp1
|
exp2
|...)
|
Exactly one of exp1 or exp2 or ... |
!(
exp
)
|
Anything that doesn’t match exp [a] |
[a]
Actually, |
A little-known alternative notation is to separate each exp with
the ampersand character, &
. In this case, all
the alternative expressions must match. Think of the |
as meaning
“or,” while the &
means “and.”
(You can, in fact, use both of them in the same pattern list. The
&
has higher precedence, with the meaning
“match this and that, OR match the next thing.”)
Table 4-4 provides some example uses of the shell’s
regular expression operators.
The Emacs editor supports customization files whose names end in .el (for Emacs LISP) or .elc (for Emacs LISP Compiled). List all Emacs customization files in the current directory.
Filenames in the OpenVMS operating system end in a semicolon followed by a version number, e.g., fred.bob;23. List all OpenVMS-style filenames in the current directory.
In the first of these, we are looking for files that end in .el
with an optional c. The expression that matches this is
*.el?(c)
.
The second example depends on the four standard subexpressions
*.c
,
*.h
,
Makefile
, and README
.
The entire expression is
!(*.c|*.h|Makefile|README)
, which matches anything
that does not match any of the four possibilities.
The solution to the third example starts with
*\;
, the shell
wildcard *
followed by a backslash-escaped semicolon.
Then, we could use
the regular expression +([0-9])
,
which matches one or more
characters in the range [0-9]
, i.e., one or more digits.
This is almost correct (and probably close enough), but it doesn’t
take into account that the first digit cannot be 0.
Therefore the correct expression is
*\;[1-9]*([0-9])
, which matches
anything that ends with a semicolon, a digit from 1 to 9, and
zero or more digits from 0 to 9.
The POSIX standard formalizes the meaning of regular expression characters and operators. The standard defines two classes of regular expressions: Basic Regular Expressions (BREs), which are the kind used by grep and sed, and Extended Regular Expressions, which are the kind used by egrep and awk.
POSIX also changed what had been common terminology. What we saw earlier
in Chapter 1 as a “range expression” is often called
a “character class” in the Unix literature. It is now
called a “bracket expression” in the POSIX
standard. Within bracket expressions, besides literal characters such as
a
, ;
,
and so on, you can also have additional components:
A POSIX character class consists of keywords bracketed by
[:
and :]
. The
keywords describe different classes of characters such as alphabetic
characters, control characters, and so on (see Table 4-5).
A collating symbol is a multicharacter sequence that should be treated
as a unit. It consists of the characters bracketed by
[.
and .]
.
An equivalence class lists a set of characters that should be considered
equivalent, such as e
and
è
.
It consists of a named element from the locale,
bracketed by [=
and =]
.
All three of these constructs must appear inside the square
brackets of a bracket expression. For example [[:alpha:]!]
matches
any single alphabetic character or the exclamation point; [[.ch.]]
matches the collating element ch
but does not match just the letter
c
or the letter h
. In a French locale,
[[=e=]]
might match any of
e
, è
,
or é
. Classes and matching
characters are shown in Table 4-5.
The following section compares Korn shell regular expressions to analogous features in awk and egrep. If you aren’t familiar with these, skip to Section 4.5.3.
Table 4-6 is an expansion of Table 4-3: the middle column shows the equivalents in awk/egrep of the shell’s regular expression operators.
The grep command has a feature called backreferences (or backrefs, for short). This facility provides a shorthand for repeating parts of a regular expression as part of a larger whole. It works as follows:
grep '\(abc\).*\1' file1 file2
It is worth reemphasizing that shell regular expressions can still
contain standard shell wildcards. Thus, the shell wildcard ?
(match any single character) is equivalent to .
in
egrep or awk, and the shell’s character set operator
[
...]
is the same as in those utilities.[53]
For example, the expression +([[:digit:]])
matches a number, i.e.,
one or more digits. The shell wildcard character *
is equivalent
to the shell regular expression *(?)
.
You can even nest the regular expressions:
+([[:digit:]]|!([[:upper:]]))
matches one or more digits or non-uppercase
letters.
Two egrep and awk regexp operators do not have equivalents in the Korn shell:
These are hardly necessary, since the Korn shell doesn’t
normally operate on text files and does parse strings into words itself.
(Essentially, the ^
and $
are implied
as always being there. Surround a pattern with *
characters to disable this.)
Read on for even more features in the very latest version of ksh.
Starting with ksh93l, the shell provides a number of additional regular expression capabilities. We discuss them here separately, because your version of ksh93 quite likely doesn’t have them, unless you download a ksh93 binary or build ksh93 from source. The facilities break down as follows.
Several new pattern matching facilities are available. They are described briefly in Table 4-7. More discussion follows after the table.
Special parenthesized subpatterns may contain options that control matching within the subpattern or the rest of the expression.
The character class [:word:]
within a bracket
expression matches any character that is “word constituent.”
This is basically any alphanumeric character or the underscore (_).
A number of escape sequences are recognized and treated specially within parenthesized expressions.
~(+options
:pattern list
) Enable options ~(-options
:pattern list
) Disable options
Within parenthesized expressions, ksh recognizes all the standard ANSI C escape sequences, and they have their usual meaning. (See Section 7.3.3.1, in Chapter 7.) Additionally, the escape sequences listed in Table 4-8 are recognized and can be used for pattern matching.
Whew! This is all fairly heady stuff. If you feel a bit overwhelmed by it, don’t worry. As you learn more about regular expressions and shell programming and begin to do more and more complex text processing tasks, you’ll come to appreciate the fact that you can do all this within the shell itself, instead of having to resort to external programs such as sed, awk, or perl.
Table 4-9 lists the Korn shell’s pattern-matching operators.
Expression | Result |
${path##/*/}
|
long.file.name |
${path#/*/}
|
billr/mem/long.file.name |
$path
|
/home/billr/mem/long.file.name
|
${path%.*}
|
/home/billr/mem/long.file
|
${path%%.*}
|
/home/billr/mem/loang
|
Starting with ksh93l, these operators automatically
set the .sh.match
array variable. This is discussed
in Section 4.5.7, later in this chapter.
We will incorporate one of these operators into our next programming task, Task 4-2.
Think of a C compiler as a pipeline of data processing components. C source code is input to the beginning of the pipeline, and object code comes out of the end; there are several steps in between. The shell script’s task, among many other things, is to control the flow of data through the components and designate output files.
You need to write the part of the script that takes the name of the
input C source file and creates from it the name of the output
object code file. That is,
you must take a filename ending in .c
and create a filename that is similar except that it ends in .o
.
The task at hand is to strip the .c
off the filename and
append .o
. A single shell statement does it:
objname=${filename%.c}.o
If filename
had an inappropriate value (without
.c
) such as fred.a
, the
above expression would evaluate to fred.a.o
:
since there was no match, nothing is deleted from the value of
filename
, and .o
is appended
anyway. And, if filename
contained more than one dot — e.g.,
if it were the y.tab.c that is so infamous among
compiler writers — the expression would still produce the desired
y.tab.o.
Notice that this would not be true if we used %%
in the expression
instead of %
.
The former operator uses the longest match instead of the shortest,
so it would match .tab.o
and evaluate to
y.o
rather than y.tab.o
. So the
single %
is correct in this case.
A longest-match deletion would be preferable, however, for Task 4-3.
Clearly the objective is to remove the directory prefix from the pathname. The following line does it:
bannername=${pathname##*/}
This solution is similar to the first line in the examples shown before.
If pathname
were just a filename, the pattern */
(anything
followed by a slash) would not match, and the value of the expression
would be $pathname
untouched.
If pathname
were something like
fred/bob
, the prefix fred/
would match the pattern and be deleted,
leaving just bob
as the expression’s value. The same thing would
happen if pathname
were something like /dave/pete/fred/bob
:
since the ##
deletes the longest match, it deletes the
entire /dave/pete/fred/
.
If we used
#*/
instead of ##*/
, the expression
would have the incorrect value dave/pete/fred/bob
, because the
shortest instance of “anything followed by a slash” at the beginning
of the string is just a slash (/
).
The construct
${
variable
##*/}
is actually quite similar to
to the Unix utility basename(1).
In typical use,
basename takes a pathname
as argument and returns the filename only; it is meant to be used
with the shell’s command substitution mechanism (see below).
basename is
less efficient than
${
variable
##/*}
because it may run in its own separate process rather than
within the shell.[55]
Another utility, dirname(1), does essentially
the opposite of basename: it returns the directory prefix only.
It is equivalent to the Korn shell expression
${
variable
%/*}
and is less efficient for the same reason.
Besides the pattern-matching operators that delete bits and pieces from the values of shell variables, you can do substitutions on those values, much as in a text editor. (In fact, using these facilities, you could almost write a line-mode text editor as a shell script!) These operators are listed in Table 4-10.
Operator | Meaning |
${
variable
:
start
}
|
These represent substring operations. The result is the value of variable starting at position start and going for length characters. The first character is at position 0, and if no length is provided, the rest of the string is used.
When used with Beginning with ksh93m, a negative start is taken as relative to the end of the string. For example, if a string has 10 characters, numbered 0 to 9, a start value of -2 means 7 (9 - 2 = 7). Similarly, if variable is an indexed array, a negative start yields an index by working backwards from the highest subscript in the array. |
${
variable
:
start
:
length
}
| |
${
variable
/
pattern
/
replace
}
|
If variable contains a match for pattern, the first match is replaced with the text of replace. |
${
variable
//
pattern
/
replace
}
|
This is the same as the previous operation, except that every match of the pattern is replaced. |
${
variable
/
pattern
}
|
If variable contains a match for pattern, delete the first match of pattern. |
${
variable
/#
pattern
/
replace
}
|
If variable contains a match for pattern, the first match is replaced with the text of replace. The match is constrained to occur at the beginning of variable’s value. If it doesn’t match there, no substitution occurs. |
${
variable
/%
pattern
/
replace
}
|
If variable contains a match for pattern, the first match is replaced with the text of replace. The match is constrained to occur at the end of variable’s value. If it doesn’t match there, no substitution occurs. |
The ${
variable
/
pattern
}
syntax is different from the #
, ##
,
%
, and %%
operators we saw earlier.
Those operators are constrained to match at the beginning or end of the variable’s value,
whereas the syntax shown here is not. For example:
$path=/home/fred/work/file
$print ${path/work/play}
Change work into play /home/fred/play/file
objname=${filename/%.c/.o} Change .c to .o, but only at end
$allfiles="fred.c dave.c pete.c"
$allobs=${allfiles//.c/.o}
$print $allobs
fred.o dave.o pete.o
Finally, these operations may be applied to the positional parameters and to arrays, in which case they are done on all the parameters or array elements at once. (Arrays are described in Chapter 6.)
$print "$@"
hi how are you over there $print ${@/h/H}
Change h to H in all parameters Hi How are you over tHere
As promised, here is a brief demonstration of the differences between greedy and non-greedy matching regular expressions:
$x='12345abc6789'
$print ${x//+([[:digit:]])/X}
Substitution with longest match XabcX $print ${x//+-([[:digit:]])/X}
Substitution with shortest match XXXXXabcXXXX $print ${x##+([[:digit:]])}
Remove longest match abc6789 $print ${x#+([[:digit:]])}
Remove shortest match 2345abc6789
Similarly, the third and fourth cases demonstrate removing text from the front of the value, using longest and shortest matching. In the third case, the longest match removes all the digits; in the fourth case, the shortest match removes just a single digit.
A number of operators relate to shell variable names, as seen in Table 4-11.
Namerefs were discussed in Section 4.4,
earlier in this chapter.
See there for an example of
${!
name
}
.
The last two operators in Table 4-11 might be useful for debugging and/or tracing the use of variables in a large script. Just to see how they work:
$print ${!HIST*}
HISTFILE HISTCMD HISTSIZE $print ${!HIST@}
HISTFILE HISTCMD HISTSIZE
Several other operators related to array variables are described in Chapter 6.
There are three remaining operators on variables. One is
${#
varname
}
, which
returns the number of characters in the string.[56]
(In Chapter 6 we see how to treat this
and similar values as actual numbers so they can be used
in arithmetic expressions.) For example,
if filename
has the value fred.c
, then
${#filename}
would have the value 6
.
The other two operators
(${#
array
[*]}
and
${#
array
[@]}
)
have to do with array variables, which are also discussed
in Chapter 6.
The .sh.match
variable was introduced in ksh93l.
It is an indexed array
(see Chapter 6), whose values are set every time you do a
pattern matching operation on a variable, such as
${filename%%*/}
, with any of the
#
, %
operators (for the shortest match), or
##
, %%
(for the longest match),
or /
and //
(for substitutions).
.sh.match[0]
contains the text that matched the entire pattern.
.sh.match[1]
contains the text that matched the first parenthesized
subexpression, .sh.match[2]
the text that matched the second,
and so on.
The values of .sh.match
become invalid (meaning, don’t
try to use them) if the variable on which the pattern matching was done
changes.
Again, this is a feature meant for more advanced programming and text processing, analogous to similar features in other languages such as perl. If you’re just starting out, don’t worry about it.
From the discussion so far, we’ve seen two ways of getting values into variables: by assignment statements and by the user supplying them as command-line arguments (positional parameters). There is another way: command substitution, which allows you to use the standard output of a command as if it were the value of a variable. You will soon see how powerful this feature is.
The syntax of command substitution is:
$(Unix command)
Here are some simple examples:
The value of $(pwd)
is the current directory
(same as the environment variable $PWD
).
The value of $(ls)
is the names of all files in the
current directory, separated by newlines.
To find out detailed information about a command if you don’t
know where its file resides, type
ls -l $(whence -p
command
)
.
The -p option forces whence to do a pathname lookup
and not consider keywords, built-ins, etc.
To get the contents of a file into a variable, you can use
varname
=$(<
filename
)
.
$(cat
filename
)
will do the same thing, but
the shell catches the former as a built-in shorthand and
runs it more efficiently.
emacs $(grep -l 'command substitution' ch*.xml)
The -l option to grep prints only the names of files that contain matches.
Command substitution, like variable expansion, is done within double quotes. (Double quotes inside the command substitution are not affected by any enclosing double quotes.) Therefore, our rule in Chapter 1 and Chapter 3 about using single quotes for strings unless they contain variables will now be extended: “When in doubt, use single quotes, unless the string contains variables, or command substitutions, in which case use double quotes.”
(For backwards compatibility,
the Korn shell supports the original Bourne shell (and C shell) command substituion notation
using backquotes: `
...`
. However, it
is considerably harder to use than $(
...)
,
since quoting and nested command substitutions require careful escaping.
We don’t use the backquotes in any of the programs in this book.)
You will undoubtedly think of many ways to use command substitution as you gain experience with the Korn shell. One that is a bit more complex than those mentioned previously relates to a customization task that we saw in Chapter 3: personalizing your prompt string.
Recall that you can personalize your prompt string by assigning a value to the variable
PS1
. If you are on a network of computers, and you use different
machines from time to time, you may find it handy to have the
name of the machine you’re on in your prompt string.
Most modern versions of Unix have the command hostname(1), which
prints the network name of the machine you are on to standard output.
(If you do not have this command, you may have a similar one like
uname.) This command enables you to get the machine name into
your prompt string by putting a line like this in your
.profile or environment file:
PS1="$(hostname) $ "
(Here, the second dollar sign does not need to be preceded by a backslash.
If the character after the $
isn’t special to the shell,
the $
is included literally in the string.)
For example, if your machine had the name coltrane
, then this
statement would set your prompt string to
"coltrane $
“.
Command substitution helps us with the solution to the next programming task, Task 4-4, which relates to the album database in Task 4-1.
The cut(1) utility is a natural for this task. cut is a data filter: it extracts columns from tabular data.[57] If you supply the numbers of columns you want to extract from the input, cut prints only those columns on the standard output. Columns can be character positions or — relevant in this example — fields that are separated by TAB characters or other delimiters.
Assume that the data table in our task is a file called albums and that it looks like this:
Coltrane, John|Giant Steps|Atlantic|1960|Ja Coltrane, John|Coltrane Jazz|Atlantic|1960|Ja Coltrane, John|My Favorite Things|Atlantic|1961|Ja Coltrane, John|Coltrane Plays the Blues|Atlantic|1961|Ja ...
Here is how we would use cut to extract the fourth (year) column:
cut -f4 -d\| albums
fieldname=$1 cut -f$(getfield $fieldname) -d\| albums
If we ran this script with the argument year
, the output would be:
1960 1960 1961 1961 ...
Task 4-5 is another small task that makes use of cut.
The command who(1) tells you who is logged in (as well as which terminal they’re on and when they logged in). Its output looks like this:
billr console May 22 07:57 fred tty02 May 22 08:31 bob tty04 May 22 08:12
The fields are separated by spaces, not TABs. Since we need the first field, we can get away with using a space as the field separator in the cut command. (Otherwise we’d have to use the option to cut that uses character columns instead of fields.) To provide a space character as an argument on a command line, you can surround it by quotes:
who | cut -d' ' -f1
With the above who output, this command’s output would look like this:
billr fred bob
This leads directly to a solution to the task. Just type:
mail $(who | cut -d' ' -f1)
The command mail billr fred bob
will run and then you can type your message.
Task 4-6 is another task that shows how useful command pipelines can be in command substitution.
function lsd { date=$1 ls -l | grep -i "^.\{41\}$date" | cut -c55- }
This function depends on the column layout of the ls -l
command.
In particular, it depends on dates starting in column 42 and
filenames starting in column 55. If this isn’t the case in your
version of Unix, you will need to adjust the column numbers.[58]
We use the grep search utility to match the date given
as argument (in the form Mon
DD,
e.g., Jan 15
or
Oct 6
, the latter having two spaces) to the output
of ls -l
.
(The regular expression argument to grep is quoted with
double quotes, in order to perform the variable substitution.)
This gives us a long listing of only those files
whose dates match the argument. The -i option to grep
allows you to use all lowercase letters in the month name, while
the rather fancy argument means, “Match any line that contains 41
characters followed by the function argument.”
For example, typing
lsd 'jan 15'
causes grep to search for lines
that match any 41 characters followed by jan 15
(or Jan 15
).
The output of grep is piped through our ubiquitous friend cut to retrieve just the filenames. The argument to cut tells it to extract characters in column 55 through the end of the line.
lp $(lsd 'jan 15')
The output of lsd is on multiple lines (one for each
filename), but because the variable IFS
(see earlier in this chapter) contains newline by default, the shell
uses newline to separate words in lsd’s output,
just as it normally does with space and TAB.
We conclude this chapter with a couple of functions that you may find handy in your everyday Unix use. They solve the problem presented by Task 4-7.
We start by implementing a significant subset of their capabilities and finish the implementation in Chapter 6. (For ease of development and explanation, our implementation ignores some things that a more bullet-proof version should handle. For example, spaces in filenames will cause things to break.)
If you don’t know what a stack is, think of a spring-loaded dish receptacle in a cafeteria. When you place dishes on the receptacle, the spring compresses so that the top stays at roughly the same level. The dish most recently placed on the stack is the first to be taken when someone wants food; thus, the stack is known as a “last-in, first-out” or LIFO structure. (Victims of a recession or company takeovers will also recognize this mechanism in the context of corporate layoff policies.) Putting something onto a stack is known in computer science parlance as pushing, and taking something off the top is called popping.
A stack is very handy for remembering directories, as we will see;
it can “hold your place” up to an arbitrary number of times.
The cd -
form of the cd
command does this, but only to one level. For example: if you
are in firstdir and then you change to
seconddir, you can type cd -
to go back. But if you start out in firstdir,
then change to seconddir, and then go to
thirddir, you can use cd -
only
to go back to seconddir. If you type cd -
again, you will be back in thirddir,
because it is the previous directory.[59]
If you want the “nested” remember-and-change functionality that will take you back to firstdir, you need a stack of directories along with the dirs, pushd and popd commands. Here is how these work:[60]
pushd dir does a cd to dir and then pushes dir onto the stack.
popd does a cd to the top directory, then pops it off the stack.
For example, consider the series of events in Table 4-12. Assume that you have just logged in and that you are in your home directory (/home/you).
We will implement a stack as an environment variable containing a list of directories separated by spaces.
DIRSTACK="$PWD" export DIRSTACK
Next, we need to implement dirs, pushd, and popd as functions. Here are our initial versions:
function dirs { # print directory stack (easy) print $DIRSTACK } function pushd { # push current directory onto stack dirname=$1 cd ${dirname:?"missing directory name."} DIRSTACK="$PWD $DIRSTACK" print "$DIRSTACK" } function popd { # cd to top, pop it off stack top=${DIRSTACK%% *} DIRSTACK=${DIRSTACK#* } cd $top print "$PWD" }
The popd function makes yet another
use of the shell’s pattern-matching operators.
The first line uses the %%
operator, which deletes the longest match of
" *
" (a space followed by anything). This removes all
but the top of the stack.
The result is saved in the variable top
, again for readability reasons.
The second line is similar, but going in the other direction.
It uses the #
operator, which tries to delete
the shortest match of the pattern "*
" (anything followed by a space)
from the value of DIRSTACK
. The result is that the top directory
(and the space following it) is deleted from the stack.
This code is deficient in the following ways: first, it has no provision for errors. For example:
The third problem with the code is that it will not work if, for some reason, a directory name contains a space. The code will treat the space as a separator character. We’ll accept this deficiency for now. However, when you read about arrays in Chapter 6, think about how you might use them to rewrite this code and eliminate the problem.
[46] This actually depends on the setting of your umask, an advanced feature described in Chapter 10.
[48] ksh93 point releases h through l+ used a similar but more restricted mechanism, via a file named .fpath, and they hard-wired the setting of the library path variable. As this feature was not wide-spread, it was generalized into a single file starting with point release m.
[49] autoload is actually an alias for typeset -fu.
[50] This is a restriction imposed by the Korn shell, not by the POSIX standard.
[51] However, see the section on typeset in Chapter 6 for a way of making variables local to functions.
[52] The Version 6 shell was written by Ken Thompson. Stephen Bourne wrote the Bourne shell for Version 7.
[53]
And, for that matter, the same as in
grep, sed, ed,
vi, etc. One notable difference is that the shell uses !
inside [
...]
for
negation, while the various utilities all use ^
.
[54] Don’t laugh — once upon a time, many Unix compilers had shell scripts as front-ends.
[55] basename may be built-in in some versions of ksh93. Thus it’s not guaranteed to run in a separate process.
[56] This may be more than the number of bytes for multibyte character sets.
[57]
Some very old BSD-derived systems don’t have cut, but you
can use awk instead. Whenever you see a command
of the form cut -f
N
-d
C filename
, use this instead: awk -F
C
'{ print $
N
}'
filename
.
[58]
For example, ls -l
on GNU/Linux has dates starting in
column 43 and filenames starting in column 57.
[60] We’ve done it here differently from the C shell. The C shell pushd pushes the initial directory onto the stack first, followed by the command’s argument. The C shell popd removes the top directory off the stack, revealing a new top. Then it cds to the new top directory. We feel that this behavior is less intuitive than our design here.