Chapter 4. Basic Shell Programming

Shell Scripts and Functions

A script, or file that contains shell commands, is a shell program. Your .profile and environment files, discussed in Chapter 3, are shell scripts.

You can create a script using the text editor of your choice. Once you have created one, there are a number of ways to run it. One, which we have already covered, is to type . scriptname (i.e., the command is a dot). This causes the commands in the script to be read and run as if you typed them in.

Two more ways are to type ksh script or ksh < script. These explicitly invoke the Korn shell on the script, requiring that you (and your users) be aware that they are scripts.

The final way to run a script is simply to type its name and hit ENTER, just as if you were invoking a built-in command. This, of course, is the most convenient way. This method makes the script look just like any other Unix command, and in fact several “regular” commands are implemented as shell scripts (i.e., not as programs originally written in C or some other language), including spell, man on some systems, and various commands for system administrators. The resulting lack of distinction between “user command files” and “built-in commands” is one factor in Unix’s extensibility and, hence, its favored status among programmers.

You can run a script by typing its name only if . (the current directory) is part of your command search path, i.e., is included in your PATH variable (as discussed in Chapter 3). If . isn’t on your path, you must type ./ scriptname, which is really the same thing as typing the script’s relative pathname (see Chapter 1).

Before you can invoke the shell script by name, you must also give it “execute” permission. If you are familiar with the Unix filesystem, you know that files have three types of permissions (read, write, and execute) and that those permissions apply to three categories of user (the file’s owner, a group of users, and everyone else). Normally, when you create a file with a text editor, the file is set up with read and write permission for you and read-only permission for everyone else.^[46]

Therefore you must give your script execute permission explicitly, by using the chmod(1) command. The simplest way to do this is like so:

chmod +x scriptname

Your text editor preserves this permission if you make subsequent changes to your script. If you don’t add execute permission to the script, and you try to invoke it, the shell prints the message:

ksh: scriptname: cannot execute [Permission denied]

But there is a more important difference between the two ways of running shell scripts. While the “dot” method causes the commands in the script to be run as if they were part of your login session, the “just the name” method causes the shell to do a series of things. First, it runs another copy of the shell as a subprocess. The shell subprocess then takes commands from the script, runs them, and terminates, handing control back to the parent shell.

Figure 4-1 shows how the shell executes scripts. Assume you have a simple shell script called fred that contains the commands bob and dave. In Figure 4-1.a, typing . fred causes the two commands to run in the same shell, just as if you had typed them in by hand. Figure 4-1.b shows what happens when you type just fred: the commands run in the shell subprocess while the parent shell waits for the subprocess to finish.

You may find it interesting to compare this with the situation in Figure 4-1.c, which shows what happens when you type fred &. As you will recall from Chapter 1, the & makes the command run in the background, which is really just another term for “subprocess.” It turns out that the only significant difference between Figure 4-1.c and Figure 4-1.b is that you have control of your terminal or workstation while the command runs — you need not wait until it finishes before you can enter further commands.

Figure 4-1. Ways to run a shell script

There are many ramifications to using shell subprocesses. An important one is that the exported environment variables that we saw in the last chapter (e.g., TERM, LOGNAME, PWD) are known in shell subprocesses, whereas other shell variables (such as any that you define in your .profile without an export statement) are not.

Other issues involving shell subprocesses are too complex to go into now; see Chapter 7 and Chapter 8 for more details about subprocess I/O and process characteristics, respectively. For now, just bear in mind that a script normally runs in a shell subprocess.

Functions

The Korn shell’s function feature is an expanded version of a similar facility in the System V Bourne shell and a few other shells. A function is sort of a script-within-a-script; you use it to define some shell code by name and store it in the shell’s memory, to be invoked and run later.

Functions improve the shell’s programmability significantly, for two main reasons. First, when you invoke a function, it is already in the shell’s memory (except for automatically loaded functions; see Section 4.1.1.1, later in this chapter); therefore a function runs faster. Modern computers have plenty of memory, so there is no need to worry about the amount of space a typical function takes up. For this reason, most people define as many functions as possible rather than keep lots of scripts around.

The other advantage of functions is that they are ideal for organizing long shell scripts into modular “chunks” of code that are easier to develop and maintain. If you aren’t a programmer, ask one what life would be like without functions (also called procedures or subroutines in other languages) and you’ll probably get an earful.

To define a function, you can use either one of two forms:

function functname {    Korn shell semantics
                  shell commands
}

or:

                  functname () {          POSIX semantics
                  shell commands
}

The first form provides access to the full power and programmability of the Korn shell. The second is compatible with the syntax for shell functions introduced in the System V Release 2 Bourne shell. This form obeys the semantics of the POSIX standard, which are less powerful than full Korn shell-style functions. (We discuss the differences in detail shortly.) We always use the first form in this book. You can delete a function definition with the command unset -f functname.

When you define a function, you tell the shell to store its name and definition (i.e., the shell commands it contains) in memory. If you want to run the function later, just type in its name followed by any arguments, as if it were a shell script.

You can find out what functions are defined in your login session by typing functions.^[47] (Note the s at the end of the command name.) The shell will print not just the names but also the definitions of all functions, in alphabetical order by function name. Since this may result in long output, you might want to pipe the output through more or redirect it to a file for examination with a text editor.

Apart from the advantages, there are two important differences between functions and scripts. First, functions do not run in separate processes, as scripts do when you invoke them by name; the “semantics” of running a function are more like those of your .profile when you log in or any script when invoked with the “dot” command. Second, if a function has the same name as a script or executable program, the function takes precedence.

This is a good time to show the order of precedence for the various sources of commands. When you type a command to the shell, it looks in the following places until it finds a match:

Keywords, such as function and several others (e.g., if and for) that we will see in Chapter 5
Aliases (although you can’t define an alias whose name is a shell keyword, you can define an alias that expands to a keyword, e.g., alias aslongas=while; see Chapter 7 for more details)
Special built-ins, such as break and continue (the full list is . (dot), :, alias, break, continue, eval, exec, exit, export, login, newgrp, readonly, return, set, shift, trap, typeset, unalias, and unset)
Functions
Non-special built-ins, such as cd and whence
Scripts and executable programs, for which the shell searches in the directories listed in the PATH environment variable

We’ll examine this process in more detail in the section on command-line processing in Chapter 7.

If you need to know the exact source of a command, there is an option to the whence built-in command that we saw in Chapter 3. whence by itself will print the pathname of a command if the command is a script or executable program, but it will only parrot the command’s name back if it is anything else. But if you type whence -v commandname, you get more complete information, such as:

$ whence -v cd
cd is a shell builtin
$ whence -v function
function is a keyword
$ whence -v man
man is a tracked alias for /usr/bin/man
$ whence -v ll
ll is an alias for 'ls -l'

For compatibility with the System V Bourne shell, the Korn shell predefines the alias type='whence -v'. This definitely makes the transition to the Korn shell easier for long-time Bourne shell users; type is similar to whence. The whence command actually has several options, described in Table 4-1.

Table 4-1. Options for the whence command

Option	Meaning
`-a`	Print all interpretations of given name.
`-f`	Skip functions in search for name.
`-p`	Search `$PATH`, even if name is a built-in or function.
`-v`	Print more verbose description of name.

Throughout the remainder of this book we refer mainly to scripts, but unless we note otherwise, you should assume that whatever we say applies equally to functions.

Automatically loading functions

At first glance, it would seem that the best place to put your own function definitions is in your .profile or environment file. This is great for interactive use, since your login shell reads ~/.profile, and other interactive shells read the environment file. However, any shell scripts that you write don’t read either file. Furthermore, as your collection of functions grows, so too do your initialization files, making them hard to work with.

ksh93 works around both of these issues by integrating the search for functions with the search for commands. Here’s how it works:

Create a directory to hold your function definitions. This can be your private bin directory, or you may wish to have a separate directory, such as ~/funcs. For the sake of discussion, assume the latter.
In your .profile file, add this directory to both the variables PATH and FPATH:
```
PATH=$PATH:~/funcs
FPATH=~/funcs
export PATH FPATH
```

In ~/funcs, place the definition of each of your functions into a separate file. Each function’s file should have the same name as the function:

$ mkdir ~/funcs
$ cd ~/funcs
$ cat > whoson
                           # whoson --- create a sorted list of logged-on users
                           function whoson {
                               who | awk '{ print $1 }' | sort -u
                           }
                           ^D

Now, the first time you type whoson, the shell looks for a command named whoson using the search order described earlier. It will not be found as a special-built-in, as a function, or as a regular built-in. The shell then starts a search along $PATH. When it finally finds ~/funcs/whoson, the shell notices that ~/funcs is also in $FPATH. (“Aha!” says the shell.) When this is the case, the shell expects to find the definition of the function named whoson inside the file. It reads and executes the entire contents of the file and only then runs the function whoson, with any supplied arguments. (If the file found in both $PATH and $FPATH doesn’t actually define the function, you’ll get a “not found” error message.)

The next time you type whoson, the function is already defined, so the shell finds it immediately, without the need for the path search.

Note that directories listed in FPATH but not in PATH won’t be searched for functions, and that as of ksh93l, the current directory must be listed in FPATH via an explicit dot; a leading or trailing colon doesn’t cause the current directory to be searched.

As a final wrinkle, starting with ksh93m, each directory named in PATH may contain a file named .paths. This file may contain comments and blank lines, and specialized variable assignments. The first allowed assignment is to FPATH, where the value should name an existing directory. If that directory contains a file whose name matches the function being searched for, that file is read and executed as if via the . (dot) command, and then the function is executed.

In addition, one other environment variable may be assigned to. The intended use of this is to specify a relative or absolute path for a library directory containing the shared libraries for executables in the current bin directory. On many Unix systems, this variable is LD_LIBRARY_PATH, but some systems have a different variable — check your local documentation. The given value is prepended to the existing value of the variable when the command is executed. (This mechanism may open security holes. System administrators should use it with caution!)

For example, the AT&T Advanced Software Tools group that distributes ksh93 also has many other tools, often installed in a separate ast/bin directory. This feature allows the ast programs to find their shared libraries, without the user having to manually adjust LD_LIBRARY_PATH in the .profile file.^[48] For example, if a command is found in /usr/local/ast/bin, and the .paths file in that directory contains the assignment LD_LIBRARY_PATH=../lib, the shell prepends /usr/local/ast/lib: to the value of LD_LIBRARY_PATH before running the command.

Readers familiar with ksh88 will notice that this part of the shell’s behavior has changed significantly. Since ksh88 always read the environment file, whether or not the shell was interactive, it was simplest to just put function definitions there. However, this could still yield a large, unwieldy file. To get around this, you could create files in one or more directories listed in $FPATH. Then, in the environment file, you would mark the functions as being autoloaded:

autoload whoson
...

Marking a function with autoload ^[49] tells the shell that this name is a function, and to find the definition by searching $FPATH. The advantage to this is that the function is not loaded into the shell’s memory if it’s not needed. The disadvantage is that you have to explicitly list all your functions in your environment file.

ksh93’s integration of PATH and FPATH searching thus simplifies the way you add shell functions to your personal shell function “library.”

POSIX functions

As mentioned earlier, functions defined using the POSIX syntax obey POSIX semantics and not Korn shell semantics:

                     functname () {
    shell commands
}

The best way to understand this is to think of a POSIX function as being like a dot script. Actions within the body of the function affect all the state of the current script. In contrast, Korn shell functions have much less shared state with the parent shell, although they are not identical to totally separate scripts.

The technical details follow; they include information that we haven’t covered yet. So come back and reread this section after you’ve learned about the typeset command in Chapter 6 and about traps in Chapter 8.

POSIX functions share variables with the parent script. Korn shell functions can have their own local variables.
POSIX functions share traps with the parent script. Korn shell functions can have their own local traps.
POSIX functions cannot be recursive (call themselves).^[50] Korn shell functions can.
When a POSIX function is run, $0 is not changed to the name of the function.

If you use the dot command with the name of a Korn shell function, that function will obey POSIX semantics, affecting all the state (variables and traps) of the parent shell:

$ function demo {                          
                     Define a Korn shell function
>   typeset myvar=3                        
                     Set a local variable myvar
>   print "demo: myvar is $myvar"
> }
$ myvar=4                                  
                     Set the global myvar
$ demo ; print "global: myvar is $myvar"   
                     Run the function
demo: myvar is 3
global: myvar is 4
$ . demo                                   
                     Run with POSIX semantics
demo: myvar is 3
$ print "global: myvar is $myvar"          
                     See the results
global: myvar is 3

Shell Variables

A major piece of the Korn shell’s programming functionality relates to shell variables. We’ve already seen the basics of variables. To recap briefly: they are named places to store data, usually in the form of character strings, and their values can be obtained by preceding their names with dollar signs ($). Certain variables, called environment variables, are conventionally named in all capital letters, and their values are made known (with the export statement) to subprocesses.

This section presents the basics for shell variables. Discussion of certain advanced features is delayed until later in the chapter, after covering regular expressions.

If you are a programmer, you already know that just about every major programming language uses variables in some way; in fact, an important way of characterizing differences between languages is comparing their facilities for variables.

The chief difference between the Korn shell’s variable schema and those of conventional languages is that the Korn shell’s schema places heavy emphasis on character strings. (Thus it has more in common with a special-purpose language like SNOBOL than a general-purpose one like Pascal.) This is also true of the Bourne shell and the C shell, but the Korn shell goes beyond them by having additional mechanisms for handling integers and double-precision floating point numbers explicitly, as well as simple arrays.

Positional Parameters

As we have already seen, you can define values for variables with statements of the form varname = value, e.g.:

$ fred=bob
$ print "$fred"
bob

Some environment variables are predefined by the shell when you log in. There are other built-in variables that are vital to shell programming. We look at a few of them now and save the others for later.

The most important special, built-in variables are called positional parameters. These hold the command-line arguments to scripts when they are invoked. Positional parameters have names 1, 2, 3, etc., meaning that their values are denoted by $1, $2, $3, etc. There is also a positional parameter 0, whose value is the name of the script (i.e., the command typed in to invoke it).

Two special variables contain all of the positional parameters (except positional parameter 0): * and @. The difference between them is subtle but important, and it’s apparent only when they are within double quotes.

"$*" is a single string that consists of all of the positional parameters, separated by the first character in the variable IFS (internal field separator), which is a space, TAB, and newline by default. On the other hand, "$@" is equal to "$1" "$2" ... "$ N ", where N is the number of positional parameters. That is, it’s equal to N separate double-quoted strings, which are separated by spaces. We’ll explore the ramifications of this difference in a little while.

The variable # holds the number of positional parameters (as a character string). All of these variables are “read-only,” meaning that you can’t assign new values to them within scripts. (They can be changed, just not via assignment. See Section 4.2.1.2, later in this chapter.)

For example, assume that you have the following simple shell script:

print "fred: $*"
print "$0: $1 and $2"
print "$# arguments"

Assume further that the script is called fred. Then if you type fred bob dave, you will see the following output:

fred: bob dave
fred: bob and dave
2 arguments

In this case, $3, $4, etc., are all unset, which means that the shell substitutes the empty (or null) string for them (unless the option nounset is turned on).

Positional parameters in functions

Shell functions use positional parameters and special variables like * and # in exactly the same way that shell scripts do. If you wanted to define fred as a function, you could put the following in your .profile or environment file:

function fred {
    print "fred: $*"
    print "$0: $1 and $2"
    print "$# arguments"
}

You get the same result if you type fred bob dave.

Typically, several shell functions are defined within a single shell script. Therefore each function needs to handle its own arguments, which in turn means that each function needs to keep track of positional parameters separately. Sure enough, each function has its own copies of these variables (even though functions don’t run in their own subprocess, as scripts do); we say that such variables are local to the function.

Other variables defined within functions are not local; they are global, meaning that their values are known throughout the entire shell script.^[51] For example, assume that you have a shell script called ascript that contains this:

function afunc {
    print in function $0: $1 $2
    var1="in function"
}
var1="outside of function"
print var1: $var1
print $0: $1 $2
afunc funcarg1 funcarg2
print var1: $var1
print $0: $1 $2

If you invoke this script by typing ascript arg1 arg2, you will see this output:

var1: outside of function
ascript: arg1 arg2
in function afunc: funcarg1 funcarg2
var1: in function
ascript: arg1 arg2

In other words, the function afunc changes the value of the variable var1 from “outside of function” to “in function,” and that change is known outside the function, while $0, $1, and $2 have different values in the function and the main script. Figure 4-2 shows this graphically.

Figure 4-2. Functions have their own positional parameters

It is possible to make other variables local to functions by using the typeset command, which we’ll see in Chapter 6. Now that we have this background, let’s take a closer look at "$@" and "$*". These variables are two of the shell’s greatest idiosyncracies, so we’ll discuss some of the most common sources of confusion.

Why are the elements of "$*" separated by the first character of IFS instead of just spaces? To give you output flexibility. As a simple example, let’s say you want to print a list of positional parameters separated by commas. This script would do it:
```
IFS=,
print "$*"
```
Changing IFS in a script is fairly risky, but it’s probably OK as long as nothing else in the script depends on it. If this script were called arglist, the command arglist bob dave ed would produce the output bob,dave,ed. Chapter 10 contains another example of changing IFS.
Why does "$@" act like N separate double-quoted strings? To allow you to use them again as separate values. For example, say you want to call a function within your script with the same list of positional parameters, like this:
```
function countargs {
    print "$# args."
}
```
Assume your script is called with the same arguments as arglist above. Then if it contains the command countargs "$*", the function prints 1 args. But if the command is countargs "$@", the function prints 3 args.
Being able to retrieve the arguments as they came in is also important in case you need to preserve any embedded white space. If your script was invoked with the arguments “hi”, “howdy”, and “hello there”, here are the different results you might get:
```
$ countargs $*
4 args
$ countargs "$*"
1 args
$ countargs $@
4 args
$ countargs "$@"
3 args
```
Because "$@" always exactly preserves arguments, we use it in just about all the example programs in this book.

Changing the positional parameters

Occasionally, it’s useful to change the positional parameters. We’ve already mentioned that you cannot set them directly, using an assignment such as 1="first". However, the built-in command set can be used for this purpose.

The set command is perhaps the single most complicated and overloaded command in the shell. It takes a large number of options, which are discussed in Chapter 9. What we care about for the moment is that additional non-option arguments to set replace the positional parameters. Suppose our script was invoked with the three arguments “bob”, “fred”, and “dave”. Then countargs "$@" tells us that we have three arguments. Upon using set to change the positional parameters, $# is updated too.

$ set one two three "four not five"   
                  Change the positional parameters
$ countargs "$@"                      
                  Verify the change
4 args

The set command also works inside a shell function. The shell function’s positional parameters are changed, but not those of the calling script:

$ function testme {
>     countargs "$@"           
                  Show the original number of parameters
>     set a b c                
                  Now change them
>     countargs "$@"           
                  Print the new count
> }
$ testme 1 2 3 4 5 6           
                  Run the function
6 args                         Original count
3 args                         New count
$ countargs "$@"               
                  No change to invoking shell's parameters
4 args

More on Variable Syntax

Before we show the many things you can do with shell variables, we have to make a confession: the syntax of $ varname for taking the value of a variable is not quite accurate. Actually, it’s the simple form of the more general syntax, which is ${varname}.

Why two syntaxes? For one thing, the more general syntax is necessary if your code refers to more than nine positional parameters: you must use ${10} for the tenth instead of $10. (This ensures compatibility with the Bourne shell, where $10 means ${1}0.) Aside from that, consider the Chapter 3 example of setting your primary prompt variable (PS1) to your login name:

PS1="($LOGNAME)-> "

This happens to work because the right parenthesis immediately following LOGNAME isn’t a valid character for a variable name, so the shell doesn’t mistake it for part of the variable name. Now suppose that, for some reason, you want your prompt to be your login name followed by an underscore. If you type:

PS1="$LOGNAME_ "

then the shell tries to use “LOGNAME_” as the name of the variable, i.e., to take the value of $LOGNAME_. Since there is no such variable, the value defaults to null (the empty string, “”), and PS1 is set just to a single space.

For this reason, the full syntax for taking the value of a variable is ${ varname }. So if we used:

PS1="${LOGNAME}_ "

we would get the desired yourname _. It is safe to omit the curly braces ({}) if the variable name is followed by a character that isn’t a letter, digit, or underscore.

Appending to a Variable

As mentioned, Korn shell variables tend to be string-oriented. One operation that’s very common is to append a new value onto an existing variable. (For example, collecting a set of options into a single string.) Since time immemorial, this was done by taking advantage of variable substitution inside double quotes:

myopts="$myopts $newopt"

The values of myopts and newopt are concatenated together into a single string, and the result is then assigned back to myopts. Starting with ksh93j, the Korn shell provides a more efficient and intuitive mechanism for doing this:

myopts+=" $newopt"

This accomplishes the same thing, but it is more efficient, and it also makes it clear that the new value is being added onto the string. (In C, the += operator adds the value on the right to the variable on the left; x += 42 is the same as x = x + 42.)

Compound Variables

ksh93 introduces a new feature, called compound variables. They are similar in nature to a Pascal or Ada record or a C struct, and they allow you to group related items together under the same name. Here are some examples:

now="May 20 2001 19:44:57"        Assign current date to variable now
now.hour=19                       Set the hour
now.minute=44                     Set the minute
...

Note the use of the period in the variable’s name. Here, now is called the parent variable, and it must exist (i.e., have a value) before you can assign a value to an individual component (such as hour or minute). To access a compound variable, you must enclose the variable’s name in curly braces. If you don’t, the period ends the shell’s scan for the variable’s name:

$ print ${now.hour}
19
$ print $now.hour
May 20 2001 19:44:57.hour

Compound Variable Assignment

Assigning to individual elements of a compound variable is tedious. In particular the requirement that the parent variable exist previously leads to an awkward programming style:

person="John Q. Public"
person.firstname=John
person.initial=Q.
person.lastname=Public

Fortunately, you can use a compound assignment to do it all in one fell swoop:

person=(firstname=John initial=Q. lastname=Public)

You can retrieve the value of either the entire variable, or a component, using print.

$ print $person                                
                  Simple print
( lastname=Public initial=Q. firstname=John )
$ print -r "$person"                           
                  Print in full glory
(
        lastname=Public
        initial=Q.
        firstname=John
)
$ print ${person.initial}                      
                  Print just the middle initial
Q.

The second print command preserves the whitespace that the Korn shell provides when returning the value of a compound variable. The -r option to print is discussed in Chapter 7.

Note

The order of the components is different from what was used in the initial assignment. This order depends upon how the Korn shell manages compound variables internally and cannot be controlled by the programmer.

A second assignment syntax exists, similar to the first:

person=(typeset firstname=John initial=Q. lastname=Public ;
        typeset -i age=42)

By using the typeset command, you can specify that a variable is a number instead of a string. Here, person.age is an integer variable. The rest remain strings. The typeset command and its options are presented in Chapter 6. (You can also use readonly to declare that a component variable cannot be changed.)

Just as you may use += to append to a regular variable, you can add components to a compound variable as well:

person+= (typeset spouse=Jane)

A space is allowed after the = but not before. This is true for compound assignments with both = and +=.

The Korn shell has additional syntaxes for compound assignment that apply only to array variables; they are also discussed in Chapter 6.

Finally, we’ll mention that the Korn shell has a special compound variable named .sh. The various components almost all relate to features we haven’t covered yet, except ${.sh.version}, which tells you the version of the Korn shell that you have:

$ print ${.sh.version}
Version M 1993-12-28 m

We will see another component of .sh later in this chapter, and the other components are covered as we introduce the features they relate to.

Indirect Variable References (namerefs)

Most of the time, as we’ve seen so far, you manipulate variables directly, by name (x=1, for example). The Korn shell allows you to manipulate variables indirectly, using something called a nameref. You create a nameref using typeset -n, or the more convenient predefined alias, nameref. Here is a simple example:

$ name="bill"                       
               Set initial value
$ nameref firstname=name            
               Set  up the nameref
$ print $firstname                  
               Actually references variable name
bill
$ firstname="arnold"                
               Now change the indirect reference
$ print $name                       
               Shazzam! Original variable is changed
arnold

To find out the name of the real variable being referenced by the nameref, use ${! variable }:

$ print ${!firstname}
name

At first glance, this doesn’t seem to be very useful. The power of namerefs comes into play when you pass a variable’s name to a function, and you want that function to be able to update the value of that variable. The following example illustrates how it works:

$ date                                     
               Current day and time
Wed May 23 17:49:44 IDT 2001
$ function getday {                        
               Define a function
>     typeset -n day=$1                    
               Set up the nameref
>     day=$(date | awk '{ print $1 }')     
               Actually change it
> }
$ today=now                                
               Set initial value
$ getday today                             
               Run the function
$ print $today                             
               Display new value
Wed

The default output of date(1) looks like this:

$ date
Wed Nov 14 11:52:38 IST 2001

The getday function uses awk to print the first field, which is the day of the week. The result of this operation, which is done inside command substitution (described later in this chapter), is assigned to the local variable day. But day is a nameref; the assignment actually updates the global variable today. Without the nameref facility, you have to resort to advanced tricks like using eval (see Chapter 7) to make something like this happen.

To remove a nameref, use unset -n, which removes the nameref itself, instead of unsetting the variable the nameref is a reference to. Finally, note that variables that are namerefs may not have periods in their names (i.e., be components of a compound variable). They may, though, be references to a compound variable.

String Operators

The curly-brace syntax allows for the shell’s string operators. String operators allow you to manipulate values of variables in various useful ways without having to write full-blown programs or resort to external Unix utilities. You can do a lot with string-handling operators even if you haven’t yet mastered the programming features we’ll see in later chapters.

In particular, string operators let you do the following:

Ensure that variables exist (i.e., are defined and have non-null values)
Set default values for variables
Catch errors that result from variables not being set
Remove portions of variables’ values that match patterns

Syntax of String Operators

The basic idea behind the syntax of string operators is that special characters that denote operations are inserted between the variable’s name and the right curly brace. Any argument that the operator may need is inserted to the operator’s right.

The first group of string-handling operators tests for the existence of variables and allows substitutions of default values under certain conditions. These are listed in Table 4-2.

Table 4-2. Substitution operators

Operator	Substitution
`${` `varname` `:-` `word` `}`	If varname exists and isn’t null, return its value; otherwise return word.
Purpose:	Returning a default value if the variable is undefined.
Example:	`${count:-0}` evaluates to 0 if `count` is undefined.

`${` `varname` `:=` `word` `}`	If varname exists and isn’t null, return its value; otherwise set it to word and then return its value.^[a]
Purpose:	Setting a variable to a default value if it is undefined.
Example:	`${count:=0}` sets `count` to 0 if it is undefined.

`${` `varname` `:?` `message` `}`	If varname exists and isn’t null, return its value; otherwise print `varname` `:` `message`, and abort the current command or script. Omitting message produces the default message `parameter null or not set`. Note, however, that interactive shells do not abort.
Purpose:	Catching errors that result from variables being undefined.
Example:	`${count:?"undefined!"}` prints `count: undefined!` and exits if `count` is undefined.

`${` `varname` `:+` `word` `}`	If varname exists and isn’t null, return word; otherwise return null.
Purpose:	Testing for the existence of a variable.
Example:	`${count:+1}` returns 1 (which could mean “true”) if `count` is defined.
^[a] Pascal, Modula, and Ada programmers may find it helpful to recognize the similarity of this to the assignment operators in those languages.

The colon (:) in each of these operators is actually optional. If the colon is omitted, then change “exists and isn’t null” to “exists” in each definition, i.e., the operator tests for existence only.

The first two of these operators are ideal for setting defaults for command-line arguments in case the user omits them. We’ll actually use all four in Task 4-1, which is our first programming task.

You have a large album collection, and you want to write some software to keep track of it. Assume that you have a file of data on how many albums you have by each artist. Lines in the file look like this:

14	Bach, J.S.
1	Balachander, S.
21	Beatles
6	Blakey, Art

Write a program that prints the N highest lines, i.e., the N artists by whom you have the most albums. The default for N should be 10. The program should take one argument for the name of the input file and an optional second argument for how many lines to print.

By far the best approach to this type of script is to use built-in Unix utilities, combining them with I/O redirectors and pipes. This is the classic “building-block” philosophy of Unix that is another reason for its great popularity with programmers. The building-block technique lets us write a first version of the script that is only one line long:

sort -nr "$1" | head -${2:-10}

Here is how this works: the sort(1) program sorts the data in the file whose name is given as the first argument ($1). (The double quotes allow for spaces or other unusual characters in file names, and also prevent wildcard expansion.) The -n option tells sort to interpret the first word on each line as a number (instead of as a character string); the -r tells it to reverse the comparisons, so as to sort in descending order.

The output of sort is piped into the head(1) utility, which, when given the argument -N, prints the first N lines of its input on the standard output. The expression -${2:-10} evaluates to a dash (-) followed by the second argument, if it is given, or to 10 if it’s not; notice that the variable in this expression is 2, which is the second positional parameter.

Assume the script we want to write is called highest. Then if the user types highest myfile, the line that actually runs is:

sort -nr myfile | head -10

Or if the user types highest myfile 22, the line that runs is:

sort -nr myfile | head -22

Make sure you understand how the :- string operator provides a default value.

This is a perfectly good, runnable script — but it has a few problems. First, its one line is a bit cryptic. While this isn’t much of a problem for such a tiny script, it’s not wise to write long, elaborate scripts in this manner. A few minor changes makes the code more readable.

First, we can add comments to the code; anything between # and the end of a line is a comment. At minimum, the script should start with a few comment lines that indicate what the script does and the arguments it accepts. Next, we can improve the variable names by assigning the values of the positional parameters to regular variables with mnemonic names. Last, we can add blank lines to space things out; blank lines, like comments, are ignored. Here is a more readable version:

#	highest filename [howmany]
#
#	Print howmany highest-numbered lines in file filename.
#	The input file is assumed to have lines that start with
#	numbers.  Default for howmany is 10.

filename=$1

howmany=${2:-10}
sort -nr "$filename" | head -$howmany

The square brackets around howmany in the comments adhere to the convention in Unix documentation that square brackets denote optional arguments.

The changes we just made improve the code’s readability but not how it runs. What if the user invoked the script without any arguments? Remember that positional parameters default to null if they aren’t defined. If there are no arguments, then $1 and $2 are both null. The variable howmany ($2) is set up to default to 10, but there is no default for filename ($1). The result would be that this command runs:

sort -nr | head -10

As it happens, if sort is called without a filename argument, it expects input to come from standard input, e.g., a pipe (|) or a user’s keyboard. Since it doesn’t have the pipe, it will expect the keyboard. This means that the script will appear to hang! Although you could always type CTRL-D or CTRL-C to get out of the script, a naive user might not know this.

Therefore we need to make sure that the user supplies at least one argument. There are a few ways of doing this; one of them involves another string operator. We’ll replace the line:

filename=$1

with:

filename=${1:?"filename missing."}

This causes two things to happen if a user invokes the script without any arguments: first, the shell prints the somewhat unfortunate message to the standard error output:

highest: line 1: : filename missing.

Second, the script exits without running the remaining code.

With a somewhat “kludgy” modification, we can get a slightly better error message. Consider this code:

filename=$1
filename=${filename:?"missing."}

This results in the message:

highest: line 2: filename: filename missing.

(Make sure you understand why.) Of course, there are ways of printing whatever message is desired; we’ll find out how in Chapter 5.

Before we move on, we’ll look more closely at the two remaining operators in Table 4-2 and see how we can incorporate them into our task solution. The := operator does roughly the same thing as :-, except that it has the side effect of setting the value of the variable to the given word if the variable doesn’t exist.

Therefore we would like to use := in our script in place of :-, but we can’t; we’d be trying to set the value of a positional parameter, which is not allowed. But if we replaced:

howmany=${2:-10}

with just:

howmany=$2

and moved the substitution down to the actual command line (as we did at the start), then we could use the := operator:

sort -nr "$filename" | head -${howmany:=10}

Using := has the added benefit of setting the value of howmany to 10 in case we need it afterwards in later versions of the script.

The final substitution operator is :+. Here is how we can use it in our example: let’s say we want to give the user the option of adding a header line to the script’s output. If he types the option -h, the output will be preceded by the line:

ALBUMS  ARTIST

Assume further that this option ends up in the variable header, i.e., $header is -h if the option is set or null if not. (Later we see how to do this without disturbing the other positional parameters.)

The expression:

${header:+"ALBUMS  ARTIST\n"}

yields null if the variable header is null or ALBUMS ARTIST\n if it is non-null. This means that we can put the line:

print -n ${header:+"ALBUMS  ARTIST\n"}

right before the command line that does the actual work. The -n option to print causes it not to print a newline after printing its arguments. Therefore this print statement prints nothing — not even a blank line — if header is null; otherwise it prints the header line and a newline (\n).

Patterns and Regular Expressions

We’ll continue refining our solution to Task 4-1 later in this chapter. The next type of string operator is used to match portions of a variable’s string value against patterns. Patterns, as we saw in Chapter 1, are strings that can contain wildcard characters (*, ?, and [] for character sets and ranges).

Wildcards have been standard features of all Unix shells going back (at least) to the Version 6 Thompson shell.^[52] But the Korn shell is the first shell to add to their capabilities. It adds a set of operators, called regular expression (or regexp for short) operators, that give it much of the string-matching power of advanced Unix utilities like awk(1), egrep(1) (extended grep(1)), and the Emacs editor, albeit with a different syntax. These capabilities go beyond those that you may be used to in other Unix utilities like grep, sed(1), and vi(1).

Advanced Unix users will find the Korn shell’s regular expression capabilities useful for script writing, although they border on overkill. (Part of the problem is the inevitable syntactic clash with the shell’s myriad other special characters.) Therefore we won’t go into great detail about regular expressions here. For more comprehensive information, the “very last word” on practical regular expressions in Unix is Mastering Regular Expressions, by Jeffrey E. F. Friedl. A more gentle introduction may found in the second edition of sed & awk, by Dale Dougherty and Arnold Robbins. Both are published by O’Reilly & Associates. If you are already comfortable with awk or egrep, you may want to skip the following introductory section and go to Section 4.5.2.3, later in this chapter, where we explain the shell’s regular expression mechanism by comparing it with the syntax used in those two utilities. Otherwise, read on.

Regular expression basics

Think of regular expressions as strings that match patterns more powerfully than the standard shell wildcard schema. Regular expressions began as an idea in theoretical computer science, but they have found their way into many nooks and crannies of everyday, practical computing. The syntax used to represent them may vary, but the concepts are very much the same.

A shell regular expression can contain regular characters, standard wildcard characters, and additional operators that are more powerful than wildcards. Each such operator has the form x(exp), where x is the particular operator and exp is any regular expression (often simply a regular string). The operator determines how many occurrences of exp a string that matches the pattern can contain. Table 4-3 describes the shell’s regular expression operators and their meanings.

Table 4-3. Regular expression operators

Operator	Meaning
`(` `exp`* `)`	0 or more occurrences of exp
`+(` `exp` `)`	1 or more occurrences of exp
`?(` `exp` `)`	0 or 1 occurrences of exp
`@(` `exp1` `\|` `exp2` `\|...)`	Exactly one of exp1 or exp2 or ...
`!(` `exp` `)`	Anything that doesn’t match exp ^[a]
^[a] Actually, `!(` `exp` `)` is not a regular expression operator by the standard technical definition, although it is a handy extension.

As shown for the @( exp1 | exp2 |...) pattern, an exp within any of the Korn shell operators can be a series of exp1|exp2|... alternatives.

A little-known alternative notation is to separate each exp with the ampersand character, &. In this case, all the alternative expressions must match. Think of the | as meaning “or,” while the & means “and.” (You can, in fact, use both of them in the same pattern list. The & has higher precedence, with the meaning “match this and that, OR match the next thing.”) Table 4-4 provides some example uses of the shell’s regular expression operators.

Table 4-4. Regular expression operator examples

Expression	Matches
`x`	x
`(` `x`* `)`	Null string, x, xx, xxx, ...
`+(` `x` `)`	x, xx, xxx, ...
`?(` x `)`	Null string, x
`!(` x `)`	Any string except x
`@(` x `)`	x (see below)

Regular expressions are extremely useful when dealing with arbitrary text, as you already know if you have used grep or the regular-expression capabilities of any Unix editor. They aren’t nearly as useful for matching filenames and other simple types of information with which shell users typically work. Furthermore, most things you can do with the shell’s regular expression operators can also be done (though possibly with more keystrokes and less efficiency) by piping the output of a shell command through grep or egrep.

Nevertheless, here are a few examples of how shell regular expressions can solve filename-listing problems. Some of these will come in handy in later chapters as pieces of solutions to larger tasks.

The Emacs editor supports customization files whose names end in .el (for Emacs LISP) or .elc (for Emacs LISP Compiled). List all Emacs customization files in the current directory.
In a directory of C source code, list all files that are not necessary. Assume that “necessary” files end in .c or .h or are named Makefile or README.
Filenames in the OpenVMS operating system end in a semicolon followed by a version number, e.g., fred.bob;23. List all OpenVMS-style filenames in the current directory.

Here are the solutions:

In the first of these, we are looking for files that end in .el with an optional c. The expression that matches this is *.el?(c).
The second example depends on the four standard subexpressions *.c, *.h, Makefile, and README. The entire expression is !(*.c|*.h|Makefile|README), which matches anything that does not match any of the four possibilities.
The solution to the third example starts with *\;, the shell wildcard * followed by a backslash-escaped semicolon. Then, we could use the regular expression +([0-9]), which matches one or more characters in the range [0-9], i.e., one or more digits. This is almost correct (and probably close enough), but it doesn’t take into account that the first digit cannot be 0. Therefore the correct expression is *\;[1-9]*([0-9]), which matches anything that ends with a semicolon, a digit from 1 to 9, and zero or more digits from 0 to 9.

POSIX character class additions

The POSIX standard formalizes the meaning of regular expression characters and operators. The standard defines two classes of regular expressions: Basic Regular Expressions (BREs), which are the kind used by grep and sed, and Extended Regular Expressions, which are the kind used by egrep and awk.

In order to accommodate non-English environments, the POSIX standard enhanced the ability of character set ranges (e.g., [a-z]) to match characters not in the English alphabet. For example, the French è is an alphabetic character, but the typical character class [a-z] would not match it. Additionally, the standard provides for sequences of characters that should be treated as a single unit when matching and collating (sorting) string data. (For example, there are locales where the two characters ch are treated as a unit and must be matched and sorted that way.)

POSIX also changed what had been common terminology. What we saw earlier in Chapter 1 as a “range expression” is often called a “character class” in the Unix literature. It is now called a “bracket expression” in the POSIX standard. Within bracket expressions, besides literal characters such as a, ;, and so on, you can also have additional components:

Character classes: A POSIX character class consists of keywords bracketed by [: and :]. The keywords describe different classes of characters such as alphabetic characters, control characters, and so on (see Table 4-5).
Collating symbols: A collating symbol is a multicharacter sequence that should be treated as a unit. It consists of the characters bracketed by [. and .].
Equivalence classes: An equivalence class lists a set of characters that should be considered equivalent, such as e and è. It consists of a named element from the locale, bracketed by [= and =].

All three of these constructs must appear inside the square brackets of a bracket expression. For example [[:alpha:]!] matches any single alphabetic character or the exclamation point; [[.ch.]] matches the collating element ch but does not match just the letter c or the letter h. In a French locale, [[=e=]] might match any of e, è, or é. Classes and matching characters are shown in Table 4-5.

Table 4-5. POSIX character classes

Class	Matching characters
`[:alnum:]`	Alphanumeric characters
`[:alpha:]`	Alphabetic characters
`[:blank:]`	Space and tab characters
`[:cntrl:]`	Control characters
`[:digit:]`	Numeric characters
`[:graph:]`	Printable and visible (non-space) characters
`[:lower:]`	Lowercase characters
`[:print:]`	Printable characters (includes whitespace)
`[:punct:]`	Punctuation characters
`[:space:]`	Whitespace characters
`[:upper:]`	Uppercase characters
`[:xdigit:]`	Hexadecimal digits

The Korn shell supports all of these features within its pattern matching facilities. The POSIX character class names are the most useful, because they work in different locales.

The following section compares Korn shell regular expressions to analogous features in awk and egrep. If you aren’t familiar with these, skip to Section 4.5.3.

Korn shell versus awk/egrep regular expressions

Table 4-6 is an expansion of Table 4-3: the middle column shows the equivalents in awk/egrep of the shell’s regular expression operators.

Table 4-6. Shell versus egrep/awk regular expression operators

Korn shell	egrep/awk	Meaning
`(` `exp`* `)`	`exp` `*`	0 or more occurrences of exp
`+(` `exp` `)`	`exp` `+`	1 or more occurrences of exp
`?(` `exp` `)`	`exp` `?`	0 or 1 occurrences of exp
`@(` `exp1` `\|` exp2 `\|...)`	`exp1` `\|` `exp2` `\|...`	exp1 or exp2 or ...
`!(` `exp` `)`	(none)	Anything that doesn’t match exp
`\` `N`	`\` `N` (grep)	Match same text as matched by previous parenthesized subexpression number N

These equivalents are close but not quite exact. Because the shell would interpret an expression like dave|fred|bob as a pipeline of commands, you must use @(dave|fred|bob) for alternates by themselves.

The grep command has a feature called backreferences (or backrefs, for short). This facility provides a shorthand for repeating parts of a regular expression as part of a larger whole. It works as follows:

grep '\(abc\).*\1' file1 file2

This matches abc, followed by any number of characters, followed again by abc. Up to nine parenthesized sub-expressions may be referenced this way. The Korn shell provides an analogous capability. If you use one or more regular expression patterns within a full pattern, you can refer to previous ones using the \ N notation as for grep.

For example:

@(dave|fred|bob) matches dave, fred, or bob.
@(*dave*&*fred*) matches davefred, and freddave. (Notice the need for the * characters.)
@(fred)*\1 matches freddavefred, fredbobfred, and so on.
*(dave|fred|bob) means, “0 or more occurrences of dave, fred, or bob“. This expression matches strings like the null string, dave, davedave, fred, bobfred, bobbobdavefredbobfred, etc.
+(dave|fred|bob) matches any of the above except the null string.
?(dave|fred|bob) matches the null string, dave, fred, or bob.
!(dave|fred|bob) matches anything except dave, fred, or bob.

It is worth reemphasizing that shell regular expressions can still contain standard shell wildcards. Thus, the shell wildcard ? (match any single character) is equivalent to . in egrep or awk, and the shell’s character set operator [...] is the same as in those utilities.^[53] For example, the expression +([[:digit:]]) matches a number, i.e., one or more digits. The shell wildcard character * is equivalent to the shell regular expression *(?). You can even nest the regular expressions: +([[:digit:]]|!([[:upper:]])) matches one or more digits or non-uppercase letters.

Two egrep and awk regexp operators do not have equivalents in the Korn shell:

The beginning- and end-of-line operators ^ and $.
The beginning- and end-of-word operators \< and \>.

These are hardly necessary, since the Korn shell doesn’t normally operate on text files and does parse strings into words itself. (Essentially, the ^ and $ are implied as always being there. Surround a pattern with * characters to disable this.) Read on for even more features in the very latest version of ksh.

Pattern matching with regular expressions

Starting with ksh93l, the shell provides a number of additional regular expression capabilities. We discuss them here separately, because your version of ksh93 quite likely doesn’t have them, unless you download a ksh93 binary or build ksh93 from source. The facilities break down as follows.

New pattern matching operators: Several new pattern matching facilities are available. They are described briefly in Table 4-7. More discussion follows after the table.
Subpatterns with options: Special parenthesized subpatterns may contain options that control matching within the subpattern or the rest of the expression.
New [:word:] character class: The character class [:word:] within a bracket expression matches any character that is “word constituent.” This is basically any alphanumeric character or the underscore (_).
Escape sequences recognized within subpatterns: A number of escape sequences are recognized and treated specially within parenthesized expressions.

Table 4-7. New pattern matching operators in ksh93l and later

Operator	Meaning
`{` `N` `}(` `exp` `)`	Exactly N occurrences of exp
`{` `N` `,` `M` `}(` `exp` `)`	Between N and M occurrences of exp
`-(` `exp`* `)`	0 or more occurrences of exp, shortest match
`+-(` `exp` `)`	1 or more occurrences of exp, shortest match
`?-(` `exp` `)`	0 or 1 occurrences of exp, shortest match
`@-(` `exp1` `\|` `exp2` `\|...)`	Exactly one of exp1 or exp2 or ..., shortest match
`{` `N` `}-(` `exp` `)`	Exactly N occurrences of exp, shortest match
`{` `N` `,` `M` `}-(` `exp` `)`	Between N and M occurrences of exp, shortest match

The first two operators in this table match facilities in egrep(1), called interval expressions. They let you specify that you want to match exactly N items, no more and no less, or that you want to match between N and M items.

The rest of the operators perform shortest or “non-greedy” matching. Normally, regular expressions match the longest possible text. A non-greedy match is one of the shortest possible text that matches. Non-greedy matching was first popularized by the perl language. These operators work with the pattern matching and substitution operators described in the next section; we delay examples of greedy vs. non-greedy matching until there. Filename wildcarding effectively always does greedy matching.

Within operations such as @(...), you can provide a special subpattern that enables or disables options for case independent and greedy matching. This subpattern has one of the following forms:

~(+options:pattern list)   Enable options
~(-options:pattern list)   Disable options

The options are one or both of i for case-independent matching and g for greedy matching. If the : pattern list is omitted, the options apply to the rest of the enclosing pattern. If provided, they apply to just that pattern list. Omitting the options is possible, as well, but doing so doesn’t really provide you with any new value.

The bracket expression [[:word:]] is a shorthand for [[:alnum:]_]. It is a notational convenience, but one that can increase program legiblity.

Within parenthesized expressions, ksh recognizes all the standard ANSI C escape sequences, and they have their usual meaning. (See Section 7.3.3.1, in Chapter 7.) Additionally, the escape sequences listed in Table 4-8 are recognized and can be used for pattern matching.

Table 4-8. Regular expression escape sequences

Escape sequence	Meaning
`\d`	Same as `[[:digit:]]`
`\D`	Same as `[![:digit:]]`
`\s`	Same as `[[:space:]]`
`\S`	Same as `[![:space:]]`
`\w`	Same as `[[:word:]]`
`\W`	Same as `[![:word:]]`

Whew! This is all fairly heady stuff. If you feel a bit overwhelmed by it, don’t worry. As you learn more about regular expressions and shell programming and begin to do more and more complex text processing tasks, you’ll come to appreciate the fact that you can do all this within the shell itself, instead of having to resort to external programs such as sed, awk, or perl.

Pattern-Matching Operators

Table 4-9 lists the Korn shell’s pattern-matching operators.

Table 4-9. Pattern-matching operators

Operator	Meaning
`${` `variable` `#` `pattern` `}`	If the pattern matches the beginning of the variable’s value, delete the shortest part that matches and return the rest.
`${` `variable` `##` `pattern` `}`	If the pattern matches the beginning of the variable’s value, delete the longest part that matches and return the rest.
`${`{`variable` `%` `pattern` `}`	If the pattern matches the end of the variable’s value, delete the shortest part that matches and return the rest.
`${` `variable` `%%` `pattern` `}`	If the pattern matches the end of the variable’s value, delete the longest part that matches and return the rest.

These can be hard to remember, so here’s a handy mnemonic device: # matches the front because number signs precede numbers; % matches the rear because percent signs follow numbers. Another mnemonic comes from the typical placement (in the U.S.A., anyway) of the # and % keys on the keyboard. Relative to each other, the # is on the left, and the % is on the right.

The classic use for pattern-matching operators is in stripping components from pathnames, such as directory prefixes and filename suffixes. With that in mind, here is an example that shows how all of the operators work. Assume that the variable path has the value /home/billr/mem/long.file.name; then:

Expression	Result
`${path##/*/}`	long.file.name
`${path#/*/}`	billr/mem/long.file.name
`$path`	`/home/billr/mem/long.file.name`
`${path%.*}`	`/home/billr/mem/long.file`
`${path%%.*}`	`/home/billr/mem/loang`

The two patterns used here are /*/, which matches anything between two slashes, and .*, which matches a dot followed by anything.

Starting with ksh93l, these operators automatically set the .sh.match array variable. This is discussed in Section 4.5.7, later in this chapter.

We will incorporate one of these operators into our next programming task, Task 4-2.

Think of a C compiler as a pipeline of data processing components. C source code is input to the beginning of the pipeline, and object code comes out of the end; there are several steps in between. The shell script’s task, among many other things, is to control the flow of data through the components and designate output files.

You need to write the part of the script that takes the name of the input C source file and creates from it the name of the output object code file. That is, you must take a filename ending in .c and create a filename that is similar except that it ends in .o.

The task at hand is to strip the .c off the filename and append .o. A single shell statement does it:

objname=${filename%.c}.o

This tells the shell to look at the end of filename for .c. If there is a match, return $filename with the match deleted. So if filename had the value fred.c, the expression ${filename%.c} would return fred. The .o is appended to make the desired fred.o, which is stored in the variable objname.

If filename had an inappropriate value (without .c) such as fred.a, the above expression would evaluate to fred.a.o: since there was no match, nothing is deleted from the value of filename, and .o is appended anyway. And, if filename contained more than one dot — e.g., if it were the y.tab.c that is so infamous among compiler writers — the expression would still produce the desired y.tab.o. Notice that this would not be true if we used %% in the expression instead of %. The former operator uses the longest match instead of the shortest, so it would match .tab.o and evaluate to y.o rather than y.tab.o. So the single % is correct in this case.

A longest-match deletion would be preferable, however, for Task 4-3.

Clearly the objective is to remove the directory prefix from the pathname. The following line does it:

bannername=${pathname##*/}

This solution is similar to the first line in the examples shown before. If pathname were just a filename, the pattern */ (anything followed by a slash) would not match, and the value of the expression would be $pathname untouched. If pathname were something like fred/bob, the prefix fred/ would match the pattern and be deleted, leaving just bob as the expression’s value. The same thing would happen if pathname were something like /dave/pete/fred/bob: since the ## deletes the longest match, it deletes the entire /dave/pete/fred/.

If we used #*/ instead of ##*/, the expression would have the incorrect value dave/pete/fred/bob, because the shortest instance of “anything followed by a slash” at the beginning of the string is just a slash (/).

The construct ${ variable ##*/} is actually quite similar to to the Unix utility basename(1). In typical use, basename takes a pathname as argument and returns the filename only; it is meant to be used with the shell’s command substitution mechanism (see below). basename is less efficient than ${ variable ##/*} because it may run in its own separate process rather than within the shell.^[55] Another utility, dirname(1), does essentially the opposite of basename: it returns the directory prefix only. It is equivalent to the Korn shell expression ${ variable %/*} and is less efficient for the same reason.

Pattern Substitution Operators

Besides the pattern-matching operators that delete bits and pieces from the values of shell variables, you can do substitutions on those values, much as in a text editor. (In fact, using these facilities, you could almost write a line-mode text editor as a shell script!) These operators are listed in Table 4-10.

Table 4-10. Pattern substitution operators

Operator	Meaning
`${` `variable` `:` `start` `}`	These represent substring operations. The result is the value of variable starting at position start and going for length characters. The first character is at position 0, and if no length is provided, the rest of the string is used. When used with `$` or `$@` or an array indexed by `` or `@` (see Chapter 6), start is a starting index and length is the count of elements. In other words, the result is a slice out of the positional parameters or array. Both start and length may be arithmetic expressions. Beginning with ksh93m, a negative start is taken as relative to the end of the string. For example, if a string has 10 characters, numbered 0 to 9, a start value of -2 means 7 (9 - 2 = 7). Similarly, if variable is an indexed array, a negative start yields an index by working backwards from the highest subscript in the array.
`${` `variable` `:` `start` `:` `length` `}`
`${` `variable` `/` `pattern` `/` `replace` `}`	If variable contains a match for pattern, the first match is replaced with the text of replace.
`${` `variable` `//` `pattern` `/` `replace` `}`	This is the same as the previous operation, except that every match of the pattern is replaced.
`${` `variable` `/` `pattern` `}`	If variable contains a match for pattern, delete the first match of pattern.
`${` `variable` `/#` `pattern` `/` `replace` `}`	If variable contains a match for pattern, the first match is replaced with the text of replace. The match is constrained to occur at the beginning of variable’s value. If it doesn’t match there, no substitution occurs.
`${` `variable` `/%` `pattern` `/` `replace` `}`	If variable contains a match for pattern, the first match is replaced with the text of replace. The match is constrained to occur at the end of variable’s value. If it doesn’t match there, no substitution occurs.

The ${ variable / pattern } syntax is different from the #, ##, %, and %% operators we saw earlier. Those operators are constrained to match at the beginning or end of the variable’s value, whereas the syntax shown here is not. For example:

$ path=/home/fred/work/file
$ print ${path/work/play}             
                  Change work into play
/home/fred/play/file

Let’s return to our compiler front-end example and look at how we might use these operators. When turning a C source filename into an object filename, we could do the substitution this way:

objname=${filename/%.c/.o}            Change .c to .o, but only at end

If we had a list of C filenames and wanted to change all of them into object filenames, we could use the so-called global substitution operator:

$ allfiles="fred.c dave.c pete.c"
$ allobs=${allfiles//.c/.o}
$ print $allobs
fred.o dave.o pete.o

The patterns may be any Korn shell pattern expression, as discussed earlier, and the replacement text may include the \ N notation to get the text that matched a subpattern.

Finally, these operations may be applied to the positional parameters and to arrays, in which case they are done on all the parameters or array elements at once. (Arrays are described in Chapter 6.)

$ print "$@"
hi how are you over there
$ print ${@/h/H}                      
                  Change h to H in all parameters
Hi How are you over tHere

Greedy versus non-greedy matching

As promised, here is a brief demonstration of the differences between greedy and non-greedy matching regular expressions:

$ x='12345abc6789'
$ print ${x//+([[:digit:]])/X}    
                  Substitution with longest match
XabcX
$ print ${x//+-([[:digit:]])/X}   
                  Substitution with shortest match
XXXXXabcXXXX
$ print ${x##+([[:digit:]])}      
                  Remove longest match
abc6789
$ print ${x#+([[:digit:]])}       
                  Remove shortest match
2345abc6789

The first print replaces the longest match of “one or more digits” with a single X, everywhere throughout the string. Since this is a longest match, both groups of digits are replaced. In the second case, the shortest match for “one or more digits” is just a single digit, and thus each digit is replaced with an X.

Variable Name Operators

A number of operators relate to shell variable names, as seen in Table 4-11.

Table 4-11. Name-related operators

Operator	Meaning
`${!` `variable` `}`	Return the name of the real variable referenced by the nameref variable.
`${!` `base` `*}`	List of all variables whose names begin with base.
`${!` `base` `@}`	List of all variables whose names begin with base.

Namerefs were discussed in Section 4.4, earlier in this chapter. See there for an example of ${! name }.

The last two operators in Table 4-11 might be useful for debugging and/or tracing the use of variables in a large script. Just to see how they work:

$ print ${!HIST*}
HISTFILE HISTCMD HISTSIZE
$ print ${!HIST@}
HISTFILE HISTCMD HISTSIZE

Several other operators related to array variables are described in Chapter 6.

Length Operators

There are three remaining operators on variables. One is ${# varname }, which returns the number of characters in the string.^[56] (In Chapter 6 we see how to treat this and similar values as actual numbers so they can be used in arithmetic expressions.) For example, if filename has the value fred.c, then ${#filename} would have the value 6. The other two operators (${# array [*]} and ${# array [@]}) have to do with array variables, which are also discussed in Chapter 6.

The .sh.match Variable

The .sh.match variable was introduced in ksh93l. It is an indexed array (see Chapter 6), whose values are set every time you do a pattern matching operation on a variable, such as ${filename%%*/}, with any of the #, % operators (for the shortest match), or ##, %% (for the longest match), or / and // (for substitutions). .sh.match[0] contains the text that matched the entire pattern. .sh.match[1] contains the text that matched the first parenthesized subexpression, .sh.match[2] the text that matched the second, and so on. The values of .sh.match become invalid (meaning, don’t try to use them) if the variable on which the pattern matching was done changes.

Again, this is a feature meant for more advanced programming and text processing, analogous to similar features in other languages such as perl. If you’re just starting out, don’t worry about it.

Command Substitution

From the discussion so far, we’ve seen two ways of getting values into variables: by assignment statements and by the user supplying them as command-line arguments (positional parameters). There is another way: command substitution, which allows you to use the standard output of a command as if it were the value of a variable. You will soon see how powerful this feature is.

The syntax of command substitution is:

$(Unix command)

The command inside the parenthesis is run, and anything the command writes to standard output (and to standard error) is returned as the value of the expression. These constructs can be nested, i.e., the Unix command can contain command substitutions.

Here are some simple examples:

The value of $(pwd) is the current directory (same as the environment variable $PWD).
The value of $(ls) is the names of all files in the current directory, separated by newlines.
To find out detailed information about a command if you don’t know where its file resides, type ls -l $(whence -p command ). The -p option forces whence to do a pathname lookup and not consider keywords, built-ins, etc.
To get the contents of a file into a variable, you can use varname =$(< filename ). $(cat filename ) will do the same thing, but the shell catches the former as a built-in shorthand and runs it more efficiently.
If you want to edit (with Emacs) every chapter of your book on the Korn shell that has the phrase “command substitution,” assuming that your chapter files all begin with ch, you could type:
```
emacs $(grep -l 'command substitution' ch*.xml)
```
The -l option to grep prints only the names of files that contain matches.

Command substitution, like variable expansion, is done within double quotes. (Double quotes inside the command substitution are not affected by any enclosing double quotes.) Therefore, our rule in Chapter 1 and Chapter 3 about using single quotes for strings unless they contain variables will now be extended: “When in doubt, use single quotes, unless the string contains variables, or command substitutions, in which case use double quotes.”

(For backwards compatibility, the Korn shell supports the original Bourne shell (and C shell) command substituion notation using backquotes: `...`. However, it is considerably harder to use than $(...), since quoting and nested command substitutions require careful escaping. We don’t use the backquotes in any of the programs in this book.)

You will undoubtedly think of many ways to use command substitution as you gain experience with the Korn shell. One that is a bit more complex than those mentioned previously relates to a customization task that we saw in Chapter 3: personalizing your prompt string.

Recall that you can personalize your prompt string by assigning a value to the variable PS1. If you are on a network of computers, and you use different machines from time to time, you may find it handy to have the name of the machine you’re on in your prompt string. Most modern versions of Unix have the command hostname(1), which prints the network name of the machine you are on to standard output. (If you do not have this command, you may have a similar one like uname.) This command enables you to get the machine name into your prompt string by putting a line like this in your .profile or environment file:

PS1="$(hostname) $ "

(Here, the second dollar sign does not need to be preceded by a backslash. If the character after the $ isn’t special to the shell, the $ is included literally in the string.) For example, if your machine had the name coltrane, then this statement would set your prompt string to "coltrane $ “.

Command substitution helps us with the solution to the next programming task, Task 4-4, which relates to the album database in Task 4-1.

The file used in Task 4-1 is actually a report derived from a bigger table of data about albums. This table consists of several columns, or fields, to which a user refers by names like “artist,” “title,” “year,” etc. The columns are separated by vertical bars (|, the same as the Unix pipe character). To deal with individual columns in the table, field names need to be converted to field numbers.

Suppose there is a shell function called getfield that takes the field name as argument and writes the corresponding field number on the standard output. Use this routine to help extract a column from the data table.

The cut(1) utility is a natural for this task. cut is a data filter: it extracts columns from tabular data.^[57] If you supply the numbers of columns you want to extract from the input, cut prints only those columns on the standard output. Columns can be character positions or — relevant in this example — fields that are separated by TAB characters or other delimiters.

Assume that the data table in our task is a file called albums and that it looks like this:

Coltrane, John|Giant Steps|Atlantic|1960|Ja
Coltrane, John|Coltrane Jazz|Atlantic|1960|Ja
Coltrane, John|My Favorite Things|Atlantic|1961|Ja
Coltrane, John|Coltrane Plays the Blues|Atlantic|1961|Ja
...

Here is how we would use cut to extract the fourth (year) column:

cut -f4 -d\| albums

The -d argument is used to specify the character used as field delimiter (TAB is the default). The vertical bar must be backslash-escaped so that the shell doesn’t try to interpret it as a pipe.

From this line of code and the getfield routine, we can easily derive the solution to the task. Assume that the first argument to getfield is the name of the field the user wants to extract. Then the solution is:

fieldname=$1
cut -f$(getfield $fieldname) -d\| albums

If we ran this script with the argument year, the output would be:

Task 4-5 is another small task that makes use of cut.

The command who(1) tells you who is logged in (as well as which terminal they’re on and when they logged in). Its output looks like this:

billr      console      May 22 07:57
fred       tty02        May 22 08:31
bob        tty04        May 22 08:12

The fields are separated by spaces, not TABs. Since we need the first field, we can get away with using a space as the field separator in the cut command. (Otherwise we’d have to use the option to cut that uses character columns instead of fields.) To provide a space character as an argument on a command line, you can surround it by quotes:

who | cut -d' ' -f1

With the above who output, this command’s output would look like this:

billr
fred
bob

This leads directly to a solution to the task. Just type:

mail $(who | cut -d' '  -f1)

The command mail billr fred bob will run and then you can type your message.

Task 4-6 is another task that shows how useful command pipelines can be in command substitution.

This task was inspired by the feature of the OpenVMS operating system that lets you specify files by date with BEFORE and SINCE parameters.

Here is a function that allows you to list all files that were last modified on the date you give as argument. Once again, we choose a function for speed reasons. No pun is intended by the function’s name:

function lsd {
    date=$1
    ls -l | grep -i "^.\{41\}$date" | cut -c55-
}

This function depends on the column layout of the ls -l command. In particular, it depends on dates starting in column 42 and filenames starting in column 55. If this isn’t the case in your version of Unix, you will need to adjust the column numbers.^[58]

We use the grep search utility to match the date given as argument (in the form Mon DD, e.g., Jan 15 or Oct 6, the latter having two spaces) to the output of ls -l. (The regular expression argument to grep is quoted with double quotes, in order to perform the variable substitution.) This gives us a long listing of only those files whose dates match the argument. The -i option to grep allows you to use all lowercase letters in the month name, while the rather fancy argument means, “Match any line that contains 41 characters followed by the function argument.” For example, typing lsd 'jan 15' causes grep to search for lines that match any 41 characters followed by jan 15 (or Jan 15).

The output of grep is piped through our ubiquitous friend cut to retrieve just the filenames. The argument to cut tells it to extract characters in column 55 through the end of the line.

With command substitution, you can use this function with any command that accepts filename arguments. For example, if you want to print all files in your current directory that were last modified today, and today is January 15, you could type:

lp $(lsd 'jan 15')

The output of lsd is on multiple lines (one for each filename), but because the variable IFS (see earlier in this chapter) contains newline by default, the shell uses newline to separate words in lsd’s output, just as it normally does with space and TAB.

Advanced Examples: pushd and popd

We conclude this chapter with a couple of functions that you may find handy in your everyday Unix use. They solve the problem presented by Task 4-7.

We start by implementing a significant subset of their capabilities and finish the implementation in Chapter 6. (For ease of development and explanation, our implementation ignores some things that a more bullet-proof version should handle. For example, spaces in filenames will cause things to break.)

If you don’t know what a stack is, think of a spring-loaded dish receptacle in a cafeteria. When you place dishes on the receptacle, the spring compresses so that the top stays at roughly the same level. The dish most recently placed on the stack is the first to be taken when someone wants food; thus, the stack is known as a “last-in, first-out” or LIFO structure. (Victims of a recession or company takeovers will also recognize this mechanism in the context of corporate layoff policies.) Putting something onto a stack is known in computer science parlance as pushing, and taking something off the top is called popping.

A stack is very handy for remembering directories, as we will see; it can “hold your place” up to an arbitrary number of times. The cd - form of the cd command does this, but only to one level. For example: if you are in firstdir and then you change to seconddir, you can type cd - to go back. But if you start out in firstdir, then change to seconddir, and then go to thirddir, you can use cd - only to go back to seconddir. If you type cd - again, you will be back in thirddir, because it is the previous directory.^[59]

If you want the “nested” remember-and-change functionality that will take you back to firstdir, you need a stack of directories along with the dirs, pushd and popd commands. Here is how these work:^[60]

pushd dir does a cd to dir and then pushes dir onto the stack.
popd does a cd to the top directory, then pops it off the stack.

For example, consider the series of events in Table 4-12. Assume that you have just logged in and that you are in your home directory (/home/you).

We will implement a stack as an environment variable containing a list of directories separated by spaces.

Table 4-12. pushd/popd example

Command	Stack contents (top on left)	Result directory
`pushd fred`	/home/you/fred	/home/you/fred
`pushd /etc`	/etc /home/you/fred	/etc
`cd /usr/tmp`	/etc /home/you/fred	/usr/tmp
`popd`	/home/you/fred	/etc
`popd`	(empty)	/home/you/fred

Your directory stack should be initialized to your home directory when you log in. To do so, put this in your .profile:

DIRSTACK="$PWD"
export DIRSTACK

Do not put this in your environment file if you have one. The export statement guarantees that DIRSTACK is known to all subprocesses; you want to initialize it only once. If you put this code in an environment file, it will get reinitialized in every interactive shell subprocess, which you probably don’t want.

Next, we need to implement dirs, pushd, and popd as functions. Here are our initial versions:

function dirs {        # print directory stack (easy)
    print $DIRSTACK
}

function pushd {       # push current directory onto stack
    dirname=$1
    cd ${dirname:?"missing directory name."}
    DIRSTACK="$PWD $DIRSTACK"
    print "$DIRSTACK"
}

function popd {        # cd to top, pop it off stack
    top=${DIRSTACK%% *}
    DIRSTACK=${DIRSTACK#* }
    cd $top
    print "$PWD"
}

Notice that there isn’t much code! Let’s go through the functions and see how they work. dirs is easy; it just prints the stack. The fun starts with pushd. The first line merely saves the first argument in the variable dirname for readability reasons.

The second line’s main purpose is to change to the new directory. We use the :? operator to handle the error when the argument is missing: if the argument is given, the expression ${dirname:?"missing directory name."} evaluates to $dirname, but if it is not given, the shell prints the message ksh: pushd: line 2: dirname: missing directory name. and exits from the function.

The third line of the function pushes the new directory onto the stack. The expression within double quotes consists of the full pathname for the current directory, followed by a single space, followed by the contents of the directory stack ($DIRSTACK). The double quotes ensure that all of this is packaged into a single string for assignment back to DIRSTACK.

The last line merely prints the contents of the stack, with the implication that the leftmost directory is both the current directory and at the top of the stack. (This is why we chose spaces to separate directories, rather than the more customary colons as in PATH and MAILPATH.)

The popd function makes yet another use of the shell’s pattern-matching operators. The first line uses the %% operator, which deletes the longest match of " *" (a space followed by anything). This removes all but the top of the stack. The result is saved in the variable top, again for readability reasons.

The second line is similar, but going in the other direction. It uses the # operator, which tries to delete the shortest match of the pattern "* " (anything followed by a space) from the value of DIRSTACK. The result is that the top directory (and the space following it) is deleted from the stack.

The third line actually changes directory to the previous top of the stack. (Note that popd doesn’t care where you are when you run it; if your current directory is the one on the top of the stack, you won’t go anywhere.) The final line just prints a confirmation message.

This code is deficient in the following ways: first, it has no provision for errors. For example:

What if the user tries to push a directory that doesn’t exist or is invalid?
What if the user tries popd and the stack is empty?

Test your understanding of the code by figuring out how it would respond to these error conditions. The second deficiency is that the code implements only some of the functionality of the C shell’s pushd and popd commands — albeit the most useful parts. In the next chapter, we will see how to overcome both of these deficiencies.

The third problem with the code is that it will not work if, for some reason, a directory name contains a space. The code will treat the space as a separator character. We’ll accept this deficiency for now. However, when you read about arrays in Chapter 6, think about how you might use them to rewrite this code and eliminate the problem.