This chapter introduces you to the bash
shell. You will learn how to use some basic commands, such as navigating around the file system, listing files, and displaying the contents of files. This chapter is dense and contains a very eclectic mix of topics to quickly prepare you for later chapters. If you already have some knowledge of shell programming, you can probably skim quickly through this introductory chapter and proceed to Chapter 2.
The first part of this chapter starts with a brief introduction to some Unix shells, and then discusses files, file permissions, and directories. You will also learn how to create files and directories and how to change their permissions.
The second part of this chapter introduces simple shell scripts, along with instructions for making them executable. As you will see, shell scripts contain bash commands (and can optionally contain user-defined functions), so it’s a good idea to learn about bash
commands before you can create shell scripts (which include bash
scripts).
The third portion of this chapter discusses two useful bash commands: the cut
command (for cutting or extracting columns and/or fields from a dataset) and the paste
command (for “pasting” text or datasets together vertically).
In addition, the final part of this chapter uses the material from the previous section (i.e., the cut
command and paste
command) in a use case that illustrates how to switch the order of two columns in a dataset. As you will see later, there are other ways to perform this task, such as invoking the awk
command (discussed in Chapter 5).
There are a few points to keep in mind before delving into the details of shell scripts. First, shell scripts can be executed from the command line after adding “execute” permissions to the text file containing the shell script. Second, you can use the crontab
utility to schedule the execution of your shell scripts. The crontab
utility allows you to specify the execution of a shell script on an hourly, daily, weekly, or monthly basis. Tasks that are commonly scheduled via crontab
include performing backups, removing unwanted files, and so forth. If you are completely new to Unix, just keep in mind that there is a way to run scripts both from the command line and in a “scheduled” manner. Setting file permissions to run the script from the command line will be discussed later.
Third, the contents of any shell script can be as simple as a single command, or can comprise hundreds of lines of bash commands. In general, the more interesting shell scripts involve a combination of several bash commands. A learning tip: since there are usually several ways to produce the desired result, it’s helpful to read other people’s shell scripts to learn how to combine commands in useful ways.
Unix is an operating system created by Ken Thompson in the early 1970s, and today there are several variants available, such as HP/UX for HP machines and AIX for IBM machines. Linus Torvalds developed the Linux operating system during the 1990s, and many Linux commands are the same as their bash counterparts (but differences exist, often in the commands for system administrators). The Mac OS X operating system is based on AT&T Unix.
Unix has a rich and storied history, and if you are really interested in learning about its past, you can read online articles and also Wikipedia. This book foregoes those details and focuses on helping you quickly learn how to become productive with various commands.
The original Unix shell is the Bourne shell, which was written in the mid-1970s by Stephen R. Bourne. In addition, the Bourne shell was the first shell to appear on bash systems, and you will sometimes hear “the shell” as a reference to the Bourne shell. The Bourne shell is a POSIX
standard shell, usually installed as /bin/sh
on most versions of Unix, whose default prompt is the $
character. Consequently, Bourne shell scripts will execute on almost every version of Unix. In essence, the AT&T branches of Unix support the Bourne shell (sh), bash
, Korn shell (ksh), tsh
, and zsh
.
However, there is also the BSD
branch of Unix that uses the “C” shell (csh)
, whose default prompt is the %
character. In general, shell scripts written for csh
will not execute on AT&T branches of Unix, unless the csh
shell is also installed on those machines (and vice versa).
The Bourne shell is the most “unadorned” in the sense that it lacks some commands that are available in the other shells, such as history, noclobber
, and so forth. The various subcategories for the Bourne Shell are listed as follows:
Bourne shell (sh)
Korn shell (ksh)
Bourne Again shell (bash)
POSIX shell (sh)
The different C-type shells follow:
C shell (csh)
TENEX/TOPS C shell (tcsh)
While the commands and the shell scripts in this book are based on the bash
shell, many of the commands also work in other shells (and if not, those other shells have a similar command to accomplish the same goal). Performing an Internet search for “how do I do <bash command> in <shell name>” will often get you an answer. Sometimes the command is essentially the same, but with slightly different syntax, and typing “man <command>” in a command shell can provide useful information.
Bash is an acronym for “Bourne Again Shell,” which has its roots in the Bourne shell created by Stephen R. Bourne. Shell scripts based on the Bourne shell will execute in bash
, but the converse is not true. The bash
shell provides additional features that are unavailable in the Bourne shell, such as support for arrays (discussed later in this chapter).
On Mac OS X, the /bin
directory contains the following executable shells:
-r-xr-xr-x 1 root wheel 1377872 Apr 28 2017 /bin/ksh -r-xr-xr-x 1 root wheel 630464 Apr 28 2017 /bin/sh -rwxr-xr-x 1 root wheel 375632 Apr 28 2017 /bin/csh -rwxr-xr-x 1 root wheel 592656 Apr 28 2017 /bin/zsh -r-xr-xr-x 1 root wheel 626272 Apr 28 2017 /bin/bash
In case you’re interested, a nice comparison matrix of the support for various features among the preceding shells is here:
https://stackoverflow.com/questions/5725296/difference-between-sh-and-bash.
Something else that might surprise you: in some environments the Bourne shell sh
is the Bash shell, which you can check by typing the following command:
sh --version GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin16) Copyright (C) 2007 Free Software Foundation, Inc.
If you are new to the command line (be it Mac, Linux, or PCs), please read the Preface, which provides some useful guidelines for accessing command shells.
bash
CommandsIf you want to see the options for a specific bash command, specify the -?
switch. For example, cat -?
displays the available options for the cat
command. You can invoke the man
command to see a description of a bash
command and its options:
man cat
Keep in mind that the man
command produces terse explanations, and if those explanations are not clear enough, you can search for online code samples that provide more details.
In a command shell you will often perform basic operations, such as displaying (or changing) the current directory, listing the contents of a directory, displaying the contents of a file, and so forth. The following set of commands shows you how to perform these operations, and you can execute a subset of these comments in the sequence that is relevant to you. Options for some of the commands in this section (such as the ls
command) are described in greater detail later in this chapter.
A frequently used Bash command is pwd
(“print working directory”), which displays the current directory, as shown here:
pwd
The output of the preceding command might look something like this:
/Users/jsmith
Use the cd
(“change directory”) command to go to a specific directory. For example, type the command cd /Users/jsmith/Mail
or cd Mail
if you are already in the /Users/jsmith
directory. You can navigate to your home directory with either of these commands:
$ cd $HOME $ cd
One convenient way to return to the previous directory is the command cd
–. Keep in mind that the cd
command on Windows merely displays the current directory (which differs from the Unix cd
command).
history
CommandThe history
command displays the history
of commands that you executed in the current command shell, as shown here:
history
A sample output of the preceding command is here:
1202 cat longfile.txt > longfile2.txt 1203 vi longfile2.txt 1204 cat longfile2.txt |fold -40 1205 cat longfile2.txt |fold -30 1206 cat longfile2.txt |fold -50 1207 cat longfile2.txt |fold -45 1208 vi longfile2.txt 1209 history 1210 cd /Library/Developer/CommandLineTools/usr/include/c++/ 1211 cd /tmp 1212 cd $HOME/Desktop 1213 history
Now you can return to the directory in line 1210
with the following command:
!1210
The command !cd
will search backward through the history of commands to find the first command that matches the cd
command: in this case, line 1212
is the first match. If there weren’t any intervening cd
commands between the current command and the command in line 1210
, then !1210
and !cd
will have the same effect.
NOTE
Be careful with the “!” option with bash commands, because the command that matches the “!” might not be the one you intended, so it’s safer to use the history
command and then explicitly specify the correct number (in that history) when you invoke the “!” operator.
ls
CommandThe ls
command is for listing filenames, and there are many switches available that you can use, as shown in this section. For example, the ls
command displays the following filenames (the actual display depends on the font size and the width of the command shell) on my Mac:
apple-care.txt iphonemeetup.txt outfile.txt ssl- instructions.txt checkin-commands.txt kyrgyzstan.txt output.txt
The command ls -1
(the digit “1”) displays a vertical listing of filenames:
apple-care.txt checkin-commands.txt iphonemeetup.txt kyrgyzstan.txt outfile.txt output.txt ssl-instructions.txt
The command ls -1
(the letter “l”) displays a long listing of filenames:
total 56 -rwx------ 1 ocampesato staff 25 Jan 06 19:21 apple-care.txt -rwx------ 1 ocampesato staff 146 Jan 06 19:21 checkin-commands.txt -rwx------ 1 ocampesato staff 478 Jan 06 19:21 iphonemeetup.txt -rwx------ 1 ocampesato staff 12 Jan 06 19:21 kyrgyzstan.txt -rw-r--r-- 1 ocampesato staff 11 Jan 06 19:21 outfile.txt -rw-r--r-- 1 ocampesato staff 12 Jan 06 19:21 output.txt -rwx------ 1 ocampesato staff 176 Jan 06 19:21 ssl-instructions.txt
The command ls -1t
(the letters “l” and “t”) display a time-based long listing:
total 56 -rwx------1 ocampesato staff 25 Jan 06 19:21 apple-care.txt -rwx------1 ocampesato staff 146 Jan 06 19:21 checkin-commands.txt -rwx------1 ocampesato staff 478 Jan 06 19:21 iphonemeetup.txt -rwx------1 ocampesato staff 12 Jan 06 19:21 kyrgyzstan.txt -rw-r--r--1 ocampesato staff 11 Jan 06 19:21 outfile.txt -rw-r--r--1 ocampesato staff 12 Jan 06 19:21 output.txt -rwx------1 ocampesato staff 176 Jan 06 19:21 ssl-instructions.txt
The command ls -ltr
(the letters “l”, “t”, and “r”) display a reversed time-based long listing of filenames:
total 56 -rwx------1 ocampesato staff 176 Jan 06 19:21 ssl-instructions.txt -rw-r--r--1 ocampesato staff 12 Jan 06 19:21 output.txt -rw-r--r--1 ocampesato staff 11 Jan 06 19:21 outfile.txt -rwx------1 ocampesato staff 12 Jan 06 19:21 kyrgyzstan.txt -rwx------1 ocampesato staff 478 Jan 06 19:21 iphonemeetup.txt -rwx------1 ocampesato staff 146 Jan 06 19:21 checkin-commands.txt -rwx------1 ocampesato staff 25 Jan 06 19:21 apple-care.txt
Here is the description about all the listed columns in the preceding output:
Column #1: represents file type and permission given on the file (see the following)
Column #2: shows the number of memory blocks taken by the file or directory
Column #3: indicates the (Bash user) owner of the file
Column #4: represents group of the owner
Column #5: represents file size in bytes
Column #6: shows the date and time when this file was created or last modified
Column #7: represents file or directory name
In the ls -l
listing example, every file line began with a d, -, or l. These characters indicate the type of file that’s listed. These (and other) initial values are described as follows:
- | Regular file (ASCII text file, binary executable, or hard link) |
b | Block special file (such as a physical hard drive) |
c | Character special file (such as a physical hard drive) |
d | Directory file that contains a listing of other files and directories |
l | Symbolic link file |
p | Named pipe (a mechanism for interprocess communications) |
s | Socket (for interprocess communication) |
Consult online documentation for more details regarding the ls command.
Now let’s see how to display different lines of text in a text file. You can use the cat
command to display the entire contents of a file, but it’s a good idea to first get some information about the file contents. Specifically, use the wc
(word count) command that displays the number of lines, words, and characters in a text file, as shown here:
wc longfile.txt 37 80 408 longfile.txt
The preceding output shows that the file longfile.txt
contains 37 lines, 80 words, and 408 characters, which means that the file size is actually quite small (despite its name).
cat
CommandYou can use the cat
command to display the contents of longfile.txt:
cat longfile.txt
The preceding command displays the following text:
the contents of this long file are too long to see in a single screen and each line contains one or more words and if you use the cat command the (other lines are omitted)
As another example, suppose that the file temp1
has the following contents:
this is line1 of temp1 this is line2 of temp1 this is line3 of temp1
Suppose that the file temp2
has these contents:
this is line1 of temp2 this is line2 of temp2
Now type the following command that contains the ?
metacharacter (discussed in detail later in this chapter):
cat temp?
The output from the preceding command is shown here:
this is line1 of temp1 this is line2 of temp1 this is line3 of temp1 this is line1 of temp2 this is line2 of temp2
head
and tail
CommandsThe head
command displays the first ten lines of a text file (by default), an example of which is here:
head longfile.txt
The preceding command displays the following text:
the contents of this long file are too long to see in a single screen and each line contains one or more words
The head
command also provides an option to specify a different number of lines to display, as shown here:
head -4 longfile.txt
The preceding command displays the following text:
the contents of this long file are too long
The tail
command displays the last ten lines (by default) of a text file:
tail longfile.txt
The preceding command displays the following text:
is available in every shell including the bash shell csh zsh ksh and Bourne shell
NOTE
The last two lines in the preceding output are blank lines (not a typographical error in this page).
Similarly, the tail
command allows you to specify a different number of lines to display: tail –4 longfile.txt
displays the last 4 lines of longfile.txt.
Use the more
command to display a screenful of data, as shown here:
more longfile.txt
Press the <spacebar>
to view the next screenful of data, and press the <return>
key to see the next line of text in a file. Incidentally, some people prefer the less
command, which generates essentially the same output as the more
command. (A geeky joke: “What’s less? It’s more.”)
A very useful feature of Bash is its support for the pipe symbol (“|
”) that enables you to “pipe” or redirect the output of one command to become the input of another command. The pipe command is very handy when you want to perform a sequence of operations involving various Bash commands.
For example, the following code snippet combines the head
command with the cat
command and the pipe (“|
”) symbol:
cat longfile.txt| head -2
A technical point: the preceding command creates two bash
processes (more about processes later), whereas the command head -2 longfile.txt
only creates a single bash
process.
You can use the head
and tail
commands in more interesting ways. For example, the following command sequence displays lines 11 through 15 of longfile.txt:
head -15 longfile.txt |tail -5
The preceding command displays the following text:
and if you use the cat command the file contents scroll
Display the line numbers for the preceding output as follows:
cat –n longfile.txt | head -15 | tail -5
The preceding command displays the following text:
11 and if you 12 use the cat 13 command the 14 file contents 15 scroll
You won’t see the “tab” character from the output, but it’s visible if you redirect the previous command sequence to a file and then use the “-t” option with the cat
command:
cat -n longfile.txt | head -15 | tail -5 > 1 cat -t 1 11^Iand if you 12^Iuse the cat 13^Icommand the 14^Ifile contents 15^Iscroll
fold
CommandThe fold
command enables you to “fold” the lines in a text file, which is useful for text files that contain long lines of text that you want to split into shorter lines. For example, here are the contents of longfile2.txt:
the contents of this long file are too long to see in a single screen and each line contains one or more words and if you use the cat command the file contents scroll off the screen so you can use other commands such as the head or tail or more commands in conjunction with the pipe command that is very useful in Bash and is available in every shell including the bash shell csh zsh ksh and Bourne shell
You can “fold” the contents of longfile2.txt
into lines whose length is 45
(just as an example) with this command:
cat longfile2.txt |fold -45
The output of the preceding command is here:
the contents of this long file are too long t o see in a single screen and each line contai ns one or more words and if you use the cat c ommand the file contents scroll off the scree n so you can use other commands such as the h ead or tail or more commands in conjunction w ith the pipe command that is very useful in U nix and is available in every shell including the bash shell csh zsh ksh and Bourne shell
Notice that some words in the preceding output are split based on the line width, and not “newspaper style.”
In Chapter 4 you will learn how to display the lines in a text file that match a string or a pattern, and in Chapter 5 you will learn how to replace a string with another string in a text file.
Bash files have rwx
privileges, where r
= read privilege, w
= write privilege, x
= execute privilege can be executed from the command line, simply by typing the file name (or the full path to file name if the file is not in your current directory). Invoking an executable file from the command line will cause the operating system to attempt to execute commands inside the text file.
Use the chmod
command to set permissions for files. For example, if you need to set the permission rwx rw- r--
for a file, use the following:
chmod u=rwx g=rw o=r filename
In the preceding command the options u, g
, and o
represent user permissions, group permissions, and other permissions, respectively.
In order to add additional permissions on the current file, use +
to add permission to user, group, or others and use -
to remove the permissions. For example, given a file with the permissions rwx rw- r--
, add the executable permission as follows:
chmod o+x filename
This command adds the x
permission for others
.
Add the executable permission to all permission categories—that is, for user, group, and others—as follows:
chmod a+x filename
In the preceding command, the letter a
means “all.”
Specify a -
in order to remove any permission, as shown here:
chmod a-x filename
A so-called “invisible” file is one whose first character is the dot or period character (.). Bash programs (including the shell) use most of these files to store configuration information. Some common examples of hidden files include the following files:
.profile: the Bourne shell (sh) initialization script
.bash_profile: the bash shell (bash) initialization script
.kshrc: the Korn shell (ksh) initialization script
.cshrc: the C shell (csh) initialization script
.rhosts: the remote shell configuration file
To list invisible files, specify the -a
option to ls
:
ls -a . .profile docs lib test_results .. .rhosts hosts pub users .emacs bin hw1 res.01 work .exrc ch07 hw2 res.02 .kshrc ch07.bak hw3 res.03
Single dot .: This represents current directory.
Double dot ..: This represents parent directory.
Problematic filenames contain one or more whitespaces, hidden (non-printing) characters, or start with a dash (“-”) character.
You can use double quotes to list filenames that contain whitespaces, or you can precede each whitespace by a backslash (“\
”) character.
For example, if you have a file named One Space.txt
, you can use the ls
command as follows:
ls -1 "One Space.txt" ls -l One\ Space.txt
Filenames that start with a dash (“-”) character are difficult to handle because the dash character is the prefix that specifies options for bash commands. Consequently, if you have a file whose name is –abc
, then the command ls –abc
will not work correctly, because the “-a” is interpreted as a switch for the ls
command (and there is no “a” option).
In most cases the best solution to this type of file is to rename the file. This can be done in your operating system if your client isn’t a Unix shell, or you can use the following special syntax for the mv
(“move”) command to rename the file. The preceding two dashes tell mv
to ignore the dash in the filename. An example is here:
mv -- -abc.txt renamed-abc.txt
There are many built-in environment variables available, and the following subsections discuss some of the more common variables.
env
CommandThe env
(“environment”) command displays the variables that are in your bash
environment. An example of the output of the env
command is here:
SHELL=/bin/bash TERM=xterm-256color TMPDIR=/var/folders/73/39lngcln4dj_scmgvsv53g_w0000gn/T/ OLDPWD=/tmp TERM_SESSION_ID=63101060-9DF0-405E-84E1-EC56282F4803 USER=ocampesato COMMAND_MODE=bash2003PATH=/opt/local/bin:/Users/ocampesato/ android-sdk-mac_86/platform-tools:/Users/ocampesato/ android-sdk-mac_86/tools:/usr/local/bin: PWD=/Users/ocampesato JAVA_HOME=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/ Contents/Home LANG=en_US.UTF-8 NODE_PATH=/usr/local/lib/node_modules HOME=/Users/ocampesato LOGNAME=ocampesato DISPLAY=/tmp/launch-xnTgkE/org.macosforge.xquartz:0 SECURITYSESSIONID=186a4 _=/usr/bin/env
Some interesting examples of setting an environment variable and also executing a command are described here:
https://stackoverflow.com/questions/13998075/setting-environment -variable-for-one-program-call-in-bash-using-env.
This section discusses some important environment variables, most of which you probably will not need to modify, but it’s useful to be aware of the existence of these variables and their purpose.
The HOME
variable contains the absolute path of the user’s home directory.
The HOSTNAME
variable specifies the Internet name of the host.
The LOGNAME
variable specifies the user’s login name.
The PATH
variable specifies the search path (see next subsection).
The SHELL
variable specifies the absolute path of the current shell.
The USER
specifies the user’s current username. This value might be different than the login name if a superuser executes the su
command to emulate another user’s permissions.
PATH
Environment VariablePrograms and other executable files can live in many directories, so operating systems provide a search path that lists the directories that the OS searches for executable files. Adding a directory to your path means an executable file can be called by just using the filename as a command, without having to call out its entire path, just as if it resided in your working directory.
The path is stored in an environment variable, which is a named string maintained by the operating system. These variables contain information available to the command shell and other programs.
The path variable is named PATH
in bash
or Path
in Windows (bash
is case-sensitive; Windows is not).
Setting the path in bash/Linux:
export PATH=$HOME/anaconda:$PATH
To add the Python
directory to the path for a particular session in bash:
export PATH="$PATH:/usr/local/bin/python"
In Bourne shell or ksh shell enter this command:
PATH="$PATH:/usr/local/bin/python"
NOTE
/usr/local/bin
is the location of the Python
executable
The following command defines an environment variable called h1:
h1=$HOME/test
Now if you enter the following command:
echo $h1
you will see the following output on OS X:
/Users/jsmith/test
The next code snippet shows you how to set the alias ll
so that it displays a long listing of a directory:
alias ll="ls -l"
The following three alias definitions involve the ls
command and various switches:
alias ll="ls -l" alias lt="ls -lt" alias ltr="ls -ltr"
As an example, you can replace the command ls -ltr
(the letters “l,” “t,” and “r”) that you saw earlier in the chapter with the ltr
alias, and you will see the same reversed time-based long listing of filenames (reproduced here):
total 56 -rwx------ 1 ocampesato staff 176 Jan 06 19:21 ssl-instructions.txt -rw-r--r-- 1 ocampesato staff 12 Jan 06 19:21 output.txt -rw-r--r-- 1 ocampesato staff 11 Jan 06 19:21 outfile.txt -rwx------ 1 ocampesato staff 12 Jan 06 19:21 kyrgyzstan.txt -rwx------ 1 ocampesato staff 478 Jan 06 19:21 iphonemeetup.txt -rwx------ 1 ocampesato staff 146 Jan 06 19:21 checkin-commands.txt -rwx------ 1 ocampesato staff 25 Jan 06 19:21 apple-care.txt
You can also define an alias that contains the Bash pipe (“|
”) symbol:
alias ltrm="ls -ltr|more"
In a similar manner, you can define aliases for directory related commands:
alias ltd="ls -lt | grep '^d'" alias ltdm="ls -lt | grep '^d'|more"
There are several commands available for finding executable files (binary files or shell scripts) by searching the directories in the PATH
environment variable: which, whence, whereis
, and whatis
. These commands produce results similar to the which
command, as discussed below.
The which
command gives the full path to whatever executable that you specify or a blank line if the executable is not in any directory that is specified in the PATH
environment variable. This is useful for finding out whether a particular command or utility is installed on the system.
which rm
The output of the preceding command is here:
/usr/bin/rm
The whereis
command provides the information that you get from the where
command, and also the location of the man
page of the executable:
$ whereis rm rm: /bin/rm /usr/share/man/man1/rm.1.bz2
The whatis
command looks up the specified command in the whatis
database, which is useful for identifying system commands and important configuration files. Consider it a simplified “man” command, which displays concise details about bash
commands (e.g., type man ls
and you will see several pages of explanation regarding the ls
command).
Shell scripts contain bash
commands, which are executed sequentially from top to bottom (i.e., in the sequence that they appear in a shell script), unless they are defined inside a function. In particular, user-defined functions in shell scripts are executed in the order that they are invoked instead of in the order that they appear in the shell script. However, you can change the sequence in which commands are executed by using conditional logic, case statements, loops, and functions.
Shell scripts can contain whatever bash
commands are available on your system (but be aware that some commands require the sudo
command, which in turn requires a password). Simple examples of shell scripts include file-related commands that create files, read data from files, and update the contents of files. Regardless of the contents of your shell scripts, they are interpreted “on the fly,” so there are no compilation steps that create a binary executable.
The purpose of shell scripts is to automate the process of executing a set of bash
commands so that you don’t need to execute them manually from the command line. If you need to execute a simple command from the command line, then it’s unlikely that you need to do so via a shell script: just type the command and press the <RETURN> key. Note that the bash crontab
utility enables you to schedule the execution of shell scripts at various points in time (the crontab
utility is outside the scope of this book).
As you probably know, comments are important in source code. A good shell script contains meaningful comments, which are preceded by a pound sign “#
,” that explain the purpose of different sections in the shell script. The exception is when the “#
” symbol appears in the first line of a shell script, as you will see in the next section.
Create the file test.sh
(using your favorite text editor) with the following contents:
#!/bin/bash pwd ls cd /tmp ls mkdir /tmp/abc touch /tmp/abc/emptyfile ls /tmp/abc/
Now save the above content and make this script executable as follows:
chmod +x test.sh
Now you have your shell script ready to be executed as follows:
./test.sh
NOTE
The output from launching test.sh
depends on the contents of the /tmp
directory.
The first line in test.sh
is called the “shebang” line, which directs the system to launch the bash shell in order to invoke the commands in test.sh
. The term shebang is sort of a contraction of “hash” (for the “#” character) and “bang” (for the “!” character). Note that the initial “./
” of ./test.sh
specifies the file test.sh
in the current directory: if the file test.sh
is in your home directory, specify $HOME/test.sh.
In addition, if “.” is included in the PATH
environment variable, then you can simply type test.sh
without the “./
” prefix.
One point regarding the mkdir
command: if you specify a path in which intermediate directories do not exist, then you need to use the –p
switch. For example, if the directory /tmp/abc does not exist, then the following command requires the –p
switch:
mkdir –p /tmp/abc/def
As another example of a simple shell script, the following script uses the read
command, which takes the input from the keyboard and assigns that input value as the value of the variable PERSON
. The echo
command prints the input value on STDOUT
, which is the screen (by default).
#!/bin/sh echo "What is your name?" read PERSON echo "Hello, $PERSON"
Here is sample invocation of this script:
$./test.sh What is your name? John Smith Hello, John Smith
You can combine multiple commands with a semicolon (“;”), as shown here:
cd /tmp; pwd; cd ~; pwd
The preceding code snippet navigates to the /tmp
directory, prints the full path to the current directory, returns to the previous directory, and again prints the full path to the current directory. The output of the preceding command is here:
/tmp /Users/jsmith
You can use command substitution (discussed in a later section) to assign the output to a variable, as shown here:
x=`cd /tmp; pwd; cd ~; pwd` echo $x
The output of the preceding snippet is here:
/tmp /Users/jsmith
printf
Command and the echo CommandIn brief, use the printf
command instead of the echo
command if you need to control the output format. One key difference is that the echo
command prints a newline character whereas the printf
statement does not print a newline character. Keep this point in mind when you see the printf
statement in the awk
code samples in Chapter 5.
As a simple example, place the following code snippet in a shell script:
printf "%-5s %-10s %-4s\n" ABC DEF GHI printf "%-5s %-10s %-4.2f\n" ABC DEF 12.3456
Make the shell script executable and then launch the shell script, after which you will see the following output:
ABC DEF GHI ABC DEF 12.35
On the other hand, if you type the following pair of commands:
echo "ABC DEF GHI" echo "ABC DEF 12.3456"
you will see the following output:
ABC DEF GHI ABC DEF 12.3456
A detailed (and very lengthy) discussion regarding the printf
statement and the echo
command is here:
https://unix.stackexchange.com/questions/65803/why-is-printf-better-than-echo.
echo
Command and WhitespacesThe echo
command preserves whitespaces in variables, but in some cases the results might be different from your expectations.
Listing 1.1 displays the contents of EchoCut.sh
that illustrates the differences that can occur when the echo
command is used with the cut
command.
x1="123 456 789" x2="123 456 789" echo "x1 = $x1" echo "x2 = $x2"
x3=`echo $x1 | cut -c1-7` x4=`echo "$x1" | cut -c1-7` x5=`echo $x2 | cut -c1-7` echo "x3 = $x3" echo "x4 = $x4" echo "x5 = $x5"
Launch the code in Listing 1.1 and you will see the following output:
x1 = 123 456 789 x2 = 123 456 789 x3 = 123 456 x4 = 123 4 x5 = 123 456
The value of x3
is probably different from what you expected: there is only one blank space between 123
and 456
instead of the three blank spaces that appear in the definition of the variable x1
.
This seemingly minor detail is important when you write shell scripts that check the values contained in specific columns of text files, such as payroll files and other files with financial data. The solution involves the use of double quote marks (and sometimes the IFS
variable that is discussed in Chapter 2) that you can see in the definition of x4
.
The “back tick” or command substitution feature of the Bourne shell is very powerful and enables you to combine multiple bash commands. You can also write very compact and powerful (and complicated) shell scripts with command substitution. The syntax is to simply precede and follow your command with a “`” (back tick) character. In Listing 1.2, the back tick command is `ls *py`
Listing 1.2 displays the contents of CommandSubst.sh
that displays a subset of the list of files in a directory.
for f in `ls *py` do echo "file is: $f" done
Listing 1.2 contains a for
loop that displays the filenames (in the current directory) that have a “py” suffix.
The output of Listing 1.2 on my MacBook is here:
file is: CapitalizeList.py file is: CompareStrings.py file is: FixedColumnCount1.py file is: FixedColumnWidth1.py file is: LongestShortest1.py file is: My2DMatrix.pyß file is: PythonBash.py file is: PythonBash2.py file is: StringChars1.py file is: Triangular1.py file is: Triangular2.py file is: Zip1.py
NOTE
The output depends on whether or not you have any files with a .py
suffix in the directory where you execute CommandSubst.sh
.
A very important concept when using shell scripts is that any variables set inside the script are no longer set when the script finishes executing. The rules are shown as follows:
If a variable isn’t set in a script, but is already defined before the script is executed, that variable will also be available inside the script.
If a variable is set in a script, it will override any existing variable with the same name after the variable is set, but once the script ends, the variable will revert to its old value (or to no value, if it did not exist outside the shell script).
For example, if your $HOME
directory is /Users/jsmith
, but inside a script on row 10 you define $HOME
to be /Users/common/bin
, then the value of $HOME
is initially /Users/jsmith
for rows 1–9, then becomes /Users/common/bin
on row 10, and maintains that value until the last command in the shell script is executed. Then the value reverts to / Users/jsmith
.
The reason for this behavior is related to how Unix structures its processes (known as “shells,” hence the term “shell script”). That discussion is beyond the scope of this book.
Therefore, the default behavior is that if you set the value of a variable in a shell script, then that variable (and its value) exist only for the duration of the execution of the shell script. There is a simple “workaround” whereby variables “hold” their values after a shell script has completed, and you’ll learn how to do so in a subsequent section.
Just to make sure that the distinction is clear, consider Listing 1.3 that displays the contents of the shell script abc.sh
.
export x="123" echo "inside abc.sh" echo "x = $x"
Make sure that abc.sh
is an executable shell script with the chmod
command (as shown earlier in this chapter) and then launch the following sequence of commands from the command line:
export x="tom" echo "x = $x" ./abc.sh echo "x = $x"
The output from the preceding commands is here:
x = tom inside abc.sh x = 123 x = tom
As you can see, the value that is assigned to the variable x
is only for the duration of the process associated with the shell script abc.sh
. After execution has competed, the process terminates and the value of x
reverts to its original value. Fortunately, there is a way to ensure that the values of variables in a shell script can be “set” for the current shell, a technique called “sourcing” the shell script, as described in the next section.
Now execute the following sequence of commands:
export x="tom smith" echo "x = $x" . abc.sh echo "x = $x"
The output from the preceding commands is here:
x = "tom smith" inside abc.sh x = 123 x = 123
In the preceding code block, the value assigned to the variable x
inside the shell script abc.sh
overrides its previously defined value because “sourcing” (also called “dotting”) a shell script does not create a new process. Consequently, if a shell script assigns a new value to an existing variable, that new value is placed in the current environment and the previously defined value is lost.
Arrays are critical to data management and appear in a variety of real world contexts. It is a common problem to want to group related data elements together, then reference it within a row.
For example, at a volunteer event you might have to sign in and provide your name, address, and phone number so they can contact you later for future events. That related data could be thought of (and defined in bash) as:
volunteer[0] = name volunteer[1] = Address volunteer[2] = phone number
The sign-in list could be then captured as a file that used an internal field separator [IFS
] to make each row a volunteer, and each data element (name, address, phone number) distinct, easy to use with a later bash
script (or any other programming language or program that understands the concept of IFS
).
The IFS
is a concept covered in detail in Chapter 2, but it will be used in the following examples so you get a taste of how it is used. If you are familiar with “.csv” (comma separated value) text output from spreadsheets, the comma in those files is the IFS
. If you were to open the sign-in list in an Excel spreadsheet or Google Doc created with commas as IFS
, you would have column A = name, column B = address, and column C = phone number, each row a separate volunteer.
This section contains several shell scripts that illustrate some useful features of arrays in bash
. Listing 1.4 displays the contents of array1.sh
, which illustrates how to use an array and some operations that you can perform on arrays.
The syntax in bash is different enough from other programming languages that it’s worthwhile to use several examples to explore its behavior.
#!/bin/bash # method #1: fruits[0]="apple" fruits[1]="banana" fruits[2]="cherry" fruits[3]="orange" fruits[4]="pear" echo "first fruit: ${fruits[0]}" # method #2: declare -a fruits2=(apple banana cherry orange pear) echo "first fruit: ${fruits2[0]}" # range of elements: echo "last two: ${fruits[@]:3:2}" # substring of element: echo "substring: ${fruits[1]:0:3}" arrlength=${#fruits[@]} echo "length: ${#fruits[@]}"
Listing 1.5 displays the contents of names.txt
and Listing 1.6 displays the contents of array-from-file.sh
, which contains a for
loop to iterate through the elements of an array whose initial values are based on the contents of names.txt
.
Jane Smith John Jones Dave Edwards
#!/bin/bash names="names.txt" contents1=( `cat "$names"` ) echo "First loop:" for w in "${contents1[@]}" do echo "$w" done IFS="" names="names.txt" contents1=( `cat "$names"` ) echo "Second loop:" for w in "${contents1[@]}" do echo "$w" done
Listing 1.6 initializes the array variable contents1
by using command substitution with the cat
command, followed by a loop that displays elements of the array contents1
. The second for loop is the same code as the first for loop, but this time with the value of IFS
equal to “”, which has the effect of using the newline as a separator, one data element per row.
Launch the code in Listing 1.6 and you will see the following output:
First loop: Jane Smith John Jones Dave Edwards Second loop: Jane Smith John Jones Dave Edwards
Listing 1.7 displays the contents of array-function.sh
, which illustrates how to initialize an array and then display its contents in a user-defined function.
#!/bin/bash # compact version of the code later in this script: #items() { for line in "${@}" ; do printf "%s\n" "${line}" ; done ; } #aa=( 7 -4 -e ) ; items "${aa[@]}" items() { for line in "${@}" do printf "%s\n" "${line}" done } arr=( 123 -abc 'my data' ) items "${arr[@]}"
Listing 1.7 contains the items()
function that displays the contents of the arr
array that has been initialized prior to invoking this function. The output is shown here:
123 -abc my data
Listing 1.8 displays the contents of array-loops1.sh
, which illustrates how to determine the length of an initialized array and then display its contents via a for loop.
#!/bin/bash fruits[0]="apple" fruits[1]="banana" fruits[2]="cherry" fruits[3]="orange" fruits[4]="pear" # array length: arrlength=${#fruits[@]} echo "length: ${#fruits[@]}" # print each element via a loop: for (( i=1; i<${arrlength}+1; i++ )); do echo "element $i of ${arrlength} : " ${fruits[$i-1]} done
Listing 1.8 contains straightforward code for initializing an array and displaying its values.
This section is mainly for fun: you will see how to use nested loops to display a “triangular” output. Listing 1.9 displays the contents of nested-loops.sh
, which illustrates how to display an alternating set of symbols in a triangular fashion.
#!/bin/bash outermax=10 symbols[0]="#" symbols[1]="@" for (( i=1; i<${outermax}; i++ )); do for (( j=1; j<${i}; j++ )); do printf "%-2s" ${symbols[($i+$j)%2]} done printf "\n" done for (( i=1; i<${outermax}; i++ )); do for (( j=${i}+1; j<${outermax}; j++ )); do printf "%-2s" ${symbols[($i+$j)%2]} done printf "\n" done
Listing 1.9 initializes some variables, followed by a nested loop. The outer loop is “controlled” by the loop variable i
, whereas the inner loop (which depends on the value of i
) is “controlled” by the loop variable j
. The key point to notice is how the following code snippet prints alternating symbols in the symbols array, depending on whether or not the value of $i + $j
is even or odd:
printf "%-2s" ${symbols[($i+$j)%2]}
You can easily generalize this code: if the symbols array contains arrlength
elements, then replace the preceding code snippet with the following:
printf "%-2s" ${symbols[($i+$j)% $arrlength]}
Launch the code in Listing 1.9 and you will see the following output:
@ # @ @ # @ # @ # @ @ # @ # @ # @ # @ # @ @ # @ # @ # @ # @ # @ # @ # @ @ # @ # @ # @ # @ # @ # @ # @ @ # @ # @ # @ # @ # @ @ # @ # @ # @ @ # @
paste
CommandThe paste
command is useful when you need to combine two files in a “pairwise” fashion. For example, Listing 1.10 and Listing 1.11 display the contents of the text files list1
and list2
, respectively. You can think of paste
as adding the contents of the second file as a new column in the first file. In our first example, the first file has a list of files to copy, and the second file has a list of files that are the destination for the copy command. Paste then merges the two files into output that could then be run to execute all the copy commands in one step.
cp abc.sh cp abc2.sh cp abc3.sh
def.sh def2.sh def3.sh
Listing 1.12 displays the result of invoking the following command:
paste list1 list2 >list1.sh
cp abc.sh def.sh cp abc2.sh def2.sh cp abc3.sh def3.sh
Listing 1.12 contains three cp
commands that are the result of invoking the paste command. If you want to execute the commands in Listing 1.12, make this shell script executable and then launch the script, as shown here:
chmod +x list1.sh ./list1.sh
paste
CommandInstead of merging two equal length files, paste
can also be used to add the same thing to every line in a file. This example inserts a blank line after every line in names.txt
with this command:
paste -d'\n' - /dev/null < names.txt
Jane Smith John Jones Dave Edwards
Insert a blank line after every other line in names.txt
with this command:
paste -d'\n' - - /dev/null < names.txt
The output is here:
Jane Smith John Jones Dave Edwards
Insert a blank line after every third line in names.txt
with this command:
paste -d'\n' - - - /dev/null < names.txt
The output is here:
Jane Smith John Jones Dave Edwards
Note that there is a blank line after the third line in the preceding output. The shell script joinlines.sh
(later in this chapter) also contains examples of one-line paste commands for joining consecutive lines of a dataset or text file.
cut
CommandThe cut
command enables you to extract fields with a specified delimiter (another word commonly used for IFS
, especially when it’s part of a command syntax, instead of being set as an outside variable) as well as a range of columns from an input stream. Some examples are here:
x="abc def ghi" echo $x | cut -d" " -f2
The output (using space " "
as IFS
, and -f2
to indicate the second column) of the preceding code snippet is here:
def
x="abc def ghi" echo $x | cut -c2-5
The output of the preceding code snippet (-c2-5
means extract the characters in columns 2 through 5 from the variable) is here:
bc d
Listing 1.13 displays the contents of SplitName1.sh
, which illustrates how to split a filename containing the “.
” character as a delimiter/IFS
.
fileName="06.22.04p.vp.0.tgz" f1=`echo $fileName | cut -d"." -f1` f2=`echo $fileName | cut -d"." -f2` f3=`echo $fileName | cut -d"." -f3` f4=`echo $fileName | cut -d"." -f4` f5=`echo $fileName | cut -d"." -f5` f5=`expr $f5 + 12` newFileName="${f1}.${f2}.${f3}.${f4}.${f5}" echo "newFileName: $newFileName"
Listing 1.13 uses the echo
command and the cut
command in order to initialize the variables f1, f2, f3, f4, and f5
, after which a new filename is constructed. The output of the preceding shell script is here:
newFileName: 06.22.04p.vp.12
Metacharacters can be thought of as a complex set of wildcards. Regular expressions are a “search patterns” which are a combination of normal text and metacharacters. In concept it is much like a “find” tool (press ctrl-f
on your search engine), but bash
(and Unix in general) allows for much more complex pattern matching because of its rich metacharacter set. There are entire books devoted to regular expressions, but this section contains enough information to get started, and the key concepts needed for data manipulation and cleansing.
The following metacharacters are useful with regular expressions:
The ? metacharacter refers to 0 or 1 occurrences of something.
The + metacharacter refers to 1 or more occurrences of something.
The * metacharacter refers to 0 more occurrences of something.
Note that “something” in the preceding descriptions can refer to a digit, letter, word, or more complex combinations.
Some examples are shown here:
The expression a?
matches zero or one occurrences of the letter a
.
The expression a+
matches the string a followed by one or more occurrences of anything.
The expression a*
matches the string a followed by zero or more occurrences of anything.
The pipe “|
” metacharacter (which has a different context from the pipe symbol in the command line: regular expressions have their own syntax, which does not match that of the operating system a lot of the time) provides a choice of options. For example, the expression a|b
means a
or b
, and the expression a|b|c
means a
or b
or c
.
The “$
” metacharacter refers to the end of a line of text, and in regular expressions inside the vi
editor, the “$
” metacharacter refers to the last line in a file.
The “^
” metacharacter refers to the beginning of a string or a line of text. For example:
*a$ matches "Mary Anna" but not "Anna Mary" ^A* matches "Anna Mary" but not "Mary Anna"
In the case of regular expressions, the “^
” metacharacter can also mean “does not match.” The next section contains some examples of the “^
” metacharacter.
Character classes enable you to express a range of digits, letters, or a combination of both. For example, the character class [0-9]
matches any single digit; [a-z]
matches any lowercase letter; and [A-Z]
matches any uppercase letter. You can also specify subranges of digits or letters, such as [3-7], [g-p]
, and [F-X]
, as well as other combinations:
[0-9][0-9] matches a consecutive pair of digits
[0-9][0-9][0-9] matches three consecutive digits
\d{3} also matches three consecutive digits
The previous section introduced you to the “^
” metacharacter, and here are some examples of using “^
” with character classes:
1)^[a-z] matches any lowercase letter at the beginning of a line of text
2)^[^a-z] matches any line of text that does not start with a lowercase letter
Based on what you have learned thus far, you can understand the purpose of the following regular expressions:
3)([a-z]|[A-Z]): either a lowercase letter or an uppercase letter
4)(^[a-z][a-z]): an initial lowercase letter followed by another lowercase letter
5)(^[^a-z][A-Z]): anything other than a lowercase letter followed by an uppercase letter
Chapter 4 contains a section that discusses regular expressions, which combine character classes and metacharacters in order to create sophisticated expressions for matching complex string patterns (such as email addresses).
At this point you’ve seen various combinations of bash
commands that are connected with the “|
” symbol. The general form looks something like this:
cmd1 | cmd2 | cmd3 .... >mylist
What happens if there are intermediate errors? You’ve seen how to redirect error messages to /dev/null
, and you can also redirect error messages to a text file if you need to review them. Yet another option is to redirect stderr
(“standard error”) to stdout
(“standard out”), which is beyond the scope of this chapter.
Question: can an intermediate error cause the entire “pipeline” to fail? Unfortunately, this scenario can occur, and in general it’s a trial-and-error process to debug long and complex commands that involve multiple pipe symbols.
Now consider the case where you need to redirect the output of multiple commands to the same location. For example, the following commands display output on the screen:
ls | sort; echo "the contents of /tmp: "; ls /tmp
You can easily redirect the output to a file with this command:
(ls | sort; echo "the contents of /tmp:"; ls /tmp) > myfile1
However, each of the preceding commands inside the parentheses spawns a subshell (which is an extra process that consumes memory and cpu). You can avoid spawning subshells by using {} instead of ()
, as shown here (and the whitespaces after { and before } are required):
{ ls | sort; echo "the contents of /tmp:"; ls /tmp } > myfile1
Suppose that you want to set a variable and execute a command, and then invoke a second command via a pipe, as shown here:
name=SMITH cmd1 | cmd2
Unfortunately, cmd2
in the preceding code snippet does not recognize the value of name, but there is a simple solution, as shown here:
(name=SMITH cmd1) | cmd2
Use the double ampersand &&
symbol if you want to execute a command only if a prior command succeeds. For example, the cd
command only works if the mkdir
command succeeds in the following code snippet:
mkdir /tmp2/abc && cd /tmp2/abc
The preceding command will fail because (by default) the /tmp2
does not exist. On the other hand, the following command succeeds because the –p
option ensures that intermediate directories are created:
mkdir -p /tmp/abc/def && cd /tmp/abc && ls -l
The code sample in this section shows you how to use the paste
command in order to join consecutive rows in a dataset. Listing 1.14 displays the contents of linepairs.csv
, which contains letter and number pairs, and Listing 1.15 contains reversecolumns.sh
, which illustrates how to match the pairs even though the line breaks are in different places between numbers and letters.
a,b,c,d,e,f,g h,i,j,k,l 1,2,3,4,5,6,7,8,9 10,11,12
inputfile="linepairs.csv" outputfile="linepairsjoined.csv" # join pairs of consecutive lines: paste -d " " - - < $inputfile > $outputfile # join three consecutive lines: #paste -d " " - - - < $inputfile > $outputfile # join four consecutive lines: #paste -d " " - - - - < $inputfile > $outputfile
The contents of the output file are shown here (note that the script is just joining pairs of lines; the three- and four-line command examples are commented out):
a,b,c,d,e,f,g h,i,j,k,l 1,2,3,4,5,6,7,8,9 10,11,12
Notice that the preceding output is not completely correct: there is a space “ ” instead of a “,” whenever a pair of lines is joined (between “g” and “h” and “9” and “10”). We can make the necessary revision using the sed
command (discussed in Chapter 4):
cat $outputfile | sed "s/ /,/g" > $outputfile2
Examine the contents of $outputfile2
to see the result of the preceding code snippet.
The code sample in this section shows you how to use the cut
and paste
commands in order to reverse the order of two columns in a dataset. Keep in mind that the purpose of the shell script in Listing 1.17 is to help you get some practice for writing bash
scripts. The better solution involves a single line of code (shown at the end of this section).
Listing 1.16 displays the contents of namepairs.csv
, which contains the first name and last name of a set of people, and Listing 1.17 contains reversecolumns.sh
, which illustrates how to reverse these two columns.
Jane,Smith Dave,Jones Sara,Edwards
inputfile="namepairs.csv" outputfile="reversenames.csv" fnames="fnames" lnames="lnames" cat $inputfile|cut -d"," -f1 > $fnames cat $inputfile|cut -d"," -f2 > $lnames paste -d"," $lnames $fnames > $outputfile
The contents of the output file are shown here:
Smith,Jane Jones,Dave Edwards,Sara
The code in Listing 1.17 (after removing blank lines) consists of seven lines of code that involves creating two extra intermediate files. Unless you need those files, it’s a good idea to remove those two files (which you can do with one rm
command).
Although Listing 1.17 is straightforward, there is a simpler way to execute this task: use the cat
command and the awk
command (discussed in detail in Chapter 5).
Specifically, compare the contents of reversecolumns.sh
with the following single line of code that combines the cat command and the awk
command in order to generate the same output:
cat namepairs.txt |awk -F"," '{print $2 "," $1}'
The output from the preceding code snippet is here:
Smith,Jane Jones,Dave Edwards,Sara
As you can see, there is a big difference in these two solutions. If you are unfamiliar with the awk
command, then obviously you would not have thought of the second solution. However, the more you learn about bash
commands and how to combine them, the more adept you will become in terms of writing better shell scripts to solve data cleaning tasks. Another important point: document the commands as they get more complex, as they can be hard to interpret later by others, or even by yourself if enough time has passed. A comment like the following can be extremely helpful to interpreting code:
# This command reverses first and last names in namepairs.txt cat namepairs.txt |awk -F"," '{print $2 "," $1}'
This chapter started with an introduction to some Unix shells, followed by a brief discussion of files, file permissions, and directories. You also learned how to create files and directories and how to change their permissions. Next you learned about environment variables, how to set them, and also how to use aliases. You also learned about “sourcing” (also called “dotting”) a shell script and how this changes variable behavior from calling a shell script in the normal fashion.
Next you learned about the cut
command (for cutting columns and/or fields) and the paste
command (for “pasting” test together vertically). Finally, you saw two use cases, the first of which involved the cut
command and paste
command to switch the order to two columns in a dataset, and the second showed you another way to perform the same task using concepts from later chapters.