Data Cleaning

CHAPTER 1 INTRODUCTION

This chapter introduces you to the bash shell. You will learn how to use some basic commands, such as navigating around the file system, listing files, and displaying the contents of files. This chapter is dense and contains a very eclectic mix of topics to quickly prepare you for later chapters. If you already have some knowledge of shell programming, you can probably skim quickly through this introductory chapter and proceed to Chapter 2.

The first part of this chapter starts with a brief introduction to some Unix shells, and then discusses files, file permissions, and directories. You will also learn how to create files and directories and how to change their permissions.

The second part of this chapter introduces simple shell scripts, along with instructions for making them executable. As you will see, shell scripts contain bash commands (and can optionally contain user-defined functions), so it’s a good idea to learn about bash commands before you can create shell scripts (which include bash scripts).

The third portion of this chapter discusses two useful bash commands: the cut command (for cutting or extracting columns and/or fields from a dataset) and the paste command (for “pasting” text or datasets together vertically).

In addition, the final part of this chapter uses the material from the previous section (i.e., the cut command and paste command) in a use case that illustrates how to switch the order of two columns in a dataset. As you will see later, there are other ways to perform this task, such as invoking the awk command (discussed in Chapter 5).

There are a few points to keep in mind before delving into the details of shell scripts. First, shell scripts can be executed from the command line after adding “execute” permissions to the text file containing the shell script. Second, you can use the crontab utility to schedule the execution of your shell scripts. The crontab utility allows you to specify the execution of a shell script on an hourly, daily, weekly, or monthly basis. Tasks that are commonly scheduled via crontab include performing backups, removing unwanted files, and so forth. If you are completely new to Unix, just keep in mind that there is a way to run scripts both from the command line and in a “scheduled” manner. Setting file permissions to run the script from the command line will be discussed later.

Third, the contents of any shell script can be as simple as a single command, or can comprise hundreds of lines of bash commands. In general, the more interesting shell scripts involve a combination of several bash commands. A learning tip: since there are usually several ways to produce the desired result, it’s helpful to read other people’s shell scripts to learn how to combine commands in useful ways.

What Is Unix?

Unix is an operating system created by Ken Thompson in the early 1970s, and today there are several variants available, such as HP/UX for HP machines and AIX for IBM machines. Linus Torvalds developed the Linux operating system during the 1990s, and many Linux commands are the same as their bash counterparts (but differences exist, often in the commands for system administrators). The Mac OS X operating system is based on AT&T Unix.

Unix has a rich and storied history, and if you are really interested in learning about its past, you can read online articles and also Wikipedia. This book foregoes those details and focuses on helping you quickly learn how to become productive with various commands.

Available Shell Types

The original Unix shell is the Bourne shell, which was written in the mid-1970s by Stephen R. Bourne. In addition, the Bourne shell was the first shell to appear on bash systems, and you will sometimes hear “the shell” as a reference to the Bourne shell. The Bourne shell is a POSIX standard shell, usually installed as /bin/sh on most versions of Unix, whose default prompt is the $ character. Consequently, Bourne shell scripts will execute on almost every version of Unix. In essence, the AT&T branches of Unix support the Bourne shell (sh), bash, Korn shell (ksh), tsh, and zsh.

However, there is also the BSD branch of Unix that uses the “C” shell (csh), whose default prompt is the % character. In general, shell scripts written for csh will not execute on AT&T branches of Unix, unless the csh shell is also installed on those machines (and vice versa).

The Bourne shell is the most “unadorned” in the sense that it lacks some commands that are available in the other shells, such as history, noclobber, and so forth. The various subcategories for the Bourne Shell are listed as follows:

Bourne shell (sh)

Korn shell (ksh)

Bourne Again shell (bash)

POSIX shell (sh)

The different C-type shells follow:

C shell (csh)

TENEX/TOPS C shell (tcsh)

While the commands and the shell scripts in this book are based on the bash shell, many of the commands also work in other shells (and if not, those other shells have a similar command to accomplish the same goal). Performing an Internet search for “how do I do <bash command> in <shell name>” will often get you an answer. Sometimes the command is essentially the same, but with slightly different syntax, and typing “man <command>” in a command shell can provide useful information.

What Is bash?

Bash is an acronym for “Bourne Again Shell,” which has its roots in the Bourne shell created by Stephen R. Bourne. Shell scripts based on the Bourne shell will execute in bash, but the converse is not true. The bash shell provides additional features that are unavailable in the Bourne shell, such as support for arrays (discussed later in this chapter).

On Mac OS X, the /bin directory contains the following executable shells:

-r-xr-xr-x   1  root   wheel   1377872   Apr 28  2017 /bin/ksh
-r-xr-xr-x   1  root   wheel    630464   Apr 28  2017 /bin/sh
-rwxr-xr-x   1  root   wheel    375632   Apr 28  2017 /bin/csh
-rwxr-xr-x   1  root   wheel    592656   Apr 28  2017 /bin/zsh
-r-xr-xr-x   1  root   wheel    626272   Apr 28  2017 /bin/bash

In case you’re interested, a nice comparison matrix of the support for various features among the preceding shells is here:

https://stackoverflow.com/questions/5725296/difference-between-sh-and-bash.

Something else that might surprise you: in some environments the Bourne shell sh is the Bash shell, which you can check by typing the following command:

sh  --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin16)
Copyright (C) 2007 Free Software Foundation, Inc.

If you are new to the command line (be it Mac, Linux, or PCs), please read the Preface, which provides some useful guidelines for accessing command shells.

Getting Help for `bash` Commands

If you want to see the options for a specific bash command, specify the -? switch. For example, cat -? displays the available options for the cat command. You can invoke the man command to see a description of a bash command and its options:

man  cat

Keep in mind that the man command produces terse explanations, and if those explanations are not clear enough, you can search for online code samples that provide more details.

Navigating Around Directories

In a command shell you will often perform basic operations, such as displaying (or changing) the current directory, listing the contents of a directory, displaying the contents of a file, and so forth. The following set of commands shows you how to perform these operations, and you can execute a subset of these comments in the sequence that is relevant to you. Options for some of the commands in this section (such as the ls command) are described in greater detail later in this chapter.

A frequently used Bash command is pwd (“print working directory”), which displays the current directory, as shown here:

pwd

The output of the preceding command might look something like this:

/Users/jsmith

Use the cd (“change directory”) command to go to a specific directory. For example, type the command cd /Users/jsmith/Mail or cd Mail if you are already in the /Users/jsmith directory. You can navigate to your home directory with either of these commands:

$   cd   $HOME
$   cd

One convenient way to return to the previous directory is the command cd –. Keep in mind that the cd command on Windows merely displays the current directory (which differs from the Unix cd command).

The `history` Command

The history command displays the history of commands that you executed in the current command shell, as shown here:

history

A sample output of the preceding command is here:

  1202   cat  longfile.txt > longfile2.txt
  1203   vi  longfile2.txt
  1204   cat  longfile2.txt  |fold  -40
  1205   cat  longfile2.txt  |fold  -30
  1206   cat  longfile2.txt  |fold  -50
  1207   cat  longfile2.txt  |fold  -45
  1208   vi  longfile2.txt
  1209   history
  1210   cd  /Library/Developer/CommandLineTools/usr/include/c++/
  1211   cd  /tmp
  1212   cd $HOME/Desktop
  1213   history

Now you can return to the directory in line 1210 with the following command:

!1210

The command !cd will search backward through the history of commands to find the first command that matches the cd command: in this case, line 1212 is the first match. If there weren’t any intervening cd commands between the current command and the command in line 1210, then !1210 and !cd will have the same effect.

NOTE

Be careful with the “!” option with bash commands, because the command that matches the “!” might not be the one you intended, so it’s safer to use the history command and then explicitly specify the correct number (in that history) when you invoke the “!” operator.

Listing Filenames with the `ls` Command

The ls command is for listing filenames, and there are many switches available that you can use, as shown in this section. For example, the ls command displays the following filenames (the actual display depends on the font size and the width of the command shell) on my Mac:

apple-care.txt     iphonemeetup.txt    outfile.txt      ssl-
instructions.txt  checkin-commands.txt   kyrgyzstan.txt
output.txt

The command ls -1 (the digit “1”) displays a vertical listing of filenames:

apple-care.txt
checkin-commands.txt
iphonemeetup.txt
kyrgyzstan.txt
outfile.txt
output.txt
ssl-instructions.txt

The command ls -1 (the letter “l”) displays a long listing of filenames:

total 56
-rwx------  1 ocampesato staff  25  Jan 06 19:21 apple-care.txt
-rwx------  1 ocampesato staff 146  Jan 06 19:21 checkin-commands.txt
-rwx------  1 ocampesato staff 478  Jan 06 19:21 iphonemeetup.txt
-rwx------  1 ocampesato staff  12  Jan 06 19:21 kyrgyzstan.txt
-rw-r--r--  1 ocampesato staff  11  Jan 06 19:21 outfile.txt
-rw-r--r--  1 ocampesato staff  12  Jan 06 19:21 output.txt
-rwx------  1 ocampesato staff 176  Jan 06 19:21 ssl-instructions.txt

The command ls -1t (the letters “l” and “t”) display a time-based long listing:

total 56
-rwx------1 ocampesato staff  25  Jan 06 19:21 apple-care.txt
-rwx------1 ocampesato staff 146  Jan 06 19:21 checkin-commands.txt
-rwx------1 ocampesato staff 478  Jan 06 19:21 iphonemeetup.txt
-rwx------1 ocampesato staff  12  Jan 06 19:21 kyrgyzstan.txt
-rw-r--r--1 ocampesato staff  11  Jan 06 19:21 outfile.txt
-rw-r--r--1 ocampesato staff  12  Jan 06 19:21 output.txt
-rwx------1 ocampesato staff  176 Jan 06 19:21 ssl-instructions.txt

The command ls -ltr (the letters “l”, “t”, and “r”) display a reversed time-based long listing of filenames:

total 56
-rwx------1 ocampesato staff 176 Jan 06 19:21 ssl-instructions.txt
-rw-r--r--1 ocampesato staff  12 Jan 06 19:21 output.txt
-rw-r--r--1 ocampesato staff  11 Jan 06 19:21 outfile.txt
-rwx------1 ocampesato staff  12 Jan 06 19:21 kyrgyzstan.txt
-rwx------1 ocampesato staff 478 Jan 06 19:21 iphonemeetup.txt
-rwx------1 ocampesato staff 146 Jan 06 19:21 checkin-commands.txt
-rwx------1 ocampesato staff  25 Jan 06 19:21 apple-care.txt

Here is the description about all the listed columns in the preceding output:

Column #1: represents file type and permission given on the file (see the following)

Column #2: shows the number of memory blocks taken by the file or directory

Column #3: indicates the (Bash user) owner of the file

Column #4: represents group of the owner

Column #5: represents file size in bytes

Column #6: shows the date and time when this file was created or last modified

Column #7: represents file or directory name

In the ls -l listing example, every file line began with a d, -, or l. These characters indicate the type of file that’s listed. These (and other) initial values are described as follows:

-	Regular file (ASCII text file, binary executable, or hard link)
b	Block special file (such as a physical hard drive)
c	Character special file (such as a physical hard drive)
d	Directory file that contains a listing of other files and directories
l	Symbolic link file
p	Named pipe (a mechanism for interprocess communications)
s	Socket (for interprocess communication)

Consult online documentation for more details regarding the ls command.

Displaying Contents of Files

Now let’s see how to display different lines of text in a text file. You can use the cat command to display the entire contents of a file, but it’s a good idea to first get some information about the file contents. Specifically, use the wc (word count) command that displays the number of lines, words, and characters in a text file, as shown here:

wc longfile.txt
37       80      408 longfile.txt

The preceding output shows that the file longfile.txt contains 37 lines, 80 words, and 408 characters, which means that the file size is actually quite small (despite its name).

The `cat` Command

You can use the cat command to display the contents of longfile.txt:

cat longfile.txt

The preceding command displays the following text:

the contents
of this
long file
are too long
to see in a
single screen
and each line
contains
one or
more words
and if you
use the cat
command the
(other lines are omitted)

As another example, suppose that the file temp1 has the following contents:

this is line1 of temp1
this is line2 of temp1
this is line3 of temp1

Suppose that the file temp2 has these contents:

this is line1 of temp2
this is line2 of temp2

Now type the following command that contains the ? metacharacter (discussed in detail later in this chapter):

cat temp?

The output from the preceding command is shown here:

this is line1 of temp1
this is line2 of temp1
this is line3 of temp1
this is line1 of temp2
this is line2 of temp2

The `head` and `tail` Commands

The head command displays the first ten lines of a text file (by default), an example of which is here:

head longfile.txt

The preceding command displays the following text:

the contents
of this
long file
are too long
to see in a
single screen
and each line
contains
one or
more words

The head command also provides an option to specify a different number of lines to display, as shown here:

head -4 longfile.txt

The preceding command displays the following text:

the contents
of this
long file
are too long

The tail command displays the last ten lines (by default) of a text file:

tail longfile.txt

The preceding command displays the following text:

is available
in every shell
including the
bash shell
csh
zsh
ksh
and Bourne shell

NOTE

The last two lines in the preceding output are blank lines (not a typographical error in this page).

Use the more command to display a screenful of data, as shown here:

more longfile.txt

Press the <spacebar> to view the next screenful of data, and press the <return> key to see the next line of text in a file. Incidentally, some people prefer the less command, which generates essentially the same output as the more command. (A geeky joke: “What’s less? It’s more.”)

The Pipe Symbol

A very useful feature of Bash is its support for the pipe symbol (“|”) that enables you to “pipe” or redirect the output of one command to become the input of another command. The pipe command is very handy when you want to perform a sequence of operations involving various Bash commands.

For example, the following code snippet combines the head command with the cat command and the pipe (“|”) symbol:

cat longfile.txt| head -2

A technical point: the preceding command creates two bash processes (more about processes later), whereas the command head -2 longfile.txt only creates a single bash process.

You can use the head and tail commands in more interesting ways. For example, the following command sequence displays lines 11 through 15 of longfile.txt:

head -15 longfile.txt |tail -5

The preceding command displays the following text:

and if you
use the cat
command the
file contents
scroll

Display the line numbers for the preceding output as follows:

cat –n longfile.txt | head -15 | tail -5

The preceding command displays the following text:

      11       and if you
      12       use the cat
      13       command the
      14       file contents
      15       scroll

You won’t see the “tab” character from the output, but it’s visible if you redirect the previous command sequence to a file and then use the “-t” option with the cat command:

cat  -n longfile.txt | head -15 | tail -5 > 1
cat  -t 1
     11^Iand if you
     12^Iuse the cat
     13^Icommand the
     14^Ifile contents
     15^Iscroll

The `fold` Command

The fold command enables you to “fold” the lines in a text file, which is useful for text files that contain long lines of text that you want to split into shorter lines. For example, here are the contents of longfile2.txt:

the contents of this long file are too long to see in a single
screen and each line contains one or more words and if you
use the cat command the file contents scroll off the screen so
you can use other commands such as the head or tail or more
commands in conjunction with the pipe command that is very
useful in Bash and is available in every shell including the
bash shell csh zsh ksh and Bourne shell

You can “fold” the contents of longfile2.txt into lines whose length is 45 (just as an example) with this command:

cat longfile2.txt |fold -45

The output of the preceding command is here:

the contents of this long file are too long t
o see in a single screen and each line contai
ns one or more words and if you use the cat c
ommand the file contents scroll off the scree
n so you can use other commands such as the h
ead or tail or more commands in conjunction w
ith the pipe command that is very useful in U
nix and is available in every shell including
 the bash shell csh zsh ksh and Bourne shell

Notice that some words in the preceding output are split based on the line width, and not “newspaper style.”

In Chapter 4 you will learn how to display the lines in a text file that match a string or a pattern, and in Chapter 5 you will learn how to replace a string with another string in a text file.

File Ownership: Owner, Group, and World

Bash files have rwx privileges, where r = read privilege, w = write privilege, x = execute privilege can be executed from the command line, simply by typing the file name (or the full path to file name if the file is not in your current directory). Invoking an executable file from the command line will cause the operating system to attempt to execute commands inside the text file.

Use the chmod command to set permissions for files. For example, if you need to set the permission rwx rw- r-- for a file, use the following:

chmod u=rwx g=rw o=r filename

In the preceding command the options u, g, and o represent user permissions, group permissions, and other permissions, respectively.

In order to add additional permissions on the current file, use + to add permission to user, group, or others and use - to remove the permissions. For example, given a file with the permissions rwx rw- r--, add the executable permission as follows:

chmod o+x filename

This command adds the x permission for others.

Add the executable permission to all permission categories—that is, for user, group, and others—as follows:

chmod a+x filename

In the preceding command, the letter a means “all.”

Specify a - in order to remove any permission, as shown here:

chmod a-x filename

Hidden Files

A so-called “invisible” file is one whose first character is the dot or period character (.). Bash programs (including the shell) use most of these files to store configuration information. Some common examples of hidden files include the following files:

.profile: the Bourne shell (sh) initialization script

.bash_profile: the bash shell (bash) initialization script

.kshrc: the Korn shell (ksh) initialization script

.cshrc: the C shell (csh) initialization script

.rhosts: the remote shell configuration file

To list invisible files, specify the -a option to ls:

ls -a
.         .profile        docs        lib        test_results
..        .rhosts         hosts       pub        users
.emacs    bin             hw1         res.01     work
.exrc     ch07            hw2         res.02
.kshrc    ch07.bak        hw3         res.03

Single dot .: This represents current directory.

Double dot ..: This represents parent directory.

Handling Problematic Filenames

Problematic filenames contain one or more whitespaces, hidden (non-printing) characters, or start with a dash (“-”) character.

You can use double quotes to list filenames that contain whitespaces, or you can precede each whitespace by a backslash (“\”) character.

For example, if you have a file named One Space.txt, you can use the ls command as follows:

ls -1 "One Space.txt"
ls -l One\ Space.txt

Filenames that start with a dash (“-”) character are difficult to handle because the dash character is the prefix that specifies options for bash commands. Consequently, if you have a file whose name is –abc, then the command ls –abc will not work correctly, because the “-a” is interpreted as a switch for the ls command (and there is no “a” option).

In most cases the best solution to this type of file is to rename the file. This can be done in your operating system if your client isn’t a Unix shell, or you can use the following special syntax for the mv (“move”) command to rename the file. The preceding two dashes tell mv to ignore the dash in the filename. An example is here:
mv -- -abc.txt renamed-abc.txt

Working with Environment Variables

There are many built-in environment variables available, and the following subsections discuss some of the more common variables.

The `env` Command

The env (“environment”) command displays the variables that are in your bash environment. An example of the output of the env command is here:

SHELL=/bin/bash
TERM=xterm-256color
TMPDIR=/var/folders/73/39lngcln4dj_scmgvsv53g_w0000gn/T/
OLDPWD=/tmp
TERM_SESSION_ID=63101060-9DF0-405E-84E1-EC56282F4803
USER=ocampesato
COMMAND_MODE=bash2003PATH=/opt/local/bin:/Users/ocampesato/
android-sdk-mac_86/platform-tools:/Users/ocampesato/
android-sdk-mac_86/tools:/usr/local/bin:
PWD=/Users/ocampesato
JAVA_HOME=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/
Contents/Home
LANG=en_US.UTF-8
NODE_PATH=/usr/local/lib/node_modules
HOME=/Users/ocampesato
LOGNAME=ocampesato
DISPLAY=/tmp/launch-xnTgkE/org.macosforge.xquartz:0
SECURITYSESSIONID=186a4
_=/usr/bin/env

Some interesting examples of setting an environment variable and also executing a command are described here:

https://stackoverflow.com/questions/13998075/setting-environment -variable-for-one-program-call-in-bash-using-env.

Useful Environment Variables

This section discusses some important environment variables, most of which you probably will not need to modify, but it’s useful to be aware of the existence of these variables and their purpose.

The HOME variable contains the absolute path of the user’s home directory.

The HOSTNAME variable specifies the Internet name of the host.

The LOGNAME variable specifies the user’s login name.

The PATH variable specifies the search path (see next subsection).

The SHELL variable specifies the absolute path of the current shell.

The USER specifies the user’s current username. This value might be different than the login name if a superuser executes the su command to emulate another user’s permissions.

Setting the `PATH` Environment Variable

Programs and other executable files can live in many directories, so operating systems provide a search path that lists the directories that the OS searches for executable files. Adding a directory to your path means an executable file can be called by just using the filename as a command, without having to call out its entire path, just as if it resided in your working directory.

The path is stored in an environment variable, which is a named string maintained by the operating system. These variables contain information available to the command shell and other programs.

The path variable is named PATH in bash or Path in Windows (bash is case-sensitive; Windows is not).

Setting the path in bash/Linux:

export PATH=$HOME/anaconda:$PATH

To add the Python directory to the path for a particular session in bash:

export PATH="$PATH:/usr/local/bin/python"

In Bourne shell or ksh shell enter this command:

PATH="$PATH:/usr/local/bin/python"

NOTE

/usr/local/bin is the location of the Python executable

Specifying Aliases and Environment Variables

The following command defines an environment variable called h1:

h1=$HOME/test

Now if you enter the following command:

echo $h1

you will see the following output on OS X:

/Users/jsmith/test

The next code snippet shows you how to set the alias ll so that it displays a long listing of a directory:

alias ll="ls -l"

The following three alias definitions involve the ls command and various switches:

alias ll="ls -l"
alias lt="ls -lt"
alias ltr="ls -ltr"

As an example, you can replace the command ls -ltr (the letters “l,” “t,” and “r”) that you saw earlier in the chapter with the ltr alias, and you will see the same reversed time-based long listing of filenames (reproduced here):

total 56
-rwx------ 1 ocampesato staff 176 Jan 06 19:21 ssl-instructions.txt
-rw-r--r-- 1 ocampesato staff  12 Jan 06 19:21 output.txt
-rw-r--r-- 1 ocampesato staff  11 Jan 06 19:21 outfile.txt
-rwx------ 1 ocampesato staff  12 Jan 06 19:21 kyrgyzstan.txt
-rwx------ 1 ocampesato staff 478 Jan 06 19:21 iphonemeetup.txt
-rwx------ 1 ocampesato staff 146 Jan 06 19:21 checkin-commands.txt
-rwx------ 1 ocampesato staff  25 Jan 06 19:21 apple-care.txt

You can also define an alias that contains the Bash pipe (“|”) symbol:

alias ltrm="ls -ltr|more"

In a similar manner, you can define aliases for directory related commands:

alias ltd="ls -lt | grep '^d'"
alias ltdm="ls -lt | grep '^d'|more"

Finding Executable Files

There are several commands available for finding executable files (binary files or shell scripts) by searching the directories in the PATH environment variable: which, whence, whereis, and whatis. These commands produce results similar to the which command, as discussed below.

The which command gives the full path to whatever executable that you specify or a blank line if the executable is not in any directory that is specified in the PATH environment variable. This is useful for finding out whether a particular command or utility is installed on the system.

which rm

The output of the preceding command is here:

/usr/bin/rm

The whereis command provides the information that you get from the where command, and also the location of the man page of the executable:

$ whereis rm
rm: /bin/rm /usr/share/man/man1/rm.1.bz2

The whatis command looks up the specified command in the whatis database, which is useful for identifying system commands and important configuration files. Consider it a simplified “man” command, which displays concise details about bash commands (e.g., type man ls and you will see several pages of explanation regarding the ls command).

What Are Shell Scripts?

Shell scripts contain bash commands, which are executed sequentially from top to bottom (i.e., in the sequence that they appear in a shell script), unless they are defined inside a function. In particular, user-defined functions in shell scripts are executed in the order that they are invoked instead of in the order that they appear in the shell script. However, you can change the sequence in which commands are executed by using conditional logic, case statements, loops, and functions.

Shell scripts can contain whatever bash commands are available on your system (but be aware that some commands require the sudo command, which in turn requires a password). Simple examples of shell scripts include file-related commands that create files, read data from files, and update the contents of files. Regardless of the contents of your shell scripts, they are interpreted “on the fly,” so there are no compilation steps that create a binary executable.

The purpose of shell scripts is to automate the process of executing a set of bash commands so that you don’t need to execute them manually from the command line. If you need to execute a simple command from the command line, then it’s unlikely that you need to do so via a shell script: just type the command and press the <RETURN> key. Note that the bash crontab utility enables you to schedule the execution of shell scripts at various points in time (the crontab utility is outside the scope of this book).

As you probably know, comments are important in source code. A good shell script contains meaningful comments, which are preceded by a pound sign “#,” that explain the purpose of different sections in the shell script. The exception is when the “#” symbol appears in the first line of a shell script, as you will see in the next section.

A Simple Shell Script

Create the file test.sh (using your favorite text editor) with the following contents:

#!/bin/bash
pwd
ls
cd /tmp
ls
mkdir /tmp/abc
touch /tmp/abc/emptyfile
ls /tmp/abc/

Now save the above content and make this script executable as follows:

chmod +x test.sh

Now you have your shell script ready to be executed as follows:

./test.sh

NOTE

The output from launching test.sh depends on the contents of the /tmp directory.

The first line in test.sh is called the “shebang” line, which directs the system to launch the bash shell in order to invoke the commands in test.sh. The term shebang is sort of a contraction of “hash” (for the “#” character) and “bang” (for the “!” character). Note that the initial “./” of ./test.sh specifies the file test.sh in the current directory: if the file test.sh is in your home directory, specify $HOME/test.sh. In addition, if “.” is included in the PATH environment variable, then you can simply type test.sh without the “./” prefix.

One point regarding the mkdir command: if you specify a path in which intermediate directories do not exist, then you need to use the –p switch. For example, if the directory /tmp/abc does not exist, then the following command requires the –p switch:

mkdir –p /tmp/abc/def

As another example of a simple shell script, the following script uses the read command, which takes the input from the keyboard and assigns that input value as the value of the variable PERSON. The echo command prints the input value on STDOUT, which is the screen (by default).

#!/bin/sh
echo "What is your name?"
read PERSON
echo "Hello, $PERSON"

Here is sample invocation of this script:

$./test.sh
What is your name?
John Smith
Hello, John Smith

Using a Semicolon to Separate Commands

You can combine multiple commands with a semicolon (“;”), as shown here:

cd /tmp; pwd; cd ~; pwd

The preceding code snippet navigates to the /tmp directory, prints the full path to the current directory, returns to the previous directory, and again prints the full path to the current directory. The output of the preceding command is here:

/tmp
/Users/jsmith

You can use command substitution (discussed in a later section) to assign the output to a variable, as shown here:

x=`cd /tmp; pwd; cd ~; pwd`
echo $x

The output of the preceding snippet is here:

/tmp /Users/jsmith

The `printf` Command and the echo Command

In brief, use the printf command instead of the echo command if you need to control the output format. One key difference is that the echo command prints a newline character whereas the printf statement does not print a newline character. Keep this point in mind when you see the printf statement in the awk code samples in Chapter 5.

As a simple example, place the following code snippet in a shell script:

printf "%-5s %-10s %-4s\n" ABC DEF GHI
printf "%-5s %-10s %-4.2f\n" ABC DEF 12.3456

Make the shell script executable and then launch the shell script, after which you will see the following output:

ABC     DEF           GHI
ABC     DEF           12.35

On the other hand, if you type the following pair of commands:

echo "ABC DEF GHI"
echo "ABC DEF 12.3456"

you will see the following output:

ABC DEF GHI
ABC DEF 12.3456

A detailed (and very lengthy) discussion regarding the printf statement and the echo command is here:

https://unix.stackexchange.com/questions/65803/why-is-printf-better-than-echo.

The `echo` Command and Whitespaces

The echo command preserves whitespaces in variables, but in some cases the results might be different from your expectations.

Listing 1.1 displays the contents of EchoCut.sh that illustrates the differences that can occur when the echo command is used with the cut command.

LISTING 1.1. EchoCut.sh

x1="123    456    789"
x2="123 456 789"
echo "x1 = $x1"
echo "x2 = $x2"

x3=`echo $x1   | cut -c1-7`
x4=`echo "$x1" | cut -c1-7`
x5=`echo $x2   | cut -c1-7`
echo "x3 = $x3"
echo "x4 = $x4"
echo "x5 = $x5"

Launch the code in Listing 1.1 and you will see the following output:

x1 = 123   456 789
x2 = 123 456 789
x3 = 123 456
x4 = 123   4
x5 = 123 456

The value of x3 is probably different from what you expected: there is only one blank space between 123 and 456 instead of the three blank spaces that appear in the definition of the variable x1.

This seemingly minor detail is important when you write shell scripts that check the values contained in specific columns of text files, such as payroll files and other files with financial data. The solution involves the use of double quote marks (and sometimes the IFS variable that is discussed in Chapter 2) that you can see in the definition of x4.

Command Substitution (“back tick”)

The “back tick” or command substitution feature of the Bourne shell is very powerful and enables you to combine multiple bash commands. You can also write very compact and powerful (and complicated) shell scripts with command substitution. The syntax is to simply precede and follow your command with a “`” (back tick) character. In Listing 1.2, the back tick command is `ls *py`

Listing 1.2 displays the contents of CommandSubst.sh that displays a subset of the list of files in a directory.

LISTING 1.2. CommandSubst.sh

for f in `ls *py`
do
  echo "file is: $f"
done

Listing 1.2 contains a for loop that displays the filenames (in the current directory) that have a “py” suffix.

The output of Listing 1.2 on my MacBook is here:

file is: CapitalizeList.py
file is: CompareStrings.py
file is: FixedColumnCount1.py
file is: FixedColumnWidth1.py
file is: LongestShortest1.py
file is: My2DMatrix.pyß
file is: PythonBash.py
file is: PythonBash2.py
file is: StringChars1.py
file is: Triangular1.py
file is: Triangular2.py
file is: Zip1.py

NOTE

The output depends on whether or not you have any files with a .py suffix in the directory where you execute CommandSubst.sh.

Setting Environment Variables via Shell Scripts

A very important concept when using shell scripts is that any variables set inside the script are no longer set when the script finishes executing. The rules are shown as follows:

If a variable isn’t set in a script, but is already defined before the script is executed, that variable will also be available inside the script.

If a variable is set in a script, it will override any existing variable with the same name after the variable is set, but once the script ends, the variable will revert to its old value (or to no value, if it did not exist outside the shell script).

For example, if your $HOME directory is /Users/jsmith, but inside a script on row 10 you define $HOME to be /Users/common/bin, then the value of $HOME is initially /Users/jsmith for rows 1–9, then becomes /Users/common/bin on row 10, and maintains that value until the last command in the shell script is executed. Then the value reverts to / Users/jsmith.

The reason for this behavior is related to how Unix structures its processes (known as “shells,” hence the term “shell script”). That discussion is beyond the scope of this book.

Therefore, the default behavior is that if you set the value of a variable in a shell script, then that variable (and its value) exist only for the duration of the execution of the shell script. There is a simple “workaround” whereby variables “hold” their values after a shell script has completed, and you’ll learn how to do so in a subsequent section.

Just to make sure that the distinction is clear, consider Listing 1.3 that displays the contents of the shell script abc.sh.

LISTING 1.3. abc.sh

export x="123"
echo "inside abc.sh"
echo "x = $x"

Make sure that abc.sh is an executable shell script with the chmod command (as shown earlier in this chapter) and then launch the following sequence of commands from the command line:

export x="tom"
echo "x = $x"
./abc.sh
echo "x = $x"

The output from the preceding commands is here:

x = tom
inside abc.sh
x = 123
x = tom

As you can see, the value that is assigned to the variable x is only for the duration of the process associated with the shell script abc.sh. After execution has competed, the process terminates and the value of x reverts to its original value. Fortunately, there is a way to ensure that the values of variables in a shell script can be “set” for the current shell, a technique called “sourcing” the shell script, as described in the next section.

Sourcing or “Dotting” a Shell Script

Now execute the following sequence of commands:

export x="tom smith"
echo "x = $x"
. abc.sh
echo "x = $x"

The output from the preceding commands is here:

x = "tom smith"
inside abc.sh
x = 123
x = 123

In the preceding code block, the value assigned to the variable x inside the shell script abc.sh overrides its previously defined value because “sourcing” (also called “dotting”) a shell script does not create a new process. Consequently, if a shell script assigns a new value to an existing variable, that new value is placed in the current environment and the previously defined value is lost.

Working with Arrays

Arrays are critical to data management and appear in a variety of real world contexts. It is a common problem to want to group related data elements together, then reference it within a row.

For example, at a volunteer event you might have to sign in and provide your name, address, and phone number so they can contact you later for future events. That related data could be thought of (and defined in bash) as:

volunteer[0] = name
volunteer[1] = Address
volunteer[2] = phone number

The sign-in list could be then captured as a file that used an internal field separator [IFS] to make each row a volunteer, and each data element (name, address, phone number) distinct, easy to use with a later bash script (or any other programming language or program that understands the concept of IFS).

The IFS is a concept covered in detail in Chapter 2, but it will be used in the following examples so you get a taste of how it is used. If you are familiar with “.csv” (comma separated value) text output from spreadsheets, the comma in those files is the IFS. If you were to open the sign-in list in an Excel spreadsheet or Google Doc created with commas as IFS, you would have column A = name, column B = address, and column C = phone number, each row a separate volunteer.

This section contains several shell scripts that illustrate some useful features of arrays in bash. Listing 1.4 displays the contents of array1.sh, which illustrates how to use an array and some operations that you can perform on arrays.

The syntax in bash is different enough from other programming languages that it’s worthwhile to use several examples to explore its behavior.

LISTING 1.4. array1.sh

#!/bin/bash

# method #1:
fruits[0]="apple"
fruits[1]="banana"
fruits[2]="cherry"
fruits[3]="orange"
fruits[4]="pear"
echo "first fruit: ${fruits[0]}"

# method #2:
declare -a fruits2=(apple banana cherry orange pear)
echo "first fruit: ${fruits2[0]}"

# range of elements:
echo "last two: ${fruits[@]:3:2}"

# substring of element:
echo "substring: ${fruits[1]:0:3}"

arrlength=${#fruits[@]}
echo "length: ${#fruits[@]}"

Listing 1.5 displays the contents of names.txt and Listing 1.6 displays the contents of array-from-file.sh, which contains a for loop to iterate through the elements of an array whose initial values are based on the contents of names.txt.

LISTING 1.5. names.txt

Jane Smith
John Jones
Dave Edwards

LISTING 1.6. array-from-file.sh

#!/bin/bash

names="names.txt"
contents1=( `cat "$names"` )

echo "First loop:"
for w in "${contents1[@]}"
do
  echo "$w"
done

IFS=""
names="names.txt"
contents1=( `cat "$names"` )

echo "Second loop:"
for w in "${contents1[@]}"
do
  echo "$w"
done

Listing 1.6 initializes the array variable contents1 by using command substitution with the cat command, followed by a loop that displays elements of the array contents1. The second for loop is the same code as the first for loop, but this time with the value of IFS equal to “”, which has the effect of using the newline as a separator, one data element per row.

Launch the code in Listing 1.6 and you will see the following output:

First loop:
Jane
Smith
John
Jones
Dave
Edwards
Second loop:
Jane Smith
John Jones
Dave Edwards

Listing 1.7 displays the contents of array-function.sh, which illustrates how to initialize an array and then display its contents in a user-defined function.

LISTING 1.7. array-function.sh

#!/bin/bash

# compact version of the code later in this script:
#items() { for line in "${@}" ; do printf "%s\n" "${line}" ;
done ; }
#aa=( 7 -4 -e ) ; items "${aa[@]}"

items() {
  for line in "${@}"
  do
     printf "%s\n" "${line}"
  done
}

arr=( 123 -abc 'my data' )
items "${arr[@]}"

Listing 1.7 contains the items() function that displays the contents of the arr array that has been initialized prior to invoking this function. The output is shown here:

123
-abc
my data

Listing 1.8 displays the contents of array-loops1.sh, which illustrates how to determine the length of an initialized array and then display its contents via a for loop.

LISTING 1.8. array-loops1.sh

#!/bin/bash


fruits[0]="apple"
fruits[1]="banana"
fruits[2]="cherry"
fruits[3]="orange"
fruits[4]="pear"


# array length:
arrlength=${#fruits[@]}
echo "length: ${#fruits[@]}"


# print each element via a loop:
for (( i=1; i<${arrlength}+1; i++ ));
do
  echo "element $i of ${arrlength} : " ${fruits[$i-1]}
done

Listing 1.8 contains straightforward code for initializing an array and displaying its values.

Working with Nested Loops

This section is mainly for fun: you will see how to use nested loops to display a “triangular” output. Listing 1.9 displays the contents of nested-loops.sh, which illustrates how to display an alternating set of symbols in a triangular fashion.

LISTING 1.9. nestedloops2.sh

#!/bin/bash

outermax=10
symbols[0]="#"
symbols[1]="@"

for (( i=1; i<${outermax}; i++ ));
do
  for (( j=1; j<${i}; j++ ));
  do
    printf "%-2s" ${symbols[($i+$j)%2]}
  done
  printf "\n"
done

for (( i=1; i<${outermax}; i++ ));
do
  for (( j=${i}+1; j<${outermax}; j++ ));
  do
    printf "%-2s" ${symbols[($i+$j)%2]}
  done
  printf "\n"
done

Listing 1.9 initializes some variables, followed by a nested loop. The outer loop is “controlled” by the loop variable i, whereas the inner loop (which depends on the value of i) is “controlled” by the loop variable j. The key point to notice is how the following code snippet prints alternating symbols in the symbols array, depending on whether or not the value of $i + $j is even or odd:

printf "%-2s" ${symbols[($i+$j)%2]}

You can easily generalize this code: if the symbols array contains arrlength elements, then replace the preceding code snippet with the following:

printf "%-2s" ${symbols[($i+$j)% $arrlength]}

Launch the code in Listing 1.9 and you will see the following output:

      @
      # @
      @ # @
      # @ # @
      @ # @ # @
      # @ # @ # @
      @ # @ # @ # @
      # @ # @ # @ # @
      @ # @ # @ # @ #
      @ # @ # @ # @
      @ # @ # @ #
      @ # @ # @
      @ # @ #
      @ # @
      @ #
      @

The `paste` Command

The paste command is useful when you need to combine two files in a “pairwise” fashion. For example, Listing 1.10 and Listing 1.11 display the contents of the text files list1 and list2, respectively. You can think of paste as adding the contents of the second file as a new column in the first file. In our first example, the first file has a list of files to copy, and the second file has a list of files that are the destination for the copy command. Paste then merges the two files into output that could then be run to execute all the copy commands in one step.

LISTING 1.10. list1

cp abc.sh
cp abc2.sh
cp abc3.sh

LISTING 1.11. list2

def.sh
def2.sh
def3.sh

Listing 1.12 displays the result of invoking the following command:

paste list1 list2 >list1.sh

LISTING 1.12. list1.sh

cp  abc.sh    def.sh
cp  abc2.sh   def2.sh
cp  abc3.sh   def3.sh

Listing 1.12 contains three cp commands that are the result of invoking the paste command. If you want to execute the commands in Listing 1.12, make this shell script executable and then launch the script, as shown here:

chmod +x list1.sh
./list1.sh

Inserting Blank Lines with the `paste` Command

Instead of merging two equal length files, paste can also be used to add the same thing to every line in a file. This example inserts a blank line after every line in names.txt with this command:

paste -d'\n' - /dev/null < names.txt

The output is here:

Jane Smith

John Jones

Dave Edwards

Insert a blank line after every other line in names.txt with this command:

paste -d'\n' - - /dev/null < names.txt

The output is here:

Jane Smith
John Jones

Dave Edwards

Insert a blank line after every third line in names.txt with this command:

paste -d'\n' - - - /dev/null < names.txt

The output is here:

Jane Smith
John Jones
Dave Edwards

Note that there is a blank line after the third line in the preceding output. The shell script joinlines.sh (later in this chapter) also contains examples of one-line paste commands for joining consecutive lines of a dataset or text file.

The `cut` Command

The cut command enables you to extract fields with a specified delimiter (another word commonly used for IFS, especially when it’s part of a command syntax, instead of being set as an outside variable) as well as a range of columns from an input stream. Some examples are here:

x="abc def ghi"
echo $x | cut -d" " -f2

The output (using space " " as IFS, and -f2 to indicate the second column) of the preceding code snippet is here:

def

Consider this code snippet:

x="abc def ghi"
echo $x | cut -c2-5

The output of the preceding code snippet (-c2-5 means extract the characters in columns 2 through 5 from the variable) is here:

bc d

Listing 1.13 displays the contents of SplitName1.sh, which illustrates how to split a filename containing the “.” character as a delimiter/IFS.

LISTING 1.13. SplitName1.sh

fileName="06.22.04p.vp.0.tgz"

f1=`echo $fileName | cut -d"." -f1`
f2=`echo $fileName | cut -d"." -f2`
f3=`echo $fileName | cut -d"." -f3`
f4=`echo $fileName | cut -d"." -f4`
f5=`echo $fileName | cut -d"." -f5`

f5=`expr $f5 + 12`

newFileName="${f1}.${f2}.${f3}.${f4}.${f5}"
echo "newFileName: $newFileName"

Listing 1.13 uses the echo command and the cut command in order to initialize the variables f1, f2, f3, f4, and f5, after which a new filename is constructed. The output of the preceding shell script is here:

newFileName: 06.22.04p.vp.12

Working with Metacharacters

Metacharacters can be thought of as a complex set of wildcards. Regular expressions are a “search patterns” which are a combination of normal text and metacharacters. In concept it is much like a “find” tool (press ctrl-f on your search engine), but bash (and Unix in general) allows for much more complex pattern matching because of its rich metacharacter set. There are entire books devoted to regular expressions, but this section contains enough information to get started, and the key concepts needed for data manipulation and cleansing.

The following metacharacters are useful with regular expressions:

The ? metacharacter refers to 0 or 1 occurrences of something.

The + metacharacter refers to 1 or more occurrences of something.

The * metacharacter refers to 0 more occurrences of something.

Note that “something” in the preceding descriptions can refer to a digit, letter, word, or more complex combinations.

Some examples are shown here:

The expression a? matches zero or one occurrences of the letter a.

The expression a+ matches the string a followed by one or more occurrences of anything.

The expression a* matches the string a followed by zero or more occurrences of anything.

The pipe “|” metacharacter (which has a different context from the pipe symbol in the command line: regular expressions have their own syntax, which does not match that of the operating system a lot of the time) provides a choice of options. For example, the expression a|b means a or b, and the expression a|b|c means a or b or c.

The “$” metacharacter refers to the end of a line of text, and in regular expressions inside the vi editor, the “$” metacharacter refers to the last line in a file.

The “^” metacharacter refers to the beginning of a string or a line of text. For example:

*a$ matches "Mary Anna" but not "Anna Mary"
^A* matches "Anna Mary" but not "Mary Anna"

In the case of regular expressions, the “^” metacharacter can also mean “does not match.” The next section contains some examples of the “^” metacharacter.

Working with Character Classes

Character classes enable you to express a range of digits, letters, or a combination of both. For example, the character class [0-9] matches any single digit; [a-z] matches any lowercase letter; and [A-Z] matches any uppercase letter. You can also specify subranges of digits or letters, such as [3-7], [g-p], and [F-X], as well as other combinations:

[0-9][0-9] matches a consecutive pair of digits

[0-9][0-9][0-9] matches three consecutive digits

\d{3} also matches three consecutive digits

The previous section introduced you to the “^” metacharacter, and here are some examples of using “^” with character classes:

1)^[a-z] matches any lowercase letter at the beginning of a line of text

2)^[^a-z] matches any line of text that does not start with a lowercase letter

Based on what you have learned thus far, you can understand the purpose of the following regular expressions:

3)([a-z]|[A-Z]): either a lowercase letter or an uppercase letter

4)(^[a-z][a-z]): an initial lowercase letter followed by another lowercase letter

5)(^[^a-z][A-Z]): anything other than a lowercase letter followed by an uppercase letter

Chapter 4 contains a section that discusses regular expressions, which combine character classes and metacharacters in order to create sophisticated expressions for matching complex string patterns (such as email addresses).

The “pipe” Symbol and Multiple Commands

At this point you’ve seen various combinations of bash commands that are connected with the “|” symbol. The general form looks something like this:

cmd1 | cmd2 | cmd3 .... >mylist

What happens if there are intermediate errors? You’ve seen how to redirect error messages to /dev/null, and you can also redirect error messages to a text file if you need to review them. Yet another option is to redirect stderr (“standard error”) to stdout (“standard out”), which is beyond the scope of this chapter.

Question: can an intermediate error cause the entire “pipeline” to fail? Unfortunately, this scenario can occur, and in general it’s a trial-and-error process to debug long and complex commands that involve multiple pipe symbols.

Now consider the case where you need to redirect the output of multiple commands to the same location. For example, the following commands display output on the screen:

ls | sort; echo "the contents of /tmp: "; ls /tmp

You can easily redirect the output to a file with this command:

(ls | sort; echo "the contents of /tmp:"; ls /tmp) > myfile1

However, each of the preceding commands inside the parentheses spawns a subshell (which is an extra process that consumes memory and cpu). You can avoid spawning subshells by using {} instead of (), as shown here (and the whitespaces after { and before } are required):

{ ls | sort; echo "the contents of /tmp:"; ls /tmp } > myfile1

Suppose that you want to set a variable and execute a command, and then invoke a second command via a pipe, as shown here:

name=SMITH cmd1 | cmd2

Unfortunately, cmd2 in the preceding code snippet does not recognize the value of name, but there is a simple solution, as shown here:

(name=SMITH cmd1) | cmd2

Use the double ampersand && symbol if you want to execute a command only if a prior command succeeds. For example, the cd command only works if the mkdir command succeeds in the following code snippet:

mkdir /tmp2/abc && cd /tmp2/abc

The preceding command will fail because (by default) the /tmp2 does not exist. On the other hand, the following command succeeds because the –p option ensures that intermediate directories are created:

mkdir -p /tmp/abc/def && cd /tmp/abc && ls -l

A Simple Use Case

The code sample in this section shows you how to use the paste command in order to join consecutive rows in a dataset. Listing 1.14 displays the contents of linepairs.csv, which contains letter and number pairs, and Listing 1.15 contains reversecolumns.sh, which illustrates how to match the pairs even though the line breaks are in different places between numbers and letters.

LISTING 1.14. linepairs.csv

a,b,c,d,e,f,g
h,i,j,k,l
1,2,3,4,5,6,7,8,9
10,11,12

LISTING 1.15. linepairs.sh

inputfile="linepairs.csv"
outputfile="linepairsjoined.csv"

# join pairs of consecutive lines:
paste -d " " - - < $inputfile > $outputfile

# join three consecutive lines:
#paste -d " " - - - < $inputfile > $outputfile

# join four consecutive lines:
#paste -d " " - - - - < $inputfile > $outputfile

The contents of the output file are shown here (note that the script is just joining pairs of lines; the three- and four-line command examples are commented out):

a,b,c,d,e,f,g h,i,j,k,l
1,2,3,4,5,6,7,8,9 10,11,12

Notice that the preceding output is not completely correct: there is a space “ ” instead of a “,” whenever a pair of lines is joined (between “g” and “h” and “9” and “10”). We can make the necessary revision using the sed command (discussed in Chapter 4):

cat $outputfile | sed "s/ /,/g" > $outputfile2

Examine the contents of $outputfile2 to see the result of the preceding code snippet.

Another Simple Use Case

The code sample in this section shows you how to use the cut and paste commands in order to reverse the order of two columns in a dataset. Keep in mind that the purpose of the shell script in Listing 1.17 is to help you get some practice for writing bash scripts. The better solution involves a single line of code (shown at the end of this section).

Listing 1.16 displays the contents of namepairs.csv, which contains the first name and last name of a set of people, and Listing 1.17 contains reversecolumns.sh, which illustrates how to reverse these two columns.

LISTING 1.16. namepairs.csv

Jane,Smith
Dave,Jones
Sara,Edwards

LISTING 1.17. reversecolums.sh

inputfile="namepairs.csv"
outputfile="reversenames.csv"
fnames="fnames"
lnames="lnames"

cat $inputfile|cut -d"," -f1 > $fnames
cat $inputfile|cut -d"," -f2 > $lnames

paste -d"," $lnames $fnames > $outputfile

The contents of the output file are shown here:

Smith,Jane
Jones,Dave
Edwards,Sara

The code in Listing 1.17 (after removing blank lines) consists of seven lines of code that involves creating two extra intermediate files. Unless you need those files, it’s a good idea to remove those two files (which you can do with one rm command).

Although Listing 1.17 is straightforward, there is a simpler way to execute this task: use the cat command and the awk command (discussed in detail in Chapter 5).

Specifically, compare the contents of reversecolumns.sh with the following single line of code that combines the cat command and the awk command in order to generate the same output:

cat namepairs.txt |awk -F"," '{print $2 "," $1}'

The output from the preceding code snippet is here:

Smith,Jane
Jones,Dave
Edwards,Sara

As you can see, there is a big difference in these two solutions. If you are unfamiliar with the awk command, then obviously you would not have thought of the second solution. However, the more you learn about bash commands and how to combine them, the more adept you will become in terms of writing better shell scripts to solve data cleaning tasks. Another important point: document the commands as they get more complex, as they can be hard to interpret later by others, or even by yourself if enough time has passed. A comment like the following can be extremely helpful to interpreting code:

# This command reverses first and last names in namepairs.txt
cat namepairs.txt |awk -F"," '{print $2 "," $1}'

Summary

This chapter started with an introduction to some Unix shells, followed by a brief discussion of files, file permissions, and directories. You also learned how to create files and directories and how to change their permissions. Next you learned about environment variables, how to set them, and also how to use aliases. You also learned about “sourcing” (also called “dotting”) a shell script and how this changes variable behavior from calling a shell script in the normal fashion.

Next you learned about the cut command (for cutting columns and/or fields) and the paste command (for “pasting” test together vertically). Finally, you saw two use cases, the first of which involved the cut command and paste command to switch the order to two columns in a dataset, and the second showed you another way to perform the same task using concepts from later chapters.

INTRODUCTION

What Is Unix?

Available Shell Types

What Is bash?

Getting Help for bash Commands

Navigating Around Directories

The history Command

Displaying Contents of Files

The cat Command

The Pipe Symbol

The fold Command

File Ownership: Owner, Group, and World

Hidden Files

Handling Problematic Filenames

The env Command

Useful Environment Variables

Setting the PATH Environment Variable

Specifying Aliases and Environment Variables

Finding Executable Files

What Are Shell Scripts?

A Simple Shell Script

Using a Semicolon to Separate Commands

The printf Command and the echo Command

The echo Command and Whitespaces

Command Substitution (“back tick”)

Setting Environment Variables via Shell Scripts

LISTING 1.3. abc.sh

Sourcing or “Dotting” a Shell Script

Working with Arrays

LISTING 1.4. array1.sh

LISTING 1.5. names.txt

LISTING 1.6. array-from-file.sh

LISTING 1.7. array-function.sh

LISTING 1.8. array-loops1.sh

Working with Nested Loops

LISTING 1.9. nestedloops2.sh

LISTING 1.10. list1

LISTING 1.11. list2

LISTING 1.12. list1.sh

Inserting Blank Lines with the paste Command

The cut Command

LISTING 1.13. SplitName1.sh

Working with Metacharacters

Working with Character Classes

The “pipe” Symbol and Multiple Commands

A Simple Use Case

LISTING 1.15. linepairs.sh

Another Simple Use Case

LISTING 1.16. namepairs.csv

LISTING 1.17. reversecolums.sh

Summary

Getting Help for `bash` Commands

The `history` Command

The `cat` Command

The `fold` Command

The `env` Command

Setting the `PATH` Environment Variable

The `printf` Command and the echo Command

The `echo` Command and Whitespaces

Inserting Blank Lines with the `paste` Command

The `cut` Command