Data Cleaning

CHAPTER 2 USEFUL COMMANDS

This chapter discusses various bash commands that you can use when working with datasets, such as splitting, sorting, and comparing datasets. You see examples of finding files in a directory and then searching for strings in those files using the bash “pipe” command that redirects the output of one bash command as the input of a second bash command.

The first part of this chapter shows you how to merge, fold, and split datasets. This section also shows you how to sort files and find unique lines in files using the sort and uniq commands, respectively. The last portion explains how to compare text files and binary files.

The second section introduces you to the find command, which is a powerful command that supports many options. For example, you can search for files in the current directory or in subdirectories; you can search for files based on their creation date and last modification date. One convenient combination is to “pipe” the output of the find command to the xargs command in order to search files for a particular pattern. Next you will see how to use the tr command, a tool which handles a lot of commonly used text transformations such as capitalization or removal of whitespace. After the section that discusses the tr command you will see a use case that shows you how use the tr command in order to remove the ^M control character from a dataset.

The third section contains compression-related commands, such as cpio, tar, and bash commands for managing files that are already compressed (such as zdiff, zcmp, zmore, and so forth).

The fourth section introduces you to the IFS option, which is useful for extracting data from a range of columns in a dataset. You will also see how to use the xargs command in order to “line up” the columns of a dataset so that all rows have the same number of columns.

The fifth section shows you how to create shell scripts, which contain bash commands that are executed sequentially, and also how to use recursion in order to compute the factorial value of a positive integer. The Appendix for this book contains additional shell scripts that use recursion in order to calculate the GCD (greatest common divisor) and LCM (lowest common multiple) of two positive integers, the Fibonacci value of a positive integer, and also the prime divisors of a positive integer.

The `join` Command

The join command allows you to merge two files in a meaningful fashion, which essentially creates a simple version of a relational database.

The join command operates on exactly two files, but pastes together only those lines with a common tagged field (usually a numerical label), and writes the result to stdout. The files to be joined should be sorted according to the tagged field for the matchups to work properly. Listing 2.1 and Listing 2.2 display the contents of 1.data and 2.data, respectively.

LISTING 2.1 1.data

100 Shoes
200 Laces
300 Socks

LISTING 2.2 2.data

100 $40.00
200 $1.00
300 $2.00

Now launch the following command:

join 1.data 2.data

The output is here:

1)100 Shoes $40.00

2)200 Laces $1.00

3)300 Socks $2.00

The `fold` Command

As you know from Chapter 1, the fold command enables you to display a set of lines with a fixed column width, and this section contains a few more examples. Note that this command does not take into account spaces between words: the output is displayed in columns that resemble a “newspaper” style.

The following command displays a set of lines with ten characters in each line:

x="aa bb cc d e f g h i j kk ll mm nn"
echo $x |fold -10

The output of the preceding code snippet is here:

aa bb cc d
 e f g h i
 j kk ll m
m nn

As another example, consider the following code snippet:

x="The quick brown fox jumps over the fat lazy dog. "
echo $x |fold -10

The output of the preceding code snippet is here:

The quick
brown fox
jumps over
 the fat l
azy dog.

The `split` Command

The split command is useful when you want to create a set of subfiles of a given file. By default, the subfiles are named xaa, xab, . . ., xaz, xba, xbb, . . ., xbz, . . . xza, xzb, . . ., xzz. Thus, the split command creates a maximum of 676 files (=26x26). The default size for each of these files is 1,000 lines.

The following snippet illustrates how to invoke the split command in order to split the file abc.txt into files with 500 lines each:

split -l 500 one-dl-course-outline.txt

If the file abc.txt contains between 501 and 1,000 lines, then the preceding command will create the following pair of files:

xaa
xab

You can also specify a file prefix for the created files, as shown here:

split -l 500 one-dl-course-outline.txt shorter

The preceding command creates the following pair of files:

shorterxaa
shorterxab

The `sort` Command

The sort command sorts the lines in a text file. For example, if the text file test2.txt contains the following lines:

aa
cc
bb

The following simple example sorts the lines in test2.txt:

cat test2.txt |sort

The output of the preceding code snippet is here:

aa
bb
cc

The sort command arranges lines of text alphabetically by default. Some options for the sort command are here:

Option  Description
-n      Sort numerically (example: 10 will sort after 2),
ignore blanks and tabs.
-r      Reverse the order of sort.
-f      Sort upper- and lowercase together.
+x      Ignore first x fields when sorting.

You can use the sort command to display the files in a directory based on their file size, as shown here:

-rw-r--r-- 1 ocampesato staff  11 Jan 06 19:21 outfile.txt
-rw-r--r-- 1 ocampesato staff  12 Jan 06 19:21 output.txt
-rwx------ 1 ocampesato staff  12 Jan 06 19:21 kyrgyzstan.txt
-rwx------ 1 ocampesato staff  25 Jan 06 19:21 apple-care.txt
-rwx------ 1 ocampesato staff 146 Jan 06 19:21 checkin-commands.txt
-rwx------ 1 ocampesato staff 176 Jan 06 19:21 ssl-instructions.txt
-rwx------ 1 ocampesato staff 417 Jan 06 19:43 iphonemeetup.txt

The sort command supports many options, some of which are summarized here.

The sort –r command sorts the list of files in reverse chronological order. The sort –n command sorts on numeric data and sort –k command sorts on a field. For example, the following command displays the long listing of the files in a directory that are sorted by their file size:

ls –l |sort –k 5

The output is here:

total 72
-rwx------ 1 ocampesato staff  12 Jan 06 20:46 kyrgyzstan.txt
-rw-r--r-- 1 ocampesato staff  12 Jan 06 20:46 output.txt
-rw-r--r-- 1 ocampesato staff  14 Jan 06 20:46 outfile.txt
-rwx------ 1 ocampesato staff  25 Jan 06 20:46 apple-care.txt
-rwxr-xr-x 1 ocampesato staff  90 Jan 06 20:50 testvars.sh
-rwxr-xr-x 1 ocampesato staff 100 Jan 06 20:50 testvars2.sh
-rwx------ 1 ocampesato staff 146 Jan 06 20:46 checkin-commands.txt
-rwx------ 1 ocampesato staff 176 Jan 06 20:46 ssl-instructions.txt
-rwx------ 1 ocampesato staff 417 Jan 06 20:46 iphonemeetup.txt

Notice that the file listing is sorted based on the fifth column, which displays the file size of each file. You can sort the files in a directory and display them from largest to smallest with this command:

ls –l |sort –n

In addition to sorting lists of files, you can use the sort command to sort the contents of a file. For example, suppose that the file abc2.txt contains the following:

This is line one
This is line two
This is line one
This is line three
Fourth line
Fifth line
The sixth line
The seventh line

The following command sorts the contents of abc2.txt:

sort abc2.txt

You can sort the contents of multiple files and redirect the output to another file:

sort outfile.txt output.txt > sortedfile.txt

An example of combining the commands sort and tail is shown here:

cat abc2.txt |sort |tail -5

The preceding command sorts the contents of the file abc2.txt and then displays the final five lines:

The seventh line
The sixth line
This is line one
This is line one
This is line three
This is line two

As you can see, the preceding output contains two duplicate lines. The next section shows you how to use the uniq command in order to remove duplicate lines.

The `uniq` Command

The uniq command prints only the unique lines in a sorted text file and omits duplicates. As a simple example, suppose the file test3.txt contains the following lines:

abc
def
abc
abc

The following command displays the unique lines:

cat test3.txt |sort | uniq

The output of the preceding code snippet is here:

abc
def

How to Compare Files

The diff command enables you to compare two text files and the cmp command compares two binary files. For example, suppose that the file output.txt contains these two lines:

Hello
World

Suppose that the file outfile.txt contains these two lines:

goodbye
world

Then the output of this command:

diff output.txt outfile.txt

is shown here:

1,2c1,2
< Hello
< World
---
> goodbye
> world

Note that the diff command performs a case-sensitive text-based comparison, which means that the strings Hello and hello are different.

The `od` Command

The od command displays an octal dump of a file, which can be very helpful when you want to see embedded control characters (such as tab characters) that are not normally visible on the screen. This command contains many switches that you can see when you type man od.

As a simple example, suppose that the file abc.txt contains one line of text with the following three letters, separated by a tab character (which are not visible here) between each pair of letters:

a       b       c

The following command displays the tab and newline characters in the file abc.txt:

cat control1.txt |od -tc

The preceding command generates the following output:

0000000    a   \t   b   \t   c   \n
0000006

In the special case of tabs, another way to see them is to use the following cat command:

cat –t abc.txt

The output from the preceding command is here:

a^Ib^Ic

In Chapter 1 you learned that the echo command prints a newline character whereas the printf statement does not print a newline character (unless it is explicitly included). You can verify this fact for yourself with this code snippet:

echo abcde | od -c
0000000    a   b   c   d   e   \n
0000006
printf abcde | od -c
0000000    a   b   c   d   e
0000005

The `tr` Command

The tr command is a highly versatile command that supports many operations. For example, the tr command enables you to remove extraneous whitespaces in datasets, insert blank lines, print words on separate lines, and also translate characters from one character set to another character set (i.e., from uppercase to lowercase, and vice versa).

The following command capitalizes the letters in the variable x:

x="abc def ghi"
echo $x | tr [a-z] [A-Z]
ABC DEF GHI

Another way to convert from lowercase to uppercase:

cat columns4.txt | tr '[:lower:]' '[:upper:]'

In addition to upper and lower, you can use the POSIX character classes in the tr command:

alnum: alphanumeric characters

alpha: alphabetic characters

cntrl: control (non-printing) characters

digit: numeric characters

graph: graphic characters

lower: lowercase alphabetic characters

print: printable characters

punct: punctuation characters

space: whitespace characters

upper: uppercase characters

xdigit: hexadecimal characters 0–9 A–F

The following example removes white spaces in the variable x (initialized above):

echo $x |tr -ds " " ""
abcdefghi

The following command prints each word on a separate line:

echo "a b c" | tr -s " " "\012"
a
b
c

The following command replaces commas “,” with a linefeed:

echo "a,b,c" | tr -s "," "\n"
a
b
c

The following example replaces the linefeed in each line with a blank space, which produces a single line of output:

cat test4.txt |tr '\n' ' '

The output of the preceding command is here:

abc  def  abc  abc

The following example removes the linefeed character at the end of each line of text in a text file. As an illustration, Listing 2.3 displays the contents of abc2.txt.

LISTING 2.3 abc2.txt

This is line one
This is line two
This is line three
Fourth line
Fifth line
The sixth line
The seventh line

The following code snippet removes the linefeed character in the text file abc2.txt:

tr -d '\n' < abc2.txt

The output of the preceding tr code snippet is here:

This is line oneThis is line twoThis is line threeFourth line-Fifth lineThe sixth lineThe seventh line

As you can see, the output is missing a blank space between consecutive lines, which we can insert with this command:

tr -s '\n' ' ' < abc2.txt

The output of the modified version of the tr code snippet is here:

This is line one This is line two This is line three Fourth line Fifth line The sixth line The seventh line

You can replace the linefeed character with a period “.” with this version of the tr command:

tr -s '\n' '.' < abc2.txt

The output of the preceding version of the tr code snippet is here:

This is line one.This is line two.This is line three.Fourth line.Fifth line.The sixth line.The seventh line.

The tr command with the –s option works on a one-for-one basis, which means that the sequence “.” has the same effect as the sequence “. ”. As a sort of “preview,” we can add a blank space after each period “.” by combining the tr command with the sed command (discussed in Chapter 4), as shown here:

tr -s '\n' '.' < abc2.txt | sed 's/\./\. /g'

The output of the preceding command is here:

This is line one. This is line two. This is line three. Fourth line. Fifth line. The sixth line. The seventh line.

Think of the preceding sed snippet as follows: “whenever a ‘dot’ is encountered, replace it with a ‘dot’ followed by a space, and do this for every such occurrence.”

You can also combine multiple commands using the Unix pipe symbol. For example, the following command sorts the contents of Listing 2.3, retrieves the “bottom” five lines of text, retrieves the lines of text that are unique, and then converts the text to upper case letters,

cat abc2.txt |sort |tail -5 | uniq | tr [a-z] [A-Z]

Here is the output from the preceding command

THE SEVENTH LINE
THE SIXTH LINE
THIS IS LINE ONE
THIS IS LINE THREE
THIS IS LINE TWO

You can also convert the first letter of a word to uppercase (or to lowercase) with the tr command, as shown here:

x="pizza"
x=`echo ${x:0:1} | tr '[a-z]' '[A-Z]'`${x:1}
echo $x

A slightly longer (one extra line of code) way to convert the first letter to uppercase is shown here:

x="pizza"
first=`echo $x|cut -c1|tr [a-z] [A-Z]`
second=`echo $x|cut -c2-`
echo $first$second

However, both of the preceding code blocks are somewhat obscure (at least for novices), so it’s probably better to use other tools, such as dataframes in R or RStudio.

As you can see, it’s possible to combine multiple commands using the bash pipe symbol “|” in order to produce the desired output.

A Simple Use Case

The code sample in this section shows you how to use the tr command in order to replace the control character “^M” with a linefeed. Listing 2.4 displays the contents of the dataset controlm.csv that contains embedded control characters.

LISTING 2.4 controlm.csv

IDN,TEST,WEEK_MINUS1,WEEK0,WEEK1,WEEK2,WEEK3,WEEK4,WEEK10,WEEK
12,WEEK14,WEEK15,WEEK17,WEEK18,WEEK19,
WEEK21^M1,BASO,,1.4,,0.8,,1.2,,1.1,,,2.2,,,1.4^M1,
BASOAB,,0.05,,0.04,,0.05,,0.04,,,0.07,,,0.05^M1,EOS,,
6.1,,6.2,,7.5,,6.6,,,7.0,,,6.2^M1,EOSAB,,0.22,,0.30,,
0.27,,0.25,,,0.22,,,0.21^M1,HCT,,35.0,,34.2,,34.6,,34.3,,,36.2
,,,34.1^M1,HGB,,11.8,,11.1,,11.6,,11.5,,,12.1,,,
11.3^M1,LYM,,36.7

Listing 2.5 displays the contents of the file controlm.sh that illustrates how to remove the control characters from controlm.csv.

LISTING 2.5 controlm.sh

inputfile="controlm.csv"
removectrlmfile="removectrlmfile"
tr -s '\r' '\n' < $inputfile > $removectrlmfile

For convenience, Listing 2.5 contains a variable for the input file and one for the output file, but you can simplify the tr command in Listing 2.5 by using hard-coded values for the filenames.

The output from launching the shell script in Listing 2.5 is here:

IDN,TEST,WEEK_MINUS1,WEEK0,WEEK1,WEEK2,WEEK3,WEEK4,WEEK10,WEEK
12,WEEK14,WEEK15,WEEK17,WEEK18,WEEK19,WEEK21
1,BASO,,1.4,,0.8,,1.2,,1.1,,,2.2,,,1.4
1,BASOAB,,0.05,,0.04,,0.05,,0.04,,,0.07,,,0.05
1,EOS,,6.1,,6.2,,7.5,,6.6,,,7.0,,,6.2
1,EOSAB,,0.22,,0.30,,0.27,,0.25,,,0.22,,,0.21

As you can see, the task in this section is very easily solved via the tr command. Note that additional data cleaning is required in order to handle the empty fields in the output.

You can also replace the current delimiter “,” with a different delimiter, such as a “|” symbol with the following command:

cat removectrlmfile |tr -s ',' '|' > pipedfile

The resulting output is shown here:

IDN|TEST|WEEK_MINUS1|WEEK0|WEEK1|WEEK2|WEEK3|WEEK4|WEEK10|WEEK
12|WEEK14|WEEK15|WEEK17|WEEK18|WEEK19|WEEK21
1|BASO|1.4|0.8|1.2|1.1|2.2|1.4
1|BASOAB|0.05|0.04|0.05|0.04|0.07|0.05
1|EOS|6.1|6.2|7.5|6.6|7.0|6.2
1|EOSAB|0.22|0.30|0.27|0.25|0.22|0.21

If you have a dataset with multiple delimiters in arbitrary order in multiple files, you can replace those delimiters with a single delimiter via the sed command, which is discussed in Chapter 4.

The `find` Command

The find command supports many options, including one for printing (displaying) the files returned by the find command, and another one for removing the files returned by the find command.

In addition, you can specify logical operators such as AND as well as OR in a find command. You can also specify switches to find the files (if any) that were created, accessed, or modified before (or after) a specific date.

Several examples are here:

find . –print displays all the files (including subdirectories)

find . –print |grep "abc" displays all the files whose names contain the string abc

find . –print |grep "sh$" displays all the files whose names have the suffix sh

find . –depth 2 –print displays all files of depth at most 2 (including subdirectories)

You can also specify access times pertaining to files. For example, atime, ctime, and mtime refer to the access time, creation time, and modification time of a file.

As another example, the following command finds all the files modified in less than 2 days and prints the record count of each:

$ find . –mtime -2 –exec wc –l {} ;

You can remove a set of files with the find command. For example, you can remove all the files in the current directory tree that have the suffix “m” as follows:

find . –name "*m$" –print –exec rm {}

NOTE

Be careful when you remove files: run the preceding command without “exec rm {}” to review the list of files before deleting them.

The `tee` Command

The tee command enables you to display output to the screen and also redirect the output to a file at the same time. The –a option will append subsequent output to the named file instead of overwriting the file. An example is here:

find . –print |xargs grep "sh$" | tee /tmp/blue

The preceding code snippet redirects the list of all files in the current directory (and those in any subdirectories) to the xargs command, which then searches—and prints—all the lines that end with the string “sh.” The result is displayed on the screen and is also redirected to the file /tmp/blue.

find . –print |xargs grep "^abc$" | tee –a /tmp/blue

The preceding code snippet also redirects the list of all files in the current directory (and those in any subdirectories) to the xargs command, which then searches—and prints—all the lines that contain only the string “abc.” The result is displayed on the screen and is also appended to the file /tmp/blue.

File Compression Commands

Bash supports various commands for compressing sets of files, including the tar, cpio, gzip, and gunzip commands. The following subsections contain simple examples of how to use these commands.

The `tar` Command

The tar command enables you to compress a set of files in a directory, uncompress a tar file, and also display the contents of a tar file.

The “c” option specifies “create,” the “f” option specifies “file,” and the “v” option specifies “verbose.” For example, the following command creates a compressed file called testing.tar and displays the files that are included in testing.tar during the creation of this file:

tar cvf testing.tar *.txt

The compressed file testing.tar contains the files with the suffix txt in the current directory, and you will see the following output:

a apple-care.txt
a checkin-commands.txt
a iphonemeetup.txt
a kyrgyzstan.txt
a outfile.txt
a output.txt
a ssl-instructions.txt

The following command extracts the files that are in the tar file testing.tar:

tar xvf testing.tar

The following command displays the contents of a tar file without uncompressing its contents:

tar tvf testing.tar

The preceding command displays the same output as the “ls –l” command that displays a long listing.

The “z” option uses gzip compression. For example, the following command creates a compressed file called testing.tar.gz:

tar czvf testing.tar.gz *.txt

The `cpio` Command

The cpio command provides further compression after you create a tar file. For example, the following command creates the file archive.cpio:

ls file1 file2 file3 | cpio -ov > archive.cpio

The “-o” option specifies an output file and the “-v” option specifies verbose, which means that the files are displayed as they are placed in the archive file. The “-I” option specifies input, and the “-d” option specifies “display.”

You can combine other commands (such as the find command) with the cpio command, an example of which is here:

find . –name ".sh" | cpio -ov > shell-scripts.cpio

You can display the contents of the file archive.cpio with the following command:

cpio -id < archive.cpio

The output of the preceding command is here:

file1
file2
file3
1 block

The `gzip` and `gunzip` Commands

The gzip command creates a compressed file. For example, the following command creates the compressed file filename.gz:

gzip filename

Extract the contents of the compressed file filename.gz with the gunzip command:

gunzip filename.gz

You can create gzipped tarballs using the following methods:

Method #1:

tar -czvf archive.tar.gz [LIST-OF-FILES]

Method #2:

tar -cavf archive.tar.gz [LIST-OF-FILES]

The -a option specifies that the compression format should automatically be detected from the extension.

The `bunzip2` Command

The bunzip2 utility uses a compression technique that is similar to gunzip2, except that bunzip2 typically produces smaller (more compressed) files than gzip. It comes with all Linux distributions. In order to compress with bzip2 use:

bzip2 filename
ls
filename.bz2

The `zip` Command

The zip command is another utility for creating zip files. For example, if you have the files called file1, file2, and file3, then the following command creates the file file1.zip that contains these three files:

zip file?

The zip command has useful options (such as –x for excluding files), and you can find more information in online tutorials.

Commands for `zip` Files and `bz` Files

There are various commands for handling zip files, including zdiff, zcmp, zmore, zless, zcat, zipgrep, zipsplit, zipinfo, zgrep, zfgrep, and zegrep.

Remove the initial “z” or “zip” from these commands to obtain the corresponding “regular” bash command.

For example, the zcat command is the counterpart to the cat command, so you can display the contents of a file in a .gz file without manually extracting that file and also without modifying the contents of the .gz file. Here is an example:

ls test.gz
zcat test.gz

A test file

# file test contains a line “A test file”

Another set of utilities for bz files includes bzcat, bzcmp, bzdiff, bzegrep, bzfgrep, bzgrep, bzless, and bzmore.

Read the online documentation to find out more about these commands.

Internal Field Separator (IFS)

The Internal Field Separator is an important concept in shell scripting that is useful while manipulating text data. An Internal Field Separator (IFS) is an environment variable that stores delimiting characters. The IFS is the default delimiter string used by a running shell environment.

Consider the case where we need to iterate through words in a string or comma separated values (CSV). In the first case we will use IFS =" " and in the second we will use IFS=",". Suppose that the shell variable data is defined as follows:

data="name,sex,rollno,location"

#To read each of the data elements into a variable, we can use IFS as shown here:

oldIFS=$IFS
IFS=,
for item in `echo $data`
do
  echo Item: $item done
IFS=$oldIFS

The next section contains a code sample that relies on the value of IFS in order to extract data correctly from a dataset.

Data from a Range of Columns in a Dataset

Listing 2.6 displays the contents of the dataset datacolumns1.txt and Listing 2.7 displays the contents of the shell script datacolumns1.sh that illustrates how to extract data from a range of columns from the dataset in Listing 2.6.

LISTING 2.6 datacolumns1.txt

#23456789012345678901234567890
  1000   Jane   Edwards
  2000   Tom    Smith
  3000   Dave   Del Ray

LISTING 2.7 datacolumns1.sh

# empid: 03-09
# fname: 11-20
# lname: 21-30
IFS=''
inputfile="datacolumns1.txt"

while read line
do
  pound="`echo $line |grep '^#'`"

  if [ x"$pound" == x"" ]
  then
    echo "line: $line"
    empid=`echo "$line" |cut -c3-9`
    echo "empid: $empid"

    fname=`echo "$line" |cut -c11-19`
    echo "fname: $fname"

    lname=`echo "$line" |cut -c21-29`
    echo "lname: $lname"
    echo "--------------"
  fi
done < $inputfile

Listing 2.7 sets the value of IFS to an empty string, which is required for this shell script to work correctly (try running this script without setting IFS and see what happens). The body of this script contains a while loop that reads each line from the input file called datacolumns1.txt and sets the pound variable equal to “” if a line does not start with the “#” character OR sets the pound variable equal to the entire line if it does start with the “#” character. This is a simple technique for “filtering” lines based on their initial character.

The if statement executes for lines that do not start with a “#” character, and the variables empid, fname, and lname are initialized to the characters in columns 3 through 9, then 11 through 19, and then 21 through 29, respectively. The values of those three variables are printed each time they are initialized. As you can see, these variables are initialized by a combination of the echo command and the cut command, and the value of IFS is required in order to ensure that the echo command does not remove blank spaces.

The output from Listing 2.7 is shown below:

line:   1000   Jane   Edwards
empid: 1000
fname: Jane
lname: Edwards
--------------
line:   2000   Tom    Smith
empid: 2000
fname: Tom
lname: Smith
--------------
line:   3000   Dave   Del Ray
empid: 3000
fname: Dave
lname: Del Ray
--------------

Working with Uneven Rows in Datasets

Listing 2.8 displays the contents of the dataset uneven.txt that contains rows with a different number of columns. Listing 2.9 displays the contents of the bash script uneven.sh that illustrates how to generate a dataset whose rows have the same number of columns.

LISTING 2.8 uneven.txt

abc1 abc2 abc3 abc4
abc5 abc6
abc1 abc2 abc3 abc4
abc5 abc6

LISTING 2.9 uneven.sh

inputfile="uneven.txt"
outputfile="even2.txt"

# ==> four fields per line

#method #1: four fields per line
cat $inputfile | xargs -n 4 >$outputfile

#method #2: two equal rows
#xargs -L 2 <$inputfile > $outputfile

echo "input file:"
cat $inputfile

echo "output file:"
cat $outputfile

Listing 2.9 contains two techniques for realigning the input file so that the output appears with four columns in each row. As you can see, both techniques involve the xargs command (which is an interesting use of the xargs command).

Launch the code in Listing 2.9 and you will see the following output:

abc1 abc2 abc3 abc4
abc5 abc6 abc1 abc2
abc3 abc4 abc5 abc6

Working with Functions in Shell Scripts

A shell function can be defined by using the keyword function, followed by the name of the function (specified by you) and a pair of round parentheses, followed by a pair of curly braces that contain shell commands. The general form is shown here:

function fname()
{
   statements;
}

An alternate method of defining a shell function is shown here:

fname()
{
   statements;
}

A function can be invoked by its name:

fname ; # executes function

Arguments can be passed to functions and can be accessed by the shell script:

fname arg1 arg2 ; # passing args

Listing 2.10 displays the contents of checkuser.sh, which illustrates how to prompt users for two input strings and then invoke a function with those strings as parameters.

LISTING 2.10 checkuser.sh

#!/bin/bash

function checkNewUser()
{
   echo "argument #1 = $1"
   echo "argument #2 = $2"
   echo "arg count = $#"

   if test "$1" = "John" && test "$2" = "Smith"
   then
     return 1
   else
     return 0
   fi
}

/bin/echo -n "First name: "
read fname
/bin/echo -n "Last name: "
read lname

checkNewUser $fname $lname
echo "result = $?"

Listing 2.10 contains the function checkNewUser() that displays the value of the first argument, the second argument, and the total number of arguments, respectively. This function returns the value 1 if the first argument is John and the second argument is Smith; otherwise the function returns 0.

The remaining portion of Listing 2.10 invokes the echo command twice in order to prompt users to enter a first name and a last name, and then invokes the function checkNewUser() with these two input values. A sample output from launching Listing 2.10 is shown here:

First name: John
Last name: Smith
argument #1 = John
argument #2 = Smith
arg count = 2
result = 1

What about using command substitution in order to invoke the function checkNewUser? In order to find out what would happen, let’s add the following code snippet to the bottom of Listing 2.10:

result=`checkNewUser $fname $lname`
echo "result = $result"

Launch the modified version of Listing 2.10, provide the same input values of John and Smith, and compare the following result with the previous result:

First name: John
Last name: Smith
argument #1 = John
argument #2 = Smith
arg count = 2
result = 1
result = argument #1 = John
argument #2 = Smith
arg count = 2

Recursion and Shell Scripts

This section contains several examples of shell scripts with recursion, which is a topic that occurs in many programming languages. Although you probably won’t need to write many scripts that use recursion, it’s worthwhile to learn this concept, especially if you plan to study other languages.

If you already understand recursion, then the scripts in this section will be straightforward. In particular, you will learn how to calculate the factorial value of a positive integer. In case you are interested, the Appendix contains bash scripts for calculating the Fibonacci number of a positive integer, as well as bash scripts for calculating the greatest common divisor (GCD) and the least common multiple (LCM) of two positive integers.

Listing 2.11 displays the contents of Factorial.sh that computes the factorial value of a positive integer.

LISTING 2.11 Factorial.sh

#!/bin/sh

factorial()
{

   if [ "$1" -gt 1 ]
   then
      decr=`expr $1 - 1`
      result=`factorial $decr`
      product=`expr $1 \* $result`
      echo $product
   else
      # we have reached 1:
      echo 1
   fi
}
echo "Enter a number: "
read num

# add code to ensure it's a positive integer

echo "$num! = `factorial $num`"

Listing 2.11 contains the factorial() function with conditional logic: if the first parameter is greater than 1, then the variable decr is initialized as 1 less than the value of $1, followed by initializing result with the recursive invocation of the factorial() function with the argument decr. Finally, this block of code initializes product as the value of $1 multiplied by the value of result. Note that if the first parameter is not greater than 1, then the value 1 is returned.

The last portion of Listing 2.11 prompts users for a number and then the factorial value of that number is computed and displayed. For simplicity, non-integer values are not checked (you can try to add that functionality yourself).

Iterative Solutions for Factorial Values

Listing 2.12 displays the contents of Factorial2.sh, which computes the factorial value of a positive integer using a for loop.

LISTING 2.12 Factorial2.sh

#!/bin/bash

factorial()
{
   num=$1
   result=1
   for (( i=2; i<=${num}; i++ ));
   do
     result=$((${result}*$i))
   done

   echo $result
}

printf "Enter a number: "
read num

echo "$num! = `factorial $num`"

Listing 2.12 contains a function called factorial() that initializes the variable num to the first argument passed into the function factorial(), followed by the variable result whose initial value is 1. The next portion of Listing 2.12 is a for loop that iteratively multiples the value of result by the numbers between 2 and num inclusive, and then returns the value of the variable result.

The final portion of Listing 2.12 prompts users for a number and then uses command substitution to invoke the function factorial() with the user-supplied value. Note that no validation is performed in order to ensure that the input value is a non-negative integer. The echo statement displays the calculated factorial value.

Listing 2.13 displays the contents of Factorial3.sh, which computes the factorial value of a positive integer using a for loop and an array that keeps track of intermediate factorial values.

LISTING 2.13 Factorial3.sh

#!/bin/bash

factorial()
{
   num=$1
   result=1
   for (( i=2; i<=${num}; i++ ));
   do
     result=$((${result}*$i))
     factvalues[$i]=$result
   done
}

printf "Enter a number: "
read num

for (( i=1; i<=${num}; i++ ));
do
  factvalues[$i]=1
done

factorial $num

# print each element via a loop:
for (( i=1; i<=${num}; i++ ));
do
  echo "Factorial of $i : " ${factvalues[$i]}
done

Listing 2.13 is very similar to the code in Listing 2.12: the key difference is that intermediate factorial values are stored in the array factvalues. Notice that initial loop that initializes the values in factvalues: doing so makes the values global, so we don’t need to return anything from the factorial() function.

The last portion of Listing 2.13 contains a for loop that displays the intermediate factorial values as well as the factorial of the user-provided input.

Summary

This chapter showed you examples of how to use some useful and versatile bash commands. First you learned about the bash commands join, fold, split, sort, and uniq. Next you learned about the find command and the xargs command. You also learned about various ways to use the tr command, which is also in the use case in this chapter.

Then you saw some compression-related commands, such as cpio and tar, which help you create new compressed files and also help you examine the contents of compressed files.

In addition, you learned how to extract column ranges of data, as well as the usefulness of the IFS option. Finally, you saw an example of a bash script for computing the factorial value of a number via recursion.

USEFUL COMMANDS

The join Command

LISTING 2.1 1.data

LISTING 2.2 2.data

The split Command

The sort Command

The uniq Command

The od Command

The tr Command

LISTING 2.4 controlm.csv

LISTING 2.5 controlm.sh

The find Command

The tee Command

File Compression Commands

The tar Command

The cpio Command

The gzip and gunzip Commands

The bunzip2 Command

The zip Command

Commands for zip Files and bz Files

Data from a Range of Columns in a Dataset

LISTING 2.6 datacolumns1.txt

Working with Uneven Rows in Datasets

LISTING 2.8 uneven.txt

LISTING 2.9 uneven.sh

Working with Functions in Shell Scripts

Recursion and Shell Scripts

LISTING 2.11 Factorial.sh

Iterative Solutions for Factorial Values

LISTING 2.12 Factorial2.sh

LISTING 2.13 Factorial3.sh

Summary

The `join` Command

The `split` Command

The `sort` Command

The `uniq` Command

The `od` Command

The `tr` Command

The `find` Command

The `tee` Command

The `tar` Command

The `cpio` Command

The `gzip` and `gunzip` Commands

The `bunzip2` Command

The `zip` Command

Commands for `zip` Files and `bz` Files