Data Cleaning

APPENDIX

OTHER CODE SAMPLES

This appendix contains an assortment of bash scripts that illustrate how to solve some well-known tasks, such as recursion-based solutions for the GCD and LCM of two positive integers, as well as awk commands for processing multiple datasets in order to perform arithmetic calculations.

The shell scripts are grouped corresponding to their respective chapters: for instance, awk -related bash scripts are listed as part of the section for Chapter 5. In some cases (such as Chapter 1), N/A is listed when there are no samples for a chapter. Please keep in mind that there is fairly light coverage (in terms of explanations) for the code samples in this Appendix: the assumption is that you have read the code samples in the chapters, thereby enabling you to understand the code without in-depth explanations.

Examples for Chapter 1

N/A

Examples for Chapter 2

The examples in this Appendix for Chapter 2 contains the following shell scripts for calculating Fibonacci numbers, the GCD and LCM of two positive integers, and the divisors of a positive integer:

Fibonacci.sh

gcd.sh

lcm.sh

Divisors2.sh

Calculating Fibonacci Numbers

Listing A.1 displays the contents of Fibonacci.sh that computes the Fibonacci value of a positive integer.

LISTING A.1: Fibonacci.sh

#!/bin/sh

LOGFILE="/tmp/a1"

rm -f $LOGFILE 2>/dev/null


fib()
{
    if [ "$1" -gt 3 ]

    then

echo "1 = $1 2 = $2 3 = $3" >> $LOGFILE


      decr1=`expr $2 - 1`

      decr2=`expr $3 - 1`

      decr3=`expr $3 - 2`

echo "d1 = $decr1 d2 = $decr2 d3 = $decr3" >> $LOGFILE


      fib1=`fib $2 $3 $decr2`

      fib2=`fib $3 $decr2 $decr3`

      fib=`expr $fib1 + $fib2`

      echo $fib

   else

      if [ "$1" -eq 3 ]

      then

        echo 2

      else

        echo 1

      fi

   fi

}

echo "Enter a number: "

read num

# add code to ensure it's a positive integer


if [ "$num" -lt 3 ]

then

  echo "fibonacci $num = 1"

else

  decr1=`expr $num - 1`

  decr2=`expr $num - 2`

  echo "fibonacci $num = `fib $num $decr1 $decr2`"

fi

In case you don't already know, the Fibonacci sequence is defined as follows:

F(1) = 1; F(2) = 2; and F(n) = F(n-1) + F(n-2) for n >= 2.

Listing A.1 looks complicated, but in a sense it "extends" the technique shown in Listing 2.10 in Chapter 2. In particular, the code for calculating factorial values involves decrementing one variable, whereas calculating Fibonacci numbers involves decrementing two variables (which are called decr1 and decr2 in Listing A.1) in order to make recursive invocations of the fib() function.

Calculating the GCD of Two Positive Integers

Listing A.2 displays the contents of the shell script gcd.sh that computes the greatest common divisor of two positive integers.

LISTING A.2 gcd.sh

#!/bin/sh


function gcd()

{

  if [ $1 -lt $2 ]

  then

    result=`gcd $2 $1`

    echo $result

else

  remainder=`expr $1 % $2`

  if [ $remainder == 0 ]

  then

    echo $2

  else

    echo `gcd $2 $remainder`

  fi

 fi

}



a="4"

b="20"

result=`gcd $a $b`

echo "GCD of $a and $b = $result"


a="4"

b="22"

result=`gcd $a $b`

echo "GCD of $b and $a = $result"


a="20"

b="3"

result=`gcd $a $b`

echo "GCD of $b and $a = $result"


a="10"

b="10"

result=`gcd $a $b`

echo "GCD of $b and $a = $result"

Listing A.2 is a straightforward implementation of the Euclidean algorithm (check Wikipedia for details) for finding the GCD of two positive integers. The output from Listing A.2 shows the GCD of 4 and 20, as shown here:


GCD of 4 and 20 = 4

GCD of 22 and 4 = 2

GCD of 3 and 20 = 1

GCD of 10 and 10 = 10

Calculating the LCM of Two Positive Integers

Listing A.3 displays the contents of the shell script lcm.sh that computes the lowest common multiple (LCM) of two positive integers. This script contains the code in the shell script gcd.sh in order to compute the LCM of two positive integers.

LISTING A.3: lcm.sh

#!/bin/sh



function gcd()

{

   if [ $1 -lt $2 ]

   then

     result=`gcd $2 $1`

     echo $result

   else

     remainder=`expr $1 % $2`


     if [ $remainder == 0 ]

     then

       echo $2

     else

       result=`gcd $2 $remainder`

       echo $result

     fi

   fi

}

function lcm()

{

     gcd1=`gcd $1 $2`

     lcm1=`expr $1 / $gcd1`

     lcm2=`expr $lcm1 \* $2`

     echo $lcm2
}

a="24"

b="10"

result=`lcm $a $b`

echo "The LCM of $a and $b = $result"


a="10"

b="30"

result=`lcm $a $b`

echo "The LCM of $a and $b = $result"

Notice that Listing A.3 contains the gcd() function to compute the GCD of two positive integers. This function is necessary because the next portion of Listing A.3 contains the lcm() function that invokes the gcd() function, followed by some multiplication steps in order to calculate the LCM of two numbers. The output from Listing A.3 displays the LCM of 10 and 24, as shown here:

The LCM of 24 and 10 = 120
The LCM of 10 and 30 = 30

Calculating Prime Divisors

Listing A.4 displays the contents of the shell script Divisors2.sh that calculates the prime factors of a positive integer.

LISTING A.4: Divisors2.sh

#!/bin/sh

function divisors()

{

   div="2"

   num="$1"

   primes=""


   while (true)

   do

     remainder=`expr $num % $div`


     if [ $remainder == 0 ]

     then

      #echo "divisor: $div"

       primes="${primes} $div"

         num=`expr $num / $div`

       else

         div=`expr $div + 1`

       fi


       if [ $num -eq 1 ]

       then

         break

       fi

     done


     # use 'echo' instead of 'return'

     echo $primes

}


num="12"

primes=`divisors $num`

echo "The prime divisors of $num: $primes"


num="768"

primes=`divisors $num`

echo "The prime divisors of $num: $primes"


num="12345"

primes=`divisors $num`

echo "The prime divisors of $num: $primes"


num="23768"

primes=`divisors $num`

echo "The prime divisors of $num: $primes"

Listing A.4 contains the divisors() function that consists primarily of a while loop that checks for the divisors of num (which is initialized as the value of $1). The initial value of div is 2, and each time div divides num, the value of div is appended to the primes string, and num is replaced by num/div. If div does not divide num, div is incremented by 1. Note that the while loop in Listing A.4 terminates when num reaches the value of 1.

The output from Listing A.4 displays the prime divisors of 12, 768, 12345, and 23768, as shown here:


The prime divisors of 12: 2 2 3

The prime divisors of 768: 2 2 2 2 2 2 2 2 3

The prime divisors of 12345: 3 5 823

The prime divisors of 23768: 2 2 2 2971

The prime factors of 12 and 678 are computed in less than 1 second, but the calculation of the prime factors of 12345 and 23768 is significantly slower.

Examples for Chapter 3

The first example in this section illustrates how to determine which zip files contain SVG documents. The second example in this section shows you how to check the entries in a log file (with simulated values). The third code sample shows you how to use the grep command in order to simulate a relational database consisting of three "tables", each of which is represented by a dataset.

Listing A.5 displays the contents of myzip.sh that produces two lists of files: the first list contains the names of the zip files that contain SVG documents, and the second list contains the names of the zip files that do not contain SVG documents.

LISTING A.5: myzip.sh

foundlist=""

notfoundlist=""


for f in `ls *zip`

do

  found=`unzip -v $f |grep "svg$"`

  if [ "$found" != "" ]
  
  then

   #echo "$f contains SVG documents:"

   #echo "$found"

    foundlist="$f ${foundlist}"

else

    notfoundlist="$f ${notfoundlist}"

  fi

done


echo "Files containing SVG documents:"

echo $foundlist| tr ' ' '\n'


echo "Files not containing SVG documents:"

echo $notfoundlist |tr ' ' '\n'

Listing A.5 searches ("looks inside") zip files for the hard-coded string svg. If you want to search for some other string in a set of zip files, then manually replace this string with that other string. Alternatively, you can prompt users for a search string so you don't need to make manual modifications to the shell script.

For your convenience, Listing A.6 displays the contents of searchstrings.sh that illustrates how to enter one or more strings on the command line, in order to search for those strings in the zip files in the current directory.

LISTING A.6: searchstrings.sh

foundlist=""

notfoundlist=""


if [ "$#" == 0 ]

then

    echo "Usage: $0 <string-list>"

    exit

fi


      zipfiles=`ls *zip 2>/dev/null`

if [ "$zipfiles" = "" ]

then

     echo "*** No zip files in `pwd` ***"

     exit

fi

for str in "$@"

do

echo "Checking zip files for $str:"

for f in `ls *zip`

do

  found=`unzip -v $f |grep "$str"`

  if [ "$found" != "" ]

  then

    foundlist="$f ${foundlist}"

  else

    notfoundlist="$f ${notfoundlist}"

  fi

 done

 echo "Files containing $str:"

 echo $foundlist| tr ' ' '\n'


 echo "Files not containing $str:"

 echo $notfoundlist |tr ' ' '\n'

 foundlist=""

 notfoundlist=""

done

Listing A.6 first checks that at least one file is specified on the command line, and then initializes the zipfiles variable with the list of zip files in the current directory. If zipfiles is null, an appropriate message is displayed.

The next section of Listing A.6 contains a for loop that processes each argument that was specified at the command line. For each such argument, another for loop will check for the names of the zip files that contain that argument. If there is a match, then the variable $foundlist is updated, otherwise the $notfoundlist variable is updated. When the inner loop has completed, the names of the matching files and the non-matching files are displayed, and then the outer loop is executed with the next command line argument.

Although the preceding explanation might seem complicated, a sample output from launching Listing A.6 will clarify how the code works:

./searchstrings.sh svg abc

Checking zip files for svg:

Files containing svg:


Files not containing svg:

shell-programming-manuscript.zip

shell-progr-manuscript-0930-2013.zip

shell-progr-manuscript-0207-2015.zip

shell-prog-manuscript.zip

Checking zip files for abc:

Files containing abc:


Files not containing abc:

shell-programming-manuscript.zip

shell-progr-manuscript-0930-2013.zip

shell-progr-manuscript-0207-2015.zip

shell-prog-manuscript.zip

If you want to perform the search for zip files in subdirectories, modify the loop as shown here:


for f in `find . –print |grep "zip$"`

do

   echo "Searching $f…"

   unzip -v $f |grep "svg$"

done

If you have the Java SDK on your machine, you can also use the jar command instead of the unzip command, as shown here:

jar tvf $f |grep "svg$"

Listing A.7 displays the contents of skutotals.sh that calculates the number of units sold for each SKU in skuvalues.txt.

LISTING A.7: skutotals.sh

SKUVALUES="skuvalues.txt"

SKUSOLD="skusold.txt"


for sku in `cat $SKUVALUES`

do

  total=`cat $SKUSOLD |grep $sku | awk '{total += $2} END
{print total}'`

  echo "UNITS SOLD FOR SKU $sku: $total"

done

Listing A.7 contains a for loop that iterates through the rows of the file skuvalues.txt, and passes those SKU values – one at a time – to a command that involves the cat, grep, and awk commands. The purpose of the latter combination of commands is to 1) find the matching lines in skusold.txt, 2) compute the sum of the values of the numbers in the second column, and 3) print the subtotal for the current SKU. In essence, this shell script prints the subtotals for each SKU value.

Launch skutotals.sh and you will see the following output:


UNITS SOLD FOR SKU 4520: 27

UNITS SOLD FOR SKU 5530: 17

UNITS SOLD FOR SKU 6550: 8

UNITS SOLD FOR SKU 7200: 90

UNITS SOLD FOR SKU 8000: 160

We can generalize the previous shell script to take into account different prices for each SKU. Listing A.8 displays the contents of skuprices.txt.

LISTING A.8: skuprices.txt

Listing A.9 displays the contents of skutotals2.sh that extends the code in Listing A.8 in order to calculate the revenue for each SKU.

LISTING A.9: skutotals2.sh

SKUVALUES="skuvalues.txt"

SKUSOLD="skusold.txt"

SKUPRICES="skuprices.txt"


for sku in `cat $SKUVALUES`

do

  skuprice=`grep $sku $SKUPRICES | cut -d" " -f2`

  subtotal=`cat $SKUSOLD |grep $sku | awk '{total += $2} END
{print total}'`

  total=`echo "$subtotal * $skuprice" |bc`

  echo "AMOUNT SOLD FOR SKU $sku: $total"

done

Listing A.9 contains a slight enhancement: instead of computing the subtotals of the number of units for each SKU, the revenue for each SKU is computed, where the revenue for each item equals the price of the SKU multiplied by the number of units sold for the given SKU. Launch skutotals2.sh and you will see the following output:

AMOUNT SOLD FOR SKU 4520: 94.50

AMOUNT SOLD FOR SKU 5530: 85.00

AMOUNT SOLD FOR SKU 6550: 22.00

AMOUNT SOLD FOR SKU 7200: 562.50

AMOUNT SOLD FOR SKU 8000: 560.00

Listing A.10 displays the contents of skutotals3.sh that calculates the minimum, maximum, average, and total number of units sold for each SKU in skuvalues.txt.

LISTING A.10: skutotals3.sh

SKUVALUES="skuvalues.txt"

SKUSOLD="skusold.txt"

TOTALS="totalspersku.txt"

rm -f $TOTALS 2>/dev/null


##############################

#calculate totals for each sku

##############################

for sku in `cat $SKUVALUES`

do

   total=`cat $SKUSOLD |grep $sku | awk '{total += $2} END
{print total}'`

   echo "UNITS SOLD FOR SKU $sku: $total"

   echo "$sku|$total" >> $TOTALS

done


##########################

#calculate max/min/average

##########################

awk -F"|" '

   BEGIN {first = 1;}

   {if(first) { min = max= avg = sum = $2; first=0; next}}

   { if($2 < min) { min = $2 }

     if($2 > max) { max = $2 }

     sum += $2
   }

   END {print "Minimum = ",min

        print "Maximum = ",max

        print "Average = ",avg

        print "Total = ",sum
}

' $TOTALS

Listing A.10 initializes some variables, followed by a for loop that invokes an awk command in order to compute subtotals (i.e., number of units sold) for each SKU value. The next portion of Listing A.10 contains an awk command that calculates the maximum, minimum, average, and sum for the SKU units in the files $TOTALS.

Launch the script file in Listing A.10 and you will see the following output:


UNITS SOLD FOR SKU 4520: 27

UNITS SOLD FOR SKU 5530: 17

UNITS SOLD FOR SKU 6550: 8

UNITS SOLD FOR SKU 7200: 90

UNITS SOLD FOR SKU 8000: 160

Minimum = 8

Maximum = 160

Average = 27

Total = 302

Simulating Relational Data with the `grep` Command

This section shows you how to combine the grep and cut commands in order to keep track of a small database of customers, their purchases, and the details of their purchases that are stored in three text files.

Keep in mind that there are many open source toolkits available that can greatly facilitate working with relational data and non-relational data. Those toolkits can be very robust and also minimize the amount of coding that is required.

Moreover, you can use the join command (discussed in Chapter 2) to perform SQL-like operations on datasets. Nevertheless, the real purpose of this section is to illustrate some techniques with grep that might be useful in your own shell scripts.

Listing A.11 displays the contents of the MasterOrders.txt text file.

LISTING A.11: MasterOrders.txt

M10000 C1000 12/15/2012

M11000 C2000 12/15/2012

M12000 C3000 12/15/2012

Listing A.12 displays the contents of the Customers.txt text file.

LISTING A.12: Customers.txt

C1000 John Smith LosAltos California 94002

C2000 Jane Davis MountainView California 94043

C3000 Billy Jones HalfMoonBay California 94040

Listing A.13 displays the contents of the PurchaseOrders.txt text file.

LISTING A.13: PurchaseOrders.txt

C1000,"Radio",54.99,2,"01/22/2013"

C1000,"DVD",15.99,5,"01/25/2013"

C2000,"Laptop",650.00,1,"01/24/2013"

C3000,"CellPhone",150.00,2,"01/28/2013"

Listing A.14 displays the contents of the MasterOrders.sh bash script.

LISTING A.14: MasterOrders.sh

# initialize variables for the three main files

MasterOrders="MasterOrders.txt"

CustomerDetails="Customers.txt"

PurchaseOrders="PurchaseOrders.txt"


# iterate through the "master table"

for mastCustId in `cat $MasterOrders | cut -d" " -f2`

do

  # get the customer information

  custDetails=`grep $mastCustId $CustomerDetails`


  # get the id from the previous line

  custDetailsId=`echo $custDetails | cut -d" " -f1`


  # get the customer PO from the PO file

  custPO=`grep $custDetailsId $PurchaseOrders`


  # print the details of the customer

  echo "Customer $mastCustId:"

  echo "Customer Details: $custDetails"

  echo "Purchase Orders: $custPO"

  echo "----------------------"

  echo

done

Listing A.14 initializes some variables for orders, details, and purchase-related datasets. The next portion of Listing A.14 contains a for loop that iterates through the id values in the MasterOrders.txt file and uses each id to find the corresponding row in the Customers.txt file as well as the corresponding row in the PurchaseOrders.txt file. Finally, the bottom of the loop displays the details of the information that were retrieved from the initial portion of the for loop. The output from Listing A.14 is here:

Customer C1000:

Customer Details: C1000 John Smith LosAltos California 94002

Purchase Orders: C1000,"Radio",54.99,2,"01/22/2013"

C1000,"DVD",15.99,5,"01/25/2013"

----------------------

Customer C2000:

Customer Details: C2000 Jane Davis MountainView California
94043

Purchase Orders: C2000,"Laptop",650.00,1,"01/24/2013"

----------------------

Customer C3000:

Customer Details: C3000 Billy Jones HalfMoonBay California
94040

Purchase Orders: C3000,"CellPhone",150.00,2,"01/28/2013"

----------------------

Checking Updates in a Logfile

Listing A.15 displays the contents of CheckLogUpdates.sh that illustrates how to periodically check the last line in a log file to determine the status of a system. This shell script simulates the status of a system by appending a new row that is based on the current timestamp. The shell script sleeps for a specified number of seconds, and on the third iteration the script appends a row with an error status in order to simulate an error. In the case of a shell script that is monitoring a live system, the error code is obviously generated outside the shell script.

LISTING A.15: CheckLogUpdates.sh

DataFile="mylogfile.txt"

OK="okay"

ERROR="error"

sleeptime="2"

loopcount=0

rm -f $DataFile 2>/dev/null; touch $DataFile

newline="`date` SYSTEM IS OKAY"

echo $newline >> $DataFile

while (true)

do

  loopcount=`expr $loopcount + 1`

  echo "sleeping $sleeptime seconds..."

  sleep $sleeptime

  echo "awake again..."

  lastline=`tail -1 $DataFile`

  if [ "$lastline" == "" ]

  then

    continue

  fi

  okstatus=`echo $lastline |grep -i $OK`

  badstatus=`echo $lastline |grep -i $ERROR`

  if [ "$okstatus" != "" ]

  then

  echo "system is normal"

  if [ $loopcount –lt 5 ]

  then

     newline="`date` SYSTEM IS OKAY"

  else

     newline="`date` SYSTEM ERROR"

  fi

  echo $newline >> $DataFile

 elif [ "$badstatus" != "" ]

 then

   echo "Error in logfile: $lastline"

   break

 fi

done

Listing A.15 initializes some variables and then ensures that the log file mylogfile.txt is empty. After an initial line is added to this log file, a while loop sleeps periodically and then examines the contents of the final line of text in the log file. New text lines are appended to this log file, and when an error message is detected, the code exits the while loop. A sample invocation of Listing A.15 is here:


sleeping 2 seconds...

awake again...

system is normal

sleeping 2 seconds...

awake again...

system is normal

sleeping 2 seconds...

awake again...

system is normal

sleeping 2 seconds...

awake again...

system is normal

sleeping 2 seconds...

awake again...

system is normal

sleeping 2 seconds...

awake again...

Error in logfile: Thu Nov 23 18:22:22 PST 2017 SYSTEM ERROR

The contents of the log file are shown here:


Thu Nov 23 18:22:12 PST 2017 SYSTEM IS OKAY

Thu Nov 23 18:22:14 PST 2017 SYSTEM IS OKAY

Thu Nov 23 18:22:16 PST 2017 SYSTEM IS OKAY

Thu Nov 23 18:22:18 PST 2017 SYSTEM IS OKAY

Thu Nov 23 18:22:20 PST 2017 SYSTEM IS OKAY

Thu Nov 23 18:22:22 PST 2017 SYSTEM ERROR

Examples for Chapter 4

N/A

Examples for Chapter 5

This section of the Appendix contains an assortment of bash scripts that use awk in order to perform various tasks:

1)multiline.sh: convert multi-line records into single-line records

2)sumrows.sh: compute the total of each row in a dataset

3)genetics.sh: an example of the awk 'split' function

4)diagonal.sh: display the main/off-diagonal values and also compute the sum of the main/off-diagonal values

5)calculate column totals from multiple files

6)display main diagonal and off-diagonal values, as well as the sum of those values

The details of these shell scripts are discussed in the following sections.

Processing Multiline Records

Listing A.16 displays the contents of the dataset multiline.txt and Listing A.17 displays the contents of the shell script multiline.sh that combines multiple lines into a single record.

LISTING A.16: multiline.txt

  Mary Smith

999 Appian Way

Roman Town, SF 94234

        Jane Adams

123 Main Street

Chicago, IL 67840

John Jones

321 Pine Road

Anywhere, MN 94949

Note that each record spans multiple lines that can contain whitespaces, and records are separated by a blank line.

LISTING A.17: multiline.sh

# Records are separated by blank lines

awk '

BEGIN { RS = "" ; FS = "\n" }

{

   gsub(/[ \t]+$/, "", $1)

   gsub(/[ \t]+$/, "", $2)

   gsub(/[ \t]+$/, "", $3)


   gsub(/^[ \t]+/, "", $1)

   gsub(/^[ \t]+/, "", $2)

   gsub(/^[ \t]+/, "", $3)


   print $1 ":" $2 ":" $3 ""

  #printf("%s:%s:%s\n",$1,$2,$3)

}

' multiline.txt

Listing A.17 contains a BEGIN block that sets RS ("record separator") as an empty string and FS ("field separator") as a linefeed. Doing so enables us to "slurp" multiple lines into the same record, using a blank line as a separator for different records. The gsub() function removes leading and trailing whitespaces and tabs for three fields in the datasets. The output from launching Listing A.17 is here:

Mary Smith:999 Appian Way:Roman Town, SF 94234

Jane Adams:123 Main Street:Chicago, IL 67840

John Jones:321 Pine Road:Anywhere, MN 94949

Adding the Contents of Records

Listing A.18 displays the contents of the dataset numbers.txt and Listing A.19 displays the contents of the shell script sumrows.sh that combines and adds the fields in each record.

LISTING A.18: numbers.txt

LISTING A.19: sumrows.sh

awk '{ for(i=1; i<=NF;i++) j+=$i; print j; j=0 }' numbers.txt

Listing A.19 contains a simple invocation of the awk command that contains a for loop that uses the variable j to hold the sum of the values of the fields in each record; after which the sum is printed and j is re-initialized to 0. The output from Listing A.19 is here:

Using the `split` Function in `awk`

Listing A.20 displays the contents of the dataset genetics.txt (some rows wrap across more than one line) and Listing A.21 displays the contents of the shell script genetics.sh that uses the split() function in order to parse the contents of a field in a record.

LISTING A.20: genetics.txt

#extract rows with 'gene' and print rows and 'key' value

xyz3    GTF2GFF chro     55555   44444  key=chr1;Name=chr1

xyz3    GTF2GFF gene     77774   11111
key=XYZ123;NB=standard;Name=extra

xyz3    GTF2GFF exon     71874   12227  Super=NR_55555

xyz3    GTF2GFF exon     72613   12721  Super=NR_55555

xyz3    GTF2GFF exon     83221   14408  Super=NR_55555

xyz3    GTF2GFF gene     84362   29370
key=WASH7P;Note=extra;Name=ALPHA

xyz3    GTF2GFF exon     84362   14829  Super=NR_222222

LISTING A.21: genetics.sh

# required output:

#xyz3:77774:XYZ123

#xyz3:84362:WASH7P

awk -F" " '

{

  if( $3 == "gene" ) {

    split($6, triplet, /[;=]/)

    printf("%s:%s:%s\n", $1, $4, triplet[2] )

  }

}

' genetics.txt

Listing A.21 matches input lines whose third field equals gene, after which the array triplet is populated with the components of the sixth field, using the characters ";" and "=" as delimiters in the sixth field. The output consists of the first field, the fourth field, and the second element in the array triplet. The output from launching Listing A.21 is here:

xyz3:77774:XYZ123

xyz3:84362:WASH7P

Scanning Diagonal Elements in Datasets

Listing A.22 displays the contents of the dataset diagonal.txt and Listing A.23 displays the contents of the shell script diagonal.sh that displays the elements in the main diagonal and off-diagonal, and also computes the sum of the elements in the main diagonal and off-diagonal.

LISTING A.22: diagonal.csv

1,1,1,1,1

5,4,3,2,1

8,8,1,8,8

5,4,3,2,1

1,6,6,7,7

LISTING A.23: diagonal.sh

# NF is the number of fields in the current record.

# NR is the number of the current record/line

# (not the number of records in the file).

# In the END block (or the last line of the file)

# it's the number of lines in the file.

# Solution in R: https://gist.github.com/dsparks/3693115


echo "Main diagonal:"

awk -F"," '{ for (i=0; i<=NF; i++) if (NR >= 1 && NR == i)
print $(i) }' diagonal.csv

echo "Off diagonal:"

awk -F"," '{print $(NF+1-NR)}' diagonal.csv

echo "Main diagonal sum:"

awk -F"," '

BEGIN { sum = 0 }

{

   for (i=0; i<=NF; i++) { if (NR >= 1 && NR == i) { sum += $i
}  }

}

END { printf ("sum = %s\n",sum) }

' diagonal.csv

echo "Off diagonal sum:"

awk -F"," '

BEGIN { sum = 0 }

{

  for (i=0; i<=NF; i++) { if(NR >= 1 && i+NR == NF+1) { sum += $i; } }

}

END { printf ("sum = %s\n",sum) }

' diagonal.csv

Listing A.23 starts with an awk command that contains a loop that matches "diagonal" elements of the dataset, which is to say the first field of the first record, the second field of the second record, the third field of the third record, and so forth. This matching process is handled by the conditional logic inside the for loop.

The second part of Listing A.23 contains an awk command that prints the off-diagonal elements of the dataset, using a very simple print statement.

The third part of Listing A.23 contains an awk command that contains the same logic as the first awk command, and then calculates the cumulative sum of the diagonal elements.

The fourth part of Listing A.23 contains an awk command that contains logic that is similar to the first awk command, with the following variation:

if(NR >= 1 && i+NR == NF+1)

The preceding logic enables us to calculate the cumulative sum of the off-diagonal elements. The output from launching Listing A.23 is here:

Main diagonal:

1

4

1

2

7

Off diagonal:

1

2

1

4

1

Main diagonal sum:

sum = 15 Off diagonal sum:

sum = 9

Listing A.24, Listing A.25, and Listing A.26 display the contents of the dataset rain1.csv, rain2.csv, and rain3.csv.txt that are used in several shell scripts in this section.

LISTING A.24: rain1.csv

1,0.10,53,15

2,0.12,54,16

3,0.19,65,10

4,0.25,86,23

5,0.18,57,17

6,0.23,79,34

7,0.34,66,21

LISTING A.25: rain2.csv

1,0.00,63,24

2,0.02,64,25

3,0.09,75,19

4,0.15,66,28

5,0.08,67,36

6,0.13,79,23

7,0.24,68,25

LISTING A.26: rain3.csv

1,1.00,83,34

2,0.02,84,35

3,1.09,75,19

4,0.15,86,38

5,1.08,87,36

6,0.13,79,33

7,0.24,88,45

Adding Values From Multiple Datasets (1)

Listing A.27 displays the contents of the shell script rainfall1.sh that adds the numbers in the corresponding fields of several CSV files and displays the results.

LISTING A.27: rainfall1.sh

# => Calculate COLUMN averages for multiple files


#columns in rain.csv:

#DOW,inches of rain, degrees F, humidity (%)


#files: rain1.csv, rain2.csv, rain3.csv

echo "FILENAMES:"

ls rain?.csv


awk -F',' '

{

  inches+=$2

  degrees+=$3

  humidity+=$4

}

END {

  printf("FILENAME: %s\n", FILENAME)

  printf("inches: %.2f\n", inches/7)

  printf("degrees: %.2f\n", degrees/7)

  printf("humidity: %.2f\n", humidity/7)

}

' rain?.csv

Listing A.27 calculates the sum of the numbers in three columns (i.e., inches of rainfall, degrees Fahrenheit, and humidity as a percentage) in the datasets specified by the expression rain?.csv, which in this particular example consists of the datasets rain1.csv, rain2.csv, and rain3.csv. Thus, Listing A.27 can handle multiple datasets (rain1.csv through rain9.csv). You can generalize this example to handle any dataset that starts with the string rain and ends with the suffix csv with the following expression:

rain*.csv

The output from launching Listing A.27 is here:

FILENAMES:

rain1.csv   rain2.csv   rain3.csv

inches:   0.83

degrees:  217.71

humidity: 79.43

Adding Values From Multiple Datasets (2)

Listing A.28 displays the contents of the shell script rainfall12.sh that adds the numbers in the corresponding fields of several CSV files and displays the results.

LISTING A.28: rainfall2.sh

# => Calculate ROW averages for multiple files


#columns in rain.csv:

#DOW,inches of rain, degrees F, humidity (%)

#files: rain1.csv, rain2.csv, rain3.csv


awk -F',' '

{

   mon_rain[FNR]+=$2

   mon_degrees[FNR]+=$3

   mon_humidity[FNR]+=$4

   idx[FNR]++

}

END {

  printf("DAY INCHES DEGREES HUMIDITY\n")


  for(i=1; i<=FNR; i++){

   printf("%3d %-6.2f %-8.2f %-7.2f\n",

    i,mon_rain[i]/idx[i],mon_degrees[i]/idx[i],mon_humidity[i]/
    idx[i])

  }

}

' rain?.csv

Listing A.28 is similar to Listing A.27, except that this code sample uses the value of FNR in order to calculate the average rainfall, degrees Fahrenheit, and percentage humidity only for Monday. The output from launching Listing A.28 is here:

DAY INCHES DEGREES HUMIDITY

  1 0.37   66.33    24.33

  2 0.05   67.33    25.33

  3 0.46   71.67    16.00

  4 0.18   79.33    29.67

  5 0.45   70.33    29.67

  6 0.16   79.00    30.00

  7 0.27   74.00    30.33

Listing A.29, Listing A.30, and Listing A.31 display the contents of the dataset zain1.csv, zain2.csv, and rainz.csv.txt that are used in an upcoming shell script in this section.

LISTING A.29: zain1.csv

1,0.10,53,15

2,0.12,54,16

3,0.19,65,10

4,0.25,86,23

5,0.18,57,17

6,0.23,79,34

7,0.34,66,21

LISTING A.30: zain2.csv

1,0.00,63,24

2,0.02,64,25

3,0.09,75,19

4,0.15,66,28

5,0.08,67,36

6,0.13,79,23

7,0.24,68,25

LISTING A.31: zain3.csv

1,1.00,83,34

2,0.02,84,35

3,1.09,75,19

4,0.15,86,38

5,1.08,87,36

6,0.13,79,33

7,0.24,88,45

Adding Values From Multiple Datasets (3)

Listing A.32 displays the contents of the shell script rainfall3.sh that adds the numbers in the corresponding fields of several CSV files and displays the results.

LISTING A.32: rainfall3.sh

# => Calculate COLUMN averages for multiple files (backtick)


#columns in rain.csv:

#DOW,inches of rain, degrees F, humidity (%)

# specify the list of CSV files (supports multiple regexs)

files=`ls rain*csv zain*csv`


echo "FILES: `echo $files`"

awk -F',' '

{

  mon_rain[FNR]+=$2

  mon_degrees[FNR]+=$3

  mon_humidity[FNR]+=$4

  idx[FNR]++

}

END {

  printf("DAY INCHES DEGREES HUMIDITY\n")


  for(i=1; i<=FNR; i++){

    printf("%3d %-6.2f %-8.2f %-7.2f\n",

     i,mon_rain[i]/idx[i],mon_degrees[i]/idx[i],mon_humidity[i]/
     idx[i])

  }

}

' `echo $files`

Listing A.32 performs the same calculations as Listing A.28, with the following variation: the datasets specified by the variable files that is defined by the regular expression `ls rain*csv zain*csv`. You can modify this regular expression to include any list of files that need to be processed. Notice that the final line of code in Listing A.32 uses backtick substitution to expand the regular expression in the definition of the variable files:

' `echo $files`

As yet another variation, you can specify a file – let's call it filelist.txt - that contains a list of filenames that you want to process, and then replace the preceding line as follows:

' `cat filelist.txt`

The output from launching Listing A.32 is here:

FILES: rain1.csv rain2.csv rain3.csv zain1.csv zain2.csv
zain3.csv

DAY INCHES DEGREES HUMIDITY

  1 0.37   66.33    24.33

  2 0.05   67.33    25.33

  3 0.46   71.67    16.00

  4 0.18   79.33    29.67

  5 0.45   70.33    29.67

  6 0.16   79.00    30.00

  7 0.27   74.00    30.33

Calculating Combinations of Field Values

Listing A.33 displays the contents of the shell script linear-combo.sh that computes various linear combinations of the columns in multiple datasets and displays one combined dataset as the output.

LISTING A.33: linear-combo.sh

# => combinations of columns

awk -F',' '

{

  $2 += $3 * 2 + $4 / 2

  $3 += $4 / 3 + $2 * $2 / 10

  $4 += $2 + $3

  $1 += $2 * 3 - $4 / 10

  printf("%d,%.2f,%.2f,%.2f\n",$1,$2,$3,$4)

}

' rain?.csv

Listing A.33 processes the values of the datasets rain1.csv, rain2. csv, and rain3.csv whose contents are shown earlier in this section. The key observation to make is that the sequence of calculations in the calculations in the body of the awk statement involved inter-dependencies.

Specifically, the value of $2 is a linear combination of the values of $3 and $4. Next, the value of $3 is a linear combination of the value of $4 and $2, where the latter is not the original value from the datasets, but its calculated value. Third, the value of $4 is a linear combination of $2 and of $3, both of which are calculated values and not the values in the datasets. Finally, the value of $1 is a linear combination of the newly calculated values for $2 and $4.

As you can see, awk provides the flexibility to specify practically any combination of calculations (including non-linear combinations) in a very simple and sequential fashion. The output of Listing A.33 is here:

194,113.60,1348.50,1477.10

196,116.12,1407.72,1539.84

204,135.19,1895.97,2041.16

187,183.75,3470.07,3676.82

202,122.68,1567.70,1707.38

194,175.23,3160.89,3370.12

207,142.84,2113.33,2277.17

201,138.00,1975.40,2137.40

202,140.52,2046.92,2212.44

201,159.59,2628.23,2806.82

203,146.15,2211.32,2385.47

203,152.08,2391.83,2579.91

199,169.63,2964.10,3156.73

206,148.74,2288.69,2462.43

183,184.00,3479.93,3697.93

182,185.52,3537.43,3757.95

200,160.59,2660.25,2839.84

179,191.15,3752.50,3981.65

178,193.08,3826.99,4056.07

195,174.63,3139.56,3347.19

173,198.74,4052.76,4296.50

Summary

In this appendix, you saw examples of how to use some useful and versatile bash commands. First you saw examples of shell scripts for various tasks involving recursion, such as computing the GCD (greatest common divisor) and the LCM (lowest common multiple) of two positive integers, the Fibonacci value of a positive integer, and also the prime divisors of a positive integer.

Next you saw a bash script with the grep command, a while loop, and other constructs that append data to a log file, with logic to determine when to exit the bash script. In addition, you learned how to use the grep command to simulate a simple relational database.

In the final portion of this Appendix you learned how to use awk to process records that span multiple lines, how to compute column sums and averages involving multiple datasets, and how to use awk -related functions such as gsub() and split(). Finally, you learned how to dynamically calculate various combinations of columns of numbers from multiple datasets.