Data Cleaning

CHAPTER 3 FILTERING DATA WITH `grep`

This chapter introduces you to the versatile grep command, whose purpose is to take a stream of text data and reduce it to only the parts that you care about. The grep command is useful not only by itself, but also in conjunction with other commands, especially the find command. This chapter contains many short code samples that illustrate various options of the grep command. Some code samples illustrate how to combine the grep command with commands from previous chapters.

The first part of this chapter introduces the grep command used in isolation, combined with the regular expression metacharacters (from Chapter 1) and also with code snippets that illustrate how to use some of the options of the grep command. Next you will learn how to match ranges of lines, how to use the so-called “back references” in grep, and how to “escape” metacharacters in grep.

The second part of this chapter shows you how to use the grep command in order to find empty lines and common lines in datasets, as well as how to use keys to match rows in datasets. Next you will learn how to use character classes with the grep command, as well as the backslash “\” character, and how to specify multiple matching patterns. Next you will learn how to combine the grep command with the find command and the xargs command, which is useful for matching a pattern in files that reside in different directories. This section also contains some examples of common mistakes that people make with the grep command.

The third section briefly discusses the egrep command and the fgrep command, which are related commands that provide additional functionality that is unavailable in the standard grep utility. The final section contains a use case that illustrates how to use the grep command in order to find matching lines that are then merged in order to create a new dataset.

What Is the `grep` Command?

The grep (“Global Regular Expression Print”) command is useful for finding substrings in one or more files. Several examples are here:

grep abc *sh displays all the lines of abc in files with suffix sh

grep –i abc *sh is the same as the preceding query, but case-insensitive

grep –l abc *sh displays all the filenames with suffix sh that contain abc

grep –n abc *sh displays all the line numbers of the occurrences of the string abc in files with suffix sh

You can perform logical AND and logical OR operations with this syntax:

grep abc *sh | grep def matches lines containing abc AND def

grep "abc\|def" *sh matches lines containing abc OR def

You can combine switches as well: the following command displays the names of the files that contain the string abc (case insensitive):

grep –il abc *sh

In other words, the preceding command matches filenames that contain abc, Abc, ABc, ABC, abC, and so forth.

Another (less efficient way) to display the lines containing abc (case insensitive) is here:

cat file1 |grep –i abc

The preceding command involves two processes, whereas the “grep using –l switch instead of cat to input the files you want” approach involves a single process. The execution time is roughly the same for small text files, but the execution time can become more significant if you are working with multiple large text files.

You can combine the sort command, the pipe symbol, and the grep command. For example, the following command displays the files with a “Jan” date in increasing size:

ls -l |grep " Jan " | sort -n

A sample output from the preceding command is here:

-rw-r--r-- 1 oswaldcampesato2 staff   3 Sep 27 2017 abc.txt
-rw-r--r-- 1 oswaldcampesato2 staff   6 Sep 21 2017 control1.txt
-rw-r--r-- 1 oswaldcampesato2 staff  27 Sep 28 2017 fiblist.txt
-rw-r--r-- 1 oswaldcampesato2 staff  28 Sep 14 2017 dest
-rw-r--r-- 1 oswaldcampesato2 staff  36 Sep 14 2017 source
-rw-r--r-- 1 oswaldcampesato2 staff 195 Sep 28 2017 Divisors.py
-rw-r--r-- 1 oswaldcampesato2 staff 267 Sep 28 2017 Divisors2.py

Metacharacters and the `grep` Command

The fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Any meta-character with special meaning may be quoted by preceding it with a backslash.

A regular expression may be followed by one of several repetition operators, as shown below.

‘.’ matches any single character.

‘?’ indicates that the preceding item is optional and will be matched at most once: Z? matches Z or ZZ.

‘*’ indicates that the preceding item will be matched zero or more times: Z* matches Z, ZZ, ZZZ, and so forth.

‘+’ indicates that the preceding item will be matched one or more times: Z+ matches ZZ, ZZZ, and so forth.

‘{n}’ indicates that the preceding item is matched exactly n times: Z{3} matches ZZZ.

‘{n,}’ indicates that the preceding item is matched n or more times: Z{3} matches ZZZ, ZZZZ, and so forth.

‘{,m}’ indicates that the preceding item is matched at most m times: Z{,3} matches Z, ZZ, and ZZZ.

‘{n,m}’ indicates that the preceding item is matched at least n times, but not more than m times: Z{2,4} matches ZZ, ZZZ, and ZZZZ.

The empty regular expression matches the empty string (i.e., a line in the input stream with no data). Two regular expressions may be joined by the infix operator ‘|.’ When used in this manner, the infix operator behaves exactly like a logical “OR” statement, which directs the grep command to return any line that matches either regular expression.

Escaping Metacharacters with the `grep` Command

Listing 3.1 displays the contents of lines.txt, which contains lines with words and metacharacters.

LISTING 3.1 lines.txt

abcd
ab
abc
cd
defg
.*.
..

The following grep command lists the lines of length 2 (using the ^ begin with and $ end with operators to restrict length) in lines.txt:

grep '^..$' lines.txt

The following command lists the lines of length two in lines.txt that contain two dots (the backslash tells grep to interpret the dots as actual dots, not as metacharacters):

grep '^\.\.$' lines.txt

The result is shown here:

ab
cd
..

The following command also displays lines of length two that begin and end with a dot (the * matches any text of any length, including no text at all, and is used as a metacharacter because it is not preceded with a backslash):

grep '^\.*\.$' lines.txt

The following command lists the lines that contain a period, followed by an asterisk, and then another period (the ^* is now a character that must be matched because it is preceded by a backslash):

grep '^\.\*\.$' lines.txt

Useful Options for the `grep` Command

There are many types of pattern matching possibilities with the grep command, and this section contains an eclectic mix of such commands that handle common scenarios.

In the following examples we have four text files (two .sh and two .txt) and two Word documents in a directory. The string abc is found on one line in abc1.txt and three lines in abc3.sh. The string ABC is found on 2 lines in in ABC2.txt and 4 lines in ABC4.sh. Notice that abc is not found in ABC files, and ABC is not found in abc files.

ls *
ABC.doc  ABC4.sh        abc1.txt        ABC2.txt   abc.doc
abc3.sh

The following code snippet searches for occurrences of the string abc in all the files in the current directory that have sh as a suffix:

grep abc *sh
abc3.sh:abc at start
abc3.sh:ends with -abc
abc3.sh:the abc is in the middle

The “-c” option counts the number of occurrences of a string (note that even though ABC4.sh has no matches, it still counts them and returns zero):

grep –c abc *sh

The output of the preceding command is here:

ABC4.sh:0
abc3.sh:3

The “-e” option lets you match patterns that would otherwise cause syntax problems (the “–” character normally is interpreted as an argument for grep):

grep –e "-abc" *sh
abc3.sh:ends with -abc

The “-e” option also lets you match multiple patterns.

grep –e "-abc" -e "comment" *sh
ABC4.sh:# ABC in a comment
abc3.sh:ends with -abc

The “-i” option is to perform a case insensitive match:

grep –i abc *sh
ABC4.sh:ABC at start
ABC4.sh:ends with ABC
ABC4.sh:the ABC is in the middle
ABC4.sh:# ABC in a comment
abc3.sh:abc at start
abc3.sh:ends with -abc
abc3.sh:the abc is in the middle

The “-v” option “inverts” the matching string, which means that the output consists of the lines that do not contain the specified string (ABC doesn’t match because −i is not used, and ABC4.sh has an entirely empty line):

grep –v abc *sh

Use the “-iv” options to display the lines that do not contain a specified string using a case insensitive match:

grep –iv abc *sh
ABC4.sh:
abc3.sh:this line won't match

The “-l” option is to list only the filenames that contain a successful match (note this matches contents of files, not the filenames). The Word document matches because the actual text is still visible to grep, it is just surrounded by proprietary formatting gibberish. You can do similar things with other formats that contain text, such as XML, HTML, .csv, and so forth:

grep −l abc *
abc1.txt
abc3.sh
abc.doc

The “-l” option is to list only the filenames that contain a successful match:

grep –l abc *sh

Use the “-il” options to display the filenames that contain a specified string using a case insensitive match:

grep –il abc *doc

The preceding command is very useful when you want to check for the occurrence of a string in Word documents.

The “-n” option specifies line numbers of any matching file:

grep –n abc *sh
abc3.sh:1:abc at start
abc3.sh:2:ends with -abc
abc3.sh:3:the abc is in the middle

The “-h” option suppresses the display of the filename for a successful match:

grep –h abc *sh
abc at start
ends with -abc
the abc is in the middle

For the next series of examples, we will use columns4.txt as shown in Listing 3.2.

LISTING 3.2 columns4.txt

123 ONE TWO
456 three four
ONE TWO THREE FOUR
five 123 six
one two three
    four five

The “-o” option shows only the matched string (this is how you avoid returning the entire line that matches):

grep –o one columns4.txt

The “-o” option followed by the “-b” option shows the position of the matched string (returns character position, not line number. The “o” in “one” is the 59th character of the file):

grep –o –b one columns4.txt

You can specify a recursive search as shown here (output not shown because it will be different on every client or account. This searches not only every file in directory /etc, but every file in every subdirectory of etc):

grep –r abc /etc

The preceding commands match lines where the specified string is a substring of a longer string in the file. For instance, the preceding commands will match occurrences of abc as well as abcd, dabc, abcde, and so forth.

grep ABC *txt
ABC2.txt:ABC at start or ABC in middle or end in ABC
ABC2.txt:ABCD DABC

If you want to exclude everything except for an exact match, you can use the –w option, as shown here:

grep –w ABC *txt
ABC2.txt:ABC at start or ABC in middle or end in ABC

The --color switch displays the matching string in color:

grep --color abc *sh
abc3.sh:abc at start
abc3.sh:ends with −abc
abc3.sh:the abc is in the middle

You can use the pair of metacharacters .* to find the occurrences of two words that are separated by an arbitrary number of intermediate characters.

The following command finds all lines that contain the strings one and three with any number of intermediate characters:

grep "one.*three" columns4.txt
one two three

You can “invert” the preceding result by using the –v switch, as shown here:

grep –v "one.*three" columns4.txt
123 ONE TWO
456 three four
ONE TWO THREE FOUR
five 123 six
four five

The following command finds all lines that contain the strin gs one and three with any number of intermediate characters, where the match involves a case-insensitive comparison:

grep −i "one.*three" columns4.txt
ONE TWO THREE FOUR
one two three

You can “invert” the preceding result by using the –v switch, as shown here:

grep –iv "one.*three" columns4.txt
123 ONE TWO
456 three four
five 123 six
four five

Sometimes you need to search a file for the presence of either of two strings. For example, the following command finds the files that contain “start” or “end”:

grep −l 'start\|end' *
ABC2.txt
ABC4.sh
abc3.sh

Later in the chapter you will see how to find files that contain a pair of strings via the grep and xargs commands.

Character Classes and the `grep` Command

This section contains some simple one-line commands that combine the grep command with character classes.

echo "abc" | grep '[:alpha:]'
abc
echo "123" | grep '[:alpha:]'
(returns nothing, no match)
echo "abc123" | grep '[:alpha:]'
abc123
echo "abc" | grep '[:alnum:]'
abc
echo "123" | grep '[:alnum:]'
(returns nothing, no match)
echo "abc123" | grep '[:alnum:]'
abc123
echo "123" | grep '[:alnum:]'
(returns nothing, no match)
echo "abc123" | grep '[:alnum:]'
abc123
echo "abc" | grep '[0-9]'
(returns nothing, no match)
echo "123" | grep '[0-9]'
123
echo "abc123" | grep '[0-9]'
abc123
echo "abc123" | grep -w '[0-9]'
(returns nothing, no match)

Working with the –c Option in `grep`

Consider a scenario in which a directory (such as a log directory) has files created by an outside program. Your task is to write a shell script that determines which (if any) of the files that contain two occurrences of a string, after which additional processing is performed on the matching files (e.g., use email to send log files containing two or more error messages to a system administrator for investigation).

One solution involves the –c option for grep, followed by additional invocations of the grep command.

The command snippets in this section assume the following data files whose contents are shown below.

The file hello1.txt contains the following:

hello world1

The file hello2.txt contains the following:

hello world2
hello world2 second time

The file hello3.txt contains the following:

hello world3
hello world3 two
hello world3 three

Now launch the following commands: (2>/dev/null keeps warnings and errors caused by empty directories from cluttering up the output):

grep -c hello hello*txt 2>/dev/null
hello1.txt:1
hello2.txt:2
hello3.txt:3
grep -l hello hello*txt 2>/dev/null
hello1.txt
hello2.txt
hello3.txt
grep -c hello hello*txt 2>/dev/null |grep ":2$"
hello2.txt:2

Note how we use the “ends with” "$" metacharacter to grab just the files that have exactly two matches. We also use the colon ":2$" rather than just "2$" to prevent grabbing files that have 12, 32, or 142 matches (which would end in :12, :32, and :142).

What if we wanted to show “two or more” (as in the “2 or more errors in a log”)? You would instead use the invert (-v) command to exclude counts of exactly 0 or exactly 1.

grep -c hello hello*txt 2>/dev/null |grep -v ':[0-1]$'
hello2.txt:2
hello3.txt:3

In a real-world application, you would want to strip off everything after the colon to return only the filenames. There are many ways to do so, but we’ll use the cut command we learned in Chapter 1, which involves defining : as a delimiter with -d":" and using -f1 to return the first column (i.e., the part before the colon in the return text):

grep -c hello hello*txt 2>/dev/null | grep -v ':[0-1]$'| cut
-d":" -f1
hello2.txt
hello3.txt

Matching a Range of Lines

In Chapter 1 you saw how to use the head and tail commands to display a range of lines in a text file. Now suppose that you want to search a range of lines for a string. For instance, the following command displays lines 9 through 15 of longfile.txt:

cat -n longfile.txt |head -15|tail -9

The output is here:

     7       and each line
     8       contains
     9       one or
    10       more words
    11       and if you
    12       use the cat
    13       command the
    14       file contents
    15       scroll

This command displays the subset of lines 9 through 15 of longfile.txt that contain the string and:

cat -n longfile.txt |head -15|tail -9 | grep and

The output is here:

     7      and each line
    11      and if you
    13      command the

This command includes a whitespace after the word and, thereby excluding the line with the word “command”:

cat -n longfile.txt |head -15|tail -9 | grep "and "

The output is here:

     7      and each line
    11      and if you

Note that the preceding command excludes lines that end in “and” because they do not have the whitespace after “and” at the end of the line. You could remedy this situation with an “OR” operator including both cases:

cat -n longfile.txt |head -15|tail -9 | grep " and\|and "
     7      and each line
    11      and if you
    13      command the

However, the preceding allows “command” back into the mix. Hence, if you really want to match a specific word, it’s best to use the -w tag, which is smart enough to handle the variations:

cat -n longfile.txt |head -15|tail -9 | grep -w "and"
     7      and each line
    11      and if you

The use of whitespace is safer if you are looking for something at the beginning or end of a line. This is a common approach when reading contents of log files or other structured text where the first word is often important (a tag like ERROR or Warning, a numeric code or a date). This command displays the lines that start with the word and:

cat longfile.txt |head -15|tail -9 | grep "^and "

The output is here (without the line number because we are not using “cat -n”):

and each line
and if you

Recall that the “use the file name(s) in the command, instead of using cat to display the file first” style is more efficient:

head -15 longfile.txt |tail -9 | grep "^and "
and each line
and if you

However, the head command does not display the line numbers of a text file, so the “cat first” (cat -n adds line numbers) style is used in the earlier examples when you want to see the line numbers, even though this style is less efficient. Basically, you only want to add an extra command to a pipe if it is adding value, otherwise it’s better to start with a direct call to the files you are trying to process with the first command in the pipe, assuming the command syntax is capable of reading in filenames.

Using Back References in the `grep` Command

The grep command allows you to reference a set of characters that match a regular expression placed inside a pair of parentheses. For grep to parse the parentheses correctly, each has to be preceded with the escape character “\.”

For example, grep 'a$.$' uses the “.” regular expression to match ab or “a3” but not “3a” or “ba.”

The back reference ‘\n,’ where n is a single digit, matches the substring previously matched by the nth parenthesized sub-expression of the regular expression. For example, grep '$a$\1' matches “aa” and grep '$a$\2' matches “aaa.”

When used with alternation, if the group does not participate in the match, then the back reference makes the whole match fail. For example, grep 'a$.$|b\1' will not match ba or ab or bb (or anything else really).

If you have more than one regular expression inside a pair of parentheses, they are referenced (from left to right) by \1, \2, . . ., \9:

grep -e '\([a-z]\)\([0-9]\)\1' is the same as this command:
grep -e '\([a-z]\)\([0-9]\)\([a-z]\)'
grep -e '\([a-z]\)\([0-9]\)\2' is the same as this command:
grep -e '\([a-z]\)\([0-9]\)\([0-9]\)'

The easiest way to think of it is that the number (for example, \2) is a placeholder or variable that saves you from typing the longer regular expression it references. As regular expressions can get extremely complex, this often helps code clarity.

You can match consecutive digits or characters using the pattern $[0-9]$\1. For example, the following command is a successful match because the string “1223” contains a pair of consecutive identical digits:

echo "1223" | grep -e '\([0-9]\)\1'

Similarly, the following command is a successful match because the string “12223” contains three consecutive occurrences of the digit 2:

echo "12223" | grep -e '\([0-9]\)\1\1'

You can check for the occurrence of two identical digits separated by any character with this expression:

echo "12z23" | grep -e '\([0-9]\).\1'

In an analogous manner, you can test for the occurrence of duplicate letters, as shown here:

echo "abbc" | grep -e '\([a-z]\)\1'

The following example matches an IP address, and does not use back references, just the “\d” and “\.” Regular expressions to match digits and periods are as follows:

echo "192.168.125.103" | grep -e
'\(\d\d\d\)\.\(\d\d\d\)\.\(\d\d\d\)\.\(\d\d\d\)'

If you want to allow for fewer than three digits, you can use the expression {1,3}, which matches 1, 2, or 3 digits on the third block. In a situation where any of the four blocks might have fewer than three characters, you must use the following type of syntax in all four blocks:

echo "192.168.5.103" | grep -e
'\(\d\d\d\)\.\(\d\d\d\)\.\(\d\)\{1,3\}\.\(\d\d\d\)'

You can perform more complex matches using back references. Listing 3.3 displays the contents of columns5.txt, which contains several lines that are palindromes (the same spelling from left-to-right as right-to-left). Note that the third line is an empty line.

LISTING 3.3 columns5.txt

one eno
ONE ENO
ONE TWO OWT ENO
four five

The following command finds all lines that are palindromes:

grep -w -e '\(.\)\(.\).*\2\1' columns5.txt

The output of the preceding command is here:

one eno
ONE ENO
ONE TWO OWT ENO

The idea is as follows: the first $.$ matches a set of letters, followed by a second $.$ that matches a set of letters, followed by any number of intermediate characters. The sequence \2\1 reverses the order of the matching sets of letters specified by the two consecutive occurrences of $.$.

Finding Empty Lines in Datasets

Recall that the metacharacter “^” refers to the beginning of a line and the metacharacter “$” refers to the end of a line. Thus, an empty line consists of the sequence ^$. You can find the single empty in columns5.txt with this command:

grep -n "^$" columns5.txt

The output of the preceding grep command is here (use the -n switch to display line numbers, as blank lines will not otherwise show in the output):

3:

More commonly the goal is to simply strip the empty lines from a file. We can do that just by inverting the prior query (and not showing the line numbers):

grep -v "^$" columns5.txt
one eno
ONE ENO
ONE TWO OWT ENO
four five

As you can see, the preceding output displays four non-empty lines, and as we saw in the previous grep command, line #3 is an empty line.

Using Keys to Search Datasets

Data is often organized around unique values (typically numbers) in order to distinguish otherwise similar things: for example, John Smith the manager must not be confused with John Smith the programmer in an employee dataset. Hence, each record is assigned a unique number that will be used for all queries related to employees. Moreover, their names are merely data elements of a given record, rather than a means of identifying a record that contains a particular person.

With the preceding points in mind, suppose that you have a text file in which each line contains a single key value. In addition, another text file consists of one or more lines, where each line contains a key value followed by a quantity value.

As an illustration, Listing 3.4 displays the contents of skuvalues.txt and Listing 3.5 displays the contents of skusold.txt. Note that an SKU is a term often used to refer to an individual product configuration, including its packaging, labeling, and so forth.

LISTING 3.4 skuvalues.txt

LISTING 3.5 skusold.txt

The Backslash Character and the `grep` Command

The “\” character has a special interpretation when it’s followed by the following characters:

“\b” = Match the empty string at the edge of a word.

“\B” = Match the empty string provided it’s not at the edge of a word, so:
“\brat\b” matches the separate word “rat” but not “crate,” and
“\Brat\B” matches “crate” but not “furry rat.”

“\<” = Match the empty string at the beginning of a word.

“\>” = Match the empty string at the end of a word.

“\w” = Match word constituent, it is a synonym for “[_[:alnum:]].”

“\W” = Match non-word constituent, it is a synonym for “[^_[:alnum:]].”

“\s” = Match whitespace, it is a synonym for “[[:space:]].”

“\S” = Match non-whitespace, it is a synonym for “[^[:space:]].”

Multiple Matches in the `grep` Command

In an earlier example you saw how to use the –i option to perform a case insensitive match. However, you can also use the pipe “|” symbol to specify more than one sequence of regular expressions.

For example, the following grep expression matches any line that contains “one” as well as any line that contains “ONE TWO”:

grep "one\|ONE TWO" columns5.txt

The output of the preceding grep command is here:

one eno
ONE TWO OWT ENO

Although the preceding grep command specifies a pair of character strings, you can specify an arbitrary number of character sequences or regular expressions, as long as you put "\|" between each thing you want to match.

The `grep` Command and the `xargs` Command

The xargs command is often used in conjunction with the find command in bash. For example, you can search for the files under the current directory (including subdirectories) that have the sh suffix and then check which one of those files contains the string abc, as shown here:

find . –print |grep "sh$" | xargs grep –l abc

A more useful combination of the find and xargs commands is shown here:

find . -mtime -7 -name "*.sh" –print | xargs grep –l abc

The preceding command searches for all the files (including subdirectories) with suffix “sh” that have not been modified in at least seven days, and pipes that list to the xargs command, which displays the files that contain the string abc (case insensitive).

The find command supports many options, which can be combined via AND as well as OR in order to create very complex expressions.

Note that grep –R hello . also performs a search for the string hello in all files, including subdirectories, and follows the “one process” recommendation. On the other hand, the find . –print command searches for all files in all subdirectories, and you can pipe the output to xargs grep hello in order to find the occurrences of the word hello in all files (which involves two processes instead of one process).

You can use the output of the preceding code snippet in order to copy the matching files to another directory, as shown here:

cp `find . –print |grep "sh$" | xargs grep –l abc` /tmp

Alternatively, you can copy the matching files in the current directory (without matching files in any subdirectories) to another directory with the grep command:

cp `grep –l abc *sh` /tmp

Yet another approach is to use “back tick” so that you can obtain additional information:

for file in `find . –print`
do
    echo "Processing the file: $file"
    # now do something here
done

Keep in mind that if you pass too many filenames to the xargs command you will see a “too many files” error message. In this situation, try to insert additional grep commands prior to the xargs command in order to reduce the number of files that are piped into the xargs command.

If you work with NodeJS, you know that the node_modules directory contains a large number of files. In most cases, you probably want to exclude the files in that directory when you are searching for a string, and the “-v” option is ideal for this situation. The following command excludes the files in the node_modules directory while searching for the names of the HTML files that contain the string src and redirecting the list of file names to the file src_list.txt (and also redirecting error messages to / dev/null):

find . –print |grep –v node |xargs grep –il src >src_list.txt 2>/dev/null

You can extend the preceding command to search for the HTML files that contain the string src and the string angular with the following command:

find . –print |grep –v node |xargs grep –il src |xargs grep –il
angular >angular_list.txt 2>/dev/null

You can use the following combination of grep and xargs to find the files that contain both xml and defs:

grep -l xml *svg |xargs grep -l def

A variation of the preceding command redirects error messages to /dev/ null, as shown here:
grep -l hello *txt 2>/dev/null | xargs grep -c hello

Searching Zip Files for a String

There are at least three ways to search for a string in one or more zip files. As an example, suppose that you want to determine which zip files contain SVG documents.

The first way is shown here:

for f in `ls *zip`
do
   echo "Searching $f"
   jar tvf $f |grep "svg$"
done

When there are many zip files in a directory, the output of the preceding loop can be very verbose, in which case you need to scroll backward and probably copy/paste the names of the files that actually contain SVG documents into a separate file. A better solution is to put the preceding loop in a shell and redirect its output. For instance, create the file findsvg. sh whose contents are the preceding loop, and then invoke this command:

./findsvg.sh 1>1 2>2

Notice that the preceding command redirects error message (2>) to the file 2 and the results of the jar/grep command (1>) to the file 1. See the Appendix for another example of searching zip files for SVG documents.

Checking for a Unique Key Value

Sometimes you need to check for the existence of a string (such as a key) in a text file, and then perform additional processing based on its existence. However, do not assume that the existence of a string means that that string only occurs once. As a simple example, suppose the file mykeys.txt has the following content:

Suppose that you search for the string 2000, which you can do with findkey.sh, whose contents are displayed in Listing 3.6.

LISTING 3.6 findkey.sh

key="2000"
if [ "`grep $key mykeys.txt`" != "" ]
then
 foundkey=true
else
 foundkey=false
fi
echo "current key = $key"
echo "found key = $foundkey"

Listing 3.6 contains if/else conditional logic to determine whether or not the file mykeys.txt contains the value of $key (which is initialized as 2000). Launch the code in Listing 3.6 and you will see the following output:

current key = 2000
found key   = true
linecount   = 2

While the key value of 2000 does exist in mykeys.txt, you can see that it matches two lines in mykeys.txt. However, if mykeys.txt were part of a file with 100,000 (or more) lines, it’s not obvious that the value of 2000 matches more than one line. In this dataset, 2000 and 22000 both match, and you can prevent the extra matching line with this code snippet:

grep –w $key

Thus, in files that have duplicate lines, you can count the number of lines that match the key via the preceding code snippet. Another way to do so involves the use of wc –l, which displays the line count.

Redirecting Error Messages

Another scenario involves the use of the xargs command with the grep command, which can result in “no such . . .” error messages:

find . –print |xargs grep –il abc

Make sure to redirect errors using the following variant:

find . –print |xargs grep –il abc 2>/dev/null

The `egrep` Command and the `fgrep` Command

The egrep command is an Extended grep that supports added grep features like “+” (1 or more occurrence of previous character), “?” (0 or 1 occurrence of previous character), and “|” (alternate matching). The egrep command is almost identical to the grep -E, along with some caveats that are described here:

https://www.gnu.org/software/grep/manual/html_node/Basic-vs-Extended.html.

One advantage of using the egrep command is that it’s easier to understand the regular expressions than the corresponding expressions in grep (when it’s combined with backward references).

The egrep (“extended grep”) command supports extended regular expressions, as well as the pipe “|” in order to specify multiple words in a search pattern. A match is successful if any of the words in the search pattern appear, so you can think of the search pattern as an “any” match. Thus, the pattern “abc|def” matches lines that contain either abc or def (or both).

For example, the following code snippet enables you to search for occurrences of the string abc as well as occurrences of the string def in all files with the suffix sh:

egrep -w 'abc|def' *sh

The preceding egrep command is an “or” operation: a line matches if it contains either abc or def.

You can also use metacharacters in egrep expressions. For example, the following code snippet matches lines that start with abc or end with four and a whitespace:

egrep '^123|four $' columns3.txt

A more detailed explanation of grep, egrep, and frep is here:

https://superuser.com/questions/508881/what-is-the-difference-between-grep-pgrep-egrep-fgrep.

Displaying “Pure” Words in a Dataset with `egrep`

For simplicity, let’s work with a text string, and that way we can see the intermediate results as we work toward the solution. Let’s initialize the variable x as shown here:

x="ghi abc Ghi 123 #def5 123z"

The first step is to split x into words:

echo $x |tr -s ' ' '\n'

The output is here:

ghi
abc
Ghi
123
#def5
123z

The second step is to invoke egrep with the regular expression ^[a-zA-Z]+, which matches any string consisting of one or more uppercase and/or lowercase letters (and nothing else):

echo $x |tr -s ' ' '\n' |egrep "^[a-zA-Z]+$"

The output is here:

ghi
abc
Ghi

If you also want to sort the output and print only the unique words, use this command:

echo $x |tr -s ' ' '\n' |egrep "^[a-zA-Z]+$" |sort | uniq

The output is here:

123
123z
Ghi
abc
ghi

If you want to extract only the integers in the variable x, use this command:

echo $x |tr -s ' ' '\n' |egrep "^[0-9]+$" |sort | uniq

The output is here:

If you want to extract alphanumeric words from the variable x, use this command:

echo $x |tr -s ' ' '\n' |egrep "^[a-zA-Z0-9]+$" |sort | uniq

The output is here:

123
123z
Ghi
abc 
ghi

Note that the ASCII collating sequences place digits before uppercase letters, and the latter are before lowercase letters for the following reason: 0 through 9 are hexadecimal values 0x30 through 0x39, and the uppercase letters in A–Z are hexadecimal 0x41 through 0x5a, and the lowercase letters in a–z are hexadecimal 0x61 through 0x7a.

Now you can replace echo $x with a dataset in order to retrieve only alphabetic strings from that dataset.

The `fgrep` Command

The fgrep (“fast grep”) is the same as grep –F and although fgrep is deprecated, it’s still supported in order to allow historical applications that rely on them to run unmodified. In addition, some older systems might not support the –F option for the grep command, so they use the fgrep command. If you really want to learn more about the fgrep command, perform an Internet search for tutorials.

A Simple Use Case

The code sample in this section shows you how to use the grep command in order to find specific lines in a dataset and then “merge” pairs of lines to create a new dataset. This is very much like what a “join” command does in a relational database. Listing 3.7 displays the contents of the file test1.csv, which contains the initial dataset.

LISTING 3.7 test1.csv

F1,F2,F3,M0,M1,M2,M3,M4,M5,M6,M7,M8,M9,M10,M11,M12
1,KLM,,1.4,,0.8,,1.2,,1.1,,,2.2,,,1.4
1,KLMAB,,0.05,,0.04,,0.05,,0.04,,,0.07,,,0.05
1,TP,,7.4,,7.7,,7.6,,7.6,,,8.0,,,7.3
1,XYZ,,4.03,3.96,,3.99,,3.84,4.12,,,,4.04,,
2,KLM,,0.9,0.7,,0.6,,0.8,0.5,,,,0.5,,
2,KLMAB,,0.04,0.04,,0.03,,0.04,0.03,,,,0.03,,
2,EGFR,,99,99,,99,,99,99,,,,99,,
2,TP,,6.6,6.7,,6.9,,6.6,7.1,,,,7.0,,
3,KLM,,0.9,0.1,,0.5,,0.7,,0.7,,,0.9,,
3,KLMAB,,0.04,0.01,,0.02,,0.03,,0.03,,,0.03,,
3,PLT,,224,248,,228,,251,,273,,,206,,
3,XYZ,,4.36,4.28,,4.58,,4.39,,4.85,,,4.47,,
3,RDW,,13.6,13.7,,13.8,,14.1,,14.0,,,13.4,,
3,WBC,,3.9,6.5,,5.0,,4.7,,3.7,,,3.9,,
3,A1C,,5.5,5.6,,5.7,,5.6,,5.5,,,5.3,,
4,KLM,,1.2,,0.6,,0.8,0.7,,,0.9,,,1.0,
4,TP,,7.6,,7.8,,7.6,7.3,,,7.7,,,7.7,
5,KLM,,0.7,,0.8,,1.0,0.8,,0.5,,,1.1,,
5,KLM,,0.03,,0.03,,0.04,0.04,,0.02,,,0.04,,
5,TP,,7.0,,7.4,,7.3,7.6,,7.3,,,7.5,,
5,XYZ,,4.73,,4.48,,4.49,4.40,,,4.59,,,4.63,

Listing 3.8 displays the contents of the file joinlines.sh, which illustrates how to merge the pairs of matching lines in joinlines.csv.

LISTING 3.8 joinlines.sh

inputfile="test1.csv"
outputfile="joinedlines.csv"
tmpfile2="tmpfile2"

# patterns to match:
klm1="1,KLM,"
klm5="5,KLM,"
xyz1="1,XYZ,"
xyz5="5,XYZ,"

#output:
#klm1,xyz1
#klm5,xyz5

# step 1: match patterns with CSV file:
klm1line="`grep $klm1 $inputfile`"
klm5line="`grep $klm5 $inputfile`"
xyz1line="`grep $xyz1 $inputfile`"
# $xyz5 matches 2 lines (we want first line):
grep $xyz5 $inputfile > $tmpfile2
xyz5line="`head -1 $tmpfile2`"
echo "klm1line: $klm1line"
echo "klm5line: $klm5line"
echo "xyz1line: $xyz1line"
echo "xyz5line: $xyz5line"

# step 3: create summary file:
echo "$klm1line" | tr -d '\n' > $outputfile
echo "$xyz1line"               >> $outputfile
echo "$klm5line" | tr -d '\n' >> $outputfile
echo "$xyz5line"               >> $outputfile
echo; echo

The output from launching the shell script in Listing 3.8 is here:

1,KLM,,1.4,,0.8,,1.2,,1.1,,,2.2,,,1.41,
XYZ,,4.03,3.96,,3.99,,3.84,4.12,,,,4.04,,
5,KLM,,0.7,,0.8,,1.0,0.8,,0.5,,,1.1,,5,KLM,,0.03,,0.03,,0.04,0
.04,,0.02,,,0.04,,5,XYZ,,4.73,,4.48,,4.49,4.40,,,4.59,,,4.63,

As you can see, the task in this section is very easily solved via the grep command. Note that additional data cleaning is required in order to handle the empty fields in the output.

Summary

This chapter showed you how to work with the grep utility, which is a very powerful Unix command for searching text fields for strings. You saw various options for the grep command, and examples of how to use those options to find string patterns in text files.

Next you learned about egrep, which is a variant of the grep command, which can simplify and also expand on the basic functionality of grep, indicating when you might choose one option over another.

Finally, you learned how to use key values in one text file to search for matching lines of text in another file, and perform join-like operations using the grep command.

CHAPTER 3

FILTERING DATA WITH grep

What Is the grep Command?

Metacharacters and the grep Command

Escaping Metacharacters with the grep Command

LISTING 3.1 lines.txt

Useful Options for the grep Command

LISTING 3.2 columns4.txt

Character Classes and the grep Command

Working with the –c Option in grep

Matching a Range of Lines

Using Back References in the grep Command

LISTING 3.3 columns5.txt

Finding Empty Lines in Datasets

Using Keys to Search Datasets

LISTING 3.4 skuvalues.txt

LISTING 3.5 skusold.txt

The Backslash Character and the grep Command

Multiple Matches in the grep Command

The grep Command and the xargs Command

Searching Zip Files for a String

Checking for a Unique Key Value

LISTING 3.6 findkey.sh

Redirecting Error Messages

The egrep Command and the fgrep Command

Displaying “Pure” Words in a Dataset with egrep

The fgrep Command

A Simple Use Case

LISTING 3.7 test1.csv

LISTING 3.8 joinlines.sh

Summary

FILTERING DATA WITH `grep`

What Is the `grep` Command?

Metacharacters and the `grep` Command

Escaping Metacharacters with the `grep` Command

Useful Options for the `grep` Command

Character Classes and the `grep` Command

Working with the –c Option in `grep`

Using Back References in the `grep` Command

The Backslash Character and the `grep` Command

Multiple Matches in the `grep` Command

The `grep` Command and the `xargs` Command

The `egrep` Command and the `fgrep` Command

Displaying “Pure” Words in a Dataset with `egrep`

The `fgrep` Command