grep
This chapter introduces you to the versatile grep
command, whose purpose is to take a stream of text data and reduce it to only the parts that you care about. The grep
command is useful not only by itself, but also in conjunction with other commands, especially the find
command. This chapter contains many short code samples that illustrate various options of the grep
command. Some code samples illustrate how to combine the grep
command with commands from previous chapters.
The first part of this chapter introduces the grep
command used in isolation, combined with the regular expression metacharacters (from Chapter 1) and also with code snippets that illustrate how to use some of the options of the grep
command. Next you will learn how to match ranges of lines, how to use the so-called “back references” in grep,
and how to “escape” metacharacters in grep.
The second part of this chapter shows you how to use the grep
command in order to find empty lines and common lines in datasets, as well as how to use keys to match rows in datasets. Next you will learn how to use character classes with the grep
command, as well as the backslash “\” character, and how to specify multiple matching patterns. Next you will learn how to combine the grep
command with the find
command and the xargs
command, which is useful for matching a pattern in files that reside in different directories. This section also contains some examples of common mistakes that people make with the grep
command.
The third section briefly discusses the egrep
command and the fgrep
command, which are related commands that provide additional functionality that is unavailable in the standard grep
utility. The final section contains a use case that illustrates how to use the grep
command in order to find matching lines that are then merged in order to create a new dataset.
grep
Command?The grep
(“Global Regular Expression Print”) command is useful for finding substrings in one or more files. Several examples are here:
grep abc *sh
displays all the lines of abc
in files with suffix sh
grep –i abc *sh
is the same as the preceding query, but case-insensitive
grep –l abc *sh
displays all the filenames with suffix sh
that contain abc
grep –n abc *sh
displays all the line numbers of the occurrences of the string abc
in files with suffix sh
You can perform logical AND
and logical OR
operations with this syntax:
grep abc *sh | grep def
matches lines containing abc
AND def
grep "abc\|def" *sh
matches lines containing abc
OR def
You can combine switches as well: the following command displays the names of the files that contain the string abc
(case insensitive):
grep –il abc *sh
In other words, the preceding command matches filenames that contain abc, Abc, ABc, ABC, abC,
and so forth.
Another (less efficient way) to display the lines containing abc
(case insensitive) is here:
cat file1 |grep –i abc
The preceding command involves two processes, whereas the “grep using –l switch instead of cat to input the files you want” approach involves a single process. The execution time is roughly the same for small text files, but the execution time can become more significant if you are working with multiple large text files.
You can combine the sort
command, the pipe symbol, and the grep
command. For example, the following command displays the files with a “Jan
” date in increasing size:
ls -l |grep " Jan " | sort -n
A sample output from the preceding command is here:
-rw-r--r-- 1 oswaldcampesato2 staff 3 Sep 27 2017 abc.txt -rw-r--r-- 1 oswaldcampesato2 staff 6 Sep 21 2017 control1.txt -rw-r--r-- 1 oswaldcampesato2 staff 27 Sep 28 2017 fiblist.txt -rw-r--r-- 1 oswaldcampesato2 staff 28 Sep 14 2017 dest -rw-r--r-- 1 oswaldcampesato2 staff 36 Sep 14 2017 source -rw-r--r-- 1 oswaldcampesato2 staff 195 Sep 28 2017 Divisors.py -rw-r--r-- 1 oswaldcampesato2 staff 267 Sep 28 2017 Divisors2.py
grep
CommandThe fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Any meta-character with special meaning may be quoted by preceding it with a backslash.
A regular expression may be followed by one of several repetition operators, as shown below.
‘.’ matches any single character.
‘?’ indicates that the preceding item is optional and will be matched at most once: Z? matches Z or ZZ.
‘*’ indicates that the preceding item will be matched zero or more times: Z* matches Z, ZZ, ZZZ, and so forth.
‘+’ indicates that the preceding item will be matched one or more times: Z+ matches ZZ, ZZZ, and so forth.
‘{n}’ indicates that the preceding item is matched exactly n times: Z{3} matches ZZZ.
‘{n,}’ indicates that the preceding item is matched n or more times: Z{3} matches ZZZ, ZZZZ, and so forth.
‘{,m}’ indicates that the preceding item is matched at most m times: Z{,3} matches Z, ZZ, and ZZZ.
‘{n,m}’ indicates that the preceding item is matched at least n times, but not more than m times: Z{2,4} matches ZZ, ZZZ, and ZZZZ.
The empty regular expression matches the empty string (i.e., a line in the input stream with no data). Two regular expressions may be joined by the infix operator ‘|.’ When used in this manner, the infix operator behaves exactly like a logical “OR” statement, which directs the grep
command to return any line that matches either regular expression.
grep
CommandListing 3.1 displays the contents of lines.txt
, which contains lines with words and metacharacters.
abcd ab abc cd defg .*. ..
The following grep
command lists the lines of length 2 (using the ^
begin with and $
end with operators to restrict length) in lines.txt:
grep '^..$' lines.txt
The following command lists the lines of length two in lines.txt
that contain two dots (the backslash tells grep
to interpret the dots as actual dots, not as metacharacters):
grep '^\.\.$' lines.txt
The result is shown here:
ab cd ..
The following command also displays lines of length two that begin and end with a dot (the * matches any text of any length, including no text at all, and is used as a metacharacter because it is not preceded with a backslash):
grep '^\.*\.$' lines.txt
The following command lists the lines that contain a period, followed by an asterisk, and then another period (the * is now a character that must be matched because it is preceded by a backslash):
grep '^\.\*\.$' lines.txt
grep
CommandThere are many types of pattern matching possibilities with the grep
command, and this section contains an eclectic mix of such commands that handle common scenarios.
In the following examples we have four text files (two .sh and two .txt) and two Word documents in a directory. The string abc
is found on one line in abc1.txt
and three lines in abc3.sh.
The string ABC
is found on 2 lines in in ABC2.txt
and 4 lines in ABC4.sh.
Notice that abc
is not found in ABC
files, and ABC
is not found in abc
files.
ls * ABC.doc ABC4.sh abc1.txt ABC2.txt abc.doc abc3.sh
The following code snippet searches for occurrences of the string abc
in all the files in the current directory that have sh
as a suffix:
grep abc *sh abc3.sh:abc at start abc3.sh:ends with -abc abc3.sh:the abc is in the middle
The “-c”
option counts the number of occurrences of a string (note that even though ABC4.sh
has no matches, it still counts them and returns zero):
grep –c abc *sh
The output of the preceding command is here:
ABC4.sh:0 abc3.sh:3
The “-e
” option lets you match patterns that would otherwise cause syntax problems (the “–” character normally is interpreted as an argument for grep
):
grep –e "-abc" *sh abc3.sh:ends with -abc
The “-e
” option also lets you match multiple patterns.
grep –e "-abc" -e "comment" *sh ABC4.sh:# ABC in a comment abc3.sh:ends with -abc
The “-i
” option is to perform a case insensitive match:
grep –i abc *sh ABC4.sh:ABC at start ABC4.sh:ends with ABC ABC4.sh:the ABC is in the middle ABC4.sh:# ABC in a comment abc3.sh:abc at start abc3.sh:ends with -abc abc3.sh:the abc is in the middle
The “-v
” option “inverts” the matching string, which means that the output consists of the lines that do not contain the specified string (ABC doesn’t match because −i
is not used, and ABC4.sh
has an entirely empty line):
grep –v abc *sh
Use the “-iv
” options to display the lines that do not contain a specified string using a case insensitive match:
grep –iv abc *sh ABC4.sh: abc3.sh:this line won't match
The “-l
” option is to list only the filenames that contain a successful match (note this matches contents of files, not the filenames). The Word document matches because the actual text is still visible to grep,
it is just surrounded by proprietary formatting gibberish. You can do similar things with other formats that contain text, such as XML, HTML, .csv,
and so forth:
grep −l abc * abc1.txt abc3.sh abc.doc
The “-l
” option is to list only the filenames that contain a successful match:
grep –l abc *sh
Use the “-il
” options to display the filenames that contain a specified string using a case insensitive match:
grep –il abc *doc
The preceding command is very useful when you want to check for the occurrence of a string in Word documents.
The “-n
” option specifies line numbers of any matching file:
grep –n abc *sh abc3.sh:1:abc at start abc3.sh:2:ends with -abc abc3.sh:3:the abc is in the middle
The “-h
” option suppresses the display of the filename for a successful match:
grep –h abc *sh abc at start ends with -abc the abc is in the middle
For the next series of examples, we will use columns4.txt
as shown in Listing 3.2.
123 ONE TWO 456 three four ONE TWO THREE FOUR five 123 six one two three four five
The “-o” option shows only the matched string (this is how you avoid returning the entire line that matches):
grep –o one columns4.txt
The “-o” option followed by the “-b” option shows the position of the matched string (returns character position, not line number. The “o” in “one” is the 59th character of the file):
grep –o –b one columns4.txt
You can specify a recursive search as shown here (output not shown because it will be different on every client or account. This searches not only every file in directory /etc
, but every file in every subdirectory of etc
):
grep –r abc /etc
The preceding commands match lines where the specified string is a substring of a longer string in the file. For instance, the preceding commands will match occurrences of abc as well as abcd, dabc, abcde
, and so forth.
grep ABC *txt ABC2.txt:ABC at start or ABC in middle or end in ABC ABC2.txt:ABCD DABC
If you want to exclude everything except for an exact match, you can use the –w
option, as shown here:
grep –w ABC *txt ABC2.txt:ABC at start or ABC in middle or end in ABC
The --color
switch displays the matching string in color:
grep --color abc *sh abc3.sh:abc at start abc3.sh:ends with −abc abc3.sh:the abc is in the middle
You can use the pair of metacharacters .* to find the occurrences of two words that are separated by an arbitrary number of intermediate characters.
The following command finds all lines that contain the strings one
and three
with any number of intermediate characters:
grep "one.*three" columns4.txt one two three
You can “invert” the preceding result by using the –v
switch, as shown here:
grep –v "one.*three" columns4.txt 123 ONE TWO 456 three four ONE TWO THREE FOUR five 123 six four five
The following command finds all lines that contain the strin gs one
and three
with any number of intermediate characters, where the match involves a case-insensitive comparison:
grep −i "one.*three" columns4.txt ONE TWO THREE FOUR one two three
You can “invert” the preceding result by using the –v
switch, as shown here:
grep –iv "one.*three" columns4.txt 123 ONE TWO 456 three four five 123 six four five
Sometimes you need to search a file for the presence of either of two strings. For example, the following command finds the files that contain “start
” or “end
”:
grep −l 'start\|end' * ABC2.txt ABC4.sh abc3.sh
Later in the chapter you will see how to find files that contain a pair of strings via the grep
and xargs
commands.
grep
CommandThis section contains some simple one-line commands that combine the grep
command with character classes.
echo "abc" | grep '[:alpha:]' abc echo "123" | grep '[:alpha:]' (returns nothing, no match) echo "abc123" | grep '[:alpha:]' abc123 echo "abc" | grep '[:alnum:]' abc echo "123" | grep '[:alnum:]' (returns nothing, no match) echo "abc123" | grep '[:alnum:]' abc123 echo "123" | grep '[:alnum:]' (returns nothing, no match) echo "abc123" | grep '[:alnum:]' abc123 echo "abc" | grep '[0-9]' (returns nothing, no match) echo "123" | grep '[0-9]' 123 echo "abc123" | grep '[0-9]' abc123 echo "abc123" | grep -w '[0-9]' (returns nothing, no match)
grep
Consider a scenario in which a directory (such as a log directory) has files created by an outside program. Your task is to write a shell script that determines which (if any) of the files that contain two occurrences of a string, after which additional processing is performed on the matching files (e.g., use email to send log files containing two or more error messages to a system administrator for investigation).
One solution involves the –c option for grep
, followed by additional invocations of the grep
command.
The command snippets in this section assume the following data files whose contents are shown below.
The file hello1.txt
contains the following:
hello world1
The file hello2.txt
contains the following:
hello world2 hello world2 second time
The file hello3.txt
contains the following:
hello world3 hello world3 two hello world3 three
Now launch the following commands: (2>/dev/null
keeps warnings and errors caused by empty directories from cluttering up the output):
grep -c hello hello*txt 2>/dev/null hello1.txt:1 hello2.txt:2 hello3.txt:3 grep -l hello hello*txt 2>/dev/null hello1.txt hello2.txt hello3.txt grep -c hello hello*txt 2>/dev/null |grep ":2$" hello2.txt:2
Note how we use the “ends with” "$
" metacharacter to grab just the files that have exactly two matches. We also use the colon ":2$
" rather than just "2$
" to prevent grabbing files that have 12, 32, or 142 matches (which would end in :12, :32, and :142).
What if we wanted to show “two or more” (as in the “2 or more errors in a log”)? You would instead use the invert (-v
) command to exclude counts of exactly 0 or exactly 1.
grep -c hello hello*txt 2>/dev/null |grep -v ':[0-1]$' hello2.txt:2 hello3.txt:3
In a real-world application, you would want to strip off everything after the colon to return only the filenames. There are many ways to do so, but we’ll use the cut
command we learned in Chapter 1, which involves defining :
as a delimiter with -d":"
and using -f1
to return the first column (i.e., the part before the colon in the return text):
grep -c hello hello*txt 2>/dev/null | grep -v ':[0-1]$'| cut -d":" -f1 hello2.txt hello3.txt
In Chapter 1 you saw how to use the head
and tail
commands to display a range of lines in a text file. Now suppose that you want to search a range of lines for a string. For instance, the following command displays lines 9 through 15 of longfile.txt:
cat -n longfile.txt |head -15|tail -9
The output is here:
7 and each line 8 contains 9 one or 10 more words 11 and if you 12 use the cat 13 command the 14 file contents 15 scroll
This command displays the subset of lines 9 through 15 of longfile.txt
that contain the string and:
cat -n longfile.txt |head -15|tail -9 | grep and
7 and each line 11 and if you 13 command the
This command includes a whitespace after the word and,
thereby excluding the line with the word “command
”:
cat -n longfile.txt |head -15|tail -9 | grep "and "
The output is here:
7 and each line 11 and if you
Note that the preceding command excludes lines that end in “and” because they do not have the whitespace after “and” at the end of the line. You could remedy this situation with an “OR” operator including both cases:
cat -n longfile.txt |head -15|tail -9 | grep " and\|and " 7 and each line 11 and if you 13 command the
However, the preceding allows “command” back into the mix. Hence, if you really want to match a specific word, it’s best to use the -w
tag, which is smart enough to handle the variations:
cat -n longfile.txt |head -15|tail -9 | grep -w "and" 7 and each line 11 and if you
The use of whitespace is safer if you are looking for something at the beginning or end of a line. This is a common approach when reading contents of log files or other structured text where the first word is often important (a tag like ERROR
or Warning,
a numeric code or a date). This command displays the lines that start with the word and:
cat longfile.txt |head -15|tail -9 | grep "^and "
The output is here (without the line number because we are not using “cat -n
”):
and each line and if you
Recall that the “use the file name(s) in the command, instead of using cat
to display the file first” style is more efficient:
head -15 longfile.txt |tail -9 | grep "^and " and each line and if you
However, the head
command does not display the line numbers of a text file, so the “cat first” (cat -n
adds line numbers) style is used in the earlier examples when you want to see the line numbers, even though this style is less efficient. Basically, you only want to add an extra command to a pipe if it is adding value, otherwise it’s better to start with a direct call to the files you are trying to process with the first command in the pipe, assuming the command syntax is capable of reading in filenames.
grep
CommandThe grep
command allows you to reference a set of characters that match a regular expression placed inside a pair of parentheses. For grep
to parse the parentheses correctly, each has to be preceded with the escape character “\.”
For example, grep
'a\(.\)
' uses the “.” regular expression to match ab or “a3
” but not “3a
” or “ba.
”
The back reference ‘\n,’ where n
is a single digit, matches the substring previously matched by the nth parenthesized sub-expression of the regular expression. For example, grep
'\(a\)\1
' matches “aa” and grep
'\(a\)\2
' matches “aaa
.”
When used with alternation, if the group does not participate in the match, then the back reference makes the whole match fail. For example, grep
'a\(.\)|b\1
' will not match ba
or ab
or bb
(or anything else really).
If you have more than one regular expression inside a pair of parentheses, they are referenced (from left to right) by \1, \2,
. . ., \9:
grep -e '\([a-z]\)\([0-9]\)\1' is the same as this command: grep -e '\([a-z]\)\([0-9]\)\([a-z]\)' grep -e '\([a-z]\)\([0-9]\)\2' is the same as this command: grep -e '\([a-z]\)\([0-9]\)\([0-9]\)'
The easiest way to think of it is that the number (for example, \2
) is a placeholder or variable that saves you from typing the longer regular expression it references. As regular expressions can get extremely complex, this often helps code clarity.
You can match consecutive digits or characters using the pattern \([0-9]\)\1.
For example, the following command is a successful match because the string “1223
” contains a pair of consecutive identical digits:
echo "1223" | grep -e '\([0-9]\)\1'
Similarly, the following command is a successful match because the string “12223
” contains three consecutive occurrences of the digit 2:
echo "12223" | grep -e '\([0-9]\)\1\1'
You can check for the occurrence of two identical digits separated by any character with this expression:
echo "12z23" | grep -e '\([0-9]\).\1'
In an analogous manner, you can test for the occurrence of duplicate letters, as shown here:
echo "abbc" | grep -e '\([a-z]\)\1'
The following example matches an IP address, and does not use back references, just the “\d
” and “\.
” Regular expressions to match digits and periods are as follows:
echo "192.168.125.103" | grep -e '\(\d\d\d\)\.\(\d\d\d\)\.\(\d\d\d\)\.\(\d\d\d\)'
If you want to allow for fewer than three digits, you can use the expression {1,3},
which matches 1, 2, or 3 digits on the third block. In a situation where any of the four blocks might have fewer than three characters, you must use the following type of syntax in all four blocks:
echo "192.168.5.103" | grep -e '\(\d\d\d\)\.\(\d\d\d\)\.\(\d\)\{1,3\}\.\(\d\d\d\)'
You can perform more complex matches using back references. Listing 3.3 displays the contents of columns5.txt,
which contains several lines that are palindromes (the same spelling from left-to-right as right-to-left). Note that the third line is an empty line.
one eno ONE ENO ONE TWO OWT ENO four five
The following command finds all lines that are palindromes:
grep -w -e '\(.\)\(.\).*\2\1' columns5.txt
The output of the preceding command is here:
one eno ONE ENO ONE TWO OWT ENO
The idea is as follows: the first \(.\)
matches a set of letters, followed by a second \(.\)
that matches a set of letters, followed by any number of intermediate characters. The sequence \2\1
reverses the order of the matching sets of letters specified by the two consecutive occurrences of \(.\)
.
Recall that the metacharacter “^” refers to the beginning of a line and the metacharacter “$” refers to the end of a line. Thus, an empty line consists of the sequence ^$
. You can find the single empty in columns5.txt
with this command:
grep -n "^$" columns5.txt
The output of the preceding grep
command is here (use the -n
switch to display line numbers, as blank lines will not otherwise show in the output):
3:
More commonly the goal is to simply strip the empty lines from a file. We can do that just by inverting the prior query (and not showing the line numbers):
grep -v "^$" columns5.txt one eno ONE ENO ONE TWO OWT ENO four five
As you can see, the preceding output displays four non-empty lines, and as we saw in the previous grep
command, line #3 is an empty line.
Data is often organized around unique values (typically numbers) in order to distinguish otherwise similar things: for example, John Smith
the manager must not be confused with John Smith
the programmer in an employee dataset. Hence, each record is assigned a unique number that will be used for all queries related to employees. Moreover, their names are merely data elements of a given record, rather than a means of identifying a record that contains a particular person.
With the preceding points in mind, suppose that you have a text file in which each line contains a single key value. In addition, another text file consists of one or more lines, where each line contains a key value followed by a quantity value.
As an illustration, Listing 3.4 displays the contents of skuvalues.txt
and Listing 3.5 displays the contents of skusold.txt.
Note that an SKU
is a term often used to refer to an individual product configuration, including its packaging, labeling, and so forth.
4520 5530 6550 7200 8000
4520 12 4520 15 5530 5 5530 12 6550 0 6550 8 7200 50 7200 10 7200 30 8000 25 8000 45 8000 90
grep
CommandThe “\” character has a special interpretation when it’s followed by the following characters:
“\b
” = Match the empty string at the edge of a word.
“\B
” = Match the empty string provided it’s not at the edge of a word, so:
“\brat\b” matches the separate word “rat” but not “crate,” and
“\Brat\B” matches “crate” but not “furry rat.”
“\<
” = Match the empty string at the beginning of a word.
“\>
” = Match the empty string at the end of a word.
“\w
” = Match word constituent, it is a synonym for “[_[:alnum:]].”
“\W
” = Match non-word constituent, it is a synonym for “[^_[:alnum:]].”
“\s
” = Match whitespace, it is a synonym for “[[:space:]].”
“\S
” = Match non-whitespace, it is a synonym for “[^[:space:]].”
grep
CommandIn an earlier example you saw how to use the –i
option to perform a case insensitive match. However, you can also use the pipe “|” symbol to specify more than one sequence of regular expressions.
For example, the following grep
expression matches any line that contains “one” as well as any line that contains “ONE TWO
”:
grep "one\|ONE TWO" columns5.txt
The output of the preceding grep
command is here:
one eno ONE TWO OWT ENO
Although the preceding grep
command specifies a pair of character strings, you can specify an arbitrary number of character sequences or regular expressions, as long as you put "\|
" between each thing you want to match.
grep
Command and the xargs
CommandThe xargs
command is often used in conjunction with the find
command in bash. For example, you can search for the files under the current directory (including subdirectories) that have the sh
suffix and then check which one of those files contains the string abc,
as shown here:
find . –print |grep "sh$" | xargs grep –l abc
A more useful combination of the find
and xargs
commands is shown here:
find . -mtime -7 -name "*.sh" –print | xargs grep –l abc
The preceding command searches for all the files (including subdirectories) with suffix “sh
” that have not been modified in at least seven days, and pipes that list to the xargs
command, which displays the files that contain the string abc
(case insensitive).
The find
command supports many options, which can be combined via AND as well as OR in order to create very complex expressions.
Note that grep –R hello
. also performs a search for the string hello
in all files, including subdirectories, and follows the “one process” recommendation. On the other hand, the find . –print
command searches for all files in all subdirectories, and you can pipe the output to xargs grep hello
in order to find the occurrences of the word hello
in all files (which involves two processes instead of one process).
You can use the output of the preceding code snippet in order to copy the matching files to another directory, as shown here:
cp `find . –print |grep "sh$" | xargs grep –l abc` /tmp
Alternatively, you can copy the matching files in the current directory (without matching files in any subdirectories) to another directory with the grep
command:
cp `grep –l abc *sh` /tmp
Yet another approach is to use “back tick” so that you can obtain additional information:
for file in `find . –print` do echo "Processing the file: $file" # now do something here done
Keep in mind that if you pass too many filenames to the xargs
command you will see a “too many files” error message. In this situation, try to insert additional grep
commands prior to the xargs
command in order to reduce the number of files that are piped into the xargs
command.
If you work with NodeJS, you know that the node_modules
directory contains a large number of files. In most cases, you probably want to exclude the files in that directory when you are searching for a string, and the “-v” option is ideal for this situation. The following command excludes the files in the node_modules
directory while searching for the names of the HTML files that contain the string src and redirecting the list of file names to the file src_list.txt (and also redirecting error messages to / dev/null
):
find . –print |grep –v node |xargs grep –il src >src_list.txt 2>/dev/null
You can extend the preceding command to search for the HTML files that contain the string src
and the string angular
with the following command:
find . –print |grep –v node |xargs grep –il src |xargs grep –il angular >angular_list.txt 2>/dev/null
You can use the following combination of grep
and xargs
to find the files that contain both xml
and defs
:
grep -l xml *svg |xargs grep -l def
A variation of the preceding command redirects error messages to /dev/ null
, as shown here:
grep -l hello *txt 2>/dev/null | xargs grep -c hello
There are at least three ways to search for a string in one or more zip files. As an example, suppose that you want to determine which zip files contain SVG documents.
The first way is shown here:
for f in `ls *zip` do echo "Searching $f" jar tvf $f |grep "svg$" done
When there are many zip files in a directory, the output of the preceding loop can be very verbose, in which case you need to scroll backward and probably copy/paste the names of the files that actually contain SVG documents into a separate file. A better solution is to put the preceding loop in a shell and redirect its output. For instance, create the file findsvg
. sh
whose contents are the preceding loop, and then invoke this command:
./findsvg.sh 1>1 2>2
Notice that the preceding command redirects error message (2>) to the file 2 and the results of the jar/grep command (1>) to the file 1. See the Appendix for another example of searching zip files for SVG documents.
Sometimes you need to check for the existence of a string (such as a key) in a text file, and then perform additional processing based on its existence. However, do not assume that the existence of a string means that that string only occurs once. As a simple example, suppose the file mykeys.txt has the following content:
2000 22000 10000 3000
Suppose that you search for the string 2000, which you can do with findkey.sh
, whose contents are displayed in Listing 3.6.
key="2000" if [ "`grep $key mykeys.txt`" != "" ] then foundkey=true else foundkey=false fi echo "current key = $key" echo "found key = $foundkey"
Listing 3.6 contains if/else conditional logic to determine whether or not the file mykeys.txt
contains the value of $key
(which is initialized as 2000). Launch the code in Listing 3.6 and you will see the following output:
current key = 2000 found key = true linecount = 2
While the key value of 2000 does exist in mykeys.txt,
you can see that it matches two lines in mykeys.txt.
However, if mykeys.txt
were part of a file with 100,000 (or more) lines, it’s not obvious that the value of 2000 matches more than one line. In this dataset, 2000 and 22000 both match, and you can prevent the extra matching line with this code snippet:
grep –w $key
Thus, in files that have duplicate lines, you can count the number of lines that match the key via the preceding code snippet. Another way to do so involves the use of wc –l,
which displays the line count.
Another scenario involves the use of the xargs
command with the grep
command, which can result in “no such . . .” error messages:
find . –print |xargs grep –il abc
Make sure to redirect errors using the following variant:
find . –print |xargs grep –il abc 2>/dev/null
egrep
Command and the fgrep
CommandThe egrep
command is an Extended grep
that supports added grep
features like “+” (1 or more occurrence of previous character), “?” (0 or 1 occurrence of previous character), and “|” (alternate matching). The egrep
command is almost identical to the grep -E,
along with some caveats that are described here:
https://www.gnu.org/software/grep/manual/html_node/Basic-vs-Extended.html.
One advantage of using the egrep
command is that it’s easier to understand the regular expressions than the corresponding expressions in grep
(when it’s combined with backward references).
The egrep
(“extended grep
”) command supports extended regular expressions, as well as the pipe “|” in order to specify multiple words in a search pattern. A match is successful if any of the words in the search pattern appear, so you can think of the search pattern as an “any” match. Thus, the pattern “abc|def
” matches lines that contain either abc
or def
(or both).
For example, the following code snippet enables you to search for occurrences of the string abc
as well as occurrences of the string def
in all files with the suffix sh
:
egrep -w 'abc|def' *sh
The preceding egrep
command is an “or” operation: a line matches if it contains either abc or def.
You can also use metacharacters in egrep
expressions. For example, the following code snippet matches lines that start with abc
or end with four and a whitespace:
egrep '^123|four $' columns3.txt
A more detailed explanation of grep, egrep, and frep is here:
https://superuser.com/questions/508881/what-is-the-difference-between-grep-pgrep-egrep-fgrep.
egrep
For simplicity, let’s work with a text string, and that way we can see the intermediate results as we work toward the solution. Let’s initialize the variable x as shown here:
x="ghi abc Ghi 123 #def5 123z"
The first step is to split x
into words:
echo $x |tr -s ' ' '\n'
The output is here:
ghi abc Ghi 123 #def5 123z
The second step is to invoke egrep
with the regular expression ^[a-zA-Z]+
, which matches any string consisting of one or more uppercase and/or lowercase letters (and nothing else):
echo $x |tr -s ' ' '\n' |egrep "^[a-zA-Z]+$"
ghi abc Ghi
If you also want to sort the output and print only the unique words, use this command:
echo $x |tr -s ' ' '\n' |egrep "^[a-zA-Z]+$" |sort | uniq
The output is here:
123 123z Ghi abc ghi
If you want to extract only the integers in the variable x
, use this command:
echo $x |tr -s ' ' '\n' |egrep "^[0-9]+$" |sort | uniq
The output is here:
123
If you want to extract alphanumeric words from the variable x
, use this command:
echo $x |tr -s ' ' '\n' |egrep "^[a-zA-Z0-9]+$" |sort | uniq
The output is here:
123 123z Ghi abc ghi
Note that the ASCII collating sequences place digits before uppercase letters, and the latter are before lowercase letters for the following reason: 0 through 9 are hexadecimal values 0x30
through 0x39
, and the uppercase letters in A–Z
are hexadecimal 0x41
through 0x5a
, and the lowercase letters in a–z
are hexadecimal 0x61
through 0x7a
.
Now you can replace echo $x
with a dataset in order to retrieve only alphabetic strings from that dataset.
fgrep
CommandThe fgrep
(“fast grep”) is the same as grep –F
and although fgrep is deprecated, it’s still supported in order to allow historical applications that rely on them to run unmodified. In addition, some older systems might not support the –F option for the grep
command, so they use the fgrep
command. If you really want to learn more about the fgrep
command, perform an Internet search for tutorials.
The code sample in this section shows you how to use the grep
command in order to find specific lines in a dataset and then “merge” pairs of lines to create a new dataset. This is very much like what a “join” command does in a relational database. Listing 3.7 displays the contents of the file test1.csv,
which contains the initial dataset.
F1,F2,F3,M0,M1,M2,M3,M4,M5,M6,M7,M8,M9,M10,M11,M12 1,KLM,,1.4,,0.8,,1.2,,1.1,,,2.2,,,1.4 1,KLMAB,,0.05,,0.04,,0.05,,0.04,,,0.07,,,0.05 1,TP,,7.4,,7.7,,7.6,,7.6,,,8.0,,,7.3 1,XYZ,,4.03,3.96,,3.99,,3.84,4.12,,,,4.04,, 2,KLM,,0.9,0.7,,0.6,,0.8,0.5,,,,0.5,, 2,KLMAB,,0.04,0.04,,0.03,,0.04,0.03,,,,0.03,, 2,EGFR,,99,99,,99,,99,99,,,,99,, 2,TP,,6.6,6.7,,6.9,,6.6,7.1,,,,7.0,, 3,KLM,,0.9,0.1,,0.5,,0.7,,0.7,,,0.9,, 3,KLMAB,,0.04,0.01,,0.02,,0.03,,0.03,,,0.03,, 3,PLT,,224,248,,228,,251,,273,,,206,, 3,XYZ,,4.36,4.28,,4.58,,4.39,,4.85,,,4.47,, 3,RDW,,13.6,13.7,,13.8,,14.1,,14.0,,,13.4,, 3,WBC,,3.9,6.5,,5.0,,4.7,,3.7,,,3.9,, 3,A1C,,5.5,5.6,,5.7,,5.6,,5.5,,,5.3,, 4,KLM,,1.2,,0.6,,0.8,0.7,,,0.9,,,1.0, 4,TP,,7.6,,7.8,,7.6,7.3,,,7.7,,,7.7, 5,KLM,,0.7,,0.8,,1.0,0.8,,0.5,,,1.1,, 5,KLM,,0.03,,0.03,,0.04,0.04,,0.02,,,0.04,, 5,TP,,7.0,,7.4,,7.3,7.6,,7.3,,,7.5,, 5,XYZ,,4.73,,4.48,,4.49,4.40,,,4.59,,,4.63,
Listing 3.8 displays the contents of the file joinlines.sh
, which illustrates how to merge the pairs of matching lines in joinlines.csv.
inputfile="test1.csv" outputfile="joinedlines.csv" tmpfile2="tmpfile2"
# patterns to match: klm1="1,KLM," klm5="5,KLM," xyz1="1,XYZ," xyz5="5,XYZ,"
#output: #klm1,xyz1 #klm5,xyz5
# step 1: match patterns with CSV file: klm1line="`grep $klm1 $inputfile`" klm5line="`grep $klm5 $inputfile`" xyz1line="`grep $xyz1 $inputfile`" # $xyz5 matches 2 lines (we want first line): grep $xyz5 $inputfile > $tmpfile2 xyz5line="`head -1 $tmpfile2`" echo "klm1line: $klm1line" echo "klm5line: $klm5line" echo "xyz1line: $xyz1line" echo "xyz5line: $xyz5line"
# step 3: create summary file: echo "$klm1line" | tr -d '\n' > $outputfile echo "$xyz1line" >> $outputfile echo "$klm5line" | tr -d '\n' >> $outputfile echo "$xyz5line" >> $outputfile echo; echo
The output from launching the shell script in Listing 3.8 is here:
1,KLM,,1.4,,0.8,,1.2,,1.1,,,2.2,,,1.41, XYZ,,4.03,3.96,,3.99,,3.84,4.12,,,,4.04,, 5,KLM,,0.7,,0.8,,1.0,0.8,,0.5,,,1.1,,5,KLM,,0.03,,0.03,,0.04,0 .04,,0.02,,,0.04,,5,XYZ,,4.73,,4.48,,4.49,4.40,,,4.59,,,4.63,
As you can see, the task in this section is very easily solved via the grep
command. Note that additional data cleaning is required in order to handle the empty fields in the output.
This chapter showed you how to work with the grep
utility, which is a very powerful Unix command for searching text fields for strings. You saw various options for the grep
command, and examples of how to use those options to find string patterns in text files.
Next you learned about egrep
, which is a variant of the grep
command, which can simplify and also expand on the basic functionality of grep
, indicating when you might choose one option over another.
Finally, you learned how to use key values in one text file to search for matching lines of text in another file, and perform join-like operations using the grep
command.