Getting a List of Nonmatching Files

You can use the grep ( Section 13.2) option -c to tell you how many occurrences of a pattern appear in a given file, so you can also use it to find files that don't contain a pattern (i.e., zero occurrences of the pattern). This is a handy technique to package into a shell script.

Using grep -c

Let's say you're indexing a DocBook (SGML) document and you want to make a list of files that don't yet contain indexing tags. What you need to find are files with zero occurrences of the string <indexterm>. (If your tags might be uppercase, you'll also want the -i option (Section 9.22).) The following command:

% grep -c "<indexterm>" chapter*

might produce the following output:

chapter1.sgm:10
chapter2.sgm:27
chapter3.sgm:19
chapter4.sgm:0
chapter5.sgm:39
   ...

This is all well and good, but suppose you need to check index entries in hundreds of reference pages. Well, just filter grep's output by piping it through another grep. The previous command can be modified as follows:

% grep -c "<indexterm>" chapter* | grep :0

This results in the following output:

chapter4.sgm:0

Using sed (Section 34.1) to truncate the :0, you can save the output as a list of files. For example, here's a trick for creating a list of files that don't contain index macros:

% grep -c "<indexterm>" * | sed -n 's/:0$//p' > ../not_indexed.list

The sed -n command prints only the lines that contain :0; it also strips the :0 from the output so that ../not_indexed.list contains a list of files, one per line. For a bit of extra safety, we've added a $ anchor (Section 32.5) to be sure sed matches only 0 at the end of a line — and not, say, in some bizarre filename that contains :0. (We've quoted (Section 27.12) the $ for safety — though it's not really necessary in most shells because $/ can't match shell variables.) The .. pathname (Section 1.16) puts the not_indexed.list file into the parent directory — this is one easy way to keep grep from searching that file, but it may not be worth the bother.

To edit all files that need index macros added, you could type this:

% vi `grep -c "<indexterm>" * | sed -n 's/:0$//p'`

This command is more obvious once you start using backquotes a lot.

The vgrep Script

You can put the grep -c technique into a little script named vgrep with a couple of safety features added:

"$@" Section 35.20

Go to http://examples.oreilly.com/upt3 for more information on: vgrep

#!/bin/sh
case $# in
0|1) echo "Usage: `basename $0` pattern file [files...]" 1>&2; exit 2 ;;
2)  # Given a single filename, grep returns a count with no colon or name.
    grep -c -e "$1" "$2" | sed -n "s|^0\$|$2|p"
    ;;
*)  # With more than one filename, grep returns "name:count" for each file.
    pat="$1"; shift
    grep -c -e "$pat" "$@" | sed -n "s|:0\$||p"
    ;;
esac

Now you can type, for example:

% vi `vgrep "<indexterm>" *`

One of the script's safety features works around a problem that happens if you pass grep just one filename. In that case, most versions of grep won't print the file's name, just the number of matches. So the first sed command substitutes a digit 0 with the filename.

The second safety feature is the grep -e option. It tells grep that the following argument is the search pattern, even if that pattern looks like an option because it starts with a dash (-). This lets you type commands like vgrep -0123 * to find files that don't contain the string -0123.

—DG and JP