Searching for Patterns Split Across Lines

[Section 13.9 introduced a script called cgrep , a general-purpose, grep-like program built with sed. It allows you to look for one or more words that appear on one line or across several lines. This article explains the sed tricks that are necessary to do this kind of thing. It gets into territory that is essential for any advanced applications of this obscure yet wonderful editor. Section 34.14 through Section 34.17 have background information. — JP]

Let's review the two examples from Section 13.9. The first command below finds all lines containing the word system in the file main.c and shows 10 additional lines of context above and below each match. The second command finds all occurrences of the word "awk" where it is followed by the word "perl" somewhere within the next 3 lines:

cgrep -10 system main.c
cgrep -3 "awk.*perl"

Now the script, followed by an explanation of how it works:

case Section 35.11, expr Section 36.21, shift Section 35.22, ${?} Section 36.7, \~..~ Section 34.8, "$@" Section 35.20

#!/bin/sh
#  cgrep - multiline context grep using sed
#  Usage: cgrep [-context] pattern [file...]

n=3
case $1 in -[1-9]*)
    n=`expr 1 - "$1"`
    shift
esac
re=${1?}; shift

sed -n "
    1b start
    : top
    \~$re~{
        h; n; p; H; g
        b endif
    }
        N
        : start
        //{ =; p; }
    : endif
    $n,\$D
    b top
" "$@"

The sed script is embedded in a bare-bones shell wrapper ( Section 35.19) to parse out the initial arguments because, unlike awk and perl, sed cannot directly access command-line parameters. If the first argument looks like a -context option, variable n is reset to one more than the number of lines specified, using a little trick — the argument is treated as a negative number and subtracted from 1. The pattern argument is then stored in $re, with the ${1?} syntax causing the shell to abort with an error message if no pattern was given. Any remaining arguments are passed as filenames to the sed command.

So that the $re and $n parameters can be embedded, the sed script is enclosed in double quotes (Section 27.12). We use the -n option because we don't want to print out every line by default, and because we need to use the n command in the script without its side effect of outputting a line.

The sed script itself looks rather unstructured (it was actually designed using a flowchart), but the basic algorithm is easy enough to understand. We keep a "window" of n lines in the pattern space and scroll this window through the input stream. If an occurrence of the pattern comes into the window, the entire window is printed (providing n lines of previous context), and each subsequent line is printed until the pattern scrolls out of view again (providing n lines of following context). The sed idiom N;D is used to advance the window, with the D not kicking in until the first n lines of input have been accumulated.

The core of the script is basically an if-then-else construct that decides whether we are currently "in context." (The regular expression here is delimited by tilde (~) characters because tildes are less likely to occur in the user-supplied pattern than slashes.) If we are still in context, then the next line of input is read and output, temporarily using the hold space to save the window (and effectively doing an N in the process). Else we append the next input line (N) and search for the pattern again (an empty regular expression means to reuse the last pattern). If it's now found, the pattern must have just come into view — so we print the current line number followed by the contents of the window. Subsequent iterations will take the "then" branch until the pattern scrolls out of the window.

— GU