Back to log parsing

Of course, the previous example could easily have been coded using a couple of integer variables and calling x += increment on them. Let's look at a second example where coroutines actually save us some code. This example is a somewhat simplified (for pedagogical reasons) version of a problem I had to solve while working at Facebook.

The Linux kernel log contains lines that look almost, but not quite entirely, unlike this:

unrelated log messages 
sd 0:0:0:0 Attached Disk Drive 
unrelated log messages 
sd 0:0:0:0 (SERIAL=ZZ12345) 
unrelated log messages 
sd 0:0:0:0 [sda] Options 
unrelated log messages 
XFS ERROR [sda] 
unrelated log messages 
sd 2:0:0:1 Attached Disk Drive 
unrelated log messages 
sd 2:0:0:1 (SERIAL=ZZ67890) 
unrelated log messages 
sd 2:0:0:1 [sdb] Options 
unrelated log messages 
sd 3:0:1:8 Attached Disk Drive 
unrelated log messages 
sd 3:0:1:8 (SERIAL=WW11111) 
unrelated log messages 
sd 3:0:1:8 [sdc] Options 
unrelated log messages 
XFS ERROR [sdc] 
unrelated log messages

There are a whole bunch of interspersed kernel log messages, some of which pertain to hard disks. The hard disk messages might be interspersed with other messages, but they occur in a predictable format and order. For each, a specific drive with a known serial number is associated with a bus identifier (such as 0:0:0:0). A block device identifier (such as sda) is also associated with that bus. Finally, if the drive has a corrupt filesystem, it might fail with an XFS error.

Now, given the preceding log file, the problem we need to solve is how to obtain the serial number of any drives that have XFS errors on them. This serial number might later be used by a data center technician to identify and replace the drive.

We know we can identify the individual lines using regular expressions, but we'll have to change the regular expressions as we loop through the lines, since we'll be looking for different things depending on what we found previously. The other difficult bit is that if we find an error string, the information about which bus contains that string as well as the serial number have already been processed. This can easily be solved by iterating through the lines of the file in reverse order.

Before you look at this example, be warned—the amount of code required for a coroutine-based solution is scarily small:

import re


def match_regex(filename, regex):
    with open(filename) as file:
        lines = file.readlines()
    for line in reversed(lines):
        match = re.match(regex, line)
        if match:
            regex = yield match.groups()[0]


def get_serials(filename):
    ERROR_RE = "XFS ERROR (\[sd[a-z]\])"
    matcher = match_regex(filename, ERROR_RE)
    device = next(matcher)
    while True:
        try:
            bus = matcher.send(
                "(sd \S+) {}.*".format(re.escape(device))
            )
            serial = matcher.send("{} \(SERIAL=([^)]*)\)".format(bus))
 yield serial
            device = matcher.send(ERROR_RE)
        except StopIteration:
            matcher.close()
            return


for serial_number in get_serials("EXAMPLE_LOG.log"):
    print(serial_number)

This code neatly divides the job into two separate tasks. The first task is to loop over all the lines and spit out any lines that match a given regular expression. The second task is to interact with the first task and give it guidance as to what regular expression it is supposed to be searching for at any given time.

Look at the match_regex coroutine first. Remember, it doesn't execute any code when it is constructed; rather, it just creates a coroutine object. Once constructed, someone outside the coroutine will eventually call next() to start the code running. Then it stores the state of two variables filename and regex. It then reads all the lines in the file and iterates over them in reverse. Each line is compared to the regular expression that was passed in until it finds a match. When the match is found, the coroutine yields the first group from the regular expression and waits.

At some point in the future, other code will send in a new regular expression to search for. Note that the coroutine never cares what regular expression it is trying to match; it's just looping over lines and comparing them to a regular expression. It's somebody else's responsibility to decide what regular expression to supply.

In this case, that somebody else is the get_serials generator. It doesn't care about the lines in the file; in fact, it isn't even aware of them. The first thing it does is create a matcher object from the match_regex coroutine constructor, giving it a default regular expression to search for. It advances the coroutine to its first yield and stores the value it returns. It then goes into a loop that instructs the matcher object to search for a bus ID based on the stored device ID, and then a serial number based on that bus ID.

It idly yields that serial number to the outside for loop before instructing the matcher to find another device ID and repeat the cycle.

Basically, the coroutine's job is to search for the next important line in the file, while the generator's (get_serial, which uses the yield syntax without assignment) job is to decide which line is important. The generator has information about this particular problem, such as what order lines will appear in the file. The coroutine, on the other hand, could be plugged into any problem that required searching a file for given regular expressions.