Chapter 7. Practical Reference Tricks

This chapter looks at optimizing sorting and dealing with recursively defined data.

Review of Sorting

Perl’s built-in sort operator sorts text strings in their natural text order, by default.^[29]

This is fine if you want to sort text strings:

my @sorted = sort qw(Gilligan Skipper Professor Ginger Mary_Ann);

but gets pretty messy when you want to sort numbers:

my @wrongly_sorted = sort 1, 2, 4, 8, 16, 32;

The resulting list is 1, 16, 2, 32, 4, 8. Why didn’t sort order these properly? It treats each item as a string and sorts them in string order. Any string that begins with 3 sorts before any string that begins with 4.

You can fix this by overriding how Perl compares pairs of items in the list. By default, as Perl orders the items, a string comparison is used. A new comparison is specified using a sort block, placed between the sort keyword and the list of things to sort.^[30]

Within the sort block, $a and $b stand in for two of the items to be sorted. The last evaluated expression must return a -1, 0, or +1 value.^[31] If the value is -1, the value currently in $a must appear before the value in $b in the final sorted list. If the value is +1, then the value in $a must appear after the value in $b in the final sorted list. If the value is 0, you don’t know or can’t tell, so the results are unpredictable.^[32]

For example, to sort those numbers in their proper order, you can use a sort block comparing $a and $b, like so:

my @numerically_sorted = sort {
  if ($a < $b)    { -1 }
  elsif ($a > $b) { +1 }
  else            {  0 }
} 1, 2, 4, 8, 16, 32;

Now you have a proper numeric comparison, so you have a proper numeric sort. Of course, this is far too much typing, so you can use the spaceship operator instead:

my @numerically_sorted = sort { $a <=> $b } 1, 2, 4, 8, 16, 32;

The spaceship operator returns -1, 0, and +1, according to the rules discussed. A descending sort is as simple as reversing the position of $a and $b:

my @numerically_descending = sort { $b <=> $a } 1, 2, 4, 8, 16, 32;

In every place the previous sort expression returned -1, this expression returns +1, and vice versa. Thus, the sort is in the opposite order. It’s also easy to remember because if $a is to the left of $b, you get out the lower items first, just like a and b would be in the resulting list.

Likewise, a string sort can be indicated with cmp, although this is used less often because it is the default comparison. The cmp operator is handier when you have a more complex comparison, as you’ll see shortly.

Sorting with Indices

In the same way you used indices to solve a few problems with grep and map back in Chapter 5, you can also use indices with sort to get some interesting results. For example, let’s sort the list of names from earlier:

my @sorted = sort qw(Gilligan Skipper Professor Ginger Mary_Ann);
print "@sorted\n";

which necessarily results in:

Gilligan Ginger Mary_Ann Professor Skipper

But what if you wanted to look at the original list and determine which element of the original list now appears as the first, second, third, and so on, element of the sorted list? For example, Ginger is the second element of the sorted list and was the fourth element of the original list. How do you determine that the second element of the final list was the fourth element of the original list?

Well, you can apply a bit of indirection. Let’s not sort the actual names but rather the indices of each name:

my @input = qw(Gilligan Skipper Professor Ginger Mary_Ann);
my @sorted_positions = sort { $input[$a] cmp $input[$b] } 0..$#input;
print "@sorted_positions\n";

This prints 0 3 4 2 1, which means that the first element of the sorted list is element 0 of the original list, Gilligan. The second element of the sorted list is element 3 of the original list, which is Ginger, and so on. Now you can rank information rather than just move the names around.

Actually, you have the inverse of the rank. You still don’t know for a given name in the original list about which position it occupies in the output list. But with a bit more magic, you can get there as well:

my @input = qw(Gilligan Skipper Professor Ginger Mary_Ann);
my @sorted_positions = sort { $input[$a] cmp $input[$b] } 0..$#input;
my @ranks;
@ranks[@sorted_positions] = (0..$#sorted_positions);
print "@ranks\n";

The code prints 0 4 3 1 2. This means that Gilligan is position 0 in the output list, Skipper is position 4, Professor is position 2, and so on. The positions here are 0-based, so add 1 to get “human” ordinal values. One way to cheat is to use 1..@sorted_positions instead of 0..$#sorted_positions, so a way to dump it all out might look like:

my @input = qw(Gilligan Skipper Professor Ginger Mary_Ann);
my @sorted_positions = sort { $input[$a] cmp $input[$b] } 0..$#input;
my @ranks;
@ranks[@sorted_positions] = (1..@sorted_positions);
for (0..$#ranks) {
  print "$input[$_] sorts into position $ranks[$_]\n";
}

This results in:

Gilligan sorts into position 1
Skipper sorts into position 5
Professor sorts into position 4
Ginger sorts into position 2
Mary_Ann sorts into position 3

Sorting Efficiently

As the Professor tries to maintain the community computing facility (built entirely out of bamboo, coconuts, and pineapples, and powered by a live monkey), he continues to discover that people are leaving entirely too much data on the single monkey-powered filesystem and decides to print a list of offenders.

The Professor has written a subroutine called ask_monkey_about( ) which is given a castaway’s name and returns the number of pineapples of storage they use. You have to ask the monkey because he’s in charge of the pineapples. An initial naive approach to find the offenders from greatest to least might be something like:

my @castaways =
  qw(Gilligan Skipper Professor Ginger Mary_Ann Thurston Lovey);
my @wasters = sort {
  ask_monkey_about($b) <=> ask_monkey_about($a)
} @castaways;

In theory, this would be fine. For the first pair of names (Gilligan and Skipper), you ask the monkey “how many pineapples does Gilligan have?” and “how many pineapples does Skipper have?” You get back two values from the monkey and use them to order Gilligan and Skipper in the final list.

However, at some point, you have to compare the number of pineapples that Gilligan has with another castaway as well. For example, suppose the pair is Ginger and Gilligan. Ask the monkey about Ginger, get a number back, and then ask the monkey about Gilligan...again. This will probably annoy the monkey a bit, since you already asked earlier. But you need to ask for each value two, three, or maybe even four times just to put the seven values into order.

This can be a problem because it irritates the monkey.

How do you keep the number of monkey requests to a minimum? Well, you can build a table first. Use a map with seven inputs and seven outputs, turning each castaway item into a separate array reference, with each referenced array consisting of the castaway name and the pineapple count reported by the monkey:

my @names_and_pineapples = map {
  [ $_, ask_monkey_about($_) ]
} @castaways;

At this point, you asked the monkey seven questions in a row, but that’s the last time you have to talk to the monkey! You now have everything you need to finish the task.

For the next step, sort the arrayrefs, ordering them by the monkey-returned value:

my @sorted_names_and_pineapples = sort {
  $b->[1] <=> $a->[1];
} @names_and_pineapples;

In this subroutine, $a and $b are still two elements from the list of things to be sorted. When you’re sorting numbers, $a and $b are numbers; when you’re sorting references, $a and $b are references. Dereference them to get to the corresponding array itself, and pick out item 1 from the array (the monkey’s pineapple value). Because $b appears to the left of $a, it’ll be a descending sort as well. (You want a descending sort because the Professor wants the first name on the list to be the person who uses the most pineapples.)

You’re almost done, but what if you just wanted the top names, rather than the names and pineapple counts? You merely need to perform another map to transform the references back to the original data:

my @names = map $_->[0], @sorted_names_and_pineapples;

Each element of the list ends up in $_, so you’ll dereference that to pick out the element 0 of that array, which is just the name.

Now you have a list of names, ordered by their pineapple counts, and a calm monkey, all in three easy steps.

The Schwartzian Transform

The intermediate variables between each of these steps were not necessary, except as input to the next step. You can save yourself some brainpower by just stacking all the steps together:

my @names =
  map $_->[0],
  sort { $b->[1] <=> $a->[1] }
  map [ $_, ask_monkey_about($_) ],
  @castaways;

Because the map and sort operators are right to left, you have to read this construct from the bottom up. Take a list of @castaways, create some arrayrefs by asking the monkey a simple question, sort the list of arrayrefs, and then extract the names from each arrayref. This gives you the list of names in the desired order.

This construct is commonly called the Schwartzian Transform, which was named after me (but not by me), thanks to a Usenet posting I made many years ago. It has proven to be a very nice thing to have in your bag of sorting tricks.

If this transform looks like it might be too complex to memorize or come up with from first principles, it might help to look at the flexible and constant parts:

my @output_data =
  map $_->[0],
  sort { SORT COMPARISON USING $a->[1] AND $b->[1] }
  map [ $_, EXPENSIVE FUNCTION OF $_ ],
  @input_data;

The basic structure maps the original list into a list of arrayrefs, computing the expensive function only once for each; sorts those array refs, looking at the cached value of each expensive function invocation;^[33] and then extracts the original values back out in the new order. All you have to do is plug in the proper two operations, and you’re done.

Recursively Defined Data

While the data you’ve processed with references up to this point has been rather fixed-structure, sometimes you have to deal with hierarchical data, which is often defined recursively.

For example, a company organization chart has managers with direct reports, some of whom may also be managers themselves, or an HTML table that has rows containing cells—and some of those cells may also contain entire tables. The chart could also show a filesystem consisting of directories containing files and other directories.

You can use references to acquire, store, and process such hierarchical information. Frequently, the routines to manage the data structures end up as recursive subroutines. A recursive subroutine has a branch from which it calls itself to handle a portion of the task, and a branch that doesn’t call itself to handle the base cases.

For example, a recursive subroutine handling the factorial function, which is one of the simplest recursive functions, might look like:

sub factorial {
  my $n = shift;
  if ($n <= 1) {
    return 1;
  } else {
    return $n * factorial($n - 1);
  }
}

Here you have a base case where $n is less than or equal to 1, that does not invoke the recursive instance, along with a recursive case for $n greater than 1, which calls the routine to handle a portion of the problem (i.e., compute the factorial of the next lower number).

This task would probably be solved better using iteration rather than recursion, even though the classic definition of factorial is often given as a recursive operation.

Building Recursively Defined Data

Suppose you wanted to capture information about a filesystem, including the filenames and directory names, and their included contents. Represent a directory as a hash, in which the keys are the names of the entries within the directory, and values are undef for plain files. A sample /bin directory might look like:

my $bin_directory = {
  "cat" => undef,
  "cp" => undef,
  "date" => undef,
  ... and so on ...
};

Similarly, the Skipper’s home directory might also contain a personal bin directory (at something like ~skipper/bin) that contains personal tools:

my $skipper_bin = {
  "navigate" => undef,
  "discipline_gilligan" => undef,
  "eat" => undef,
};

nothing in either structure tells where the directory is located in the hierarchy. It just represents the contents of some directory.

Go up one level to the Skipper’s home directory, which is likely to contain a few files along with the personal bin directory:

my $skipper_home = {
  ".cshrc" => undef,
  "Please_rescue_us.pdf" => undef,
  "Things_I_should_have_packed" => undef,
  "bin" => $skipper_bin,
};

Ahh, notice that you have three files, but the fourth entry "bin" doesn’t have undef for a value but rather the hash reference created earlier for the Skipper’s personal bin directory. This is how you indicate subdirectories. If the value is undef, it’s a plain file; if it’s a hash reference, you have a subdirectory, with its own files and subdirectories. Of course, you can have combined these two initializations:

my $skipper_home = {
  ".cshrc" => undef,
  "Please_rescue_us.pdf" => undef,
  "Things_I_should_have_packed" => undef,
  "bin" => {
    "navigate" => undef,
    "discipline_gilligan" => undef,
    "eat" => undef,
  },
};

Now the hierarchical nature of the data starts to come into play.

Obviously, you don’t want to create and maintain a data structure by changing literals in the program. You should fetch the data by using a subroutine. Write a subroutine that for a given pathname returns undef if the path is a file, or a hash reference of the directory contents if the path is a directory. The base case of looking at a file is the easiest, so let’s write that:

sub data_for_path {
  my $path = shift;
  if (-f $path) {
    return undef;
  }
  if (-d $path) {
    ...
  }
  warn "$path is neither a file nor a directory\n";
  return undef;
}

If the Skipper calls this on .cshrc, he’ll get back an undef value, indicating that a file was seen.

Now for the directory part. You need a hash reference to be returned, which you declare as a named hash inside the subroutine. For each element of the hash, you call yourself to populate the value of that hash element. It goes something like this:

sub data_for_path {
  my $path = shift;
  if (-f $path or -l $path) {        # files or symbolic links
    return undef;
  }
  if (-d $path) {
    my %directory;
    opendir PATH, $path or die "Cannot opendir $path: $!";
    my @names = readdir PATH;
    closedir PATH;
    for my $name (@names) {
      next if $name eq "." or $name eq "..";
      $directory{$name} = data_for_path("$path/$name");
    }
    return \%directory;
  }
  warn "$path is neither a file nor a directory\n";
  return undef;
}

For each file within the directory being examined, the response from the recursive call to data_for_path is undef. This populates most elements of the hash. When the reference to the named hash is returned, the reference becomes a reference to an anonymous hash because the name immediately goes out of scope. (The data itself doesn’t change, but the number of ways in which you can access the data changes.)

If there is a subdirectory, the nested subroutine call uses readdir to extract the contents of that directory and returns a hash reference, which is inserted into the hash structure created by the caller.

At first, it may look a bit mystifying, but if you walk through the code slowly, you’ll see it’s always doing the right thing.

Test the results of this subroutine by calling it on . (the current directory) and seeing the result:

use Data::Dumper;
print Dumper(data_for_path("."));

Obviously, this will be more interesting if your current directory contains subdirectories.

Displaying Recursively Defined Data

The Dumper routine of Data::Dumper displays the output nicely, but what if you don’t like the format being used? You can write a routine to display the data. Again, for recursively defined data, a recursive subroutine is usually the key.

To dump the data, you need to know the name of the directory at the top of the tree because that’s not stored within the structure:

sub dump_data_for_path {
  my $path = shift;
  my $data = shift;
  if (not defined $data) { # plain file
    print "$path\n";
    return;
  }
  ...
}

For a plain file, dump the pathname; for a directory, $data is a hash reference. Let’s walk through the keys and dump the values:

sub dump_data_for_path {
  my $path = shift;
  my $data = shift;
  if (not defined $data) { # plain file
    print "$path\n";
    return;
  }
  my %directory = %$data;
  for (sort keys %directory) {
    dump_data_for_path("$path/$_", $directory{$_});
  }
}

For each element of the directory, you pass a path consisting of the incoming path followed by the current directory entry, and the data pointer is either undef for a file or a subdirectory hash reference for another directory. You can see the results by running:

dump_data_for_path(".", data_for_path("."));

Again, this is more interesting in a directory that has subdirectories, but the output should be similar to:

find . -print

from the shell prompt.

Exercises

The answers for all exercises can be found in Section A.6.

Exercise 1 [15 min]

Using the glob operator, a naive sort of every name in the /bin directory by their relative sizes might be written as:

my @sorted = sort { -s $a <=> -s $b } glob "/bin/*";

Rewrite this using the Schwartzian Transform technique.

If you don’t have many files in the /bin directory, perhaps because you don’t have a Unix machine, change the argument to glob as needed.

Exercise 2 [15 min]

Read up on the Benchmark module, included with Perl. Write a program that will answer the question, “How much does using the Schwartzian Transform speed up the task of Exercise 1?”

Exercise 3 [10 min]

Using a Schwartzian Transform, read a list of words, and sort them in “dictionary order.” Dictionary order ignores all capitalization and internal punctuation. Hint: The following transformation might be useful:

my $string = "Mary-Ann";
$string =~ tr/A-Z/a-z/;       # force all lowercase
$string =~ tr/a-z//cd;        # strip all but a-z from the string
print $string;                # prints "maryann"

Be sure you don’t mangle the data! If the input includes the Professor, and The skipper, the output should have them listed in that order, with that capitalization.

Exercise 4 [20 min]

Modify the recursive directory dumping routine so it shows the nested directories through indentation. An empty directory should show up as:

sandbar, an empty directory

while a nonempty directory should appear with nested contents, indented two spaces:

uss_minnow, with contents:
  anchor
  broken_radio
  galley, with contents:
    captain_crunch_cereal
    gallon_of_milk
    tuna_fish_sandwich
  life_preservers

^[29] My friends call that the “ASCIIbetical” ordering. Normally modern Perl doesn’t use ASCII; instead, it uses a default sort order, depending on the current locale and character set; see the perllocale (not perllocal!) manpage.

^[30]A sort block can also be specified as a subroutine name, causing the subroutine to be invoked when a comparison is needed.

^[31]Actually, you can use any negative or positive number in place of -1 and +1, respectively.

^[32]Recent Perl versions include a default sorting engine that is stable, so zero returns from the sort block cause the relative ordering of $a and $b to reflect their order in the original list. Older versions of Perl didn’t guarantee such stability, and a future version might not use a stable sort, so don’t rely on it.

^[33]An expensive operation is one that takes a relatively long time or a relatively large amount of memory.