This chapter looks at optimizing sorting and dealing with recursively defined data.
Perl’s built-in
sort
operator sorts text strings in their natural
text order, by default.[29]
This is fine if you want to sort text strings:
my @sorted = sort qw(Gilligan Skipper Professor Ginger Mary_Ann);
but gets pretty messy when you want to sort numbers:
my @wrongly_sorted = sort 1, 2, 4, 8, 16, 32;
The
resulting list is 1
,
16
, 2
, 32
,
4
, 8
. Why didn’t
sort
order these properly? It treats each item as
a string and sorts them in string order. Any string that begins with
3
sorts before any string that begins with
4
.
You can fix this by overriding how Perl
compares pairs of items in the list. By default, as Perl orders the
items, a string comparison is used. A new comparison is specified
using a sort block, placed between the
sort
keyword and the list of things to
sort.[30]
Within the sort block, $a
and
$b
stand in for two of the items to be sorted. The
last evaluated expression must return a -1
,
0
, or +1
value.[31] If the value is
-1
, the value currently in $a
must appear before the value in $b
in the final
sorted list. If the value is +1
, then the value in
$a
must appear after the value in
$b
in the final sorted list. If the value is 0,
you don’t know or can’t tell, so
the results are unpredictable.[32]
For example, to sort those numbers in their proper order, you can use
a sort block comparing $a
and
$b
, like so:
my @numerically_sorted = sort { if ($a < $b) { -1 } elsif ($a > $b) { +1 } else { 0 } } 1, 2, 4, 8, 16, 32;
Now you have a proper numeric comparison, so you have a proper numeric sort. Of course, this is far too much typing, so you can use the spaceship operator instead:
my @numerically_sorted = sort { $a <=> $b } 1, 2, 4, 8, 16, 32;
The spaceship operator returns
-1
, 0
, and
+1
, according to the rules discussed. A descending
sort is as simple as reversing the position of $a
and $b
:
my @numerically_descending = sort { $b <=> $a } 1, 2, 4, 8, 16, 32;
In every place the previous sort expression returned
-1
, this expression returns +1
,
and vice versa. Thus, the sort is in the opposite order.
It’s also easy to remember because if
$a
is to the left of $b
, you
get out the lower items first, just like a
and
b
would be in the resulting list.
Likewise, a string sort can be
indicated with cmp
, although this is used less
often because it is the default comparison. The
cmp
operator is handier when you have a more
complex comparison, as you’ll see shortly.
In the same way you used indices to solve a
few problems with grep
and map
back in Chapter 5, you can also use indices with
sort
to get some interesting results. For example,
let’s sort the list of names from earlier:
my @sorted = sort qw(Gilligan Skipper Professor Ginger Mary_Ann); print "@sorted\n";
which necessarily results in:
Gilligan Ginger Mary_Ann Professor Skipper
But what if you wanted to look at the original list and determine which element of the original list now appears as the first, second, third, and so on, element of the sorted list? For example, Ginger is the second element of the sorted list and was the fourth element of the original list. How do you determine that the second element of the final list was the fourth element of the original list?
Well, you can apply a bit of indirection. Let’s not sort the actual names but rather the indices of each name:
my @input = qw(Gilligan Skipper Professor Ginger Mary_Ann); my @sorted_positions = sort { $input[$a] cmp $input[$b] } 0..$#input; print "@sorted_positions\n";
This prints 0 3 4 2 1
, which means that the first
element of the sorted list is element 0 of the original list,
Gilligan. The second element of the sorted list is element 3 of the
original list, which is Ginger, and so on. Now you can rank
information rather than just move the names around.
Actually, you have the inverse of the rank. You still don’t know for a given name in the original list about which position it occupies in the output list. But with a bit more magic, you can get there as well:
my @input = qw(Gilligan Skipper Professor Ginger Mary_Ann); my @sorted_positions = sort { $input[$a] cmp $input[$b] } 0..$#input; my @ranks; @ranks[@sorted_positions] = (0..$#sorted_positions); print "@ranks\n";
The code prints 0 4 3 1 2
. This means that
Gilligan is position 0 in the output list, Skipper is position 4,
Professor is position 2, and so on. The positions here are 0-based,
so add 1
to get
“human” ordinal values. One way to
cheat is to use 1..@sorted_positions
instead of
0..$#sorted_positions
, so a way to dump it all out
might look like:
my @input = qw(Gilligan Skipper Professor Ginger Mary_Ann); my @sorted_positions = sort { $input[$a] cmp $input[$b] } 0..$#input; my @ranks; @ranks[@sorted_positions] = (1..@sorted_positions); for (0..$#ranks) { print "$input[$_] sorts into position $ranks[$_]\n"; }
Gilligan sorts into position 1 Skipper sorts into position 5 Professor sorts into position 4 Ginger sorts into position 2 Mary_Ann sorts into position 3
As the Professor tries to maintain the community computing facility (built entirely out of bamboo, coconuts, and pineapples, and powered by a live monkey), he continues to discover that people are leaving entirely too much data on the single monkey-powered filesystem and decides to print a list of offenders.
The Professor has written a subroutine called
ask_monkey_about( )
which is given a
castaway’s name and returns the number of pineapples
of storage they use. You have to ask the monkey because
he’s in charge of the pineapples. An initial naive
approach to find the offenders from greatest to least might be
something like:
my @castaways = qw(Gilligan Skipper Professor Ginger Mary_Ann Thurston Lovey); my @wasters = sort { ask_monkey_about($b) <=> ask_monkey_about($a) } @castaways;
In theory, this would be fine. For the first pair of names (Gilligan and Skipper), you ask the monkey “how many pineapples does Gilligan have?” and “how many pineapples does Skipper have?” You get back two values from the monkey and use them to order Gilligan and Skipper in the final list.
However, at some point, you have to compare the number of pineapples that Gilligan has with another castaway as well. For example, suppose the pair is Ginger and Gilligan. Ask the monkey about Ginger, get a number back, and then ask the monkey about Gilligan...again. This will probably annoy the monkey a bit, since you already asked earlier. But you need to ask for each value two, three, or maybe even four times just to put the seven values into order.
This can be a problem because it irritates the monkey.
How do you keep the number of
monkey requests to a minimum? Well, you can build a table first. Use
a map
with seven inputs and seven outputs, turning
each castaway item into a separate array reference, with each
referenced array consisting of the castaway name and the pineapple
count reported by the monkey:
my @names_and_pineapples = map { [ $_, ask_monkey_about($_) ] } @castaways;
At this point, you asked the monkey seven questions in a row, but that’s the last time you have to talk to the monkey! You now have everything you need to finish the task.
For the next step, sort the arrayrefs, ordering them by the monkey-returned value:
my @sorted_names_and_pineapples = sort { $b->[1] <=> $a->[1]; } @names_and_pineapples;
In this subroutine, $a
and $b
are still two elements from the list of things to be sorted. When
you’re sorting numbers, $a
and
$b
are numbers; when you’re
sorting references, $a
and $b
are references. Dereference them to get to the corresponding array
itself, and pick out item 1 from the array (the
monkey’s pineapple value). Because
$b
appears to the left of $a
,
it’ll be a descending sort as well. (You want a
descending sort because the Professor wants the first name on the
list to be the person who uses the most pineapples.)
You’re almost done, but what if you just wanted the
top names, rather than the names and pineapple counts? You merely
need to perform another map
to transform the
references back to the original data:
my @names = map $_->[0], @sorted_names_and_pineapples;
Each element of the list ends up in $_
, so
you’ll dereference that to pick out the element 0 of
that array, which is just the name.
Now you have a list of names, ordered by their pineapple counts, and a calm monkey, all in three easy steps.
The intermediate variables between each of these steps were not necessary, except as input to the next step. You can save yourself some brainpower by just stacking all the steps together:
my @names = map $_->[0], sort { $b->[1] <=> $a->[1] } map [ $_, ask_monkey_about($_) ], @castaways;
Because the map
and sort
operators are right to left, you have to read this construct from the
bottom up. Take a list of @castaways
, create some
arrayrefs by asking the monkey a simple question, sort the list of
arrayrefs, and then extract the names from each arrayref. This gives
you the list of names in the desired order.
This construct is commonly called the Schwartzian Transform, which was named after me (but not by me), thanks to a Usenet posting I made many years ago. It has proven to be a very nice thing to have in your bag of sorting tricks.
If this transform looks like it might be too complex to memorize or come up with from first principles, it might help to look at the flexible and constant parts:
my @output_data = map $_->[0], sort { SORT COMPARISON USING $a->[1] AND $b->[1] } map [ $_, EXPENSIVE FUNCTION OF $_ ], @input_data;
The basic structure maps the original list into a list of arrayrefs, computing the expensive function only once for each; sorts those array refs, looking at the cached value of each expensive function invocation;[33] and then extracts the original values back out in the new order. All you have to do is plug in the proper two operations, and you’re done.
While the data you’ve processed with references up to this point has been rather fixed-structure, sometimes you have to deal with hierarchical data, which is often defined recursively.
For example, a company organization chart has managers with direct reports, some of whom may also be managers themselves, or an HTML table that has rows containing cells—and some of those cells may also contain entire tables. The chart could also show a filesystem consisting of directories containing files and other directories.
You can use references to acquire, store, and process such hierarchical information. Frequently, the routines to manage the data structures end up as recursive subroutines. A recursive subroutine has a branch from which it calls itself to handle a portion of the task, and a branch that doesn’t call itself to handle the base cases.
For example, a recursive subroutine handling the factorial function, which is one of the simplest recursive functions, might look like:
sub factorial { my $n = shift; if ($n <= 1) { return 1; } else { return $n * factorial($n - 1); } }
Here you have a base case where $n
is less than or
equal to 1, that does not invoke the recursive instance, along with a
recursive case for $n
greater than 1, which calls
the routine to handle a portion of the problem (i.e., compute the
factorial of the next lower number).
This task would probably be solved better using iteration rather than recursion, even though the classic definition of factorial is often given as a recursive operation.
Suppose you wanted to capture
information about a filesystem, including the filenames and directory
names, and their included contents. Represent a directory as a hash,
in which the keys are the names of the entries within the directory,
and values are undef
for plain files. A sample
/bin
directory might look like:
my $bin_directory = { "cat" => undef, "cp" => undef, "date" => undef, ... and so on ... };
Similarly, the Skipper’s home directory might also
contain a personal bin
directory (at something
like ~skipper/bin
) that contains personal tools:
my $skipper_bin = { "navigate" => undef, "discipline_gilligan" => undef, "eat" => undef, };
nothing in either structure tells where the directory is located in the hierarchy. It just represents the contents of some directory.
Go up one level to the Skipper’s home directory,
which is likely to contain a few files along with the personal
bin
directory:
my $skipper_home = { ".cshrc" => undef, "Please_rescue_us.pdf" => undef, "Things_I_should_have_packed" => undef, "bin" => $skipper_bin, };
Ahh, notice that you have
three files, but the fourth entry "bin
"
doesn’t have undef
for a value
but rather the hash reference created earlier for the
Skipper’s personal bin
directory.
This is how you indicate subdirectories. If the value is
undef
, it’s a plain file; if
it’s a hash reference, you have a subdirectory, with
its own files and subdirectories. Of course, you can have combined
these two initializations:
my $skipper_home = { ".cshrc" => undef, "Please_rescue_us.pdf" => undef, "Things_I_should_have_packed" => undef, "bin" => { "navigate" => undef, "discipline_gilligan" => undef, "eat" => undef, }, };
Now the hierarchical nature of the data starts to come into play.
Obviously, you don’t
want to create and maintain a data structure by changing literals in
the program. You should fetch the data by using a subroutine. Write a
subroutine that for a given pathname returns undef
if the path is a file, or a hash reference of the directory contents
if the path is a directory. The base case of looking at a file is the
easiest, so let’s write that:
sub data_for_path { my $path = shift; if (-f $path) { return undef; } if (-d $path) { ... } warn "$path is neither a file nor a directory\n"; return undef; }
If the Skipper calls this on .cshrc
,
he’ll get back an undef
value,
indicating that a file was seen.
Now for the directory part. You need a hash reference to be returned, which you declare as a named hash inside the subroutine. For each element of the hash, you call yourself to populate the value of that hash element. It goes something like this:
sub data_for_path { my $path = shift; if (-f $path or -l $path) { # files or symbolic links return undef; } if (-d $path) { my %directory; opendir PATH, $path or die "Cannot opendir $path: $!"; my @names = readdir PATH; closedir PATH; for my $name (@names) { next if $name eq "." or $name eq ".."; $directory{$name} = data_for_path("$path/$name"); } return \%directory; } warn "$path is neither a file nor a directory\n"; return undef; }
For each file within the directory
being examined, the response from the recursive call to
data_for_path
is undef
. This
populates most elements of the hash. When the reference to the named
hash is returned, the reference becomes a reference to an anonymous
hash because the name immediately goes out of scope. (The data itself
doesn’t change, but the number of ways in which you
can access the data changes.)
If there is a subdirectory, the nested subroutine call uses
readdir
to extract the contents of that directory
and returns a hash reference, which is inserted into the hash
structure created by the caller.
At first, it may look a bit mystifying, but if you walk through the code slowly, you’ll see it’s always doing the right thing.
Test the results of this subroutine by calling it on . (the current directory) and seeing the result:
use Data::Dumper; print Dumper(data_for_path("."));
Obviously, this will be more interesting if your current directory contains subdirectories.
The Dumper
routine
of Data::Dumper
displays the output nicely, but
what if you don’t like the format being used? You
can write a routine to display the data. Again, for recursively
defined data, a recursive subroutine is usually the key.
To dump the data, you need to know the name of the directory at the top of the tree because that’s not stored within the structure:
sub dump_data_for_path { my $path = shift; my $data = shift; if (not defined $data) { # plain file print "$path\n"; return; } ... }
For a plain file, dump the pathname; for a directory,
$data
is a hash reference. Let’s
walk through the keys and dump the values:
sub dump_data_for_path { my $path = shift; my $data = shift; if (not defined $data) { # plain file print "$path\n"; return; } my %directory = %$data; for (sort keys %directory) { dump_data_for_path("$path/$_", $directory{$_}); } }
For each element of the directory, you pass a path consisting of the
incoming path followed by the current directory entry, and the data
pointer is either undef
for a file or a
subdirectory hash reference for another directory. You can see the
results by running:
dump_data_for_path(".", data_for_path("."));
Again, this is more interesting in a directory that has subdirectories, but the output should be similar to:
find . -print
from the shell prompt.
The answers for all exercises can be found in Section A.6.
Using the glob
operator, a naive sort of every name in the
/bin
directory by their relative sizes might be
written as:
my @sorted = sort { -s $a <=> -s $b } glob "/bin/*";
Rewrite this using the Schwartzian Transform technique.
If you don’t have many files in the
/bin
directory, perhaps because you
don’t have a Unix machine, change the argument to
glob
as needed.
Read up on the Benchmark
module, included with
Perl. Write a program that will answer the question,
“How much does using the Schwartzian Transform speed
up the task of Exercise 1?”
Using a Schwartzian Transform, read a list of words, and sort them in “dictionary order.” Dictionary order ignores all capitalization and internal punctuation. Hint: The following transformation might be useful:
my $string = "Mary-Ann"; $string =~ tr/A-Z/a-z/; # force all lowercase $string =~ tr/a-z//cd; # strip all but a-z from the string print $string; # prints "maryann"
Be sure you don’t mangle the data! If the input
includes the Professor
, and The
skipper
, the output should have them listed in
that order, with that capitalization.
Modify the recursive directory dumping routine so it shows the nested directories through indentation. An empty directory should show up as:
sandbar, an empty directory
while a nonempty directory should appear with nested contents, indented two spaces:
uss_minnow, with contents: anchor broken_radio galley, with contents: captain_crunch_cereal gallon_of_milk tuna_fish_sandwich life_preservers
[29] My friends call that the
“ASCIIbetical” ordering. Normally
modern Perl doesn’t use ASCII; instead, it uses a
default sort order, depending on the current locale and character
set; see the perllocale
(not
perllocal
!) manpage.
[30] A sort block can also be specified as a subroutine name, causing the subroutine to be invoked when a comparison is needed.
[31] Actually, you can use any negative or positive number in place
of -1
and +1
,
respectively.
[32] Recent Perl versions
include a default sorting engine that is stable,
so zero returns from the sort block cause the relative ordering of
$a
and $b
to reflect their
order in the original list. Older versions of Perl
didn’t guarantee such stability, and a future
version might not use a stable sort, so don’t rely
on it.
[33] An expensive operation is one that takes a relatively long time or a relatively large amount of memory.