Chapter 5. Benchmarking Perl

Tony Hoare’s famous quote goes, “Premature optimization is the root of all evil,” although it doesn’t often come with its setup: “We should forget about small efficiencies, say about 97% of the time.” That is, don’t sweat the small stuff until you need to. In this chapter I show how I can look into my Perl programs to see where the slow parts are. Before I start working to improve the performance of my program, I should check to see what the program is actually doing. Once I know where the slow parts are, I concentrate on those.

Benchmarking Theory

The term benchmark comes from surveyors. They create a physical mark in something to denote a known elevation and use that mark to determine other elevations. Those computed elevations can only be right if the original mark is right. Even if the original mark started off right, it might change because it sunk into the ground, the ground moves because of an earthquake, or global warming redefines the ultimate benchmark we call sea level. Benchmarks are comparisons, not absolutes.

For computers, a benchmark compares the performance of one system against another. They can measure many dimensions, such as time to completion, resource use, network activity, or memory use. Several tools already exist for measuring the parts outside of Perl, so I won’t cover those here. I want to look inside Perl to see what I can find. I want to know if one bit of code is faster or uses less memory, or whatever is important to me.

Measuring things and extracting numbers is easy, and it’s often easy for us to believe the numbers that computers give us. This makes benchmarking dangerous. Unlike those surveyors, we can’t stand on a hill and know if we are higher or lower than the next hill by just looking. We have to carefully consider not only the numbers that we get from benchmarks, but the method we use to generate the numbers.

Benchmarking isn’t as popular as it used to be. The speed and storage of computers and the bandwidth of networks are not as limiting as they used to be, so we don’t feel like we have to work hard to conserve them. We also don’t have to pay (as in money, literally) for CPU cycles (in most cases), so we don’t care how many we actually use. At least, we don’t care as much as programmers used to care. After all, you’re using Perl, aren’t you?

Any measurement comes with risk. If I don’t understand what I’m measuring, what affects the measurement, or what the numbers actually mean, I can easily misinterpret the results. If I’m not careful about how I measure things, my numbers may be meaningless. I can let the computer do the benchmarking work for me, but I shouldn’t let it do the thinking for me.

A Perl program doesn’t run on its own. It depends on a perl interpreter, an operating system, and hardware. Each of those depend on other things. Even if I use the same machine but different perl interpreters—even of the same version of Perl—I may get different results. I could have compiled them with different C compilers that have different levels of optimization, I could have included different features in one interpreter, and so on. I’ll talk about this more toward the end of the chapter when I discuss perlbench.

You probably don’t have to imagine a situation where you develop on one platform but deploy on another. I get to visit many companies in my travels as a consultant, so I’ve been able to see a lot of different setups. Often, teams develop on one system that only they use, then deploy the result to a busy server that has a different version of Perl, a different version of the operating system, and a completely different load profile. What was quite speedy on a lightly used machine becomes unbearably slow when people start to use it. A good example of this is CGI programs, which become quite slow with increased load, versus speedy mod_perl programs, which scale quite nicely.

Any benchmark only applies to its situation. Extrapolating my results might not get me in trouble, but they aren’t really valid either. The only way for me to really know what will happen in a particular situation is to test that situation. Along with my numbers, I have to report the details. It’s not enough to say, for instance, that I’m writing this on a 2011 MacBook Air running OS X 10.8. I have to tell you the details of my perl interpreter, how I compiled it (that’s just perl -V), and how I’ve tuned my operating system.

Also, I can’t measure something without interacting with it, and that interaction changes the situation. If I want to watch the details of Perl’s memory management, for instance, I can compile Perl with -DDEBUGGING_MSTATS, but then it’s not the same Perl interpreter. Although I can now watch the memory, I’ve probably slowed the entire program down (and I verify that at the end of this chapter when I show perlbench). If I add code to time the program, I have to execute that code, which means my program takes longer. In any case, I might have to use additional modules, which means that Perl has to find, load, and compile more code.

Benchmarking Time

To measure the time it takes my program to run, I could just look at the clock on the wall when I start the program and look again when the program finishes. That’s the simplest way, and the most naïve too. This method might work in extreme circumstances, though. If I can reduce the runtime of my program from an entire workday to a couple of minutes, then I don’t care that the wall-clock method might be a bit inaccurate.

I don’t have to really look at my watch, though. I can time my program directly in my program if I like:

#!/usr/bin/perl

my $start = time;

#... the meat of my program

my $end   = time;

print "The total time is ", $end - $start;

For a short running program, this method only tests a portion of the runtime. What about all that time Perl spent compiling the code? If I used a lot of modules, a significant part of the time the whole process takes might be in the parts before Perl even starts running the program. Jean-Louis Leroy wrote “A Timely Start” for Perl.com about slow startup times in a Perl FTP program because Perl had to look through 23 different directories to find everything Net::FTP needed to load. The runtime portion is still pretty speedy, but the startup time was relatively long. Remember that Perl has to compile the program every time I run it (forgetting about things like mod_perl or FastCGI for the moment). If I use many modules, I make a lot of work for Perl to find and compile them every time I run my program.

If I want to time the whole process, compile time and runtime, I can create a wrapper around the program to do the wall-clock timing. I could take this number and compare it to the runtime numbers to estimate the compilation times:

#!/usr/bin/perl

my $start = time;

system  { $ARGV[0] } @ARGV;   # See the Security chapter!

my $end   = time;

printf "The whole time was %d seconds\n", $end - $start;

I would run it:

% perl bench perl -e 'sleep 5' 
The whole time was 5 seconds

The wall-clock method breaks down, though, because the operating system can switch between tasks, or even run different tasks at the same time. I can’t tell how much time the computer worked on my program only by looking at my watch. The situation is even worse when my program has to wait for resources that might be busy or for network latency. I can’t really blame my program in those cases.

The time program (not the Perl built-in) that comes with most Unix-like systems solves this by reporting only the time that the operating system thinks about my program. Your particular shell may even have a built-in command for it.

From the command line, I tell the time command what it should measure. It runs the command and reports its results. It breaks down the runtime by the real time, the user time, and the system time. The real time is the wall-clock time. The other two deal with how the operating system divides tasks between the system and my process. Mostly I don’t care about that distinction and only their sum matters to me.

When I time the sleep program (not the Perl built-in), the real time is the time I told it to sleep, but since that’s all that program does, the user and system times are minuscule. The output for your particular version of time may be different:

% time sleep 5
real    0m5.094s
user    0m0.002s
sys     0m0.011s

Behind the scenes, the time program just uses the times function from the standard C library, and that carries along accounting information (although we’re fortunate that we don’t have to pay for clock cycles anymore). The times Perl built-in does the same thing. In list context, it returns four times: the total user and system time, and the user and system time for the children of the process. I take the end times and subtract the starting times to get the real times:

#!/usr/bin/perl

use Benchmark;

my @start = times;

#... the meat of my program

my @end   = times;

my @diffs = map { $end[$_] - $start[$_] } 0 .. $#end;

print "The total time is @diffs";

I don’t have to do those calculations myself, though, because the Benchmark module, which comes with Perl, already does it for me. Again, this approach only measures the runtime:

#!/usr/bin/perl

use Benchmark;

my $start = Benchmark->new;

#... the meat of my program

my $end = Benchmark->new;

my $diff = timediff( $end, $start );

print "My program took: " . timestr( $diff ) . "\n";

( $real, $child_user, $child_system ) = @$diff[0,3,4];

printf STDERR "\nreal\t%.3f\nuser\t%.3f\nsys\t%.3f\n",
              $real, $child_user, $child_system;

The output looks like the times output I showed previously, but now it comes completely from within my Perl program and just for the parts of the program between my calls to Benchmark->new. Instead of timing the entire program, I can focus on the part I want to examine.

This is almost the same thing David Kulp did to create the Perl Power Tools version of time. Take a benchmark, run the command of interest using system (so those are the children times), then take another benchmark once system returns. Since this version of time is pure Perl, it runs anywhere that Perl runs:

#!/usr/bin/perl
# benchmark_rc.pl
use Benchmark;

my $start = Benchmark->new;

my $rc = system { $ARGV[0] } @ARGV;

my $end = Benchmark->new;

my $diffs = timediff( $end, $start );

printf STDERR "\nreal %.2f\nuser %.2f\nsys  %.2f\n", @$diffs[0,3,4];

$rc &= 0xffff;
if ($rc == 0xff00) { exit 127; } else { exit ($rc >> 8); }

There’s a big problem with measuring CPU times and comparing them to program perfomance: they only measure the time my program used the CPU. It doesn’t include the time that my program waits to get input, to send output, or to get control of some other resource. Those times might be much more important than the CPU time. It also doesn’t measure remote resource use, such as the time a remote database server works on the task.

Comparing Code

Benchmarks by themselves aren’t very useful. I file them under the heading of “decision support.” They don’t directly answer questions for us. I might be able to use them to decide that I need to change a program to improve a number, but the number itself doesn’t tell me what to do. Sure, I know how long it takes to run my program, but it doesn’t tell me if I can make it any faster. I need to compare one implementation to another.

I could compare entire programs to each other, but that’s not very useful. When I’m trying to speed up a program, I’m going to change the parts that I think are slow. Most of the other parts will be the same, and the time to run all of those same parts end up in the total time, skewing the results. I really just want to compare the bits that are different. The times for the rest of the code skews the results, so I need to isolate the parts that I want to compare.

If I extract the different parts, I can create small programs with just those. Most of the time the sample program then takes to run applies only to the interesting bits. I’ll talk more about that later, but as I go through this next section, remember that anything I do has some overhead and every measurement changes the situation a bit, so I should think about the numbers before I accept them. For now, I’ll go back to the Benchmark module.

If I want to compare two small bits of code instead of entire programs, I can use some of the functions from Benchmark. I can compare either by running a certain number of iterations and comparing the total time, or the inverse of that, a total time and comparing the total number of iterations.

In the timethese function from Benchmark, I give it a number of iterations as the first argument. The second argument is an anonymous hash where the keys are labels I give the snippets and the hash values represent the code I want to compare, in this case as string values that Perl will eval. In this sample program, I want to compare the speed of opendir and glob for getting a list of files:

#!/usr/bin/perl
# glob_opendir.pl
use v5.10;
use Benchmark;
use Cwd qw(cwd);

my( $files, $iterations ) = @ARGV;

$files      //=  1000;
$iterations //= 10000;

use File::Temp qw(tempdir);
my $dir = tempdir( CLEANUP => 1 );

my $original_cwd = cwd(); # save our current directory
chdir( $dir ) or die "Could not change to $dir: $!";
foreach ( 1 .. $files ) {
    open my($fh), '>', "$0.$_.tmp" or die "Could not create a file: $!";
    print { $fh } time();
    }

my $count = () = glob( '*' );
printf "$dir has %d files\n", $count;

timethese( $iterations, {
    'Opendir'  => q{ opendir my $dh, "."; my @f = readdir( $dh ) },
    'Glob'     => q{ my @f = glob(".* *") },
    }
    );

chdir( $original_cwd );

The timethese function prints a nice report that shows me the three times I discussed earlier:

% perl glob_opendir.pl
/var/folders/yg/47k_dc892sb8vyjtxbsfv2zh0000gn/T/vPcw4yjmrM has 1000 files
Benchmark: timing 10000 iterations of Glob, Opendir...
Glob: 36 wallclock secs (13.47 usr + 22.17 sys = 35.64 CPU) @ 280.58/s (n=10000)
Opendir: 7 wallclock secs ( 3.64 usr +  2.74 sys =  6.38 CPU) @ 1567.40/s 
         (n=10000)

These aren’t “The Numbers,” though. People try to get away with running the measurement once. Try it again. Then again. The results vary a little bit every time I run it; certainly some of this is merely round-off error:

% perl glob_opendir.pl
/var/folders/yg/47k_dc892sb8vyjtxbsfv2zh0000gn/T/ekWQkoZoeP has 1000 files
Benchmark: timing 10000 iterations of Glob, Opendir...
Glob: 36 wallclock secs (13.32 usr + 22.06 sys = 35.38 CPU) @ 282.65/s (n=10000)
Opendir: 6 wallclock secs ( 3.39 usr +  2.56 sys =  5.95 CPU) @ 1680.67/s 
         (n=10000)
% perl big_dir.pl
/var/folders/yg/47k_dc892sb8vyjtxbsfv2zh0000gn/T/xxo8olMbZ0 has 1000 files
Benchmark: timing 10000 iterations of Glob, Opendir...
Glob: 37 wallclock secs (13.77 usr + 23.26 sys = 37.03 CPU) @ 270.05/s (n=10000)
Opendir: 7 wallclock secs ( 3.61 usr +  2.80 sys =  6.41 CPU) @ 1560.06/s 
         (n=10000)

If I hadn’t already known that glob is an expensive operation, I might be surprised by these results. More likely, I would have to convince myself that they were true results and not a mistake on my part.

In that example, I gave Benchmark two strings to compare. The timethese function took the strings, created subroutines from them, and ran those subroutines. I can do the same thing with anonymous subroutines:

timethese( $iterations, {
    'Opendir'  => sub { opendir my $dh, "."; my @f = readdir( $dh ) },
    'Glob'     => sub { my @f = glob(".* *") },
    }
    );

I can also use references to defined subroutines:

my $results = timethese( $iterations, {
    'Opendir'  => \&bench_opendir,
    'Glob'     => \&bench_glob,
    }
    );

Now that I’ve given the subroutines names and defined them outside of the call to timethese, I can use them elsewhere. Subroutines that I can use elsewhere can go into modules, for instance, and I can test them with the conventional setup. I write about this in Chapter 17.

You might already see where I’m going with this. As I was working on this chapter, I wanted to run many benchmarks of different tasks, and I didn’t want to re-create the code to do that in every program. Instead, I created Surveyor::App to handle all of the boring controller code. With that out of the way, I could make a small module containing my target code and let Surveyor::App do the rest of the work:

% survey -p Surveyor::Benchmark::GetDirectoryListing

The Surveyor::Benchmark::GetDirectoryListing module can be simple:

package Surveyor::Benchmark::GetDirectoryListing;

sub set_up {
    my( $class, $directory ) = @_;
    unless( defined $directory ) {
        require Cwd;
        $directory = Cwd::cwd();
        }
    die "Directory [$directory] does not exist!\n" unless -e $directory;
    chdir( $directory ) or die "Could not change to $directory: $!\n";
    my @files = glob( '.* *' );
    printf "$directory has %d files\n", scalar @files;
    }

sub tear_down { 1 }

sub bench_opendir {
    opendir my( $dh ), ".";
    my @f = readdir( $dh );
    }

sub bench_glob {
    my @f = glob(".* *")
    }

__PACKAGE__;

Don’t Turn Off Your Thinking Cap

Benchmarking can be deceptive if I let the computer do the thinking for me. The Benchmark module can spit out numbers all day long, but if I don’t think about what I’m doing and what those numbers actually mean, they aren’t useful. They may even lead me to believe something that isn’t true, and I have a nice example from my personal experience of mistrusting a benchmark.

In Chapter 10 of Intermediate Perl, we cover the Schwartzian Transform, which uses a cached sort key to avoid duplicating work during a sort. The Schwartzian Transform should be faster, especially for more elements and more complicated sort-key computations.

In one of the course exercises, to prove to our students that the transform actually boosts performance, we ask them to sort a bunch of filenames in order of their modification date. Looking up the modification time is an expensive operation, especially when I have to do it N*log(N) times. Since we got the answer we wanted, we didn’t investigate as fully as we should have.

The answer we used to give in Intermediate Perl was not the best answer. It is short so it fits on one slide, but it makes things seem worse than they really are. The Schwartzian Transform comes out ahead, as it should, but I always thought it should be faster.

Our example used Benchmark’s timethese to compare two methods to sort filenames by their modification age. The “Ordinary” sort computes the file modification age, -M $a, every time it needs to make a comparison. The “Schwartzian” method uses the Schwartzian Transform to compute the modification age once per file and store it with the filename. It’s a cached sort key:

use Benchmark qw{ timethese };
timethese( -2, {
   Ordinary    =>
      q{ my @results = sort { -M $a <=> -M $b } glob "/bin/*"; },
   Schwartzian =>
      q{ map $_->[0], sort { $a->[1] <=> $b->[1] } map [$_, -M], glob "/bin/*"; },
   });

This code has a number of problems. If I am going to compare two things, they need to be as alike as I can make them. Notice that in the “Ordinary” case I assign to @results and in the “Schwartzian” case I use map() in a void context. They do different things: one sorts and stores, and one just sorts. To compare them, they need to produce the same thing. In this case, they both need to store their result.

Also, I need to isolate the parts that are different and abstract the parts that are the same. In each code string I do a glob(), which I already know is an expensive operation. The glob() taints the results because it adds to the time for the two sorts of, um, sorts.

During one class, while the students were doing this lab exercise, I did my own homework by rewriting our benchmark following the same process I should in any benchmark.

I broke up the task into parts and timed the different bits to see how they impact the overall task. I identified three major parts to benchmark: creating a list of files, sorting the files, and assigning the sorted list. I want to time each of those individually, and I also want to time the bigger task. This seems like such a simple task, comparing two bits of code, but I can mess up in several ways if I’m not careful.

I also want to see how much the numbers improve from the example we have in the course slides, so I use the original code strings too. I try a bunch of different snippets to see how each part of the task contributes to the final numbers. How much of it comes from the list assignment, or from the filename generation through glob()? I build up a bunch of code strings from the various common parts.

First, I create some package variables. Benchmark turns my code strings into subroutines, and I want those subroutines to find these variables. They have to be global (package) variables. Although I know Benchmark puts these subroutines in the main:: package, I use L::*, which is short for Local. It’s not important that I do it in this particular way so much as that I abstract the common parts so they have as little effect as possible on the results.

The $L::glob variable is just the pattern I want glob() to use, and I get that from @ARGV so I can run this program over different directories to see how the times change with different numbers of files. I specify it once and use it everywhere I use glob(). That way, every code string gets the same list of files. I expect the Schwartzian Transform to get better and better as the number of files increases.

I also want to run some code strings that don’t use glob(), so I pre-glob the directory and store the list in @L::files. I think glob() is going to significantly affect the times, so I want to see the results with and without it.

The $code anonymous hash has the code strings. I want to test the pieces as well as the whole thing, so I start off with control strings to assign the list of files to a variable and to run a glob(). Benchmark also runs an empty subroutine behind the scenes so it can adjust its time for that overhead too. I expect the “assign” times to be insignificant and the glob() times to be a big deal. At the outset, I suspect the glob() may be as much as a third of the time of the benchmarks, but that’s just a guess.

The next set of code strings measure the sort. The sort_names string tries it in void context, and sort_names_assign does the same thing but assigns its result to an array. I expect a measurable difference, and the difference to be the same as the time for the assign string.

Then I try the original code strings from our exercise example, and call that ordinary_orig. That one uses a glob(), which I think inflates the time significantly. The ordinary_mod string uses the list of files in @L::files, which is the same thing as ordinary_orig without the glob(). I expect these two to differ by the time of the glob code string.

The last set of strings compares three things. The schwartz_orig string is the one I started with. In schwartz_orig_assign, I fix that to assign to an array, just as I did with ordinary_orig. If I want to compare them, they have to do the same thing. The final code string, schwartz_mod, gets rid of the glob():

#!/usr/bin/perl
# schwartzian_benchmark.pl
use v5.10;
use strict;
use Benchmark;
use Cwd qw(cwd);

use Getopt::Std;
getopts( 'f:g:i:s:', \ my %opts );

my $files      = $opts{f} // 100;
my $iterations = $opts{i} // 10_000;
my $seconds    = 0 - abs( $opts{'s'} // -2 );

use File::Temp qw(tempdir);
my $dir = tempdir( CLEANUP => 1 );

my( $original_cwd ) = cwd();
chdir( $dir ) or die "Could not change to $dir: $!";
foreach ( 1 .. $files ) {
    open my($fh), '>', "$0.$_.tmp" or die "Could not create a file: $!";
    print { $fh } time();
    }

$L::glob = $opts{g} // '*';
@L::files = glob $L::glob;

print "Testing with " . @L::files . " files\n";

my $transform = q|map $_->[0], sort { $a->[1] <=> $b->[1] } map [ $_, -M ]|;
my $sort      = q|sort { -M $a <=> -M $b }|;

my $code = {
    assign               =>  q| my @r = @L::files |,
    'glob'               =>  q| my @r = glob $L::glob |,
    sort_names           =>  q| sort { $a cmp $b } @L::files |,
    sort_names_assign    =>  q| my @r = sort { $a cmp $b } @L::files |,
    sort_times_assign    => qq| my \@r = $sort \@L::files |,

    ordinary_orig        => qq| my \@r = $sort glob \$L::glob |,
    ordinary_mod         => qq| my \@r = $sort \@L::files |,
    schwartz_orig        => qq| $transform, glob \$L::glob |,
    schwartz_orig_assign => qq| my \@r = $transform, glob \$L::glob |,
    schwartz_mod         => qq| my \@r = $transform, \@L::files |,
};

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
printf "Timing for %d CPU seconds...\n", $seconds ;
timethese( $seconds, $code );

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
print "\n", "-" x 73, "\n\n";
print "Timing for $iterations iterations\n";

timethese( $iterations, $code );
chdir( $original_cwd );

The Benchmark module provides the report, which I reformatted to make it a bit easier to read (so some of the output is missing and some lines are shorter). The results are not surprising, although I like to show the students that they didn’t waste an hour listening to me talk about how wonderful the transform is:

% perl schwartzian_benchmark.pl -f 500
Testing with 500 files
Timing for 2 CPU seconds...
Benchmark: running assign, glob, ordinary_mod, ordinary_orig,
    schwartz_mod, schwartz_orig, schwartz_orig_assign, sort_names,
    sort_names_assign, sort_times_assign for at least 2 CPU seconds...
    assign: ( 2.15 usr +  0.01 sys =  2.16 CPU) (n=12421)
      glob: ( 0.82 usr +  1.28 sys =  2.10 CPU) (n=1556)
ordinary_mod: ( 0.61 usr +  1.48 sys =  2.09 CPU) (n=1508)
ordinary_orig: ( 0.70 usr +  1.41 sys =  2.11 CPU) (n=829)
schwartz_mod: ( 1.25 usr +  0.85 sys =  2.10 CPU) (n=1426)
schwartz_orig: ( 0.94 usr +  1.08 sys =  2.02 CPU) (n=820)
schwartz_orig_assign: ( 1.11 usr +  1.12 sys =  2.23 CPU) (n=830)
sort_names: ( 2.13 usr +  0.01 sys =  2.14 CPU) (n=12942139)
sort_names_assign: ( 2.05 usr +  0.00 sys =  2.05 CPU) (n=14580)
sort_times_assign: ( 0.60 usr +  1.45 sys =  2.05 CPU) (n=1448)

-------------------------------------------------------------------------

Timing for 10000 iterations
Benchmark: timing 10000 iterations of assign, glob, ordinary_mod,
    ordinary_orig, schwartz_mod, schwartz_orig, schwartz_orig_assign,
    sort_names, sort_names_assign, sort_times_assign...
    assign:  1 secs ( 1.23 usr +  0.00 sys =  1.23 CPU)
      glob: 12 secs ( 4.68 usr +  6.89 sys = 11.57 CPU)
ordinary_mod: 14 secs ( 4.09 usr +  9.74 sys = 13.83 CPU)
ordinary_orig: 25 secs ( 8.42 usr + 16.70 sys = 25.12 CPU)
schwartz_mod: 15 secs ( 8.79 usr +  5.93 sys = 14.72 CPU)
schwartz_orig: 24 secs (11.41 usr + 12.94 sys = 24.35 CPU)
schwartz_orig_assign: 27 secs (13.47 usr + 13.51 sys = 26.98 CPU)
sort_names:  0 secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
            (warning: too few iterations for a reliable count)
sort_names_assign:  2 secs ( 1.42 usr +  0.00 sys =  1.42 CPU)
sort_times_assign: 14 secs ( 4.15 usr +  9.99 sys = 14.14 CPU)

The sort_names result stands out. It ran almost six million times a second. It also doesn’t do anything since it’s in a void context. It runs really fast, and it runs just as fast no matter what I put in the sort block. A sort in void context will always be the fastest. The difference between the sort and the map in void context is not as pronounced in schwartz_orig and schwartz_orig_assign, because it’s only the last map that is in void context. Both still have the rightmost map and the sort to compute before they can optimize for void context. There is an approximately 10% difference in the number of extra iterations the schwartz_orig can go through, so the missing assignment gave it an apparent but unwarranted boost in our original example.

I like to look at the second set of results for the comparisons, and use the wall-clock seconds even though they are not as exact as the CPU seconds. Remember that the CPU times are only measuring time spent in the CPU, and that I’m really doing a lot of filesystem stuff here. The CPU times aren’t any more accurate than the wall-clock times.

The glob code string took about 12 seconds, and the schwartz_orig_assign code string took 27 seconds. If I subtract those extra 12 seconds from the 27, I get the same wall-clock time for schwartz_mod, just like I expected.

If I try the same benchmark with more and more files, I should see the Schwartzian Transform doing even better as the number of files grows.

Isolating the Environment

I’m not done yet. I thought quite a bit about what I was benchmarking in that Schwartzian Transform problem, but I trusted Benchmark to handle the rest. I shouldn’t do that, and there are many reasons I need to be careful about how as well as the what.

Consider this program, which has a big problem that many people might miss when they first look at it. To better demonstrate the problem, I give the labels in $code a random one-digit prefix so I can randomize the order in which Benchmark runs them:

#!/usr/bin/perl
# side_effect.pl
use v5.10;

use Benchmark;

my $regex       = qr/\.pl\z/;
my @files = glob( $ARGV[0] );

my $grep_prefix = int( rand( 10 ) );
my $map_prefix  = int( rand( 10 ) );

my $code = {
    "$grep_prefix-grep"    => sub {
        push @files, scalar grep /$regex/, @files;
        return \@files;
        },
    "$map_prefix-map"     => sub {
        push @files, () = map { /$regex/ } @files;
        return \@files;
        },
    };

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
my $iterations = 1000;
print "Timing for $iterations iterations\n";

timethese( $iterations, $code );

I run this program several times, where sometimes the map version runs first and other times the grep version runs first:

% perl side_effect.pl '/Users/Amelia/bin/*'
Timing for 1000 iterations
Benchmark: timing 1000 iterations of 2-map, 5-grep...
     2-map:  0 secs ( 0.02 usr +  0.00 sys =  0.02 CPU) @ 50000.00/s
    5-grep:  0 secs ( 0.23 usr +  0.00 sys =  0.23 CPU) @ 4347.83/s

% perl side_effect.pl '/Users/Amelia/bin/*'
Timing for 1000 iterations
Benchmark: timing 1000 iterations of 0-map, 8-grep...
     0-map:  0 secs ( 0.02 usr +  0.00 sys =  0.02 CPU) @ 50000.00/s
            (warning: too few iterations for a reliable count)
    8-grep:  0 secs ( 0.24 usr +  0.01 sys =  0.25 CPU) @ 4000.00/s
            (warning: too few iterations for a reliable count)

% perl5.14.2 side_effect.pl '/Users/Amelia/bin/*'
Timing for 1000 iterations
Benchmark: timing 1000 iterations of 0-grep, 6-map...
    0-grep:  0 secs ( 0.23 usr +  0.00 sys =  0.23 CPU) @ 4347.83/s
     6-map:  1 secs ( 0.54 usr +  0.00 sys =  0.54 CPU) @ 1851.85/s

The results are vastly different on different runs, but consistent depending on the order in which the tests are run. When I go back to examine the program, I catch the problem; each test changes the value of @files, so the tests that come later deal with different input.

I can use Benchmark::Forking instead. It has the same interface as Benchmark but runs the subroutine for each label in its own process:

#!/usr/bin/perl
# benchmark_forking.pl

use Benchmark::Forking qw(cmpthese);

my @files = glob( $ARGV[0] );

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
my $iterations = 10_000;
print "Timing for $iterations iterations\n";

cmpthese( $iterations, {
    "A" => sub { push @files, scalar grep 1, @files },
    "B" => sub { push @files, scalar grep 1, @files },
    "C" => sub { push @files, scalar grep 1, @files },
    } );

This doesn’t mean that it fixes the side-effect problem. My code still might be broken, but its brokenness can’t affect other tests:

% perl benchmark_forking.pl '/etc/*'
Timing for 10000 iterations
    Rate   B   A   C
B 4762/s  -- -0% -0%
A 4785/s  0%  --  0%
C 4785/s  0%  0%  --

% perl benchmark_forking.pl '/etc/*'
Timing for 10000 iterations
    Rate   B   A   C
B 4695/s  -- -1% -1%
A 4739/s  1%  --  0%
C 4739/s  1%  0%  --

This doesn’t really solve the problem, which isn’t between runs of the separate subroutines, but between separate runs of any subroutines. The first subroutine call makes it slightly harder for the next.

Handling Outliers

The Benchmark module averages the results of every run, but what if one of them took a very short or very long time, unlike most of the other runs? For instance, the first time you access a file the system might read it into the cache, and as you continually open it in other tests it’s quickly available.

Steffen Müller thought quite a bit about this problem when he created Dumbbench. It looks much like Benchmark, but it’s much more sophisticated. It watches the timings as the tests run and stops when it has a statistically significant sample. It can also exclude outliers, or data points far away from the norm. Best of all, though, it can stop when it reaches a certain precision instead of running lots of extra tests that don’t add anything:

#!/usr/bin/perl
# glob_opendir_dumbbench.pl
use v5.10;
use Benchmark::Dumb qw(timethese);
use Cwd qw(cwd);
my( $files, $precision ) = @ARGV;

$files      //= 1000;
$precision  //=    0.05;

use File::Temp qw(tempdir);
my $dir = tempdir( CLEANUP => 1 );

my( $original_cwd ) = cwd();
chdir( $dir ) or die "Could not change to $dir: $!";
foreach ( 1 .. $files ) {
    open my($fh), '>', "$0.$_.tmp" or die "Could not create a file: $!";
    print { $fh } time();
    }

my $count = () = glob( '*' );
printf "$dir has %d files\n", $count;

timethese(
    $precision,
    {
    'Opendir'  => q{ opendir my $dh, "."; my @f = readdir( $dh ) },
    'Glob'     => q{ my @f = glob(".* *") },
    }
    );

chdir( $original_cwd );

Dumbbench reports the number with an uncertainty (based on the spread of all of the individual results). Notice the low number of iterations reported at the end of each line:

% perl glob_opendir_dumbbench.pl 1000 0.05
/var/folders/yg/.../T/lfI3Tosk0i has 1000 files
Benchmark: ran Glob, Opendir.
Glob: 5.526e-03 +- 6.1e-05 wallclock secs (1.1%) @ ( 181 +-    2)/s (n=21)
Opendir: 9.25e-04 +- 4.2e-05 wallclock secs (4.5%) @ (1081 +-   49)/s (n=30)

% perl glob_opendir_dumbbench.pl 10000 0.05
/var/folders/yg/.../T/UgQ8DD0KA7 has 10000 files
Benchmark: ran Glob, Opendir.
Glob: 4.789e-02 +- 6.0e-04 wallclock secs (1.3%) @ (20.88 +- 0.26)/s (n=20)
Opendir: 8.78e-03 +- 3.6e-04 wallclock secs (4.1%) @ (113.8 +-  4.6)/s (n=20)

% perl glob_opendir_dumbbench.pl 100000 0.05
/var/folders/yg/.../T/LE_oNTC7XB has 100000 files
Benchmark: ran Glob, Opendir.
Glob: 3.865e+00 +- 2.5e-02 wallclock secs (0.647%) @ (0.2588 +- 0.0017)/s (n=20)
Opendir:9.761e-02 +- 8.3e-04 wallclock secs (0.850%) @ (10.245 +- 0.087)/s (n=20)

Since Dumbbench can stop when it knows it has enough runs, it doesn’t take as long to get results.

Memory Use

When programmers talk about benchmarking, they are probably talking about speed. After all, that’s what the Benchmark Perl module measures, and what most articles on the subject discuss. Time is an easy thing to measure, so it’s understandable, though not necessarily right, that people measure what they can. Sometimes time is not the major constraint, but something else, such as memory use, causes the problem.

The perldebguts documentation says:

There is a saying that to estimate memory usage of Perl, assume a reasonable algorithm for memory allocation, multiply that estimate by 10, and while you still may miss the mark, at least you won’t be quite so astonished.

Perl trades memory for processing speed. Instead of doing a lot of computation, Perl does a lot of lookup. Higher-level languages handle memory management so the developer can think more about the task at hand than about getting more memory, releasing memory, or creating memory management bugs.

This ease of use comes at an expense, though. Since I don’t control the memory, and Perl doesn’t know what I plan to do ahead of time, Perl has to guess. When Perl needs more memory, it grabs a big chunk of it. If I need memory now, I’ll probably need more later too, so Perl gets more than I need immediately. If I watch the memory use of my program carefully, I’ll see it jump in big amounts, stay that way for a bit, then jump again. Perl doesn’t give memory back to the operating system, either. It needed the memory before, so it might need it again. It tries to reuse the space it doesn’t need anymore, but it’s going to keep what it’s got.

Also, Perl is built on top of C, but it doesn’t have C’s data types. Perl has scalars, arrays, hashes, and a couple of others. Perl doesn’t expose the actual storage to me, so I don’t have to think about it. Not only that, but Perl has to deal with context. Are those data strings, or numbers, or both? Where, in memory, are all of the elements of the array? What other variables does this thing reference?

That’s the long way to say that a number in Perl is more than a number. It’s really a movie star with a big entourage. It may be a 32-bit integer, but it’s really 12 bytes. The Devel::Peek module lets me see what’s going on by letting me inspect the variable’s data structure to see how Perl stores it:

#!/usr/bin/perl

use Devel::Peek;

my $a;

print_message( "Before I do anything" );
Dump( $a );

print_message( "After I assign a string" );
$a = '123456789';
Dump( $a ),

print_message( "After I use it as a number" );
$b = $a + 1;
Dump( $a );

sub print_message {
    print STDERR "\n", "-" x 50, "\n$_[0]\n", "-" x 50, "\n"
    }

The output shows me what Perl is tracking for that scalar at each point in the program. When I first create the variable, it doesn’t have a value. I can see that Perl created the scalar (in internals parlance, the SV, or “scalar value”), it has a reference count of 1, and that it has some flags set. The SV doesn’t have anything in it (that’s the NULL(0x0)), but it has an address, 0x1808248, because the scalar infrastructure is set up and ready to go when I’m ready to give it a value.

When I assign a string to $a, it has more flags set and now has a PV, a “pointer value,” which really means it’s just a string (or char * for you C people). Now the scalar value points to the string data.

When I use this scalar as a number for the first time, Perl has to convert the string to a number. Once it does that, it stores the number value too, turning my scalar into a PVIV, meaning that it has a pointer value and an integer value. Perl sets more flags to indicate that it’s done the conversion and it has both values. Next time it can access the number directly:

--------------------------------------------------
Before I do anything
--------------------------------------------------
SV = NULL(0x0) at 0x1808248
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY)

--------------------------------------------------
After I assign a string
--------------------------------------------------
SV = PV(0x1800908) at 0x1808248
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK)
  PV = 0x301c10 "123456789"\0
  CUR = 9
  LEN = 10

--------------------------------------------------
After I use it as a number
--------------------------------------------------
SV = PVIV(0x1800c20) at 0x1808248
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,IOK,POK,pIOK,pPOK)
  IV = 123456789
  PV = 0x301c10 "123456789"\0
  CUR = 9
  LEN = 10

Just from that I can see that Perl is doing a lot of work. Each Perl variable has some overhead even if it doesn’t have a defined value. That’s OK because Perl’s are more useful for it.

The Devel::Size module can tell me how much memory my variable takes up. I have to remember, though, that the actual memory is probably a little bit more since Perl has to align the low-level values at the appropriate byte boundaries. It can’t just store a value starting anywhere it likes:

#!/usr/bin/perl

use Devel::Size qw(size);

my $n;

print_message( "Before I do anything" );
print "Size is ", size( \$n );

print_message( "After I assign a string" );
$n = '1';
print "Size is ", size( \$n );

print_message( "After I use it as a number" );

my $m = $n + 1;
print "Size is ", size( \$n );

sub print_message { print "\n", "-" x 50, "\n$_[0]\n", "-" x 50, "\n" }

I see that even before I do anything, my scalar $n takes up 12 bytes, at least. When I assign it a string, the size of the scalar is larger, and by more than just the number of characters in the string. Perl tacks on a null byte to terminate the string and might have some extra space allocated in case the string gets bigger. When I use the string as a number, Perl stores the numeric version too, so the scalar gets even larger. Every one of these things can use a bit more memory than I expect:

--------------------------------------------------
Before I do anything
--------------------------------------------------
Size is 12

--------------------------------------------------
After I assign a string
--------------------------------------------------
Size is 26

--------------------------------------------------
After I use it as a number
--------------------------------------------------
Size is 31

What about references, which are also scalars? They only need to know where to find the value, but they don’t store values themselves. They stay the same size even when the values change. The size of a reference doesn’t change. I have to be careful with Devel::Size, though. If I give it a reference, it finds the size of the thing at which the reference points. That’s a good thing, as I’ll show when I try it with arrays or hashes. However, if I have a reference pointing at a reference, the size of that second reference is the size of the thing at which it points, which is just a reference:

#!/usr/bin/perl

use LWP::Simple;
use Devel::Size qw(size);

# watch out! This is 50+ MB big!
my $data = get( "http://www.cpan.org/src/stable.tar.gz" );

print "The size of the data is " , size( $data ), "\n";

my $ref = \$data;

print "The size of the reference is " , size( $ref ), "\n";

my $ref2 = \$ref;

print "The size of the second reference is " , size( $ref2 ), "\n";

The output shows that the second reference is just 16 bytes. It doesn’t include all of the data stored in the ultimate scalar. I’ll show in a moment why I need to know that, but I have to look at Perl’s containers first:

The size of the data is 12829217
The size of the reference is 12829217
The size of the second reference is 16

The situation for Perl’s containers is different. Arrays are collections of scalars, and hashes have scalar keys and scalar values. Those scalars can be the normal variety that hold values, or they can be references. The size function from Devel::Size tells us the size of the data structure. Remember, references may point to big values, but they don’t take up that much space themselves:

#!/usr/bin/perl

use Devel::Size qw(size);

my @array = ( 1 ) x 500;

print "The size of the array is ", size( \@array ), "\n";

I can see how much space the array takes up. The Devel::Size documentation is careful to note that this doesn’t count the size of the things in the array, just the size of the array. Notice that the size of my 500-element array is much larger than 500 times the 16 bytes my individual scalars used:

The size of the array is 2052

That number isn’t counting the contents, though. The array takes up the same size no matter what the scalars hold:

#!/usr/bin/perl

use Devel::Size qw(size);

my $data   = '-' x 500;
print "The size of the scalar is ", size( $data ), "\n";

my @array = ( $data ) x 500;
print "The size of the array is ", size( \@array ), "\n";

I created a scalar with 500 characters, and the entire scalar, including the overhead, takes up 525 bytes. The array takes up the same space as it did previously:

The size of the scalar is 525
The size of the array is 2052

Devel::Size has a fix for this. To get around this, I need to look at each of the scalars in the container and find their sizes. The reference values may point to other containers, which may have more references. My array might look like it’s really small until I try to make a deep copy, store it, or do anything else where I have to get all the data in one place, reference or not:

#!/usr/bin/perl

use Devel::Size qw(size total_size);

my $data   = '-' x 500;
print "The       size of the scalar is ", size( $data ), "\n";
print "The total size of the scalar is ", total_size( $data ), "\n";

print "\n";

my @array = ( $data ) x 500;
print "The       size of the array is ", size( \@array ), "\n";
print "The total size of the array is ", total_size( \@array ), "\n";

Using total_size, the scalar size stays the same, and the array size now includes all the scalar sizes. The number, 264552, is 500 times 525, the aggregate size of the scalars added to 2052, the array size:

The       size of the scalar is 525
The total size of the scalar is 525

The       size of the array is 2052
The total size of the array is 264552

I have to remember what this number actually is, though. It’s just the aggregate size of all the data to which the array eventually points. If I did this for all of my data structures, I wouldn’t get the program memory size because those structures might contain references to the same data.

The perlbench Tool

The same code can perform differently on different perl binaries, which might differ in their compilation options, compilers used, features included, and so on. For instance, threaded versions of Perl are slightly slower, as are shared library versions. It’s not necessarily bad to be slower if you want the other benefits, but you don’t always get to choose beforehand. For instance, the stock Perl on some Linux distributions is compiled for threads. If you think you might get a speed-up with a nonthreaded interpreter, find out before you reconfigure your system!

To compare different perl interpreters, Gisle Aas wrote perlbench. I give it the paths of the interpreters I want to test, and it runs several tests on them, producing a report. The perlbench distribution comes with perlbench-run which, given the locations of the perl interpreters I want to test, runs a series of benchmarks on each of them. The program normalizes the numbers to the time for the first interpreter I specify:

perlbench-run /usr/local/bin/perl5*

The output first shows the details for each interpreter I’m testing and assigns them letters that correspond to a column in the table that it’s about to output. Especially interesting is the ccflags information, which might show options such as -DDEBUGGING_MSTATS, which makes the interpreter slower. Also interesting is the compiler information. It looks like I’ve got a pretty old version of gcc. That might be a good or bad thing. Different versions, or even different compilers, do better or worse jobs optimizing the code. These numbers only have relative meaning on the same machine:

A) v5.14.1 darwin-2level
    version     = 5.014001
    path        = /usr/local/perls/perl-5.14.1/bin/perl
    ccflags     = -fno-common -DPERL_DARWIN ...

B) v5.16.0 darwin-2level
    version     = 5.016
    path        = /usr/local/perls/perl-5.16.0/bin/perl
    ccflags     = -fno-common -DPERL_DARWIN ...

C) v5.16.1 darwin-2level
    version     = 5.016001
    path        = /usr/local/perls/perl-5.16.1/bin/perl
    ccflags     = -fno-common -DPERL_DARWIN ...

D) v5.18.0 darwin-2level
    version     = 5.018
    path        = /usr/local/perls/perl-5.18.0/bin/perl
    ccflags     = -fno-common -DPERL_DARWIN ...

After perlbench-run reports the details of the interpreter, it runs a series of Perl programs with each of the interpreters. It measures the time to execute, much like Benchmark’s timethese. Once it tries the program with all of the interpreters, it normalizes the results so that the first interpreter (that’s the one labeled with “A”) is 100. Lower numbers in the other column mean that interpreter is slower for that test. Higher numbers (they can be above 100) mean that interpreter is faster for that test. The number only has meaning for that test, and I can’t compare them to a different test, even in the same run:

                              A       B       C       D
                             ---     ---     ---     ---
    arith/mixed              100      99     103     107
    arith/trig               100      98     100     101
    array/copy               100      96     106     101
    array/foreach            100     100     100      96
    ...
    string/ipol              100     105     100     104
    string/tr                100      97     101     100

    AVERAGE                  100      98      99      98

    Results saved in ./benchres-006/index.html

If I have something special I want to test, I can add my own test files. Most of the infrastructure is already in place. The README from the perlbench distribution gives the basic format of a test file. I create my test and put it in perlbench’s benchmark/ directory. The distribution gives an example file:

# Name: My name goes here
# Require: 4

require 'benchlib.pl';

# YOUR SETUP CODE HERE
$a = 0;

&runtest(100, <<'ENDTEST');
    # YOUR TESTING CODE HERE
ENDTEST

Summary

Benchmarking is a tricky subject. It involves a lot of numbers and requires a good understanding of what’s actually going on. Not only do I have to look at my Perl program, but I should consider other factors, such as my choice of operating system, the Perl interpreter I’m using and how I compiled it, and anything else that interacts with my program. It’s not all about speed, either. I might want to compare the memory use of two approaches, or see which one takes up less bandwidth. Different situations have different constraints. No matter what I’m doing, I need to do my best to find out what’s really going on before I make any conclusions about how to make it better.