Benchmark
Functions
Benchmarking is the practice of measuring the performance of a program
on a fixed workload.
In Go, a benchmark function looks like a test function, but with the
Benchmark
prefix and a *testing.B
parameter that
provides most
of the same methods as a *testing.T
, plus a few extra related to performance
measurement.
It also exposes an integer field N
, which specifies the number
of times to perform the operation being measured.
Here’s a benchmark for IsPalindrome
that calls it N
times in
a loop.
import "testing" func BenchmarkIsPalindrome(b *testing.B) { for i := 0; i < b.N; i++ { IsPalindrome("A man, a plan, a canal: Panama") } }
We run it with the command below.
Unlike tests, by default no benchmarks are run.
The argument to the -bench
flag selects which benchmarks to run. It
is a regular expression matching the names of Benchmark
functions,
with a default value that matches none of them. The “.
” pattern causes it to
match all benchmarks in the word
package, but since there’s only
one, -bench=IsPalindrome
would have been equivalent.
$ cd $GOPATH/src/gopl.io/ch11/word2 $ go test -bench=. PASS BenchmarkIsPalindrome-8 1000000 1035 ns/op ok gopl.io/ch11/word2 2.179s
The benchmark name’s numeric suffix, 8
here, indicates the
value of GOMAXPROCS
, which is important for concurrent
benchmarks.
The report tells us that each call to IsPalindrome
took about 1.035
microseconds, averaged over 1,000,000 runs. Since the benchmark
runner initially has no idea how long the operation takes, it makes
some initial measurements using small values of N
and then
extrapolates to a value large enough for a stable timing
measurement to be made.
The reason the loop is implemented by the benchmark function,
and not by the calling code in the test driver, is so that the benchmark
function has the opportunity to execute any necessary one-time setup
code outside the loop without this adding to the measured time of each
iteration. If this setup code is still perturbing the results, the
testing.B
parameter provides methods to stop, resume, and reset the timer,
but these are rarely needed.
Now that we have a benchmark and tests, it’s easy to try out ideas for
making the program faster. Perhaps the most obvious optimization is
to make IsPalindrome
’s second loop stop checking at the midpoint, to
avoid doing each comparison twice:
n := len(letters)/2 for i := 0; i < n; i++ { if letters[i] != letters[len(letters)-1-i] { return false } } return true
But as is often the case, an obvious optimization doesn’t always yield the expected benefit. This one delivered a mere 4% improvement in one experiment.
$ go test -bench=. PASS BenchmarkIsPalindrome-8 1000000 992 ns/op ok gopl.io/ch11/word2 2.093s
Another idea is to pre-allocate a sufficiently large array for use by
letters
, rather than expand it by successive calls to append
.
Declaring letters
as an array of the right size, like this,
letters := make([]rune, 0, len(s)) for _, r := range s { if unicode.IsLetter(r) { letters = append(letters, unicode.ToLower(r)) } }
yields an improvement of nearly 35%, and the benchmark runner now reports the average over 2,000,000 iterations.
$ go test -bench=. PASS BenchmarkIsPalindrome-8 2000000 697 ns/op ok gopl.io/ch11/word2 1.468s
As this example shows, the fastest program is often the one that makes
the fewest memory allocations. The -benchmem
command-line flag
will include memory allocation statistics in its report.
Here we compare the number of allocations before the optimization:
$ go test -bench=. -benchmem PASS BenchmarkIsPalindrome 1000000 1026 ns/op 304 B/op 4 allocs/op
and after it:
$ go test -bench=. -benchmem PASS BenchmarkIsPalindrome 2000000 807 ns/op 128 B/op 1 allocs/op
Consolidating the allocations in a single
call to make
eliminated 75% of the allocations and halved the quantity
of allocated memory.
Benchmarks like this tell us the absolute time required for a given operation, but in many settings the interesting performance questions are about the relative timings of two different operations. For example, if a function takes 1ms to process 1,000 elements, how long will it take to process 10,000 or a million? Such comparisons reveal the asymptotic growth of the running time of the function. Another example: what is the best size for an I/O buffer? Benchmarks of application throughput over a range of sizes can help us choose the smallest buffer that delivers satisfactory performance. A third example: which algorithm performs best for a given job? Benchmarks that evaluate two different algorithms on the same input data can often show the strengths and weaknesses of each one on important or representative workloads.
Comparative benchmarks are just regular code. They typically take the form of a single
parameterized function, called from several Benchmark
functions
with different values, like this:
func benchmark(b *testing.B, size int) { /* ... */ } func Benchmark10(b *testing.B) { benchmark(b, 10) } func Benchmark100(b *testing.B) { benchmark(b, 100) } func Benchmark1000(b *testing.B) { benchmark(b, 1000) }
The parameter size
, which specifies the size of the input, varies
across benchmarks but is constant within each benchmark.
Resist the temptation to use the parameter b.N
as the input
size.
Unless you interpret it as an iteration count for a fixed-size
input, the results of your benchmark will be meaningless.
Patterns revealed by comparative benchmarks are particularly useful during program design, but we don’t throw the benchmarks away when the program is working. As the program evolves, or its input grows, or it is deployed on new operating systems or processors with different characteristics, we can reuse those benchmarks to revisit design decisions.
Exercise 11.6:
Write benchmarks to compare the PopCount
implementation
in Section 2.6.2 with your solutions to
Exercise 2.4 and Exercise 2.5.
At what point does the table-based approach break even?
Exercise 11.7:
Write benchmarks for Add
, UnionWith
, and other
methods of *IntSet
(§6.5) using large
pseudo-random inputs.
How fast can you make these methods run?
How does the choice of word size affect performance?
How fast is IntSet
compared to a set implementation based on
the built-in map type?