To tie these concepts together, we will look at the process of benchmarking, profiling, and optimizing a Rails action. This example comes from a real application, one that is fairly large and complicated. We have seen pieces of this application before; it is a map-based real estate search application. The application deals heavily with geospatial data, and is based on the PostGIS spatial extensions to PostgreSQL.
We have identified an action to profile: the action that performs
the search itself (POST/searches
).
This action is not particularly slow in absolute terms, but it is our
most commonly used feature, and any more performance we can get reduces
overall latency and makes our application feel snappier.
Once we have decided on an action whose performance we want to improve, we can profile it to see where our time is being spent. Jeremy Kemper recently added a new request profiler to Rails, which will be released with the final version of Rails 2.0. Its library is located at actionpack/lib/action_controller/request_profiler.rb, and it is accessible through script/performance/request. It is a fairly simple wrapper around the ruby-prof library, adding some commonly needed Rails functionality:
Rather than running a single action, the request profiler runs a specified integration test script, so the test procedure can be arbitrarily complex. This also means that we can profile non-GET requests, which is a must for the action we wish to profile.
The request profiler can run a script multiple times, while only profiling time actually spent in those actions (not the overhead of starting up the profiler).
The profiler script opens up both the flat and HTML graph profiles for us; we will see later how both of these are useful.
First, we need to install the ruby-prof gem:
$ sudo gem install ruby-prof
We need to write an integration script that drives the profiler.
As mentioned previously, this script can be arbitrarily complicated,
but ours will be a single request. The script uses the same methods as
integration scripts, except that the script's execution is wrapped in
an integration runner (technically, the script's text is inserted into
the runner using instance_eval
), so
the integration session methods can be called as top-level
methods.
We can drive the profiler with this script. We will also specify the number of trials; this is a fairly long-running action (around half a second on my machine without the profiling overhead), so we will run 10 trials:
$ script/performance/request -n 10 test_script.rb
This will generate two profiles in the tmp
directory (and open them using open
on OS X). The flat profile, which we have seen before, is
tmp/profile-flat.txt. The new graph profile,
which is much more detailed, is
tmp/profile-graph.html.
The graph profile is extremely complicated, because it is a linearized version of the call graph for every function that was called during profiling. When opened in a browser, it starts off as shown in Figure 6-2.
We will examine portions of this graph, but a complete description of these fields is beyond the scope of this book. An introduction is available at http://ruby-prof.rubyforge.org/graph.txt
The profile is divided into blocks, which are separated by horizontal lines. Each block in this profile is centered around one function—the one in boldface that has values in the first two columns (%Total and %Self ). If we only look at that one line from each block, we have a fairly standard flat profile, like the one we saw previously from Ruby's Profiler. The %Total column indicates the percentage of overall time spent in this method and any methods it calls (and recursively on down); the %Self column excludes children and only relates to the current method.
The added value that we receive from the graph profile is the parent and child information in each block. Each function has zero or more parents (functions that called this function) and zero or more children (functions called by this function). Parents are listed above the current function within a block, and children are listed below the current function.
The meaning of the Calls column is
different for different rows. For the primary row within a block, it
simply refers to the total number of calls to the function. For the
parent rows, it takes the form calls/total
, where calls
is the number of times the parent
function called the current function and total
is the total number of times the
current function was called. Note that since all functions calling the
current function are parents, the calls
values of all parents will sum to the
calls
value of the current
function.
For child rows, the semantics are the same, but they are from
the child's point of view. The calls
value of a child row is the number of
times the current function calls the child function, as a fraction of
the total number of times the child function was called. The
denominator of this number will be the same as the calls
value for the child function's own
block. This gives a picture of how much impact the current function
has on the child, relative to the child's other callers.
Now, we will examine the profile to find areas of our code that are slow. There are two main columns we need to watch while scanning the profile:
This value represents the amount of time spent in this method only (excluding time spent in its children), relative to total profiling time. If one method is doing an inordinate amount of work, it will have a proportionally higher %Self value.
This column shows the number of calls made to this function during the profiling period. All of the optimization in the world on one method will not help if the method is being called too many times. Again, algorithmic optimizations are always the place to start; then, individual bits of code can be tweaked if more performance is needed.
A good place to start is the flat profile, as it is sorted by the %Self column, highest first. (The graph profile is sorted by %Total instead.) The beginning of our flat profile looks like this:
Thread ID: 2057980 Total: 12.69 %self total self wait child calls name 9.93 2.06 1.26 0.00 0.80 2730 GeoRuby::SimpleFeatures:: HexEWKBParser#decode_hex 6.86 1.43 0.87 0.00 0.56 4430 Array#each_index 4.65 1.82 0.59 0.00 1.23 4230 Array#each-1 2.68 0.34 0.34 0.00 0.00 132130 Hash#[]= 2.68 0.34 0.34 0.00 0.00 156540 Fixnum#== 2.60 0.38 0.33 0.00 0.05 77550 String#gsub
Most of these values are fairly reasonable (we are doing a lot
of necessary data processing in Ruby, and it is reasonable that
Array#each_index
would be a
significant component of this). However, we are spending a lot of time
in GeoRuby::
SimpleFeatures::HexEWKBParser#decode_hex
. The
%Self value of 9.93 means that we are spending
nearly 10% of the total request time inside this method. We should
investigate this by looking for that method's block in the graph
profile (see Figure 6-3).
Now, we may want to examine the source of these calls to
decode_hex
. This function has only
one caller, HexEWKBParser#parse
.
Clicking on that line takes us to its own block (see Figure 6-4).
Following this chain of callers upward, we find that these calls
are coming from the spatial_adapter
plugin that we use to interface with the PostGIS database. These
functions are doing a particularly computationally intensive form of
type casting; they are converting the hex strings coming from PostGIS
to Ruby objects such as points and polygons (EWKB is the
extended well-known binary format that PostGIS
uses to represent geometrical objects).
The decode_hex
function is
pure Ruby. It could probably be rewritten with C, OCaml, or another
high-performance extension language, which would likely improve
performance substantially. However, we should be considering
algorithmic improvements first.
The first thing that catches my eye about this profile is the number of calls. On the data set being used, our test query returns slightly fewer than 100 geospatial objects. So why are we decoding geospatial objects 2,730 times? After some investigation, we find this note in the spatial adapter's code:
Because ActiveRecord keeps only the string values directly returned from the data-base, it translates from these to the correct types every time an attribute is read (using the code returned by this method), which is probably OK for simple types, but might be less than efficient for geometries. Also, you cannot modify the geometry object returned directly or your change will not be saved.
So, every time we reference a geospatial column (via Listing#location
), it is being decoded again
from the string representation. We can do better than that. We will
keep a cache of the last-seen string value (the EWKB value from
PostGIS), as well as its corresponding geometry (the Ruby object).
This is done by inserting the following code into the Listing
class:
class Listing # Cache location information # AR wants to cast the HexEWKB string to a Geometry on each access, # let's not do that def location string = location_before_type_cast return @location_geometry if string == @location_string @location_string = string @location_geometry = super end # Invalidate cache when new location is assigned def location=(val) @location_geometry = @location_string = nil super end # Invalidate cache when record is reloaded def reload @location_geometry = @location_string = nil super end end
Now that we have made a substantial change, it is time to reprofile and see how the numbers compare. After running the profiler again, we see that we have managed to push that method a few notches down the list:
Thread ID: 2057980 Total: 11.21
Thread ID: 2057980 Total: 11.21 %self total self wait child calls name 8.56 1.38 0.96 0.00 0.42 4430 Array#each_index 7.76 2.07 0.87 0.00 1.20 4230 Array#each-1 4.19 0.74 0.47 0.00 0.27 1040 GeoRuby::SimpleFeatures:: HexEWKBParser#decode_hex
We have cut the number of calls to decode_hex
from 2,730 to 1,040. (This number
is still higher than 100 because there is another spatial class that
we have not optimized yet.) In addition, we have cut the total time
spent in decode_hex
from 1.26
seconds to 0.47 seconds without optimizing the actual function at all.
When we benchmark these different versions of the application, we will
see what kind of an impact the optimizations have on the action as a
whole.
Next, we notice from the preceding flat profile that we are
spending the most time in Array#each_index
and Array#each-1
(the ruby-prof syntax
indicating the first-level recursive call to Array#each
). These are more complicated to
track down, primarily because they have many callers (Array#each
is used in many places).
The optimization process for this problem was fairly difficult, and we do not show it here; it was a fairly boring and highly application-specific optimization. (It involved rewriting a complicated clustering algorithm to build up the same data set using fewer intermediate data structures.) But the change did result in a small improvement in the profile, as well as the overall running time:
Thread ID: 2057980 Total: 10.7 %self total self wait child calls name 7.94 1.38 0.85 0.00 0.53 4430 Array#each_index 5.23 1.77 0.56 0.00 1.21 4230 Array#each-1 3.64 0.40 0.39 0.00 0.01 133960 Hash#[]= 3.55 0.70 0.38 0.00 0.32 1040 GeoRuby::SimpleFeatures:: HexEWKBParser#decode_hex
Now that we have made optimizations, we should see how they affect the actual performance of our application. We saved this section for later so that we could compare the two optimizations to each other; in reality, you should benchmark after each optimization to be sure that each change has the desired effect.
We will be comparing the performance of the three versions of the application:
The benchmarking tool we will be using is Railsbench http://railsbench.rubyforge.org/ by Stefan Kaes. This provides some very convenient tools for benchmarking Rails applications and visualizing the results. Railsbench has a slightly complicated installation procedure. First, install the gem:
$ sudo gem install railsbench
Next, you should add the directory containing the Railsbench scripts to your PATH, and make them executable. This can be done as follows:
$ eval 'railsbench path' $ sudo railsbench postinstall
Railsbench looks for some environment variables to know where the Rails application is and where its data should go, so we need to export those:
$ export RAILS_ROOT=. $ export RAILS_PERF_DATA=.
Now Railsbench is installed and ready to run. But first, there are some changes we need to make to the application. We install the Railsbench code into our application:
$ railsbench install
That command provides us with a few files that are used for benchmarking:
Railsbench configuration. Use this file to provide Rails with custom session data, if needed.
Defines the set of benchmarks that can be run; each one has a name, URI, HTTP method, and optional POST data. Most of this file can be generated automatically from the application's routes; we will do this as our next step.
This is a separate Rails environment used for benchmarking. When created, it is a copy of the production.rb file, but it can be customized to meet the bench-mark's needs.
Railsbench has a command to generate the config/benchmarks.yml file based on the application's routes. We will run this to create the basic version of the file.
$ railsbench generate_benchmarks
We delete actions from this file that we do not need to
benchmark (such as named routes generated automatically by map.resources
, some of which are unused).
And we modify the searches_create
benchmark, changing the method to post
and adding the POST data we
need:
searches_create: uri: /searches/ action: create controller: searches method: post post_data: "search[pclass]=1&search[city_id]=149&search[subtype]=HOUSE"
There is one last change we must make. By default, all Rails
environments except production have a log level of :debug
. We want to set the benchmarking environment's log level to :info
, so that we don't confound the
benchmarking results with I/O issues from heavy log traffic. Add the
following line to
config/environments/benchmarking.yml:
config.log_level = :info
Now we can run the benchmarks. We use the Railsbench perf_run
command, which takes as an argument the number of requests to make
on each trial. We specify the benchmark from
benchmarks.yml with the -bm=
option, and we specify 20 trials with
the RAILS_PERF_RUNS
variable:
$ RAILS_PERF_RUNS=20 perf_run 100 -bm=searches_create benchmarking 20 runs with options 100 -bm=searches_create perf data file: ./perf_run.searches_create.txt requests=100, options=-bm=searches_create loading environment 2.43529 page request total stddev% r/s ms/r searches_create 44.70490 0.4030 2.24 447.05 all requests 44.70490 0.4030 2.24 447.05
Railsbench can benchmark multiple actions during the
same run. We are only benchmarking one action here, so we can just
look at the searches_create
line in this table for our information. The last line is just a
summary.
The data shows us that we averaged 2.24 requests per second (447.05 milliseconds per request). Each of 20 trials involves 100 requests, so the mean runtime for each trial was 44.705 seconds. The standard deviation for that figure was 0.4030% of the mean, or 0.180 seconds. (Thus, 95% of the trials should fall between 44.34 seconds and 45.06 seconds, two standard deviations away from the mean.)
perf_run
saves its raw data
for this run in perf_run.searches_create.txt,
and we will feed that file into other utilities for analysis.
Between each benchmark, we store this file away for comparison
across the different versions under test.
Railsbench includes a utility called perf_comp
that will compare results
between different runs of perf_run
. When run with two arguments (the
two data files to be compared), it will give a summary of each and a
comparison between them. Here is the comparison between the first
benchmark (the control) and the second (with the Listing#location
cache
improvement):
$ perf_comp a/perf_run.searches_create.txt b/perf_run.searches_create.txt perf data file 1: a/perf_run.searches_create.txt requests=100, options=-bm=searches_create perf data file 2: b/perf_run.searches_create.txt requests=100, options=-bm=searches_create page c1 real c2 real c1 r/s c2 r/s c1 ms/r c2 ms/r c1/c2 searches_create 44.70490 43.13935 2.2 2.3 447.05 431.39 1.04 all requests 44.70490 43.13935 2.2 2.3 447.05 431.39 1.04
This is mostly information we have seen before. The interesting bit of information is the far-right number in the table (1.04): the ratio of the old runtime to the new. In this case, we see that the cache optimization afforded us a 4% improvement in overall runtime on this action. This is a decent performance improvement, for an action that is already fairly well optimized.
Compare that to the second optimization, the refactoring:
$ perf_comp b/perf_run.searches_create.txt c/perf_run.searches_create.txt perf data file 1: b/perf_run.searches_create.txt requests=100, options=-bm=searches_create perf data file 2: c/perf_run.searches_create.txt requests=100, options=-bm=searches_create page c1 real c2 real c1 r/s c2 r/s c1 ms/r c2 ms/r c1/c2 searches_create 43.13935 42.79449 2.3 2.3 431.39 427.94 1.01 all requests 43.13935 42.79449 2.3 2.3 431.39 427.94 1.01
This change only resulted in a performance gain of 1%, which is probably not worth pursuing for performance's sake alone. (In this case, the code under question was in desperate need of a rewrite, so this was a net gain anyway.)
We should also compare the actual ranges that the trial times
fall into, to be sure that these results have statistical
significance. To visualize this, I instrumented the Rails-bench
PerfInfo
class to show the actual
trial times (not just the mean and standard deviation), and piped
the data through R. [51]The result is shown in Figure 6-5.
This box-and-whisker plot shows the median value (the heavy line), the first and third quartile (into which the middle half of the data points fall; represented by the box), and the complete range (the brackets). From this, we see that the first optimization was a clear winner; the ranges of values do not even overlap.
The second optimization is not as clear-cut; there is still significant overlap between the two ranges of observations. If there is any performance gain, it is likely marginal, as it is overshadowed by the inherent variability of the response times.
[50] The Mercurial Queues extension (also called MQ; its use is detailed at http://www.selenic.com/mercurial/wiki/index.cgi/MqExtension makes this process much more straightforward, and makes it easier to cherrypick or throw away individual changes (even while working in the main repository). However, it is far beyond this book's scope.
[51] R (http://www.r-project.org/) is a language for statistical computing.