Rails Optimization Example

To tie these concepts together, we will look at the process of benchmarking, profiling, and optimizing a Rails action. This example comes from a real application, one that is fairly large and complicated. We have seen pieces of this application before; it is a map-based real estate search application. The application deals heavily with geospatial data, and is based on the PostGIS spatial extensions to PostgreSQL.

We have identified an action to profile: the action that performs the search itself (POST/searches). This action is not particularly slow in absolute terms, but it is our most commonly used feature, and any more performance we can get reduces overall latency and makes our application feel snappier.

Profiling an Action

Once we have decided on an action whose performance we want to improve, we can profile it to see where our time is being spent. Jeremy Kemper recently added a new request profiler to Rails, which will be released with the final version of Rails 2.0. Its library is located at actionpack/lib/action_controller/request_profiler.rb, and it is accessible through script/performance/request. It is a fairly simple wrapper around the ruby-prof library, adding some commonly needed Rails functionality:

Rather than running a single action, the request profiler runs a specified integration test script, so the test procedure can be arbitrarily complex. This also means that we can profile non-GET requests, which is a must for the action we wish to profile.
The request profiler can run a script multiple times, while only profiling time actually spent in those actions (not the overhead of starting up the profiler).
The profiler script opens up both the flat and HTML graph profiles for us; we will see later how both of these are useful.

First, we need to install the ruby-prof gem:

	$ sudo gem install ruby-prof

We need to write an integration script that drives the profiler. As mentioned previously, this script can be arbitrarily complicated, but ours will be a single request. The script uses the same methods as integration scripts, except that the script's execution is wrapped in an integration runner (technically, the script's text is inserted into the runner using instance_eval), so the integration session methods can be called as top-level methods.

test_script.rb

	post '/searches', :search => {:pclass => 1, :city_id => 149, :subtype => 'HOUSE'}

We can drive the profiler with this script. We will also specify the number of trials; this is a fairly long-running action (around half a second on my machine without the profiling overhead), so we will run 10 trials:

	$ script/performance/request -n 10 test_script.rb

This will generate two profiles in the tmp directory (and open them using open on OS X). The flat profile, which we have seen before, is tmp/profile-flat.txt. The new graph profile, which is much more detailed, is tmp/profile-graph.html.

The graph profile is extremely complicated, because it is a linearized version of the call graph for every function that was called during profiling. When opened in a browser, it starts off as shown in Figure 6-2.

We will examine portions of this graph, but a complete description of these fields is beyond the scope of this book. An introduction is available at http://ruby-prof.rubyforge.org/graph.txt

The profile is divided into blocks, which are separated by horizontal lines. Each block in this profile is centered around one function—the one in boldface that has values in the first two columns (%Total and %Self ). If we only look at that one line from each block, we have a fairly standard flat profile, like the one we saw previously from Ruby's Profiler. The %Total column indicates the percentage of overall time spent in this method and any methods it calls (and recursively on down); the %Self column excludes children and only relates to the current method.

The added value that we receive from the graph profile is the parent and child information in each block. Each function has zero or more parents (functions that called this function) and zero or more children (functions called by this function). Parents are listed above the current function within a block, and children are listed below the current function.

Figure 6-2. Beginning of a graph profile

The meaning of the Calls column is different for different rows. For the primary row within a block, it simply refers to the total number of calls to the function. For the parent rows, it takes the form calls/total, where calls is the number of times the parent function called the current function and total is the total number of times the current function was called. Note that since all functions calling the current function are parents, the calls values of all parents will sum to the calls value of the current function.

For child rows, the semantics are the same, but they are from the child's point of view. The calls value of a child row is the number of times the current function calls the child function, as a fraction of the total number of times the child function was called. The denominator of this number will be the same as the calls value for the child function's own block. This gives a picture of how much impact the current function has on the child, relative to the child's other callers.

Optimizing and Reprofiling

Now, we will examine the profile to find areas of our code that are slow. There are two main columns we need to watch while scanning the profile:

%Self: This value represents the amount of time spent in this method only (excluding time spent in its children), relative to total profiling time. If one method is doing an inordinate amount of work, it will have a proportionally higher %Self value.
Calls: This column shows the number of calls made to this function during the profiling period. All of the optimization in the world on one method will not help if the method is being called too many times. Again, algorithmic optimizations are always the place to start; then, individual bits of code can be tweaked if more performance is needed.

A good place to start is the flat profile, as it is sorted by the %Self column, highest first. (The graph profile is sorted by %Total instead.) The beginning of our flat profile looks like this:

	Thread ID: 2057980
	Total: 12.69

	 %self  total  self  wait  child  calls  name
	  9.93   2.06  1.26  0.00   0.80   2730  GeoRuby::SimpleFeatures::
	                                           HexEWKBParser#decode_hex
	  6.86   1.43  0.87  0.00   0.56   4430  Array#each_index
	  4.65   1.82  0.59  0.00   1.23   4230  Array#each-1
	  2.68   0.34  0.34  0.00   0.00 132130  Hash#[]=
	  2.68   0.34  0.34  0.00   0.00 156540  Fixnum#==
	  2.60   0.38  0.33  0.00   0.05  77550  String#gsub

Most of these values are fairly reasonable (we are doing a lot of necessary data processing in Ruby, and it is reasonable that Array#each_index would be a significant component of this). However, we are spending a lot of time in GeoRuby:: SimpleFeatures::HexEWKBParser#decode_hex. The %Self value of 9.93 means that we are spending nearly 10% of the total request time inside this method. We should investigate this by looking for that method's block in the graph profile (see Figure 6-3).

Figure 6-3. The decode_hex method dominates our request time

Now, we may want to examine the source of these calls to decode_hex. This function has only one caller, HexEWKBParser#parse. Clicking on that line takes us to its own block (see Figure 6-4).

Figure 6-4. Tracing upward through the call stack

Following this chain of callers upward, we find that these calls are coming from the spatial_adapter plugin that we use to interface with the PostGIS database. These functions are doing a particularly computationally intensive form of type casting; they are converting the hex strings coming from PostGIS to Ruby objects such as points and polygons (EWKB is the extended well-known binary format that PostGIS uses to represent geometrical objects).

The decode_hex function is pure Ruby. It could probably be rewritten with C, OCaml, or another high-performance extension language, which would likely improve performance substantially. However, we should be considering algorithmic improvements first.

The first thing that catches my eye about this profile is the number of calls. On the data set being used, our test query returns slightly fewer than 100 geospatial objects. So why are we decoding geospatial objects 2,730 times? After some investigation, we find this note in the spatial adapter's code:

Because ActiveRecord keeps only the string values directly returned from the data-base, it translates from these to the correct types every time an attribute is read (using the code returned by this method), which is probably OK for simple types, but might be less than efficient for geometries. Also, you cannot modify the geometry object returned directly or your change will not be saved.

So, every time we reference a geospatial column (via Listing#location), it is being decoded again from the string representation. We can do better than that. We will keep a cache of the last-seen string value (the EWKB value from PostGIS), as well as its corresponding geometry (the Ruby object). This is done by inserting the following code into the Listing class:

	class Listing
	  # Cache location information
	  # AR wants to cast the HexEWKB string to a Geometry on each access,
	  # let's not do that
	  def location
	    string = location_before_type_cast
	    return @location_geometry if string == @location_string

	    @location_string = string
	    @location_geometry = super
	    end

	    # Invalidate cache when new location is assigned
	    def location=(val)
	      @location_geometry = @location_string = nil
	      super
	    end
	 
	    # Invalidate cache when record is reloaded
	    def reload
	      @location_geometry = @location_string = nil
	      super
	    end
	  end

Now that we have made a substantial change, it is time to reprofile and see how the numbers compare. After running the profiler again, we see that we have managed to push that method a few notches down the list:

Thread ID: 2057980 Total: 11.21

	Thread ID: 2057980
	Total: 11.21

	 %self  total  self  wait  child  calls name
	  8.56   1.38  0.96  0.00   0.42   4430 Array#each_index
	  7.76   2.07  0.87  0.00   1.20   4230 Array#each-1
	  4.19   0.74  0.47  0.00   0.27   1040 GeoRuby::SimpleFeatures::
	                                          HexEWKBParser#decode_hex

We have cut the number of calls to decode_hex from 2,730 to 1,040. (This number is still higher than 100 because there is another spatial class that we have not optimized yet.) In addition, we have cut the total time spent in decode_hex from 1.26 seconds to 0.47 seconds without optimizing the actual function at all. When we benchmark these different versions of the application, we will see what kind of an impact the optimizations have on the action as a whole.

Next, we notice from the preceding flat profile that we are spending the most time in Array#each_index and Array#each-1 (the ruby-prof syntax indicating the first-level recursive call to Array#each). These are more complicated to track down, primarily because they have many callers (Array#each is used in many places).

The optimization process for this problem was fairly difficult, and we do not show it here; it was a fairly boring and highly application-specific optimization. (It involved rewriting a complicated clustering algorithm to build up the same data set using fewer intermediate data structures.) But the change did result in a small improvement in the profile, as well as the overall running time:

	Thread ID: 2057980
	Total: 10.7

	 %self    total    self    wait    child   calls    name
	  7.94     1.38    0.85    0.00     0.53    4430    Array#each_index
	  5.23     1.77    0.56    0.00     1.21    4230    Array#each-1
	  3.64     0.40    0.39    0.00     0.01  133960    Hash#[]=
	  3.55     0.70    0.38    0.00     0.32    1040    GeoRuby::SimpleFeatures::
	                                                      HexEWKBParser#decode_hex

Benchmarking

Now that we have made optimizations, we should see how they affect the actual performance of our application. We saved this section for later so that we could compare the two optimizations to each other; in reality, you should benchmark after each optimization to be sure that each change has the desired effect.

We will be comparing the performance of the three versions of the application:

No optimization (the control)
The Listing#location optimization
The rewrite of the clustering algorithm

The benchmarking tool we will be using is Railsbench http://railsbench.rubyforge.org/ by Stefan Kaes. This provides some very convenient tools for benchmarking Rails applications and visualizing the results. Railsbench has a slightly complicated installation procedure. First, install the gem:

	$ sudo gem install railsbench

Next, you should add the directory containing the Railsbench scripts to your PATH, and make them executable. This can be done as follows:

	$ eval 'railsbench path'
	$ sudo railsbench postinstall

Railsbench looks for some environment variables to know where the Rails application is and where its data should go, so we need to export those:

	$ export RAILS_ROOT=.
	$ export RAILS_PERF_DATA=.

Now Railsbench is installed and ready to run. But first, there are some changes we need to make to the application. We install the Railsbench code into our application:

	$ railsbench install

That command provides us with a few files that are used for benchmarking:

config/benchmarks.rb: Railsbench configuration. Use this file to provide Rails with custom session data, if needed.
config/benchmarks.yml: Defines the set of benchmarks that can be run; each one has a name, URI, HTTP method, and optional POST data. Most of this file can be generated automatically from the application's routes; we will do this as our next step.
config/environments/benchmarking.rb: This is a separate Rails environment used for benchmarking. When created, it is a copy of the production.rb file, but it can be customized to meet the bench-mark's needs.

Railsbench has a command to generate the config/benchmarks.yml file based on the application's routes. We will run this to create the basic version of the file.

	$ railsbench generate_benchmarks

We delete actions from this file that we do not need to benchmark (such as named routes generated automatically by map.resources, some of which are unused). And we modify the searches_create benchmark, changing the method to post and adding the POST data we need:

	searches_create:
	    uri:            /searches/
	    action:         create
	    controller:     searches
	    method:         post
	    post_data:      "search[pclass]=1&search[city_id]=149&search[subtype]=HOUSE"

There is one last change we must make. By default, all Rails environments except production have a log level of :debug. We want to set the benchmarking environment's log level to :info, so that we don't confound the benchmarking results with I/O issues from heavy log traffic. Add the following line to config/environments/benchmarking.yml:

	config.log_level = :info

Running the benchmark

Now we can run the benchmarks. We use the Railsbench perf_run command, which takes as an argument the number of requests to make on each trial. We specify the benchmark from benchmarks.yml with the -bm= option, and we specify 20 trials with the RAILS_PERF_RUNS variable:

	$ RAILS_PERF_RUNS=20 perf_run 100 -bm=searches_create
	benchmarking 20 runs with options 100 -bm=searches_create

	perf data file: ./perf_run.searches_create.txt
	    requests=100, options=-bm=searches_create

	loading environment                 2.43529

	page request                          total   stddev%   r/s    ms/r
	searches_create                    44.70490    0.4030  2.24  447.05

	all requests                       44.70490    0.4030  2.24  447.05

Tip

Railsbench can benchmark multiple actions during the same run. We are only benchmarking one action here, so we can just look at the searches_create line in this table for our information. The last line is just a summary.

The data shows us that we averaged 2.24 requests per second (447.05 milliseconds per request). Each of 20 trials involves 100 requests, so the mean runtime for each trial was 44.705 seconds. The standard deviation for that figure was 0.4030% of the mean, or 0.180 seconds. (Thus, 95% of the trials should fall between 44.34 seconds and 45.06 seconds, two standard deviations away from the mean.)

perf_run saves its raw data for this run in perf_run.searches_create.txt, and we will feed that file into other utilities for analysis. Between each benchmark, we store this file away for comparison across the different versions under test.

Interpreting the results

Railsbench includes a utility called perf_comp that will compare results between different runs of perf_run. When run with two arguments (the two data files to be compared), it will give a summary of each and a comparison between them. Here is the comparison between the first benchmark (the control) and the second (with the Listing#location cache improvement):

	$ perf_comp a/perf_run.searches_create.txt b/perf_run.searches_create.txt
	perf data file 1: a/perf_run.searches_create.txt
	  requests=100, options=-bm=searches_create

	perf data file 2: b/perf_run.searches_create.txt
	  requests=100, options=-bm=searches_create

	page              c1 real   c2 real  c1 r/s  c2 r/s   c1 ms/r  c2 ms/r  c1/c2
	searches_create  44.70490  43.13935     2.2     2.3    447.05   431.39   1.04

	all requests     44.70490  43.13935     2.2     2.3    447.05   431.39   1.04

This is mostly information we have seen before. The interesting bit of information is the far-right number in the table (1.04): the ratio of the old runtime to the new. In this case, we see that the cache optimization afforded us a 4% improvement in overall runtime on this action. This is a decent performance improvement, for an action that is already fairly well optimized.

Compare that to the second optimization, the refactoring:

	$ perf_comp b/perf_run.searches_create.txt  c/perf_run.searches_create.txt
	perf data file 1: b/perf_run.searches_create.txt
	  requests=100, options=-bm=searches_create

	perf data file 2: c/perf_run.searches_create.txt
	  requests=100, options=-bm=searches_create

	page              c1 real    c2 real  c1 r/s  c2 r/s  c1 ms/r  c2 ms/r  c1/c2
	searches_create  43.13935   42.79449     2.3     2.3   431.39   427.94   1.01

	all requests     43.13935   42.79449     2.3     2.3   431.39   427.94   1.01

This change only resulted in a performance gain of 1%, which is probably not worth pursuing for performance's sake alone. (In this case, the code under question was in desperate need of a rewrite, so this was a net gain anyway.)

We should also compare the actual ranges that the trial times fall into, to be sure that these results have statistical significance. To visualize this, I instrumented the Rails-bench PerfInfo class to show the actual trial times (not just the mean and standard deviation), and piped the data through R. ^[51]The result is shown in Figure 6-5.

This box-and-whisker plot shows the median value (the heavy line), the first and third quartile (into which the middle half of the data points fall; represented by the box), and the complete range (the brackets). From this, we see that the first optimization was a clear winner; the ranges of values do not even overlap.

The second optimization is not as clear-cut; there is still significant overlap between the two ranges of observations. If there is any performance gain, it is likely marginal, as it is overshadowed by the inherent variability of the response times.

Figure 6-5. Performance comparison between the control and two optimized versions

^[50] The Mercurial Queues extension (also called MQ; its use is detailed at http://www.selenic.com/mercurial/wiki/index.cgi/MqExtension makes this process much more straightforward, and makes it easier to cherrypick or throw away individual changes (even while working in the main repository). However, it is far beyond this book's scope.

^[51]R (http://www.r-project.org/) is a language for statistical computing.