Hash tables in Ruby

image
Every time you write a method,
Ruby creates an entry in a hash table.

Hash tables are a commonly used, well known, old concept in computer science. They organize values into groups or “bins” based on an integer value calculated from each value called a “hash.” Later when you need to search for and find a value, by recalculating the hash value you can figure out which bin the value is contained in, to speed up the search.

Here’s a high level diagram showing a single hash object and its hash table:

image

On the left is the RHash structure; this is short for “Ruby Hash.” On the right, I show the hash table used by this hash, represented by the st_table structure. This C structure contains the basic information about the hash table, such as the number of entries saved in the table, the number of bins and a pointer to the bins. Each RHash structure contains a pointer to a corresponding st_table structure. Finally, I show some empty bins on the lower right. Ruby 1.8 and Ruby 1.9 initially create 11 bins for a new, empty hash.

The best way to understand how a hash table works is by stepping through an example. Let’s suppose I add a new key/value to a hash called my_hash:

my_hash[:key] = "value"

While executing this line of code, Ruby will create a new structure called an st_table_entry and will save it into the hash table for my_hash:

image

Here you can see Ruby saved the new key/value pair under the third bucket, #2. Ruby did this by taking the given key, the symbol :key in this example, and passing it to an internal hash function that returns a pseudorandom integer:

some_value = internal_hash_function(:key)

Next, Ruby takes the hash value, some_value in this example, and calculates the modulus by the number of bins… i.e. the remainder after dividing by the number of bins:

some_value % 11 = 2

In this diagram I imagine that the actual hash value for :key divided by 11 leaves a remainder of 2. Later in this chapter I’ll explore the hash functions that Ruby actually uses in more detail.

Now let’s add a second element to the hash:

my_hash[:key2] = "value2"

And this time let’s imagine that the hash value of :key2 divided by 11 yields a remainder of 5:

internal_hash_function(:key2) % 11 = 5

Now you can see Ruby places a second st_table_entry structure under bin #5, the sixth bin:

image

The benefit of using a hash table comes later, when you ask Ruby to retrieve the value for a given key:

p my_hash[:key]
=> "value"

If Ruby had saved all of the keys and values in an array or linked list, then it would have to iterate over all the elements in that array or list, looking for :key. This might take a very long time, depending on how many elements there were. But using a hash table Ruby can jump straight to the key it needs to find by recalculating the hash value for that key. It simply calls the hash function again:

some_value = internal_hash_function(:key)

…redivides the hash value by the number of bins and obtaining the remainder, the modulus:

some_value % 11 = 2

…and now Ruby knows to look in bin #2 for the entry with a key of :key. In a similar way, Ruby can later find the value for :key2 by repeating the same hash calculation:

internal_hash_function(:key2) % 11 = 5

Believe it or not, the C library used by Ruby to implement hash tables was originally written back in the 1980’s by Peter Moore from the University of California at Berkeley, and later modified by the Ruby core team. You can find Peter Moore’s hash table code in the C code files st.c and include/ruby/st.h. All of the function and structure names use the naming convention st_ in Peter’s hash table code.

Meanwhile, the definition of the RHash structure that represents every Ruby Hash object can be found in the include/ruby/ruby.h file. Along with RHash, here you’ll find all of the other primary object structures used in the Ruby source code: RString, RArray, RValue, etc.

Experiment 4-1: Retrieving a value from hashes of varying sizes

image

My first experiment will create hashes of wildly different sizes, from 1 element to 1 million elements and then measure how long it takes to find and return a value from each of these hashes. First, I create hashes of different sizes, based on powers of two, by running this code for different values of exponent:

size = 2**exponent
hash = {}
(1..size).each do |n|
  index = rand
  hash[index] = rand
end

Here both the keys and values are random floating values. Then I measure how long it takes to find one of the keys, the target_key 10,000 times using the benchmark library:

Benchmark.bm do |bench|
bench.report("retrieving an element
              from a hash with #{size} elements 10000 times") do
    10000.times do
      val = hash[target_key]
    end
  end
end

By using a hash table internally, Ruby is able to find and return value from a hash containing over a million elements just as fast as it takes to return one from a small hash:

image
Time to retrieve 10,000 values (ms) vs. hash size

Clearly the hash function Ruby uses is very fast, and once Ruby identifies the bin containing the target key, it is able to very quickly find the corresponding value and return it. What’s remarkable about this is that the values in this chart are more or less flat.