Ruby Under a Microscope

Tokenization, parsing and compilation in Rubinius

Now let’s take a look at Rubinius and how it parses your Ruby code. You may have heard that Rubinius is a version of Ruby implemented with Ruby, but did you know this also applies to the compiler that Rubinius uses? That’s right: as hard as it is to imagine, when you run a Ruby script using Rubinius, it compiles your Ruby code using Ruby.

At a high level the process looks very similar to MRI and JRuby:

Again at build time, before you ever run your Ruby program, Rubinius generates an LALR parser using Bison – the same tool that MRI Ruby uses. Just like JRuby, the Rubinius team has more or less copied the same grammar rules over from the original MRI parse.y file. In Rubinius the grammar file is called either “grammar18.y” or “grammar19.y” – just like JRuby, Rubinius maintains two copies of the grammar rules for its 1.8 and 1.9 compatibility modes.

Later when you run your Rubinius process, it converts your code again into a token stream, an AST structure, and later into high level instructions called “Rubinius instructions.” One nice feature of Rubinius is that it allows you to save these compiled instructions into special “.rbc” files. That is, Rubinius exposes a compile command, and allows you to precompile your Ruby code before you actually run it, if you prefer, saving some time later. Remember that MRI didn’t provide this feature: Ruby 1.9 and 2.0 always compile your code every time you run it.

But what makes Rubinius fascinating is the way that it implements Ruby using Ruby, or more precisely a combination of C, C++ and Ruby. I’ll have more examples of this later in other chapters, but for now let’s take a look at how Rubinius parses and compiles your code. Here’s the same diagram I had for MRI and JRuby showing all the different forms your code takes internally inside of Rubinius when you run it:

When you run a Ruby script using Rubinius your code is converted into all of these different formats, and ultimately into machine language! At the top, the picture is the same: your Ruby script is once again tokenized and parsed, and converted into a similar AST structure. Next, Rubinius iterates through the AST nodes, compiling them into high level instructions which I’ll call “Rubinius instructions.” These are similar to the YARV instructions that Ruby 1.9 and 2.0 use internally, except as I mentioned above they can optionally be saved into .RBC files for later use.

Then in order to execute these instructions, Rubinius uses a well known and very powerful open source framework called the “Low Level Virtual Machine” or LLVM. The LLVM framework includes a number of different, powerful tools that make it easy – or at least easier – to write a language compiler. LLVM provides a low-level instruction set, a virtual machine to execute these instructions along with optimizers, a C/C++ compiler (Clang), a debugger and more.

Rubinius primarily leverages the LLVM virtual machine itself by converting the high level Rubinius instructions into low level LLVM instructions using a JIT (“just in time”) compiler written by the Rubinius team. That is, first your Ruby code is parsed and compiled into Rubinius instructions; later Rubinius converts these high level instructions into their equivalent low level LLVM instructions using a background thread as your Rubinius process runs.

As we’ll continue to see in later chapters, Rubinius’s implementation is a tour de force – it’s an innovative, creative implementation of Ruby that at the same time leverages some of the best open source software available to provide fantastic performance. For me one of the most elegant aspects of Rubinius internals is the way that it seamlessly combines C++, C and Ruby code together – the parsing/compiling process is a good example of this. Here’s a closer look at the way Rubinius processes your code:

Inside of Rubinius, parsing and compiling your code is a team effort:

First, as I mentioned above, Rubinius uses the same Bison generated LALR parser that MRI Ruby does. Rubinius also uses similar C code to first tokenize your code file’s text.
But next, the C code triggered by the matching grammar rules in the parser create AST nodes… that are implemented by Ruby classes! Every type of AST node has a corresponding Ruby class, all of which have a common Ruby super class: Rubinius::AST::Node.
Next each of these AST node Ruby classes contains code that compiles that type of AST node into Rubinius instructions.
Finally, once your Rubinius process is running a JIT compiler written in C++ converts these high Rubinius instructions into low level LLVM instructions.

The Rubinius Ruby compiler, itself written in Ruby, is very readable and straightforward to understand. In fact, the fact that much of Rubinius is implemented in Ruby is one of its most important features. To see what I mean, take a look at how the send AST node – or method call – is compiled into high level Rubinius instructions:

module Rubinius
  module AST
    class Send < Node

...

      def bytecode(g)
        pos(g)
        if @vcall_style and reference = check_local_reference(g)
          return reference.get_bytecode(g)
        end
        @receiver.bytecode(g)
        if @block
          @block.bytecode(g)
          g.send_with_block @name, 0, @privately
        elsif @vcall_style
          g.send_vcall @name
        else
          g.send @name, 0, @privately
        end
      end

...

This is a snippet from the lib/compiler/ast/sends.rb Rubinius source code file. This class, Rubinius::AST::Send, implements the Send Rubinius AST node that the parser creates when it encounters a method or function call in your Ruby script. You can see the reference to the Rubinius::AST::Node super class.

I won’t explain every detail, but at a high level the way this works is:

When Rubinius compiles the AST nodes into Rubinius instructions, it visits every AST node object and calls their bytecode methods, passing in a generator object or “g” here. The generator object provides a DSL for creating Rubinius instructions, e.g. send_with_block or send.
After checking for the case where the function call might actually be a reference to a local variable, Rubinius calls @receiver.bytecode – this compiles the receiver object first.
Then Rubinius creates either a send_with_block, send_vcall or send method depending on various attributes of the node object.

To save space I’m glossing over some details here but it’s real pleasure reading the Ruby compiler code inside Rubinius since it’s so easy to understand and follow. Again, you can find all of the AST node Ruby classes in the lib/compiler/ast folder in your Rubinius source tree.