Ruby Under a Microscope

Parsing: how Ruby understands the code you write

Ruby uses an LALR parser
generator called Bison.

Now that Ruby has converted your code into a series of tokens, what does it do next? How does it actually understand and run your program? Does Ruby simply step through the tokens and execute each one in order?

No, it doesn’t… your code still has a long way to go before Ruby can run it. As I said above, the next step on your code’s journey through Ruby is called “parsing,” which is the process for grouping the words or tokens into sentences or phrases that make sense to Ruby. It is during the parsing process that Ruby takes order of operations, methods and arguments, blocks and other larger code structures into account. But how does Ruby do this? How can Ruby or any language actually “understand” what you’re telling it with your code? For me, this is one of the most fascinating areas of computer science… endowing a computer program with intelligence.

Ruby, like many programming languages, uses something called an “LALR parser generator” to process the stream of tokens that we just saw above. Parser generators were invented back in the 1960s; like the name implies, parser generators take a series of grammar rules and generate code that can later parse and understand tokens that follow those rules. The most widely used and well known parser generator is called Yacc (“Yet Another Compiler Compiler”), but Ruby instead uses a newer version of Yacc called Bison, part of the GNU project. The term “LALR” describes how the generated parser actually works internally – more on that below.

Bison, Yacc and other parser generators require you to express your grammar rules using “Backus–Naur Form” (BNF) notation. For Bison and Yacc, this grammar rule file will have a “.y” extension, named after “Yacc.” The grammar rule file in the Ruby source code is called parse.y – the same file I mentioned earlier that contains the tokenize code. It is in this parse.y file that Ruby defines the actual syntax and grammar that you have to use while writing your Ruby code. The parse.y file is really the heart and soul of Ruby – it is where the language itself is actually defined!

Ruby doesn’t use Bison to actually process the tokens, instead Ruby runs Bison ahead of time during Ruby’s build process to create the actual parser code. There are really two separate steps to the parsing process, then:

Ahead of time, before you ever run your Ruby program, the Ruby build process uses Bison to generate the parser code (parse.c) from the grammar rules file (parse.y). Then later at run time this generated parser code actually parses the tokens returned by Ruby’s tokenizer code. You might have built Ruby yourself from source manually or automatically on your computer by using a tool like Homebrew. Or someone else may have built Ruby ahead of time for you, if you installed Ruby with a prepared install kit.

As I explained at the end of the last section, the parse.y file, and therefore the generated parse.c file, also contains the tokenization code. This is why I show the diagonal arrow from parse.c to the “Tokenize” process on the lower left. In fact, the parse engine I am about to describe calls the tokenization code whenever it needs a new token. The tokenization and parsing processes actually occur simultaneously.

Understanding the LALR parse algorithm

Now let’s take a look at how grammar rules work – the best way to become familiar with grammar rules is to take a close look at one simple example. Suppose I want to translate from the Spanish:

Me gusta el Ruby

…to the English:

I like Ruby

… and that to do this suppose I use Bison to generate a C language parser from a grammar file. Using the Bison/Yacc grammar rule syntax – the “Backus–Naur” notation – I can write a simple grammar rule like this with the rule name on the left, and the matching tokens on the right: This is actually a slightly modified version of BNF that Bison uses – the original BNF syntax would have used ‘::=’ instead of a simple ‘:’ character.

SpanishPhrase : me gusta el ruby {
  printf("I like Ruby\n");
}

This grammar rule means: if the token stream is equal to “me”, “gusta,” “el” and “ruby” – in that order – then we have a match. If there’s a match the Bison generated parser will run the given C code, the printf statement (similar to puts in Ruby), which will print out the translated English phrase.

How does this work? Here’s a conceptual picture of the parsing process in action:

At the top I show the four input tokens, and the grammar rule right underneath it. It’s obvious in this case there’s a match since each input token corresponds directly to one of the terms on the right side of the grammar rule. In this example we have a match on the SpanishPhrase rule.

Now let’s change the example to be a bit more complex: suppose I need to enhance my parser to match both:

Me gusta el Ruby

and:

Le gusta el Ruby

…which means “She/He/It likes Ruby.” Here’s a more complex grammar file that can parse both Spanish phrases: again this is a modified version of BNF used by Bison – the original syntax from the 1960s would use < > around the child rule names, like this for example: “VerbAndObject ::= <SheLikes> | <ILike>”)

SpanishPhrase: VerbAndObject el ruby {
  printf("%s Ruby\n", $1);
};
VerbAndObject: SheLikes | ILike {
  $$ = $1;
};
SheLikes: le gusta {
  $$ = "She likes";
}
 ILike: me gusta {
  $$ = "I like";
}

There’s a lot more going on here; you can see four grammar rules instead of just one. Also, I’m using the Bison directive $$ to return a value from a child grammar rule to a parent, and $1 to refer to a child’s value from a parent.

Now things aren’t so obvious – the parser can’t immediately match any of the grammar rules like in the previous, trivial example:

Here using the SpanishPhrase rule the el and ruby tokens match, but le and gusta do not. Ultimately we’ll see how the child rule VerbAndObject does match “le gusta” but for now there is no immediate match. And now that there are four grammar rules, how does the parser know which one to try to match against? …and against which tokens?

This is where the real intelligence of the LALR parser comes into play. This acronym describes the algorithm the parser uses to find matching grammar rules, and means “Look Ahead LR parser.” We’ll get to the “Look Ahead” part in a minute, but let’s start with “LR:”

“L” (left) means the parser moves from left to right while processing the token stream; in my example this would be: le, gusta, el, ruby.
“R” (reversed rightmost derivation) means the parser uses a bottom up strategy for finding matching grammar rules, by using a shift/reduce technique.

Here’s how the algorithm works for this example. First, the parser takes the input token stream:

… and shifts the tokens to the left, creating something I’ll call the “Grammar Rule Stack:”

Here since the parser has processed just one token, le, this is kept in the stack alone for the moment. “Grammar Rule Stack” is a simplification; in reality the parser does use a stack, but instead of grammar rules it pushes numbers on to its stack that indicate which grammar rule it just parsed. These numbers – or states from a state machine – help the parser keep track of where it is as it processes the tokens.

Next, the parser shifts another token to the left:

Now there are two tokens in the stack on the left. At this point the parser stops to examine all of the different grammar rules and looks for one that matches. In this case, it finds that the SheLikes rule matches:

This operation is called “reduce,” since the parser is replacing the pair of tokens with a single, matching rule. This seems very straightforward… the parser just has to look through the available rules and reduce or apply the single, matching rule.

Now the parser in our example can reduce again – now there is another matching rule: VerbAndObject! This rule matches because of the OR (vertical bar) operator: it matches either the SheLikes or ILike child rules. The parser can next replace SheLikes with VerbAndObject:

But let’s stop for a moment and think about this a bit more carefully. How did the parser know to reduce and not continue to shift tokens? Also, in the real world there might actually be many matching rules the parser could reduce with – how does it know which rule to use? This is the crux of the algorithm that LALR parsers use… that Ruby uses… how does it decide whether to shift or reduce? And if it reduces, how does it decide which grammar rule to reduce with?

In other words, suppose at this point in the process…

… there were multiple matching rules that included “le gusta.” How would the parser know which rule to apply or whether to shift in the el token first before looking for a match?

Here’s where the “LA” (Look Ahead) portion of LALR comes in: in order to find the proper matching rule it looks ahead at the next token:

Additionally, the parser maintains a state table of possible outcomes depending on what the next token was and which grammar rule was just parsed. This table would contain a series of states, describing which grammar rules have been parsed so far, and which states to move to next depending on the next token. LALR parsers are complex state machines that match patterns in the token stream. When you use Bison to generate the LALR parser, Bison calculates what this state table should contain, based on the grammar rules you provided.

In this example, the state table would contain an entry indicating that if the next token was el the parser should first reduce using the SheLikes rule, before shifting a new token.

I won’t show the details of what a state table looks like; if you’re interested, the actual LALR state table for Ruby can be found in the generated parse.c file. Instead let’s just continue the shift/reduce operations for my simple example. After matching the VerbAndObject rule, the parser would shift another token to the left:

At this point no rules would match, and the state machine would shift another token to the left:

And finally, the parent grammar rule SpanishPhrase would match after a final reduce operation:

Why have I shown you this Spanish to English example? Because Ruby parses your program in exactly the same way! Inside the Ruby parse.y source code file, you’ll see hundreds of rules that define the structure and syntax of the Ruby language. There are parent and child rules, and the child rules return values the parent rules can refer to in exactly the same way, using the $$, $1, $2, etc. symbols. The only real difference is scale – my Spanish phrase grammar is extremely simple, trivial really. On the other hand, Ruby’s grammar is very complex, an intricate series of interrelated parent and child grammar rules, which sometimes even refer to each other in circular, recursive patterns. But this complexity just means that the generated state table in the parse.c file is larger. The basic LALR algorithm – how the parser processes tokens and uses the state table – is the same.

Some actual Ruby grammar rules

Let’s take a quick look at some of the actual Ruby grammar rules from parse.y. Here’s my example Ruby script from the last section on tokenization:

10.times do |n|
  puts n
end

This is a very simple Ruby script, right? Since this is so short, it should’t be too difficult to trace the Ruby parser’s path through its grammar rules. Let’s take a look at how it works:

On the left I show the code that Ruby is trying to parse. On the right are the actual matching grammar rules from the Ruby parse.y file, shown in a simplified manner. The first rule, “program: top_compstmt,” is the root grammar rule which matches every Ruby program in its entirety. As you follow the list down, you can see a complex series of child rules that also match my entire Ruby script: “top statements,” a single statement, an expression, an argument and finally a “primary” value.

Once Ruby’s parse reaches the “primary” grammar rule, it encounters a rule that has two matching child rules: “method_call” and “brace_block.” Let’s take the “method_call” rule first:

The “method_call” rule matches the “10.times” portion of my Ruby code - i.e. where I call the times method on the 10 Fixnum object. You can see the rule matches another primary value, followed by a period character, followed, in turn, by an “operation2.” The period is simple enough, and here’s how the “primary_value” and “operation2” child rules work: first the “primary_value” rule matches the literal “10:”

And then the “operation2” rule matches the method name times:

What about the rest of my Ruby code? How does Ruby parse the contents of the “do … puts… end” block I passed to the times method? Ruby handles that using the “brace_block” rule from above:

I won’t go through all the remaining child grammar rules here, but you can see how this rule, in turn, contains a series of other matching child rules:

“keyword_do” matches the do reserved keyword
“opt_block_param” matches the block parameter |n|
“compstmt” matches the contents of the block itself - “puts n,” and
“keyword_end” matches the end reserved keyword

To give you a taste of what the actual Ruby parse.y source code looks like, here’s a small portion of the file containing part of the “method_call” grammar rule I showed above:

method_call        :

...

  primary_value '.' operation2
  {
  /*%%%*/
      $<num>$ = ruby_sourceline;
  /*% %*/
  }
opt_paren_args
  {
  /*%%%*/
      $$ = NEW_CALL($1, $3, $5);
      nd_set_line($$, $<num>4);
  /*%
      $$ = dispatch3(call, $1, ripper_id2sym('.'), $3);
      $$ = method_optarg($$, $5);
  %*/
  }

Like my Spanish to English example grammar file above, here you can see there are snippets of complex C code that appear after each of the terms in the grammar rule. The way this works is that the Bison generated parser will execute these snippets if and when there’s a match for this rule on the tokens found in the target Ruby script. However, these C code snippets also contain Bison directives such as $$ and $1 that allow the code to create return values and to refer to values returned by other grammar rules. What we end up with is a confusing mix of C and Bison directives.

And to make things even worse, Ruby uses a trick during the Ruby build process to divide these C/Bison code snippets into separate pieces – some that are actually used by Ruby and others that are only used by the Ripper tool which we tried out in Experiment 1-1. Here’s how that works:

The C code that appears between the /*%%%*/ line and the /*% line is actually compiled into Ruby during the Ruby build process, and:
The C code between /*% and %*/ is dropped when Ruby is built. Instead, this code is only used by the Ripper tool which is built separately during the Ruby build process.

Ruby uses this very confusing syntax to allow the Ripper tool and Ruby itself to share the same grammar rules inside of parse.y.

And what are these snippets actually doing? As you might guess Ruby uses the Ripper code snippets to allow the Ripper tool to display information about what Ruby is parsing. We’ll try that next in Experiment 1-2. There’s also some bookkeeping code: Ruby uses the ruby_sourceline variable to keep track of what source code line corresponds to each portion of the grammar.

But more importantly, the snippets Ruby actually uses at run time when parsing your code create a series of “nodes” or temporary data structures that form an internal representation of your Ruby code. These nodes are saved in a tree structure called an Abstract Syntax Tree (or AST)… more on that in a minute. You can see one example of creating an AST node above, where Ruby calls the NEW_CALL C macro/function. This creates a new NODE_CALL node, which represents a method call. We’ll see later how Ruby eventually compiles this into byte code that can be executed by a virtual machine.

Experiment 1-2: Using Ripper to parse different Ruby scripts

In Experiment 1-1 I showed you how to use Ripper to display the tokens Ruby converts your code into, and we just saw how all of the Ruby grammar rules in parse.y are also included in the Ripper tool. Now let’s learn how to use Ripper to display information about how Ruby parses your code. Here’s how to do it:

require 'ripper'
require 'pp'
code = <<STR
10.times do |n|
  puts n
end
STR
puts code
pp Ripper.sexp(code)

This is exactly the same code I showed in the first experiment, except that here I call Ripper.sexp instead of Ripper.lex. Running this I get:

[:program,
  [[:method_add_block,
     [:call,
       [:@int, "10", [1, 0]], :".",
       [:@ident, "times", [1, 3]]],
     [:do_block,
       [:block_var,
         [:params, [[:@ident, "n", [1, 13]]],
                   nil, nil, nil, nil, nil, nil],
         false],
       [[:command,
          [:@ident, "puts", [2, 2]],
          [:args_add_block, [[:var_ref, [:@ident, "n", [2, 7]]]],
                            false]]]]]]]

What the heck does this mean? I can see some bits and pieces from my Ruby script in this cryptic text, but what do all of the other symbols and arrays mean here?

It turns out that the output from Ripper is a textual representation of your Ruby code. As Ruby parses your code, matching one grammar rule after another, it converts the tokens found in your code file into a complex internal data structure called an Abstract Syntax Tree (AST). You can see some of the C code that produces this structure in the previous yellow section. The purpose of the AST is to record the structure and syntactical meaning of your Ruby code. To see what I mean, here’s a small piece of the AST structure Ripper just displayed for my sample Ruby script:

This is how Ruby represents that single call puts n internally. This corresponds to the last three lines of the Ripper output:

[[:command,
   [:@ident, "puts", [2, 2]],
   [:args_add_block, [[:var_ref, [:@ident, "n", [2, 7]]]],
                     false]]]

Like in Experiment 1-1 when we displayed token information from Ripper, you can see the source code file line and column information are displayed as integers. For example [2, 2] indicates that Ripper found the puts call on line 2 at column 2 of my code file. Aside from that, you can see that Ripper outputs an array for each of the nodes in the AST, “[:@ident, “puts”, [2, 2]]” for example.

What’s interesting and important about this is that now my Ruby program is beginning to “make sense” to Ruby. Instead of a simple stream of tokens, which could mean anything, Ruby now has a detailed description of what I meant when I wrote puts n. We have a function call, “a command,” followed by an identifier node which indicates what function to call. Ruby uses the args_add_block node since you might optionally pass a block to a command/function call like this. Even though we are not passing a block in this case, the args_add_block node is still saved into the AST. Another interesting detail is how the n identifier is recorded as a :var_ref or variable reference node, and not as a simple identifier.

Let’s take a look at more of the Ripper output:

Here you can see Ruby now understands that “do |n| … end” is a block, with a single block parameter called n. The puts n box on the right represents the other part of the AST I showed above, the parsed version of the puts call.

And finally here’s the entire AST for my sample Ruby code:

Here you can see “method add block” means we’re calling a method, but also adding a block parameter “10.times do.” The “call” tree node obviously represents the actual method call “10.times”. This is the NODE_CALL node that we saw earlier in the C code snippet.

Again, the key point here is that now your Ruby program is no longer a simple series of tokens – Ruby now “understands” what you meant with your code. Ruby’s knowledge of your code is saved in the way the nodes are arranged in the AST.

To make this point even more clear, suppose I pass the Ruby expression “2+2” to Ripper like this:

require 'ripper'
require 'pp'
code = <<STR
2 + 2
STR
puts code
pp Ripper.sexp(code)

And running it I get:

[:program,
  [[:binary,
     [:@int, "2", [1, 0]],
     :+,
     [:@int, "2", [1, 4]]]]]

Here you can see the + is represented with an AST node called “binary:”

Not very surprising, but look what happens when I pass the expression “2 + 2 * 3” into Ripper:

require 'ripper'
require 'pp'
code = <<STR
2 + 2 * 3
STR
puts code
pp Ripper.sexp(code)

Now I get:

[:program,
 [[:binary,
   [:@int, "2", [1, 0]],
   :+,
   [:binary,
     [:@int, "2", [1, 4]],
     :*,
     [:@int, "3", [1, 8]]]]]]

And here’s what that looks like:

Note how Ruby was smart enough to realize that multiplication has a higher precedence than addition does. We all knew this, of course… this isn’t very interesting. But what is interesting to me here is how the AST tree structure itself inherently captures the information about the order of operations. The simple token stream: 2 + 2 * 3 just indicates what I wrote in my code file, while the parsed version saved to the AST structure now contains the meaning of my code – all the information Ruby will need later to execute it.

One final note: Ruby itself actually contains some debug code that can also display information about the AST node structure. To use it, just run your Ruby script with the “parsetree” option:

$ ruby --dump parsetree your_script.rb

This will display the same information we’ve just seen, but in a different format. Instead of showing symbols, the “parsetree” option will show the actual node names from the C source code. In the next section, about how Ruby compiles your code, I’ll also use the actual node names.