Chapter 16. Regular Expressions

Regular expressions provide you with powerful ways to find and modify patterns in text—not only short bits of text such as might be entered at a command prompt but also huge stores of text such as might be found in files on disk.

A regular expression takes the form of a pattern that is compared with a string. Regular expressions also provide the means by which you can modify strings so that, for example, you might change specific characters by putting them into uppercase, you might replace every occurrence of “Diamond” with “Ruby,” or you might read in a file of programming code, extract all the comments, and write out a new documentation file containing all the comments but none of the code. You’ll find out how to write a comment-extraction tool shortly. First, though, let’s take a look at some very simple regular expressions.

Making Matches

Just about the simplest regular expression is a sequence of characters (such as “abc”) that you want to find in a string. A regular expression to match “abc” can be created by placing those letters between two forward slash delimiters, like this: /abc/. You can test for a match using the =˜ operator method like this:

regex0.rb

p( /abc/ =˜ 'abc' )                 #=> 0

If a match is made, an integer representing the character position in the string is returned. If no match is made, nil is returned.

p( /abc/ =˜ 'xyzabcxyzabc' )        #=> 3
p( /abc/ =˜ 'xycab' )               #=> nil

You can also specify a group of characters, between square brackets, in which case a match will be made with any one of those characters in the string. Here, for example, the first match is made with “c”; then that character’s position in the string is returned:

p( /[abc]/ =˜ 'xycba' )             #=> 2

Although I’ve used forward-slash delimiters in the previous examples, there are alternative ways of defining regular expressions: You can specifically create a new Regexp object initialized with a string, or you can precede the regular expression with %r and use custom delimiters—nonalphanumeric characters—as you can with strings (see Chapter 3). In the following example, I use curly bracket delimiters:

regex1.rb

regex1 = Regexp.new('^[a-z]*$')
regex2 = /^[a-z]*$/
regex3 = %r{^[a-z]*$}

Each of the previous examples defines a regular expression that matches an all-lowercase string (I’ll explain the details of the expressions shortly). These expressions can be used to test strings like this:

def test( aStr, aRegEx )
    if aRegEx =˜ aStr then
        puts( "All lowercase" )
    else
        puts( "Not all lowercase" )
    end
end

test( "hello", regex1 )             #=> matches: "All lowercase"
test( "hello", regex2 )             #=> matches: "All lowercase"
test( "Hello", regex3 )             #=> no match: "Not all lowercase"

To test for a match, you can use if and the =˜ operator:

if /def/ =˜ 'abcdef'

The previous expression evaluates to true if a match is made (and an integer is returned); it would evaluate to false if no match were made (and nil were returned):

if_test.rb

RegEx = /def/
Str1  = 'abcdef'
Str2  = 'ghijkl'

if RegEx =˜ Str1 then
    puts( 'true' )
else
    puts( 'false' )
end                          #=> displays: true

if RegEx =˜ Str2 then
    puts( 'true' )
else
    puts( 'false' )
end                          #=> displays: false

Frequently, it is useful to attempt to match some expression from the very start of a string; you can use the character ^ followed by a match term to specify this. It may also be useful to make a match from the end of the string; you use the character $ preceded by a match term to specify that.

start_end1.rb

puts( /^a/ =˜ 'abc' )        #=> 0
puts( /^b/ =˜ 'abc' )        #=> nil
puts( /c$/ =˜ 'abc' )        #=> 2
puts( /b$/ =˜ 'abc' )        #=> nil

Note

As mentioned previously, when a nil value is passed to print or puts in Ruby 1.9, nothing is displayed. In Ruby 1.8, nil is displayed. To be sure that nil is displayed in Ruby 1.9, use p instead of puts.

Matching from the start or end of a string becomes more useful when it forms part of a more complex expression. Often such an expression tries to match zero or more instances of a specified pattern. The * character is used to indicate zero or more matches of the pattern that it follows. Formally, this is known as a quantifier. Consider this example:

start_end2.rb

p( /^[a-z 0-9]*$/ =˜ 'well hello 123' )

Here, the regular expression specifies a range of characters between square brackets. This range includes all lowercase characters (a–z), all digits (0–9), and the space character (that’s the space between the z and the 0 in the expression shown earlier). The ^ character means the match must be made from the start of the string, the * character after the range means that zero or more matches with the characters in the range must be made, and the $ character means that the matches must be made right up to the end of the string. In other words, this pattern will only match a string containing lowercase characters, digits, and spaces from the start right to the end of the string:

puts( /^[a-z 0-9]*$/ =˜ 'well hello 123' ) # match at 0
puts( /^[a-z 0-9]*$/ =˜ 'Well hello 123' ) # no match due to ^ and upcase W

Actually, this pattern will also match an empty string, since * indicates that zero or more matches are acceptable:

puts( /^[a-z 0-9]*$/ =˜ '' )        # this matches!

If you want to exclude empty strings, use + (to match one or more occurrences of the pattern):

puts( /^[a-z 0-9]+$/ =˜ '' )        # no match

Try the code in start_end2.rb for more examples of ways in which ^, $, * and + may be combined with ranges to create a variety of different match patterns.

You could use these techniques to determine specific characteristics of strings, such as whether a given string is uppercase, lowercase, or mixed case:

regex2.rb

aStr = "HELLO WORLD"

case aStr
    when /^[a-z 0-9]*$/
        puts( "Lowercase" )
    when /^[A-Z 0-9]*$/
        puts( "Uppercase" )
    else
        puts( "Mixed case\n" )
end

Since the string assigned to aStr is currently all uppercase, the previous code displays the “Uppercase” string. But if aStr were assigned hello world, it would display “Lowercase,” and if aStr were assigned Hello World, it would display “Mixed case.”

Often regular expressions are used to process the text in a file on disk. Let’s suppose, for example, that you want to display all the full-line comments in a Ruby file but omit all the code and partial-line comments. You could do this by trying to match from the start of each line (^) zero or more whitespace characters (a whitespace character is represented by \s) up to a comment character (#).

regex3a.rb

# displays all the full-line comments in a Ruby file
File.foreach( 'regex1.rb' ){ |line|
    if line =˜ /^\s*#/ then
        puts( line )
    end
}