Regular expressions provide you with powerful ways to find and modify patterns in text—not only short bits of text such as might be entered at a command prompt but also huge stores of text such as might be found in files on disk.
A regular expression takes the form of a pattern that is compared with a string. Regular expressions also provide the means by which you can modify strings so that, for example, you might change specific characters by putting them into uppercase, you might replace every occurrence of “Diamond” with “Ruby,” or you might read in a file of programming code, extract all the comments, and write out a new documentation file containing all the comments but none of the code. You’ll find out how to write a comment-extraction tool shortly. First, though, let’s take a look at some very simple regular expressions.
Just about the simplest regular expression is a sequence of characters (such as “abc”) that you want to find in a string. A regular expression to match “abc” can be created by placing those letters between two forward slash delimiters, like this: /abc/
. You can test for a match using the =˜
operator method like this:
regex0.rb
p( /abc/ =˜ 'abc' ) #=> 0
If a match is made, an integer representing the character position in the string is returned. If no match is made, nil
is returned.
p( /abc/ =˜ 'xyzabcxyzabc' ) #=> 3 p( /abc/ =˜ 'xycab' ) #=> nil
You can also specify a group of characters, between square brackets, in which case a match will be made with any one of those characters in the string. Here, for example, the first match is made with “c”; then that character’s position in the string is returned:
p( /[abc]/ =˜ 'xycba' ) #=> 2
Although I’ve used forward-slash delimiters in the previous examples, there are alternative ways of defining regular expressions: You can specifically create a new Regexp object initialized with a string, or you can precede the regular expression with %r
and use custom delimiters—nonalphanumeric characters—as you can with strings (see Chapter 3). In the following example, I use curly bracket delimiters:
regex1.rb
regex1 = Regexp.new('^[a-z]*$') regex2 = /^[a-z]*$/ regex3 = %r{^[a-z]*$}
Each of the previous examples defines a regular expression that matches an all-lowercase string (I’ll explain the details of the expressions shortly). These expressions can be used to test strings like this:
def test( aStr, aRegEx ) if aRegEx =˜ aStr then puts( "All lowercase" ) else puts( "Not all lowercase" ) end end test( "hello", regex1 ) #=> matches: "All lowercase" test( "hello", regex2 ) #=> matches: "All lowercase" test( "Hello", regex3 ) #=> no match: "Not all lowercase"
To test for a match, you can use if
and the =˜
operator:
if /def/ =˜ 'abcdef'
The previous expression evaluates to true if a match is made (and an integer is returned); it would evaluate to false if no match were made (and nil
were returned):
if_test.rb
RegEx = /def/ Str1 = 'abcdef' Str2 = 'ghijkl' if RegEx =˜ Str1 then puts( 'true' ) else puts( 'false' ) end #=> displays: true if RegEx =˜ Str2 then puts( 'true' ) else puts( 'false' ) end #=> displays: false
Frequently, it is useful to attempt to match some expression from the very start of a string; you can use the character ^
followed by a match term to specify this. It may also be useful to make a match from the end of the string; you use the character $
preceded by a match term to specify that.
start_end1.rb
puts( /^a/ =˜ 'abc' ) #=> 0 puts( /^b/ =˜ 'abc' ) #=> nil puts( /c$/ =˜ 'abc' ) #=> 2 puts( /b$/ =˜ 'abc' ) #=> nil
As mentioned previously, when a nil
value is passed to print
or puts
in Ruby 1.9, nothing is displayed. In Ruby 1.8, nil
is displayed. To be sure that nil
is displayed in Ruby 1.9, use p
instead of puts
.
Matching from the start or end of a string becomes more useful when it forms part of a more complex expression. Often such an expression tries to match zero or more instances of a specified pattern. The *
character is used to indicate zero or more matches of the pattern that it follows. Formally, this is known as a quantifier. Consider this example:
start_end2.rb
p( /^[a-z 0-9]*$/ =˜ 'well hello 123' )
Here, the regular expression specifies a range of characters between square brackets. This range includes all lowercase characters (a–z), all digits (0–9), and the space character (that’s the space between the z
and the 0
in the expression shown earlier). The ^
character means the match must be made from the start of the string, the *
character after the range means that zero or more matches with the characters in the range must be made, and the $
character means that the matches must be made right up to the end of the string. In other words, this pattern will only match a string containing lowercase characters, digits, and spaces from the start right to the end of the string:
puts( /^[a-z 0-9]*$/ =˜ 'well hello 123' ) # match at 0 puts( /^[a-z 0-9]*$/ =˜ 'Well hello 123' ) # no match due to ^ and upcase W
Actually, this pattern will also match an empty string, since *
indicates that zero or more matches are acceptable:
puts( /^[a-z 0-9]*$/ =˜ '' ) # this matches!
If you want to exclude empty strings, use +
(to match one or more occurrences of the pattern):
puts( /^[a-z 0-9]+$/ =˜ '' ) # no match
Try the code in start_end2.rb for more examples of ways in which ^
, $
, *
and +
may be combined with ranges to create a variety of different match patterns.
You could use these techniques to determine specific characteristics of strings, such as whether a given string is uppercase, lowercase, or mixed case:
regex2.rb
aStr = "HELLO WORLD" case aStr when /^[a-z 0-9]*$/ puts( "Lowercase" ) when /^[A-Z 0-9]*$/ puts( "Uppercase" ) else puts( "Mixed case\n" ) end
Since the string assigned to aStr
is currently all uppercase, the previous code displays the “Uppercase” string. But if aStr
were assigned hello world
, it would display “Lowercase,” and if aStr
were assigned Hello World
, it would display “Mixed case.”
Often regular expressions are used to process the text in a file on disk. Let’s suppose, for example, that you want to display all the full-line comments in a Ruby file but omit all the code and partial-line comments. You could do this by trying to match from the start of each line (^
) zero or more whitespace characters (a whitespace character is represented by \s
) up to a comment character (#
).
regex3a.rb
# displays all the full-line comments in a Ruby file File.foreach( 'regex1.rb' ){ |line| if line =˜ /^\s*#/ then puts( line ) end }