Let's now focus on the Python side of things. The regular expression syntax is the furthest thing from object-oriented programming. However, Python's re module provides an object-oriented interface to enter the regular expression engine.
We've been checking whether the re.match function returns a valid object or not. If a pattern does not match, that function returns None. If it does match, however, it returns a useful object that we can introspect for information about the pattern.
So far, our regular expressions have answered questions such as, does this string match this pattern? Matching patterns is useful, but in many cases, a more interesting question is, if this string matches this pattern, what is the value of a relevant substring? If you use groups to identify parts of the pattern that you want to reference later, you can get them out of the match return value, as illustrated in the next example:
pattern = "^[a-zA-Z.]+@([a-z.]*\.[a-z]+)$" search_string = "some.user@example.com" match = re.match(pattern, search_string) if match: domain = match.groups()[0] print(domain)
The specification describing valid email addresses is extremely complicated, and the regular expression that accurately matches all possibilities is obscenely long. So, we cheated and made a simple regular expression that matches some common email addresses; the point is that we want to access the domain name (after the @ sign) so we can connect to that address. This is done easily by wrapping that part of the pattern in parentheses and calling the groups() method on the object returned by match.
The groups method returns a tuple of all the groups matched inside the pattern, which you can index to access a specific value. The groups are ordered from left to right. However, bear in mind that groups can be nested, meaning you can have one or more groups inside another group. In this case, the groups are returned in the order of their leftmost brackets, so the outermost group will be returned before its inner matching groups.
In addition to the match function, the re module provides a couple of other useful functions, search and findall. The search function finds the first instance of a matching pattern, relaxing the restriction that the pattern should start at the first letter of the string. Note that you can get a similar effect by using match and putting a ^.* character at the front of the pattern to match any characters between the start of the string and the pattern you are looking for.
The findall function behaves similarly to search, except that it finds all non-overlapping instances of the matching pattern, not just the first one. Basically, it finds the first match, then it resets the search to the end of that matching string and finds the next one.
Instead of returning a list of match objects, as you would expect, it returns a list of matching strings, or tuples. Sometimes it's strings, sometimes it's tuples. It's not a very good API at all! As with all bad APIs, you'll have to memorize the differences and not rely on intuition. The type of the return value depends on the number of bracketed groups inside the regular expression:
- If there are no groups in the pattern, re.findall will return a list of strings, where each value is a complete substring from the source string that matches the pattern
- If there is exactly one group in the pattern, re.findall will return a list of strings where each value is the contents of that group
- If there are multiple groups in the pattern, re.findall will return a list of tuples where each tuple contains a value from a matching group, in order
The examples in the following interactive session will hopefully clarify the differences:
>>> import re >>> re.findall('a.', 'abacadefagah') ['ab', 'ac', 'ad', 'ag', 'ah'] >>> re.findall('a(.)', 'abacadefagah') ['b', 'c', 'd', 'g', 'h'] >>> re.findall('(a)(.)', 'abacadefagah') [('a', 'b'), ('a', 'c'), ('a', 'd'), ('a', 'g'), ('a', 'h')] >>> re.findall('((a)(.))', 'abacadefagah') [('ab', 'a', 'b'), ('ac', 'a', 'c'), ('ad', 'a', 'd'), ('ag', 'a', 'g'), ('ah', 'a',
'h')]