Regular expressions allow us to identify patterns of data by using generic search patterns. For example, searching for all possible phone numbers of the XXX-XXX-XXXX type appearing in a document can be easily accomplished by one regular expression. We're going to create a regular expression module that will run a set of default expressions or a user-supplied expression against the processed WAL data. The purpose of the default expressions will be to identify relevant forensic information such as URLs or Personally Identifiable Information (PII).
While this section is not a primer on regular expression by any means, we'll briefly touch on the basics so that we can understand its advantages and the regular expressions used in the code. In Python, we use the re module to run regular expressions against strings. First, we must compile the regular expression and then check whether there are any matches in the string:
>>> import re
>>> phone = '214-324-5555'
>>> expression = r'214-324-5555'
>>> re_expression = re.compile(expression)
>>> if re_expression.match(phone): print(True)
...
True
Using the identical string as our expression results in a positive match. However, this would not capture other phone numbers. Regular expressions can use a variety of special characters that either represent a subgroup of characters or how the preceding elements are interpreted. We use these special characters to refer to multiple sets of characters and create a generic search pattern.
Square brackets, [], are used to indicate a range of characters such as 0 through 9 or a through z. Using curly braces, {n}, after a regular expression requires that n copies of the preceding regular expression must be matched to be considered valid. Using these two special characters, we can create a much more generic search pattern:
>>> expression = r'[0-9]{3}-[0-9]{3}-[0-9]{4}'
This regular expression matches anything of the XXX-XXX-XXXX pattern containing only integers 0 through 9. This wouldn't match phone numbers such as +1 800.234.5555. We can build more complicated expressions to include those types of patterns.
Another example we'll take a look at is matching credit card numbers. Fortunately, there exist standard regular expressions for some of the major cards such as Visa, MasterCard, American Express, and so on. The following is the expression we could use for identifying any Visa card. The variable, expression_1, matches any number starting with four followed by any 15 digits (0-9). The second expression, expression_2, matches any number starting with 4 followed by any 15 digits (0-9) that are optionally separated by a space or dash:
>>> expression_1 = r'^4\d{15}$' >>> expression_2 = r'^4\d{3}([\ \ -]?)\d{4}\1\d{4}\1\d{4}$'
For the first expression, we've introduced three new special characters: ^, d, and $. The caret (^) asserts that the starting position of the string is at the beginning. Likewise, $ requires that the end position of the pattern is the end of the string or line. Together, this pattern would only match if our credit card is the only element on the line. The d character is an alias for [0-9]. This expression could capture a credit card number such as 4111111111111111. Note that, with regular expressions, we use the r prefix to create a raw string which ignores backslashes as Python escape characters. Because regular expressions use backslashes as an escape character, we would have to use double backslashes wherever one is present so Python doesn't interpret it as an escape character for itself.
In the second expression, we use parentheses and square brackets to optionally match a space or dash between quartets. Notice the backslash, which acts as an escape for the space, and dash, which are themselves special characters in regular expressions. If we didn't use the backslash here, the interpreter wouldn't realize we meant to use the literal space and dash rather than their special meaning in regular expressions. We can use 1 after we define our pattern in parentheses rather than rewriting it each time. Again, because of ^ and $, this pattern will only match if it's the only element on the line or entire string. This expression would capture Visa cards such as 4111-1111-1111-1111 and capture anything expression_1 would match.
Mastering regular expressions allow a user to create very thorough and comprehensive patterns. For the purpose of this chapter, we'll stick to fairly simple expressions to accomplish our tasks. As with any pattern matching, there's the possibility of generating false positives as a result of throwing large datasets at the pattern.