Regular Expressions

When dealing with string-manipulation functions in programming languages, the notion of regular expressions sometimes arises. In R, you must pay attention to this point when using the string functions grep(), grepl(), regexpr(), gregexpr(), sub(), gsub(), and strsplit().

A regular expression is a kind of wild card. It’s shorthand to specify broad classes of strings. For example, the expression "[au]" refers to any string that contains either of the letters a or u. You could use it like this:

> grep("[au]",c("Equator","North Pole","South Pole"))
[1] 1 3

This reports that elements 1 and 3 of ("Equator","North Pole","South Pole")—that is, “Equator” and “South Pole”—contain either an a or a u.

A period (.) represents any single character. Here’s an example of using it:

> grep("o.e",c("Equator","North Pole","South Pole"))
[1] 2 3

This searches for three-character strings in which an o is followed by any single character, which is in turn followed by an e. Here is an example of the use of two periods to represent any pair of characters:

> grep("N..t",c("Equator","North Pole","South Pole"))
[1] 2

Here, we searched for four-letter strings consisting of an N, followed by any pair of characters, followed by a t.

A period is an example of a metacharacter, which is a character that is not to be taken literally. For example, if a period appears in the first argument of grep(), it doesn’t actually mean a period; it means any character.

But what if you want to search for a period using grep()? Here’s the naive approach:

> grep(".",c("abc","de","f.g"))
[1] 1 2 3

The result should have been 3, not (1,2,3). This call failed because periods are metacharacters. You need to escape the metacharacter nature of the period, which is done via a backslash:

> grep("\\.",c("abc","de","f.g"))
[1] 3

Now, didn’t I say a backslash? Then why are there two? Well, the sad truth is that the backslash itself must be escaped, which is accomplished by its own backslash! This goes to show how arcanely complex regular expressions can become. Indeed, a number of books have been written on the subject of regular expressions (for various programming languages). As a start in learning about the topic, refer to R’s online help (type ?regex).

Extended Example: Testing a Filename for a Given Suffix

Suppose we wish to test for a specified suffix in a filename. We might, for instance, want to find all HTML files (those with suffix .html, .htm, and so on). Here is code for that:

1    testsuffix <- function(fn,suff) {
2       parts <- strsplit(fn,".",fixed=TRUE)
3       nparts <- length(parts[[1]])
4       return(parts[[1]][nparts] == suff)
5    }

Let’s test it.

> testsuffix("x.abc","abc")
[1] TRUE
> testsuffix("x.abc","ac")
[1] FALSE
> testsuffix("x.y.abc","ac")
[1] FALSE
> testsuffix("x.y.abc","abc")
[1] TRUE

How does the function work? First note that the call to strsplit() on line 2 returns a list consisting of one element (because fn is a one-element vector)—a vector of strings. For example, calling testsuffix("x.y.abc","abc") will result in parts being a list consisting of a three-element vector with elements x, y, and abc. We then pick up the last element and compare it to suff.

A key aspect is the argument fixed=TRUE. Without it, the splitting argument . (called split in the list of strsplit()’s formal arguments) would have been treated as a regular expression. Without setting fixed=TRUE, strsplit() would have just separated all the letters.

Of course, we could also escape the period, as follows:

1    testsuffix <- function(fn,suff) {
2       parts <- strsplit(fn,"\\.")
3       nparts <- length(parts[[1]])
4       return(parts[[1]][nparts] == suff)
5    }

Let’s check to see if it still works.

> testsuffix("x.y.abc","abc")
[1] TRUE

Here’s another way to do the suffix-test code that’s a bit more involved but a good illustration:

1    testsuffix <- function(fn,suff) {
2       ncf <- nchar(fn)  # nchar() gives the string length
3       # determine where the period would start if suff is the suffix in fn
4       dotpos <- ncf - nchar(suff) + 1
5       # now check that suff is there
6       return(substr(fn,dotpos,ncf)==suff)
7    }

Let’s look at the call to substr() here, again with fn = "x.ac" and suff = "abc". In this case, dotpos will be 1, which means there should be a period at the first character in fn if there is an abc suffix. The call to substr() then becomes substr("x.ac",1,4), which extracts the substring in character positions 1 through 4 of x.ac. That substring will be x.ac, which is not abc, so the filename’s suffix is found not to be the latter.

Extended Example: Forming Filenames

Suppose we want to create five files, q1.pdf through q5.pdf, consisting of histograms of 100 random N(0,i²) variates. We could execute the following code:

1    for (i in 1:5)  {
2       fname <- paste("q",i,".pdf")
3       pdf(fname)
4       hist(rnorm(100,sd=i))
5       dev.off()
6    }

The main point in this example is the string manipulation we use to create the filename fname. For more details about the graphics operations used in this example, refer to Section 12.3.

The paste() function concatenates the string "q" with the string form of the number i. For example, when i = 2, the variable fname will be q 2 .pdf. However, that isn’t quite what we want. On Linux systems, filenames with embedded spaces create headaches, so we want to remove the spaces. One solution is to use the sep argument, specifying an empty string for the separator, as follows:

1    for (i in 1:5)  {
2       fname <- paste("q",i,".pdf",sep="")
3       pdf(fname)
4       hist(rnorm(100,sd=i))
5       dev.off()
6    }

Another approach is to employ the sprintf() function, borrowed from C:

for (i in 1:5)  {
   fname <- sprintf("q%d.pdf",i)
   pdf(fname)
   hist(rnorm(100,sd=i))
   dev.off()
}

For floating-point quantities, note also the difference between %f and %g formats:

> sprintf("abc%fdef",1.5)
[1] "abc1.500000def"
> sprintf("abc%gdef",1.5)
[1] "abc1.5def"

The %g format eliminated the superfluous zeros.