Chapter 11. String Manipulation

Although R is a statistical language with numeric vectors and matrices playing a central role, character strings are surprisingly important as well. Ranging from birth dates stored in medical research data files to text- mining applications, character data arises quite frequently in R programs. Accordingly, R has a number of string-manipulation utilities, many of which will be introduced in this chapter.

An Overview of String-Manipulation Functions

Here, we’ll briefly review just some of the many string-manipulation functions R has to offer. Note that the call forms shown in this introduction are very simple, usually omitting many optional arguments. We’ll use some of those arguments in our extended examples later in the chapter, but do check R’s online help for further details.

grep()

The call grep(pattern,x) searches for a specified substring pattern in a vector x of strings. If x has n elements—that is, it contains n strings—then grep(pattern,x) will return a vector of length up to n. Each element of this vector will be the index in x at which a match of pattern as a substring of x[i]) was found.

Here’s an example of using grep:

> grep("Pole",c("Equator","North Pole","South Pole"))
[1] 2 3
> grep("pole",c("Equator","North Pole","South Pole"))
integer(0)

In the first case, the string "Pole" was found in elements 2 and 3 of the second argument, hence the output (2,3). In the second case, string "pole" was not found anywhere, so an empty vector was returned.

nchar()

The call nchar(x) finds the length of a string x. Here’s an example:

> nchar("South Pole")
[1] 10

The string "South Pole" was found to have 10 characters. C programmers, take note: There is no NULL character terminating R strings.

Also note that the results of nchar() will be unpredictable if x is not in character mode. For instance, nchar(NA) turns out to be 2, and nchar(factor("abc")) is 1. For more consistent results on nonstring objects, use Hadley Wickham’s stringr package on CRAN.

paste()

The call paste(...) concatenates several strings, returning the result in one long string. Here are some examples:

> paste("North","Pole")
[1] "North Pole"
> paste("North","Pole",sep="")
[1] "NorthPole"
> paste("North","Pole",sep=".")
[1] "North.Pole"
> paste("North","and","South","Poles")
[1] "North and South Poles"

As you can see, the optional argument sep can be used to put something other than a space between the pieces being spliced together. If you specify sep as an empty string, the pieces won’t have any character between them.

sprintf()

The call sprintf(...) assembles a string from parts in a formatted manner. Here’s a simple example:

> i <- 8
> s <- sprintf("the square of %d is %d",i,i^2)
> s
[1] "the square of 8 is 64"

The name of the function is intended to evoke string print for “printing” to a string rather than to the screen. Here, we are printing to the string s.

What are we printing? The function says to first print “the square of” and then print the decimal value of i. (The term decimal here means in the base-10 number system, not that there will be a decimal point in the result.) The result is the string "the square of 8 is 64."

substr()

The call substr(x,start,stop) returns the substring in the given character position range start:stop in the given string x. Here’s an example:

> substring("Equator",3,5)
[1] "uat"

strsplit()

The call strsplit(x,split) splits a string x into an R list of substrings based on another string split in x. Here’s an example:

> strsplit("6-16-2011",split="-")
[[1]]
[1] "6"    "16"   "2011"

regexpr()

The call regexpr(pattern,text) finds the character position of the first instance of pattern within text, as in this example:

> regexpr("uat","Equator")
[1] 3

This reports that “uat” did indeed appear in “Equator,” starting at character position 3.

gregexpr()

The call gregexpr(pattern,text) is the same as regexpr(), but it finds all instances of pattern. Here’s an example:

> gregexpr("iss","Mississippi")
[[1]]
[1] 2 5

This finds that “iss” appears twice in “Mississippi,” starting at character positions 2 and 5.