Common Functions Used with Factors

With factors, we have yet another member of the family of apply functions, tapply. We’ll look at that function, as well as two other functions commonly used with factors: split() and by().

The tapply() Function

As motivation, suppose we have a vector x of ages of voters and a factor f showing some nonumeric trait of those voters, such as party affiliation (Democrat, Republican, Unaffiliated). We might wish to find the mean ages in x within each of the party groups.

In typical usage, the call tapply(x,f,g) has x as a vector, f as a factor or list of factors, and g as a function. The function g() in our little example above would be R’s built-in mean() function. If we wanted to group by both party and another factor, say gender, we would need f to consist of the two factors, party and gender.

Each factor in f must have the same length as x. This makes sense in light of the voter example above; we should have as many party affiliations as ages. If a component of f is a vector, it will be coerced into a factor by applying as.factor() to it.

The operation performed by tapply() is to (temporarily) split x into groups, each group corresponding to a level of the factor (or a combination of levels of the factors in the case of multiple factors), and then apply g() to the resulting subvectors of x. Here’s a little example:

> ages <- c(25,26,55,37,21,42)
> affils <- c("R","D","D","R","U","D")
> tapply(ages,affils,mean)
 D  R  U
41 31 21

Let’s look at what happened. The function tapply() treated the vector ("R","D","D","R","U","D") as a factor with levels "D", "R", and "U". It noted that "D" occurred in indices 2, 3 and 6; "R" occurred in indices 1 and 4; and "U" occurred in index 5. For convenience, let’s refer to the three index vectors (2,3,6), (1,4), and (5) as x, y, and z, respectively. Then tapply() computed mean(u[x]), mean(u[y]), and mean(u[z]) and returned those means in a three-element vector. And that vector’s element names are "D", "R", and "U", reflecting the factor levels that were used by tapply().

What if we have two or more factors? Then each factor yields a set of groups, as in the preceding example, and the groups are ANDed together. As an example, suppose that we have an economic data set that includes variables for gender, age, and income. Here, the call tapply(x,f,g) might have x as income and f as a pair of factors: one for gender and the other coding whether the person is older or younger than 25. We may be interested in finding mean income, broken down by gender and age. If we set g() to be mean(), tapply() will return the mean incomes in each of four subgroups:

Male and under 25 years old
Female and under 25 years old
Male and over 25 years old
Female and over 25 years old

Here’s a toy example of that setting:

> d <- data.frame(list(gender=c("M","M","F","M","F","F"),
+    age=c(47,59,21,32,33,24),income=c(55000,88000,32450,76500,123000,45650)))
> d
  gender age income
1      M  47  55000
2      M  59  88000
3      F  21  32450
4      M  32  76500
5      F  33 123000
6      F  24  45650
> d$over25 <- ifelse(d$age > 25,1,0)
> d
  gender age income over25
1      M  47  55000      1
2      M  59  88000      1
3      F  21  32450      0
4      M  32  76500      1
5      F  33 123000      1
6      F  24  45650      0
> tapply(d$income,list(d$gender,d$over25),mean)
      0         1
F 39050 123000.00
M    NA  73166.67

We specified two factors, gender and indicator variable for age over or under 25. Since each of these factors has two levels, tapply() partitioned the income data into four groups, one for each combination of gender and age, and then applied to mean() function to each group.

The split() Function

In contrast to tapply(), which splits a vector into groups and then applies a specified function on each group, split() stops at that first stage, just forming the groups.

The basic form, without bells and whistles, is split(x,f), with x and f playing roles similar to those in the call tapply(x,f,g); that is, x being a vector or data frame and f being a factor or a list of factors. The action is to split x into groups, which are returned in a list. (Note that x is allowed to be a data frame with split() but not with tapply().)

Let’s try it out with our earlier example.

> d
  gender age income over25
1      M  47  55000      1
2      M  59  88000      1
3      F  21  32450      0
4      M  32  76500      1
5      F  33 123000      1
6      F  24  45650      0
> split(d$income,list(d$gender,d$over25))
$F.0
[1] 32450 45650

$M.0
numeric(0)

$F.1
[1] 123000

$M.1
[1] 55000 88000 76500

The output of split() is a list, and recall that list components are denoted by dollar signs. So the last vector, for example, was named "M.1" to indicate that it was the result of combining "M" in the first factor and 1 in the second.

As another illustration, consider our abalone example from Section 2.9.2. We wanted to determine the indices of the vector elements corresponding to male, female, and infant. The data in that little example consisted of the seven-observation vector ("M","F","F","I","M","M","F"), assigned to g. We can do this in a flash with split().

> g <- c("M","F","F","I","M","M","F")
> split(1:7,g)
$F
[1] 2 3 7

$I
[1] 4

$M
[1] 1 5 6

The results show the female cases are in records 2, 3, and 7; the infant case is in record 4; and the male cases are in records 1, 5, and 6.

Let’s dissect this step-by-step. The vector g, taken as a factor, has three levels: "M", "F", and "I". The indices corresponding to the first level are 1, 5, and 6, which means that g[1], g[5], and g[6] all have the value "M". So, R sets the M component of the output to elements 1, 5, and 6 of 1:7, which is the vector (1,5,6).

We can take a similar approach to simplify the code in our text concordance example from Section 4.2.4. There, we wished to input a text file, determine which words were in the text, and then output a list giving the words and their locations within the text. We can use split() to make short work of writing the code, as follows:

1    findwords <- function(tf) {
2       # read in the words from the file, into a vector of mode character
3       txt <- scan(tf,"")
4       words <- split(1:length(txt),txt)
5       return(words)
6    }

The call to scan() returns a list txt of the words read in from the file tf. So, txt[[1]] will contain the first word input from the file, txt[[2]] will contain the second word, and so on; length(txt) will thus be the total number of words read. Suppose for concreteness that that number is 220.

Meanwhile, txt itself, as the second argument in split() above, will be taken as a factor. The levels of that factor will be the various words in the file. If, for instance, the file contains the word world 6 times and climate was there 10 times, then “world” and “climate” will be two of the levels of txt. The call to split() will then determine where these and the other words appear in txt.

The by() Function

Suppose in the abalone example we wish to do regression analyses of diameter against length separately for each gender code: males, females, and infants. At first, this seems like something tailor-made for tapply(), but the first argument of that function must be a vector, not a matrix or a data frame. The function to be applied can be multivariate—for example, range()—but the input must be a vector. Yet the input for regression is a matrix (or data frame) with at least two columns: one for the predicted variable and one or more for predictor variables. In our abalone data application, the matrix would consist of a column for the diameter data and a column for length.

The by() function can be used here. It works like tapply() (which it calls internally, in fact), but it is applied to objects rather than vectors. Here’s how to use it for the desired regression analyses:

> aba <- read.csv("abalone.data",header=TRUE)
> by(aba,aba$Gender,function(m) lm(m[,2]˜m[,3]))
aba$Gender: F
Call:
lm(formula = m[, 2] ˜ m[, 3])

Coefficients:
(Intercept)       m[, 3]
    0.04288      1.17918

----------------------------------------
aba$Gender: I

Call:
lm(formula = m[, 2] ˜ m[, 3])

Coefficients:
(Intercept)       m[, 3]
    0.02997      1.21833

----------------------------------------
aba$Gender: M

Call:
lm(formula = m[, 2] ˜ m[, 3])

Coefficients:
(Intercept)       m[, 3]
    0.03653      1.19480

Calls to by() look very similar to calls to tapply(), with the first argument specifying our data, the second the grouping factor, and the third the function to be applied to each group.

Just as tapply() forms groups of indices of a vector according to levels of a factor, this by() call finds groups of row numbers of the data frame aba. That creates three subdata frames: one for each gender level of M, F, and I.

The anonymous function we defined regresses the second column of its matrix argument m against the third column. This function will be called three times—once for each of the three subdata frames created earlier—thus producing the three regression analyses.