Working with Tables

To begin exploring R tables, consider this example:

> u <- c(22,8,33,6,8,29,-2)
> fl <- list(c(5,12,13,12,13,5,13),c("a","bc","a","a","bc","a","a"))
> tapply(u,fl,length)
   a bc
5  2 NA
12 1  1
13 2  1

Here, tapply() again temporarily breaks u into subvectors, as you saw earlier, and then applies the length() function to each subvector. (Note that this is independent of what’s in u. Our focus now is purely on the factors.) Those subvector lengths are the counts of the occurrences of each of the 3 × 2 = 6 combinations of the two factors. For instance, 5 occurred twice with "a" and not at all with "bc"; hence the entries 2 and NA in the first row of the output. In statistics, this is called a contingency table.

There is one problem in this example: the NA value. It really should be 0, meaning that in no cases did the first factor have level 5 and the second have level "bc". The table() function creates contingency tables correctly.

> table(fl)
    fl.2
fl.1 a bc
  5  2  1
  12 1  1
  13 1  0

The first argument in a call to table() is either a factor or a list of factors. The two factors here were (5,12,13,12,13,5,13) and ("a","bc","a","a","bc", "a","a"). In this case, an object that is interpretable as a factor is counted as one.

Typically a data frame serves as the table() data argument. Suppose for instance the file ct.dat consists of election-polling data, in which candidate X is running for reelection. The ct.dat file looks like this:

"Vote for X" "Voted For X Last Time"
"Yes" "Yes"
"Yes" "No"
"No" "No"
"Not Sure" "Yes"
"No" "No"

In the usual statistical fashion, each row in this file represents one subject under study. In this case, we have asked five people the following two questions:

  • Do you plan to vote for candidate X?

  • Did you vote for X in the last election?

This gives us five rows in the data file.

Let’s read in the file:

> ct <- read.table("ct.dat",header=T)
> ct
  Vote.for.X Voted.for.X.Last.Time
1        Yes                   Yes
2        Yes                    No
3         No                    No
4   Not Sure                   Yes
5         No                    No

We can use the table() function to compute the contingency table for this data:

> cttab <- table(ct)
> cttab
          Voted.for.X.Last.Time
Vote.for.X No Yes
  No        2   0
  Not Sure  0   1
  Yes       1   1

The 2 in the upper-left corner of the table shows that we had, for example, two people who said “no” to the first and second questions. The 1 in the middle-right indicates that one person answered “not sure” to the first question and “yes” to the second question.

We can also get one-dimensional counts, which are counts on a single factor, as follows:

> table(c(5,12,13,12,8,5))

 5  8 12 13
 2  1  2  1

Here’s an example of a three-dimensional table, involving voters’ genders, race (white, black, Asian, and other), and political views (liberal or conservative):

> v  # the data frame
  gender race pol
1 M W L
2 M W L
3 F A C
4 M O L
5 F B L
6 F B C
> vt <- table(v)
> vt
, , pol = C

      race
gender A B O W
     F 1 1 0 0
     M 0 0 0 0

, , pol = L

      race
gender A B O W
     F 0 1 0 0
     M 0 0 1 2

R prints out a three-dimensional table as a series of two-dimensional tables. In this case, it generates a table of gender and race for conservatives and then a corresponding table for liberals. For example, the second two-dimensional table says that there were two white male liberals.

Just as most (nonmathematical) matrix/array operations can be used on data frames, they can be applied to tables, too. (This is not surprising, given that the cell counts portion of a table object is an array.)

For example, we can access the table cell counts using matrix notation. Let’s apply this to our voting example from the previous section.

> class(cttab)
[1] "table"
> cttab[1,1]
[1] 2
> cttab[1,]
 No Yes
  2   0

In the second command, even though the first command had shown that cttab had class “cttab”, we treated it as a matrix and printed out its “[1,1] element.” Continuing this idea, the third command printed the first column of this “matrix.”

We can multiply the matrix by a scalar. For instance, here’s how to change cell counts to proportions:

> ctt/5
          Voted.for.X.Last.Time
Vote.for.X  No Yes
  No       0.4 0.0
  Not Sure 0.0 0.2
  Yes      0.2 0.2

In statistics, the marginal values of a variable are those obtained when this variable is held constant while others are summed. In the voting example, the marginal values of the Vote.for.X variable are 2 + 0 = 2, 0 + 1 = 1, and 1 + 1 = 2. We can of course obtain these via the matrix apply() function:

> apply(ctt,1,sum)
      No Not Sure      Yes
       2        1        2

Note that the labels here, such as No, came from the row names of the matrix, which table() produced.

But R supplies a function addmargins() for this purpose—that is, to find marginal totals. Here’s an example:

> addmargins(cttab)
          Voted.for.X.Last.Time
Vote.for.X No Yes Sum
  No        2   0   2
  Not Sure  0   1   1
  Yes       1   1   2
  Sum       3   2   5

Here, we got the marginal data for both dimensions at once, conveniently superimposed onto the original table.

We can get the names of the dimensions and levels through dimnames(), as follows:

> dimnames(cttab)
$Vote.for.X
[1] "No"       "Not Sure" "Yes"

$Voted.for.X.Last.Time
[1] "No"  "Yes"

Let’s continue working with our voting example:

> cttab
          Voted.for.X.Last.Time
Vote.for.X No Yes
  No        2   0
  Not Sure  0   1
  Yes       1   1

Suppose we wish to present this data at a meeting, concentrating on those respondents who know they will vote for X in the current election. In other words, we wish to eliminate the Not Sure entries and present a subtable that looks like this:

Voted.for.X.Last.Time
Vote.for.X No Yes
  No        2   0
  Yes       1   1

The function subtable() below performs subtable extraction. It has two arguments:

  • tbl: The table of interest, of class "table".

  • subnames: A list specifying the desired subtable extraction. Each component of this list is named after some dimension of tbl, and the value of that component is a vector of the names of the desired levels.

So, let’s review what we have in this example before looking at the code. The argument cttab will be a two-dimensional table, with dimension names Voted.for.X and Voted.for.X.Last.Time. Within those two dimensions, the level names are No, Not Sure, and Yes in the first dimension, and No and Yes in the second. Of those, we wish to exclude the Not Sure cases, so our actual argument corresponding to the formal argument subnames is as follows:

list(Vote.for.X=c("No","Yes"),Voted.for.X.Last.Time=c("No","Yes"))

We can now call the function.

> subtable(cttab,list(Vote.for.X=c("No","Yes"),
+    Voted.for.X.Last.Time=c("No","Yes")))
          Voted.for.X.Last.Time
Vote.for.X No Yes
       No   2   0
       Yes  1   1

Now that we have a feel for what the function does, let’s take a look at its innards.

1    subtable <- function(tbl,subnames) {
2       # get array of cell counts in tbl
3       tblarray <- unclass(tbl)
4       # we'll get the subarray of cell counts corresponding to subnames by
5       # calling do.call() on the "[" function; we need to build up a list
6       # of arguments first
7       dcargs <- list(tblarray)
8       ndims <- length(subnames)  # number of dimensions
9       for (i in 1:ndims) {
10          dcargs[[i+1]] <- subnames[[i]]
11       }
12       subarray <- do.call("[",dcargs)
13       # now we'll build the new table, consisting of the subarray, the
14       # numbers of levels in each dimension, and the dimnames() value, plus
15       # the "table" class attribute
16       dims <- lapply(subnames,length)
17       subtbl <- array(subarray,dims,dimnames=subnames)
18       class(subtbl) <- "table"
19       return(subtbl)
20    }

So, what’s happening here? To prepare for writing this code, I first did a little detective work to determine the structure of objects of class "table". Looking through the code of the function table(), I found that at its core, an object of class "table" consists of an array whose elements are the cell counts. So the strategy is to extract the desired subarray, then add names to the dimensions of the subarray, and then bestow "table" class status to the result.

For the code here, then, the first task is to form the subarray corresponding to the user’s desired subtable, and this constitutes most of the code. To this end, in line 3, we first extract the full cell counts array, storing it in tblarray. The question is how to use that to find the desired subarray. In principle, this is easy. In practice, that’s not always the case.

To get the desired subarray, I needed to form a subsetting expression on the array tblarray—something like this:

tblarray[some index ranges here]

In our voting example, the expression is as follows:

tblarray[c("No","Yes"),c("No","Yes")]

This is simple in concept but difficult to do directly, since tblarray could be of different dimensions (two, three, or anything else). Recall that R’s array subscripting is actually done via a function named "["(). This function takes a variable number of arguments: two for two-dimensional arrays, three for three-dimensional arrays, and so on.

This problem is solved by using R’s do.call(). This function has the following basic form:

do.call(f,argslist)

where f is a function and argslist is a list of arguments on which to call f(). In other words, the preceding code basically does this:

f(argslist[[1],argslist[[2]],...)

This makes it easy to call a function with a variable number of arguments.

For our example, we need to form a list consisting first of tblarray and then the user’s desired levels for each dimension. Our list looks like this:

list(tblarray,Vote.for.X=c("No","Yes"),Voted.for.X.Last.Time=c("No","Yes"))

Lines 7 through 11 build up this list for the general case. That’s our subarray. Then we need to attach the names and set the class to "table". The former operation can be done via R’s array() function, which has the following arguments:

  • data: The data to be placed into the new array. In our case, this is subarray.

  • dim: The dimension lengths (number of rows, number of columns, number of layers, and so on). In our case, this is the value ndims, computed in line 16.

  • dimnames: The dimension names and the names of their levels, already given to us by the user as the argument subnames.

This was a somewhat conceptually complex function to write, but it gets easier once you’ve mastered the inner structures of the "table" class.

It can be difficult to view a table that is very big, with a large number of rows or dimensions. One approach might be to focus on the cells with the largest frequencies. That’s the purpose of the tabdom() function developed below—it reports the dominant frequencies in a table. Here’s a simple call:

tabdom(tbl,k)

This reports the cells in the table tbl that have the k largest frequencies.

Here’s an example:

> d <- c(5,12,13,4,3,28,12,12,9,5,5,13,5,4,12)
> dtab <- table(d)
> tabdom(dtab,3)
   d Freq
3  5    4
5 12    4
2  4    2

The function tells us that the values 5 and 12 were the most frequent in d, with four instances each, and the next most frequent value was 4, with two instances. (The 3, 5, and 2 on the left are actually extraneous information; see the following discussion regarding converting a table to a data frame.)

As another example, consider our table cttab in the examples in the preceding sections:

> tabdom(cttab,2)
  Vote.for.X Voted.For.X.Last.Time Freq
1         No                    No    2
3        Yes                    No    1

So the combination No-No was most frequent, with two instances, with the second most frequent being Yes-No, with one instance.[1]

Well, how is this accomplished? It looks fairly complicated, but actually the work is made pretty easy by a trick, exploiting the fact that you can present tables in data frame format. Let’s use our cttab table again.

> as.data.frame(cttab)
  Vote.for.X Voted.For.X.Last.Time Freq
1         No                    No    2
2   Not Sure                    No    0
3        Yes                    No    1
4         No                   Yes    0
5   Not Sure                   Yes    1
6        Yes                   Yes    1

Note that this is not the original data frame ct from which the table cttab was constructed. It is simply a different presentation of the table itself. There is one row for each combination of the factors, with a Freq column added to show the number of instances of each combination. This latter feature makes our task quite easy.

1    # finds the cells in table tbl with the k highest frequencies; handling
2    # of ties is unrefined
3    tabdom <- function(tbl,k) {
4       # create a data frame representation of tbl, adding a Freq column
5       tbldf <- as.data.frame(tbl)
6       # determine the proper positions of the frequencies in a sorted order
7       freqord <- order(tbldf$Freq,decreasing=TRUE)
8       # rearrange the data frame in that order, and take the first k rows
9       dom <- tbldf[freqord,][1:k,]
10       return(dom)
11    }

The comments should make the code self-explanatory.

The sorting approach in line 7, which makes use of order(), is the standard way to sort a data frame (worth remembering, since the situation arises rather frequently).

The approach taken here—converting a table to a data frame—could also be used in Section 6.3.2. However, you would need to be careful to remove levels from the factors to avoid zeros in cells.



[1] But didn’t the Not Sure–Yes and Yes-Yes combinations also have one instance and thus should be tied with Yes-No for second place? Yes, definitely. My code is cavalier regarding ties, and the reader is encouraged to refine it in that direction.