Other Factor- and Table-Related Functions

R includes a number of other functions that are handy for working with tables and factors. We’ll discuss two of them here: aggregate() and cut().

Note

Hadley Wickham’s reshape package “lets you flexibly restructure and aggregate data using just two functions: melt and cast.” This package may take a while to learn, but it is extremely powerful. His plyr package is also quite versatile. You can download both packages from R’s CRAN repository. See Appendix B for more details about downloading and installing packages.

The aggregate() function calls tapply() once for each variable in a group. For example, in the abalone data, we could find the median of each variable, broken down by gender, as follows:

> aggregate(aba[,-1],list(aba$Gender),median)
  Group.1 Length Diameter Height WholeWt ShuckedWt ViscWt ShellWt Rings
  1       F  0.590    0.465  0.160 1.03850   0.44050 0.2240   0.295 10
  2       I  0.435    0.335  0.110 0.38400   0.16975 0.0805   0.113 8
  3       M  0.580    0.455  0.155 0.97575   0.42175 0.2100   0.276 10

The first argument, aba[,-1], is the entire data frame except for the first column, which is Gender itself. The second argument, which must be a list, is our Gender factor as before. Finally, the third argument tells R to compute the median on each column in each of the data frames generated by the subgrouping corresponding to our factors. There are three such subgroups in our example here and thus three rows in the output of aggregate().

A common way to generate factors, especially for tables, is the cut() function. You give it a data vector x and a set of bins defined by a vector b. The function then determines which bin each of the elements of x falls into.

The following is the form of the call we’ll use here:

y <- cut(x,b,labels=FALSE)

where the bins are defined to be the semi-open intervals (b[1],b[2]], (b[2],b[3]],.... Here’s an example:

> z
[1] 0.88114802 0.28532689 0.58647376 0.42851862 0.46881514 0.24226859 0.05289197
[8] 0.88035617
> seq(from=0.0,to=1.0,by=0.1)
 [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> binmarks <- seq(from=0.0,to=1.0,by=0.1)
> cut(z,binmarks,labels=F)
[1] 9 3 6 5 5 3 1 9

This says that z[1], 0.88114802, fell into bin 9, which was (0,0,0.1]; z[2], 0.28532689, fell into bin 3; and so on.

This returns a vector, as seen in the example’s result. But we can convert it into a factor and possibly then use it to build a table. For instance, you can imagine using this to write your own specialized histogram function. (The R function findInterval() would be useful for this, too.)