Chapter 5. Data Frames

On an intuitive level, a data frame is like a matrix, with a two-dimensional rows-and-columns structure. However, it differs from a matrix in that each column may have a different mode. For instance, one column may consist of numbers, and another column might have character strings. In this sense, just as lists are the heterogeneous analogs of vectors in one dimension, data frames are the heterogeneous analogs of matrices for two-dimensional data.

On a technical level, a data frame is a list, with the components of that list being equal-length vectors. Actually, R does allow the components to be other types of objects, including other data frames. This gives us heterogeneous–data analogs of arrays in our analogy. But this use of data frames is rare in practice, and in this book, we will assume all components of a data frame are vectors.

In this chapter, we’ll present quite a few data frame examples, so you can become familiar with their variety of uses in R.

Creating Data Frames

To begin, let’s take another look at our simple data frame example from Section 1.4.5:

> kids <- c("Jack","Jill")
> ages <- c(12,10)
> d <- data.frame(kids,ages,stringsAsFactors=FALSE)
> d  # matrix-like viewpoint
  kids ages
1  Jack   12
2  Jill   10

The first two arguments in the call to data.frame() are clear: We wish to produce a data frame from our two vectors: kids and ages. However, that third argument, stringsAsFactors=FALSE requires more comment.

If the named argument stringsAsFactors is not specified, then by default, stringsAsFactors will be TRUE. (You can also use options() to arrange the opposite default.) This means that if we create a data frame from a character vector—in this case, kids—R will convert that vector to a factor. Because our work with character data will typically be with vectors rather than factors, we’ll set stringsAsFactors to FALSE. We’ll cover factors in Chapter 6.

Accessing Data Frames

Now that we have a data frame, let’s explore a bit. Since d is a list, we can access it as such via component index values or component names:

> d[[1]]
[1] "Jack" "Jill"
> d$kids
[1] "Jack" "Jill"

But we can treat it in a matrix-like fashion as well. For example, we can view column 1:

> d[,1]
[1] "Jack" "Jill"

This matrix-like quality is also seen when we take d apart using str():

> str(d)
'data.frame':   2 obs. of  2 variables:
 $ kids: chr  "Jack" "Jill"
 $ ages: num  12 10

R tells us here that d consists of two observations—our two rows—that store data on two variables—our two columns.

Consider three ways to access the first column of our data frame above: d[[1]], d[,1], and d$kids. Of these, the third would generally considered to be clearer and, more importantly, safer than the first two. This better identifies the column and makes it less likely that you will reference the wrong column. But in writing general code—say writing R packages—matrix-like notation d[,1] is needed, and it is especially handy if you are extracting subdata frames (as you’ll see when we talk about extracting subdata frames in Section 5.2).

Extended Example: Regression Analysis of Exam Grades Continued

Recall our course examination data set in Section 1.5. There, we didn’t have a header, but for this example we do, and the first few records in the file now are as follows:

"Exam 1" "Exam 2" Quiz
2.0      3.3      4.0
3.3      2.0      3.7
4.0      4.0      4.0
2.3      0.0      3.3
2.3      1.0      3.3
3.3      3.7      4.0

As you can see, each line contains the three test scores for one student. This is the classic two-dimensional file notion, like that alluded to in the preceding output of str(). Here, each line in our file contains the data for one observation in a statistical data set. The idea of a data frame is to encapsulate such data, along with variable names, into one object.

Notice that we have separated the fields here by spaces. Other delimiters may be specified, notably commas for comma-separated value (CSV) files (as you’ll see in Section 5.2.5). The variable names specified in the first record must be separated by the same delimiter as used for the data, which is spaces in this case. If the names themselves contain embedded spaces, as we have here, they must be quoted.

We read in the file as before, but in this case we state that there is a header record:

examsquiz <- read.table("exams",header=TRUE)

The column names now appear, with periods replacing blanks:

> head(examsquiz)
  Exam.1 Exam.2 Quiz
1    2.0    3.3  4.0
2    3.3    2.0  3.7
3    4.0    4.0  4.0
4    2.3    0.0  3.3
5    2.3    1.0  3.3
6    3.3    3.7  4.0