Preview of Some Important R Data Structures

R has a variety of data structures. Here, we will sketch some of the most frequently used structures to give you an overview of R before we dive into the details. This way, you can at least get started with some meaningful examples, even if the full story behind them must wait.

The vector type is really the heart of R. It’s hard to imagine R code, or even an interactive R session, that doesn’t involve vectors.

The elements of a vector must all have the same mode, or data type. You can have a vector consisting of three character strings (of mode character) or three integer elements (of mode integer), but not a vector with one integer element and two character string elements. We’ll talk more about vectors in Chapter 2.

Character strings are actually single-element vectors of mode character, (rather than mode numeric):

> x <- c(5,12,13)
> x
[1]  5 12 13
> length(x)
[1] 3
> mode(x)
[1] "numeric"
> y <- "abc"
> y
[1] "abc"
> length(y)
[1] 1
> mode(y)
[1] "character"
> z <- c("abc","29 88")
> length(z)
[1] 2
> mode(z)
[1] "character"

In the first example, we create a vector x of numbers, thus of mode numeric. Then we create two vectors of mode character: y is a one-element (that is, one-string) vector, and z consists of two strings.

R has various string-manipulation functions. Many deal with putting strings together or taking them apart, such as the two shown here:

> u <- paste("abc","de","f")  # concatenate the strings
> u
[1] "abc de f"
> v <- strsplit(u," ")  # split the string according to blanks
> v
[[1]]
[1] "abc" "de"  "f"

Strings will be covered in detail in Chapter 11.

An R matrix corresponds to the mathematical concept of the same name: a rectangular array of numbers. Technically, a matrix is a vector, but with two additional attributes: the number of rows and the number of columns. Here is some sample matrix code:

> m <- rbind(c(1,4),c(2,2))
> m
     [,1] [,2]
[1,]    1    4
[2,]    2    2
> m %*% c(1,1)
     [,1]
[1,]    5
[2,]    4

First, we use the rbind() (for row bind) function to build a matrix from two vectors that will serve as its rows, storing the result in m. (A corresponding function, cbind(), combines several columns into a matrix.) Then entering the variable name alone, which we know will print the variable, confirms that the intended matrix was produced. Finally, we compute the matrix product of the vector (1,1) and m. The matrix-multiplication operator, which you may know from linear algebra courses, is %*% in R.

Matrices are indexed using double subscripting, much as in C/C++, although subscripts start at 1 instead of 0.

> m[1,2]
[1] 4
> m[2,2]
[1] 2

An extremely useful feature of R is that you can extract submatrices from a matrix, much as you extract subvectors from vectors. Here’s an example:

> m[1,]  # row 1
[1] 1 4
> m[,2]  # column 2
[1] 4 2

We’ll talk more about matrices in Chapter 3.

Like an R vector, an R list is a container for values, but its contents can be items of different data types. (C/C++ programmers will note the analogy to a C struct.) List elements are accessed using two-part names, which are indicated with the dollar sign $ in R. Here’s a quick example:

> x <- list(u=2, v="abc")
> x
$u
[1] 2

$v
[1] "abc"

> x$u
[1] 2

The expression x$u refers to the u component in the list x. The latter contains one other component, denoted by v.

A common use of lists is to combine multiple values into a single package that can be returned by a function. This is especially useful for statistical functions, which can have elaborate results. As an example, consider R’s basic histogram function, hist(), introduced in Section 1.2. We called the function on R’s built-in Nile River data set:

> hist(Nile)

This produced a graph, but hist() also returns a value, which we can save:

> hn <- hist(Nile)

What’s in hn? Let’s take a look:

> print(hn)
$breaks
 [1]  400  500  600  700  800  900 1000 1100 1200 1300 1400

$counts
 [1]  1  0  5 20 25 19 12 11  6  1

$intensities
 [1] 9.999998e-05 0.000000e+00 5.000000e-04 2.000000e-03 2.500000e-03
 [6] 1.900000e-03 1.200000e-03 1.100000e-03 6.000000e-04 1.000000e-04

$density
 [1] 9.999998e-05 0.000000e+00 5.000000e-04 2.000000e-03 2.500000e-03
 [6] 1.900000e-03 1.200000e-03 1.100000e-03 6.000000e-04 1.000000e-04

$mids
 [1]  450  550  650  750  850  950 1050 1150 1250 1350

$xname
[1] "Nile"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

Don’t try to understand all of that right away. For now, the point is that, besides making a graph, hist() returns a list with a number of components. Here, these components describe the characteristics of the histogram. For instance, the breaks component tells us where the bins in the histogram start and end, and the counts component is the numbers of observations in each bin.

The designers of R decided to package all of the information returned by hist() into an R list, which can be accessed and manipulated by further R commands via the dollar sign.

Remember that we could also print hn simply by typing its name:

> hn

But a more compact alternative for printing lists like this is str():

> str(hn)
List of 7
 $ breaks     : num [1:11] 400 500 600 700 800 900 1000 1100 1200 1300 ...
 $ counts     : int [1:10] 1 0 5 20 25 19 12 11 6 1
 $ intensities: num [1:10] 0.0001 0 0.0005 0.002 0.0025 ...
 $ density    : num [1:10] 0.0001 0 0.0005 0.002 0.0025 ...
 $ mids       : num [1:10] 450 550 650 750 850 950 1050 1150 1250 1350
 $ xname      : chr "Nile"
 $ equidist   : logi TRUE
 - attr(*, "class")= chr "histogram"

Here str stands for structure. This function shows the internal structure of any R object, not just lists.

A typical data set contains data of different modes. In an employee data set, for example, we might have character string data, such as employee names, and numeric data, such as salaries. So, although a data set of (say) 50 employees with 4 variables per worker has the look and feel of a 50-by-4 matrix, it does not qualify as such in R, because it mixes types.

Instead of a matrix, we use a data frame. A data frame in R is a list, with each component of the list being a vector corresponding to a column in our “matrix” of data. Indeed, you can create data frames in just this way:

> d <- data.frame(list(kids=c("Jack","Jill"),ages=c(12,10)))
> d
  kids ages
1 Jack   12
2 Jill   10
> d$ages
[1] 12 10

Typically, though, data frames are created by reading in a data set from a file or database.

We’ll talk more about data frames in Chapter 5.

R is an object-oriented language. Objects are instances of classes. Classes are a bit more abstract than the data types you’ve met so far. Here, we’ll look briefly at the concept using R’s S3 classes. (The name stems from their use in the old S language, version 3, which was the inspiration for R.) Most of R is based on these classes, and they are exceedingly simple. Their instances are simply R lists but with an extra attribute: the class name.

For example, we noted earlier that the (nongraphical) output of the hist() histogram function is a list with various components, such as break and count components. There was also an attribute, which specified the class of the list, namely histogram.

> print(hn)
$breaks
 [1]  400  500  600  700  800  900 1000 1100 1200 1300 1400

$counts
 [1]  1  0  5 20 25 19 12 11  6  1
...
...
attr(,"class")
[1] "histogram"

At this point, you might be wondering, “If S3 class objects are just lists, why do we need them?” The answer is that the classes are used by generic functions. A generic function stands for a family of functions, all serving a similar purpose but each appropriate to a specific class.

A commonly used generic function is summary(). An R user who wants to use a statistical function, like hist(), but is unsure of how to deal with its output (which can be voluminous), can simply call summary() on the output, which is not just a list but an instance of an S3 class.

The summary() function, in turn, is actually a family of summary-making functions, each handling objects of a particular class. When you call summary() on some output, R searches for a summary function appropriate to the class at hand and uses it to give a friendlier representation of the list. Thus, calling summary() on the output of hist() produces a summary tailored to that function, and calling summary() on the output of the lm() regression function produces a summary appropriate for that function.

The plot() function is another generic function. You can use plot() on just about any R object. R will find an appropriate plotting function based on the object’s class.

Classes are used to organize objects. Together with generic functions, they allow flexible code to be developed for handling a variety of different but related tasks. Chapter 9 covers classes in depth.