Chapter 12. Graphics

image with no caption

R has a very rich set of graphics facilities. The R home page (http://www.r-project.org/) has a few colorful examples, but to really appreciate R’s graphical power, browse through the R Graph Gallery at http://addictedtor.free.fr/graphiques.

In this chapter, we cover the basics of using R’s base, or traditional, graphics package. This will give you enough foundation to start working with graphics in R. If you’re interested in pursuing R graphics further, you may want to refer to the excellent books on the subject.[4]

To begin, we’ll look at the foundational function for creating graphs: plot(). Then we’ll explore how to build a graph, from adding lines and points to attaching a legend.

We now have an empty graph, ready for the next stage, which is adding a line:

> x <- c(1,2,3)
> y <- c(1,3,8)
> plot(x,y)
> lmout <- lm(y ˜ x)
> abline(lmout)

After the call to plot(), the graph will simply show the three points, along with the x- and y- axes with hash marks. The call to abline() then adds a line to the current graph. Now, which line is this?

As you learned in Section 1.5, the result of the call to the linear-regression function lm() is a class instance containing the slope and intercept of the fitted line, as well as various other quantities that don’t concern us here. We’ve assigned that class instance to lmout. The slope and intercept will now be in lmout$coefficients.

So, what happens when we call abline()? This function simply draws a straight line, with the function’s arguments treated as the intercept and slope of the line. For instance, the call abline(c(2,1)) draws this line on whatever graph you’ve built up so far:

y = 2 + 1 . x

But abline() is written to take special action if it is called on a regression object (though, surprisingly, it is not a generic function). Thus, it will pick up the slope and intercept it needs from lmout$coefficients and plot that line. It superimposes this line onto the current graph, the one that graphs the three points. In other words, the new graph will show both the points and the line, as in Figure 12-2.

You can add more lines using the lines() function. Though there are many options, the two basic arguments to lines() are a vector of x-values and a vector of y-values. These are interpreted as (x,y) pairs representing points to be added to the current graph, with lines connecting the points. For instance, if X and Y are the vectors (1.5,2.5) and (3,3), you could use this call to add a line from (1.5,3) to (2.5,3) to the present graph:

> lines(c(1.5,2.5),c(3,3))

If you want the lines to “connect the dots,” but don’t want the dots themselves, include type="l" in your call to lines() or to plot(), as follows:

> plot(x,y,type="l")

You can use the lty parameter in plot() to specify the type of line, such as solid or dashed. To see the types available and their codes, enter this command:

> help(par)

Let’s plot nonparametric density estimates (these are basically smoothed histograms) for two sets of examination scores in the same graph. We use the function density() to generate the estimates. Here are the commands we issue:

> d1 = density(testscores$Exam1,from=0,to=100)
> d2 = density(testscores$Exam2,from=0,to=100)
> plot(d1,main="",xlab="")
> lines(d2)

First, we compute nonparametric density estimates from the two variables, saving them in objects d1 and d2 for later use. We then call plot() to draw the curve for exam 1, at which point the plot looks like Figure 12-3. We then call lines() to add exam 2’s curve to the graph, producing Figure 12-4.

Note that we asked R to use blank labels for the figure as a whole and for the x-axis. Otherwise, R would have gotten such labels from d1, which would have been specific to exam 1.

Also note that we needed to plot exam 1 first. The scores there were less diverse, so the density estimate was narrower and taller. Had we plotted exam 2, with its shorter curve, first, exam 1’s curve would have been too tall for the plot window. Here, we first ran the two plots separately to see which was taller, but let’s consider a more general situation.

Say we wish to write a broadly usable function that will plot several density estimates on the same graph. For this, we would need to automate the process of determining which density estimate is tallest. To do so, we would use the fact that the estimated density values are contained in the y component of the return value from the call to density(). We would then call max() on each density estimate and use which.max() to determine which density estimate is the tallest.

The call to plot() both initiates the plot and draws the first curve. (Without specifying type="l", only the points would have been plotted.) The call to lines() then adds the second curve.

In Section 9.1.7, we defined a class "polyreg" that facilitates fitting polynomial regression models. Our code there included an implementation of the generic print() function. Let’s now add one for the generic plot() function:

1    # polyfit(x,maxdeg) fits all polynomials up to degree maxdeg; y is
2    # vector for response variable, x for predictor; creates an object of
3    # class "polyreg", consisting of outputs from the various regression
4    # models, plus the original data
5    polyfit <- function(y,x,maxdeg) {
6       pwrs <- powers(x,maxdeg)  # form powers of predictor variable
7       lmout <- list()  # start to build class
8       class(lmout) <- "polyreg"  # create a new class
9       for (i in 1:maxdeg) {
10          lmo <- lm(y ˜ pwrs[,1:i])
11          # extend the lm class here, with the cross-validated predictions
12          lmo$fitted.xvvalues <- lvoneout(y,pwrs[,1:i,drop=F])
13          lmout[[i]] <- lmo
14       }
15       lmout$x <- x
16       lmout$y <- y
17       return(lmout)
18    }
19
20    # generic print() for an object fits of class "polyreg":  print
21    # cross-validated mean-squared prediction errors
22    print.polyreg <- function(fits) {
23       maxdeg <- length(fits) - 2  # count lm() outputs only, not $x and $y
24       n <- length(fits$y)
25       tbl <- matrix(nrow=maxdeg,ncol=1)
26       cat("mean squared prediction errors, by degree\n")
27       colnames(tbl) <- "MSPE"
28       for (i in 1:maxdeg) {
29          fi <- fits[[i]]
30          errs <- fits$y - fi$fitted.xvvalues
31          spe <- sum(errs^2)
32          tbl[i,1] <- spe/n
33       }
34       print(tbl)
35    }
36
37    # generic plot(); plots fits against raw data
38    plot.polyreg <- function(fits) {
39       plot(fits$x,fits$y,xlab="X",ylab="Y")  # plot data points as background
40       maxdg <- length(fits) - 2
41       cols <- c("red","green","blue")
42       dg <- curvecount <- 1
43       while (dg < maxdg) {
44          prompt <- paste("RETURN for XV fit for degree",dg,"or type degree",
45             "or q for quit ")
46          rl <- readline(prompt)
47          dg <- if (rl == "") dg else if (rl != "q") as.integer(rl) else break
48          lines(fits$x,fits[[dg]]$fitted.values,col=cols[curvecount%%3 + 1])
49          dg <- dg + 1
50          curvecount <- curvecount + 1
51       }
52    }
53
54    # forms matrix of powers of the vector x, through degree dg
55    powers <- function(x,dg) {
56       pw <- matrix(x,nrow=length(x))
57       prod <- x
58       for (i in 2:dg) {
59          prod <- prod * x
60          pw <- cbind(pw,prod)
61       }
62       return(pw)
63    }
64
65    # finds cross-validated predicted values; could be made much faster via
66    # matrix-update methods
67    lvoneout <- function(y,xmat) {
68       n <- length(y)
69       predy <- vector(length=n)
70       for (i in 1:n) {
71          # regress, leaving out ith observation
72          lmo <- lm(y[-i] ˜ xmat[-i,])
73          betahat <- as.vector(lmo$coef)
74          # the 1 accommodates the constant term
75          predy[i] <- betahat %*% c(1,xmat[i,])
76       }
77       return(predy)
78    }
79
80    # polynomial function of x, coefficients cfs
81    poly <- function(x,cfs) {
82       val <- cfs[1]
83       prod <- 1
84       dg <- length(cfs) - 1
85       for (i in 1:dg) {
86          prod <- prod * x
87          val <- val + cfs[i+1] * prod
88       }
89    }

As noted, the only new code is plot.polyreg(). For convenience, the code is reproduced here:

# generic plot(); plots fits against raw data
plot.polyreg <- function(fits) {
   plot(fits$x,fits$y,xlab="X",ylab="Y")  # plot data points as background
   maxdg <- length(fits) - 2
   cols <- c("red","green","blue")
   dg <- curvecount <- 1
   while (dg < maxdg) {
      prompt <- paste("RETURN for XV fit for degree",dg,"or type degree",
         "or q for quit ")
      rl <- readline(prompt)
      dg <- if (rl == "") dg else if (rl != "q") as.integer(rl) else break
      lines(fits$x,fits[[dg]]$fitted.values,col=cols[curvecount%%3 + 1])
      dg <- dg + 1
      curvecount <- curvecount + 1
   }
}

As before, our implementation of the generic function takes the name of the class, which is plot.polyreg() here.

The while loop iterates through the various polynomial degrees. We cycle through three colors, by setting the vector cols; note the expression curvecount %%3 for this purpose.

The user can choose either to plot the next sequential degree or select a different one. The query, both user prompt and reading of the user’s reply, is done in this line:

rl <- readline(prompt)

We use the R string function paste() to assemble a prompt, offering the user a choice of plotting the next fitted polynomial, plotting one of a different degree, or quitting. The prompt appears in the interactive R window in which we issued the plot() call. For instance, after taking the default choice twice, the command window looks like this:

> plot(lmo)
RETURN for XV fit for degree 1 or type degree or q for quit
RETURN for XV fit for degree 2 or type degree or q for quit
RETURN for XV fit for degree 3 or type degree or q for quit

The plot window looks like Figure 12-5.



[4] These include Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis (New York: Springer-Verlag, 2009); Dianne Cook and Deborah F. Swayne, Interactive and Dynamic Graphics for Data Analysis: With R and GGobi (New York: Springer-Verlag, 2007); Deepayan Sarkar, Lattice: Multivariate Data Visualization with R (New York: Springer-Verlag, 2008); and Paul Murrell, R Graphics (Boca Raton, FL: Chapman and Hall/CRC, 2011).