An Overview of R Graphics

R includes tools for drawing most common types of charts, including bar charts, pie charts, line charts, and scatter plots. Additionally, R can also draw some less-familiar charts like quantile-quantile (Q-Q) plots, mosaic plots, and contour plots. The following table shows many of the charts included in the graphics package.

Graphics package function	Description
`barplot`	Bar and column charts
`dotchart`	Cleveland dot plots
`hist`	Histograms
`density`	Kernel density plots
`stripchart`	Strip charts
`qqnorm` (in `stats` package)	Quantile-quantile plots
`xplot`	Scatter plots
`smoothScatter`	Smooth scatter plots
`qqplot` (in `stats` package)	Quantile-quantile plots
`pairs`	Scatter plot matrices
`image`	Image plots
`contour`	Contour plots
`persp`	Perspective charts of three-dimensional data
`interaction.plot`	Summary of the response for two-way combinations of factors
`sunflowerplot`	Sunflower plots

You can show R graphics on the screen or save them in many different formats. Graphics Devices explains how to choose output methods. R gives you an enormous amount of control over graphics. You can control almost every aspect of a chart. Customizing Charts explains how to tweak the output of R to look the way you want. This section shows how to use many common types of R charts.

Scatter Plots

To show how to use scatter plots, we will look at cases of cancer in 2008 and toxic waste releases by state in 2006. Data on new cancer cases (and deaths from cancer) are tabulated by the American Cancer Society; information on toxic chemicals released into the environment is tabulated by the U.S. Environmental Protection Agency (EPA).^[35]

The sample data is included in the nutshell package:

> library(nutshell)
> data(toxins.and.cancer)

To show a scatter plot, use the plot function. plot is a generic function (you can “plot” many different types of objects); plot also can draw many types of objects, including vectors, tables, and time series. For simple scatter plots with two vectors, the function that is called is plot.default:

plot(x, y = NULL, type = "p",  xlim = NULL, ylim = NULL,
     log = "", main = NULL, sub = NULL, xlab = NULL, ylab = NULL,
     ann = par("ann"), axes = TRUE, frame.plot = axes,
     panel.first = NULL, panel.last = NULL, asp = NA, ...)

Here is a description of the arguments to plot.

Argument	Description	Default
x, y	The data to be plotted. You may specify two separate vectors, `x` and `y`. Otherwise, you may specify a time series, formula, list, or matrix with two or more columns; see the help file for `xy.coords` for more details.
type	A character value that specifies the type of plot: `type="p"` for points, `type="l"` for lines, `type="o"` for overplotted points and lines, `type="b"` for points joined by lines, `type="s"` for stair steps, `type="h"` for histogram-style vertical lines, or `type="n"` for no points or lines.	`"p"`
xlim	A numeric vector with two values specifying the `x` limits of the plot.	`NULL`
log	A character value that specifies which axes should be plotted with a logarithmic scale. Use `log=""` for neither, `log="x"` for the x-axis, `log="y"` for the y-axis, and `log="xy"` for both.	`""`
main	The main title for the plot.	`NULL`
sub	The subtitle for the plot.	`NULL`
xlab	The label for the x-axis.	`NULL`
ylab	The label for the y-axis.	`NULL`
ann	If `ann=TRUE`, then axis titles and overall titles are included with plots. If `ann=FALSE`, then these annotations are not included.	`par("ann")`
axes	A logical value that specifies whether axes should be drawn.	`TRUE`
frame.plot	A logical value that specifies whether a box should be drawn around the plot.	`axes`
panel.first	An expression that is evaluated after the axes are drawn but before points are plotted.	`NULL`
panel.last	An expression that is evaluated after the points are plotted.	`NULL`
asp	A numeric value that specifies the aspect ratio of the plot (as y/x).	`NA`
...	Additional graphical parameters. (See Graphical Parameters for more information.)

Now let’s try our first plot. Let’s compare the overall cancer rate (number of cancer deaths divided by state population) to the presence of toxins (total toxic chemicals release divided by state area):

> # use with so that we don't have to keep typing the
> # data frame name
> attach(toxins.and.cancer)
> plot(total_toxic_chemicals/Surface_Area, deaths_total/Population)

The chart is shown in Figure 13-1. Perhaps there is a stronger correlation between airborne toxins and lung cancer:

> plot(air_on_site/Surface_Area, deaths_lung/Population)

This chart is shown in Figure 13-2. Suppose that you wanted to know which states were associated with which points. R provides some interactive tools for identifying points on plots. You can use the locator function to tell you the coordinates of a specific point (or set of points). To do this, first plot the data. Next, type locator(1). Then click on a point in the open graphics window. As an example, suppose that you plotted the data above, typed locator(1), and then clicked on the point in the upper-right corner. You would see output like this in the R console:

> locator(1)
$x
[1] 0.002499427

$y
[1] 0.0008182696

Figure 13-1. Total toxins and new cancer cases

Another useful function for identifying points is identify. This function can be used to interactively label points on a plot. To use identify with the data above, you would enter:

> identify(air_on_site/Surface_Area, deaths_lung/Population,
+   State_Abbrev)

While this command is running, you can click on individual points on the chart, and R will label those points with state names.

Figure 13-2. Toxins released by air and lung cancer deaths per capita

If you wanted to label all the points at once, you could use the text function to add labels to the plot. Here is how I drew the plot shown in Figure 13-3:

> plot(air_on_site/Surface_Area, deaths_lung/Population,
+   xlab="Air Release Rate of Toxic Chemicals",
+   ylab="Lung Cancer Death Rate")
> text(air_on_site/Surface_Area, deaths_lung/Population,
+   labels=State_Abbrev,
+   cex=0.5,
+   adj=c(0,-1))

Notice that I have added some extra arguments to refine the appearance of the plot. The xlab and ylab arguments are used to add labels to the x- and y-axes. The text function draws a label next to each point. I tweaked the size placement of the labels using the cex and adj arguments; see Graphical Parameters for more information.

Is this relationship significant? It is actually statistically significant (see Correlation tests), but we don’t have enough information to make a good argument that there is a causal relationship.

Figure 13-3. Toxins released by air and lung cancer deaths per capita, cleaned up

The plot function is a good choice if you want to plot two columns of data on one chart. However, suppose that you have more columns of data to plot, perhaps split into different categories. Or, suppose that you want to plot all the columns of one matrix against all the columns of another matrix. To plot multiple sets of columns against one another, you can use the matplot function:

matplot(x, y, type = "p", lty = 1:5, lwd = 1, pch = NULL,
        col = 1:6, cex = NULL, bg = NA,
        xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL,
        ..., add = FALSE, verbose = getOption("verbose"))

Matplot accepts the following arguments.

Argument	Description	Default
x, y	Vectors or matrices containing the data to be plotted. The number of rows and columns should match.	If `x` is not specified, then `x=1:ncol(y)`. If `y` is not specified, then `y=x; x=1:ncol(y)`.
type	A character vector specifying the types of plots to generate. Use `type="p"` for points, `type="l"` for lines, `type="b"` for both, `type="c"` for the lines part alone of `"b"`, `type="o"` for both overplotted points and lines, `type="h"` for histogram-like (or high-density) vertical lines, `type="s"` for stair steps, `type="S"` for other steps, or `type="n"` for no plotting.	`"p"`
lty	A vector of line types. See Graphical parameters by name for more details.	`1:5`
lwd	A vector of line widths. See Graphical parameters by name for more details.	`1`
pch	A vector of plotting characters. See Graphical parameters by name for more details.	`NULL`
col	A vector of colors. See Graphical parameters by name for more details.	`1:6`
cex	A vector of character expansion sizes. See Graphical parameters by name for more details.	`NULL`
bg	A vector of background colors for plot symbols. See Graphical parameters by name for more details.	`NA`
xlab, ylab	Character values specifying x- and y-axis labels.	`NULL`
xlim, ylim	Numeric values specifying ranges for the x- and y-axes.	`NULL`
...	Additional graphical parameters that are passed to `par`.	`NULL`
add	A logical value indicating whether to add to the current plot or to generate a new plot.	`FALSE`
verbose	A logical value indicating whether to write information to the console describing what `matplot` did.	`getOption("verbose")`

Many arguments to matplot have the same names as standard arguments to par. However, because matplot generates multiple plots at the same time, these arguments can be specified as vectors of multiple values when called by matplot. For more details on standard arguments, see Graphical Parameters.

If you are plotting a very large number of points, then you may prefer the function smoothScatter, which plots the density of points by shading different regions of the plot different shades, depending on the density of points in each region:

smoothScatter(x, y = NULL, nbin = 128, bandwidth,
              colramp = colorRampPalette(c("white", blues9)),
              nrpoints = 100, pch = ".", cex = 1, col = "black",
              transformation = function(x) x^.25,
              postPlotHook = box,
              xlab = NULL, ylab = NULL, xlim, ylim,
              xaxs = par("xaxs"), yaxs = par("yaxs"), ...)

For an example of smoothScatter, see Correlation and Covariance.

If you have a data frame with n different variables and you would like to generate a scatter plot for each pair of values in the data frame, try the pairs function. As an example, let’s plot the hits, runs, strikeouts, walks, and home runs for each Major League Baseball (MLB) player who had more than 100 at bats in 2008. To do this, we would use the following R statement:

> library(nutshell)
> data(batting.2008)
> pairs(batting.2008[batting.2008$AB>100, c("H", "R", "SO", "BB", "HR")])

The result is shown in Figure 13-4.

Figure 13-4. Pairs example

Plotting Time Series

R includes tools for plotting time series data. The plot function has a method for time series:

plot(x, y = NULL, plot.type = c("multiple", "single"),
        xy.labels, xy.lines, panel = lines, nc, yax.flip = FALSE,
        mar.multi = c(0, 5.1, 0, if(yax.flip) 5.1 else 2.1),
        oma.multi = c(6, 0, 5, 0), axes = TRUE, ...)

The arguments x and y specify ts objects, panel specifies how to plot the time series (by default, lines), and other arguments specify how to break time series into different plots (as in lattice). As an example, we’ll plot the turkey price data:

> library(nutshell)
> data(turkey.price.ts)
> plot(turkey.price.ts)

The results are shown in Figure 13-5. As you can see, turkey prices are very seasonal. There are huge sales in November and December (for Thanksgiving and Christmas) and minor sales in spring (probably for Easter).

Figure 13-5. Time series plot

Another way to look at seasonal effects is with an autocorrelation plot (which is also called a correlogram; see Figure 13-6). This plot shows how correlated points are with each other, by difference in time. You can also plot the autocorrelation function for a time series (which can be helpful for looking at cyclical effects). The plot is generated by default when you call acf, which computes the autocorrelation function. (Alternatively, you can generate the autocorrelation function with acf and then plot it separately.) Here is how to generate the plot for the turkey price data:

> acf(turkey.price.ts)

Figure 13-6. Autocorrelation function plot

As you can see, points are correlated over 12-month cycles (and inversely correlated over 6-month cycles). Time series analysis is discussed further in Chapter 23.

Bar Charts

To draw bar (or column) charts in R, use the barplot function.

As an example, let’s look at doctoral degrees awarded in the United States between 2001 and 2006:^[36]

> doctorates <- data.frame (
+   year=c(2001, 2002, 2003, 2004, 2005, 2006),
+   engineering=c(5323, 5511, 5079, 5280, 5777, 6425),
+   science=c(20643, 20017, 19529, 20001, 20498, 21564),
+   education=c(6436, 6349, 6503, 6643, 6635, 6226),
+   health=c(1591, 1541, 1654, 1633, 1720, 1785),
+   humanities=c(5213, 5178, 5051, 5020, 5013, 4949),
+   other=c(2159, 2141, 2209, 2180, 2480, 2436)
+ )

Or, if you prefer, you can just load the data from the nutshell package:

> library(nutshell)
> data(doctorates)

Now let’s transform this into a matrix for plotting:

> # make this into a matrix:
> doctorates.m <- as.matrix(doctorates[2:7])
> rownames(doctorates.m) <- doctorates[, 1]
> doctorates.m
     engineering science education health humanities other
2001        5323   20643      6436   1591       5213  2159
2002        5511   20017      6349   1541       5178  2141
2003        5079   19529      6503   1654       5051  2209
2004        5280   20001      6643   1633       5020  2180
2005        5777   20498      6635   1720       5013  2480
2006        6425   21564      6226   1785       4949  2436

The barplot function can’t work with a data frame, so we’ve created a matrix object for this problem with the data.

Let’s start by just showing a bar plot of doctorates in 2001 by type:

> barplot(doctorates.m[1, ])

As you can see from Figure 13-7, by default R shows the y-axis along with the size of each bar, but it does not show the x-axis. R also automatically uses column names to name the bars. Suppose that we wanted to show all the different years as bars stacked next to one another. Suppose that we also wanted the bars plotted horizontally and wanted to show a legend for the different years. To do this, we could use the following expression to generate the chart shown in Figure 13-8:

> barplot(doctorates.m, beside=TRUE, horiz=TRUE, legend=TRUE, cex.names=.75)

Figure 13-7. Simple bar plot example

Figure 13-8. Horizontal juxtaposed bar plot example

Finally, suppose that we wanted to show doctorates by year as stacked bars. To do this, we need to transform the matrix so that each column is a year and each row is a discipline. We also need to make sure that there is enough room to see the legend, so we’ll extend the limits on the y-axis:

> barplot(t(doctorates.m), legend=TRUE, ylim=c(0, 66000))

The chart generated by this expression is shown in Figure 13-9.

Figure 13-9. Stacked bar plot example

Here is a detailed description of barplot:

barplot(height, width = 1, space = NULL,
        names.arg = NULL, legend.text = NULL, beside = FALSE,
        horiz = FALSE, density = NULL, angle = 45,
        col = NULL, border = par("fg"),
        main = NULL, sub = NULL, xlab = NULL, ylab = NULL,
        xlim = NULL, ylim = NULL, xpd = TRUE, log = "",
        axes = TRUE, axisnames = TRUE,
        cex.axis = par("cex.axis"), cex.names = par("cex.axis"),
        inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,
        add = FALSE, args.legend = NULL, ...)

The barplot function is very flexible; here is a description of the arguments to barplot.

Argument	Description	Default
height	Either a numeric vector or a numeric matrix representing the values to be plotted. If values are given as a matrix and `beside=FALSE`, then the bars are stacked. If `beside=TRUE`, then the bars are plotted next to one another.
width	A numeric vector representing the widths of the bars.	`1`
space	If `beside=FALSE`, a numeric value indicating the amount of space between bars. You specify the space as a fraction of the average column width. If `beside=TRUE`, then you can specify a two-element vector, where the first element specifies the space within a group and the second represents the space between groups.	`if (is.matrix(height) & beside=TRUE) c(0, 1) else 0.2`
names.arg	A character vector specifying the names to be plotted for each bar (or group of bars).	`if (is.matrix(height)) col.names(height) else names(height)`
legend.text	A character vector or a logical value. If a logical value is given, then a legend is generated using the row names of height. If a character vector is given, then those character values are used instead. This function is mostly useful when height is a matrix (and there are two different dimensions that need labels).	`NULL`
beside	A logical value indicating whether columns should be stacked or drawn beside one another. Meaningful only when `height` is a matrix.	`FALSE`
horiz	A logical value specifying the direction to draw the bars. If `horiz=FALSE`, then bars are drawn vertically from left to right. If `horiz=TRUE`, then bars are drawn horizontally from bottom to top.	`FALSE`
density	A numeric value that specifies the density of shading lines in lines per inch. `density=NULL` means that no lines are drawn.	`NULL`
angle	A numeric value that specifies the slope of the shading lines (in degrees).	`45`
col	A vector of colors to use for the bars (or bar components).	Gray is used if height is a vector; a gamma-corrected gray palette if height is a matrix.
border	The color to be used for the border of the bars.	`par("fg")`
main	A character value to be used as the overall title.	`NULL`
sub	A character value to be used as a subtitle.	`NULL`
xlab	A character value to use as the label for the x-axis.	`NULL`
ylab	A character value to use as the label for the y-axis.	`NULL`
xlim	Limits for the x-axis.	`NULL`
ylim	Limits for the y-axis.	`NULL`
xpd	A logical value indicating if bars should be allowed to go outside the region.	`TRUE`
log	A character value specifying whether axis scales should be logarithmic.	`""`
axes	A logical value indicating whether axes should be drawn.	`TRUE`
axisnames	If `axisnames=TRUE` and `names.arg` is not null, then the second axis is drawn and labeled.	`TRUE`
cex.axis	A numeric value specifying the size of numeric axis labels relative to other text (`cex.axis=1` means full size).	`par("cex.axis")`
cex.names	A numeric value specifying the size of axis names relative to other text (`cex.axis=1` means full size).	`par("cex.axis")`
inside	A logical value that indicates whether to draw lines separating bars when `beside=TRUE`.	`TRUE`
plot	A logical value indicating whether to plot the chart.	`TRUE`
axis.lty	The line type for the axis.	`0`
offset	A numeric vector indicating how much bars should be shifted relative to the x-axis.	`0`
add	A logical value indicating if bars should be added to an existing plot.	`FALSE`
args.legend	A list of arguments to pass to legend (if `legend.text` is used).	`NULL`
...	Additional arguments passed to other graphical routines used inside `barplot` (typically, arguments to `par`).

Pie Charts

One of the most popular ways to plot data is the pie chart. Pie charts can be an effective way to compare different parts of a quantity, though there are lots of good reasons not to use pie charts.^[37] You can draw pie charts in R using the pie function:

pie(x, labels = names(x), edges = 200, radius = 0.8,
    clockwise = FALSE, init.angle = if(clockwise) 90 else 0,
    density = NULL, angle = 45, col = NULL, border = NULL,
    lty = NULL, main = NULL, ...)

Here is a description of the arguments to pie.

Argument	Description	Default
x	A vector of nonnegative numeric values that will be plotted.
labels	An expression to generate labels, a vector of character strings, or another object that can be coerced to a `graphicsAnnot` object and used as labels.	`names(x)`
edges	A numeric value indicating the number of segments used to draw the outside of the pie.	`200`
radius	A numeric value that specifies how big the pie should be. (Parts of the pie are cut off for values over 1.)	`0.8`
clockwise	A logical value indicating whether slices are drawn clockwise or counterclockwise.	`FALSE`
init.angle	A numeric value specifying the starting angle for the slices (in degrees).	`if (clockwise) 90 else 0`
density	A numeric value that specifies the density of shading lines in lines per inch. `density=NULL` means that no lines are drawn.	`NULL`
angle	A numeric value that specifies the slope of the shading lines (in degrees).	`45`
col	A numeric vector that specifies the colors to be used for slices. If `col=NULL`, then a set of six pastel colors is used.	`NULL`
border	Arguments passed to the polygon function to draw each slice.	`NULL`
lty	The line type used to draw each slice.	`NULL`
main	A character string that represents the title.	`NULL`

As a simple example, let’s use pie charts to show what happened to fish caught in the United States in 2006:

> # 2006 fishery data from 
> #   http://www.census.gov/compendia/statab/tables/09s0852.xls
> # units are millions of pounds of live fish
> domestic.catch.2006 <- c(7752, 1166, 463, 108)
> names(domestic.catch.2006) <- c("Fresh and frozen",
+    "Reduced to meal, oil, etc.", "Canned", "Cured")
> # note: cex.6 setting shrinks text size by 40% so you can see the labels
> pie(domestic.catch.2006, init.angle=100, cex=.6)

As shown in Figure 13-10, most of the fish (by weight) was sold fresh or frozen.

Figure 13-10. Pie chart

Plotting Categorical Data

The graphics package includes some very useful, and possibly unfamiliar, tools for looking at categorical data.

Suppose that you want to look at the conditional density of a set of categories dependent on a numeric value. You can do this with a conditional density plot, generated by the cdplot function:

cdplot(x, y,
  plot = TRUE, tol.ylab = 0.05, ylevels = NULL,
  bw = "nrd0", n = 512, from = NULL, to = NULL,
  col = NULL, border = 1, main = "", xlab = NULL, ylab = NULL,
  yaxlabels = NULL, xlim = NULL, ylim = c(0, 1), ...)

Here is the form of cdplot when called with a formula:

cdplot(formula, data = list(),
  plot = TRUE, tol.ylab = 0.05, ylevels = NULL,
  bw = "nrd0", n = 512, from = NULL, to = NULL,
  col = NULL, border = 1, main = "", xlab = NULL, ylab = NULL,
  yaxlabels = NULL, xlim = NULL, ylim = c(0, 1), ...,
  subset = NULL)

The cdplot function uses the density function to compute kernel density estimates across the range of numeric values and then plots these estimates. Here is the list of arguments to cdplot.

Argument	Description	Default
x, y, formula, data	Arguments used to specify the data to plot. You may specify either a numeric vector `x` containing data to plot and a factor vector `y` containing grouping information or a `formula` and a data frame (`data`) in which to evaluate the formula.
subset	A vector specifying the subset of values to be used when plotting. (Applies only when using a formula and a data frame.)	`NULL`
plot	Logical value specifying whether the conditional densities should be plotted.	`TRUE`
tol.ylab	A numeric vector that specifies a “tolerance parameter” for y-axis labels. If the difference between two labels is less than this parameter, then they are plotted equidistantly.	`0.05`
ylevels	A character or numeric vector that specifies the order in which levels should be plotted.	`NULL`
bw	The “smoothing bandwidth” to use when plotting. See the help file for `density` for more details.	`"nrd0"`
n	A numeric value specifying the number of points at which the density is estimated.	`512`
from	A numeric value specifying the lowest point at which the density is estimated.	`NULL`
to	A numeric value specifying the highest point at which the density is estimated.	`NULL`
col	A vector of fill colors for the different conditional values.	`NULL`
border	Border color for shaded polygons.	`1`
main	Main title.	`""`
xlab	x-axis label.	`NULL`
ylab	y-axis label.	`NULL`
yaxlabels	Character vector for labeling different conditional variables.	`NULL`
xlim	Range of x variables to plot.	`NULL`
ylim	Range of y variables to plot.	`c(0, 1)`
...	Other arguments passed to `density`.

As an example, let’s look at how the distribution of batting hand varies by batting average among MLB players in 2008:

> batting.w.names.2008 <- transform(batting.2008,
+      AVG=H/AB, bats=as.factor(bats), throws=as.factor(throws))
> cdplot(bats~AVG,data=batting.w.names.2008,
+      subset=(batting.w.names.2008$AB>100))

The results are shown in Figure 13-11. As you can see, the proportion of switch hitters (bats=="B") increases with higher batting average.

Figure 13-11. Conditional density plot

Suppose, instead, that you simply wanted to plot the proportion of observations for two different categorical variables. R also provides tools for visualizing this type of data. One of the most interesting charts available in R for showing the number of observations with certain properties is the mosaic plot. A mosaic plot shows a set of boxes corresponding to different factor values. The x-axis corresponds to one factor and the y-axis to another factor. To create a mosaic plot, use the mosaicplot function. Here is the form of the mosaicplot function for a contingency table:

mosaicplot(x, main = deparse(substitute(x)),
           sub = NULL, xlab = NULL, ylab = NULL,
           sort = NULL, off = NULL, dir = NULL,
           color = NULL, shade = FALSE, margin = NULL,
           cex.axis = 0.66, las = par("las"),
           type = c("pearson", "deviance", "FT"), ...)

There is also a method for mosaicplot that allows you to specify the data as a formula and data frame:

mosaicplot(formula, data = NULL, ...,
           main = deparse(substitute(data)), subset,
           na.action = stats::na.omit)

Here is a description of the arguments to mosaicplot.

Argument	Description	Default
x, formula, data	Specifies the data to be plotted. You may specify either a contingency table `x` or a formula and a data frame `(data)`. (If the variables in `formula` are defined in the current environment, then you may omit `data`.)
subset	A vector that specifies which values in `data` to plot.
main	A character value specifying the main title for the plot.	`deparse(substitute(x))`
sub	A character value specifying the subtitle for the plot.	`NULL`
xlab	A character value specifying the label for the x-axis.	`NULL`
ylab	A character value specifying the label for the y-axis.	`NULL`
sort	An integer vector that describes how to sort the variables in `x`. Specified as a permutation of `1:length(dim(x)`).	`NULL`
off	A numeric vector that specifies the spacing between each level of the mosaic as a percentage.	`NULL`
dir	A character vector that specifies which direction to plot each vector in `x`. Use `"v"` for vertical and `"h"` for horizontal.	`NULL`
color	A logical value or character vector specifying colors to use for color shading. You may use `color=TRUE` for a gamma-corrected color palette, `color=NULL` for grayscale, or `color=FALSE` for unfilled boxes.	`NULL`
shade	A logical value (or numeric vector) specifying whether to produce “extended mosaic plots” to visualize standardized residuals of a log-linear model for the table by color and outline of the mosaic’s tiles. You may specify `shade=FALSE` for standard plots, `shade=TRUE` for extended plots, or a numeric vector with up to five elements specifying cut points of the residuals.	`FALSE`
margin	A list of vectors containing marginal totals to fit in a log-linear model. See the help file for `loglin` for more information.	`NULL`
cex.axis	A numeric value specifying the magnification factor to use for axis annotation text.	`0.66`
las	Specifies the style of the axis labels.	`par("las")`
type	A character string indicating the type of residuals to plot. Use `type="pearson"` for components of Pearson’s chi-squared, `type="deviance"` for components of the likelihood ratio chi-squared, or `type="FT"` for the Freeman-Tukey residuals.	`c("pearson", "deviance", "FT")`
na.action	A function that specifies what mosaicplot should do if the data contains variables to be cross-tabulated that contain `NA` values.	A function that omits NA values (specifically, `stats::na.omit`)
...	Additional graphical parameters passed to other methods.

As an example, let’s create a mosaic plot showing the number of batters in the MLB in 2008. On the x-axis, we’ll show batting hand (left, right, or both), and on the y-axis we’ll show throwing hand (left or right). This function can accept either a matrix of values or a formula and a data frame. In this example, we’ll use a formula and a data frame. The plot is shown in Figure 13-12:

> mosaicplot(formula=bats~throws, data=batting.w.names.2008, color=TRUE)
> dev.off()

Figure 13-12. Mosaic plot

Another chart that is very similar to a mosaic plot is a spine plot. A spine plot shows different boxes corresponding to the number of observations associated with two factors. Figure 13-13 shows an example of a spine plot using the same batting data we used in the mosaic example:

> spineplot(formula=bats~throws, data=batting.w.names.2008)

Another function for looking at tables of data is assocplot. This function plots a set of bar charts, showing the deviation of each combination of factors from independence. (These are also called Cohen-Friendly association plots.) As an example, let’s look at the same data for batting and throwing hands:

> assocplot(table(batting.w.names.2008$bats, batting.w.names.2008$throws),
+   xlab="Throws", ylab="Bats")

Figure 13-13. Spine plot

The resulting plot is shown in Figure 13-14. Other useful plotting functions include stars and fourfoldplot. See the help files for more information.

Three-Dimensional Data

R includes a few functions for visualizing three-dimensional data. All of these functions can be used to plot a matrix of values. (Row indices correspond to x values, column indices to y values, and values in the matrix to z values.)

As an example of multidimensional data, I used elevation data for Yosemite Valley in Yosemite National Park (you can find a map at http://www.nps.gov/yose/planyourvisit/upload/yosevalley2008.pdf). The sample data I used for my examples is included in the nutshell library.

Figure 13-14. Association plot example

Getting Elevation Data

I downloaded the Yosemite data from the U.S. Geological Survey. Specifically, I used the National Map Seamless Server (available at http://seamless.usgs.gov/website/seamless/viewer.htm). This service allows you to search for a specific location and select a region from which to obtain elevation data. After you select the area that you want to export, a window will pop up called the “Request Summary Page.” There will be a link on this page to “Modify Data Request.” Click this link to modify the defaults, choose to export the data in GridFloat format, save the options, and download the file. The name of the file I downloaded was NED_09216343.zip, though the name of your file will be different.

Unzip the downloaded file. There are many different files inside the archive, including a lot of information about the request. The most important files are the .hdr file (which contains information you need to load the data) and the .flt file (which contains the data). Here is what was contained in the ned_09216343.hdr file that I downloaded:

ncols         562
nrows         253
xllcorner     -119.68111111082
yllcorner     37.699166665986
cellsize      0.00027777777779647
NODATA_value  -9999
byteorder     LSBFIRST

The GridFloat format saves the topological data as a stream of 4-byte floating-point values. You can load this into R using the readBin function. As noted in this file, there were 562 × 253 four-byte values encoded in little-endian format. So I loaded the data with the following statement:

> yosemite <- readBin(
+  "~/Documents/book/data/NED_09216343/ned_09216343.flt",
+  what="numeric", n=562*253, size=4, endian="little")

I then assigned dimensions with this statement:

> dim(yosemite) <- c(562,253)

Feel free to grab your own data samples for experimentation.

To view a three-dimensional surface, use the persp function. This function draws a plot of a three-dimensional surface for a specific perspective. (It does, of course, draw in only two dimensions.) If you want to show your nonstatistician friends that you are doing really cool math stuff with R, this is the function that draws the coolest plots:

persp(x = seq(0, 1, length.out = nrow(z)),
      y = seq(0, 1, length.out = ncol(z)),
      z, xlim = range(x), ylim = range(y),
      zlim = range(z, na.rm = TRUE),
      xlab = NULL, ylab = NULL, zlab = NULL,
      main = NULL, sub = NULL,
      theta = 0, phi = 15, r = sqrt(3), d = 1,
      scale = TRUE, expand = 1,
      col = "white", border = NULL, ltheta = -135, lphi = 0,
      shade = NA, box = TRUE, axes = TRUE, nticks = 5,
      ticktype = "simple", ...)

Here is a description of the values to persp.

Argument	Description	Default
x, y	Numeric vectors that explain what each dimension of `z` represents. (Specifically, `x` is a numeric vector representing the `x` values for each row in `z`, and `y` is a numeric vector representing the `y` values for each column in `z`.)	`x = seq(0, 1, length.out = nrow(z)) sy = seq(0, 1, length.out = ncol(z))`
z	A matrix of values to plot.
xlim, ylim, zlim	Numeric vectors with two values, representing the range of values to plot for `x`, `y`, and `z`, respectively.	`xlim = range(x) ylim = range(y), zlim = range(z, na.rm = TRUE)`
xlab, ylab, zlab	Character values specifying titles to plot for the x-, y-, and z-axes.	`NULL`
main	A character value specifying the main title for the plot.	`NULL`
sub	A character value specifying the subtitle for the plot.	`NULL`
theta	A numeric value that specifies the azimuthal direction of the viewing angle.	`0`
phi	A numeric value that specifies the colatitude of the viewing angle.	`15`
r	The distance of the viewing point from the center of the plotting box.	`sqrt(3)`
d	A numeric value that can be used to increase or decrease the perspective effect.	`1`
scale	A logical value specifying whether to maintain aspect ratios when plotting.	`TRUE`
expand	A numeric factor used to expand (when z > 1) or shrink (when z < 1) the z coordinates.	`1`
col	The color of the surface facets.	`"white"`
border	The color of the lines drawn around the surface facets.	`NULL`
ltheta	If specified, the surface is drawn as if illuminated from the direction specified by azimuth `ltheta` and colatitude `lphi`.	`-135`
lphi	See the explanation for `ltheta`.	`0`
shade	An exponent used to calculate the shade of the surface facets. See the help file for more information.	NA
box	A logical value indicating whether a bounding box for the surface should be drawn.	`TRUE`
axes	A logical value indicating whether axes should be drawn.	`TRUE`
nticks	A numeric value specifying the number of ticks to draw on each axis.	`5`
ticktype	A character value specifying the types of ticks drawn here. Use `ticktype="simple"` for arrows pointing in the direction of increase and `ticktype="detailed"` to show simple tick marks.	`"simple"`
...	Additional graphical parameters. See Graphical Parameters.

As an example of three-dimensional data, let’s take a look at Yosemite Valley. Specifically, let’s look toward Half Dome. To plot this elevation data, I needed to make two transformations. First, I needed to flip the data horizontally. In the data file, values move east to west (or left to right) as x indices increase and from north to south (or top to bottom) as y indices increase. Unfortunately, persp plots y coordinates slightly differently. Persp plots increasing y coordinates from bottom to top. So I selected y indices in reverse order. Here is an R expression to do this:

> # load the data:
> library(nutshell)
> data(yosemite)
> # check dimensions of data
> dim(yosemite)
[1] 562 253
> # select all 253 columns in reverse order
> yosemite.flipped <- yosemite[,seq(from=253, to=1)]

Next, I wanted to select only a square subset of the elevation points. To do this, I selected only the rightmost 253 columns of the yosemite matrix using an expression like this:

> yosemite.rightmost <- yosemite[nrow(yosemite) - ncol(yosemite) + 1,]

Note the “+ 1” in this statement; that’s to make sure that we take exactly 253 columns. (This is to avoid a fencepost error.)

To plot the figure, I rotated the image by 225° (through theta=225) and changed the viewing angle to 20° (phi=20). I adjusted the light source to be from a 45° angle (ltheta=45) and set the shading factor to 0.75 (shade=.75) to exaggerate topological features. Putting it all together, here is the code I used to plot Yosemite Valley looking toward Half Dome:

> # create halfdome subset in one expression:
> halfdome <- yosemite[(nrow(yosemite) - ncol(yosemite) + 1):562,
+   seq(from=253,to=1)]
> persp(halfdome,col=grey(.25), border=NA, expand=.15,
+   theta=225, phi=20, ltheta=45, lphi=20, shade=.75)

The resulting image is shown in Figure 13-15.

Figure 13-15. Perspective view of Yosemite Valley

Another useful function for plotting three-dimensional data is image. This function plots a matrix of data points as a grid of boxes, color coding the boxes based on the intensity at each location:

image(x, y, z, zlim, xlim, ylim, col = heat.colors(12),
      add = FALSE, xaxs = "i", yaxs = "i", xlab, ylab,
      breaks, oldstyle = FALSE, ...)

Here is a description of the arguments to image.

Argument	Description	Default
x, y	(Alternatively, you may pass `x` an argument that is a list containing elements named `x`, `y`, and `z`.)
z	A matrix of values to plot.
xlim, ylim	Two-element numeric vectors that specify the range of values in `x` and `y` that should be plotted.
zlim	The range of values for `z` for which colors should be plotted.
col	A vector of colors to plot. Typically generated by functions like `rainbow`, `heat.colors`, `topo.colors`, or `terrain.colors`.	`heat.colors(12)`
add	A logical value that specifies whether the plot should be added to the existing plot.	`FALSE`
xaxs, yaxs	Style for the x- and y-axes; see Graphical parameters by name.	`xlab="i", ylab="i"`
xlab, ylab	Labels for the `x` and `y` values.
breaks	An integer value specifying the number of break points for colors. (There must be at least one more color than break point.)
oldstyle	If `oldstyle=TRUE`, then the midpoints of the color intervals are equally spaced between the limits. If `oldstyle=FALSE`, then the range is split into color intervals of equal size.	`FALSE`
...	Additional arguments to par.

To plot the Yosemite Valley data using image, I needed to make several tweaks. First, I needed to specify an aspect ratio that matched the dimensions of the data by setting asp=253/562 (note that this is a standard graphics parameter passed to par). Then I specified a range of points on the y dimension to make sure that data was plotted from top to bottom (y=c(1,0)). Finally, I specified a set of 32 grayscale colors for this plot (col=sapply((0:32)/32,gray)). Here is an expression that generates an image plot from the Yosemite Valley data:

> image(yosemite, asp=253/562, ylim=c(1,0), col=sapply((0:32)/32, gray))

The results are shown in Figure 13-16.

A closely related tool for looking at multidimensional data, particularly in biology, is the heat map. A heat map plots a single variable on two axes, each representing a different factor. The heatmap function plots a grid, where each box is encoded with a different color depending on the size of the dependent variable. It may also plot a tree structure (called a dendrogram) to the side of each plot showing the hierarchy of values. As you might have guessed, the function for plotting heat maps in R is heatmap:

heatmap(x, Rowv=NULL, Colv=if(symm)"Rowv" else NULL,
        distfun = dist, hclustfun = hclust,
        reorderfun = function(d,w) reorder(d,w),
        add.expr, symm = FALSE, revC = identical(Colv, "Rowv"),
        scale=c("row", "column", "none"), na.rm = TRUE,
        margins = c(5, 5), ColSideColors, RowSideColors,
        cexRow = 0.2 + 1/log10(nr), cexCol = 0.2 + 1/log10(nc),
        labRow = NULL, labCol = NULL, main = NULL,
        xlab = NULL, ylab = NULL,
        keep.dendro = FALSE, verbose = getOption("verbose"), ...)

Figure 13-16. Image example: Yosemite Valley

Another useful function for plotting three-dimensional data is contour. The contour function plots contour lines, connecting equal values in the data:

contour(x = seq(0, 1, length.out = nrow(z)),
        y = seq(0, 1, length.out = ncol(z)),
        z,
        nlevels = 10, levels = pretty(zlim, nlevels),
        labels = NULL,
        xlim = range(x, finite = TRUE),
        ylim = range(y, finite = TRUE),
        zlim = range(z, finite = TRUE),
        labcex = 0.6, drawlabels = TRUE, method = "flattest",
        vfont, axes = TRUE, frame.plot = axes,
        col = par("fg"), lty = par("lty"), lwd = par("lwd"),
        add = FALSE, ...)

Here is a table showing the arguments to contour.

Argument	Description	Default
x, y	Numeric vectors specifying the location of grid lines at which values in the matrix `z` are measured. (Alternatively, you may specify a single matrix for `x` and omit `y` and `z`.)	`x=seq(0, 1, length.out=nrow(z)) y=seq(0, 1, length.out=ncol(z))`
z	A numeric vector containing values to be plotted.
nlevels	Number of contour levels. (Used only if `levels` is not specified.)	`10`
levels	A numeric vector of levels at which to draw lines.	`pretty(zlim, nlevels)`
labels	A vector of labels for the contour lines.	NULL
xlim, ylim, zlim	Numeric vectors of two elements specifying the range of `x`, `y`, and `z` values to include in the plot.	`xlim = range(x, finite = TRUE) ylim = range(y, finite = TRUE) zlim = range(z, finite = TRUE)`
labcex	Text scaling factor for contour labels.	`0.6`
drawlabels	A logical value specifying whether to draw contour labels.	`TRUE`
method	Character value specifying where to draw contour labels. Options include `method="simple"`, `method="edge"`, and `method="flattest"`.	`"flattest"`
vfont	A character vector with two elements specifying the font to use for contour labels. `vfont[1]` specifies a Hershey font family; `vfont[2]` specifies a typeface within the family.
axes	A logical value indicating whether to print axes.	`TRUE`
frame.plot	A logical value indicating whether to draw a box around the plot.	`axes`
col	A color for the contour lines.	`par("fg")`
lty	A type of lines to draw.	`par("lty")`
lwd	A width for the lines.	`par("lwd")`
add	A logical value specifying whether to add the contour lines to an existing plot (`add=TRUE`) or to create a new plot (`add=FALSE`).	`FALSE`
...	Additional arguments passed to `plot.window`, `title`, `Axis`, and `box`.

The following expression generates a contour plot using the Yosemite Valley data:

> contour(yosemite, asp=253/562, ylim=c(1, 0))

As with image, we needed to flip the y-axis and to specify an aspect ratio. The results are shown in Figure 13-17.

Contours are commonly added to existing image plots.

Figure 13-17. Contour example: Yosemite Valley

Plotting Distributions

When performing data analysis, it’s often very important to understand the shape of a data distribution. Looking at a distribution can tell you whether there are outliers in the data, or whether a certain modeling technique will work on your data, or simply how many observations are within a certain range of values.

The best-known technique for visualizing a distribution is the histogram. In R, you can plot a histogram with the hist function. As an example, let’s look at the number of plate appearances (PAs) for batters during the 2008 MLB season. Plate appearances count the number of times a player had the opportunity to bat; plate appearances include all times a player had a hit, made an out, reached on error, walked, was hit by pitch, hit a sacrifice fly, or hit a sacrifice bunt.

You can load this data set from the nutshell package:

> library(nutshell)
> data(batting.2008)

Let’s calculate the plate appearances for each player and then plot a histogram. The resulting histogram is shown in Figure 13-18:

> # PA (plate appearances) =
> #   AB (at bats) + BB (base on balls) + HBP (hit by pitch) +
> #   SF (sacrifice flies) + SH (sacrifice bunts)
> batting.2008 <- transform(batting.2008,
+   PA=AB+BB+HBP+SF+SH)
> hist(batting.2008$PA)

The histogram shows that there were a large number of players with fewer than 50 plate appearances. If you were to perform further analysis on this data (for example, looking at the average on-base percentage [OBP]), you might want to exclude these players from your analysis. As we will show in Proportion Test Design, you will need much larger sample sizes than 50 plate appearances to draw conclusions with the data.

Figure 13-18. Histogram showing the distribution of plate appearances in 2008

Let’s try generating a second histogram, this time excluding players with fewer than 25 at bats. We’ll also increase the number of bars, using the breaks argument to specify that we want 50 bins:

> hist(batting.2008[batting.2008$PA>25, "PA"], breaks=50, cex.main=.8)

The second histogram is shown in Figure 13-19.

A closely related type of chart is the density plot. Many statisticians recommend using density plots instead of histograms because they are more robust and easier to read. To plot a density plot from the plate appearance data (for batters with more than 25 plate appearances), we use two functions.

Figure 13-19. Histogram showing the number of plate appearances for players with over 25 plate appearances in 2008

First, we use density to calculate the kernel density estimates. Next, we use plot to plot the estimates. We could plot the diagram with an expression like this:

> plot(density(batting.2008[batting.2008$PA>25, "PA"]))

A common addition to a kernel density plot is a rug. A rug is essentially a strip plot shown along the axis, with each point represented by a short line segment. You can add a rug to the kernel density plot with an expression like:

> rug(batting.2008[batting.2008$PA>25, "PA"])

The final version of the density plot is shown in Figure 13-20.

Figure 13-20. Density plot of plate appearances with rug

Another way to view a distribution is the quantile-quantile (Q-Q) plot. Quantile-quantile plots compare the distribution of the sample data to the distribution of a theoretical distribution (often a normal distribution). As the name implies, they plot the quantiles from the sample data set against the quantiles from a theoretical distribution. If the sample data is distributed the same way as the theoretical distribution, all points will be plotted on a 45° line from the lower-left corner to the upper-right corner. Quantile-quantile plots provide a very efficient way to tell how a distribution deviates from an expected distribution.

You can generate these plots in R with the qqnorm function. Without arguments, this function will plot the distribution of points in each quantile, assuming a theoretical normal distribution. The plot is shown in Figure 13-21:

> qqnorm(batting.2008$AB)

If you would like to compare two actual distributions, or compare the data distribution to a different theoretical distribution, then try the function qqplot.

Figure 13-21. Quantile-quantile plot

Box Plots

Another very useful way to visualize a distribution is a box plot. A box plot is a compact way to show the distribution of a variable. The box shows the interquartile range. The interquartile range contains values between the 25th and 75th percentile; the line inside the box shows the median. The two “whiskers” on either side of the box show the adjacent values. A box plot is shown in Figure 13-22.

The adjacent values are intended to show extreme values, but they don’t always extend to the absolute maximum or minimum value. When there are values far outside the range we would expect for normally distributed data, those outlying values are plotted separately. Specifically, here is how the adjacent values are calculated: the upper adjacent value is the value of the largest observation that is less than or equal to the upper quartile plus 1.5 times the length of the interquartile range; the lower adjacent value is the value of the smallest observation that is greater than or equal to the lower quartile less 1.5 times the length of the interquartile range. Values outside the range of the whiskers are called outside values and are plotted individually.

Figure 13-22. Box plot components

To plot a box plot, use the boxplot function. Here is the default method of boxplot for vectors:

boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,
        notch = FALSE, outline = TRUE, names, plot = TRUE,
        border = par("fg"), col = NULL, log = "",
        pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),
        horizontal = FALSE, add = FALSE, at = NULL)

And here is the form of boxplot when a formula is specified:

boxplot(formula, data = NULL, ..., subset, na.action = NULL)

Here is a description of the arguments to boxplot.

Argument	Description	Default
formula	A formula of the form `y ~ grp`, where `y` is a variable to be plotted and `grp` is a variable describing a set of different plotting groups.
data	A data frame (or list) in which the variables used in `formula` are defined.
subset	A vector specifying a subset of observations to use in plotting.
x	A vector specifying values to plot.
...	Additional vectors to plot (or graphical parameters to pass to `bxp`). Each additional vector is plotted as an additional box.
range	A numeric value that determines the maximum amount that the whiskers extend from the boxes.	`1.5`
width	A numeric vector specifying the widths of the boxes being plotted.	`NULL`
varwidth	If `varwidth=TRUE`, then each box is drawn with a width proportional to the square root of the number of observations represented by the box. If `varwidth=FALSE`, boxes are plotted with the same width.	`FALSE`
notch	If `notch=TRUE`, then a “notch” is drawn in the boxes. Notches are drawn at `+/-1.58 IQR/sqrt(n)`; see the help file for `boxplot.stats` for an explanation of what they mean.	`FALSE`
outline	A logical value specifying whether outliers should be drawn.	`TRUE`
names	A character vector specifying the group names used to label each box plot.
plot	If `plot=TRUE`, then the box plots are plotted. If `plot=FALSE`, then boxplot returns a list of statistics that could be used to draw a box plot; see the help file for `boxplot` for more details.	`TRUE`
border	A character vector specifying the color to use for the outline of each box plot.	`par("fg")`
col	A character vector specifying the color to use for the background of each box plot.	`NULL`
log	A character value indicating whether the x-axis `(log="x")`, y-axis `(log="y")`, both axes `(log="xy")`, or neither axis `(log="")` should be plotted with a logarithmic scale.	`""`
pars	A list of additional graphical parameters passed to `bxp`.	`list(boxweb = 0.8, stapleweb = 0.5, outwex = 0.5)`
horizontal	A logical value indicating whether the boxes should be drawn horizontally `(horizontal=TRUE)` or vertically `(horizontal=FALSE)`.	`FALSE`
add	A logical value specifying whether the box plot should be added to an existing chart `(add=TRUE)` or if a new chart should be drawn `(add=FALSE)`.	`FALSE`
at	A numeric vector specifying the locations at which each box plot should be drawn.	1:n, where n is the number of boxes

As an example, let’s look at the team batting data from 2008. We’ll restrict the data to include only American League teams (it’s too hard to read a plot with 30 boxes, so this cuts it to 16) and include only players with over 100 plate appearances (to cut out marginal players with a small number of plate appearances). Finally, let’s adjust the text size on the axis so that all the labels fit. Here is the expression:

> batting.2008 <- transform(batting.2008,
+   OBP=(H+BB+HBP)/(AB+BB+HBP+SF))
> boxplot(OBP~teamID,
+   data=batting.2008[batting.2008$PA>100 & batting.2008$lgID=="AL",],
+   cex.axis=.7)

The results are shown in Figure 13-23.

Figure 13-23. Box plot showing on-base percentage for players in the AL in 2008

^[35] Data from both can be found in the Statistical Abstract of the United States, available online at http://www.census.gov/compendia/statab/.

^[36]As with many other examples in this book, this data was taken from the Statistical Abstract of the United States, 2009. This data comes from http://www.census.gov/compendia/statab/tables/09s0785.xls.

^[37]A lot of people dislike pie charts. I think they are good for saying, “Look how much bigger this number is than this number,” and they are very good at taking up lots of space on a page. Pie charts are not good at showing subtle differences between the size of different slices; search for “why pie charts are bad” on Google, and you’ll come up with dozens of sites explaining what’s wrong with them. Or just check the help file for pie, which says, “Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.”