An Overview of R Graphics

R includes tools for drawing most common types of charts, including bar charts, pie charts, line charts, and scatter plots. Additionally, R can also draw some less-familiar charts like quantile-quantile (Q-Q) plots, mosaic plots, and contour plots. The following table shows many of the charts included in the graphics package.

Graphics package functionDescription
barplotBar and column charts
dotchartCleveland dot plots
histHistograms
densityKernel density plots
stripchartStrip charts
qqnorm (in stats package)Quantile-quantile plots
xplotScatter plots
smoothScatterSmooth scatter plots
qqplot (in stats package)Quantile-quantile plots
pairsScatter plot matrices
imageImage plots
contourContour plots
perspPerspective charts of three-dimensional data
interaction.plotSummary of the response for two-way combinations of factors
sunflowerplotSunflower plots

You can show R graphics on the screen or save them in many different formats. Graphics Devices explains how to choose output methods. R gives you an enormous amount of control over graphics. You can control almost every aspect of a chart. Customizing Charts explains how to tweak the output of R to look the way you want. This section shows how to use many common types of R charts.

To show how to use scatter plots, we will look at cases of cancer in 2008 and toxic waste releases by state in 2006. Data on new cancer cases (and deaths from cancer) are tabulated by the American Cancer Society; information on toxic chemicals released into the environment is tabulated by the U.S. Environmental Protection Agency (EPA).[35]

The sample data is included in the nutshell package:

> library(nutshell)
> data(toxins.and.cancer)

To show a scatter plot, use the plot function. plot is a generic function (you can “plot” many different types of objects); plot also can draw many types of objects, including vectors, tables, and time series. For simple scatter plots with two vectors, the function that is called is plot.default:

plot(x, y = NULL, type = "p",  xlim = NULL, ylim = NULL,
     log = "", main = NULL, sub = NULL, xlab = NULL, ylab = NULL,
     ann = par("ann"), axes = TRUE, frame.plot = axes,
     panel.first = NULL, panel.last = NULL, asp = NA, ...)

Here is a description of the arguments to plot.

ArgumentDescriptionDefault
x, yThe data to be plotted. You may specify two separate vectors, x and y. Otherwise, you may specify a time series, formula, list, or matrix with two or more columns; see the help file for xy.coords for more details. 
typeA character value that specifies the type of plot: type="p" for points, type="l" for lines, type="o" for overplotted points and lines, type="b" for points joined by lines, type="s" for stair steps, type="h" for histogram-style vertical lines, or type="n" for no points or lines."p"
xlimA numeric vector with two values specifying the x limits of the plot.NULL
logA character value that specifies which axes should be plotted with a logarithmic scale. Use log="" for neither, log="x" for the x-axis, log="y" for the y-axis, and log="xy" for both.""
mainThe main title for the plot.NULL
subThe subtitle for the plot.NULL
xlabThe label for the x-axis.NULL
ylabThe label for the y-axis.NULL
annIf ann=TRUE, then axis titles and overall titles are included with plots. If ann=FALSE, then these annotations are not included.par("ann")
axesA logical value that specifies whether axes should be drawn.TRUE
frame.plotA logical value that specifies whether a box should be drawn around the plot.axes
panel.firstAn expression that is evaluated after the axes are drawn but before points are plotted.NULL
panel.lastAn expression that is evaluated after the points are plotted.NULL
aspA numeric value that specifies the aspect ratio of the plot (as y/x).NA
...Additional graphical parameters. (See Graphical Parameters for more information.) 

Now let’s try our first plot. Let’s compare the overall cancer rate (number of cancer deaths divided by state population) to the presence of toxins (total toxic chemicals release divided by state area):

> # use with so that we don't have to keep typing the
> # data frame name
> attach(toxins.and.cancer)
> plot(total_toxic_chemicals/Surface_Area, deaths_total/Population)

The chart is shown in Figure 13-1. Perhaps there is a stronger correlation between airborne toxins and lung cancer:

> plot(air_on_site/Surface_Area, deaths_lung/Population)

This chart is shown in Figure 13-2. Suppose that you wanted to know which states were associated with which points. R provides some interactive tools for identifying points on plots. You can use the locator function to tell you the coordinates of a specific point (or set of points). To do this, first plot the data. Next, type locator(1). Then click on a point in the open graphics window. As an example, suppose that you plotted the data above, typed locator(1), and then clicked on the point in the upper-right corner. You would see output like this in the R console:

> locator(1)
$x
[1] 0.002499427

$y
[1] 0.0008182696

Another useful function for identifying points is identify. This function can be used to interactively label points on a plot. To use identify with the data above, you would enter:

> identify(air_on_site/Surface_Area, deaths_lung/Population,
+   State_Abbrev)

While this command is running, you can click on individual points on the chart, and R will label those points with state names.

If you wanted to label all the points at once, you could use the text function to add labels to the plot. Here is how I drew the plot shown in Figure 13-3:

> plot(air_on_site/Surface_Area, deaths_lung/Population,
+   xlab="Air Release Rate of Toxic Chemicals",
+   ylab="Lung Cancer Death Rate")
> text(air_on_site/Surface_Area, deaths_lung/Population,
+   labels=State_Abbrev,
+   cex=0.5,
+   adj=c(0,-1))

Notice that I have added some extra arguments to refine the appearance of the plot. The xlab and ylab arguments are used to add labels to the x- and y-axes. The text function draws a label next to each point. I tweaked the size placement of the labels using the cex and adj arguments; see Graphical Parameters for more information.

Is this relationship significant? It is actually statistically significant (see Correlation tests), but we don’t have enough information to make a good argument that there is a causal relationship.

The plot function is a good choice if you want to plot two columns of data on one chart. However, suppose that you have more columns of data to plot, perhaps split into different categories. Or, suppose that you want to plot all the columns of one matrix against all the columns of another matrix. To plot multiple sets of columns against one another, you can use the matplot function:

matplot(x, y, type = "p", lty = 1:5, lwd = 1, pch = NULL,
        col = 1:6, cex = NULL, bg = NA,
        xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL,
        ..., add = FALSE, verbose = getOption("verbose"))

Matplot accepts the following arguments.

ArgumentDescriptionDefault
x, yVectors or matrices containing the data to be plotted. The number of rows and columns should match.If x is not specified, then x=1:ncol(y). If y is not specified, then y=x; x=1:ncol(y).
typeA character vector specifying the types of plots to generate. Use type="p" for points, type="l" for lines, type="b" for both, type="c" for the lines part alone of "b", type="o" for both overplotted points and lines, type="h" for histogram-like (or high-density) vertical lines, type="s" for stair steps, type="S" for other steps, or type="n" for no plotting."p"
ltyA vector of line types. See Graphical parameters by name for more details.1:5
lwdA vector of line widths. See Graphical parameters by name for more details.1
pchA vector of plotting characters. See Graphical parameters by name for more details.NULL
colA vector of colors. See Graphical parameters by name for more details.1:6
cexA vector of character expansion sizes. See Graphical parameters by name for more details.NULL
bgA vector of background colors for plot symbols. See Graphical parameters by name for more details.NA
xlab, ylabCharacter values specifying x- and y-axis labels.NULL
xlim, ylimNumeric values specifying ranges for the x- and y-axes.NULL
...Additional graphical parameters that are passed to par.NULL
addA logical value indicating whether to add to the current plot or to generate a new plot.FALSE
verboseA logical value indicating whether to write information to the console describing what matplot did. getOption("verbose")

Many arguments to matplot have the same names as standard arguments to par. However, because matplot generates multiple plots at the same time, these arguments can be specified as vectors of multiple values when called by matplot. For more details on standard arguments, see Graphical Parameters.

If you are plotting a very large number of points, then you may prefer the function smoothScatter, which plots the density of points by shading different regions of the plot different shades, depending on the density of points in each region:

smoothScatter(x, y = NULL, nbin = 128, bandwidth,
              colramp = colorRampPalette(c("white", blues9)),
              nrpoints = 100, pch = ".", cex = 1, col = "black",
              transformation = function(x) x^.25,
              postPlotHook = box,
              xlab = NULL, ylab = NULL, xlim, ylim,
              xaxs = par("xaxs"), yaxs = par("yaxs"), ...)

For an example of smoothScatter, see Correlation and Covariance.

If you have a data frame with n different variables and you would like to generate a scatter plot for each pair of values in the data frame, try the pairs function. As an example, let’s plot the hits, runs, strikeouts, walks, and home runs for each Major League Baseball (MLB) player who had more than 100 at bats in 2008. To do this, we would use the following R statement:

> library(nutshell)
> data(batting.2008)
> pairs(batting.2008[batting.2008$AB>100, c("H", "R", "SO", "BB", "HR")])

The result is shown in Figure 13-4.

R includes tools for plotting time series data. The plot function has a method for time series:

plot(x, y = NULL, plot.type = c("multiple", "single"),
        xy.labels, xy.lines, panel = lines, nc, yax.flip = FALSE,
        mar.multi = c(0, 5.1, 0, if(yax.flip) 5.1 else 2.1),
        oma.multi = c(6, 0, 5, 0), axes = TRUE, ...)

The arguments x and y specify ts objects, panel specifies how to plot the time series (by default, lines), and other arguments specify how to break time series into different plots (as in lattice). As an example, we’ll plot the turkey price data:

> library(nutshell)
> data(turkey.price.ts)
> plot(turkey.price.ts)

The results are shown in Figure 13-5. As you can see, turkey prices are very seasonal. There are huge sales in November and December (for Thanksgiving and Christmas) and minor sales in spring (probably for Easter).

Another way to look at seasonal effects is with an autocorrelation plot (which is also called a correlogram; see Figure 13-6). This plot shows how correlated points are with each other, by difference in time. You can also plot the autocorrelation function for a time series (which can be helpful for looking at cyclical effects). The plot is generated by default when you call acf, which computes the autocorrelation function. (Alternatively, you can generate the autocorrelation function with acf and then plot it separately.) Here is how to generate the plot for the turkey price data:

> acf(turkey.price.ts)

As you can see, points are correlated over 12-month cycles (and inversely correlated over 6-month cycles). Time series analysis is discussed further in Chapter 23.

To draw bar (or column) charts in R, use the barplot function.

As an example, let’s look at doctoral degrees awarded in the United States between 2001 and 2006:[36]

> doctorates <- data.frame (
+   year=c(2001, 2002, 2003, 2004, 2005, 2006),
+   engineering=c(5323, 5511, 5079, 5280, 5777, 6425),
+   science=c(20643, 20017, 19529, 20001, 20498, 21564),
+   education=c(6436, 6349, 6503, 6643, 6635, 6226),
+   health=c(1591, 1541, 1654, 1633, 1720, 1785),
+   humanities=c(5213, 5178, 5051, 5020, 5013, 4949),
+   other=c(2159, 2141, 2209, 2180, 2480, 2436)
+ )

Or, if you prefer, you can just load the data from the nutshell package:

> library(nutshell)
> data(doctorates)

Now let’s transform this into a matrix for plotting:

> # make this into a matrix:
> doctorates.m <- as.matrix(doctorates[2:7])
> rownames(doctorates.m) <- doctorates[, 1]
> doctorates.m
     engineering science education health humanities other
2001        5323   20643      6436   1591       5213  2159
2002        5511   20017      6349   1541       5178  2141
2003        5079   19529      6503   1654       5051  2209
2004        5280   20001      6643   1633       5020  2180
2005        5777   20498      6635   1720       5013  2480
2006        6425   21564      6226   1785       4949  2436

The barplot function can’t work with a data frame, so we’ve created a matrix object for this problem with the data.

Let’s start by just showing a bar plot of doctorates in 2001 by type:

> barplot(doctorates.m[1, ])

As you can see from Figure 13-7, by default R shows the y-axis along with the size of each bar, but it does not show the x-axis. R also automatically uses column names to name the bars. Suppose that we wanted to show all the different years as bars stacked next to one another. Suppose that we also wanted the bars plotted horizontally and wanted to show a legend for the different years. To do this, we could use the following expression to generate the chart shown in Figure 13-8:

> barplot(doctorates.m, beside=TRUE, horiz=TRUE, legend=TRUE, cex.names=.75)

Finally, suppose that we wanted to show doctorates by year as stacked bars. To do this, we need to transform the matrix so that each column is a year and each row is a discipline. We also need to make sure that there is enough room to see the legend, so we’ll extend the limits on the y-axis:

> barplot(t(doctorates.m), legend=TRUE, ylim=c(0, 66000))

The chart generated by this expression is shown in Figure 13-9.

Here is a detailed description of barplot:

barplot(height, width = 1, space = NULL,
        names.arg = NULL, legend.text = NULL, beside = FALSE,
        horiz = FALSE, density = NULL, angle = 45,
        col = NULL, border = par("fg"),
        main = NULL, sub = NULL, xlab = NULL, ylab = NULL,
        xlim = NULL, ylim = NULL, xpd = TRUE, log = "",
        axes = TRUE, axisnames = TRUE,
        cex.axis = par("cex.axis"), cex.names = par("cex.axis"),
        inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,
        add = FALSE, args.legend = NULL, ...)

The barplot function is very flexible; here is a description of the arguments to barplot.

ArgumentDescriptionDefault
heightEither a numeric vector or a numeric matrix representing the values to be plotted. If values are given as a matrix and beside=FALSE, then the bars are stacked. If beside=TRUE, then the bars are plotted next to one another. 
widthA numeric vector representing the widths of the bars.1
spaceIf beside=FALSE, a numeric value indicating the amount of space between bars. You specify the space as a fraction of the average column width. If beside=TRUE, then you can specify a two-element vector, where the first element specifies the space within a group and the second represents the space between groups.if (is.matrix(height) & beside=TRUE) c(0, 1) else 0.2
names.argA character vector specifying the names to be plotted for each bar (or group of bars).if (is.matrix(height)) col.names(height) else names(height)
legend.textA character vector or a logical value. If a logical value is given, then a legend is generated using the row names of height. If a character vector is given, then those character values are used instead. This function is mostly useful when height is a matrix (and there are two different dimensions that need labels).NULL
besideA logical value indicating whether columns should be stacked or drawn beside one another. Meaningful only when height is a matrix.FALSE
horizA logical value specifying the direction to draw the bars. If horiz=FALSE, then bars are drawn vertically from left to right. If horiz=TRUE, then bars are drawn horizontally from bottom to top.FALSE
densityA numeric value that specifies the density of shading lines in lines per inch. density=NULL means that no lines are drawn.NULL
angleA numeric value that specifies the slope of the shading lines (in degrees).45
colA vector of colors to use for the bars (or bar components).Gray is used if height is a vector; a gamma-corrected gray palette if height is a matrix.
borderThe color to be used for the border of the bars.par("fg")
mainA character value to be used as the overall title.NULL
subA character value to be used as a subtitle.NULL
xlabA character value to use as the label for the x-axis.NULL
ylabA character value to use as the label for the y-axis.NULL
xlimLimits for the x-axis.NULL
ylimLimits for the y-axis.NULL
xpdA logical value indicating if bars should be allowed to go outside the region.TRUE
logA character value specifying whether axis scales should be logarithmic.""
axesA logical value indicating whether axes should be drawn.TRUE
axisnamesIf axisnames=TRUE and names.arg is not null, then the second axis is drawn and labeled.TRUE
cex.axisA numeric value specifying the size of numeric axis labels relative to other text (cex.axis=1 means full size).par("cex.axis")
cex.namesA numeric value specifying the size of axis names relative to other text (cex.axis=1 means full size).par("cex.axis")
insideA logical value that indicates whether to draw lines separating bars when beside=TRUE.TRUE
plotA logical value indicating whether to plot the chart.TRUE
axis.ltyThe line type for the axis.0
offsetA numeric vector indicating how much bars should be shifted relative to the x-axis.0
addA logical value indicating if bars should be added to an existing plot.FALSE
args.legendA list of arguments to pass to legend (if legend.text is used).NULL
...Additional arguments passed to other graphical routines used inside barplot (typically, arguments to par). 

One of the most popular ways to plot data is the pie chart. Pie charts can be an effective way to compare different parts of a quantity, though there are lots of good reasons not to use pie charts.[37] You can draw pie charts in R using the pie function:

pie(x, labels = names(x), edges = 200, radius = 0.8,
    clockwise = FALSE, init.angle = if(clockwise) 90 else 0,
    density = NULL, angle = 45, col = NULL, border = NULL,
    lty = NULL, main = NULL, ...)

Here is a description of the arguments to pie.

ArgumentDescriptionDefault
xA vector of nonnegative numeric values that will be plotted. 
labelsAn expression to generate labels, a vector of character strings, or another object that can be coerced to a graphicsAnnot object and used as labels.names(x)
edgesA numeric value indicating the number of segments used to draw the outside of the pie.200
radiusA numeric value that specifies how big the pie should be. (Parts of the pie are cut off for values over 1.)0.8
clockwiseA logical value indicating whether slices are drawn clockwise or counterclockwise.FALSE
init.angleA numeric value specifying the starting angle for the slices (in degrees).if (clockwise) 90 else 0
densityA numeric value that specifies the density of shading lines in lines per inch. density=NULL means that no lines are drawn.NULL
angleA numeric value that specifies the slope of the shading lines (in degrees).45
colA numeric vector that specifies the colors to be used for slices. If col=NULL, then a set of six pastel colors is used.NULL
borderArguments passed to the polygon function to draw each slice.NULL
ltyThe line type used to draw each slice.NULL
mainA character string that represents the title.NULL

As a simple example, let’s use pie charts to show what happened to fish caught in the United States in 2006:

> # 2006 fishery data from 
> #   http://www.census.gov/compendia/statab/tables/09s0852.xls
> # units are millions of pounds of live fish
> domestic.catch.2006 <- c(7752, 1166, 463, 108)
> names(domestic.catch.2006) <- c("Fresh and frozen",
+    "Reduced to meal, oil, etc.", "Canned", "Cured")
> # note: cex.6 setting shrinks text size by 40% so you can see the labels
> pie(domestic.catch.2006, init.angle=100, cex=.6)

As shown in Figure 13-10, most of the fish (by weight) was sold fresh or frozen.

The graphics package includes some very useful, and possibly unfamiliar, tools for looking at categorical data.

Suppose that you want to look at the conditional density of a set of categories dependent on a numeric value. You can do this with a conditional density plot, generated by the cdplot function:

cdplot(x, y,
  plot = TRUE, tol.ylab = 0.05, ylevels = NULL,
  bw = "nrd0", n = 512, from = NULL, to = NULL,
  col = NULL, border = 1, main = "", xlab = NULL, ylab = NULL,
  yaxlabels = NULL, xlim = NULL, ylim = c(0, 1), ...)

Here is the form of cdplot when called with a formula:

cdplot(formula, data = list(),
  plot = TRUE, tol.ylab = 0.05, ylevels = NULL,
  bw = "nrd0", n = 512, from = NULL, to = NULL,
  col = NULL, border = 1, main = "", xlab = NULL, ylab = NULL,
  yaxlabels = NULL, xlim = NULL, ylim = c(0, 1), ...,
  subset = NULL)

The cdplot function uses the density function to compute kernel density estimates across the range of numeric values and then plots these estimates. Here is the list of arguments to cdplot.

ArgumentDescriptionDefault
x, y, formula, dataArguments used to specify the data to plot. You may specify either a numeric vector x containing data to plot and a factor vector y containing grouping information or a formula and a data frame (data) in which to evaluate the formula. 
subsetA vector specifying the subset of values to be used when plotting. (Applies only when using a formula and a data frame.)NULL
plotLogical value specifying whether the conditional densities should be plotted.TRUE
tol.ylabA numeric vector that specifies a “tolerance parameter” for y-axis labels. If the difference between two labels is less than this parameter, then they are plotted equidistantly.0.05
ylevelsA character or numeric vector that specifies the order in which levels should be plotted.NULL
bwThe “smoothing bandwidth” to use when plotting. See the help file for density for more details."nrd0"
nA numeric value specifying the number of points at which the density is estimated.512
fromA numeric value specifying the lowest point at which the density is estimated.NULL
toA numeric value specifying the highest point at which the density is estimated.NULL
colA vector of fill colors for the different conditional values.NULL
borderBorder color for shaded polygons.1
mainMain title.""
xlabx-axis label.NULL
ylaby-axis label.NULL
yaxlabelsCharacter vector for labeling different conditional variables.NULL
xlimRange of x variables to plot.NULL
ylimRange of y variables to plot.c(0, 1)
...Other arguments passed to density. 

As an example, let’s look at how the distribution of batting hand varies by batting average among MLB players in 2008:

> batting.w.names.2008 <- transform(batting.2008,
+      AVG=H/AB, bats=as.factor(bats), throws=as.factor(throws))
> cdplot(bats~AVG,data=batting.w.names.2008,
+      subset=(batting.w.names.2008$AB>100))

The results are shown in Figure 13-11. As you can see, the proportion of switch hitters (bats=="B") increases with higher batting average.

Suppose, instead, that you simply wanted to plot the proportion of observations for two different categorical variables. R also provides tools for visualizing this type of data. One of the most interesting charts available in R for showing the number of observations with certain properties is the mosaic plot. A mosaic plot shows a set of boxes corresponding to different factor values. The x-axis corresponds to one factor and the y-axis to another factor. To create a mosaic plot, use the mosaicplot function. Here is the form of the mosaicplot function for a contingency table:

mosaicplot(x, main = deparse(substitute(x)),
           sub = NULL, xlab = NULL, ylab = NULL,
           sort = NULL, off = NULL, dir = NULL,
           color = NULL, shade = FALSE, margin = NULL,
           cex.axis = 0.66, las = par("las"),
           type = c("pearson", "deviance", "FT"), ...)

There is also a method for mosaicplot that allows you to specify the data as a formula and data frame:

mosaicplot(formula, data = NULL, ...,
           main = deparse(substitute(data)), subset,
           na.action = stats::na.omit)

Here is a description of the arguments to mosaicplot.

ArgumentDescriptionDefault
x, formula, dataSpecifies the data to be plotted. You may specify either a contingency table x or a formula and a data frame (data). (If the variables in formula are defined in the current environment, then you may omit data.) 
subsetA vector that specifies which values in data to plot. 
mainA character value specifying the main title for the plot. deparse(substitute(x))
subA character value specifying the subtitle for the plot.NULL
xlabA character value specifying the label for the x-axis.NULL
ylabA character value specifying the label for the y-axis.NULL
sortAn integer vector that describes how to sort the variables in x. Specified as a permutation of 1:length(dim(x)).NULL
offA numeric vector that specifies the spacing between each level of the mosaic as a percentage.NULL
dirA character vector that specifies which direction to plot each vector in x. Use "v" for vertical and "h" for horizontal.NULL
colorA logical value or character vector specifying colors to use for color shading. You may use color=TRUE for a gamma-corrected color palette, color=NULL for grayscale, or color=FALSE for unfilled boxes.NULL
shadeA logical value (or numeric vector) specifying whether to produce “extended mosaic plots” to visualize standardized residuals of a log-linear model for the table by color and outline of the mosaic’s tiles. You may specify shade=FALSE for standard plots, shade=TRUE for extended plots, or a numeric vector with up to five elements specifying cut points of the residuals.FALSE
marginA list of vectors containing marginal totals to fit in a log-linear model. See the help file for loglin for more information.NULL
cex.axisA numeric value specifying the magnification factor to use for axis annotation text.0.66
lasSpecifies the style of the axis labels.par("las")
typeA character string indicating the type of residuals to plot. Use type="pearson" for components of Pearson’s chi-squared, type="deviance" for components of the likelihood ratio chi-squared, or type="FT" for the Freeman-Tukey residuals.c("pearson", "deviance", "FT")
na.actionA function that specifies what mosaicplot should do if the data contains variables to be cross-tabulated that contain NA values.A function that omits NA values (specifically, stats::na.omit)
...Additional graphical parameters passed to other methods. 

As an example, let’s create a mosaic plot showing the number of batters in the MLB in 2008. On the x-axis, we’ll show batting hand (left, right, or both), and on the y-axis we’ll show throwing hand (left or right). This function can accept either a matrix of values or a formula and a data frame. In this example, we’ll use a formula and a data frame. The plot is shown in Figure 13-12:

> mosaicplot(formula=bats~throws, data=batting.w.names.2008, color=TRUE)
> dev.off()

Another chart that is very similar to a mosaic plot is a spine plot. A spine plot shows different boxes corresponding to the number of observations associated with two factors. Figure 13-13 shows an example of a spine plot using the same batting data we used in the mosaic example:

> spineplot(formula=bats~throws, data=batting.w.names.2008)

Another function for looking at tables of data is assocplot. This function plots a set of bar charts, showing the deviation of each combination of factors from independence. (These are also called Cohen-Friendly association plots.) As an example, let’s look at the same data for batting and throwing hands:

> assocplot(table(batting.w.names.2008$bats, batting.w.names.2008$throws),
+   xlab="Throws", ylab="Bats")

The resulting plot is shown in Figure 13-14. Other useful plotting functions include stars and fourfoldplot. See the help files for more information.

R includes a few functions for visualizing three-dimensional data. All of these functions can be used to plot a matrix of values. (Row indices correspond to x values, column indices to y values, and values in the matrix to z values.)

As an example of multidimensional data, I used elevation data for Yosemite Valley in Yosemite National Park (you can find a map at http://www.nps.gov/yose/planyourvisit/upload/yosevalley2008.pdf). The sample data I used for my examples is included in the nutshell library.

To view a three-dimensional surface, use the persp function. This function draws a plot of a three-dimensional surface for a specific perspective. (It does, of course, draw in only two dimensions.) If you want to show your nonstatistician friends that you are doing really cool math stuff with R, this is the function that draws the coolest plots:

persp(x = seq(0, 1, length.out = nrow(z)),
      y = seq(0, 1, length.out = ncol(z)),
      z, xlim = range(x), ylim = range(y),
      zlim = range(z, na.rm = TRUE),
      xlab = NULL, ylab = NULL, zlab = NULL,
      main = NULL, sub = NULL,
      theta = 0, phi = 15, r = sqrt(3), d = 1,
      scale = TRUE, expand = 1,
      col = "white", border = NULL, ltheta = -135, lphi = 0,
      shade = NA, box = TRUE, axes = TRUE, nticks = 5,
      ticktype = "simple", ...)

Here is a description of the values to persp.

ArgumentDescriptionDefault
x, yNumeric vectors that explain what each dimension of z represents. (Specifically, x is a numeric vector representing the x values for each row in z, and y is a numeric vector representing the y values for each column in z.)x = seq(0, 1, length.out = nrow(z)) sy = seq(0, 1, length.out = ncol(z))
zA matrix of values to plot. 
xlim, ylim, zlimNumeric vectors with two values, representing the range of values to plot for x, y, and z, respectively.xlim = range(x) ylim = range(y), zlim = range(z, na.rm = TRUE)
xlab, ylab, zlabCharacter values specifying titles to plot for the x-, y-, and z-axes.NULL
mainA character value specifying the main title for the plot.NULL
subA character value specifying the subtitle for the plot.NULL
thetaA numeric value that specifies the azimuthal direction of the viewing angle.0
phiA numeric value that specifies the colatitude of the viewing angle.15
rThe distance of the viewing point from the center of the plotting box.sqrt(3)
dA numeric value that can be used to increase or decrease the perspective effect.1
scaleA logical value specifying whether to maintain aspect ratios when plotting.TRUE
expandA numeric factor used to expand (when z > 1) or shrink (when z < 1) the z coordinates.1
colThe color of the surface facets."white"
borderThe color of the lines drawn around the surface facets.NULL
lthetaIf specified, the surface is drawn as if illuminated from the direction specified by azimuth ltheta and colatitude lphi.-135
lphiSee the explanation for ltheta.0
shadeAn exponent used to calculate the shade of the surface facets. See the help file for more information.NA
boxA logical value indicating whether a bounding box for the surface should be drawn.TRUE
axesA logical value indicating whether axes should be drawn.TRUE
nticksA numeric value specifying the number of ticks to draw on each axis.5
ticktypeA character value specifying the types of ticks drawn here. Use ticktype="simple" for arrows pointing in the direction of increase and ticktype="detailed" to show simple tick marks."simple"
...Additional graphical parameters. See Graphical Parameters. 

As an example of three-dimensional data, let’s take a look at Yosemite Valley. Specifically, let’s look toward Half Dome. To plot this elevation data, I needed to make two transformations. First, I needed to flip the data horizontally. In the data file, values move east to west (or left to right) as x indices increase and from north to south (or top to bottom) as y indices increase. Unfortunately, persp plots y coordinates slightly differently. Persp plots increasing y coordinates from bottom to top. So I selected y indices in reverse order. Here is an R expression to do this:

> # load the data:
> library(nutshell)
> data(yosemite)
> # check dimensions of data
> dim(yosemite)
[1] 562 253
> # select all 253 columns in reverse order
> yosemite.flipped <- yosemite[,seq(from=253, to=1)]

Next, I wanted to select only a square subset of the elevation points. To do this, I selected only the rightmost 253 columns of the yosemite matrix using an expression like this:

> yosemite.rightmost <- yosemite[nrow(yosemite) - ncol(yosemite) + 1,]

Note the “+ 1” in this statement; that’s to make sure that we take exactly 253 columns. (This is to avoid a fencepost error.)

To plot the figure, I rotated the image by 225° (through theta=225) and changed the viewing angle to 20° (phi=20). I adjusted the light source to be from a 45° angle (ltheta=45) and set the shading factor to 0.75 (shade=.75) to exaggerate topological features. Putting it all together, here is the code I used to plot Yosemite Valley looking toward Half Dome:

> # create halfdome subset in one expression:
> halfdome <- yosemite[(nrow(yosemite) - ncol(yosemite) + 1):562,
+   seq(from=253,to=1)]
> persp(halfdome,col=grey(.25), border=NA, expand=.15,
+   theta=225, phi=20, ltheta=45, lphi=20, shade=.75)

The resulting image is shown in Figure 13-15.

Another useful function for plotting three-dimensional data is image. This function plots a matrix of data points as a grid of boxes, color coding the boxes based on the intensity at each location:

image(x, y, z, zlim, xlim, ylim, col = heat.colors(12),
      add = FALSE, xaxs = "i", yaxs = "i", xlab, ylab,
      breaks, oldstyle = FALSE, ...)

Here is a description of the arguments to image.

ArgumentDescriptionDefault
x, y(Alternatively, you may pass x an argument that is a list containing elements named x, y, and z.) 
zA matrix of values to plot. 
xlim, ylimTwo-element numeric vectors that specify the range of values in x and y that should be plotted. 
zlimThe range of values for z for which colors should be plotted. 
colA vector of colors to plot. Typically generated by functions like rainbow, heat.colors, topo.colors, or terrain.colors.heat.colors(12)
addA logical value that specifies whether the plot should be added to the existing plot.FALSE
xaxs, yaxsStyle for the x- and y-axes; see Graphical parameters by name.xlab="i", ylab="i"
xlab, ylabLabels for the x and y values. 
breaksAn integer value specifying the number of break points for colors. (There must be at least one more color than break point.) 
oldstyleIf oldstyle=TRUE, then the midpoints of the color intervals are equally spaced between the limits. If oldstyle=FALSE, then the range is split into color intervals of equal size.FALSE
...Additional arguments to par. 

To plot the Yosemite Valley data using image, I needed to make several tweaks. First, I needed to specify an aspect ratio that matched the dimensions of the data by setting asp=253/562 (note that this is a standard graphics parameter passed to par). Then I specified a range of points on the y dimension to make sure that data was plotted from top to bottom (y=c(1,0)). Finally, I specified a set of 32 grayscale colors for this plot (col=sapply((0:32)/32,gray)). Here is an expression that generates an image plot from the Yosemite Valley data:

> image(yosemite, asp=253/562, ylim=c(1,0), col=sapply((0:32)/32, gray))

The results are shown in Figure 13-16.

A closely related tool for looking at multidimensional data, particularly in biology, is the heat map. A heat map plots a single variable on two axes, each representing a different factor. The heatmap function plots a grid, where each box is encoded with a different color depending on the size of the dependent variable. It may also plot a tree structure (called a dendrogram) to the side of each plot showing the hierarchy of values. As you might have guessed, the function for plotting heat maps in R is heatmap:

heatmap(x, Rowv=NULL, Colv=if(symm)"Rowv" else NULL,
        distfun = dist, hclustfun = hclust,
        reorderfun = function(d,w) reorder(d,w),
        add.expr, symm = FALSE, revC = identical(Colv, "Rowv"),
        scale=c("row", "column", "none"), na.rm = TRUE,
        margins = c(5, 5), ColSideColors, RowSideColors,
        cexRow = 0.2 + 1/log10(nr), cexCol = 0.2 + 1/log10(nc),
        labRow = NULL, labCol = NULL, main = NULL,
        xlab = NULL, ylab = NULL,
        keep.dendro = FALSE, verbose = getOption("verbose"), ...)

Another useful function for plotting three-dimensional data is contour. The contour function plots contour lines, connecting equal values in the data:

contour(x = seq(0, 1, length.out = nrow(z)),
        y = seq(0, 1, length.out = ncol(z)),
        z,
        nlevels = 10, levels = pretty(zlim, nlevels),
        labels = NULL,
        xlim = range(x, finite = TRUE),
        ylim = range(y, finite = TRUE),
        zlim = range(z, finite = TRUE),
        labcex = 0.6, drawlabels = TRUE, method = "flattest",
        vfont, axes = TRUE, frame.plot = axes,
        col = par("fg"), lty = par("lty"), lwd = par("lwd"),
        add = FALSE, ...)

Here is a table showing the arguments to contour.

ArgumentDescriptionDefault
x, yNumeric vectors specifying the location of grid lines at which values in the matrix z are measured. (Alternatively, you may specify a single matrix for x and omit y and z.)x=seq(0, 1, length.out=nrow(z)) y=seq(0, 1, length.out=ncol(z))
zA numeric vector containing values to be plotted. 
nlevelsNumber of contour levels. (Used only if levels is not specified.)10
levelsA numeric vector of levels at which to draw lines.pretty(zlim, nlevels)
labelsA vector of labels for the contour lines.NULL
xlim, ylim, zlimNumeric vectors of two elements specifying the range of x, y, and z values to include in the plot.xlim = range(x, finite = TRUE) ylim = range(y, finite = TRUE) zlim = range(z, finite = TRUE)
labcexText scaling factor for contour labels.0.6
drawlabelsA logical value specifying whether to draw contour labels.TRUE
methodCharacter value specifying where to draw contour labels. Options include method="simple", method="edge", and method="flattest"."flattest"
vfontA character vector with two elements specifying the font to use for contour labels. vfont[1] specifies a Hershey font family; vfont[2] specifies a typeface within the family. 
axesA logical value indicating whether to print axes.TRUE
frame.plotA logical value indicating whether to draw a box around the plot.axes
colA color for the contour lines.par("fg")
ltyA type of lines to draw.par("lty")
lwdA width for the lines.par("lwd")
addA logical value specifying whether to add the contour lines to an existing plot (add=TRUE) or to create a new plot (add=FALSE).FALSE
...Additional arguments passed to plot.window, title, Axis, and box. 

The following expression generates a contour plot using the Yosemite Valley data:

> contour(yosemite, asp=253/562, ylim=c(1, 0))

As with image, we needed to flip the y-axis and to specify an aspect ratio. The results are shown in Figure 13-17.

Contours are commonly added to existing image plots.

When performing data analysis, it’s often very important to understand the shape of a data distribution. Looking at a distribution can tell you whether there are outliers in the data, or whether a certain modeling technique will work on your data, or simply how many observations are within a certain range of values.

The best-known technique for visualizing a distribution is the histogram. In R, you can plot a histogram with the hist function. As an example, let’s look at the number of plate appearances (PAs) for batters during the 2008 MLB season. Plate appearances count the number of times a player had the opportunity to bat; plate appearances include all times a player had a hit, made an out, reached on error, walked, was hit by pitch, hit a sacrifice fly, or hit a sacrifice bunt.

You can load this data set from the nutshell package:

> library(nutshell)
> data(batting.2008)

Let’s calculate the plate appearances for each player and then plot a histogram. The resulting histogram is shown in Figure 13-18:

> # PA (plate appearances) =
> #   AB (at bats) + BB (base on balls) + HBP (hit by pitch) +
> #   SF (sacrifice flies) + SH (sacrifice bunts)
> batting.2008 <- transform(batting.2008,
+   PA=AB+BB+HBP+SF+SH)
> hist(batting.2008$PA)

The histogram shows that there were a large number of players with fewer than 50 plate appearances. If you were to perform further analysis on this data (for example, looking at the average on-base percentage [OBP]), you might want to exclude these players from your analysis. As we will show in Proportion Test Design, you will need much larger sample sizes than 50 plate appearances to draw conclusions with the data.

Let’s try generating a second histogram, this time excluding players with fewer than 25 at bats. We’ll also increase the number of bars, using the breaks argument to specify that we want 50 bins:

> hist(batting.2008[batting.2008$PA>25, "PA"], breaks=50, cex.main=.8)

The second histogram is shown in Figure 13-19.

A closely related type of chart is the density plot. Many statisticians recommend using density plots instead of histograms because they are more robust and easier to read. To plot a density plot from the plate appearance data (for batters with more than 25 plate appearances), we use two functions.

First, we use density to calculate the kernel density estimates. Next, we use plot to plot the estimates. We could plot the diagram with an expression like this:

> plot(density(batting.2008[batting.2008$PA>25, "PA"]))

A common addition to a kernel density plot is a rug. A rug is essentially a strip plot shown along the axis, with each point represented by a short line segment. You can add a rug to the kernel density plot with an expression like:

> rug(batting.2008[batting.2008$PA>25, "PA"])

The final version of the density plot is shown in Figure 13-20.

Another way to view a distribution is the quantile-quantile (Q-Q) plot. Quantile-quantile plots compare the distribution of the sample data to the distribution of a theoretical distribution (often a normal distribution). As the name implies, they plot the quantiles from the sample data set against the quantiles from a theoretical distribution. If the sample data is distributed the same way as the theoretical distribution, all points will be plotted on a 45° line from the lower-left corner to the upper-right corner. Quantile-quantile plots provide a very efficient way to tell how a distribution deviates from an expected distribution.

You can generate these plots in R with the qqnorm function. Without arguments, this function will plot the distribution of points in each quantile, assuming a theoretical normal distribution. The plot is shown in Figure 13-21:

> qqnorm(batting.2008$AB)

If you would like to compare two actual distributions, or compare the data distribution to a different theoretical distribution, then try the function qqplot.

Another very useful way to visualize a distribution is a box plot. A box plot is a compact way to show the distribution of a variable. The box shows the interquartile range. The interquartile range contains values between the 25th and 75th percentile; the line inside the box shows the median. The two “whiskers” on either side of the box show the adjacent values. A box plot is shown in Figure 13-22.

The adjacent values are intended to show extreme values, but they don’t always extend to the absolute maximum or minimum value. When there are values far outside the range we would expect for normally distributed data, those outlying values are plotted separately. Specifically, here is how the adjacent values are calculated: the upper adjacent value is the value of the largest observation that is less than or equal to the upper quartile plus 1.5 times the length of the interquartile range; the lower adjacent value is the value of the smallest observation that is greater than or equal to the lower quartile less 1.5 times the length of the interquartile range. Values outside the range of the whiskers are called outside values and are plotted individually.

To plot a box plot, use the boxplot function. Here is the default method of boxplot for vectors:

boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,
        notch = FALSE, outline = TRUE, names, plot = TRUE,
        border = par("fg"), col = NULL, log = "",
        pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),
        horizontal = FALSE, add = FALSE, at = NULL)

And here is the form of boxplot when a formula is specified:

boxplot(formula, data = NULL, ..., subset, na.action = NULL)

Here is a description of the arguments to boxplot.

ArgumentDescriptionDefault
formulaA formula of the form y ~ grp, where y is a variable to be plotted and grp is a variable describing a set of different plotting groups. 
dataA data frame (or list) in which the variables used in formula are defined. 
subsetA vector specifying a subset of observations to use in plotting. 
xA vector specifying values to plot. 
...Additional vectors to plot (or graphical parameters to pass to bxp). Each additional vector is plotted as an additional box. 
rangeA numeric value that determines the maximum amount that the whiskers extend from the boxes.1.5
widthA numeric vector specifying the widths of the boxes being plotted.NULL
varwidthIf varwidth=TRUE, then each box is drawn with a width proportional to the square root of the number of observations represented by the box. If varwidth=FALSE, boxes are plotted with the same width.FALSE
notchIf notch=TRUE, then a “notch” is drawn in the boxes. Notches are drawn at +/-1.58 IQR/sqrt(n); see the help file for boxplot.stats for an explanation of what they mean.FALSE
outlineA logical value specifying whether outliers should be drawn.TRUE
namesA character vector specifying the group names used to label each box plot. 
plotIf plot=TRUE, then the box plots are plotted. If plot=FALSE, then boxplot returns a list of statistics that could be used to draw a box plot; see the help file for boxplot for more details.TRUE
borderA character vector specifying the color to use for the outline of each box plot.par("fg")
colA character vector specifying the color to use for the background of each box plot.NULL
logA character value indicating whether the x-axis (log="x"), y-axis (log="y"), both axes (log="xy"), or neither axis (log="") should be plotted with a logarithmic scale.""
parsA list of additional graphical parameters passed to bxp.list(boxweb = 0.8, stapleweb = 0.5, outwex = 0.5)
horizontalA logical value indicating whether the boxes should be drawn horizontally (horizontal=TRUE) or vertically (horizontal=FALSE).FALSE
addA logical value specifying whether the box plot should be added to an existing chart (add=TRUE) or if a new chart should be drawn (add=FALSE).FALSE
atA numeric vector specifying the locations at which each box plot should be drawn.1:n, where n is the number of boxes

As an example, let’s look at the team batting data from 2008. We’ll restrict the data to include only American League teams (it’s too hard to read a plot with 30 boxes, so this cuts it to 16) and include only players with over 100 plate appearances (to cut out marginal players with a small number of plate appearances). Finally, let’s adjust the text size on the axis so that all the labels fit. Here is the expression:

> batting.2008 <- transform(batting.2008,
+   OBP=(H+BB+HBP)/(AB+BB+HBP+SF))
> boxplot(OBP~teamID,
+   data=batting.2008[batting.2008$PA>100 & batting.2008$lgID=="AL",],
+   cex.axis=.7)

The results are shown in Figure 13-23.



[35] Data from both can be found in the Statistical Abstract of the United States, available online at http://www.census.gov/compendia/statab/.

[36] As with many other examples in this book, this data was taken from the Statistical Abstract of the United States, 2009. This data comes from http://www.census.gov/compendia/statab/tables/09s0785.xls.

[37] A lot of people dislike pie charts. I think they are good for saying, “Look how much bigger this number is than this number,” and they are very good at taking up lots of space on a page. Pie charts are not good at showing subtle differences between the size of different slices; search for “why pie charts are bad” on Google, and you’ll come up with dozens of sites explaining what’s wrong with them. Or just check the help file for pie, which says, “Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.”