R includes tools for drawing most common types of charts, including
bar charts, pie charts, line charts, and scatter plots. Additionally, R
can also draw some less-familiar charts like quantile-quantile (Q-Q)
plots, mosaic plots, and contour plots. The following table shows many of
the charts included in the graphics
package.
Graphics package function | Description |
---|---|
barplot | Bar and column charts |
dotchart | Cleveland dot plots |
hist | Histograms |
density | Kernel density plots |
stripchart | Strip charts |
qqnorm (in stats package) | Quantile-quantile plots |
xplot | Scatter plots |
smoothScatter | Smooth scatter plots |
qqplot (in stats package) | Quantile-quantile plots |
pairs | Scatter plot matrices |
image | Image plots |
contour | Contour plots |
persp | Perspective charts of three-dimensional data |
interaction.plot | Summary of the response for two-way combinations of factors |
sunflowerplot | Sunflower plots |
You can show R graphics on the screen or save them in many different formats. Graphics Devices explains how to choose output methods. R gives you an enormous amount of control over graphics. You can control almost every aspect of a chart. Customizing Charts explains how to tweak the output of R to look the way you want. This section shows how to use many common types of R charts.
To show how to use scatter plots, we will look at cases of cancer in 2008 and toxic waste releases by state in 2006. Data on new cancer cases (and deaths from cancer) are tabulated by the American Cancer Society; information on toxic chemicals released into the environment is tabulated by the U.S. Environmental Protection Agency (EPA).[35]
The sample data is included in the nutshell
package:
> library(nutshell) > data(toxins.and.cancer)
To show a scatter plot, use the plot
function. plot
is a
generic function (you can “plot” many different types of objects);
plot
also can draw many types of
objects, including vectors, tables, and time series. For simple scatter
plots with two vectors, the function that is called is plot.default
:
plot(x, y = NULL, type = "p", xlim = NULL, ylim = NULL, log = "", main = NULL, sub = NULL, xlab = NULL, ylab = NULL, ann = par("ann"), axes = TRUE, frame.plot = axes, panel.first = NULL, panel.last = NULL, asp = NA, ...)
Here is a description of the arguments to plot
.
Argument | Description | Default |
---|---|---|
x, y | The data to be plotted. You may specify two separate
vectors, x and y . Otherwise, you may specify a time
series, formula, list, or matrix with two or more columns; see
the help file for xy.coords
for more details. | |
type | A character value that specifies the type of plot:
type="p" for points, type="l" for lines, type="o" for overplotted points and
lines, type="b" for points
joined by lines, type="s" for
stair steps, type="h" for
histogram-style vertical lines, or type="n" for no points or
lines. | "p" |
xlim | A numeric vector with two values specifying the x limits of the plot. | NULL |
log | A character value that specifies which axes should be
plotted with a logarithmic scale. Use log="" for neither, log="x" for the x-axis, log="y" for the y-axis, and log="xy" for both. | "" |
main | The main title for the plot. | NULL |
sub | The subtitle for the plot. | NULL |
xlab | The label for the x-axis. | NULL |
ylab | The label for the y-axis. | NULL |
ann | If ann=TRUE , then axis
titles and overall titles are included with plots. If ann=FALSE , then these annotations are
not included. | par("ann") |
axes | A logical value that specifies whether axes should be drawn. | TRUE |
frame.plot | A logical value that specifies whether a box should be drawn around the plot. | axes |
panel.first | An expression that is evaluated after the axes are drawn but before points are plotted. | NULL |
panel.last | An expression that is evaluated after the points are plotted. | NULL |
asp | A numeric value that specifies the aspect ratio of the plot (as y/x). | NA |
... | Additional graphical parameters. (See Graphical Parameters for more information.) |
Now let’s try our first plot. Let’s compare the overall cancer rate (number of cancer deaths divided by state population) to the presence of toxins (total toxic chemicals release divided by state area):
> # use with so that we don't have to keep typing the > # data frame name > attach(toxins.and.cancer) > plot(total_toxic_chemicals/Surface_Area, deaths_total/Population)
The chart is shown in Figure 13-1. Perhaps there is a stronger correlation between airborne toxins and lung cancer:
> plot(air_on_site/Surface_Area, deaths_lung/Population)
This chart is shown in Figure 13-2.
Suppose that you wanted to know which states were associated with which
points. R provides some interactive tools for identifying points on plots. You can use the
locator
function to tell you the coordinates of a specific point (or set of points). To
do this, first plot the data. Next, type locator(1)
. Then click on a point in
the open graphics window. As an example, suppose that you plotted the
data above, typed locator(1)
, and
then clicked on the point in the upper-right corner. You would see
output like this in the R console:
> locator(1)
$x
[1] 0.002499427
$y
[1] 0.0008182696
Another useful function for identifying points is identify
. This function can be used to interactively label points
on a plot. To use identify
with the
data above, you would enter:
> identify(air_on_site/Surface_Area, deaths_lung/Population, + State_Abbrev)
While this command is running, you can click on individual points on the chart, and R will label those points with state names.
If you wanted to label all the points at once, you could use the
text
function to add labels to the plot. Here is how I drew
the plot shown in Figure 13-3:
> plot(air_on_site/Surface_Area, deaths_lung/Population, + xlab="Air Release Rate of Toxic Chemicals", + ylab="Lung Cancer Death Rate") > text(air_on_site/Surface_Area, deaths_lung/Population, + labels=State_Abbrev, + cex=0.5, + adj=c(0,-1))
Notice that I have added some extra arguments to refine the
appearance of the plot. The xlab
and
ylab
arguments are used to add labels
to the x- and y-axes. The
text
function draws a label next to
each point. I tweaked the size placement of the labels using the
cex
and adj
arguments; see Graphical Parameters for more information.
Is this relationship significant? It is actually statistically significant (see Correlation tests), but we don’t have enough information to make a good argument that there is a causal relationship.
The plot
function is a good
choice if you want to plot two columns of data on one chart. However,
suppose that you have more columns of data to plot, perhaps split into
different categories. Or, suppose that you want to plot all the columns
of one matrix against all the columns of another matrix. To plot
multiple sets of columns against one another, you can use the matplot
function:
matplot(x, y, type = "p", lty = 1:5, lwd = 1, pch = NULL, col = 1:6, cex = NULL, bg = NA, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, ..., add = FALSE, verbose = getOption("verbose"))
Matplot
accepts the following
arguments.
Argument | Description | Default |
---|---|---|
x, y | Vectors or matrices containing the data to be plotted. The number of rows and columns should match. | If x is not specified,
then x=1:ncol(y) . If y is not specified, then y=x; x=1:ncol(y) . |
type | A character vector specifying the types of plots to
generate. Use type="p" for
points, type="l" for lines,
type="b" for both, type="c" for the lines part alone of
"b" , type="o" for both overplotted points
and lines, type="h" for
histogram-like (or high-density) vertical lines, type="s" for stair steps, type="S" for other steps, or type="n" for no plotting. | "p" |
lty | A vector of line types. See Graphical parameters by name for more details. | 1:5 |
lwd | A vector of line widths. See Graphical parameters by name for more details. | 1 |
pch | A vector of plotting characters. See Graphical parameters by name for more details. | NULL |
col | A vector of colors. See Graphical parameters by name for more details. | 1:6 |
cex | A vector of character expansion sizes. See Graphical parameters by name for more details. | NULL |
bg | A vector of background colors for plot symbols. See Graphical parameters by name for more details. | NA |
xlab, ylab | Character values specifying x- and y-axis labels. | NULL |
xlim, ylim | Numeric values specifying ranges for the x- and y-axes. | NULL |
... | Additional graphical parameters that are passed to
par . | NULL |
add | A logical value indicating whether to add to the current plot or to generate a new plot. | FALSE |
verbose | A logical value indicating whether to write information
to the console describing what matplot did. | getOption("verbose") |
Many arguments to matplot
have
the same names as standard arguments to par
. However, because matplot
generates multiple plots at the same
time, these arguments can be specified as vectors of multiple values
when called by matplot
. For more
details on standard arguments, see Graphical Parameters.
If you are plotting a very large number of points, then you may
prefer the function smoothScatter
, which plots the density of points by shading different
regions of the plot different shades, depending on the density of points
in each region:
smoothScatter(x, y = NULL, nbin = 128, bandwidth, colramp = colorRampPalette(c("white", blues9)), nrpoints = 100, pch = ".", cex = 1, col = "black", transformation = function(x) x^.25, postPlotHook = box, xlab = NULL, ylab = NULL, xlim, ylim, xaxs = par("xaxs"), yaxs = par("yaxs"), ...)
For an example of smoothScatter
, see Correlation and Covariance.
If you have a data frame with n different
variables and you would like to generate a scatter plot for each pair of
values in the data frame, try the pairs
function. As an example, let’s plot the
hits, runs, strikeouts, walks, and home runs for each Major League
Baseball (MLB) player who had more than 100 at bats in 2008. To do this,
we would use the following R statement:
> library(nutshell) > data(batting.2008) > pairs(batting.2008[batting.2008$AB>100, c("H", "R", "SO", "BB", "HR")])
The result is shown in Figure 13-4.
R includes tools for plotting time series data. The
plot
function has a method for time series:
plot(x, y = NULL, plot.type = c("multiple", "single"), xy.labels, xy.lines, panel = lines, nc, yax.flip = FALSE, mar.multi = c(0, 5.1, 0, if(yax.flip) 5.1 else 2.1), oma.multi = c(6, 0, 5, 0), axes = TRUE, ...)
The arguments x
and y
specify ts
objects, panel
specifies how to plot the time series
(by default, lines), and other arguments specify how to break time
series into different plots (as in lattice
). As an example, we’ll plot the turkey
price data:
> library(nutshell) > data(turkey.price.ts) > plot(turkey.price.ts)
The results are shown in Figure 13-5. As you can see, turkey prices are very seasonal. There are huge sales in November and December (for Thanksgiving and Christmas) and minor sales in spring (probably for Easter).
Another way to look at seasonal effects is with an autocorrelation
plot (which is also called a correlogram
; see Figure 13-6).
This plot shows how correlated points are with each other, by difference
in time. You can also plot the autocorrelation function for a time
series (which can be helpful for looking at cyclical effects). The plot
is generated by default when you call acf
, which computes the autocorrelation function. (Alternatively, you can generate the
autocorrelation function with acf
and
then plot it separately.) Here is how to generate the plot for the
turkey price data:
> acf(turkey.price.ts)
As you can see, points are correlated over 12-month cycles (and inversely correlated over 6-month cycles). Time series analysis is discussed further in Chapter 23.
To draw bar (or column) charts in R, use the barplot
function.
As an example, let’s look at doctoral degrees awarded in the United States between 2001 and 2006:[36]
> doctorates <- data.frame ( + year=c(2001, 2002, 2003, 2004, 2005, 2006), + engineering=c(5323, 5511, 5079, 5280, 5777, 6425), + science=c(20643, 20017, 19529, 20001, 20498, 21564), + education=c(6436, 6349, 6503, 6643, 6635, 6226), + health=c(1591, 1541, 1654, 1633, 1720, 1785), + humanities=c(5213, 5178, 5051, 5020, 5013, 4949), + other=c(2159, 2141, 2209, 2180, 2480, 2436) + )
Or, if you prefer, you can just load the data from the nutshell
package:
> library(nutshell) > data(doctorates)
Now let’s transform this into a matrix for plotting:
> # make this into a matrix: > doctorates.m <- as.matrix(doctorates[2:7]) > rownames(doctorates.m) <- doctorates[, 1] > doctorates.m engineering science education health humanities other 2001 5323 20643 6436 1591 5213 2159 2002 5511 20017 6349 1541 5178 2141 2003 5079 19529 6503 1654 5051 2209 2004 5280 20001 6643 1633 5020 2180 2005 5777 20498 6635 1720 5013 2480 2006 6425 21564 6226 1785 4949 2436
The barplot
function can’t work with a data frame, so we’ve created a
matrix object for this problem with the data.
Let’s start by just showing a bar plot of doctorates in 2001 by type:
> barplot(doctorates.m[1, ])
As you can see from Figure 13-7, by default R shows the y-axis along with the size of each bar, but it does not show the x-axis. R also automatically uses column names to name the bars. Suppose that we wanted to show all the different years as bars stacked next to one another. Suppose that we also wanted the bars plotted horizontally and wanted to show a legend for the different years. To do this, we could use the following expression to generate the chart shown in Figure 13-8:
> barplot(doctorates.m, beside=TRUE, horiz=TRUE, legend=TRUE, cex.names=.75)
Finally, suppose that we wanted to show doctorates by year as stacked bars. To do this, we need to transform the matrix so that each column is a year and each row is a discipline. We also need to make sure that there is enough room to see the legend, so we’ll extend the limits on the y-axis:
> barplot(t(doctorates.m), legend=TRUE, ylim=c(0, 66000))
The chart generated by this expression is shown in Figure 13-9.
Here is a detailed description of barplot
:
barplot(height, width = 1, space = NULL, names.arg = NULL, legend.text = NULL, beside = FALSE, horiz = FALSE, density = NULL, angle = 45, col = NULL, border = par("fg"), main = NULL, sub = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, xpd = TRUE, log = "", axes = TRUE, axisnames = TRUE, cex.axis = par("cex.axis"), cex.names = par("cex.axis"), inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0, add = FALSE, args.legend = NULL, ...)
The barplot
function is very
flexible; here is a description of the arguments to barplot
.
One of the most popular ways to plot data is the pie
chart. Pie charts can be an effective way to compare different parts of
a quantity, though there are lots of good reasons not to use pie
charts.[37] You can draw pie charts in R using the pie
function:
pie(x, labels = names(x), edges = 200, radius = 0.8, clockwise = FALSE, init.angle = if(clockwise) 90 else 0, density = NULL, angle = 45, col = NULL, border = NULL, lty = NULL, main = NULL, ...)
Here is a description of the arguments to pie
.
Argument | Description | Default |
---|---|---|
x | A vector of nonnegative numeric values that will be plotted. | |
labels | An expression to generate labels, a vector of character
strings, or another object that can be coerced to a graphicsAnnot object and used as
labels. | names(x) |
edges | A numeric value indicating the number of segments used to draw the outside of the pie. | 200 |
radius | A numeric value that specifies how big the pie should be. (Parts of the pie are cut off for values over 1.) | 0.8 |
clockwise | A logical value indicating whether slices are drawn clockwise or counterclockwise. | FALSE |
init.angle | A numeric value specifying the starting angle for the slices (in degrees). | if (clockwise) 90 else
0 |
density | A numeric value that specifies the density of shading
lines in lines per inch. density=NULL means that no
lines are drawn. | NULL |
angle | A numeric value that specifies the slope of the shading lines (in degrees). | 45 |
col | A numeric vector that specifies the colors to be used for
slices. If col=NULL , then a
set of six pastel colors is used. | NULL |
border | Arguments passed to the polygon function to draw each slice. | NULL |
lty | The line type used to draw each slice. | NULL |
main | A character string that represents the title. | NULL |
As a simple example, let’s use pie charts to show what happened to fish caught in the United States in 2006:
> # 2006 fishery data from > # http://www.census.gov/compendia/statab/tables/09s0852.xls > # units are millions of pounds of live fish > domestic.catch.2006 <- c(7752, 1166, 463, 108) > names(domestic.catch.2006) <- c("Fresh and frozen", + "Reduced to meal, oil, etc.", "Canned", "Cured") > # note: cex.6 setting shrinks text size by 40% so you can see the labels > pie(domestic.catch.2006, init.angle=100, cex=.6)
As shown in Figure 13-10, most of the fish (by weight) was sold fresh or frozen.
The graphics
package
includes some very useful, and possibly unfamiliar, tools for looking at
categorical data.
Suppose that you want to look at the conditional
density of a set of categories dependent on a numeric value. You
can do this with a conditional density plot, generated by the cdplot
function:
cdplot(x, y, plot = TRUE, tol.ylab = 0.05, ylevels = NULL, bw = "nrd0", n = 512, from = NULL, to = NULL, col = NULL, border = 1, main = "", xlab = NULL, ylab = NULL, yaxlabels = NULL, xlim = NULL, ylim = c(0, 1), ...)
Here is the form of cdplot
when
called with a formula:
cdplot(formula, data = list(), plot = TRUE, tol.ylab = 0.05, ylevels = NULL, bw = "nrd0", n = 512, from = NULL, to = NULL, col = NULL, border = 1, main = "", xlab = NULL, ylab = NULL, yaxlabels = NULL, xlim = NULL, ylim = c(0, 1), ..., subset = NULL)
The cdplot
function uses the
density
function to compute kernel
density estimates across the range of numeric values and then plots
these estimates. Here is the list of arguments to cdplot
.
Argument | Description | Default |
---|---|---|
x, y, formula, data | Arguments used to specify the data to plot. You may
specify either a numeric vector x containing data to plot and a factor
vector y containing grouping
information or a formula and
a data frame (data ) in which
to evaluate the formula. | |
subset | A vector specifying the subset of values to be used when plotting. (Applies only when using a formula and a data frame.) | NULL |
plot | Logical value specifying whether the conditional densities should be plotted. | TRUE |
tol.ylab | A numeric vector that specifies a “tolerance parameter” for y-axis labels. If the difference between two labels is less than this parameter, then they are plotted equidistantly. | 0.05 |
ylevels | A character or numeric vector that specifies the order in which levels should be plotted. | NULL |
bw | The “smoothing bandwidth” to use when plotting. See the
help file for density for
more details. | "nrd0" |
n | A numeric value specifying the number of points at which the density is estimated. | 512 |
from | A numeric value specifying the lowest point at which the density is estimated. | NULL |
to | A numeric value specifying the highest point at which the density is estimated. | NULL |
col | A vector of fill colors for the different conditional values. | NULL |
border | Border color for shaded polygons. | 1 |
main | Main title. | "" |
xlab | x-axis label. | NULL |
ylab | y-axis label. | NULL |
yaxlabels | Character vector for labeling different conditional variables. | NULL |
xlim | Range of x variables to plot. | NULL |
ylim | Range of y variables to plot. | c(0, 1) |
... | Other arguments passed to density . |
As an example, let’s look at how the distribution of batting hand varies by batting average among MLB players in 2008:
> batting.w.names.2008 <- transform(batting.2008, + AVG=H/AB, bats=as.factor(bats), throws=as.factor(throws)) > cdplot(bats~AVG,data=batting.w.names.2008, + subset=(batting.w.names.2008$AB>100))
The results are shown in Figure 13-11. As you
can see, the proportion of switch hitters (bats=="B"
) increases with higher batting
average.
Suppose, instead, that you simply wanted to plot the proportion of
observations for two different categorical variables. R also provides
tools for visualizing this type of data. One of the most interesting
charts available in R for showing the number of observations with certain properties is
the mosaic plot. A mosaic plot shows a set of boxes corresponding to
different factor values. The x-axis corresponds to
one factor and the y-axis to another factor. To
create a mosaic plot, use the mosaicplot
function. Here is the form of the
mosaicplot
function for a contingency table:
mosaicplot(x, main = deparse(substitute(x)), sub = NULL, xlab = NULL, ylab = NULL, sort = NULL, off = NULL, dir = NULL, color = NULL, shade = FALSE, margin = NULL, cex.axis = 0.66, las = par("las"), type = c("pearson", "deviance", "FT"), ...)
There is also a method for mosaicplot
that allows you to specify the data
as a formula and data frame:
mosaicplot(formula, data = NULL, ..., main = deparse(substitute(data)), subset, na.action = stats::na.omit)
Here is a description of the arguments to mosaicplot
.
Argument | Description | Default |
---|---|---|
x, formula, data | Specifies the data to be plotted. You may specify either
a contingency table x or a
formula and a data frame (data) . (If the variables in formula are defined in the current
environment, then you may omit data .) | |
subset | A vector that specifies which values in data to plot. | |
main | A character value specifying the main title for the plot. | deparse(substitute(x))
|
sub | A character value specifying the subtitle for the plot. | NULL |
xlab | A character value specifying the label for the x-axis. | NULL |
ylab | A character value specifying the label for the y-axis. | NULL |
sort | An integer vector that describes how to sort the
variables in x . Specified as
a permutation of 1:length(dim(x) ). | NULL |
off | A numeric vector that specifies the spacing between each level of the mosaic as a percentage. | NULL |
dir | A character vector that specifies which direction to plot
each vector in x . Use
"v" for vertical and "h" for horizontal. | NULL |
color | A logical value or character vector specifying colors to
use for color shading. You may use color=TRUE for a gamma-corrected color
palette, color=NULL for
grayscale, or color=FALSE for
unfilled boxes. | NULL |
shade | A logical value (or numeric vector) specifying whether to
produce “extended mosaic plots” to visualize standardized
residuals of a log-linear model for the table by color and
outline of the mosaic’s tiles. You may specify shade=FALSE for standard plots,
shade=TRUE for extended
plots, or a numeric vector with up to five elements specifying
cut points of the residuals. | FALSE |
margin | A list of vectors containing marginal totals to fit in a
log-linear model. See the help file for loglin for more information. | NULL |
cex.axis | A numeric value specifying the magnification factor to use for axis annotation text. | 0.66 |
las | Specifies the style of the axis labels. | par("las") |
type | A character string indicating the type of residuals to
plot. Use type= "pearson" for components of
Pearson’s chi-squared, type="deviance" for
components of the likelihood ratio chi-squared, or type="FT" for the
Freeman-Tukey residuals. | c("pearson", "deviance",
"FT") |
na.action | A function that specifies what mosaicplot should do if
the data contains variables to be cross-tabulated that contain
NA values. | A function that omits NA values (specifically, stats::na.omit ) |
... | Additional graphical parameters passed to other methods. |
As an example, let’s create a mosaic plot showing the number of batters in the MLB in 2008. On the x-axis, we’ll show batting hand (left, right, or both), and on the y-axis we’ll show throwing hand (left or right). This function can accept either a matrix of values or a formula and a data frame. In this example, we’ll use a formula and a data frame. The plot is shown in Figure 13-12:
> mosaicplot(formula=bats~throws, data=batting.w.names.2008, color=TRUE) > dev.off()
Another chart that is very similar to a mosaic plot is a spine plot. A spine plot shows different boxes corresponding to the number of observations associated with two factors. Figure 13-13 shows an example of a spine plot using the same batting data we used in the mosaic example:
> spineplot(formula=bats~throws, data=batting.w.names.2008)
Another function for looking at tables of data is assocplot
. This function plots a set of bar charts, showing the
deviation of each combination of factors from independence. (These are
also called Cohen-Friendly association plots.) As
an example, let’s look at the same data for batting and throwing
hands:
> assocplot(table(batting.w.names.2008$bats, batting.w.names.2008$throws), + xlab="Throws", ylab="Bats")
The resulting plot is shown in Figure 13-14. Other useful plotting functions include
stars
and fourfoldplot
. See the help files for more
information.
R includes a few functions for visualizing three-dimensional data. All of these functions can be used to plot a matrix of values. (Row indices correspond to x values, column indices to y values, and values in the matrix to z values.)
As an example of multidimensional data, I used elevation data for
Yosemite Valley in Yosemite National Park (you can find a map at http://www.nps.gov/yose/planyourvisit/upload/yosevalley2008.pdf).
The sample data I used for my examples is included in the nutshell
library.
To view a three-dimensional surface, use the persp
function. This function draws a plot of a
three-dimensional surface for a specific perspective. (It does, of
course, draw in only two dimensions.) If you want to show your
nonstatistician friends that you are doing really cool math stuff with
R, this is the function that draws the coolest plots:
persp(x = seq(0, 1, length.out = nrow(z)), y = seq(0, 1, length.out = ncol(z)), z, xlim = range(x), ylim = range(y), zlim = range(z, na.rm = TRUE), xlab = NULL, ylab = NULL, zlab = NULL, main = NULL, sub = NULL, theta = 0, phi = 15, r = sqrt(3), d = 1, scale = TRUE, expand = 1, col = "white", border = NULL, ltheta = -135, lphi = 0, shade = NA, box = TRUE, axes = TRUE, nticks = 5, ticktype = "simple", ...)
Here is a description of the values to persp
.
Argument | Description | Default |
---|---|---|
x, y | Numeric vectors that explain what each dimension of
z represents. (Specifically,
x is a numeric vector
representing the x values for
each row in z , and y is a numeric vector representing the
y values for each column in
z .) | x = seq(0, 1, length.out =
nrow(z)) sy = seq(0, 1, length.out = ncol(z)) |
z | A matrix of values to plot. | |
xlim, ylim, zlim | Numeric vectors with two values, representing the range
of values to plot for x ,
y , and z , respectively. | xlim = range(x) ylim = range(y),
zlim = range(z, na.rm = TRUE) |
xlab, ylab, zlab | Character values specifying titles to plot for the x-, y-, and z-axes. | NULL |
main | A character value specifying the main title for the plot. | NULL |
sub | A character value specifying the subtitle for the plot. | NULL |
theta | A numeric value that specifies the azimuthal direction of the viewing angle. | 0 |
phi | A numeric value that specifies the colatitude of the viewing angle. | 15 |
r | The distance of the viewing point from the center of the plotting box. | sqrt(3) |
d | A numeric value that can be used to increase or decrease the perspective effect. | 1 |
scale | A logical value specifying whether to maintain aspect ratios when plotting. | TRUE |
expand | A numeric factor used to expand (when z > 1) or shrink (when z < 1) the z coordinates. | 1 |
col | The color of the surface facets. | "white" |
border | The color of the lines drawn around the surface facets. | NULL |
ltheta | If specified, the surface is drawn as if illuminated from
the direction specified by azimuth ltheta and colatitude lphi . | -135 |
lphi | See the explanation for ltheta . | 0 |
shade | An exponent used to calculate the shade of the surface facets. See the help file for more information. | NA |
box | A logical value indicating whether a bounding box for the surface should be drawn. | TRUE |
axes | A logical value indicating whether axes should be drawn. | TRUE |
nticks | A numeric value specifying the number of ticks to draw on each axis. | 5 |
ticktype | A character value specifying the types of ticks drawn
here. Use ticktype= "simple" for arrows pointing
in the direction of increase and ticktype="detailed" to show
simple tick marks. | "simple" |
... | Additional graphical parameters. See Graphical Parameters. |
As an example of three-dimensional data, let’s take a look at
Yosemite Valley. Specifically, let’s look toward Half Dome. To plot this
elevation data, I needed to make two transformations. First, I needed to
flip the data horizontally. In the data file, values move east to west
(or left to right) as x indices increase and from
north to south (or top to bottom) as y indices
increase. Unfortunately, persp
plots
y coordinates slightly differently. Persp
plots increasing y
coordinates from bottom to top. So I selected y
indices in reverse order. Here is an R expression to do this:
> # load the data: > library(nutshell) > data(yosemite) > # check dimensions of data > dim(yosemite) [1] 562 253 > # select all 253 columns in reverse order > yosemite.flipped <- yosemite[,seq(from=253, to=1)]
Next, I wanted to select only a square subset of the elevation
points. To do this, I selected only the rightmost 253 columns of the
yosemite
matrix using an expression
like this:
> yosemite.rightmost <- yosemite[nrow(yosemite) - ncol(yosemite) + 1,]
Note the “+ 1” in this statement; that’s to make sure that we take exactly 253 columns. (This is to avoid a fencepost error.)
To plot the figure, I rotated the image by 225° (through theta=225
) and changed the viewing angle to
20° (phi=20
). I adjusted the light
source to be from a 45° angle (ltheta=45
) and set the shading factor to 0.75
(shade=.75
) to exaggerate topological
features. Putting it all together, here is the code I used to plot
Yosemite Valley looking toward Half Dome:
> # create halfdome subset in one expression: > halfdome <- yosemite[(nrow(yosemite) - ncol(yosemite) + 1):562, + seq(from=253,to=1)] > persp(halfdome,col=grey(.25), border=NA, expand=.15, + theta=225, phi=20, ltheta=45, lphi=20, shade=.75)
The resulting image is shown in Figure 13-15.
Another useful function for plotting three-dimensional data is
image
. This function plots a matrix of data points as a grid of
boxes, color coding the boxes based on the intensity at each
location:
image(x, y, z, zlim, xlim, ylim, col = heat.colors(12), add = FALSE, xaxs = "i", yaxs = "i", xlab, ylab, breaks, oldstyle = FALSE, ...)
Here is a description of the arguments to image
.
Argument | Description | Default |
---|---|---|
x, y | (Alternatively, you may pass x an argument that is a list
containing elements named x ,
y , and z .) | |
z | A matrix of values to plot. | |
xlim, ylim | Two-element numeric vectors that specify the range of
values in x and y that should be plotted. | |
zlim | The range of values for z for which colors should be
plotted. | |
col | A vector of colors to plot. Typically generated by
functions like rainbow ,
heat.colors , topo.colors , or terrain.colors . | heat .colors(12) |
add | A logical value that specifies whether the plot should be added to the existing plot. | FALSE |
xaxs, yaxs | Style for the x- and y-axes; see Graphical parameters by name. | xlab="i",
ylab="i" |
xlab, ylab | Labels for the x and
y values. | |
breaks | An integer value specifying the number of break points for colors. (There must be at least one more color than break point.) | |
oldstyle | If oldstyle=TRUE , then
the midpoints of the color intervals are equally spaced between
the limits. If oldstyle=FALSE , then the range is split into color intervals of
equal size. | FALSE |
... | Additional arguments to par. |
To plot the Yosemite Valley data using image
, I needed to make several tweaks. First,
I needed to specify an aspect ratio that matched the dimensions of the
data by setting asp=253/562
(note
that this is a standard graphics parameter passed to par
). Then I specified a range of points on
the y dimension to make sure that data was plotted
from top to bottom (y=c(1,0)
).
Finally, I specified a set of 32 grayscale colors for this plot
(col=sapply((0:32)/32,gray)
). Here is
an expression that generates an image plot from the Yosemite Valley
data:
> image(yosemite, asp=253/562, ylim=c(1,0), col=sapply((0:32)/32, gray))
The results are shown in Figure 13-16.
A closely related tool for looking at multidimensional data,
particularly in biology, is the heat map. A heat map plots a single
variable on two axes, each representing a different factor. The heatmap
function plots a grid, where each box is encoded with a
different color depending on the size of the dependent variable. It may
also plot a tree structure (called a
dendrogram) to the side of each plot showing the hierarchy of
values. As you might have guessed, the function for plotting heat maps
in R is heatmap
:
heatmap(x, Rowv=NULL, Colv=if(symm)"Rowv" else NULL, distfun = dist, hclustfun = hclust, reorderfun = function(d,w) reorder(d,w), add.expr, symm = FALSE, revC = identical(Colv, "Rowv"), scale=c("row", "column", "none"), na.rm = TRUE, margins = c(5, 5), ColSideColors, RowSideColors, cexRow = 0.2 + 1/log10(nr), cexCol = 0.2 + 1/log10(nc), labRow = NULL, labCol = NULL, main = NULL, xlab = NULL, ylab = NULL, keep.dendro = FALSE, verbose = getOption("verbose"), ...)
Another useful function for plotting three-dimensional data is
contour
. The contour
function plots contour lines,
connecting equal values in the data:
contour(x = seq(0, 1, length.out = nrow(z)), y = seq(0, 1, length.out = ncol(z)), z, nlevels = 10, levels = pretty(zlim, nlevels), labels = NULL, xlim = range(x, finite = TRUE), ylim = range(y, finite = TRUE), zlim = range(z, finite = TRUE), labcex = 0.6, drawlabels = TRUE, method = "flattest", vfont, axes = TRUE, frame.plot = axes, col = par("fg"), lty = par("lty"), lwd = par("lwd"), add = FALSE, ...)
Here is a table showing the arguments to contour
.
Argument | Description | Default |
---|---|---|
x, y | Numeric vectors specifying the location of grid lines at
which values in the matrix z
are measured. (Alternatively, you may specify a single matrix
for x and omit y and z .) | x=seq(0, 1, length.out=nrow(z))
y=seq(0, 1, length.out=ncol(z)) |
z | A numeric vector containing values to be plotted. | |
nlevels | Number of contour levels. (Used only if levels is not specified.) | 10 |
levels | A numeric vector of levels at which to draw lines. | pretty(zlim,
nlevels) |
labels | A vector of labels for the contour lines. | NULL |
xlim, ylim, zlim | Numeric vectors of two elements specifying the range of
x , y , and z values to include in the
plot. | xlim = range(x, finite = TRUE)
ylim = range(y, finite = TRUE) zlim = range(z, finite =
TRUE) |
labcex | Text scaling factor for contour labels. | 0.6 |
drawlabels | A logical value specifying whether to draw contour labels. | TRUE |
method | Character value specifying where to draw contour labels.
Options include method="simple" , method="edge" , and method="flattest" . | "flattest" |
vfont | A character vector with two elements specifying the font
to use for contour labels. vfont[1] specifies a Hershey font
family; vfont[2] specifies a
typeface within the family. | |
axes | A logical value indicating whether to print axes. | TRUE |
frame.plot | A logical value indicating whether to draw a box around the plot. | axes |
col | A color for the contour lines. | par("fg") |
lty | A type of lines to draw. | par("lty") |
lwd | A width for the lines. | par("lwd") |
add | A logical value specifying whether to add the contour
lines to an existing plot (add=TRUE ) or to create a new plot
(add=FALSE ). | FALSE |
... | Additional arguments passed to plot.window , title , Axis , and box . |
The following expression generates a contour plot using the Yosemite Valley data:
> contour(yosemite, asp=253/562, ylim=c(1, 0))
As with image
, we needed to
flip the y-axis and to specify an aspect ratio. The
results are shown in Figure 13-17.
Contours are commonly added to existing image plots.
When performing data analysis, it’s often very important to understand the shape of a data distribution. Looking at a distribution can tell you whether there are outliers in the data, or whether a certain modeling technique will work on your data, or simply how many observations are within a certain range of values.
The best-known technique for visualizing a distribution is the
histogram. In R, you can plot a histogram with the hist
function. As an example, let’s look at
the number of plate appearances (PAs) for batters during the 2008 MLB
season. Plate appearances count the number of times a player had the
opportunity to bat; plate appearances include all times a player had a
hit, made an out, reached on error, walked, was hit by pitch, hit a
sacrifice fly, or hit a sacrifice bunt.
You can load this data set from the nutshell
package:
> library(nutshell) > data(batting.2008)
Let’s calculate the plate appearances for each player and then plot a histogram. The resulting histogram is shown in Figure 13-18:
> # PA (plate appearances) = > # AB (at bats) + BB (base on balls) + HBP (hit by pitch) + > # SF (sacrifice flies) + SH (sacrifice bunts) > batting.2008 <- transform(batting.2008, + PA=AB+BB+HBP+SF+SH) > hist(batting.2008$PA)
The histogram shows that there were a large number of players with fewer than 50 plate appearances. If you were to perform further analysis on this data (for example, looking at the average on-base percentage [OBP]), you might want to exclude these players from your analysis. As we will show in Proportion Test Design, you will need much larger sample sizes than 50 plate appearances to draw conclusions with the data.
Let’s try generating a second histogram, this time excluding
players with fewer than 25 at bats. We’ll also increase the number of
bars, using the breaks
argument to
specify that we want 50 bins:
> hist(batting.2008[batting.2008$PA>25, "PA"], breaks=50, cex.main=.8)
The second histogram is shown in Figure 13-19.
A closely related type of chart is the density plot. Many statisticians recommend using density plots instead of histograms because they are more robust and easier to read. To plot a density plot from the plate appearance data (for batters with more than 25 plate appearances), we use two functions.
Figure 13-19. Histogram showing the number of plate appearances for players with over 25 plate appearances in 2008
First, we use density
to
calculate the kernel density estimates. Next, we use plot
to plot the estimates. We could plot the
diagram with an expression like this:
> plot(density(batting.2008[batting.2008$PA>25, "PA"]))
A common addition to a kernel density plot is a rug. A rug is essentially a strip plot shown along the axis, with each point represented by a short line segment. You can add a rug to the kernel density plot with an expression like:
> rug(batting.2008[batting.2008$PA>25, "PA"])
The final version of the density plot is shown in Figure 13-20.
Another way to view a distribution is the quantile-quantile (Q-Q) plot. Quantile-quantile plots compare the distribution of the sample data to the distribution of a theoretical distribution (often a normal distribution). As the name implies, they plot the quantiles from the sample data set against the quantiles from a theoretical distribution. If the sample data is distributed the same way as the theoretical distribution, all points will be plotted on a 45° line from the lower-left corner to the upper-right corner. Quantile-quantile plots provide a very efficient way to tell how a distribution deviates from an expected distribution.
You can generate these plots in R with the qqnorm
function. Without arguments, this function will plot the
distribution of points in each quantile, assuming a theoretical normal
distribution. The plot is shown in Figure 13-21:
> qqnorm(batting.2008$AB)
If you would like to compare two actual distributions, or compare
the data distribution to a different theoretical distribution, then try
the function qqplot
.
Another very useful way to visualize a distribution is a box plot. A box plot is a compact way to show the distribution of a variable. The box shows the interquartile range. The interquartile range contains values between the 25th and 75th percentile; the line inside the box shows the median. The two “whiskers” on either side of the box show the adjacent values. A box plot is shown in Figure 13-22.
The adjacent values are intended to show extreme values, but they don’t always extend to the absolute maximum or minimum value. When there are values far outside the range we would expect for normally distributed data, those outlying values are plotted separately. Specifically, here is how the adjacent values are calculated: the upper adjacent value is the value of the largest observation that is less than or equal to the upper quartile plus 1.5 times the length of the interquartile range; the lower adjacent value is the value of the smallest observation that is greater than or equal to the lower quartile less 1.5 times the length of the interquartile range. Values outside the range of the whiskers are called outside values and are plotted individually.
To plot a box plot, use the boxplot
function. Here is the default method
of boxplot
for vectors:
boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE, notch = FALSE, outline = TRUE, names, plot = TRUE, border = par("fg"), col = NULL, log = "", pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5), horizontal = FALSE, add = FALSE, at = NULL)
And here is the form of boxplot
when a formula is specified:
boxplot(formula, data = NULL, ..., subset, na.action = NULL)
Here is a description of the arguments to boxplot
.
Argument | Description | Default |
---|---|---|
formula | A formula of the form y ~
grp , where y is a
variable to be plotted and grp is a variable describing a set of
different plotting groups. | |
data | A data frame (or list) in which the variables used in
formula are defined. | |
subset | A vector specifying a subset of observations to use in plotting. | |
x | A vector specifying values to plot. | |
... | Additional vectors to plot (or graphical parameters to
pass to bxp ). Each additional
vector is plotted as an additional box. | |
range | A numeric value that determines the maximum amount that the whiskers extend from the boxes. | 1.5 |
width | A numeric vector specifying the widths of the boxes being plotted. | NULL |
varwidth | If varwidth=TRUE , then
each box is drawn with a width proportional to the square root
of the number of observations represented by the box. If
varwidth=FALSE , boxes are
plotted with the same width. | FALSE |
notch | If notch=TRUE , then a
“notch” is drawn in the boxes. Notches are drawn at +/-1.58 IQR/sqrt(n) ; see the help file
for boxplot.stats for an
explanation of what they mean. | FALSE |
outline | A logical value specifying whether outliers should be drawn. | TRUE |
names | A character vector specifying the group names used to label each box plot. | |
plot | If plot=TRUE , then the
box plots are plotted. If plot=FALSE , then boxplot returns a
list of statistics that could be used to draw a box plot; see
the help file for boxplot for
more details. | TRUE |
border | A character vector specifying the color to use for the outline of each box plot. | par("fg") |
col | A character vector specifying the color to use for the background of each box plot. | NULL |
log | A character value indicating whether the
x-axis (log="x") , y-axis
(log="y") , both axes (log="xy") , or neither axis (log="") should be plotted with a
logarithmic scale. | "" |
pars | A list of additional graphical parameters passed to
bxp . | list(boxweb = 0.8, stapleweb =
0.5, outwex = 0.5) |
horizontal | A logical value indicating whether the boxes should be
drawn horizontally (horizontal=TRUE) or vertically
(horizontal=FALSE) . | FALSE |
add | A logical value specifying whether the box plot should be
added to an existing chart (add=TRUE) or if a new chart should be
drawn (add=FALSE) . | FALSE |
at | A numeric vector specifying the locations at which each box plot should be drawn. | 1:n, where n is the number of boxes |
As an example, let’s look at the team batting data from 2008. We’ll restrict the data to include only American League teams (it’s too hard to read a plot with 30 boxes, so this cuts it to 16) and include only players with over 100 plate appearances (to cut out marginal players with a small number of plate appearances). Finally, let’s adjust the text size on the axis so that all the labels fit. Here is the expression:
> batting.2008 <- transform(batting.2008, + OBP=(H+BB+HBP)/(AB+BB+HBP+SF)) > boxplot(OBP~teamID, + data=batting.2008[batting.2008$PA>100 & batting.2008$lgID=="AL",], + cex.axis=.7)
The results are shown in Figure 13-23.
[35] Data from both can be found in the Statistical Abstract of the United States, available online at http://www.census.gov/compendia/statab/.
[36] As with many other examples in this book, this data was taken from the Statistical Abstract of the United States, 2009. This data comes from http://www.census.gov/compendia/statab/tables/09s0785.xls.
[37] A lot of people dislike pie charts. I think they are good for
saying, “Look how much bigger this number is than this number,” and
they are very good at taking up lots of space on a page. Pie charts
are not good at showing subtle differences between the size of
different slices; search for “why pie charts are bad” on Google, and
you’ll come up with dozens of sites explaining what’s wrong with
them. Or just check the help file for pie
, which says, “Pie charts are a very
bad way of displaying information. The eye is good at judging linear
measures and bad at judging relative areas. A bar chart or dot chart
is a preferable way of displaying this type of data.”