This section describes high-level lattice functions. (We’ll cover panel functions in the next section.) We’ll start with functions for plotting a single vector of values, then functions for plotting two variables, then functions for plotting three variables, and some other functions that build on these functions.
In this section, I’m going to use the same data set for
most of the examples: births in the United States during 2006.[43] The original data file contains a record for every birth
in the United States during 2006, but the version included in the
nutshell
package contains only a 10%
sample. Each record includes the following variables:
Month of birth
Day of week of birth
Mother’s age
Total birth order
Weight gain (by mother)
Sex of the child (M or F)
Apgar score
Mother’s education
Number of prenatal visits
Estimated weeks of gestation
Delivery method
“Plural” births (i.e., single, twins, triplets, etc.)
Birth weight (in grams)
It takes a little while to process the raw data, so I’ve included
a 10% sample of this data set within the nutshell
package as births2006.smpl
.
To draw bar charts with Trellis graphics, use the
function barchart
. The default
method for barchart
accepts a
formula and a data frame as arguments:
barchart(x, data, panel = lattice.getOption("panel.barchart"), box.ratio = 2, ...)
You specify the formula with the argument x
and the data frame with the argument
data
. (I’ll explain the rest of the
arguments below.) However, you can also call barchart
on an object of class table
:
barchart(x, data, groups = TRUE, origin = 0, stack = TRUE, ..., horizontal = TRUE)
To call barchart
with an
object of class table
, simply call
barchart
with the argument x
set to a table. (You shouldn’t specify an
argument for data
; if you do,
barchart
will print a warning and
ignore the argument.)
By default, the charts are actually drawn by the panel function
panel.barchart
:
panel.barchart(x, y, box.ratio = 1, box.width, horizontal = TRUE, origin = NULL, reference = TRUE, stack = FALSE, groups = NULL, col = if (is.null(groups)) plot.polygon$col else superpose.polygon$col, border = if (is.null(groups)) plot.polygon$border else superpose.polygon$border, lty = if (is.null(groups)) plot.polygon$lty else superpose.polygon$lty, lwd = if (is.null(groups)) plot.polygon$lwd else superpose.polygon$lwd, ...)
Let’s start by calculating a table of the number of births by
day of week and then printing a bar chart to show the number of births
by day of week. It’s the first time that we’re using lattice graphics,
so let’s start by loading the lattice
package:
> library(lattice) > births.dow <- table(births2006.smpl$DOB_WK) > barchart(births.dow)
The results are shown in Figure 14-5. This
is the default format for the barchart
function: horizontal bars, a frame
along the outside, tick marks, and turquoise-colored bars (on screen).
Notice that many more babies are born on weekdays than on weekends. That’s a little surprising: you might think that the number of births would be nearly the same, regardless of the day of the week. We’ll use lattice graphics to explore this data set further, to see if we can better understand this phenomenon.
You might wonder if there is a difference in the number of
births because of the delivery method; maybe doctors just schedule a
lot of cesarean sections on weekdays, and natural births occur all the
time. This is the type of question that the lattice
package is great for answering.
Let’s start by eliminating records where the delivery method was
unknown and then tabulate the number of births by day of week and
method:
> births2006.dm <- transform( + births2006.smpl[births2006.smpl$DMETH_REC != "Unknown",], + DMETH_REC=as.factor(as.character(DMETH_REC))) > dob.dm.tbl <- table(WK=births2006.dm$DOB_WK, MM=births2006.dm$DMETH_REC)
Now let’s plot the results:
> barchart(dob.dm.tbl)
The chart is shown in Figure 14-6. By
default, barchart
prints stacked
bars with no legend. In Trellis terminology, the different colors show
different groups. It does look like both types of births are less common
on weekends, but it’s tough to compare the number of each type of
birth in this chart. Also, notice that the different shades aren’t
labeled, so it’s not immediately obvious what each shade represents.
Let’s try to change the way the chart is displayed.
As an alternative, let’s try unstacking the bars (by specifying
stack=FALSE
) and adding a legend
(by specifying auto.key=TRUE
):
> trellis.device(device.pdf, color=FALSE, + filename="~/Documents/book/current/figs/incoming/rian_1507.pdf", + width=4.3, height=4.3, units="in", res=72) > barchart(dob.dm.tbl, stack=FALSE, auto.key=TRUE)
The results are shown in Figure 14-7. It’s a little easier to see that both types of births decrease on weekends, but it’s still a little difficult to compare values within each group. (When I try to focus on each group, I get distracted by the other group.) Different colored groups aren’t the best choice for this data, so let’s try a different approach.
First, let’s try changing this chart in two ways. We’ll split it
into two different panels by telling barchart
not to group by color, using the
groups=FALSE
argument. Second,
we’ll change to columns (using the horizontal=FALSE
argument), so we can easily
compare the different values:
> barchart(dob.dm.tbl, horizontal=FALSE, groups=FALSE)
The new chart is shown in Figure 14-8. The two different charts are in different panels. Now, we can more clearly see what’s going on. The number of vaginal births decreases on weekends, by maybe 25% to 30%. However, C-sections drop by 50% to 60%. As you can see, lattice graphics let you quickly try different ways to present information, helping you zero in on the method that best illustrates what is happening in the data.
A good alternative to bar charts are Cleveland dot
plots. Like bar charts, dot plots are useful for showing data where
there is a single point for each category. Visually, they seem a lot
less “busy” to me than bar charts, so I like using them to summarize
larger data tables. To show dot plots in R, use the function dotplot
:
dotplot(x, data, panel = lattice.getOption("panel.dotplot"), ...)
Much like barchart
, the
default method expects you to specify the data in a formula and a data
frame, but there is a method for plotting tables as well:
## S3 method for class 'table': dotplot(x, data, groups = TRUE, ..., horizontal = TRUE)
As an example of dotplot
,
let’s look at a chart of data on births by day of week. Is the pattern
we saw above a seasonal pattern? First, we’ll create a new table
counting births by month, week, and delivery method:
> dob.dm.tbl.alt <- table(WEEK=births2006.dm$DOB_WK, + MONTH=births2006.dm$DOB_MM, + METHOD=births2006.dm$DMETH_REC)
Next, we’ll plot the results using a dot plot. In this plot,
we’ll keep on grouping, so that different delivery methods are shown
in different colors (groups=TRUE
).
To help highlight differences, we’ll disable stacking values (stack=FALSE
). Finally, we’ll print a key so
that it’s obvious what each symbol represents (auto.key=TRUE
):
> dotplot(dob.dm.tbl.alt, stack=FALSE, auto.key=TRUE, groups=TRUE)
The results are shown in Figure 14-9. (To make the results print nicely, I generated these charts with the default black-and-white color scheme. If you try this yourself, the table may look slightly different. Depending on your platform, you’ll probably see hollow blue circles for C-section births and hollow purple sections for vaginal births.) As you can see, there are slight seasonal differences, but the overall pattern remains the same.
As another example of dot plots, let’s look at the tire failure data. In 2003, the National Highway Traffic Safety Administration (NHTSA) began a study into the durability of radial tires on light trucks. (This was three years after the Firestone recall of tires for Ford Explorers.) The NHTSA performed the tests in Phoenix, because it felt that the hot and dry conditions would be unusually stressful for tires (and because it had noted that many tire failures occur in the American Southwest). Over the next few years, it conducted hundreds of different tests on tires and released the data to the public. (See http://www.nhtsa.gov/portal/site/nhtsa/menuitem.8027fe7cfb6e727568d07a30343c44cc for links to this study.)
Tests were carried out on six different types of tires. Following is a table of the characteristics of the tires.
Tire | Size | Load Index | Speed Rating | Brand | Model | OE Vehicle | OE Model |
---|---|---|---|---|---|---|---|
B | P195/65R15 | 89 | S | BF Goodrich | Touring T/A | Chevy | Cavalier |
C | P205/65R15 | 92 | V | Goodyear | Eagle GA | Lexus | ES300 |
D | P235/75R15 | 108 | S | Michelin | LTX M/S | Ford, Dodge | E 150 Van, Ram Van 1500 |
E | P265/75R16 | 114 | S | Firestone | Wilderness AT | Chevy/GMC | Silverado, Tahoe, Yukon |
H | LT245/75R16/E | 120/116 | Q | Pathfinder | ATR A/S OWL | NA | NA |
L | 255/65R16 | 109 | H | General | Grabber ST A/S | Mercedes | ML320 |
As an example, we’re going to look at one particular batch of
tests from this study. The test was called a “Stepped-Up Speed to Failure”
test. In this test, tires were mounted on testing devices. The
testing facility then conducted a number of basic tests on the tires
to check that they were intact. The test facility then proceeded to
test the tires at increasing speeds until the tires failed.
Specifically, the testing facility tested each tire at a specific
speed for 1 hour, and then it proceeded to increase the speed in
10-km/h increments until either (a) the tire failed or (b) a
prescribed limit was reached for each tire. (The limit was dependent
on the speed rating for the tire.) After the limit was reached, the
test was run continuously until the tire failed. The test data set is
in the package nutshell
, under the
name tires.sus
.
The data set contains a lot of information, but we’re going to
focus on only three variables. Time_To_Failure
is the time before each tire
failed (in hours), Speed_At_Failure_km_h
is the testing speed
at which the tire failed, and Tire_Type
is the type of tire tested. We
know that tests were run at only certain stepped speeds; despite the
fact that speed is a numeric variable, we can treat it as a factor. So
we can use dot plots to show the one continuous variable (time to
failure) by the speed at failure for each different type of
tire:
> library(nutshell) > data(tires.sus) > dotplot(as.factor(Speed_At_Failure_km_h)~Time_To_Failure|Tire_Type, + data=tires.sus)
The result is shown in Figure 14-10. This diagram lets us clearly see how quickly tires failed in each of the tests. For example, all type D tires failed quickly at the testing speed of 180 km/h, but some type H tires lasted a long time before failure. We’ll revisit this example in Comparing means.
A very popular chart for showing the distribution of a
variable is the histogram. You can plot histograms in the trellis
package with the function histogram
:
histogram(x, data, allow.multiple, outer = TRUE, auto.key = FALSE, aspect = "fill", panel = lattice.getOption("panel.histogram"), prepanel, scales, strip, groups, xlab, xlim, ylab, ylim, type = c("percent", "count", "density"), nint = if (is.factor(x)) nlevels(x) else round(log2(length(x)) + 1), endpoints = extend.limits(range(as.numeric(x), finite = TRUE), prop = 0.04), breaks, equal.widths = TRUE, drop.unused.levels = lattice.getOption("drop.unused.levels"), ..., lattice.options = NULL, default.scales = list(), subscripts, subset)
By default, histograms are drawn by panel.histogram
:
panel.histogram(x, breaks, equal.widths = TRUE, type = "density", nint = round(log2(length(x)) + 1), alpha, col, border, lty, lwd, ...)
As an example of histograms, let’s look at average birth weights, grouped by number of births:
> histogram(~DBWT|DPLURAL, data=births2006.smpl)
The results are shown in Figure 14-11.
Notice that the panels are ordered alphabetically by the conditioning
variable. (That’s why the group names have the numbers at the front.)
Also notice that the histogram
function tries to fill in all the available space with squarish
panels. This helps make each chart readable by itself but makes it
difficult to compare the different groups.
To make it easier to compare groups, we can explicitly stack the
charts on top of each other using the layout
variable:
> histogram(~DBWT|DPLURAL, data=births2006.smpl, layout=c(1, 5))
The resulting chart is shown in Figure 14-12. As you can see, birth weights are roughly normally distributed within each group, but the mean weight drops as the number of births increases.
If you’d like to see a single line showing the
distribution, instead of a set of columns representing bins, you can
use kernel density plots. To draw them in R, use the function densityplot
:
densityplot(x, data, allow.multiple = is.null(groups) || outer, outer = !is.null(groups), auto.key = FALSE, aspect = "fill", panel = lattice.getOption("panel.densityplot"), prepanel, scales, strip, groups, weights, xlab, xlim, ylab, ylim, bw, adjust, kernel, window, width, give.Rkern, n = 50, from, to, cut, na.rm, drop.unused.levels = lattice.getOption("drop.unused.levels"), ..., lattice.options = NULL, default.scales = list(), subscripts, subset)
By default, panels are drawn by panel.densityplot
:
panel.densityplot(x, darg, plot.points = "jitter", ref = FALSE, groups = NULL, weights = NULL, jitter.amount, type, ...
Let’s redraw the example above, replacing the histogram with a
density plot. By default, densityplot
will draw a strip chart under
each chart, showing every data point. However, because the data set is
so big (there are 427,432 observations), we’ll tell densityplot
not to do this by specifying
plot.points=FALSE
:
> densityplot(~DBWT|DPLURAL,data=births2006.smpl, + layout=c(1,5), plot.points=FALSE)
The results are shown in Figure 14-13. One advantage of density plots over
histograms is that you can stack them on top of one another and still
read the results. By changing the conditioning variable (DPLURAL
) to a grouping variable, we can
stack these charts on top of one another:
> densityplot(~DBWT, groups=DPLURAL, data=births2006.smpl, + plot.points=FALSE, auto.key=TRUE)
The superimposed density plots are shown in Figure 14-14. As you can see, it’s easier to compare distribution shapes (and centers) by superimposing the charts.
A good alternative to histograms are strip plots,
especially when there isn’t much data to plot. Strip plots look
similar to dot plots, but they show different information. Dot plots
are designed to show one value per category (often a mean or a sum),
while strip plots show many values. You can think of strip plots as
one-dimensional scatter plots. To draw strip plots in R, use the
stripplot
function:
stripplot(x, data, panel = lattice.getOption("panel.stripplot"), ...)
By default, panels are drawn by panel.stripplot
:
panel.stripplot(x, y, jitter.data = FALSE, factor = 0.5, amount = NULL, horizontal = TRUE, groups = NULL, ...)
As an example of a strip plot, let’s look at the weights of
babies born in sets of four or more. There were only 44 observations
in our data set that match this description, so a strip plot is a
reasonable way to show density. In this case, we’ll use the subset
argument to specify the set
of observations we want to plot and add some random vertical noise to
make the points easier to read by specifying jitter.data=TRUE
:
> stripplot(~DBWT, data=births2006.smpl, + subset=(DPLURAL=="5 Quintuplet or higher" | + DPLURAL=="4 Quadruplet"), + jitter.data=TRUE)
The resulting chart is shown in Figure 14-15.
Another useful plot that you can generate within the
lattice
package is the
quantile-quantile plot. A quantile-quantile plot compares the
distribution of actual data values to a theoretical distribution.
Specifically, it plots quantiles of the observed data against
quantiles of a theoretical distribution. If the plotted points form a
straight diagonal line (from top right to bottom left), then it is
likely that the observed data comes from the theoretical distribution.
Quantile-quantile plots are a very powerful technique for seeing how
closely a data set matches a theoretical distribution (or how much it
deviates from it).
To plot quantile-quantile plots using lattice graphics, use the
function qqmath
:
qqmath(x, data, allow.multiple = is.null(groups) || outer, outer = !is.null(groups), distribution = qnorm, f.value = NULL, auto.key = FALSE, aspect = "fill", panel = lattice.getOption("panel.qqmath"), prepanel = NULL, scales, strip, groups, xlab, xlim, ylab, ylim, drop.unused.levels = lattice.getOption("drop.unused.levels"), ..., lattice.options = NULL, default.scales = list(), subscripts, subset)
By default, panels are drawn by panel.qqmath
:
panel.qqmath(x, f.value = NULL, distribution = qnorm, qtype = 7, groups = NULL, ...
By default, the function qqmath
compares the sample data to a normal distribution. If
the sample data is really normally distributed, you’ll see a vertical
line. As an example, let’s plot 100,000 random values from a normal
distribution to show what qqmath
does:
> qqmath(rnorm(100000))
The results are shown in Figure 14-16.
Let’s plot a set of quantile-quantile plots for the birth weight data. Because the data set is rather large, we’ll plot only a random sample of 50,000 points:
qqmath(~DBWT|DPLURAL, data=births2006.smpl[sample(1:nrow(births2006.smpl), 50000), ], pch=19, cex=0.25, subset=(DPLURAL != "5 Quintuplet or higher"))
As you can see from Figure 14-17, the distribution of birth weights is not quite normal.
As another example, let’s look at real estate prices in San
Francisco in 2008 and 2009. This
data set is included in the nutshell
package as sanfrancisco
.home.sales
. (See More About the San Francisco Real Estate Prices Data
Set for more information on this data
set.) Here is how to load the data:
> library(nutshell) > data(sanfrancisco.home.sales)
Intuitively, it doesn’t make sense for real estate prices to be
normally distributed. There are far more people with below-average
incomes than above-average incomes. The lowest recorded price in the
data set is $100,000; the highest is $9,500,000. Let’s take a look at
this distribution with qqmath
:
> qqmath(~price, data=sanfrancisco.home.sales)
The distribution is shown in Figure 14-18. As expected, the distribution is not normal. It looks exponential, so let’s try a log transform:
> qqmath(~log(price), data=sanfrancisco.home.sales)
A log transform yields a distribution that looks pretty close to
normally distributed (see Figure 14-19).
Let’s take a look at how the distribution changes based on the number
of bedrooms. To do this, we’ll split the distribution into groups and
change the way the points are plotted. Specifically, we’ll plot smooth
lines instead of individual points. (Point type is actually an
argument for panel.xyplot
, which is
used to draw the chart.) We’ll add a key to the plot (using auto.key=TRUE
). We’ll pass an explicit
subset as an argument to the function instead of using the subset
argument. (This helps clean up the
key, which would show unused factor levels otherwise.)
> qqmath(~log(price), groups=bedrooms, + data=subset(sanfrancisco.home.sales, + !is.na(bedrooms) & bedrooms > 0 & bedrooms < 7), + auto.key=TRUE, drop.unused.levels=TRUE, type="smooth") > dev.off()
Notice that the lines are separate, with higher values for
higher numbers of bedrooms (see Figure 14-20).
We can do the same thing for square footage (see Figure 14-21). (I used the function cut2
from the package HMisc
to divide square footages into six
even quantiles.)
> library(Hmisc) > qqmath(~log(price), groups=cut2(squarefeet, g=6), + data=subset(sanfrancisco.home.sales, !is.na(squarefeet)), + auto.key=TRUE, drop.unused.levels=TRUE, type="smooth")
Here the separation is even more clear. We can see the same separation by neighborhood. (We’ll come back to this analysis in Chapter 20.)
This section describes Trellis plots for plotting two variables. Many real data sets (for example, financial data) record relationships between multiple numeric variables. The tools in this section can help you examine those relationships.
To generate scatter plots with the trellis
package, use the function xyplot
:
xyplot(x, data, allow.multiple = is.null(groups) || outer, outer = !is.null(groups), auto.key = FALSE, aspect = "fill", panel = lattice.getOption("panel.xyplot"), prepanel = NULL, scales = list(), strip = TRUE, groups = NULL, xlab, xlim, ylab, ylim, drop.unused.levels = lattice.getOption("drop.unused.levels"), ..., lattice.options = NULL, default.scales, subscripts = !is.null(groups), subset = TRUE)
Most of the work is done by the panel function panel.xyplot
:
panel.xyplot(x, y, type = "p", groups = NULL, pch, col, col.line, col.symbol, font, fontfamily, fontface, lty, cex, fill, lwd, horizontal = FALSE, ..., jitter.x = FALSE, jitter.y = FALSE, factor = 0.5, amount = NULL)
As an example of a scatter plot, let’s take a look at the relationship between house size and price. Let’s start with a simple scatter plot, showing size and price:
> xyplot(price~squarefeet, data=sanfrancisco.home.sales)
The results of this command are shown in Figure 14-22. It looks like there is a rough
correspondence between size and price (the plot looks vaguely cone
shaped). This chart is hard to read, so let’s try modifying it. Let’s
trim outliers (sales prices over 4,000,000 and properties over 6,000
square feet) using the subset
argument. Additionally, let’s
take a look at how this relationship varies by zip code. San Francisco
is a pretty big place, and not all neighborhoods are equally in
demand. (You probably know the cliché about the first three rules of
real estate: location, location, location.)
Before plotting the price data, let’s pick a subset of zip codes to plot. A few parts of the city are sparsely populated (like the financial district, 94104) and don’t have enough data to make plotting interesting. Also, let’s exclude zip codes where square footage isn’t available:
> table(subset(sanfrancisco.home.sales, !is.na(squarefeet), select=zip))
94100 94102 94103 94104 94105 94107 94108 94109 94110 94111 94112
2 52 62 4 44 147 21 115 161 12 192
94114 94115 94116 94117 94118 94121 94122 94123 94124 94127 94131
143 101 124 114 92 92 131 71 85 108 136
94132 94133 94134 94158
82 47 105 13
So we’ll exclude 94100, 94104, 94108, 94111, 94133, and 94158
because there are too few sales to be interesting. (Note the strip
argument. This simply prints the zip
codes with the plots.)
> trellis.par.set(fontsize=list(text=7)) > xyplot(price~squarefeet|zip, data=sanfrancisco.home.sales, + subset=(zip!=94100 & zip!=94104 & zip!=94108 & + zip!=94111 & zip!=94133 & zip!=94158 & + price < 4000000 & + ifelse(is.na(squarefeet), FALSE, squarefeet < 6000)), + strip=strip.custom(strip.levels=TRUE))
The resulting plot is shown in Figure 14-23. Now the linear relationship is much more pronounced. Note the different slopes in different neighborhoods. As you might expect, some up-and-coming neighborhoods (like zip code 94110, which includes the Mission and Bernal Heights) are more shallowly sloped, while ritzy neighborhoods (like zip code 94123, which includes the Marina and Cow Hollow) are more steeply sloped.
We can make this slightly more readable by using neighborhood
names. Let’s rerun the code, conditioning by neighborhood. We’ll also
add a diagonal line to each plot (through a custom panel function) to
make the charts even easier to read. We’ll also change the default
points plotted to be solid (through the pch=19
argument) and shrink them to a
smaller size (through the cex=.2
argument):
> trellis.par.set(fontsize=list(text=7)) > dollars.per.squarefoot <- mean( + sanfrancisco.home.sales$price / sanfrancisco.home.sales$squarefeet, + na.rm=TRUE); > xyplot(price~squarefeet|neighborhood, + data=sanfrancisco.home.sales, + pch=19, + cex=.2, + subset=(zip != 94100 & zip != 94104 & zip != 94108 & + zip != 94111 & zip != 94133 & zip != 94158 & + price < 4000000 & + ifelse(is.na(squarefeet), FALSE, squarefeet < 6000)), + strip=strip.custom(strip.levels=TRUE, + horizontal=TRUE, + par.strip.text=list(cex=.8)), + panel=function(...) { + panel.abline(a=0,b=dollars.per.squarefoot); + panel.xyplot(...); + } + )
This plot is shown in Figure 14-24.
The San Francisco home sales data set was taken from a
particularly interesting time: the housing market crash. (The market
fell a little late in San Francisco compared with other cities.) Let’s
take a look at how prices changed over time during this period. We
could plot just the median price or mean price, or the number of
sales. However, the lattice
package
gives us tools that will let us watch how the whole distribution
changed over time. Specifically, we can use box plots.
Box plots in the lattice
package are just like box plots drawn with the graphics
package, as described in Box Plots. The boxes represent prices from the 25th
through the 75th percentiles (the interquartile range), the dots
represent median prices, and the whiskers represent the minimum or
maximum values. (When there are values that stretch beyond 1.5 times
the length of the interquartile range, the whiskers are truncated at
those extremes.)
To show box plots with Trellis
graphics, use the function bwplot
:
bwplot(x, data, allow.multiple = is.null(groups) || outer, outer = FALSE, auto.key = FALSE, aspect = "fill", panel = lattice.getOption("panel.bwplot"), prepanel = NULL, scales = list(), strip = TRUE, groups = NULL, xlab, xlim, ylab, ylim, box.ratio = 1, horizontal = NULL, drop.unused.levels = lattice.getOption("drop.unused.levels"), ..., lattice.options = NULL, default.scales, subscripts = !is.null(groups), subset = TRUE)
This function will, in turn, call panel.bwplot
:
panel.bwplot(x, y, box.ratio = 1, box.width = box.ratio / (1 + box.ratio), horizontal = TRUE, pch, col, alpha, cex, font, fontfamily, fontface, fill, varwidth = FALSE, notch = FALSE, notch.frac = 0.5, ..., levels.fos, stats = boxplot.stats, coef = 1.5, do.out = TRUE)
Let’s show a set of box plots, with one plot per month. We’ll
need to round the date ranges to the nearest month. A convenient way
to do this in R is with the cut
function. Here’s the number of sales by month in this
data set:
> table(cut(sanfrancisco.home.sales$saledate, "month"))
2008-02-01 2008-03-01 2008-04-01 2008-05-01 2008-06-01 2008-07-01
139 230 267 253 237 198
2008-08-01 2008-09-01 2008-10-01 2008-11-01 2008-12-01 2009-01-01
253 223 272 118 181 114
2009-02-01 2009-03-01 2009-04-01 2009-05-01 2009-06-01 2009-07-01
123 142 116 180 150 85
As you may remember from above, the cutoff dates don’t fall neatly on the beginning and ending of each month:
> min(sanfrancisco.home.sales$saledate) [1] "2008-02-13" > max(sanfrancisco.home.sales$saledate) [1] "2009-07-14"
So don’t focus too much on the volumes in February 2008 or July 2009. (Volume was much lower in the spring.) Let’s take a look at the distribution of sales prices by month. Here’s the code to present this data using the default representation:
> bwplot(price~cut(saledate, "month"), data=sanfrancisco.home.sales)
Unfortunately, this doesn’t produce an easily readable plot, as you can see in Figure 14-25. It’s clear that there are a large number of outliers that are making the plot hard to see. Box plots assume a normal distribution, but this doesn’t make intuitive sense for real estate prices (as we saw in Univariate quantile-quantile plots). Let’s try plotting the box plots again, this time with the log-transformed values. To make it more readable, we’ll change to vertical box plots and rotate the text at the bottom:
> bwplot(log(price)~cut(saledate, "month"), + data=sanfrancisco.home.sales, + scales=list(x=list(rot=90)))
Taking a look at the plot (shown in Figure 14-26), we can more clearly see some trends. Median prices moved around a lot during this period, though the interquartile range moved less. Moreover, it looks like sales at the high end of the market slowed down quite a bit (looking at the outliers on the top and the top whiskers). But, interestingly, the basic distribution appears pretty stable from month to month.
If you would like to generate a matrix of scatter plots
for many different pairs of variables, use the splom
function:
splom(x, data, auto.key = FALSE, aspect = 1, between = list(x = 0.5, y = 0.5), panel = lattice.getOption("panel.splom"), prepanel, scales, strip, groups, xlab, xlim, ylab = NULL, ylim, superpanel = lattice.getOption("panel.pairs"), pscales = 5, varnames, drop.unused.levels, ..., lattice.options = NULL, default.scales, subset = TRUE)
Most of the work is done by panel.splom
:
panel.splom(...)
If you would like to generate quantile-quantile plots
for comparing two distributions, use the function qq
:
qq(x, data, aspect = "fill", panel = lattice.getOption("panel.qq"), prepanel, scales, strip, groups, xlab, xlim, ylab, ylim, f.value = NULL, drop.unused.levels = lattice.getOption("drop.unused.levels"), ..., lattice.options = NULL, qtype = 7, default.scales = list(), subscripts, subset)
If you would like to plot three-dimensional data with Trellis graphics, there are several functions available.
To plot three-dimensional data in flat grids, with
colors showing different values for the third dimension, use the
levelplot
function:
levelplot(x, data, allow.multiple = is.null(groups) || outer, outer = TRUE, aspect = "fill", panel = lattice.getOption("panel.levelplot"), prepanel = NULL, scales = list(), strip = TRUE, groups = NULL, xlab, xlim, ylab, ylim, at, cuts = 15, pretty = FALSE, region = TRUE, drop.unused.levels = lattice.getOption("drop.unused.levels"), ..., lattice.options = NULL, default.scales = list(), colorkey = region, col.regions, alpha.regions, subset = TRUE)
Most of the work is done by panel.levelplot
:
panel.levelplot(x, y, z, subscripts, at = pretty(z), shrink, labels, label.style = c("mixed", "flat", "align"), contour = FALSE, region = TRUE, col = add.line$col, lty = add.line$lty, lwd = add.line$lwd, ..., col.regions = regions$col, alpha.regions = regions$alpha)
As an example of level plots, we will look at the San Francisco
home sales data set. Let’s start by looking at the number of home
sales in different parts of the city. To do this, we’ll need to use
that coordinate data in the San Francisco home sales data set.
Unfortunately, we can’t use the coordinates directly; the coordinates
are too precise, so the levelplot
function simply plots a large number of points. (Try executing
levelplot(price~latitude+longitude)
to see what I mean.)
We’ll need to break the data into bins and count the number of
homes within each bin. To do this, we’ll use the table
and cut
functions:
> attach(sanfrancisco.home.sales) > levelplot(table(cut(longitude, breaks=40), + cut(latitude, breaks=40)), + scales=list(y=list(cex=.5), + x=list(rot=90, cex=.5)))
This plot is shown in Figure 14-27. If we
were interested in looking at the average sales price by area, we
could use a similar strategy. Instead of table
, we’ll use the tapply
function to aggregate observations. And while we’re at
it, we’ll cut out the axis labels:
> levelplot(tapply(price, + INDEX=list(cut(longitude, breaks=40), + cut(latitude, breaks=40)), + FUN=mean), + scales=list(draw=FALSE))
This plot is shown in Figure 14-28. And, of course, you can use conditioning values with level plots. Let’s look at the number of home sales by numbers of bedrooms. We’ll simplify the data slightly by looking at houses with zero to four bedrooms and then houses with five bedrooms or more. We’ll also cut the number of breaks to keep the charts legible:
> bedrooms.capped <- ifelse(bedrooms < 5, bedrooms, 5); > levelplot(table(cut(longitude, breaks=25), + cut(latitude, breaks=25), + bedrooms.capped), + scales=list(draw=FALSE))
This figure is shown in Figure 14-29.
If you would like to show contour plots with lattice
(which resemble topographic maps),
then use the contourplot
function:
contourplot(x, data, panel = lattice.getOption("panel.contourplot"), cuts = 7, labels = TRUE, contour = TRUE, pretty = TRUE, region = FALSE, ...)
To plot points in three dimensions (technically,
projections into two dimensions of the points in three dimensions),
use the function cloud
:
cloud(x, data, allow.multiple = is.null(groups) || outer, outer = FALSE, auto.key = FALSE, aspect = c(1,1), panel.aspect = 1, panel = lattice.getOption("panel.cloud"), prepanel = NULL, scales = list(), strip = TRUE, groups = NULL, xlab, ylab, zlab, xlim = if (is.factor(x)) levels(x) else range(x, finite = TRUE), ylim = if (is.factor(y)) levels(y) else range(y, finite = TRUE), zlim = if (is.factor(z)) levels(z) else range(z, finite = TRUE), at, drape = FALSE, pretty = FALSE, drop.unused.levels, ..., lattice.options = NULL, default.scales = list(distance = c(1, 1, 1), arrows = TRUE, axs = axs.default), colorkey, col.regions, alpha.regions, cuts = 70, subset = TRUE, axs.default = "r")
By default, plots are drawn with panel.cloud
:
panel.cloud(x, y, subscripts, z, groups = NULL, perspective = TRUE, distance = if (perspective) 0.2 else 0, xlim, ylim, zlim, panel.3d.cloud = "panel.3dscatter", panel.3d.wireframe = "panel.3dwire", screen = list(z = 40, x = -60), R.mat = diag(4), aspect = c(1, 1), par.box = NULL, xlab, ylab, zlab, xlab.default, ylab.default, zlab.default, scales.3d, proportion = 0.6, wireframe = FALSE, scpos, ..., at)
If you have fitted a model to a data set, the rfs
function can help you visualize how well
the model fits the data:
rfs(model, layout=c(2, 1), xlab="f-value", ylab=NULL, distribution = qunif, panel, prepanel, strip, ...)
The rfs
function plots residual
and fit-spread (RFS) plots. As an example, we’ll use the model described
in Example: A Simple Linear Model. The example is a linear
model for runs scored in baseball games as a function of team offensive
statistics. For a full explanation, see Chapter 20; here we just want to show what charts
are plotted for linear models with the rfs
function:
> rfs(runs.mdl)
The plot generated by this command is shown in Figure 14-30. Notice that the two curves are S shaped. The
residual plot is a quantile-quantile plot of the residuals; we’d expect
the plot to be linear if the data fit the assumed distribution. The
default distribution choice for rfs
is a uniform distribution, which clearly isn’t right. Let’s try
generating a second set of plots, assuming a normal distribution for the
residuals:
> rfs(runs.mdl, distribution=qnorm)
The results are shown in Figure 14-31. Notice that the plots are roughly linear. We expect a normally distributed error function for a linear regression model, so this is a good thing.
[43] This data set is available from http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm. I used the 2006 Birth Data File in this book. The data file is 3.1 GB uncompressed, which is way too big to load easily into R on a machine with only 4 GB. I used a Perl script to parse this file and return a limited number of records in CSV format.
[44] It’s not completely dirty, either. I spent some time cleaning up and correcting the data: removing blatant duplicates, adding years to some recent condo listings, and a few other fixes.