High-Level Lattice Plotting Functions

This section describes high-level lattice functions. (We’ll cover panel functions in the next section.) We’ll start with functions for plotting a single vector of values, then functions for plotting two variables, then functions for plotting three variables, and some other functions that build on these functions.

In this section, I’m going to use the same data set for most of the examples: births in the United States during 2006.[43] The original data file contains a record for every birth in the United States during 2006, but the version included in the nutshell package contains only a 10% sample. Each record includes the following variables:

DOB_MM

Month of birth

DOB_WK

Day of week of birth

MAGER

Mother’s age

TBO_REC

Total birth order

WTGAIN

Weight gain (by mother)

SEX

Sex of the child (M or F)

APGAR5

Apgar score

DMEDUC

Mother’s education

UPREVIS

Number of prenatal visits

ESTGEST

Estimated weeks of gestation

DMETH_REC

Delivery method

DPLURAL

“Plural” births (i.e., single, twins, triplets, etc.)

DBWT

Birth weight (in grams)

It takes a little while to process the raw data, so I’ve included a 10% sample of this data set within the nutshell package as births2006.smpl.

To draw bar charts with Trellis graphics, use the function barchart. The default method for barchart accepts a formula and a data frame as arguments:

barchart(x,
         data,
         panel = lattice.getOption("panel.barchart"),
         box.ratio = 2,
         ...)

You specify the formula with the argument x and the data frame with the argument data. (I’ll explain the rest of the arguments below.) However, you can also call barchart on an object of class table:

barchart(x, data, groups = TRUE,
         origin = 0, stack = TRUE, ..., horizontal = TRUE)

To call barchart with an object of class table, simply call barchart with the argument x set to a table. (You shouldn’t specify an argument for data; if you do, barchart will print a warning and ignore the argument.)

By default, the charts are actually drawn by the panel function panel.barchart:

panel.barchart(x, y, box.ratio = 1, box.width,
               horizontal = TRUE,
               origin = NULL, reference = TRUE,
               stack = FALSE,
               groups = NULL,
               col = if (is.null(groups)) plot.polygon$col
                     else superpose.polygon$col,
               border = if (is.null(groups)) plot.polygon$border
                        else superpose.polygon$border,
               lty = if (is.null(groups)) plot.polygon$lty
                     else superpose.polygon$lty,
               lwd = if (is.null(groups)) plot.polygon$lwd
                     else superpose.polygon$lwd,
               ...)

Let’s start by calculating a table of the number of births by day of week and then printing a bar chart to show the number of births by day of week. It’s the first time that we’re using lattice graphics, so let’s start by loading the lattice package:

> library(lattice)
> births.dow <- table(births2006.smpl$DOB_WK)
> barchart(births.dow)

The results are shown in Figure 14-5. This is the default format for the barchart function: horizontal bars, a frame along the outside, tick marks, and turquoise-colored bars (on screen).

Notice that many more babies are born on weekdays than on weekends. That’s a little surprising: you might think that the number of births would be nearly the same, regardless of the day of the week. We’ll use lattice graphics to explore this data set further, to see if we can better understand this phenomenon.

You might wonder if there is a difference in the number of births because of the delivery method; maybe doctors just schedule a lot of cesarean sections on weekdays, and natural births occur all the time. This is the type of question that the lattice package is great for answering. Let’s start by eliminating records where the delivery method was unknown and then tabulate the number of births by day of week and method:

> births2006.dm <- transform(
+   births2006.smpl[births2006.smpl$DMETH_REC != "Unknown",],
+    DMETH_REC=as.factor(as.character(DMETH_REC)))
> dob.dm.tbl <- table(WK=births2006.dm$DOB_WK, MM=births2006.dm$DMETH_REC)

Now let’s plot the results:

> barchart(dob.dm.tbl)

The chart is shown in Figure 14-6. By default, barchart prints stacked bars with no legend. In Trellis terminology, the different colors show different groups. It does look like both types of births are less common on weekends, but it’s tough to compare the number of each type of birth in this chart. Also, notice that the different shades aren’t labeled, so it’s not immediately obvious what each shade represents. Let’s try to change the way the chart is displayed.

As an alternative, let’s try unstacking the bars (by specifying stack=FALSE) and adding a legend (by specifying auto.key=TRUE):

> trellis.device(device.pdf, color=FALSE,
+   filename="~/Documents/book/current/figs/incoming/rian_1507.pdf",
+   width=4.3, height=4.3, units="in", res=72)
> barchart(dob.dm.tbl, stack=FALSE, auto.key=TRUE)

The results are shown in Figure 14-7. It’s a little easier to see that both types of births decrease on weekends, but it’s still a little difficult to compare values within each group. (When I try to focus on each group, I get distracted by the other group.) Different colored groups aren’t the best choice for this data, so let’s try a different approach.

First, let’s try changing this chart in two ways. We’ll split it into two different panels by telling barchart not to group by color, using the groups=FALSE argument. Second, we’ll change to columns (using the horizontal=FALSE argument), so we can easily compare the different values:

> barchart(dob.dm.tbl, horizontal=FALSE, groups=FALSE)

The new chart is shown in Figure 14-8. The two different charts are in different panels. Now, we can more clearly see what’s going on. The number of vaginal births decreases on weekends, by maybe 25% to 30%. However, C-sections drop by 50% to 60%. As you can see, lattice graphics let you quickly try different ways to present information, helping you zero in on the method that best illustrates what is happening in the data.

A good alternative to bar charts are Cleveland dot plots. Like bar charts, dot plots are useful for showing data where there is a single point for each category. Visually, they seem a lot less “busy” to me than bar charts, so I like using them to summarize larger data tables. To show dot plots in R, use the function dotplot:

dotplot(x,
        data,
        panel = lattice.getOption("panel.dotplot"),
        ...)

Much like barchart, the default method expects you to specify the data in a formula and a data frame, but there is a method for plotting tables as well:

## S3 method for class 'table':
dotplot(x, data, groups = TRUE, ..., horizontal = TRUE)

As an example of dotplot, let’s look at a chart of data on births by day of week. Is the pattern we saw above a seasonal pattern? First, we’ll create a new table counting births by month, week, and delivery method:

> dob.dm.tbl.alt <- table(WEEK=births2006.dm$DOB_WK,
+   MONTH=births2006.dm$DOB_MM,
+   METHOD=births2006.dm$DMETH_REC)

Next, we’ll plot the results using a dot plot. In this plot, we’ll keep on grouping, so that different delivery methods are shown in different colors (groups=TRUE). To help highlight differences, we’ll disable stacking values (stack=FALSE). Finally, we’ll print a key so that it’s obvious what each symbol represents (auto.key=TRUE):

> dotplot(dob.dm.tbl.alt, stack=FALSE, auto.key=TRUE, groups=TRUE)

The results are shown in Figure 14-9. (To make the results print nicely, I generated these charts with the default black-and-white color scheme. If you try this yourself, the table may look slightly different. Depending on your platform, you’ll probably see hollow blue circles for C-section births and hollow purple sections for vaginal births.) As you can see, there are slight seasonal differences, but the overall pattern remains the same.

As another example of dot plots, let’s look at the tire failure data. In 2003, the National Highway Traffic Safety Administration (NHTSA) began a study into the durability of radial tires on light trucks. (This was three years after the Firestone recall of tires for Ford Explorers.) The NHTSA performed the tests in Phoenix, because it felt that the hot and dry conditions would be unusually stressful for tires (and because it had noted that many tire failures occur in the American Southwest). Over the next few years, it conducted hundreds of different tests on tires and released the data to the public. (See http://www.nhtsa.gov/portal/site/nhtsa/menuitem.8027fe7cfb6e727568d07a30343c44cc for links to this study.)

Tests were carried out on six different types of tires. Following is a table of the characteristics of the tires.

TireSizeLoad IndexSpeed RatingBrandModelOE VehicleOE Model
BP195/65R1589SBF GoodrichTouring T/AChevyCavalier
CP205/65R1592VGoodyearEagle GALexusES300
DP235/75R15108SMichelinLTX M/SFord, DodgeE 150 Van, Ram Van 1500
EP265/75R16114SFirestoneWilderness ATChevy/GMCSilverado, Tahoe, Yukon
HLT245/75R16/E120/116QPathfinderATR A/S OWLNANA
L255/65R16109HGeneralGrabber ST A/SMercedesML320

As an example, we’re going to look at one particular batch of tests from this study. The test was called a “Stepped-Up Speed to Failure” test. In this test, tires were mounted on testing devices. The testing facility then conducted a number of basic tests on the tires to check that they were intact. The test facility then proceeded to test the tires at increasing speeds until the tires failed. Specifically, the testing facility tested each tire at a specific speed for 1 hour, and then it proceeded to increase the speed in 10-km/h increments until either (a) the tire failed or (b) a prescribed limit was reached for each tire. (The limit was dependent on the speed rating for the tire.) After the limit was reached, the test was run continuously until the tire failed. The test data set is in the package nutshell, under the name tires.sus.

The data set contains a lot of information, but we’re going to focus on only three variables. Time_To_Failure is the time before each tire failed (in hours), Speed_At_Failure_km_h is the testing speed at which the tire failed, and Tire_Type is the type of tire tested. We know that tests were run at only certain stepped speeds; despite the fact that speed is a numeric variable, we can treat it as a factor. So we can use dot plots to show the one continuous variable (time to failure) by the speed at failure for each different type of tire:

> library(nutshell)
> data(tires.sus)
> dotplot(as.factor(Speed_At_Failure_km_h)~Time_To_Failure|Tire_Type,
+   data=tires.sus)

The result is shown in Figure 14-10. This diagram lets us clearly see how quickly tires failed in each of the tests. For example, all type D tires failed quickly at the testing speed of 180 km/h, but some type H tires lasted a long time before failure. We’ll revisit this example in Comparing means.

A very popular chart for showing the distribution of a variable is the histogram. You can plot histograms in the trellis package with the function histogram:

histogram(x,
          data,
          allow.multiple, outer = TRUE,
          auto.key = FALSE,
          aspect = "fill",
          panel = lattice.getOption("panel.histogram"),
          prepanel, scales, strip, groups,
          xlab, xlim, ylab, ylim,
          type = c("percent", "count", "density"),
          nint = if (is.factor(x)) nlevels(x)
          else round(log2(length(x)) + 1),
          endpoints = extend.limits(range(as.numeric(x), finite = TRUE),
                                    prop = 0.04),
          breaks,
          equal.widths = TRUE,
          drop.unused.levels = lattice.getOption("drop.unused.levels"),
          ...,
          lattice.options = NULL,
          default.scales = list(),
          subscripts,
          subset)

By default, histograms are drawn by panel.histogram:

panel.histogram(x,
                breaks,
                equal.widths = TRUE,
                type = "density",
                nint = round(log2(length(x)) + 1),
                alpha, col, border, lty, lwd,
                ...)

As an example of histograms, let’s look at average birth weights, grouped by number of births:

> histogram(~DBWT|DPLURAL, data=births2006.smpl)

The results are shown in Figure 14-11. Notice that the panels are ordered alphabetically by the conditioning variable. (That’s why the group names have the numbers at the front.) Also notice that the histogram function tries to fill in all the available space with squarish panels. This helps make each chart readable by itself but makes it difficult to compare the different groups.

To make it easier to compare groups, we can explicitly stack the charts on top of each other using the layout variable:

> histogram(~DBWT|DPLURAL, data=births2006.smpl, layout=c(1, 5))

The resulting chart is shown in Figure 14-12. As you can see, birth weights are roughly normally distributed within each group, but the mean weight drops as the number of births increases.

If you’d like to see a single line showing the distribution, instead of a set of columns representing bins, you can use kernel density plots. To draw them in R, use the function densityplot:

densityplot(x,
            data,
            allow.multiple = is.null(groups) || outer,
            outer = !is.null(groups),
            auto.key = FALSE,
            aspect = "fill",
            panel = lattice.getOption("panel.densityplot"),
            prepanel, scales, strip, groups, weights,
            xlab, xlim, ylab, ylim,
            bw, adjust, kernel, window, width, give.Rkern,
            n = 50, from, to, cut, na.rm,
            drop.unused.levels = lattice.getOption("drop.unused.levels"),
            ...,
            lattice.options = NULL,
            default.scales = list(),
            subscripts,
            subset)

By default, panels are drawn by panel.densityplot:

panel.densityplot(x, darg, plot.points = "jitter", ref = FALSE,
                  groups = NULL, weights = NULL,
                  jitter.amount, type, ...

Let’s redraw the example above, replacing the histogram with a density plot. By default, densityplot will draw a strip chart under each chart, showing every data point. However, because the data set is so big (there are 427,432 observations), we’ll tell densityplot not to do this by specifying plot.points=FALSE:

> densityplot(~DBWT|DPLURAL,data=births2006.smpl,
+   layout=c(1,5), plot.points=FALSE)

The results are shown in Figure 14-13. One advantage of density plots over histograms is that you can stack them on top of one another and still read the results. By changing the conditioning variable (DPLURAL) to a grouping variable, we can stack these charts on top of one another:

> densityplot(~DBWT, groups=DPLURAL, data=births2006.smpl,
+   plot.points=FALSE, auto.key=TRUE)

The superimposed density plots are shown in Figure 14-14. As you can see, it’s easier to compare distribution shapes (and centers) by superimposing the charts.

Another useful plot that you can generate within the lattice package is the quantile-quantile plot. A quantile-quantile plot compares the distribution of actual data values to a theoretical distribution. Specifically, it plots quantiles of the observed data against quantiles of a theoretical distribution. If the plotted points form a straight diagonal line (from top right to bottom left), then it is likely that the observed data comes from the theoretical distribution. Quantile-quantile plots are a very powerful technique for seeing how closely a data set matches a theoretical distribution (or how much it deviates from it).

To plot quantile-quantile plots using lattice graphics, use the function qqmath:

qqmath(x,
       data,
       allow.multiple = is.null(groups) || outer,
       outer = !is.null(groups),
       distribution = qnorm,
       f.value = NULL,
       auto.key = FALSE,
       aspect = "fill",
       panel = lattice.getOption("panel.qqmath"),
       prepanel = NULL,
       scales, strip, groups,
       xlab, xlim, ylab, ylim,
       drop.unused.levels = lattice.getOption("drop.unused.levels"),
       ...,
       lattice.options = NULL,
       default.scales = list(),
       subscripts,
       subset)

By default, panels are drawn by panel.qqmath:

panel.qqmath(x, f.value = NULL,
             distribution = qnorm,
             qtype = 7,
             groups = NULL, ...

By default, the function qqmath compares the sample data to a normal distribution. If the sample data is really normally distributed, you’ll see a vertical line. As an example, let’s plot 100,000 random values from a normal distribution to show what qqmath does:

> qqmath(rnorm(100000))

The results are shown in Figure 14-16.

Let’s plot a set of quantile-quantile plots for the birth weight data. Because the data set is rather large, we’ll plot only a random sample of 50,000 points:

qqmath(~DBWT|DPLURAL,
       data=births2006.smpl[sample(1:nrow(births2006.smpl), 50000), ],
       pch=19,
       cex=0.25,
       subset=(DPLURAL != "5 Quintuplet or higher"))

As you can see from Figure 14-17, the distribution of birth weights is not quite normal.

As another example, let’s look at real estate prices in San Francisco in 2008 and 2009. This data set is included in the nutshell package as sanfrancisco.home.sales. (See More About the San Francisco Real Estate Prices Data Set for more information on this data set.) Here is how to load the data:

> library(nutshell)
> data(sanfrancisco.home.sales)

Intuitively, it doesn’t make sense for real estate prices to be normally distributed. There are far more people with below-average incomes than above-average incomes. The lowest recorded price in the data set is $100,000; the highest is $9,500,000. Let’s take a look at this distribution with qqmath:

> qqmath(~price, data=sanfrancisco.home.sales)

The distribution is shown in Figure 14-18. As expected, the distribution is not normal. It looks exponential, so let’s try a log transform:

> qqmath(~log(price), data=sanfrancisco.home.sales)

A log transform yields a distribution that looks pretty close to normally distributed (see Figure 14-19). Let’s take a look at how the distribution changes based on the number of bedrooms. To do this, we’ll split the distribution into groups and change the way the points are plotted. Specifically, we’ll plot smooth lines instead of individual points. (Point type is actually an argument for panel.xyplot, which is used to draw the chart.) We’ll add a key to the plot (using auto.key=TRUE). We’ll pass an explicit subset as an argument to the function instead of using the subset argument. (This helps clean up the key, which would show unused factor levels otherwise.)

> qqmath(~log(price), groups=bedrooms,
+   data=subset(sanfrancisco.home.sales,
+               !is.na(bedrooms) & bedrooms > 0 & bedrooms < 7),
+   auto.key=TRUE, drop.unused.levels=TRUE, type="smooth")
> dev.off()

Notice that the lines are separate, with higher values for higher numbers of bedrooms (see Figure 14-20). We can do the same thing for square footage (see Figure 14-21). (I used the function cut2 from the package HMisc to divide square footages into six even quantiles.)

> library(Hmisc)
> qqmath(~log(price), groups=cut2(squarefeet, g=6),
+   data=subset(sanfrancisco.home.sales, !is.na(squarefeet)),
+   auto.key=TRUE, drop.unused.levels=TRUE, type="smooth")

Here the separation is even more clear. We can see the same separation by neighborhood. (We’ll come back to this analysis in Chapter 20.)

This section describes Trellis plots for plotting two variables. Many real data sets (for example, financial data) record relationships between multiple numeric variables. The tools in this section can help you examine those relationships.

To generate scatter plots with the trellis package, use the function xyplot:

xyplot(x,
       data,
       allow.multiple = is.null(groups) || outer,
       outer = !is.null(groups),
       auto.key = FALSE,
       aspect = "fill",
       panel = lattice.getOption("panel.xyplot"),
       prepanel = NULL,
       scales = list(),
       strip = TRUE,
       groups = NULL,
       xlab,
       xlim,
       ylab,
       ylim,
       drop.unused.levels = lattice.getOption("drop.unused.levels"),
       ...,
       lattice.options = NULL,
       default.scales,
       subscripts = !is.null(groups),
       subset = TRUE)

Most of the work is done by the panel function panel.xyplot:

panel.xyplot(x, y, type = "p",
             groups = NULL,
             pch, col, col.line, col.symbol,
             font, fontfamily, fontface,
             lty, cex, fill, lwd,
             horizontal = FALSE, ...,
             jitter.x = FALSE, jitter.y = FALSE,
             factor = 0.5, amount = NULL)

As an example of a scatter plot, let’s take a look at the relationship between house size and price. Let’s start with a simple scatter plot, showing size and price:

> xyplot(price~squarefeet, data=sanfrancisco.home.sales)

The results of this command are shown in Figure 14-22. It looks like there is a rough correspondence between size and price (the plot looks vaguely cone shaped). This chart is hard to read, so let’s try modifying it. Let’s trim outliers (sales prices over 4,000,000 and properties over 6,000 square feet) using the subset argument. Additionally, let’s take a look at how this relationship varies by zip code. San Francisco is a pretty big place, and not all neighborhoods are equally in demand. (You probably know the cliché about the first three rules of real estate: location, location, location.)

Before plotting the price data, let’s pick a subset of zip codes to plot. A few parts of the city are sparsely populated (like the financial district, 94104) and don’t have enough data to make plotting interesting. Also, let’s exclude zip codes where square footage isn’t available:

> table(subset(sanfrancisco.home.sales, !is.na(squarefeet), select=zip))

94100 94102 94103 94104 94105 94107 94108 94109 94110 94111 94112
    2    52    62     4    44   147    21   115   161    12   192
94114 94115 94116 94117 94118 94121 94122 94123 94124 94127 94131
  143   101   124   114    92    92   131    71    85   108   136
94132 94133 94134 94158
   82    47   105    13

So we’ll exclude 94100, 94104, 94108, 94111, 94133, and 94158 because there are too few sales to be interesting. (Note the strip argument. This simply prints the zip codes with the plots.)

> trellis.par.set(fontsize=list(text=7))
> xyplot(price~squarefeet|zip, data=sanfrancisco.home.sales,
+   subset=(zip!=94100 & zip!=94104 & zip!=94108 &
+           zip!=94111 & zip!=94133 & zip!=94158 &
+           price < 4000000 &
+           ifelse(is.na(squarefeet), FALSE, squarefeet < 6000)),
+   strip=strip.custom(strip.levels=TRUE))

The resulting plot is shown in Figure 14-23. Now the linear relationship is much more pronounced. Note the different slopes in different neighborhoods. As you might expect, some up-and-coming neighborhoods (like zip code 94110, which includes the Mission and Bernal Heights) are more shallowly sloped, while ritzy neighborhoods (like zip code 94123, which includes the Marina and Cow Hollow) are more steeply sloped.

We can make this slightly more readable by using neighborhood names. Let’s rerun the code, conditioning by neighborhood. We’ll also add a diagonal line to each plot (through a custom panel function) to make the charts even easier to read. We’ll also change the default points plotted to be solid (through the pch=19 argument) and shrink them to a smaller size (through the cex=.2 argument):

> trellis.par.set(fontsize=list(text=7))
> dollars.per.squarefoot <- mean(
+   sanfrancisco.home.sales$price / sanfrancisco.home.sales$squarefeet,
+   na.rm=TRUE);
> xyplot(price~squarefeet|neighborhood,
+   data=sanfrancisco.home.sales,
+   pch=19,
+   cex=.2,
+   subset=(zip != 94100 & zip != 94104 & zip != 94108 &
+           zip != 94111 & zip != 94133 & zip != 94158 &
+           price < 4000000 &
+           ifelse(is.na(squarefeet), FALSE, squarefeet < 6000)),
+   strip=strip.custom(strip.levels=TRUE,
+   horizontal=TRUE,
+   par.strip.text=list(cex=.8)),
+   panel=function(...) {
+     panel.abline(a=0,b=dollars.per.squarefoot);
+     panel.xyplot(...);
+   }
+ )

This plot is shown in Figure 14-24.

The San Francisco home sales data set was taken from a particularly interesting time: the housing market crash. (The market fell a little late in San Francisco compared with other cities.) Let’s take a look at how prices changed over time during this period. We could plot just the median price or mean price, or the number of sales. However, the lattice package gives us tools that will let us watch how the whole distribution changed over time. Specifically, we can use box plots.

Box plots in the lattice package are just like box plots drawn with the graphics package, as described in Box Plots. The boxes represent prices from the 25th through the 75th percentiles (the interquartile range), the dots represent median prices, and the whiskers represent the minimum or maximum values. (When there are values that stretch beyond 1.5 times the length of the interquartile range, the whiskers are truncated at those extremes.)

To show box plots with Trellis graphics, use the function bwplot:

bwplot(x,
       data,
       allow.multiple = is.null(groups) || outer,
       outer = FALSE,
       auto.key = FALSE,
       aspect = "fill",
       panel = lattice.getOption("panel.bwplot"),
       prepanel = NULL,
       scales = list(),
       strip = TRUE,
       groups = NULL,
       xlab,
       xlim,
       ylab,
       ylim,
       box.ratio = 1,
       horizontal = NULL,
       drop.unused.levels = lattice.getOption("drop.unused.levels"),
       ...,
       lattice.options = NULL,
       default.scales,
       subscripts = !is.null(groups),
       subset = TRUE)

This function will, in turn, call panel.bwplot:

panel.bwplot(x, y, box.ratio = 1,
             box.width = box.ratio / (1 + box.ratio),
             horizontal = TRUE,
             pch, col, alpha, cex,
             font, fontfamily, fontface,
             fill, varwidth = FALSE,
             notch = FALSE, notch.frac = 0.5,
             ...,
             levels.fos,
             stats = boxplot.stats,
             coef = 1.5,
             do.out = TRUE)

Let’s show a set of box plots, with one plot per month. We’ll need to round the date ranges to the nearest month. A convenient way to do this in R is with the cut function. Here’s the number of sales by month in this data set:

> table(cut(sanfrancisco.home.sales$saledate, "month"))

2008-02-01 2008-03-01 2008-04-01 2008-05-01 2008-06-01 2008-07-01
       139        230        267        253        237        198
2008-08-01 2008-09-01 2008-10-01 2008-11-01 2008-12-01 2009-01-01
       253        223        272        118        181        114
2009-02-01 2009-03-01 2009-04-01 2009-05-01 2009-06-01 2009-07-01
       123        142        116        180        150         85

As you may remember from above, the cutoff dates don’t fall neatly on the beginning and ending of each month:

> min(sanfrancisco.home.sales$saledate)
[1] "2008-02-13"
> max(sanfrancisco.home.sales$saledate)
[1] "2009-07-14"

So don’t focus too much on the volumes in February 2008 or July 2009. (Volume was much lower in the spring.) Let’s take a look at the distribution of sales prices by month. Here’s the code to present this data using the default representation:

> bwplot(price~cut(saledate, "month"), data=sanfrancisco.home.sales)

Unfortunately, this doesn’t produce an easily readable plot, as you can see in Figure 14-25. It’s clear that there are a large number of outliers that are making the plot hard to see. Box plots assume a normal distribution, but this doesn’t make intuitive sense for real estate prices (as we saw in Univariate quantile-quantile plots). Let’s try plotting the box plots again, this time with the log-transformed values. To make it more readable, we’ll change to vertical box plots and rotate the text at the bottom:

> bwplot(log(price)~cut(saledate, "month"),
+   data=sanfrancisco.home.sales,
+   scales=list(x=list(rot=90)))

Taking a look at the plot (shown in Figure 14-26), we can more clearly see some trends. Median prices moved around a lot during this period, though the interquartile range moved less. Moreover, it looks like sales at the high end of the market slowed down quite a bit (looking at the outliers on the top and the top whiskers). But, interestingly, the basic distribution appears pretty stable from month to month.

If you would like to plot three-dimensional data with Trellis graphics, there are several functions available.

To plot three-dimensional data in flat grids, with colors showing different values for the third dimension, use the levelplot function:

levelplot(x,
          data,
          allow.multiple = is.null(groups) || outer,
          outer = TRUE,
          aspect = "fill",
          panel = lattice.getOption("panel.levelplot"),
          prepanel = NULL,
          scales = list(),
          strip = TRUE,
          groups = NULL,
          xlab,
          xlim,
          ylab,
          ylim,
          at,
          cuts = 15,
          pretty = FALSE,
          region = TRUE,
          drop.unused.levels = lattice.getOption("drop.unused.levels"),
          ...,
          lattice.options = NULL,
          default.scales = list(),
          colorkey = region,
          col.regions,
          alpha.regions,
          subset = TRUE)

Most of the work is done by panel.levelplot:

panel.levelplot(x, y, z,
                subscripts,
                at = pretty(z),
                shrink,
                labels,
                label.style = c("mixed", "flat", "align"),
                contour = FALSE,
                region = TRUE,
                col = add.line$col,
                lty = add.line$lty,
                lwd = add.line$lwd,
                ...,
                col.regions = regions$col,
                alpha.regions = regions$alpha)

As an example of level plots, we will look at the San Francisco home sales data set. Let’s start by looking at the number of home sales in different parts of the city. To do this, we’ll need to use that coordinate data in the San Francisco home sales data set. Unfortunately, we can’t use the coordinates directly; the coordinates are too precise, so the levelplot function simply plots a large number of points. (Try executing levelplot(price~latitude+longitude) to see what I mean.)

We’ll need to break the data into bins and count the number of homes within each bin. To do this, we’ll use the table and cut functions:

> attach(sanfrancisco.home.sales)
> levelplot(table(cut(longitude, breaks=40), 
+               cut(latitude, breaks=40)),
+         scales=list(y=list(cex=.5),
+                     x=list(rot=90, cex=.5)))

This plot is shown in Figure 14-27. If we were interested in looking at the average sales price by area, we could use a similar strategy. Instead of table, we’ll use the tapply function to aggregate observations. And while we’re at it, we’ll cut out the axis labels:

> levelplot(tapply(price,
+                  INDEX=list(cut(longitude, breaks=40),
+                             cut(latitude, breaks=40)),
+                  FUN=mean),
+           scales=list(draw=FALSE))

This plot is shown in Figure 14-28. And, of course, you can use conditioning values with level plots. Let’s look at the number of home sales by numbers of bedrooms. We’ll simplify the data slightly by looking at houses with zero to four bedrooms and then houses with five bedrooms or more. We’ll also cut the number of breaks to keep the charts legible:

> bedrooms.capped <- ifelse(bedrooms < 5, bedrooms, 5);
> levelplot(table(cut(longitude, breaks=25),
+                 cut(latitude, breaks=25),
+                 bedrooms.capped),
+           scales=list(draw=FALSE))

This figure is shown in Figure 14-29.

If you have fitted a model to a data set, the rfs function can help you visualize how well the model fits the data:

rfs(model, layout=c(2, 1), xlab="f-value", ylab=NULL,
    distribution = qunif,
    panel, prepanel, strip, ...)

The rfs function plots residual and fit-spread (RFS) plots. As an example, we’ll use the model described in Example: A Simple Linear Model. The example is a linear model for runs scored in baseball games as a function of team offensive statistics. For a full explanation, see Chapter 20; here we just want to show what charts are plotted for linear models with the rfs function:

> rfs(runs.mdl)

The plot generated by this command is shown in Figure 14-30. Notice that the two curves are S shaped. The residual plot is a quantile-quantile plot of the residuals; we’d expect the plot to be linear if the data fit the assumed distribution. The default distribution choice for rfs is a uniform distribution, which clearly isn’t right. Let’s try generating a second set of plots, assuming a normal distribution for the residuals:

> rfs(runs.mdl, distribution=qnorm)

The results are shown in Figure 14-31. Notice that the plots are roughly linear. We expect a normally distributed error function for a linear regression model, so this is a good thing.



[43] This data set is available from http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm. I used the 2006 Birth Data File in this book. The data file is 3.1 GB uncompressed, which is way too big to load easily into R on a machine with only 4 GB. I used a Perl script to parse this file and return a limited number of records in CSV format.

[44] It’s not completely dirty, either. I spent some time cleaning up and correcting the data: removing blatant duplicates, adding years to some recent condo listings, and a few other fixes.