R includes several packages for visualizing data: graphics
, grid
, and lattice
. Usually, you’ll find that functions
within the graphics
and lattice
packages are the most useful.[10] If you’re familiar with Microsoft Excel, you’ll find that R can generate all of the charts that
you’re familiar with: column charts, bar charts, line plots, pie charts,
and scatter plots. Even if that’s all you need, R makes it much easier
than Excel to automate the creation of charts and to customize them.
However, there are many, many more types of charts available in R, many of
them quite intuitive and elegant.
To make this a little more interesting, let’s work with some real data. We’re going to look at all field goal attempts in the National Football League (NFL) in 2005.[11] For those of you who aren’t familiar with American football, here’s a quick explanation. A team can attempt to kick a football between a set of goalposts to receive 3 points. If it misses the field goal, possession of the ball reverts to the other team (at the spot on the field where the kick was attempted). We’re going to take a look at kick attempts in the NFL in 2005.
First, let’s take a quick look at the distribution of distances. R
provides a function, hist
, that can do
this quickly for us. Let’s start by loading the appropriate data set. (The
data set is included in the nutshell
package; see the Preface for information on how to
obtain this package.)
> library(nutshell) > data(field.goals)
Let’s take a quick look at the names of the columns in the field.goals
data frame:
> names(field.goals)
[1] "home.team" "week" "qtr" "away.team"
[5] "offense" "defense" "play.type" "player"
[9] "yards" "stadium.type"
Now, let’s just try the hist
command:
> hist(field.goals$yards)
This produces a chart like the one shown in Figure 3-1. (Depending on your system, if you try
this yourself, you may see a differently colored and formatted chart. I
tweaked a few graphical parameters so the charts would look good in
print.) I wanted to see more detail about the number of field goals at
different distances, so I modified the breaks
argument to add more bins to the
histogram:
> hist(field.goals$yards, breaks=35)
You can see the results of this command in Figure 3-2. R also features many other ways to
visualize data. A great example is a strip chart. This chart just plots
one point on the x-axis for every point in a vector.
As an example, let’s look at the distance of blocked field goals. We can
distinguish blocked field goals with the play.type
variable in the field.goals
data frame. Let’s take a quick look
at how many blocked field goals there were in 2005. We’ll use the table
function to tabulate the results:
> table(field.goals$play.type)
FG aborted FG blocked FG good FG no
8 24 787 163
Now we’ll select only observations with blocked field goals. We’ll
add a little jitter so we can see individual points. Finally, we will also
change the appearance of the points using the pch
argument:
> stripchart(field.goals[field.goals$play.type=="FG blocked",]$yards, + pch=19, method="jitter")
The results are shown in Figure 3-3.
As a second example, let’s use the cars
data set, which is included in the base
package. The cars
data set consists of a set of 50
observations:
> data(cars) > dim(cars) [1] 50 2 > names(cars) [1] "speed" "dist"
Each observation contains the speed of the car and the distance required to stop. Let’s take a quick look at the contents of this data set:
> summary(cars)
speed dist
Min. : 4.0 Min. : 2.00
1st Qu.:12.0 1st Qu.: 26.00
Median :15.0 Median : 36.00
Mean :15.4 Mean : 42.98
3rd Qu.:19.0 3rd Qu.: 56.00
Max. :25.0 Max. :120.00
Let’s plot the relationship between vehicle speed and stopping distance:
> plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)", + las = 1, xlim = c(0, 25))
The plot is shown in Figure 3-4. At a quick glance, we see that stopping distance is roughly proportional to speed.
Let’s try one more example, this time using lattice graphics.
Lattice graphics provide some great tools for drawing pretty charts,
particularly charts that compare different groups of points. By default,
the lattice
package is not loaded; you
will get an error if you try calling a lattice function without loading
the library. To load the library, use the following command:
> library(lattice)
We will talk more about R packages in Chapter 4.
For example data, we’ll look at how American eating habits changed between 1980 and 2005.[12]
The consumption data set is available in the nutshell
package. It contains 48 observations,
each showing the amount of a commodity consumed (or produced) in a
specific year. Data is available only for years that are multiples of 5
(so there are six unique years between 1980 and 2005). The amount of food
consumed is given by Amount
, the type
of food is given by Food
, and the year
is given by Year
.
Two of the variables are numeric vectors: Amount
and Year
. However, two of them are an important data
type that we haven’t seen yet: factors. A factor is
an R object type that is used to compactly represent a vector of
categorical values. Factors are used in many modeling functions. You can
create a factor from another vector (typically a character vector) using
the factor
function. In this data
frame, the values Food
and Units
are factors. (We’ll discuss vectors in
more detail in Vectors.)
To help reveal trends in the data, I decided to use the dotplot
function. (This function resembles line charts in Excel.)
Specifically, we’d like to look at how the Amount
varies by Year
. We’d like to separately plot the trend for
each value of the Food
variable. For
lattice graphics, we specify the data that we want to plot through a
formula, in this case, Amount ~ Year |
Food
. A formula is an R object that is
used to express a relationship between a set of variables.
If you’d like, you can try plotting the relationship using the default settings:
> library(nutshell) > library(lattice) > data(consumption) > dotplot(Amount~Year|Food, consumption)
I found the default plot hard to read: the axis labels were too big, the scale for each plot was the same, and the stacking didn’t look right to me. So I tuned the presentation a little bit. Here is the version that produced Figure 3-5:
> dotplot(Amount ~ Year | Food,data=consumption, + aspect="xy",scales=list(relation="sliced", cex=.4))
The aspect
option changes the
aspect ratios of each plot to try to show changes from 45° angles (making
changes easier to see). The scales
option changes how the axes are drawn. I’ll discuss lattice plots in more
detail in Chapter 14, explaining how to
use different options to tune the look of your charts.
[10] Other packages are available for visualizing data. For example, the RGobi package provides tools for creating interactive graphics.
[11] The data was provided by Aaron Schatz of Pro Football Prospectus. For more information, see the Football Outsiders website at http://www.footballoutsiders.com/, or you can find its annual books at most bookstores—both online and “brick and mortar.”
[12] I obtained the data from the 2009 Statistical Abstract of the United States, a terrific book of data about the United States that is published by the Census Bureau. I took a subset of the data, keeping consumption for only the largest categories. You can find this data at http://www.census.gov/compendia/statab/cats/health_nutrition/food_consumption_and_nutrition.html.