Charts and Graphics

R includes several packages for visualizing data: graphics, grid, and lattice. Usually, you’ll find that functions within the graphics and lattice packages are the most useful.^[10] If you’re familiar with Microsoft Excel, you’ll find that R can generate all of the charts that you’re familiar with: column charts, bar charts, line plots, pie charts, and scatter plots. Even if that’s all you need, R makes it much easier than Excel to automate the creation of charts and to customize them. However, there are many, many more types of charts available in R, many of them quite intuitive and elegant.

To make this a little more interesting, let’s work with some real data. We’re going to look at all field goal attempts in the National Football League (NFL) in 2005.^[11] For those of you who aren’t familiar with American football, here’s a quick explanation. A team can attempt to kick a football between a set of goalposts to receive 3 points. If it misses the field goal, possession of the ball reverts to the other team (at the spot on the field where the kick was attempted). We’re going to take a look at kick attempts in the NFL in 2005.

First, let’s take a quick look at the distribution of distances. R provides a function, hist, that can do this quickly for us. Let’s start by loading the appropriate data set. (The data set is included in the nutshell package; see the Preface for information on how to obtain this package.)

> library(nutshell)
> data(field.goals)

Let’s take a quick look at the names of the columns in the field.goals data frame:

> names(field.goals)
 [1] "home.team"    "week"         "qtr"          "away.team"
 [5] "offense"      "defense"      "play.type"    "player"
 [9] "yards"        "stadium.type"

Now, let’s just try the hist command:

> hist(field.goals$yards)

This produces a chart like the one shown in Figure 3-1. (Depending on your system, if you try this yourself, you may see a differently colored and formatted chart. I tweaked a few graphical parameters so the charts would look good in print.) I wanted to see more detail about the number of field goals at different distances, so I modified the breaks argument to add more bins to the histogram:

> hist(field.goals$yards, breaks=35)

Figure 3-1. Histogram of field goal attempts with default settings

You can see the results of this command in Figure 3-2. R also features many other ways to visualize data. A great example is a strip chart. This chart just plots one point on the x-axis for every point in a vector. As an example, let’s look at the distance of blocked field goals. We can distinguish blocked field goals with the play.type variable in the field.goals data frame. Let’s take a quick look at how many blocked field goals there were in 2005. We’ll use the table function to tabulate the results:

> table(field.goals$play.type)

FG aborted FG blocked    FG good      FG no
         8         24        787        163

Figure 3-2. Histogram of field goal distances, showing more bins

Now we’ll select only observations with blocked field goals. We’ll add a little jitter so we can see individual points. Finally, we will also change the appearance of the points using the pch argument:

> stripchart(field.goals[field.goals$play.type=="FG blocked",]$yards,
+            pch=19, method="jitter")

The results are shown in Figure 3-3.

Figure 3-3. Strip chart showing field goal attempt distances

As a second example, let’s use the cars data set, which is included in the base package. The cars data set consists of a set of 50 observations:

> data(cars)
> dim(cars)
[1] 50  2
> names(cars)
[1] "speed" "dist"

Each observation contains the speed of the car and the distance required to stop. Let’s take a quick look at the contents of this data set:

> summary(cars)
     speed           dist
 Min.   : 4.0   Min.   :  2.00
 1st Qu.:12.0   1st Qu.: 26.00
 Median :15.0   Median : 36.00
 Mean   :15.4   Mean   : 42.98
 3rd Qu.:19.0   3rd Qu.: 56.00
 Max.   :25.0   Max.   :120.00

Let’s plot the relationship between vehicle speed and stopping distance:

> plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)",
+   las = 1, xlim = c(0, 25))

The plot is shown in Figure 3-4. At a quick glance, we see that stopping distance is roughly proportional to speed.

Figure 3-4. Plot of data in the cars data set

Let’s try one more example, this time using lattice graphics. Lattice graphics provide some great tools for drawing pretty charts, particularly charts that compare different groups of points. By default, the lattice package is not loaded; you will get an error if you try calling a lattice function without loading the library. To load the library, use the following command:

> library(lattice)

We will talk more about R packages in Chapter 4.

For example data, we’ll look at how American eating habits changed between 1980 and 2005.^[12]

The consumption data set is available in the nutshell package. It contains 48 observations, each showing the amount of a commodity consumed (or produced) in a specific year. Data is available only for years that are multiples of 5 (so there are six unique years between 1980 and 2005). The amount of food consumed is given by Amount, the type of food is given by Food, and the year is given by Year.

Two of the variables are numeric vectors: Amount and Year. However, two of them are an important data type that we haven’t seen yet: factors. A factor is an R object type that is used to compactly represent a vector of categorical values. Factors are used in many modeling functions. You can create a factor from another vector (typically a character vector) using the factor function. In this data frame, the values Food and Units are factors. (We’ll discuss vectors in more detail in Vectors.)

To help reveal trends in the data, I decided to use the dotplot function. (This function resembles line charts in Excel.) Specifically, we’d like to look at how the Amount varies by Year. We’d like to separately plot the trend for each value of the Food variable. For lattice graphics, we specify the data that we want to plot through a formula, in this case, Amount ~ Year | Food. A formula is an R object that is used to express a relationship between a set of variables.

If you’d like, you can try plotting the relationship using the default settings:

> library(nutshell)
> library(lattice)
> data(consumption)
> dotplot(Amount~Year|Food, consumption)

I found the default plot hard to read: the axis labels were too big, the scale for each plot was the same, and the stacking didn’t look right to me. So I tuned the presentation a little bit. Here is the version that produced Figure 3-5:

> dotplot(Amount ~ Year | Food,data=consumption,
+   aspect="xy",scales=list(relation="sliced", cex=.4))

Figure 3-5. Lattice plot showing American changes in American eating habits, 1980–2005

The aspect option changes the aspect ratios of each plot to try to show changes from 45° angles (making changes easier to see). The scales option changes how the axes are drawn. I’ll discuss lattice plots in more detail in Chapter 14, explaining how to use different options to tune the look of your charts.

^[10]Other packages are available for visualizing data. For example, the RGobi package provides tools for creating interactive graphics.

^[11]The data was provided by Aaron Schatz of Pro Football Prospectus. For more information, see the Football Outsiders website at http://www.footballoutsiders.com/, or you can find its annual books at most bookstores—both online and “brick and mortar.”

^[12]I obtained the data from the 2009 Statistical Abstract of the United States, a terrific book of data about the United States that is published by the Census Bureau. I took a subset of the data, keeping consumption for only the largest categories. You can find this data at http://www.census.gov/compendia/statab/cats/health_nutrition/food_consumption_and_nutrition.html.