Example: A Simple Linear Model

A linear regression assumes that there is a linear relationship between the response variable and the predictors. Specifically, a linear regression assumes that a response variable y is a linear function of a set of predictor variables x₁, x₂, ..., x_n.

As an example, we’re going to look at how different metrics predict the runs scored by a baseball team.^[54] Let’s start by loading the data for every team between 2000 and 2008. We’ll use the SQLite database that we used in Chapter 13 and extract the fields we want using an SQL query:

> library(RSQLite)
> drv <- dbDriver("SQLite")
> con <- dbConnect(drv,
+   dbname=system.file("extdata","bb.db", package="nutshell"))
> team.batting.00to08 <- dbGetQuery(con,
+   paste(
+     'SELECT teamID, yearID, R as runs, ',
+     '   H-"2B"-"3B"-HR as singles, ',
+     '   "2B" as doubles, "3B" as triples, HR as homeruns, ',
+     '   BB as walks, SB as stolenbases, CS as caughtstealing, ',
+     '   HBP as hitbypitch, SF as sacrificeflies, ',
+     '   AB as atbats ',
+     '   FROM Teams ',
+     '   WHERE yearID between 2000 and 2008'
+     )
+   )

Or, if you’d like, you can just load the file from the nutshell package:

> library(nutshell)
> data(team.batting.00to08)

Because this is a book about R and not a book about baseball, I renamed the common abbreviations to more intuitive names for plays. Let’s look at scatter plots of runs versus each other variable so that we can see which variables are likely to be most important.

We’ll create a data frame for plotting, using the make.groups function:

> attach(team.batting.00to08);
> forplot <- make.groups(
+   singles        = data.frame(value=singles,       teamID,yearID,runs),
+   doubles        = data.frame(value=doubles,       teamID,yearID,runs),
+   triples        = data.frame(value=triples,       teamID,yearID,runs),
+   homeruns       = data.frame(value=homeruns,      teamID,yearID,runs),
+   walks          = data.frame(value=walks,         teamID,yearID,runs),
+   stolenbases    = data.frame(value=stolenbases,   teamID,yearID,runs),
+   caughtstealing = data.frame(value=caughtstealing,teamID,yearID,runs),
+   hitbypitch     = data.frame(value=hitbypitch,    teamID,yearID,runs),
+   sacrificeflies = data.frame(value=sacrificeflies,teamID,yearID,runs)
+   );
> detach(team.batting.00to08);

Now, we’ll generate the scatter plots using the xyplot function:

> xyplot(runs~value|which, data=forplot,
+   scales=list(relation="free"),
+   pch=19, cex=.2,
+   strip=strip.custom(strip.levels=TRUE,
+   horizontal=TRUE,
+   par.strip.text=list(cex=.8))
+ )

The results are shown in Figure 20-1. Intuitively, teams that hit a lot of home runs score a lot of runs. Interestingly, teams that walk a lot score a lot of runs as well (maybe even more than teams that score a lot of singles).

Figure 20-1. Scatter plots: runs as a function of different batter statistics

Fitting a Model

Let’s fit a linear model to the data and assign it to the variable runs.mdl. We’ll use the lm function, which fits a linear model using ordinary least squares:

> runs.mdl <- lm(
+   formula=runs~singles+doubles+triples+homeruns+
+   walks+hitbypitch+sacrificeflies+
+   stolenbases+caughtstealing,
+   data=team.batting.00to08)

R doesn’t show much information when you fit a model. (If you don’t print the returned object, most modeling functions will not show any information, unless there is an error.) To get information about a model, you have to use helper functions.

Helper Functions for Specifying the Model

In a formula object, some symbols have special interpretations. Specifically, “+”, “*”, “-”, and “^” are interpreted specially by R. This means that you need to use some helper functions to represent simple addition, multiplication, subtraction, and exponentiation in a model formula. To interpret an expression literally, and not as a formula, use the identity function I(). For example, suppose that you want to include only the product of variables a and b in a formula specification, but not just a or b. If you specify a*b, this is interpreted as a, b, or a*b. To include only a*b, use the identity function I() to protect the expression a*b:

lm(y~I(a*b))

Sometimes, you would like to fit a polynomial function. Writing out all the terms individually can be tedious, but R provides a short way to specify all the terms at once. To do this, you use the poly function to add all terms up to a specified degree:

poly(x, ..., degree = 1, coefs = NULL, raw = FALSE

As arguments, the poly function takes a vector x (or a set of vectors), degree to specify a maximum degree to generate, coefs to specify coefficients from a previous fit (when using poly to generate predicted values), and raw to specify whether to use raw and not orthogonal polynomials. For more information on how to specify formulas, see Formulas.

Getting Information About a Model

In R, statistical models are represented by objects; statistical modeling functions return statistical model objects. When you fit a statistical model with most statistical software packages (such as SAS or SPSS) they print a lot of diagnostic information. In R, most statistical modeling functions do not print any information.

If you simply call a model function in R but don’t assign the model to a variable, the R console will print the object. (Specifically, it will call the generic method print with the object generated by the modeling function.) R doesn’t clutter your screen with lots of information you might not want. Instead, R includes a large set of functions for printing information about model objects. This section describes the functions for getting information about lm objects. Many of these functions may also be used with other types of models; see the help files for more information.

Viewing the model

For most model functions (including lm), the best place to start is with the print method. If you are using the R console, you can simply enter the name of the returned object on the console to see the results of print:

> runs.mdl

Call:
lm(formula = runs ~ singles + doubles + triples + homeruns +
     walks + hitbypitch + sacrificeflies + stolenbases + caughtstealing,
     data = team.batting.00to08)

Coefficients:
   (Intercept)         singles         doubles         triples
    -507.16020         0.56705         0.69110         1.15836
      homeruns           walks      hitbypitch  sacrificeflies
       1.47439         0.30118         0.37750         0.87218
   stolenbases  caughtstealing
       0.04369        -0.01533

To show the formula used to fit the model, use the formula function:

formula(x, ...)

Here is the formula on which the model function was called:

> formula(runs.mdl)
runs ~ singles + doubles + triples + homeruns + walks + hitbypitch +
    sacrificeflies + stolenbases + caughtstealing

To get the list of coefficients for a model object, use the coef function:

coef(object, ...)

Here are the coefficients for the model fitted above:

> coef(runs.mdl)
   (Intercept)        singles        doubles        triples
 -507.16019759     0.56704867     0.69110420     1.15836091
      homeruns          walks     hitbypitch sacrificeflies
    1.47438916     0.30117665     0.37749717     0.87218094
   stolenbases caughtstealing
    0.04369407    -0.01533245

Alternatively, you can use the alias coefficients to access the coef function.

To get a summary of a linear model object, you can use the summary function. The method used for linear model objects is:

summary(object, correlation = FALSE, symbolic.cor = FALSE, ...)

For the example above, here is the output of the summary function:

> summary(runs.mdl)

Call:
lm(formula = runs ~ singles + doubles + triples + homeruns +
    walks + hitbypitch + sacrificeflies + stolenbases + caughtstealing,
    data = team.batting.00to08)

Residuals:
     Min       1Q   Median       3Q      Max
-71.9019 -11.8282  -0.4193  14.6576  61.8743

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)    -507.16020   32.34834 -15.678  < 2e-16 ***
singles           0.56705    0.02601  21.801  < 2e-16 ***
doubles           0.69110    0.05922  11.670  < 2e-16 ***
triples           1.15836    0.17309   6.692 1.34e-10 ***
homeruns          1.47439    0.05081  29.015  < 2e-16 ***
walks             0.30118    0.02309  13.041  < 2e-16 ***
hitbypitch        0.37750    0.11006   3.430 0.000702 ***
sacrificeflies    0.87218    0.19179   4.548 8.33e-06 ***
stolenbases       0.04369    0.05951   0.734 0.463487
caughtstealing   -0.01533    0.15550  -0.099 0.921530
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 23.21 on 260 degrees of freedom
Multiple R-squared: 0.9144,	Adjusted R-squared: 0.9114
F-statistic: 308.6 on 9 and 260 DF,  p-value: < 2.2e-16

When you print a summary object, the following method is used:

print(x, digits = max(3, getOption("digits") - 3),
      symbolic.cor = x$symbolic.cor,
      signif.stars = getOption("show.signif.stars"), ...

Predicting values using a model

To get the vector of residuals from a linear model fit, use the residuals function:

residuals(object, ...)

To get a vector of fitted values, use the fitted function:

fitted(object, ...)

Suppose that you wanted to use the model object to predict values in another data set. You can use the predict function to calculate predicted values using the model object and another data frame:

predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf,
        interval = c("none", "confidence", "prediction"),
        level = 0.95, type = c("response", "terms"),
        terms = NULL, na.action = na.pass,
        pred.var = res.var/weights, weights = 1, ...)

The argument object specifies the model returned by the fitting function, newdata specifies a new data source for predictions, and na.action specifies how to deal with missing values in newdata. (By default, predict ignores missing values. You can choose na.omit to simply return NA for observations in newdata with missing values.) The predict function can also return confidence intervals for predictions, in addition to exact values; see the help file for more information.

Analyzing the fit

To compute confidence intervals for the coefficients in the fitted model, use the confint function:

confint(object, parm, level = 0.95, ...)

The argument object specifies the model returned by the fitting function, parm specifies the variables for which to show confidence levels, and level specifies the confidence level. Here are the confidence intervals for the coefficients of the model fitted above:

> confint(runs.mdl)
                       2.5 %       97.5 %
(Intercept)    -570.85828008 -443.4621151
singles           0.51583022    0.6182671
doubles           0.57449582    0.8077126
triples           0.81752968    1.4991921
homeruns          1.37432941    1.5744489
walks             0.25570041    0.3466529
hitbypitch        0.16077399    0.5942203
sacrificeflies    0.49451857    1.2498433
stolenbases      -0.07349342    0.1608816
caughtstealing   -0.32152716    0.2908623

To compute the influence of different parameters, you can use the influence function:

influence(model, do.coef = TRUE, ...)

For more friendly output, try influence.measures:

influence.measures(model)

To get analysis of variance statistics, use the anova function. For linear models, the method used is anova.lmlist, which has the following form:

anova.lmlist(object, ..., scale = 0, test = "F")

By default, F-test statistics are included in the results table. You can specify test="F" for F-test statistics, test="Chisq" for chi-squared test statistics, test="Cp" for Mallows’ C_p statistic, or test=NULL for no test statistics. You can also specify an estimate of the noise variance σ² through the scale argument. If you set scale=0 (the default), then the anova function will calculate an estimate from the test data. The test statistic and p-values compare the mean square for each row to the residual mean square.

Here are the ANOVA statistics for the model fitted above:

> anova(runs.mdl)
Analysis of Variance Table

Response: runs
                Df Sum Sq Mean Sq   F value    Pr(>F)
singles          1 215755  215755  400.4655 < 2.2e-16 ***
doubles          1 356588  356588  661.8680 < 2.2e-16 ***
triples          1    237     237    0.4403 0.5075647
homeruns         1 790051  790051 1466.4256 < 2.2e-16 ***
walks            1 114377  114377  212.2971 < 2.2e-16 ***
hitbypitch       1   7396    7396   13.7286 0.0002580 ***
sacrificeflies   1  11726   11726   21.7643 4.938e-06 ***
stolenbases      1    357     357    0.6632 0.4161654
caughtstealing   1      5       5    0.0097 0.9215298
Residuals      260 140078     539
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Interestingly, it appears that triples, stolen bases, and times caught stealing are not statistically significant.

You can also view the effects from a fitted model. The effects are the uncorrelated single degree of freedom values obtained by projecting the data onto the successive orthogonal subspaces generated by the QR-decomposition during the fitting process. To obtain a vector of orthogonal effects from the model, use the effects function:

effects(object, set.sign = FALSE, ...)

To calculate the variance-covariance matrix from the linear model object, use the vcov function:

vcov(object, ...)

Here is the variance-covariance matrix for the model fitted above:

> vcov(runs.mdl)
                (Intercept)       singles       doubles       triples
(Intercept)    1046.4149572 -6.275356e-01 -6.908905e-01 -0.8115627984
singles          -0.6275356  6.765565e-04 -1.475026e-04  0.0001538296
doubles          -0.6908905 -1.475026e-04  3.506798e-03 -0.0013459187
triples          -0.8115628  1.538296e-04 -1.345919e-03  0.0299591843
homeruns         -0.3190194  2.314669e-04 -3.940172e-04  0.0011510663
walks            -0.2515630  7.950878e-05 -9.902388e-05  0.0004174548
hitbypitch       -0.9002974  3.385518e-04 -4.090707e-04  0.0018360831
sacrificeflies    1.6870020 -1.723732e-03 -2.253712e-03 -0.0051709718
stolenbases       0.2153275 -3.041450e-04  2.871078e-04 -0.0009794480
caughtstealing   -1.4370890  3.126387e-04  1.466032e-03 -0.0016038175
                    homeruns         walks    hitbypitch sacrificeflies
(Intercept)    -3.190194e-01 -2.515630e-01 -0.9002974059   1.6870019518
singles         2.314669e-04  7.950878e-05  0.0003385518  -0.0017237324
doubles        -3.940172e-04 -9.902388e-05 -0.0004090707  -0.0022537124
triples         1.151066e-03  4.174548e-04  0.0018360831  -0.0051709718
homeruns        2.582082e-03 -4.007590e-04 -0.0008183475  -0.0005078943
walks          -4.007590e-04  5.333599e-04  0.0002219440  -0.0010962381
hitbypitch     -8.183475e-04  2.219440e-04  0.0121132852  -0.0011315622
sacrificeflies -5.078943e-04 -1.096238e-03 -0.0011315622   0.0367839752
stolenbases    -2.041656e-06 -1.400052e-04 -0.0001197102  -0.0004636454
caughtstealing  3.469784e-04  6.008766e-04  0.0001742039  -0.0024880710
                 stolenbases caughtstealing
(Intercept)     2.153275e-01  -1.4370889812
singles        -3.041450e-04   0.0003126387
doubles         2.871078e-04   0.0014660316
triples        -9.794480e-04  -0.0016038175
homeruns       -2.041656e-06   0.0003469784
walks          -1.400052e-04   0.0006008766
hitbypitch     -1.197102e-04   0.0001742039
sacrificeflies -4.636454e-04  -0.0024880710
stolenbases     3.541716e-03  -0.0050935339
caughtstealing -5.093534e-03   0.0241794596

To return the deviance of the fitted model, use the deviance function:

deviance(object, ...)

Here is the deviance for the model fitted above (though this value is just the residual sum of squares in this case because runs.mdl is a linear model):

> deviance(runs.mdl)
[1] 140077.6

Finally, to plot a set of useful diagnostic diagrams, use the plot function:

plot(x, which = c(1:3,5),
     caption = list("Residuals vs Fitted", "Normal Q-Q",
          "Scale-Location", "Cook's distance",
          "Residuals vs Leverage",
          expression("Cook's dist vs Leverage  " * h[ii] / (1 - h[ii]))),
     panel = if(add.smooth) panel.smooth else points,
     sub.caption = NULL, main = "",
     ask = prod(par("mfcol")) < length(which) && dev.interactive(),
     ...,
     id.n = 3, labels.id = names(residuals(x)), cex.id = 0.75,
     qqline = TRUE, cook.levels = c(0.5, 1.0),
     add.smooth = getOption("add.smooth"), label.pos = c(4,2),
     cex.caption = 1)

This function shows the following plots:

Residuals against fitted values
A normal Q-Q plot
A scale-location plot of sqrt{| residuals |} against fitted values
(Not plotted by default) A plot of Cook’s distances versus row labels
A plot of residuals against leverages
(Not plotted by default) A plot of Cook’s distances against leverage/(1 − leverage)

There are many more functions available in R for regression diagnostics; see the help file for influence.measures for more information on many of these.

Refining the Model

Often, it is better to use the update function to refit a model. This can save you some typing if you are using R interactively. Additionally, this can save on computation time (for large data sets). You can run update after changing the formula (perhaps adding or subtracting a term) or even after changing the data frame.

For example, let’s fit a slightly different model to the data above. We’ll omit the variable sacrificeflies and add 0 as a variable (which means to fit the model with no intercept):

> runs.mdl2 <- update(runs.mdl,formula=runs ~ singles + doubles +
+   triples + homeruns + walks + hitbypitch +
+   stolenbases + caughtstealing + 0)
> runs.mdl2

Call:
lm(formula = runs ~ singles + doubles + triples + homeruns +
     walks + hitbypitch + stolenbases + caughtstealing - 1,
     data = team.batting.00to08)

Coefficients:
       singles         doubles         triples        homeruns
       0.29823         0.41280         0.95664         1.31945
         walks      hitbypitch     stolenbases  caughtstealing
       0.21352        -0.07471         0.18828        -0.70334

^[54] This example is closely related to the Batter Runs formula, which was popularized by Pete Palmer and John Thorn in the 1984 book The Hidden Game of Baseball. The original Batter Runs formula worked slightly differently: it predicted the number of runs above or below the mean, and it had no intercept. For more about this problem, see [Adler2006].