A linear regression assumes that there is a linear relationship between the response variable and the predictors. Specifically, a linear regression assumes that a response variable y is a linear function of a set of predictor variables x1, x2, ..., xn.
As an example, we’re going to look at how different metrics predict the runs scored by a baseball team.[54] Let’s start by loading the data for every team between 2000 and 2008. We’ll use the SQLite database that we used in Chapter 13 and extract the fields we want using an SQL query:
> library(RSQLite) > drv <- dbDriver("SQLite") > con <- dbConnect(drv, + dbname=system.file("extdata","bb.db", package="nutshell")) > team.batting.00to08 <- dbGetQuery(con, + paste( + 'SELECT teamID, yearID, R as runs, ', + ' H-"2B"-"3B"-HR as singles, ', + ' "2B" as doubles, "3B" as triples, HR as homeruns, ', + ' BB as walks, SB as stolenbases, CS as caughtstealing, ', + ' HBP as hitbypitch, SF as sacrificeflies, ', + ' AB as atbats ', + ' FROM Teams ', + ' WHERE yearID between 2000 and 2008' + ) + )
Or, if you’d like, you can just load the file from the nutshell
package:
> library(nutshell) > data(team.batting.00to08)
Because this is a book about R and not a book about baseball, I renamed the common abbreviations to more intuitive names for plays. Let’s look at scatter plots of runs versus each other variable so that we can see which variables are likely to be most important.
We’ll create a data frame for plotting, using the make.groups
function:
> attach(team.batting.00to08); > forplot <- make.groups( + singles = data.frame(value=singles, teamID,yearID,runs), + doubles = data.frame(value=doubles, teamID,yearID,runs), + triples = data.frame(value=triples, teamID,yearID,runs), + homeruns = data.frame(value=homeruns, teamID,yearID,runs), + walks = data.frame(value=walks, teamID,yearID,runs), + stolenbases = data.frame(value=stolenbases, teamID,yearID,runs), + caughtstealing = data.frame(value=caughtstealing,teamID,yearID,runs), + hitbypitch = data.frame(value=hitbypitch, teamID,yearID,runs), + sacrificeflies = data.frame(value=sacrificeflies,teamID,yearID,runs) + ); > detach(team.batting.00to08);
Now, we’ll generate the scatter plots using the xyplot
function:
> xyplot(runs~value|which, data=forplot, + scales=list(relation="free"), + pch=19, cex=.2, + strip=strip.custom(strip.levels=TRUE, + horizontal=TRUE, + par.strip.text=list(cex=.8)) + )
The results are shown in Figure 20-1. Intuitively, teams that hit a lot of home runs score a lot of runs. Interestingly, teams that walk a lot score a lot of runs as well (maybe even more than teams that score a lot of singles).
Let’s fit a linear model to the data and assign it to the
variable runs.mdl
. We’ll use the
lm
function, which fits a linear model using ordinary least
squares:
> runs.mdl <- lm( + formula=runs~singles+doubles+triples+homeruns+ + walks+hitbypitch+sacrificeflies+ + stolenbases+caughtstealing, + data=team.batting.00to08)
R doesn’t show much information when you fit a model. (If you don’t print the returned object, most modeling functions will not show any information, unless there is an error.) To get information about a model, you have to use helper functions.
In a formula object, some symbols have special
interpretations. Specifically, “+”, “*”, “-”, and “^” are interpreted
specially by R. This means that you need to use some helper functions to
represent simple addition, multiplication, subtraction, and
exponentiation in a model formula. To interpret an expression literally,
and not as a formula, use the identity function I()
. For example, suppose that you want to
include only the product of variables a
and b
in
a formula specification, but not just a
or b
. If
you specify a*b
, this is interpreted
as a
, b
, or a*b
.
To include only a*b
, use the identity
function I()
to protect the
expression a*b
:
lm(y~I(a*b))
Sometimes, you would like to fit a polynomial function. Writing
out all the terms individually can be tedious, but R provides a short
way to specify all the terms at once. To do this, you use the poly
function to add all terms up to a specified
degree:
poly(x, ..., degree = 1, coefs = NULL, raw = FALSE
As arguments, the poly
function
takes a vector x
(or a set of
vectors), degree
to specify a maximum
degree to generate, coefs
to specify
coefficients from a previous fit (when using poly
to generate predicted values), and
raw
to specify whether to use raw and
not orthogonal polynomials. For more information on how to specify
formulas, see Formulas.
In R, statistical models are represented by objects; statistical modeling functions return statistical model objects. When you fit a statistical model with most statistical software packages (such as SAS or SPSS) they print a lot of diagnostic information. In R, most statistical modeling functions do not print any information.
If you simply call a model function in R but don’t assign the
model to a variable, the R console will print the object. (Specifically,
it will call the generic method print
with the object generated by the modeling function.) R doesn’t clutter
your screen with lots of information you might not want. Instead, R
includes a large set of functions for printing information about model
objects. This section describes the functions for getting information
about lm
objects. Many of these
functions may also be used with other types of models; see the help
files for more information.
For most model functions (including lm
), the best place to start is with the
print
method. If you are using the
R console, you can simply enter the name of the returned object on the
console to see the results of print
:
> runs.mdl
Call:
lm(formula = runs ~ singles + doubles + triples + homeruns +
walks + hitbypitch + sacrificeflies + stolenbases + caughtstealing,
data = team.batting.00to08)
Coefficients:
(Intercept) singles doubles triples
-507.16020 0.56705 0.69110 1.15836
homeruns walks hitbypitch sacrificeflies
1.47439 0.30118 0.37750 0.87218
stolenbases caughtstealing
0.04369 -0.01533
To show the formula used to fit the model, use the formula
function:
formula(x, ...)
Here is the formula on which the model function was called:
> formula(runs.mdl)
runs ~ singles + doubles + triples + homeruns + walks + hitbypitch +
sacrificeflies + stolenbases + caughtstealing
To get the list of coefficients for a model object, use the
coef
function:
coef(object, ...)
Here are the coefficients for the model fitted above:
> coef(runs.mdl)
(Intercept) singles doubles triples
-507.16019759 0.56704867 0.69110420 1.15836091
homeruns walks hitbypitch sacrificeflies
1.47438916 0.30117665 0.37749717 0.87218094
stolenbases caughtstealing
0.04369407 -0.01533245
Alternatively, you can use the alias coefficients
to access the coef
function.
To get a summary of a linear model object, you can use the
summary
function. The method used for linear model objects
is:
summary(object, correlation = FALSE, symbolic.cor = FALSE, ...)
For the example above, here is the output of the summary
function:
> summary(runs.mdl)
Call:
lm(formula = runs ~ singles + doubles + triples + homeruns +
walks + hitbypitch + sacrificeflies + stolenbases + caughtstealing,
data = team.batting.00to08)
Residuals:
Min 1Q Median 3Q Max
-71.9019 -11.8282 -0.4193 14.6576 61.8743
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -507.16020 32.34834 -15.678 < 2e-16 ***
singles 0.56705 0.02601 21.801 < 2e-16 ***
doubles 0.69110 0.05922 11.670 < 2e-16 ***
triples 1.15836 0.17309 6.692 1.34e-10 ***
homeruns 1.47439 0.05081 29.015 < 2e-16 ***
walks 0.30118 0.02309 13.041 < 2e-16 ***
hitbypitch 0.37750 0.11006 3.430 0.000702 ***
sacrificeflies 0.87218 0.19179 4.548 8.33e-06 ***
stolenbases 0.04369 0.05951 0.734 0.463487
caughtstealing -0.01533 0.15550 -0.099 0.921530
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 23.21 on 260 degrees of freedom
Multiple R-squared: 0.9144, Adjusted R-squared: 0.9114
F-statistic: 308.6 on 9 and 260 DF, p-value: < 2.2e-16
When you print a summary object, the following method is used:
print(x, digits = max(3, getOption("digits") - 3), symbolic.cor = x$symbolic.cor, signif.stars = getOption("show.signif.stars"), ...
To get the vector of residuals from a linear model fit,
use the residuals
function:
residuals(object, ...)
To get a vector of fitted values, use the fitted
function:
fitted(object, ...)
Suppose that you wanted to use the model object to predict
values in another data set. You can use the predict
function to calculate predicted values using the model
object and another data frame:
predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf, interval = c("none", "confidence", "prediction"), level = 0.95, type = c("response", "terms"), terms = NULL, na.action = na.pass, pred.var = res.var/weights, weights = 1, ...)
The argument object
specifies
the model returned by the fitting function, newdata
specifies a new data source for
predictions, and na.action
specifies how to deal with missing values in newdata
. (By default, predict
ignores missing values. You can
choose na.omit
to simply return
NA
for observations in newdata
with missing values.) The predict
function can also return confidence
intervals for predictions, in addition to exact values; see the help
file for more information.
To compute confidence intervals for the coefficients in
the fitted model, use the confint
function:
confint(object, parm, level = 0.95, ...)
The argument object
specifies
the model returned by the fitting function, parm
specifies the variables for which to
show confidence levels, and level
specifies the confidence level. Here are the confidence intervals for
the coefficients of the model fitted above:
> confint(runs.mdl)
2.5 % 97.5 %
(Intercept) -570.85828008 -443.4621151
singles 0.51583022 0.6182671
doubles 0.57449582 0.8077126
triples 0.81752968 1.4991921
homeruns 1.37432941 1.5744489
walks 0.25570041 0.3466529
hitbypitch 0.16077399 0.5942203
sacrificeflies 0.49451857 1.2498433
stolenbases -0.07349342 0.1608816
caughtstealing -0.32152716 0.2908623
To compute the influence of different parameters, you can use
the influence
function:
influence(model, do.coef = TRUE, ...)
For more friendly output, try influence.measures
:
influence.measures(model)
To get analysis of variance statistics, use the anova
function. For linear models, the method used is
anova.lmlist
, which has the
following form:
anova.lmlist(object, ..., scale = 0, test = "F")
By default, F-test statistics are included
in the results table. You can specify test="F"
for F-test
statistics, test="Chisq"
for
chi-squared test statistics, test="Cp"
for Mallows’
Cp statistic, or test=NULL
for no test statistics. You can
also specify an estimate of the noise variance
σ2 through the scale
argument. If you set scale=0
(the default), then the anova
function will calculate an estimate
from the test data. The test statistic and
p-values compare the mean square for each row to
the residual mean square.
Here are the ANOVA statistics for the model fitted above:
> anova(runs.mdl)
Analysis of Variance Table
Response: runs
Df Sum Sq Mean Sq F value Pr(>F)
singles 1 215755 215755 400.4655 < 2.2e-16 ***
doubles 1 356588 356588 661.8680 < 2.2e-16 ***
triples 1 237 237 0.4403 0.5075647
homeruns 1 790051 790051 1466.4256 < 2.2e-16 ***
walks 1 114377 114377 212.2971 < 2.2e-16 ***
hitbypitch 1 7396 7396 13.7286 0.0002580 ***
sacrificeflies 1 11726 11726 21.7643 4.938e-06 ***
stolenbases 1 357 357 0.6632 0.4161654
caughtstealing 1 5 5 0.0097 0.9215298
Residuals 260 140078 539
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Interestingly, it appears that triples, stolen bases, and times caught stealing are not statistically significant.
You can also view the effects from a fitted model. The effects
are the uncorrelated single degree of freedom values obtained by
projecting the data onto the successive orthogonal subspaces generated
by the QR-decomposition during the fitting
process. To obtain a vector of
orthogonal effects from the model, use the effects
function:
effects(object, set.sign = FALSE, ...)
To calculate the variance-covariance matrix from the linear model object, use the vcov
function:
vcov(object, ...)
Here is the variance-covariance matrix for the model fitted above:
> vcov(runs.mdl)
(Intercept) singles doubles triples
(Intercept) 1046.4149572 -6.275356e-01 -6.908905e-01 -0.8115627984
singles -0.6275356 6.765565e-04 -1.475026e-04 0.0001538296
doubles -0.6908905 -1.475026e-04 3.506798e-03 -0.0013459187
triples -0.8115628 1.538296e-04 -1.345919e-03 0.0299591843
homeruns -0.3190194 2.314669e-04 -3.940172e-04 0.0011510663
walks -0.2515630 7.950878e-05 -9.902388e-05 0.0004174548
hitbypitch -0.9002974 3.385518e-04 -4.090707e-04 0.0018360831
sacrificeflies 1.6870020 -1.723732e-03 -2.253712e-03 -0.0051709718
stolenbases 0.2153275 -3.041450e-04 2.871078e-04 -0.0009794480
caughtstealing -1.4370890 3.126387e-04 1.466032e-03 -0.0016038175
homeruns walks hitbypitch sacrificeflies
(Intercept) -3.190194e-01 -2.515630e-01 -0.9002974059 1.6870019518
singles 2.314669e-04 7.950878e-05 0.0003385518 -0.0017237324
doubles -3.940172e-04 -9.902388e-05 -0.0004090707 -0.0022537124
triples 1.151066e-03 4.174548e-04 0.0018360831 -0.0051709718
homeruns 2.582082e-03 -4.007590e-04 -0.0008183475 -0.0005078943
walks -4.007590e-04 5.333599e-04 0.0002219440 -0.0010962381
hitbypitch -8.183475e-04 2.219440e-04 0.0121132852 -0.0011315622
sacrificeflies -5.078943e-04 -1.096238e-03 -0.0011315622 0.0367839752
stolenbases -2.041656e-06 -1.400052e-04 -0.0001197102 -0.0004636454
caughtstealing 3.469784e-04 6.008766e-04 0.0001742039 -0.0024880710
stolenbases caughtstealing
(Intercept) 2.153275e-01 -1.4370889812
singles -3.041450e-04 0.0003126387
doubles 2.871078e-04 0.0014660316
triples -9.794480e-04 -0.0016038175
homeruns -2.041656e-06 0.0003469784
walks -1.400052e-04 0.0006008766
hitbypitch -1.197102e-04 0.0001742039
sacrificeflies -4.636454e-04 -0.0024880710
stolenbases 3.541716e-03 -0.0050935339
caughtstealing -5.093534e-03 0.0241794596
To return the deviance of the fitted model, use the deviance
function:
deviance(object, ...)
Here is the deviance for the model fitted above (though this
value is just the residual sum of squares in this case because
runs.mdl
is a linear model):
> deviance(runs.mdl)
[1] 140077.6
Finally, to plot a set of useful diagnostic diagrams, use the
plot
function:
plot(x, which = c(1:3,5), caption = list("Residuals vs Fitted", "Normal Q-Q", "Scale-Location", "Cook's distance", "Residuals vs Leverage", expression("Cook's dist vs Leverage " * h[ii] / (1 - h[ii]))), panel = if(add.smooth) panel.smooth else points, sub.caption = NULL, main = "", ask = prod(par("mfcol")) < length(which) && dev.interactive(), ..., id.n = 3, labels.id = names(residuals(x)), cex.id = 0.75, qqline = TRUE, cook.levels = c(0.5, 1.0), add.smooth = getOption("add.smooth"), label.pos = c(4,2), cex.caption = 1)
This function shows the following plots:
Residuals against fitted values
A normal Q-Q plot
A scale-location plot of sqrt{| residuals |} against fitted values
(Not plotted by default) A plot of Cook’s distances versus row labels
A plot of residuals against leverages
(Not plotted by default) A plot of Cook’s distances against leverage/(1 − leverage)
There are many more functions available in R for regression
diagnostics; see the help file for influence.measures
for more information on
many of these.
Often, it is better to use the update
function to refit a model. This can
save you some typing if you are using R interactively. Additionally,
this can save on computation time (for large data sets). You can run
update
after changing the formula
(perhaps adding or subtracting a term) or even after changing the data
frame.
For example, let’s fit a slightly different model to the data
above. We’ll omit the variable sacrificeflies
and add 0
as a variable (which means to fit the model
with no intercept):
> runs.mdl2 <- update(runs.mdl,formula=runs ~ singles + doubles + + triples + homeruns + walks + hitbypitch + + stolenbases + caughtstealing + 0) > runs.mdl2 Call: lm(formula = runs ~ singles + doubles + triples + homeruns + walks + hitbypitch + stolenbases + caughtstealing - 1, data = team.batting.00to08) Coefficients: singles doubles triples homeruns 0.29823 0.41280 0.95664 1.31945 walks hitbypitch stolenbases caughtstealing 0.21352 -0.07471 0.18828 -0.70334
[54] This example is closely related to the Batter Runs formula, which was popularized by Pete Palmer and John Thorn in the 1984 book The Hidden Game of Baseball. The original Batter Runs formula worked slightly differently: it predicted the number of runs above or below the mean, and it had no intercept. For more about this problem, see [Adler2006].