Now that we’ve seen a simple example of how models work in
R, let’s describe in detail what lm
does and how you can control it. A linear regression model is appropriate
when the response variable (the thing that you want to predict) can be
estimated from a linear function of the predictor variables (the
information that you know). Technically, we assume that:
where y is the response variable, x1, x2, ..., xn are the predictor variables (or predictors), c1, c2, ..., cn are the coefficients for the predictor variables, c0 is the intercept, and ε is the error term. (For more details on the assumptions of the least squares model, see Assumptions of Least Squares Regression.) The predictors can be simple variables or even nonlinear functions of variables.
Suppose that you have a matrix of observed predictor variables X and a vector of response variables Y. (In this sentence, I’m using the terms “matrix” and “vector” in the mathematical sense.) We have assumed a linear model, so given a set of coefficients c, we can calculate a set of estimates ŷ for the input data X by calculating ŷ = cX. The differences between the estimates ŷ and the actual values Y are called the residuals. You can think of the residuals as a measure of the prediction error; small residuals mean that the predicted values are close to the actual values. We assume that the expected difference between the actual response values and the residual values (the error term in the model) is 0. This is important to remember: at best, a model is probabilistic.[55]
Our goal is to find the set of coefficients c that does the best job of estimating Y given X; we’d like the estimates ŷ to be as close as possible to Y. In a classical linear regression model, we find coefficients c that minimize the sum of squared differences between the estimates ŷ and the observed values Y. Specifically, we want to find values for c that minimize:
This is called the least squares method for regression. You can use
the lm
function in R to estimate the coefficients in a linear
model:[56]
lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...)
Arguments to lm
include the
following.
Argument | Description | Default |
---|---|---|
formula | A formula object that specifies the form of the model to fit. | |
data | A data frame, list, or environment (or an object that can
be coerced to a data frame) in which the variables in formula can be evaluated. | |
subset | A vector specifying the observations in data to include in the model. | |
weights | A numeric vector containing weights for each observation in
data . | NULL |
na.action | A function that specifies what lm should do if there are NA values in the data. If NULL , lm uses na.omit . | getOption("na.action") ,
which defaults to na.fail |
method | The method to use for fitting. Only method="qr" fits a model, though you can
specify method="model.frame" to
return a model frame. | "qr" |
model | A logical value specifying whether the “model frame” should be returned. | TRUE |
x | Logical values specifying whether the “model matrix” should be returned. | FALSE |
y | A logical value specifying whether the response vector should be returned. | FALSE |
qr | A logical value specifying whether the QR-decomposition should be returned. | TRUE |
singular.ok | A logical value that specifies whether a singular fit results in an error. | TRUE |
contrasts | A list of contrasts for factors in the model, specifying
one contrast for each factor
in the model. For example, for formula y~a+b , to specify a Helmert contrast
for a and a treatment contrast for b , you would use the argument contrasts=(a="contr.helmert",
b="contr.treatment") . Some options in R are
"contr.helmert" for Helmert
contrasts, "contr.sum" for
sum-to-zero contrasts, "contr.treatment" to contrast each level
with the baseline level, and "contr.poly" for contrasts based on
orthogonal polynomials. See [Venables2002]
for an explanation of why contrasts are important and how they are
used. | When contrasts=NULL (the
default), lm uses the
value from options("contrasts") |
offset | A vector of offsets to use when building the model. (An offset is a linear term that is included in the model without fitting.) | |
... | Additional arguments passed to lower-level functions such
as lm.fit (for unweighted
models) or lm.wfit (for
weighted models).
|
Model-fitting functions in R return model objects. A model object contains a lot of information about the fitted model (and the fitting operation). Different model objects contain slightly different information.
You may notice that most modeling functions share a few common
variables: formula,
data,
na.action, subset, weights
. These arguments mean the same thing
for most modeling functions.
If you are working with a very large data set, you may want to
consider using the biglm
function instead of lm
.
This function uses only p2
memory for p variables, which is much less than the
memory required for lm
.
Linear models fit with the least squares method are one of the oldest statistical methods, dating back to the age of slide rules. Even today, when computers are ubiquitous, high-quality statistical software is free, and statisticians have developed thousands of new estimation methods, they are still popular. One reason why linear regression is still popular is because linear models are easy to understand. Another reason is that the least squares method has the smallest variance among all unbiased linear estimates (proven by the Gauss-Markov theorem).
Technically, linear regression is not always appropriate. Ordinary
least squares (OLS) regression (implemented through lm
) is guaranteed to work only when certain
properties of the training data are true. Here are the key
assumptions:
Linearity. We assume that the response variable y is a linear function of the predictor variables x1, x2, ..., cn.
Full rank. There is no linear relationship between any pair of predictor variables. (Equivalently, the predictor matrix is not singular.) Technically, ∀ xi, xj, ∄ c such that xi = cxj.
Exogenicity of the predictor variables. The expected value of the error term ε is 0 for all possible values of the predictor variables.
Homoscedasticity. The variance of the error term ε is constant and is not correlated with the predictor variables.
Nonautocorrelation. In a sequence of observations, the values of y are not correlated with one another.
Exogenously generated data. The predictor variables x1, x2, ..., xn are generated independently of the process that generates the error term ε.
The error term ε is normally distributed with standard deviation σ and mean 0.
In practice, OLS models often make accurate predictions even when one (or more) of these assumptions are violated.
By the way, it’s perfectly OK for there to be a
nonlinear relationship between some of the
predictor variables. Suppose that one of the variables is age
. You could add age^2
, log(age)
, or other nonlinear mathematical
expressions using age
to the model
and not violate the assumptions above. You are effectively defining a
set of new predictor variables:
w1 =
age, w2
= age2,
w3 =
log(age). This doesn’t violate the linearity
assumption (because the model is still a linear function of the
predictor variables) or the full rank assumption (as long as the
relationship between the new variables is not linear).
If you want to be careful, you can use test functions to check if the OLS assumptions apply:
You can test for heteroscedasticity using the function ncvTest
in the car
(Companion to Applied Regression) package, which implements the
Breusch-Pagan test. (Alternatively, you could use the bptest
function in the lmtest
library, which implements the same
test. The lmtest
library includes
a number of other functions for testing for heteroscedasticity; see
the documentation for more details.)
You can test for autocorrelation in a model using the function durbin.watson
in the car
package, which implements the Durbin-Watson test. You can also use
the function dwtest
in the
library lmtest
by specifying a
formula and a data set. (Alternatively, you could use the function
bgtest
in the lmtest
package, which implements the
Breusch-Godfrey test. This functions also tests for higher-order
disturbances.)
You can check that the predictor matrix is not singular by
using the singular.ok=FALSE
argument in
lm
.
Incidentally, the example used in Example: A Simple Linear Model is not heteroscedastic:
> ncv.test(runs.mdl)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 1.411893 Df = 1 p = 0.2347424
Nor is there a problem with autocorrelation:
> durbin.watson(runs.mdl)
lag Autocorrelation D-W Statistic p-value
1 0.003318923 1.983938 0.884
Alternative hypothesis: rho != 0
Or with singularity:
> runs.mdl <- lm( + formula=runs~singles+doubles+triples+homeruns+ + walks+hitbypitch+sacrificeflies+ + stolenbases+caughtstealing, + data=team.batting.00to08,singular.ok=FALSE)
If a model has problems with heteroscedasticity or outliers, consider using a resistant or robust regression function, as described in Robust and Resistant Regression. If the data is homoscedastic and not autocorrelated, but the error form is not normal, then a good choice is ridge regression, which is described in Ridge Regression. If the predictors are closely correlated (and nearly collinear), then a good choice is principal components regression, as described in Principal Components Regression and Partial Least Squares Regression.
Often, ordinary least squares regression works well even with imperfect data. However, it’s better in many situations to use regression techniques that are less sensitive to outliers and heteroscedasticity. With R, there are alternative options for fitting linear models.
If you would like to fit a linear regression model to data with outliers, consider using resistant regression. Using the least median squares (LMS) and least trimmed squares (LTS) estimators:
library(MASS) ## S3 method for class 'formula': lqs(formula, data, ..., method = c("lts", "lqs", "lms", "S", "model.frame"), subset, na.action, model = TRUE, x.ret = FALSE, y.ret = FALSE, contrasts = NULL) ## Default S3 method: lqs(x, y, intercept = TRUE, method = c("lts", "lqs", "lms", "S"), quantile, control = lqs.control(...), k0 = 1.548, seed, ...)
Robust regression methods can be useful when there are
problems with heteroscedasticity
and outliers in the data. The function rlm
in the MASS
package
fits a model using MM-estimation:
## S3 method for class 'formula': rlm(formula, data, weights, ..., subset, na.action, method = c("M", "MM", "model.frame"), wt.method = c("inv.var", "case"), model = TRUE, x.ret = TRUE, y.ret = FALSE, contrasts = NULL) ## Default S3 method: rlm(x, y, weights, ..., w = rep(1, nrow(x)), init = "ls", psi = psi.huber, scale.est = c("MAD", "Huber", "proposal 2"), k2 = 1.345, method = c("M", "MM"), wt.method = c("inv.var", "case"), maxit = 20, acc = 1e-4, test.vec = "resid", lqs.control = NULL)
You may also want to try the function lmRob
in the robust
package, which fits a model using MS- and
S-estimation:
library(robust) lmRob(formula, data, weights, subset, na.action, model = TRUE, x = FALSE, y = FALSE, contrasts = NULL, nrep = NULL, control = lmRob.control(...), genetic.control = NULL, ...)
As a quick exercise, we’ll look at how lm
, lqs
,
and rlm
perform on some
particularly ugly data: U.S. housing prices. We’ll use Robert
Shiller’s home price index as an example, looking at home prices
between 1890 and 2009.[57] First, we’ll load the data and fit the data using an
ordinary linear regression model, a robust regression model, and a
resistant regression model:
> library(nutshell) > data(shiller.index) > hpi.lm <- lm(Index~Year, data=shiller.index) > hpi.rlm <- rlm(Index~Year, data=shiller.index) > hpi.lqs <- lqs(Index~Year, data=shiller.index)
Now we’ll plot the data to compare how each method worked. We’ll
plot the models using the abline
function because it allows you to specify a model as an
argument (as long as the model function has a coefficient
function):
> plot(hpi, pch=19, cex=0.3) > abline(reg=hpi.lm, lty=1) > abline(reg=hpi.rlm, lty=2) > abline(reg=hpi.lqs, lty=3) > legend(x=1900, y=200, legend=c("lm", "rlm", "lqs"), lty=c(1, 2, 3))
As you can see from Figure 20-2, the standard linear model is influenced by big peaks (such as the growth between 2001 and 2006) and big valleys (such as the dip between 1920 and 1940). The robust regression method is less sensitive to peaks and valleys in this data, and the resistant regression method is the least sensitive.
[55] By the way, the estimate returned by a model is not an exact prediction. It is, instead, the expected value of the response variable given the predictor variables. To be precise, the estimate ŷ means:
This observation is important when we talk about generalized linear models later.
[56] To efficiently calculate the coefficients, R uses several matrix calculations. R uses a method called QR-decomposition to transform X into an orthogonal matrix Q and an upper triangular matrix R, where X = QR, and then calculates the coefficients as c = R−1QTY.
[57] The data is available from http://www.irrationalexuberance.com/.