Details About the lm Function

Now that we’ve seen a simple example of how models work in R, let’s describe in detail what lm does and how you can control it. A linear regression model is appropriate when the response variable (the thing that you want to predict) can be estimated from a linear function of the predictor variables (the information that you know). Technically, we assume that:

where y is the response variable, x₁, x₂, ..., x_n are the predictor variables (or predictors), c₁, c₂, ..., c_n are the coefficients for the predictor variables, c₀ is the intercept, and ε is the error term. (For more details on the assumptions of the least squares model, see Assumptions of Least Squares Regression.) The predictors can be simple variables or even nonlinear functions of variables.

Suppose that you have a matrix of observed predictor variables X and a vector of response variables Y. (In this sentence, I’m using the terms “matrix” and “vector” in the mathematical sense.) We have assumed a linear model, so given a set of coefficients c, we can calculate a set of estimates ŷ for the input data X by calculating ŷ = cX. The differences between the estimates ŷ and the actual values Y are called the residuals. You can think of the residuals as a measure of the prediction error; small residuals mean that the predicted values are close to the actual values. We assume that the expected difference between the actual response values and the residual values (the error term in the model) is 0. This is important to remember: at best, a model is probabilistic.^[55]

Our goal is to find the set of coefficients c that does the best job of estimating Y given X; we’d like the estimates ŷ to be as close as possible to Y. In a classical linear regression model, we find coefficients c that minimize the sum of squared differences between the estimates ŷ and the observed values Y. Specifically, we want to find values for c that minimize:

This is called the least squares method for regression. You can use the lm function in R to estimate the coefficients in a linear model:^[56]

lm(formula, data, subset, weights, na.action,
   method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
   singular.ok = TRUE, contrasts = NULL, offset, ...)

Arguments to lm include the following.

Argument	Description	Default
formula	A formula object that specifies the form of the model to fit.
data	A data frame, list, or environment (or an object that can be coerced to a data frame) in which the variables in `formula` can be evaluated.
subset	A vector specifying the observations in `data` to include in the model.
weights	A numeric vector containing weights for each observation in `data`.	`NULL`
na.action	A function that specifies what `lm` should do if there are `NA` values in the data. If `NULL`, `lm` uses `na.omit`.	`getOption("na.action")`, which defaults to `na.fail`
method	The method to use for fitting. Only `method="qr"` fits a model, though you can specify `method="model.frame"` to return a model frame.	`"qr"`
model	A logical value specifying whether the “model frame” should be returned.	`TRUE`
x	Logical values specifying whether the “model matrix” should be returned.	`FALSE`
y	A logical value specifying whether the response vector should be returned.	`FALSE`
qr	A logical value specifying whether the QR-decomposition should be returned.	`TRUE`
singular.ok	A logical value that specifies whether a singular fit results in an error.	`TRUE`
contrasts	A list of contrasts for factors in the model, specifying one contrast for each factor in the model. For example, for formula `y~a+b`, to specify a Helmert contrast for `a` and a treatment contrast for `b`, you would use the argument `contrasts=(a="contr.helmert", b="contr.treatment")`. Some options in R are `"contr.helmert"` for Helmert contrasts, `"contr.sum"` for sum-to-zero contrasts, `"contr.treatment"` to contrast each level with the baseline level, and `"contr.poly"` for contrasts based on orthogonal polynomials. See [Venables2002] for an explanation of why contrasts are important and how they are used.	When `contrasts=NULL` (the default), `lm` uses the value from `options("contrasts")`
offset	A vector of offsets to use when building the model. (An offset is a linear term that is included in the model without fitting.)
...	Additional arguments passed to lower-level functions such as `lm.fit` (for unweighted models) or `lm.wfit` (for weighted models).

Model-fitting functions in R return model objects. A model object contains a lot of information about the fitted model (and the fitting operation). Different model objects contain slightly different information.

You may notice that most modeling functions share a few common variables: formula, data, na.action, subset, weights. These arguments mean the same thing for most modeling functions.

If you are working with a very large data set, you may want to consider using the biglm function instead of lm. This function uses only p² memory for p variables, which is much less than the memory required for lm.

Assumptions of Least Squares Regression

Linear models fit with the least squares method are one of the oldest statistical methods, dating back to the age of slide rules. Even today, when computers are ubiquitous, high-quality statistical software is free, and statisticians have developed thousands of new estimation methods, they are still popular. One reason why linear regression is still popular is because linear models are easy to understand. Another reason is that the least squares method has the smallest variance among all unbiased linear estimates (proven by the Gauss-Markov theorem).

Technically, linear regression is not always appropriate. Ordinary least squares (OLS) regression (implemented through lm) is guaranteed to work only when certain properties of the training data are true. Here are the key assumptions:

Linearity. We assume that the response variable y is a linear function of the predictor variables x₁, x₂, ..., c_n.
Full rank. There is no linear relationship between any pair of predictor variables. (Equivalently, the predictor matrix is not singular.) Technically, ∀ x_i, x_j, ∄ c such that x_i = cx_j.
Exogenicity of the predictor variables. The expected value of the error term ε is 0 for all possible values of the predictor variables.
Homoscedasticity. The variance of the error term ε is constant and is not correlated with the predictor variables.
Nonautocorrelation. In a sequence of observations, the values of y are not correlated with one another.
Exogenously generated data. The predictor variables x₁, x₂, ..., x_n are generated independently of the process that generates the error term ε.
The error term ε is normally distributed with standard deviation σ and mean 0.

In practice, OLS models often make accurate predictions even when one (or more) of these assumptions are violated.

By the way, it’s perfectly OK for there to be a nonlinear relationship between some of the predictor variables. Suppose that one of the variables is age. You could add age^2, log(age), or other nonlinear mathematical expressions using age to the model and not violate the assumptions above. You are effectively defining a set of new predictor variables: w₁ = age, w₂ = age², w₃ = log(age). This doesn’t violate the linearity assumption (because the model is still a linear function of the predictor variables) or the full rank assumption (as long as the relationship between the new variables is not linear).

If you want to be careful, you can use test functions to check if the OLS assumptions apply:

You can test for heteroscedasticity using the function ncvTest in the car (Companion to Applied Regression) package, which implements the Breusch-Pagan test. (Alternatively, you could use the bptest function in the lmtest library, which implements the same test. The lmtest library includes a number of other functions for testing for heteroscedasticity; see the documentation for more details.)
You can test for autocorrelation in a model using the function durbin.watson in the car package, which implements the Durbin-Watson test. You can also use the function dwtest in the library lmtest by specifying a formula and a data set. (Alternatively, you could use the function bgtest in the lmtest package, which implements the Breusch-Godfrey test. This functions also tests for higher-order disturbances.)
You can check that the predictor matrix is not singular by using the singular.ok=FALSE argument in lm.

Incidentally, the example used in Example: A Simple Linear Model is not heteroscedastic:

> ncv.test(runs.mdl)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 1.411893    Df = 1     p = 0.2347424

Nor is there a problem with autocorrelation:

> durbin.watson(runs.mdl)
 lag Autocorrelation D-W Statistic p-value
   1     0.003318923      1.983938   0.884
 Alternative hypothesis: rho != 0

Or with singularity:

> runs.mdl <- lm(
+    formula=runs~singles+doubles+triples+homeruns+
+                 walks+hitbypitch+sacrificeflies+
+                 stolenbases+caughtstealing,
+    data=team.batting.00to08,singular.ok=FALSE)

If a model has problems with heteroscedasticity or outliers, consider using a resistant or robust regression function, as described in Robust and Resistant Regression. If the data is homoscedastic and not autocorrelated, but the error form is not normal, then a good choice is ridge regression, which is described in Ridge Regression. If the predictors are closely correlated (and nearly collinear), then a good choice is principal components regression, as described in Principal Components Regression and Partial Least Squares Regression.

Robust and Resistant Regression

Often, ordinary least squares regression works well even with imperfect data. However, it’s better in many situations to use regression techniques that are less sensitive to outliers and heteroscedasticity. With R, there are alternative options for fitting linear models.

Resistant regression

If you would like to fit a linear regression model to data with outliers, consider using resistant regression. Using the least median squares (LMS) and least trimmed squares (LTS) estimators:

library(MASS)
## S3 method for class 'formula':
lqs(formula, data, ...,
    method = c("lts", "lqs", "lms", "S", "model.frame"),
    subset, na.action, model = TRUE,
    x.ret = FALSE, y.ret = FALSE, contrasts = NULL)

## Default S3 method:
lqs(x, y, intercept = TRUE, method = c("lts", "lqs", "lms", "S"),
    quantile, control = lqs.control(...), k0 = 1.548, seed, ...)

Robust regression

Robust regression methods can be useful when there are problems with heteroscedasticity and outliers in the data. The function rlm in the MASS package fits a model using MM-estimation:

## S3 method for class 'formula':
rlm(formula, data, weights, ..., subset, na.action,
    method = c("M", "MM", "model.frame"),
    wt.method = c("inv.var", "case"),
    model = TRUE, x.ret = TRUE, y.ret = FALSE, contrasts = NULL)

## Default S3 method:
rlm(x, y, weights, ..., w = rep(1, nrow(x)),
    init = "ls", psi = psi.huber,
    scale.est = c("MAD", "Huber", "proposal 2"), k2 = 1.345,
    method = c("M", "MM"), wt.method = c("inv.var", "case"),
    maxit = 20, acc = 1e-4, test.vec = "resid", lqs.control = NULL)

You may also want to try the function lmRob in the robust package, which fits a model using MS- and S-estimation:

library(robust)
lmRob(formula, data, weights, subset, na.action, model = TRUE, x = FALSE,
            y = FALSE, contrasts = NULL, nrep = NULL,
            control = lmRob.control(...), genetic.control = NULL, ...)

Comparing lm, lqs, and rlm

As a quick exercise, we’ll look at how lm, lqs, and rlm perform on some particularly ugly data: U.S. housing prices. We’ll use Robert Shiller’s home price index as an example, looking at home prices between 1890 and 2009.^[57] First, we’ll load the data and fit the data using an ordinary linear regression model, a robust regression model, and a resistant regression model:

> library(nutshell)
> data(shiller.index)
> hpi.lm <- lm(Index~Year, data=shiller.index)
> hpi.rlm <- rlm(Index~Year, data=shiller.index)
> hpi.lqs <- lqs(Index~Year, data=shiller.index)

Now we’ll plot the data to compare how each method worked. We’ll plot the models using the abline function because it allows you to specify a model as an argument (as long as the model function has a coefficient function):

> plot(hpi, pch=19, cex=0.3)
> abline(reg=hpi.lm, lty=1)
> abline(reg=hpi.rlm, lty=2)
> abline(reg=hpi.lqs, lty=3)
> legend(x=1900, y=200, legend=c("lm", "rlm", "lqs"), lty=c(1, 2, 3))

Figure 20-2. Home prices and lm, rlm, and lqs models

As you can see from Figure 20-2, the standard linear model is influenced by big peaks (such as the growth between 2001 and 2006) and big valleys (such as the dip between 1920 and 1940). The robust regression method is less sensitive to peaks and valleys in this data, and the resistant regression method is the least sensitive.

^[55] By the way, the estimate returned by a model is not an exact prediction. It is, instead, the expected value of the response variable given the predictor variables. To be precise, the estimate ŷ means:

This observation is important when we talk about generalized linear models later.

^[56] To efficiently calculate the coefficients, R uses several matrix calculations. R uses a method called QR-decomposition to transform X into an orthogonal matrix Q and an upper triangular matrix R, where X = QR, and then calculates the coefficients as c = R⁻¹Q^TY.

^[57]The data is available from http://www.irrationalexuberance.com/.