Details About the lm Function

Now that we’ve seen a simple example of how models work in R, let’s describe in detail what lm does and how you can control it. A linear regression model is appropriate when the response variable (the thing that you want to predict) can be estimated from a linear function of the predictor variables (the information that you know). Technically, we assume that:

Details About the lm Function

where y is the response variable, x1, x2, ..., xn are the predictor variables (or predictors), c1, c2, ..., cn are the coefficients for the predictor variables, c0 is the intercept, and ε is the error term. (For more details on the assumptions of the least squares model, see Assumptions of Least Squares Regression.) The predictors can be simple variables or even nonlinear functions of variables.

Suppose that you have a matrix of observed predictor variables X and a vector of response variables Y. (In this sentence, I’m using the terms “matrix” and “vector” in the mathematical sense.) We have assumed a linear model, so given a set of coefficients c, we can calculate a set of estimates ŷ for the input data X by calculating ŷ = cX. The differences between the estimates ŷ and the actual values Y are called the residuals. You can think of the residuals as a measure of the prediction error; small residuals mean that the predicted values are close to the actual values. We assume that the expected difference between the actual response values and the residual values (the error term in the model) is 0. This is important to remember: at best, a model is probabilistic.[55]

Our goal is to find the set of coefficients c that does the best job of estimating Y given X; we’d like the estimates ŷ to be as close as possible to Y. In a classical linear regression model, we find coefficients c that minimize the sum of squared differences between the estimates ŷ and the observed values Y. Specifically, we want to find values for c that minimize:

Details About the lm Function

This is called the least squares method for regression. You can use the lm function in R to estimate the coefficients in a linear model:[56]

lm(formula, data, subset, weights, na.action,
   method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
   singular.ok = TRUE, contrasts = NULL, offset, ...)

Arguments to lm include the following.

ArgumentDescriptionDefault
formulaA formula object that specifies the form of the model to fit. 
dataA data frame, list, or environment (or an object that can be coerced to a data frame) in which the variables in formula can be evaluated. 
subsetA vector specifying the observations in data to include in the model. 
weightsA numeric vector containing weights for each observation in data.NULL
na.actionA function that specifies what lm should do if there are NA values in the data. If NULL, lm uses na.omit.getOption("na.action"), which defaults to na.fail
methodThe method to use for fitting. Only method="qr" fits a model, though you can specify method="model.frame" to return a model frame."qr"
modelA logical value specifying whether the “model frame” should be returned.TRUE
xLogical values specifying whether the “model matrix” should be returned.FALSE
yA logical value specifying whether the response vector should be returned.FALSE
qrA logical value specifying whether the QR-decomposition should be returned.TRUE
singular.okA logical value that specifies whether a singular fit results in an error.TRUE
contrastsA list of contrasts for factors in the model, specifying one contrast for each factor in the model. For example, for formula y~a+b, to specify a Helmert contrast for a and a treatment contrast for b, you would use the argument contrasts=(a="contr.helmert", b="contr.treatment"). Some options in R are "contr.helmert" for Helmert contrasts, "contr.sum" for sum-to-zero contrasts, "contr.treatment" to contrast each level with the baseline level, and "contr.poly" for contrasts based on orthogonal polynomials. See [Venables2002] for an explanation of why contrasts are important and how they are used.When contrasts=NULL (the default), lm uses the value from options("contrasts")
offsetA vector of offsets to use when building the model. (An offset is a linear term that is included in the model without fitting.) 
...Additional arguments passed to lower-level functions such as lm.fit (for unweighted models) or lm.wfit (for weighted models). 

Model-fitting functions in R return model objects. A model object contains a lot of information about the fitted model (and the fitting operation). Different model objects contain slightly different information.

You may notice that most modeling functions share a few common variables: formula, data, na.action, subset, weights. These arguments mean the same thing for most modeling functions.

If you are working with a very large data set, you may want to consider using the biglm function instead of lm. This function uses only p2 memory for p variables, which is much less than the memory required for lm.

Linear models fit with the least squares method are one of the oldest statistical methods, dating back to the age of slide rules. Even today, when computers are ubiquitous, high-quality statistical software is free, and statisticians have developed thousands of new estimation methods, they are still popular. One reason why linear regression is still popular is because linear models are easy to understand. Another reason is that the least squares method has the smallest variance among all unbiased linear estimates (proven by the Gauss-Markov theorem).

Technically, linear regression is not always appropriate. Ordinary least squares (OLS) regression (implemented through lm) is guaranteed to work only when certain properties of the training data are true. Here are the key assumptions:

In practice, OLS models often make accurate predictions even when one (or more) of these assumptions are violated.

By the way, it’s perfectly OK for there to be a nonlinear relationship between some of the predictor variables. Suppose that one of the variables is age. You could add age^2, log(age), or other nonlinear mathematical expressions using age to the model and not violate the assumptions above. You are effectively defining a set of new predictor variables: w1 = age, w2 = age2, w3 = log(age). This doesn’t violate the linearity assumption (because the model is still a linear function of the predictor variables) or the full rank assumption (as long as the relationship between the new variables is not linear).

If you want to be careful, you can use test functions to check if the OLS assumptions apply:

Incidentally, the example used in Example: A Simple Linear Model is not heteroscedastic:

> ncv.test(runs.mdl)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 1.411893    Df = 1     p = 0.2347424

Nor is there a problem with autocorrelation:

> durbin.watson(runs.mdl)
 lag Autocorrelation D-W Statistic p-value
   1     0.003318923      1.983938   0.884
 Alternative hypothesis: rho != 0

Or with singularity:

> runs.mdl <- lm(
+    formula=runs~singles+doubles+triples+homeruns+
+                 walks+hitbypitch+sacrificeflies+
+                 stolenbases+caughtstealing,
+    data=team.batting.00to08,singular.ok=FALSE)

If a model has problems with heteroscedasticity or outliers, consider using a resistant or robust regression function, as described in Robust and Resistant Regression. If the data is homoscedastic and not autocorrelated, but the error form is not normal, then a good choice is ridge regression, which is described in Ridge Regression. If the predictors are closely correlated (and nearly collinear), then a good choice is principal components regression, as described in Principal Components Regression and Partial Least Squares Regression.

Often, ordinary least squares regression works well even with imperfect data. However, it’s better in many situations to use regression techniques that are less sensitive to outliers and heteroscedasticity. With R, there are alternative options for fitting linear models.



[55] By the way, the estimate returned by a model is not an exact prediction. It is, instead, the expected value of the response variable given the predictor variables. To be precise, the estimate ŷ means:

Details About the lm Function

This observation is important when we talk about generalized linear models later.

[56] To efficiently calculate the coefficients, R uses several matrix calculations. R uses a method called QR-decomposition to transform X into an orthogonal matrix Q and an upper triangular matrix R, where X = QR, and then calculates the coefficients as c = R−1QTY.

[57] The data is available from http://www.irrationalexuberance.com/.