Subset Selection and Shrinkage Methods

Modeling functions like lm will include every variable specified in the formula, calculating a coefficient for each one. Unfortunately, this means that lm may calculate coefficients for variables that aren’t needed. You can manually tune a model using diagnostics like summary and lm.influence. However, you can also use some other statistical techniques to reduce the effect of insignificant variables or remove them from a model altogether.

A simple technique for selecting the most important variables is stepwise variable selection. The stepwise algorithm works by repeatedly adding or removing variables from the model, trying to “improve” the model at each step. When the algorithm can no longer improve the model by adding or subtracting variables, it stops and returns the new (and usually smaller) model.

Note that “improvement” does not just mean reducing the residual sum of squares (RSS) for the fitted model. Adding an additional variable to a model will not increase the RSS (see a statistics book for an explanation of why), but it does increase model complexity. Typically, AIC (Akaike’s information criterion) is used to measure the value of each additional variable. The AIC is defined as AIC = − 2 ∗ log(L) + k ∗ edf, where L is the likelihood and edf is the equivalent degrees of freedom.

In R, you perform stepwise selection through the step function:

step(object, scope, scale = 0,
     direction = c("both", "backward", "forward"),
     trace = 1, keep = NULL, steps = 1000, k = 2, ...)

Here is a description of the arguments to step.

ArgumentDescriptionDefault
objectAn object representing a model, such as the objects returned by lm, glm, or aov. 
scopeAn argument specifying a set of variables that you want in the final model and a list of all variables that you want to consider including in the model. The first set is called the lower bound, and the second is called the upper bound. If a single formula is specified, then it is interpreted as the upper bound. To specify both an upper and a lower bound, pass a list with two formulas labeled as upper and lower. 
scaleA value used in the definition of AIC for lm and aov models. See the help file for extractAIC for more information.0
directionSpecifies whether variables should be only added to the model (direction="forward"), removed from the model (direction="backward"), or both (direction="both")."both"
traceA numeric value that specifies whether to print out details of the fitting process. Specify trace=0 (or a negative number) to suppress printing, trace=1 for normal detail, and higher numbers for even more detail.1
keepA function used to select a subset of arguments to keep from an object. The function accepts a fitted model object and an AIC statistic.NULL
stepsA numeric value that specifies the maximum number of steps to take before the function halts.1000
kThe multiple of the number of degrees of freedom to be used in the penalty calculation (extractAIC).2
...Additional arguments for extractAIC. 

There is an alternative implementation of stepwise selection in the MASS library: the stepAIC function. This function works similarly to step but operates on a wider range of model objects.

Stepwise variable selection simply fits a model using lm, but limits the number of variables in the model. In contrast, ridge regression places constraints on the size of the coefficients and fits a model using different computations.

Ridge regression can be used to mitigate problems when there are several highly correlated variables in the underlying data. This condition (called multicollinearity) causes high variance in the results. Reducing the number, or impact, of regressors in the data can help reduce these problems.[58]

In Details About the lm Function, we described how ordinary linear regression finds the coefficients that minimize the residual sum of squares. Ridge regression does something similar. Ridge regression attempts to minimize the sum of squared residuals plus a penalty for the coefficient sizes. The penalty is a constant λ times the sum of squared coefficients. Specifically, ridge regression tries to minimize the following quantity:

Ridge Regression

To estimate a model using ridge regression, you can use the lm.ridge function from the MASS package:

library(MASS)
lm.ridge(formula, data, subset, na.action, lambda = 0, model = FALSE,
         x = FALSE, y = FALSE, contrasts = NULL, ...)

Arguments to lm.ridge are the following.

ArgumentDescriptionDefault
formulaA formula object that specifies the form of the model to fit. 
dataA data frame, list, or environment (or an object that can be coerced to a data frame) in which the variables in formula can be evaluated. 
subsetA vector specifying the observations in data to include in the model. 
na.actionA function that specifies what lm should do if there are NA values in the data. If NULL, lm uses na.omit. 
lambdaA scalar or vector of ridge constants.0
modelA logical value specifying whether the “model frame” should be returned.FALSE
xLogical values specifying whether the “model matrix” should be returned.FALSE
yA logical value specifying whether the response vector should be returned.FALSE
contrastsA list of contrasts for factors in the model.NULL
...Additional arguments to lm.fit. 

Another technique for reducing the size of the coefficients (and thus reducing their impact on the final model) is the lasso. Like ridge regression, lasso regression puts a penalty on the size of the coefficients. However, the lasso algorithm uses a different penalty: instead of a sum of squared coefficients, the lasso sums the absolute value of the coefficients. (In math terms, ridge uses L2-norms, while lasso uses L1-norms.) Specifically, the lasso algorithm tries to minimize the following value:

Lasso and Least Angle Regression

The best way to compute lasso regression in R is through the lars function:

library(lars)
lars(x, y, type = c("lasso", "lar", "forward.stagewise", "stepwise"),
    trace = FALSE, normalize = TRUE, intercept = TRUE, Gram,
    eps = .Machine$double.eps, max.steps, use.Gram = TRUE)

The lars function computes the entire lasso path at once. Specifically, it begins with a model with no variables. It then computes the lambda values for which each variable enters the model and shows the resulting coefficients. Finally, the lars algorithm computes a model with all the coefficients present, which is the same as an ordinary linear regression fit.

This function actually implements a more general algorithm called least angle regression; you have the option to choose least angle regression, forward stagewise regression, or stepwise regression instead of lasso. Here are the arguments to the lars function.

ArgumentDescriptionDefault
xA matrix of predictor variables. 
yA numeric vector containing the response variable. 
typeThe type of model to fit. Use type="lasso" for lasso, type="lar" for least angle regression, type="forward.stagewise" for infinitesimal forward stagewise, and type="stepwise" for stepwise.c("lasso", "lar", "forward.stagewise", "stepwise")
traceA logical value specifying whether to print details as the function is running.FALSE
normalizeA logical value specifying whether each variable will be standardized to have an L2-norm of 1.TRUE
interceptA logical value indicating whether an intercept should be included in the model.TRUE
GramThe X’X matrix used in the calculations. To rerun lars with slightly different parameters, but the same underlying data, you may reuse the Gram matrix from a prior run to increase efficiency. 
epsAn effective 0. .Machine$double.eps
max.stepsA limit on the number of steps taken by the lars function. 
use.GramA logical value specifying whether lars should precompute the Gram matrix. (For large N, this can be time consuming.)TRUE

Both ridge regression and lasso regression are subsets of a family of models called elastic net. Elastic nets are available in R through the function enet in the package elasticnet. (Both the algorithm and code were developed by Hui Zou and Trevor Hastie.)

enet(x, y, lambda, max.steps, normalize, intercept, trace, eps)

Unfortunately, the enet function requires its input as a matrix and not as a data frame and a formula. Here is a description of the parameters for enet:

ArgumentDescriptionDefault
xA matrix of predictor variables. 
yA numeric vector containing the response variable. 
lambdaThe quadratic penalty. Use lambda=0 for a lasso fit. 
max.stepsThe maximum number of steps50 * min(ncol(x), nrow(x)-1)
traceSpecifies whether to print progress.FALSE
normalizeA logical value indicating whether to normalize the input before computing the fit.TRUE
interceptA logical value indicating whether to center the predictorsTRUE
epsAn effective 0. .Machine$double.eps

Ordinary least squares regression doesn’t always work well with closely correlated variables. A useful technique for modeling effects in this form of data is principal components regression. Principal components regression works by first transforming the predictor variables using principal components analysis. Next, a linear regression is performed on the transformed variables.

A closely related technique is partial least squares regression. In partial least squares regression, both the predictor and the response variables are transformed before fitting a linear regression. In R, principal components regression is available through the function pcr in the pls package:

library(pls)
pcr(..., method = pls.options()$pcralg)

Partial least squares is available through the function plsr in the same package:

plsr(..., method = pls.options()$plsralg)

Both functions are actually aliases to the function mvr:

mvr(formula, ncomp, data, subset, na.action,
    method = pls.options()$mvralg,
    scale = FALSE, validation = c("none", "CV", "LOO"),
    model = TRUE, x = FALSE, y = FALSE, ...)


[58] For example, see [Greene2007].