Chapter 4. Advanced Feature Selection in Linear Models

"I found that math got to be too abstract for my liking and computer science seemed concerned with little details -- trying to save a microsecond or a kilobyte in a computation. In statistics I found a subject that combined the beauty of both math and computer science, using them to solve real-world problems."

This was quoted by Rob Tibshirani, Professor, Stanford University at http://statweb.stanford.edu/~tibs/research_page.html.

So far, we examined the usage of linear models for both quantitative and qualitative outcomes with an emphasis on the techniques of feature selection, that is, the methods and techniques to exclude useless or unwanted predictor variables. We saw that the linear models can be quite effective in the machine learning problems. However, newer techniques that have been developed and refined in the last couple of decades or so can improve the predictive ability and interpretability above and beyond the linear models that we've discussed in the preceding chapters. In this day and age, many datasets have numerous features in relation to the number of observations or, as it is called, high-dimensionality. If you ever have to work on a genomics problem, this will quickly become self-evident. Additionally, with the size of the data that we are being asked to work with, a technique like best subsets or stepwise feature selection can take inordinate amounts of time to converge even on high-speed computers. I'm not talking about minutes; but in many cases, hours of system time are required to get a best subsets solution.

There is a better way in these cases. In this chapter, we will look at the concept of regularization where the coefficients are constrained or shrunk towards zero. There are a number of methods and permutations to these methods of regularization but we will focus on Ridge regression, Least Absolute Shrinkage and Selection Operator (LASSO), and finally, Elastic net, which combines the benefit of both the techniques to one.

Regularization in a nutshell

You may recall that our linear model follows the form, Y = B0 + B1x1 +...Bnxn + e, and also that the best fit tries to minimize the RSS, which is the sum of the squared errors of the actual minus the estimate or e12 + e22 + … en2.

With regularization, we will apply what is known as a shrinkage penalty in conjunction with the minimization RSS. This penalty consists of a lambda (symbol λ) along with the normalization of the beta coefficients and weights. How these weights are normalized differs in the techniques and we will discuss them accordingly. Quite simply, in our model, we are minimizing (RSS + λ(normalized coefficients)). We will select the λ, which is known as the tuning parameter in our model building process. Please note that if lambda is equal to zero, then our model is equivalent to OLS as it cancels out the normalization term.

So what does this do for us and why does it work? First of all, regularization methods are very computationally efficient. In best subsets, we are searching 2p models and in large datasets, it may just not be feasible to attempt. In R, we are only fitting one model to each value of lambda and this is therefore far and away more efficient. Another reason goes back to our bias-variance trade-off that is discussed in the preface. In the linear model, where the relationship between the response and the predictors is close to linear, the least squares estimates will have low bias but may have high variance. This means that a small change in the training data can cause a large change in the least squares coefficient estimates (James, 2013). Regularization through the proper selection of lambda and normalization may help you improve the model fit by optimizing the bias-variance trade-off. Finally, regularization of the coefficients works to solve the multicollinearity problems.

Ridge regression

Let's begin by exploring what ridge regression is and what it can and cannot do for you. With ridge regression, the normalization term is the sum of the squared weights, referred to as an L2-norm. Our model is trying to minimize RSS + λ(sum Bj²). As lambda increases, the coefficients shrink toward zero but do not ever become zero. The benefit may be an improved predictive accuracy but as it does not zero out the weights for any of your features, it could lead to issues in the model's interpretation and communication. To help with this problem, we will turn to LASSO.

LASSO

LASSO applies the L1-norm instead of the L2-norm as in ridge regression, which is the sum of the absolute value of the feature weights and thus minimizes RSS + λ(sum |Bj|). This shrinkage penalty will indeed force a feature weight to zero. This is a clear advantage over ridge regression as it may greatly improve the model interpretability.

The mathematics behind the reason that the L1-norm allows the weights/coefficients to become zero is out of the scope of this book (refer to Tibsharini, 1996 for further details).

If LASSO is so great, then ridge regression must be clearly obsolete. Not so fast! In a situation of high collinearity or high pairwise correlations, LASSO may force a predictive feature to zero and thus you can lose the predictive ability, that is, say if both feature A and B should be in your model, LASSO may shrink one of their coefficients to zero. The following quote sums up this issue nicely:

	"One might expect the lasso to perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero. Ridge regression will perform better when the response is a function of many predictors, all with coefficients of roughly equal size."
	-- (James, 2013)

There is the possibility of achieving the best of both the worlds and that leads us to the next topic, elastic net.

Elastic net

The power of elastic net is that it performs the feature extraction that ridge regression does not and it will group the features that LASSO fails to do. Again, LASSO will tend to select one feature from a group of correlated ones and ignore the rest. Elastic net does this by including a mixing parameter, alpha, in conjunction with lambda. Alpha will be between 0 and 1 and as before, lambda will regulate the size of the penalty. Please note that an alpha of zero is equal to ridge regression and an alpha of one is equivalent to LASSO. Essentially, we are blending the L1 and L2 penalties by including a second tuning parameter to a quadratic (squared) term of the beta coefficients. We will end up with the goal of minimizing (RSS + λ[(1-alpha) (sum|Bj| ²)/2 + alpha (sum |Bj|)])/N).

Let's put these techniques to the test. We will primarily utilize the leaps, glmnet, and caret packages to select the appropriate features and thus the appropriate model in our business case.