This was quoted by Rob Tibshirani, Professor, Stanford University at

So far, we examined the usage of linear models for both quantitative and qualitative outcomes with an emphasis on the techniques of feature selection, that is, the methods and techniques to exclude useless or unwanted predictor variables. We saw that the linear models can be quite effective in the machine learning problems. However, newer techniques that have been developed and refined in the last couple of decades or so can improve the predictive ability and interpretability above and beyond the linear models that we've discussed in the preceding chapters. In this day and age, many datasets have numerous features in relation to the number of observations or, as it is called, high-dimensionality. If you ever have to work on a genomics problem, this will quickly become self-evident. Additionally, with the size of the data that we are being asked to work with, a technique like best subsets or stepwise feature selection can take inordinate amounts of time to converge even on high-speed computers. I'm not talking about minutes; but in many cases, hours of system time are required to get a best subsets solution.

There is a better way in these cases. In this chapter, we will look at the concept of regularization where the coefficients are constrained or shrunk towards zero. There are a number of methods and permutations to these methods of regularization but we will focus on Ridge regression, Least Absolute Shrinkage and Selection Operator (LASSO), and finally, Elastic net, which combines the benefit of both the techniques to one.

You may recall that our linear model follows the form, Y = B0 + B1x1 +...Bnxn + e, and also that the best fit tries to minimize the RSS, which is the sum of the squared errors of the actual minus the estimate or e12 + e22 + … en2.

With regularization, we will apply what is known as a shrinkage penalty in conjunction with the minimization RSS. This penalty consists of a lambda (symbol λ) along with the normalization of the beta coefficients and weights. How these weights are normalized differs in the techniques and we will discuss them accordingly. Quite simply, in our model, we are minimizing (RSS + λ(normalized coefficients)). We will select the λ, which is known as the tuning parameter in our model building process. Please note that if lambda is equal to zero, then our model is equivalent to OLS as it cancels out the normalization term.

So what does this do for us and why does it work? First of all, regularization methods are very computationally efficient. In best subsets, we are searching 2p models and in large datasets, it may just not be feasible to attempt. In R, we are only fitting one model to each value of lambda and this is therefore far and away more efficient. Another reason goes back to our bias-variance trade-off that is discussed in the preface. In the linear model, where the relationship between the response and the predictors is close to linear, the least squares estimates will have low bias but may have high variance. This means that a small change in the training data can cause a large change in the least squares coefficient estimates (James, 2013). Regularization through the proper selection of lambda and normalization may help you improve the model fit by optimizing the bias-variance trade-off. Finally, regularization of the coefficients works to solve the multicollinearity problems.