In linear models, regularization is a method for imposing additional constraints to a learning model, where the goal is to prevent overfitting and improve the generalization of the data. This is done by adding extra terms to the loss function being optimized, meaning that, while fitting, regularized linear models may severely diminish, or even destroy features along the way. There are two widely used regularization methods, called L1 and L2 regularization. Both regularization techniques rely on the L-p Norm, which is defined for a vector as being:
- L1 regularization, also known as lasso regularization, uses the L1 Norm, which, using the above formula, reduces to the sum of the absolute values of the entries of a vector to limit the coefficients in such a way that they may disappear entirely and become 0. If the coefficient of a feature drops to 0, then that feature will not have any say in the prediction of new data observations and definitely will not be chosen by a SelectFromModel selector.
- L2 regularization, also known as ridge regularization, imposes the L2 norm as a penalty (sum of the square of vector entries) so that coefficients cannot drop to 0, but they can become very, very tiny.
Regularization also helps with multicollinearity, the problem of having multiple features in a dataset that are linearly related to one another. A Lasso Penalty (L1) will force coefficients of dependent features to 0, ensuring that they aren't chosen by the selector module, helping combat overfitting.