Ridge regression and Tikhonov regularization

Under the standard least squares method, the obtained regression coefficients can vary wildly. We can formulate the least squares regression as an optimization problem:

What we have on the right here is just an RSS in a form of a scalar product. Tikhonov regularized least squares regression adds an additional penalty term—squared L₂ norm of weights vector:

where L₂ norm and λ is a scalar shrinkage parameter. It allows to control weights variance and keep it low. Similar to other hyperparameters, λ needs to be defined separately, usually using held-out data or cross-validation. The larger it is, the smaller the regression coefficients (weights) will be.

Such an optimization problem has a closed-form solution, similar to the normal equation:

I is an identity matrix, where main diagonal elements are equal to 1, and all others equal to 0.

Linear regression regularized in such a way is known as ridge regression. One of its advantages is that it can be used even if features in a training data are highly correlated (multicollinearity). Unlike usual linear regression, ridge regression doesn't assume the normal distribution of errors. It reduces the absolute values of features but they don't reach zero, which means that this regression also performs badly if there are irrelevant features.