In the previous section we realized that overfitting reduces the generalization properties of a model. When there are many correlated variables in a linear regression model, their coefficients can become poorly determined and exhibit high variance; hence, the values of the coefficients become huge. A wildly large positive coefficient on one variable can be canceled by a similarly large negative coefficient on its correlated cousin. By imposing a size constraint on the coefficients, this problem is alleviated (Hastie et al. 2008). Regularization methods constrain the model parameters in some way and thus are suitable to prevent overfitting.

In many regularization models an additional term is added to the optimization function for the optimal parameter estimates, \(\hat \beta_{opt}\).

\[\hat \beta_{opt} = \text{arg min} \Vert \mathbf y - \mathbf X \beta\Vert^2 + \lambda g(\beta)\text{,}\] where \(g\) is a function of the coefficients \(\beta\), that encourages the desired properties about \(\beta\), and \(\lambda\) is a regularization parameter.

Ridge regression, sometimes referred to as \(L_2\)-regularized regression, is a method to shrink the regression coefficients by imposing a penalty on their size. The Ridge regression uses a squared penalty on the regression coefficient vector \(\beta\).

\[\beta_{RR} = \text{arg min} \Vert \mathbf y - \mathbf X \beta\Vert^2 + \lambda \Vert\beta\Vert^2\]

Here, \(\lambda > 0\) is a regularization parameter that controls the amount of shrinkage: the larger the value of \(\lambda > 0\), the greater the amount of shrinkage. The coefficients are shrunk toward zero but do not reach zero. If \(\lambda \to 0\) the parameter estimates \((\beta_{RR})\) approach the parameter estimates of the least-square solution \(\beta_{LS}\).

\[ \begin{array}{l} \text{Case }\lambda \to 0 \quad\text{:} \quad \beta_{RR} \to \quad \beta_{LS}\\ \text{Case }\lambda \to \infty \quad\text{:} \quad \beta_{RR} \to \quad \overrightarrow 0\\ \end{array} \]

We can solve the ridge regression problem using exactly the same procedure as for least squares,

\[ \begin{align} \mathcal L & = \Vert \mathbf y - \mathbf X \beta \Vert ^2 + \lambda \Vert \beta\Vert^2 \\ & = (\mathbf y - \mathbf X \beta)^T(\mathbf y - \mathbf X \beta) + \lambda \beta^T\beta \end{align} \]

First, take the gradient of \(\mathcal L\) with respect to \(\beta\) and set to zero,

\[\nabla \mathcal L = -2\mathbf X^T \mathbf y+2\mathbf X^T\mathbf X\beta+2\lambda\beta = 0\] Then, solve for \(\beta\) to find that

\[\beta_{RR}=(\mathbf X^T\mathbf X + \lambda \mathbf I)^{-1}\mathbf X^T\mathbf y\text{,}\]

where \(\mathbf I\) corresponds to the identity matrix .

The LASSO (least absolute shrinkage and selection operator), also referred to as \(L_1\)-regularized regression, is a shrinkage method like ridge regression, with subtle but important differences.

The LASSO estimate is defined by

\[ \beta_{lasso} = \text{arg min} \Vert \mathbf y - \mathbf X \beta\Vert + \lambda \Vert\beta\Vert\text{,} \]

where

\[\Vert\beta\Vert = \sum_{j=1}^d\vert\beta_j\vert\]

The LASSO method performs both regularization and variable selection. During the LASSO model fitting process only a subset of the provided features is selected for use in the final model. The LASSO forces certain coefficients to be set to zero, effectively choosing a simpler model that does not include those coefficients. In contrast to the ridge regression, that can be solved analytically, numerical optimization (e.g. coordinate descent) is warranted to find the solution for the LASSO regression.