Recall the equation for the multiple regression model

\[ y_i = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \text{,} \quad i = 1,2,...,n\text{, } x\in \mathbb R^d \]

Further recall that in a regression analysis we try to find a set of unknown parameters \(\beta\) that \(y \approx f (x;\beta)\) for the data pair \((x; y)\). In order to find a appropriate set of parameters \(\beta\) we need to define an **objective function** \((\mathcal L)\), which allows us to find “good” values for the set of unknown parameters \(\beta\). One of the most popular estimation method is the so-called ordinary least squares (OSL) method. Thereby the coefficients \(\beta_0, \beta_1,\beta_2,...,\beta_{d}\) are picked by minimizing the residual sum of squares. The objective function for the residual sum of squares is written as

\[\mathcal L = \sum_{i=1}^n\epsilon_i^2=\sum_{i=1}^n(y_i - f(x_i; \beta))^2\text{.}\]

By plugging in the regression model equation from above we get

\[\mathcal L = \sum_{i=1}^n(y_i - \beta_0 + \sum_{j=1}^dx_{ij}\beta_j)^2\text{,}\] where \(n\) corresponds to the number of observations and \(d\) corresponds to the number of features of our data set.

In the previous section we introduced the matrix notation for the vectors \(\mathbf y\) and the stacked matrix \(\mathbf X\) (written in bold letters to indicate that \(y\) is a \(n\)-dimensional vector, and \(X\) a \(n \times d\) dimensional matrix, respectively).

\[ \begin{align} \mathbf y_{n \times 1}= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \\ \end{bmatrix} ,\qquad \mathbf X_{n \times d}= \begin{bmatrix} -x_{1}^T - \\ -x_{2}^T -\\ \vdots \\ -x_{n1}-\\ \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \\ \end{bmatrix} \end{align} \]

For linear regression (and classification) the intercept term \(\beta_0\) does not interact with any element in the vector \(x \in \mathbb R^d\). Thus it is convenient to populate the first column of the matrix \(\mathbf X\) with ones:

\[ \begin{align} \mathbf y_{n \times 1}= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \\ \end{bmatrix}\text{,} \quad \mathbf X_{n \times d+1}= \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1d} \\ 1 &x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots &\vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{nd} \\ \end{bmatrix}\text{,} \quad \mathbf \beta_{d+1 \times 1}^T= \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_d \\ \end{bmatrix} \end{align} \]

Note that the matrix \(\mathbf X\) is now of the size \(n \times d+1\) and that \(\beta\in \mathbb R^{d+1}\), and thus \([\beta_0, \beta_1,...,\beta_d]^T\) is of size \(d+1 \times 1\). The vector \(\mathbf y\) is denoted as **response vector** and the matrix \(\mathbf X\) is called the **model matrix**.

Recalling the basics of **Linear Algebra**, we know that the *dot product*, also referred to as the *scalar product* of a vector \(\mathbf u_{1\times n}\) with a vector \(\mathbf v_{n\times 1}\) results in a scalar \(w\) (\(w\) may be written as a matrix of size \(1\times1\)). The value of the scalar \(w\) corresponds to the sum of the products of the elements in the vectors \(\mathbf u\) and \(\mathbf v\).

\[\mathbf u \cdot \mathbf v = \sum_{i=1}^n u_1v_1+u_2v_2+...+ u_nv_n= w\]

Returning to our problem we may rewrite the original least squares objective function

\[\mathcal L = \sum_{i=1}^n(y_i - \beta_0 + \sum_{j=1}^dx_{ij}\beta_j)^2\text{,}\] by using the matrix notation as

\[\mathcal L = \sum_{i=1}^n(y_i - x_i^T \beta)\]

The representation in matrix form allows us to further simplify the equation by plugging in the response vector \(\mathbf y\) and the model matrix \(\mathbf X\):

\[\mathcal L = \sum_{i=1}^n(y_i - x_i^T \beta) = \Vert\mathbf y - \mathbf X\beta\Vert^2 = (\mathbf y - \mathbf X\beta)^T(\mathbf y - \mathbf X\beta)\] Now, in order to find the least squares solution we take the gradient with respect to \(\beta\) and find a unique solution \(\hat \beta\):

\[\nabla_\beta\mathcal L=2\mathbf X^T\mathbf X \beta-2\mathbf X^T\mathbf y \quad \Rightarrow \quad\hat \beta = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y\]

The matrix \(\mathbf H = \mathbf X(\mathbf X^T\mathbf X)^{-1}\mathbf X^T\) is sometimes called the **“hat” matrix** because it puts the hat on \(\mathbf y\) (Hastie et al 2008).

Given new data \(\mathbf X_{new}\), the least squares prediction for \(\hat y\) is

\[\hat y = \mathbf X_{new}\hat \beta = \mathbf X_{new}(\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y\]

Be aware that the calculation of \(\hat \beta = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y\) assumes \((\mathbf X^T\mathbf X)^{-1}\) exists. Therefore \(\mathbf X^T\mathbf X\) has to be a full rank matrix. A matrix is full rank when the \(n \times (d + 1)\) matrix \(\mathbf X\) has at least \(d + 1\) linearly independent rows. This means that any point in \(\mathbb R^{d+1}\) can be reached by a weighted combination of \(d + 1\) rows of \(\mathbf X\). Thus, if \(n < d + 1\) least squares fails because \((\mathbf X^T\mathbf X)^{-1}\) does not exist, and thus, there are an infinite number of possible solutions.