Recall the equation for the multiple regression model

$y_i = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \text{,} \quad i = 1,2,...,n\text{, } x\in \mathbb R^d$

Further recall that in a regression analysis we try to find a set of unknown parameters $$\beta$$ that $$y \approx f (x;\beta)$$ for the data pair $$(x; y)$$. In order to find a appropriate set of parameters $$\beta$$ we need to define an objective function $$(\mathcal L)$$, which allows us to find “good” values for the set of unknown parameters $$\beta$$. One of the most popular estimation method is the so-called ordinary least squares (OSL) method. Thereby the coefficients $$\beta_0, \beta_1,\beta_2,...,\beta_{d}$$ are picked by minimizing the residual sum of squares. The objective function for the residual sum of squares is written as

$\mathcal L = \sum_{i=1}^n\epsilon_i^2=\sum_{i=1}^n(y_i - f(x_i; \beta))^2\text{.}$

By plugging in the regression model equation from above we get

$\mathcal L = \sum_{i=1}^n(y_i - \beta_0 + \sum_{j=1}^dx_{ij}\beta_j)^2\text{,}$ where $$n$$ corresponds to the number of observations and $$d$$ corresponds to the number of features of our data set.

In the previous section we introduced the matrix notation for the vectors $$\mathbf y$$ and the stacked matrix $$\mathbf X$$ (written in bold letters to indicate that $$y$$ is a $$n$$-dimensional vector, and $$X$$ a $$n \times d$$ dimensional matrix, respectively).

\begin{align} \mathbf y_{n \times 1}= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \\ \end{bmatrix} ,\qquad \mathbf X_{n \times d}= \begin{bmatrix} -x_{1}^T - \\ -x_{2}^T -\\ \vdots \\ -x_{n1}-\\ \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \\ \end{bmatrix} \end{align}

For linear regression (and classification) the intercept term $$\beta_0$$ does not interact with any element in the vector $$x \in \mathbb R^d$$. Thus it is convenient to populate the first column of the matrix $$\mathbf X$$ with ones:

\begin{align} \mathbf y_{n \times 1}= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \\ \end{bmatrix}\text{,} \quad \mathbf X_{n \times d+1}= \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1d} \\ 1 &x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots &\vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{nd} \\ \end{bmatrix}\text{,} \quad \mathbf \beta_{d+1 \times 1}^T= \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_d \\ \end{bmatrix} \end{align}

Note that the matrix $$\mathbf X$$ is now of the size $$n \times d+1$$ and that $$\beta\in \mathbb R^{d+1}$$, and thus $$[\beta_0, \beta_1,...,\beta_d]^T$$ is of size $$d+1 \times 1$$. The vector $$\mathbf y$$ is denoted as response vector and the matrix $$\mathbf X$$ is called the model matrix.

Recalling the basics of Linear Algebra, we know that the dot product, also referred to as the scalar product of a vector $$\mathbf u_{1\times n}$$ with a vector $$\mathbf v_{n\times 1}$$ results in a scalar $$w$$ ($$w$$ may be written as a matrix of size $$1\times1$$). The value of the scalar $$w$$ corresponds to the sum of the products of the elements in the vectors $$\mathbf u$$ and $$\mathbf v$$.

$\mathbf u \cdot \mathbf v = \sum_{i=1}^n u_1v_1+u_2v_2+...+ u_nv_n= w$

Returning to our problem we may rewrite the original least squares objective function

$\mathcal L = \sum_{i=1}^n(y_i - \beta_0 + \sum_{j=1}^dx_{ij}\beta_j)^2\text{,}$ by using the matrix notation as

$\mathcal L = \sum_{i=1}^n(y_i - x_i^T \beta)$

The representation in matrix form allows us to further simplify the equation by plugging in the response vector $$\mathbf y$$ and the model matrix $$\mathbf X$$:

$\mathcal L = \sum_{i=1}^n(y_i - x_i^T \beta) = \Vert\mathbf y - \mathbf X\beta\Vert^2 = (\mathbf y - \mathbf X\beta)^T(\mathbf y - \mathbf X\beta)$ Now, in order to find the least squares solution we take the gradient with respect to $$\beta$$ and find a unique solution $$\hat \beta$$:

$\nabla_\beta\mathcal L=2\mathbf X^T\mathbf X \beta-2\mathbf X^T\mathbf y \quad \Rightarrow \quad\hat \beta = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y$

The matrix $$\mathbf H = \mathbf X(\mathbf X^T\mathbf X)^{-1}\mathbf X^T$$ is sometimes called the “hat” matrix because it puts the hat on $$\mathbf y$$ (Hastie et al 2008).

Given new data $$\mathbf X_{new}$$, the least squares prediction for $$\hat y$$ is

$\hat y = \mathbf X_{new}\hat \beta = \mathbf X_{new}(\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y$

Be aware that the calculation of $$\hat \beta = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y$$ assumes $$(\mathbf X^T\mathbf X)^{-1}$$ exists. Therefore $$\mathbf X^T\mathbf X$$ has to be a full rank matrix. A matrix is full rank when the $$n \times (d + 1)$$ matrix $$\mathbf X$$ has at least $$d + 1$$ linearly independent rows. This means that any point in $$\mathbb R^{d+1}$$ can be reached by a weighted combination of $$d + 1$$ rows of $$\mathbf X$$. Thus, if $$n < d + 1$$ least squares fails because $$(\mathbf X^T\mathbf X)^{-1}$$ does not exist, and thus, there are an infinite number of possible solutions.