A multiple linear regression model is a generalization of the simple linear regression model, discussed in section Linear Regression. A regression model relates the dependent variable (a.k.a. response variable), $$y$$, to a function of independent variables (a.k.a. explanatory or predictor variables), $$x$$, and unknown parameters (a.k.a. model coefficients) $$\beta$$. Such a regression model can be written as

$y = f(x; \beta)\text{.}$

The goal of regression is to find a function such that $$y \approx f (x;\beta)$$ for the data pair $$(x; y)$$. The function $$f(x;\beta)$$ is called a regression function, and its free parameters $$(\beta)$$ are the function coefficients. A regression method is linear if the prediction function $$f$$ is a linear function of the unknown parameters $$\beta$$.

By extending the equation to a set of $$n$$ observations and $$d$$ explanatory variables, $$x_1,..., x_d$$ the regression model can be written as

\begin{align} y_i & = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \\ & = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} +...+ \beta_dx_{di} + \epsilon_i \text{,} \quad i = 1,2,...n\text{, } x\in \mathbb R^d\text{, } \end{align}

where $$\beta_0$$ corresponds to the intercept, sometimes referred to as bias, shift or offset and $$\epsilon$$ corresponds to the error term, referred to as residuals.

A regression model based upon $$n$$ observations (measurements) consists of of $$n$$ response variables, $$y_1, y_2,...y_n$$. For the ease of notation we write the response variables as a one-dimensional column vector of the size $$y_{n \times 1}$$.

\begin{align} \mathbf y_{n \times 1}= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \\ \end{bmatrix} \end{align}

Moreover, for each particular observation $$x_i$$ $$(x_1, x_2,..., x_n)$$ we represent the associated explanatory variables $$d$$ as a column vector as well.

\begin{align} \mathbf x_{i}= \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{id} \\ \end{bmatrix} \end{align} \text{(e.g.)} \Rightarrow \begin{bmatrix} \text{height} \\ \text{weight} \\ \vdots \\ \text{age} \\ \end{bmatrix}

Further, by transposing $$x_i$$ we stack a set of $$n$$ observation vectors into a matrix $$X$$ of the form $$x_{n \times d}$$:

\begin{align} \mathbf X_{n \times d}= \begin{bmatrix} -x_{1}^T - \\ -x_{2}^T -\\ \vdots \\ -x_{n1}-\\ \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \\ \end{bmatrix} \end{align}

This matrix notation is very similar to a spreadsheet representation, where each row corresponds to an observation and each column to a feature. Please note that we assume that all features are continuous-valued $$(x \in \mathbb R^d)$$ and that there are more observations than dimensions $$(n > d)$$.