A multiple linear regression model is a generalization of the simple linear regression model, discussed in section Linear Regression. A regression model relates the dependent variable (a.k.a. response variable), \(y\), to a function of independent variables (a.k.a. explanatory or predictor variables), \(x\), and unknown parameters (a.k.a. model coefficients) \(\beta\). Such a regression model can be written as

\[ y = f(x; \beta)\text{.}\]

The goal of regression is to find a function such that \(y \approx f (x;\beta)\) for the data pair \((x; y)\). The function \(f(x;\beta)\) is called a regression function, and its free parameters \((\beta)\) are the function coefficients. A regression method is linear if the prediction function \(f\) is a linear function of the unknown parameters \(\beta\).

By extending the equation to a set of \(n\) observations and \(d\) explanatory variables, \(x_1,..., x_d\) the regression model can be written as

\[ \begin{align} y_i & = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \\ & = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} +...+ \beta_dx_{di} + \epsilon_i \text{,} \quad i = 1,2,...n\text{, } x\in \mathbb R^d\text{, } \end{align} \]

where \(\beta_0\) corresponds to the intercept, sometimes referred to as bias, shift or offset and \(\epsilon\) corresponds to the error term, referred to as residuals.

A regression model based upon \(n\) observations (measurements) consists of of \(n\) response variables, \(y_1, y_2,...y_n\). For the ease of notation we write the response variables as a one-dimensional column vector of the size \(y_{n \times 1}\).

\[ \begin{align} \mathbf y_{n \times 1}= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \\ \end{bmatrix} \end{align} \]

Moreover, for each particular observation \(x_i\) \((x_1, x_2,..., x_n)\) we represent the associated explanatory variables \(d\) as a column vector as well.

\[ \begin{align} \mathbf x_{i}= \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{id} \\ \end{bmatrix} \end{align} \text{(e.g.)} \Rightarrow \begin{bmatrix} \text{height} \\ \text{weight} \\ \vdots \\ \text{age} \\ \end{bmatrix} \]

Further, by transposing \(x_i\) we stack a set of \(n\) observation vectors into a matrix \(X\) of the form \(x_{n \times d}\):

\[ \begin{align} \mathbf X_{n \times d}= \begin{bmatrix} -x_{1}^T - \\ -x_{2}^T -\\ \vdots \\ -x_{n1}-\\ \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \\ \end{bmatrix} \end{align} \]

This matrix notation is very similar to a spreadsheet representation, where each row corresponds to an observation and each column to a feature. Please note that we assume that all features are continuous-valued \((x \in \mathbb R^d)\) and that there are more observations than dimensions \((n > d)\).