A regression model relates the dependent variable (a.k.a. response variable), $$y$$, to a function of independent variables (a.k.a. explanatory or predictor variables), $$x$$, and unknown parameters (a.k.a. model coefficients) $$\beta$$. Such a regression model can be written as

$y = f(x; \beta)\text{.}$

The goal of regression is to find a function such that $$y \approx f (x;\beta)$$ for the data pair $$(x; y)$$. The function $$f(x;\beta)$$ is called a regression function, and its free parameters ($$\beta$$) are the function coefficients. A regression method is linear if the prediction function $$f$$ is a linear function of the unknown parameters $$\beta$$.

By extending the equation to a set of $$n$$ observations and $$d$$ explanatory variables, $$x_1,..., x_d$$ the regression model can be written as

\begin{align} y_i & = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \\ & = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} +...+ \beta_dx_{di} + \epsilon_i \text{,} \quad i = 1,2,...m\text{, } x\in \mathbb R^d\text{, } \end{align}

where $$\beta_0$$ corresponds to the intercept, sometimes referred to as bias, shift or offset and $$\epsilon$$ corresponds to the error term, referred to as residuals.

A regression model based upon $$m$$ observations (measurements) consists of $$n$$ response variables, $$y_1, y_2,...y_m$$. For the ease of notation we write the response variables as a one-dimensional column vector of the size $$\mathbf{y}_{m \times 1}$$.

\begin{align} \mathbf y_{m \times 1}= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \\ \end{bmatrix} \end{align}

Moreover, for each particular observation $$x_i$$ $$(x_1, x_2,..., x_m)$$ we represent the $$d$$ associated explanatory variables as a column vector as well.

\begin{align} \mathbf x_{i}= \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{id} \\ \end{bmatrix} \end{align} \text{(e.g.)} \Rightarrow \begin{bmatrix} \text{height} \\ \text{weight} \\ \vdots \\ \text{age} \\ \end{bmatrix}

Further, by transposing $$\mathbf{x}_i$$ we stack a set of $$m$$ observation vectors into a matrix $$\mathbf{X}$$ of the form $$\mathbf{X}_{m \times d}$$:

\begin{align} \mathbf X_{m \times d}= \begin{bmatrix} \mathbf x_{1}^T \\ \mathbf x_{2}^T \\ \vdots \\ \mathbf x_{m1}^T \\ \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m1} & x_{m2} & \cdots & x_{md} \\ \end{bmatrix} \end{align}.

This matrix notation is very similar to a spreadsheet representation, where each row corresponds to an observation and each column to a feature. Please note that we assume that all features are continuous-valued $$(\mathbf x \in \mathbb R^d)$$ and that there are more observations than dimensions $$(m > d)$$.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.