A regression model relates the dependent variable (a.k.a. response variable) $y$, to a function of independent variables (a.k.a. explanatory or predictor variables) $x$, and unknown parameters (a.k.a. model coefficients) $\beta$. We can write this regression model as $$ y = f(x; \beta)\text{.}$$ The goal of regression is to find a function such that $y \approx f (x;\beta)$ for the data pair $(x; y)$. The function $f(x;\beta)$ is called a regression function, and its free parameters ($\beta$) are the function coefficients. We call a regression method linear if the prediction function $f$ is a linear function of the unknown parameters $\beta$.
By extending the equation to a set of $n$ observations and $d$ explanatory variables $x_1,..., x_d$, we write the regression model as $$ \begin{align} y_i & = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \ & = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} +...+ \beta_dx_{di} + \epsilon_i \text{,} \quad i = 1,2,...m\text{, } x\in \mathbb R^d\text{, } \end{align} $$ where $\beta_0$ corresponds to the intercept, sometimes referred to as bias, shift or offset and $\epsilon$ corresponds to the error term, referred to as residuals. A regression model based upon $n$ observations (measurements) consists of $n$ response variables, $y_1, y_2,...y_n$. For the ease of notation we write the response variables as a one-dimensional column vector of the size $\mathbf{y}_{n \times 1}$. $$ \begin{align} \mathbf y_{m \times 1}= \begin{bmatrix} y_{1} \ y_{2} \ \vdots \ y_{m} \ \end{bmatrix} \end{align} $$ Moreover, for each particular observation $x_i$ $(x_1, x_2,..., x_n)$ we represent the associated explanatory variables $d$ as a column vector as well. $$ \begin{align} \mathbf x_{i}= \begin{bmatrix} x_{i1} \ x_{i2} \ \vdots \ x_{id} \ \end{bmatrix} \end{align} \\ \text{(e.g.)} \Rightarrow \begin{bmatrix} \text{height} \ \text{weight} \ \vdots \ \text{age} \ \end{bmatrix} $$ Further, by transposing $\mathbf{x}i$ we stack a set of $m$ observation vectors into a matrix $\mathbf{X}$ of the form $\mathbf{X}{m \times d}$: $$ \begin{align} \mathbf X_{m \times d}= \begin{bmatrix} \mathbf x_{1}^T \ \mathbf x_{2}^T \ \vdots \ \mathbf x_{m1}^T \ \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \ x_{21} & x_{22} & \cdots & x_{2d} \ \vdots & \vdots & \ddots & \vdots \ x_{m1} & x_{m2} & \cdots & x_{md} \ \end{bmatrix} \end{align}. $$ This matrix notation is very similar to a spreadsheet representation, where each row corresponds to an observation and each column to a feature. Please note that we assume that all features are continuous-valued $(\mathbf x \in \mathbb R^d)$ and that there are more observations than dimensions $(m > d)$.
We frequently make use of the following libraries and packages in the upcoming lessons:
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.