Now, as we relaxed the constraints of the deterministic model and introduced an error term $\epsilon$, we run into another problem. There are infinitely many regression lines that fulfill the specifications of the probabilistic model.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
np.random.seed(42)
fig, ax = plt.subplots(figsize=(12, 8))
x = np.arange(1, 10)
y0 = 1.5 + (x*2.2 + 6 * np.random.randn(len(x)))
ax.scatter(x, y0, color='k')
plt.xlabel('predictor variable')
plt.ylabel('response variable')
style = {'color': 'k'}
ax.axline((0, 1.3), slope=2.6, **style)
plt.arrow(6.8, 8, -.4, 10, **style)
plt.text(6.5, 7, '$\hat{y} = 1.3 + 2.6x+e$', **style)
style = {'color': 'b'}
ax.axline((0, 1.7), slope=1.6, **style)
plt.arrow(8, 20, -1, -7, **style)
plt.text(8, 20.5, '$\hat{y} = 1.7+1.6x+e$', **style)
style = {'color': 'r'}
ax.axline((0, 1.3), slope=2.9, **style)
plt.arrow(3, 20, 0.37, -9, **style)
plt.text(2.3, 20.5, '$\hat{y} = 1.3+2.9x+e$', **style)
style = {'color': 'g'}
ax.axline((0, 1.9), slope=2.8, **style)
plt.arrow(.9, 12, 1, -4.8, **style)
plt.text(0, 13,'$\hat{y} = 1.9+2.8x+e$', **style)
ax.grid()
plt.show()
We need a strategy to select one particular regression line, which corresponds to the best model to describe the data. In this section we discuss one of the most popular methods to achieve that task, the so-called ordinary least squares method (OLS).
As mentioned in the previous section for each particular pair of values $(x_i,y_i)$ the error $e_i$ is calculated by $y_i-\hat y_i$, where $\hat y_i$ denotes the y-value of the line $\hat y$ at $x_i$. To get the best fitting line for the given data, we minimize the error sum of squares, denoted by SSE:
$$min\; SSE = \sum_{i=1}^n e_i^2=\sum_{i=1}^n (y_i - \hat y_i)^2 \, .$$Recalling that $\bar x$ denotes the mean, for the simple linear model, there exists an analytic solution for $\beta$:
$$\beta = \frac{\sum_{i=1}^n ((x_i- \bar x) (y_i-\bar y))}{\sum_{i=1}^n (x_i-\bar x)^2} = \frac{cov(x,y)}{var(x)}\text{,}$$and $\alpha$:
$$\alpha = \bar y -\beta \bar x \, .$$The OLS gives the maximum likelihood estimate for $\beta$ when the parameters have equal variance and are uncorrelated, and the residuals $\epsilon$ are uncorrelated and follow a Gaussian distribution (homoscedasticity).
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.