In this example, we load the *trees* data set shipping with the R-package `datasets`

. The statsmodels has a function to import these great sample datasets to work with in `Python`

. The *trees* data set provides measurements of the girth, height, and volume of timber in 31 felled black cherry trees, also known as *Prunus serotina*. The height of the trees is given in feet (ft) and the volume is given in cubic feet (cft). The girth is the diameter of the tree (in inches) measured at 4 ft 6 in (approx. 1.37 m) above the ground.

We take a quick look at the data set by applying the `head`

-method and immediately convert the variables in meters, respectively, cubic meters, by applying these equations

and

$$ 1\; \text{cft} = 0.0283168\;\text{m}^3\text{.}$$In [1]:

```
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
```

In [2]:

```
trees = sm.datasets.get_rdataset('trees').data
df = trees
df['Girth'] = df['Girth'] * 0.0254
df['Height'] = df['Height'] * 0.3048
df['Volume'] = df['Volume'] * 0.0283168
df =df.round(3)
df.head()
```

Out[2]:

Girth | Height | Volume | |
---|---|---|---|

0 | 0.211 | 21.336 | 0.292 |

1 | 0.218 | 19.812 | 0.292 |

2 | 0.224 | 19.202 | 0.289 |

3 | 0.267 | 21.946 | 0.464 |

4 | 0.272 | 24.689 | 0.532 |

`seaborn`

package provides the handy `pairplot()`

-method for plotting to scatter plot matrices.

In [3]:

```
sns.pairplot(df)
plt.show()
```

A quick visual inspection indicates that there is an approximately linear relationship between girth and somehow a less pronounced linear relationship between height and volume. Any relation between height and girth is less obvious.

**The goal in this example is to build a linear regression model with Volume being the dependent variable and Height and Girth being the independent (explanatory) variables.**

First, we develop a linear regression model based on the matrix-based equations derived in the previous section. Thereafter, we apply the statsmodels function `ols()`

.

A multiple linear regression model can be learned based on sample data by finding the optimal set of the coefficients $\beta$. The regression model is written as

$$y_i = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i\text{,}$$where $\beta$ corresponds to the model coefficients, $x_i$ to the observation/measurement data, and $\epsilon$ to the residuals. We use the least square method as our objective function, which may be written as

$$\mathcal E = \sum_{i=1}^n(y_i - x_i^T \beta)\text{.}$$To calculate the optimum solution for the objective function with the estimate to the model coefficients $\hat \beta$ by applying the following equation:

$$\hat \beta = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y.$$To calculate $\hat \beta$ in Python we start by constructing the model matrix $\mathbf X$ and the response vector $\mathbf y$ as numpy arrays. Be aware that we have to manually account for the intercept term, $\beta_0$, by populating the first column of the matrix $\mathbf X$ with ones. numpy provides the `transpose()`

-method to transpose an array, the `inv()`

-function in the linalg module to calculate the inverse of a matrix, and a function `matmul()`

, to conduct matrix multiplication.

In [4]:

```
y = df['Volume'].to_numpy()
x = sm.add_constant(df[['Girth', 'Height']].to_numpy())
left = np.matmul(x.transpose(), x)
right = np.matmul(x.transpose(), y)
solved = np.matmul(np.linalg.inv(left), right)
print(f'Intercept:\t{solved[0]},\nGirth:\t\t{solved[1]},\nHeight:\t\t{solved[2]}')
```

Intercept: -1.6416612935772914, Girth: 5.252827232565963, Height: 0.03144365962146134

The calculation yields a value of `-1.642`

for the intercept $\beta_0$, a value of `5.253`

for $\beta_1$, the coefficient of the `Girth`

variable, and a value of `0.031`

for $\beta_2$, the coefficient of the `Height`

variable. The intercept $\beta_0$ represents the volume when all independent variables are zero. The coefficient $\beta_1$ represents the volume change when there is a unit increase in the `Girth`

variable, while the other independent variable is held constant. The coefficient $\beta_2$ represents the volume change when there is a unit increase in the `Height`

variable, while the other independent variable is held constant.

Consequently, the linear model can be written as

$$\text{Volume} = -1.642 + 5.253 \times \text{Girth} + 0.031 \times \text{Height}{.} $$Based on the regression equation it is fairly easy to predict the volume of any black cherry tree given its girth and height. We predict the volume of three black cherry trees with a girth of 0.3, 0.4, and 0.5 meters and a height of 20, 21, and 22 meters, respectively.

$$\text{Volume} = −1.642+5.253×0.3+0.031×20=0.55$$$$\text{Volume} = −1.642+5.253×0.4+0.031×21=1.11$$$$\text{Volume} = −1.642+5.253×0.5+0.031×22=1.67$$This naive prediction procedure works perfectly fine, however, a more sophisticated and scalable procedure is to make use of the matrix notation we introduced in the previous section. Recall that our unknown coefficient $\hat \beta$ is calculated by the equation

$$\hat \beta = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y{.}$$Given a bunch of new data points in matrix form $\mathbf X_{new}$, the least squares prediction for $\hat y$ is

$$\mathbf{\hat y} = \mathbf X_{new}\hat \beta = \mathbf X_{new}(\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y{.}$$In [5]:

```
X_new = pd.DataFrame({'Girth': [0.3, 0.4, 0.5],
'Height': [20, 21, 22]})
X_new = sm.add_constant(X_new)
left = np.matmul(x.transpose(), x)
right = np.matmul(x.transpose(), y)
both = np.matmul(np.linalg.inv(left), right)
solved = np.matmul(X_new.to_numpy(), both)
print(f'Intercept:\t{solved[0]},\nGirth:\t\t{solved[1]},\nHeight:\t\t{solved[2]}')
```

Intercept: 0.5630600686217242, Girth: 1.119786451499782, Height: 1.6765128343778395

Compare the result of the naive prediction approach with the more sophisticated matrix-based prediction approach. The numbers match, though there are some small deviations related to the representation of floating point numbers during the computations.

`ols()`

- Formula style¶One function to build multiple linear regression models is statsmodels' `ols()`

function, which we already discussed in the section Linear Regression section. Here we use it with (R-style) formula notation as input; thus, a linear model is specified as `response ~ predictor(s)`

. To increase the number of predictor variables we concatenate them by using the `+`

sign.

In [6]:

```
model = smf.ols(formula='Volume ~ Girth + Height', data=df).fit()
model.summary()
```

Out[6]:

Dep. Variable: | Volume | R-squared: | 0.948 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 0.944 |

Method: | Least Squares | F-statistic: | 253.5 |

Date: | Wed, 05 Oct 2022 | Prob (F-statistic): | 1.16e-18 |

Time: | 17:04:38 | Log-Likelihood: | 25.957 |

No. Observations: | 31 | AIC: | -45.91 |

Df Residuals: | 28 | BIC: | -41.61 |

Df Model: | 2 | ||

Covariance Type: | nonrobust |

coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

Intercept | -1.6417 | 0.245 | -6.694 | 0.000 | -2.144 | -1.139 |

Girth | 5.2528 | 0.296 | 17.763 | 0.000 | 4.647 | 5.859 |

Height | 0.0314 | 0.012 | 2.593 | 0.015 | 0.007 | 0.056 |

Omnibus: | 0.891 | Durbin-Watson: | 1.258 |
---|---|---|---|

Prob(Omnibus): | 0.640 | Jarque-Bera (JB): | 0.926 |

Skew: | 0.304 | Prob(JB): | 0.629 |

Kurtosis: | 2.410 | Cond. No. | 359. |

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The `summary`

-method output returns, among other values, the intercept and the other regression coefficients. Compare the results with our matrix-based implementation of the multiple linear regression model. The numbers match perfectly!

An easy way to use an `ols()`

object for prediction is to apply the `predict()`

-method. Be aware that the new data provided to the `get_prediction()`

-method is a pandas-`DataFrame`

object `X_new`

that we created earlier. As in the application above we predict the volume of three black cherry trees with a girth of 0.3, 0.4, and 0.5 meters and a height of 20, 21, and 22 meters, respectively.

In [7]:

```
prediction_new = model.get_prediction(X_new)
prediction_new.predicted_mean
```

Out[7]:

array([0.56306007, 1.11978645, 1.67651283])

`get_prediction()`

-method provides a simple approach to calculating the confidence intervals. We simply get use the `conf_int()`

-method to get the array of predicted values augmented with its lower (left) and upper (right) limits.

In [8]:

```
prediction_new.conf_int()
```

Out[8]:

array([[0.48240479, 0.64371535], [1.02944111, 1.21013179], [1.55296025, 1.80006542]])

`summary_frame()`

-method additionaly shows the `prediction interval`

und the notation `obs_ci`

(observed confidence interval).

In [9]:

```
prediction_new.summary_frame()
```

Out[9]:

mean | mean_se | mean_ci_lower | mean_ci_upper | obs_ci_lower | obs_ci_upper | |
---|---|---|---|---|---|---|

0 | 0.563060 | 0.039375 | 0.482405 | 0.643715 | 0.323333 | 0.802787 |

1 | 1.119786 | 0.044105 | 1.029441 | 1.210132 | 0.876628 | 1.362945 |

2 | 1.676513 | 0.060316 | 1.552960 | 1.800065 | 1.419163 | 1.933863 |

Note that the default confidence level are 95% and may be changed by setting the argument `alpha`

.

A higher confidence level causes the confidence interval to become wider, whereas a lower confidence level causes the interval to become narrower.

In multiple regression analysis tasks, there may be many independent variables, resulting in high-dimensional model space, so visualization for model diagnostics is very often restricted. In the case of bivariate data, the regression model can be visualized by a line. For $d$-dimensional data, the regression model results in a $d$-dimensional plane, a so-called *hyperplane*. However, when there are only two explanatory variables, such as in our simple example from above, the *hyperplane* is just an ordinary plane and we may visualize it by a 3D scatter plot. In Python, we may apply the `scatter3D()`

-method of the matplotlib library.

In [15]:

```
from mpl_toolkits.mplot3d import axes3d
sns.set(style = "darkgrid")
X, Y, Z = axes3d.get_test_data(0.05)
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection = '3d')
ax.scatter3D(
df['Girth'],
df['Height'],
df['Volume'],
)
ax.set_xlabel("Girth")
ax.set_ylabel("Height")
ax.set_zlabel("Volume")
plt.show()
```

By inspecting the 3D scatter plot we see that the point data is very well approximated by a plane. Thus indicating, that the model fits well the data.

Recall that we may access the properties of an `ols()`

object. One of the most convenient methods to summarize the properties of an `ols()`

instance is `summary()`

.

In [20]:

```
model.summary()
```

Out[20]:

Dep. Variable: | Volume | R-squared: | 0.948 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 0.944 |

Method: | Least Squares | F-statistic: | 253.5 |

Date: | Wed, 05 Oct 2022 | Prob (F-statistic): | 1.16e-18 |

Time: | 17:37:11 | Log-Likelihood: | 25.957 |

No. Observations: | 31 | AIC: | -45.91 |

Df Residuals: | 28 | BIC: | -41.61 |

Df Model: | 2 | ||

Covariance Type: | nonrobust |

coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

Intercept | -1.6417 | 0.245 | -6.694 | 0.000 | -2.144 | -1.139 |

Girth | 5.2528 | 0.296 | 17.763 | 0.000 | 4.647 | 5.859 |

Height | 0.0314 | 0.012 | 2.593 | 0.015 | 0.007 | 0.056 |

Omnibus: | 0.891 | Durbin-Watson: | 1.258 |
---|---|---|---|

Prob(Omnibus): | 0.640 | Jarque-Bera (JB): | 0.926 |

Skew: | 0.304 | Prob(JB): | 0.629 |

Kurtosis: | 2.410 | Cond. No. | 359. |

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The output of the

`summary()`

-method has the title of the applied method (OLS).In the left column, the first lines give the dependent variable and basic info about the calculations.

Below that we find some information about obervations and degrees of freedom in the model and residuals.

On the right-hand side we find the $R^2$, the squared

*Pearson correlation coefficient*, also known as the*coefficient of determination*, and the adjusted $R^2$. The $R^2$ measure is interpreted as the proportion of total variation that is explained by the regression model. However, by the addition of explanatory variables to the model, the value of $R^2$ is inflated, regardless of whether or not the added variables are useful concerning the prediction of the response variable (Kerns 2010). In other words, the addition of a single explanatory variable to a regression model will increase the value of $R^2$, no matter how worthless the explanatory variable is. This issue is addressed by penalizing $R^2$ when parameters are added to the model. The result is an adjusted $R^2$ which is denoted by $$\text{adjusted }R^2 =\left(R^2-\frac{d}{n-1}\right)\left(\frac{n-1}{n-d-1}\right)\text{,}$$

$\qquad$ where $n$ corresponds to the number of observations and $d$ to the number of features in our data set.

Note that in many cases the values for $R^2$ and the adjusted $R^2$ will be very close to each other (such as in our example from above). If their values differ substantially, or if one changes dramatically when an explanatory variable is added, then the explanatory variables need to be critically assessed and possibly excluded from the regression model.

Below that it shows the $F$-statistic and its

*p-value*. When the $F$-statistic is large, that is, when the explained variation is large relative to the unexplained variation, it is an indicator that the explanatory variables are useful variables to explain the response variable.At the bottom of this part, we find the information criteria Akaike information criterion

`AIC`

, Bayesion information`BIC`

and the log-likelihood` that are not further discussed here.The next lines give the

*regression coefficients*and the*intercept*, and in addition, for each of them the*standard errors*, the*t-values*, and the*p-values*. If the*p-value*is small we conclude that there is a significant relationship between $\mathbb y$ and $x_i$. The*p-values*imply that there is a significant linear relationship between`Volume`

and`Girth`

and between`Volume`

and`Height`

in the regression model. Further, it appears that the intercept is significant too. Note that*t-tests*only consider the effect of removing one variable and leaving in all the others.

This section was inspired by section 12 of the book *Introduction to Probability and Statistics using R* by G. Jay Kerns (2010).

**Citation**

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis
using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.*