In this example, we load the trees data set shipping with the R-package datasets
. The statsmodels has a function to import these great sample datasets to work with in Python
. The trees data set provides measurements of the girth, height, and volume of timber in 31 felled black cherry trees, also known as Prunus serotina. The height of the trees is given in feet (ft) and the volume is given in cubic feet (cft). The girth is the diameter of the tree (in inches) measured at 4 ft 6 in (approx. 1.37 m) above the ground.
We take a quick look at the data set by applying the head
-method and immediately convert the variables in meters, respectively, cubic meters, by applying these equations
and
$$ 1\; \text{cft} = 0.0283168\;\text{m}^3\text{.}$$import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
trees = sm.datasets.get_rdataset('trees').data
df = trees
df['Girth'] = df['Girth'] * 0.0254
df['Height'] = df['Height'] * 0.3048
df['Volume'] = df['Volume'] * 0.0283168
df =df.round(3)
df.head()
Girth | Height | Volume | |
---|---|---|---|
0 | 0.211 | 21.336 | 0.292 |
1 | 0.218 | 19.812 | 0.292 |
2 | 0.224 | 19.202 | 0.289 |
3 | 0.267 | 21.946 | 0.464 |
4 | 0.272 | 24.689 | 0.532 |
It is always a good idea to visualize the data we are working with. However, instead of a scatter plot, which is fine for the comparison of two variables, we plot a scatter plot matrix to account for multiple variables. The seaborn
package provides the handy pairplot()
-method for plotting to scatter plot matrices.
sns.pairplot(df)
plt.show()
A quick visual inspection indicates that there is an approximately linear relationship between girth and somehow a less pronounced linear relationship between height and volume. Any relation between height and girth is less obvious.
The goal in this example is to build a linear regression model with Volume
being the dependent variable and Height
and Girth
being the independent (explanatory) variables.
First, we develop a linear regression model based on the matrix-based equations derived in the previous section. Thereafter, we apply the statsmodels function ols()
.
A multiple linear regression model can be learned based on sample data by finding the optimal set of the coefficients $\beta$. The regression model is written as
$$y_i = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i\text{,}$$where $\beta$ corresponds to the model coefficients, $x_i$ to the observation/measurement data, and $\epsilon$ to the residuals. We use the least square method as our objective function, which may be written as
$$\mathcal E = \sum_{i=1}^n(y_i - x_i^T \beta)\text{.}$$To calculate the optimum solution for the objective function with the estimate to the model coefficients $\hat \beta$ by applying the following equation:
$$\hat \beta = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y.$$To calculate $\hat \beta$ in Python we start by constructing the model matrix $\mathbf X$ and the response vector $\mathbf y$ as numpy arrays. Be aware that we have to manually account for the intercept term, $\beta_0$, by populating the first column of the matrix $\mathbf X$ with ones. numpy provides the transpose()
-method to transpose an array, the inv()
-function in the linalg module to calculate the inverse of a matrix, and a function matmul()
, to conduct matrix multiplication.
y = df['Volume'].to_numpy()
x = sm.add_constant(df[['Girth', 'Height']].to_numpy())
left = np.matmul(x.transpose(), x)
right = np.matmul(x.transpose(), y)
solved = np.matmul(np.linalg.inv(left), right)
print(f'Intercept:\t{solved[0]},\nGirth:\t\t{solved[1]},\nHeight:\t\t{solved[2]}')
Intercept: -1.6416612935772914, Girth: 5.252827232565963, Height: 0.03144365962146134
The calculation yields a value of -1.642
for the intercept $\beta_0$, a value of 5.253
for $\beta_1$, the coefficient of the Girth
variable, and a value of 0.031
for $\beta_2$, the coefficient of the Height
variable. The intercept $\beta_0$ represents the volume when all independent variables are zero. The coefficient $\beta_1$ represents the volume change when there is a unit increase in the Girth
variable, while the other independent variable is held constant. The coefficient $\beta_2$ represents the volume change when there is a unit increase in the Height
variable, while the other independent variable is held constant.
Consequently, the linear model can be written as
$$\text{Volume} = -1.642 + 5.253 \times \text{Girth} + 0.031 \times \text{Height}{.} $$Based on the regression equation it is fairly easy to predict the volume of any black cherry tree given its girth and height. We predict the volume of three black cherry trees with a girth of 0.3, 0.4, and 0.5 meters and a height of 20, 21, and 22 meters, respectively.
$$\text{Volume} = −1.642+5.253×0.3+0.031×20=0.55$$$$\text{Volume} = −1.642+5.253×0.4+0.031×21=1.11$$$$\text{Volume} = −1.642+5.253×0.5+0.031×22=1.67$$This naive prediction procedure works perfectly fine, however, a more sophisticated and scalable procedure is to make use of the matrix notation we introduced in the previous section. Recall that our unknown coefficient $\hat \beta$ is calculated by the equation
$$\hat \beta = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y{.}$$Given a bunch of new data points in matrix form $\mathbf X_{new}$, the least squares prediction for $\hat y$ is
$$\mathbf{\hat y} = \mathbf X_{new}\hat \beta = \mathbf X_{new}(\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y{.}$$X_new = pd.DataFrame({'Girth': [0.3, 0.4, 0.5],
'Height': [20, 21, 22]})
X_new = sm.add_constant(X_new)
left = np.matmul(x.transpose(), x)
right = np.matmul(x.transpose(), y)
both = np.matmul(np.linalg.inv(left), right)
solved = np.matmul(X_new.to_numpy(), both)
print(f'Intercept:\t{solved[0]},\nGirth:\t\t{solved[1]},\nHeight:\t\t{solved[2]}')
Intercept: 0.5630600686217242, Girth: 1.119786451499782, Height: 1.6765128343778395
Compare the result of the naive prediction approach with the more sophisticated matrix-based prediction approach. The numbers match, though there are some small deviations related to the representation of floating point numbers during the computations.
ols()
- Formula style¶One function to build multiple linear regression models is statsmodels' ols()
function, which we already discussed in the section Linear Regression section. Here we use it with (R-style) formula notation as input; thus, a linear model is specified as response ~ predictor(s)
. To increase the number of predictor variables we concatenate them by using the +
sign.
model = smf.ols(formula='Volume ~ Girth + Height', data=df).fit()
model.summary()
Dep. Variable: | Volume | R-squared: | 0.948 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.944 |
Method: | Least Squares | F-statistic: | 253.5 |
Date: | Wed, 05 Oct 2022 | Prob (F-statistic): | 1.16e-18 |
Time: | 17:04:38 | Log-Likelihood: | 25.957 |
No. Observations: | 31 | AIC: | -45.91 |
Df Residuals: | 28 | BIC: | -41.61 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -1.6417 | 0.245 | -6.694 | 0.000 | -2.144 | -1.139 |
Girth | 5.2528 | 0.296 | 17.763 | 0.000 | 4.647 | 5.859 |
Height | 0.0314 | 0.012 | 2.593 | 0.015 | 0.007 | 0.056 |
Omnibus: | 0.891 | Durbin-Watson: | 1.258 |
---|---|---|---|
Prob(Omnibus): | 0.640 | Jarque-Bera (JB): | 0.926 |
Skew: | 0.304 | Prob(JB): | 0.629 |
Kurtosis: | 2.410 | Cond. No. | 359. |
The summary
-method output returns, among other values, the intercept and the other regression coefficients. Compare the results with our matrix-based implementation of the multiple linear regression model. The numbers match perfectly!
An easy way to use an ols()
object for prediction is to apply the predict()
-method. Be aware that the new data provided to the get_prediction()
-method is a pandas-DataFrame
object X_new
that we created earlier. As in the application above we predict the volume of three black cherry trees with a girth of 0.3, 0.4, and 0.5 meters and a height of 20, 21, and 22 meters, respectively.
prediction_new = model.get_prediction(X_new)
prediction_new.predicted_mean
array([0.56306007, 1.11978645, 1.67651283])
The numbers match! In addition, the get_prediction()
-method provides a simple approach to calculating the confidence intervals. We simply get use the conf_int()
-method to get the array of predicted values augmented with its lower (left) and upper (right) limits.
prediction_new.conf_int()
array([[0.48240479, 0.64371535], [1.02944111, 1.21013179], [1.55296025, 1.80006542]])
The summary_frame()
-method additionaly shows the prediction interval
und the notation obs_ci
(observed confidence interval).
prediction_new.summary_frame()
mean | mean_se | mean_ci_lower | mean_ci_upper | obs_ci_lower | obs_ci_upper | |
---|---|---|---|---|---|---|
0 | 0.563060 | 0.039375 | 0.482405 | 0.643715 | 0.323333 | 0.802787 |
1 | 1.119786 | 0.044105 | 1.029441 | 1.210132 | 0.876628 | 1.362945 |
2 | 1.676513 | 0.060316 | 1.552960 | 1.800065 | 1.419163 | 1.933863 |
Note that the default confidence level are 95% and may be changed by setting the argument alpha
.
A higher confidence level causes the confidence interval to become wider, whereas a lower confidence level causes the interval to become narrower.
In multiple regression analysis tasks, there may be many independent variables, resulting in high-dimensional model space, so visualization for model diagnostics is very often restricted. In the case of bivariate data, the regression model can be visualized by a line. For $d$-dimensional data, the regression model results in a $d$-dimensional plane, a so-called hyperplane. However, when there are only two explanatory variables, such as in our simple example from above, the hyperplane is just an ordinary plane and we may visualize it by a 3D scatter plot. In Python, we may apply the scatter3D()
-method of the matplotlib library.
from mpl_toolkits.mplot3d import axes3d
sns.set(style = "darkgrid")
X, Y, Z = axes3d.get_test_data(0.05)
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection = '3d')
ax.scatter3D(
df['Girth'],
df['Height'],
df['Volume'],
)
ax.set_xlabel("Girth")
ax.set_ylabel("Height")
ax.set_zlabel("Volume")
plt.show()
By inspecting the 3D scatter plot we see that the point data is very well approximated by a plane. Thus indicating, that the model fits well the data.
Recall that we may access the properties of an ols()
object. One of the most convenient methods to summarize the properties of an ols()
instance is summary()
.
model.summary()
Dep. Variable: | Volume | R-squared: | 0.948 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.944 |
Method: | Least Squares | F-statistic: | 253.5 |
Date: | Wed, 05 Oct 2022 | Prob (F-statistic): | 1.16e-18 |
Time: | 17:37:11 | Log-Likelihood: | 25.957 |
No. Observations: | 31 | AIC: | -45.91 |
Df Residuals: | 28 | BIC: | -41.61 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -1.6417 | 0.245 | -6.694 | 0.000 | -2.144 | -1.139 |
Girth | 5.2528 | 0.296 | 17.763 | 0.000 | 4.647 | 5.859 |
Height | 0.0314 | 0.012 | 2.593 | 0.015 | 0.007 | 0.056 |
Omnibus: | 0.891 | Durbin-Watson: | 1.258 |
---|---|---|---|
Prob(Omnibus): | 0.640 | Jarque-Bera (JB): | 0.926 |
Skew: | 0.304 | Prob(JB): | 0.629 |
Kurtosis: | 2.410 | Cond. No. | 359. |
The output of the summary()
-method has the title of the applied method (OLS).
In the left column, the first lines give the dependent variable and basic info about the calculations.
Below that we find some information about obervations and degrees of freedom in the model and residuals.
On the right-hand side we find the $R^2$, the squared Pearson correlation coefficient, also known as the coefficient of determination, and the adjusted $R^2$. The $R^2$ measure is interpreted as the proportion of total variation that is explained by the regression model. However, by the addition of explanatory variables to the model, the value of $R^2$ is inflated, regardless of whether or not the added variables are useful concerning the prediction of the response variable (Kerns 2010). In other words, the addition of a single explanatory variable to a regression model will increase the value of $R^2$, no matter how worthless the explanatory variable is. This issue is addressed by penalizing $R^2$ when parameters are added to the model. The result is an adjusted $R^2$ which is denoted by $$\text{adjusted }R^2 =\left(R^2-\frac{d}{n-1}\right)\left(\frac{n-1}{n-d-1}\right)\text{,}$$
$\qquad$ where $n$ corresponds to the number of observations and $d$ to the number of features in our data set.
Note that in many cases the values for $R^2$ and the adjusted $R^2$ will be very close to each other (such as in our example from above). If their values differ substantially, or if one changes dramatically when an explanatory variable is added, then the explanatory variables need to be critically assessed and possibly excluded from the regression model.
Below that it shows the $F$-statistic and its p-value. When the $F$-statistic is large, that is, when the explained variation is large relative to the unexplained variation, it is an indicator that the explanatory variables are useful variables to explain the response variable.
At the bottom of this part, we find the information criteria Akaike information criterion AIC
, Bayesion information BIC
and the log-likelihood` that are not further discussed here.
The next lines give the regression coefficients and the intercept, and in addition, for each of them the standard errors, the t-values, and the p-values. If the p-value is small we conclude that there is a significant relationship between $\mathbb y$ and $x_i$. The p-values imply that there is a significant linear relationship between Volume
and Girth
and between Volume
and Height
in the regression model. Further, it appears that the intercept is significant too. Note that t-tests only consider the effect of removing one variable and leaving in all the others.
This section was inspired by section 12 of the book Introduction to Probability and Statistics using R by G. Jay Kerns (2010).
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.