In this example we load the trees data set shipping with the
R-package datasets
. The trees data set provides
measurements of the girth, height and volume of timber in 31 felled
black cherry trees, also known as Prunus serotina. The height of the trees is
given in feet (ft) and the volume is given in cubic feet (cft). The
girth is the diameter of the tree (in inches) measured at 4 ft 6 in
(approx. 1.37 m) above the ground.
We take a quick look at the data set by applying the
head
function and immediately convert the variables in
meters and cubic meters, respectively, by applying these equations
\[ 1\; \text{in} = 0.0254\;\text{m,}\]
\[ 1\; \text{ft} = 0.3048\;\text{m,}\]
and
\[ 1\; \text{cft} = 0.0283168\;\text{m}^3\text{.}\]
library(datasets)
head(trees)
## Girth Height Volume
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
df <- trees
df$Girth <- df$Girth * 0.0254
df$Height <- df$Height * 0.3048
df$Volume <- df$Volume * 0.0283168
df <- round(df, 3) # round to three decimal digits
df
## Girth Height Volume
## 1 0.211 21.336 0.292
## 2 0.218 19.812 0.292
## 3 0.224 19.202 0.289
## 4 0.267 21.946 0.464
## 5 0.272 24.689 0.532
## 6 0.274 25.298 0.558
## 7 0.279 20.117 0.442
## 8 0.279 22.860 0.515
## 9 0.282 24.384 0.640
## 10 0.284 22.860 0.564
## 11 0.287 24.079 0.685
## 12 0.290 23.165 0.595
## 13 0.290 23.165 0.606
## 14 0.297 21.031 0.603
## 15 0.305 22.860 0.541
## 16 0.328 22.555 0.629
## 17 0.328 25.908 0.957
## 18 0.338 26.213 0.776
## 19 0.348 21.641 0.728
## 20 0.351 19.507 0.705
## 21 0.356 23.774 0.977
## 22 0.361 24.384 0.898
## 23 0.368 22.555 1.028
## 24 0.406 21.946 1.085
## 25 0.414 23.470 1.206
## 26 0.439 24.689 1.569
## 27 0.444 24.994 1.577
## 28 0.455 24.384 1.651
## 29 0.457 24.384 1.458
## 30 0.457 24.384 1.444
## 31 0.523 26.518 2.180
It is always a good idea to visualize the data we are working with.
However, instead of a scatter plot, which is fine for the comparison of
two variables, we plot a scatter plot matrix to account for multiple
variables. The lattice
package provides the handy
splom()
function for plotting scatter plot matrices.
library(lattice)
splom(df, xlab = "Scatter Plot Matrix")
A quick visual inspection indicates that there is an approximately linear relationship between girth and to somehow a less pronounced linear relationship between height and volume. Any relation between height and girth is less obvious.
The goal in this example is to build a linear regression
model with Volume
being the dependent variable and
Height
and Girth
being the independent
(explanatory) variables.
First, we develop a linear regression model based on the matrix-based
equations derived in the previous section. Thereafter, we apply the
build-in function lm()
.
A multiple linear regression model can be learned based on sample data by finding the optimal set of the coefficients \(\beta\). The regression model is written as
\[y_i = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i\text{,} \\\]
where \(\beta\) corresponds to the model coefficients, \(x_i\) to the observation/measurement data, and \(\epsilon\) to the residuals. We use the least square method as our objective function, which may be written as
\[\mathcal E = \sum_{i=1}^n(y_i - x_i^T \beta)^2\text{.}\]
In order to calculate the optimum solution for the objective function with estimate to the model coefficients \(\hat \beta\) by applying the following equation:
\[\hat \beta = (\mathbf X^T\mathbf
X)^{-1}\mathbf X^T\mathbf y.\] In order to calculate \(\hat \beta\) in R we start by constructing
the model matrix \(\mathbf X\) and the
response vector \(\mathbf y\). Be aware
the we have to manually account for the intercept term, \(\beta_0\), by populating the first column
of the matrix \(\mathbf X\) with ones.
The software package R provides the t()
function to
transpose a matrix, the solve()
function to calculate the
inverse of a matrix, and a special character, %*%
, to
conduct matrix multiplication.
y <- as.matrix(df[, "Volume"]) # response vector
beta0 <- rep(1, nrow(df)) # intercept
X <- df[, c("Girth", "Height")] # feature selection
X <- as.matrix(cbind(beta0, X)) # model matrix
model1 <- solve(t(X) %*% X) %*% t(X) %*% y # solving for beta
model1
## [,1]
## beta0 -1.64166129
## Girth 5.25282723
## Height 0.03144366
The calculation yields a value of -1.642 for the intercept \(\beta_0\), a value of 5.253 for \(\beta_1\), the coefficient of the
Girth
variable, and a value of 0.031 for \(\beta_2\), the coefficient of the
Height
variable. The intercept \(\beta_0\) represents the volume when all
independent variables are zero. The coefficient \(\beta_1\) represents the change in volume
when there is a unit increase in the Girth
variable, while
the other independent variable is held constant. The coefficient \(\beta_2\) represents the change in volume
when there is a unit increase in the Height
variable, while
the other independent variable is held constant.
Consequently, the linear model can be written as
\[\text{Volume} = -1.642 + 5.253 \times \text{Girth} +0.031 \times \text{Height}{.} \]
Based on the regression equation it is fairly easy to predict the volume for any black cherry tree given its girth and height. We predict the volume of three black cherry trees with a girth of 0.3, 0.4 and 0.5 meters and a height of 20, 21 and 22 meters, respectively.
\[\text{Volume} = -1.642 + 5.253 \times 0.3+0.031 \times 20 = 0.55\] \[\text{Volume} = -1.642 + 5.253 \times 0.4+0.031 \times 21 = 1.11\]
\[\text{Volume} = -1.642 + 5.253 \times 0.5+0.031 \times 22 = 1.67\]
This naive prediction procedure works perfectly fine, however, a more sophisticated and scalable procedure is to make use of the matrix notation we introduced in the previous section. Recall that our unknown coefficients \(\hat \beta\) are calculated by the equation
\[\hat \beta = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y{.}\]
Given a bunch of new data points in matrix form \(\mathbf X_{new}\), the least squares prediction for \(\hat y\) is
\[\mathbf{\hat y} = \mathbf X_{new}\hat \beta = \mathbf X_{new}(\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y{.}\]
# response vector
y <- as.matrix(df[, "Volume"])
# model matrix
beta0 <- rep(1, nrow(df))
X <- df[, c("Girth", "Height")]
X <- as.matrix(cbind(beta0, X))
# new data matrix
X_new <- data.frame(
"Girth" = c(0.3, 0.4, 0.5),
"Height" = c(20, 21, 22)
)
# add intercept term to the new data matrix
beta0 <- rep(1, nrow(X_new))
X_new <- as.matrix(cbind(beta0, X_new))
# plug into equation from above
prediction <- X_new %*% solve(t(X) %*% X) %*% t(X) %*% y
prediction
## [,1]
## [1,] 0.5630601
## [2,] 1.1197865
## [3,] 1.6765128
Compare the result of the naive prediction approach with the more sophisticated matrix-based prediction approach. The numbers match, though there are some small deviations related to the representation of floating point numbers during the computations.
lm()
One function to build multiple linear regression models that ships
with R is the lm()
function, which we already discussed in
the section Linear Regression section. The lm()
expects the
formula notation as input; thus, a linear model is specified as
response ~ predictor(s)
. In order to increase the number of
predictor variables we concatenate them by using the +
sign.
model2 <- lm(Volume ~ Girth + Height, data = df)
model2
##
## Call:
## lm(formula = Volume ~ Girth + Height, data = df)
##
## Coefficients:
## (Intercept) Girth Height
## -1.64166 5.25283 0.03144
The default output to the R-console returns the intercept and the other regression coefficients. Compare the results with our matrix-based implementation of the multiple linear regression model. The numbers match perfectly!
An easy way to use an lm
object for prediction is to
apply the predict()
function. Be aware that the new data
provided to the predict()
function must be a
data.frame
object or a list
object. As in the
application above we predict the volume of three black cherry trees with
a girth of 0.3, 0.4 and 0.5 meters and a height of 20, 21 and 22 meters,
respectively.
new <- data.frame("Girth" = c(0.3, 0.4, 0.5),
"Height" = c(20, 21, 22)
)
predict(model2, newdata = new)
## 1 2 3
## 0.5630601 1.1197865 1.6765128
The numbers match! In addition, the predict()
function
provides a simple approach to calculate the confidence and prediction
intervals. We apply the predict()
function and add the
argument interval = "confidence"
or
interval = "prediction"
to you get the vector of predicted
values augmented with its upper and lower limits.
predict(model2,
newdata = new,
interval = "confidence"
)
## fit lwr upr
## 1 0.5630601 0.4824048 0.6437153
## 2 1.1197865 1.0294411 1.2101318
## 3 1.6765128 1.5529603 1.8000654
predict(model2,
newdata = new,
interval = "prediction"
)
## fit lwr upr
## 1 0.5630601 0.3233328 0.8027874
## 2 1.1197865 0.8766277 1.3629452
## 3 1.6765128 1.4191626 1.9338631
Note that the default confidence level, respectively prediction level
is 95%. This can be changed with the level
argument.
predict(model2,
newdata = new,
interval = "confidence",
level = 0.99
)
## fit lwr upr
## 1 0.5630601 0.4542576 0.6718625
## 2 1.1197865 0.9979123 1.2416606
## 3 1.6765128 1.5098427 1.8431829
A higher confidence level causes the confidence interval to become wider, whereas as a lower confidence level causes the interval to become narrower.
predict(model2,
newdata = new,
interval = "confidence",
level = 0.9
)
## fit lwr upr
## 1 0.5630601 0.4960787 0.6300415
## 2 1.1197865 1.0447578 1.1948151
## 3 1.6765128 1.5739067 1.7791190
In multiple regression analysis tasks there may be many independent
variables, resulting in a high-dimensional model space, so that
visualization for the purpose of model diagnostics is very often
restricted. In the case of bivariate data the regression model can be
visualized by a line. For \(d\)-dimensional data the regression model
results in a \(d\)-dimensional plane, a
so-called hyperplane. However, when there are only
two explanatory variables, such as in our simple example from above, the
hyperplane is just an ordinary plane and we may visualize it by a 3D
scatter plot. In R we may apply the scatterplot3d()
function of the eponymous scatterplot3d
package.
library(scatterplot3d)
scatter.3d <- with(df,
scatterplot3d(Girth,
Height,
Volume,
pch = 16,
highlight.3d = TRUE,
angle = 60)
)
scatter.3d$plane3d(model2)
By inspecting the 3D scatter plot we see that the point data is very well approximated by the plane. Thus indicating, that the model fits well the data.
Recall that we may access the properties of a lm
object
by various extractor functions (see section Linear
Regression for more details). One of the most convenient functions to
summarize the properties of a model object in R is the
summary()
function.
summary(model2)
##
## Call:
## lm(formula = Volume ~ Girth + Height, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.182027 -0.074603 -0.004944 0.061530 0.240610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.64166 0.24524 -6.694 2.89e-07 ***
## Girth 5.25283 0.29572 17.763 < 2e-16 ***
## Height 0.03144 0.01212 2.593 0.0149 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1102 on 28 degrees of freedom
## Multiple R-squared: 0.9477, Adjusted R-squared: 0.9439
## F-statistic: 253.5 on 2 and 28 DF, p-value: < 2.2e-16
The output of the summary()
function starts with the
repetition of the function call.
The next lines give the distribution of the residuals that serve as a quick check of the distributional assumptions. The average of the residuals is zero by definition, so the median should not be far from zero.
The next lines give the regression coefficients and the
intercept, and in addition, for each of them the standard
errors, the t-values, and the p-values. The star
symbols are graphical indicators of the level of significance. If the
p-value is small we conclude that there is a significant
relationship between \(\mathbb y\) and
\(x_i\). The p-values imply
that there is a significant linear relationship between
Volume
and Girth
and between
Volume
and Height
in the regression model.
Further, it appears that the intercept is significant too. Note that the
t-tests only considers the effect of removing one variable and
leave in all the others.
The next line gives the residual standard error, an expression of the variation of the observations around the regression hyperplane.
The next line gives \(R^2\), the squared Pearson correlation coefficient, also known as the coefficient of determination and the adjusted \(R^2\). The \(R^2\) measure is interpreted as the proportion of total variation that is explained by the regression model. However, by the addition of explanatory variables to the model the value of \(R^2\) is inflated, regardless of whether or not the added variables are useful with respect to prediction of the response variable (Kerns 2010). In other words the addition of a single explanatory variable to a regression model will increase the value of \(R^2\), no matter how worthless the explanatory variable is. This issue is addressed by penalizing \(R^2\) when parameters are added to the model. The result is an adjusted \(R^2\) which is denoted by
\[\text{adjusted }R^2 =\left(R^2-\frac{d}{n-1}\right)\left(\frac{n-1}{n-d-1}\right)\text{,}\] \(\qquad\) where \(n\) corresponds to the number of observations and \(d\) to the number of features in our data set.
Note that in many cases the values for \(R^2\) and the adjusted \(R^2\) will be very close to each other (such as in our example from above). If their values differ substantially, or if one changes dramatically when an explanatory variable is added, then the explanatory variables need to be critically assessed and possibly excluded from the regression model.
The last line shows the \(F\)-statistic, the number of degrees of freedom and the p-value. When the \(F\)-statistic is large, that is, when the explained variation is large relative to the unexplained variation, it is an indicator that the explanatory variables are useful variables to explain the response variable.
Please note that there exists a very informative blog post by Felipe Rego, who provides a nicely written overview of the output of a linear model in R.
This section was inspired by section 12 of the book Introduction to Probability and Statistics using R by G. Jay Kerns (2010).
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.