In this example we load the trees data set shipping with the R-package datasets. The trees data set provides measurements of the girth, height and volume of timber in 31 felled black cherry trees, also known as Prunus serotina. The height of the trees is given in feet (ft) and the volume is given in in cubic feet (cft). The girth is the diameter of the tree (in inches) measured at 4 ft 6 in (approx. 1.37 m) above the ground.

We take a quick look at the data set by applying the head function and immediately convert the variables in meters, respectively, cubic meters, by applying these equations

$1\; \text{in} = 0.0254\;\text{m,}$

$1\; \text{ft} = 0.3048\;\text{m,}$

and

$1\; \text{cft} = 0.0283168\;\text{m}^3\text{.}$

library(datasets)
head(trees)
##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7
df <- trees
df$Girth <- df$Girth*0.0254
df$Height <- df$Height*0.3048
df$Volume <- df$Volume*0.0283168
df <- round(df,3) # round to three decimal digits
df
##    Girth Height Volume
## 1  0.211 21.336  0.292
## 2  0.218 19.812  0.292
## 3  0.224 19.202  0.289
## 4  0.267 21.946  0.464
## 5  0.272 24.689  0.532
## 6  0.274 25.298  0.558
## 7  0.279 20.117  0.442
## 8  0.279 22.860  0.515
## 9  0.282 24.384  0.640
## 10 0.284 22.860  0.564
## 11 0.287 24.079  0.685
## 12 0.290 23.165  0.595
## 13 0.290 23.165  0.606
## 14 0.297 21.031  0.603
## 15 0.305 22.860  0.541
## 16 0.328 22.555  0.629
## 17 0.328 25.908  0.957
## 18 0.338 26.213  0.776
## 19 0.348 21.641  0.728
## 20 0.351 19.507  0.705
## 21 0.356 23.774  0.977
## 22 0.361 24.384  0.898
## 23 0.368 22.555  1.028
## 24 0.406 21.946  1.085
## 25 0.414 23.470  1.206
## 26 0.439 24.689  1.569
## 27 0.444 24.994  1.577
## 28 0.455 24.384  1.651
## 29 0.457 24.384  1.458
## 30 0.457 24.384  1.444
## 31 0.523 26.518  2.180

It is always a good idea to visualize the data we are working with. However, instead of a scatter plot, which is fine for the comparison of two variables, we plot a scatter plot matrix to account for multiple variables. The lattice package provides the handy splom() function for plotting scatter plot matrices.

library(lattice)
splom(df)

A quick visual inspection indicates that there is an approximately linear relationship between girth and volume and to somehow a less pronounced linear relationship between height and volume. Any relation between height and girth is less obvious.

The goal in this example is to build a linear regression model with Volume being the dependent variable and Height and Girth being the independent (explanatory) variables.

First, we develop a linear regression model based on the matrix-based equations derived in the previous section. Thereafter, we apply the build-in function lm().

#### One step learning

A multiple linear regression model can be learned based on sample data by finding the optimal set of the coefficients $$\beta$$. The regression model is written as

$y_i = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i\text{,} \\$

where $$\beta$$ corresponds to the model coefficients, $$x_i$$ to the observation/measurement data, and $$\epsilon$$ to the residuals. We use the least square method as our objective function, which may be written as

$\mathcal L = \sum_{i=1}^n(y_i - x_i^T \beta)\text{.}$

In order to calculate the optimum solution for the objective function with estimate to the model coefficients $$\hat \beta$$ by applying the following equation

$\hat \beta = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y$

In order to calculate $$\hat \beta$$ in R we start by constructing the model matrix $$\mathbf X$$ and the response vector $$\mathbf y$$. Be aware the we have to manually account for the intercept term, $$\beta_0$$, by populating the first column of the matrix $$\mathbf X$$ with ones. The software package R provides the t() function to transpose a matrix, the solve() function to calculate the inverse of a matrix, and a special character, %*%, to conduct matrix multiplication.

y <- as.matrix(df[,'Volume']) # response vector

beta0 = rep(1, nrow(df)) # intercept
X <- df[,c('Girth','Height')] # feature selection
X <- as.matrix(cbind(beta0, X)) # model matrix

model1 <- solve(t(X)%*%X)%*%t(X)%*%y # sovling for beta
model1 
##               [,1]
## beta0  -1.64166129
## Girth   5.25282723
## Height  0.03144366

The calculation yields a value of -1.642 for the intercept $$\beta_0$$, a value of 5.253 for $$\beta_1$$, the coefficient of the Girth variable, and a value of 0.031 for $$\beta_2$$, the coefficient of the Height variable. The intercept $$\beta_0$$ represents the volume when all independent variables are zero. The coefficient $$\beta_1$$represents the change in volume when there is a unit increase in the Girth variable, while the other independent variable is held constant. The coefficient $$\beta_2$$ represents the change in volume when there is a unit increase in the Height variable, while the other independent variable is held constant.

Consequently, the linear model can be written as

$\text{Volume} = -1.642 + 5.253 \times \text{Girth}+0.031 \times \text{Height}$

#### Model prediction

Based on the regression equation it is fairly easy to predict the volume for any black cherry tree given its girth and height. We predict the volume of 3 black cherry trees with a girth of 0.3, 0.4 and 0.5 meters and a height of 20, 21 and 22 meters, respectively.

$\text{Volume} = -1.642 + 5.253 \times 0.3+0.031 \times 20 = 0.55$ $\text{Volume} = -1.642 + 5.253 \times 0.4+0.031 \times 21 = 1.11$

$\text{Volume} = -1.642 + 5.253 \times 0.5+0.031 \times 22 = 1.67$

This naive prediction procedure works perfectly fine, however, a more sophisticated and scalable procedure is to make use of the matrix notation we introduced in the previous section. Recall that our unknown coefficients $$\hat \beta$$ are calculated by the equation

$\hat \beta = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y$

Given a bunch of new data points in matrix form $$\mathbf X_{new}$$, the least squares prediction for $$\hat y$$ is

$\hat y = \mathbf X_{new}\hat \beta = \mathbf X_{new}(\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y$

# response vector
y <- as.matrix(df[,'Volume'])

# model matrix
beta0 = rep(1,nrow(df))
X <- df[,c('Girth','Height')]
X <- as.matrix(cbind(beta0, X))

# new data matrix
X.new = data.frame('Girth' = c(0.3, 0.4, 0.5),
'Height' = c(20, 21, 22))
# add intercept term to the new data matrix
beta0 = rep(1,nrow(X.new))
X.new <- as.matrix(cbind(beta0, X.new))

# plug into equation from above
prediction <- X.new%*%solve(t(X)%*%X)%*%t(X)%*%y
prediction
##           [,1]
## [1,] 0.5630601
## [2,] 1.1197865
## [3,] 1.6765128

Compare the result of the naive prediction approach with the more sophisticated matrix-based prediction approach. The numbers match, though there are some small deviations related to the representation of floating point numbers during the computations.

#### Build-in Linear Model function lm()

One function to build multiple linear regression models that ships with R is the lm() function, which we already disused in the section Linear Regression. The lm() expects the formula notation as input; thus, a linear model is specified as response ~ predictor(s). In order to increase the number of predictor variables we concatenate them by using the + sign.

model2 <- lm(Volume ~ Girth + Height, data = df)
model2
##
## Call:
## lm(formula = Volume ~ Girth + Height, data = df)
##
## Coefficients:
## (Intercept)        Girth       Height
##    -1.64166      5.25283      0.03144

The default output to the R-console returns the intercept and the other regression coefficients. Compare the results with our matrix-based implementation of the multiple linear regression model. The numbers match perfectly!

#### Model prediction

An easy way to use a lm object for prediction is to apply the predict() function. Be aware that the new data provided to the predcit() function must be a data.frame object or a list object. As in the application above we predict the volume of 3 black cherry trees with a girth of 0.3, 0.4 and 0.5 meters and a height of 20, 21 and 22 meters, respectively.

new <- data.frame('Girth' = c(0.3, 0.4, 0.5),
'Height' =  c(20, 21, 22))
predict(model2, newdata = new)
##         1         2         3
## 0.5630601 1.1197865 1.6765128

The numbers match! In addition, the predict() function provides a simple approach to calculate the confidence and prediction intervals. We apply the predict() function and add the argument interval = "confidence" or interval = "prediction" to get the vector of predicted values augmented with its upper and lower limits.

predict(model2, newdata = new, interval = "confidence")
##         fit       lwr       upr
## 1 0.5630601 0.4824048 0.6437153
## 2 1.1197865 1.0294411 1.2101318
## 3 1.6765128 1.5529603 1.8000654
predict(model2, newdata = new, interval = "prediction")
##         fit       lwr       upr
## 1 0.5630601 0.3233328 0.8027874
## 2 1.1197865 0.8766277 1.3629452
## 3 1.6765128 1.4191626 1.9338631

Note that the default confidence level, respectively prediction level is 95%. This can be changed with the level argument.

predict(model2, newdata = new,
interval = "confidence", level = 0.99)
##         fit       lwr       upr
## 1 0.5630601 0.4542576 0.6718625
## 2 1.1197865 0.9979123 1.2416606
## 3 1.6765128 1.5098427 1.8431829

A higher confidence level causes the confidence interval to become wider, whereas as a lower confidence level causes the interval to become narrower.

predict(model2, newdata = new,
interval = "confidence", level = 0.9)
##         fit       lwr       upr
## 1 0.5630601 0.4960787 0.6300415
## 2 1.1197865 1.0447578 1.1948151
## 3 1.6765128 1.5739067 1.7791190

#### Model evaluation

In multiple regression analysis tasks there may be many independent variables, resulting in a high-dimensional model space, so that visualization for the purpose of model diagnostics is very often restricted. In the case of a bivariate data the regression model can be visualized by a line. For $$d$$-dimensional data the regression model results in a $$d$$-dimensional plane, a so-called hyperplane. However, when there are only two explanatory variables, such as in our simple example from above, the hyperplane is just an ordinary plane and we may visualize it by a 3D scatter plot. In R we may apply the scatterplot3d() function of the eponymous scatterplot3d package.

library(scatterplot3d)
scatter.3d <- with(df, scatterplot3d(Girth, Height, Volume,
pch = 16, highlight.3d = TRUE,
angle = 60))
scatter.3d\$plane3d(model2)

By inspecting the 3D scatter plot we see that the point data is very well approximated by the plane. Thus indicating, that the model fits well the data.

Recall that we may access the properties of a lm object by various extractor functions (see section Linear Regression for more details). One of the most convenient functions to summarize the properties of a model object in R is the summary() function.

summary(model2)
##
## Call:
## lm(formula = Volume ~ Girth + Height, data = df)
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -0.182027 -0.074603 -0.004944  0.061530  0.240610
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.64166    0.24524  -6.694 2.89e-07 ***
## Girth        5.25283    0.29572  17.763  < 2e-16 ***
## Height       0.03144    0.01212   2.593   0.0149 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1102 on 28 degrees of freedom
## Multiple R-squared:  0.9477, Adjusted R-squared:  0.9439
## F-statistic: 253.5 on 2 and 28 DF,  p-value: < 2.2e-16
• The output of the summary() function starts with the repetition of the function call.

• The next lines give the distribution of the residuals that serve as a quick check of the distributional assumptions. The average of the residuals is zero by definition, so the median should not be far from zero.

• The next lines give the regression coefficients and the intercept, and in addition, for each of them the standard errors, the t-values, and the p-values. The star symbols are graphical indicators of the level of significance. If the p-value is small we conclude that there is a significant relationship between $$\mathbb y$$ and $$x_i$$. The p-values imply that there is a significant linear relationship between Volume and Girth and between Volume and Height in the regression model. Further, it appears that the intercept is significant too. Note that the t-tests only considers the effect of removing one variable and leave in all the others.

• The next line gives the residual standard error, an expression of the variation of the observations around the regression hyperplane.

• The next line gives $$R^2$$, the squared Pearson correlation coefficient, also known as the coefficient of determination and the adjusted $$R^2$$. The $$R^2$$ measure is interpreted as the proportion of total variation that is explained by the regression model. However, by the addition of explanatory variables to the model the value of $$R^2$$ is inflated, regardless of whether or not the added variables are useful with respect to prediction of the response variable (Kerns 2010). In other words the addition of a single explanatory variable to a regression model will increase the value of $$R^2$$, no matter how worthless the explanatory variable is. This issue is addressed by penalizing $$R^2$$ when parameters are added to the model. The result is an adjusted $$R^2$$ which is denoted by

$\text{adjusted }R^2 =\left(R^2-\frac{d}{n-1}\right)\left(\frac{n-1}{n-d-1}\right)\text{,}$ $$\qquad$$ where $$n$$ corresponds to the number of observations and $$d$$ to the number of features in our data set.

• Note that in many cases the values for $$R^2$$ and the adjusted $$R^2$$ will be very close to each other (such as in our example from above). If their values differ substantially, or if one changes dramatically when an explanatory variable is added, then the explanatory variables need to be critically assessed and possibly excluded from the regression model.

• The last line shows the $$F$$-statistic, the number of degrees of freedom and the p-value. When the $$F$$-statistic is large, that is, when the explained variation is large relative to the unexplained variation, it is an indicator that the explanatory variables are useful variables to explain the response variable.

Please note that there exists a very informative blog post by Felipe Rego, who provides a nicely written overview of the output of a linear model in R.

This section was inspired by section 12 of the book Introduction to Probability and Statistics using R by G. Jay Kerns (2010).