Regression validation or regression diagnostics is a set of procedures that are applied to assess the numerical results of a regression analysis. The procedures include methods of graphical and quantitative analysis or formal statistical hypothesis tests. In this section we focus on the two foremost methods, the graphical and the quantitative analysis. Statistical hypothesis tests for regression problems are provided in the section on hypothesis tests.
The Coefficient of Determination, also denoted as \(R^2\), is the proportion of variation in the observed values explained by the regression equation. In other words, \(R^2\) is a statistical measure of how well the regression line approximates the real data points. Thus, it is a measure of the goodness-of-fit of the model.
The total variation of the response variable \(y\) is based on the deviation of each observed value \(y_i\) from the mean value \(\bar y\). This quantity is called total sum of squares, SST, and is given by
\[SST = \sum_{i=1}^n (y_i - \bar y)^2\text{.}\]
The total sum of squares (SST) can be decomposed into two parts: the deviation explained by the regression line (\(\hat y_i-\bar y\)) and the remaining unexplained deviation (\(y_i-\hat y_i\)). Consequently, the amount of variation which is explained by the regression is called the sum of squares due to regression, SSR, and is given by
\[SSR = \sum_{i=1}^n (\hat y_i- \bar y)^2\text{.}\]
The ratio of the sum of squares due to regression (SSR) and the total sum of squares (SST) is called the coefficient of determination and is denoted \(R^2\):
\[R^2 = \frac{SSR}{SST}\text{.}\]
\(R^2\) lies between 0 and 1. A value close to 0 suggests that the regression equation is not capable of explaining the data. A \(R^2\) of 1 indicates that the regression line perfectly fits the data.
The variation in the observed values of the response variable not explained by the regression is called sum of squared errors of prediction, SSE, and is given by
\[SSE = \sum_{i=1}^n (y_i-\hat y_i)^2\text{.} \]
Recall, that the SSE quantity is minimized to obtain the best regression line to describe the data, also known as the ordinary least squares method (OLS).
summary()
functionA fundamental method for regression diagnostics in R is the
summary()
function. The lm()
function returns
a model object. This lm
object contains the model
properties, which can be inspected by applying the
summary()
function.
For demonstration purposes we set up the same model as in the previous section:
# load data
students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
# take a random sample
n <- 12
sample_idx <- sample(1:nrow(students), n)
data <- students[sample_idx, c("height", "weight")]
model <- lm(weight ~ height, data = data)
summary(model)
##
## Call:
## lm(formula = weight ~ height, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5120 -1.4672 -0.5516 0.8700 5.7636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -74.80999 14.64261 -5.109 0.000458 ***
## height 0.86036 0.08372 10.277 1.24e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.614 on 10 degrees of freedom
## Multiple R-squared: 0.9135, Adjusted R-squared: 0.9049
## F-statistic: 105.6 on 1 and 10 DF, p-value: 1.237e-06
The output of the summary()
function starts with a
repetition of the function call.
The next lines give a view of the distribution of the residuals, that serves as a quick check of the distributional assumptions. The average of the residuals is zero by definition, so the median should not be far from zero.
The next lines give the regression coefficient and the intercept. In addition, for each of them the standard error, the t-value, and the p-value is given. The star symbols are graphical indicators of the level of significance.
The next line gives the residual standard error, an expression of the variation of the observations around the regression line.
The next line gives \(R^2\), the squared Pearson correlation coefficient also known as the coefficient of determination, and the adjusted \(R^2\), a statistical measure that may be used for feature selection in regression analysis with multiple predictors (multiple regression).
The last line shows the \(F\)-statistic, the number of degrees of freedom and the p-value.
It is important to realize, that you may run a linear regression analysis using the R software package, or any other statistical software suite, resulting in a bunch of numbers, including a p-value, which allow stating immediately, if the results are significant (or not). But are we really done after reporting the significance of the results?
Consider a very famous data set, known as Anscombe’s quartet. Anscombe’s quartet consists of four data sets and has the following form:
x1 | y1 | x2 | y2 | x3 | y3 | x4 | y4 |
---|---|---|---|---|---|---|---|
10 | 8.04 | 10 | 9.14 | 10 | 7.46 | 8 | 6.58 |
8 | 6.95 | 8 | 8.14 | 8 | 6.77 | 8 | 5.76 |
13 | 7.58 | 13 | 8.74 | 13 | 12.74 | 8 | 7.71 |
9 | 8.81 | 9 | 8.77 | 9 | 7.11 | 8 | 8.84 |
11 | 8.33 | 11 | 9.26 | 11 | 7.81 | 8 | 8.47 |
14 | 9.96 | 14 | 8.1 | 14 | 8.84 | 8 | 7.04 |
6 | 7.24 | 6 | 6.13 | 6 | 6.08 | 8 | 5.25 |
4 | 4.26 | 4 | 3.1 | 4 | 5.39 | 19 | 12.5 |
12 | 10.84 | 12 | 9.13 | 12 | 8.15 | 8 | 5.56 |
7 | 4.82 | 7 | 7.26 | 7 | 6.42 | 8 | 7.91 |
5 | 5.68 | 5 | 4.74 | 5 | 5.73 | 8 | 6.89 |
Anscombe’s quartet is so popular, that the data set is included in R.
You may load the data set in your workspace by typing
data(anscombe)
. Now, we calculate some descriptive
statistical measures for each of the four \((x,y)\) pairs. First, we calculate the mean
for each particular \(x\) and \(y\) in the data set:
data(anscombe)
# assign data to variables
x1 <- anscombe[, "x1"]
x2 <- anscombe[, "x2"]
x3 <- anscombe[, "x3"]
x4 <- anscombe[, "x4"]
y1 <- anscombe[, "y1"]
y2 <- anscombe[, "y2"]
y3 <- anscombe[, "y3"]
y4 <- anscombe[, "y4"]
# mean for x's
cbind(mean(x1), mean(x2), mean(x3), mean(x4))
## [,1] [,2] [,3] [,4]
## [1,] 9 9 9 9
# mean for y's
cbind(mean(y1), mean(y2), mean(y3), mean(y4))
## [,1] [,2] [,3] [,4]
## [1,] 7.500909 7.500909 7.5 7.500909
The values either match up perfectly or are very close to each other!
Now, we calculate the variance of each \((x,y)\) pair:
# variance for (x,y) pairs
cbind(var(x1, y1), var(x2, y2), var(x3, y3), var(x4, y4))
## [,1] [,2] [,3] [,4]
## [1,] 5.501 5.5 5.497 5.499
Not exactly the same, but definitely very close to each other.
Finally we build a linear model, using the lm()
function, for each subset and calculate the model coefficients and \(R^2\), the coefficient of
determination:
model1 <- lm(y1 ~ x1)
model2 <- lm(y2 ~ x2)
model3 <- lm(y3 ~ x3)
model4 <- lm(y4 ~ x4)
rbind(coef(model1), coef(model2), coef(model3), coef(model4))
## (Intercept) x1
## [1,] 3.000091 0.5000909
## [2,] 3.000909 0.5000000
## [3,] 3.002455 0.4997273
## [4,] 3.001727 0.4999091
Amazing! They are nearly the same! And now \(R^2\):
rbind(
summary(model1)$r.squared,
summary(model2)$r.squared,
summary(model3)$r.squared,
summary(model4)$r.squared
)
## [,1]
## [1,] 0.6665425
## [2,] 0.6662420
## [3,] 0.6663240
## [4,] 0.6667073
Wow, what an analysis! We have thrown a lot of different statistical methods at the four data sets and honestly, they appear very similar to each other.
Are we now done with our analysis? No, not yet! No matter what, we should always check, if the model works well for the data. An easy way to do that is to visualize the data. Let us plot the Anscombe’s data sets, including the regression lines:
par(mfrow = c(2, 2), mar = c(3, 2, 2, 2))
xlim <- c(0, 20)
ylim <- c(0, 15)
plot(x1, y1, xlim = xlim, ylim = ylim)
abline(model1, col = "red")
title("Anscombe's data set #1", cex.main = 0.8)
plot(x2, y2, xlim = xlim, ylim = ylim)
abline(model2, col = "red")
title("Anscombe's data set #2", cex.main = 0.8)
plot(x3, y3, xlim = xlim, ylim = ylim)
abline(model3, col = "red")
title("Anscombe's data set #3", cex.main = 0.8)
plot(x4, y4, xlim = xlim, ylim = ylim)
abline(model4, col = "red")
title("Anscombe's data set #4", cex.main = 0.8)
What a surprise! The main takeaway of the exercise is to realize, that we must check if a model works well for the given data in many different ways. We pay attention to regression results, such as slope coefficients, p-values or \(R^2\), that tell us how well a model represents the given data. However, that is not the whole story. We also need to apply visual diagnostics. The visual inspection helps to evaluate if linear regression assumptions are met or to identify outliers and/or influential observations and so-called leverage points, which affect the numerical output of the regression analysis.
A residual of an observed value is the difference between the observed value and the estimated value \((y_i-\hat y_i)\). It is the leftover after fitting a model to data. The sum of squared errors of prediction (SSE), also known as the sum of squared residuals or the error sum of squares is an indicator of how well a model represents data.
If the absolute residuals, defined for observation \(x_i\) as \(\epsilon_i= y_i -\hat y_i\) are unusually large, it may be that the observation is from a different population or that there was some error in making or recording the observation.
Look at the two plots above. Obviously, one data point in Anscombe’s data set #3 (right plot) shows an unusually large residual. Such a data point needs special attention as it influences the regression analysis. There is no overarching rule on how to deal with outliers. But depending on the domain knowledge of the researcher, there are cases in which one might decide to exclude such an outlier from the analysis.
In addition, we may analyze the residuals to check if the linear regression assumptions are met. Regression residuals should be approximately normally-distributed; that is, the regression should explain the structure and whatever is left over should just be noise, caused by measurement errors or many small uncorrelated factors. The normality of residuals can be checked graphically by a plot of the residuals against the values of the predictor variable. In such a residual plot, the residuals should be randomly scattered around \(0\) and the variation around \(0\) should be equal.
Prior to plotting residuals it is common to standardize them. The R
software package provides the rstandard()
function to
standardize residuals and the rstudent()
to studentize residuals.
If the assumptions for regression inferences are met, the following two conditions should hold (Weiss, 2010):
A plot of the residuals (residual plot) against the values of the predictor variable should fall roughly in a horizontal band centered and symmetric about the x-axis.
A normal probability plot of the residuals should be roughly linear.
Only in the uppermost plot the residuals are fairly well distributed around zero. In the two lower plots this is not the case. This indicates that the linear model assumptions for that model are not fulfilled.
The normal probability plots, often referred to as Q-Q plots, indicate that only in the left plot the data points fall roughly on a straight line. This is not the case for the other plots; thus indicating that the linear model assumptions are not fulfilled.
Outliers are points that fall away from the cloud of data points. Outliers which fall horizontally away from the center of the cloud, thus not influencing the slope of the regression line, are called leverage points. Outliers which actually influence the slope of the regression line are called influential points and are usually high leverage points.
Let us build a toy data set to examine the concept of influential observations:
set.seed(110) # set seed for reproducibility
# random data generation
n <- 100
beta0 <- 5
beta1 <- 2.5
x <- runif(n, 0, 10)
y <- beta0 + beta1 * x + rnorm(n, mean = 0, sd = 12) # add random noise
# generate leverage points
n_lev <- floor(n * 0.05)
x_lev <- runif(n_lev, 0, 8)
y_lev <- beta0^1.5 + beta1^3 * x_lev + rnorm(n_lev, mean = 0, sd = 12) # add random noise
# generate influential points
n_inf <- floor(n * 0.02)
x_inf <- runif(n_inf, 10, 15)
y_inf <- beta0 + beta1^2.5 * x_inf + rnorm(n_inf, mean = 0, sd = 12) # add random noise
# concatenate data sets
x_out <- c(x, x_inf, x_lev)
y_out <- c(y, y_inf, y_lev)
# build linear models
toy_model <- lm(y ~ x)
toy_model_lev <- lm(c(y, y_lev) ~ c(x, x_lev))
toy_model_inf <- lm(c(y, y_inf) ~ c(x, x_inf))
toy_model_out <- lm(y_out ~ x_out)
# plot the data and the regression line
plot(x, y, xlim = c(0, ceiling(max(x_out))), ylim = c(0, ceiling(max(y_out)))) # plot cloud data
points(x_lev, y_lev, col = "blue", pch = 16) # plot leverage points
points(x_inf, y_inf, col = "green", pch = 16) # plot influential points
abline(toy_model_out, col = "red", lwd = 2) # regression line, all data
abline(toy_model_lev, col = "blue", lty = 2, lwd = 2) # regression line, data+leverage points
abline(toy_model_inf, col = "green", lty = 2, lwd = 2) # regression line, data+influential points
abline(toy_model, col = "black", lty = 2, lwd = 2) # regression line, data
title("Toy data set")
# add legend
legend("topleft",
legend = c(
"data points",
"leverage points",
"influential points",
"regression line"
),
pch = c(1, 16, 16, NA),
col = c(1, "blue", "green", "red"),
lty = c(NA, NA, NA, 1),
cex = 0.8
)
The figure above clearly shows the impact of different types of outliers. The black dashed line shows the regression line without outliers. The blue dashed line shows the regression line, when the blue leverage points are included. The green dashed line shows the regression line, when the green influential points are included. The red line shows the final regression line, when all data is included. Obviously, the largest effect on the slope of the regression line is due to the green data points!
The leverage of an observation indicates its ability to move the regression model. These observations are not necessarily an error, but they should be identified and verified. The leverage is measured by the hat value, which measures the overall influence of a single observation on the model predictions (Dalgaard, 2008). The hat value takes values between 0 and 1. A point with zero leverage has no effect on the regression model. The higher the hat value the higher the influence of that particular point on the regression model.
In R the leverage is calculated with the hatvalues()
function. The function returns the \(h\)-value for each particular data point.
Values more than about three times the average leverage (\(\bar h\), given by \(\bar h=(k+1)/2\), where \(k\) is the number of coefficients in the
model and \(n\) is the number of data
points) are assumed to be influential.
Let us examine the toy data set from above with the
hatvalues()
function. Let us find the most influential
points and plot the hat values, \(h\),
against the observations.
h <- hatvalues(toy_model_out)
hi_lev <- which(h > 3 * mean(h)) # values more than three times the average leverage are assumed to be influential
plot(x_out, h)
abline(v = mean(x_out), lty = 2)
abline(h = seq(1:3) * mean(h), col = c(1, 4, 2))
points(x_out[hi_lev], hatvalues(toy_model_out)[hi_lev], pch = 16, col = 2) # color influential points
The most influential points are clearly identified.
Another method to capture influential outliers is Cook’s distance. The measurement is
a combination of the leverage and residual of each particular
observation. The higher the leverage and residue, the higher Cook’s
distance. Typically, points with Cook’s distance greater than 1 are
classified as being influential. In R we calculate Cook’s distance with
the cooks.distance()
function.
cook <- cooks.distance(toy_model_out)
plot(cook, ylab = "Cooks distance")
The plot identifies the most influential points in the model very clearly.
Other useful methods for regression analysis diagnostic are the
dffits()
function, which expresses how much an observation
affects the associated fitted value and the dfbetas()
function, which gives the change in the estimated parameters if an
observation is excluded relative to its standard error. The
dfbetas()
function returns a matrix object; hence we apply
the matplot()
function to plot the output. The solid line
corresponds to the effect on \(a\), the
intercept, and the dashed line corresponds to the effect on \(b\), the regression coefficient (Dalgaard, 2008).
par(mfrow = c(1, 2))
plot(dffits(toy_model_out), type = "l")
matplot(dfbetas(toy_model_out), type = "l", main = "")
##
## Attaching package: 'htmltools'
## The following object is masked from 'package:pander':
##
## p
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.