Regression validation or regression diagnostic is a set of procedures that are applied to assess the numerical results of a regression analysis. The procedures include methods of graphical and quantitative analysis or formal statistical hypothesis tests. In this section we focus on the two foremost methods, the graphical and the quantitative analysis. Statistical hypothesis tests for regression problems are provided in the section on hypothesis tests.

Coefficient of Determination¶

The Coefficient of Determination, also denoted as $R^2$, is the proportion of variation in the observed values explained by the regression equation. In other words, $R^2$ is a statistical measure of how well the regression line approximates the real data points; thus, it is a measure of the goodness of fit of the model.

The total variation of the response variable $y$ is based on the deviation of each observed value $y_i$ from the mean value $\bar y$. This quantity is called total sum of squares, SST, and is given by

$$SST = \sum (y_i - \bar y)^2\text{.}$$

This total sum of squares (SST) can be be decomposed into two parts: the deviation explained by the regression line, $\hat y_i-\bar y$, and the remaining unexplained deviation $y_i-\hat y$. Consequently, the the amount of variation that is explained by the regression is called the sum of squares due to regression, SSR, and is given by

$$SSR = \sum (\hat y_i- \bar y)^2\text{.}$$

The ratio of the sum of squares due to regression (SSR) and the total sum of squares (SST) is called the coefficient of determination and is denoted $R^2$:

$$R^2 = \frac{SSR}{SST}\, .$$

$R^2$ lies between 0 and 1. A value of near 0 suggests that the regression equation is not capable of explaining the data. An $R^2$ of 1 indicates that the regression line perfectly fits the data.

Just for the sake of completeness the variation in the observed values of the response variable not explained by the regression is called sum of squared errors of prediction, SSE and is given by

$$SSE = \sum (y_i-\hat y_i)^2\text{.} $$

Recall that the SSE quantity is minimized to obtain the best regression line to describe the data, also known as the ordinary least squares method (OLS).

Statsmodel's summary() method¶

A fundamental method for regression diagnostic in the statsmodel package is the summary() method of the RegressionResults class. The OLS() function returns an instance of the OLS class. When using the fit() method, we obtain an instance of the RegressionResults class. This object contains the model properties, which can be inspected by applying the summary() method.

For demonstration purposes we setup the same model as in the previous section.

In [186]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')
%matplotlib inline 
In [2]:
students = pd.read_csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")
n = 24
students_sample = students.sample(n, random_state=14)
df = students_sample[['height', 'weight']]
x = df['height']
y = df['weight']
x_train = sm.add_constant(x)
model = sm.OLS(y, x_train)
fit = model.fit()
fit.summary()
Out[2]:
OLS Regression Results
Dep. Variable: weight R-squared: 0.891
Model: OLS Adj. R-squared: 0.886
Method: Least Squares F-statistic: 178.9
Date: Thu, 09 Mar 2023 Prob (F-statistic): 4.81e-12
Time: 08:53:19 Log-Likelihood: -54.440
No. Observations: 24 AIC: 112.9
Df Residuals: 22 BIC: 115.2
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -55.1467 9.568 -5.764 0.000 -74.990 -35.304
height 0.7494 0.056 13.376 0.000 0.633 0.866
Omnibus: 3.844 Durbin-Watson: 2.262
Prob(Omnibus): 0.146 Jarque-Bera (JB): 2.935
Skew: 0.855 Prob(JB): 0.230
Kurtosis: 2.900 Cond. No. 3.28e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.28e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
  • The left columns of the output of summary() function starts with information about the variables and method.

  • Below, we find information about the date and time of the analysis, the number of observations and degrees of freedom in residuals and in the model, and the covariance type.

  • The right column starts with a measure $R^2$, the squared Pearson correlation coefficient, also known as the coefficient of determination and the adjusted $R^2$, a statistical measure used for feature selection in regression analysis with multiple predictors (multiple regression).

  • The points below show information about the $F$-statistic, followed by more parameters to determine the model quality (log-likelihood, Akaike information criterion and Bayesian information criterion).

  • The next lines give the regression coefficient (height) and the intercept (const), and in addition, for each of them the standard errors, the t-values, and the p-values and confidence values.

  • Finally, we find more advanced parameters that we will not touch upon in this course.

Diagnostic plots¶

It is important to realize that you may run a linear regression analysis using the statsmodels package, or any other statistical software suite, which results in a bunch of numbers, including a p-value, thus you may immediately state if the results were statistically significant (or not). Are we done after reporting the significance of the results?

Consider a very famous data set, known as Anscombe's quartet. Anscombe's quartet consists of four data sets (I, II, II, IV). It is included in the seaborn package and has the following form:

In [3]:
anscombe = sns.load_dataset("anscombe")
anscombe
Out[3]:
dataset x y
0 I 10.0 8.04
1 I 8.0 6.95
2 I 13.0 7.58
3 I 9.0 8.81
4 I 11.0 8.33
5 I 14.0 9.96
6 I 6.0 7.24
7 I 4.0 4.26
8 I 12.0 10.84
9 I 7.0 4.82
10 I 5.0 5.68
11 II 10.0 9.14
12 II 8.0 8.14
13 II 13.0 8.74
14 II 9.0 8.77
15 II 11.0 9.26
16 II 14.0 8.10
17 II 6.0 6.13
18 II 4.0 3.10
19 II 12.0 9.13
20 II 7.0 7.26
21 II 5.0 4.74
22 III 10.0 7.46
23 III 8.0 6.77
24 III 13.0 12.74
25 III 9.0 7.11
26 III 11.0 7.81
27 III 14.0 8.84
28 III 6.0 6.08
29 III 4.0 5.39
30 III 12.0 8.15
31 III 7.0 6.42
32 III 5.0 5.73
33 IV 8.0 6.58
34 IV 8.0 5.76
35 IV 8.0 7.71
36 IV 8.0 8.84
37 IV 8.0 8.47
38 IV 8.0 7.04
39 IV 8.0 5.25
40 IV 19.0 12.50
41 IV 8.0 5.56
42 IV 8.0 7.91
43 IV 8.0 6.89

With the Anscombe data set, we can visualize multiple issues that may arise with regression models. We need to calculate some descriptive statistical measures for each of the four $(x,y)$ pairs.

First, we split the data set and calculate the mean for each particular x and y in the data set.

In [4]:
ds_i = anscombe.query("dataset == 'I'")
ds_ii = anscombe.query("dataset == 'II'")
ds_iii = anscombe.query("dataset == 'III'")
ds_iv = anscombe.query("dataset == 'IV'")
mean_i = ds_i.mean()
mean_ii = ds_ii.mean()
mean_iii = ds_iii.mean()
mean_iv = ds_iv.mean()
print(mean_i['x'], mean_ii['x'], mean_iii['x'], mean_iv['x'])
print(mean_i['y'], mean_ii['y'], mean_iii['y'], mean_iv['y'])
9.0 9.0 9.0 9.0
7.500909090909093 7.50090909090909 7.5 7.500909090909091

The values either match up perfectly or are very close to each another!!

Now we calculate the variance of each $(x,y)$ pair.

In [5]:
var_i = np.var(ds_i)
var_ii = np.var(ds_ii)
var_iii = np.var(ds_iii)
var_iv = np.var(ds_iv)
print(var_i['x'], var_ii['x'], var_iii['x'], var_iv['x'])
print(var_i['y'], var_ii['y'], var_iii['y'], var_iv['y'])
10.0 10.0 10.0 10.0
3.7520628099173554 3.752390082644628 3.747836363636364 3.7484082644628103
In [6]:
ds_i
Out[6]:
dataset x y
0 I 10.0 8.04
1 I 8.0 6.95
2 I 13.0 7.58
3 I 9.0 8.81
4 I 11.0 8.33
5 I 14.0 9.96
6 I 6.0 7.24
7 I 4.0 4.26
8 I 12.0 10.84
9 I 7.0 4.82
10 I 5.0 5.68

Not exactly the same, but definitely very close to each another. Finally, we build and plot a linear model, using the OLD() function, for each subset and calculate the model coefficients and $R^2$, the coefficient of determination.

In [7]:
x_train_i = sm.add_constant(ds_i['x'])
x_train_ii = sm.add_constant(ds_ii['x'])
x_train_iii = sm.add_constant(ds_iii['x'])
x_train_iv = sm.add_constant(ds_iv['x'])
model_i = sm.OLS(ds_i['y'], x_train_i).fit()
model_ii = sm.OLS(ds_ii['y'], x_train_ii).fit()
model_iii = sm.OLS(ds_iii['y'], x_train_iii).fit()
model_iv = sm.OLS(ds_iv['y'], x_train_iv).fit()
print(model_i.params, "\n", model_ii.params, "\n", model_iii.params, "\n", model_iv.params)
const    3.000091
x        0.500091
dtype: float64 
 const    3.000909
x        0.500000
dtype: float64 
 const    3.002455
x        0.499727
dtype: float64 
 const    3.001727
x        0.499909
dtype: float64

Amazing! They are nearly the same! And now $R^2$:

In [8]:
print(model_i.rsquared, model_ii.rsquared, model_iii.rsquared, model_iv.rsquared)
0.666542459508775 0.6662420337274844 0.6663240410665593 0.6667072568984652
In [ ]:
 

Wow, what an analysis! We have thrown a lot of different statistical methods at the four data sets, and honestly, they appear very similar to each another.

Are we now done with our analysis? No, not yet! No matter what, we should always check if the model works well for data. Any easy way to do that is to visualize the data. Let us plot the Anscombe's data set, including the regression line.

In [9]:
sns.lmplot(
    x="x", 
    y="y", 
    data=anscombe, 
    hue="dataset", 
    col="dataset", 
    col_wrap=2, 
    truncate=False,
    ci=95,
    height=4,
)
Out[9]:
<seaborn.axisgrid.FacetGrid at 0x7ff75ddf36a0>

What a surprise! The main takeaway of the exercise is to realize that we must check if a model works well for data in many different ways. We pay attention to regression results, such as slope coefficients, p-values, or $R^2$ that tell us how well a model represents the given data. However, that is not the whole story. We also need to apply visual diagnostics. The visual inspection helps to evaluate if linear regression assumptions are met or to identify outliers and/or influential observations and so called [leverage points](https://en.wikipedia.org/wiki/Leverage_(statistics), that affect the numerical output of the regression analysis.

Residual analysis¶

A residuals of an observed value is the difference between the observed value and the estimated value $(y_i-\hat y_i)$. It is the leftover after fitting a model to data. The sum of squared errors of prediction (SSE), also known as the sum of squared residuals or the error sum of squares is an indicator how well a model represents data.

If the absolute residuals, defined for observation $x_i$ as $e_i= y_i -\hat y_i$ are unusually large, it may be that the observation is from a different population, or that there was some error in making or recording the observation.

Compare the plots of data set #1 and #3 above. Obviously one data point in Anscombe's data set #3 (right plot) shows a unusually large residue. Such a data point needs special attention as it influences the regression analysis. There is no overarching rule how to deal with outliers, but depending on the domain knowledge of the researcher, there are cases were on may decide to exclude such an outlier from the analysis.

In addition we may analyse the residuals to check if linear regression assumptions are met. Regression residuals should be approximately normally-distributed; that is, the regression should explain the structure and whatever is left over should just be noise, caused by measurement errors or many small uncorrelated factors. The normality of residuals can be checked graphically by a plot of the residuals against the values of the predictor variable. In such a residual plot, the residuals should be randomly scattered about $0$ and the variation around $0$ should be equal.

Prior to plotting the residuals it is common to standardize the residuals.

If the assumptions for regression inferences are met, the following two conditions should hold (Weiss 2010):

  • A plot of the residuals (residual plot) against the values of the predictor variable should fall roughly in a horizontal band centered and symmetric about the x-axis.

  • A normal probability plot of the residuals should be roughly linear.

In [10]:
sns.set_theme(style="whitegrid")

rs = np.random.RandomState(7)
n = 200
x = rs.uniform(-100, 100, n)
y_1 = 5 * x + rs.uniform(0, 25, n)
y_2 = x ** 2 + rs.uniform(0, 25**2, n)

def h(x):
    return 100+.5*x**2
# eps = abs(rs.normal(0, h(x), n))

eps = np.arange(0, n)*h(x)

y_3 = 4*abs(x) + rs.normal(25, eps, n)

fig1 = plt.figure(figsize=(12, 3))
sns.residplot(x=x, y=y_1, color="g")
plt.title("no violation")
plt.show()

fig2 = plt.figure(figsize=(12, 3))
sns.residplot(x=x, y=y_2, color="g")
plt.title("violation of linearity")
plt.show()

fig3 = plt.figure(figsize=(12, 3))
sns.residplot(x=abs(x), y=y_3, color="g")
plt.title("violation of constant standard deviation (Heteroscedasticity)")
plt.show()

Only in the uppermost plot the residuals are fairly well distributed around zero, whereas in the two lower plots this is not the case; thus, indicating that the linear model assumptions for that model are not fulfilled.

In [11]:
from scipy import stats

def standardize(x):
    return (x - x.mean())/(x.std())

df = pd.DataFrame({
    'x': standardize(x), 
    'y_1': standardize(y_1), 
    'y_2': standardize(y_2), 
    'y_3': standardize(y_3), 
})
df.head(10)
Out[11]:
x y_1 y_2 y_3
0 -1.454424 -1.432784 1.388205 -0.061084
1 1.022916 0.983533 -0.034598 -0.056711
2 -0.179504 -0.158630 -1.061309 -0.059461
3 0.824148 0.862644 -0.437975 -0.101608
4 1.720302 1.728537 2.080581 -0.085717
5 0.172891 0.153483 -1.142410 -0.061537
6 0.041296 0.055769 -0.999634 -0.062519
7 -1.469413 -1.451283 1.447360 -0.099979
8 -0.777952 -0.764790 -0.420411 -0.116076
9 0.036937 0.005165 -1.112621 -0.066166
In [12]:
fig = sm.qqplot(df['y_1'], line ='45')
plt.show()

fig = sm.qqplot(df['y_2'], line ='45')
plt.show()

fig = sm.qqplot(df['y_3'], line ='45')
plt.show()

The normal probability plots, often referred to as Q-Q plots, indicate that only in the left plot the data points fall roughly on a straight line. This is not the case for the other plots; thus indicating that the linear model assumptions are not fulfilled.

Outliers and influential points¶

Outliers are points that fall away from the cloud of data points. Outliers which fall horizontally away from the center of the cloud, thus not influencing the slope of the regression line, are called leverage points. Outliers which actually influence the slope of the regression line are called influential points and are usually high leverage points.

Let us build a toy data set to examine the concept of influential observations:

In [163]:
#set.seed(110) # set seed for reproducibility

rs = np.random.RandomState(1)
# random data generation
n = 100
beta0 = 5
beta1 = 2.5

x = rs.uniform(0, 10, n)
y = beta0 + beta1 * x + rs.normal(loc = 0, scale = 12, size=n) # add random noise

# generate leverage points
n_lev = np.int(np.floor(n * 0.05))
x_lev = rs.uniform(0, 8, n_lev)
y_lev = beta0**1.5 + beta1**3 * x_lev + rs.normal(0, 12, n_lev)

# generate influential points
n_inf = np.int(np.floor(n * 0.02))
x_inf = rs.uniform(10, 15, n_inf)
y_inf = beta0 + beta1**2.5 * x_inf + rs.normal(0, 12, n_inf)

# concatenate data sets
x_lev = np.concatenate([x, x_lev])
y_lev = np.concatenate([y, y_lev])

x_out = np.concatenate([x_lev, x_inf])
y_out = np.concatenate([y_lev, y_inf])

# build linear model
toy_model_out = sm.OLS(y_out, x_out).fit()

fig1, ax1 = plt.subplots(figsize=(12, 6))

sns.regplot(
    x="x", 
    y="y", 
    data=pd.DataFrame.from_dict({'x': x_out, 'y': y_out}),
    truncate=False,
    ax=ax1,
    label="influential points (105 - 106)"
)

sns.regplot(
    x="x", 
    y="y", 
    data=pd.DataFrame.from_dict({'x': x_lev, 'y': y_lev}),
    truncate=False,
    ax=ax1,
    label="leverage points (101 - 104)"
)

sns.regplot(
    x="x", 
    y="y", 
    data=pd.DataFrame.from_dict({'x': x, 'y': y}),
    truncate=False,
    ax=ax1,
    label="base points"
)

ax1.legend()
plt.show()

The figures above clearly show the impact of different types of outliers. The green line shows the regression line without outliers. The orange line shows the regression line, when the leverage points are included. The blue line shows the regression line, when the influential points are added and such all data is included. The largest effect on the slope of the regression line is due to the inclusion the blue (influential) data points!

The leverage of an observation indicates its ability to move the regression model. These observations are not necessarily an error, but they should be identified and verified. The leverage is measured by the hat value, which computes the overall influence of a single observation on the model predictions (Dalgaard, 2008). It takes values between 0 and 1. A point with zero leverage has no effect on the regression model. The higher the hat value the higher the influence of that particular point on the regression model. This can be in the below plot Leverage vs. normalized residuals.

In [168]:
fig1, ax1 = plt.subplots(figsize=(12, 4))
sm.graphics.plot_leverage_resid2(toy_model_out, ax=ax1)
plt.tight_layout()
plt.show()

We can identify that even though the point 101, 102, and 103 show large deviations from the regression line, they do not show a large leverage value. Points 105 and 106, our influential points, on the other hand show large deviations from the regression line and a large leverage. Thus they have a greater influence on the regression and need to be treated with care.

Cook’s distance¶

Another method to capture influential outliers is Cook's distance. The measurement is a combination of the leverage and residual of each particular observation. The higher the leverage and residue, the higher Cook’s distance. Typically, points with Cook’s distance greater than 1 are classified as being influential. We show the Cook's distance in the plots below.

In [177]:
fig1, [ax1, ax2]  = plt.subplots(2, figsize=(12, 8))
#create instance of influence
influence = toy_model_out.get_influence()
#obtain Cook's distance for each observation
cooks = influence.cooks_distance

ax1.scatter(np.arange(len(cooks[0])), cooks[0])
ax1.set_xlabel('point number')
ax1.set_ylabel('Cooks Distance')
sm.graphics.influence_plot(toy_model_out, criterion="cooks", ax=ax2)
plt.tight_layout()
plt.show()

Both plots identify the most influential points in the model very.

Other useful regression diagnostics¶

Another useful method for regression analysis diagnostic is the DFFITS method, which expresses how much an observation affects the associated fitted value. Below we can see the dffits-values. These express how much each observation affects its associated fitted value. Below we find an influence plot based on the DFFITS-method, instead of the Cook's distance.

In [211]:
fig1, [ax1, ax2]  = plt.subplots(2, figsize=(12, 8))
influence = toy_model_out.get_influence()
dffits = influence.dffits

ax1.scatter(np.arange(len(dffits[0])), dffits[0])
ax1.set_xlabel('point number')
ax1.set_ylabel('dffits')
sm.graphics.influence_plot(toy_model_out, criterion="DFFITS", ax=ax2)
plt.tight_layout()
plt.show()

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.