Least squares estimation

Let us reload the working data set, the subset of the Keeling Cruve from 1990 to 2015, from the previous section.

library(xts)
load(url("https://userpage.fu-berlin.de/soga/data/r-data/KeelingCurve_1990-2015.RData"))

We apply the least squares estimation and fit 4 different models, a linear \((p=1)\), a quadratic \((p=2)\), a cubic \((p=3)\) and a quartic \((p=4)\) model, by applying the lm() function.

Exercise: Fit four polynomial models with \((p=1)\), \((p=2)\), \((p=3)\) and \((p=4)\).

## Your code here ...
t <- 1:length(co2_1990_2015)
fit1 <- NULL
fit2 <- NULL
fit3 <- NULL
fit4 <- NULL
Show code
t <- 1:length(co2_1990_2015)
fit1 <- lm(co2_1990_2015 ~ t)
fit2 <- lm(co2_1990_2015 ~ t + I(t^2))
fit3 <- lm(co2_1990_2015 ~ t + I(t^2) + I(t^3))
fit4 <- lm(co2_1990_2015 ~ t + I(t^2) + I(t^3) + I(t^4))


As the residuals from the fit yield a time series without the trend, we evaluate the goodness of the models by plotting the residuals.

list_fits <- list(fit1, fit2, fit3, fit4)
power <- c("linear", "quadratic", "cubic", "quartic")

par(mfrow = c(2, 2), mar = c(3, 4, 2, 1))
for (i in 1:length(list_fits)) {
  r <- list_fits[[i]]$residuals
  plot.ts(r,
    ylab = "",
    main = paste("CO2 residuals from a", power[i], "fit"),
    cex.main = 0.8
  )
}

The plots of residuals indicate that the linear fit (upper left) did not remove the entire trend in the data. There appears to be some non-linearity to the trend. The residual plot from the quadratic fit (upper right) still showed some structure. The cubic model (lower left) and the quartic mode (lower right) seem to be a good fit to the data.

Following Ockham”s razor we pick the cubic model for further analysis. A summary of the cubic model is provided by calling summary(fit3).

summary(fit3)
## 
## Call:
## lm(formula = co2_1990_2015 ~ t + I(t^2) + I(t^3))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2318 -1.9402  0.2885  1.8282  3.9270 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.538e+02  5.091e-01 694.928  < 2e-16 ***
## t            8.488e-02  1.406e-02   6.036 4.54e-09 ***
## I(t^2)       4.017e-04  1.043e-04   3.851 0.000143 ***
## I(t^3)      -5.919e-07  2.191e-07  -2.702 0.007277 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.221 on 308 degrees of freedom
## Multiple R-squared:  0.9768, Adjusted R-squared:  0.9765 
## F-statistic:  4314 on 3 and 308 DF,  p-value: < 2.2e-16

The p-values indicate that the estimated coefficients of the cubic fit are statistical significant.


Differencing

Alternatively, to least squares estimation we may apply differencing. For the purpose of demonstration we apply first order \((\Delta_{y_t})\), second order \((\Delta_{y_t}^2)\), third order \((\Delta_{y_t}^3)\) and fourth order \((\Delta_{y_t}^4)\) differencing.

To difference a series, \(\Delta_{y_t}=y_t-y_{t-1}\), in R we apply the diff() function. Use diff(y) for \((\Delta_{y_t})\), diff(diff(y)) for \((\Delta_{y_t}^2)\), diff(diff(diff(y))) for \((\Delta_{y_t}^3)\), and so on for higher order differencing.

Exercise: Apply \((\Delta_{y_t})\), \((\Delta_{y_t}^2)\),\((\Delta_{y_t}^3)\),\((\Delta_{y_t}^4)\) differencing to the co2_1990_2015 time series.

## Your code here ...
diff1 <- NULL
diff2 <- NULL
diff3 <- NULL
diff4 <- NULL
Show code
diff1 <- diff(co2_1990_2015)
diff2 <- diff(diff(co2_1990_2015))
diff3 <- diff(diff(diff(co2_1990_2015)))
diff4 <- diff(diff(diff(diff(co2_1990_2015))))

Let us plot the results:

list_diffs <- list(diff1, diff2, diff3, diff4)
diff_level <- c(1:4)
par(mfrow = c(2, 2), mar = c(3, 4, 2, 1))

for (i in 1:length(list_diffs)) {
  diff <- list_diffs[[i]]
  plot.ts(diff,
    ylab = "",
    main = substitute(paste("Differencing: ", Delta^l, y[t]), list(l = diff_level[i])),
    cex.main = 0.8
  )
}

The plot indicates that the first order differencing already did a quite good job in removing the linear trend from the data. However, in order to base our decision on a statistical quantity, we make use of the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test.

The null hypothesis in a KPSS test is that a time series is stationary around a deterministic trend (trend-stationary). The alternative hypothesis is that a unit root is present, which means that the process is non-stationary. In the presence of a shock, trend-stationary processes are mean-reverting. This means that the time series will converge again towards the mean, while shocks have a permanent impact on the mean of unit-root processes (i.e. no convergence over time).

Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test

\(H_0: \quad \text{The time series is trend-stationary}\)
\(H_A: \quad \text{The time series is not trend-stationary}\)

We use the kpss.test() function from the tseries package. Please note that if the computed p-value is \(0.01 \le p \le 0.1\), then a warning message is generated. Type help(kpss.test) into your console for further information.

library(tseries)

We apply the KPSS test on both, the original and the differenced data sets.

kpss.test(co2_1990_2015, null = "Trend")
## Warning in kpss.test(co2_1990_2015, null = "Trend"): p-value smaller than
## printed p-value
## 
##  KPSS Test for Trend Stationarity
## 
## data:  co2_1990_2015
## KPSS Trend = 0.34532, Truncation lag parameter = 5, p-value = 0.01

The p-value of the KPSS test on the original data set (co2_1990_2015) is \(p<0.01\). Hence we would reject \(H_0\) in favor of \(H_A\); the original time series is not trend-stationary.

kpss.test(diff1, null = "Trend")
## Warning in kpss.test(diff1, null = "Trend"): p-value greater than printed
## p-value
## 
##  KPSS Test for Trend Stationarity
## 
## data:  diff1
## KPSS Trend = 0.002258, Truncation lag parameter = 5, p-value = 0.1

The p-value of the KPSS test on the differenced data set (diff1) is \(p>0.1\). Hence we would not reject \(H_0\) in favor of \(H_A\); the differenced time series is trend-stationary.


Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.