Let us reload the working data set, the subset of the *Keeling Cruve* from 1990 to 2015, from the
previous section.

```
library(xts)
load(url("https://userpage.fu-berlin.de/soga/data/r-data/KeelingCurve_1990-2015.RData"))
```

We apply the *least squares estimation* and fit 4 different
models, a linear \((p=1)\), a quadratic
\((p=2)\), a cubic \((p=3)\) and a quartic \((p=4)\) model, by applying the
`lm()`

function.

Exercise: Fit four polynomial models with \((p=1)\), \((p=2)\), \((p=3)\) and \((p=4)\).

```
## Your code here ...
t <- 1:length(co2_1990_2015)
fit1 <- NULL
fit2 <- NULL
fit3 <- NULL
fit4 <- NULL
```

```
t <- 1:length(co2_1990_2015)
fit1 <- lm(co2_1990_2015 ~ t)
```

`fit2 <- lm(co2_1990_2015 ~ t + I(t^2))`

`fit3 <- lm(co2_1990_2015 ~ t + I(t^2) + I(t^3))`

`fit4 <- lm(co2_1990_2015 ~ t + I(t^2) + I(t^3) + I(t^4))`

As the residuals from the fit yield a time series without the trend, we
evaluate the goodness of the models by plotting the residuals.

```
list_fits <- list(fit1, fit2, fit3, fit4)
power <- c("linear", "quadratic", "cubic", "quartic")
par(mfrow = c(2, 2), mar = c(3, 4, 2, 1))
for (i in 1:length(list_fits)) {
r <- list_fits[[i]]$residuals
plot.ts(r,
ylab = "",
main = paste("CO2 residuals from a", power[i], "fit"),
cex.main = 0.8
)
}
```

The plots of residuals indicate that the linear fit (upper left) did not remove the entire trend in the data. There appears to be some non-linearity to the trend. The residual plot from the quadratic fit (upper right) still showed some structure. The cubic model (lower left) and the quartic mode (lower right) seem to be a good fit to the data.

Following Ockham”s razor we pick the cubic model for further
analysis. A summary of the cubic model is provided by calling
`summary(fit3)`

.

`summary(fit3)`

```
##
## Call:
## lm(formula = co2_1990_2015 ~ t + I(t^2) + I(t^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2318 -1.9402 0.2885 1.8282 3.9270
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.538e+02 5.091e-01 694.928 < 2e-16 ***
## t 8.488e-02 1.406e-02 6.036 4.54e-09 ***
## I(t^2) 4.017e-04 1.043e-04 3.851 0.000143 ***
## I(t^3) -5.919e-07 2.191e-07 -2.702 0.007277 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.221 on 308 degrees of freedom
## Multiple R-squared: 0.9768, Adjusted R-squared: 0.9765
## F-statistic: 4314 on 3 and 308 DF, p-value: < 2.2e-16
```

The *p*-values indicate that the estimated coefficients of the
cubic fit are statistical significant.

Alternatively, to least squares estimation we may apply
*differencing*. For the purpose of demonstration we apply first
order \((\Delta_{y_t})\), second order
\((\Delta_{y_t}^2)\), third order \((\Delta_{y_t}^3)\) and fourth order \((\Delta_{y_t}^4)\) differencing.

To difference a series, \(\Delta_{y_t}=y_t-y_{t-1}\), in R we apply
the `diff()`

function. Use `diff(y)`

for \((\Delta_{y_t})\),
`diff(diff(y))`

for \((\Delta_{y_t}^2)\),
`diff(diff(diff(y)))`

for \((\Delta_{y_t}^3)\), and so on for higher
order differencing.

Exercise: Apply \((\Delta_{y_t})\), \((\Delta_{y_t}^2)\),\((\Delta_{y_t}^3)\),\((\Delta_{y_t}^4)\) differencing to the`co2_1990_2015`

time series.

```
## Your code here ...
diff1 <- NULL
diff2 <- NULL
diff3 <- NULL
diff4 <- NULL
```

`diff1 <- diff(co2_1990_2015)`

`diff2 <- diff(diff(co2_1990_2015))`

`diff3 <- diff(diff(diff(co2_1990_2015)))`

`diff4 <- diff(diff(diff(diff(co2_1990_2015))))`

Let us plot the results:

```
list_diffs <- list(diff1, diff2, diff3, diff4)
diff_level <- c(1:4)
par(mfrow = c(2, 2), mar = c(3, 4, 2, 1))
for (i in 1:length(list_diffs)) {
diff <- list_diffs[[i]]
plot.ts(diff,
ylab = "",
main = substitute(paste("Differencing: ", Delta^l, y[t]), list(l = diff_level[i])),
cex.main = 0.8
)
}
```

The plot indicates that the first order differencing already did a
quite good job in removing the linear trend from the data. However, in
order to base our decision on a statistical quantity, we make use of the
**Kwiatkowski-Phillips-Schmidt-Shin
(KPSS)** test.

The null hypothesis in a KPSS test is that a time series is
stationary around a deterministic trend (*trend-stationary*). The
alternative hypothesis is that a unit
root is present, which means that the process is non-stationary. In
the presence of a shock, trend-stationary processes are mean-reverting.
This means that the time series will converge again towards the mean,
while shocks have a permanent impact on the mean of unit-root processes
(i.e. no convergence over time).

**Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test**

\(H_0: \quad \text{The time series is
trend-stationary}\)

\(H_A: \quad \text{The time series is not
trend-stationary}\)

We use the `kpss.test()`

function from the
`tseries`

package. Please note that if the computed
*p*-value is \(0.01 \le p \le
0.1\), then a warning message is generated. Type
`help(kpss.test)`

into your console for further
information.

`library(tseries)`

We apply the KPSS test on both, the original and the differenced data sets.

`kpss.test(co2_1990_2015, null = "Trend")`

```
## Warning in kpss.test(co2_1990_2015, null = "Trend"): p-value smaller than
## printed p-value
```

```
##
## KPSS Test for Trend Stationarity
##
## data: co2_1990_2015
## KPSS Trend = 0.34532, Truncation lag parameter = 5, p-value = 0.01
```

The *p*-value of the KPSS test on the original data set
(`co2_1990_2015`

) is \(p<0.01\). Hence we would reject \(H_0\) in favor of \(H_A\); the original time series is
**not** trend-stationary.

`kpss.test(diff1, null = "Trend")`

```
## Warning in kpss.test(diff1, null = "Trend"): p-value greater than printed
## p-value
```

```
##
## KPSS Test for Trend Stationarity
##
## data: diff1
## KPSS Trend = 0.002258, Truncation lag parameter = 5, p-value = 0.1
```

The *p*-value of the KPSS test on the differenced data set
(`diff1`

) is \(p>0.1\).
Hence we would not reject \(H_0\) in
favor of \(H_A\); the differenced time
series is trend-stationary.

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*