In the previous section we introduced three strategies to eliminate seasonal effects:
\[y_t = \beta_1m_{1t}+\beta_2m_{2t}+...+\beta_{12}m_{12t} + w_t \text{,}\]
where \(m_{kt}\) is the [0, 1] indicator for month \(k\).
\[\Delta_{12}(y_t) = y_t -y_{t-12}\]
\[y_t=\beta_0+\beta_1\sin((2\pi/m)t)+\beta_2\cos((2\pi/m)t) +w_t \text{,}\] where \(m\) corresponds to the period.
Let us reload the working data set, the subset of the Keeling Cruve from 1990 to 2015, from the previous section.
library(xts)
load(url("https://userpage.fu-berlin.de/soga/data/r-data/KeelingCurve_1990-2015.RData"))
In order to build a model with indicator variables for the different
months, we first need to generate a factor variable corresponding to the
months in our time series. For this we use the gl()
function. The function gl()
generates factors by specifying
the pattern of their levels. The first number in the gl()
function, \(n\), gives the number of
levels (n = 12
for 12 months in this example). The second
number, \(k\), gives the number of
replications (k = 1
in our case since we have only a single
average for a given month). The length
argument tells R the
length of the time series. The labels
argument is optional.
We use it to label the different months.
library(lubridate)
month_labels <- month(1:12, label = TRUE)
month_factor <- gl(
n = 12,
k = 1,
length = length(co2_1990_2015),
label = month_labels
)
Exercise: Fit a cubic model by using the
lm()
function and addmonth_factor
as explanatory variable.
## Your code here...
t <- seq_along(co2_1990_2015)
seas_fit <- NULL
summary(seas_fit)
t <- seq_along(co2_1990_2015)
seas_fit <- lm(co2_1990_2015 ~ month_factor + t + I(t^2) + I(t^3))
summary(seas_fit)
##
## Call:
## lm(formula = co2_1990_2015 ~ month_factor + t + I(t^2) + I(t^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.16631 -0.34900 0.01969 0.27951 1.46393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.536e+02 1.430e-01 2473.738 < 2e-16 ***
## month_factorFeb 6.458e-01 1.355e-01 4.767 2.94e-06 ***
## month_factorMar 1.397e+00 1.355e-01 10.312 < 2e-16 ***
## month_factorApr 2.501e+00 1.355e-01 18.458 < 2e-16 ***
## month_factorMay 2.913e+00 1.355e-01 21.500 < 2e-16 ***
## month_factorJun 2.107e+00 1.355e-01 15.550 < 2e-16 ***
## month_factorJul 3.589e-01 1.355e-01 2.648 0.00852 **
## month_factorAug -1.879e+00 1.355e-01 -13.865 < 2e-16 ***
## month_factorSep -3.553e+00 1.355e-01 -26.210 < 2e-16 ***
## month_factorOct -3.522e+00 1.356e-01 -25.980 < 2e-16 ***
## month_factorNov -2.260e+00 1.356e-01 -16.671 < 2e-16 ***
## month_factorDec -9.487e-01 1.356e-01 -6.996 1.75e-11 ***
## t 9.394e-02 3.098e-03 30.324 < 2e-16 ***
## I(t^2) 3.340e-04 2.298e-05 14.537 < 2e-16 ***
## I(t^3) -4.477e-07 4.827e-08 -9.276 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4885 on 297 degrees of freedom
## Multiple R-squared: 0.9989, Adjusted R-squared: 0.9989
## F-statistic: 1.955e+04 on 14 and 297 DF, p-value: < 2.2e-16
Most of the factor effects are highly significant indicated by small p-values. Let us plot the modeled data and the observed data to assess the model”s goodness of fit.
plot.ts(index(co2_1990_2015),
fitted(seas_fit),
col = "red", type = "l", lwd = 2,
ylab = expression("CO"[2] * " ppm"),
xlab = ""
)
points(index(co2_1990_2015), co2_1990_2015,
cex = 0.35, type = "p", col = "blue"
)
legend("topleft",
legend = c("observed", "modeled"),
col = c("blue", "red"),
lty = c(NA, 1),
pch = c(16, NA),
cex = 0.75
)
Let us plot the residuals from the cubic and periodic model versus time to see if the strong seasonal effect is now gone and the correlogram for CO2 data with trend and period removed.
par(mfrow = c(2, 1))
plot.ts(seas_fit$residuals,
ylab = "residuals",
main = "CO2 Residuals from the Cubic & Indicator Fit",
cex.main = 0.85
)
acf(seas_fit$residuals,
lag.max = 48,
main = NA
)
title("Correlogram for CO2 Data with Trend & Period removed",
cex.main = 0.85
)
The lag axes are in terms of seasons (12 months). From the correlogram it appears that the autocorrelations for the CO2 are positive for successive months up to about 12 months apart.
In R we apply the diff()
function for differencing,
setting the lag
argument to \(12\) results in seasonal differencing
# cubic fit
fit3 <- lm(as.numeric(co2_1990_2015) ~ t + I(t^2) + I(t^3))
# differencing
seas_diff <- diff(fit3$residuals, lag = 12)
par(mfrow = c(2, 1))
plot(seas_diff,
main = "CO2 Residuals from a Cubic trend & Seasonal Differencing",
cex.main = 0.85,
type = "l",
)
acf(na.trim(seas_diff),
lag.max = 48,
main = NA
)
title("Correlogram for CO2 Data with Trend & Period Removed",
cex.main = 0.85
)
The upper plot shows the residuals after linear trend removal with a cubic fit and seasonal differencing \((\Delta^{12}y_t)\). The lower plot shows the correlogram. The lag axes in the correlogram are in terms of seasons (12 months). From the correlogram it appears that the autocorrelations for the CO2 are positive for successive months up to about 8 months apart.
The advantage of fitting a regression model that is periodic, is the need of less parameters compared to the regression model with indicators. Let us fit the periodic model and take a look at the model summary.
t <- seq_along(co2_1990_2015)
x1 <- sin(t * 2 * pi / 12)
x2 <- cos(t * 2 * pi / 12)
co2_periodic <- lm(as.numeric(fit3$residuals) ~ x1 + x2)
summary(co2_periodic)
##
## Call:
## lm(formula = as.numeric(fit3$residuals) ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.76416 -0.60746 -0.02468 0.58161 2.28855
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.319e-16 4.357e-02 0.00 1
## x1 2.452e+00 6.161e-02 39.80 <2e-16 ***
## x2 -1.597e+00 6.161e-02 -25.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7696 on 309 degrees of freedom
## Multiple R-squared: 0.8796, Adjusted R-squared: 0.8788
## F-statistic: 1128 on 2 and 309 DF, p-value: < 2.2e-16
The regression coefficient estimates are both highly significant. Let us plot the linear trend corrected CO2 data along with the fitted values from the periodic regression model.
plot.ts(index(co2_1990_2015),
fitted(co2_periodic),
col = "red", type = "l", lwd = 2,
ylab = expression("CO"[2] * " ppm"),
xlab = "",
ylim = c(-5, 5)
)
points(index(co2_1990_2015), fit3$residuals,
pch = 16, cex = 0.7, type = "p", col = "blue"
)
legend("topleft",
legend = c("observed", "modeled"),
col = c("blue", "red"),
lty = c(NA, 1),
pch = c(16, NA),
cex = 0.75
)
The plot indicates that the general pattern of seasonality is well reproduced, however, the extreme values are not well fitted.
Let us take a look at the model residuals.
plot(co2_periodic$residuals,
ylab = "residuals",
main = "CO2 Residuals from a Cubic & Periodic Fit",
cex.main = 0.85
)
The plot still shows strong seasonal components, hence indicating that this kind of periodic regression model is not fully capable of modelling the seasonal component in the CO2 data. This is very likely due to missing information about the phase (\(\phi_1\) and \(\phi_2\)).
\[y_t=\beta_0+\beta_1\sin((2t\pi/m)+\phi_1)+\beta_2\cos((2t\pi/m)+\phi_2) +w_t\] We cannot estimate the additional phase information by a linear model, but we may estimate phase information by e.g. Fourier analysis.
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.