307083_Eliminating_the_seasonal

In the previous section we introduced three strategies to eliminate seasonal effects:

Fitting a linear model with indicator variables for the different months

\[y_t = \beta_1m_{1t}+\beta_2m_{2t}+...+\beta_{12}m_{12t} + w_t \text{,}\]

where \(m_{kt}\) is the [0, 1] indicator for month \(k\).

Seasonal differencing

\[\Delta_{12}(y_t) = y_t -y_{t-12}\]

Fitting a periodic function

\[y_t=\beta_0+\beta_1\sin((2\pi/m)t)+\beta_2\cos((2\pi/m)t) +w_t \text{,}\] where \(m\) corresponds to the period.

Fitting a linear model with indicator

Let us reload the working data set, the subset of the Keeling Cruve from 1990 to 2015, from the previous section.

library(xts)
load(url("https://userpage.fu-berlin.de/soga/data/r-data/KeelingCurve_1990-2015.RData"))

In order to build a model with indicator variables for the different months, we first need to generate a factor variable corresponding to the months in our time series. For this we use the gl() function. The function gl() generates factors by specifying the pattern of their levels. The first number in the gl() function, \(n\), gives the number of levels (n = 12 for 12 months in this example). The second number, \(k\), gives the number of replications (k = 1 in our case since we have only a single average for a given month). The length argument tells R the length of the time series. The labels argument is optional. We use it to label the different months.

library(lubridate)
month_labels <- month(1:12, label = TRUE)
month_factor <- gl(
  n = 12,
  k = 1,
  length = length(co2_1990_2015),
  label = month_labels
)

Exercise: Fit a cubic model by using the lm() function and add month_factor as explanatory variable.

## Your code here...
t <- seq_along(co2_1990_2015)
seas_fit <- NULL
summary(seas_fit)

Show code

t <- seq_along(co2_1990_2015)
seas_fit <- lm(co2_1990_2015 ~ month_factor + t + I(t^2) + I(t^3))
summary(seas_fit)

## 
## Call:
## lm(formula = co2_1990_2015 ~ month_factor + t + I(t^2) + I(t^3))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.16631 -0.34900  0.01969  0.27951  1.46393 
## 
## Coefficients:
##                   Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)      3.536e+02  1.430e-01 2473.738  < 2e-16 ***
## month_factorFeb  6.458e-01  1.355e-01    4.767 2.94e-06 ***
## month_factorMar  1.397e+00  1.355e-01   10.312  < 2e-16 ***
## month_factorApr  2.501e+00  1.355e-01   18.458  < 2e-16 ***
## month_factorMay  2.913e+00  1.355e-01   21.500  < 2e-16 ***
## month_factorJun  2.107e+00  1.355e-01   15.550  < 2e-16 ***
## month_factorJul  3.589e-01  1.355e-01    2.648  0.00852 ** 
## month_factorAug -1.879e+00  1.355e-01  -13.865  < 2e-16 ***
## month_factorSep -3.553e+00  1.355e-01  -26.210  < 2e-16 ***
## month_factorOct -3.522e+00  1.356e-01  -25.980  < 2e-16 ***
## month_factorNov -2.260e+00  1.356e-01  -16.671  < 2e-16 ***
## month_factorDec -9.487e-01  1.356e-01   -6.996 1.75e-11 ***
## t                9.394e-02  3.098e-03   30.324  < 2e-16 ***
## I(t^2)           3.340e-04  2.298e-05   14.537  < 2e-16 ***
## I(t^3)          -4.477e-07  4.827e-08   -9.276  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4885 on 297 degrees of freedom
## Multiple R-squared:  0.9989, Adjusted R-squared:  0.9989 
## F-statistic: 1.955e+04 on 14 and 297 DF,  p-value: < 2.2e-16

Most of the factor effects are highly significant indicated by small p-values. Let us plot the modeled data and the observed data to assess the model”s goodness of fit.

plot.ts(index(co2_1990_2015),
  fitted(seas_fit),
  col = "red", type = "l", lwd = 2,
  ylab = expression("CO"[2] * " ppm"),
  xlab = ""
)

points(index(co2_1990_2015), co2_1990_2015,
  cex = 0.35, type = "p", col = "blue"
)

legend("topleft",
  legend = c("observed", "modeled"),
  col = c("blue", "red"),
  lty = c(NA, 1),
  pch = c(16, NA),
  cex = 0.75
)

Let us plot the residuals from the cubic and periodic model versus time to see if the strong seasonal effect is now gone and the correlogram for CO₂ data with trend and period removed.

par(mfrow = c(2, 1))

plot.ts(seas_fit$residuals,
  ylab = "residuals",
  main = "CO2 Residuals from the Cubic & Indicator Fit",
  cex.main = 0.85
)

acf(seas_fit$residuals,
  lag.max = 48,
  main = NA
)

title("Correlogram for CO2 Data with Trend & Period removed",
  cex.main = 0.85
)

The lag axes are in terms of seasons (12 months). From the correlogram it appears that the autocorrelations for the CO₂ are positive for successive months up to about 12 months apart.

Seasonal differencing

In R we apply the diff() function for differencing, setting the lag argument to \(12\) results in seasonal differencing

# cubic fit
fit3 <- lm(as.numeric(co2_1990_2015) ~ t + I(t^2) + I(t^3))

# differencing
seas_diff <- diff(fit3$residuals, lag = 12)
par(mfrow = c(2, 1))

plot(seas_diff,
  main = "CO2 Residuals from a Cubic trend & Seasonal Differencing",
  cex.main = 0.85,
  type = "l",
) 

acf(na.trim(seas_diff),
  lag.max = 48,
  main = NA
)

title("Correlogram for CO2 Data with Trend & Period Removed",
  cex.main = 0.85
)

The upper plot shows the residuals after linear trend removal with a cubic fit and seasonal differencing \((\Delta^{12}y_t)\). The lower plot shows the correlogram. The lag axes in the correlogram are in terms of seasons (12 months). From the correlogram it appears that the autocorrelations for the CO₂ are positive for successive months up to about 8 months apart.

Fitting a periodic function

The advantage of fitting a regression model that is periodic, is the need of less parameters compared to the regression model with indicators. Let us fit the periodic model and take a look at the model summary.

t <- seq_along(co2_1990_2015)
x1 <- sin(t * 2 * pi / 12)
x2 <- cos(t * 2 * pi / 12)
co2_periodic <- lm(as.numeric(fit3$residuals) ~ x1 + x2)
summary(co2_periodic)

## 
## Call:
## lm(formula = as.numeric(fit3$residuals) ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.76416 -0.60746 -0.02468  0.58161  2.28855 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.319e-16  4.357e-02    0.00        1    
## x1           2.452e+00  6.161e-02   39.80   <2e-16 ***
## x2          -1.597e+00  6.161e-02  -25.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7696 on 309 degrees of freedom
## Multiple R-squared:  0.8796, Adjusted R-squared:  0.8788 
## F-statistic:  1128 on 2 and 309 DF,  p-value: < 2.2e-16

The regression coefficient estimates are both highly significant. Let us plot the linear trend corrected CO₂ data along with the fitted values from the periodic regression model.

plot.ts(index(co2_1990_2015),
  fitted(co2_periodic),
  col = "red", type = "l", lwd = 2,
  ylab = expression("CO"[2] * " ppm"),
  xlab = "",
  ylim = c(-5, 5)
)

points(index(co2_1990_2015), fit3$residuals,
  pch = 16, cex = 0.7, type = "p", col = "blue"
)

legend("topleft",
  legend = c("observed", "modeled"),
  col = c("blue", "red"),
  lty = c(NA, 1),
  pch = c(16, NA),
  cex = 0.75
)

The plot indicates that the general pattern of seasonality is well reproduced, however, the extreme values are not well fitted.

Let us take a look at the model residuals.

plot(co2_periodic$residuals,
  ylab = "residuals",
  main = "CO2 Residuals from a Cubic & Periodic Fit",
  cex.main = 0.85
)

The plot still shows strong seasonal components, hence indicating that this kind of periodic regression model is not fully capable of modelling the seasonal component in the CO₂ data. This is very likely due to missing information about the phase (\(\phi_1\) and \(\phi_2\)).

\[y_t=\beta_0+\beta_1\sin((2t\pi/m)+\phi_1)+\beta_2\cos((2t\pi/m)+\phi_2) +w_t\] We cannot estimate the additional phase information by a linear model, but we may estimate phase information by e.g. Fourier analysis.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.