We get started with our exercise by collecting data. The data is generated from a function that is unknown to you. This precondition makes this example more realistic, as in real applications we do not know the exact specifications of the underlying data generation process. At the end of this section we uncover the secret of the data generation process.

This is the data, our observations, in tabular form. We have 25 data points, each data point is a \((x,y)\) pair.

poly.data
##              x           y
## 1  0.001399613 -0.23436656
## 2  0.971629779  0.64689524
## 3  0.579119475 -0.92635765
## 4  0.335693937  0.13000706
## 5  0.736736086 -0.89294863
## 6  0.492572335  0.33854780
## 7  0.737133774 -1.24171910
## 8  0.563693769 -0.22523318
## 9  0.877603280 -0.12962722
## 10 0.141426545  0.37632006
## 11 0.307203910  0.30299077
## 12 0.024509308 -0.21162739
## 13 0.843665029 -0.76468719
## 14 0.771206067 -0.90455412
## 15 0.149670258  0.77097952
## 16 0.359605608  0.56466366
## 17 0.049612895  0.18897607
## 18 0.409898906  0.32531750
## 19 0.935457898 -0.78703491
## 20 0.149476207  0.80585375
## 21 0.234315216  0.62944986
## 22 0.455297119  0.02353327
## 23 0.102696671  0.27621694
## 24 0.715372314 -1.20379729
## 25 0.681745393 -0.83059624

Here is the data in form of a scatter plot:

plot(poly.data$x, poly.data$y)


Fitting a curve in R: The Notation in R

The statistical software R provides powerful functionality to fit a polynomial to data. On of these functions is the lm() function, which we already know. However, in order to fit a \(k^{th}\)-dimensional polynomial we add additional arguments to the function call. In addition, there are two different options of coding a polynomial regression.

For a 3rd-order polynomial the first option is to type lm(response ~ poly(predictor, 3)), and the second option is lm(response ~ predictor + I(predictor^2) + I(predictor^3))). We need the I(...) notation to prevent any special interpretation of operators in the model formula.

The difference between those two options is, that in the first case the poly() function produces orthogonal polynomials, which have the advantage that the inspection of the regression coefficient allows to assess which polynomial significantly improved the regression. In contrast, the numerical values for \(\beta_i\) may no longer be plugged into the polynomial regression equation from above. In contrast, if we apply the second option, the resulting regression coefficients may be plugged into the regression equation and yield the correct numerical result. However, as the predictors are correlated (problem of multicollinearity) it is often difficult to identify which polynomial significantly improved the regression. However, please note that the prediction will not be affect at all by the choice of notation.

To alleviate the confusion we show an example in R. We construct two 2nd-order polynomials for poly.data.

m1 <- lm(y ~ poly(x, 2), data = poly.data)
m2 <- lm(y ~ x + I(x^2), data = poly.data)

We review the output of both models.

m1
## 
## Call:
## lm(formula = y ~ poly(x, 2), data = poly.data)
## 
## Coefficients:
## (Intercept)  poly(x, 2)1  poly(x, 2)2  
##    -0.11891     -1.80079      0.07206
m2
## 
## Call:
## lm(formula = y ~ x + I(x^2), data = poly.data)
## 
## Coefficients:
## (Intercept)            x       I(x^2)  
##      0.4592      -1.3681       0.1890

As noted above, the regression coefficients differ. Let us predict \(\hat y\) for \(x=0.5\) by applying the predict() function.

predict(m1, newdata = list('x' = 0.5))
##          1 
## -0.1776587
predict(m2, newdata = list('x' = 0.5))
##          1 
## -0.1776587

As expected, both models predict \(\hat y = -0.1776587\) for \(x=0.5\).


Fitting a curve in R (continued)

So now that we know about the notation in R we start to build 6 different models, with \(k = 1,2,3,5,9,14\). For each model we calculate the RMSE. Finally we plot the data together with the regression line, given by each particular model. For convenience we construct a loop to reduce the amount of coding.

# plotting setup
par(mfrow = c(3,2), mar = c(2,2,1,1)) # set up 6 subplots

# setup
RMSE <- data.frame('kth.oder' = NA, 'RMSE' = NA) # empty data frame to store RMSE
vals <- list('x' <- seq(min(poly.data$x), max(poly.data$y), by = 0.01)) # set up vector used for prediction

# run  loop
k <- c(1,2,3,5,9,14) #k-th order

for (i in 1:length(k)){
  # build models
  model <- lm(y ~ poly(x,k[i]), data = poly.data)
  
  # calculate RSME and store it for further usage
  RMSE[i,1] <- k[i] # store k-th order
  RMSE[i,2] <- sqrt(sum((fitted(model)-poly.data$y)^2)/length(poly.data$y)) # calculate RSME
  
  # predict 
  predictions <- predict(model, newdata = vals)
  # plot
  plot(poly.data$x, poly.data$y, pch=16, col='blue')
  lines(vals[[1]], predictions, lwd=2, col='red')
  text(x=0.8,y=0.95, paste0('k = ',k[i], ', RMSE = ', round(RMSE[i,2],3))) # annotate the plot
  }

Awesome, pretty plots! The figure shows, that if we increase \(k\), the order of the polynomial, the curve becomes more flexible and it fits the data better and better. The better the data is fitted the lower becomes the error, RMSE.

For convenience we plot the RMSE against \(k\).

plot(RMSE[,1], RMSE[,2], 
     xlab = 'k-th order', 
     ylab = 'RMSE', 
     type = 'b', 
     col = 'blue',
     pch = 16)

Hence, once again the question arises, which is the best polynomial to fit the data? Do we believe that the 14th-order polynomial fits best the underlying data generation process? Though, we obtain an excellent fit to the observation data by increasing the order of the polynomial, it remains questionable if the high order polynomial generalizes well. Imagine we conduct a new measurement campaign and receive new data. Do you believe the wildly oscillating curve of a high order polynomial still fits the data well? No, probably not!

This behavior is known as over-fitting. Recall the goal is to learn the parameters from the data, thus we are interested in achieving a good generalization of the model and not necessarily perfectly fitted observation data.


Learning from Data

How do we solve the problem? How to determine the best \(n^{th}\)-order polynomial for our data set? Well, there are many methods and strategies to counteract over-fitting. In this section we follow a simple approach. First, we split the data set into two parts. One part we call training set, the other part we call validation set. Then we use all the data in the training set to learn the model parameters \(\beta_i\), in the same fashion as we did above. Thereafter we apply the learned model to predict the data of the validation set and evaluate the model performance, by computing RMSE. Thus, we use the validation set to optimize the model’s complexity, given by \(k\).

Unfortunately, if we want to learn from data we ultimately need data to learn from. So far we worked with 25 observations. That is not much. In real life applications we would probably have to obtain new observations by conducting a new measurement campaign. In our exercise, however, we may generate more data fairly easy. Thus we continue this example with a new data set of 150 observations.

Let us plot the data!

plot(new.poly.data$x, new.poly.data$y)