We get started with our exercise by collecting data. The data is generated by a function unknown to you. This precondition makes this example more realistic, as in real applications we do not know the exact specifications of the underlying data generation process either. At the end of this section we will uncover the secret of the data generation process. As always we import our wanted packages first.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
rs = np.random.RandomState(7)
n = 25
x = rs.uniform(0, 1, n)
y = np.sin(2 * np.pi * x) + rs.normal(0, 0.35, n)
poly_data = pd.DataFrame.from_dict({"x": x, "y": y})
This is the data, our observations, in tabular form. We have 25 data points, each data point is a $(x,y)$ pair.
poly_data.head()
x | y | |
---|---|---|
0 | 0.076308 | 0.486031 |
1 | 0.779919 | -1.369874 |
2 | 0.438409 | 0.655364 |
3 | 0.723465 | -0.574302 |
4 | 0.977990 | -0.640220 |
Here is the data in form of a scatter plot:
fig = plt.figure(figsize=[12,4])
plt.scatter(x, y)
plt.grid()
plt.xlabel("x")
plt.ylabel("y")
plt.show()
Thr package ecosystem of Python provides powerful functionality to fit a polynomial to data.
One helpful module is statsmodels. It provides the ordinary least squares model through the OLS()
class, which we already know from simple linear regression.
However, in order to fit a $k^{th}$-order polynomial we need to add additional arguments to the function call.
In addition, there are two different options of coding a polynomial regression. There is the python-style notation.
For a 3rd-order polynomial this first option is looks like this in pseudocode:
OLS.from_formula(response ~ polynomial(predictor, 3))
and the R-style formula notation
OLS.from_formula(response ~ predictor + I(predictor^2) + I(predictor^3)))
.
We need the I(...)
notation to prevent any special interpretation of operators in the model formula.
The difference between those two options is, that in the first case the we need to write out our orthogonal polynomials, which has the advantage that the inspection of allowing the regression coefficient to assess which polynomial improves the regression.
However, in this case the numerical values for $\beta_i$ may no longer be plugged into the polynomial regression equation from above. In contrast, if we apply the second option, the resulting regression coefficients may be plugged into the regression equation and yield the correct numerical result. However, as the predictors are correlated (problem of multicollinearity) it is often difficult to identify which polynomial significantly improves the regression. We can use a preprocessor of scikit-learn to get our polynomial.
Note: The prediction itself will not be affected at all by the choice of notation.
To alleviate the confusion we will calculate an example.
For this we construct two 2nd-order polynomials for poly_data
:
polynomial_features= PolynomialFeatures(degree=2)
xp = polynomial_features.fit_transform(sm.add_constant(x))
m1 = sm.OLS(y, xp).fit() # Python-style
m2 = sm.OLS.from_formula('y ~ x + I(x**2)', data = poly_data).fit() # R-style formula
m1.params
array([ 0.13582645, 0.13582645, -0.23447937, 0.13582645, -0.23447937, -0.74967182])
m2.params
Intercept 0.407479 x -0.468959 I(x ** 2) -0.749672 dtype: float64
While it is possible to
So now that we know the notation, we start to build 6 different models, with $k = 1,2,3,5,9,14$. For each model we calculate the RMSE. Finally, we plot the data together with the regression line of each particular model. For convenience we construct a loop to reduce the amount of coding.
# plotting setup
fig, ax = plt.subplots(3, 2, figsize=(12, 8), sharex=True, sharey=True)
# setup
vals = pd.DataFrame.from_dict({"x": np.arange(poly_data.x.min(), poly_data.x.max(), 0.01)}) # set up vector used for prediction
rmse_values = []
# run loop
k = [1, 2, 3, 5, 8, 10] # k-th order
for i in np.arange(len(k)):
# build models
# model = sm.OLS.from_formula('y ~ np.poly1d(np.polyfit(x, y, k[i]))(x)', data = poly_data).fit()
polynomial_features= PolynomialFeatures(degree=k[i])
xp = polynomial_features.fit_transform(sm.add_constant(x))
model = sm.OLS(y, xp).fit()
# calculate RMSE and store it for further usage
rmse_values.append(np.sqrt(np.sum((model.fittedvalues - poly_data.y)**2) / len(poly_data.y))) # calculate RMSE
# predict
xvals = polynomial_features.fit_transform(sm.add_constant(vals))
predictions = model.predict(sm.add_constant(xvals))
# plot
plt.subplot(3, 2, i+1)
plt.scatter(poly_data.x, poly_data.y)
plt.plot(vals, predictions)
plt.grid()
plt.title(f"degree of polynomial = {k[i]}, RMSE = {np.round(rmse_values[i], 3)}")
plt.tight_layout()
plt.show()
RMSE = pd.DataFrame.from_dict({"kth_order": k, "value": rmse_values})
Awesome, pretty plots! The figure shows, that if we increase $k$, the order of the polynomial, the curve becomes more flexible and it fits the data better and better. The better the data is fitted, the lower becomes the error (RMSE).
In order to visualize that behavior we plot the RMSE against $k$:
plt.plot(RMSE["kth_order"], RMSE["value"], 'o-')
plt.grid()
plt.xlabel("$k^{th}$-order")
plt.ylabel("RMSE")
plt.show()
Hence, once again the question arises: which is the best polynomial to fit the data? Do we believe that the 14th-order polynomial fits the underlying data generation process best? Even though we obtain an excellent fit to the observation data by increasing the order of the polynomial, it remains questionable if the high order polynomial generalizes well. Imagine we conduct a new measurement campaign and receive new data based on the same data generation process. Do you believe the wildly oscillating curve of a high order polynomial will still fit the data well? No, probably not!
This problem is known as overfitting. Recall, the goal is to learn the model parameters from the data. Thus, we are interested in achieving a good generalization of the model and not necessarily perfectly fitted observation data.
So, how do we solve the problem? How do we determine the best $k^{th}$-order polynomial for our data set? Well, there are many methods and strategies to counteract overfitting. In this section we follow a simple approach. First, we split the data set into two parts. We call one part the training set and the other part the validation set. Then we use all the data in the training set to learn the model parameters $\beta_i$, in the same fashion as we did above. Thereafter, we apply the trained model to predict the data of the validation set and evaluate the model performance by computing the RMSE. Thus, we use the validation set to optimize the model's complexity, given by $k$.
Unfortunately, if we want to learn from data we ultimately need data to learn from. So far we worked with 25 observations. That is not much. In real life applications we would probably have to obtain new observations by conducting a new measurement campaign. In our exercise, however, we may generate more data fairly easily. Thus, we continue this example with a new data set of 150 observations.
Let us plot the data!
n1 = 150
x = rs.uniform(0, 1, n1)
y = np.sin(2 * np.pi * x) + rs.normal(0, 0.3, n1)
new_poly_data = pd.DataFrame.from_dict({"x": x, "y": y})
plt.scatter(new_poly_data.x, new_poly_data.y)
plt.grid()
plt.xlabel("x")
plt.ylabel("y")
plt.show()
Now, we are ready to build our training and validation set.
For this we use the pandas.DataFrame.sample()
method from the pandas
package to get our training data and then drop these row (identified by the index column, to get our test data.
We split the data in such a way, that 65 % of the data is assigned to the training set and the rest, 35 % of the data, is assigned to the validation set.
train = new_poly_data.sample(frac=.65, random_state=0)
test = new_poly_data.drop(train.index)
In this exercise we are going to build 14 models with $k = 1,2,...,14$. We evaluate each of the 14 polynomial regression models by calculating the RMSE on the training set. Further, after building the model and obtaining the model parameters $\beta_i$, we use the model to predict the response variable in the validation set. Again, we rely on the RMSE in order to evaluate the predictions based on the validation set. Finally, we plot the RMSE of each model for both, the training set and the validation set. Based on the RMSE we assess the generalization of the model.
We already developed a nice piece of code above. Thus, we only modify and extend the existing code:
rmse_train_values = []
rmse_test_values = []
# run loop
k = np.arange(1, 15) # k-th order
for i in k:
# build models
polynomial_features= PolynomialFeatures(degree=k[i-1])
xp = polynomial_features.fit_transform(sm.add_constant(train.x))
model = sm.OLS(train.y, xp).fit()
# calculate RMSE and store it for further usage
rmse_train_values.append(
np.sqrt(np.sum((model.fittedvalues - train.y)**2) / len(train.y))
)
# predict
xvals = polynomial_features.fit_transform(sm.add_constant(test.x))
predictions = model.predict(sm.add_constant(xvals))
rmse_test_values.append(
np.sqrt(np.sum((predictions - test.y)**2) / len(test.y))
)
RMSE = pd.DataFrame.from_dict({
"kth_order": k,
"test_values": rmse_test_values,
"train_values": rmse_train_values
})
plt.figure(figsize=(12, 4))
plt.plot(RMSE.kth_order, RMSE.train_values, 'o-', label="train values")
plt.plot(RMSE.kth_order, RMSE.test_values, 'o-', label="test values")
plt.grid()
plt.legend()
plt.xlabel("$k^{th}-order$")
plt.ylabel("RMSE")
plt.show()
The figure shows, that the error on the training data (blue line) is constantly decreasing. This makes perfect sense, the more complex the model becomes by increasing $k$, the better the model fits the training data. We observed the same behavior in the section above, when we trained our model with just 25 observations. If we take a look at the RMSE for the validation set (orange line), we see that with increasing $k$, and thus increasing model complexity, the error decreases. However, there is a sweet spot, indicated by the lowest RMSE, where the model is just complex enough to generalize well on the so far unseen validation data. If the model complexity increases further, the RMSE is starting to increase, too. This indicates that the model is overfitting. Thus, the model memorizes the data of the training set well, however, the model becomes less predictive for so far unseen data, such as the data of the validation set. Take a look at the figure above to see that the lowest error on the test set is obtained for a regression model of 3rd-order.
In the previous section we found a 3rd-order polynomial regression model to perform best on the validation set. Now, we plot that model.
# setup
# set up vector used for prediction
train_vals = np.arange(train.x.min(), train.x.max(), 0.01)
test_vals = np.arange(test.x.min(), test.x.max(), (test.x.max()-test.x.min())/len(test))
vals = np.arange(
new_poly_data.x.min(),
new_poly_data.x.max(),
0.01
)
polynomial_features = PolynomialFeatures(degree=3)
xp = polynomial_features.fit_transform(sm.add_constant(train.x))
model = sm.OLS(train.y, xp).fit()
inner_predictions = model.predict()
# predict
xvals = polynomial_features.fit_transform(sm.add_constant(vals))
predictions = model.predict(sm.add_constant(xvals))
fig, ax = plt.subplots()
ax.plot(train.x, train.y, "o", label="train data")
ax.plot(test.x, test.y, "o", label="test data")
ax.plot(vals, np.sin(2 * np.pi * vals), '-', label="signal")
ax.plot(vals, predictions, "-", label="predictions")
ax.legend(loc="best")
plt.show()
The figure shows that our model does a decent job at fitting the data and therefore we can be quite satisfied with it.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.