In the previous section we realized that feature engineering, such as transformations, basis expansions, and interactions may significantly enhance the predictive power of a model. Consequently, the question arises which features should be included in the model. Hence, we are looking for a criterion that allows us to assess which combination of features gives us the best model performance, and at the same time, in order to counteract overfitting issues, penalizes the number of free parameters in our model.
There are two commonly applied model selection criteria, the Akaike information criterion (AIC) and the Bayesian information criterion (BIC).
Both criteria (AIC and BIC) are measures of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC and BIC estimate the quality of each model, relative to each of the other models. Hence, both criteria provide a means for model selection. When adding a more model variables $k$ to a model, we loose less information and the maximum likelihood $\ln(\mathcal L)$ increases. From the below formulas, we can see that this would cause the rightmost parts of the formulas to decrease. Adding more explanatory variables causes overfitting to the training data. Thus, a penalty term $2k$, respectively $\ln(n)k$ is introduced. This penalty is stronger for the BIC than for the AIC.
$$AIC = 2k-2\ln(\mathcal L)$$$$BIC = \ln(n)k-2\ln(\mathcal L)\text{,}$$where $k$ is the number of estimated model parameters, $n$ is the number of data points, and $\mathcal L$ is the maximized value of the likelihood function for the model $M$; i.e. $\mathcal L=P(x\vert\hat \theta, M)$, where $\hat \theta$ are the parameter values that maximize the likelihood function.
Under the framework of stew-wise model selection a criterion (e.g. AIC or BIC) is used for weighing the choices of adding (or excluding) model parameters, taking into account the numbers of parameters to be fitted. At each step an add or drop will be performed that minimizes the criterion (e.g. AIC or BIC) score.
Forward-stepwise selection starts with the intercept, and then sequentially adds into the model the predictor that most improves the fit. Forward-stepwise selection is a greedy algorithm, producing a nested sequence of models.
In Python we can compute these information criteria for our models via the statsmodels package. Let us use this to select some features by hand. First we will load our data and prepare our training and test data. We will start with a smaller subset of variables.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.base import clone
from sklearn.metrics import mean_squared_error
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
dwd = pd.read_table(
"https://userpage.fu-berlin.de/soga/data/raw-data/DWD.csv",
index_col=0,
sep=',',
)
df = dwd.drop([
'DWD_ID',
'RECORD_LENGTH',
'STATION_NAME',
'FEDERAL_STATE',
'PERIOD',
'LON',
'LAT',
], axis=1).dropna()
subset = df[[
"MEAN_ANNUAL_RAINFALL",
"ALTITUDE",
"MAX_RAINFALL",
"MEAN_CLOUD_COVER",
"MEAN_ANNUAL_AIR_TEMP"
]]
train = subset.sample(frac=0.5, random_state=0)
test = subset.drop(train.index)
rmses = pd.read_feather('https://userpage.fu-berlin.de/soga/data/py-data/30221_rmses.feather').set_index('index')
train.head()
MEAN_ANNUAL_RAINFALL | ALTITUDE | MAX_RAINFALL | MEAN_CLOUD_COVER | MEAN_ANNUAL_AIR_TEMP | |
---|---|---|---|---|---|
ID | |||||
94 | 778.0 | 363.0 | 37.0 | 65.0 | 9.2 |
239 | 533.0 | 316.0 | 36.0 | 67.0 | 8.2 |
148 | 678.0 | 68.0 | 39.0 | 68.0 | 10.2 |
167 | 571.0 | 69.0 | 38.0 | 65.0 | 8.9 |
502 | 511.0 | 131.0 | 33.0 | 66.0 | 9.1 |
Now we can create our first model which without any predictor variable.
model = sm.OLS.from_formula("MEAN_ANNUAL_RAINFALL ~ ALTITUDE", data = train).fit()
print(f"predictor: ALTITUDE, aic: {model.aic}, bic: {model.bic}")
predictor: ALTITUDE, aic: 1304.5574092519862, bic: 1309.8073548785546
Let us have a look on the root-mean-square error for predictions on train and test datasets.
Next let us iterate through our selected preditors and compare the information criteria. Furthermore we will calculate the root-mean-square errors for predictions on training and test data to check the over-/underfitting of the model.
#A dictionary to store the results
results = {'predictor': [], 'aic':[], 'bic': []}
for col in train.columns.drop("MEAN_ANNUAL_RAINFALL"):
model = sm.OLS.from_formula(f"MEAN_ANNUAL_RAINFALL ~ {col}", data = train).fit()
results["predictor"].append(col)
results["aic"].append(model.aic)
results["bic"].append(model.bic)
results_df = pd.DataFrame(results).sort_values(by=['aic'], ignore_index=True)
results_df
predictor | aic | bic | |
---|---|---|---|
0 | MAX_RAINFALL | 1276.729049 | 1281.978995 |
1 | ALTITUDE | 1304.557409 | 1309.807355 |
2 | MEAN_ANNUAL_AIR_TEMP | 1332.742853 | 1337.992799 |
3 | MEAN_CLOUD_COVER | 1361.375435 | 1366.625380 |
We can see that "MAX_RAINFALL" is shows the lowest AIC and BIC, thus performs best in this model. The remaining variability for predictions on the training data is lower than the one for the test data for every one of the tested models. Let us see how this will change if we add more features.
Let us define a function to add more predictors and compare information criteria.
def add_feature(ds, current_features, dependent_var, scoring):
results = {'predictor': [], 'aic':[], 'bic': []}
for col in ds.columns:
# We add a predictor if it isn't already selected
if col not in (current_features + [dependent_var]):
# Build our formula string by joining the list current_features and adding the new one
formula = f"{dependent_var} ~ {' + '.join(current_features)} + {col}"
model = sm.OLS.from_formula(formula, data=ds).fit()
results['predictor'].append(col)
results["aic"].append(model.aic)
results["bic"].append(model.bic)
return pd.DataFrame(results).sort_values(by=[scoring], ignore_index=True), formula
We start with the already selected best predictor and compare the results for added predictors.
results, formula = add_feature(train, ['MAX_RAINFALL'], 'MEAN_ANNUAL_RAINFALL', scoring='aic')
print("New formula: " + formula)
results.head()
New formula: MEAN_ANNUAL_RAINFALL ~ MAX_RAINFALL + MEAN_ANNUAL_AIR_TEMP
predictor | aic | bic | |
---|---|---|---|
0 | MEAN_CLOUD_COVER | 1269.870867 | 1277.745785 |
1 | ALTITUDE | 1273.583872 | 1281.458790 |
2 | MEAN_ANNUAL_AIR_TEMP | 1274.532408 | 1282.407326 |
Great! MEAN_CLOUD_COVER yields the best results. Both criteria improved. Lets add another one.
results, formula = add_feature(train, ['MAX_RAINFALL', 'MEAN_CLOUD_COVER'], 'MEAN_ANNUAL_RAINFALL', scoring='aic')
print("New formula: " + formula)
results.head()
New formula: MEAN_ANNUAL_RAINFALL ~ MAX_RAINFALL + MEAN_CLOUD_COVER + MEAN_ANNUAL_AIR_TEMP
predictor | aic | bic | |
---|---|---|---|
0 | ALTITUDE | 1268.130183 | 1278.630074 |
1 | MEAN_ANNUAL_AIR_TEMP | 1270.676485 | 1281.176376 |
The AIC still improves for the ALTITUDE predictor but the BIC performs a little worse than before. We can see that the penalty term $\ln(n)k$ for the BIC is stronger than the one for AIC. Let us do one more iteration and see what criteria values it yields.
results, formula = add_feature(train, ['MAX_RAINFALL', 'MEAN_CLOUD_COVER', "ALTITUDE"], 'MEAN_ANNUAL_RAINFALL', scoring='aic')
print("New formula: " + formula)
results.head()
model = sm.OLS.from_formula(formula, data = train).fit()
print("RMSE train: " + str(
mean_squared_error(model.predict(train), train['MEAN_ANNUAL_RAINFALL'], squared=False)
))
print("RMSE test: " + str(
mean_squared_error(model.predict(test), test['MEAN_ANNUAL_RAINFALL'], squared=False)
))
New formula: MEAN_ANNUAL_RAINFALL ~ MAX_RAINFALL + MEAN_CLOUD_COVER + ALTITUDE + MEAN_ANNUAL_AIR_TEMP RMSE train: 116.53247656972394 RMSE test: 117.68357200524653
We can now see that both parameters increased. We would thus stop and not add this last feature, respectively the last two features if we value the BIC higher, to our model. This was a very manual step-wise approach. We have good control over our iterations by manually checking our scores. On the other hand it is a long process if we have many predictors to select from.
So let us build a function, the forward selector, to automate the iteration.
We define a function that uses our add_feature
function and store the best result. I then iterates forward until the selected score/information criteria worsens for the first time. Let us use this with the full dataset.
def forward_selector(ds, dependent_var, scoring):
# A dict to store the final results
results = {'predictor': [], 'aic':[], 'bic': []}
# The same to store the first iteration
scoring_dict = {'predictor': [], 'aic':[], 'bic': []}
current_features = []
for col in ds.columns.drop(dependent_var):
model = sm.OLS.from_formula(f"{dependent_var} ~ {col}", data = train).fit()
# append the results of the first iteration
scoring_dict["predictor"].append(col)
scoring_dict['aic'].append(model.aic)
scoring_dict['bic'].append(model.bic)
# transform scoring to table
scoring_table = pd.DataFrame(scoring_dict).sort_values(by=[scoring], ignore_index=True)
for i in range(len(ds.columns)):
# store the scoring value of the last iteration
prev_score = scoring_table.loc[0, scoring]
current_features.append(scoring_table.loc[0, 'predictor'])
# add a new feature
scoring_table, formula = add_feature(ds, current_features, dependent_var, scoring)
# stop if the info
if prev_score < scoring_table.loc[0, scoring]:
return pd.DataFrame(results), formula
results['predictor'].append(scoring_table.loc[0, 'predictor'])
results['aic'].append(scoring_table.loc[0, 'aic'])
results['bic'].append(scoring_table.loc[0, 'bic'])
train = df.sample(frac=0.8, random_state=0)
test = df.drop(train.index)
train.head()
ALTITUDE | MEAN_ANNUAL_AIR_TEMP | MEAN_MONTHLY_MAX_TEMP | MEAN_MONTHLY_MIN_TEMP | MEAN_ANNUAL_WIND_SPEED | MEAN_CLOUD_COVER | MEAN_ANNUAL_SUNSHINE | MEAN_ANNUAL_RAINFALL | MAX_MONTHLY_WIND_SPEED | MAX_AIR_TEMP | MAX_WIND_SPEED | MAX_RAINFALL | MIN_AIR_TEMP | MEAN_RANGE_AIR_TEMP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID | ||||||||||||||
94 | 363.0 | 9.2 | 13.0 | 5.6 | 3.0 | 65.0 | 1628.0 | 778.0 | 4.0 | 31.8 | 31.5 | 37.0 | -11.3 | 7.4 |
239 | 316.0 | 8.2 | 12.5 | 4.0 | 3.0 | 67.0 | 1635.0 | 533.0 | 3.0 | 32.4 | 27.4 | 36.0 | -16.4 | 8.4 |
148 | 68.0 | 10.2 | 13.6 | 6.2 | 3.0 | 68.0 | 1362.0 | 678.0 | 3.0 | 33.0 | 29.0 | 39.0 | -13.1 | 7.4 |
167 | 69.0 | 8.9 | 13.2 | 4.7 | 2.0 | 65.0 | 1652.0 | 571.0 | 3.0 | 33.5 | 27.1 | 38.0 | -16.7 | 8.6 |
502 | 131.0 | 9.1 | 13.3 | 5.1 | 3.0 | 66.0 | 1612.0 | 511.0 | 4.0 | 33.2 | 28.8 | 33.0 | -14.8 | 8.1 |
Let's give it a try. We start with an empty model and add our full set of predictors. The selector evaluates for the selected information criteria (scoring) per model and picks the best model for the next iteration until we found our best model.
results, formula = forward_selector(train, 'MEAN_ANNUAL_RAINFALL', 'aic')
print("Final formula: " + formula)
display(results)
Final formula: MEAN_ANNUAL_RAINFALL ~ MAX_RAINFALL + MAX_AIR_TEMP + MEAN_MONTHLY_MAX_TEMP + ALTITUDE + MEAN_CLOUD_COVER + MIN_AIR_TEMP + MEAN_MONTHLY_MIN_TEMP + MEAN_ANNUAL_AIR_TEMP + MEAN_RANGE_AIR_TEMP
predictor | aic | bic | |
---|---|---|---|
0 | MAX_AIR_TEMP | 1987.052408 | 1996.333659 |
1 | MEAN_MONTHLY_MAX_TEMP | 1967.520464 | 1979.895465 |
2 | ALTITUDE | 1961.499172 | 1976.967923 |
3 | MEAN_CLOUD_COVER | 1956.724648 | 1975.287149 |
4 | MIN_AIR_TEMP | 1954.874349 | 1976.530600 |
5 | MEAN_MONTHLY_MIN_TEMP | 1938.649460 | 1963.399462 |
6 | MEAN_ANNUAL_AIR_TEMP | 1937.938436 | 1965.782187 |
Great. Let us check the rmse for this model and compare to the previous.
model = sm.OLS.from_formula(formula, data = train).fit()
rmse_train = mean_squared_error(model.predict(train), train['MEAN_ANNUAL_RAINFALL'], squared=False)
rmse_test = mean_squared_error(model.predict(test), test['MEAN_ANNUAL_RAINFALL'], squared=False)
print("RMSE train: " + str(rmse_train))
print("RMSE test: " + str(rmse_test))
RMSE train: 87.3923774773306 RMSE test: 117.8846385376481
rmses.loc[len(rmses)] = ['forward model', rmse_train, rmse_test]
rmses
name | train_RMSE | test_RMSE | |
---|---|---|---|
index | |||
0 | baseline model | 243.882152 | 180.877011 |
0 | simple alt model | 154.992815 | 138.854544 |
0 | max rainfall model | 119.953630 | 117.437897 |
0 | multi alt rain model | 118.095746 | 113.746363 |
4 | forward model | 87.392377 | 117.884639 |
The backward-stepwise model selection is very similar to the forward-stepwise model selection discussed in the previous section. Backward-stepwise selection starts with the full model, and sequentially deletes the predictor that has the least impact on the fit. Backward selection can only be used when the number of observation is larger than the number of features (n>d), while forward stepwise can always be used (Hastie et al. 2008).
From our example we can implement backward-stepwise feature selection by changing a few things.
For the sake of comparison we apply the backward-stepwise model selection in the same manner as we did in the previous section. However, this time we start with a sophisticated model and exclude one feature after the other based on the AIC criterion.
def remove_feature(ds, current_features, dependent_var, scoring):
results = {'predictor': [], 'aic': [], 'bic': []}
for col in current_features:
# Build our formula string by joining the list current_features and removing one
remaining_features = current_features.copy()
remaining_features.remove(col)
formula = f"{dependent_var} ~ {' + '.join(remaining_features)}"
model = sm.OLS.from_formula(formula, data=ds).fit()
results['predictor'].append(col)
results["aic"].append(model.aic)
results["bic"].append(model.bic)
return pd.DataFrame(results).sort_values(by=[scoring], ascending=True)
def backward_selector(ds, dependent_var, scoring):
# A dict to store the final results
results = {'predictor': [], 'aic': [], 'bic': []}
# The same to store the first iteration
scoring_dict = {'predictor': [], 'aic': [], 'bic': []}
current_features = ds.columns.drop(dependent_var).tolist()
formula = f"{dependent_var} ~ {' + '.join(current_features)}"
model = sm.OLS.from_formula(formula, data=ds).fit()
# Append the results of the first iteration
scoring_dict["predictor"].extend(current_features)
scoring_dict['aic'].extend([model.aic] * len(current_features))
scoring_dict['bic'].extend([model.bic] * len(current_features))
# Transform scoring to table
scoring_table = pd.DataFrame(scoring_dict).sort_values(by=[scoring], ascending=True)
prev_score = scoring_table.loc[0, scoring]
while len(current_features) > 1:
# Remove the predictor with the highest score
current_features.remove(scoring_table.loc[0, 'predictor'])
# Remove a feature
scoring_table = remove_feature(ds, current_features, dependent_var, scoring)
# Stop if the scoring is worse
if prev_score < scoring_table.loc[0, scoring]:
break
prev_score = scoring_table.loc[0, scoring]
results['predictor'].append(scoring_table.loc[0, 'predictor'])
results['aic'].append(scoring_table.loc[0, 'aic'])
results['bic'].append(scoring_table.loc[0, 'bic'])
return pd.DataFrame(results), formula
# Assuming you have the 'train' DataFrame and the 'MEAN_ANNUAL_RAINFALL' column
results, formula = backward_selector(train, 'MEAN_ANNUAL_RAINFALL', 'aic')
print("Final formula: " + formula)
display(results)
Final formula: MEAN_ANNUAL_RAINFALL ~ ALTITUDE + MEAN_ANNUAL_AIR_TEMP + MEAN_MONTHLY_MAX_TEMP + MEAN_MONTHLY_MIN_TEMP + MEAN_ANNUAL_WIND_SPEED + MEAN_CLOUD_COVER + MEAN_ANNUAL_SUNSHINE + MAX_MONTHLY_WIND_SPEED + MAX_AIR_TEMP + MAX_WIND_SPEED + MAX_RAINFALL + MIN_AIR_TEMP + MEAN_RANGE_AIR_TEMP
predictor | aic | bic | |
---|---|---|---|
0 | MEAN_ANNUAL_AIR_TEMP | 1947.284975 | 1984.409977 |
1 | MEAN_MONTHLY_MAX_TEMP | 1946.621664 | 1980.652916 |
2 | MEAN_MONTHLY_MIN_TEMP | 1946.411300 | 1977.348802 |
3 | MEAN_ANNUAL_WIND_SPEED | 1944.423818 | 1972.267570 |
results, formula = backward_selector(train, 'MEAN_ANNUAL_RAINFALL', 'bic')
print("Final formula: " + formula)
display(results)
Final formula: MEAN_ANNUAL_RAINFALL ~ ALTITUDE + MEAN_ANNUAL_AIR_TEMP + MEAN_MONTHLY_MAX_TEMP + MEAN_MONTHLY_MIN_TEMP + MEAN_ANNUAL_WIND_SPEED + MEAN_CLOUD_COVER + MEAN_ANNUAL_SUNSHINE + MAX_MONTHLY_WIND_SPEED + MAX_AIR_TEMP + MAX_WIND_SPEED + MAX_RAINFALL + MIN_AIR_TEMP + MEAN_RANGE_AIR_TEMP
predictor | aic | bic | |
---|---|---|---|
0 | MEAN_ANNUAL_AIR_TEMP | 1947.284975 | 1984.409977 |
1 | MEAN_MONTHLY_MAX_TEMP | 1946.621664 | 1980.652916 |
2 | MEAN_MONTHLY_MIN_TEMP | 1946.411300 | 1977.348802 |
3 | MEAN_ANNUAL_WIND_SPEED | 1944.423818 | 1972.267570 |
4 | MEAN_CLOUD_COVER | 1944.637352 | 1969.387353 |
5 | MEAN_ANNUAL_SUNSHINE | 1943.184089 | 1964.840341 |
6 | MAX_MONTHLY_WIND_SPEED | 1941.632340 | 1960.194841 |
model = sm.OLS.from_formula(formula, data = train).fit()
rmse_train = mean_squared_error(model.predict(train), train['MEAN_ANNUAL_RAINFALL'], squared=False)
rmse_test = mean_squared_error(model.predict(test), test['MEAN_ANNUAL_RAINFALL'], squared=False)
print("RMSE train: " + str(rmse_train))
print("RMSE test: " + str(rmse_test))
RMSE train: 87.23602319787905 RMSE test: 118.36797395246535
rmses.loc[len(rmses)] = ['backward model', rmse_train, rmse_test]
rmses
name | train_RMSE | test_RMSE | |
---|---|---|---|
index | |||
0 | baseline model | 243.882152 | 180.877011 |
0 | simple alt model | 154.992815 | 138.854544 |
0 | max rainfall model | 119.953630 | 117.437897 |
0 | multi alt rain model | 118.095746 | 113.746363 |
4 | forward model | 87.392377 | 117.884639 |
5 | backward model | 87.236023 | 118.367974 |
We can use these functions to apply a backward stepwise feature selection process relying on the same information criteria.
In Python we can also implement a forward-stepwise model selection procedure using SequentialFeatureSelector()
class of the package Mlxtend. First, we select a model class LinearRegression
. In the forward-stepwise selection the model starts with the intercept and nothing else. Moreover, we specify the forward-stepwise selection by explicitly adding the argument forward=True
to the function call.
Let's give it a try. We start with an empty model and add our full set of predictors. The selector function evaluates by default the the explained variance $r^2$ and lets us pick that particular model that minimizes it at most.
X = df.drop('MEAN_ANNUAL_RAINFALL', axis=1)
y = df['MEAN_ANNUAL_RAINFALL']
scaler = StandardScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
n_samples = X_train.shape[0]
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Sequential Forward Selection(sfs)
sfs = SFS(
LinearRegression(),
k_features=(1, 12),
forward=True,
cv=None,
)
results = sfs.fit_transform(X, y)
sfs.k_feature_names_
('MEAN_ANNUAL_AIR_TEMP', 'MEAN_MONTHLY_MAX_TEMP', 'MEAN_MONTHLY_MIN_TEMP', 'MEAN_ANNUAL_WIND_SPEED', 'MEAN_CLOUD_COVER', 'MEAN_ANNUAL_SUNSHINE', 'MAX_MONTHLY_WIND_SPEED', 'MAX_AIR_TEMP', 'MAX_WIND_SPEED', 'MAX_RAINFALL', 'MIN_AIR_TEMP', 'MEAN_RANGE_AIR_TEMP')
sfs
SequentialFeatureSelector(cv=None, estimator=LinearRegression(), k_features=(1, 12), scoring='r2')
Let us take a closer loot at the output of the SequentialFeatureSelector()
class. It estimates using a LinearRegression() and selects a model consisting of between 1 and 12 features. It scores, however, using the $r^2$ scorer. Let us have a look at the model performance curve of number of features.
fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')
plt.title('Sequential Forward Selection')
plt.grid()
plt.show()
/Users/annette/opt/anaconda3/lib/python3.9/site-packages/numpy/core/_methods.py:262: RuntimeWarning: Degrees of freedom <= 0 for slice ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof, /Users/annette/opt/anaconda3/lib/python3.9/site-packages/numpy/core/_methods.py:254: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount)
This is helpful. There is one problem though; The $r^2$ scorer does not penalize for additional variables and thus does not fit in our methodology. Even though scikit-learn does provide additional scoring functions, they do not offer AIC or BIC.
Just for a matter of fun, let us build our own LinearRegression class that scores on AIC.
# Custom Linear Regression class
class AICLinearRegression(LinearRegression):
def score(self, X, y):
n = y.shape[0]
y_pred = self.predict(X)
rss = np.sum((y_pred - y) ** 2)
mse = mean_squared_error(y, y_pred)
aic = n * np.log(rss / n) + 2 * X.shape[1]
return -1 * aic # Negate because the SequentialFeatureSelector regards higher values as better
sfs = SFS(
AICLinearRegression(),
k_features=(1, 12),
forward=True,
floating=False,
cv=None,
)
sfs = sfs.fit(X_train, y_train)
# Transform the dataset
X_train_sfs = sfs.transform(X_train)
X_test_sfs = sfs.transform(X_test)
# Fit the estimator using the reduced dataset
estimator = AICLinearRegression().fit(X_train_sfs, y_train)
# Calculate RMSE for training data
train_preds = estimator.predict(X_train_sfs)
train_rmse = mean_squared_error(y_train, train_preds, squared=False)
# Calculate RMSE for test data
test_preds = estimator.predict(X_test_sfs)
test_rmse = mean_squared_error(y_test, test_preds, squared=False)
print("Training RMSE: ", train_rmse)
print("Test RMSE: ", test_rmse)
Training RMSE: 91.9769045548552 Test RMSE: 91.23237008905065
rmses.loc[len(rmses)] = ['mlxtend SFS model', train_rmse, test_rmse]
_, ax = plt.subplots(figsize=(10,6))
rmses.plot(kind="bar", x="name", ax=ax)
ax.set_xlabel("")
ax.set_ylabel("RMSE in mm")
plt.show()
Let us leave it at that. A class using a BIC-scorer could be built similarly but as this is not our focus here, we will continue.
rmses.reset_index().to_feather('30222_rmses.feather')
Parts of this section are inspired by the blog post Feature Selection Techniques by Maria Gusarova
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.