30342_Regression-modelling.knit

Logistic regression analysis belongs to the class of generalized linear models. We apply the glmnet() function from the eponymous glmnet package to build a \(L_2\)-regularized logistic regression model. You may want to revisit the introductory sections on logistic regression and regularization methods.

Suppose the response variable takes value in \(G=\text{{0,1}}\). Denote \(y_i=I(g_i=1)\). The model

\[P(G=1|X=x)+\frac{e^{\beta_0+\beta^{T}x}}{1+e^{\beta_0+\beta^{T}x}}\] can be written in the following form:

\[\log\frac{P(G=1|X=x)}{P(G=0|X=x)}=\beta_0+\beta^{T}x,\] the so-called logistic or log-odds transformation.

\[\min_{\substack{(\beta_0\beta)\in\mathbb{R}^{p+1}}}- \begin{bmatrix} \frac{1}{N}\displaystyle\sum_{i=1}^{N}y_i(\beta_0+x_i^T\beta)-\log(1+e^{\beta_0+x_i^T\beta}) \end{bmatrix}+\lambda[(1-\alpha)||\beta||_2^2/2+\alpha||\beta||_1].\]

For further details on the computational mechanics visit the Glmnet Vignette by typing vignette("glmnet_beta") in your console.

The glmnet() function expects an input matrix X and the response vector y. In order to apply a logistic regression we need to add the argument family = "binomial" to the function call. In the previous sections we prepared the response vectors and the model matrices for the training and test sets (y_train, y_test, X_train, X_test, X_train_PC3, X_test_PC3, and pca_PC3). Let us load all the data we need into our workspace.

# load model input data from previous section
load("pca_data_30300.RData")

The general procedure is as follows:

Build a model for the training set.
Evaluate model performance on the test set.

We evaluate the model performance by calculating the model accuracy, the confusion matrix and Cohen’s kappa. The confusion matrix is a specific table that visualizes the performance of a predictive model. Each column of the matrix represents the instances in a actual class while each row represents the instances in an predicted class (or sometimes vice versa).

Cohen’s kappa coefficient, \(κ\), is a statistic which measures inter-rater agreement for categorical items. It takes a value of \(κ=0\) if the items agreement is due to chance. A value of \(κ=1\) indicates a perfect model fit.

The definition of \(κ\) is:

\[\kappa=\frac{p_0-p_e}{1-p_e},\] where \(p_0\) is the relative observed agreement (identical to accuracy), and \(p_e\) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly saying each category.

The baseline model

In the baseline model we predict the response variable HOT.DAY by including all 15 features in the data set. We use the \(L_2\)-regularized logistic regression model (ridge regression: alpha = 0) provided by the glmnet package. In particular we apply the cv.glmnet() function, which performs by default 10-fold cross-validation on the data set and evaluates the model performance by calculating the misclassification error given the additional argument type.measure = "class" to the function call.

library(glmnet)

## Warning: Paket 'glmnet' wurde unter R Version 4.2.3 erstellt

grid <- 10^seq(1, -3, length = 100) # set lambda sequence
m_log_baseline <- cv.glmnet(X_train,
                            y_train,
                            family = "binomial",
                            type.measure = "class", # misclassification error
                            alpha = 0, # ridge regression
                            lambda = grid)

Calling the plot() function on the cv.glmnet object provides a nice graphical representation of the relation between misclassification error and \(λ\).

plot(m_log_baseline)

The plot shows that the misclassification error increases with \(λ\). The dotted vertical lines indicate the value of \(λ\) that gives the minimum mean cross-validated error and the largest value of \(λ\) such that the error is within 1 standard error of the minimum. We can assess these values of \(λ\) by the object attributes lambda.min and lambda.1se.

m_log_baseline$lambda.min

## [1] 0.01963041

m_log_baseline$lambda.1se

## [1] 0.04132012

Finally, we use the test data set, X_test, and the best \(λ\) to predict the response vector y_test. We have to add an additional argument to the predict() function in order to indicate if we want the function to return class labels (type = "class") or probabilities (type = "response").

# prediction for the test set
m_log_baseline_pred <- predict(m_log_baseline, # learned model
                               X_test,  # new data
                               type = "class", # class labels
                               s = "lambda.min") # picks the smallest lambda
tbl <- table(m_log_baseline_pred)
tbl

## m_log_baseline_pred
##   0   1 
##  18 141

To assess the model performance we use the functionality of the caret package. The function confusionMatrix() creates a tabular form of the confusion matrix. In addition the confusionMatrix object stores many other model metrics. Here we just ask for the accuracy and Cohen’s kappa.

library(caret)

## Warning: Paket 'caret' wurde unter R Version 4.2.3 erstellt

cm <- confusionMatrix(data = as.factor(m_log_baseline_pred), reference = y_test)
cm$table

##           Reference
## Prediction   0   1
##          0  16   2
##          1   6 135

cm$overall["Accuracy"]

##  Accuracy 
## 0.9496855

cm$overall["Kappa"]

##     Kappa 
## 0.7715517

The model performs very good. We perform better than our accuracy benchmark of \(0.86\). The two model metrics, accuracy and Cohen’s kappa are 0.95 and 0.77, respectively. In terms of classification accuracy we count 16 true negative and 135 true positive classifications. In contrast we only misclassified 8 observations (6 false negatives and 2 false positives).

The PC3 model

Now we build a \(L_2\)-regularized logistic regression model to predict the response variable HOT.DAY by including just the first three principal components scores. As in the previous section, we compute the confusion matrix, the model accuracy and Cohen’s kappa for the test set to evaluate the goodness of fit of the model.

# build the model
grid <- 10^seq(1, -3, length = 100) ## set lambda sequence
m_log_PC3 <- cv.glmnet(X_train_PC3,
                       y_train,
                       family = "binomial",
                       type.measure = "class", # misclassification error
                       alpha = 0, # ridge regression
                       lambda = grid)
plot(m_log_PC3)

# prediction for the test set
m_log_PC3_pred <- predict(m_log_PC3, # learned model
                          X_test_PC3, # new data
                          type = "class", # class labels
                          s = "lambda.min") # picks the smallest lambda

cm <- confusionMatrix(data = as.factor(m_log_PC3_pred), reference = y_test)
cm$table

##           Reference
## Prediction   0   1
##          0  16   4
##          1   6 133

cm$overall["Accuracy"]

##  Accuracy 
## 0.9371069

cm$overall["Kappa"]

##     Kappa 
## 0.7257675

The model performs very good compared the baseline model. The PC3 model accuracy is higher than our accuracy benchmark of \(0.86\). The two model metrics, accuracy and Cohen’s kappa are 0.94 and 0.73, respectively. We count 16 true negative and 133 true positive classifications. In contrast we only misclassified 10 observations (6 false negatives and 4 false positives).

Please note that we projected a 15-dimensional data set onto a 3-dimensional feature space, and still the model performance is remarkable.

Visualisation of the results

Let us visualize the classification results of the PC3 model on a map.

library(raster)
library(ggplot2)


# Retrieve Federal States by the getData() function from the raster package
G1 <- getData(country = "Germany", level = 1)

## Warning in getData(country = "Germany", level = 1): getData will be removed in a future version of raster
## . You can use access these data, and more, with functions from the geodata package instead

# Predcit the occurrence of hot days at a particular weather station based on the PC3 model (m_log_PC3) for the full data set pca_PC3
m_log_PC3_full <- predict(m_log_PC3, # learned model
                          pca_PC3, # new data
                          type = "class", # class labels
                          s = "lambda.min") # picks the smallest lambda

# add a new column called "Classification" to the data set
# "FN" for false negativ,
# "TN" for true negativ,
# "FP" for false positive and
# "TP" for true positive

dwd_data$PC3_prediction <- m_log_PC3_full # assign prediction vector to data set

dwd_data$Classification <- NULL # initialize empty column
# populate column
dwd_data$Classification[dwd_data$HOT.DAY == 1 & dwd_data$PC3_prediction == 1] <- "TP"
dwd_data$Classification[dwd_data$HOT.DAY == 0 & dwd_data$PC3_prediction == 0] <- "TN"
dwd_data$Classification[dwd_data$HOT.DAY == 1 & dwd_data$PC3_prediction == 0] <- "FN"
dwd_data$Classification[dwd_data$HOT.DAY == 0 & dwd_data$PC3_prediction == 1] <- "FP"


#plot the map
ggplot() +
geom_polygon(data = G1,
             aes(x = long, y = lat, group = group),
             colour = "grey10", fill = "#FFFFFF") +
geom_point(data = dwd_data,
           aes(x = LON, y = LAT, colour = Classification),
           alpha = 0.7,
           size = 2) +
scale_colour_manual(values = c("#D55E00",
                               "#CC79A7",
                               "#56B4E9",
                               "#009E73")) +
theme_bw() +
xlab("Longitude") + ylab("Latitude") +
coord_map()

The map shows an interesting pattern. First of all we realize that the data set is quite imbalanced, as there are a lot more positives than negatives. We see that the negatives (weather stations that did not record a hot day so far) are clustered. One cluster is located in the Alpine region and the Alpine foothills. Another cluster is found in the Central German Uplands, and another cluster is observed at the coastal regions of Germany. Though, it is very interesting, that the classification algorithm correctly predicted all observations in the southern part of Germany. In the central parts of Germany the algorithm sometimes failed to correctly predict positives, as well as negatives. However, one pattern is very striking. In the coastal regions of Northern Germany the logistic regression model performed, in general, quite well, expect for those weather stations located in the federal state of Schleswig-Holstein. In this particular coastal region nearly all class predictions turn out to be wrong. This is an interesting result, which needs more thoroughly investigation.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.