Logistic regression analysis belongs to the
class of generalized linear models. We apply the
glmnet()
function from the eponymous glmnet
package to build a \(L_2\)-regularized logistic regression
model. You may want to revisit the introductory sections on
logistic regression and regularization methods.
Suppose the response variable takes value in \(G=\text{{0,1}}\). Denote \(y_i=I(g_i=1)\). The model
\[P(G=1|X=x)+\frac{e^{\beta_0+\beta^{T}x}}{1+e^{\beta_0+\beta^{T}x}}\] can be written in the following form:
\[\log\frac{P(G=1|X=x)}{P(G=0|X=x)}=\beta_0+\beta^{T}x,\] the so-called logistic or log-odds transformation.
\[\min_{\substack{(\beta_0\beta)\in\mathbb{R}^{p+1}}}- \begin{bmatrix} \frac{1}{N}\displaystyle\sum_{i=1}^{N}y_i(\beta_0+x_i^T\beta)-\log(1+e^{\beta_0+x_i^T\beta}) \end{bmatrix}+\lambda[(1-\alpha)||\beta||_2^2/2+\alpha||\beta||_1].\]
For further details on the computational mechanics visit the
Glmnet
Vignette by typing
vignette("glmnet_beta")
in your console.
The glmnet()
function expects an input matrix
X
and the response vector y
. In order to apply
a logistic regression we need to add the argument
family = "binomial"
to the function call. In the previous
sections we prepared the response vectors and the model matrices for the
training and test sets (y_train
, y_test
,
X_train
, X_test
, X_train_PC3
,
X_test_PC3
, and pca_PC3
). Let us load all the
data we need into our workspace.
# load model input data from previous section
load("pca_data_30300.RData")
The general procedure is as follows:
We evaluate the model performance by calculating the model accuracy, the confusion matrix and Cohen’s kappa. The confusion matrix is a specific table that visualizes the performance of a predictive model. Each column of the matrix represents the instances in a actual class while each row represents the instances in an predicted class (or sometimes vice versa).
Cohen’s kappa coefficient, \(κ\), is a statistic which measures inter-rater agreement for categorical items. It takes a value of \(κ=0\) if the items agreement is due to chance. A value of \(κ=1\) indicates a perfect model fit.
The definition of \(κ\) is:
\[\kappa=\frac{p_0-p_e}{1-p_e},\] where \(p_0\) is the relative observed agreement (identical to accuracy), and \(p_e\) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly saying each category.
In the baseline model we predict the response variable
HOT.DAY
by including all 15 features in the data set. We
use the \(L_2\)-regularized logistic
regression model (ridge regression: alpha = 0
) provided
by the glmnet
package. In particular we apply the
cv.glmnet()
function, which performs by default 10-fold cross-validation on the data set and evaluates the
model performance by calculating the misclassification error given the
additional argument type.measure = "class"
to the function
call.
library(glmnet)
## Warning: Paket 'glmnet' wurde unter R Version 4.2.3 erstellt
grid <- 10^seq(1, -3, length = 100) # set lambda sequence
m_log_baseline <- cv.glmnet(X_train,
y_train,
family = "binomial",
type.measure = "class", # misclassification error
alpha = 0, # ridge regression
lambda = grid)
Calling the plot()
function on the
cv.glmnet
object provides a nice graphical representation
of the relation between misclassification error and \(λ\).
plot(m_log_baseline)
The plot shows that the misclassification error increases with \(λ\). The dotted vertical lines indicate the
value of \(λ\) that gives the minimum
mean cross-validated error and the largest value of \(λ\) such that the error is within 1
standard error of the minimum. We can assess these values of \(λ\) by the object attributes
lambda.min
and lambda.1se
.
m_log_baseline$lambda.min
## [1] 0.01963041
m_log_baseline$lambda.1se
## [1] 0.04132012
Finally, we use the test data set, X_test
, and the best
\(λ\) to predict the response vector
y_test
. We have to add an additional argument to the
predict()
function in order to indicate if we want the
function to return class labels (type = "class"
) or
probabilities (type = "response"
).
# prediction for the test set
m_log_baseline_pred <- predict(m_log_baseline, # learned model
X_test, # new data
type = "class", # class labels
s = "lambda.min") # picks the smallest lambda
tbl <- table(m_log_baseline_pred)
tbl
## m_log_baseline_pred
## 0 1
## 18 141
To assess the model performance we use the functionality of the
caret
package. The function confusionMatrix()
creates a tabular form of the confusion matrix. In addition the
confusionMatrix
object stores many other model metrics.
Here we just ask for the accuracy and Cohen’s kappa.
library(caret)
## Warning: Paket 'caret' wurde unter R Version 4.2.3 erstellt
cm <- confusionMatrix(data = as.factor(m_log_baseline_pred), reference = y_test)
cm$table
## Reference
## Prediction 0 1
## 0 16 2
## 1 6 135
cm$overall["Accuracy"]
## Accuracy
## 0.9496855
cm$overall["Kappa"]
## Kappa
## 0.7715517
The model performs very good. We perform better than our accuracy benchmark of \(0.86\). The two model metrics, accuracy and Cohen’s kappa are 0.95 and 0.77, respectively. In terms of classification accuracy we count 16 true negative and 135 true positive classifications. In contrast we only misclassified 8 observations (6 false negatives and 2 false positives).
Now we build a \(L_2\)-regularized
logistic regression model to predict the response variable
HOT.DAY
by including just the first three principal
components scores. As in the previous section, we compute the confusion
matrix, the model accuracy and Cohen’s kappa for the test set to
evaluate the goodness of fit of the model.
# build the model
grid <- 10^seq(1, -3, length = 100) ## set lambda sequence
m_log_PC3 <- cv.glmnet(X_train_PC3,
y_train,
family = "binomial",
type.measure = "class", # misclassification error
alpha = 0, # ridge regression
lambda = grid)
plot(m_log_PC3)
# prediction for the test set
m_log_PC3_pred <- predict(m_log_PC3, # learned model
X_test_PC3, # new data
type = "class", # class labels
s = "lambda.min") # picks the smallest lambda
cm <- confusionMatrix(data = as.factor(m_log_PC3_pred), reference = y_test)
cm$table
## Reference
## Prediction 0 1
## 0 16 4
## 1 6 133
cm$overall["Accuracy"]
## Accuracy
## 0.9371069
cm$overall["Kappa"]
## Kappa
## 0.7257675
The model performs very good compared the baseline model. The PC3 model accuracy is higher than our accuracy benchmark of \(0.86\). The two model metrics, accuracy and Cohen’s kappa are 0.94 and 0.73, respectively. We count 16 true negative and 133 true positive classifications. In contrast we only misclassified 10 observations (6 false negatives and 4 false positives).
Please note that we projected a 15-dimensional data set onto a 3-dimensional feature space, and still the model performance is remarkable.
Let us visualize the classification results of the PC3 model on a map.
library(raster)
library(ggplot2)
# Retrieve Federal States by the getData() function from the raster package
G1 <- getData(country = "Germany", level = 1)
## Warning in getData(country = "Germany", level = 1): getData will be removed in a future version of raster
## . You can use access these data, and more, with functions from the geodata package instead
# Predcit the occurrence of hot days at a particular weather station based on the PC3 model (m_log_PC3) for the full data set pca_PC3
m_log_PC3_full <- predict(m_log_PC3, # learned model
pca_PC3, # new data
type = "class", # class labels
s = "lambda.min") # picks the smallest lambda
# add a new column called "Classification" to the data set
# "FN" for false negativ,
# "TN" for true negativ,
# "FP" for false positive and
# "TP" for true positive
dwd_data$PC3_prediction <- m_log_PC3_full # assign prediction vector to data set
dwd_data$Classification <- NULL # initialize empty column
# populate column
dwd_data$Classification[dwd_data$HOT.DAY == 1 & dwd_data$PC3_prediction == 1] <- "TP"
dwd_data$Classification[dwd_data$HOT.DAY == 0 & dwd_data$PC3_prediction == 0] <- "TN"
dwd_data$Classification[dwd_data$HOT.DAY == 1 & dwd_data$PC3_prediction == 0] <- "FN"
dwd_data$Classification[dwd_data$HOT.DAY == 0 & dwd_data$PC3_prediction == 1] <- "FP"
#plot the map
ggplot() +
geom_polygon(data = G1,
aes(x = long, y = lat, group = group),
colour = "grey10", fill = "#FFFFFF") +
geom_point(data = dwd_data,
aes(x = LON, y = LAT, colour = Classification),
alpha = 0.7,
size = 2) +
scale_colour_manual(values = c("#D55E00",
"#CC79A7",
"#56B4E9",
"#009E73")) +
theme_bw() +
xlab("Longitude") + ylab("Latitude") +
coord_map()
The map shows an interesting pattern. First of all we realize that the data set is quite imbalanced, as there are a lot more positives than negatives. We see that the negatives (weather stations that did not record a hot day so far) are clustered. One cluster is located in the Alpine region and the Alpine foothills. Another cluster is found in the Central German Uplands, and another cluster is observed at the coastal regions of Germany. Though, it is very interesting, that the classification algorithm correctly predicted all observations in the southern part of Germany. In the central parts of Germany the algorithm sometimes failed to correctly predict positives, as well as negatives. However, one pattern is very striking. In the coastal regions of Northern Germany the logistic regression model performed, in general, quite well, expect for those weather stations located in the federal state of Schleswig-Holstein. In this particular coastal region nearly all class predictions turn out to be wrong. This is an interesting result, which needs more thoroughly investigation.
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.