Double constrained feature spaces: the logistic transformation

As introduced in the introductional part double contraint feature scales have a certain relation to compositional constraints.
Let us imagine a variable \(\mathbf{y}=(y_i\in ]a,b[_\mathbb R,i=1,\cdots,m)\) within a constrained interval \([l,u]\subset \mathbb R\) with \(l<a\) and \(b<u\). The absolute value is obviously of minor interest compared with the relative position between l and u. A first step should be a linear min-max-transformation with min=l and max=u:
\[ y \to y' =\frac{y-l}{u-l} \in\ ]0,1[\ \in \ {\mathbb R}_+ \]
Now, y’ divides the interval into two parts:
1. the distance to the left (=y) and
2. the distance to the right (1-y).
Both sum up to 1 and are therefore related to a 2-dimensional composition.

Applying an alr-transformation to our variable, we are getting a symmetric feature scale:
\[ y'\to y''=log(\frac{y'}{1-y'})\in \mathbb R\]
This so-called logistic transformation enable application of algebraic operation over the base field \(\mathbb R\).
The back transformations are:
\[\begin{align} y''\to y'=\frac{e^{y''}}{1+e^{y''}}&& and && y'\to y=(y'+l)\cdot(u-l) \end{align}\]


Example: Northern hemispheric ice extent under climate change

The National Snow and Ice Data Center at the University of Colorado in Boulder hosts and provides monthly sea ice data for both hemispheres since November 1978 to present. Under the condition of a warming earth system, we will estimate the changes for the northern hemisphere and project a possible trend until the end of this century for September.

Loading the data and plotting the time series

N09<-read.csv("ftp://sidads.colorado.edu/DATASETS/NOAA/G02135//north/monthly/data/N_09_extent_v3.0.csv")
plot(N09$year,N09$extent,main="ice cover extent september",xlab = "Year",ylab = "million sqrkm")


Now, we may estimate a linear trend:

Year<-1979:2100
model09<-lm(extent~year,data = N09)
conf09 <- predict(model09, newdata = list('year'=Year), interval="confidence", level = 0.95)
pred09 <- predict(model09, newdata = list('year'=Year),  interval="prediction", level = 0.95)

plot(N09$year, N09$extent,xlim = c(1979,2100),ylim = c(0,10),
     main = "Prediction of Arctic Ice Cover extent decrease for september",
     xlab = "Year",ylab = " Sea ice extent [mio. sqrkm]")
abline(model09, col='red')
lines(Year, conf09[,'lwr'])
lines(Year, conf09[,'upr'])
lines(Year, pred09[,'lwr'], lty=2)
lines(Year, pred09[,'upr'], lty=2)
abline(h=0,col='blue')

legend("topright",
       legend = c('observations', 'regression line', 'confidence bands', 'prediction bands'), 
       cex = 0.7, 
       col = c('black', 'red', 'black','black'), 
       lwd = c(NA, 1, 1, 1), 
       lty = c(NA, 1, 1, 2),
       pch = c(1, NA, NA, NA))


Modeling the observed ice cover extent from 1979 to 2020 for the northern hemisphere in September for the future, we have to expect a complete disappearance of the arctic ice in late summer with 95% probability within the second half of this century!

In 2035 we may expect an ICE in september with 95% confidence level between 2.07 and 4.41 milion sqrkm, therefore between 51.83% and 110.3% of last september extent.

In 2050 we may expect an ICE in September with 95% confidence level between 0.8 and 3.31 milion sqrkm, therefore between 20.09% and 82.77% of last september extent.
>The feature scale of the arctic ice extent in million square kilometer is clearly framed by a lower and upper limit: The extent definitly can neither be negative nor pass the equator. Thus, the lower limit is trivial but the upper limit unknown.

For the sake of simplicity we initially use the maximum extent as upper limit and remove this maximum from our time series and start a logistic approach:

logistic Transformation{ We start by scaling the data to the standard simplex \(]0,1[_{\mathbb R^+}\) by deviding the data smaller the maximum through the maximum as estimate of the upper limit. Afterward we transform the standard simplex by logistic transformation and plot the transfomed time series.

N09_frame<-(N09$extent[N09$extent<max(N09$extent)]/(max(N09$extent)))
N09_lrt<-log(N09_frame/(1-N09_frame))
plot(N09$year[N09$extent<max(N09$extent)],N09_lrt[N09$extent<max(N09$extent)],
     xlab="year",ylab = "lrt",main = "extent after logistic transformation")

Estimating the trend

Now, we can fit a linear model in the log-space:

Year<-1979:2100
year<-N09$year[N09$extent<max(N09$extent)]
data_log_ex<-data.frame(N09_lrt , year)

model09_log<-lm(N09_lrt ~ year,data = data_log_ex)
conf09_log <- predict.lm(model09_log, newdata = data.frame(year=Year), interval="confidence", level = 0.95)
pred09_log <- predict(model09_log, newdata = data.frame(year=Year),  interval="prediction", level = 0.95)

plot(N09$year[N09$extent<max(N09$extent)], N09_lrt,xlim = c(1979,2100),ylim = c(-6,6),
     main = "Prediction of ICE decrease for september log-space",
     xlab="Year",
     ylab="ICE logistic space")
abline(model09_log, col='red')
lines(Year, conf09_log[,'lwr'])
lines(Year, conf09_log[,'upr'])
lines(Year, pred09_log[,'lwr'], lty=2)
lines(Year, pred09_log[,'upr'], lty=2)
abline(h=0,col='blue')

legend("topright",
       legend = c('observations', 'regression line', 'confidence bands', 'prediction bands'), 
       cex = 0.7, 
       col = c('black', 'red', 'black','black'), 
       lwd = c(NA, 1, 1, 1), 
       lty = c(NA, 1, 1, 2),
       pch = c(1, NA, NA, NA))


Interpreting the zero base of the upper plot, half of the former ice cover extent (expressed by the maximum of observation) has dissapeared since roughly 2010.

September extent origin-space Next, let us estimate let us transform the model including confidence and prediction bands in the original space by: \[ICE=\frac{e^{bx+a}}{1+e^{bx+a}}\]

#backtransformation
conf09_origin<-as.data.frame(exp(conf09_log)/(1+exp(conf09_log)))*max(N09$extent)
pred09_origin<-as.data.frame(exp(pred09_log)/(1+exp(pred09_log)))*max(N09$extent)

plot(N09$year[N09$extent<max(N09$extent)], N09$extent[N09$extent<max(N09$extent)],
     xlim = c(1979,2100),ylim = c(0,10),
     main = "Prediction of Arctic Ice Cover extent decrease for september ",
     xlab = "Year",ylab = " Sea Ice Extent [mio. sqrkm]")
abline(model09, col='red',lwd=0.5)
lines(Year, conf09_origin[,'fit'],col='black')
lines(Year, conf09_origin[,'lwr'],col='green')
lines(Year, conf09_origin[,'upr'],col='green')
lines(Year, pred09_origin[,'lwr'],col='green', lty=2,lwd=0.8)
lines(Year, pred09_origin[,'upr'],col='green', lty=2,lwd=0.8)
abline(h=0,col='blue')
abline(h=1,col='red')

legend("topright",
       legend = c('observations', 'lin.regression line', 'logistic regression','confidence bands', 'prediction bands'), 
       cex = 0.7, 
       col = c('black', 'red', 'black','green','green'), 
       lwd = c(NA, 1, 1, 1,1), 
       lty = c(NA, 1, 1,1, 2),
       pch = c(1, NA, NA, NA, NA))

Depressing!!! With 95% probability, the arctic ice extent may pass the 1-million sqrkm marker between 2030 and 2070 and a complete disappearance in the second half of our century.
This simple empirical estimation is able to confirm the sophisticated physical modeling approach by Peng et al. 2020
There is still some work to do:
1. Are the residuals constant?

# are residuals constant?

plot(N09$year[N09$extent<max(N09$extent)],model09_log$residuals,ylim = c(-2,2),
     ylab="residuals logistic model",
     xlab="Year",
     main="Residuals of our inital model")
abline(h=0,col='red')


Ok, there are no conspicuous features and we may assume constant error variance.
However, this model is still based on the maximum as upper boundary, which is relative unlikely.
For the estimation of the upper boundary we may use the same criteria as for the OLS model: minimum sum of square residuals.
2.Estimating upper boundary
We will iterate possible candidates in a spatial resolution of 10,000 sqrkm starting 1 sqrkm above the maximum.

year<-N09$year
u<-seq(max(N09$extent)+0.000001,max(N09$extent)+8,0.01)
modelcomp<-as.data.frame(matrix(NA,nrow=length(u), ncol = 5),
                         col.names=c("u","res","a","b"))
modelcomp[,1]<-u
for (i in 1:length(u)) {
  N09_frame<-(N09$extent/u[i])
  N09_lrt<-log(N09_frame/(1-N09_frame))
  modelcomp[i,2]<-sum(lm(N09_lrt ~ year)$residuals^2)
  modelcomp[i,3:4]<-lm(N09_lrt~year)$coefficients
  modelcomp[i,5]<-sum((modelcomp[i,1]*(exp(modelcomp[i,3]+modelcomp[i,4]*year)/
                      (1+exp(modelcomp[i,3]+modelcomp[i,4]*year)))-N09$extent)^2)
  
}
plot(u,modelcomp[,5],
     xlab = "upper limit [mio.sqrkm]",
     ylab = "sum of squared residuals",
     main = "Estimating upper limit by minimizing sum of squared residuals")
points(u[modelcomp[,5]==min(modelcomp[,5])],
       modelcomp[,5][modelcomp[,5]==min(modelcomp[,5])],
       col='red',pch=16,cex=2.0)


Based on the assumption, that model with the least sum of squared residuals leads to the best estimation, we identified 9.21sqrkm as the maximum extent.
However, before relaxing we should test for normality and variance homogeneity:

N09_minres<-(N09$extent/min(modelcomp[,5]))
  N09_lrt_minres<-log(N09_minres/(1-N09_minres))
  modelminres<-lm(N09_lrt_minres~year)
plot(year,modelminres$residuals,ylim = c(-0.45,0.45))
abline(h=0,col='red')

shapiro.test(modelminres$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  modelminres$residuals
## W = 0.97908, p-value = 0.5974
sum(modelminres$residuals^2)
## [1] 1.555613

Great, again no conspicuous features and a clear decision for normality.
In a final step we will implement this optimal estimation into the model:

Final Prediction

conf09_log_opt <- predict.lm(modelminres, newdata = data.frame(year=Year), interval="confidence", level = 0.95)
pred09_log_opt <- predict(modelminres, newdata = data.frame(year=Year),  interval="prediction", level = 0.95)



conf09_orign_opt<-as.data.frame(exp(conf09_log)/(1+exp(conf09_log)))*max(N09$extent)
pred09_orign_opt<-as.data.frame(exp(pred09_log)/(1+exp(pred09_log)))*max(N09$extent)
plot(N09$year[N09$extent<max(N09$extent)], N09$extent[N09$extent<max(N09$extent)],xlim = c(1979,2100),ylim = c(0,10),
     main = "Prediction of Arctic Ice Cover extent decrease for september ",
     xlab = "Year",ylab = " Sea ice extent [mio. sqrkm]")
abline(model09, col='red',lwd=1.5)

lines(Year, conf09_orign_opt[,'fit'],col='black',lwd=2)
lines(Year, conf09_orign_opt[,'lwr'],col='blue')
lines(Year, conf09_orign_opt[,'upr'],col='blue')
lines(Year, pred09_orign_opt[,'lwr'],col='blue', lty=2,lwd=1.5)
lines(Year, pred09_orign_opt[,'upr'],col='blue', lty=2,lwd=1.5)
abline(h=0,col='green')
abline(h=1,col='green',lty=2)
legend("topright",
       legend = c('observations', 'lin.regression line', 'logistic regression','confidence bands', 'prediction bands'), 
       cex = 0.7, 
       col = c('black', 'red', 'black','blue','blue'), 
       lwd = c(NA, 1, 1, 1,1), 
       lty = c(NA, 1, 1,1, 2),
       pch = c(1, NA, NA, NA, NA))


There is obviously a relevant difference on this visual scale and we performed a physical, mathematical and statistical consistent model.



Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.