As written in the introductional part, double contraint feature scales have a certain relation to compositional constraints.
Let us imagine a variable $\mathbf{y}$ in an open intervall, i.e. $\mathbf{y}=(y_i\in ]a,b[_\mathbb R,i=1,\cdots,m)$, which lies within a constrained interval $[l,u]\subset \mathbb R$ with $l<a$ and $b<u$. The absolute value is obviously of minor interest compared with the relative position between l and u. A first step should be a linear min-max-transformation with min=l and max=u:
$$ y \to y' =\frac{y-l}{u-l} \in\ ]0,1[\ \in \ {\mathbb R}_+ $$Now, the transfor med variable y' divides the interval into two parts:
Both sum up to 1 and are therefore related to a 2-dimensional composition.
Applying an alr-transformation to our variable, we are getting a symmetric feature scale: $$ y'\to y''=log(\frac{y'}{1-y'})\in \mathbb R$$
This so-called logistic transformation enable application of algebraic operation over the base field $\mathbb R$.
The back transformations are: $$\begin{align} y''\to y'=\frac{e^{y''}}{1+e^{y''}}&& and && y'\to y=(y'+l)\cdot(u-l) \end{align}$$
The National Snow and Ice Data Center at the University of Colorado in Boulder hosts and provides monthly sea ice data for both hemispheres since November 1978 to present. Under the condition of a warming earth system, we will estimate the changes for the northern hemisphere and project a possible trend until the end of this century for September.
Loading the data and plotting the time series
# First, let's import the needed libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# open the data set as csv file using pandas
df = pd.read_csv('ftp://sidads.colorado.edu/DATASETS/NOAA/G02135//north/monthly/data/N_09_extent_v3.0.csv')
df.head()
year | mo | data-type | region | extent | area | |
---|---|---|---|---|---|---|
0 | 1979 | 9 | Goddard | N | 7.05 | 4.58 |
1 | 1980 | 9 | Goddard | N | 7.67 | 4.87 |
2 | 1981 | 9 | Goddard | N | 7.14 | 4.44 |
3 | 1982 | 9 | Goddard | N | 7.30 | 4.43 |
4 | 1983 | 9 | Goddard | N | 7.39 | 4.70 |
# be aware of the spaces in the keys:
df.keys()
Index(['year', ' mo', ' data-type', ' region', ' extent', ' area'], dtype='object')
# use seaborn for plotting
import seaborn as sns
sns.set()
sns.scatterplot(data = df, x='year', y = ' extent')
plt.title("Ice cover extent september")
plt.ylabel('million sqrkm')
Text(0, 0.5, 'million sqrkm')
Now, we may estimate the linear trend using seeborn.regplot. If we set 'truncate = True' regression line is bounded by the data limits. If we set 'truncate = False', it extends to the x axis limits.
plt.xlim(1975,2100)
plt.ylim(0,10)
sns.regplot(data = df, x='year', y = ' extent', truncate = False, ci = 95)
plt.title("Prediction of the ice cover extent september")
plt.ylabel('million sqrkm')
Text(0, 0.5, 'million sqrkm')
Modeling the observed ice cover extent from 1979 to 2020 for the northern hemisphere in September for the future, we have to expect a complete disappearance of the arctic ice in late summer with 95% probability within the second half of this century!
In 2035 we may expect an ICE in september with 95% confidence level between 2.07 and 4.41 milion sqrkm, therefore between 51.83% and 110.3% of last september extent.
In 2050 we may expect an ICE in September with 95% confidence level between 0.8 and 3.31 milion sqrkm, therefore between 20.09% and 82.77% of last september extent.
The feature scale of the arctic ice extent in million square kilometer is clearly framed by a lower and upper limit: The extent definitly can neither be negative nor pass the equator. Thus, the lower limit is trivial but the upper limit unknown.
For the sake of simplicity we initially use the maximum extent as upper limit and remove this maximum from our time series and start a logistic approach:
Logistic Transformation We start by scaling the data to the standard simplex ]0,1[ℝ+ by deviding the data by the maximum as estimate of the upper limit. Afterward we transform the standard simplex by logistic transformation and plot the transfomed time series.
df[' extent lrt'] = df[' extent']/max_extent
# if we hit the maximum, the outcome is 1. For this value, we can not calculate the logarithm. Therefore, we need
# to exclude the value or approximate it by, e.g. 0.999999:
for i in range(0,len(df[' extent lrt'])):
if df.iloc[i,6] == 1:
df.iloc[i,6] = 0.999999
df[' extent lrt'] = np.log(df[' extent lrt']/(1-df[' extent lrt']))
## Plot the data:
sns.scatterplot(data = df, x='year', y = ' extent lrt')
plt.title("Ice cover extent after logistic transformation")
plt.ylim(0,5)
plt.ylabel('ICE logistic space')
Text(0, 0.5, 'ICE logistic space')
Now, we can fit a linear model in the log-space:
plt.xlim(1977,2100)
plt.ylim(-6,5)
plot_des = sns.regplot(data = df, x='year', y = ' extent lrt', truncate = False)
plt.axhline(0)
plt.title("Prediction of ICE decrease for september log-space")
plt.ylabel('ICE logistic space')
Text(0, 0.5, 'ICE logistic space')
Interpreting the zero base of the upper plot, half of the former ice cover extent (expressed by the maximum of observation) has dissapeared since roughly 2010.
Next, let us us transform the model in the original space by $$ ICE = \frac{ e^{bx+a} }{1+e^{bx+a}}$$
# Since sns.regplot does not provide the coefficients of the regression line, we calculate the
# coefficients by statmodels. Both approaches calculate the parameters analogously, so we may do this.
import statsmodels.api as sm
y = df.iloc[:,6]
x = df.iloc[:,0]
x = sm.add_constant(x)
alpha = 0.05 # 95% confidence interval
model = sm.OLS(y, x)
lr = model.fit()
#lr = sm.OLS(y, sm.add_constant(x)).fit()
#print(lr.params)
print(lr.summary())
# Slope: a
a = lr.params[1]
# Intercept: b
b = lr.params[0]
# Confidence intervals:
lr_ci = lr.conf_int(alpha = 0.05)
# enlarge data until the year 2100:
x_1979_2100 = np.arange(1979,2100)
x_1979_2100 = sm.add_constant(x_test)
y_1979_2100 = lr.predict(x_1979_2100)
OLS Regression Results ============================================================================== Dep. Variable: extent lrt R-squared: 0.381 Model: OLS Adj. R-squared: 0.366 Method: Least Squares F-statistic: 26.42 Date: Tue, 12 Dec 2023 Prob (F-statistic): 6.40e-06 Time: 13:34:36 Log-Likelihood: -86.781 No. Observations: 45 AIC: 177.6 Df Residuals: 43 BIC: 181.2 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 202.7756 39.110 5.185 0.000 123.902 281.649 year -0.1005 0.020 -5.140 0.000 -0.140 -0.061 ============================================================================== Omnibus: 81.579 Durbin-Watson: 2.337 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1377.022 Skew: 4.699 Prob(JB): 9.63e-300 Kurtosis: 28.418 Cond. No. 3.08e+05 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 3.08e+05. This might indicate that there are strong multicollinearity or other numerical problems.
df[' extent back trafo'] = (np.exp(df[' extent lrt']))/(1+(np.exp(df[' extent lrt'])))*df[' extent'].max()
#predict until 2021:
lr_back = ((np.exp(a*x_1979_2100[:,1]+b))/(1+(np.exp(a*x_1979_2100[:,1]+b))))*df[' extent'].max()
Let us have a look on the scatter plot of the transformed data:
sns.scatterplot(data = df, x='year', y = ' extent back trafo')
plt.title("ICE decrease for september log-spac")
plt.ylabel('Sea ice extend in mio. sqrkm')
Text(0, 0.5, 'Sea ice extend in mio. sqrkm')
# In order to plot the logistic regresion line (orange line below, we create a new data set with the back transformed data
# of the linear regression line
d_1979_2100 = {'year': np.array(x_1979_2100)[:,1], 'extent' : lr_back }
df_1979_2100 = pd.DataFrame(data = d_1979_2100)
plt.xlim(1977,2100)
plot_des = sns.regplot(data = df, x='year', y = ' extent back trafo', truncate = False)
plt.axhline(0)
plt.axhline(1)
plt.title("Prediction of ICE decrease for september log-space")
plt.ylabel('ICE logistic space')
sns.lineplot(data = df_1979_2100 , x='year', y = 'extent')
<Axes: title={'center': 'Prediction of ICE decrease for september log-space'}, xlabel='year', ylabel='ICE logistic space'>
As an exercise, you can transform the confidence intervals of the regression line in the log space and you will get the following result:
With 95% probability, the arctic ice extent may pass the 1-million sqrkm marker between 2030 and 2070 and a complete disappearance in the second half of our century. This simple empirical estimation is able to confirm the sophisticated physical modeling approach by Peng et al. 2020.
As a last exercise, check are the residuals of the transformed and not transformed data, apply the shaparo-Wilk normality test for the normality of the different distributions, and interpret the results.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.