3060003_Reasons_for_Transformations.knit

There are nearly unlimited reasons for scale transformation. Here we are focusing on the following

* Scale homogeneity
* Algebraic constraints on positive feature scales
* Skewness, symmetry and normality
* Compositional data sets
* Double constraint feature scales

Scale inhomogeneity

When measuring the temperature, we have the choice between different units: Kelvin [K], Celsius [°C] or Fahrenheit [°F]. Whereas Kelvin has an absolute orign of 0°K, the other two have arbitrary zero pints. A simple shift of -273.15 links the Celsius and Kelvin scale. In order to build physically meaningful statistical models such transformation have to be performed first.

Further examples are different currencies, molar weight and mass weight percentages of e.g. chemical compositions or length, area and volume units in meter, yard, inches etc. .

Such type of transformations can be described by linear relations as \[ y=b*x + a \] The constant \(a\) is the shift or translation value and b the slope or stretching factor.
In the following chapter, we will call these linear relations scaling (translation & stretching).

Example for scaling Transformation from Celsius to Fahrenheit: \[ \underbrace{y}_{[°F]} = \underbrace{1.8*x_{[°C]} + 32}_{[°F]}\]

Algebraic constraints on limited feature scales

Most real-world variables are positive (e.g. \(\mathbb{R_+}\) or \(]0,+\infty]\)) or are limited to an interval \(]a,b[_\mathbb{R_+}\). Exceeding these boundaries by algebraic operations such as (+,-,*,/) leads to useless or meaningless results.

Example
On the scale of positive real numbers \(\mathbb{R_+}\) multiplication and the inverse operation division can not leave the set of positive real numbers. Thus, the positive real numbers together with the multiplication, denoted by \((\mathbb{R_+},*)\) forms a commutative group with 1 as neutral element. In contrast, the set of positive real numbers together with the addition \((\mathbb{R_+},+)\) has neither inverse elements nor a neutral element (because 0 is not a positive number). Thus, addition and subtraction can not be unlimited performed.
But a log-transformation of any variable \[x\in\mathbb{R_+}: x\to x'=\log(x)\] opens the scale to \(x'\to\mathbb{R}\) and transforms the neutral element of \((\mathbb{R_+},\cdot)\) 1 to the neutral element of \((\mathbb{R},+)\) 0!
The field of \((\mathbb{R},+,\cdot)\) provides nearly unlimited operations for statistical purposes (except e.g. \(\sqrt{-1}\) or \(\frac{1}{0}\)).

The inverse transformation powering \[x'\to x=e^{x'}\]

reverses the log-transformation back to the positive real scale.

Skewness, symmetry and normality

Let us look on the following example on a data set of measured dissolved Mg:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1086  0.6269  1.0526  1.4219  1.7430 15.1291

Compared with the median, the arithmetic mean appears as a bad estimate of central tendency for our heavy skewed data.
Let us try the geometric mean as measure of central tendency using Gmean() function of package DescTools :

library(DescTools)
param<-c("arithm. mean","Median","geomean")
values<-round(c(mean(Mg),median(Mg),Gmean(Mg)),2)
cbind(param,values)

##      param          values
## [1,] "arithm. mean" "1.42"
## [2,] "Median"       "1.05"
## [3,] "geomean"      "1.04"

Obviously the geometric mean seems to be a far better estimate for the central tendency. One reason for right skewed distributions can be the Generalized mean inequality \(\bar{x}_{geo}\leq \bar{x}_{arith}\), but is not sufficient as explanation.

Let us recall the geometric mean: \[ \bar{x}_{geo}=\sqrt[n]{\prod_{i=1}^N x_i}=\prod_{i=1}^N x_i^{\frac{1}{n}}= x_1^{\frac{1}{n}}\cdot x_2^{\frac{1}{n}}\cdots x_n^{\frac{1}{n}}\] The main apparent difference between these two means is the operator: sum for arithmetic mean and product for the geometric mean.
Thus, the arithmetic mean recommends for an additive or distance scale and the geometric mean for a ratio scale approach.
In our case, we are dealing with a variable clearly hosted on the \(\mathbb{R}_+\)-scale where only multiplication is unlimited allowed.

But we often need an arithmetic mean for further statistical parameters: e.g.
- standard deviation
- skewness
- covariance
- correlation
- regression
- etc.

To solve this problem, we can apply the logarithm to the geometric mean:
\[\begin{align} log(\bar{x}_{geo}) & =log(x_1^{\frac{1}{n}}\cdot x_2^{\frac{1}{n}}\cdots x_n^{\frac{1}{n}})= \\ & =\frac{1}{n}log(x_1)+\frac{1}{n}log(x_2)+\cdots+\frac{1}{n}log(x_N)= \\ & =\frac{1}{n}\sum_{i=1}^{N}{log(x_i)}={\overline{log(x_i)}}_{arith} \end{align}\]

!!! The logarithm of the geometric mean is the arithmetic mean of the logarithms!!!
Applying logarithm on our Mg-data yield the following distribution:

log_Mg<-log(Mg)
hist(log_Mg, main = "Dissolved Mg on log-scale", xlab = "log(Mg)",breaks=50)

Using the log-transformation kills two birds with one stone:
1. We transform our data on the feature scale \(\mathbb{R}\) with nearly unlimited operations concerning differences and ratios!
2. We are getting a less skewed nearly normal distribution, where symmetric confidence intervals are becoming meaningful.

Check for normality using shapiro.test() yields:

shapiro.test(log_Mg)

## 
##  Shapiro-Wilk normality test
## 
## data:  log_Mg
## W = 0.99868, p-value = 0.9747

Ok, perfect normal!

Here, we may calculate moments in the usual way:

moments<-c("arithm. mean log(x)","variance log(x)","stand. dev. log(x)","n")
mean_log_Mg<-mean(log_Mg)
var_log_Mg<-var(log_Mg)
sd_log_Mg<-sd(log_Mg)
n<-length(log_Mg)
x=as.matrix(c(round(mean_log_Mg,3),round(var_log_Mg,3),round(sd_log_Mg,3),n))
res<-as.data.frame(x=x,row.names=c("arithm. mean log(x)","variance log(x)",
                               "stand. dev. log(x)","n"))
res

##                          V1
## arithm. mean log(x)   0.041
## variance log(x)       0.610
## stand. dev. log(x)    0.781
## n                   500.000

Now, we may calculate confidence bounces in the log space:

low_CI<-mean_log_Mg-pt(0.025,n)*sd_log_Mg/sqrt(n)
up_CI<-mean_log_Mg+pt(0.975,n)*sd_log_Mg/sqrt(n)
CI<-c(low_CI,up_CI)
CI

## [1] 0.02331638 0.07027670

… and a two-sigma-interval:

low_sigma<-mean_log_Mg-2*sd_log_Mg
up_sigma<-mean_log_Mg+2*sd_log_Mg
sigma<-c(low_sigma,up_sigma)
sigma

## [1] -1.520368  1.602613

Next, we transform our results back onto the original feature scale:

mean_Mg<-exp(mean_log_Mg)
var_Mg<-exp(var_log_Mg)
sd_Mg<-exp(sd_log_Mg)
CI_orign<-exp(CI)
two_sigma<-exp(sigma)
robust<-round(c(mean_Mg,var_Mg,sd_Mg,CI_orign,two_sigma),3)
robust

## [1] 1.042 1.840 2.183 1.024 1.073 0.219 4.966

Just for comparison, we calculate moments and interval the “wrong” way:

w_mean_Mg<-mean(Mg)
w_var_Mg<-var(Mg)
w_sd_Mg<-sd(Mg)
w_CI<-c(mean(Mg)-pt(0.975,n)*sd(Mg)/sqrt(n),mean(Mg)+pt(0.975,n)*sd(Mg)/sqrt(n))
w_two_sigma<-c(mean(Mg)-2*sd(Mg),mean(Mg)+2*sd(Mg))
wrong<-round(c(w_mean_Mg,w_var_Mg,w_sd_Mg,w_CI,w_two_sigma),3)

and plot the stuff for visual inspection:

hist(Mg, main = "Dissolved Mg in g/l", xlab = "Mg in g/l",breaks=50, xlim = c(-1.5,8))
abline(v=wrong[c(1,4:7)],col='red')
abline(v=robust[c(1,4:7)],col='green')

Here, the green lines showing geometric mean including CI and 2-sigma-intervals after back transformation (the robust way).
The same in red represents the meaningless mean and interval margins. We call it “meaningless” at least due to the negative concentration suggested by the lower 2-sigma-margin of the arithmetic mean.

Magnification of the CI’s exhibit the difference in both CI’s:

hist(Mg, main = "Dissolved Mg in g/l", xlab = "Mg in g/l",breaks=50, xlim = c(1,1.5))
abline(v=wrong[c(1,4:7)],col='red')
abline(v=robust[c(1,4:7)],col='green')

On the original feature scale, the robust CI’s are skew as the random model.

Other transformations to gain normality / symmetry but do not solve algebraic constraints: - square-root-transformation - reciprocal - \(asin\)-transformation - \(asin-sqrt\)-transformation

Important: In our example above, we kept the problem of upper scale limit. No concentration can be larger than 100% or 1000g/kg, etc. . Here, the upper limit is far below 1000g/liter due to a chemical saturation limit. However, nearly all observable measures are physical or ecological (economical, demographical etc.) limited, at least on earth. We will come back to this problem after explaining the statistical, algebraic and geometric problem of compositional constraints.

Compositional contraints

“If the level of investigation of compositional data with classical tools moves from a descriptive phase to the application of statistical tests, erroneous evaluation of the formulated hypothesis maybe the result, a serious point when environmental questions are the matter of debate.” from A. Buccianti, 2013. Is compositional data analysis a way to see beyond the illusuion?

Compositions are sets of variables which values sum up to a constant:
- dissolved cations and anions in a liquid
- frequencies of population groups (e.g. employees, jobless people, etc.)
- role allocation of a person during a time span in behavioral sciences
- nutrition content for food declaration
- many, many more …

Due to the compositional nature of such variable, they can neither be independent nor be normal.

A simple example:
Let us generate a 3-part random composition by defining 2 random variables and a remainder to the total:

set.seed(311)
A<-runif(30,0,33)
B<-runif(30,0,33)
C<-100-A-B
comp1<-cbind(A,B,C)
summary(comp1)

##        A                 B                 C        
##  Min.   : 0.4925   Min.   : 0.4783   Min.   :40.63  
##  1st Qu.: 7.9098   1st Qu.:11.8154   1st Qu.:52.27  
##  Median :21.3940   Median :17.6066   Median :62.03  
##  Mean   :18.4264   Mean   :17.9454   Mean   :63.63  
##  3rd Qu.:27.4527   3rd Qu.:25.2677   3rd Qu.:75.83  
##  Max.   :31.5422   Max.   :32.4978   Max.   :86.17

All three parts should be independent:

library(psych)
pairs.panels(comp1)

As we see, both random variables (A and B) are correlated with the remainder C. Imagine, we are measuring another part and extent the composition by variable D.
Hereby, we re-generate C as random and close all 4 parts to 100%:

C<-runif(30,0,70)
D<-runif(30,0,70)
comp2<-as.data.frame(cbind(comp1,D))
# set the composition to 100%
comp2<-as.data.frame(100*cbind(comp2$A/rowSums(comp2),
             comp2$B/rowSums(comp2),
             comp2$C/rowSums(comp2),
             comp2$D/rowSums(comp2)))
colnames(comp2)<-c("A","B","C","D")
pairs.panels(comp2,smooth = F,ellipses = F,lm=T,cor=T)

If closing a set of independent uniform random variable to 100%, none of them stay uniform anymore. They are loosing any symmetry and possible initial normality!
Furthermore, beside A and B all correlations are negative again.

Let us test for significance of the Pearson correlation coefficient:

Test<-c("r(AB)","r(AC)","r(AD)","r(BC)","r(BD)","r(CD)")
p_values<-c(cor.test(comp2$A,comp2$B)$p.value<0.05,
            cor.test(comp2$A,comp2$C)$p.value<0.05,
            cor.test(comp2$A,comp2$D)$p.value<0.05,
            cor.test(comp2$B,comp2$C)$p.value<0.05,
            cor.test(comp2$B,comp2$D)$p.value<0.05,
            cor.test(comp2$C,comp2$D)$p.value<0.05)
cbind(Test,p_values)

##      Test    p_values
## [1,] "r(AB)" "FALSE" 
## [2,] "r(AC)" "FALSE" 
## [3,] "r(AD)" "TRUE"  
## [4,] "r(BC)" "TRUE"  
## [5,] "r(BD)" "FALSE" 
## [6,] "r(CD)" "TRUE"

50% of possible pairs of random variable are showing up with significant correlation!

What does this imply beside obvious spurious correlations?

We can interpret the correlation coefficient as normalized cosine of the angle between the deviation vectors to the mean \[\mathbf{x}_{d}=(x_i-\bar{x},i=1,\cdots,n)\]

and

\[ \mathbf{y}_{d}=(y_i-\bar{y},i=1,\cdots,n)\]

The cosine of the angle between two vectors is calculated by the scalar product divided to the product of both norm:

\[\cos \measuredangle (\mathbf{x}, \mathbf{y})=\frac{<\mathbf{x}, \mathbf{y}>}{\parallel \mathbf{x} \parallel \cdot \parallel \mathbf{y} \parallel }= \frac {\sum{x_i\cdot y_i }}{\sqrt{x_i^2}\cdot \sqrt{y_i^2}} \] Setting \(\mathbf{x} =\mathbf{x}-\overline{\mathbf{x}}\) and \(\mathbf{y} =\mathbf{y}-\overline{\mathbf{y}}\), we are getting the scalar product as nominator for the covariance term and the product of the norms as standard deviation terms.

Thus, in compositions the deviations vectors as bases are oblique by nature and consequently non-euclidean !!!

Hence, statistical methods using distances and angles (ratios) as optimization parameters can not be applied on pure compositions. Such methods are e.g.:
- (ML-)Regression and correlation, GLM, PCA, LDA, ANN
- SVM, KNN, k-Means
- most hypotheses tests, etc.

Whereas, this problem is well known since Karl Pearson published the problem of spurious correlation in compositional data 1897 more than 125 years ago, Felix Chayes (1960) suggested the use of ratios instead of portions: \[ a,b \in ]0,1[_{\mathbb{R}}\to \frac{a}{b}\in \mathbb{R_+} \] One of the remaining disadvantage was now the question about nominator and denominator, because of the often huge difference between a/b and b/a. Another problem was still the limit to the positive real feature scale.

In 1981, the Scottish statistician John Aitchison solved the problem introducing the log-ratios: \(log \frac{a}{b}=-log \frac{b}{a}\in\mathbb R\).Thus, changing nominator and denominator just change the sign of the log-ratio but not the absolute value and implies symmetry again (cf. paragraph above).

In his Book The statistical analysis of composional data (1986), Aitchison introduced two transformations:

Additive log-ratio (alr) transformation is the logarithm of the ratio of each variable divided by a selected variable of the composition. The alr applied on a D-dimensional composition results in a D-1-dimensional variable space over the field \((\mathbb{R},+,\cdot)\). Unfortunately, beside a asymmetry in the parts, alr is not isometric. Thus, distances and angles are not consistence between simplex and the resulting D-1 dimensional subspace of \(\mathbb{R}^D\) .
Centered log-ratio (clr) transformation is the logarithm of the ratio of each value divided through the geometric mean of all parts of the corresponding observation. The clr transformation is symmetric with respect to the compositional parts and keeps the same number of components as the number of parts in the composition.

Due to the problem that orthogonal references in that subspace are not obtained in a straightforward manner, Egozcue et al. 2003 introduced the
Isometric log-ratio (ilr) transformation is isometric and provides an orthonormal basis. Thus, distances and angles are independent from the sub-composition (symmetric in all parts of the sub-composition) comparable between the Aitchison simplex and the D-dimensional real space.

However, PCA or SVD-analysis on a clr-transformed simplex provides similar properties Egozcue et al. 2003.

Double constrained feature spaces

There is a certain relation to the compositional constraint if we face double bounded feature spaces. Let us imagine a variable \(\mathbf{x}=(x_i\in ]l,u[_\mathbb R,i=1,\cdots,m)\) within a constrained interval [a,b]\(\subset \mathbb R\). The absolute value is obviously of minor interest compared with the relative position between l and u.
A first step should be a linear min-max-transformation with min=l and max=u:
\[ x \to x' =\frac{x-l}{u-l} \in\ ]0,1[\ \in \ {\mathbb R}_+ \] Now, \(x'\) divides the interval into two parts: 1. the distance to the left (=x) and 2. the distance to the right (1-x). Both sum up to 1 and are therefore related to a 2-dimensional composition.
Applying an alr-transformation to our variable, we are getting a symmetric feature scale: \[ x'\to x''=log(\frac{x'}{1-x'})\in \mathbb R\] This so-called logistic transformation enable application of algebraic operation over the base field \(\mathbb R\).

Feature scales and machine learning approaches

As shown above, normalization, scaling, standardization and transformations are crucial requirements for reliable and robust application of statistical methods beyond simple data description for getting meaningful results. Especially application of complex machine learning algorithms for exploration of multivariate data patterns it often becomes likely to get entrapped for violation of important preconditions. Hereby, it can be the violation of scale homogeneity using gradient descent related optimization technics or non-euclidean feature space for distance or angle related technics as well as simple violation of algebraic constraints.

A brief overview of common risk concerning serious pitfall is provided by Minkyung Kang in his data science blog:

ML Models sensitive to feature scale

Algorithms that use gradient descent as an optimization technique
- Linear Regression
- Neural Networks
Distance-based algorithms
- Support Vector Machines
- KNN
- K-means clustering
Algorithms using directions of maximized variances
- PCA / SVD
- LDA

ML models not sensitive to feature scale

Tree-based algorithms
- Decision Tree
- Random Forest
- Gradient Boosted Trees

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.