There are nearly unlimited reasons for scale transformation. Here we are focusing on the following

* Scale homogeneity
* Algebraic constraints on positive feature scales
* Skewness, symmetry and normality
* Compositional data sets
* Double constraint feature scales


Scale inhomogeneity

When measuring the temperature, we have the choice between different units: Kelvin [K], Celsius [°C] or Fahrenheit [°F]. Whereas Kelvin has an absolute orign of 0°K, the other two have arbitrary zero pints. A simple shift of -273.15 links the Celsius and Kelvin scale. In order to build physically meaningful statistical models such transformation have to be performed first.

Further examples are different currencies, molar weight and mass weight percentages of e.g. chemical compositions or length, area and volume units in meter, yard, inches etc. .

Such type of transformations can be described by linear relations as \[ y=b*x + a \] The constant \(a\) is the shift or translation value and b the slope or stretching factor.
In the following chapter, we will call these linear relations scaling (translation & stretching).

Example for scaling Transformation from Celsius to Fahrenheit: \[ \underbrace{y}_{[°F]} = \underbrace{1.8*x_{[°C]} + 32}_{[°F]}\]

Algebraic constraints on limited feature scales

Most real-world variables are positive (e.g. \(\mathbb{R_+}\) or \(]0,+\infty]\)) or are limited to an interval \(]a,b[_\mathbb{R_+}\). Exceeding these boundaries by algebraic operations such as (+,-,*,/) leads to useless or meaningless results.

Example
On the scale of positive real numbers \(\mathbb{R_+}\) multiplication and the inverse operation division can not leave the set of positive real numbers. Thus, the positive real numbers together with the multiplication, denoted by \((\mathbb{R_+},*)\) forms a commutative group with 1 as neutral element. In contrast, the set of positive real numbers together with the addition \((\mathbb{R_+},+)\) has neither inverse elements nor a neutral element (because 0 is not a positive number). Thus, addition and subtraction can not be unlimited performed.
But a log-transformation of any variable \[x\in\mathbb{R_+}: x\to x'=\log(x)\] opens the scale to \(x'\to\mathbb{R}\) and transforms the neutral element of \((\mathbb{R_+},\cdot)\) 1 to the neutral element of \((\mathbb{R},+)\) 0!
The field of \((\mathbb{R},+,\cdot)\) provides nearly unlimited operations for statistical purposes (except e.g. \(\sqrt{-1}\) or \(\frac{1}{0}\)).

The inverse transformation powering \[x'\to x=e^{x'}\]

reverses the log-transformation back to the positive real scale.

Skewness, symmetry and normality

Let us look on the following example on a data set of measured dissolved Mg:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1086  0.6269  1.0526  1.4219  1.7430 15.1291

Compared with the median, the arithmetic mean appears as a bad estimate of central tendency for our heavy skewed data.
Let us try the geometric mean as measure of central tendency using Gmean() function of package DescTools :

library(DescTools)
param<-c("arithm. mean","Median","geomean")
values<-round(c(mean(Mg),median(Mg),Gmean(Mg)),2)
cbind(param,values)
##      param          values
## [1,] "arithm. mean" "1.42"
## [2,] "Median"       "1.05"
## [3,] "geomean"      "1.04"

Obviously the geometric mean seems to be a far better estimate for the central tendency. One reason for right skewed distributions can be the Generalized mean inequality \(\bar{x}_{geo}\leq \bar{x}_{arith}\), but is not sufficient as explanation.

Let us recall the geometric mean: \[ \bar{x}_{geo}=\sqrt[n]{\prod_{i=1}^N x_i}=\prod_{i=1}^N x_i^{\frac{1}{n}}= x_1^{\frac{1}{n}}\cdot x_2^{\frac{1}{n}}\cdots x_n^{\frac{1}{n}}\] The main apparent difference between these two means is the operator: sum for arithmetic mean and product for the geometric mean.
Thus, the arithmetic mean recommends for an additive or distance scale and the geometric mean for a ratio scale approach.
In our case, we are dealing with a variable clearly hosted on the \(\mathbb{R}_+\)-scale where only multiplication is unlimited allowed.

But we often need an arithmetic mean for further statistical parameters: e.g.
- standard deviation
- skewness
- covariance
- correlation
- regression
- etc.

To solve this problem, we can apply the logarithm to the geometric mean:
\[\begin{align} log(\bar{x}_{geo}) & =log(x_1^{\frac{1}{n}}\cdot x_2^{\frac{1}{n}}\cdots x_n^{\frac{1}{n}})= \\ & =\frac{1}{n}log(x_1)+\frac{1}{n}log(x_2)+\cdots+\frac{1}{n}log(x_N)= \\ & =\frac{1}{n}\sum_{i=1}^{N}{log(x_i)}={\overline{log(x_i)}}_{arith} \end{align}\]

!!! The logarithm of the geometric mean is the arithmetic mean of the logarithms!!!
Applying logarithm on our Mg-data yield the following distribution:

log_Mg<-log(Mg)
hist(log_Mg, main = "Dissolved Mg on log-scale", xlab = "log(Mg)",breaks=50)

Using the log-transformation kills two birds with one stone:
1. We transform our data on the feature scale \(\mathbb{R}\) with nearly unlimited operations concerning differences and ratios!
2. We are getting a less skewed nearly normal distribution, where symmetric confidence intervals are becoming meaningful.

Check for normality using shapiro.test() yields:

shapiro.test(log_Mg)
## 
##  Shapiro-Wilk normality test
## 
## data:  log_Mg
## W = 0.99868, p-value = 0.9747

Ok, perfect normal!

Here, we may calculate moments in the usual way:

moments<-c("arithm. mean log(x)","variance log(x)","stand. dev. log(x)","n")
mean_log_Mg<-mean(log_Mg)
var_log_Mg<-var(log_Mg)
sd_log_Mg<-sd(log_Mg)
n<-length(log_Mg)
x=as.matrix(c(round(mean_log_Mg,3),round(var_log_Mg,3),round(sd_log_Mg,3),n))
res<-as.data.frame(x=x,row.names=c("arithm. mean log(x)","variance log(x)",
                               "stand. dev. log(x)","n"))
res
##                          V1
## arithm. mean log(x)   0.041
## variance log(x)       0.610
## stand. dev. log(x)    0.781
## n                   500.000

Now, we may calculate confidence bounces in the log space:

low_CI<-mean_log_Mg-pt(0.025,n)*sd_log_Mg/sqrt(n)
up_CI<-mean_log_Mg+pt(0.975,n)*sd_log_Mg/sqrt(n)
CI<-c(low_CI,up_CI)
CI
## [1] 0.02331638 0.07027670

… and a two-sigma-interval:

low_sigma<-mean_log_Mg-2*sd_log_Mg
up_sigma<-mean_log_Mg+2*sd_log_Mg
sigma<-c(low_sigma,up_sigma)
sigma
## [1] -1.520368  1.602613

Next, we transform our results back onto the original feature scale:

mean_Mg<-exp(mean_log_Mg)
var_Mg<-exp(var_log_Mg)
sd_Mg<-exp(sd_log_Mg)
CI_orign<-exp(CI)
two_sigma<-exp(sigma)
robust<-round(c(mean_Mg,var_Mg,sd_Mg,CI_orign,two_sigma),3)
robust
## [1] 1.042 1.840 2.183 1.024 1.073 0.219 4.966

Just for comparison, we calculate moments and interval the “wrong” way:

w_mean_Mg<-mean(Mg)
w_var_Mg<-var(Mg)
w_sd_Mg<-sd(Mg)
w_CI<-c(mean(Mg)-pt(0.975,n)*sd(Mg)/sqrt(n),mean(Mg)+pt(0.975,n)*sd(Mg)/sqrt(n))
w_two_sigma<-c(mean(Mg)-2*sd(Mg),mean(Mg)+2*sd(Mg))
wrong<-round(c(w_mean_Mg,w_var_Mg,w_sd_Mg,w_CI,w_two_sigma),3)

and plot the stuff for visual inspection:

hist(Mg, main = "Dissolved Mg in g/l", xlab = "Mg in g/l",breaks=50, xlim = c(-1.5,8))
abline(v=wrong[c(1,4:7)],col='red')
abline(v=robust[c(1,4:7)],col='green')

Here, the green lines showing geometric mean including CI and 2-sigma-intervals after back transformation (the robust way).
The same in red represents the meaningless mean and interval margins. We call it “meaningless” at least due to the negative concentration suggested by the lower 2-sigma-margin of the arithmetic mean.

Magnification of the CI’s exhibit the difference in both CI’s:

hist(Mg, main = "Dissolved Mg in g/l", xlab = "Mg in g/l",breaks=50, xlim = c(1,1.5))
abline(v=wrong[c(1,4:7)],col='red')
abline(v=robust[c(1,4:7)],col='green')

On the original feature scale, the robust CI’s are skew as the random model.

Other transformations to gain normality / symmetry but do not solve algebraic constraints: - square-root-transformation - reciprocal - \(asin\)-transformation - \(asin-sqrt\)-transformation

Important: In our example above, we kept the problem of upper scale limit. No concentration can be larger than 100% or 1000g/kg, etc. . Here, the upper limit is far below 1000g/liter due to a chemical saturation limit. However, nearly all observable measures are physical or ecological (economical, demographical etc.) limited, at least on earth. We will come back to this problem after explaining the statistical, algebraic and geometric problem of compositional constraints.

Compositional contraints

“If the level of investigation of compositional data with classical tools moves from a descriptive phase to the application of statistical tests, erroneous evaluation of the formulated hypothesis maybe the result, a serious point when environmental questions are the matter of debate.” from A. Buccianti, 2013. Is compositional data analysis a way to see beyond the illusuion?

Compositions are sets of variables which values sum up to a constant:
- dissolved cations and anions in a liquid
- frequencies of population groups (e.g. employees, jobless people, etc.)
- role allocation of a person during a time span in behavioral sciences
- nutrition content for food declaration
- many, many more …

Due to the compositional nature of such variable, they can neither be independent nor be normal.

A simple example:
Let us generate a 3-part random composition by defining 2 random variables and a remainder to the total:

set.seed(311)
A<-runif(30,0,33)
B<-runif(30,0,33)
C<-100-A-B
comp1<-cbind(A,B,C)
summary(comp1)
##        A                 B                 C        
##  Min.   : 0.4925   Min.   : 0.4783   Min.   :40.63  
##  1st Qu.: 7.9098   1st Qu.:11.8154   1st Qu.:52.27  
##  Median :21.3940   Median :17.6066   Median :62.03  
##  Mean   :18.4264   Mean   :17.9454   Mean   :63.63  
##  3rd Qu.:27.4527   3rd Qu.:25.2677   3rd Qu.:75.83  
##  Max.   :31.5422   Max.   :32.4978   Max.   :86.17

All three parts should be independent:

library(psych)
pairs.panels(comp1)