There are nearly unlimited reasons for scale transformation. Here we
are focusing on the following
* Scale homogeneity
* Algebraic constraints on positive feature
scales
* Skewness, symmetry and normality
* Compositional data sets
* Double constraint feature scales
When measuring the temperature, we have the choice between different
units: Kelvin [K], Celsius [°C] or Fahrenheit [°F]. Whereas Kelvin has
an absolute orign of 0°K, the other two have arbitrary zero pints. A
simple shift of -273.15 links the Celsius and Kelvin scale. In order to
build physically meaningful statistical models such transformation have
to be performed first.
Further examples are different currencies, molar weight and mass
weight percentages of e.g. chemical compositions or length, area and
volume units in meter, yard, inches etc. .
Such type of transformations can be described by linear relations as
\[ y=b*x + a \] The constant \(a\) is the shift or translation value and b
the slope or stretching factor.
In the following chapter, we will call these linear relations
scaling (translation & stretching).
Example for scaling Transformation from Celsius to
Fahrenheit: \[ \underbrace{y}_{[°F]} =
\underbrace{1.8*x_{[°C]} + 32}_{[°F]}\]
Most real-world variables are positive (e.g. \(\mathbb{R_+}\) or \(]0,+\infty]\)) or are limited to an
interval \(]a,b[_\mathbb{R_+}\).
Exceeding these boundaries by algebraic operations such as (+,-,*,/)
leads to useless or meaningless results.
Example
On the scale of positive real numbers \(\mathbb{R_+}\) multiplication and the
inverse operation division can not leave the set of positive real
numbers. Thus, the positive real numbers together with the
multiplication, denoted by \((\mathbb{R_+},*)\) forms a commutative
group with 1 as neutral element. In contrast, the set of positive real
numbers together with the addition \((\mathbb{R_+},+)\) has neither inverse
elements nor a neutral element (because 0 is not a positive number).
Thus, addition and subtraction can not be unlimited
performed.
But a log-transformation of any variable \[x\in\mathbb{R_+}: x\to x'=\log(x)\]
opens the scale to \(x'\to\mathbb{R}\) and transforms the
neutral element of \((\mathbb{R_+},\cdot)\) 1
to the neutral element of \((\mathbb{R},+)\) 0!
The field of \((\mathbb{R},+,\cdot)\)
provides nearly unlimited operations for statistical purposes (except
e.g. \(\sqrt{-1}\) or \(\frac{1}{0}\)).
The inverse transformation powering \[x'\to x=e^{x'}\]
reverses the log-transformation back to the positive real
scale.
Let us look on the following example on a data set of measured dissolved Mg:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1086 0.6269 1.0526 1.4219 1.7430 15.1291
Compared with the median, the arithmetic mean appears as a bad
estimate of central tendency for our heavy skewed data.
Let us try the geometric mean as measure of central tendency using
Gmean()
function of package DescTools
:
library(DescTools)
param<-c("arithm. mean","Median","geomean")
values<-round(c(mean(Mg),median(Mg),Gmean(Mg)),2)
cbind(param,values)
## param values
## [1,] "arithm. mean" "1.42"
## [2,] "Median" "1.05"
## [3,] "geomean" "1.04"
Obviously the geometric mean seems to be a far better estimate for
the central tendency. One reason for right skewed distributions can be
the Generalized
mean inequality \(\bar{x}_{geo}\leq
\bar{x}_{arith}\), but is not sufficient as explanation.
Let us recall the geometric mean: \[
\bar{x}_{geo}=\sqrt[n]{\prod_{i=1}^N x_i}=\prod_{i=1}^N
x_i^{\frac{1}{n}}= x_1^{\frac{1}{n}}\cdot x_2^{\frac{1}{n}}\cdots
x_n^{\frac{1}{n}}\] The main apparent difference between these
two means is the operator: sum for arithmetic mean and product for the
geometric mean.
Thus, the arithmetic mean recommends for an additive or distance scale
and the geometric mean for a ratio scale approach.
In our case, we are dealing with a variable clearly hosted on the \(\mathbb{R}_+\)-scale where only
multiplication is unlimited allowed.
But we often need an arithmetic mean for further statistical
parameters: e.g.
- standard deviation
- skewness
- covariance
- correlation
- regression
- etc.
To solve this problem, we can apply the logarithm to the geometric
mean:
\[\begin{align}
log(\bar{x}_{geo}) & =log(x_1^{\frac{1}{n}}\cdot
x_2^{\frac{1}{n}}\cdots x_n^{\frac{1}{n}})= \\
&
=\frac{1}{n}log(x_1)+\frac{1}{n}log(x_2)+\cdots+\frac{1}{n}log(x_N)= \\
& =\frac{1}{n}\sum_{i=1}^{N}{log(x_i)}={\overline{log(x_i)}}_{arith}
\end{align}\]
!!! The logarithm of the geometric mean is the arithmetic
mean of the logarithms!!!
Applying logarithm on our Mg-data yield the following distribution:
log_Mg<-log(Mg)
hist(log_Mg, main = "Dissolved Mg on log-scale", xlab = "log(Mg)",breaks=50)
Using the log-transformation kills two birds with one
stone:
1. We transform our data on the feature scale \(\mathbb{R}\) with nearly unlimited
operations concerning differences and ratios!
2. We are getting a less skewed nearly normal distribution, where
symmetric confidence intervals are becoming meaningful.
Check for normality using shapiro.test()
yields:
shapiro.test(log_Mg)
##
## Shapiro-Wilk normality test
##
## data: log_Mg
## W = 0.99868, p-value = 0.9747
Ok, perfect normal!
Here, we may calculate moments in the usual way:
moments<-c("arithm. mean log(x)","variance log(x)","stand. dev. log(x)","n")
mean_log_Mg<-mean(log_Mg)
var_log_Mg<-var(log_Mg)
sd_log_Mg<-sd(log_Mg)
n<-length(log_Mg)
x=as.matrix(c(round(mean_log_Mg,3),round(var_log_Mg,3),round(sd_log_Mg,3),n))
res<-as.data.frame(x=x,row.names=c("arithm. mean log(x)","variance log(x)",
"stand. dev. log(x)","n"))
res
## V1
## arithm. mean log(x) 0.041
## variance log(x) 0.610
## stand. dev. log(x) 0.781
## n 500.000
Now, we may calculate confidence bounces in the log space:
low_CI<-mean_log_Mg-pt(0.025,n)*sd_log_Mg/sqrt(n)
up_CI<-mean_log_Mg+pt(0.975,n)*sd_log_Mg/sqrt(n)
CI<-c(low_CI,up_CI)
CI
## [1] 0.02331638 0.07027670
… and a two-sigma-interval:
low_sigma<-mean_log_Mg-2*sd_log_Mg
up_sigma<-mean_log_Mg+2*sd_log_Mg
sigma<-c(low_sigma,up_sigma)
sigma
## [1] -1.520368 1.602613
Next, we transform our results back onto the original feature scale:
mean_Mg<-exp(mean_log_Mg)
var_Mg<-exp(var_log_Mg)
sd_Mg<-exp(sd_log_Mg)
CI_orign<-exp(CI)
two_sigma<-exp(sigma)
robust<-round(c(mean_Mg,var_Mg,sd_Mg,CI_orign,two_sigma),3)
robust
## [1] 1.042 1.840 2.183 1.024 1.073 0.219 4.966
Just for comparison, we calculate moments and interval the “wrong” way:
w_mean_Mg<-mean(Mg)
w_var_Mg<-var(Mg)
w_sd_Mg<-sd(Mg)
w_CI<-c(mean(Mg)-pt(0.975,n)*sd(Mg)/sqrt(n),mean(Mg)+pt(0.975,n)*sd(Mg)/sqrt(n))
w_two_sigma<-c(mean(Mg)-2*sd(Mg),mean(Mg)+2*sd(Mg))
wrong<-round(c(w_mean_Mg,w_var_Mg,w_sd_Mg,w_CI,w_two_sigma),3)
and plot the stuff for visual inspection:
hist(Mg, main = "Dissolved Mg in g/l", xlab = "Mg in g/l",breaks=50, xlim = c(-1.5,8))
abline(v=wrong[c(1,4:7)],col='red')
abline(v=robust[c(1,4:7)],col='green')
Here, the green lines showing geometric mean including CI
and 2-sigma-intervals after back transformation (the robust
way).
The same in red represents the meaningless mean and
interval margins. We call it “meaningless” at least due to the negative
concentration suggested by the lower 2-sigma-margin of the arithmetic
mean.
Magnification of the CI’s exhibit the difference in both CI’s:
hist(Mg, main = "Dissolved Mg in g/l", xlab = "Mg in g/l",breaks=50, xlim = c(1,1.5))
abline(v=wrong[c(1,4:7)],col='red')
abline(v=robust[c(1,4:7)],col='green')
On the original feature scale, the robust CI’s are skew as the random
model.
Other transformations to gain normality / symmetry but do not
solve algebraic constraints: - square-root-transformation -
reciprocal - \(asin\)-transformation -
\(asin-sqrt\)-transformation
Important: In our example above, we kept the problem of upper scale limit. No concentration can be larger than 100% or 1000g/kg, etc. . Here, the upper limit is far below 1000g/liter due to a chemical saturation limit. However, nearly all observable measures are physical or ecological (economical, demographical etc.) limited, at least on earth. We will come back to this problem after explaining the statistical, algebraic and geometric problem of compositional constraints.
“If the level of investigation of compositional data with classical tools moves from a descriptive phase to the application of statistical tests, erroneous evaluation of the formulated hypothesis maybe the result, a serious point when environmental questions are the matter of debate.” from A. Buccianti, 2013. Is compositional data analysis a way to see beyond the illusuion?
Compositions are sets of variables which values sum up to a
constant:
- dissolved cations and anions in a liquid
- frequencies of population groups (e.g. employees, jobless people,
etc.)
- role allocation of a person during a time span in behavioral
sciences
- nutrition content for food declaration
- many, many more …
Due to the compositional nature of such variable, they can neither be
independent nor be normal.
A simple example:
Let us generate a 3-part random composition by defining 2 random
variables and a remainder to the total:
set.seed(311)
A<-runif(30,0,33)
B<-runif(30,0,33)
C<-100-A-B
comp1<-cbind(A,B,C)
summary(comp1)
## A B C
## Min. : 0.4925 Min. : 0.4783 Min. :40.63
## 1st Qu.: 7.9098 1st Qu.:11.8154 1st Qu.:52.27
## Median :21.3940 Median :17.6066 Median :62.03
## Mean :18.4264 Mean :17.9454 Mean :63.63
## 3rd Qu.:27.4527 3rd Qu.:25.2677 3rd Qu.:75.83
## Max. :31.5422 Max. :32.4978 Max. :86.17
All three parts should be independent:
library(psych)
pairs.panels(comp1)