Recently on a geoscience community meeting, the following
poll chart popped up on the screen:
This informative chart shows the frequency of n=74 personal voting
related to the relevance of creativity in daily work on an interval of
integers from 1 (no relevance) to 10 (high relevance).
The figure contains a lot of information:
1. The absolute majority voted on the upper part of the scale
2. Only one individual voted “not important at all”
3. Nearly one third voted for the third highest value
4. 60% of the 74 individuals voted for levels \(\ge 8\)
5. A “Score” reports a value of 7.6
The first question is coming up:
What is the meaning of this “Score”
data<-c(1,4,4,rep(x = 5,times=round(0.07*74)),
rep(x = 6,times=round(0.08*74)),
rep(x = 7,times=round(0.2*74)),
rep(x = 8,times=round(0.32*74)),
rep(x = 9,times=round(0.16*74)),
rep(x = 10,times=round(0.12*74))
)
summary(data)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 7.000 8.000 7.635 9.000 10.000
Okay, the score resemble the arithmetic mean!
But, why is the mean as measure of central tendency smaller than the values of 60% of the voters?
A 1st row explanation may be: The outlier is responsible for this meaningless result!
Let us check by ignoring the outlier:
data_cleaned<-c(4,4,rep(x = 5,times=round(0.07*74)),
rep(x = 6,times=round(0.08*74)),
rep(x = 7,times=round(0.2*74)),
rep(x = 8,times=round(0.32*74)),
rep(x = 9,times=round(0.16*74)),
rep(x = 10,times=round(0.12*74))
)
summary(data_cleaned)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.000 7.000 8.000 7.726 9.000 10.000
!!! The outlier has a very small effect < 0.1 and, thus, cannot be accepted as excuse!
Question: May the geometric mean provide a more
meaningful measure of metric for our problem?
Answer: Definitly not, because the geometric mean is
always smaller than the arithmetic mean (cf. Hölder’s inequality)
By the way, we can use the median, but in our case it would be a very coarse measure of central tendency even more for further test statistics.
Let us assume that people are able to “measure relevance”
within a constraint interval and the integers are considered to reflect
an equidistant metric.
Choosing one number is related to two
information:
Both distances are sum up to the width of the interval. Hereby,
neither the margins themselves nor the width of the interval are
important. Only the amount of possible eligible numbers influences the
resolution of the measure.
Thus, any number reflects a relative measure towards the margins and can therefore be transformed to a standard interval as [0,1] or [0,100%].
However, for a meaningful examination of the constraint values, we should focus on the relation of both equal important measures, e.g. for [0,1]: \[ x\to x'=\frac{x}{1-x}\]
Because, \(x\) and \(1-x\) are part of a 2D-composition, we apply an additive log-ratio transformation after John Aitchison (1986). In our case, the additive log-ratio transformation is identical to a logistic transformation resp. logit: \[x\to x'=log\left( \frac{x}{1-x}\right)\]
The back transformation is the “logistic” function resp. inverse logit: \[x' \to x=\frac{e^{x'}}{1+e^{x'}}\]
But before we start we have to enlarge our interval by a small constant
c to avoid zeros when the original margin has been voted.
\[ x'=\frac{x-l}{u-l}\]
and calculate the desired parameters
#set a small constant c to avoid zeros:
c<-0.01
#min-max-transformation
data_t_mm<-(data-(1-c))/(10-1+2*c)
#logistic transformation
data_log<-log(data_t_mm/(1-data_t_mm))
hist(data_log)
m_log<-mean(data_log)
med_log<-median(data_log)
CI_low<-m_log-sd(data_log)/sqrt(length(data_log))
CI_up<-m_log+sd(data_log)/sqrt(length(data_log))
Backtransformation
cor_mean<-(1-c)+(9+2*c)*exp(m_log)/(1+exp(m_log))
cor_median<-(1-c)+(9+2*c)*exp(med_log)/(1+exp(med_log))
cor_CI_low<-(1-c)+(9+2*c)*exp(CI_low)/(1+exp(CI_low))
cor_CI_up<-(1-c)+(9+2*c)*exp(CI_up)/(1+exp(CI_up))
paste("corrected mean =",round(cor_mean,2)," ","corrected median=",round(cor_median),"confidence interval=[",round(cor_CI_low,2),",",round(cor_CI_up,2),"]")
## [1] "corrected mean = 8.5 corrected median= 8 confidence interval=[ 8.14 , 8.8 ]"
Results:
1. corrected mean \(\bar x\) =
8.5
2. corrected confidence interval CI95% = [8.14, 8.8]
Thus, we not only get a reasonable mean, the confidence bounds are asymmetric as it should be for skewed distributions.
Furthermore, these transformations have not influenced the median! (As it should be!)
Applying classical statistics on bounded or finite integer scales always produce pointless measures even though it is not as obvious as in our example above.
But, there are a couple of ways out of the mess:
Under assumption of any metric, use an appropriate transformation and apply common statistics in the real space (our example!)
Handle the data space as discrete data and use compositional statistics after Aitchison,1986 resp. e.g. Pawlowsky-Glahn, Egozcue & Tolosana-Delgado, 2015
Depending of your application, also the multinomal distribution may provide robust estimations for probabilistic modeling.
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.