Recently on a geoscience community meeting, the following poll chart popped up on the screen:

Fig: Results of a workshop poll about the importance of creativity for geoscientists of n=74 participants

Fig: Results of a workshop poll about the importance of creativity for geoscientists of n=74 participants


This informative chart shows the frequency of n=74 personal voting related to the relevance of creativity in daily work on an interval of integers from 1 (no relevance) to 10 (high relevance).

The figure contains a lot of information:
1. The absolute majority voted on the upper part of the scale
2. Only one individual voted “not important at all”
3. Nearly one third voted for the third highest value
4. 60% of the 74 individuals voted for levels \(\ge 8\)
5. A “Score” reports a value of 7.6

The first question is coming up:
What is the meaning of this “Score”

Let us reconstruct the data:

data<-c(1,4,4,rep(x = 5,times=round(0.07*74)),
        rep(x = 6,times=round(0.08*74)),
        rep(x = 7,times=round(0.2*74)),
        rep(x = 8,times=round(0.32*74)),
        rep(x = 9,times=round(0.16*74)),
        rep(x = 10,times=round(0.12*74))
        )
summary(data)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   7.000   8.000   7.635   9.000  10.000

Okay, the score resemble the arithmetic mean!

But, why is the mean as measure of central tendency smaller than the values of 60% of the voters?

A 1st row explanation may be: The outlier is responsible for this meaningless result!

Let us check by ignoring the outlier:

data_cleaned<-c(4,4,rep(x = 5,times=round(0.07*74)),
        rep(x = 6,times=round(0.08*74)),
        rep(x = 7,times=round(0.2*74)),
        rep(x = 8,times=round(0.32*74)),
        rep(x = 9,times=round(0.16*74)),
        rep(x = 10,times=round(0.12*74))
        )
summary(data_cleaned)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   7.000   8.000   7.726   9.000  10.000

!!! The outlier has a very small effect < 0.1 and, thus, cannot be accepted as excuse!

Question: May the geometric mean provide a more meaningful measure of metric for our problem?
Answer: Definitly not, because the geometric mean is always smaller than the arithmetic mean (cf. Hölder’s inequality)

By the way, we can use the median, but in our case it would be a very coarse measure of central tendency even more for further test statistics.

The challenge of voting between two margins

Let us assume that people are able to “measure relevance” within a constraint interval and the integers are considered to reflect an equidistant metric.

Choosing one number is related to two information:

  1. the distance to the lower margin and
  2. the distance to the upper margin!

Both distances are sum up to the width of the interval. Hereby, neither the margins themselves nor the width of the interval are important. Only the amount of possible eligible numbers influences the resolution of the measure.

Thus, any number reflects a relative measure towards the margins and can therefore be transformed to a standard interval as [0,1] or [0,100%].

However, for a meaningful examination of the constraint values, we should focus on the relation of both equal important measures, e.g. for [0,1]: \[ x\to x'=\frac{x}{1-x}\]

Because, \(x\) and \(1-x\) are part of a 2D-composition, we apply an additive log-ratio transformation after John Aitchison (1986). In our case, the additive log-ratio transformation is identical to a logistic transformation resp. logit: \[x\to x'=log\left( \frac{x}{1-x}\right)\]

The back transformation is the “logistic” function resp. inverse logit: \[x' \to x=\frac{e^{x'}}{1+e^{x'}}\]


But before we start we have to enlarge our interval by a small constant c to avoid zeros when the original margin has been voted.

Recipe:

  1. Set a small constant c for min-max-Transformation into the simplex: \[ u=margin_{low}-c\] \[l=margin_{up}+c\]

\[ x'=\frac{x-l}{u-l}\]

  1. Perform a logistic transformation on \(x'\in ]0,1[_\mathbb R\)

and calculate the desired parameters

#set a small constant c to avoid zeros:
c<-0.01
#min-max-transformation
data_t_mm<-(data-(1-c))/(10-1+2*c)
#logistic transformation
data_log<-log(data_t_mm/(1-data_t_mm))
hist(data_log)

m_log<-mean(data_log)
med_log<-median(data_log)
CI_low<-m_log-sd(data_log)/sqrt(length(data_log))
CI_up<-m_log+sd(data_log)/sqrt(length(data_log))
  1. transform these paramters back by applying logistical function:

Backtransformation

cor_mean<-(1-c)+(9+2*c)*exp(m_log)/(1+exp(m_log))
cor_median<-(1-c)+(9+2*c)*exp(med_log)/(1+exp(med_log))
cor_CI_low<-(1-c)+(9+2*c)*exp(CI_low)/(1+exp(CI_low))
cor_CI_up<-(1-c)+(9+2*c)*exp(CI_up)/(1+exp(CI_up))
paste("corrected mean =",round(cor_mean,2),"   ","corrected median=",round(cor_median),"confidence interval=[",round(cor_CI_low,2),",",round(cor_CI_up,2),"]")
## [1] "corrected mean = 8.5     corrected median= 8 confidence interval=[ 8.14 , 8.8 ]"

Results:
1. corrected mean \(\bar x\) = 8.5
2. corrected confidence interval CI95% = [8.14, 8.8]

Thus, we not only get a reasonable mean, the confidence bounds are asymmetric as it should be for skewed distributions.

Furthermore, these transformations have not influenced the median! (As it should be!)

Take away:

Applying classical statistics on bounded or finite integer scales always produce pointless measures even though it is not as obvious as in our example above.

But, there are a couple of ways out of the mess:

  1. Under assumption of any metric, use an appropriate transformation and apply common statistics in the real space (our example!)

  2. Handle the data space as discrete data and use compositional statistics after Aitchison,1986 resp. e.g. Pawlowsky-Glahn, Egozcue & Tolosana-Delgado, 2015

  3. Depending of your application, also the multinomal distribution may provide robust estimations for probabilistic modeling.


Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.