The Arithmetic Mean

\[ \text{Mean} = \frac{\text{Sum of all values}}{\text{Number of all values}} \]

The arithmetic mean calculated for sample data is denoted by \(\bar x\) (read as “x bar”), and the mean for population data is denoted by \(\mu\) (Greek letter mu). Thus, the mean may be expressed by the following equations:

\[ \bar x = \frac{1}{n}\sum_{i=1}^n x_i \] or \[ \mu = \frac{1}{N}\sum_{i=1}^N x_i \]

where \(\sum_{i=1}^n x_i\) is the sum of all values (\(x_1, x_2,...,x_n\)), \(N\) is the population size, and \(n\) is the sample size.

Let us consider the height variable in the students data set and calculate its arithmetic mean. Be aware that the students data set is a sample and is not the population of all students. Thus we calculate the \(\bar x\), the sample mean.

students <- read.csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")

stud.heights <- students$height # extract heights vector
stud.heigths.n <- length(stud.heights) # get the length of the heights vector, 
                                       # which is the number of observations
stud.heights.sum <- sum(stud.heights) # sum up the heights vector
stud.heights.xbar <- stud.heights.sum / stud.heigths.n 
sprintf("The arithmetic mean height in the students data set is: %s cm", round(stud.heights.xbar,1))
## [1] "The arithmetic mean height in the students data set is: 171.4 cm"

Of course there is as well an in-built function called mean().

mean(stud.heights)
## [1] 171.3808

The Geometric Mean

When studying phenomena such as inflation or population changes that involve periodic increases or decreases (known as rates of change), the geometric mean is more appropriate to find the average change over the entire period under investigation. To calculate the geometric mean of a sequence of \(n\) values \(x_1, x_2,..., x_n\), we multiply them together and then find the \(n^{th}\) root of this product.

\[ \bar x_{geo} = \sqrt[n]{x_1 \cdot x_2 \cdots x_n} \]

which can be rewritten as

\[ \bar x_{geo} = \sqrt[n]{x_1 \cdot x_2 \cdots x_n} =\bigg(\prod_{i=1}^n x_i \bigg)^{1/n} = \sqrt[n]{\prod_{i=1}^n x_i} \]

Let us make it clear by an example:

We consider the annual growth rates of a swarm of honey bees over a 5-year period. These rates of change are: 14%, 26%, 16%, -38%, -6%. Further, we know that at the beginning of the monitoring period there we 5,000 bees in the swarm. We are looking for the mean rate of population change.

At first we set up our variables in R

bees <- c(14, 26, 16, -38, -6) # rate of change in %

Now we apply, against better knowledge, the arithmetic mean:

sprintf("The mean rate of population change is: %s percent", mean(bees))
## [1] "The mean rate of population change is: 2.4 percent"

The arithmetic mean indicates that the swarm is growing over the period of five years! Well, we are skeptical, thus we calculate the annual growth of the swarm of bees explicitly. First, we transform the given percentages into relative growth rates (bees.growth.rel), and then we simply calculate the state of the bees population by sequentially multiplying the rates of change with the number of bees, which we know was at the beginning of the survey 5,000.

bees.growth.rel <- 1 + bees/100 # annual rates of growth/decline
bees.growth.rel
## [1] 1.14 1.26 1.16 0.62 0.94
round(5000 * 
        bees.growth.rel[1] * 
        bees.growth.rel[2] * 
        bees.growth.rel[3] * 
        bees.growth.rel[4] * 
        bees.growth.rel[5])
## [1] 4855

Wow, what a surprise! Obviously, there is something wrong! We expected the swarm to grow on average over time. However, we calculated a decline in the absolute number of bees!

Let us try the geometric mean:

Please note that \(\sqrt[n]{x} = x^{\frac{1}{n}}\).

To calculate the geometric mean explicitly we write:

bees.len <- length(bees) # number of observations
bees.growth.geom <- (bees.growth.rel[1] * 
                       bees.growth.rel[2] * 
                       bees.growth.rel[3] * 
                       bees.growth.rel[4] *
                       bees.growth.rel[5]) ^ (1/bees.len)
# rounded result
round(bees.growth.geom,3)
## [1] 0.994

Great! The geometric mean indicates that there is a decline in the number of species over time at an average rate of 0.994, which corresponds to -0.006%. We check that by taking 5,000 bees (the initial number of bees in the swarm) times 0.994 for each year; thus, resulting in 4,971 bees after the first year, 4,942 after the second year, 4,913 after the third year, 4,884 after the fourth year and 4,855 after the fifth year of observation. A perfect match! In contrast to the arithmetic mean, the geometric mean does not over-state the year-on-year growth!

Unfortunately, there is no in-built function for the geometric mean in the R-base package. Thus, we install the psych package by calling install.packages("psych"), and attaching it to the work space by calling library("psych"). Then, we may access the geometric.mean() function.

library("psych")
round(geometric.mean(bees.growth.rel),3)
## [1] 0.994

The Harmonic Mean

The harmonic mean is best used when we want to average inverse metrics, such as speed (km/h), or population density (pop/km2).

Consider the following example:

The distance from your house to the next lake is 40 km. You drove to the lake at a speed of 20 km per hour and returned home at a speed of 80 km per hour. What was my average speed for the whole trip?.

Let us first calculate the arithmetic mean.

one.way <- 20
back.way <- 80
mean(c(one.way,back.way))
## [1] 50

The arithmetic mean of the two speeds you drove at is 50 km per hour. However, this is not the correct average speed. It ignores the fact that you drove at 20 km per hour for a much longer time as you drove at 80 km per hour. To find the correct average speed, we must instead calculate the harmonic mean.

The harmonic mean \(\bar x_h\) for the positive real numbers \(x_1 , x_2 , ... , x_n\) is defined by

\[\bar x_h = \frac{n}{\frac{1}{x_1} + \frac{1}{x_2} + \cdots + \frac{1}{x_n}} = \frac{n}{\sum_{i=1}^n \frac{1}{x_i}}, \qquad x_i > 0 \text{ for all } i.\]

Unfortunately, there is no in-built function for the harmonic mean in the R-base package. So we either write our own function

my.harmonic.mean = function(vals){length(vals) / sum(1/vals)}

or we rely on the psych package. The package provides a harmonic.mean() function.

my.harmonic.mean(c(one.way,back.way))
## [1] 32
# function from the psych package
harmonic.mean(c(one.way,back.way))
## [1] 32

Perfect, both implementations give the same result. However, is this result correct? We can verify it by reasoning. In the example above, the distance from the lake to your home is 40 km. So the trip from A to B at a speed of 20 km/h will take 2 hours. The trip from B to A at a speed to 80 km/h will take 0.5 hours. The total time taken for the round distance 2x40 km will be 2.5 hours. The average speed will then be \(\frac{80}{2.5} = 32\).


The Weighted Mean

There are applications, where certain values in a data set may be considered more important than others. In general, for a sequence of \(n\) data values \(x_1, x_2,..., x_n\) an their corresponding weights \(w_1, w_2,..., w_n\), the weighted (arithmetic) mean is given by

\[ \bar x_w = \frac{\sum_{i=1}^n w_ix_i }{\sum_{i=1}^n w_i} \]

where \(\sum_{i=1}^n w_ix_i\) is obtained by multiplying each data value by its weight and then adding the products.

For example, to determine the grades of students in a course, an instructor may assign a weight to the final exam that is three times as much as that of the other exams. Let us find the weighted mean for a student who scores 45 and 68 on the first two exams and 74 on the final.

First, we calculate the weighted mean explicitly:

sum(45*1, 68*1, 74*3) / sum (1,1,3)
## [1] 67

Now, we repeat the calculation by applying the weighted.mean function of R:

stud.scores <- c(45,68,74)
stud.scores.weights <- c(1,1,3)
weighted.mean(stud.scores, stud.scores.weights)
## [1] 67

Just for the ease of comparison we calculate the arithmetic mean too.

round(mean(stud.scores))
## [1] 62

Please note that the weighting of input values is a principle that is applicable to other measures of the mean as well. For example, we may weight the input variable for calculating the geometric mean.


The Weighted Geometric Mean

\[ \bar x_{geo_w} = \bigg(\prod_{i=1}^n x_i^{w_i} \bigg)^{1/ \sum_{i=1}^n w_i}\]

where \(x_1,x_2,...x_n\) correspond to the data values and \(w_1,w_2,...w_n\) to the weights, respectively. Based on the equation from above it is straight forward to write an implementation in R:

my.weighted.geometric.mean = function(vals, weights){prod(vals)^(1/sum(weights))}

To make sure, that our implementation is correct we recompute the example of the swarm of bees from the section above. Recall, that in the example above we observed a swarm of bees over 5 years and noted the rates of change in the bee population. The annual rates of change were 1.14, 1.26, 1.16, 0.62, 0.94, which correspond to \(x_1,x_2,...x_n\). To reproduce the result from above we set all weights \(w_1,w_2,...w_n\) to be 1.

round(my.weighted.geometric.mean(bees.growth.rel, rep(1,5)),3)
## [1] 0.994

Correct we get the same result!


The Weighted Harmonic Mean

Finally, we will implement a weighted harmonic mean function by our own.

The weighted harmonic mean \(\bar x_{h_w}\) for the positive real numbers \(x_1 , x_2 , ... , x_n\) is defined by the following equation:

\[\bar x_{h_w} = \frac{w_1+w_2+\cdots +w_n}{\frac{w_1}{x_1} + \frac{w2}{x_2} + \cdots + \frac{w_n}{x_n}} = \frac{\sum_{i=1}^n w_i}{\sum_{i=1}^n \frac{w_i}{x_i}}, \qquad x_i > 0 \text{ for all } i.\]

Let us implement the weighted harmonic mean function in R. The coding is straightforward, however to expand the functionality of our weighted harmonic mean function, we include an if-statement. This if-statement in the code will normalize the weights if the weights are not given in proportions. When the if-statement is run, we print a user-feedback, otherwise there will be no feedback.

my.weighted.harmonic.mean = function(vals, weights){
  if (sum(weights) != 1){
    weights = weights/sum(weights)
    print('Weights do not sum up to 1. Weights are normalized...')
  } 
  sum(weights)/sum(weights/vals)
  }

Let us try our my.weighted.harmonic.mean on a fairly complex data set. The data set cities consists of all state capitals of Germany, its population size and its area. The goal is to calculate the mean population density for the state capitals of Germany. You may download the cities.csv file here. The data is retrieved from this website.

First, we load the data set, assign a proper name to the data set and take a look at it.

cities <- read.csv('https://userpage.fu-berlin.de/soga/200/2010_data_sets/cities.csv')
cities
##              name                        state area_km2 pop.size
## 1          Berlin                  Land Berlin   891.85  3415100
## 2          Bremen      Freie Hansestadt Bremen   325.42   546450
## 3         Dresden            Freistaat Sachsen   328.31   525100
## 4   D\xfcsseldorf     Land Nordrhein-Westfalen   217.41   593682
## 5          Erfurt       Freistaat Th\xfcringen   269.17   203480
## 6         Hamburg Freie und Hansestadt Hamburg   755.26  1751780
## 7        Hannover           Land Niedersachsen   204.14   514130
## 8            Kiel      Land Schleswig-Holstein   118.60   239860
## 9       Magdeburg          Land Sachsen-Anhalt   200.97   229924
## 10          Mainz         Land Rheinland-Pfalz    97.76   202750
## 11     M\xfcnchen             Freistaat Bayern   310.71  1388300
## 12        Potsdam             Land Brandenburg   187.27   159450
## 13 Saarbr\xfccken                     Saarland   167.07   176990
## 14       Schwerin  Land Mecklenburg-Vorpommern   130.46    91260
## 15      Stuttgart    Land Baden-W\xfcrttemberg   207.36   597939
## 16      Wiesbaden                  Land Hessen   203.90   272630

Second, we create a new column and calculate the population density (inhabitant per square kilometer).

cities$pop.density <- cities$pop.size / cities$area_km2
cities
##              name                        state area_km2 pop.size
## 1          Berlin                  Land Berlin   891.85  3415100
## 2          Bremen      Freie Hansestadt Bremen   325.42   546450
## 3         Dresden            Freistaat Sachsen   328.31   525100
## 4   D\xfcsseldorf     Land Nordrhein-Westfalen   217.41   593682
## 5          Erfurt       Freistaat Th\xfcringen   269.17   203480
## 6         Hamburg Freie und Hansestadt Hamburg   755.26  1751780
## 7        Hannover           Land Niedersachsen   204.14   514130
## 8            Kiel      Land Schleswig-Holstein   118.60   239860
## 9       Magdeburg          Land Sachsen-Anhalt   200.97   229924
## 10          Mainz         Land Rheinland-Pfalz    97.76   202750
## 11     M\xfcnchen             Freistaat Bayern   310.71  1388300
## 12        Potsdam             Land Brandenburg   187.27   159450
## 13 Saarbr\xfccken                     Saarland   167.07   176990
## 14       Schwerin  Land Mecklenburg-Vorpommern   130.46    91260
## 15      Stuttgart    Land Baden-W\xfcrttemberg   207.36   597939
## 16      Wiesbaden                  Land Hessen   203.90   272630
##    pop.density
## 1    3829.2314
## 2    1679.2146
## 3    1599.4030
## 4    2730.7024
## 5     755.9535
## 6    2319.4397
## 7    2518.5167
## 8    2022.4283
## 9    1144.0713
## 10   2073.9566
## 11   4468.1536
## 12    851.4444
## 13   1059.3763
## 14    699.5248
## 15   2883.5793
## 16   1337.0770

Third, we calculate the weight for each city, according to the population size

cities$pop.weight <- cities$pop.size / sum(cities$pop.size)
cities
##              name                        state area_km2 pop.size
## 1          Berlin                  Land Berlin   891.85  3415100
## 2          Bremen      Freie Hansestadt Bremen   325.42   546450
## 3         Dresden            Freistaat Sachsen   328.31   525100
## 4   D\xfcsseldorf     Land Nordrhein-Westfalen   217.41   593682
## 5          Erfurt       Freistaat Th\xfcringen   269.17   203480
## 6         Hamburg Freie und Hansestadt Hamburg   755.26  1751780
## 7        Hannover           Land Niedersachsen   204.14   514130
## 8            Kiel      Land Schleswig-Holstein   118.60   239860
## 9       Magdeburg          Land Sachsen-Anhalt   200.97   229924
## 10          Mainz         Land Rheinland-Pfalz    97.76   202750
## 11     M\xfcnchen             Freistaat Bayern   310.71  1388300
## 12        Potsdam             Land Brandenburg   187.27   159450
## 13 Saarbr\xfccken                     Saarland   167.07   176990
## 14       Schwerin  Land Mecklenburg-Vorpommern   130.46    91260
## 15      Stuttgart    Land Baden-W\xfcrttemberg   207.36   597939
## 16      Wiesbaden                  Land Hessen   203.90   272630
##    pop.density  pop.weight
## 1    3829.2314 0.313058464
## 2    1679.2146 0.050092471
## 3    1599.4030 0.048135340
## 4    2730.7024 0.054422177
## 5     755.9535 0.018652788
## 6    2319.4397 0.160583748
## 7    2518.5167 0.047129732
## 8    2022.4283 0.021987703
## 9    1144.0713 0.021076880
## 10   2073.9566 0.018585870
## 11   4468.1536 0.127263935
## 12    851.4444 0.014616606
## 13   1059.3763 0.016224479
## 14    699.5248 0.008365704
## 15   2883.5793 0.054812411
## 16   1337.0770 0.024991693
sprintf('The sum of the weight vector is %s',sum(cities$pop.weight) )
## [1] "The sum of the weight vector is 1"

Now, we finally apply our implementation of the weighted harmonic mean (my.weighted.harmonic.mean) and compare it to the arithmetic mean.

We start, by giving the function the cities$pop.weight vector as an input parameter.

my.weighted.harmonic.mean(cities$pop.density, cities$pop.weight)
## [1] 2363.438

Now we test the functionality of our function by providing the cities$pop.size vector as an input parameter.

my.weighted.harmonic.mean(cities$pop.density, cities$pop.size)
## [1] "Weights do not sum up to 1. Weights are normalized..."
## [1] 2363.438

Awesome, the function works as expected. The results are the identical.

We can conclude that the mean population density in state capitals of Germany is about 2,363 inhabitants/km2.

We may calculate the arithmetic mean of the population density for comparison.

mean(cities$pop.density)
## [1] 1998.255