\[ \text{Mean} = \frac{\text{Sum of all values}}{\text{Number of all values}} \]

The **arithmetic mean** calculated for sample data is
denoted by \(\bar x\) (read as “x bar”)
and the arithmetic mean for population data is denoted by \(\mu\) (Greek letter *mu*). Thus, the
mean can be expressed by the following equations:

\[ \bar x = \frac{1}{n}\sum_{i=1}^n x_i \] or \[ \mu = \frac{1}{N}\sum_{i=1}^N x_i \]

where \(\sum_{i=1}^n x_i\) is the sum of all values (\(x_1, x_2,...,x_n\)), \(N\) is the population size, and \(n\) is the sample size.

Let us consider the `height`

variable in the
`students`

data set and calculate its arithmetic mean. Be
aware that the `students`

data set is a *sample* and
**is not** the population of all students. Thus, we
calculate \(\bar x\), the sample
mean.

```
students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
stud_heights <- students$height # extract heights vector
stud_heigths_n <- length(stud_heights) # get the length of the heights vector, i.e. the number of observations
stud_heights_sum <- sum(stud_heights) # sum up the heights vector
stud_heights_xbar <- stud_heights_sum / stud_heigths_n
sprintf("The arithmetic mean height in the students data set is: %s cm", round(stud_heights_xbar, 1))
```

`## [1] "The arithmetic mean height in the students data set is: 171.4 cm"`

Of course, R also already has an in-built function called
`mean()`

, which makes things a lot easier for us.

`mean(stud_heights)`

`## [1] 171.3808`

When studying phenomena such as inflation or population changes,
which involve periodic increases or decreases (known as *rates of
change*), the **geometric mean** is more appropriate to
find the average change over the entire period under investigation. To
calculate the geometric mean of a sequence of \(n\) values \(x_1,
x_2,..., x_n\), we multiply them and then find the \(n^{th}\) root of this product:

\[ \bar x_{geo} = \sqrt[n]{x_1 \cdot x_2 \cdots x_n} \]

which can be rewritten as

\[ \bar x_{geo} = \sqrt[n]{x_1 \cdot x_2 \cdots x_n} =\bigg(\prod_{i=1}^n x_i \bigg)^{1/n} = \sqrt[n]{\prod_{i=1}^n x_i} \]

Let us make it clear by calculating an example:

We consider the annual growth rates of a swarm of honey bees over a 5-year period. These rates of change are: 14 %, 26 %, 16 %, -38 %, -6 %. Further, we know that at the beginning of the monitoring period there were 5,000 bees in the swarm. We are looking for the mean rate of population change.

First, we set up our variable as a vector in R:

`bees <- c(14, 26, 16, -38, -6) # rate of change in %`

Now, we apply, against better knowledge, the arithmetic mean:

`sprintf("The mean rate of population change is: %s percent", mean(bees))`

`## [1] "The mean rate of population change is: 2.4 percent"`

The arithmetic mean indicates that the swarm is growing over the period of five years!

Well, we are skeptical, thus we calculate the annual growth of the
swarm of bees explicitly. First, we transform the given percentages into
relative growth rates (`bees_growth_rel`

). Then we simply
calculate the state of the bee population after 5 years by sequentially
multiplying the rates of change with the number of bees, which we know
was 5,000 at the beginning of the survey.

```
bees_growth_rel <- 1 + bees / 100 # annual rates of growth/decline
bees_growth_rel
```

`## [1] 1.14 1.26 1.16 0.62 0.94`

```
round(5000 *
bees_growth_rel[1] *
bees_growth_rel[2] *
bees_growth_rel[3] *
bees_growth_rel[4] *
bees_growth_rel[5])
```

`## [1] 4855`

Wow, what a surprise! Obviously, there is something wrong. We expected the swarm to grow on average over time. However, we calculated a decline in the absolute number of bees.

Let us try the geometric mean!

Please note that \(\sqrt[n]{x} = x^{\frac{1}{n}}\).

To calculate the geometric mean explicitly we write:

```
bees_len <- length(bees) # number of observations
bees_growth_geom <- (bees_growth_rel[1] *
bees_growth_rel[2] *
bees_growth_rel[3] *
bees_growth_rel[4] *
bees_growth_rel[5])^(1 / bees_len)
# rounded result
round(bees_growth_geom, 3)
```

`## [1] 0.994`

Great! The geometric mean indicates that there is a decline in the number of species over time at an average rate of 0.994, which corresponds to -0.006 %. We check that by taking 5,000 bees (the initial number of bees in the swarm) times 0.994 for each year; thus, resulting in 4,971 bees after the first year, 4,942 after the second year, 4,913 after the third year, 4,884 after the fourth year and 4,855 after the fifth year of observation. A perfect match! In contrast to the arithmetic mean, the geometric mean does not over-state the year-to-year growth!

Unfortunately, there is no in-built function for the geometric mean
in the R-base package. Thus, we install the `psych`

package by calling
`install.packages("psych")`

and attaching it to the work
space by calling `library("psych")`

. Then, we can access the
`geometric.mean()`

function.

```
library("psych")
round(geometric.mean(bees_growth_rel), 3)
```

`## [1] 0.994`

The **harmonic mean** is best used when we want to
average inverse metrics, such as speed (km/h) or population density
(pop/km^{2}).

Consider the following example:

The distance from your house to the next lake is 40 km. You drove to the lake at a speed of 20 km per hour and returned home at a speed of 80 km per hour. What was your average speed for the whole trip?

Let us first calculate the arithmetic mean.

```
one_way <- 20
back_way <- 80
mean(c(one_way, back_way))
```

`## [1] 50`

The arithmetic mean of the two speeds you drove at is 50 km per hour. However, this is not the correct average speed. It ignores the fact that you drove at 20 km per hour for a much longer time than you drove at 80 km per hour. To find the correct average speed, we must instead calculate the harmonic mean.

The harmonic mean \(\bar x_h\) for the positive real numbers \(x_1 , x_2 , ... , x_n\) is defined by

\[\bar x_h = \frac{n}{\frac{1}{x_1} + \frac{1}{x_2} + \cdots + \frac{1}{x_n}} = \frac{n}{\sum_{i=1}^n \frac{1}{x_i}}, \qquad x_i > 0 \text{ for all } i.\]

Unfortunately, there is no in-built function for the harmonic mean in the R-base package. So we either write our own function

```
my.harmonic.mean <- function(vals) {
length(vals) / sum(1 / vals)
}
my.harmonic.mean(c(one_way, back_way))
```

`## [1] 32`

or use the `harmonic.mean()`

function provided by the `psych`

package

```
# function from the psych package
harmonic.mean(c(one_way, back_way))
```

`## [1] 32`

Perfect, both implementations give the same result. However, is this result correct? We can verify it by reasoning. In the example above the distance from the lake to your home is 40 km. So the trip from A to B at a speed of 20 km/h will take 2 hours. The trip from B to A at a speed of 80 km/h will take 0.5 hours. The total time taken for the round trip of 2x40 km will be 2.5 hours. The average speed will then be \(\frac{80}{2.5} = 32\) km/h.

There are applications, where certain values in a data set may be
considered more important than others. In general, for a sequence of
\(n\) data values \(x_1, x_2,..., x_n\) and their corresponding
weights \(w_1, w_2,..., w_n\) the
**weighted (arithmetic) mean** is given by

\[ \bar x_w = \frac{\sum_{i=1}^n w_ix_i }{\sum_{i=1}^n w_i} \]

where \(\sum_{i=1}^n w_ix_i\) is obtained by multiplying each data value by its weight and then adding the products.

For example, to determine the grades of students in a course an instructor may assign a weight to the final exam that is three times as much as that of the other exams. Let us find the weighted mean for a student who scores 45 and 68 on the first two exams and 74 on the final.

First, we calculate the weighted mean explicitly:

`sum(45 * 1, 68 * 1, 74 * 3) / sum(1, 1, 3)`

`## [1] 67`

Now, we repeat the calculation by applying the
`weighted.mean`

function of R:

```
stud_scores <- c(45, 68, 74)
stud_scores_weights <- c(1, 1, 3)
weighted.mean(stud_scores, stud_scores_weights)
```

`## [1] 67`

Just for the ease of comparison we calculate the arithmetic mean, too:

`round(mean(stud_scores))`

`## [1] 62`

Please note that the weighting of input values is a principle that is applicable to other measures of the mean as well. For example, we may weight the input variable for calculating the geometric mean.

\[ \bar x_{geo_w} = \bigg(\prod_{i=1}^n x_i^{w_i} \bigg)^{1/ \sum_{i=1}^n w_i}\]

where \(x_1,x_2,...,x_n\) correspond
to the data values and \(w_1,w_2,...,w_n\) to the weights,
respectively. Based on the equation above it is straight forward to
write an implementation for the **weighted geometric mean**
in R:

```
my.weighted.geometric.mean <- function(vals, weights) {
prod(vals)^(1 / sum(weights))
}
```

To make sure that our implementation is correct we recompute the example of the swarm of bees from the section above. Recall, that in the example we observed a swarm of bees over 5 years and noted the rates of change in the bee population. The annual rates of change were 1.14, 1.26, 1.16, 0.62, 0.94, which correspond to \(x_1,x_2,...,x_n\). To reproduce the result from above we set all weights \(w_1,w_2,...,w_n\) to be 1.

`round(my.weighted.geometric.mean(bees_growth_rel, rep(1, 5)), 3)`

`## [1] 0.994`

Correct; we get the same result!

The **weighted harmonic mean** \(\bar x_{h_w}\) for the positive real
numbers \(x_1 , x_2 , ... , x_n\) is
defined as

\[\bar x_{h_w} = \frac{w_1+w_2+\cdots +w_n}{\frac{w_1}{x_1} + \frac{w2}{x_2} + \cdots + \frac{w_n}{x_n}} = \frac{\sum_{i=1}^n w_i}{\sum_{i=1}^n \frac{w_i}{x_i}}, \qquad x_i > 0 \text{ for all } i.\]

Let us implement the weighted harmonic mean function in R. The coding
is straightforward, however, we include an `if`

-statement to
expand the functionality of our function. This `if`

-statement
in the code will normalize the weights if the weights are not given in
proportions. When the `if`

-statement is run, it prints a
user-feedback, otherwise there will be no feedback.

```
my.weighted.harmonic.mean <- function(vals, weights) {
if (sum(weights) != 1) {
weights <- weights / sum(weights)
print("Weights do not sum up to 1. Weights are normalized...")
}
sum(weights) / sum(weights / vals)
}
```

Let us try our function `my.weighted.harmonic.mean`

on a
fairly complex data set. The data set `cities`

consists of
all state capitals of Germany, their population size and area. The goal
is to calculate the mean population density for the state capitals of
Germany. You may download the `cities.csv`

file here.
The data is retrieved from this website.

First, we load the data set, assign it a proper name and take a look at it.

```
cities <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/cities.csv")
cities
```

```
## name state area_km2 pop_size
## 1 Berlin Land Berlin 891.85 3520100
## 2 Bremen Freie Hansestadt Bremen 325.42 546450
## 3 Dresden Freistaat Sachsen 328.31 525100
## 4 D\xfcsseldorf Land Nordrhein-Westfalen 217.41 593682
## 5 Erfurt Freistaat Th\xfcringen 269.17 203480
## 6 Hamburg Freie und Hansestadt Hamburg 755.26 1751780
## 7 Hannover Land Niedersachsen 204.14 514130
## 8 Kiel Land Schleswig-Holstein 118.60 239860
## 9 Magdeburg Land Sachsen-Anhalt 200.97 229924
## 10 Mainz Land Rheinland-Pfalz 97.76 202750
## 11 M\xfcnchen Freistaat Bayern 310.71 1388300
## 12 Potsdam Land Brandenburg 187.27 159450
## 13 Saarbr\xfccken Saarland 167.07 176990
## 14 Schwerin Land Mecklenburg-Vorpommern 130.46 91260
## 15 Stuttgart Land Baden-W\xfcrttemberg 207.36 597939
## 16 Wiesbaden Land Hessen 203.90 272630
```

Second, we create a new column and calculate the population density (inhabitant per square kilometer).

```
cities$pop_density <- cities$pop_size / cities$area_km2
cities
```

```
## name state area_km2 pop_size pop_density
## 1 Berlin Land Berlin 891.85 3520100 3946.9642
## 2 Bremen Freie Hansestadt Bremen 325.42 546450 1679.2146
## 3 Dresden Freistaat Sachsen 328.31 525100 1599.4030
## 4 D\xfcsseldorf Land Nordrhein-Westfalen 217.41 593682 2730.7024
## 5 Erfurt Freistaat Th\xfcringen 269.17 203480 755.9535
## 6 Hamburg Freie und Hansestadt Hamburg 755.26 1751780 2319.4397
## 7 Hannover Land Niedersachsen 204.14 514130 2518.5167
## 8 Kiel Land Schleswig-Holstein 118.60 239860 2022.4283
## 9 Magdeburg Land Sachsen-Anhalt 200.97 229924 1144.0713
## 10 Mainz Land Rheinland-Pfalz 97.76 202750 2073.9566
## 11 M\xfcnchen Freistaat Bayern 310.71 1388300 4468.1536
## 12 Potsdam Land Brandenburg 187.27 159450 851.4444
## 13 Saarbr\xfccken Saarland 167.07 176990 1059.3763
## 14 Schwerin Land Mecklenburg-Vorpommern 130.46 91260 699.5248
## 15 Stuttgart Land Baden-W\xfcrttemberg 207.36 597939 2883.5793
## 16 Wiesbaden Land Hessen 203.90 272630 1337.0770
```

Third, we calculate the weight for each city according its population size.

```
cities$pop_weight <- cities$pop_size / sum(cities$pop_size)
cities
```

```
## name state area_km2 pop_size pop_density
## 1 Berlin Land Berlin 891.85 3520100 3946.9642
## 2 Bremen Freie Hansestadt Bremen 325.42 546450 1679.2146
## 3 Dresden Freistaat Sachsen 328.31 525100 1599.4030
## 4 D\xfcsseldorf Land Nordrhein-Westfalen 217.41 593682 2730.7024
## 5 Erfurt Freistaat Th\xfcringen 269.17 203480 755.9535
## 6 Hamburg Freie und Hansestadt Hamburg 755.26 1751780 2319.4397
## 7 Hannover Land Niedersachsen 204.14 514130 2518.5167
## 8 Kiel Land Schleswig-Holstein 118.60 239860 2022.4283
## 9 Magdeburg Land Sachsen-Anhalt 200.97 229924 1144.0713
## 10 Mainz Land Rheinland-Pfalz 97.76 202750 2073.9566
## 11 M\xfcnchen Freistaat Bayern 310.71 1388300 4468.1536
## 12 Potsdam Land Brandenburg 187.27 159450 851.4444
## 13 Saarbr\xfccken Saarland 167.07 176990 1059.3763
## 14 Schwerin Land Mecklenburg-Vorpommern 130.46 91260 699.5248
## 15 Stuttgart Land Baden-W\xfcrttemberg 207.36 597939 2883.5793
## 16 Wiesbaden Land Hessen 203.90 272630 1337.0770
## pop_weight
## 1 0.31960740
## 2 0.04961492
## 3 0.04767644
## 4 0.05390334
## 5 0.01847496
## 6 0.15905283
## 7 0.04668042
## 8 0.02177808
## 9 0.02087594
## 10 0.01840868
## 11 0.12605067
## 12 0.01447726
## 13 0.01606980
## 14 0.00828595
## 15 0.05428986
## 16 0.02475343
```

`sprintf("The sum of the weight vector is %s", sum(cities$pop_weight))`

`## [1] "The sum of the weight vector is 1"`

Now, we finally apply our implementation of the weighted harmonic
mean (`my.weighted.harmonic.mean`

) and compare it to the
arithmetic mean.

We start by giving the function the `cities$pop_weight`

vector as an input parameter.

`my.weighted.harmonic.mean(cities$pop_density, cities$pop_weight)`

`## [1] 2386.186`

Then we test the functionality of our function by providing the
`cities$pop_size`

vector as an input parameter.

`my.weighted.harmonic.mean(cities$pop_density, cities$pop_size)`

`## [1] "Weights do not sum up to 1. Weights are normalized..."`

`## [1] 2386.186`

Awesome, the function works as expected. The results are identical.

We can conclude that the mean population density in the state
capitals of Germany is about 2,386 inhabitants/km^{2}.

For comparison, we may calculate the arithmetic mean of the population density.

`mean(cities$pop_density)`

`## [1] 2005.613`

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*