The arithmetic mean is defined as follows:

$$ \text{Mean} = \frac{\text{Sum of all values}}{\text{Number of all values}} \, . $$The **arithmetic mean** calculated for sample data is denoted by $\bar x$ (read as "x bar") and the arithmetic mean for population data is denoted by $\mu$ (Greek letter *mu*).
Thus, the arithmetic mean can be calculated by the following equations:

or $$ \mu = \frac{1}{N}\sum_{i=1}^N x_i \, .$$

where $\sum_{i=1}^n x_i$ is the sum of all values ($x_1, x_2,...,x_n$), $N$ is the population size, and $n$ is the sample size.

In order to calculate the arithmetic mean via python, all the needed libraries have to be imported.

In [2]:

```
# First, let's import all the needed libraries.
import numpy as np
import pandas as pd
import scipy.stats as stat
```

`height`

variable in the `students`

data set and calculate its arithmetic mean. Be aware that the `students`

data set is a *sample* and **is not** the population of all students. Thus we calculate the $\bar x$, the sample mean.

In [3]:

```
students = pd.read_csv(
"https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv"
)
students_heights = students["height"] # extract heights vector
stud_heigths_n = len(students_heights) # get the length of the heights vector,
# which is the number of observations
stud_heights_sum = sum(students_heights) # sum up the heights vector
stud_heights_xbar = stud_heights_sum / stud_heigths_n
print(
f"The arithmetic mean height in the students data set is: {round(stud_heights_xbar,1)} cm."
)
```

The arithmetic mean height in the students data set is: 171.4 cm.

In [4]:

```
np.mean(students_heights)
```

Out[4]:

171.38075009103048

When studying phenomena such as inflation or population changes, which involve periodic increases or decreases (known as *rates of change*), the **geometric mean** is more appropriate to find the average change over the entire period under investigation.
To calculate the geometric mean of a sequence of $n$ values $x_1, x_2,..., x_n$, we multiply them and then find the $n^{th}$ root of this product:

which can be rewritten as

$$ \bar x_{geo} = \sqrt[n]{x_1 \cdot x_2 \cdots x_n} =\bigg(\prod_{i=1}^n x_i \bigg)^{1/n} = \sqrt[n]{\prod_{i=1}^n x_i} \, . $$Let us make it clear by calculating an example:

We consider the annual growth rates of a swarm of honey bees over a 5-year period. These rates of change are: 14 %, 26 %, 16 %, -38 %, -6 %. Further, we know that at the beginning of the monitoring period there were 5,000 bees in the swarm. We are looking for the mean rate of population change.

At first we set up our variables in python:

In [5]:

```
bees = np.array([14, 26, 16, -38, -6]) # rate of change in %
```

Now we apply, against better knowledge, the arithmetic mean:

In [6]:

```
bees_mean = np.mean(bees)
print(f"The mean rate of population change is: {round(bees_mean,1)} percent.")
```

The mean rate of population change is: 2.4 percent.

The arithmetic mean indicates that the swarm is growing over the period of five years!

Well, we are skeptical, thus we calculate the annual growth of the swarm of bees explicitly.
First, we transform the given percentages into relative growth rates (`bees_growth_rel`

).
Then we simply calculate the state of the bee population after 5 years by sequentially multiplying the rates of change with the number of bees, which we know was 5,000 at the beginning of the survey.

In [7]:

```
bees_growth_rel = 1 + bees / 100 # annual rates of growth/decline
bees_growth_rel
round(
5000
* bees_growth_rel[0]
* bees_growth_rel[1]
* bees_growth_rel[2]
* bees_growth_rel[3]
* bees_growth_rel[4]
)
```

Out[7]:

4855

Wow, what a surprise! Obviously, there is something wrong. We expected the swarm to grow on average over time. However, we calculated a decline in the absolute number of bees.

Let us try the geometric mean!

Please note that $\sqrt[n]{x} = x^{\frac{1}{n}}$.

To calculate the geometric mean explicitly we write:

In [8]:

```
bees_len = len(bees) # number of observations
bees_growth_geom = (
bees_growth_rel[0]
* bees_growth_rel[1]
* bees_growth_rel[2]
* bees_growth_rel[3]
* bees_growth_rel[4]
) ** (1 / bees_len)
```

In [9]:

```
# rounded result
round(bees_growth_geom, 3)
```

Out[9]:

0.994

Great! The geometric mean indicates that there is a decline in the number of species over time at an average rate of 0.994, which corresponds to -0.006%.

We check that by taking 5,000 bees (the initial number of bees in the swarm) times 0.994 for each year; thus, resulting in 4971 bees after the first year, 4942 after the second year, 4913 after the third year, 4884 after the fourth year and 4855 after the fifth year of observation.

A perfect match! In contrast to the arithmetic mean, the geometric mean does not over-state the year-to-year growth!

Unfortunately, there is no in-built function for the geometric mean in the Python Standard Library.
Thus, we install the `scipy`

library by calling `pip install scipy`

and then importing the `stats`

module within the`scipy`

library by calling `import scipy.stats`

.
Then, we can access the `gmean()`

function.

In [10]:

```
round(stat.gmean(bees_growth_rel), 3)
```

Out[10]:

0.994

The **harmonic mean** is best used when we want to average inverse metrics, such as speed (km/h) or population density (pop/km^{2}).

Consider the following example:

The distance from your house to the next lake is 40 km. You drove to the lake at a speed of 20 km per hour and returned home at a speed of 80 km per hour. What was your average speed for the whole trip?

Let us first calculate the arithmetic mean.

In [11]:

```
one_way = 20
back_way = 80
np.mean([one_way, back_way])
```

Out[11]:

50.0

The arithmetic mean of the two speeds you drove at is 50.0 km per hour. However, this is not the correct average speed. It ignores the fact that you drove at 20 km per hour for a much longer time than you drove at 80 km per hour. To find the correct average speed, we must instead calculate the harmonic mean.

The harmonic mean $\bar x_h$ for the positive real numbers $x_1 , x_2 , ... , x_n$ is defined by

$$\bar x_h = \frac{n}{\frac{1}{x_1} + \frac{1}{x_2} + \cdots + \frac{1}{x_n}} = \frac{n}{\sum_{i=1}^n \frac{1}{x_i}}, \qquad x_i > 0 \text{ for all } i.$$Unfortunately, there is no in-built function for the harmonic mean. So we either write our own function

In [12]:

```
def my_harmonic_mean(val):
"""
This function calculates the harmonic mean
based on an input vector called val
"""
return len(val) / np.sum(1 / val)
val = np.array([one_way, back_way])
my_harmonic_mean(val)
```

Out[12]:

32.0

In [13]:

```
# function from scipy library
stat.hmean([one_way, back_way])
```

Out[13]:

32.0

There are applications, where certain values in a data set may be considered more important than others.
In general, for a sequence of $n$ data values $x_1, x_2,..., x_n$ and their corresponding weights $w_1, w_2,..., w_n$ the **weighted (arithmetic) mean** is given by

where $\sum_{i=1}^n w_ix_i$ is obtained by multiplying each data value by its weight and then adding the products.

For example, to determine the grades of students in a course an instructor may assign a weight to the final exam that is three times as much as that of the other exams. Let us find the weighted mean for a student who scores 45 and 68 on the first two exams and 74 on the final.

First, we calculate the weighted mean explicitly:

In [14]:

```
np.sum([45 * 1, 68 * 1, 74 * 3]) / np.sum([1, 1, 3])
```

Out[14]:

67.0

In [15]:

```
stud_scores = np.array([45, 68, 74])
stud_weights = np.array([1, 1, 3])
np.average(stud_scores, weights=stud_weights)
```

Out[15]:

67.0

Just for the ease of comparison we calculate the arithmetic mean too.

In [16]:

```
round(np.mean(stud_scores))
```

Out[16]:

62

where $x_1,x_2,...,x_n$ correspond to the data values and $w_1,w_2,...,w_n$ to the weights, respectively.
Based on the equation above it is straight forward to write an implementation for the **weighted geometric mean** in Python:

In [17]:

```
def my_weighted_geometric_mean(val, weights):
"""
This function calculates the weighted geometric mean
based on an input vector called val and the applied weigths
"""
return np.prod(val) ** (1 / np.sum(weights))
```

In [18]:

```
weights = [1] * 5
round(my_weighted_geometric_mean(bees_growth_rel, weights), 3)
```

Out[18]:

0.994

Correct we get the same result!

Finally, we will implement a weighted harmonic mean function by our own.

The weighted harmonic mean $\bar x_{h_w}$ for the positive real numbers $x_1 , x_2 , ... , x_n$ is defined by the following equation:

$$\bar x_{h_w} = \frac{w_1+w_2+\cdots +w_n}{\frac{w_1}{x_1} + \frac{w2}{x_2} + \cdots + \frac{w_n}{x_n}} = \frac{\sum_{i=1}^n w_i}{\sum_{i=1}^n \frac{w_i}{x_i}}, \qquad x_i > 0 \text{ for all } i.$$Let us implement the weighted harmonic mean function in R. The coding is straightforward, however to expand the functionality of our weighted harmonic mean function, we include an `if`

-statement. This `if`

-statement in the code will normalize the weights if the weights are not given in proportions. When the `if`

-statement is run, we print a user-feedback, otherwise there will be no feedback.

In [19]:

```
def my_weighted_harmonic_mean(val, weights):
"""
This function calculates the weighted harmonic mean
based on an input vector called val and the applied weights
"""
if np.sum(weights) != 1:
weights = weights / np.sum(weights)
print("Weights do not sum up to 1. Weights are normalized...")
return np.sum(weights) / np.sum(weights / val)
```

Let us try our function `my_weighted_harmonic_mean`

on a fairly complex data set.
The data set `cities`

consists of all state capitals of Germany, their population size and area.
The goal is to calculate the mean population density for the state capitals of Germany.
You may download the `cities.csv`

file here.
The data is retrieved from this website.

First, using the `pandas`

library, we load the data set, assign it a proper name and take a look at it:

In [20]:

```
cities = pd.read_csv(
"https://userpage.fu-berlin.de/soga/data/raw-data/cities.csv",
encoding="ISO-8859-1", # add this to correctly read special characters (such as ü,ä,..)
encoding_errors="replace",
)
cities.head(10)
```

Out[20]:

name | state | area_km2 | pop_size | |
---|---|---|---|---|

1 | Berlin | Land Berlin | 891.85 | 3520100 |

2 | Bremen | Freie Hansestadt Bremen | 325.42 | 546450 |

3 | Dresden | Freistaat Sachsen | 328.31 | 525100 |

4 | Düsseldorf | Land Nordrhein-Westfalen | 217.41 | 593682 |

5 | Erfurt | Freistaat Thüringen | 269.17 | 203480 |

6 | Hamburg | Freie und Hansestadt Hamburg | 755.26 | 1751780 |

7 | Hannover | Land Niedersachsen | 204.14 | 514130 |

8 | Kiel | Land Schleswig-Holstein | 118.60 | 239860 |

9 | Magdeburg | Land Sachsen-Anhalt | 200.97 | 229924 |

10 | Mainz | Land Rheinland-Pfalz | 97.76 | 202750 |

In [21]:

```
cities["pop_density"] = cities["pop_size"] / cities["area_km2"]
cities.head(10)
```

Out[21]:

name | state | area_km2 | pop_size | pop_density | |
---|---|---|---|---|---|

1 | Berlin | Land Berlin | 891.85 | 3520100 | 3946.964176 |

2 | Bremen | Freie Hansestadt Bremen | 325.42 | 546450 | 1679.214554 |

3 | Dresden | Freistaat Sachsen | 328.31 | 525100 | 1599.403003 |

4 | Düsseldorf | Land Nordrhein-Westfalen | 217.41 | 593682 | 2730.702360 |

5 | Erfurt | Freistaat Thüringen | 269.17 | 203480 | 755.953487 |

6 | Hamburg | Freie und Hansestadt Hamburg | 755.26 | 1751780 | 2319.439663 |

7 | Hannover | Land Niedersachsen | 204.14 | 514130 | 2518.516704 |

8 | Kiel | Land Schleswig-Holstein | 118.60 | 239860 | 2022.428331 |

9 | Magdeburg | Land Sachsen-Anhalt | 200.97 | 229924 | 1144.071254 |

10 | Mainz | Land Rheinland-Pfalz | 97.76 | 202750 | 2073.956628 |

Third, we calculate the weight for each city according its population size.

In [22]:

```
cities["pop_weight"] = cities["pop_size"] / np.sum(cities["pop_size"])
pop_weight_sum = np.sum(cities["pop_weight"])
print(f"The sum of the weight vector is {pop_weight_sum}")
```

The sum of the weight vector is 1.0

Now, we finally apply our implementation of the weighted harmonic mean (`my_weighted_harmonic_mean`

) and compare it to the arithmetic mean.

We start by giving the function the `cities["pop_weight"]`

vector as an input parameter.

In [23]:

```
round(my_weighted_harmonic_mean(cities["pop_density"], cities["pop_weight"]), 3)
```

Out[23]:

2386.186

`cities["pop_size"]`

vector as an input parameter.

In [24]:

```
harm_mean_pop_size = round(
my_weighted_harmonic_mean(cities["pop_density"], cities["pop_size"]), 3
)
harm_mean_pop_size
```

Weights do not sum up to 1. Weights are normalized...

Out[24]:

2386.186

Awesome, the function works as expected. The results are identical.

We can conclude that the mean population density in the state capitals of Germany is about 2386.186 inhabitants/km^{2 .}

For comparison, we may calculate the arithmetic mean of the population density:

In [25]:

```
round(np.mean(cities["pop_density"]), 3)
```

Out[25]:

2005.613

**Citation**

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis
using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.*