Another very important measure of central tendency is the median . The median is the value of the middle term in a data set that has been ranked in increasing order. Thus, the median divides a ranked data set into two equal parts.

The calculation of the median consists of the following two steps:

  1. Rank the data set in increasing order.
  2. Find the middle term. The value of this term is the median.

Note that if the number of observations in a data set is odd, then the median is given by the value of the middle term in the ranked data. However, if the number of observations is even, then the median is given by the average of the values of the two middle terms (Mann 2012).

Let us evaluate the median for the age variable of the students data set.

In [2]:
# First, let's import all the needed libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

matplotlib is a Python library that is essential in visualizing and analyzing data. It is a comprehensive library for static, animated and interactive visualizations in Python.

In [3]:
students = pd.read_csv(
    "https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv"
)
stud_age = students["age"]  # extract age vector

plt.figure(figsize=(9, 5))  # set figure size
plt.xlabel("Student")
plt.ylabel("Age")
plt.plot(stud_age, "o", color="black")
plt.show()

By plotting the age variable we immediately realize that there are some students, which are much older than the rest of the students.

Let us calculate the median..

In [4]:
np.median(stud_age)
Out[4]:
21.0

...and compare it to the arithmetic mean.

In [5]:
np.mean(stud_age)
Out[5]:
22.541570578953756

Now, for visualization we add the median and the arithmetic mean to the plot.

In [6]:
plt.figure(figsize=(9, 5))  # set figure size
plt.plot(stud_age, "o", color="black")  # plot figure
plt.xlabel("Student")
plt.ylabel("Age")
plt.ylim(min(stud_age), max(stud_age) * 1.3)  # set limits for y-axis

plt.axhline(
    y=np.mean(stud_age), color="red", linestyle="-", lw=3, label="Arithmetic mean"
)  # add horizontal mean line
plt.axhline(
    y=np.median(stud_age), color="green", linestyle="-", lw=3, label="Median"
)  # add horizontal median line
plt.legend()
plt.show()

As we can see, the median is not influenced by the outliers. Consequently, the median is preferred over the mean as a measure of central tendency for data sets that contain outliers (Mann 2012).


Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.