Another very important measure of central tendency is the median . The median is the value of the middle term in a data set that has been ranked in increasing order. Thus, the median divides a ranked data set into two equal parts.
The calculation of the median consists of the following two steps:
Note that if the number of observations in a data set is odd, then the median is given by the value of the middle term in the ranked data. However, if the number of observations is even, then the median is given by the average of the values of the two middle terms (Mann 2012).
Let us evaluate the median for the age
variable of the students
data set.
# First, let's import all the needed libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
matplotlib
is a Python library that is essential in visualizing and analyzing data. It is a comprehensive library for static, animated and interactive visualizations in Python.
students = pd.read_csv(
"https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv"
)
stud_age = students["age"] # extract age vector
plt.figure(figsize=(9, 5)) # set figure size
plt.xlabel("Student")
plt.ylabel("Age")
plt.plot(stud_age, "o", color="black")
plt.show()
By plotting the age
variable we immediately realize that there are some students, which are much older than the rest of the students.
Let us calculate the median..
np.median(stud_age)
21.0
...and compare it to the arithmetic mean.
np.mean(stud_age)
22.541570578953756
Now, for visualization we add the median and the arithmetic mean to the plot.
plt.figure(figsize=(9, 5)) # set figure size
plt.plot(stud_age, "o", color="black") # plot figure
plt.xlabel("Student")
plt.ylabel("Age")
plt.ylim(min(stud_age), max(stud_age) * 1.3) # set limits for y-axis
plt.axhline(
y=np.mean(stud_age), color="red", linestyle="-", lw=3, label="Arithmetic mean"
) # add horizontal mean line
plt.axhline(
y=np.median(stud_age), color="green", linestyle="-", lw=3, label="Median"
) # add horizontal median line
plt.legend()
plt.show()
As we can see, the median is not influenced by the outliers. Consequently, the median is preferred over the mean as a measure of central tendency for data sets that contain outliers (Mann 2012).
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.