From the three quartiles ($Q1, Q2, Q3$) we can obtain a measure of center (the median, $Q2$) and measures of variation of the two middle quarters of the data, $Q2 - Q1$ for the second quarter and $Q3 - Q2$ for the third quarter. But the three quartiles do not tell us anything about the variation of the first and fourth quarters.
To gain this information, we include the minimum and maximum observations as well. The variation of the first quarter can be measured as the difference between the minimum and the first quartile, $Q1 - Min$. The variation of the fourth quarter can be measured as the difference between the third quartile and the maximum, $Max - Q3$. Thus, the minimum, maximum and quartiles together provide, among other things, information on center and variation (Weiss 2010).
The so called Tukey Five-Number Summary (after the mathematician John Wilder Tukey) of a data set consists of the $Min$, $Q1$, $Q2$, $Q3$ and $Max$ of the data set.
The five-number summary has no built-in function in Python. The following function easily calculates five-number summary: For demonstration purposes it is applied to the the nc_score
variable of the students
data set.
# First, let's import all the needed libraries.
import pandas as pd
students = pd.read_csv(
"https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv"
)
nc_score = students["nc.score"]
def fivenum(x):
series = pd.Series(x)
mi = series.min()
q1 = series.quantile(q=0.25, interpolation="nearest")
me = series.median()
q3 = series.quantile(q=0.75, interpolation="nearest")
ma = series.max()
return pd.Series([mi, q1, me, q3, ma], index=["min", "q1", "median", "q3", "max"])
fivenum(nc_score)
min 1.00 q1 1.46 median 2.04 q3 2.78 max 4.00 dtype: float64
This function returns minimum, lower-hinge, median, upper-hinge and maximum for the input data.
There is similiar describe()
function, which is applicable to pandas Series and DataFrame (which is similar to R's summary). This function provides similar statistics; however, including the arithmetic mean as well.
nc_score.describe()
count 8239.000000 mean 2.166481 std 0.811548 min 1.000000 25% 1.460000 50% 2.040000 75% 2.780000 max 4.000000 Name: nc.score, dtype: float64
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.