20232_the_five_number

From the three quartiles ($Q1, Q2, Q3$) we can obtain a measure of center (the median, $Q2$) and measures of variation of the two middle quarters of the data, $Q2 - Q1$ for the second quarter and $Q3 - Q2$ for the third quarter. But the three quartiles do not tell us anything about the variation of the first and fourth quarters.

To gain this information, we include the minimum and maximum observations as well. The variation of the first quarter can be measured as the difference between the minimum and the first quartile, $Q1 - Min$. The variation of the fourth quarter can be measured as the difference between the third quartile and the maximum, $Max - Q3$. Thus, the minimum, maximum and quartiles together provide, among other things, information on center and variation (Weiss 2010).

The so called Tukey Five-Number Summary (after the mathematician John Wilder Tukey) of a data set consists of the $Min$, $Q1$, $Q2$, $Q3$ and $Max$ of the data set.

The five-number summary has no built-in function in Python. The following function easily calculates five-number summary: For demonstration purposes it is applied to the the nc_score variable of the students data set.

In [2]:

# First, let's import all the needed libraries.
import pandas as pd

In [3]:

students = pd.read_csv(
    "https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv"
)
nc_score = students["nc.score"]

In [4]:

def fivenum(x):
    series = pd.Series(x)
    mi = series.min()
    q1 = series.quantile(q=0.25, interpolation="nearest")
    me = series.median()
    q3 = series.quantile(q=0.75, interpolation="nearest")
    ma = series.max()
    return pd.Series([mi, q1, me, q3, ma], index=["min", "q1", "median", "q3", "max"])

In [5]:

fivenum(nc_score)

Out[5]:

min       1.00
q1        1.46
median    2.04
q3        2.78
max       4.00
dtype: float64

This function returns minimum, lower-hinge, median, upper-hinge and maximum for the input data.

There is similiar describe() function, which is applicable to pandas Series and DataFrame (which is similar to R's summary). This function provides similar statistics; however, including the arithmetic mean as well.

In [6]:

nc_score.describe()

Out[6]:

count    8239.000000
mean        2.166481
std         0.811548
min         1.000000
25%         1.460000
50%         2.040000
75%         2.780000
max         4.000000
Name: nc.score, dtype: float64

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.