20233_percentiles_and_percentile

Percentiles divide a ranked data set into 100 equal parts. Each (ranked) data set has 99 percentiles that divide it into 100 equal parts. The $k^{th}$ percentile is denoted by $P_k$, where $k$ is an integer in the range 1 to 99. For instance, the 25^th percentile is denoted by $P_{25}$.

Thus, the $k^{th}$ percentile, $P_k$, can be defined as a value in a data set such that about $k$ % of the measurements are smaller than the value of $P_k$ and about $(100 - k)$ % of the measurements are greater than the value of $P_k$.

The approximate value of the $k^{th}$ percentile, denoted by $P_k$, is $$ P_k = \frac{kn}{100}$$ where $k$ denotes the number of the percentile and $n$ represents the sample size.

As an exercise we calculate the 38^th, the 50^th and the 73^th percentile of the nc_score variable of the students data set in Python. At first, we calculate the 38^th percentile according to the equation given above.

In [1]:

# First, let's import all the needed libraries.
import pandas as pd
import numpy as np
import scipy.stats as stats

In [2]:

students = pd.read_csv(
    "https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv"
)
nc_score = students["nc.score"]

In [4]:

k = 38 # set the percentile k
n = len(nc_score) # set n
print(f"The {k}sth percentile's position is number {round(k * n / 100)}.")

# select value based on number in the ordered vector
nc_score.sort_values()[round((k * n )/ 100)]

The 38sth percentile's position is number 3131.

Out[4]:

1.66

Alternatively, we apply the np.percentile() function to find the 38^th, 50^th and 73^th percentile of the nc_score variable.

In [5]:

np.percentile(nc_score, [38, 50, 73])

Out[5]:

array([1.74, 2.04, 2.71])

In [6]:

np.percentile(nc_score, 50)

Out[6]:

2.04

That worked out fine! You may check if the median of the nc_score variable corresponds to the 50^th percentile 2.04, as calculated above.

We can also calculate the percentile rank for a particular value $x_i$ of a data set by the following equation:

$$\text{Percentile rank of } x_i =\frac{\text{Number of values less than } x_i}{\text{Total number of values in the data set}}$$

The percentile rank of $x_i$ gives the percentage of values in the data set that are less than $x_i$.

In Python, there is a in-built function to calculate the percentile rank. The scipy.stats.percentileofscore() function computes the percentile rank of a score relative to a list of scores.

Now, we can calculate, for instance, the percentile rank for a numerus clausus of 2.5.

In [7]:

# calculate the percentile rank
stats.percentileofscore(nc_score, 2.5)

Out[7]:

66.39762107051827

Rounding the result to the nearest integer value, we can state that about 66 % of the students in our data set had a numerus clausus better than 2.5.

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.