Percentiles divide a ranked data set into 100 equal parts. Each (ranked) data set has 99 percentiles that divide it into 100 equal parts. The $k^{th}$ percentile is denoted by $P_k$, where $k$ is an integer in the range 1 to 99. For instance, the 25th percentile is denoted by $P_{25}$.
Thus, the $k^{th}$ percentile, $P_k$, can be defined as a value in a data set such that about $k$ % of the measurements are smaller than the value of $P_k$ and about $(100 - k)$ % of the measurements are greater than the value of $P_k$.
The approximate value of the $k^{th}$ percentile, denoted by $P_k$, is $$ P_k = \frac{kn}{100}$$ where $k$ denotes the number of the percentile and $n$ represents the sample size.
As an exercise we calculate the 38th, the 50th and the 73th percentile of the nc_score
variable of the students data set in Python.
At first, we calculate the 38th percentile according to the equation given above.
# First, let's import all the needed libraries.
import pandas as pd
import numpy as np
import scipy.stats as stats
students = pd.read_csv(
"https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv"
)
nc_score = students["nc.score"]
k = 38 # set the percentile k
n = len(nc_score) # set n
print(f"The {k}sth percentile's position is number {round(k * n / 100)}.")
# select value based on number in the ordered vector
nc_score.sort_values()[round((k * n )/ 100)]
The 38sth percentile's position is number 3131.
1.66
Alternatively, we apply the np.percentile()
function to find the 38th, 50th and 73th percentile of the nc_score
variable.
np.percentile(nc_score, [38, 50, 73])
array([1.74, 2.04, 2.71])
np.percentile(nc_score, 50)
2.04
That worked out fine!
You may check if the median of the nc_score
variable corresponds to the 50th percentile 2.04, as calculated above.
We can also calculate the percentile rank for a particular value $x_i$ of a data set by the following equation:
$$\text{Percentile rank of } x_i =\frac{\text{Number of values less than } x_i}{\text{Total number of values in the data set}}$$The percentile rank of $x_i$ gives the percentage of values in the data set that are less than $x_i$.
In Python, there is a in-built function to calculate the percentile rank. The scipy.stats.percentileofscore()
function computes the percentile rank of a score relative to a list of scores.
Now, we can calculate, for instance, the percentile rank for a numerus clausus of 2.5.
# calculate the percentile rank
stats.percentileofscore(nc_score, 2.5)
66.39762107051827
Rounding the result to the nearest integer value, we can state that about 66 % of the students in our data set had a numerus clausus better than 2.5.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.