Percentiles divide a ranked data set into 100 equal parts. Each (ranked) data set has 99 percentiles that divide it into 100 equal parts. The $$k^{th}$$ percentile is denoted by $$P_k$$, where $$k$$ is an integer in the range 1 to 99. For instance, the 25th percentile is denoted by $$P_{25}$$.

Thus, the $$k^{th}$$ percentile, $$P_k$$, can be defined as a value in a data set such that about $$k$$ % of the measurements are smaller than the value of $$P_k$$ and about $$(100 - k)$$ % of the measurements are greater than the value of $$P_k$$.

The approximate value of the $$k^{th}$$ percentile, denoted by $$P_k$$, is $P_k = \frac{k*n}{100}$ where $$k$$ denotes the number of the percentile and $$n$$ represents the sample size.

As an exercise we calculate the 38th, the 50th and the 73th percentile of the nc_score variable in R. At first, we calculate the 38th percentile according to the equation given above.

students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
nc_score <- students\$nc.score
k <- 38 # set k
n <- length(nc_score) # set n
sprintf("The %sth percentile's position is number %s.", k, round(k * n / 100))
## [1] "The 38th percentile's position is number 3131."
# select value based on number in the ordered vector
sort(nc_score)[round(k * n / 100)]
## [1] 1.74

Alternatively, we apply R’s quantile() function to find the 38th, 50th and 73th percentile of the nc_score variable.

quantile(nc_score, probs = c(0.38, 0.50, 0.73))
##  38%  50%  73%
## 1.74 2.04 2.71

That worked out fine! You may check if the median of the nc_score variable corresponds to the 50th percentile (2.04), as calculated above.

We can also calculate the percentile rank for a particular value $$x_i$$ of a data set by the following equation: $\text{Percentile rank of } x_i =\frac{\text{Number of values less than } x_i}{\text{Total number of values in the data set}}$ The percentile rank of $$x_i$$ gives the percentage of values in the data set that are less than $$x_i$$.

In R there is no in-built function to calculate the percentile rank. However, it is fairly easy to write such a function by ourselves:

# user defined function based on the equation given above
percentile.ranked <- function(a_vector, value) {
numerator <- length(sort(a_vector)[a_vector < value])
denominator <- length(a_vector)
round(numerator / denominator, 3) * 100
}

Now, we can calculate, for instance, the percentile rank for a numerus clausus of 2.5.

# calculate the percentile rank
value <- 2.5
percentile.ranked(nc_score, value)
## [1] 66.3

Rounding the result to the nearest integer value, we can state that about 66 % of the students in our data set had a numerus clausus better than 2.5.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.