Percentiles divide a ranked data set into 100 equal parts. Each (ranked) data set has 99 percentiles that divide it into 100 equal parts. The \(k^{th}\) percentile is denoted by \(P_k\), where \(k\) is an integer in the range 1 to 99. For instance, the 25th percentile is denoted by \(P_{25}\).
Thus, the \(k^{th}\) percentile, \(P_k\), can be defined as a value in a data set such that about \(k\) % of the measurements are smaller than the value of \(P_k\) and about \((100 - k)\) % of the measurements are greater than the value of \(P_k\).
The approximate value of the \(k^{th}\) percentile, denoted by \(P_k\), is \[ P_k = \frac{k*n}{100}\] where \(k\) denotes the number of the percentile and \(n\) represents the sample size.
As an exercise we calculate the 38th, the 50th
and the 73th percentile of the nc_score
variable
in R. At first, we calculate the 38th percentile according to
the equation given above.
students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
nc_score <- students$nc.score
k <- 38 # set k
n <- length(nc_score) # set n
sprintf("The %sth percentile's position is number %s.", k, round(k * n / 100))
## [1] "The 38th percentile's position is number 3131."
# select value based on number in the ordered vector
sort(nc_score)[round(k * n / 100)]
## [1] 1.74
Alternatively, we apply R’s quantile()
function to find
the 38th, 50th and 73th percentile of
the nc_score
variable.
quantile(nc_score, probs = c(0.38, 0.50, 0.73))
## 38% 50% 73%
## 1.74 2.04 2.71
That worked out fine! You may check if the median of the
nc_score
variable corresponds to the 50th
percentile (2.04), as calculated above.
We can also calculate the percentile rank for a particular value \(x_i\) of a data set by the following equation: \[\text{Percentile rank of } x_i =\frac{\text{Number of values less than } x_i}{\text{Total number of values in the data set}}\] The percentile rank of \(x_i\) gives the percentage of values in the data set that are less than \(x_i\).
In R there is no in-built function to calculate the percentile rank. However, it is fairly easy to write such a function by ourselves:
# user defined function based on the equation given above
percentile.ranked <- function(a_vector, value) {
numerator <- length(sort(a_vector)[a_vector < value])
denominator <- length(a_vector)
round(numerator / denominator, 3) * 100
}
Now, we can calculate, for instance, the percentile rank for a numerus clausus of 2.5.
# calculate the percentile rank
value <- 2.5
percentile.ranked(nc_score, value)
## [1] 66.3
Rounding the result to the nearest integer value, we can state that about 66 % of the students in our data set had a numerus clausus better than 2.5.
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.