20233_percentiles_and_percentile

Percentiles divide a ranked data set into 100 equal parts. Each (ranked) data set has 99 percentiles that divide it into 100 equal parts. The \(k^{th}\) percentile is denoted by \(P_k\), where \(k\) is an integer in the range 1 to 99. For instance, the 25^th percentile is denoted by \(P_{25}\).

Thus, the \(k^{th}\) percentile, \(P_k\), can be defined as a value in a data set such that about \(k\) % of the measurements are smaller than the value of \(P_k\) and about \((100 - k)\) % of the measurements are greater than the value of \(P_k\).

The approximate value of the \(k^{th}\) percentile, denoted by \(P_k\), is \[ P_k = \frac{k*n}{100}\] where \(k\) denotes the number of the percentile and \(n\) represents the sample size.

As an exercise we calculate the 38^th, the 50^th and the 73^th percentile of the nc_score variable in R. At first, we calculate the 38^th percentile according to the equation given above.

students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
nc_score <- students$nc.score

k <- 38 # set k
n <- length(nc_score) # set n
sprintf("The %sth percentile's position is number %s.", k, round(k * n / 100))

## [1] "The 38th percentile's position is number 3131."

# select value based on number in the ordered vector
sort(nc_score)[round(k * n / 100)]

## [1] 1.74

Alternatively, we apply R’s quantile() function to find the 38^th, 50^th and 73^th percentile of the nc_score variable.

quantile(nc_score, probs = c(0.38, 0.50, 0.73))

##  38%  50%  73% 
## 1.74 2.04 2.71

That worked out fine! You may check if the median of the nc_score variable corresponds to the 50^th percentile (2.04), as calculated above.

We can also calculate the percentile rank for a particular value \(x_i\) of a data set by the following equation: \[\text{Percentile rank of } x_i =\frac{\text{Number of values less than } x_i}{\text{Total number of values in the data set}}\] The percentile rank of \(x_i\) gives the percentage of values in the data set that are less than \(x_i\).

In R there is no in-built function to calculate the percentile rank. However, it is fairly easy to write such a function by ourselves:

# user defined function based on the equation given above
percentile.ranked <- function(a_vector, value) {
  numerator <- length(sort(a_vector)[a_vector < value])
  denominator <- length(a_vector)
  round(numerator / denominator, 3) * 100
}

Now, we can calculate, for instance, the percentile rank for a numerus clausus of 2.5.

# calculate the percentile rank
value <- 2.5
percentile.ranked(nc_score, value)

## [1] 66.3

Rounding the result to the nearest integer value, we can state that about 66 % of the students in our data set had a numerus clausus better than 2.5.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.