20231_quartiles_and_interquartile

Quartiles divide a ranked data set into four equal parts. These three measures are denoted first quartile (denoted by \(Q1\)), second quartile (denoted by \(Q2\)) and third quartile (denoted by \(Q3\)). The second quartile is the same as the median of a data set. The first quartile is the value of the middle term among the observations that are less than the median and the third quartile is the value of the middle term among the observations that are greater than the median (Mann 2012).

Approximately 25 % of the values in a ranked data set are less than \(Q1\) and about 75 % are greater than \(Q1\) The second quartile, \(Q2\), divides a ranked data set into two equal parts; hence, the second quartile and the median are the same. Approximately 75 % of the data values are less than \(Q3\) and about 25 % are greater than \(Q3\). The difference between the third quartile and the first quartile of a data set is called the interquartile range (\(IQR\)) (Mann 2012).

\[ IQR = Q3-Q1\]

Let us switch to R and test its functionality for computing quantiles/quartiles. We will use the nc.score variable of the studentsdata set to calculate quartiles and the \(IQR\). The nc.scorevariable corresponds to the Numerus Clausus score of each particular student.

First, we subset the data and plot a histogram to further inspect the variable’s distribution.

students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
nc_score <- students$nc.score
hist(nc_score, breaks = "sturges")

To calculate the quartiles for the nc_score variable, we apply the function quantile(). If you call the help() function on quantile(), you see that the default values for the argument probs are set to 0, 0.25, 0.5 and 0.75. Thus, in order to calculate the quartiles for the nc_score variable we just write:

quantile(nc_score)

##   0%  25%  50%  75% 100% 
## 1.00 1.46 2.04 2.78 4.00

which gives the same result as:

quantile(nc_score, probs = c(0, 0.25, 0.5, 0.75, 1))

##   0%  25%  50%  75% 100% 
## 1.00 1.46 2.04 2.78 4.00

Note: Not all statisticians define quartiles in exactly the same way. For a detailed discussion of the different methods for computing quartiles, see the online article “Quartiles in Elementary Statistics” by E. Langford (2006). In addition, you may find the help(quantile) function and its type argument helpful.

In order to calculate the \(IQR\) for the nc_score variable we either write…

nc_score_quart <- quantile(nc_score, names = FALSE)
nc_score_quart[4] - nc_score_quart[2]

## [1] 1.32

…or we apply the in-built function IQR():

IQR(nc_score)

## [1] 1.32

We can visualize the partitioning of the nc_score variable into quartiles by plotting a histogram and by adding a couple of additional lines of code.

h <- hist(nc_score, breaks = 50, plot = F)

cuts <- cut(h$breaks, c(0, nc_score_quart))

plot(h,
  col = rep(c("4", "4", "3", "2", "1"))[cuts],
  main = "Quartiles",
  xlab = "Numerus Clausus score"
)

# add legend
legend("topright",
  legend = c("1st", "2nd", "3rd", "4th"),
  col = c(4, 3, 2, 1),
  pch = 15
)

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.