**Quartiles** divide a ranked data set into **four equal parts**.
These three measures are denoted **first quartile (denoted by $Q1$)**, **second quartile (denoted by $Q2$)** and **third quartile (denoted by $Q3$)**.
The second quartile is the same as the median of a data set.
The first quartile is the value of the middle term among the observations that are less than the median and the third quartile is the value of the middle term among the observations that are greater than the median (Mann 2012).

In [2]:

```
# First, let's import all the needed libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
```

In [3]:

```
cmap = plt.get_cmap("YlOrBr", 4)
cmap(4)
```

Out[3]:

(0.4, 0.1450980392156863, 0.02352941176470588, 1.0)

In [4]:

```
## hide cell
# Stacked grouped bar chart
plt.barh([1], 4, height=1.2, align="edge", color=cmap(3), edgecolor="black")
plt.barh([1], 3, height=1.2, align="edge", color=cmap(2), edgecolor="black")
plt.barh([1], 2, height=1.2, align="edge", color=cmap(1), edgecolor="black")
plt.barh([1], 1, height=1.2, align="edge", color=cmap(0), edgecolor="black")
plt.ylim(0, 4)
plt.xlim(0, 4)
plt.text(0.5, 1.5, "25%", fontsize=15)
plt.text(1.5, 1.5, "25%", fontsize=15)
plt.text(2.5, 1.5, "25%", fontsize=15)
plt.text(3.5, 1.5, "25%", fontsize=15)
plt.text(0.9, 0.7, "Q1", fontsize=15)
plt.text(1.9, 0.7, "Q2", fontsize=15)
plt.text(2.9, 0.7, "Q3", fontsize=15)
plt.arrow(
2,
3,
0,
-0.8,
length_includes_head=True,
head_width=0.15,
head_length=0.25,
color="black",
)
plt.text(1.7, 3.2, "Median", fontsize=15)
plt.axis("off")
plt.show()
```

Approximately 25 % of the values in a ranked data set are less than $Q1$ and about 75 % are greater than $Q1$
The second quartile, $Q2$, divides a ranked data set into two equal parts; hence, the second quartile and the median are the same.
Approximately 75 % of the data values are less than $Q3$ and about 25 % are greater than $Q3$.
The difference between the third quartile and the first quartile of a data set is called the **interquartile range ($IQR$)** (Mann 2012).

Let us test Pythons functionality for computing quantiles/quartiles.
We will use the `nc.score`

variable of the `students`

data set to calculate quartiles and the $IQR$.
The `nc.score`

variable corresponds to the Numerus Clausus score of each particular student.

First, we subset the data and plot a histogram to further inspect the variable's distribution.

In [5]:

```
students = pd.read_csv(
"https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv"
)
nc_score = students["nc.score"]
plt.hist(nc_score, bins="sturges", color="lightgrey", edgecolor="grey")
plt.title("Histogram of NC score")
plt.xlabel("nc")
plt.ylabel("Frequency")
plt.show()
```

`nc_score`

variable, we apply the function `np.percentile()`

.
If you call the `help()`

function on `np.percentile`

, you see that the values for the argument `q`

are set to be between 0 and 100.

Thus, in order to calculate the quartiles for the `nc_score`

variable we just write:

In [6]:

```
help(np.percentile)
```

In [7]:

```
np.percentile(nc_score, [0, 25, 50, 75, 100])
```

Out[7]:

array([1. , 1.46, 2.04, 2.78, 4. ])

Note:Not all statisticians define quartiles in exactly the same way.

For a detailed discussion of the different methods for computing quartiles, see e.g. the online article "Quartiles in Elementary Statistics" by E. Langford (2006).

In order to calculate the $IQR$ for the `nc_score`

variable we either write...

In [8]:

```
nc_score_quart = np.percentile(nc_score, [0, 25, 50, 75, 100])
nc_score_quart[3] - nc_score_quart[1]
```

Out[8]:

1.3199999999999998

...or we apply the in-built function `iqr()`

that is included in the statistics library scipy.stats.

In [9]:

```
stats.iqr(nc_score_quart)
```

Out[9]:

1.3199999999999998

`nc_score`

variable into quartiles by plotting a histogram and by adding a couple of additional lines of code.

In [10]:

```
ax = nc_score.plot.hist(bins=50, density=1, edgecolor="black", figsize=(10, 5))
for bar in ax.containers[0]:
# get x midpoint of bar
x = bar.get_x() + 0.5 * bar.get_width()
# set bar color based on x
if x < nc_score_quart[0]:
bar.set_color("blue")
bar.set_edgecolor("grey")
elif x < nc_score_quart[1]:
bar.set_color("blue")
bar.set_edgecolor("grey")
elif x < nc_score_quart[2]:
bar.set_color("red")
bar.set_edgecolor("grey")
elif x < nc_score_quart[3]:
bar.set_color("green")
bar.set_edgecolor("grey")
elif x < nc_score_quart[4]:
bar.set_color("black")
bar.set_edgecolor("grey")
else:
bar.set_color("grey")
plt.title("Quartiles")
plt.ylabel("Density")
plt.xlabel("Numerus Clausus score")
plt.text(4, 0.6, "1st", color="blue")
plt.text(4, 0.55, "2nd", color="red")
plt.text(4, 0.5, "3rd", color="green")
plt.text(4, 0.45, "4th", color="black")
plt.show()
```

**Citation**

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis
using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.*