In data analysis the identification of outliers, meaning observations that fall well outside the overall pattern of the data, is very important. An outlier requires special attention. It may be the result of a measurement or recording error, an observation from a different population or an unusual extreme observation. Note that an extreme observation does not need to be an outlier; it can instead be an indication of skewness (Weiss 2010)).

If we observe an outlier, we should try to determine its cause. If an outlier is caused by a measurement or recording error or if for some other reason it clearly does not belong to the data set, the outlier can simply be removed. However, if no explanation for an outlier is apparent, the decision whether to retain it in the data set is a difficult judgment call.

As a diagnostic tool for spotting observations that may be outliers
we may use quartiles and the \(IQR\).
For this we define the **lower limit** and the
**upper limit** of a data set. The lower limit is the
number that lies \(1.5 \times IQRs\)
below the first quartile; the upper limit is the number that lies \(1.5 \times IQRs\) above the third quartile.
Observations that lie below the lower limit or above the upper limit are
potential outliers (Weiss 2010).

\[ \text{Lower limit} = Q1 - 1.5 \times IQR \] \[ \text{Upper limit} = Q3 + 1.5 \times IQR \]

A boxplot, also called a **box-and-whisker diagram**, is
based on the five-number summary and can be used to provide a
graphical display of the center and variation of a data set. These
diagrams were invented by the mathematician John
Wilder Tukey. Several types of
boxplots are in common use.

Box-and-whisker plots give a graphic representation of data using five measures: the median, the first quartile, the third quartile as well as the smallest and the largest value between the lower and the upper limits of the data set. The spacing between the different parts of the box indicates the degree of dispersion (spread) and skewness in the data. We can compare different distributions by making box-and-whisker plots for each of them. It also helps to detect outliers (Mann 2012). Box plots can be drawn either horizontally or vertically.

The edges of the box are always the first and third quartile, and the
band inside the box is always the second quartile (the median). The
lines extending from the boxes (whiskers) indicate the variability
outside the upper and lower quartile. To construct a boxplot, we also
need the concept of adjacent values. The **adjacent
values** of a data set are the most extreme observations that
still lie within the lower and upper limits; i.e. they are the most
extreme observations that are not potential outliers. Outliers may be
plotted as individual points. Note that, if a data set has no potential
outliers, the adjacent values are just the minimum and maximum
observations (Weiss 2010).

Let us now construct a series of boxplots in order to analyze the
`students`

data set in more depth. We start by constructing a
boxplot for the `nc.score`

variable.

`students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")`

`boxplot(students$nc.score)`

We immediately get an impression of the spread and skewness in the
data. By adding the argument `horizontal = TRUE`

to the
boxplot we rotate the boxplot by 90\(^{\circ}\). For the sake of a better visual
impression we also colorize the box.

```
boxplot(students$nc.score,
col = "blue",
horizontal = TRUE,
xlab = "Scores",
main = "Numerus clausus"
)
```

Boxplots are a very powerful technique for exploratory data analysis
as it is very easy to condition the variable of interest, in our case
the `nc.score`

variable, on other variables. In R we
condition one variable on another by using the `~`

(this is
the so called *formula notation*).

Let us plot a boxplot of the `nc.score`

variable
conditioned on the `semester`

variable. The
`semester`

variable corresponds to the semester the
particular student is studying. For your information: The minimum period
of study for the study programs under investigation is set to 4
semesters.

```
par(mar = c(2.5, 5, 5.8, 3))
boxplot(students$nc.score ~ students$semester,
col = "blue",
horizontal = TRUE,
xlab = "Scores",
ylab = "Semester",
main = "Numerus clausus for students of\n different semesters"
)
```

Interesting, isn’t it? The plot suggests that students of higher
semesters (> 5^{th}) tend do score lower on the *numerus
clausus*. Or, in other words, those students who finish their
studies within the minimum period of study tend to have a higher
*numerus clausus* score.

Still, we are not yet finished. We want to know whether gender has
any effect on that observation. We can easily incorporate an interaction
variable by simply adding the variable with a `+`

sign. In
addition, we introduce the `notch`

argument. If the notches
of two plots do not overlap this is “strong evidence” that the two
medians differ (Chambers, et al. (1983): Graphical Methods for Data
Analysis. Wadsworth & Brooks/Cole, p. 62). For further
information type `help(boxplot)`

or
`help(boxplot.stats)`

in your console. Please be aware, that
in order to get a nicer looking y-axis we must write one additional line
of code.

```
par(mar = c(2.5, 5, 5.8, 3), xpd = TRUE)
boxplot(students$nc.score ~ students$gender + students$semester,
col = c("blue", "red"),
horizontal = TRUE,
notch = T,
xlab = "Numerus clausus scores",
ylab = "Semester",
yaxt = "n",
main = "Numerus clausus for students of\n different semesters and gender"
)
# add a legend
legend(
x = 2,
y = 16.6,
legend = c("Female", "Male"),
col = c("blue", "red"),
pch = 15,
bty = "n",
pt.cex = 3,
cex = 1,
horiz = T
)
# add the y axis label
axis(2, at = seq(1.5, 14, 2), labels = unique(students$semester), tick = T)
```

This plot is not as easy to interpret. Though, it seems that the
observation we made previously is confirmed: students of higher
semesters (> 5^{th}) tend do score a lower *numerus
clausus*. However, the impact of gender on the *numerus
clausus* scores is not as clear. We will have to apply methods of
**inferential statistics** to assess whether these
differences are *statistically significant* or whether these
fluctuations around the median may also be caused solely by chance.

To wrap this section up and in order to see a boxplot with outliers
too, we plot the `height`

variable against the
`gender`

variable. This time we use the extremely powerful `ggplot2`

package for advanced plotting.
You can use all `ggplot2`

functions by calling
`install.packages("ggplot2")`

and attaching it to the
workspace by calling `library(ggplot2)`

.

```
library(ggplot2)
ggplot(students, aes(gender, height, fill = gender)) +
geom_boxplot(outlier.color = "red", outlier.shape = 6, outlier.size = 1.5, width = 0.9) +
geom_line(color = "#3366FF", alpha = 0.5) +
labs(title = "The height of students based on gender", x = "", y = "Height in cm") +
scale_fill_manual(name = "Legend", values = c("#fc8d59", "#91bfdb")) +
theme_minimal()
```

Obviously, and certainly not that unexpectedly, there is a difference
in the height of the students among the different groups (male or
female). Female students tend to be smaller than male students, but, if
we look at the extremes, there are tall and short individuals in both
groups. However, as mentioned above, we will have to test our
observations for *statistical significance* to be more confident,
that the observed difference in height is not just there by chance.

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*