20234_outliers_and

Outliers¶

In data analysis the identification of outliers, meaning observations that fall well outside the overall pattern of the data, is very important. An outlier requires special attention. It may be the result of a measurement or recording error, an observation from a different population or an unusual extreme observation. Note that an extreme observation does not need to be an outlier; it can instead be an indication of skewness (Weiss 2010).

If we observe an outlier, we should try to determine its cause. If an outlier is caused by a measurement or recording error or if for some other reason it clearly does not belong to the data set, the outlier can simply be removed. However, if no explanation for an outlier is apparent, the decision whether to retain it in the data set is a difficult judgment call.

As a diagnostic tool for spotting observations that may be outliers we may use quartiles and the $IQR$. For this we define the lower limit and the upper limit of a data set. The lower limit is the number that lies $1.5 \times IQRs$ below the first quartile; the upper limit is the number that lies $1.5 \times IQRs$ above the third quartile. Observations that lie below the lower limit or above the upper limit are potential outliers (Weiss 2010).

$$ \text{Lower limit} = Q1 - 1.5 \times IQR $$$$ \text{Upper limit} = Q3 + 1.5 \times IQR $$

Boxplots¶

A boxplot, also called a box-and-whisker diagram, is based on the five-number summary and can be used to provide a graphical display of the center and variation of a data set. These diagrams were invented by the mathematician John Wilder Tukey. Several types of boxplots are in common use.

Box-and-whisker plots give a graphic representation of data using five measures: the median, the first quartile, the third quartile as well as the smallest and the largest value between the lower and the upper limits of the data set. The spacing between the different parts of the box indicates the degree of dispersion (spread) and skewness in the data. We can compare different distributions by making box-and-whisker plots for each of them. It also helps to detect outliers (Mann 2012). Box plots can be drawn either horizontally or vertically.

In [2]:

# First, let's import all the needed libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns

In [3]:

## hide cell
from IPython.display import Image

Image("20234_box-whisker.png")

Out[3]:

The edges of the box are always the first and third quartile, and the band inside the box is always the second quartile (the median). The lines extending from the boxes (whiskers) indicate the variability outside the upper and lower quartile. To construct a boxplot, we also need the concept of adjacent values. The adjacent values of a data set are the most extreme observations that still lie within the lower and upper limits; i.e. they are the most extreme observations that are not potential outliers. Outliers may be plotted as individual points. Note that, if a data set has no potential outliers, the adjacent values are just the minimum and maximum observations (Weiss 2010).

Let us now construct a series of boxplots in order to analyze the students data set in more depth. We start by constructing a boxplot for the nc.score variable.

In [4]:

students = pd.read_csv(
    "https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv"
)
nc_score = students["nc.score"]

fig = plt.figure(figsize=(7, 5))
plt.boxplot(nc_score)
plt.show()

We immediately get an impression of the spread and skewness in the data. By adding the argument vert = 0 to the boxplot we rotate the boxplot by 90$^{\circ}$. For the sake of a better visual impression we also colorize the box.

In [5]:

fig = plt.figure(figsize=(7, 5))
plt.boxplot(nc_score, vert=0, patch_artist=True)
plt.title("Numerus clausus")
plt.xlabel("Scores")
plt.show()

Boxplots are a very powerful technique for exploratory data analysis as it is very easy to condition the variable of interest, in our case the nc.score variable, on other variables. In Python we condition one variable on another by using the by-argument.

Let us plot a boxplot of the nc.score variable conditioned on the semester variable. The semester variable corresponds to the semester the particular student is studying. For your information: The minimum period of study for the study programs under investigation is set to 4 semesters.

In [6]:

fig = plt.figure(figsize=(7, 5))
students.boxplot(column=["nc.score"], by="semester", patch_artist=True, vert=0)
plt.tight_layout()
plt.show()

<Figure size 700x500 with 0 Axes>

Interesting, isn't it? The plot suggests that students of higher semesters (> 5^th) tend do score lower on the numerus clausus. Or, in other words, those students who finish their studies within the minimum period of study tend to have a higher numerus clausus score.

Still, we are not yet finished. We want to know whether gender has any effect on that observation. Therefore, we switch to the seaborn package for plotting boxplots, since it allows for a easy handling of grouped boxplots.

We can easily incorporate an interaction variable by simply adding the variable with the hue argument. In addition, we introduce the notch argument. If the notches of two plots do not overlap this is "strong evidence" that the two medians differ (Chambers, et al. (1983): Graphical Methods for Data Analysis. Wadsworth & Brooks/Cole, p. 62).

Note, that in order to get a nice looking position of the legend, we will have to add on additional line.

In [7]:

sns.boxplot(
    y="semester",
    x="nc.score",
    data=students,
    palette="colorblind",
    hue="gender",
    orient="h",
    notch=True,
)

# place legend outside top right corner of plot
plt.legend(bbox_to_anchor=(1.02, 1), loc="upper left", borderaxespad=0)

Out[7]:

<matplotlib.legend.Legend at 0x1ac48395f60>

This plot is not as easy to interpret. Though, it seems that the observation we made previously is confirmed: students of higher semesters (> 5^th) tend do score a lower numerus clausus. However, the impact of gender on the numerus clausus scores is not as clear. We will have to apply methods of inferential statistics to assess whether these differences are statistically significant or whether these fluctuations around the median may also be caused solely by chance.

To wrap this section up and in order to see a boxplot with outliers too, we plot the height variable against the gender variable. We will use the seaborn package again for this task.

In [8]:

sns.boxplot(y="height", x="gender", data=students, palette="colorblind", notch=True)

Out[8]:

<AxesSubplot: xlabel='gender', ylabel='height'>

Obviously, and certainly not that unexpectedly, there is a difference in the height of the students among the different groups (male or female). Female students tend to be smaller than male students, but, if we look at the extremes, there are tall and short individuals in both groups. However, as mentioned above, we will have to test our observations for statistical significance to be more confident, that the observed difference in height is not just there by chance.

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.