Let us elaborate the concept of discrete random variables by an exercise.
Say, our population under investigation consists of all students, all lecturers and all administrative staff members at FU Berlin. We randomly pick one of those individuals and ask him/her about his/her number of siblings. Consequently, the answer, the number of siblings of a randomly selected individual is a discrete random variable, denoted as $X$. The actual value (number of siblings) of $X$ depends on chance, but we may still list all values of $X$, e.g. 0 sibling, 1 sibling, 2 siblings, and so on. For simplification purposes we limit the number of siblings in this exercise to 5.
# First, let's import all the needed libraries.
import numpy as np
import random
import pandas as pd
import matplotlib.pyplot as plt
According the the website of FU Berlin there are 30600 students, 5750 doctoral students, 341 professors and 4270 staff members associated to FU Berlin. In total there are 40961 individuals (please note, that the actual numbers might change over time) at FU Berlin.
fu_studs = 30600
fu_PhD = 5750
fu_profs = 341
fu_staff = 4270
fu_all = np.sum([fu_studs, fu_PhD, fu_profs, fu_staff])
experiment_prob = np.array([0.2, 0.425, 0.275, 0.07, 0.025, 0.005])
def siblings(x):
return random.choices(np.array([0, 1, 2, 3, 4, 5]), weights=experiment_prob, k=x)
As we do not have any idea of the associated probability for a particular number of sibling, we start some experiments:
We pick one randomly chosen individual and ask for the number of siblings.
The answer is:
siblings(1)
[0]
We pick ten randomly chosen individuals and ask them about siblings.
The answers are:
siblings(10)
[2, 2, 1, 1, 2, 1, 1, 3, 2, 2]
We pick one hundred individuals and ask for siblings.
The answers are:
siblings(100)
[1, 2, 1, 1, 2, 1, 2, 1, 2, 0, 1, 0, 4, 1, 2, 2, 2, 3, 1, 1, 0, 2, 2, 1, 1, 0, 2, 1, 2, 1, 1, 2, 2, 1, 0, 2, 2, 1, 4, 3, 1, 1, 0, 0, 1, 1, 1, 1, 1, 2, 2, 3, 0, 2, 2, 0, 1, 1, 2, 2, 2, 2, 3, 3, 1, 1, 1, 2, 1, 2, 1, 0, 1, 1, 1, 2, 2, 1, 0, 2, 2, 0, 3, 1, 3, 2, 3, 1, 1, 1, 1, 0, 0, 1, 2, 1, 1, 1, 1, 0]
You see, the form of notation is getting quite fast confusing if we increase the number of individuals we interrogate. Thus, we decide to note the frequency, and the corresponding relative frequency of the values given for the classes 0, 1, 2, 3, 4, 5 (to be explicit: the last class corresponds to 5 or more siblings), and present the experiment in form of a nicely formatted table.
We pick 1,000 individuals and ask them about siblings.
The easiest will be to construct a pd.Series
object to be able to apply the pandas
function value_counts
. This function allows to set the argument normalize = True
to retrieve relative frequencies.
random.seed(1000)
experiment = pd.Series(siblings(1000))
siblings1000_f = experiment.value_counts()
siblings1000_p = experiment.value_counts(normalize=True)
After we listed all possible values and calculated the corresponding relative frequencies, we still do not know exactly the probabilities of the discrete random variable $X$ for the whole population of 40961 individuals, associated to FU Berlin. However, after talking to 1,000 randomly chosen individuals we are quite confident that such a large number of interviews - compared to the number of the whole population (40961) - will give us a good approximation of the probabilities of the discrete random variable $X$ (number of siblings) for the whole population.
In a next step, we draw a proportion histogram (of the sample), which displays the possible values of a discrete random variable $X$ on the horizontal axis and the proportions of those values on the vertical axis. A proportion histogram may serve as an approximation to the probability distribution too. Please note, that the sum of the probabilities as well as the sum of the proportions of any discrete random variables is 1.
col_map = plt.get_cmap("Blues", 8)
siblings1000_p.sort_index().plot(
kind="bar",
figsize=(10, 5),
color=col_map([1, 2, 3, 4, 5, 6]),
align="center",
width=1,
edgecolor="black",
)
plt.title(
"Proportion histogram for the random variable X, the number of siblings\n of randomly selected individuals at FU Berlin",
fontsize=15,
)
plt.xlabel("# of siblings", fontsize=15)
plt.ylabel("Proportion", fontsize=15)
plt.text(4.5, 0.3, "1000 random \n samples", fontsize=12)
plt.show()
In many real life applications we do not know the population's probability distribution - and we never will. This is mainly because in many applications the population is much to large, or there is no chance to get reliable data, nor we have the money nor the time for exhaustive data collection. However, by increasing the number of independent observations of a random variable $X$, the proportion histogram of the sample will approximate better and better the probability histogram of the whole population. To prove this claim we scale up our experiment:
We sequentially pick 10, 100 and 1,000 randomly chosen individuals associated with FU Berlin and ask them about the number of siblings. We will plot each of our three experiments and finally compare it to the actual/real probability distribution (Please note that this example is a toy example and does not represent the real number of sibling in the population of individuals at the FU Berlin; thus the instructors of the present e-learning module know the probability distribution of the population ;-).
samples = list([10, 100, 1000])
fig, ax = plt.subplots(2, 2, figsize=(12, 7))
col_pal = plt.get_cmap("Blues", len(siblings1000_p))
fig.suptitle(
"Proportion histograms for the random variable X, the number of siblings\n of randomly selected individuals at FU Berlin",
fontsize=12,
y=0.95,
)
ax[0, 0].set_title("10 random samples")
ax[0, 1].set_title("100 random samples")
ax[1, 0].set_title("1000 random samples")
ax[1, 1].set_title('the "known unknown" actual \nprobability distribution')
ax[1, 1].set_xlabel("# of siblings")
ax[1, 1].set_ylabel("Probability")
ax[1, 1].set_ylim(0, 0.5)
ax[1, 1].bar(
np.array([0, 1, 2, 3, 4, 5]),
experiment_prob,
color=col_map([1, 2, 3, 4, 5, 6]), #
align="center",
width=1,
edgecolor="black",
)
for bars in ax[1, 1].containers:
ax[1, 1].bar_label(bars)
for sample, ax in zip(samples, ax.ravel()):
siblings_p = pd.Series(siblings(sample)).value_counts(normalize=True).sort_index()
siblings_p.plot(
kind="bar",
color=col_map([1, 2, 3, 4, 5, 6]), #
align="center",
width=1,
edgecolor="black",
ax=ax,
)
ax.set_xlabel("# of siblings")
ax.set_ylabel("Probability")
ax.set_ylim(0, 0.5)
for bars in ax.containers:
ax.bar_label(bars)
plt.tight_layout()
plt.show()
The graphs confirm our hypothesis, by increasing the number of observations the proportion histogram of the sample will approximate better and better the probability histogram of the whole population.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.