The binomial distribution is the probability distribution for the number of successes in a sequence of Bernoulli trials (Weiss 2010).
In $n$ Bernoulli trials, the number of outcomes that contain exactly $x$ successes equals the binomial coefficient ${n \choose x}$ (Weiss 2010).
Let $X$ denote the total number of successes in $n$ Bernoulli trials with success probability $p$. Then the probability distribution of the random variable $X$ is given by:
$$ P(X = x) = {n \choose x}p^x(1 - p)^{n-x}, \qquad x = 0, 1, 2, . . . , n \, .$$The random variable $X$ is called a binomial random variable and is said to have the binomial distribution with parameters $n$ and $p$ (Weiss 2010).
# First, let's import all the needed libraries.
import numpy as np
import matplotlib.pyplot as plt
import math
import scipy.stats as stats
import random
import pandas as pd
n0 = 25
p0 = 0.3
x_1 = 3
x_2 = 15
x_3 = 9
Let us consider an example taken form the real world.
Long term statistics tell us, that the chance for passing the final statistics exam is 0.3. Yes, approximately 70 % of students fail in their statistic exam! By the way, if you complete this e-learning module your chances of passing the final exam will definitely increase. :-)
Let us consider a class of n0
students. What is the chance that exactly x_1
students of that particular class will pass the final statistics exam? Or written more formally $P(X = 3)$. Once again we start by implementing a naive approach in Python.
n = 25 # number of students
p = 0.3 # probability of success
k = 3 # exactly 3 students will pass the exam
math.comb(n, k) * p**k * (1 - p) ** (n - k)
0.02427998871170032
Wow, the probability that exactly x_1
out of n0
students $(P(X = 3))$ will pass the final statistics exam is quite low. What about the probability that exactly x_2
out of n0
students $(P(X = 15))$ will pass the final statistics exam. We turn to Python.
n = 25 # number of students
p = 0.3 # probability of success
k = 15 # exactly 15 students will pass the exam
math.comb(n, k) * p**k * (1 - p) ** (n - k)
0.0013248974242351928
The probability of $P(X = 15)$ is approximately 0.1%. We can continue with our experiments to find out all the probabilities of exactly one outcome for $k = 0,1,2,...,n$. Please note that for our particular example it is not very informative to know the probability to pass the exam of exactly one particular number ($k$) of students. For us it is of higher informational interest if we could answer the question what is the probability that $k$ or less students $(P(X \le k))$ will pass the exam, or equally interesting that $k$ or more students $(P(X \ge k))$ will pass the exam.
As an exercise we turn to Python and determine what is the probability that x_3
or less students will pass the final statistics exam $(P(X \le 9))$. So we are interested in the probability that 0 out of n0
, or 1 out of n0
, or 2 out of n0
, ..., or x_3
out of n0
students will pass the exam. To calculate this probability we can extend our naive approach and calculate $P(X = 0)+P(X = 1)+P(X = 2)+...+P(X = 9)$.
n = 25 # number of students
p = 0.3 # probability of success
(
math.comb(n, 0) * p**0 * (1 - p) ** (n - 0)
+ math.comb(n, 1) * p**1 * (1 - p) ** (n - 1)
+ math.comb(n, 2) * p**2 * (1 - p) ** (n - 2)
+ math.comb(n, 3) * p**3 * (1 - p) ** (n - 3)
+ math.comb(n, 4) * p**4 * (1 - p) ** (n - 4)
+ math.comb(n, 5) * p**5 * (1 - p) ** (n - 5)
+ math.comb(n, 6) * p**6 * (1 - p) ** (n - 6)
+ math.comb(n, 7) * p**7 * (1 - p) ** (n - 7)
+ math.comb(n, 8) * p**8 * (1 - p) ** (n - 8)
+ math.comb(n, 9) * p**9 * (1 - p) ** (n - 9)
)
0.8105639764950532
The probability that 9 or less students will pass the statistics exam $(P(X\le `r x.3`))$ is 81.1%. In turn the means that the probability that 10 or more students will pass the exam is $P(X > `r x.3`)= 1-P(X \le `r x.3`)$ or only 18.9%.
Beside the honestly quite unpleasant insight in the statistics of passing the final exam, it is clear that the naive implementation from above is quite tedious. Therefore we introduce the in-built function binom()
implemented in the scipy.stats
module. The binom()
function takes n
and p
and as shape parameters, where p
is the probability of a single success. Call the help()
on those functions for further information.
According to the help()
function the usage of binom()
is as follows: binom.pmf(k, n, p, loc)
, where k
corresponds to our $k$, n
to our $n$ and p
to our $p$. There are several methods to be used with the binom()
, e.g. the method pmf
gives the probability mass function. For more information on the different methods, check help()
.
In order to check if the binom.pmf()
function works as expected, we recall our test examples from above, which was $P(X = 3)$ and $P(X = 15)$.
n = 25 # number of students
p = 0.3 # probability of success
k = [3, 15]
stats.binom.pmf(k[0], n, p), stats.binom.pmf(k[1], n, p)
(0.024279988711700378, 0.0013248974242351943)
Compare these results with the results of our naive implementation (scroll up). The numbers should match.
The binom.pmf()
function becomes very handy if we consider to accumulate probabilities, such as $P(X \le x)$ or $P(X > x)$, because we can specify a vector for the argument x
, and sum it up afterwards. Let us consider the example from above $(P(X \le 9))$.
n = 25 # number of students
p = 0.3 # probability of success
k = np.arange(1, 10)
stats.binom.pmf(k, n, p)
array([0.00143686, 0.00738956, 0.02427999, 0.0572314 , 0.10301652, 0.14716646, 0.17119364, 0.16507958, 0.13363585])
np.sum(stats.binom.pmf(k, n, p))
0.8104298696330876
Another in-built function that is appropriate to answer $P(X \le 9)$ is the binom.cdf()
function, which allows to calculate the cumulative distribution function, where the number of successes is equal to or lower than m. The cdf
function is thus handy, because we can omit to call the sum
function. The usage of binom.cdf()
is as follows: binom.cdf(m,N,p)
.
Note: When calculating the lower or upper tail probability, there is no additional argument within the
cdf
function. Python will calculate the left/lower tail probabilities by default ($P(X \le x)$). To calculate the upper tail probability, one has to tweak the input by entering1- binom.cdf()
($P(X > x)$).
To make it clear let us recalculate $P(X \le 9)$. The result for the calculation above was 0.8104. Now we apply the binom.cdf()
function to get the same result.
n = 25 # number of students
p = 0.3 # probability of success
k = 9
stats.binom.cdf(9, n, p)
0.8105639764950546
In order the get $P(X > 9)$, which is the same as $1-P(X \le 9)$, we add 1-stats.binom.cdf()
to get the complementary probability.
n = 25 # number of students
p = 0.3 # probability of success
k = 9
1 - stats.binom.cdf(9, n, p)
0.18943602350494537
Finally, in order to visualize the binomial probability distribution we apply the method binom.rvs()
to the binom()
function, which generates random deviates for a binomial distribution defined by its size $n$ and probability of success/failure $p$.
n = 25 # number of students
p = 0.3 # probability of success
size = 100000 # number of random samples
random.seed(3) # set seed for reproducibility
random_binom_numbers = stats.binom.rvs(n, p, size=size)
plt.figure(figsize=(10, 5))
plt.hist(
random_binom_numbers,
density=True,
bins=len(np.unique(random_binom_numbers)),
color="#fc9d12",
edgecolor="grey",
) # density=False would make counts
plt.xlabel("Students passing the final exam")
plt.ylabel("Probability")
plt.title("Binomial Probability Distribution \nfor size=25 and p=0.3")
plt.xticks(
np.arange(min(random_binom_numbers), max(random_binom_numbers) + 1, 2.0)
) # define x-axis ticks
plt.show()
In addition we visualize the discussed probabilities form above: $P(X = 3)$, $P(X = 15)$, $P(X \le 9)$, $P(X > 9)$ (upper panel), and the corresponding cummulative binomial probability distribution function (lower panel).
n = 25 # number of students
p = 0.3 # probability of success
size = 100000 # number of random samples
random.seed(3) # set seed for reproducibility
random_binom_numbers = pd.Series(stats.binom.rvs(n, p, size=size))
plt.figure(figsize=(10, 5))
ax = random_binom_numbers.plot.hist(
bins=len(np.unique(random_binom_numbers)),
density=True,
edgecolor="grey",
figsize=(10, 5),
)
for bar in ax.containers[0]:
# get x midpoint of bar
x = bar.get_x() + 0.5 * bar.get_width()
# set bar color based on x
if x < 9.5:
bar.set_color("#fc4d62")
bar.set_edgecolor("black")
else:
bar.set_color("white")
bar.set_edgecolor("black")
plt.xlabel("Students passing the final exam")
plt.ylabel("Probability")
plt.title("Binomial Probability Distribution \nfor size=25 and p=0.3")
plt.xticks(np.arange(min(random_binom_numbers), max(random_binom_numbers) + 1, 2.0))
plt.xlim([0, n])
plt.ylim([0, 0.25])
# # P[X <= 11]
plt.axvline(x=9, color="red", linestyle="dashed")
plt.arrow(
8.5,
0.19,
-5.5,
0,
length_includes_head=False,
head_width=0.006,
head_length=2,
color="red",
)
plt.text(4, 0.21, r"$P(X \leq 9$)", fontsize=14, color="red")
plt.arrow(
9.5,
0.19,
5.5,
0,
length_includes_head=False,
head_width=0.006,
head_length=2,
color="red",
)
plt.text(11, 0.21, r"$P(X > 9$)", fontsize=14, color="red")
# # P[X=3]
plt.arrow(
3.5,
0.1,
0,
-0.07,
length_includes_head=True,
head_width=0.4,
head_length=0.02,
color="red",
)
plt.text(2, 0.11, r"$P(X =3$)", fontsize=14, color="red")
# # P[X=15]
plt.arrow(
14.5,
0.07,
0,
-0.06,
length_includes_head=True,
head_width=0.4,
head_length=0.02,
color="red",
)
plt.text(13.5, 0.08, r"$P(X =15$)", fontsize=14, color="red")
plt.show()
n = 25 # number of students
p = 0.3 # probability of success
size = 100000 # number of random samples
random.seed(3) # set seed for reproducibility
random_binom_numbers = pd.Series(stats.binom.rvs(n, p, size=size))
plt.figure(figsize=(10, 5))
plt.plot(
np.arange(1, n), stats.binom.cdf(np.arange(1, n), n, p), linewidth=3, color="black"
)
plt.xlabel("Students passing the final exam")
plt.ylabel("Cummulative Probability")
plt.title("Cummulative Binomial Probability Distribution \nfor size=25 and p=0.3")
plt.xticks(np.arange(min(random_binom_numbers), max(random_binom_numbers) + 9, 2.0))
plt.fill_between(
x=np.arange(1, n),
y1=stats.binom.cdf(np.arange(1, n), n, p),
where=(np.arange(1, n) <= 9),
color="red",
alpha=0.5,
)
plt.text(3, 0.85, r"$P(X \leq 9$) =0.81", fontsize=14, color="black")
plt.text(6, 0.21, r"$P(X \leq 9$)", fontsize=14, color="black")
plt.axvline(x=9, ymax=0.8, color="black", linestyle="dashed")
plt.hlines(y=0.81, xmin=0, xmax=9, color="black", linestyle="dashed")
plt.show()
In order to conclude this section, and in order to give you some intuition of the shapes of different binomial probability distributions, three different binomial probability and its corresponding cumulative binomial provability distributions for $p=0.1$, $p=0.5$, $p=0.9$ are given below.
n = 25
p = 0.1 # probability of success
size = 100000 # number of random samples
random_binom_numbers = pd.Series(stats.binom.rvs(n, p, size=size))
fig, ax = plt.subplots(figsize=(10, 5))
plt.title("size = 25, p = 0.1", fontsize=16)
ax.hist(
random_binom_numbers,
bins=len(np.unique(random_binom_numbers)),
color="white",
edgecolor="black",
)
ax.set_xlabel("Size", fontsize=14)
ax.set_ylabel("Probability", fontsize=14)
# twin object for two different y-axis on the sample plot
ax2 = ax.twinx()
# make a plot with different y-axis using second axis object
ax2.plot(
np.arange(1, n), stats.binom.cdf(np.arange(1, n), n, p), linewidth=3, color="black"
)
ax2.set_ylabel("Cummulative probability", fontsize=14)
plt.show()
n = 25
p = 0.5 # probability of success
size = 100000 # number of random samples
random_binom_numbers = pd.Series(stats.binom.rvs(n, p, size=size))
fig, ax = plt.subplots(figsize=(10, 5))
plt.title("size = 25, p = 0.5", fontsize=16)
ax.hist(
random_binom_numbers,
bins=len(np.unique(random_binom_numbers)),
color="white",
edgecolor="black",
)
ax.set_xlabel("Size", fontsize=14)
ax.set_ylabel("Probability", fontsize=14)
# twin object for two different y-axis on the sample plot
ax2 = ax.twinx()
# make a plot with different y-axis using second axis object
ax2.plot(
np.arange(1, n), stats.binom.cdf(np.arange(1, n), n, p), linewidth=3, color="black"
)
ax2.set_ylabel("Cummulative probability", fontsize=14)
plt.show()
n = 25
p = 0.9 # probability of success
size = 100000 # number of random samples
random_binom_numbers = pd.Series(stats.binom.rvs(n, p, size=size))
fig, ax = plt.subplots(figsize=(10, 5))
plt.title("size = 25, p = 0.9", fontsize=16)
ax.hist(
random_binom_numbers,
bins=len(np.unique(random_binom_numbers)),
color="white",
edgecolor="black",
)
ax.set_xlabel("Size", fontsize=14)
ax.set_ylabel("Probability", fontsize=14)
# twin object for two different y-axis on the sample plot
ax2 = ax.twinx()
# make a plot with different y-axis using second axis object
ax2.plot(
np.arange(1, n), stats.binom.cdf(np.arange(1, n), n, p), linewidth=3, color="black"
)
ax2.set_ylabel("Cummulative probability", fontsize=14)
plt.show()
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.