The binomial distribution is the probability distribution for the number of successes in a sequence of Bernoulli trials. In \(n\) Bernoulli trials the number of outcomes that contain exactly \(x\) successes equals the binomial coefficient \({n \choose x}\) (Weiss 2010).
Let \(X\) denote the total number of successes in \(n\) Bernoulli trials with the success probability \(p\). Then the probability distribution of the random variable \(X\) is given by:
\[ P(X = x) = {n \choose x}p^x(1 - p)^{n-x}, \qquad x = 0, 1, 2, . . . , n\]
The random variable \(X\) is called a binomial random variable and is said to have the binomial distribution with parameters \(n\) and \(p\) (Weiss 2010).
Let us consider an example taken form the real world.
Long term statistics tell us, that the chance for passing the final statistics exam is 0.3. Yes, approximately 70 % of students fail in their statistics exam! By the way, if you complete this e-learning module, your chances of passing the final exam will definitely increase :-)
Let us consider a class of 25 students. What is the chance that exactly 3 students of that particular class will pass the final statistics exam? Or written more formally \(P(X = 3)\). Once again, we start by implementing a naive approach in R.
n <- 25 # number of students
p <- 0.3 # probability of success
k <- 3 # exactly 3 students will pass the exam
choose(n, k) * p^k * (1 - p)^(n - k)
## [1] 0.02427999
Wow, the probability that exactly 3 out of 25 students \((P(X = 3))\) will pass the final statistics exam is quite low. What about the probability that exactly 15 out of 25 students \((P(X = 15))\) will pass the final statistics exam? We turn to R.
n <- 25 # number of students
p <- 0.3 # probability of success
k <- 15 # exactly 15 students will pass the exam
choose(n, k) * p^k * (1 - p)^(n - k)
## [1] 0.001324897
The probability of \(P(X = 15)\) is approximately 0.1 %. We can continue with our experiments to find out all the probabilities of exactly one outcome for \(k = 0,1,2,...,n\). Please note, that for our particular example it is not very informative to know the probability of exactly one particular number (\(k\)) of students passing the exam. For us it is of higher informational interest if we can answer the question, what the probability that \(k\) or less students \((P(X \le k))\) will pass the exam is or, equally interesting, that \(k\) or more students \((P(X \ge k))\) will pass the exam.
As an exercise we turn to R and determine the probability that 9 or less students will pass the final statistics exam \((P(X \le 9))\). So, we are interested in the probability that 0 out of 25, or 1 out of 25, or 2 out of 25, …, or 9 out of 25 students will pass the exam. To calculate this probability we can extend our naive approach and calculate \(P(X = 0)+P(X = 1)+P(X = 2) + ... + P(X = 9)\).
n <- 25 # number of students
p <- 0.3 # probability of success
choose(n, 0) * p^0 * (1 - p)^(n - 0) +
choose(n, 1) * p^1 * (1 - p)^(n - 1) +
choose(n, 2) * p^2 * (1 - p)^(n - 2) +
choose(n, 3) * p^3 * (1 - p)^(n - 3) +
choose(n, 4) * p^4 * (1 - p)^(n - 4) +
choose(n, 5) * p^5 * (1 - p)^(n - 5) +
choose(n, 6) * p^6 * (1 - p)^(n - 6) +
choose(n, 7) * p^7 * (1 - p)^(n - 7) +
choose(n, 8) * p^8 * (1 - p)^(n - 8) +
choose(n, 9) * p^9 * (1 - p)^(n - 9)
## [1] 0.810564
The probability that 9 or less students will pass the statistics exam \((P(X\le 9))\) is 81.1 %. In turn, that means the probability of 10 or more students passing the exam is \(P(X > 9)= 1-P(X \le 9)\) and only 18.9 %.
Exercise: Since it is quite a lot of typing, try to do the previous calculation based on a loop! (There are several ways to solve this exercise)
### your code here
n <- 25 # number of students
p <- 0.3 # probability of success
result <- 0
for (k in 0:9) {
result <- result + choose(n, k) * p^k * (1 - p)^(n - k)
}
result
## [1] 0.810564
Beside the honestly quite unpleasant insight into the statistics of
passing the final exam, it is clear that the naive implementation from
above is quite tedious. Therefore, we introduce the in-built functions
dbinom()
and pbinom()
. Call the
help()
on those functions for further information.
According to the help()
function the usage of
dbinom()
is as follows:
dbinom(x, size, prob, log = FALSE)
. We disregard the
log
argument and stay with its default
log = FALSE
. Thus, the dbinom()
function
simplifies to dbinom(x, size, prob)
, where x
corresponds to our \(k\),
size
to our \(n\) and
prob
to our \(p\). In
order to check if the dbinom()
function works as expected,
we recall our test examples from above, which were \(P(X = 3)\) and \(P(X = 15)\).
n <- 25 # number of students
p <- 0.3 # probability of success
k <- c(3, 15)
dbinom(k[1], n, p)
## [1] 0.02427999
dbinom(k[2], n, p)
## [1] 0.001324897
Compare these results with the results of our naive implementation (scroll up). The numbers should match.
The dbinom()
function becomes very handy, if we consider
accumulated probabilities, such as \(P(X \le
x)\) or \(P(X > x)\). We can
specify a vector for the argument x
and sum it up
afterwards. Let us consider the example from above \((P(X \le 9))\).
n <- 25 # number of students
p <- 0.3 # probability of success
k <- 0:9
dbinom(k, n, p)
## [1] 0.0001341069 0.0014368592 0.0073895618 0.0242799887 0.0572314020
## [6] 0.1030165235 0.1471664622 0.1711936397 0.1650795811 0.1336358514
sum(dbinom(k, n, p))
## [1] 0.810564
Exercise: Use the code implementation above to calculate P(X > 9)!
### your code here
n <- 25 # number of students
p <- 0.3 # probability of success
k <- 10:n
dbinom(k, n, p)
sum(dbinom(k, n, p))
## [1] 9.163601e-02 5.355351e-02 2.677676e-02 1.147575e-02 4.215583e-03
## [6] 1.324897e-03 3.548832e-04 8.051973e-05 1.533709e-05 2.421646e-06
## [11] 3.113545e-07 3.177086e-08 2.475652e-09 1.383905e-10 4.942517e-12
## [16] 8.472886e-14
## [1] 0.189436
Another in-built function that is appropriate to answer \(P(X \le 9)\) is the pbinom()
function, which returns the distribution function. The
pbinom()
function calculates the cumulative probability
distribution and is thus handy, because we can omit calling the
sum()
function. The usage of pbinom()
is as
follows:
pbinom(x, size, prob, lower.tail = TRUE, log.p = FALSE)
.
Again, we disregard the log
argument and stay with its
default log = FALSE
. In addition to the arguments from the
dbinom()
function (see above) a new argument is available:
lower.tail
. The default is set to TRUE
.
Note: If
lower.tail = TRUE
the probabilities are \(P(X \le x)\), otherwise \(P(X > x)\).
To make it clear let us recalculate \(P(X
\le 9)\). The result for the calculation above was 0.810564. Now
we apply the pbinom()
function to get the same result.
n <- 25 # number of students
p <- 0.3 # probability of success
k <- 9
pbinom(k, n, p, lower.tail = TRUE)
## [1] 0.810564
In order the get \(P(X > 9)\),
which is the same as \(1-P(X \le 9)\),
we set the argument lower.tail = FALSE
.
n <- 25 # number of students
p <- 0.3 # probability of failure
k <- 9
pbinom(k, n, p, lower.tail = FALSE)
## [1] 0.189436
Finally, in order to visualize the binomial probability
distribution we apply the rbinom()
function, which
generates random deviates for a binomial distribution defined by its
size \(n\) and probability of
success/failure \(p\).
size <- 25 # size
p <- 0.3 # probability of success
n <- 100000 # number of random samples
set.seed(3) # set seed for reproducibility
random_binom_numbers <- rbinom(n, size, p)
h <- hist(random_binom_numbers,
breaks = length(unique(random_binom_numbers)),
plot = FALSE
)
plot(h,
freq = FALSE,
space = NULL,
xlim = c(0, size),
xlab = "Students passing the final exam", # label for x-axis
ylab = "Probability", # label for y-axis
main = "Binomial Probability Distribution \nfor size=25 and p=0.3", # add title
col = "#fc9d12", # add color "#fc9d12"
xaxt = "n"
) # do not show ticks of the x-axis
axis(side = 1, at = seq(0.5, size, by = 1), labels = seq(1, size, by = 1)) # set ticks at defined positions for x axis
In addition, we visualize the discussed probabilities from above: \(P(X = 3)\), \(P(X = 15)\), \(P(X \le 9)\), \(P(X > 9)\) (upper panel) and the corresponding cumulative binomial probability distribution function (lower panel).
In order to conclude this section and in order to give you some intuition of the shapes of different binomial probability distributions, three different binomial probability distributions and their corresponding cumulative binomial probability distributions for \(p=0.1\), \(p=0.5\), \(p=0.9\) are given below.
Exercise: Imagine we found hints for 3 twenty-year-flood events (HQ20) within a 10 year time frame in a historical archive. Calculate the probability for this! Note: A HQ20 indicates a flood with a probability of p = 1/20 = 0.05 for a certain year.
### your code here
p <- 1 / 20
n <- 10
paste("The probability for three HQ20 in a decade is:", dbinom(x = 3, size = n, prob = p))
## [1] "The probability for three HQ20 in a decade is: 1.05 %"
Advanced Exercise: How many twenty-year-flood events may occur per decade with a probability of more than 98 %? Hint: Have a look at R’s
qbinom()
function!
### your code here
p <- 1 / 20
n <- 10
paste("With a probability of 98 %", qbinom(p = 0.95, size = n, prob = p), "or less twenty-year-flood events occur per decade.")
## [1] "With a probability of 98 % 2 or less twenty-year-flood events occur per decade."
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.