The mean of a discrete random variable $X$ is denoted $\mu_X$ or, when no confusion will arise, simply $\mu$. The terms expected value, $E(X)$, and expectation are commonly used in place of the term mean.
$$ E(X) = \sum_{i=1}^{N}x_iP(X=x_i) $$In a large number of independent observations of a random variable $X$, the $E(X)$ of those observations - the sample - will approximate the mean, $\mu$, of the population. The larger the number of observations, the closer $E(X)$ is to $\mu$ (Weiss 2010).
# First, let's import all the needed libraries.
import numpy as np
import random
import pandas as pd
import matplotlib.pyplot as plt
Let us recall our experiment from the previous section, when we picked 1,000 individuals and asked for the number of siblings.
experiment_prob = np.array([0.2, 0.425, 0.275, 0.07, 0.025, 0.005])
def siblings(x):
return random.choices(np.array([0, 1, 2, 3, 4, 5]), weights=experiment_prob, k=x)
random.seed(1000)
experiment = pd.Series(siblings(1000))
siblings1000_f = experiment.value_counts()
siblings1000_p = experiment.value_counts(normalize=True)
Let us again take a look at the table, summarizing the experiment
$$ \begin{array}{c|lcr} \text{Siblings} & \text{Frequency} & \text{Relative}\\ \ x & f & \text{frequency}\\ \hline 0 & 185 & 0.185 \\ 1 & 420 & 0.420 \\ 2 & 289 & 0.289\\ 3 & 64 & 0.064 \\ 4 & 35 & 0.035 \\ 5 & 7 & 0.007 \\ \hline & 1000 & 1.0 \end{array} $$Let us calculate the expected value (mean) for that experiment.
\begin{align} \\ & E(X) = \sum_{i=1}^{N}x_iP(X=x_i) \\ & = 0 \cdot P(X=0) + 1 \cdot P(X=1)+ 2 \cdot P(X=2) + 3 \cdot P(X=3) +4 \cdot P(X=4)+ 5 \cdot P(X \ge 5) \\ & = 0 \cdot 0.185 + 1 \cdot 0.420 + 2 \cdot 0.289 + 3 \cdot 0.064 + 4 \cdot 0.035 + 5 \cdot 0.007 \\ & = 1.026 \end{align}Calculate in Python by typing: np.sum(np.arange(0,6)*siblings1000_p)
The resulting expected value of 1.026 is close to the mean $\mu$, which we calculate by using the population's probabilities (real probabilities are taken from the lower right figure in the previous section). 0.2, 0.425, 0.275, 0.07,0.025, 0.005 $$\mu = 1 \cdot 0.2 + 2 \cdot 0.425 + 3 \cdot 0.275 + 4 \cdot 0.07 + 5 \cdot 0.025 = 1.31$$
Calculate in Python by typing: np.sum(np.arange(0,6)*experiment_prob)
Let us consider a fair six sided dice. We can easily compute the expected value $E(X)$ using Python. The term fair means that each random variable $X=x_i,\; x \in 1,2,3,4,5,6$ is equally likely to occur. Therefore $P(X=x_i) = \frac{1}{6}$.
$$E(X) = \sum_{i=1}^{6}x_iP(X=x_i) = 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{6} + 3 \cdot \frac{1}{6} + 4 \cdot \frac{1}{6} + 5 \cdot \frac{1}{6} + 6 \cdot \frac{1}{6} = 3.5$$In Python we write the following code:
# Expected value of a fair six sided dice
p_die = 1 / 6
die = np.arange(1, 7)
np.sum(die * p_die)
3.5
However, what if we are not sure if the dice is really fair? How to know that we are not cheated? Or to put it in other words: How often do we need to roll a dice before we can be more confident?
Let us do a computational experiment: We know from reasoning above that the expected value of a 6-sided fair dice is 3.5. We conduct an experiment by rolling a dice over and over again. We store the result and before we roll the dice again we calculate the average of all dice rolls so far. In order to achieve that little experiment, we write a for
loop in Python.
### Simulation ###
eyes = np.arange(1, 7) # possible events
probs = [1 / len(eyes)] * len(eyes) # probabilities
expected_value = np.round(np.sum(eyes * probs), 2) # calculate expected value
n = 500 # number of maximum rolls
values = [] # initialize an empty vector to store the value
averages = [] # initialize an empty vector to store the average values so far
# for-loop
for roll in range(1, n):
values.append(random.choices(eyes, weights=probs, k=1))
averages.append(np.mean(values))
plt.plot(averages)
plt.xlabel("number of trials")
plt.ylabel("Expected value")
plt.ylim((1, 6))
plt.axhline(y=3.5, color="r", linestyle="--", label=f"Expected value: {expected_value}")
plt.legend()
plt.show()
The graph shows that after some initial volatile behavior, the curve finally flattens and approximates the $E(X)$ of 3.5.
The standard deviation of a discrete random variable $X$ is denoted $\sigma_X$ or, when no confusion will arise, simply $\sigma$. It is defined as
$$ \sigma = \sqrt{\sum_{i=1}^{N}(x_i-\mu)^2P(X=x_i)} $$Let us turn to Python and calculate the standard deviation for the dice roll experiment from above. During the experiment we rolled 500 times. The outcome of these rolls are stored in the vector values
. The probability for each of these numbers in the vector values
approximates $\frac{1}{6}= 0.167$. So we just put those numbers in the equation for the standard deviation from above. Remember, that the mean is stored in the expected_value
variable.
x = pd.Series(np.arange(1, 7))
p_x = pd.Series(values).value_counts(normalize=True)
p_x
[4] 0.184369 [6] 0.180361 [5] 0.178357 [1] 0.164329 [3] 0.154309 [2] 0.138277 dtype: float64
sd = np.sqrt(np.sum((x - np.mean(values)) ** 2 * p_x.values))
sd
1.704601726361285
Roughly speaking, our experiment showed that after 500 rolls, on average, the value of a dice number is 1.70 from the experimental mean of 3.62.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.