The shape of the sampling distribution relates to the following two cases.
When the population from which samples are drawn is normally distributed with its mean equal to $\mu$ and standard deviation equal to $\sigma$, then:
# First, let's import all the needed libraries.
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
n1 = 5
n2 = 15
n3 = 30
n4 = 50
sigma = 1
mu = 0
Let us consider a normally distributed population. For the sake of simplicity we use the standard normal distribution, $N \sim (\mu, \sigma)$, with $\mu = 0$ and $\sigma = 1$. Let us further calculate $\mu_{\bar x}$ and $\sigma_{\bar x}$ for samples of sample sizes $n = 5, 15, 30, 50 $.
Recall that for a large enough number of repeated sampling $\mu_{\bar x} \approx \mu$. Thus, $\mu_{\bar x}$ of the different sampling distributions under consideration.
$$\mu_{\bar x_{n=`r n1`}} = \mu_{\bar x_{n=`r n2`}} = \mu_{\bar x_{n=`r n3`}} = \mu_{\bar x_{n=`r n4`}} = \mu = 0$$Recall the standard error of the sampling distribution $\sigma_{\bar x} = \frac{\sigma}{\sqrt{n}}$. Thus, we can easily compute $\sigma_{\bar x}$ for $n= 5, 15, 30, 50$ elements. The different sampling distributions are visualized thereafter.
$$\sigma_{\bar x_{n=5}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{5}}\approx 0.447$$$$\sigma_{\bar x_{n=15}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{15}}\approx 0.258$$$$\sigma_{\bar x_{n=30}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{30}}\approx 0.183$$$$\sigma_{\bar x_{n=50}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{50}} \approx 0.141$$seq = np.arange(-4, 4.01, 0.001)
n = [5, 15, 30, 50]
color = ["blue", "orange", "green", "red"]
plt.figure(figsize=(10, 7))
plt.plot(
seq,
stats.norm.pdf(seq, mu, 1),
color="black",
linewidth=2,
label="Population distribution",
)
for i, c in zip(n, color):
plt.plot(
seq,
stats.norm.pdf(seq, mu, 1 / np.sqrt(i)),
color=c,
linewidth=2,
label="$\\bar{x}$ for n =" + f"{i}",
)
plt.vlines(0, 0, 15, color="darkgrey", linestyle="dashed")
plt.yticks([])
plt.ylim(0, 3)
plt.legend()
plt.text(0.1, 2.5, "$\mu_{\\bar{x}} = \mu$", fontsize=16)
plt.show()
There are two important observations regarding the sampling distribution of $\bar x$
In order to verify the 3rd claim from above, that the shape of the sampling distribution of $\bar x$ is normal, whatever the value of $n$, we conduct a computational experiment. For a large enough number of times (trials = 1000
) we sample from the standard normal distribution $N \sim (\mu =0, \sigma = 1)$, where each particular sample has a sample size of $n = 5, 15, 30, 50 $. For each sample we calculate the sample mean $\bar x$ and visualize the empirical probabilities. Afterwards we compare the empirical distribution of those probabilities with the sampling distributions calculated from the equations above.
trials = 1000
n = [5, 15, 30, 50]
mu = 0
x = np.arange(-4, 4.01, 0.001)
color = ["blue", "orange", "green", "red"]
fig, ax = plt.subplots(2, 2, figsize=(10, 7))
fig.suptitle("Relative frequency distribution (occurrences) of $\\bar{x}$", fontsize=20)
for i, ax, c in zip(n, ax.ravel(), color):
ax.plot(
x,
stats.norm.pdf(x, mu, 1 / np.sqrt(i)),
color=c,
linewidth=2,
label="$\\bar{x}$ for n =" + f"{i}",
)
ax.hist(
stats.norm.rvs(mu, 1 / np.sqrt(i), size=trials),
density=True,
color="lightgrey",
edgecolor="darkgrey",
)
ax.set_ylim(-0.1, 3)
ax.set_xlim(-2, 2)
ax.set_ylabel("Density")
ax.title.set_text(
f"Empirical Probabilities vs.\nSampling Distribution for sample size n={i}"
)
plt.tight_layout()
plt.show()
The figure verifies the 3rd claim from above: The shape of the sampling distribution of $\bar x$ is normal, for whatever value of $n$.
In addition, the figure shows that the distribution of the empirical probabilities (bars) fits well the sampling distribution (colored line), and that the standard deviation of the sampling distribution of $\bar x$ decreases as the sample size increases. Recall, that the y-axis represents the density, which is a the probability per unit value of the random variable. This is why the probability density can take a value greater than 1, but only over a region with measure less than 1.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.