The shape of the sampling distribution relates to the following two cases.
In the previous section we discussed the shape of sample distributions if sampling from a normally distributed population. However, in real life applications we often do not know the actual shape of the population distribution.
In order to understand, how the shape of the distribution of the population of interest affects the shape of the sampling distribution we conduct an experiment. Let us consider three different continuous probability density functions: the uniform distribution, the beta distribution, and the gamma distribution. We do not go into details here, however the figure below shows, that these three PDFs are not normally distributed.
# First, let's import all the needed libraries.
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
x = np.arange(0, 1, 0.0001)
fig, ax = plt.subplots(1, 3, figsize=(10, 5))
ax[0].plot(x, stats.uniform.pdf(x, loc=0.2, scale=0.6), color="black")
ax[0].set_yticks([])
ax[0].title.set_text("Uniform distribution\n Unif(loc=0.2, scale=0.6)")
ax[1].plot(x, stats.beta.pdf(x, 2, 5, loc=0, scale=1), color="black")
ax[1].set_yticks([])
ax[1].title.set_text("Beta distribution\n Beta(alpha=2, beta=5)")
ax[2].plot(x, stats.gamma.pdf(x, 0.8, loc=0, scale=1), color="black")
ax[2].set_yticks([])
ax[2].title.set_text("Gamma distribution\n Gamma(alpha=0.8)")
plt.show()
Now we conduct the same experiment as in the previous section. For a large enough number of times (trials = 1000
) we sample from each particular distribution. However, this time each particular sample has a sample size $n=2,5,15,30$. For each sample we calculate the sample mean $\bar x$ and visualize the empirical probabilities after 1,000 trials.
# trials = 1000
n = [2, 5, 15, 30] # sample size
counter = [0, 1, 2, 3]
trials = np.arange(0, 1001, 1)
fig, ax = plt.subplots(4, 3, figsize=(10, 7))
for i, c in zip(n, counter):
## uniform
uni = []
for x in trials:
uni.append(np.mean(stats.uniform.rvs(loc=0.2, scale=0.6, size=i)))
ax[c, 0].hist(
uni,
bins="scott",
color="purple",
edgecolor="grey",
)
ax[c, 0].set_xlim(0.15, 0.85)
ax[c, 0].set_yticks([])
ax[c, 0].spines["top"].set_visible(False)
ax[c, 0].spines["left"].set_visible(False)
ax[c, 0].spines["right"].set_visible(False)
ax[c, 0].title.set_text(f"Uniform distribution \nsample size n={i}")
## beta
beta = []
for x in trials:
beta.append(np.mean(stats.beta.rvs(a=2, b=5, loc=0, scale=1, size=i)))
ax[c, 1].hist(
beta,
bins="scott",
color="blue",
edgecolor="grey",
)
ax[c, 1].set_xlim(0, 0.6)
ax[c, 1].set_yticks([])
ax[c, 1].spines["top"].set_visible(False)
ax[c, 1].spines["left"].set_visible(False)
ax[c, 1].spines["right"].set_visible(False)
ax[c, 1].title.set_text(f"Beta distribution \nsample size n={i}")
## gamma
gamma = []
for x in trials:
gamma.append(np.mean(stats.gamma.rvs(0.3, loc=0, scale=1, size=i)))
ax[c, 2].hist(
gamma,
bins="scott",
color="green",
edgecolor="grey",
)
ax[c, 2].set_xlim(-0.01, 0.8)
ax[c, 2].set_yticks([])
ax[c, 2].spines["top"].set_visible(False)
ax[c, 2].spines["left"].set_visible(False)
ax[c, 2].spines["right"].set_visible(False)
ax[c, 2].title.set_text(f"Gamma distribution \nsample size n={i}")
plt.tight_layout()
plt.show()
The figure shows that, in case of a population that is not normally distributed, the sampling distributions are not normal, when $n < 30$. However, the sampling distributions approximate a normal distribution, when $n > 30$. Also notice that the spread of the sampling distribution decreases as the sample size increases.
According to the central limit theorem, for a large sample size $(n > 30)$, the sampling distribution is approximately normal, irrespective of the shape of the population distribution (Mann 2012).
The mean and standard deviation of the sampling distribution of $\bar x$ are, respectively,
$$\mu_{\bar x} = \mu \qquad \text{and} \qquad \sigma_{\bar x}=\frac{\sigma}{\sqrt{n}} \, .$$The sample size is usually considered to be large if $n \ge 30$.
Owing to the fact that the sampling distribution approximates a normal distribution, the area under the curve of a sampling distribution yields probabilistic information about sample statistic.
Recall the Empirical Rule, also known as the 68-95-99.7 rule. Consequently, applied to the sampling distribution the 68-95-99.7 rule implies that
about 68.26% of the sample means will be within one standard deviation of the population mean,
95.44% of the sample means will be within two standard deviations of the population mean, and
about 99.74% of the sample means will be within three standard deviations of the population mean.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.