The shape of the sampling distribution relates to the following two cases.

- The population from which samples are drawn has a normal distribution.
- The population from which samples are drawn does not have a normal distribution.

When the population from which samples are drawn is normally distributed with its mean equal to $\mu$ and standard deviation equal to $\sigma$, then:

- The mean of the sample means, $\mu_{\bar x}$, is equal to the mean of the population, $\mu$.
- The standard deviation of the sample means, $\sigma_{\bar x}$ is equal to $\frac{\sigma}{\sqrt{n}}$, assuming $\frac{n}{N} \le 0.05$.
- The shape of the sampling distribution of the sample means $(\bar x)$ is normal, for whatever value of $n$.

In [2]:

```
# First, let's import all the needed libraries.
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
```

In [3]:

```
n1 = 5
n2 = 15
n3 = 30
n4 = 50
sigma = 1
mu = 0
```

Let us consider a normally distributed population. For the sake of simplicity we use the standard normal distribution, $N \sim (\mu, \sigma)$, with $\mu = 0$ and $\sigma = 1$. Let us further calculate $\mu_{\bar x}$ and $\sigma_{\bar x}$ for samples of sample sizes $n = 5, 15, 30, 50 $.

Recall that for a large enough number of repeated sampling $\mu_{\bar x} \approx \mu$. Thus, $\mu_{\bar x}$ of the different sampling distributions under consideration.

$$\mu_{\bar x_{n=`r n1`}} = \mu_{\bar x_{n=`r n2`}} = \mu_{\bar x_{n=`r n3`}} = \mu_{\bar x_{n=`r n4`}} = \mu = 0$$Recall the standard error of the sampling distribution $\sigma_{\bar x} = \frac{\sigma}{\sqrt{n}}$. Thus, we can easily compute $\sigma_{\bar x}$ for $n= 5, 15, 30, 50$ elements. The different sampling distributions are visualized thereafter.

$$\sigma_{\bar x_{n=5}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{5}}\approx 0.447$$$$\sigma_{\bar x_{n=15}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{15}}\approx 0.258$$$$\sigma_{\bar x_{n=30}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{30}}\approx 0.183$$$$\sigma_{\bar x_{n=50}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{50}} \approx 0.141$$In [4]:

```
seq = np.arange(-4, 4.01, 0.001)
n = [5, 15, 30, 50]
color = ["blue", "orange", "green", "red"]
plt.figure(figsize=(10, 7))
plt.plot(
seq,
stats.norm.pdf(seq, mu, 1),
color="black",
linewidth=2,
label="Population distribution",
)
for i, c in zip(n, color):
plt.plot(
seq,
stats.norm.pdf(seq, mu, 1 / np.sqrt(i)),
color=c,
linewidth=2,
label="$\\bar{x}$ for n =" + f"{i}",
)
plt.vlines(0, 0, 15, color="darkgrey", linestyle="dashed")
plt.yticks([])
plt.ylim(0, 3)
plt.legend()
plt.text(0.1, 2.5, "$\mu_{\\bar{x}} = \mu$", fontsize=16)
plt.show()
```

There are two important observations regarding the sampling distribution of $\bar x$

- The spread of the sampling distribution is smaller than the spread of the corresponding population distribution. In other words, $\sigma_{\bar x} < \sigma$.
- The standard deviation of the sampling distribution decreases as the sample size increases.

In order to verify the 3^{rd} claim from above, that the shape of the sampling distribution of $\bar x$ is normal, whatever the value of $n$, we conduct a computational experiment. For a large enough number of times (`trials = 1000`

) we sample from the standard normal distribution $N \sim (\mu =0, \sigma = 1)$, where each particular sample has a sample size of $n = 5, 15, 30, 50 $. For each sample we calculate the sample mean $\bar x$ and visualize the empirical probabilities. Afterwards we compare the empirical distribution of those probabilities with the sampling distributions calculated from the equations above.

In [5]:

```
trials = 1000
n = [5, 15, 30, 50]
mu = 0
x = np.arange(-4, 4.01, 0.001)
color = ["blue", "orange", "green", "red"]
fig, ax = plt.subplots(2, 2, figsize=(10, 7))
fig.suptitle("Relative frequency distribution (occurrences) of $\\bar{x}$", fontsize=20)
for i, ax, c in zip(n, ax.ravel(), color):
ax.plot(
x,
stats.norm.pdf(x, mu, 1 / np.sqrt(i)),
color=c,
linewidth=2,
label="$\\bar{x}$ for n =" + f"{i}",
)
ax.hist(
stats.norm.rvs(mu, 1 / np.sqrt(i), size=trials),
density=True,
color="lightgrey",
edgecolor="darkgrey",
)
ax.set_ylim(-0.1, 3)
ax.set_xlim(-2, 2)
ax.set_ylabel("Density")
ax.title.set_text(
f"Empirical Probabilities vs.\nSampling Distribution for sample size n={i}"
)
plt.tight_layout()
plt.show()
```

The figure verifies the 3^{rd} claim from above: The shape of the sampling distribution of $\bar x$ is normal, for whatever value of $n$.

In addition, the figure shows that the distribution of the empirical probabilities (bars) fits well the sampling distribution (colored line), and that the standard deviation of the sampling distribution of $\bar x$ decreases as the sample size increases. Recall, that the y-axis represents the *density*, which is a the **probability per unit value** of the random variable. This is why the probability density can take a value greater than 1, but only over a region with measure less than 1.

**Citation**

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis
using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.*