The shape of the sampling distribution relates to the following two
cases:

1. The population from which samples are drawn has a normal
distribution.

2. The population from which samples are drawn does not have a normal
distribution.

When the population from which samples are drawn is normally distributed with its mean equal to \(\mu\) and standard deviation equal to \(\sigma\), then:

- The mean of the sample means, \(\mu_{\bar
x}\), is equal to the mean of the population, \(\mu\).

- The standard deviation of the sample means, \(\sigma_{\bar x}\), is equal to \(\frac{\sigma}{\sqrt{n}}\), assuming \(\frac{n}{N} \le 0.05\).

- The shape of the sampling distribution of the sample means \((\bar x)\) is normal, for whatever value of \(n\).

Let us consider a normally distributed population. For the sake of simplicity we use the standard normal distribution, \(N \sim (\mu, \sigma)\), with \(\mu = 0\) and \(\sigma = 1\). Let us further calculate \(\mu_{\bar x}\) and \(\sigma_{\bar x}\) for samples of sample sizes \(n=5,15,30,50\).

Recall that for a large enough number of repeated sampling \(\mu_{\bar x} \approx \mu\). Thus, \(\mu_{\bar x}\) of the different sampling distributions under consideration should be:

\[\mu_{\bar x_{n=5}} = \mu_{\bar x_{n=15}} = \mu_{\bar x_{n=30}} = \mu_{\bar x_{n=50}} = \mu = 0\]

Recall the standard error of the sampling distribution \(\sigma_{\bar x} = \frac{\sigma}{\sqrt{n}}\). Thus, we can easily compute \(\sigma_{\bar x}\) for \(n=5,15,30,50\) elements:

\[\sigma_{\bar x_{n=5}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{5}}\approx 0.447\]

\[\sigma_{\bar x_{n=15}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{15}}\approx 0.258\]

\[\sigma_{\bar x_{n=30}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{30}}\approx 0.183\]

\[\sigma_{\bar x_{n=50}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{50}} \approx 0.141\]

The different sampling distributions are visualized in the figure below.

There are two important observations regarding the sampling distribution of \(\bar x\):

- The spread of the sampling distribution is smaller than the spread
of the corresponding population distribution. In other words, \(\sigma_{\bar x} < \sigma\).

- The standard deviation of the sampling distribution decreases as the sample size increases.

In order to verify the 3^{rd} claim from above, that the
shape of the sampling distribution of \(\bar
x\) is normal whatever the value of \(n\), we conduct a computational experiment.
For a large enough number of times (1000) we sample from the standard
normal distribution \(N \sim (\mu =0, \sigma =
1)\), where each particular sample has a sample size of \(n=5,15,30,50\). For each sample we
calculate the sample mean \(\bar x\)
and visualize the empirical probabilities. Afterwards we compare the
empirical distribution of those probabilities with the sampling
distributions calculated from the equations above.

```
trials <- 1000
n <- c(5, 15, 30, 50) # sample size
# empty matrix to store results of computations
out <- matrix(nrow = trials, ncol = length(n))
# plotting parameters
my_seq <- seq(-4, 4, by = 0.001)
color <- c(2, 3, 4, 5, 6)
# random sampling
for (i in seq(trials)) {
for (j in seq(length(n))) {
out[i, j] <- mean(rnorm(n[j]))
}
}
# plotting
par(mfrow = c(2, 2), mar = c(3, 4, 2, 3))
for (i in seq(1, 4)) {
h <- hist(out[, i],
breaks = "Scott",
plot = FALSE
)
plot(h,
freq = FALSE,
xlim = c(-2, 2),
main = paste("Empirical Probabilities vs.\nSampling Distribution for sample size n=", n[i]),
cex.main = 0.75
)
curve(dnorm(x,
mean = 0, sd = 1 / sqrt(n[i])
),
from = -4, to = 4, n = 1000,
type = "l", # set line type
lwd = 2, # set line width
add = TRUE,
col = color[i]
) # set line color
legend(
x = 0.8, # set x position
y = max(h$density) * 0.7, # set y position
paste("n = ", n[i]), # set appropriate legend names
lty = 1, # set line type
lwd = 2, # set line width
col = color[i], # set line color
cex = 0.7 # set font size
)
}
```

The figure verifies the 3^{rd} claim from above: The shape of
the sampling distribution of \(\bar x\)
is normal, for whatever value of \(n\).

In addition, the figure shows that the distribution of the empirical
probabilities (bars) fits the sampling distribution (colored line) well.
Also, the standard deviation of the sampling distribution of \(\bar x\) decreases as the sample size
increases. Recall, that the y-axis represents the *density*,
which is the **probability per unit value** of the random
variable. This is why the probability density can take a value greater
than 1, but only over a region with measure less than 1.

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*