#### Shape of the Sampling Distribution

The shape of the sampling distribution relates to the following two cases:
1. The population from which samples are drawn has a normal distribution.
2. The population from which samples are drawn does not have a normal distribution.

#### Sampling from a Normally Distributed Population

When the population from which samples are drawn is normally distributed with its mean equal to $$\mu$$ and standard deviation equal to $$\sigma$$, then:

1. The mean of the sample means, $$\mu_{\bar x}$$, is equal to the mean of the population, $$\mu$$.
2. The standard deviation of the sample means, $$\sigma_{\bar x}$$, is equal to $$\frac{\sigma}{\sqrt{n}}$$, assuming $$\frac{n}{N} \le 0.05$$.
3. The shape of the sampling distribution of the sample means $$(\bar x)$$ is normal, for whatever value of $$n$$.

Let us consider a normally distributed population. For the sake of simplicity we use the standard normal distribution, $$N \sim (\mu, \sigma)$$, with $$\mu = 0$$ and $$\sigma = 1$$. Let us further calculate $$\mu_{\bar x}$$ and $$\sigma_{\bar x}$$ for samples of sample sizes $$n=5,15,30,50$$.

Recall that for a large enough number of repeated sampling $$\mu_{\bar x} \approx \mu$$. Thus, $$\mu_{\bar x}$$ of the different sampling distributions under consideration should be:

$\mu_{\bar x_{n=5}} = \mu_{\bar x_{n=15}} = \mu_{\bar x_{n=30}} = \mu_{\bar x_{n=50}} = \mu = 0$

Recall the standard error of the sampling distribution $$\sigma_{\bar x} = \frac{\sigma}{\sqrt{n}}$$. Thus, we can easily compute $$\sigma_{\bar x}$$ for $$n=5,15,30,50$$ elements:

$\sigma_{\bar x_{n=5}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{5}}\approx 0.447$

$\sigma_{\bar x_{n=15}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{15}}\approx 0.258$

$\sigma_{\bar x_{n=30}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{30}}\approx 0.183$

$\sigma_{\bar x_{n=50}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{50}} \approx 0.141$

The different sampling distributions are visualized in the figure below.

There are two important observations regarding the sampling distribution of $$\bar x$$:

1. The spread of the sampling distribution is smaller than the spread of the corresponding population distribution. In other words, $$\sigma_{\bar x} < \sigma$$.
2. The standard deviation of the sampling distribution decreases as the sample size increases.

In order to verify the 3rd claim from above, that the shape of the sampling distribution of $$\bar x$$ is normal whatever the value of $$n$$, we conduct a computational experiment. For a large enough number of times (1000) we sample from the standard normal distribution $$N \sim (\mu =0, \sigma = 1)$$, where each particular sample has a sample size of $$n=5,15,30,50$$. For each sample we calculate the sample mean $$\bar x$$ and visualize the empirical probabilities. Afterwards we compare the empirical distribution of those probabilities with the sampling distributions calculated from the equations above.

trials <- 1000
n <- c(5, 15, 30, 50) # sample size

# empty matrix to store results of computations
out <- matrix(nrow = trials, ncol = length(n))

# plotting parameters
my_seq <- seq(-4, 4, by = 0.001)
color <- c(2, 3, 4, 5, 6)

# random sampling
for (i in seq(trials)) {
for (j in seq(length(n))) {
out[i, j] <- mean(rnorm(n[j]))
}
}

# plotting
par(mfrow = c(2, 2), mar = c(3, 4, 2, 3))

for (i in seq(1, 4)) {
h <- hist(out[, i],
breaks = "Scott",
plot = FALSE
)
plot(h,
freq = FALSE,
xlim = c(-2, 2),
main = paste("Empirical Probabilities vs.\nSampling Distribution for sample size n=", n[i]),
cex.main = 0.75
)
curve(dnorm(x,
mean = 0, sd = 1 / sqrt(n[i])
),
from = -4, to = 4, n = 1000,
type = "l", # set line type
lwd = 2, # set line width
col = color[i]
) # set line color
legend(
x = 0.8, # set x position
y = max(h\$density) * 0.7, # set y position
paste("n = ", n[i]), # set appropriate legend names
lty = 1, # set line type
lwd = 2, # set line width
col = color[i], # set line color
cex = 0.7 # set font size
)
}

The figure verifies the 3rd claim from above: The shape of the sampling distribution of $$\bar x$$ is normal, for whatever value of $$n$$.

In addition, the figure shows that the distribution of the empirical probabilities (bars) fits the sampling distribution (colored line) well. Also, the standard deviation of the sampling distribution of $$\bar x$$ decreases as the sample size increases. Recall, that the y-axis represents the density, which is the probability per unit value of the random variable. This is why the probability density can take a value greater than 1, but only over a region with measure less than 1.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.