Based on our intuition of randomness in the sampling process, we
introduce the **Sampling Distribution**. The sampling
distribution is a distribution of a sample statistic (Lovric 2010).
Often the name of the computed statistic is added as part of the title.
For example, if the computed statistic was the sample mean, the sampling
distribution would be titled **the sampling distribution of the
sample mean**.

Let us recall the simple example from the previous section, where the
population is represented by the first 100 integers \(\{1,2,3,...,100\}\). If we repeatedly
sample from that population and compute the sample statistic (e.g. \(\bar x\) or \(s\)) each time, the resulting distribution
of a sample statistic is a called the **sampling distribution of
that statistic**.

Let us repeatedly take random samples \((x)\) of size \(n
= 30\) without replacement from this population. The random
sampling might generate sets that look like

\(\{50, 88, 43, 86, 21, 62, 40, 76, 55, 33, 5,
23, 1, 85, 46, 93, 6, 17, 28, 77, 95, 58, 74, 24, 57, 73, 72, 27, 14,
41\}\)

or

\(\{17, 80, 91, 67, 59, 66, 54, 63, 44, 39,
19, 90, 53, 43, 28, 98, 93, 33, 22, 83, 4, 61, 81, 38, 72, 97, 79, 29,
62, 3\}\).

For each sample we calculate a sample statistic. In this example we take
the mean, \(\bar x\), of each sample.
However, please note that the sample statistic could be any descriptive
statistic, such as the median, the standard deviation or a proportion,
among others. Once we obtained the sample means for all samples, we list
all their different values and number of their occurrences (frequencies)
in order to obtain relative frequencies or **empirical
probabilities**. We turn to R to visualize the relative frequency
distribution of repeatedly sampling the given population 1, 10, 100,
500, 1000 and 3000 times. Remember the sample size is set to \(n=30\).

```
pop <- 1:100 # initialize population as all integers between 1 and 100
n <- 30 # sample size
# set plotting parameters
par(mfrow = c(3, 2), mar = c(2, 2, 2, 3), xpd = FALSE)
# start experiment
no_samples <- c(1, 10, 100, 500, 1000, 3000) # set number of samples to be drawn
# run experiment 6 times
for (i in 1:length(no_samples)) {
# draw either 1, 10, 100, 500, 1000 or 2000 random samples of sample size n=30
my_samples <- rep(NA, no_samples[i]) # initialize empty vector of size i
for (j in 1:no_samples[i]) {
# take random samples j times and calculate the sample mean
my_samples[j] <- mean(sample(pop, n))
}
# plot result (NOTE: the stripchart() function does not scale well.
# If you want to experiment with the code you should plot histograms instead)
stripchart(my_samples,
method = "stack",
offset = 0.4,
at = .01,
pch = 19,
col = "red",
xlim = c(30, 70)
)
abline(v = mean(pop), lty = 2)
text(
x = mean(pop) * 1.25,
y = 1.8,
labels = paste(no_samples[i], " random\nsamples"),
col = "red"
)
text(x = mean(pop) * 0.98, y = 1.8, labels = expression(mu))
}
# add title
mtext(expression(paste(
"Relative frequency distribution (occurrences) of ",
bar(x)
)), outer = TRUE, cex = 1, line = -1.5)
```

From the figures above we can see, that the more often we take a
sample the better the relative frequency distribution of the sample
statistics approximates the sampling distribution. In other words, as
the number of samples approaches infinity, the resulting frequency
distribution will approach the sampling distribution. Lovric (2010)
states that “the **sampling distribution of a statistic**
is a probability distribution of that statistic derived from all
possible samples having the same size from the population”. However, the
sampling distribution should not be confused with a sample distribution:
the latter describes the distribution of values (elements) in one
particular sample.

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*