Based on our intuition of randomness in the sampling process, we introduce the Sampling Distribution. The sampling distribution is a distribution of a sample statistic (Lovric 2010). Often the name of the computed statistic is added as part of the title. For example, if the computed statistic was the sample mean, the sampling distribution would be titled the sampling distribution of the sample mean.

Let us recall the simple example from the previous section, where the population is represented by the first 100 integers $$\{1,2,3,...,100\}$$. If we repeatedly sample from that population and compute the sample statistic (e.g. $$\bar x$$ or $$s$$) each time, the resulting distribution of a sample statistic is a called the sampling distribution of that statistic.

Let us repeatedly take random samples $$(x)$$ of size $$n = 30$$ without replacement from this population. The random sampling might generate sets that look like
$$\{50, 88, 43, 86, 21, 62, 40, 76, 55, 33, 5, 23, 1, 85, 46, 93, 6, 17, 28, 77, 95, 58, 74, 24, 57, 73, 72, 27, 14, 41\}$$
or
$$\{17, 80, 91, 67, 59, 66, 54, 63, 44, 39, 19, 90, 53, 43, 28, 98, 93, 33, 22, 83, 4, 61, 81, 38, 72, 97, 79, 29, 62, 3\}$$.
For each sample we calculate a sample statistic. In this example we take the mean, $$\bar x$$, of each sample. However, please note that the sample statistic could be any descriptive statistic, such as the median, the standard deviation or a proportion, among others. Once we obtained the sample means for all samples, we list all their different values and number of their occurrences (frequencies) in order to obtain relative frequencies or empirical probabilities. We turn to R to visualize the relative frequency distribution of repeatedly sampling the given population 1, 10, 100, 500, 1000 and 3000 times. Remember the sample size is set to $$n=30$$.

pop <- 1:100 # initialize population as all integers between 1 and 100
n <- 30 # sample size

# set plotting parameters
par(mfrow = c(3, 2), mar = c(2, 2, 2, 3), xpd = FALSE)

# start experiment
no_samples <- c(1, 10, 100, 500, 1000, 3000) # set number of samples to be drawn

# run experiment 6 times
for (i in 1:length(no_samples)) {
# draw either 1, 10, 100, 500, 1000 or 2000 random samples of sample size n=30
my_samples <- rep(NA, no_samples[i]) # initialize empty vector of size i
for (j in 1:no_samples[i]) {
# take random samples j times and calculate the sample mean
my_samples[j] <- mean(sample(pop, n))
}
# plot result (NOTE: the stripchart() function does not scale well.
# If you want to experiment with the code you should plot histograms instead)
stripchart(my_samples,
method = "stack",
offset = 0.4,
at = .01,
pch = 19,
col = "red",
xlim = c(30, 70)
)

abline(v = mean(pop), lty = 2)
text(
x = mean(pop) * 1.25,
y = 1.8,
labels = paste(no_samples[i], " random\nsamples"),
col = "red"
)
text(x = mean(pop) * 0.98, y = 1.8, labels = expression(mu))
}
mtext(expression(paste(
"Relative frequency distribution (occurrences) of ",
bar(x)
)), outer = TRUE, cex = 1, line = -1.5)

From the figures above we can see, that the more often we take a sample the better the relative frequency distribution of the sample statistics approximates the sampling distribution. In other words, as the number of samples approaches infinity, the resulting frequency distribution will approach the sampling distribution. Lovric (2010) states that “the sampling distribution of a statistic is a probability distribution of that statistic derived from all possible samples having the same size from the population”. However, the sampling distribution should not be confused with a sample distribution: the latter describes the distribution of values (elements) in one particular sample.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.