2054_sampling_distribution

Based upon our intuition of randomness in the sampling process, we introduce the Sampling Distribution. The sampling distribution is a distribution of a sample statistic (Lovric 2011). Often the name of the computed statistic is added as part of the title. For example, if the computed statistic is the sample mean, the sampling distribution would be titled the sampling distribution of the sample mean.

In [2]:

# First, let's import all the needed libraries.
import numpy as np
import random
import matplotlib.pyplot as plt

In [3]:

n = 3
population = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
pop = np.arange(1, 101, 1)
n_pop = 30

Let us recall the simple example from the previous section, where the population is represented by the first 100 integers $\{1,2,3,...,100\}$. If we repeatedly sample from that population and compute each time the sample statistic (e.g. $\bar x$ or $s$,...), the resulting distribution of sample statistics is a called the sampling distribution of that statistic.

Let us take repeatedly random samples $(x)$ without replacement of size $n = 30$ from this population. The random sampling might generate sets that look like
$\{22,85,53,64,34,93,80,82,84,71,74,63,72,56,73,3,88,57,25,46,48,21,51,38,9,67,75,97,8,90\}$
or
$\{39,68,49,58,9,57,93,74,81,65,66,59,36,20,97,67,79,70,54,78,75,95,63,10,15,37,89,24,28,61\}$, ....

For each sample we calculate a sample statistic. In this example we take the mean, $\bar x$, of each sample. However, please note that the sample statistic could be any descriptive statistic, such as the median, the standard deviation, a proportion, among others. Once we obtained the sample means for all samples, we list all their different values and number of their occurrences (frequencies) in order to obtain relative frequencies or empirical probabilities. We turn to Python to visualize the relative frequency distribution of repeatedly sampling the given population for 1, 10, 100, 500, 1000, and 3000 times. The sample size is set to $n=30$.

In [4]:

print(range(10))

range(0, 10)

In [5]:

pop = list(
    np.arange(1, 101, 1)
)  # initialize population list as all integers between 1 and 100
n = 30  # sample size

no_samples = [2, 11, 101, 501, 1001, 3001]  # set number of samples to be drawn

samples_list = []
my_samples = []

for i in np.arange(0, len(no_samples)):
    my_samples = []

    for j in np.arange(1, no_samples[i]):
        # take random samples for j times and calculate the sample mean
        samples = np.mean(random.sample(pop, n))
        my_samples.append(samples)
    samples_list.append(my_samples)

In [6]:

fig, ax = plt.subplots(2, 3, figsize=(10, 5))
fig.suptitle("Relative frequency distribution (occurrences) of $\\bar{x}$", fontsize=20)

for i, ax, rs in zip(np.arange(0, len(samples_list)), ax.ravel(), no_samples):

    plt.tight_layout()
    values, counts = np.unique(samples_list[i], return_counts=True)
    ax.plot(
        values,
        counts,
        c="red",
        marker="o",
        ms=10,
        linestyle="",
    )
    ax.set_yticks([])
    ax.set_ylim(-0.1, 15)
    ax.set_xlim(34, 66)
    ax.vlines(np.mean(pop), 0, 15, color="black", linestyle="dashed")
    ax.title.set_text(f"{rs-1} random\nsamples")


plt.show()

The more often we take a sample the better the relative frequency distribution of the sample statistics approximates the sampling distribution. Or in other words, as the number of samples approaches infinity, the resulting frequency distribution will approach the sampling distribution. Lovric (2011) states that "the sampling distribution of a statistic is a probability distribution of that statistic derived from all possible samples having the same size from the population". However, the sampling distribution should not be confused with a sample distribution: the latter describes the distribution of values (elements) in one particular sample.

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.