Let us repeat the sampling process from the previous section for 5 times, store the results in the variable my_experiment
and in addition print out the mean, $\bar x$, for each particular sample.
# First, let's import all the needed libraries.
import numpy as np
import random
population = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
my_experiment = []
for i in np.arange(1, 6):
my_sample = random.sample(population, 3)
my_experiment.append(np.mean(my_sample))
print(f"Sample number {i} has a mean of {round(np.mean(my_sample),2)}.")
Sample number 1 has a mean of 6.33. Sample number 2 has a mean of 7.0. Sample number 3 has a mean of 4.67. Sample number 4 has a mean of 4.33. Sample number 5 has a mean of 4.33.
Obviously, different samples (of the same length) selected from the same population yield different sample statistics because they contain different elements. Moreover, any sample statistics obtained from any sample, such as the sample mean $\bar x$, will be different from the result obtained from the corresponding population, the population mean, $\mu$. The difference between the value of a sample statistic obtained from a sample and the value of the corresponding population parameter obtained from the population is called the sampling error. In the case of the mean the sampling error can be written as
$$\text{sampling error} = \bar x - \mu$$Due to the nature of random sampling, and thus due to the process of drawing a set of values from the population, the resulting sampling error occurs due to chance, or in other words, the sampling error is a random variable. However, one should note that beside the described randomness there are other sources of error. These error are often related the the data generation process and are subsumed under the term non-sampling error. Such errors are introduced by for example human handling of the data, calibration errors of the measuring devices, among others.
In order to gain some intuition on the nature of the sampling error we conduct an experiment. For this experiment the population of interest consists the first 100 integers $\{1,2,3,...,100\}$. We want to analyse the effect of the sample size, $n$, on the sampling error. For the sake of simplicity we choose the sample mean as the statistic of interest. For a sufficient large number of trials (trials = 1000
) we calculate the sampling error for samples of sizes $n = 10,25,50,75$.
pop = list(np.arange(1, 101))
pop_mean = np.mean(pop)
vector_error_sample_10 = []
vector_error_sample_25 = []
vector_error_sample_50 = []
vector_error_sample_75 = []
trials = np.arange(1, 1001)
for trial in trials:
my_sample_10 = random.sample(pop, 10)
my_sample_25 = random.sample(pop, 25)
my_sample_50 = random.sample(pop, 50)
my_sample_75 = random.sample(pop, 75)
error_sample_10 = abs(np.mean(my_sample_10) - pop_mean)
error_sample_25 = abs(np.mean(my_sample_25) - pop_mean)
error_sample_50 = abs(np.mean(my_sample_50) - pop_mean)
error_sample_75 = abs(np.mean(my_sample_75) - pop_mean)
vector_error_sample_10.append(error_sample_10)
vector_error_sample_25.append(error_sample_25)
vector_error_sample_50.append(error_sample_50)
vector_error_sample_75.append(error_sample_75)
print(f"Sampling Error, n = 10: {round(np.mean(vector_error_sample_10),3)}")
print(f"Sampling Error, n = 25: {round(np.mean(vector_error_sample_25),3)}")
print(f"Sampling Error, n = 50: {round(np.mean(vector_error_sample_50),3)}")
print(f"Sampling Error, n = 75: {round(np.mean(vector_error_sample_75),3)}")
Sampling Error, n = 10: 6.874 Sampling Error, n = 25: 3.902 Sampling Error, n = 50: 2.356 Sampling Error, n = 75: 1.369
Based on the experiment from above, we may conclude, that the larger the sample size, the smaller is the sampling error. Or in other words, by increasing the sample size the sample mean, $\bar x$, approximates the population mean, $\mu$. This is an important insight, which will be discussed in more detail in the section on Inferential Statistics.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.