Let us repeat the sampling process from the previous section 5 times, store the results in the variable my_experiment and in addition print out the mean, $$\bar x$$, for each particular sample:

population <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

my_experiment <- NULL
for (i in 1:5) {
my_sample <- sample(population, size = 3)
my_experiment <- c(my_experiment, mean(my_sample))
cat(sprintf("Sample number %s has a mean of %s.\n", i, round(mean(my_sample), 2)))
}
## Sample number 1 has a mean of 5.33.
## Sample number 2 has a mean of 6.
## Sample number 3 has a mean of 5.67.
## Sample number 4 has a mean of 4.67.
## Sample number 5 has a mean of 5.

Obviously, different samples (of the same length) selected from the same population yield different sample statistics because they contain different elements. Moreover, any sample statistic obtained from a sample, such as the sample mean $$\bar x$$, will be different from the result obtained from the corresponding population, the population mean $$\mu$$. The difference between the value of a sample statistic and the value of the corresponding population parameter is called the sampling error. In the case of the mean the sampling error can be written as:

$\text{sampling error} = \bar x - \mu.$

Due to the nature of random sampling, so the process of drawing a set of values from the population, the resulting sampling error occurs due to chance, or in other words, the sampling error is a random variable.

Note: Beside the described randomness of sampling there are other sources of error. These errors are often related to the data generation process and are subsumed under the term non-sampling error. Such errors are introduced, for example, by human handling of the data or calibration errors of the measuring devices, among others.

In order to gain some intuition on the nature of the sampling error we conduct an experiment. For this experiment the population of interest consists of the first 100 integers $$\{1,2,3,...,100\}$$. We want to analyze the effect of the sample size, $$n$$, on the sampling error. For the sake of simplicity we choose the sample mean as the statistic of interest. For a sufficiently large number of trials (1000) we calculate the sampling error for samples of the sizes $$n = 10,25,50,75$$:

pop <- 1:100
pop_mean <- mean(pop)
vector_error_sample_10 <- NULL
vector_error_sample_25 <- NULL
vector_error_sample_50 <- NULL
vector_error_sample_75 <- NULL

trials <- 1000
for (trial in 1:trials) {
my_sample_10 <- sample(pop, 10)
my_sample_25 <- sample(pop, 25)
my_sample_50 <- sample(pop, 50)
my_sample_75 <- sample(pop, 75)

error_sample_10 <- abs(mean(my_sample_10) - pop_mean)
error_sample_25 <- abs(mean(my_sample_25) - pop_mean)
error_sample_50 <- abs(mean(my_sample_50) - pop_mean)
error_sample_75 <- abs(mean(my_sample_75) - pop_mean)

vector_error_sample_10 <- c(vector_error_sample_10, error_sample_10)
vector_error_sample_25 <- c(vector_error_sample_25, error_sample_25)
vector_error_sample_50 <- c(vector_error_sample_50, error_sample_50)
vector_error_sample_75 <- c(vector_error_sample_75, error_sample_75)
}

print(paste("Sampling Error, n = 10: ", mean(vector_error_sample_10)))
##  "Sampling Error, n = 10:  7.0923"
print(paste("Sampling Error, n = 25: ", mean(vector_error_sample_25)))
##  "Sampling Error, n = 25:  4.0398"
print(paste("Sampling Error, n = 50: ", mean(vector_error_sample_50)))
##  "Sampling Error, n = 50:  2.26824"
print(paste("Sampling Error, n = 75: ", mean(vector_error_sample_75)))
##  "Sampling Error, n = 75:  1.34489333333333"

Based on the above experiment we may conclude, that the larger the sample size, the smaller the sampling error. In other words, by increasing the sample size, the sample mean $$\bar x$$ approximates the population mean, $$\mu$$. This is an important insight, which will be discussed in more detail in the section on Inferential Statistics.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de. You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.