Let us repeat the sampling process from the previous section for 5 times, store the results in the variable my.experiment and in addition print out the mean, \(\bar x\), for each particular sample.

population <- c(1,2,3,4,5,6,7,8,9,10)

my.experiment <- NULL
for (i in 1:5){
  my.sample <- sample(population, size = 3)
  my.experiment <- c(my.experiment, mean(my.sample))
  cat(sprintf('Sample number %s has a mean of %s.\n', i, round(mean(my.sample),2)))
}
## Sample number 1 has a mean of 7.33.
## Sample number 2 has a mean of 5.67.
## Sample number 3 has a mean of 5.
## Sample number 4 has a mean of 6.67.
## Sample number 5 has a mean of 5.

Obviously, different samples (of the same length) selected from the same population yield different sample statistics because they contain different elements. Moreover, any sample statistics obtained from any sample, such as the sample mean \(\bar x\), will be different from the result obtained from the corresponding population, the population mean, \(\mu\). The difference between the value of a sample statistic obtained from a sample and the value of the corresponding population parameter obtained from the population is called the sampling error. In the case of the mean the sampling error can be written as

\[\text{sampling error} = \bar x - \mu\]

Due to the nature of random sampling, and thus due to the process of drawing a set of values from the population, the resulting sampling error occurs due to chance, or in other words, the sampling error is a random variable. However, one should note that beside the described randomness there are other sources of error. These error are often related the the data generation process and are subsumed under the term non-sampling error. Such errors are introduced by for example human handling of the data, calibration errors of the measuring devices, among others.

In order to gain some intuition on the nature of the sampling error we conduct an experiment. For this experiment the population of interest consists the first 100 integers \(\{1,2,3,...,100\}\). We want to analyse the effect of the sample size, \(n\), on the sampling error. For the sake of simplicity we choose the sample mean as the statistic of interest. For a sufficient large number of trials (trials = 1000) we calculate the sampling error for samples of sizes \(n = 10,25,50,75\).

pop <- 1:100
pop.mean <- mean(pop)
vector.error.sample_10 <- NULL
vector.error.sample_25 <- NULL
vector.error.sample_50 <- NULL
vector.error.sample_75 <- NULL
                         
                         
trials <- 1000
for (trial in 1:trials){
  my.sample_10 <- sample(pop, 10)
  my.sample_25 <- sample(pop, 25)
  my.sample_50 <- sample(pop, 50)
  my.sample_75 <- sample(pop, 75)
  
  error.sample_10 <- abs(mean(my.sample_10) - pop.mean)
  error.sample_25 <- abs(mean(my.sample_25) - pop.mean)
  error.sample_50 <- abs(mean(my.sample_50) - pop.mean)
  error.sample_75 <- abs(mean(my.sample_75) - pop.mean)
  
  vector.error.sample_10 <- c(vector.error.sample_10, error.sample_10)
  vector.error.sample_25 <- c(vector.error.sample_25, error.sample_25)
  vector.error.sample_50 <- c(vector.error.sample_50, error.sample_50)
  vector.error.sample_75 <- c(vector.error.sample_75, error.sample_75)
}

print(paste('Sampling Error, n = 10: ', mean(vector.error.sample_10)))
## [1] "Sampling Error, n = 10:  6.7518"
print(paste('Sampling Error, n = 25: ', mean(vector.error.sample_25)))
## [1] "Sampling Error, n = 25:  3.75268"
print(paste('Sampling Error, n = 50: ', mean(vector.error.sample_50)))
## [1] "Sampling Error, n = 50:  2.32848"
print(paste('Sampling Error, n = 75: ', mean(vector.error.sample_75)))
## [1] "Sampling Error, n = 75:  1.30461333333333"

Based on the experiment from above, we may conclude, that the larger the sample size, the smaller is the sampling error. Or in other words, by increasing the sample size the sample mean, \(\bar x\), approximates the population mean, \(\mu\). This is an important insight, which will be discussed in more detail in the section on Inferential Statistics.