2057_sampling_from_a_population_which_is_not_normally

Sampling from a Population which is not Normally Distributed

In the previous section we discussed the shape of sample distributions, when sampled from a normally distributed population. However, in real life applications we often do not know the actual shape of the population distribution.

In order to understand how the shape of the distribution of the population of interest affects the shape of the sampling distribution we conduct an experiment. Let us consider three different continuous probability density functions: the uniform distribution, the beta distribution and the gamma distribution. We do not go into details here, but the figure below shows, that these three PDFs are not normally distributed:

Now, we conduct the same experiment as in the previous section. For a large enough number of times (1000) we sample from each particular distribution. However, this time each particular sample has a sample size \(n=2,5,15,30\). For each sample we calculate the sample mean \(\bar x\) and visualize the empirical probabilities after 1,000 trials.

trials <- 1000
n <- c(2, 5, 15, 30) # sample size

# empty matrix to store results of computations
out_unif <- matrix(nrow = trials, ncol = length(n))
out_beta <- matrix(nrow = trials, ncol = length(n))
out_gamma <- matrix(nrow = trials, ncol = length(n))

# random sampling
for (i in seq(trials)) {
  for (j in seq(length(n))) {
    out_unif[i, j] <- mean(runif(n[j], min = 0.2, max = 0.8))
    out_beta[i, j] <- mean(rbeta(n[j], shape1 = 2, shape2 = 5))
    out_gamma[i, j] <- mean(rgamma(n[j], shape = 1, rate = 8))
  }
}

# plotting
# set plotting variables
par(mfrow = c(4, 3), mar = c(2, 1, 3, 1))
plotting_variables <- list(out_unif, out_beta, out_gamma)

# generate 12 histogram plots and store them in a list
h_list <- list(NULL)
h_list_pos <- 1
for (i in seq(1, length(n))) {
  for (j in seq(1, 3)) {
    out <- plotting_variables[[j]]
    h_list[[h_list_pos]] <- hist(out[, i],
      breaks = "Scott",
      plot = FALSE
    )
    h_list_pos <- h_list_pos + 1
  }
}

# loop through the list of histograms and plot them
xlim_list <- rep(list(c(0.2, 0.8), c(0, 0.6), c(0, 0.4)), 4)
n_vector <- c(rep(n[1], 3), rep(n[2], 3), rep(n[3], 3), rep(n[4], 3))
main_vector <- rep(c("Uniform ", "Beta ", "Gamma "), 4)
color <- rep(c(3, 4, 6), 4)

for (i in seq(1, length(h_list))) {
  plot(h_list[[i]],
    freq = FALSE,
    xlim = xlim_list[[i]],
    ylab = "",
    yaxt = "n",
    col = color[i],
    main = paste(main_vector[i], "distribution for sample size n =", n_vector[i]),
    cex.main = 0.9
  )
}

The figure shows, that in case of a population, which is not normally distributed, the sampling distributions are not normal when \(n < 30\). However, the sampling distributions approximate a normal distribution when \(n > 30\). Also notice that the spread of the sampling distribution decreases as the sample size increases.

According to the central limit theorem, for a large sample size \((n > 30)\) the sampling distribution is approximately normal, irrespective of the shape of the population distribution (Mann 2012).

The mean and standard deviation of the sampling distribution of \(\bar x\) are, respectively:

\[\mu_{\bar x} = \mu \qquad \text{and} \qquad \sigma_{\bar x}=\frac{\sigma}{\sqrt{n}} \]

The sample size is usually considered to be large if \(n \ge 30\).

Owing to the fact that the sampling distribution approximates a normal distribution, the area under the curve of a sampling distribution yields probabilistic information about sample statistics.

Recall the Empirical Rule, also known as the 68-95-99.7 rule. Applied to the sampling distribution the 68-95-99.7 rule implies that:

about 68.26% of the sample means will be within one standard deviation of the population mean,
95.44% of the sample means will be within two standard deviations of the population mean and
about 99.74% of the sample means will be within three standard deviations of the population mean.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.

Shape of the Sampling Distribution

Sampling from a Population which is not Normally Distributed