In order to better understand the estimation of a population mean and the construction of a confidence interval we will go through the procedure based on a data set. Therefore, we load the students data set. You may download the students.csv file here or import it directly into R using the read.csv() function.

students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")

The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.

In this section we focus on the height of female students. The height variable of female students is roughly normally distributed - which is supported by the symmetric bell shape of the population histogram:

females <- subset(students, gender == "Female")
heights <- females$height

hist(heights, breaks = "FD")

Let us assume that the height measurements given in the data set are a very good approximation for the population of interest, which is the height of female students in cm. Based on the data given, we first compute the population mean \(\mu\) and the population standard deviation \(\sigma\).

heights_mean <- mean(heights)
heights_mean
## [1] 163.6533
heights_sigma <- sd(heights)
heights_sigma
## [1] 7.919726

We then construct a probability distribution by applying the dnorm() function defined by the previously calculated parameters, \(\mu\) and \(\sigma\), and plot it on top of the histogram.

x <- seq(min(heights), max(heights), by = 0.01)
height_pdf <- dnorm(x, mean = heights_mean, sd = heights_sigma)

plot(x, height_pdf,type='n')
hist(heights, breaks = "Scott", freq = F, add = T)
lines(x, height_pdf, type = "l", col = "red", ylab = "Density", xlab = "Height in cm")

A nice fit!

Now, we take a random sample with a sample size of \(n=10\) from the probability distribution by applying the rnorm() function and then calculate the sample mean \(\bar x\).

sample_size <- 10
my_sample <- rnorm(n = sample_size, mean = heights_mean, sd = heights_sigma)
x_bar <- mean(my_sample)
x_bar
## [1] 159.7953

Our random sample yields a sample mean (\(\bar x\)) of approximately 159.8. This is our point estimate for the population parameter of interest, which in this case is the mean height of female students (\(\mu\)).

How accurate is our point estimate? We can check, whether our estimate is equal to the true population parameter.

x_bar == heights_mean
## [1] FALSE

OK, we expected that.

Let us calculate some interval estimates by constructing the 90 %, 95 % and 99 % confidence intervals. Recall the equation for a confidence interval:

\[CI: \text{Point estimate} \pm z^*_{\alpha/2} \times \frac{\sigma}{\sqrt{n}}\]

The critical value \(z^*_{\alpha/2}\) evaluates to 1.64, 1.96 and 2.58 for confidence levels of 90 %, 95 % and 99 %, respectively.

Applied to our height data the equation yields the following:

\[CI_{90\%}: 159.8 \pm 1.64 \times \frac{7.92}{\sqrt{10}} = 159.8 \pm 4.12\]

  1. We can state that we are 90 % confident that the mean height of female students (the population parameter \(\mu\)) is between 155.68 and 163.91 cm.

\[CI_{95\%}: 159.8 \pm 1.96 \times \frac{7.92}{\sqrt{10}} = 159.8 \pm 4.91\]

  1. We can state that we are 95 % confident that the mean height of female students is between 154.89 and 164.7 cm.

\[CI_{99\%}: 159.8 \pm 2.58 \times \frac{7.92}{\sqrt{10}} = 159.8 \pm 6.45\]

  1. We can state that we are 99 % confident that the mean height of female students is between 153.34 and 166.25 cm.

It is obvious, that if we want to have a higher confidence that the unknown population parameter is contained in the interval, the margin of error becomes larger.


Just for a sanity check, let us investigate if we actually captured the true population value within our interval estimations. It is important to remember that the confidence interval does not assign a probability to our sample mean, but it states that under repeated random sampling the confidence interval is expected to include the population mean \(100(1-\alpha)\%\) of the time. In order to test that claim we write a simple R-function by ourselves.

CI.eval <- function(pop_mean, sigma, n, estimate, alpha) {
  # The function returns a vector of Booleans (TRUE or FALSE).
  # The function returns TRUE if the confidence interval contains the true population parameter and FALSE if not.

  out <- rep(NA, length(alpha))
  for (i in seq(1, length(alpha))) {
    out[i] <- pop_mean >= estimate - qnorm(alpha[i] / 2, lower.tail = F) * sigma / sqrt(n) &&
      pop_mean <= estimate + qnorm(alpha[i] / 2, lower.tail = F) * sigma / sqrt(n)
  }
  return(out)
}

Let us apply our self-built function CI.eval() to our data. We set pop_mean = heights_mean, sigma = heights_sigma, n = sample_size, estimate = x_bar, alpha = c(0.1, 0.05, 0.01) in order to evaluate if the three confidence intervals constructed above (90 %, 95 % and 99 %) contain the population mean. Finally, we transform the resulting vector into a data.frame object for enhanced readability.

eval <- CI.eval(
  pop_mean = heights_mean,
  sigma = heights_sigma,
  n = sample_size,
  estimate = x_bar,
  alpha = c(0.1, 0.05, 0.01)
)
data.frame(eval, row.names = c("90 %", "95 %", "99 %"))
##      eval
## 90 % TRUE
## 95 % TRUE
## 99 % TRUE

Well, interesting result. The true population mean (heights_mean = 163.65) is captured by all three confidence intervals. Obliviously, the mean of our random sample (x_bar = 159.8) was a quite good estimator.


So far, we just focused on one particular random sample. However, what should we expect if we repeat the sampling process over and over again?

A confidence interval of any particular confidence level states that under repeated random samples the confidence interval is expected to include the true population parameter \(100(1-\alpha)\%\) of the time. Thus, a 99 % confidence interval implies that, on average, in 99 out of 100 cases the margin of error is large enough to encompass the true population value. This as well implies that we are wrong one out of 100 times on average. Further, for a 95 % or a 90 % confidence interval we are wrong 5 or 10 out of 100 times, respectively. We will use the R machinery to test this claim by conducting a simulation experiment.

We sample from the height distribution from above. We take 10, 50, 100, 1000 and 10000 random samples. Each sample has a sample size of \(n=10\). For each sample we calculate the sample mean \(\bar x\), which is our sample statistic of interest. By applying our self-built function CI.eval() we test if the true population mean is captured by the 90 %, 95 % and 99 % confidence interval. We store the resulting Boolean vector and calculate the proportion of times each particular confidence interval actually contains the true population mean. Finally, we store the proportion as percentages in a data.frame object and rename its rows and columns for better readability.

df <- data.frame()

trials <- c(10, 50, 100, 1000, 10000)
n <- 10

for (trial in trials) {
  m <- matrix(NA, nrow = trial, ncol = 3)
  for (i in 1:trial) {
    my_sample <- rnorm(n = n, mean = heights_mean, sd = heights_sigma)
    x_bar <- mean(my_sample)
    eval <- CI.eval(
      pop_mean = heights_mean,
      sigma = heights_sigma,
      n = n,
      estimate = x_bar,
      alpha = c(0.1, 0.05, 0.01)
    )
    m[i, ] <- eval
  }
  df <- rbind(df, colSums(m) / trial * 100)
}
row.names(df) <- trials
colnames(df) <- c("90 %", "95 %", "99 %")
df
##        90 %  95 %   99 %
## 10    90.00 90.00 100.00
## 50    82.00 90.00  94.00
## 100   93.00 96.00  99.00
## 1000  90.40 96.00  99.00
## 10000 89.38 94.79  98.96

Actually, we see that by increasing the number of samples the percentages converge and reflect the probability as given by the respective level of confidence.


Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.