We practice the The One-Mean t-Interval Procedure to construct confidence intervals with the students data set. You may download the students.csv file here or import it directly into R using the read.csv() function.

students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")

The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.

In this section we focus again on the height of female students. Let us assume that the height measurements given in the data set are a very good approximation to the population of interest. This means we can sample from the probability density function based on the mean and the standard deviation of the data set.

females <- subset(students, gender == "Female")
heights <- females$height heights_mean <- mean(heights) heights_mean ##  163.6533 heights_sigma <- sd(heights) heights_sigma ##  7.919726 We take a random sample with a sample size of $$n=6$$ from the probability distribution, defined by the mean and standard deviation of the height variable, and calculate the sample mean $$\bar x$$ and the sample standard deviation $$s$$. sample_size <- 6 df <- sample_size - 1 my_sample <- rnorm(n = sample_size, mean = heights_mean, sd = heights_sigma) x_bar <- mean(my_sample) x_bar ##  163.4235 s <- sd(my_sample) s ##  7.468143 Our random sample yields a sample mean $$\bar x$$ of approximately 163.42 and a sample standard deviation $$s$$ of approximately 7.47. These are our point estimates for the population of interest, which in this case is the height of female students in our data set. How accurate is our point estimate? We can use R to check whether our estimates are equal to the true population parameters. x_bar == heights_mean ##  FALSE s == heights_sigma ##  FALSE s < heights_sigma ##  TRUE OK, we expected that! Let us now calculate interval estimates by constructing the 90 %, 95 % and 99 % confidence intervals. Recall the equation for a confidence interval and how to calculate the degree of freedom: $CI: \text{Point estimate} \pm t^*_{df,\,\alpha/2} \times \frac{s}{\sqrt{n}}$ $df = n-1$ The critical value $$t^*_{5,\,\alpha/2}$$ evaluates to 2.02, 2.57 and 4.03 for confidence levels of 90 %, 95 % and 99 %, respectively. Applied to our height data the equation yields the following: $CI_{90\%}: 163.42 \pm 2.02 \times \frac{7.47}{\sqrt{6}} = 163.42 \pm 6.14$ 1. We can state that we are 90 % confident that the mean height of female students (the population parameter $$\mu$$) is between 157.28 and 169.57 cm. $CI_{95\%}: 163.42 \pm 2.57 \times \frac{7.47}{\sqrt{6}} = 163.42 \pm 7.84$ 1. We can state that we are 95 % confident that the mean height of female students is between 155.59 and 171.26 cm. $CI_{99\%}: 163.42 \pm 4.03 \times \frac{7.47}{\sqrt{6}} = 163.42 \pm 12.29$ 1. We can state that we are 99 % confident that the mean height of female students is between 151.13 and 175.72 cm. It is obvious, that if we want to have a higher confidence that the unknown population parameter is contained in the interval, the margin of error becomes larger. Exercise: Calculate the confidence interval for the male students height (from the students dataset) using a confidence level of 90 %, 95 % and 99 %! Beware, that we do not know sigma, therefore we have to use the sample standard deviation and the t-distribution. ### your code here Show code # create a subset for male students males <- subset(students, gender == "Male") heights <- males$height
heights_mean <- mean(heights)
heights_sigma <- sd(heights)

# draw a sample of n = 6
sample <- rnorm(n = 6, mean = heights_mean, sd = heights_sigma)
sample_mean <- mean(sample)
sample_sd <- sd(sample)

# calculate the CIs
n <- 6
conf_level <- c(0.90, 0.95, 0.99)

for (conf in conf_level) {
alpha <- 1 - conf
t_lower <- qt(p = alpha / 2, df = n - 1, lower.tail = T)
t_upper <- qt(p = alpha / 2, df = n - 1, lower.tail = F)

CI_lower <- sample_mean + (t_lower * (sample_sd / sqrt(n)))
CI_upper <- sample_mean + (t_upper * (sample_sd / sqrt(n)))

print(paste("The ", conf * 100, "% Conficence Intervall for the male students-height is between ",
round(CI_lower, 2), "cm and ", round(CI_upper, 2), "cm.",
sep = ""
))
}
##  "The 90 % confidence interval for the male students-height is between 179.74cm and 187.45cm."
##  "The 95 % confidence interval for the male students-height is between 178.68cm and 188.52cm."
##  "The 99 % confidence interval for the male students-height is between 175.88cm and 191.31cm."

Exercise: Compare these results with the CI-estimation for female students calculated above! What does this mean for infering a (common/different) population?

There is no overlap of the confidence intervals of female and male mean heights in our samples. Hence, we may assume that male and female students come from different populations defined by body height!

For a sanity check we investigate, if we actually captured the true population value by our interval estimation. It is important to remember that the confidence interval does not assign a probability to our sample mean. In fact it states that under repeated random sampling, the confidence interval is expected to include the population mean $$100(1-\alpha)\%$$ of the time. In order to test that claim we slightly modify the R-function we wrote in the previous section.

CI.eval.t <- function(pop_mean, s, n, estimate, alpha) {
# The function returns a vector of Booleans (TRUE or FALSE).
# The function returns TRUE if the confidence interval contains the true population parameter and FALSE if not.
out <- rep(NA, length(alpha))
for (i in seq(1, length(alpha))) {
out[i] <- pop_mean >= estimate - qt(alpha[i] / 2, df = n - 1, lower.tail = F) * s / sqrt(n) &&
pop_mean <= estimate + qt(alpha[i] / 2, df = n - 1, lower.tail = F) * s / sqrt(n)
}
return(out)
}

Let us now apply our self-built function CI.eval.t() to our data. We set pop_mean = heights_mean, s = s, n = sample_size, estimate = x_bar, alpha = c(0.1,0.05,0.01) in order to evaluate if the three confidence intervals constructed above (90 %, 95 % and 99 %) contain the population mean. Finally, we transform the resulting vector into a data.frame object for enhanced readability.

eval <- CI.eval.t(
pop_mean = heights_mean,
s = s,
n = sample_size,
estimate = x_bar,
alpha = c(0.1, 0.05, 0.01)
)
data.frame(eval, row.names = c("90 %", "95 %", "99 %"))
##       eval
## 90 % FALSE
## 95 % FALSE
## 99 % FALSE

Well, interesting result. The true population mean (heights_mean = 179.07) is captured by all three confidence intervals. Obliviously, the mean of our random sample (x_bar = 163.42) was a quite good estimator.

So far, we focused on one sample. However, what should we expect if we repeat the sampling process over and over again?

A confidence interval of any particular confidence level states that under repeated random samples the confidence interval is expected to include the true population parameter $$100(1-\alpha)\%$$ of the time. Thus, a 99 % confidence interval implies that, on average, in 99 out of 100 cases the margin of error is large enough to encompass the true population value. This as well implies that we are wrong one out of 100 times on average. Further, for a 95 % or a 90 % confidence interval we are wrong 5 or 10 out of 100 times, respectively. We will use the R machinery to test this claim by conducting a simulation experiment.

We sample from the height distribution from above. We take 10, 50, 100, 1000 and 10000 random samples. Each sample has a sample size of $$n=6$$. For each sample we calculate the sample mean, which is our sample statistic of interest. By applying our self-built function CI.eval.t() we test if the true population mean is captured by the 90 %, 95 % and 99 % confidence interval. We store the resulting Boolean vector and calculate the proportion of times each particular confidence interval actually contains the true population mean. Finally, we store the proportion as percentages in a data.frame object and rename its rows and columns for better readability.

df <- data.frame()

trials <- c(10, 50, 100, 1000, 10000)
n <- 6

for (trial in trials) {
m <- matrix(NA, nrow = trial, ncol = 3)
for (i in 1:trial) {
my_sample <- rnorm(n = n, mean = heights_mean, sd = heights_sigma)
x_bar <- mean(my_sample)
s <- sd(my_sample)
eval <- CI.eval.t(
pop_mean = heights_mean,
s = s,
n = n,
estimate = x_bar,
alpha = c(0.1, 0.05, 0.01)
)
m[i, ] <- eval
}
df <- rbind(df, colSums(m) / trial * 100)
}
row.names(df) <- trials
colnames(df) <- c("90 %", "95 %", "99 %")
df
##         90 %   95 %   99 %
## 10    100.00 100.00 100.00
## 50     88.00  96.00 100.00
## 100    91.00  94.00  99.00
## 1000   86.50  93.30  98.90
## 10000  90.39  95.41  99.07

Actually, we see, as before, that by increasing the number of samples the percentages converge and reflect the probability as given by the respective level of confidence.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de. You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.