We practice the The One-Mean t-Interval Procedure to construct confidence intervals with a data set. Therefore we load the students data set. You may download the students.csv file here. First, we import the data set, and assign a proper name to it.

students <- read.csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")

The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.

In this section we focus again on the height of female students. Let us assume that the height measurements given in the data set are a very good approximation to population of interest, so that we sample from the probability density function based on mean and the standard deviation of the data set.

females <- subset(students, gender=='Female')
heights <- females\$height
heights.mean <- mean(heights)
heights.mean
## [1] 163.6533
heights.sigma <- sd(heights)
heights.sigma
## [1] 7.919726

We take a random sample with a sample size of $$n=6$$ from the probability distribution, defined by the mean and standard deviation of the height variable, and calculate the sample mean $$\bar x$$ and the sample standard deviation $$s$$.

sample.size <- 6
df <- sample.size - 1

my.sample <- rnorm(n = sample.size, mean = heights.mean, sd = heights.sigma)
x.bar <- mean(my.sample)
x.bar
## [1] 167.3798
s <- sd(my.sample)
s
## [1] 10.82438

Our random sample yields a sample mean $$\bar x$$ of approximately 167.38 and a sample standard deviation $$s$$ of approximately 10.82. These are our point estimates for the population of interest, which in this case is the height of female students in our data set.

How accurate is our point estimate? We ask R if our estimates are equal to the true population parameter?

x.bar == heights.mean
## [1] FALSE
s == heights.sigma
## [1] FALSE
s < heights.sigma
## [1] FALSE

OK, we expected that!

Let us calculate interval estimates by constructing the 90%, 95% and 99% confidence intervals. Recall the equation for a confidence interval and how to calculate the degree of freedom.

$CI: \text{Point estimate} \pm t^*_{df,\,\alpha/2} \times \frac{s}{\sqrt{n}}$

$df = n-1$

The critical value $$t^*_{5,\,\alpha/2}$$ evaluates to 2.02, 2.57, and 4.03 for confidence levels of 90%, 95% and 99%, respectively.

Applied to our height data the equation yields

$CI_{90\%}: 167.38 \pm 2.02 \times \frac{10.82}{\sqrt{6}} = 167.38 \pm 8.9$

Thus, we can state that we are 90% confident that the mean height of female students (the population parameter $$\mu$$) is between 158.48 and 176.28 cm.

$CI_{95\%}: 167.38 \pm 2.57 \times \frac{10.82}{\sqrt{6}} = 167.38 \pm 11.36$

Thus, we can state that we are 95% confident that the mean height of female students is between 156.02 and 178.74 cm.

$CI_{99\%}: 167.38 \pm 4.03 \times \frac{10.82}{\sqrt{6}} = 167.38 \pm 17.82$

Thus, we can state that we are 99% confident that the mean height of female students is between 149.56 and 185.2 cm.

It is obvious that if we want to have a higher confidence that the unknown population parameter is contained in the interval, the margin of error is larger.

For a sanity check we investigate if we actually captured the true population value by our interval estimation. It is important to remember that the confidence interval does not assign a probability to our sample mean, but it states that under repeated random sampling, the confidence interval is expected to include the population mean $$100(1-\alpha)\%$$ of the time. In order to test that claim we slightly modify the R-function written by ourselves in the previous section.

CI.eval.t <- function(pop.mean, s, n, estimate, alpha){
# The function returns a vector of booleans (TRUE of FALSE). The function returns TRUE if the confidence interval contains the true population parameter and FALSE if not.
out = rep(NA,length(alpha))
for (i in seq(1,length(alpha))){
out[i] =  pop.mean >= estimate-qt(alpha[i]/2, df=n-1,lower.tail=F)*s/sqrt(n) &&
pop.mean <= estimate+qt(alpha[i]/2, df=n-1, lower.tail=F)*s/sqrt(n)
}
return(out)
}

Let us now apply our self-built function CI.eval.t() to our data. We set pop.mean = heights.mean, s = s, n = sample.size, estimate = x.bar, alpha = c(0.1,0.05,0.01) in order to evaluate if the three confidence intervals constructed above (90%, 95% and 99%) contain the population mean. Finally, we transform the resulting vector into a data.frame object for enhanced readability.

eval <- CI.eval.t(pop.mean = heights.mean,
s = s,
n = sample.size,
estimate = x.bar,
alpha = c(0.1,0.05,0.01))
data.frame(eval, row.names = c('90%','95%','99%'))
##     eval
## 90% TRUE
## 95% TRUE
## 99% TRUE

Well, interesting result. The true population mean (heights.mean = 163.65) is captured by all three confidence intervals. Obliviously the mean of our random sample (x.bar = 167.38) was a quite good estimator.

So far we focused on one sample. However, what should we expect if we repeat the sampling process over and over again?

A confidence interval of any particular confidence level states that under repeated random samples the confidence interval is expected to include the true population parameter $$100(1-\alpha)\%$$ of the time. Thus, a 99% confidence interval implies that on average in 99 of 100 cases the margin of error is large enough to encompass the true value, however this as well implies that on average we are wrong for one out of 100 times. Further, for a 95% or a 90% confidence interval we are wrong for 5, respectively 10 out of 100 times. We use the R machinery to test this claim by conducting a simulation experiment.

We sample from the height distribution from above. We take 10, 50, 100, 1000, and 10000 random samples. Each sample has a sample size of $$n=6$$. For each sample we calculate the sample mean, which is our sample statistic of interest. By applying our self-built function CI.eval.t() we test if the true population mean is captured by the 90%, 95% and 99% confidence interval. We store the resulting Boolean vector and calculate the proportion of times each particular confidence interval actually contains the true population mean. Finally, we store the proportion as percentages in a data.frame object and rename its rows and columns for better readability.

df = data.frame()

trials = c(10, 50, 100, 1000, 10000)
n = 6

for (trial in trials){
m <- matrix(NA, nrow = trial, ncol = 3)
for (i in 1:trial){
my.sample <- rnorm(n = n, mean = heights.mean, sd = heights.sigma)
x.bar <- mean(my.sample)
s <- sd(my.sample)
eval <- CI.eval.t(pop.mean = heights.mean,
s = s,
n = n,
estimate = x.bar,
alpha = c(0.1,0.05,0.01))
m[i,] <- eval
}
df <- rbind(df,colSums(m)/trial * 100)
}
row.names(df) <- trials
colnames(df) <- c('90%','95%','99%')
df
##         90%    95%    99%
## 10    90.00 100.00 100.00
## 50    90.00  96.00 100.00
## 100   91.00  95.00  99.00
## 1000  89.50  94.70  99.30
## 10000 90.07  94.88  99.01

Actually, we see that by increasing the number of samples the numbers converge and reflect the probability as given by the level of confidence.