We practice the The One-Mean t-Interval Procedure
to construct confidence intervals with the students data set.
You may download the students.csv
file here
or import it directly into R using the read.csv()
function.
students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.
In this section we focus again on the height of female students. Let us assume that the height measurements given in the data set are a very good approximation to the population of interest. This means we can sample from the probability density function based on the mean and the standard deviation of the data set.
females <- subset(students, gender == "Female")
heights_f <- females$height
heights_mean_f <- mean(heights_f)
heights_mean_f
## [1] 163.6533
heights_sigma_f <- sd(heights_f)
heights_sigma_f
## [1] 7.919726
We take a random sample with a sample size of \(n=6\) from the probability distribution,
defined by the mean and standard deviation of the height
variable, and calculate the sample mean \(\bar
x\) and the sample standard deviation \(s\).
set.seed(335)
sample_size <- 6
df <- sample_size - 1
my_sample_f <- rnorm(n = sample_size, mean = heights_mean_f, sd = heights_sigma_f)
x_bar_f <- mean(my_sample_f)
x_bar_f
## [1] 158.7371
s_f <- sd(my_sample_f)
s_f
## [1] 7.11177
Our random sample yields a sample mean \(\bar x\) of approximately 158.74 and a sample standard deviation \(s\) of approximately 7.11. These are our point estimates for the population of interest, which in this case is the height of female students in our data set.
How accurate is our point estimate? We can use R to check whether our estimates are equal to the true population parameters.
x_bar_f == heights_mean_f
## [1] FALSE
s_f == heights_sigma_f
## [1] FALSE
s_f < heights_sigma_f
## [1] TRUE
OK, we expected that!
Let us now calculate interval estimates by constructing the 90 %, 95 % and 99 % confidence intervals. Recall the equation for a confidence interval and how to calculate the degree of freedom:
\[CI: \text{Point estimate} \pm t^*_{df,\,\alpha/2} \times \frac{s}{\sqrt{n}}\]
\[ df = n-1\]
The critical value \(t^*_{5,\,\alpha/2}\) evaluates to 2.02, 2.57 and 4.03 for confidence levels of 90 %, 95 % and 99 %, respectively.
Applied to our height data the equation yields the following:
\[CI_{90\%}: 158.74 \pm 2.02 \times \frac{7.11}{\sqrt{6}} = 158.74 \pm 5.85\]
\[CI_{95\%}: 158.74 \pm 2.57 \times \frac{7.11}{\sqrt{6}} = 158.74 \pm 7.46\]
\[CI_{99\%}: 158.74 \pm 4.03 \times \frac{7.11}{\sqrt{6}} = 158.74 \pm 11.71\]
It is obvious, that if we want to have a higher confidence that the unknown population parameter is contained in the interval, the margin of error becomes larger.
Exercise: Calculate the confidence interval for the male students height (from the
students
dataset) using a confidence level of 90 %, 95 % and 99 %! Beware, that we do not know sigma, therefore we have to use the sample standard deviation and the t-distribution.
### your code here
set.seed(335)
# create a subset for male students
males <- subset(students, gender == "Male")
heights_m <- males$height
heights_mean_m <- mean(heights_m)
heights_sigma_m <- sd(heights_m)
# draw a sample of n = 6
sample_m <- rnorm(n = 6, mean = heights_mean_m, sd = heights_sigma_m)
sample_mean_m <- mean(sample_m)
sample_sd_m <- sd(sample_m)
# calculate the CIs
n <- 6
conf_level <- c(0.90, 0.95, 0.99)
for (conf in conf_level) {
alpha <- 1 - conf
t_lower <- qt(p = alpha / 2, df = n - 1, lower.tail = T)
t_upper <- qt(p = alpha / 2, df = n - 1, lower.tail = F)
CI_lower <- sample_mean_m + (t_lower * (sample_sd_m / sqrt(n)))
CI_upper <- sample_mean_m + (t_upper * (sample_sd_m / sqrt(n)))
print(paste("The ", conf * 100, "% Conficence Intervall for the male students-height is between ",
round(CI_lower, 2), "cm and ", round(CI_upper, 2), "cm.",
sep = ""
))
}
## [1] "The 90% Conficence Intervall for the male students-height is between 168.21cm and 180.02cm."
## [1] "The 95% Conficence Intervall for the male students-height is between 166.59cm and 181.64cm."
## [1] "The 99% Conficence Intervall for the male students-height is between 162.3cm and 185.92cm."
Exercise: Compare these results with the CI-estimation for female students calculated above! What does this mean for infering a (common/different) population?
There is no overlap of the confidence intervals of female and male mean heights in our samples. Hence, we may assume that male and female students come from different populations defined by body height!
For a sanity check we investigate, if we actually captured the true population value by our interval estimation. It is important to remember that the confidence interval does not assign a probability to our sample mean. In fact it states that under repeated random sampling, the confidence interval is expected to include the population mean \(100(1-\alpha)\%\) of the time. In order to test that claim we slightly modify the R-function we wrote in the previous section.
CI.eval.t <- function(pop_mean, s, n, estimate, alpha) {
# The function returns a vector of Booleans (TRUE or FALSE).
# The function returns TRUE if the confidence interval contains the true population parameter and FALSE if not.
out <- rep(NA, length(alpha))
for (i in seq(1, length(alpha))) {
out[i] <- pop_mean >= estimate - qt(alpha[i] / 2, df = n - 1, lower.tail = F) * s / sqrt(n) &&
pop_mean <= estimate + qt(alpha[i] / 2, df = n - 1, lower.tail = F) * s / sqrt(n)
}
return(out)
}
Let us now apply our self-built function CI.eval.t()
to
our data. We set pop_mean = heights_mean_f
,
s = s_f
, n = sample_size
,
estimate = x_bar_f
, alpha = c(0.1,0.05,0.01)
in order to evaluate if the three confidence intervals constructed above
(90 %, 95 % and 99 %) contain the population mean. Finally, we transform
the resulting vector into a data.frame
object for enhanced
readability.
eval <- CI.eval.t(
pop_mean = heights_mean_f,
s = s_f,
n = sample_size,
estimate = x_bar_f,
alpha = c(0.1, 0.05, 0.01)
)
data.frame(eval, row.names = c("90 %", "95 %", "99 %"))
## eval
## 90 % TRUE
## 95 % TRUE
## 99 % TRUE
Well, interesting result. The true population mean
(heights_mean
= 163.65) is captured by all three confidence
intervals. Obliviously, the mean of our random sample
(x_bar
= 158.74) was a quite good estimator.
So far, we focused on one sample. However, what should we expect if we repeat the sampling process over and over again?
A confidence interval of any particular confidence level states that under repeated random samples the confidence interval is expected to include the true population parameter \(100(1-\alpha)\%\) of the time. Thus, a 99 % confidence interval implies that, on average, in 99 out of 100 cases the margin of error is large enough to encompass the true population value. This as well implies that we are wrong one out of 100 times on average. Further, for a 95 % or a 90 % confidence interval we are wrong 5 or 10 out of 100 times, respectively. We will use the R machinery to test this claim by conducting a simulation experiment.
We sample from the height
distribution from above. We
take 10, 50, 100, 1000 and 10000 random samples. Each sample has a
sample size of \(n=6\). For each sample
we calculate the sample mean, which is our sample statistic of interest.
By applying our self-built function CI.eval.t()
we test if
the true population mean is captured by the 90 %, 95 % and 99 %
confidence interval. We store the resulting Boolean vector and calculate
the proportion of times each particular confidence interval actually
contains the true population mean. Finally, we store the proportion as
percentages in a data.frame
object and rename its rows and
columns for better readability.
set.seed(335)
df <- data.frame()
trials <- c(10, 50, 100, 1000, 10000)
n <- 6
for (trial in trials) {
m <- matrix(NA, nrow = trial, ncol = 3)
for (i in 1:trial) {
my_sample <- rnorm(n = n, mean = heights_mean_f, sd = heights_sigma_f)
x_bar <- mean(my_sample)
s <- sd(my_sample)
eval <- CI.eval.t(
pop_mean = heights_mean_f,
s = s,
n = n,
estimate = x_bar,
alpha = c(0.1, 0.05, 0.01)
)
m[i, ] <- eval
}
df <- rbind(df, colSums(m) / trial * 100)
}
row.names(df) <- trials
colnames(df) <- c("90 %", "95 %", "99 %")
df
## 90 % 95 % 99 %
## 10 70.00 80.00 100.00
## 50 88.00 94.00 98.00
## 100 86.00 93.00 100.00
## 1000 90.50 95.00 98.90
## 10000 89.99 95.07 99.08
Actually, we see, as before, that by increasing the number of samples the percentages converge and reflect the probability as given by the respective level of confidence.
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.