In order to better understand the estimation of a population mean and
the construction of a confidence interval we will go through the
procedure based on a data set. Therefore, we load the students
data set. You may download the students.csv
file here
or import it directly into R using the read.csv()
function.
students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.
In this section we focus on the height of female students. The
height
variable of female students is roughly normally
distributed - which is supported by the symmetric bell shape of the
population histogram:
females <- subset(students, gender == "Female")
heights <- females$height
hist(heights, breaks = "FD")
Let us assume that the height measurements given in the data set are a very good approximation for the population of interest, which is the height of female students in cm. Based on the data given, we first compute the population mean \(\mu\) and the population standard deviation \(\sigma\).
heights_mean <- mean(heights)
heights_mean
## [1] 163.6533
heights_sigma <- sd(heights)
heights_sigma
## [1] 7.919726
We then construct a probability distribution by applying the
dnorm()
function defined by the previously calculated
parameters, \(\mu\) and \(\sigma\), and plot it on top of the
histogram.
x <- seq(min(heights), max(heights), by = 0.01)
height_pdf <- dnorm(x, mean = heights_mean, sd = heights_sigma)
plot(x, height_pdf,type='n')
hist(heights, breaks = "Scott", freq = F, add = T)
lines(x, height_pdf, type = "l", col = "red", ylab = "Density", xlab = "Height in cm")
A nice fit!
Now, we take a random sample with a sample size of \(n=10\) from the probability distribution by
applying the rnorm()
function and then calculate the sample
mean \(\bar x\).
sample_size <- 10
my_sample <- rnorm(n = sample_size, mean = heights_mean, sd = heights_sigma)
x_bar <- mean(my_sample)
x_bar
## [1] 159.7953
Our random sample yields a sample mean (\(\bar x\)) of approximately 159.8. This is our point estimate for the population parameter of interest, which in this case is the mean height of female students (\(\mu\)).
How accurate is our point estimate? We can check, whether our estimate is equal to the true population parameter.
x_bar == heights_mean
## [1] FALSE
OK, we expected that.
Let us calculate some interval estimates by constructing the 90 %, 95 % and 99 % confidence intervals. Recall the equation for a confidence interval:
\[CI: \text{Point estimate} \pm z^*_{\alpha/2} \times \frac{\sigma}{\sqrt{n}}\]
The critical value \(z^*_{\alpha/2}\) evaluates to 1.64, 1.96 and 2.58 for confidence levels of 90 %, 95 % and 99 %, respectively.
Applied to our height data the equation yields the following:
\[CI_{90\%}: 159.8 \pm 1.64 \times \frac{7.92}{\sqrt{10}} = 159.8 \pm 4.12\]
\[CI_{95\%}: 159.8 \pm 1.96 \times \frac{7.92}{\sqrt{10}} = 159.8 \pm 4.91\]
\[CI_{99\%}: 159.8 \pm 2.58 \times \frac{7.92}{\sqrt{10}} = 159.8 \pm 6.45\]
It is obvious, that if we want to have a higher confidence that the unknown population parameter is contained in the interval, the margin of error becomes larger.
Just for a sanity check, let us investigate if we actually captured the true population value within our interval estimations. It is important to remember that the confidence interval does not assign a probability to our sample mean, but it states that under repeated random sampling the confidence interval is expected to include the population mean \(100(1-\alpha)\%\) of the time. In order to test that claim we write a simple R-function by ourselves.
CI.eval <- function(pop_mean, sigma, n, estimate, alpha) {
# The function returns a vector of Booleans (TRUE or FALSE).
# The function returns TRUE if the confidence interval contains the true population parameter and FALSE if not.
out <- rep(NA, length(alpha))
for (i in seq(1, length(alpha))) {
out[i] <- pop_mean >= estimate - qnorm(alpha[i] / 2, lower.tail = F) * sigma / sqrt(n) &&
pop_mean <= estimate + qnorm(alpha[i] / 2, lower.tail = F) * sigma / sqrt(n)
}
return(out)
}
Let us apply our self-built function CI.eval()
to our
data. We set pop_mean = heights_mean
,
sigma = heights_sigma
, n = sample_size
,
estimate = x_bar
, alpha = c(0.1, 0.05, 0.01)
in order to evaluate if the three confidence intervals constructed above
(90 %, 95 % and 99 %) contain the population mean. Finally, we transform
the resulting vector into a data.frame
object for enhanced
readability.
eval <- CI.eval(
pop_mean = heights_mean,
sigma = heights_sigma,
n = sample_size,
estimate = x_bar,
alpha = c(0.1, 0.05, 0.01)
)
data.frame(eval, row.names = c("90 %", "95 %", "99 %"))
## eval
## 90 % TRUE
## 95 % TRUE
## 99 % TRUE
Well, interesting result. The true population mean
(heights_mean
= 163.65) is captured by all three confidence
intervals. Obliviously, the mean of our random sample
(x_bar
= 159.8) was a quite good estimator.
So far, we just focused on one particular random sample. However, what should we expect if we repeat the sampling process over and over again?
A confidence interval of any particular confidence level states that under repeated random samples the confidence interval is expected to include the true population parameter \(100(1-\alpha)\%\) of the time. Thus, a 99 % confidence interval implies that, on average, in 99 out of 100 cases the margin of error is large enough to encompass the true population value. This as well implies that we are wrong one out of 100 times on average. Further, for a 95 % or a 90 % confidence interval we are wrong 5 or 10 out of 100 times, respectively. We will use the R machinery to test this claim by conducting a simulation experiment.
We sample from the height
distribution from above. We
take 10, 50, 100, 1000 and 10000 random samples. Each sample has a
sample size of \(n=10\). For each
sample we calculate the sample mean \(\bar
x\), which is our sample statistic of interest. By applying our
self-built function CI.eval()
we test if the true
population mean is captured by the 90 %, 95 % and 99 % confidence
interval. We store the resulting Boolean vector and calculate the
proportion of times each particular confidence interval actually
contains the true population mean. Finally, we store the proportion as
percentages in a data.frame
object and rename its rows and
columns for better readability.
df <- data.frame()
trials <- c(10, 50, 100, 1000, 10000)
n <- 10
for (trial in trials) {
m <- matrix(NA, nrow = trial, ncol = 3)
for (i in 1:trial) {
my_sample <- rnorm(n = n, mean = heights_mean, sd = heights_sigma)
x_bar <- mean(my_sample)
eval <- CI.eval(
pop_mean = heights_mean,
sigma = heights_sigma,
n = n,
estimate = x_bar,
alpha = c(0.1, 0.05, 0.01)
)
m[i, ] <- eval
}
df <- rbind(df, colSums(m) / trial * 100)
}
row.names(df) <- trials
colnames(df) <- c("90 %", "95 %", "99 %")
df
## 90 % 95 % 99 %
## 10 90.00 90.00 100.00
## 50 82.00 90.00 94.00
## 100 93.00 96.00 99.00
## 1000 90.40 96.00 99.00
## 10000 89.38 94.79 98.96
Actually, we see that by increasing the number of samples the percentages converge and reflect the probability as given by the respective level of confidence.
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.