We practice the *The One-Mean t-Interval Procedure*
to construct confidence intervals with the

`students.csv`

file here
or import it directly into R using the `read.csv()`

function.`students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")`

The *students* data set consists of 8239 rows, each of them
representing a particular student, and 16 columns, each of them
corresponding to a variable/feature related to that particular student.
These self-explaining variables are: *stud.id, name, gender, age,
height, weight, religion, nc.score, semester, major, minor, score1,
score2, online.tutorial, graduated, salary*.

In this section we focus again on the height of female students. Let us assume that the height measurements given in the data set are a very good approximation to the population of interest. This means we can sample from the probability density function based on the mean and the standard deviation of the data set.

```
females <- subset(students, gender == "Female")
heights <- females$height
heights_mean <- mean(heights)
heights_mean
```

`## [1] 163.6533`

```
heights_sigma <- sd(heights)
heights_sigma
```

`## [1] 7.919726`

We take a random sample with a sample size of \(n=6\) from the probability distribution,
defined by the mean and standard deviation of the `height`

variable, and calculate the sample mean \(\bar
x\) and the sample standard deviation \(s\).

```
sample_size <- 6
df <- sample_size - 1
my_sample <- rnorm(n = sample_size, mean = heights_mean, sd = heights_sigma)
x_bar <- mean(my_sample)
x_bar
```

`## [1] 163.4235`

```
s <- sd(my_sample)
s
```

`## [1] 7.468143`

Our random sample yields a sample mean \(\bar x\) of approximately 163.42 and a sample standard deviation \(s\) of approximately 7.47. These are our point estimates for the population of interest, which in this case is the height of female students in our data set.

How accurate is our point estimate? We can use R to check whether our estimates are equal to the true population parameters.

`x_bar == heights_mean`

`## [1] FALSE`

`s == heights_sigma`

`## [1] FALSE`

`s < heights_sigma`

`## [1] TRUE`

OK, we expected that!

Let us now calculate interval estimates by constructing the 90 %, 95 % and 99 % confidence intervals. Recall the equation for a confidence interval and how to calculate the degree of freedom:

\[CI: \text{Point estimate} \pm t^*_{df,\,\alpha/2} \times \frac{s}{\sqrt{n}}\]

\[ df = n-1\]

The critical value \(t^*_{5,\,\alpha/2}\) evaluates to 2.02, 2.57 and 4.03 for confidence levels of 90 %, 95 % and 99 %, respectively.

Applied to our height data the equation yields the following:

\[CI_{90\%}: 163.42 \pm 2.02 \times \frac{7.47}{\sqrt{6}} = 163.42 \pm 6.14\]

- We can state that we are 90 % confident that the mean height of female students (the population parameter \(\mu\)) is between 157.28 and 169.57 cm.

\[CI_{95\%}: 163.42 \pm 2.57 \times \frac{7.47}{\sqrt{6}} = 163.42 \pm 7.84\]

- We can state that we are 95 % confident that the mean height of female students is between 155.59 and 171.26 cm.

\[CI_{99\%}: 163.42 \pm 4.03 \times \frac{7.47}{\sqrt{6}} = 163.42 \pm 12.29\]

- We can state that we are 99 % confident that the mean height of female students is between 151.13 and 175.72 cm.

It is obvious, that **if we want to have a higher confidence
that the unknown population parameter is contained in the interval, the
margin of error becomes larger**.

Exercise: Calculate the confidence interval for the male students height (from the`students`

dataset) using a confidence level of 90 %, 95 % and 99 %! Beware, that we do not know sigma, therefore we have to use the sample standard deviation and the t-distribution.

`### your code here`

```
# create a subset for male students
males <- subset(students, gender == "Male")
heights <- males$height
heights_mean <- mean(heights)
heights_sigma <- sd(heights)
# draw a sample of n = 6
sample <- rnorm(n = 6, mean = heights_mean, sd = heights_sigma)
sample_mean <- mean(sample)
sample_sd <- sd(sample)
# calculate the CIs
n <- 6
conf_level <- c(0.90, 0.95, 0.99)
for (conf in conf_level) {
alpha <- 1 - conf
t_lower <- qt(p = alpha / 2, df = n - 1, lower.tail = T)
t_upper <- qt(p = alpha / 2, df = n - 1, lower.tail = F)
CI_lower <- sample_mean + (t_lower * (sample_sd / sqrt(n)))
CI_upper <- sample_mean + (t_upper * (sample_sd / sqrt(n)))
print(paste("The ", conf * 100, "% Conficence Intervall for the male students-height is between ",
round(CI_lower, 2), "cm and ", round(CI_upper, 2), "cm.",
sep = ""
))
}
```

```
## [1] "The 90 % confidence interval for the male students-height is between 179.74cm and 187.45cm."
## [1] "The 95 % confidence interval for the male students-height is between 178.68cm and 188.52cm."
## [1] "The 99 % confidence interval for the male students-height is between 175.88cm and 191.31cm."
```

Exercise: Compare these results with the CI-estimation for female students calculated above! What does this mean for infering a (common/different) population?

There is no overlap of the confidence intervals of female and male mean heights in our samples. Hence, we may assume that male and female students come from different populations defined by body height!

For a sanity check we investigate, if we actually captured the true population value by our interval estimation. It is important to remember that the confidence interval does not assign a probability to our sample mean. In fact it states that under repeated random sampling, the confidence interval is expected to include the population mean \(100(1-\alpha)\%\) of the time. In order to test that claim we slightly modify the R-function we wrote in the previous section.

```
CI.eval.t <- function(pop_mean, s, n, estimate, alpha) {
# The function returns a vector of Booleans (TRUE or FALSE).
# The function returns TRUE if the confidence interval contains the true population parameter and FALSE if not.
out <- rep(NA, length(alpha))
for (i in seq(1, length(alpha))) {
out[i] <- pop_mean >= estimate - qt(alpha[i] / 2, df = n - 1, lower.tail = F) * s / sqrt(n) &&
pop_mean <= estimate + qt(alpha[i] / 2, df = n - 1, lower.tail = F) * s / sqrt(n)
}
return(out)
}
```

Let us now apply our self-built function `CI.eval.t()`

to
our data. We set `pop_mean = heights_mean`

,
`s = s`

, `n = sample_size`

,
`estimate = x_bar`

, `alpha = c(0.1,0.05,0.01)`

in
order to evaluate if the three confidence intervals constructed above
(90 %, 95 % and 99 %) contain the population mean. Finally, we transform
the resulting vector into a `data.frame`

object for enhanced
readability.

```
eval <- CI.eval.t(
pop_mean = heights_mean,
s = s,
n = sample_size,
estimate = x_bar,
alpha = c(0.1, 0.05, 0.01)
)
data.frame(eval, row.names = c("90 %", "95 %", "99 %"))
```

```
## eval
## 90 % FALSE
## 95 % FALSE
## 99 % FALSE
```

Well, interesting result. The true population mean
(`heights_mean`

= 179.07) is captured by all three confidence
intervals. Obliviously, the mean of our random sample
(`x_bar`

= 163.42) was a quite good estimator.

So far, we focused on one sample. However, what should we expect if we repeat the sampling process over and over again?

A confidence interval of any particular confidence level states that under repeated random samples the confidence interval is expected to include the true population parameter \(100(1-\alpha)\%\) of the time. Thus, a 99 % confidence interval implies that, on average, in 99 out of 100 cases the margin of error is large enough to encompass the true population value. This as well implies that we are wrong one out of 100 times on average. Further, for a 95 % or a 90 % confidence interval we are wrong 5 or 10 out of 100 times, respectively. We will use the R machinery to test this claim by conducting a simulation experiment.

We sample from the `height`

distribution from above. We
take 10, 50, 100, 1000 and 10000 random samples. Each sample has a
sample size of \(n=6\). For each sample
we calculate the sample mean, which is our sample statistic of interest.
By applying our self-built function `CI.eval.t()`

we test if
the true population mean is captured by the 90 %, 95 % and 99 %
confidence interval. We store the resulting Boolean vector and calculate
the proportion of times each particular confidence interval actually
contains the true population mean. Finally, we store the proportion as
percentages in a `data.frame`

object and rename its rows and
columns for better readability.

```
df <- data.frame()
trials <- c(10, 50, 100, 1000, 10000)
n <- 6
for (trial in trials) {
m <- matrix(NA, nrow = trial, ncol = 3)
for (i in 1:trial) {
my_sample <- rnorm(n = n, mean = heights_mean, sd = heights_sigma)
x_bar <- mean(my_sample)
s <- sd(my_sample)
eval <- CI.eval.t(
pop_mean = heights_mean,
s = s,
n = n,
estimate = x_bar,
alpha = c(0.1, 0.05, 0.01)
)
m[i, ] <- eval
}
df <- rbind(df, colSums(m) / trial * 100)
}
row.names(df) <- trials
colnames(df) <- c("90 %", "95 %", "99 %")
df
```

```
## 90 % 95 % 99 %
## 10 100.00 100.00 100.00
## 50 88.00 96.00 100.00
## 100 91.00 94.00 99.00
## 1000 86.50 93.30 98.90
## 10000 90.39 95.41 99.07
```

Actually, we see, as before, that by increasing the number of samples the percentages converge and reflect the probability as given by the respective level of confidence.

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*