So far we relied on \(\sigma\), the population standard deviation, to infer the population mean. The population parameter \(\sigma\) is used to calculate the standard error (\(SE=\frac{\sigma}{\sqrt{n}}\)), which is one constituent of the margin of error. However, what if we do not know the population standard deviation, as is usually the case? One may use the sample standard deviation, denoted by \(s\), as an estimate for the population standard deviation.

\[\text{if } s \approx \sigma \text{ then } SE = \frac{s}{\sqrt{n}}\]

We should note that in contrast to \(\sigma\) the sample standard deviation \(s\) will vary from sample to sample and that \(s < \sigma\). By increasing the sample size \(n\), \(s\) will become a better estimate for \(\sigma\). However, no matter what, as long as we do not know \(\sigma\) we have to estimate two quantities when carrying out the inference procedure: the mean \(\mu\) and the standard deviation \(\sigma\). That is why using \(s\) as an estimate for \(\sigma\) tends to add more uncertainty to the estimation of the mean \(\mu\). To counteract that extra uncertainty we apply the so called t-distribution or Student’s t-Distribution to calculate the margin of error \((ME)\).


The procedure to obtain a confidence interval for a population mean when the population standard deviation \(\sigma\) is unknown is essentially the same as if the population standard deviation is known. Except that now the t-distribution and the sample standard deviation \(s\) are invoked, instead of the standard normal distribution (\(z\)-scores) and the population standard deviation \(\sigma\).

Recall the construction of a confidence interval:

\[CI: \text{point estimate} \pm ME\]

The margin of error \((ME)\) contains the critical value and a measure for the variability of the sampling distribution. The critical value is \(t^*_{df,\,\alpha/2}\) for the given confidence level and degrees of freedom. Its value is obtained from a t-distribution table for \(n-1\) degrees of freedom or it is calculated in R by applying the qt() function. The measure for the variability of the sampling distribution is the standard error \(SE\). As the population standard deviation \(\sigma\) is not known, it is replaced by the sample standard deviation \(s\), thus resulting in \(SE=\frac{s}{\sqrt{n}}\).

Consequently, the \(100(1-\alpha)\%\) confidence interval for \(\mu\) is

\[CI: \bar x \pm t^*_{df,\, \alpha/2} \frac{s}{\sqrt{n}}\]

Let us construct some confidence intervals for practice! For the purpose of this exercise \(df\) is set to \(12\).

Confidence level of 90 % (\(\alpha = 0.1\))

lower_90 <- qt(0.05, df = 12, lower.tail = TRUE)
upper_90 <- qt(0.05, df = 12, lower.tail = FALSE)

paste("The lower and upper limits of the interval that covers an area of 90 % around the mean are given by t-values (df=12) of", round(lower_90, 2), "and", round(upper_90, 2), ", respectively.")
## [1] "The lower and upper limits of the interval that covers an area of 90 % around the mean are given by t-values (df=12) of -1.78 and 1.78 , respectively."

For a confidence level of 90 % (\(\alpha = 0.1\)) the equation from above evaluates to

\[CI_{90\%}: \text{Point estimate} \pm 1.78 \times \frac{s}{\sqrt{n}}\]

Confidence level of 95 % (\(\alpha = 0.05\))

lower_95 <- qt(0.025, df = 12, lower.tail = TRUE)
upper_95 <- qt(0.025, df = 12, lower.tail = FALSE)

paste("The lower and upper limits of the interval that covers an area of 95 % around the mean are given by t-values (df=12) of", round(lower_95, 2), "and", round(upper_95, 2), ", respectively.")
## [1] "The lower and upper limits of the interval that covers an area of 95 % around the mean are given by t-values (df=12) of -2.18 and 2.18 , respectively."

For a confidence level of 95 % (\(\alpha = 0.05\)) the equation from above evaluates to

\[CI_{95\%}: \text{Point estimate} \pm 2.18 \times \frac{s}{\sqrt{n}}\]

Confidence level of 99 % (\(\alpha = 0.01\))

lower_99 <- qt(0.005, df = 12, lower.tail = TRUE)
upper_99 <- qt(0.005, df = 12, lower.tail = FALSE)

paste("The lower and upper limits of the interval that covers an area of 99 % around the mean are given by t-values (df=12) of", round(lower_99, 2), "and", round(upper_99, 2), ", respectively.")
## [1] "The lower and upper limits of the interval that covers an area of 99 % around the mean are given by t-values (df=12) of -3.05 and 3.05 , respectively."

For a confidence level of 99 % (\(\alpha = 0.01\)) the equation from above evaluates to \[CI_{99\%}: \text{Point estimate} \pm 3.05 \times \frac{s}{\sqrt{n}}\]


Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.