So far, we relied on $\sigma$, the population standard deviation, to infer the population mean. The population parameter $\sigma$ is used to calculate the standard error ($SE = \frac {\sigma} {\sqrt {n}}$), which is one constituent of the margin of error. However, what if we do not know the population standard deviation, as is usually the case? One may use the sample standard deviation, denoted by s, as an estimator for the population standard deviation.
$$\text {if }s \approx \sigma \text { then }SE = \frac {s} {\sqrt {n}}$$We should note that in contrast to $\sigma$, the sample standard deviation $s$ will vary from sample to sample and that $s < \sigma$. By increasing the sample size $n$, $s$ will become a better estimate for $\sigma$. However, no matter what, as long as we do not know $\sigma$ we have to estimate two quantities when carrying out the inference procedure: the mean $\mu$ and the standard deviation $\sigma$. That is why using $s$ as an estimator for $\sigma$ tends to add more uncertainty to the estimation of the mean $\mu$. To counteract that additional uncertainty, we apply the so-called t-distribution or Student’s t-Distribution to calculate the margin of error ($ME$).
The procedure to obtain a confidence interval for a population mean when the population standard deviation $\sigma$ is unknown is essentially the same as if the population standard deviation is known. The t-distribution and the sample standard deviation $s$ are invoked instead of the standard normal distribution ($z$-scores) and the population standard deviation $\sigma$.
Recall the construction of a confidence interval:
$$CI \text { : point estimate} \pm ME$$The margin of error ($ME$) contains the critical value and a measure for the variability of the sampling distribution. The critical value is $t^{*}_{df, \alpha / 2}$ for the given confidence level and degrees of freedom. Its value is obtained from a t-distribution table for $n−1$ degrees of freedom, or it is calculated directly in Python using the t.ppf()
function derived from the scipy
package. The measure for the variability of the sampling distribution is the standard error $SE$. As the population standard deviation $\sigma$ is not known, it is replaced by the sample standard deviation $s$, thus resulting in $SE = \frac {s} {\sqrt {n}}$.
Consequently, the $100(1−\alpha)\%$ confidence interval for $\mu$ is given by:
$$CI\ :\ \bar {x} \pm t^{*}_{df, \alpha / 2} \times \frac {s} {\sqrt {n}}$$Let us construct some confidence intervals for practice! For this exercise, $df$ is set to $12$.
Confidence level of 90 % ($\alpha = 0.1$)
from scipy.stats import t
lower_90 = t.ppf(0.05, df = 12)
upper_90 = t.ppf(0.95, df = 12)
print("The lower and upper limits of the interval that covers an area of 90 % around the mean are given by t-values (df=12) of",
round(lower_90, 2), "and", round(upper_90, 2), ", respectively.")
The lower and upper limits of the interval that covers an area of 90 % around the mean are given by t-values (df=12) of -1.78 and 1.78 , respectively.
Hence, for a confidence level of 90 % ($\alpha = 0.1$) the equation from above evaluates to
$$CI_{90\%}\ :\ \text {Point estimate} \pm 1.78 \times \frac {s} {\sqrt {n}}$$Confidence level of 95 % ($\alpha = 0.05$)
alpha = 0.05
lower_95 = t.ppf(alpha / 2, df = 12)
upper_95 = t.ppf(1 - (alpha / 2) , df = 12)
print("The lower and upper limits of the interval that covers an area of", 1 - alpha,
"% around the mean are given by t-values (df=12) of",
round(lower_95, 2), "and", round(upper_95, 2), ", respectively.")
The lower and upper limits of the interval that covers an area of 0.95 % around the mean are given by t-values (df=12) of -2.18 and 2.18 , respectively.
Hence, for a confidence level of 95 % ($\alpha = 0.05$) the equation from above evaluates to
$$CI_{95\%}\ :\ \text {Point estimate} \pm 2.18 \times \frac {s} {\sqrt {n}}$$Confidence level of 99 % ($\alpha = 0.01$)
alpha = 0.01
lower_99 = t.ppf(alpha / 2, df = 12)
upper_99 = t.ppf(1 - (alpha / 2) , df = 12)
print("The lower and upper limits of the interval that covers an area of", 1 - alpha,
"% around the mean are given by t-values (df=12) of",
round(lower_99, 2), "and", round(upper_99, 2), ", respectively.")
The lower and upper limits of the interval that covers an area of 0.99 % around the mean are given by t-values (df=12) of -3.05 and 3.05 , respectively.
Hence, for a confidence level of 95 % ($\alpha = 0.05$) the equation from above evaluates to
$$CI_{99\%}\ :\ \text {Point estimate} \pm 3.05 \times \frac {s} {\sqrt {n}}$$Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.