In order to better understand the estimation of a population mean and the construction of a confidence interval, we will go through the procedure based on a data set. Therefore, we load the students data set. You may download the students.csv
file here to your local file system or import it directly as a web resource. In either case, you import the data set to python as pandas
dataframe
object by using the read_csv
method:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
students = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
Note: Make sure that the
pandas
package is part of yourmamba
environment!
The students data set consists of 8239 rows, each representing a particular student, and 16 columns, each corresponding to a variable/feature related to that particular student. These self-explaining variables are:
In this section, we focus on the height of female students. The height
variable of female students is roughly normally distributed - which is supported by the symmetric bell shape of the population histogram. To visualize the female's height distribution, we filter the student's data set at first to create a subset variable, called females
, that contains all female students:
females = students.loc[students.gender == "Female"]
sns.histplot(females["height"],
bins = 29, color = "gainsboro").set(title='Histogram of female\'s height distribution')
[Text(0.5, 1.0, "Histogram of female's height distribution")]
Let us assume that the height measurements given in the data set are a very good approximation for the population of interest, which is the height of female students in cm. We first compute the population's mean $\mu$ and the population's standard deviation $\sigma$ based on the data given.
mean_heights = np.mean(females["height"])
print(mean_heights)
163.65328467153284
sigma_heights = np.std(females["height"])
print(sigma_heights)
7.918762263149209
We want to expand our plot from above by plotting the probability density function of the assumed gaussian distribution based on the $\mu$ and $\sigma$ on top of the histogram.
ax = sns.histplot(females["height"],
kde=False, stat = 'density',
bins = 29, color = "gainsboro")
x = np.linspace(np.min(females["height"]),
np.max(females["height"]), 1000)
heights_pdf = norm.pdf(x,
loc = mean_heights, scale = sigma_heights)
ax.plot(x, heights_pdf, color = "red")
[<matplotlib.lines.Line2D at 0x1fba690b070>]
What a nice fit!
Now, we take a random sample with a sample size of $n=10$ from the assumed probability distribution using the random.normal()
function from the numpy
package. Afterwards, we calculate the sample mean $\bar {x}$.
np.random.seed(40)
sample = np.random.normal(loc = mean_heights,
scale = sigma_heights,
size = 10)
mean_sample = np.mean(sample)
print(mean_sample)
165.03599768738724
Our random sample yields a sample mean ($\bar {x}$) of approximately:
round(mean_sample, 2)
165.04
This is our point estimate for the population parameter of interest, which in this case is the mean height of female students ($\mu$).
How accurate is our point estimate? We can check whether our estimate is equal to the true population parameter:
mean_sample == mean_heights
False
OK, we expected that.
Let us calculate some interval estimates by constructing the 90 %, 95 % and 99 % confidence intervals. Recall the equation for a confidence interval:
$$CI:Point estimate \pm z^{*}_{\alpha/2} \times \frac {\sigma} {\sqrt {n}}$$The critical value $z^{*}_{\alpha/2}$ evaluates to 1.64, 1.96 and 2.58 for confidence levels of 90 %, 95 % and 99 %, respectively.
Applied to our height data the equation yields the following:
$$CI_{90\%} : 165.04 \pm 1.64 \times \frac {7.92} {\sqrt {10}} = 165.04 \pm 4.12$$Note: It is obvious, that if we want to have a higher confidence that the unknown population parameter is contained in the interval, the margin of error becomes larger.
For a sanity check, let us investigate if we captured the true population value within our interval estimations. Remembering that the confidence interval does not assign a probability to our sample mean is important. However, it states that under repeated random sampling, the confidence interval is expected to include the population mean $100(1−\alpha)\%$ of the time. In order to test that claim, we write a UDF by ourselves.
def CI_evaluation(pop_mean, sigma, n, estimate, alpha):
check = []
for sig in alpha:
if (pop_mean >= estimate - (norm.ppf(1 - (sig / 2)) * (sigma / np.sqrt(n))) and
pop_mean <= estimate + (norm.ppf(1 - (sig / 2)) * (sigma / np.sqrt(n)))):
check.append(True)
else:
check.append(False)
return check
checks = CI_evaluation(mean_heights, sigma_heights, 10, mean_sample, [0.10, 0.05, 0.01])
df = pd.DataFrame(checks, columns = ["evaluation"])
df.index = ['90 %', '95 %', '99 %']
df
evaluation | |
---|---|
90 % | True |
95 % | True |
99 % | True |
Well, interesting result. All three confidence intervals capture the true population mean (mean_heights
= $163.65$). Obviously, the mean of our random sample (mean_sample
= $165.04$) was quite a good estimator.
So far, we have just focused on one particular random sample. However, what should we expect if we repeat the sampling process over and over again?
A confidence interval of any particular confidence level states that under repeated random samples, the confidence interval is expected to include the true population parameter $100(1−\alpha)\%$ of the time. Thus, a 99 % confidence interval implies that, on average, in 99 out of 100 cases, the margin of error is large enough to encompass the true population value. This also implies that we are wrong on average one out of 100 times. Further, for a 95 % or a 90 % confidence interval, we are wrong 5 or 10 out of 100 times, respectively. We will use the R machinery to test this claim by conducting a simulation experiment.
We sample from the heights
distribution from above. We take 10, 50, 100, 1000 and 10000 random samples. Each sample has a sample size of $n=10$. We calculate the sample mean $\bar {x}$, which is our sample statistic of interest for each sample. By applying our self-built function CI_evaluation
, we test if the true population mean is captured by the 90 %, 95 % and 99 % confidence interval. We store the resulting Boolean vector and calculate the proportion of times each particular confidence interval actually contains the true population mean. Finally, we store the proportion as percentages in a pandas
dataframe object and rename its rows and columns for better readability.
trials = [10, 50, 100, 1000, 10000]
columns = ['90', '95', '99']
results = []
for trial in trials:
proofs = []
for i in range(1, trial + 1):
sample = np.random.normal(loc = mean_heights,
scale = sigma_heights,
size = 10)
mean_sample = np.mean(sample)
check = CI_evaluation(mean_heights, sigma_heights, 10, mean_sample, [0.10, 0.05, 0.01])
proofs.append(check)
proofs = pd.DataFrame(proofs, columns = columns)
res = []
for col in columns:
res.append(proofs[col].value_counts()[0] / trial)
results.append(res)
df = pd.DataFrame(results, columns = ['90', '95', '99'])
df.index = trials
df
90 | 95 | 99 | |
---|---|---|---|
10 | 1.000 | 1.0000 | 1.0000 |
50 | 0.880 | 0.9000 | 0.9800 |
100 | 0.920 | 0.9500 | 0.9900 |
1000 | 0.901 | 0.9490 | 0.9890 |
10000 | 0.898 | 0.9476 | 0.9886 |
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.