20641_the_one-mean_z-interval

In order to better understand the estimation of a population mean and the construction of a confidence interval, we will go through the procedure based on a data set. Therefore, we load the students data set. You may download the students.csv file here to your local file system or import it directly as a web resource. In either case, you import the data set to python as pandas dataframe object by using the read_csv method:

In [1]:

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm

In [2]:

students = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")

Note: Make sure that the pandas package is part of your mamba environment!

The students data set consists of 8239 rows, each representing a particular student, and 16 columns, each corresponding to a variable/feature related to that particular student. These self-explaining variables are:

stud.id
name
gender
age
height
weight
religion
nc.score
semester
major
minor
score1
score2
online.tutorial
graduated
salary

In this section, we focus on the height of female students. The height variable of female students is roughly normally distributed - which is supported by the symmetric bell shape of the population histogram. To visualize the female's height distribution, we filter the student's data set at first to create a subset variable, called females, that contains all female students:

In [3]:

females = students.loc[students.gender == "Female"]

sns.histplot(females["height"], 
             bins = 29, color = "gainsboro").set(title='Histogram of female\'s height distribution')

Out[3]:

[Text(0.5, 1.0, "Histogram of female's height distribution")]

Let us assume that the height measurements given in the data set are a very good approximation for the population of interest, which is the height of female students in cm. We first compute the population's mean $\mu$ and the population's standard deviation $\sigma$ based on the data given.

In [4]:

mean_heights = np.mean(females["height"])
print(mean_heights)

163.65328467153284

In [5]:

sigma_heights = np.std(females["height"])
print(sigma_heights)

7.918762263149209

We want to expand our plot from above by plotting the probability density function of the assumed gaussian distribution based on the $\mu$ and $\sigma$ on top of the histogram.

In [6]:

ax = sns.histplot(females["height"], 
                  kde=False, stat = 'density',
                  bins = 29, color = "gainsboro")

x = np.linspace(np.min(females["height"]), 
                np.max(females["height"]), 1000)
heights_pdf = norm.pdf(x, 
                       loc = mean_heights, scale = sigma_heights)

ax.plot(x, heights_pdf, color = "red")

Out[6]:

[<matplotlib.lines.Line2D at 0x1fba690b070>]

What a nice fit!

Now, we take a random sample with a sample size of $n=10$ from the assumed probability distribution using the random.normal() function from the numpy package. Afterwards, we calculate the sample mean $\bar {x}$.

In [7]:

np.random.seed(40)

sample = np.random.normal(loc = mean_heights,
                          scale = sigma_heights, 
                          size = 10)

mean_sample = np.mean(sample)

print(mean_sample)

165.03599768738724

Our random sample yields a sample mean ($\bar {x}$) of approximately:

In [8]:

round(mean_sample, 2)

Out[8]:

165.04

This is our point estimate for the population parameter of interest, which in this case is the mean height of female students ($\mu$).

How accurate is our point estimate? We can check whether our estimate is equal to the true population parameter:

In [9]:

mean_sample == mean_heights

Out[9]:

False

OK, we expected that.

Let us calculate some interval estimates by constructing the 90 %, 95 % and 99 % confidence intervals. Recall the equation for a confidence interval:

$$CI:Point estimate \pm z^{*}_{\alpha/2} \times \frac {\sigma} {\sqrt {n}}$$

The critical value $z^{*}_{\alpha/2}$ evaluates to 1.64, 1.96 and 2.58 for confidence levels of 90 %, 95 % and 99 %, respectively.

Applied to our height data the equation yields the following:

$$CI_{90\%} : 165.04 \pm 1.64 \times \frac {7.92} {\sqrt {10}} = 165.04 \pm 4.12$$

We can state that we are 90 % confident that the mean height of female students (the population parameter $\mu$) is between 160.92 and 169.16 cm.

$$CI_{95\%} : 165.04 \pm 1.96 \times \frac {7.92} {\sqrt {10}} = 165.04 \pm 4.91$$

We can state that we are 95 % confident that the mean height of female students is between 160.13 and 169.95 cm.

$$CI_{99\%} : 165.04 \pm 2.58 \times \frac {7.92} {\sqrt {10}} = 165.04 \pm 6.45$$

We can state that we are 99 % confident that the mean height of female students is between 158.59 and 171.49 cm.

Note: It is obvious, that if we want to have a higher confidence that the unknown population parameter is contained in the interval, the margin of error becomes larger.

For a sanity check, let us investigate if we captured the true population value within our interval estimations. Remembering that the confidence interval does not assign a probability to our sample mean is important. However, it states that under repeated random sampling, the confidence interval is expected to include the population mean $100(1−\alpha)\%$ of the time. In order to test that claim, we write a UDF by ourselves.

In [10]:

def CI_evaluation(pop_mean, sigma, n, estimate, alpha):
    check = []
    for sig in alpha:
        if (pop_mean >= estimate - (norm.ppf(1 - (sig / 2)) * (sigma / np.sqrt(n))) and
            pop_mean <= estimate + (norm.ppf(1 - (sig / 2)) * (sigma / np.sqrt(n)))):
            check.append(True)
        else:
            check.append(False)
    return check

In [11]:

checks = CI_evaluation(mean_heights, sigma_heights, 10, mean_sample, [0.10, 0.05, 0.01])

In [12]:

df = pd.DataFrame(checks, columns = ["evaluation"])
df.index = ['90 %', '95 %', '99 %']
df

Out[12]:

	evaluation
90 %	True
95 %	True
99 %	True

Well, interesting result. All three confidence intervals capture the true population mean (mean_heights = $163.65$). Obviously, the mean of our random sample (mean_sample = $165.04$) was quite a good estimator.

So far, we have just focused on one particular random sample. However, what should we expect if we repeat the sampling process over and over again?

A confidence interval of any particular confidence level states that under repeated random samples, the confidence interval is expected to include the true population parameter $100(1−\alpha)\%$ of the time. Thus, a 99 % confidence interval implies that, on average, in 99 out of 100 cases, the margin of error is large enough to encompass the true population value. This also implies that we are wrong on average one out of 100 times. Further, for a 95 % or a 90 % confidence interval, we are wrong 5 or 10 out of 100 times, respectively. We will use the R machinery to test this claim by conducting a simulation experiment.

We sample from the heights distribution from above. We take 10, 50, 100, 1000 and 10000 random samples. Each sample has a sample size of $n=10$. We calculate the sample mean $\bar {x}$, which is our sample statistic of interest for each sample. By applying our self-built function CI_evaluation, we test if the true population mean is captured by the 90 %, 95 % and 99 % confidence interval. We store the resulting Boolean vector and calculate the proportion of times each particular confidence interval actually contains the true population mean. Finally, we store the proportion as percentages in a pandas dataframe object and rename its rows and columns for better readability.

In [13]:

trials = [10, 50, 100, 1000, 10000]
columns = ['90', '95', '99']
results = []

for trial in trials:
    proofs = []
    for i in range(1, trial + 1):
        sample = np.random.normal(loc = mean_heights,
                                  scale = sigma_heights,
                                  size = 10)

        mean_sample = np.mean(sample)
        check = CI_evaluation(mean_heights, sigma_heights, 10, mean_sample, [0.10, 0.05, 0.01])
        proofs.append(check)
        
    proofs = pd.DataFrame(proofs, columns = columns)
    
    res = []
    for col in columns:
        res.append(proofs[col].value_counts()[0] / trial)
        
    results.append(res)

In [14]:

df = pd.DataFrame(results, columns = ['90', '95', '99'])
df.index = trials
df

Out[14]:

	90	95	99
10	1.000	1.0000	1.0000
50	0.880	0.9000	0.9800
100	0.920	0.9500	0.9900
1000	0.901	0.9490	0.9890
10000	0.898	0.9476	0.9886

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.