We often want to determine the proportion (percentage) of members of a finite population that have a specified attribute. Generally, the population under consideration is too large for the population proportion to be found by taking a census. Suppose that a simple random sample of size \(n\) is taken from a population in which the proportion of members that have a specified attribute is \(p\). Then a random variable of primary importance in the estimation of \(p\) is the number of members sampled that have the specified attribute, which we denote as \(X\). The exact probability distribution of \(X\) depends on whether the sampling is done with or without replacement.

If sampling is done with replacement, the sampling process constitutes Bernoulli trials: Each selection of a member from the population corresponds to a trial. A success occurs on a trial, if the member selected in that trial has the specified attribute; otherwise, a failure occurs. The trials are independent because the sampling is done with replacement. The success probability remains the same from trial to trial, so it always equals the proportion of the population that has the specified attribute. Therefore, the random variable \(X\) follows a binomial distribution with parameters \(n\) (the sample size) and \(p\) (the population proportion) (Weiss 2010).

In reality, however, sampling is ordinarily done without replacement. Under these circumstances, the sampling process does not constitute Bernoulli trials because the trials are not independent and the success probability varies from trial to trial. In other words, the random variable \(X\) does not follow a binomial distribution. Its distribution is referred to as a hypergeometric distribution (Weiss 2010).

Still, in practice, a hypergeometric distribution can usually be approximated by a binomial distribution. The reason for this is that if the sample size does not exceed 5 % of the population size, there is little difference between sampling with and without replacement.


Sampling and the Binomial Distribution

Suppose that a simple random sample of size \(n\) is taken from a finite population in which the proportion of members that have a specified attribute is \(p\). Then the number of sampled members, which have the specified attribute

Hypergeometric Distribution in R

Analog to the dbinom() function, R has the built-in functions:

dhyper(x, m, n, k, log = FALSE) phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE) qhyper(p, m, n, k, lower.tail = TRUE, log.p = FALSE)


Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.