Derivation Of Mean And Variance Of Hypergeometric Distribution

10 min read Sep 21, 2024
Derivation Of Mean And Variance Of Hypergeometric Distribution

The hypergeometric distribution is a discrete probability distribution that describes the probability of drawing a specific number of successes in a sample of size n without replacement from a finite population of size N. It's frequently used when dealing with scenarios where the sampling is done without replacement and the population consists of two distinct groups. This article delves into the derivation of the mean and variance of the hypergeometric distribution, providing a step-by-step explanation to understand its key characteristics.

Understanding the Hypergeometric Distribution

Before delving into the derivation of the mean and variance, let's establish a clear understanding of the hypergeometric distribution. Suppose we have a population of N items, where K of them are considered "successes" and the remaining (N - K) are "failures." We randomly select a sample of n items without replacement. The hypergeometric distribution models the probability of obtaining exactly k successes in this sample.

Key Parameters

The hypergeometric distribution is defined by three main parameters:

  • N: The size of the population.
  • K: The number of successes in the population.
  • n: The sample size.

Probability Mass Function (PMF)

The probability of obtaining exactly k successes in a sample of size n is given by the hypergeometric probability mass function (PMF):

P(X = k) = (K choose k) * (N - K choose n - k) / (N choose n)

where:

  • (K choose k) represents the number of ways to choose k successes from K successes in the population.
  • (N - K choose n - k) represents the number of ways to choose (n - k) failures from (N - K) failures in the population.
  • (N choose n) represents the total number of ways to choose a sample of size n from the population.

Derivation of the Mean

The mean of the hypergeometric distribution, denoted by E(X), represents the expected number of successes in a sample. To derive the mean, we can use the following approach:

  1. Linearity of Expectation:

    • The expected value of a sum of random variables is equal to the sum of their individual expected values.
  2. Indicator Variables:

    • Define n indicator variables, X<sub>i</sub>, where:
      • X<sub>i</sub> = 1 if the i-th item in the sample is a success.
      • X<sub>i</sub> = 0 otherwise.
  3. Expected Value of an Indicator Variable:

    • The expected value of each indicator variable is simply the probability that the i-th item in the sample is a success:
    • E(X<sub>i</sub>) = K/N
  4. Total Expected Value:

    • The total expected number of successes is the sum of the expected values of all indicator variables:
      • E(X) = E(X<sub>1</sub>) + E(X<sub>2</sub>) + ... + E(X<sub>n</sub>) = n * (K/N)

Therefore, the mean of the hypergeometric distribution is n * (K/N). This makes intuitive sense because it represents the sample size (n) multiplied by the proportion of successes in the population (K/N).

Derivation of the Variance

The variance of the hypergeometric distribution, denoted by Var(X), measures the spread or variability of the number of successes in a sample. We can derive the variance using the following steps:

  1. Variance of a Sum of Random Variables:

    • The variance of a sum of independent random variables is equal to the sum of their individual variances.
    • In this case, the indicator variables X<sub>i</sub> are not independent because the sampling is done without replacement.
  2. Covariance:

    • To account for the dependence, we need to consider the covariance between the indicator variables:
      • Cov(X<sub>i</sub>, X<sub>j</sub>) = E(X<sub>i</sub> * X<sub>j</sub>) - E(X<sub>i</sub>) * E(X<sub>j</sub>)
  3. Calculating Covariance:

    • E(X<sub>i</sub> * X<sub>j</sub>) represents the probability that both the i-th and j-th items are successes. This occurs if we draw two successes from the population, which has a probability of K(K-1)/[N(N-1)].
    • E(X<sub>i</sub>) * E(X<sub>j</sub>) = (K/N) * (K/N)
  4. Total Variance:

    • The variance of the hypergeometric distribution can be expressed as:
      • Var(X) = Σ<sub>i=1</sub><sup>n</sup> Var(X<sub>i</sub>) + 2 * Σ<sub>i<j</sub> Cov(X<sub>i</sub>, X<sub>j</sub>)
  5. Simplifying:

    • Var(X<sub>i</sub>) = E(X<sub>i</sub><sup>2</sup>) - [E(X<sub>i</sub>)]<sup>2</sup> = (K/N) * (1 - K/N)
    • There are n(n-1)/2 pairs of indicator variables (i<j).
  6. Final Result:

    • After substituting the values and simplifying, we obtain the variance of the hypergeometric distribution:
    Var(X) = n * (K/N) * (1 - K/N) * (N - n) / (N - 1)
    

Interpretation and Applications

The derivation of the mean and variance of the hypergeometric distribution provides valuable insights into the behavior of this distribution.

  • The mean directly reflects the proportion of successes in the population and the sample size.
  • The variance is influenced by the proportion of successes, the sample size, and the population size. A smaller population size relative to the sample size leads to higher variance.

The hypergeometric distribution has wide applications in various fields:

  • Quality Control: Determining the probability of finding defective items in a sample.
  • Genetics: Analyzing the probability of inheriting specific traits from a gene pool.
  • Survey Sampling: Estimating the proportion of a population with a particular characteristic based on a sample.
  • Poker: Calculating the probability of drawing specific cards in a hand.

Summary

The hypergeometric distribution is a powerful tool for modeling events involving sampling without replacement. Understanding the derivation of its mean and variance is essential for interpreting the distribution's behavior and applying it effectively in various scenarios. The mean represents the expected number of successes, while the variance quantifies the spread of potential outcomes. By comprehending these characteristics, we can utilize the hypergeometric distribution to gain insights into a wide range of applications where the concept of sampling without replacement plays a crucial role.