Stratified Sampling If You Are Interested In Sampling

Topic isstratified Samplingif You Are Interested In Sampling How

Topic isstratified Samplingif You Are Interested In Sampling How

The purpose of this document is to explore the concept of stratified sampling within the field of statistics, focusing on how to adapt estimation methods when the sampling process is not a simple random sample. Stratified sampling is a technique used to improve the accuracy and efficiency of statistical estimates by dividing the population into distinct subgroups, or strata, that are internally homogeneous. The goal is to ensure that each subgroup is adequately represented in the sample so that the overall estimates are more precise than those obtained via simple random sampling, especially when the population is heterogeneous across different segments.

This approach addresses research questions where understanding differences between subpopulations is vital, such as in social sciences, market research, ecology, and public health studies. For instance, a researcher might want to estimate the average income in a country by stratifying the population based on geographic regions or socioeconomic classes. Disciplines that extensively utilize stratified sampling include epidemiology, sociology, political science, and marketing. My interest in stratified sampling stems from its potential to produce more accurate, representative insights, particularly in diverse populations, and its relevance in designing efficient sampling strategies for large-scale surveys.

Fundamental Equations

Stratified sampling involves dividing the population into \(L\) strata, each with size \(N_h\), where \(h = 1, 2, ..., L\). From each stratum, a sample of size \(n_h\) is drawn, often proportionally to the stratum size, such that the total sample size is \(n = \sum_{h=1}^L n_h\). The key estimation goal is to estimate a population parameter, commonly the overall mean \(\bar{Y}\), based on the stratified sample.

The weighted stratified sample mean is calculated as:

\(\hat{\bar{Y}}_{st} = \sum_{h=1}^L \frac{N_h}{N} \bar{y}_h\),

where \(\bar{y}_h\) is the sample mean within stratum \(h\), \(N_h\) is the size of stratum \(h\), and \(N = \sum N_h\) is the total population size.

The variance of this estimator, assuming simple random sampling within each stratum, is:

\(\operatorname{Var}(\hat{\bar{Y}}_{st}) = \sum_{h=1}^L \left(\frac{N_h}{N}\right)^2 \left(\frac{S_h^2}{n_h}\right) \left(1 - \frac{n_h}{N_h}\right)\),

where \(S_h^2\) is the variance within stratum \(h\). When the stratum sizes are large, and the sampling fraction (\(n_h / N_h\)) is small, the finite population correction term \(\left(1 - n_h / N_h\right)\) can be approximated as 1.

Intuitively, stratified estimators capitalize on the homogeneity within strata to reduce variance. Proper allocation of samples among strata (proportional or optimal allocation) influences the precision of estimates. Optimal allocation minimizes variance for a fixed total sample size, allocating more samples to strata with larger variances or sizes.

A Simple Example

Suppose a university wants to estimate the average GPA of all students across three faculties: Science, Arts, and Business. The faculties are the strata, with known population sizes: Science (N₁=300), Arts (N₂=200), and Business (N₃=500). A sample of 100 students is drawn proportionally from each stratum based on their population sizes (proportional stratified sampling). The sample means (from a prior sample or a hypothetical data collection) are: Science \(\bar{y}_1=3.2\), Arts \(\bar{y}_2=3.5\), Business \(\bar{y}_3=3.3\). Sample variances are: Science \(S_1^2=0.25\), Arts \(S_2^2=0.20\), Business \(S_3^2=0.15\).

Calculate the estimated overall average GPA using stratified sampling and interpret the result.

First, determine the sample sizes per stratum:

n₁ = 100 * (300/1000) = 30,

n₂ = 100 * (200/1000) = 20,

n₃ = 100 * (500/1000) = 50.

Next, estimate the overall mean:

\(\hat{\bar{Y}}_{st} = (N_1/N)\bar{y}_1 + (N_2/N)\bar{y}_2 + (N_3/N)\bar{y}_3\),

where \(N = 1000\). Plugging in the numbers:

\(\hat{\bar{Y}}_{st} = (300/1000)3.2 + (200/1000)3.5 + (500/1000)*3.3\),

which equals:

0.33.2 + 0.23.5 + 0.5*3.3 = 0.96 + 0.70 + 1.65 = 3.31.

Thus, the estimated average GPA is approximately 3.31. To assess the precision, compute the variance:

\(\operatorname{Var}(\hat{\bar{Y}}_{st}) \approx \sum_{h=1}^3 \left(\frac{N_h}{N}\right)^2 \frac{S_h^2}{n_h}\),

assuming negligible finite population correction:

= (0.3)^2 0.25/30 + (0.2)^2 0.20/20 + (0.5)^2 * 0.15/50

= 0.09 0.00833 + 0.04 0.01 + 0.25 * 0.003

= 0.00075 + 0.0004 + 0.00075 = 0.0019.

The standard error (SE) is the square root of 0.0019, approximately 0.044. Therefore, the 95% confidence interval is roughly:

3.31 ± 1.96 * 0.044 ≈ (3.22, 3.40).

This example demonstrates how stratified sampling allows for a more precise estimate of the overall mean GPA, especially when variances within strata and their sizes are taken into account. It also highlights the importance of allocating samples efficiently between strata to improve the accuracy of the estimate.

References

  • Cochran, W. G. (1977). Sampling Techniques (3rd ed.). Wiley.
  • Kish, L. (1965). Survey Sampling. Wiley.
  • Lohr, S. L. (2010). Sampling: Design and Analysis. Cengage Learning.
  • Levy, P. S., & Lemeshow, S. (2013). Sampling of Populations: Methods and Applications. Wiley.
  • Thompson, S. K. (2012). Sampling. Wiley.