Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling Design Mark Delorey, F. Jay Breidt, Colorado State University In aquatic resources, a two-stage sampling design can be employed to make the best use of what are often limited time and financial resources. Even with the ability to focus such resources, it is often the case that the sample sizes are not sufficiently large to make model-free inferences. The presence of auxiliary information for the regions of interest suggests employing a model in our inferences. Breidt, Claeskens, and Opsomer (2003) propose incorporating this auxiliary information through a class of model-assisted estimators based on penalized spline regression in single stage sampling. Zheng and Little (2003) also use penalized spline regression in a model-based approach for finite population estimation in a two-stage sample. In a survey context, weights computed from a set of auxiliary information are often applied to many study variables. With this approach, model-assisted estimators should fare better than model-based estimators. We compare the two through a series of simulations. Two-Stage Sampling • The population of elements U = {1,…, k,…, N} is partitioned into clusters or primary sampling units (PSUs), U1,…, Ui,…, U N.I So, N IN I NI The Estimators (for population totals) • Horvitz-Thompson (HT) tˆy s NI Ni and NN Ni U U UU i and i 1 i i 1 i 1 yk k s tˆyi I where i 1 where Ni is the number of elements or secondary sampling units (SSUs) in Ui. • First stage: A sample of clusters, sI, is selected based on a design, pI() with inclusion probabilities Ii and Iij. ˆt yi yk s i • Model-assisted – Ii and Iij are the first and second order inclusion probabilities, respectively • Second stage: For every i sI, a sample si is drawn from Ui based on the design pi( | sI) • Typically require second stage design to be invariant and independent of the first stage k |i tˆy U tˆyip s I tˆyi tˆyip Ii I where tˆyip is the PSU total predicted by the model • Model-based tˆy s ni yi ˆ i U N i ˆ i I • • • • • • • • Case B: Complete Element Level Auxiliaries The auxiliary information is available for all elements in the population Leads to regression modeling of quantities associated with the elements Cluster and population quantities can then be computed from element estimates and observations Example: EMAP hexagon is cluster; lake is element; auxiliary information is elevation Two-Stage Sampling with Aquatic Resources • Time and expense constraints may make two-stage sampling more efficient • Auxiliary information may be available on different scales Generating Responses • 500 PSUs; the number of SSUs per cluster ~ Uniform(50, 400) • PSU = m(I) + , where m() is one of the eight functions below and ~ N(0, 2I) – We use first order inclusion probabilities proportional to size (pps) – Auxiliary data is often proportional to size of cluster • Response of interest yij = i + ij. where yij is the jth element in the ith cluster and ij ~iid N(0, 2) linear quadratic 1 where yi ni I si y j and ̂ i is the ith cluster mean predicted by the model Comments on Simulation Results • 500 samples from each of the populations were drawn • H-T = Horvitz-Thompson estimator M-A: lin = Model-assisted estimator using a linear model M-B: pmmra = Model-based estimator using a penalized spline and including a random effect for PSU M-A: pmm = Model-assisted estimator using a penalized spline with no random effect for PSU • Point represents MSEEstimator:MSEModel-assisted estimator with radom effect for PSU • Vertical black bars represent approximate 95% confidence intervals • Model-assisted estimator with random effect for PSU is as efficient or more efficient than model-based estimator; we do not appear to lose efficiency (with respect to MSE) by using model-assisted non-parametric methods linear quadratic exponential growth M-A: pmm M-A: lin M-B: pmm M-A: pmm H-T M-A: lin jump M-B: pmm M-A: pmm • M-B: pmm M-A: pmm cycle 4 2.0 6 cycle 1 1.5 1.0 MSE ratio 2 4 MSE ratio 6 4 growth M-A: lin M-B: pmm M-A: pmm 0.0 0 0 0.5 2 • • • M-A: lin cycle 1 8 3.0 exponential 0.0 • H-T 2.5 bump H-T Case D: Limited Cluster Level Auxiliaries The auxiliary information is available for all clusters in the first-stage sample Not a very interesting case Design-based estimator can be used for population quantities In some cases, good estimators for population quantities are not available Example: Cluster is lake; auxiliary information is measure of size which is not available until site is visited 6 2 H-T 0 0.0 0 M-B: pmm 4 MSE ratio 2.0 1.0 MSE ratio 6 2 4 MSE ratio 4 3 2 1 0 M-A: lin MSE ratio • H-T 2.0 • jump 1.0 • bump MSE ratio • Case C: Limited Element Level Auxiliaries The auxiliary information is available for all elements in selected clusters only Leads to regression modeling of quantities associated with the elements Regression estimators can be used for cluster-level quantities only for the clusters selected in the first-stage sample Example: Aerial photography of selected sites (clusters); for each point (element) in site, we have percent forested, urban, industrial MSE ratio 8 5 8 3.0 6 • Case A: Cluster Level Auxiliaries (Our focus) The auxiliary information is available for all clusters in the population Leads to regression modeling of quantities associated with the clusters, such as cluster totals Cluster quantities can be computed for all clusters Population quantities can be computed from cluster estimates Example: Lake represents a cluster; auxiliary information is elevation Ii Notes on the Models and Model Parameters • 3 different models used – Linear – Penalized spline with random effect for PSU – Penalized spline with no random effect for PSU • In a survey context, such as those found in environmental monitoring, it is often desirable to obtain a single set of survey weights that can be used to predict any study variable. To accommodate this: – Smoothing parameter for spline is selected by fixing the degrees of freedom for the smooth rather than using a data driven approach – Variance component for PSU effect is computed for the linear model and resulting covariance matrix and corresponding survey weights are applied to samples from other data sets – In this kind of survey context, model-assisted estimators have good efficiency properties and should be superior to model-based estimators which rely on correct specification of variance components 10 Abstract H-T M-A: lin M-B: pmm M-A: pmm H-T M-A: lin M-B: pmm M-A: pmm H-T M-A: lin M-B: pmm M-A: pmm cycle 4 Funding/Disclaimer The work reported here was developed under the STAR Research Assistance Agreement CR-829095 and CR-829096 awarded by the U.S. Environmental Protection Agency (EPA) to Colorado State University. This poster has not been formally reviewed by EPA. The views expressed here are solely those of the presenter and the STARMAP, the Program he represents. EPA does not endorse any products or commercial services mentioned in this poster. This research is funded by U.S.EPA – Science To Achieve Results (STAR) Program Cooperative # CR – 829095 and Agreements # CR – 829096