Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling Design Mark Delorey Joint work with F. Jay Breidt and Jean Opsomer September 8, 2005 Research supported by EPA Cooperative Agreements R829095 and R829096 Motivation In resource monitoring and assessment, time and expense constraints may make two-stage sampling more efficient • Select a sample of watersheds; sample different bodies of water within selected watersheds • Select a sample of lakes; sample at different locations in selected lakes Samples are not always sufficiently dense in small watersheds; availability of cheap auxiliary information (primarily from GIS) suggests incorporating a model Auxiliary information may be available on different scales Often many study variables; rather than fit a model for each one, would like one set of weights that can be applied reasonably well to all variables, i.e., yˆ Hy Outline Two-stage structure Model-free, model-assisted, and model-based estimators Penalized splines Simulation results Properties of model-assisted estimator using penalized spline Two-Stage Structure Population of elements U = {1,…, k,…, N} is partitioned into clusters or primary sampling units (PSUs), U1,…, Ui,…, U N I . So, NI NI i 1 i 1 U U i and N Ni where Ni is the number of elements or secondary sampling units (SSUs) in Ui. Case A: Cluster Level Auxiliaries (Our focus) The auxiliary information is available for all clusters in the population Leads to regression modeling of quantities associated with the clusters, such as cluster totals and means Cluster quantities can be computed for all clusters Population quantities can be computed from cluster estimates Example: Lake represents a cluster; auxiliary information is elevation Case B: Complete Element Level Auxiliaries The auxiliary information is available for all elements in the population Leads to regression modeling of quantities associated with the elements Cluster and population quantities can then be computed from element estimates and observations Example: EMAP hexagon is cluster; lake is element; auxiliary information is elevation Case C: Limited Element Level Auxiliaries The auxiliary information is available for all elements in selected clusters only Leads to regression modeling of quantities associated with the elements Regression estimators can be used for cluster-level quantities only for the clusters selected in the firststage sample Population-level quantities can be estimated using design-based estimators Example: Aerial photography of selected sites (clusters); for each point (element) in site, we have percent forested, urban, industrial Case D: Limited Cluster Level Auxiliaries The auxiliary information is available for all clusters in the first-stage sample Not a very interesting case Design-based estimator can be used for population quantities Example: Cluster is lake; auxiliary information is measure of size which is not available until site is visited Sampling First stage: A sample of clusters, sI, is selected based on a design, pI(·) with inclusion probabilities Ii and Iij • Ii and Iij are the first and second order inclusion probabilities, respectively Second stage: For every i sI, a sample si is drawn from Ui based on the design pi(· | sI) Typically require second stage design to be invariant and independent of the first stage Other Notation ty U yk U t yi is the total for the variable y I over the entire population Where required, we will assume the population model: i ~ N f xi , 2 where i is the mean of the y’s in PSU i xi is some auxiliary variable that is a known quantity (usually a total or mean) for PSU i The Estimators (for population totals) Model-free Model-assisted Model-based Model-Free Estimator If no other information than the sampling design is available, the Horvitz-Thompson Estimator is often used ˆ tˆy s yk k s t yi I Ii where tˆyi s i Notes: yk k |i • Always design unbiased • Variance is large for small sample sizes • Does not make use of auxiliary information Model-Assisted Estimator tˆy U tˆyip s I where tˆyi tˆyip I Ii tˆyip is the PSU total predicted by the model Properties: • Asymptotically unbiased and consistent even if model is misspecified • Variance is generally smaller than with HT, but larger than with the model-based estimator • Can incorporate auxiliary information Model-Based Estimator tˆy s ni yi N i ni ˆ i U I 1 where yi ni si I \ sI N i ˆ i yij and ̂ i is the ith PSU mean predicted by the model Properties: • Unbiased if model is correctly specified • Variance is generally smaller than with HT • Can incorporate auxiliary information Notes on the Models 3 different models considered • Linear • Penalized spline with random effect for PSU • Penalized spline with no random effect for PSU Extend model specification for penalized spline with random effect for PSU: ~ N f x , yij | i ~ N i , 2 i 2 i where yij is the response for the jth element in PSU i Penalized Splines (P-Splines) With a linear model, we assume f xi 0 1 xi For a penalized spline, K f xi 0 1 xi l 1 xi l l 1 where 1 < …< K are K fixed knots and x xIx 0 Simulation Study 500 PSUs; the number of SSUs per cluster ~ Uniform(50, 400) PSU = f(I) + , where f(·) is one of eight functions and ~ N(0, 2I) • We use first order inclusion probabilities proportional to size (pps) • Auxiliary data is often proportional to size of cluster Generate the response of interest yij = i + ij where yij is the jth element in the ith cluster and ij ~ iid N(0, 2) First Four Functions linear quadratic bump jump Second Four Functions exponential growth cycle 1 cycle 4 Some Simulation Results Function Linear Quadratic Bump Jump 2 2 HT LIN SPL MBRE 0.01 0.01 15.94 1.14 1.16 0.97 0.01 0.25 10.34 4.63 1.13 0.95 0.25 0.01 1.69 1.29 1.34 0.99 0.25 0.25 1.20 0.98 1.02 0.94 0.01 0.01 28.46 9.20 1.07 0.91 0.01 0.25 19.64 31.63 1.41 1.04 0.25 0.01 3.61 2.48 1.06 0.97 0.25 0.25 2.60 1.74 1.12 0.97 0.01 0.01 7.27 2.68 1.73 0.72 0.01 0.25 6.58 3.29 1.37 1.11 0.25 0.01 1.34 1.11 1.07 1.02 0.25 0.25 1.41 1.11 1.17 1.03 0.01 0.01 10.94 10.38 2.54 0.87 0.01 0.25 37.39 25.15 2.70 0.92 0.25 0.01 4.55 2.48 1.12 0.95 0.25 0.25 8.30 4.75 1.49 1.10 More Simulation Results Function Exponential Growth Cycle1 Cycle4 2 2 HT LIN SPL MBRE 0.01 0.01 34.77 1.35 0.87 0.54 0.01 0.25 39.47 1.96 1.85 1.14 0.25 0.01 2.72 0.94 1.30 1.07 0.25 0.25 3.13 0.90 1.15 1.01 0.01 0.01 12.49 4.20 1.28 0.93 0.01 0.25 32.10 25.24 1.82 1.03 0.25 0.01 2.80 1.68 1.20 1.04 0.25 0.25 3.47 1.48 1.06 0.99 0.01 0.01 26.55 3.27 1.18 0.82 0.01 0.25 32.01 18.80 1.37 1.05 0.25 0.01 3.07 1.53 1.32 0.79 0.25 0.25 2.97 2.11 1.23 1.03 0.01 0.01 32.96 3.52 1.17 0.87 0.01 0.25 2.72 2.88 2.68 1.09 0.25 0.01 1.02 1.10 1.04 0.91 0.25 0.25 1.84 1.70 1.69 1.09 Why not use model-based? In survey contexts, such as those found in environmental monitoring, it is often desirable to obtain a single set of survey weights that can be used to predict any study variable. To accommodate this: • Smoothing parameter for spline is selected by fixing the degrees of freedom for the smooth rather than using a data driven approach • With model-based, sampling design is ignored and estimates rely solely on the form of f(·) Relative MSE (Fitting to bump) quadratic 6 2 4 MSE rat io 6 4 0 0 2 MSE rat io 8 8 linear H-T M-A: lin M-B: pmm M-A: pmm H-T M-A: lin M-A: pmm M-B: pmm M-A: pmm jump 2 4 MSE rat io 6 4 0 2 0 MSE rat io 8 6 10 bump M-B: pmm H-T M-A: lin M-B: pmm M-A: pmm H-T M-A: lin Relative MSE (Fitting to bump) growth 2 4 MSE rat io 4 0 0 2 MSE rat io 6 6 exponential H-T M-A: lin M-B: pmm M-A: pmm H-T M-A: lin cycle 4 10 2 4 MSE rat io 6 8 6 4 0 2 0 MSE rat io M-A: pmm 8 cycle 1 M-B: pmm H-T M-A: lin M-B: pmm M-A: pmm H-T M-A: lin M-B: pmm M-A: pmm Relative Bias (Fitting to bump) quadratic 20 15 10 bias ratio 10 0 0 5 5 bias ratio 15 25 20 30 linear H-T M-A: lin M-B: pmm M-A: pmm H-T M-A: lin M-A: pmm M-B: pmm M-A: pmm jump 20 15 5 10 bias ratio 1.0 0.5 0 0.0 bias ratio 1.5 25 2.0 30 bump M-B: pmm H-T M-A: lin M-B: pmm M-A: pmm H-T M-A: lin Relative Bias (Fitting to bump) growth 20 15 10 bias ratio 10 0 0 5 5 bias ratio 15 25 20 30 exponential H-T M-A: lin M-B: pmm M-A: pmm H-T M-A: lin 30 25 20 15 10 bias ratio 6 4 0 5 2 0 bias ratio M-A: pmm cycle 4 8 cycle 1 M-B: pmm H-T M-A: lin M-B: pmm M-A: pmm H-T M-A: lin M-B: pmm M-A: pmm Relative Variance (Fitting to bump) quadratic 15 0 0 2 5 10 variance rat io 10 12 8 6 4 variance rat io 20 linear H-T M-A: lin M-B: pmm M-A: pmm H-T M-A: lin M-B: pmm M-A: pmm jump 10 0 5 variance rat io 15 10 5 0 variance rat io M-A: pmm 15 bump M-B: pmm H-T M-A: lin M-B: pmm M-A: pmm H-T M-A: lin Relative Variance (Fitting to bump) growth 15 0 0 2 5 10 variance rat io 8 6 4 variance rat io 10 20 exponential H-T M-A: lin M-B: pmm M-A: pmm H-T M-A: lin 10 12 14 8 6 0 2 4 variance rat io 10 8 6 4 2 0 variance rat io M-A: pmm cycle 4 12 cycle 1 M-B: pmm H-T M-A: lin M-B: pmm M-A: pmm H-T M-A: lin M-B: pmm M-A: pmm Properties of Model-Assisted Estimator The penalized spline estimator, tˆy ,spl, is linear operator It is location and scale invariant, in the sense that w ay s k k b as wk yk Nb provided an intercept is kept in the model and 1 ksi π Ni k|i Properties of Model-Assisted Estimator Under mild assumptions, the penalized spline estimator, tˆy ,spl , is design n I -consistent for ty, in the sense that tˆy,spl t y NI 1 Op n I and has the following asymptotic distributional property: tˆy , spl t y dist N 0,1 tˆyi t yip V U t yip s I I Ii Properties of Model-Assisted Estimator Again, under mild assumptions, the estimator 2 ˆyi t yip t N I ˆ op V tˆy , spl V U t yip s I I Ii nI The previous two results lead to: tˆy ,spl t y dist N 0,1 Vˆ tˆy ,spl Summary Two-stage sampling designs are used frequently in natural resource monitoring and assessment Sample sizes are often sparse; model-free estimators will have high variance Model-based estimators make use of auxiliary information and have good properties provided model is correctly specified Modeling with p-splines solves problem of correctly specifying model Often, model can’t be fit to all study variables; model-assisted estimators still have reasonably good properties when weights from one model are applied to all study variables