Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling Design

advertisement
Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling
Design
Mark Delorey, F. Jay Breidt, Colorado State University
In aquatic resources, a two-stage sampling design can be employed to
make the best use of what are often limited time and financial resources.
Even with the ability to focus such resources, it is often the case that the
sample sizes are not sufficiently large to make model-free inferences.
The presence of auxiliary information for the regions of interest suggests
employing a model in our inferences. Breidt, Claeskens, and Opsomer
(2003) propose incorporating this auxiliary information through a class
of model-assisted estimators based on penalized spline regression in
single stage sampling. Zheng and Little (2003) also use penalized spline
regression in a model-based approach for finite population estimation in
a two-stage sample. In a survey context, weights computed from a set of
auxiliary information are often applied to many study variables. With this
approach, model-assisted estimators should fare better than model-based
estimators. We compare the two through a series of simulations.
Two-Stage Sampling
• The population of elements U = {1,…, k,…, N} is partitioned into
clusters or primary sampling units (PSUs), U1,…, Ui,…, U N.I So,
N IN I
NI
The Estimators (for population totals)
• Horvitz-Thompson (HT)
tˆy   s
NI
 Ni
and NN 
 Ni
U U  UU i and

i 1
i
i 1
i 1
yk
k
 s
tˆyi
I
where
i 1
where Ni is the number of elements or secondary sampling units
(SSUs) in Ui.
• First stage: A sample of clusters, sI, is selected based on a design,
pI() with inclusion probabilities Ii and Iij.
ˆt yi   yk
s
i
• Model-assisted
– Ii and Iij are the first and second order inclusion probabilities, respectively
• Second stage: For every i  sI, a sample si is drawn from Ui based on
the design pi( | sI)
• Typically require second stage design to be invariant and independent
of the first stage
 k |i
tˆy  U tˆyip  s
I
tˆyi  tˆyip
 Ii
I
where tˆyip is the PSU total predicted by the model
• Model-based
tˆy  s ni  yi  ˆ i   U N i ˆ i
I
•
•
•
•
•
•
•
•
Case B: Complete Element Level Auxiliaries
The auxiliary information is available for all elements in the
population
Leads to regression modeling of quantities associated with the
elements
Cluster and population quantities can then be computed from element
estimates and observations
Example: EMAP hexagon is cluster; lake is element; auxiliary
information is elevation
Two-Stage Sampling with Aquatic Resources
• Time and expense constraints may make two-stage sampling more
efficient
• Auxiliary information may be available on different scales
Generating Responses
• 500 PSUs; the number of SSUs per cluster ~ Uniform(50, 400)
• PSU = m(I) + , where m() is one of the eight functions below and
 ~ N(0, 2I)
– We use first order inclusion probabilities proportional to size (pps)
– Auxiliary data is often proportional to size of cluster
• Response of interest yij = i + ij. where yij is the jth element in the
ith cluster and ij ~iid N(0, 2)
linear
quadratic
1
where yi 
ni

I
si
y j and ̂ i is the ith cluster mean
predicted by the model
Comments on Simulation Results
• 500 samples from each of the populations were drawn
• H-T = Horvitz-Thompson estimator
M-A: lin = Model-assisted estimator using a linear model
M-B: pmmra = Model-based estimator using a penalized spline and including a random effect for PSU
M-A: pmm = Model-assisted estimator using a penalized spline with no random effect for PSU
• Point represents MSEEstimator:MSEModel-assisted estimator with radom effect for PSU
• Vertical black bars represent approximate 95% confidence intervals
• Model-assisted estimator with random effect for PSU is as efficient or more efficient than model-based estimator; we
do not appear to lose efficiency (with respect to MSE) by using model-assisted non-parametric methods
linear
quadratic
exponential
growth
M-A: pmm
M-A: lin
M-B: pmm
M-A: pmm
H-T
M-A: lin
jump
M-B: pmm
M-A: pmm
•
M-B: pmm
M-A: pmm
cycle 4
2.0
6
cycle 1
1.5
1.0
MSE ratio
2
4
MSE ratio
6
4
growth
M-A: lin
M-B: pmm
M-A: pmm
0.0
0
0
0.5
2
•
•
•
M-A: lin
cycle 1
8
3.0
exponential
0.0
•
H-T
2.5
bump
H-T
Case D: Limited Cluster Level Auxiliaries
The auxiliary information is available for all clusters in the first-stage
sample
Not a very interesting case
Design-based estimator can be used for population quantities
In some cases, good estimators for population quantities are not
available
Example: Cluster is lake; auxiliary information is measure of size
which is not available until site is visited
6
2
H-T
0
0.0
0
M-B: pmm
4
MSE ratio
2.0
1.0
MSE ratio
6
2
4
MSE ratio
4
3
2
1
0
M-A: lin
MSE ratio
•
H-T
2.0
•
jump
1.0
•
bump
MSE ratio
•
Case C: Limited Element Level Auxiliaries
The auxiliary information is available for all elements in selected
clusters only
Leads to regression modeling of quantities associated with the
elements
Regression estimators can be used for cluster-level quantities only for
the clusters selected in the first-stage sample
Example: Aerial photography of selected sites (clusters); for each
point (element) in site, we have percent forested, urban, industrial
MSE ratio
8
5
8
3.0
6
•
Case A: Cluster Level Auxiliaries (Our focus)
The auxiliary information is available for all clusters in the population
Leads to regression modeling of quantities associated with the
clusters, such as cluster totals
Cluster quantities can be computed for all clusters
Population quantities can be computed from cluster estimates
Example: Lake represents a cluster; auxiliary information is elevation
 Ii
Notes on the Models and Model Parameters
• 3 different models used
– Linear
– Penalized spline with random effect for PSU
– Penalized spline with no random effect for PSU
• In a survey context, such as those found in
environmental monitoring, it is often desirable to obtain
a single set of survey weights that can be used to predict
any study variable. To accommodate this:
– Smoothing parameter for spline is selected by fixing
the degrees of freedom for the smooth rather than
using a data driven approach
– Variance component for PSU effect is computed for
the linear model and resulting covariance matrix and
corresponding survey weights are applied to samples
from other data sets
– In this kind of survey context, model-assisted
estimators have good efficiency properties and should
be superior to model-based estimators which rely on
correct specification of variance components
10
Abstract
H-T
M-A: lin
M-B: pmm
M-A: pmm
H-T
M-A: lin
M-B: pmm
M-A: pmm
H-T
M-A: lin
M-B: pmm
M-A: pmm
cycle 4
Funding/Disclaimer
The work reported here was developed under the STAR Research Assistance Agreement CR-829095 and
CR-829096 awarded by the U.S. Environmental Protection Agency (EPA) to Colorado State University.
This poster has not been formally reviewed by EPA. The views expressed here are solely those of the
presenter and the STARMAP, the Program he represents. EPA does not endorse any products or
commercial services mentioned in this poster.
This research is funded by
U.S.EPA – Science To Achieve
Results (STAR) Program
Cooperative
# CR – 829095 and
Agreements
# CR – 829096
Download