Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling Design Mark Delorey

advertisement
Nonparametric, Model-Assisted
Estimation for a Two-Stage
Sampling Design
Mark Delorey
Joint work with F. Jay Breidt and Jean Opsomer
September 8, 2005
Research supported by EPA Cooperative Agreements
R829095 and R829096
Motivation
 In resource monitoring and assessment, time and expense
constraints may make two-stage sampling more efficient
• Select a sample of watersheds; sample different bodies of water within
selected watersheds
• Select a sample of lakes; sample at different locations in selected lakes
 Samples are not always sufficiently dense in small watersheds;
availability of cheap auxiliary information (primarily from
GIS) suggests incorporating a model
 Auxiliary information may be available on different scales
 Often many study variables; rather than fit a model for each
one, would like one set of weights that can be applied
reasonably well to all variables, i.e.,
yˆ  Hy
Outline
 Two-stage structure
 Model-free, model-assisted, and model-based
estimators
 Penalized splines
 Simulation results
 Properties of model-assisted estimator using
penalized spline
Two-Stage Structure
 Population of elements U = {1,…, k,…, N} is
partitioned into clusters or primary sampling units
(PSUs), U1,…, Ui,…, U N I . So,
NI
NI
i 1
i 1
U  U i and N   Ni
where Ni is the number of elements or secondary
sampling units (SSUs) in Ui.
Case A: Cluster Level Auxiliaries (Our focus)
 The auxiliary information is available for all clusters
in the population
 Leads to regression modeling of quantities associated
with the clusters, such as cluster totals and means
 Cluster quantities can be computed for all clusters
 Population quantities can be computed from cluster
estimates
 Example: Lake represents a cluster; auxiliary
information is elevation
Case B: Complete Element Level Auxiliaries
 The auxiliary information is available for all elements
in the population
 Leads to regression modeling of quantities associated
with the elements
 Cluster and population quantities can then be
computed from element estimates and observations
 Example: EMAP hexagon is cluster; lake is element;
auxiliary information is elevation
Case C: Limited Element Level Auxiliaries
 The auxiliary information is available for all elements
in selected clusters only
 Leads to regression modeling of quantities associated
with the elements
 Regression estimators can be used for cluster-level
quantities only for the clusters selected in the firststage sample
 Population-level quantities can be estimated using
design-based estimators
 Example: Aerial photography of selected sites
(clusters); for each point (element) in site, we have
percent forested, urban, industrial
Case D: Limited Cluster Level Auxiliaries
 The auxiliary information is available for all clusters
in the first-stage sample
 Not a very interesting case
 Design-based estimator can be used for population
quantities
 Example: Cluster is lake; auxiliary information is
measure of size which is not available until site is
visited
Sampling
 First stage: A sample of clusters, sI, is selected based
on a design, pI(·) with inclusion probabilities Ii and
Iij
• Ii and Iij are the first and second order inclusion
probabilities, respectively
 Second stage: For every i  sI, a sample si is drawn
from Ui based on the design pi(· | sI)
 Typically require second stage design to be invariant
and independent of the first stage
Other Notation
 ty 

U
yk  U t yi is the total for the variable y
I
over the entire population
 Where required, we will assume the population
model:
i ~ N  f  xi , 2 
where i is the mean of the y’s in PSU i
 xi is some auxiliary variable that is a known quantity
(usually a total or mean) for PSU i
The Estimators (for population totals)
 Model-free
 Model-assisted
 Model-based
Model-Free Estimator
 If no other information than the sampling design is
available, the Horvitz-Thompson Estimator is often
used
ˆ
tˆy   s
yk
k
 s
t yi
I
 Ii
where
tˆyi  s
i
 Notes:
yk
 k |i
• Always design unbiased
• Variance is large for small sample sizes
• Does not make use of auxiliary information
Model-Assisted Estimator
tˆy  U tˆyip  s
I
where
tˆyi  tˆyip
I
 Ii
tˆyip is the PSU total predicted by the model
 Properties:
• Asymptotically unbiased and consistent even if model is
misspecified
• Variance is generally smaller than with HT, but larger than
with the model-based estimator
• Can incorporate auxiliary information
Model-Based Estimator
tˆy  s ni yi  N i  ni ˆ i   U
I
1
where yi 
ni

si
I
\ sI
N i ˆ i
yij and ̂ i is the ith PSU mean
predicted by the model
Properties:
• Unbiased if model is correctly specified
• Variance is generally smaller than with HT
• Can incorporate auxiliary information
Notes on the Models
 3 different models considered
• Linear
• Penalized spline with random effect for PSU
• Penalized spline with no random effect for PSU
 Extend model specification for penalized spline with
random effect for PSU:


~ N  f  x , 
yij |  i ~ N  i ,  2
i
2
i
where yij is the response for the jth element in PSU i
Penalized Splines (P-Splines)
 With a linear model, we assume
f xi    0  1 xi
 For a penalized spline,
K
f  xi    0  1 xi    l 1  xi   l 
l 1
where 1 < …< K are K fixed knots and
x   xIx  0
Simulation Study
 500 PSUs; the number of SSUs per cluster ~
Uniform(50, 400)
 PSU = f(I) + , where f(·) is one of eight functions
and  ~ N(0, 2I)
• We use first order inclusion probabilities proportional to
size (pps)
• Auxiliary data is often proportional to size of cluster
 Generate the response of interest yij = i + ij where
yij is the jth element in the ith cluster and ij ~ iid
N(0, 2)
First Four Functions
linear
quadratic
bump
jump
Second Four Functions
exponential
growth
cycle 1
cycle 4
Some Simulation Results
Function
Linear
Quadratic
Bump
Jump
2
2
HT
LIN
SPL
MBRE
0.01
0.01
15.94
1.14
1.16
0.97
0.01
0.25
10.34
4.63
1.13
0.95
0.25
0.01
1.69
1.29
1.34
0.99
0.25
0.25
1.20
0.98
1.02
0.94
0.01
0.01
28.46
9.20
1.07
0.91
0.01
0.25
19.64
31.63
1.41
1.04
0.25
0.01
3.61
2.48
1.06
0.97
0.25
0.25
2.60
1.74
1.12
0.97
0.01
0.01
7.27
2.68
1.73
0.72
0.01
0.25
6.58
3.29
1.37
1.11
0.25
0.01
1.34
1.11
1.07
1.02
0.25
0.25
1.41
1.11
1.17
1.03
0.01
0.01
10.94
10.38
2.54
0.87
0.01
0.25
37.39
25.15
2.70
0.92
0.25
0.01
4.55
2.48
1.12
0.95
0.25
0.25
8.30
4.75
1.49
1.10
More Simulation Results
Function
Exponential
Growth
Cycle1
Cycle4
2
2
HT
LIN
SPL
MBRE
0.01
0.01
34.77
1.35
0.87
0.54
0.01
0.25
39.47
1.96
1.85
1.14
0.25
0.01
2.72
0.94
1.30
1.07
0.25
0.25
3.13
0.90
1.15
1.01
0.01
0.01
12.49
4.20
1.28
0.93
0.01
0.25
32.10
25.24
1.82
1.03
0.25
0.01
2.80
1.68
1.20
1.04
0.25
0.25
3.47
1.48
1.06
0.99
0.01
0.01
26.55
3.27
1.18
0.82
0.01
0.25
32.01
18.80
1.37
1.05
0.25
0.01
3.07
1.53
1.32
0.79
0.25
0.25
2.97
2.11
1.23
1.03
0.01
0.01
32.96
3.52
1.17
0.87
0.01
0.25
2.72
2.88
2.68
1.09
0.25
0.01
1.02
1.10
1.04
0.91
0.25
0.25
1.84
1.70
1.69
1.09
Why not use model-based?
 In survey contexts, such as those found in
environmental monitoring, it is often desirable to
obtain a single set of survey weights that can be used
to predict any study variable. To accommodate this:
• Smoothing parameter for spline is selected by
fixing the degrees of freedom for the smooth rather
than using a data driven approach
• With model-based, sampling design is ignored and
estimates rely solely on the form of f(·)
Relative MSE (Fitting to bump)
quadratic
6
2
4
MSE rat io
6
4
0
0
2
MSE rat io
8
8
linear
H-T
M-A: lin
M-B: pmm
M-A: pmm
H-T
M-A: lin
M-A: pmm
M-B: pmm
M-A: pmm
jump
2
4
MSE rat io
6
4
0
2
0
MSE rat io
8
6
10
bump
M-B: pmm
H-T
M-A: lin
M-B: pmm
M-A: pmm
H-T
M-A: lin
Relative MSE (Fitting to bump)
growth
2
4
MSE rat io
4
0
0
2
MSE rat io
6
6
exponential
H-T
M-A: lin
M-B: pmm
M-A: pmm
H-T
M-A: lin
cycle 4
10
2
4
MSE rat io
6
8
6
4
0
2
0
MSE rat io
M-A: pmm
8
cycle 1
M-B: pmm
H-T
M-A: lin
M-B: pmm
M-A: pmm
H-T
M-A: lin
M-B: pmm
M-A: pmm
Relative Bias (Fitting to bump)
quadratic
20
15
10
bias ratio
10
0
0
5
5
bias ratio
15
25
20
30
linear
H-T
M-A: lin
M-B: pmm
M-A: pmm
H-T
M-A: lin
M-A: pmm
M-B: pmm
M-A: pmm
jump
20
15
5
10
bias ratio
1.0
0.5
0
0.0
bias ratio
1.5
25
2.0
30
bump
M-B: pmm
H-T
M-A: lin
M-B: pmm
M-A: pmm
H-T
M-A: lin
Relative Bias (Fitting to bump)
growth
20
15
10
bias ratio
10
0
0
5
5
bias ratio
15
25
20
30
exponential
H-T
M-A: lin
M-B: pmm
M-A: pmm
H-T
M-A: lin
30
25
20
15
10
bias ratio
6
4
0
5
2
0
bias ratio
M-A: pmm
cycle 4
8
cycle 1
M-B: pmm
H-T
M-A: lin
M-B: pmm
M-A: pmm
H-T
M-A: lin
M-B: pmm
M-A: pmm
Relative Variance (Fitting to bump)
quadratic
15
0
0
2
5
10
variance rat io
10 12
8
6
4
variance rat io
20
linear
H-T
M-A: lin
M-B: pmm
M-A: pmm
H-T
M-A: lin
M-B: pmm
M-A: pmm
jump
10
0
5
variance rat io
15
10
5
0
variance rat io
M-A: pmm
15
bump
M-B: pmm
H-T
M-A: lin
M-B: pmm
M-A: pmm
H-T
M-A: lin
Relative Variance (Fitting to bump)
growth
15
0
0
2
5
10
variance rat io
8
6
4
variance rat io
10
20
exponential
H-T
M-A: lin
M-B: pmm
M-A: pmm
H-T
M-A: lin
10 12 14
8
6
0
2
4
variance rat io
10
8
6
4
2
0
variance rat io
M-A: pmm
cycle 4
12
cycle 1
M-B: pmm
H-T
M-A: lin
M-B: pmm
M-A: pmm
H-T
M-A: lin
M-B: pmm
M-A: pmm
Properties of Model-Assisted Estimator
 The penalized spline estimator, tˆy ,spl, is linear operator
 It is location and scale invariant, in the sense that
 w ay
s
k
k
 b   as wk yk  Nb
provided an intercept is kept in the model and
1
ksi π  Ni
k|i
Properties of Model-Assisted Estimator
 Under mild assumptions, the penalized spline
estimator, tˆy ,spl , is design n I -consistent for ty, in the
sense that
tˆy,spl  t y
NI
 1 

 Op 
 n 
 I 
and has the following asymptotic distributional
property:
tˆy , spl  t y
dist
 N 0,1

tˆyi  t yip 

V  U t yip  s

I
I

Ii


Properties of Model-Assisted Estimator
 Again, under mild assumptions, the estimator
2
ˆyi  t yip 

t


N
I
ˆ
  op 

V tˆy , spl   V  U t yip  s


I
I

Ii
 nI 


 The previous two results lead to:
tˆy ,spl  t y dist
 N 0,1
Vˆ tˆy ,spl 
Summary
 Two-stage sampling designs are used frequently in natural
resource monitoring and assessment
 Sample sizes are often sparse; model-free estimators will have
high variance
 Model-based estimators make use of auxiliary information and
have good properties provided model is correctly specified
 Modeling with p-splines solves problem of correctly
specifying model
 Often, model can’t be fit to all study variables; model-assisted
estimators still have reasonably good properties when weights
from one model are applied to all study variables
Download