Nonparametric Survey Regression Estimation Using Penalized Splines

advertisement
Nonparametric Survey
Regression Estimation Using
Penalized Splines
F. Jay Breidt*,**
Colorado State University
Jean D. Opsomer**
Iowa State University
(+ more folks acknowledged soon)
Research supported by EPA STAR Grants
R-82909501 (*CSU) and R-82909601 (**OSU)
The Usual Disclaimer

The work reported here was developed under
STAR Research Assistance Agreements CR-829095
and CR-829096 awarded by the U.S.
Environmental Protection Agency (EPA) to
Colorado State University and Oregon State
University. This presentation has not been formally
reviewed by EPA. The views expressed here are
solely those of the authors. EPA does not endorse
any products or commercial services mentioned in
this report.
Outline

Background:




Penalized splines:



Comparison to other smoothers; two-stage; small area
Variations: network data, increment data
Other:


Scales of inference
Specific versus generic
Model-assisted and model-based inference
Non-Gaussian time series
Summary:

Status of STARMAP.2 and DAMARS.5
Scales of Inference in Surveys

Large area:



Medium area:



sample itself suffices for inference
no model needed
use auxiliary information through a model
model helps inference but is not critical
Small area:


sample size is small or zero
inference must be based on a model
Specific and Generic Inference

Specific: one study variable, few population
parameters



lots of modeling resources to specify, estimate,
and diagnose a model
willingness to defend the model
Generic: many study variables, many
population parameters


no resources to model every variable
no single model is adequate/defensible
Generic Inferences in Aquatic
Resources


Generic inference is a common problem for
federal, state, and tribal agencies
Example: conduct a survey and prepare a
report



analyze large numbers of chemical, biological, and
physical variables
estimate means, quantiles, and distribution
functions
break down both by political classifications and by
various ecological classifications
Model-Assisted Survey
Inference



Scarce modeling resources for generic
inference, so we don’t trust models
Can we use a model without depending
on the model?
Model-assisted inference:


efficiency gains if model is right
sensible inference even if model is wrong
Model-Assisted Estimators

Form of model-assisted estimator:




Classical parametric model-assisted:


(model-based prediction)+(design bias
adjustment)
model incorporates auxiliary information
bias adjustment corrects for bad models
prediction from linear regression model
Our idea: nonparametric model-assisted

prediction from kernel regression or other
“smoother” (JB & JO (2000), Annals of Stat)
Why Nonparametric?

More flexible model specification


smooth mean function, positive variance
function
Approximately correct more often


more opportunities for efficiency gains
from auxiliary information
often, not a large efficiency loss if
parametric specification is correct
Goals of Our Research


Focus on generic inference
Use flexible nonparametric models to
reduce misspecification bias



model-assisted: medium area problem
model-based: small area problem
Make the methods operationally feasible
for state and tribal agencies

linear smoothers generate generic weights
Penalized Splines

Very useful class of linear smoothers




Readily fits into standard linear mixed model framework
Modular, extensible, computationally convenient
Automated smoothing parameter selection and fitting with
standard software
Several ongoing projects:




Model-assisted p-spline estimation (Gerda Claeskens, JO,
JB); two-stage extensions (Mark Delorey)
Small area p-spline estimation (Gerda, Giovanna Ranalli,
Goran Kauermann, JO, JB)
Smoothing on networks (Giovanna, JB)
Semiparametric mixed models for increment-averaged core
data (Nan-Jung Hsu, Steve Ogle, JB)
Penalized Splines

Truncated linear basis allows slope
changes at each of many knots:
K
y   0  1 x   bk ( x   k )   
k 1
 Penalize for unnecessary slope changes:
2
K


2
2
y




x

b
(
x


)


b
  0 1 

k
k 
k
k 1
k 1

K
P-Splines: Influence of Penalty
• Fits with increasing penalty parameter
Penalized Splines Computation

Computation using S-Plus

Set up design matrix + truncated linear splines
Z <- outer(x, knots, "-")
Z <- Z * (Z > 0)
C <- cbind(one,x,Z)

Solve for spline with fixed degrees of freedom
D <- diag(rep(0,2),rep(1,K))
mhat <- X %*% solve(t(C) %*% diag(1/pi) %*% C
+lambda^2 * D) %*% t(C) %*% diag(1/pi)%*%y

For data-determined df/roughness penalty,
can use lme()to select via REML
Model-Assisted P-Spline Estimator

Model-based prediction + design
bias adjustment:
tˆMA   mˆ i  
iU
is
yi  mˆ i
i
 Asymptotically design-unbiased and design
consistent
 Asymptotic variance given by
 
N Var tˆMA  N 2 
2

i U j U
 ij   i  j
 yi  mi   y j  mj   o n 1 
i  j
Design of Simulation Study

Model-assisted estimators





Model-based estimator



Penalized spline
All use common degrees of freedom: 3 or 6
Eight response variables on one population



Polynomial regression
Poststratification (piecewise constant)
Local polynomial regression (kernel)
Penalized spline
Two noise levels
N=1000
Designs SI or STSI

1000 replicate samples of size n=50
Estimator Comparisons: Common
Degrees of Freedom
MSE Ratio Relative to ModelAssisted Penalized Splines
Further Results from Simulation

Variance estimation



Confidence interval coverage



For all estimators, variance estimator has negative bias
Weighted residual variance estimator performs better
Somewhat less than nominal for all estimators (90-92%)
Undercoverage not as severe as bias would suggest
Negative weights: (2 df)x(2 designs)x(1000
reps)x(50 weights) = 200,000 weights



902 negative REG weights
145 negative LLR weights
2 negative MA weights
Two-Stage P-Spline Estimation

Available auxiliary information in two-stage
sampling:




Mark Delorey (poster): focus on first case



All clusters
All elements
All elements in sampled clusters
Simulation study comparing Horvitz-Thompson,
regression, model-based p-spline, model-assisted pspline with and without cluster random effects
Operational issues with df, cluster variance component
Some results: p-spline is good!
Semiparametric Small Area
Estimation


Gerda, Giovanna, Goran
Kauermann, JO, JB
Example: ANC level for
Northeastern lakes




557 observations over 113 HUCs
Average sample size/HUC: 4.9
64 HUCs contain less than 5
observations
Site-specific covariates: lake
location and elevation

Simple way to capture spatial
effects?
Semiparametric Small Area Model

Replace linear function of covariates by
more general model:



direct estimator = truth + sampling error
truth = semiparametric regression + area-specific
deviation
Semiparametric regression expressed
as linear mixed model


Thin plate splines
Low-rank radial basis functions
Small Area Estimation Results
• EBLUP for this model easily
handled with standard
software (SAS proc mixed,
SPlus lme())
P-Splines for Increment Data

Common for soil, sediment core data:

Datum represents not a single depth point but a depth
increment (e.g., cylinder of soil 2.5cm in diameter x 15cm
high, collected at 20-35 cm)


Integrate linear mixed model representation:


Ignoring increment structure leads to biased, inconsistent
estimators
Definite integral of truncated linear basis (x-κ)+ becomes
differenced quadratic basis
[(top-κ)+ ]2 - [(bottom-κ)+ ]2
Immediate extension to small area estimation

E.g., soil mapping by map unit symbol
Carbon Sequestration

(Nan-Jung Hsu, Steve Ogle, JB) Broad
class of semiparametric mixed models
for increment-averaged data
Smoothing on Networks
• Current research with post-doc, Giovanna Ranalli
• have noisy data on stream network
• have within-network distance measure (rather than
“as the crow flies”)
• want interpolations at unsampled locations in network
• Semiparametric methodology readily extends to
this setting
• low-rank radial basis functions
• Possible real data from EPA (John Faustini)
Smoothing on Stream Networks

Toy stream network
Two first-order, one secondorder stream segment
 Regression function is
exponential along straight reach
(two segments), constant along
remaining segment, continuous at
intersection
 n=150 noisy observations
obtained along network

Toy Network Results

Noisy observations
smoothed via



Low-rank thin plate
spline (2D, ignoring
network structure)
Within-network radial
basis functions (1D,
accounts for network
structure)
Network smooth
offers 25-30%
reduction in MISE
over spatial smooth
Non-Gaussian Time Series

Potential models for one-dimensional spatial
processes
Identification and Estimation


In Gaussian case, models of differing causality/invertibility
cannot be identified
Identification in non-Gaussian case:




Fit causal/invertible ARMA via Gaussian quasi-MLE
Examine residuals for IID-ness
If not IID, fit All-Pass model (LAD [Breidt, Davis, Trindade,
Ann. Stat. (2001)], MLE, rank estimation) to determine order
of non-causality or non-invertibility
Prediction and Estimation in non-Gaussian case:



Best MS prediction requires trickery
Exact MLE, Bayes for non-Gaussian MA
Exact and conditional MLE for MA with roots near unit circle
[Rosenblatt, Davis, Breidt, Hsu]
Asymptotic Results for All-Pass
Where Are We Now?

DAMARS.5: Nonparametric model-assisted

1. Extensions






1.1 continuous spatial domains (Siobhan; poster; Giovanna, work in
progress)
1.2 multiple phases (Kim (PhD 2004, ISU), working paper)
1.3 multiple auxiliary variables (gam: Gretchen, Goran, JO, JB, JASA 2nd
submission)
1.3-1.4 alternative smoothing (Gerda, JO, JB, p-splines; Biometrika 2nd
submission; Ranalli and Montanari, neural nets, JASA 2nd submission)
Other: two-stage kernels (Kim, JO, JB; JRSS submission); two-stage
splines (Mark, JB, poster)
2. Applications




2.1 CDF estimation (Alicia, JO, JB; poster, CJS submission)
2.2 “Medium” area (Siobhan, JO, JB; poster)
2.3 Surveys over time (Jehad Al-Jararha, JO, JB, spam with partial
overlap;)
2.4 Nonresponse (da Silva and Opsomer, Survey Methodology 2004)
Where Are We Now?

STARMAP.2: Local Inferences

1. Small area
 1.1-1.4 Nonparametric model-assisted for spatial (Siobhan, poster;
Giovanna, work in progress); Semiparametric (Gerda, Giovanna, Goran,
JO, JB, working paper); Increments (Nan-Jung, Steve, JB, working
paper)
 1.1 MLE for all-pass (Beth, RD, JB, JMVA submission) ; rank for all-pass
(Beth, RD, JB, working paper); Prediction for MA (Breidt and Hsu, Stat
Sinica 2004); Exact MLE for MA (Nan-Jung, RD, JB)
 Spatial trend detection (Hsin-Cheng Huang)
 Design aspects: (Bill, JB, poster)

2. Deconvolution
 Formulated as another small area estimation problem using constrained
Bayes methods (Mark, JB, poster)
 Methodology seems OK; example (88 HUCs in MAHA) still being
tweaked; work in progress

3. Causal inference
 3.1-3.3 (Alix G)
Some Summaries (these
projects only)

Some Invited Talks and Seminars
 Winemiller Symposium (Columbia, MO)
 Computational Environmetrics (Chicago, IL)
 Monitoring Symposium (Denver, CO)
 ICSA (Singapore)
 EMAP 2004 (Newport, RI)
 ENAR (Pittsburgh PA)
 IWAP (Piraeus, Greece)
 IMS-ASA (Calcutta, India)
 Western Ecology Division, EPA (Corvallis, OR)
 University of Maryland (Baltimore County, MD)
 + Jean’s talks
More Summaries (these
projects only)


People
 Students: Ji-Yeon Kim, ISU PhD completed Spring 2004 (JO
and JB); Bill Coar, Mark Delorey, Jehad Al-Jararha, CSU PhD
work in progress; ISU student?
 Post-Doctoral Research Associate: Giovanna Ranalli
 Visiting Research Scientists: Nan-Jung Hsu and Hsin-Cheng
Huang
 Unsuspecting Collaborators: Gerda Claeskens and Goran
Kauermann
Papers
 2 appeared, 2 tentatively accepted, 1 invited revision, 4
submitted, n working papers
Optimal Sampling Design under
Frame Imperfections

Motivated by problems with RF3 perennial
classification



Compare optimal biased and unbiased designs using
anticipated MSE criterion



About 20% errors of omission and of commission!
Previous work: logistic regression for probability of perennial
as function of covariates (Bill Coar)
Account for differential costs (in frame, not in frame;
perennial, non-perennial)
Minimize AMSE for fixed cost
Further work


Asymptotic results for cases of negligible, non-negligible bias
Empirical results
Download