Advanced Analysis of Complex Survey Data Part II Michael R. Elliott

advertisement
Advanced Analysis of Complex
Survey Data
Part II
Michael R. Elliott
Presented at ARM 2009
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
40
Overview
• Extensions to linear regression
– Generalized linear models: logistic regression
– Survival Analysis: Cox Proportional Hazards Model
– Linear Mixed Models
• Software
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
41
Logistic Regression
is called the link function
In logistic regression,
 µi
log 
 1 − µi
is given by the “log-odds”:

 P (Yi = 1) 
 = log 

P
(
Y
0)
=
i



β1 is interpreted as the change in log-odds of an
event occurring for a unit change in x1.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
42
Logistic Regression
• Use of GLMs accounts for non-constant variance
caused by mean and variance being related
(recall when Yi takes on 0 or 1, with P(Y=i 1)= µi ,
Yi ) µi (1 − µi ) .)
then var(=
• Allows estimates of β to span real number line
without generating predicted values outside of
the range of μ:
 µi
log 
 1 − µi
β 0 + β1xi

µi
e
β 0 + β1xi
x
e
β
β
=
+
⇒
=
⇒ µi =

0
1 i
β 0 + β1xi
e
1
1
µ
−
+
i

As βˆ0 + βˆ1 xi → −∞, µˆ i → 0
As βˆ0 + βˆ1 xi → ∞, µˆ i → 1
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
43
Logistic Regression
• Estimating logistic regression coefficients: find β
∂l
(β )
such that= U=
∂β
′

e xi β
xi  yi −
∑
xi′β

i =1
1+ e

n

 =0.


– No closed-form solution: use Newton-Raphson
algorithm/weighted iterative least squares.
•
xi′β

e
PMLE: weighted estimator:
=
U w ( β ) ∑ wi xi  yi −
xi′β

i =1
1+ e

consistently estimates the
′
N

e xi β 
population score equation=
U N ( β ) ∑ xi  yi −

′
x
β


i
i =1
e
+
1


J. Bienias & M. Elliott
n
ARM 2009: Advanced Analysis of Complex
Survey Data
44




Logistic Regression
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
45
Logistic Regression, con.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
46
Example – Log. Reg.
• Example: Snuff Use Among Men in the 1987 National
Health Interview Survey (NHIS).
• Table 6.2-2 presents two analyses of probability of
snuff use as a function of demographic variables.
• 186 degrees of freedom can use normal or chi-square
reference distribution instead of t- or F-distribution.
• Wald test for significance of race, with four categories,
use a chi-square statistic with 3 degrees of freedom.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
47
plus MSA, education, occupation,…
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
48
Logistic Regression Diagnostics
• Similar to multiple linear regression, there are
Added Variable Plots and Partial Residual
Plots.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
49
Added Variable Plots for Log. Reg.
• Example: Development of Cancer as Function of
Transferrin Saturation for Women in the NHANES
I Epidemiologic Follow-up Study.
• Log-odds of probability of developing cancer in a
follow-up period modeled as a linear function of
transferrin saturation, smoking, race, income,
enumeration district(poverty versus nonpoverty)
and age. For transferrin saturation
• Added variable plot for transferrin saturation on
next page.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
50
The area of bubbles are proportional to sample weights. Dashed line is weighted
least-squares regression line; slope 0.025. If pt. A is dropped it results in
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
51
Partial Residual Plots for Log. Reg.
• Example: Snuff Use Among Men Sampled in the
1987 NHIS.
• Used partial residual plot to determine the
functional relationship of age in the logistic
regression model. Model had age, race, region,
metropolitan statistical area, education,
occupation and marital status main effects.
• Visual inspection of partial residual plot for age
not too helpful because partial residuals for
15,513 non-snuff users pushed down to zero
because of scaling of y-axis.
• Use of kernel smoother suggests a quadratic term in
ARM 2009: Advanced Analysis of Complex
age.
J. Bienias
& M. Elliott
52
Survey Data
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
53
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
54
Survival Analysis
• “Time-to-event” data: death, disability, divorce,
etc.
• Often censored before event occurs: observe
Yi min(Ti , Ci ),=
=
δ i I (Ti < Ci ) , where Ti is the time of
the event and Ci is the (noninformative) censoring
time.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
55
Survival Analysis
• Survival function: S (t ) = 1 − F (t )
(Probability event has not occurred by time t)
P (t ≤ T ≤ t + ∆t | T ≥ t ) − d
• Hazard
function:
=
h(t ) lim
=
ln( S (t )).
∆t
dt
(instantaneous risk
of an event occurring at time t)
• Cumulative hazard function: H (t ) = ∫ h( z )dz
(“total rate” of experiencing the event up to
∆t →0
t
0
time t – also expected number of events).
– Also have
J. Bienias & M. Elliott
t
d
H (t ) =
− ∫ ln S ( z )dz =
− ln S (t.)
dz
0
ARM 2009: Advanced Analysis of Complex
Survey Data
56
Survival Function
• Estimating the survival function (1-CFD) in the
presence of censoring.
– Normally estimate CDF as #(ti≤t)/n and thus S as
#(ti>t)/n , but censoring will cause bias.
– Kaplan-Meirer estimator accounts for censoring:
Sˆ (=
t)
∏ t ≤t Pˆ (T > ti | T > ti−=
1)
i
Yi − di
Pˆ (T > ti | T > ti −1 ) =
Yi
 di 
∏ ti ≤t 1 − Y 
i 

where Yi is the number of
subjects at risk (not censored) at time ti and di is
the number that die at time ti . (Subjects ordered
by time of death/censoring).
since
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
57
Survival Function
• Example (`+’ means censored)
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
58
Survival Function
• Kaplan-Meier Curve (Plot of Sˆ (t ) against t).
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
59
Cox Proportional Hazard Model
Cox Proportional Hazard Model: models hazard as a
multiplicative linear function of covariates:
h(t ) = h0 (t ) ex p( x′β )
• βk is the change in the log-hazard associated with
a unit change in xk.
• Can estimate β without needing to estimate h0(t)
(semiparametric model):
δ


exp( xi′ β )


L( β ) ∏ 
 , R (ti ) ≡ set of subjects without an event by time ti
i =1  ∑
exp( x j′ β ) 
j
R
t
∈
(
)
i


δi

′β 
x
exp
x
n
∑
j
j
j∈R ( ti )


U
( β ) ∑  xi −
=

′
i =12009:
exp
xj β 
ARM
Analysis of
Complex
 Advanced
∑
J. Bienias & M. Elliott
60
∈
(
)
j
R
t
i


Survey Data
i
n
( )
( )
Survival Analysis: Survey Data
• Accommodate case weights using same trick as in
linear and generalized linear models.
– K-M estimator: Replace unweighted total that died at
time ti and unweighted total at risk with equivalent
weighted totals:
w
=
Sˆw (t )
 di 
∏ ti ≤t 1 − Y w 
i 

– PH model: Replace unweighted sums in U ( β )with
weighted sums:
δ
( )
( )

′β
exp
w
x
x
n
∑
j
j
j
(
)
j
∈
R
t

i
=
U w ( β ) ∑ wi  xi −
i =1
w j exp x j′ β

∑
(
)
j
∈
R
t
i

J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data





i
61
Survival Analysis: Survey Data
• Linearized estimators of the variance also
available, but must be more careful in computing
since estimating equation is no longer a linear
t
combination
of weighted sums.
i
– Same Wald-type test statistics with df determined by
sample design can be used as in linear and generalized
linear regression.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
62
Example – Cox PH Model
• Data from 4th wave of the America’s Changing
Lives study
– Face-to-face survey: stratified, multistage unequal
probability of selection sample design.
– Oversample of African-Americans and persons 60+.
– 3,617 subjects enrolled in 1986; following in three
follow-up through 2003.
– 2,971 either dead or alive and followed through 4th
wave
• Risk of death as a function of age, gender, race
smoking status, and BMI.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
63
Example – Cox PH Model
Unadjusted for design:
coef exp(coef) se(coef)
mage
0.0825
female -0.5005
nonaa -0.4426
smoke1 0.5400
bmi1 -0.0110
1.086
0.606
0.642
1.716
0.989
0.00257
0.06077
0.06165
0.06961
0.00625
z
32.08
-8.24
-7.18
7.76
-1.76
p
0.0e+00
2.2e-16
7.0e-13
8.7e-15
7.9e-02
Adjusted for design:
coef exp(coef) se(coef)
mage
0.08864
female -0.61771
nonaa -0.48441
smoke1 0.55158
bmi1 -0.00565
J. Bienias & M. Elliott
1.093
0.539
0.616
1.736
0.994
0.00525
0.08350
0.07618
0.10280
0.01259
16.889
-7.398
-6.359
5.366
-0.449
z
p
0.0e+00
1.4e-13
2.0e-10
8.1e-08
6.5e-01
ARM 2009: Advanced Analysis of Complex
Survey Data
64
Cox PH Model: Model Checking
Proportional Hazards Assumption
PH assumption implies
S (t ; xi ) = S0 (t )exp( xi′ β )
⇒ log( S (t ; xi )) =
exp( xi ′ β ) log(( S0 (t ))
⇒ log(− log( S (t ; xi ))) =
− xi ′ β + log(− log(( S0 (t )))
Can check by plotting log(− log(Sˆ (t ))against t for
categorical variables with levels h=1,…,H.
h
– Resulting lines should be parallel.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
65
-4
-3
log(-log(S(t))
-3
-5
-4
-5
A-A
Non A-A
-7
-6
-6
Female
Male
-7
log(-log(S(t))
-2
-2
-1
-1
0
0
Cox PH Model: Model Checking
Proportional Hazards Assumption
0
5
10
15
0
10
15
Years
Years
J. Bienias & M. Elliott
5
ARM 2009: Advanced Analysis of Complex
Survey Data
66
Cox PH Model: Model Checking
Proportional Hazards Assumption
• Can relax PH assumption by including interaction
term between time and covariate: β1 x + β 2 xt =( β1 + β 2t ) x
– β1 is the instantaneous risk associated with x at the
start of follow-up.
– β1 + β 2t is the instantaneous risk associated with x at
time t.
– Details require introducing time-dependent
covariates.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
67
Cox PH Model: Model Checking
Martingale Residuals
Martingale Residuals:
M=
δ i − Hˆ i (t )
i
.
Can use to consider functional form of
covariates (as in linear and generalized linear
models): plot M against x to look for non-linear
trends.
i
J. Bienias & M. Elliott
i
ARM 2009: Advanced Analysis of Complex
Survey Data
68
-2
-3
-4
-5
M
-1
0
1
Cox PH Model: Model Checking
Martingale Residuals
40
60
80
100
Age
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
69
Linear Mixed Models
• Setting of multiple observations per subject:
Y j = (Y j1 ,..., Y j ,n j )
• Assume that linear regression model for outcome
includes unobserved subject-specific effects:
iid
•
iid
Y j = X j β + Z j b j + ε j , b j ~ N (0, G ), ε j ~ N (0, σ 2 I ), b j ⊥ ε j
If Z j = X j , equivalent to each subject having
specific regression parameters drawn from a
common distribution:
iid
iid
Yj =
X j β j + ε j , β j ~ N ( β , G ), ε j ~ N (0, σ 2 I ), b j ⊥ ε j
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
70
Linear Mixed Models
• Common example is random slope-intercept
model:
y ji = β 0 + β1t ji + b j 0 + b j1t ji + ε ij
 b j 0  iid  0
 g11
 b  ~ N2  , G = 
 g12
 j1 
0
iid
g12  
2
,
~
(0,
)
ε
σ
N

ji

g 22  
Can obtain estimates of bˆ j , so can estimate both a
population mean βˆ0 + βˆ1t ji and a subject-specific
mean ( βˆ0 + bˆ j 0 ) + ( βˆ1 + +bˆ j1 )t ji .
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
71
Linear Mixed Models
32
33
Ex: 20 subjects randomized to 2 treatment groups
and treated given fever-lowering drug:
29
30
31
____: pop. mean
1.0
1.5
2.0
2.5
J. Bienias & M. Elliott
3.0
3.5
4.0
………: subjectspecific
slopeintercept
-----: regression
within
subject
ARM 2009: Advanced Analysis of Complex
Survey Data
72
Linear Mixed Models
• Accounts for within-subject correlation
• Increases precision for within-subject covariates
(e.g., slope)
• Parcels out within-subject vs. between subject
variance in G and σ 2 .
• Estimation:
(1) Fix G and σ at some initial estimate G(0) and
σ(0) and estimate β(1) using weighted LS.
(2) Fix β at β(1) and estimate G(1) and σ(1).
(3) Repeat (1) and (2) until convergence.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
73
Linear Mixed Models: Survey Data
• Difficult to formulate pseudo-maximum
likelihood because census log-likelihood is no
longer a sum – clustering at the population level.
• Instead replace iterative estimates of G, σ, and β
with their weighted equivalents:
– Assumes 2-stage sampling, with second stage of
sampling forming the “subject” (cluster)
– Requires knowing both first stage inclusion
probabilities πk and second stage inclusion
probabilities πi|k to compute correct weighted
estimates
– Need to normalize so that sum wi|k is the sample size
nk, not population size Nk, otherwise get biased
ARM 2009: Advanced Analysis of Complex
estimates
of between-subject
variance.
J. Bienias & M.
Elliott
74
Survey Data
Example--Linear Mixed Models
• Partners for Child Passenger Safety is a
probability sample of 31,985 children in
20,202 passenger vehicles crashes in StateFarm-insured vehicles.
• Stratification and unequal probability of
selection (oversample severe crashes)
• Clustering: data collected for all children in
vehicle if sampled.
• Goal is to estimate rate of “consequential”
injuries.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
75
Example--Linear Mixed Models
Consider the model
 P (Y ji = 1) 
lo  g
=
 β 0 + b j
 1 − P (Y =
1) 
ji

b j ~ N (0,ψ 1 )
versus
 P (Y ji = 1) 
log 
 =β 0 + β1 x j + b j
 1 − P (Y =
1) 
ji

b j ~ N (0,ψ 2 )
where j indexes the crash, i the child in the
crash, Yij is an indicator of injury, and x j is an
indicator for towaway status of the vehicle.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
76
Example--Linear Mixed Models
• The first model absorbs all of the crashspecific variability of injury risk into b j . Second
model strips out the variability of injury risk
explained by the towaway status of the
vehicle before absorbing remaining crashspecific components into b j.
=
=
ψˆ 2 5.04 , indicating that the drivability
• ψˆ1 6.70,
of the vehicle explains about 25% of the injury
risk variability inherent in a crash.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
77
Modeling in a Finite Population
Context: Summary of Basics
• Estimating “traditional” statistical models in
survey context
– Estimating equivalent finite population quantity such
as a least squares regression slope or logistic
regression parameter.
– Assuming population is generated as a simple random
sample from an infinite superpopulation that is
generated under the model.
• Standard modeling methods implicitly assume iid
sample.
• Need to accommodate sample design in
inference.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
78
Modeling in a Finite Population
Context: Summary of Key Points
Point Estimation
• Case weights to account for unequal probability
of selection
– Model misspecification: interaction between
probability of selection and parameter of interest.
• Mixture models must assume correct specification of mean
to obtain correct estimates of variance components.
– Informative sampling: probability of selection is
associated with residuals in regression model.
– Can test with interaction term between weights and
parameter (e.g., regression slopes).
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
79
Modeling in a Finite Population
Context: Summary of Key Points
Variance Estimation
• Clustering, stratification
– Often done for cost or statistical efficiency
• Ignoring clustering typically underestimates
variance, ignoring stratification typically
overestimates variance.
• As in descriptive statistic setting, we
accommodate using linearization or replication
methods that account for clustering and
stratification.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
80
Software
• Until past 5-10 years, only specialized software
could accommodate complex sample designs
(WesVarPC, SUDAAN, Epi Info)
• Now SAS, Stata and R can handle many standard
regression models, although specialized software
is still required for mixture models.
SAS:
http://www.sas.com
Stata:
http://www.stata.com
R:
http://www.r-project.org/
IVEWare: http://www.isr.umich.edu/src/smp/ive/
MPlus: http://www.statmodel.com/
HLM:
http://www.ssicentral.com/hlm/student.html
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
81
Software
SAS
Stata
R*
IVEWare MPlus
HLM
Linear
Models
X
X
X
X
X
X
GLMs
X
X
X
X
X
X
X
X
X
Linear
Mixed
Models
X**
X
GLMMs
X **
X
Survival
Models
*requires “survey” package.
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
**requires equal within-cluster
weights.
Survey Data
82
Acknowledgments
• Thanks to Barry Graubard, National Cancer Institute
• Analysis of Health Surveys, EL Korn & BI Graubard (Wiley,
1999)
• “Inference for Superpopulation Parameters using Sample
Surveys.” BI Graubard & EL Korn, Statistical Science, 17
(2002), 73-96.
• “Finite Population Correction Factors (Panel Discussion).” K
Rust, B Graubard, WA Fuller, SL Stokes, & PS Kott. ASA
Proceedings, 2006.
• Bienias, J. L. (2001). Replicate-based variance estimation in
a SAS® macro. [Available from http://www.nesug.org.]
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
83
Contact Information
• mrelliot@umich.edu
• jbienias@alum.wustl.edu
• graubarb@exchange.nih.edu
J. Bienias & M. Elliott
ARM 2009: Advanced Analysis of Complex
Survey Data
84
Download