Advanced Analysis of Complex Survey Data Part II Michael R. Elliott Presented at ARM 2009 J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 40 Overview • Extensions to linear regression – Generalized linear models: logistic regression – Survival Analysis: Cox Proportional Hazards Model – Linear Mixed Models • Software J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 41 Logistic Regression is called the link function In logistic regression, µi log 1 − µi is given by the “log-odds”: P (Yi = 1) = log P ( Y 0) = i β1 is interpreted as the change in log-odds of an event occurring for a unit change in x1. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 42 Logistic Regression • Use of GLMs accounts for non-constant variance caused by mean and variance being related (recall when Yi takes on 0 or 1, with P(Y=i 1)= µi , Yi ) µi (1 − µi ) .) then var(= • Allows estimates of β to span real number line without generating predicted values outside of the range of μ: µi log 1 − µi β 0 + β1xi µi e β 0 + β1xi x e β β = + ⇒ = ⇒ µi = 0 1 i β 0 + β1xi e 1 1 µ − + i As βˆ0 + βˆ1 xi → −∞, µˆ i → 0 As βˆ0 + βˆ1 xi → ∞, µˆ i → 1 J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 43 Logistic Regression • Estimating logistic regression coefficients: find β ∂l (β ) such that= U= ∂β ′ e xi β xi yi − ∑ xi′β i =1 1+ e n =0. – No closed-form solution: use Newton-Raphson algorithm/weighted iterative least squares. • xi′β e PMLE: weighted estimator: = U w ( β ) ∑ wi xi yi − xi′β i =1 1+ e consistently estimates the ′ N e xi β population score equation= U N ( β ) ∑ xi yi − ′ x β i i =1 e + 1 J. Bienias & M. Elliott n ARM 2009: Advanced Analysis of Complex Survey Data 44 Logistic Regression J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 45 Logistic Regression, con. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 46 Example – Log. Reg. • Example: Snuff Use Among Men in the 1987 National Health Interview Survey (NHIS). • Table 6.2-2 presents two analyses of probability of snuff use as a function of demographic variables. • 186 degrees of freedom can use normal or chi-square reference distribution instead of t- or F-distribution. • Wald test for significance of race, with four categories, use a chi-square statistic with 3 degrees of freedom. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 47 plus MSA, education, occupation,… J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 48 Logistic Regression Diagnostics • Similar to multiple linear regression, there are Added Variable Plots and Partial Residual Plots. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 49 Added Variable Plots for Log. Reg. • Example: Development of Cancer as Function of Transferrin Saturation for Women in the NHANES I Epidemiologic Follow-up Study. • Log-odds of probability of developing cancer in a follow-up period modeled as a linear function of transferrin saturation, smoking, race, income, enumeration district(poverty versus nonpoverty) and age. For transferrin saturation • Added variable plot for transferrin saturation on next page. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 50 The area of bubbles are proportional to sample weights. Dashed line is weighted least-squares regression line; slope 0.025. If pt. A is dropped it results in J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 51 Partial Residual Plots for Log. Reg. • Example: Snuff Use Among Men Sampled in the 1987 NHIS. • Used partial residual plot to determine the functional relationship of age in the logistic regression model. Model had age, race, region, metropolitan statistical area, education, occupation and marital status main effects. • Visual inspection of partial residual plot for age not too helpful because partial residuals for 15,513 non-snuff users pushed down to zero because of scaling of y-axis. • Use of kernel smoother suggests a quadratic term in ARM 2009: Advanced Analysis of Complex age. J. Bienias & M. Elliott 52 Survey Data J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 53 J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 54 Survival Analysis • “Time-to-event” data: death, disability, divorce, etc. • Often censored before event occurs: observe Yi min(Ti , Ci ),= = δ i I (Ti < Ci ) , where Ti is the time of the event and Ci is the (noninformative) censoring time. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 55 Survival Analysis • Survival function: S (t ) = 1 − F (t ) (Probability event has not occurred by time t) P (t ≤ T ≤ t + ∆t | T ≥ t ) − d • Hazard function: = h(t ) lim = ln( S (t )). ∆t dt (instantaneous risk of an event occurring at time t) • Cumulative hazard function: H (t ) = ∫ h( z )dz (“total rate” of experiencing the event up to ∆t →0 t 0 time t – also expected number of events). – Also have J. Bienias & M. Elliott t d H (t ) = − ∫ ln S ( z )dz = − ln S (t.) dz 0 ARM 2009: Advanced Analysis of Complex Survey Data 56 Survival Function • Estimating the survival function (1-CFD) in the presence of censoring. – Normally estimate CDF as #(ti≤t)/n and thus S as #(ti>t)/n , but censoring will cause bias. – Kaplan-Meirer estimator accounts for censoring: Sˆ (= t) ∏ t ≤t Pˆ (T > ti | T > ti−= 1) i Yi − di Pˆ (T > ti | T > ti −1 ) = Yi di ∏ ti ≤t 1 − Y i where Yi is the number of subjects at risk (not censored) at time ti and di is the number that die at time ti . (Subjects ordered by time of death/censoring). since J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 57 Survival Function • Example (`+’ means censored) J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 58 Survival Function • Kaplan-Meier Curve (Plot of Sˆ (t ) against t). J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 59 Cox Proportional Hazard Model Cox Proportional Hazard Model: models hazard as a multiplicative linear function of covariates: h(t ) = h0 (t ) ex p( x′β ) • βk is the change in the log-hazard associated with a unit change in xk. • Can estimate β without needing to estimate h0(t) (semiparametric model): δ exp( xi′ β ) L( β ) ∏ , R (ti ) ≡ set of subjects without an event by time ti i =1 ∑ exp( x j′ β ) j R t ∈ ( ) i δi ′β x exp x n ∑ j j j∈R ( ti ) U ( β ) ∑ xi − = ′ i =12009: exp xj β ARM Analysis of Complex Advanced ∑ J. Bienias & M. Elliott 60 ∈ ( ) j R t i Survey Data i n ( ) ( ) Survival Analysis: Survey Data • Accommodate case weights using same trick as in linear and generalized linear models. – K-M estimator: Replace unweighted total that died at time ti and unweighted total at risk with equivalent weighted totals: w = Sˆw (t ) di ∏ ti ≤t 1 − Y w i – PH model: Replace unweighted sums in U ( β )with weighted sums: δ ( ) ( ) ′β exp w x x n ∑ j j j ( ) j ∈ R t i = U w ( β ) ∑ wi xi − i =1 w j exp x j′ β ∑ ( ) j ∈ R t i J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data i 61 Survival Analysis: Survey Data • Linearized estimators of the variance also available, but must be more careful in computing since estimating equation is no longer a linear t combination of weighted sums. i – Same Wald-type test statistics with df determined by sample design can be used as in linear and generalized linear regression. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 62 Example – Cox PH Model • Data from 4th wave of the America’s Changing Lives study – Face-to-face survey: stratified, multistage unequal probability of selection sample design. – Oversample of African-Americans and persons 60+. – 3,617 subjects enrolled in 1986; following in three follow-up through 2003. – 2,971 either dead or alive and followed through 4th wave • Risk of death as a function of age, gender, race smoking status, and BMI. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 63 Example – Cox PH Model Unadjusted for design: coef exp(coef) se(coef) mage 0.0825 female -0.5005 nonaa -0.4426 smoke1 0.5400 bmi1 -0.0110 1.086 0.606 0.642 1.716 0.989 0.00257 0.06077 0.06165 0.06961 0.00625 z 32.08 -8.24 -7.18 7.76 -1.76 p 0.0e+00 2.2e-16 7.0e-13 8.7e-15 7.9e-02 Adjusted for design: coef exp(coef) se(coef) mage 0.08864 female -0.61771 nonaa -0.48441 smoke1 0.55158 bmi1 -0.00565 J. Bienias & M. Elliott 1.093 0.539 0.616 1.736 0.994 0.00525 0.08350 0.07618 0.10280 0.01259 16.889 -7.398 -6.359 5.366 -0.449 z p 0.0e+00 1.4e-13 2.0e-10 8.1e-08 6.5e-01 ARM 2009: Advanced Analysis of Complex Survey Data 64 Cox PH Model: Model Checking Proportional Hazards Assumption PH assumption implies S (t ; xi ) = S0 (t )exp( xi′ β ) ⇒ log( S (t ; xi )) = exp( xi ′ β ) log(( S0 (t )) ⇒ log(− log( S (t ; xi ))) = − xi ′ β + log(− log(( S0 (t ))) Can check by plotting log(− log(Sˆ (t ))against t for categorical variables with levels h=1,…,H. h – Resulting lines should be parallel. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 65 -4 -3 log(-log(S(t)) -3 -5 -4 -5 A-A Non A-A -7 -6 -6 Female Male -7 log(-log(S(t)) -2 -2 -1 -1 0 0 Cox PH Model: Model Checking Proportional Hazards Assumption 0 5 10 15 0 10 15 Years Years J. Bienias & M. Elliott 5 ARM 2009: Advanced Analysis of Complex Survey Data 66 Cox PH Model: Model Checking Proportional Hazards Assumption • Can relax PH assumption by including interaction term between time and covariate: β1 x + β 2 xt =( β1 + β 2t ) x – β1 is the instantaneous risk associated with x at the start of follow-up. – β1 + β 2t is the instantaneous risk associated with x at time t. – Details require introducing time-dependent covariates. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 67 Cox PH Model: Model Checking Martingale Residuals Martingale Residuals: M= δ i − Hˆ i (t ) i . Can use to consider functional form of covariates (as in linear and generalized linear models): plot M against x to look for non-linear trends. i J. Bienias & M. Elliott i ARM 2009: Advanced Analysis of Complex Survey Data 68 -2 -3 -4 -5 M -1 0 1 Cox PH Model: Model Checking Martingale Residuals 40 60 80 100 Age J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 69 Linear Mixed Models • Setting of multiple observations per subject: Y j = (Y j1 ,..., Y j ,n j ) • Assume that linear regression model for outcome includes unobserved subject-specific effects: iid • iid Y j = X j β + Z j b j + ε j , b j ~ N (0, G ), ε j ~ N (0, σ 2 I ), b j ⊥ ε j If Z j = X j , equivalent to each subject having specific regression parameters drawn from a common distribution: iid iid Yj = X j β j + ε j , β j ~ N ( β , G ), ε j ~ N (0, σ 2 I ), b j ⊥ ε j J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 70 Linear Mixed Models • Common example is random slope-intercept model: y ji = β 0 + β1t ji + b j 0 + b j1t ji + ε ij b j 0 iid 0 g11 b ~ N2 , G = g12 j1 0 iid g12 2 , ~ (0, ) ε σ N ji g 22 Can obtain estimates of bˆ j , so can estimate both a population mean βˆ0 + βˆ1t ji and a subject-specific mean ( βˆ0 + bˆ j 0 ) + ( βˆ1 + +bˆ j1 )t ji . J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 71 Linear Mixed Models 32 33 Ex: 20 subjects randomized to 2 treatment groups and treated given fever-lowering drug: 29 30 31 ____: pop. mean 1.0 1.5 2.0 2.5 J. Bienias & M. Elliott 3.0 3.5 4.0 ………: subjectspecific slopeintercept -----: regression within subject ARM 2009: Advanced Analysis of Complex Survey Data 72 Linear Mixed Models • Accounts for within-subject correlation • Increases precision for within-subject covariates (e.g., slope) • Parcels out within-subject vs. between subject variance in G and σ 2 . • Estimation: (1) Fix G and σ at some initial estimate G(0) and σ(0) and estimate β(1) using weighted LS. (2) Fix β at β(1) and estimate G(1) and σ(1). (3) Repeat (1) and (2) until convergence. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 73 Linear Mixed Models: Survey Data • Difficult to formulate pseudo-maximum likelihood because census log-likelihood is no longer a sum – clustering at the population level. • Instead replace iterative estimates of G, σ, and β with their weighted equivalents: – Assumes 2-stage sampling, with second stage of sampling forming the “subject” (cluster) – Requires knowing both first stage inclusion probabilities πk and second stage inclusion probabilities πi|k to compute correct weighted estimates – Need to normalize so that sum wi|k is the sample size nk, not population size Nk, otherwise get biased ARM 2009: Advanced Analysis of Complex estimates of between-subject variance. J. Bienias & M. Elliott 74 Survey Data Example--Linear Mixed Models • Partners for Child Passenger Safety is a probability sample of 31,985 children in 20,202 passenger vehicles crashes in StateFarm-insured vehicles. • Stratification and unequal probability of selection (oversample severe crashes) • Clustering: data collected for all children in vehicle if sampled. • Goal is to estimate rate of “consequential” injuries. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 75 Example--Linear Mixed Models Consider the model P (Y ji = 1) lo g = β 0 + b j 1 − P (Y = 1) ji b j ~ N (0,ψ 1 ) versus P (Y ji = 1) log =β 0 + β1 x j + b j 1 − P (Y = 1) ji b j ~ N (0,ψ 2 ) where j indexes the crash, i the child in the crash, Yij is an indicator of injury, and x j is an indicator for towaway status of the vehicle. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 76 Example--Linear Mixed Models • The first model absorbs all of the crashspecific variability of injury risk into b j . Second model strips out the variability of injury risk explained by the towaway status of the vehicle before absorbing remaining crashspecific components into b j. = = ψˆ 2 5.04 , indicating that the drivability • ψˆ1 6.70, of the vehicle explains about 25% of the injury risk variability inherent in a crash. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 77 Modeling in a Finite Population Context: Summary of Basics • Estimating “traditional” statistical models in survey context – Estimating equivalent finite population quantity such as a least squares regression slope or logistic regression parameter. – Assuming population is generated as a simple random sample from an infinite superpopulation that is generated under the model. • Standard modeling methods implicitly assume iid sample. • Need to accommodate sample design in inference. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 78 Modeling in a Finite Population Context: Summary of Key Points Point Estimation • Case weights to account for unequal probability of selection – Model misspecification: interaction between probability of selection and parameter of interest. • Mixture models must assume correct specification of mean to obtain correct estimates of variance components. – Informative sampling: probability of selection is associated with residuals in regression model. – Can test with interaction term between weights and parameter (e.g., regression slopes). J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 79 Modeling in a Finite Population Context: Summary of Key Points Variance Estimation • Clustering, stratification – Often done for cost or statistical efficiency • Ignoring clustering typically underestimates variance, ignoring stratification typically overestimates variance. • As in descriptive statistic setting, we accommodate using linearization or replication methods that account for clustering and stratification. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 80 Software • Until past 5-10 years, only specialized software could accommodate complex sample designs (WesVarPC, SUDAAN, Epi Info) • Now SAS, Stata and R can handle many standard regression models, although specialized software is still required for mixture models. SAS: http://www.sas.com Stata: http://www.stata.com R: http://www.r-project.org/ IVEWare: http://www.isr.umich.edu/src/smp/ive/ MPlus: http://www.statmodel.com/ HLM: http://www.ssicentral.com/hlm/student.html J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 81 Software SAS Stata R* IVEWare MPlus HLM Linear Models X X X X X X GLMs X X X X X X X X X Linear Mixed Models X** X GLMMs X ** X Survival Models *requires “survey” package. J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex **requires equal within-cluster weights. Survey Data 82 Acknowledgments • Thanks to Barry Graubard, National Cancer Institute • Analysis of Health Surveys, EL Korn & BI Graubard (Wiley, 1999) • “Inference for Superpopulation Parameters using Sample Surveys.” BI Graubard & EL Korn, Statistical Science, 17 (2002), 73-96. • “Finite Population Correction Factors (Panel Discussion).” K Rust, B Graubard, WA Fuller, SL Stokes, & PS Kott. ASA Proceedings, 2006. • Bienias, J. L. (2001). Replicate-based variance estimation in a SAS® macro. [Available from http://www.nesug.org.] J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 83 Contact Information • mrelliot@umich.edu • jbienias@alum.wustl.edu • graubarb@exchange.nih.edu J. Bienias & M. Elliott ARM 2009: Advanced Analysis of Complex Survey Data 84