REVIEW Correlated Data Regression Biost 540 1. Correlation between outcomes is (almost) everywhere and takes many different forms. a. Mechanisms for correlation. 2. Correlation between outcomes cannot be ignored. a. Impact on interpretation of regression coefficients. b. Impact on precision of parameter estimators. c. Impact on validity of statistical inference. 3. Specification of the mean model is crucial for valid inference (just as for independent outcomes). a. Linear models (transformed outcome?) b. Generalized linear models (logistic, Poisson and linear regression). c. Choice of predictors, explanatory variables, covariates, interactions. d. Interpretation of regression coefficients. 4. Statistical models for correlated data use the concept of a cluster. a. Outcomes from the same cluster may be correlated whereas outcomes in different clusters are independent. (Exceptions include models with spatial correlation and other more complex models.) b. Mathematical notation for correlated data uses two (or more) indices to index (1) the cluster – primary sampling unit, and (2) the elementary unit within the cluster. c. Example: for the response Yij, we let i denote the cluster (p.s.u.) and j denote the e.u. within cluster i. 5. Between-cluster effects and within-cluster effects have different interpretations and may have different numerical values. a. Recall issues with ecological fallacy. 6. Two formulations of regression models for correlated outcomes lead to models with distinct interpretations of parameter estimates. The model discussed thus far is the marginal (or direct) model. The random effects (conditional or mixed) model is the other. (We will discuss the random effects models following the midterm exam.) 7. In the marginal (direct) formulation, we directly (and separately) specify the mean model along with a working variance (and correlation) structure. a. We do not specify distributions of the data. We specify only the means and variances. 1 8. Valid estimates of means and regression coefficients can be obtained by ignoring the correlation. a. Can easily show for sample mean. The general principle applies to regression coefficients (which are weighted means). 9. Statistical inference ignoring correlation is not valid. 10. The Intraclass Correlation Coefficient (ICC): the correlation between two outcomes in the same cluster is a useful quantification of the magnitude of the within-cluster correlation. a. The ICC completely describes the correlation matrix for an exchangeable correlation structure. In general, ICC represents an “average” pairwise correlation between two outcomes in the same cluster 11. Other correlation structures are useful when the exchangeable correlation structure is not reasonable: a. Autoregressive: AR(1), AR(2) b. Unstructured c. Stationary versus non-stationary 12. Variance inflation factor: VIF = 1 + (n-1)ρ, tells of the impact of within cluster correlation on the variances of the sample mean. (ρ = ICC, n = cluster size). 13. Variance attenuation factor: VAF = 1 – ρ, tells of the impact of within-cluster correlation on the variances of within-cluster comparisons. 14. The derived variable approach is simple, easy to interpret and is usually valid. a. It is flexible and can be adapted to simple or complex designs. The general recipe is always the same: i. calculate a summary outcome measure (the “derived” variable) for each cluster, ii. apply ordinary independent-data analysis methods to the derived variable. b. The derived variable approach is flexible and can address a variety of types of scientific questions: i. cluster averages (means), ii. regression coefficient (slopes) iii. time-trends (slope coefficient for “time”) iv. typically corresponding summaries in correlated data model 2 15. The “sandwich” variance estimator allows for valid inference about mean parameters in marginal models without requiring the correct specification of the variance (and correlation) structure. a. We also do not need to specify distributions of our outcomes. However, there is no free lunch. (See items 17 and 18.) 16. Generalized Estimating Equations (GEE) allow separate modeling of mean and covariance structure, and weights observations according to a “working” covariance structure to increase precision of estimates of mean parameters. a. Model involves specification of: i. mean model, ii. working variance structure iii. working correlation structure 17. Inference from GEE and random effects (maximum likelihood) models is based upon large sample (asymptotic) theory. a. An advantage of this fact for GEE is that we do not need distributional assumptions (such as normality). A disadvantage is the next item… 18. Inference based on asymptotic methods may be invalid with a small number of clusters. a. The robust variance estimate is susceptible to bias with a small number of clusters. This leads to Wald tests that are anti-conservative. Various alternatives, such as methods based on resampling, are available that may perform better in such situations. 3