REVIEW Correlated Data Regression Biost 540

advertisement
REVIEW
Correlated Data Regression
Biost 540
1. Correlation between outcomes is (almost) everywhere and takes many
different forms.
a. Mechanisms for correlation.
2. Correlation between outcomes cannot be ignored.
a. Impact on interpretation of regression coefficients.
b. Impact on precision of parameter estimators.
c. Impact on validity of statistical inference.
3. Specification of the mean model is crucial for valid inference (just as for
independent outcomes).
a. Linear models (transformed outcome?)
b. Generalized linear models (logistic, Poisson and linear regression).
c. Choice of predictors, explanatory variables, covariates, interactions.
d. Interpretation of regression coefficients.
4. Statistical models for correlated data use the concept of a cluster.
a. Outcomes from the same cluster may be correlated whereas outcomes in
different clusters are independent. (Exceptions include models with spatial
correlation and other more complex models.)
b. Mathematical notation for correlated data uses two (or more) indices to
index (1) the cluster – primary sampling unit, and (2) the elementary unit
within the cluster.
c. Example: for the response Yij, we let i denote the cluster (p.s.u.) and j
denote the e.u. within cluster i.
5. Between-cluster effects and within-cluster effects have different
interpretations and may have different numerical values.
a. Recall issues with ecological fallacy.
6. Two formulations of regression models for correlated outcomes lead to
models with distinct interpretations of parameter estimates. The model
discussed thus far is the marginal (or direct) model. The random effects
(conditional or mixed) model is the other. (We will discuss the random effects
models following the midterm exam.)
7. In the marginal (direct) formulation, we directly (and separately) specify the
mean model along with a working variance (and correlation) structure.
a. We do not specify distributions of the data. We specify only the means
and variances.
1
8. Valid estimates of means and regression coefficients can be obtained by
ignoring the correlation.
a. Can easily show for sample mean. The general principle applies to
regression coefficients (which are weighted means).
9. Statistical inference ignoring correlation is not valid.
10. The Intraclass Correlation Coefficient (ICC): the correlation between two
outcomes in the same cluster is a useful quantification of the magnitude of the
within-cluster correlation.
a. The ICC completely describes the correlation matrix for an exchangeable
correlation structure. In general, ICC represents an “average” pairwise
correlation between two outcomes in the same cluster
11. Other correlation structures are useful when the exchangeable correlation
structure is not reasonable:
a. Autoregressive: AR(1), AR(2)
b. Unstructured
c. Stationary versus non-stationary
12. Variance inflation factor: VIF = 1 + (n-1)ρ, tells of the impact of within cluster
correlation on the variances of the sample mean. (ρ = ICC, n = cluster size).
13. Variance attenuation factor: VAF = 1 – ρ, tells of the impact of within-cluster
correlation on the variances of within-cluster comparisons.
14. The derived variable approach is simple, easy to interpret and is usually
valid.
a. It is flexible and can be adapted to simple or complex designs. The general
recipe is always the same:
i. calculate a summary outcome measure (the “derived” variable) for
each cluster,
ii. apply ordinary independent-data analysis methods to the derived
variable.
b. The derived variable approach is flexible and can address a variety of
types of scientific questions:
i. cluster averages (means),
ii. regression coefficient (slopes)
iii. time-trends (slope coefficient for “time”)
iv. typically corresponding summaries in correlated data model
2
15. The “sandwich” variance estimator allows for valid inference about mean
parameters in marginal models without requiring the correct specification of the
variance (and correlation) structure.
a. We also do not need to specify distributions of our outcomes. However,
there is no free lunch. (See items 17 and 18.)
16. Generalized Estimating Equations (GEE) allow separate modeling of mean and
covariance structure, and weights observations according to a “working”
covariance structure to increase precision of estimates of mean parameters.
a. Model involves specification of:
i. mean model,
ii. working variance structure
iii. working correlation structure
17. Inference from GEE and random effects (maximum likelihood) models is
based upon large sample (asymptotic) theory.
a. An advantage of this fact for GEE is that we do not need distributional
assumptions (such as normality). A disadvantage is the next item…
18. Inference based on asymptotic methods may be invalid with a small number
of clusters.
a. The robust variance estimate is susceptible to bias with a small number of
clusters. This leads to Wald tests that are anti-conservative. Various
alternatives, such as methods based on resampling, are available that may
perform better in such situations.
3
Download