GEE and Generalized Linear Mixed Models Tom Greene Outline • Subject specific and population average inference in generalized linear models • Review of classical generalized linear models with independent observations • Generalized Estimating Equations • Contrasts of GLMMs with GEEs • GEE example Classes of Generalized Linear Models Linear Models (Linear regression, ANOVA, ANCOVA) E(Y) = X β, Responses Independent Generalized Linear Models (Logistic regression, Poisson regression, etc.) g(E(Y)) = X β Responses Independent Linear Mixed Models E(Y|b) = X β + Z b Responses Correlated Correlation modeled in part by “random effects” Generalized Linear Mixed Models (GLMM) g(E(Y|b)) = X β + Z b Generalized Estimating Equations Approach (GEE) g(E(Y)) = X β Responses Correlated Responses Correlated Correlation modeled in part by “random effects” Classes of Generalized Linear Models for Correlated Data Linear Mixed Models E(Y|b) = X β + Z b Responses Correlated Correlation modeled in part by “random effects” Generalized Estimating Equations Approach (GEE) Generalized Linear Mixed Models (GLMM) g(E(Y|b)) = X β + Z b g(E(Y)) = X β Responses Correlated Responses Correlated Correlation modeled in part by “random effects” Population Average Inference Subject Specific Inference Classes of Generalized Linear Models for Correlated Data Population Average Inference Subject Specific Inference Generalized Estimating Equations Approach (GEE) Generalized Linear Mixed Models (GLMM) g(E(Y)) = X β g(E(Y|b)) = X β + Z b Responses Correlated Responses Correlated • Analysis describes differences in the mean of Y across the entire population • Analysis informative from population perspective; most relevant from perspective of Policy makers Providers desiring to optimize outcomes across entire population • • • Analysis describes differences in the mean of Y conditional on the patient’s specific random effect b Most relevant from an individual patient’s perspective Often b represent a dimension of frailty – Hence, X β tells about the relationship of Y to X among patients with the same frailty Extreme Example Subject specific effects of X on Pr(Death), OR = 20 per 1 unit increase in X Population average effect of X on Pr(Death), OR = 2.7 per 1 unit increase in X Example: Toenail Data Toenail Dermatophyte Onychomycosis: Common toenail infection, difficult to treat, affecting more than 2% of population. Design: Randomized, double-blind, parallel group, multicenter study for the comparison of two new compounds (A and B) for oral treatment. 2 x189 patients randomized, 36 centers 48 weeks of total follow up (12 months) 12 weeks of treatment (3 months) Measurements at months 0, 1, 2, 3, 6, 9, 12. Research question: Severity relative to treatment of TDO ? Review of Generalized Linear Models (Independent Responses) • Independent responses Yi, i = 1, 2, …, N – Yi, with distribution from exponential family y b( ) ex p c ( y , ) – f(y;θ,ø) = a( ) • Mean model – μi = E(Yi|Xi1,Xi2,…,Xip) – g(μi) = β0 + β1Xi1 + β2Xi2+ βpXip • Variance function – Var(Yi) = øV(μi) – V(μi) is a known function determined by the assumed distribution of Y within the exponential family Review of Generalized Linear Models (Independent Responses) Review of Generalized Linear Models (Independent Responses) Review of Generalized Linear Models (Independent Responses) • Independent responses Yi, i = 1, 2, …, N – Yi, with distribution from exponential family y b( ) ex p c ( y , ) – f(y;θ,ø) = a( ) • Mean model – μi = E(Yi|Xi1,Xi2,…,Xip) – g(μi) = β0 + β1Xi1 + β2Xi2+ βJXiJ • Variance function The mean model is the only part we have to get right for valid largesample inference!!! – Var(Yi) = øV(μi) – vi = V(μi) is a known function determined by the assumed distribution of Y within the exponential family Extension to GEE for Longitudinal Data GEE: Generalized Estimating Equations (Liang & Zeger, 1986; Zeger & Liang, 1986) • Method is semi-parametric – estimating equations are derived without full specification of the joint distribution of a subject’s observations • Instead, specification of 1. The mean model for the marginal distributions of the yij 2. The variance function of yij given µij 3. The “working” correlation matrix for the vector of repeated observations from each subject • Relies on the independence across subjects (or clusters) to estimate consistently the variance of the regression coefficients GEE Method Outline 1. Relate the marginal response μij = E(yij) to a linear combination of the covariates g(μij) = Xtijβ • yij is the response for subject i at time j, j = 1,2, .., J • Xij is a p × 1 vector of covariates • β is a p × 1 vector of regression coefficients • g(·) is the link function 2. Describe the variance of yij as a function of the mean V(yij) = v(μij)ø • ø is possibly unknown scale parameter • v(·) is a known variance function Link and Variance Functions • Normally-distributed response g(μij) = μij “Identity link” v(μij) = 1 V(yij) = ø • Binary response (Bernoulli) g(μij) = log[μij/(1 − μij)] “Logit link” v(μij) = μij(1 − μij) ø=1 • Poisson response g(μij) = log(μij) “Log link” v(μij) = μij ø =1 GEE Method Outline 3. Choose the form of a n × n “working” correlation matrix Ri for each Yi Working Correlation Structures Working Correlation Structures Working Correlation Structures (AR(1) Working Correlation Structures GEE Estimation • Define Ai = n × n diagonal matrix with V(μij) as the jth diagonal element • Define Ri(α) = n × n “working” correlation matrix (of the n repeated measures) Working variance–covariance matrix for Yi equals Vi(α) = øAi1/2 Ri(α) Ai1/2 GEE vs. GLMM 1) Target of Inference: • GEE: Population Average • GLMM: Subject Specific Notes: Recent work on perform population average inference under GLMM models GEE vs. GLMM 2) Outputs: • GEE: – Coefficients relating Y to X • GLMM: – Coefficients relating Y to X conditional on b – Estimates of subject specific random effects – Variance of subject specific random effects GEE vs. GLMM 3) Robustness: • GEE (with robust variance estimates): – Inference valid in large samples even if distribution of Y and/or variance of Y are incorrectly specified • GLMM (with model-based estimates) – Valid inference generally requires correct specification of distribution of Y and of variance of Y Notes: 1) Recent proposals for robust variance estimates under GLMM 2) Inference for Linear Mixed Models remains valid if Y is not normal for large N 3) Caveat to GEE robustness: GEE can be biased if time dependent covariates are used unless an independent working correlation matrix is used GEE vs. GLMM 4) Efficiency (power and width of confidence intervals) • GEE: – Usually fairly efficient if variance function is correctly specified – Between subject comparisons are nearly efficient if an independence covariance structure is used for balanced data • GLMM: – Maximum likelihood estimates are asymptotically efficient as long as the model is correctly specified GEE vs. GLMM 5) Missing Data: • “Classical” GEE (with robust variance estimates) – Valid inference if data are Missing Completely At Random (MCAR) even if variance model is wrong – If variance model is correct, estimate of β is still consistent if data are MAR but not MCAR (but standard errors are not correct) • GLMM (with model-based estimates) – Valid inference if data are Missing At Random (MAR) Notes: 1) Various strategies for valid GEE inference if data are MAR Missing data • Three general approaches to dealing with missing data under GEE which assume MAR but not MCAR 1. Inverse probability weighting (Robins, Rotnitzky and Zhao, JASA, 1995) 2. Multiple imputation 3. Inverse probability weighting with augmentation, or doubly robust estimation • Each method can incorporate covariate information not included in the GEE model itself. This can make the MAR assumption much more plausible. • Methods 2 and 3 can be considerably more efficient than standard inverse probability weighting GEE vs. GLMM 6) Small to Moderate Samples: • GEE (with robust variance estimates): – Estimated standard errors are unstable and biased downwards • Inefficient estimating equation for estimating variance • Effectively uses fully unstructured variance model – “Sample size” means the number of independent units – Various corrections have been proposed (available in PROC GLIMMIX) • GLMM (with model-based estimates) – Large-sample approximations are often invoked, but performance usually better than GEE with small to moderate N if model is correctly specified. More Toenail Data • Multicenter trial comparing active vs. control oral treatments for toenail infection • Repeated measurements of binary outcome: – 0 = none or mild separation – 1 = severe separation • 1908 observations in 294 patients, mostly over 1 year **** Standard GENMOD GEE program using Robust SEs *****; **** Binary outcome leads to default logistic link function ****; proc genmod descending; Class id; model outcome = treatment month treatment*month/ dist=bin; repeated subject=id/type=exch covb corrw; estimate 'Control Slope' month 1/exp; estimate 'Treartment Slope' month 1 treatment*month 1/exp; run; Working Correlation Matrix Col1 Col2 Col3 Col4 Col5 Row1 1.0000 0.4212 0.4212 0.4212 Row2 0.4212 1.0000 0.4212 0.4212 Row3 0.4212 0.4212 1.0000 0.4212 Row4 0.4212 0.4212 0.4212 1.0000 Row5 0.4212 0.4212 0.4212 0.4212 Row6 0.4212 0.4212 0.4212 0.4212 Row7 0.4212 0.4212 0.4212 0.4212 Col6 0.4212 0.4212 0.4212 0.4212 1.0000 0.4212 0.4212 Col7 0.4212 0.4212 0.4212 0.4212 0.4212 1.0000 0.4212 0.4212 0.4212 0.4212 0.4212 0.4212 0.4212 1.0000 **** Standard GENMOD GEE program using Robust SEs; **** Binary outcome leads to default logistic link function; proc genmod descending; Class id; model outcome = treatment month treatment*month/ dist=bin; repeated subject=id/type=exch covb corrw; estimate 'Control Slope' month 1/exp; estimate 'Treatment Slope' month 1 treatment*month 1/exp; run; Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates Parameter Standard 95% Confidence Estimate Error Limits Z Pr > |Z| Intercept -0.5819 0.1720 -0.9191 -0.2446 -3.38 0.0007 treatment 0.0072 0.2595 -0.5013 0.5157 0.03 0.9779 month -0.1713 0.0300 -0.2301 -0.1125 -5.71 <.0001 treatment*month -0.0777 0.0541 -0.1838 0.0283 -1.44 0.1509 **** Standard GENMOD GEE program using Robust SEs *****; **** Binary outcome leads to default logistic link function ****; proc genmod descending; Class id; model outcome = treatment month treatment*month/ dist=bin; repeated subject=id/type=exch covb corrw; estimate 'Control Slope' month 1/exp; estimate 'Treatment Slope' month 1 treatment*month 1/exp; run; Can ignore in this case Contrast Estimate Results Mean Mean L'Beta Standard Label Estimate Confidence Limits Estimate Error Control Slope 0.4573 0.4427 0.4719 -0.1713 0.0300 Exp(Control Slope) 0.8426 0.0253 Treatment Slope 0.4381 0.4165 0.4599 -0.2490 0.0450 Exp(Treatment Slope) 0.7796 0.0351 Label Contrast Estimate Results L'Beta ChiAlpha Confidence Limits Square Pr > ChiSq Control Slope 0.05 -0.2301 -0.1125 32.60 <.0001 Exp(Control Slope) 0.05 0.7945 0.8936 Treatment Slope 0.05 -0.3373 -0.1607 30.57 <.0001 Exp(Treatment Slope) 0.05 0.7137 0.8515 **** GLIMMIX GLMM Estimating Subject Specific Effects ****; **** Binary outcome leading to default logistic link function ****; proc glimmix method=RSPL data=toenail; Class id; model outcome (event="1") = treatment month treatment*month/ s dist=binary; random int / subject=id; estimate 'Control Slope' month 1/or; estimate 'Treartment Slope' month 1 treatment*month 1/or cl; run; Solutions for Fixed Effects Standard Effect Estimate Error DF t Value Pr > |t| Intercept -0.7204 0.2370 292 -3.04 0.0026 treatment -0.02594 0.3360 1612 -0.08 0.9385 month -0.2782 0.03222 1612 -8.64 <.0001 treatment*month -0.09583 0.05105 1612 -1.88 0.0607 *** Small Sample; data small; set toenail; if id <= 20; ** Standard GENMOD GEE with Robust SEs: 17 Patients Only ***; ** Binary outcome leading to default logistic link function **; proc genmod descending; Class id; model outcome = treatment month treatment*month/ dist=bin; repeated subject=id/type=exch covb corrw; run; Parameter Estimate Standard 95% Confidence Error Limits Z Pr > |Z| Intercept -0.3558 0.6272 -1.5851 0.8736 -0.57 0.5706 treatment 0.0527 0.9679 -1.8444 1.9497 0.05 0.9566 month -0.1543 0.0991 -0.3485 0.0400 -1.56 0.1196 treatment*month 0.0272 0.1725 -0.3109 0.3654 0.16 0.8746 **** GLIMMIX GEE program using Robust SEs; **** Binary outcome leads to default logistic link function; **** Restricted to 17 patients; **** Small N Adjustment of Morel, Bokossa, and Neerchal (2003); proc glimmix method=RSPL empirical=mbn data=small; Class id; model outcome (event="1") = treatment month treatment*month/ s dist=binary ddfm=kenwardroger; random _residual_ / subject=id type=cs; run; Solutions for Fixed Effects Effect Standard Estimate Error DF t Value Pr > |t| Intercept -0.3605 0.7369 15 -0.49 0.6317 treatment 0.05762 1.1209 15 0.05 0.9597 month -0.1530 0.1197 94 -1.28 0.2043 treatment*month 0.02560 0.1984 94 0.13 0.8976 THAT’s ALL