Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C. Schindler Swiss Tropical and Public Health Institute University of Basel christian.schindler@unibas.ch Annual meeting of the Swiss Societies of Clinical Neurophysiology and of Neurology, Lugano, May 3rd 2012 Binary outcome data Binary endpoints Y Examples: Y = „death within 1 year“ Y = „disease progression within 2 years“ Y = „remission within three months“ Y = 1, event occurred = 0, otherwise Preliminaries: 1. Meaning of E(Y) for a binary variable Y E(Y) = Mean of Y at the population level = P(Y = 0) · 0 + P(Y = 1) · 1 = P(Y = 1) Thus, mean of Y = probability of the event represented by Y. 2. Notion of odds Odds (Y) = P(Y = 1) Example: P(Y = 1) / P(Y = 0) = P(Y = 1) / [1 – P(Y = 1)] = Odds(Y) / [1 + Odds(Y)] (*) Y = „disease progression“ P(Y =1) = 0.3 Odds(Y) = 0.3 / [1 – 0.3] = 0.3 / 0.7 = 0.429 P(Y=1) = 0.429 / [1 + 0.429] = 0.3 3. Notion of odds ratio (OR) X = „high risk“ (0 -> normal risk, 1 -> increased risk) P(Y = 1 | X = 0) = P(Y = 1) in subjects with X = 0 = 0.2 P(Y = 1 | X = 1) = P(Y = 1) in subjects with X = 1 = 0.4 Odds(Y | X = 0) = 0.2 / 0.8 = 0.25 Odds(Y | X = 1) = 0.4 / 0.6 = 0.667 OR(Y | X) = OR(Y | X = 1)/OR(Y | X = 0) = 0.667 / 0.25 = 2.67 with outcome (Y = 1) w/o outcome (Y = 0) with risk factor (X = 1) 40 60 100 w/o risk factor (X = 0) 20 80 100 70 130 200 Symmetry of OR: Prospective (cohort study): OR of Y between X=1 and X=0 = 40/60 : 20/80 = 2.67 Retrospective (case control study) OR of X between Y=1 and Y=0 = 40/20 : 60/80 = 2.67 4. Calculus of odds and odds ratios Example: risk of disease progression without risk factors A and B = 20%. OR of disease progression associated with risk factor A = 2.0. OR of disease progression associated with risk factor B = 3.0. Odds without risk factors = 0.2 / (1 – 0.2) = 0.25. a) Odds with risk factor A only = 0.25 2.0 = 0.5 b) Odds with risk factor B only = 0.25 3.0 = 0.75 c) Odds with both risk factors = 0.25 2.0 3.0 = 1.5 (if factors do not interact) Corresponding risks: a) 0.5/1.5 = 0.33, b) 0.75/1.75 = 0.43, c) 1.5/2.5 = 0.6 Regresssion models for probabilities / odds Idea: As in classical regression consider E(Y) = b0 + b1 · risk score Equivalent formulation of the model: P(Y = 1) = b0 + b1 · risk score Problem: P(Y = 1) in (0, 1) = b0 + b1· risk score can take values outside (0,1) Solution: P(Y = 1) = F ( b0 + b1 · risk score ) z = linear predictor (linear prediction score) where F(z) is a function whose values are always in (0, 1) Logistic function (standard choice) 1.0 F(z) 0.8 0.6 F(z) = ez / (1 + ez) 0.4 0.2 0.0 -4 -2 a) ez < 1 + ez => F(z) < 1 0 z 2 b) ez > 0 => F(z) > 0 . 4 Y = Outcome (1 = present, 0 = absent) X = risk factor (1 = present, 0 = absent) Linear predictor (logit): P (Y 1 | X ) e z = z 1 e z b0 + b1 · x e b 0 b1 x 1 e b 0 b1 x Recalling that P(Y = 1|X) = Odds(Y|X) / [1 + Odds(Y|X)] shows that Odds (Y | X ) e z e b 0 b1 x b 0 b 1 1 x = 1: Odds (Y ) e x = 0: Odds (Y ) e b 0 b 1 0 e b 0 b1 e b0 Odds Ratio of Y between x=1 and x=0: e = > b0 = ln (Odds(Y|X = 0)) b 0 b1 /e b0 e b1 = ln (OR(Y|X)) b1 Logistic regression P (Y 1) e z 1 e z e b 0 b1 x 1 e b 0 b1 x Probit regression P (Y 1) z b 0 b 1 x F = cumulative density function of standard normal distribution (another sigmoid shaped function ranging from 0 to 1) Logistic regression Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Log likelihood = -117.34141 = = = = 200 9.66 0.0019 0.0395 -----------------------------------------------------------------------------outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------risk_factor | .9808289 .3227486 3.04 0.002 .3482533 1.613404 _cons | -1.386294 .25 -5.55 0.000 -1.876285 -.896303 ------------------------------------------------------------------------------ ln (Odds Ratio) = 0.9808 Odds Ratio ln (Odds(Y|X = 0) = -1.3863 Odds(Y | X = 0) = e 0.9808 = 2.67 = e-1.3863 = 0.25 P(Y = 1 | X = 0) = 0.25 / (1 + 0.25) = 0.2 P(Y = 1 | X = 1) = 0.25 2.67 / (1 + 0.25 2.67) = 0.4 Summary exp(coefficient of risk factor) = OR of outcome between those with and those without risk factor (cohort study) = OR of risk factor between those with and those without outcome (case-control study) exp(intercept term) = odds of outcome among unexposed subjects Direct output of Odds ratio and odds of unexposed: Logistic regression Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Log likelihood = -117.34141 = = = = 200 9.66 0.0019 0.0395 -----------------------------------------------------------------------------outcome | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------risk_factor | 2.666666 .8606626 3.04 0.002 1.416591 5.019872 _cons | .2500001 .0625 -5.55 0.000 .153158 .4080755 ------------------------------------------------------------------------------ 95%-confidence interval of odds ratio: 0 1 2 3 ( 1.42 , 5.02) 4 5 Note: 1. Confidence intervals of odds ratios are asymmetrical! 2. If they do not include 1, then the respective association is statistically significant at the 5% level. Example from the literature Ritchie K. et al., The neuroprotective effects of caffeine – a prospective population study (The Three City Study), Neurology 2007; 69: 536-45 outcome = cognitive decline, measured as either a) decline by at least 6 units in Isaacs set test or b) decline by at least 2 units in Benton visual retention test over four years 1) 3) 2) 1. On average, odds of CD increased by 6% per additional year of age at baseline. 2. On average, odds of CD increased by 8% per additional unit in baseline cognitive test. 3. Compared to subjects with 5 years of education, the odds of CD among subjects with ≥12 years of education was reduced by more than 40%. Main result of paper: Significantly reduced risk of DIsaacs -6 among women who drank more than 3 units of caffeine per day at baseline compared to women who only drank 0 – 1 units CAVE: NOTE: Odds ratio (OR) Odds Odds > relative risk (RR) risk (relative frequency) risk (relative frequency) possible situations for OR and RR: a) OR < RR < 1 b) OR = RR = 1 c) 1 < RR < OR Interpretation of OR as relative risk and of odds as risk (relative frequency) is only appropriate if risks are small (i.e., < 10%) Model comparison Likelihood ratio test Akaike information criterion (AIC) Bayesian information criterion (BIC) Pseudo-R2 Likelihood L of a model Probability of observing exactly the same outcome data again if exactly the same predictor data are given, provided that the model describes reality in the best possible way with the given variables. log-Likelihood ln(L) natural logarithm of the likelihood ln(L) is always 0 , since probabilities are 1. The perfect model would have L = 1 and ln(L) = 0. The better the model, the closer ln(L) is to 0. Comparison of nested models A model M1 is said to be nested in another model M2 , if M1 is a special case of M2, e.g., if all the terms of M1 are also in M2 but not vice versa. Under the hypothesis, that the additional terms of M2 are of no predictive value, the difference likelihood ratio D = 2 [ln(L2) - ln(L1)] = 2 ln(L2/L1) has an approximate Chi2-distribution with df2 - df1 degrees of freedom (where dfi = number of parameters of model i). CAVE: Both models must be based on exactly the same data. In particular, their n‘s must be identical. Akaike information criterion („smaller is better“) AIC = - 2 ln(L) + 2 p penalty for complexity of the model (p = number of parameters of the model in addition to the intercept parameter) The two models compared must be based on exactly the same data (same n!), but they need not be nested and can contain different variables. Bayesian information criterion (Schwarz criterion) („smaller is better“) BIC = penalty for complexity of the model - 2 ln(L) + p ln(n) (p = number of parameters of the model in addition to the intercept parameter, n = sample size) The two models compared need not be based on the same data nor do they have to be nested. Pseudo-R2 There exists an analog of R2 with logistic and other generalized linear regression models. „total variance“ ln(Lnull model) ln(Lmodel) 0 „variance explained“ Pseudo-R2 = [ ln(Lmodel) - ln(Lnull model) ] / [ 0 - ln(Lnull model ) ] Null model = model with an intercept term only. Goodness of fit of a logistic regression model Can be assessed using the Hosmer-Lemeshow-Test. Mechanics of the test: 1. For each subject, the logistic regression model predicts its individual probability of having Y = 1. 2. Subjects are then categorized into a certain number of classes based on the size of their predicted probabilities. 3. In each of the classes, the proportion of subjects with Y = 1 is determined and compared with the mean value of the predicted probabilities (ideally the two values should coincide in each class). Analysis of survival data Censored and uncensored survival times t0 Observation period (= individual time scale!) x t1 true event -> uncensored survival time o loss to follow-up -> censored survival time Event-free survival until t1 -> censored survival time In survival analyses, two outcome variables are needed: 1. Event variable 1 = event was observed -> uncensored observation 0 = event was not observed -> censored observation 2. Variable for event-free time observed = time until event (uncensored observation) observation time (censored observation) Simple group comparisons of survival data 1. Construction of survival curves using the method of Kaplan-Meier 2. Comparison of survival curves using the log rank or the Wilcoxon test. 0.00 0.25 0.50 0.75 1.00 Kaplan-Meier survival estimates 0 20 group = 1 40 60 analysis time (weeks) group = 2 80 100 group = 3 S(t) = Proportion of patients without event until time t S(t+Dt) S(t) – S(t) h(t) Dt S(t+Dt) – S(t) -S(t) h(t) Dt S(t+Dt) – S(t) -h(t) Dt S(t) S(t+Dt) – S(t) S(t) -h(t) Dt • S(t) S(t) = -h(t) d/dt [ln(S(t))] = -h(t) h(t) = - d/dt [ln(S(t))] h(t) = instantaneous event risk (hazard at time t) S(0) S(t) S(t+Dt) t t+Dt Assumption of proportional hazards (PH) h1(t) = hazard function in group 1 h2(t) = hazard function in group 2 h2(t) / h1(t) = HR (hazard ratio) h2(t) = HR h1(t) PH: ratio of hazards is independent of time t 1.00 Kaplan-Meier survival curves 0 0.00 0.25 .005 .01 0.50 .015 0.75 .02 Smoothed hazard estimates 0 20 40 60 80 analysis time (weeks) group = 1 group = 2 group = 3 100 0 20 40 60 80 analysis time (weeks) group = 1 group = 2 group = 3 Hazard functions of group 1 and 2 are proportional. Hazard function of group 3 violates PH assumption 100 Logarithmized hazard estimates 0 -6.5 -6 .005 -5.5 .01 -5 .015 -4.5 -4 .02 Smoothed hazard estimates 0 20 40 60 80 analysis time (weeks) group = 1 group = 2 group = 3 100 0 20 40 60 80 analysis time (weeks) 100 group = 1 group = 2 group = 3 Logarithmized hazards of group 1 and 2 run parrallel (PH-assumption). Modelling of the hazard ratio Sir David Roxbee Cox * 1924 1950-1956 Cambridge University 1956-1966 Birkbeck College, London 1966-1988 Imperial College, London 1988-1994 Oxford University HR = e bx x=1 with risk factor x=0 without risk factor x dichotomous HR = e b1 = e b = hazard ratio associated with risk factor HR = e x continuous, HR = e b1 bx e.g. x = age at baseline = e b = hazard ratio associated with a unit increase in x (cross-sectional comparison) Multiple proportional hazard regression model („Cox-regression model“) HR = e b1x1 + b2x2 + ...... + bkxk Reference category: subjects with x1 = x1 = ... = xk = 0 In our example: No. of subjects = No. of failures = Time at risk = Log likelihood = 1500 671 97331 -4697.2077 Number of obs = 1500 LR chi2(2) Prob > chi2 = = 39.93 0.0000 -----------------------------------------------------------------------------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------group_2 | 1.393155 .1395814 3.31 0.001 1.144766 1.695439 group_3 | 1.832595 .1779825 6.24 0.000 1.514947 2.216846 ------------------------------------------------------------------------------ On average, the hazards in group 3 and 2 were higher by 83% (95%-CI: 51 to 121%) and 39% (95%-CI: 15 to 70%), respectively, than in group 1 (= reference group). Analysis restricted to first year: No. of subjects = No. of failures = Time at risk = 1500 579 55997 Number of obs = 1500 LR chi2(2) = 54.84 Log likelihood = -4073.1439 Prob > chi2 = 0.0000 -----------------------------------------------------------------------------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------group_2 | 1.387788 .1550301 2.93 0.003 1.114898 1.727472 group_3 | 2.122101 .222757 7.17 0.000 1.727489 2.606854 ------------------------------------------------------------------------------ Analysis restricted to survivors of first year: No. of subjects = No. of failures = Time at risk = 872 92 41334 Number of obs = 872 LR chi2(2) = 11.92 Log likelihood = -610.6509 Prob > chi2 = 0.0026 -----------------------------------------------------------------------------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------group_2 | 1.442213 .3266646 1.62 0.106 .9251881 2.248167 group_3 | .5261936 .1709134 -1.98 0.048 .2783979 .9945465 ------------------------------------------------------------------------------ Conclusion: 1. Hazard ratio between group 2 and group 1 was very similar in both years, i.e., around 1.4. (confirming the proportionality of the two hazard functions). 2. Hazard ratio between group 3 and group 1 was higher than 2 in the first year but smaller than 1 in the second year (confirming the sharp decrease of the hazard function of group 3, falling below the one of group 1 after 1 year). Example from the literature Koch M. et al., The natural history of secondary progressive multiple sclerosis, J. Neurol Neurosurg Psychiatry 2001; 81:1039-43 Some study characteristics: - 5207 patients from British Columbia with a remitting disease at baseline. - Onset of immunomodulatory treatment was considered as censoring event. Kaplan-Meier curves: PH-assumption was probably not satisfied by the factor gender! 1. On average, the covariate-adjusted hazard of secondary progression was higher in men than in women by 43% and the median of time to secondary progression was lower by 25% (i.e., 17.1 vs. 22.7 years). 2. On average, the covariate-adjusted hazard of secondary progression increased by 5% with each additional year of age at baseline. CAVE: 1. In general, the ratio of median or mean survival times between two groups is not inversely proportional to the hazard ratio. But in the special case of constant hazards, this relation holds for the mean survival times. 2. The hazard ratio may be interpreted as relative risk only if the event rates are small in the respective groups during the time period of interest. Thank you for your attention again!