Exact Logistic Regression Larry Cook Outline • Review the logistic regression model • Explore an example where model assumptions fail – Brief algebraic interlude • Explore an example with a different issue where logistic regression fails • Computational considerations • Example SAS code Logistic Regression • Model a binary outcome, Y, with one or more predictors – Success/failure – Disease/not disease • Model outcome in terms of the log odds of a success • log(odds of Yi) = a + bxi + e Why Log Odds? • Canonical link function • Makes a binary outcome continuous • Solves this problem – Probability is constrained to [0,1] – Odds are constrained to [0, ∞) • Log odds are in (-∞, ∞) • Exponentiating coefficients gives us estimates of odds ratios Example: Motor Vehicle Crash Fatalities • What are odds of being hospitalized or killed in a motor vehicle crash for drivers using safety restraints vs. those that are not? – Outcome: Hospitalized/killed or not – Covariate: safety belt use Hospital/Killed * Restraint Use OR = 0.22, p-value < 0.001 Example: Motor Vehicle Crash Fatalities • What are odds of being hospitalized or killed in a motor vehicle crash for drivers using safety restraints vs. those that are not? – Outcome: Hospitalized/killed or not – Covariate: safety belt use gender, age, alcohol, rural area Logistic Regression Output Parameter Intercept Male Restraint Use Alcohol Night Rural Estimate -0.261 Odds Ratio P-value < 0.001 -0.576 -1.430 0.56 0.24 < 0.001 < 0.001 1.065 0.194 0.135 2.90 1.21 1.14 < 0.001 0.011 <0.001 Assumptions • Conditional probabilities follow a logistic function of the independent variables • Observations are independent • Asymptotics – Sample size is large enough – Minimum of 50 to 100 observations – 10 successes/failures per variable Corneal Graft Rejections • What if studying a rare disease? • Data for eight kids in young age group and eight in the older age group • Hypothesis is that rejection is more likely in older children Graft Rejections Young (< 4 y.o.) (X = 0) Older (> 4 y.o.) (X = 1) Total No Rejection (Y = 0) 7 2 9 Rejection (Y = 1) 1 6 7 Total 8 8 16 OR = 21, p-value = 0.012, 100% of cell have expected counts < 5!!! Fisher’s Exact Test p-value (2-sided) = 0.0406; (1-sided) = 0.0203 Let’s Tackle the Graft Rejection Example as Logistic Regression Graft Rejections Young (< 4 y.o.) Older (> 4 y.o.) No Rejection 7 2 9 Rejection 1 6 7 Total 8 8 16 Sample Size << 50! Don’t have 10 success or 10 failures! Total Exact (Conditional) Logistic Regression • Rather than using the unconditional logistic regression, we will condition on nuisance parameters • Use conditional maximum likelihood for estimation and inference Warning Algebra Ahead Proceed with Caution Logistic Model Likelihood of a Sample Sufficient Statistics Conditioning • If we are only trying to describe the relationship between rejection and age, do we care about the value of the intercept? • Remove the intercept, a, out of the likelihood by conditioning on its sufficient statistic, t0 = Syi. • Let S(to) = Set of all tables with Syi = t0 and observed sample sizes Conditional Likelihood Estimation Inference End of Algebra Back to Example Graft Rejections Young (< 4 y.o.) (X = 0) Older (> 4 y.o.) (X = 1) Total No Rejection (Y = 0) 7 2 9 Rejection (Y = 1) 1 6 7 Total 8 8 16 Sufficient Statistics t0 = Syi = # of rejections = 7 t1 = Sxiyi = 0*# of rejections in young + 1*# of rejections in old = 0*1 + 1*6 = 6 Conditional Distribution for Graft Rejection • Need to calculate all possible tables that have exactly 7 rejections • Calculate how often each of the tables occur • Calculate CMLE • Calculate how rare our table is to obtain p-value Reference Set Yng_NR Yng_R Old_NR Old_R t0 t1 1 7 8 0 7 0 8 0.0007 2 6 7 1 7 1 224 0.0196 3 5 6 2 7 2 1,568 0.1371 4 4 5 3 7 3 3,920 0.3427 5 3 4 4 7 4 3,920 0.3427 6 2 3 5 7 5 1,568 0.1371 7 1 2 6 7 6 224 0.0196 8 0 1 7 7 7 8 0.007 11,440 1.000 7 Count P[Table] Estimate b and Find a p-value t1 Count P[Table] 0 8 0.0007 1 224 0.0196 2 1,568 0.1371 3 3,920 0.3427 4 3,920 0.3427 5 1,568 0.1371 6 224 0.0196 7 8 0.0007 Estimate and p-value t1 Count P[Table] 0 8 0.0007 1 224 0.0196 2 1,568 0.1371 3 3,920 0.3427 4 3,920 0.3427 5 1,568 0.1371 6 224 0.0196 7 8 0.0007 Confidence Interval • Lower Bound, b• If t1 = t1,min • Upper Bound, b+ • If t1 = t1,max b- = -∞ b+ = ∞ • Otherwise • Otherwise b- is the value of b that produces an upper p-value of a/2 b+ is the value of b that produces a lower p-value of a/2 Final Stats for Graft Rejection Example 2 PECARN C-Spine Study Case Control Study Control Case Total Not Present 1,057 540 1,0597 Present 2 0 2 Any problems estimating the odds ratio? Could exact logistic regression help? Total 1,059 540 1,599 What sufficient statistics are needed? Not Present (X = 0) Present (X = 1) Total Control (Y = 0) 1,057 2 1,059 Case (Y = 1) 540 0 540 1,597 2 1,599 Total • Sy = 2 • Sxy = 0 Conditional Density Case P Case NP Ctrl P Ctrl NP t0 t1 Count P[Table] 0 540 2 1,057 2 0 560,211 0.438 1 539 1 1,058 2 1 571,860 0.448 2 538 0 1,059 2 2 145,530 0.114 1,277,601 1.000 2 One-sided p-value = 0.438 Two-sided p-value = 2*0.438 = 0.876 95% confidence interval (-∞, 2.345) Point estimate? Median Unbiased Estimate One More Example Dose Response Toxicology Experiment • 400 mice randomized to one of four levels of a drug • Drug administered to each animal • Outcome is the number of deaths in each dose level 0 1 2 3 Total Lived 99 97 95 90 381 Died 1 3 5 10 19 Total 100 100 100 100 400 Sy = 19 Sxy = 3 + 10 + 30 = 43 Exact vs. Unconditional • • • • • Exact Estimate = 0.710 SE = 0.246 OR = 2.03 CI = (1.26, 3.52) p-value = 0.002 • • • • • Unconditional Estimate = 0.712 SE = 0.246 OR = 2.04 CI = (1.26, 3.30) p-value = 0.004 Computational Issues Counting All the Tables • One of the main hurdles for conditional logistic regression is counting all the tables in the sample space – Graft rejections – 11,440 possibilities – PECARN C-Spine - 1,277,601 – Toxicology – 2.79 x 1033 • Obviously don’t want to generate tables one at a time Network Algorithm • Graphical representation of the sample space • Nodes represent a partial sum of the sufficient statistic • Arcs have combinatorial weighting value • One path through the graph represents a table in the sample space Example X=1 X=2 X=3 X=4 Y=0 3 2 2 1 8 Y=1 0 1 1 2 4 Total 3 3 3 3 12 Sufficient Statistics t0 = Syi = 4 t1 = Sxiyi = 1*0 + 2*1 + 3*1 + 4*2 = 13 Total (0,0) (1,0) (2,0) (1,1) (2,1) (3,1) (1,2) (2,2) (3,2) (1,3) (2,3) (3,3) (2,4) (3,4) (4,4) X=1 X=2 X=3 X=4 Total Y=0 1 3 1 3 8 Y=1 2 0 2 0 4 (0,0) (1,0) (2,0) (1,1) (2,1) (3,1) (1,2) (2,2) (3,2) (1,3) (2,3) (3,3) (2,4) (3,4) (4,4) X=1 X=2 X=3 X=4 Total Y=0 3 2 2 1 8 Y=1 0 1 1 2 4 Network Representation of the Sample Space (0,0) (1,0) (2,0) (1,1) (2,1) (3,1) (1,2) (2,2) (3,2) (1,3) (2,3) (3,3) (2,4) (3,4) (4,4) What About Multiple Covariates? More Conditioning! Osteogtenic Sarcoma LogXact Manual • 46 patients surgically treated for osteogenic sarcoma and then observed for disease recurrence within 3 years • Covariates – Sex: Male = 1, Female = 0 – Any Ostoid Pathology (AOP) • Present = 1, not = 0 • Interested in the effect of AOP Osteogtenic Sarcoma Covariate Group No Recurrence (y = 0) Recurrence (y = 1) Group Size (ni) 1 8 2 Covariates Sex (x1) AOP (x2) 0 8 0 0 5 2 7 0 1 3 9 4 13 1 0 4 7 11 18 1 1 Total 29 17 46 Estimating the Effect of AOP • New statistics to condition – Group sizes – Sufficient statistic for intercept, Sy = 17 – Sufficient statistic for coefficient for sex, Sx1y = 15 • Calculate the conditional distribution of Sx2y – Sufficient statistic for coefficient for AOP – Number of cases with AOP in recurrence (=13) – Given exactly 17 with recurrence 15 of which are males Network Algorithm • The Network Algorithm using two passes – First pass conditions on the intercept • All tables with exactly 17 cases in recurrence – Second pass removes arcs that don’t produce sufficient statistic for sex • All tables that don’t have 15 males in recurrence • Proceed with estimation & inference as before P[Sx2y = t2 |17 in recurrence and 15 males ] Results LR Test for Both Variables • To test both sex and AOP are zero simultaneously, need the joint conditional density – All possible combinations of males and patients with AOP in recurrence given exactly 17 patients in recurrence – Determine how rare is it to have 15 recurrent males AND 13 recurrent AOP patients? SAS Examples Conclusion • Exact (conditional) logistic regression – Useful method when asymptotic assumptions are not met or with separation – Utilizes conditioning to remove nuisance parameters from the likelihood – Very computational intensive method – Network algorithm speeds up calculations Questions?