Are We Overstating Our Confidence from Probabilistic Cost-Effectiveness Cost Effectiveness Analysis? Henry Glick University of Pennsylvania Issues in Cost-Effectiveness Cost Effectiveness Analysis www.uphs.upenn.edu/dgimhsr/presentations.htm AcademyHealth Boston,, Massachusetts 6/28/2010 Background g • Prior to the 1990s, medical decision makers did not directly address sampling uncertainty – Instead relied on sensitivity analysis to evaluate the robustness of model model-based based recommendations • Mid-1980s saw development of initial methods for use of probabilistic cost-effectiveness analysis p y (PCEA) ( ) – i.e., second-order Monte Carlo analysis (e.g., Doubilet et al. and Critchfield and Willard in Medical Decision Making) • Eddy et al.’s Meta-Analysis by the Confidence Profile M h d (1991) and Method d O’B O’Brien i et al.’s l ’ IIn S Search h off P Power and Significance (1994) set the stage for modern PCEA Probabilistic Cost-Effectiveness Analysis y • Substitute distributions of variables for point estimates – e.g., e g use the distribution of the mean cost of living in a state for the point estimate of this mean cost • Repeatedly run the model (at least 1000 times) • In each repeated “run,” draw a sample mean from each of the distributions • As with deterministic CEA, average out the resulting tree • As with bootstrapping primary data data, use the results of the repeated Monte Carlo runs of the model to draw statistical conclusions such as whether or not the difference in cost or the difference in effect is significant Recent Rapid p Adoption p • Use of PCEA has been growing throughout the last decade – Treatment of obstructive sleep apnea: of the 10 costeffectiveness analyses published in Medlined journals between 1999 and 2009, the 3 published before 2006 were deterministic CEAs while the 7 published from 2006 on were all PCEAs. The Problem: Standard Errors of the Difference • Standard errors (SEs) for the difference in costs or effects between 2 treatment groups can be: 1) Directly estimated from the Monte Carlo replicates 2) Calculated by use of a standard formula: SEDiff = SE02 + SE12 • • The 2 need not be the same, but unless the correlation between the costs (effects) in the 2 groups is substantial, estimates should vary by at most +10% 20% Comparison of the SEs for the difference calculated from the replicates versus the formula indicated results differ by substantially more than this amount 3 Teaching g Examples p SE of Difference SE0 SE1 RepliR li cates Cost 45,831 42,396 1647 62,433 97.4 .9994 QALYs .8864 8864 .9286 9286 .2396 2396 1 2837 1.2837 81 3 81.3 .9662 9662 Cost 17 602 17,602 18 366 18,366 11942 25 439 25,439 53 1 53.1 .780 780 QALYs .0824 .0907 .0930 .1225 24.1 .425 237 362 273 433 37.0 .657 .0191 0191 .0192 0192 .0031 0031 .0271 0271 88 6 88.6 .987 987 Formula % Diff Corr Lupus Cancer COPD Cost QALYs 3 Teaching g Examples p SE of Difference Repli-cates Formula % Diff Corr .0930 0930 .1225 1225 24 1 24.1 .425 425 CO,Cost 273 433 37.0 .657 CA Cost CA,Cost 11942 25 439 25,439 53 1 53.1 .780 780 LU,QALYs .2396 1.2837 81.3 .9662 CO QALYs CO,QALYs .0031 0031 .0271 0271 88 6 88.6 .987 987 LU,Cost 1647 62,433 97.4 .9994 CA QALYs CA,QALYs Examples p from Direct Analysis y of Patient-Level Data SE of Difference SE0 SE1 Bootstrap Formula % Diff Corr 575 594 707 827 14.5 .271 .0205 0205 .0181 0181 .0257 0257 .0273 0273 59 5.9 .119 119 Cost 1212 1917 2114 2268 68 6.8 .146 146 QALYs .0283 .0194 .0317 .0343 7.6 .160 Cost 4203 4332 6059 6036 -.4 -.008 QALYs .0120 0120 .0120 0120 .0172 0172 .0170 0170 -1 2 -1.2 - 023 -.023 COPD Cost QALYs Addiction Nursing Source of the Problem • Excessively large correlations due to what appears to be a minor modeling decision to draw once from each distribution and use the resulting point estimate for both arms of the model – e.g., in one run of the model, we might draw a mean of $6000 for being in state 1 for a year and we’d use this cost for both treatment arms • This type of modeling does not the mimic data generating ti mechanism h i ffor actual t l patient ti t llevell d data t 3 Teaching g Examples p Single Draw Separate Draws SE of Difference Replicates Cost QALYs SE of Difference Formula % Diff Replicates Corr 1647 62,433 97.4 .9994 65,398 65,445 .1 .002 .2396 1.2837 81.3 .9662 1.2755 1.2742 -.1 -.002 Cost 11942 25,439 53.1 .780 25,505 25,521 .1 .001 QALYs .0930 .1225 24.1 .425 .1224 .1222 -.2 -.005 273 433 37.0 .657 436 436 0 .005 .0031 0031 .0271 0271 88 6 88.6 .987 987 .0266 0266 .0270 0270 15 1.5 .033 033 Formula % Diff Corr Lupus Cancer COPD Cost QALYs Extent of the Problem • Problem of shrunken standard errors for the difference not unique to my teaching examples • Difficult to assess the extent of the problem • Informal review of 15 Medlined articles published between 9/2009 and 4/2010 found: – 2 displayed evidence of potential shrinkage – 1 had evidence that was unclear – 12 did not provide the information needed to assess whether the standard errors were too small or not • G Given e the t e lack ac o of a awareness a e ess o of tthe ep problem, ob e , a and d tthe e fact act that when evaluable, it is often observed, the problem may be substantial Testing g the “Single g Draw” Hypothesis yp • HYPOTHESIS: Dramatic shrinkage in decision models’ standard errors for differences will be observed when we make a single draw from each distribution and use this draw for both treatment groups (“Single draw”) and will not be observed when we make 2 draws from each distribution (“Separate draws”) • To establish a gold standard for testing this hypothesis, developed a decision model from a patient level observational dataset – Used a clinical variable to define 3 disease states – Assigned patients to one of 2 treatment groups so that we could assess the SE of the differences Assignment g to Treatment Groups p • Assignment to treatment groups made to approximately equate: – Initial distribution among the 3 disease states – Treatment costs and preference scores for being in a state for a year • Assignment also made so the transition probabilities differed between the 2 treatment groups • All p patients had “intervention” costs,, but to mimic most models, “zeroed-out” for patients in treatment group 0 • Difference between the results for the 2 treatment groups due to the transition probabilities and the intervention costs Simplifications p • To simplify the model: – Limited the model to 2 periods – Eliminated observations in which the patient died – Unable to equate treatment treatment-specific specific standard deviations for costs and preferences for being in a state • Adds noise to the comparison of the results between the data vs the model-based results,, but impacts should be similar across model-based results The Gold Standard • Analyzed the data directly by bootstrapping mean costs and QALYs for each treatment group • Estimated: – Means – Standard errors – Correlations – P-values – Acceptability curve The Model Good dtgood Dist(1; 1) # Dist(21; 1) Usual care Mod Di t(1 2) Dist(1; Di t(21 2) Dist(21; Dist(21; 3) Poor dtpoor Dist(1; 3) # Good dtgoodi Dist(1; 1) # Dist(24; 1) Intervention Mod Dist(1; 2) Dist(24; 2) Dist(24; 3) Poor dtpoori Dist(1; 3) # Good Mod Good Mod Poor Mod Poor Good Mod Good Mod Poor Mod Poor The Distributions • Used the primary data to define distributions – Dirichlet and beta distributions for the initial distribution and the transitions – For the primary hypothesis built 2 versions of the model: • One used multivariate normal distributions for costs and preference scores (incorporated these data’s full correlation structure) • A second model used independent gamma distributions for costs and preference scores – For a secondary question about the impact of use of gamma distributions, a third model used independent normall di distributions t ib ti Analysis y • Developed one set of model-based predictions of means SEs means, SEs, correlations, correlations p-values p values, and the acceptability curve by use of a single draw from each distribution • Developed a second set of model-based model based predictions by use of separate draws • Compared p both sets of p predictions to the results from the primary data analysis/bootstrapping RESULTS Difference in Costs and Effects Cost Difference QALY Difference SE p-value SE p-value 95.25 0.0000 0.0300 0.85 MVN 95.70 0.0000 0.0298 0.84 Gamma 86.60 0.0000 0.0242 0.81 MVN 17 44 17.44 0 0000 0.0000 0 0034 0.0034 0 08 0.08 Gamma 14.12 0.0000 0.0034 0.08 The Data Separate Draws Single Draw Source of the Differences: Induced Correlations Cost, Groups p 0 vs 1 -0.0036 QALYs, Groups p 0 vs 1 -0.0015 0.0024 -0.0042 -0.0022 0.0042 MVN 0.9669 0.9963 Gamma 0 9726 0.9726 0 9797 0.9797 The Data Separate Draws MVN Gamma Single Draw Acceptability p y Curve #1,, the Bootstrapped pp Data Pro portion A Acceptab le 1.00 0.75 0.50 The D ata 0.25 0.00 0 125000 250000 Willingness to Pay 375000 Acceptability p y Curve #2,, Separate p Draw MVN Model Pro portion A Acceptab le 1.00 0.75 0.50 The D ata MVN, Separate 0.25 0.00 0 125000 250000 Willingness to Pay 375000 Acceptability p y Curve #3,, Separate p Draw Gamma Model Pro portion A Acceptab le 1.00 0.75 0.50 The D ata MVN, Separate Gamm a, Separate 0.25 0.00 0 125000 250000 Willingness to Pay 375000 Acceptability p y Curve #4,, Single g Draw Gamma Model Pro portion A Acceptab le 1.00 0.9352 0.75 0.50 The D ata MVN Separate MVN, Gamm a, Separate 0.25 0.00 Gamm a, single 0 125000 250000 Willingness to Pay 375000 Overstatement of Bad Value Prop portion A Acceptablle 1.00 0.75 0.50 42%: 2-tailed 16% confidence of bad value The Data MVN, Separate Gamma, single 0.25 7%: 2-tailed 86% confidence of bad value 0.00 0 ^ 50,000 125000 250000 Willingness to Pay 375000 Overstatement of Good Value Prop portion A Acceptablle 1.00 0.75 74%: 2-tailed 48% confidence of good value 0.50 52%: 2-tailed 4% confidence of good value The Data MVN, Separate Gamma, single 0.25 0.00 0 125000 ^ 150,000 250000 375000 Diagnosing g g the Problem • Easiest way to diagnose problem is to compare SEs calculated from the replicates p with those calculated by y use of the formula • Model Developers: Should compare SEs routinely to assess the th need d for f separate t draws d from f distributions di t ib ti • Reviewers/Readers: Can check for shrinkage only if authors report treatment-specific standard errors for costs and outcomes as well as the standard errors for the differences in costs and outcomes – Currently problematic, given that many articles don't report on costs or effects at all, and instead report only the cost-effectiveness ratio/acceptability curve – At least at the reviewer level, this information should q be required Multivariate Normal vs Gamma Distributions Pro portion A Acceptab le 1.00 0.75 0.50 The D ata MVN, Separate Gamm a, Separate 0.25 0.00 0 125000 250000 Willingness to Pay 375000 Acceptability p y Curve #5,, Gamma vs Normal Pro portion A Acceptab le 1.00 0.75 0.50 T he Data MVN, Single G Gamma, Separate S t 0.25 Gamma, single N ormal, Separate N ormal, ormal Single 0.00 0 125000 250000 Willingness to Pay 375000 Summaryy • Use of single draw models can lead to shrinkage of the standard errors of the difference in cost and effect • Shrinkage can lead to upwardly biased estimates of confidence • Equally affects all methods for describing sampling uncertainty: – CI for cost-effectiveness ratio – CI for net-monetary benefit – Acceptability curves – Value of information curves Summaryy (2) ( ) • Extent of the problem unknown because authors commonly fail to report the estimates needed to assess the problem – May be substantial • At a minimum, journals should require that as part of the review p process, authors p provide treatment-specific p standard errors for costs and effects Extra Slides Treatment Group p Means and SD Period 1 Period 2 Gro p 0 Group Gro p 1 Group Gro p 0 Group Gro p 1 Group State 1, Cost 86.67 704.14 86.11 459.27 57.03 341.57 57.76 305.30 State 1, QALY .7648 .2076 .7647 .2211 .7765 .1782 .7764 .2271 State 2, Cost 63.44 262.26 65.08 512.83 202.61 1067.79 215.22 1155.29 State 2 2, QALY .7332 7332 .1988 7332 .7332 .2179 7446 .7446 .2033 7440 .7440 .2250 State 3,, Cost 342.17 1273.69 347.33 1319.17 134.15 555.22 136.58 429.34 State 3, QALY .7224 .2147 2147 .7222 .1708 1708 .7137 .2223 2223 .7135 .1797 1797 “Chinese Menu” Decision Models • Problem particularly affects what I call "Chinese Menu" decision models – Obtain transitions probabilities from “article (column) A," cost data from “article A, article B," B, preference scores from “article C," etc. • Characterized byy use of the same set of disease states and the same costs and preference weights for being in these states in all arms of the model Independent p Normal vs Gamma Distributions • As a field, we have moved away from the use of normal distributions in building our decision models and towards the use of gamma distributions • In this example -- and the others I've I ve looked at -- use of independent normal or independent gamma distributions yield nearly identical results • While it doesn’t hurt to use gamma distributions, we should remember that we are drawing from distributions off the th mean, and d that th t the th central t l limit li it th theorem iis powerful