Parametric modelling of cost data: some simulation evidence Andrew Briggs University of Oxford Richard Nixon MRC Biostatistics Unit, Cambridge Simon Dixon University of Sheffield Simon Thompson MRC Biostatistics Unit, Cambridge 2003 CHEBS Seminar, Friday 7th November Parametric modelling of cost data: Background • Cost data are typically non-normally distributed, with high skew and kurtosis • Arithmetic mean cost is of interest to policy makers • Central Limit Theorem ensures sample mean is consistent estimator • Commentators have proposed parametric modelling of cost data to improve efficiency • In particular, Lognormal distribution commonly advocated • Alternatively, Gamma distribution is an increasingly popular choice Parametric modelling of cost data: Choice of estimator • If data are Lognormal an efficient estimator of mean cost is: exp(lm+lv/2) • If data are Gamma distributed the maximum likelihood estimate of the population mean is the sample mean Parametric distributions: Simulation experiment • Lognormal / Gamma distributions • Population mean was set to be 1000 • Five choices of coefficient of variation (CoV = 0.25, 0.5, 1.0, 1.5, 2.0) to define distribution parameters • Samples of five different sizes (n = 20, 50, 200, 500, 2000) drawn from each distribution for each CoV • 2 x 5 x 5 = 50 experiments • Bias, coverage probability and RMSE all recorded Parametric distributions: Distribution sets Parametric distributions: Estimated RMSE from simulations RMSE Lognormal Gamma Sample Mean exp(lm+lv/2) 0.25 56 35 18 11 6 0.25 56 35 18 11 6 0.50 112 71 35 22 11 0.50 114 73 38 25 16 CoV 1.00 221 141 70 44 22 1.00 400 304 241 226 218 925 896 878 1.50 333 214 105 67 34 1.50 1388 1097 2.00 440 284 141 89 45 2.00 2663 1914 1510 1420 1378 20 50 200 0.25 56 36 18 11 6 0.50 112 71 35 22 CoV 1.00 224 141 72 500 2000 20 50 200 500 2000 0.25 56 36 18 11 6 11 0.50 112 71 35 22 11 45 23 1.00 221 137 69 43 22 1.50 336 214 109 67 34 1.50 328 197 99 61 31 2.00 450 288 143 63 45 2.00 419 250 122 54 38 20 50 200 20 50 200 500 2000 Sample Size 500 2000 Sample Size Parametric distributions: Estimated coverage probabilities Coverage Gamma Sample Mean 0.25 0.93 0.94 0.95 0.95 0.95 0.25 0.93 0.94 0.96 0.96 0.95 0.50 0.92 0.93 0.95 0.95 0.95 0.50 0.94 0.96 0.97 0.95 0.89 CoV 1.00 0.9 0.93 0.95 0.95 0.95 1.00 0.97 0.97 0.69 0.18 0 1.50 0.87 0.91 0.94 0.95 0.94 1.50 0.99 0.92 0.14 0 0 2.00 0.83 0.89 0.93 0.94 0.95 2.00 0.99 0.93 0.20 0 0 20 Lognormal exp(lm+lv/2) 50 200 500 2000 20 50 200 500 2000 0.25 0.92 0.93 0.95 0.95 0.95 0.25 0.91 0.93 0.95 0.95 0.95 0.50 0.91 0.93 0.94 0.95 0.95 0.50 0.91 0.93 0.94 0.95 0.95 CoV 1.00 0.87 0.91 0.93 0.94 0.95 1.00 0.9 0.93 0.95 0.95 0.95 1.50 0.83 0.88 0.92 0.94 0.94 1.50 0.89 0.93 0.94 0.95 0.94 2.00 2.00 0.88 0.92 0.94 0.95 0.95 0.8 0.86 0.91 0.92 0.94 20 50 200 500 2000 Sample Size 20 50 200 500 2000 Sample Size Empirical cost distributions: Summary statistics for 3 data sets Raw cost CPOU n mean sd IV fluids 972 518 1,145 5.3 skewness kurtosis CoV 37 2.2 Paramedics 1,191 2,693 7,083 1,852 4,233 7,961 4.8 32 2.6 7.5 88 1.9 CoV – coefficent of variation Log transformed cost CPOU n mean sd skewness kurtosis CoV IV fluids Paramedics 972 1,191 1,852 5.37 6.51 7.70 1.19 1.32 1.09 0.59 1.69 -0.05 3.73 4.72 4.76 0.22 0.20 0.14 Empirical cost distributions: Data set 1: CPOU Raw cost .4 0 .2 Fraction .6 .8 CPOU dataset 0 2000 4000 6000 8000 10000 Cost Log transformed cost 0 .05 Fraction .1 .15 CPOU dataset 2 4 6 Natural log of cost 8 10 Empirical cost distributions: Data set 2: IV Fluids Raw cost .4 0 .2 Fraction .6 .8 IV fluids dataset 0 20000 40000 Cost 60000 80000 10 12 Log transformed cost .1 .05 0 Fraction .15 .2 IV fluids dataset 4 6 8 Natural log of cost Empirical cost distributions: Data set 3: Paramedics Raw cost .4 0 .2 Fraction .6 .8 Paramedics dataset 0 50000 100000 150000 Cost Log transformed cost 0 .05 Fraction .1 .15 Paramedics dataset 4 6 8 Natural log of cost 10 12 Empirical cost data sets: Simulation results RMSE Sample Mean 160 70 37 IV fluids 1583 1015 480 249 1721 1233 1090 1083 Paramedics 1915 1131 585 321 1863 990 501 334 200 500 20 50 200 500 CPOU 253 exp(lm+lv/2) 20 50 234 141 95 87 COVERAGE CPOU 0.76 0.83 0.95 0.98 0.80 0.79 0.62 0.28 IV fluids 0.77 0.86 0.94 0.98 0.60 0.47 0.12 0.00 Paramedics 0.78 0.84 0.89 0.95 0.86 0.87 0.84 0.84 20 50 200 500 Sample Size 20 50 200 500 Sample Size Parametric cost modelling: Comments & conclusions • “All models are wrong” (Box 1976) • “No data are normally distributed” (Nester 1996) • Costs are estimated from resource use times unit cost • Any parametric assumption relating to costs is at best an approximation • Simulations confirm that there are efficiency gains if appropriate distribution is chosen • But incorrect assumptions can lead to very misleading conclusions • Sample mean performs well and is unlikely to lead to inappropriate inference • Only when there are sufficient data to permit detailed modelling is the choice of an alternative estimator warrented