‘Kønne’ formler uden noget nedenunder: en advarende historie ‘Good-looking’ statistics with nothing underneath: a warning tale Jørgen Hilden jhil@sund.ku.dk Higher order refinements 2. floor: 2nd moment. Stand. errors, etc. 1. floor: 1st moment issues. Bias? Statistical counselling What do you really want to know / measure? Meaningless estimand?? Nonsense arithmetic? ? Ground floor examples ”The pH was doubled” – OOSH! How do I define / calculate the mean waiting time to liver transplantation in 2012 ? – Tricky, or impossible. Dr. NN, otologist: #(consultations) / #(patients seen) = 2.7 in 2012, = 1.5 in JAN-MAR 2013. – An interpretable change ? …THEORY… …on the dangers of inventing and popularizing new statistical (epidemiological) measures which are based entirely on ’nice looks’ and have no proper theoretical underpinning Consider prognostics as to survival vs. death (D) New biochemical marker Standard clinical data Better? Oracle risk q ’Old’ oracle risk p χ2 = 2Σi{(lnq – lnp)D + (ln@q – ln@p)@D}i @: complement is high; odds ratio or hazard ratio, etc., highly significant The statistics IDI = integrated discrimination improvement & its ‘little brother,’ the NRI = net reclassification index, were designed to measure of the incremental prognostic impact that a new marker will have when added to a battery of prognostic markers for assessing the risk of a binary outcome. Intuitively plausible? – Yes, they are. But their popularity is undeserved, nonetheless. New biochemical marker Standard clinical data Better? Oracle risk q ’Old’ oracle risk p χ2 = 2Σi{(lnq – lnp)D + (ln@q – ln@p)@D}i Proposed ’measures’ of the superiority of the new oracle: NRI ≈ E{sign(q – p)|Death} + E{sign(p – q)|Survives} IDI ≈ E{q – p | Death} + E{p – q | Survives} Pencina & al. (2008+) New biochemical marker Standard clinical data Better? Oracle risk q ’Old’ oracle risk p χ2 = 2Σi{(lnq – lnp)D + (ln@q – ln@p)@D}i Standard measures of prognostic gain: Δ(logarithmic score) = (1/n)Σ{ ln(q/p)D + ln(@q/@p)@D } = χ2 /( 2n ) ; 2Δ( Harrell’s C ) = Σij(qi – qj)(Di – Dj) / Σij(Di – Dj) – (do.with p’s). IDI and NRI were proposed because the C Index was regarded as the standard measure of prognostic performance, and it turned out to be “insensitive to new information”: “Look, the hazard ratio was as high as 2.5 and strongly significant (P = 0.0001), yet C only increased from 0.777 to 0.790 !” Main flaws of the NRI/IDI family of statistics … gradually uncovered by various investigators: Attic: • sampling distributions much farther from Gaussian than originally thought. 2nd floor: • original SE formulae wrong and seriously off the mark (when training data = evaluation data). 1st floor: • biased towards attributing prognostic power to uninformative predictors, at least in logistic regression models (Monte Carlo), so they may fool their users; • bias otherwise undefined or irrelevant (see **). Ground floor: … Main flaws of the NRI/IDI family of statistics (cont’d) Ground floor: • NRI/IDI do reflect prognostic gain, but **what do they measure? What optimality ideal do they portray? • users may also be deliberately fooled by an opponent who wants to sell the q’s (i.e., sell the new marker equipment) and who already knows the p’s of patients in the sample [dishonesty pays; keyword: non-proper scoring rule]. Essence: they reward overconfidence, i.e. , large risks are too large, small risks too small. Deliberately fooled?? Recall: pi = patient’s ’old’ risk of ’event’, qi = ’new’ risk. IDI (parameter) graphically defined: IDI = E{ q – p | event } – E{q – p | no event } = sum of arrows mean p q Event Risk No event 0 1 = 100% Alas! – The IDI is vulnerable to deliberate ( or accidental ) overconfidence… The p rule can be “improved on” simply by making its predictions more extreme: For patient i, the cheater may report a fake qi {let’s call it Z} = either 100% or = zero: Zero to the left 100% to the right of the red line. p q Event No event marginal event frequency approximately known to cheater Proof Event Consider IDI: The cheater tries to ”optimize” Z: he expects the i’th patient to contribute to IDI: +(Z – pi)/#D with prob. pi and –(Z – pi)/(n – #D) with prob. (1 – pi); i.e., No event (Z – pi){ pi / #D – (1 – pi) / (n – #D) }, a linear function of Z, maximizable by setting Z := 1 (0) for pi > (<) #D/n = the marginal frequency approx. known to him. If in doubt, he may play safe by setting Z := pi . Adoption of IDI → • Spurious results may arise when risks are overconfident ( instead of being well calibrated ) as may happen with an unlucky choice of regression program. • (Cheaters beat the best probabilistic model, so…) • A supporter of a new lab test may sell it without ever doing it !* • Simply by exploiting knowledge of the assessment machinery, a poor prognostician can outperform a good prognostician. * cf. The Emperor’s New Clothes Stepping back – what do we really want? Ideally, clinical innovations should be rated in human utility terms. In particular: New information sources should be valued in terms of the clinical benefit that is expected to accrue from ( optimized use of ) the enlarged body of information: Value-of-Information ( VOI ) statistics. All VOI-type, (quasi-) utility expectation statistics are Proper Scoring Rules ( PSRs ). Key properties of a PSR: Good performance cannot be faked. It pays for a prognostician to strive to fully use the data at hand and to honestly report his assessment. He cannot increase his performance score by ‘strategic votes,’ not even by exploiting his knowledge of the scoring machinery. Conversely: IDI can be faked ↓ IDI is not a PSR ↓ IDI is not a VOI criterion ↓ One cannot construct a decision scenario – not even a ridiculously artificial one – that has the IDI as its utility-expectation criterion. Strengthened conclusion Even in the absence of cheating, it cannot be claimed that IDI measures something arguably useful or constitutes a dependable yardstick. Summing up the horror story: What went wrong in the Boston group? (1) They knew no better than embracing the C Index as their measure of prognostic power. (2) C turns out disappointing owing to its unexpected resilience to ’well supported’ novel prognostic markers [they mix up weight of effect & weight of evidence]. (3) They [undeservedly] discard C as ’insensitive to new information.’ (4) They propose NRI, IDI and variants as being more sensitive to new information [overlooking that these are also sensitive to null or pseudo information]. (5) They rashly suggest SE formulae and make vague promises of Gaussian distribution in reasonably large samples [both wrong]. Thank you