`Good-looking` statistics with nothing underneath: a warning tale

advertisement
‘Kønne’ formler
uden noget nedenunder:
en advarende historie
‘Good-looking’ statistics with
nothing underneath: a warning tale
Jørgen Hilden
jhil@sund.ku.dk
Higher order refinements
2. floor: 2nd moment. Stand. errors, etc.
1. floor: 1st moment issues. Bias?
Statistical
counselling 
What do you really want to
know / measure?
Meaningless estimand??
Nonsense arithmetic? ?
Ground floor examples
”The pH was doubled” – OOSH!
How do I define / calculate the
mean waiting time to liver transplantation in 2012 ?
– Tricky, or impossible.
Dr. NN, otologist:
#(consultations) / #(patients seen)
= 2.7 in 2012,
= 1.5 in JAN-MAR 2013.
– An interpretable change ?
…THEORY…
…on the dangers of
inventing and popularizing
new statistical (epidemiological) measures
which are based entirely on
’nice looks’
and have no proper
theoretical underpinning
Consider prognostics as to survival vs. death (D)
New biochemical marker
Standard clinical data
Better? Oracle
risk q
’Old’ oracle
risk p
χ2 = 2Σi{(lnq – lnp)D + (ln@q – ln@p)@D}i
@: complement
is high; odds ratio or hazard ratio, etc., highly significant
The statistics
IDI = integrated discrimination improvement
& its ‘little brother,’ the
NRI = net reclassification index,
were designed to measure of the
incremental prognostic impact
that a new marker will have
when added to a battery of prognostic markers
for assessing the risk of a binary outcome.
Intuitively plausible? – Yes, they are. But
their popularity is undeserved, nonetheless.
New biochemical marker
Standard clinical data
Better? Oracle
risk q
’Old’ oracle
risk p
χ2 = 2Σi{(lnq – lnp)D + (ln@q – ln@p)@D}i
Proposed ’measures’ of the superiority of the new oracle:
NRI ≈ E{sign(q – p)|Death}
+ E{sign(p – q)|Survives}
IDI ≈ E{q – p | Death} + E{p – q | Survives}
Pencina & al. (2008+)
New biochemical marker
Standard clinical data
Better? Oracle
risk q
’Old’ oracle
risk p
χ2 = 2Σi{(lnq – lnp)D + (ln@q – ln@p)@D}i
Standard measures of prognostic gain:
Δ(logarithmic score) =
(1/n)Σ{ ln(q/p)D + ln(@q/@p)@D } = χ2 /( 2n ) ;
2Δ( Harrell’s C ) =
Σij(qi – qj)(Di – Dj) / Σij(Di – Dj) – (do.with p’s).
IDI and NRI were proposed because
the C Index was regarded as
the standard measure of prognostic performance,
and it turned out to be
“insensitive to new information”:
“Look, the hazard ratio was as high as 2.5 and
strongly significant (P = 0.0001),
yet C only increased from 0.777 to 0.790 !”
Main flaws of the NRI/IDI family of statistics
… gradually uncovered by various investigators:
Attic: • sampling distributions
much farther from Gaussian than originally thought.
2nd floor: • original SE formulae wrong and
seriously off the mark (when training data = evaluation data).
1st floor:
• biased towards
attributing prognostic power to uninformative predictors,
at least in logistic regression models (Monte Carlo),
so they may fool their users;
• bias otherwise undefined or irrelevant (see **).
Ground floor: …
Main flaws of the NRI/IDI family of statistics (cont’d)
Ground floor:
• NRI/IDI do reflect prognostic gain, but
**what do they measure?
What optimality ideal do they portray?
• users may also be deliberately fooled by
an opponent who wants to sell the q’s
(i.e., sell the new marker equipment)
and who already knows the p’s of patients in the sample
[dishonesty pays; keyword: non-proper scoring rule].
Essence: they reward overconfidence, i.e. ,
large risks are too large, small risks too small.
Deliberately fooled??
Recall: pi = patient’s ’old’ risk of ’event’, qi = ’new’ risk.
IDI (parameter) graphically defined:
IDI = E{ q – p | event } – E{q – p | no event } = sum of arrows
mean p
q
Event
Risk
No event
0
1 = 100%
Alas! – The IDI is vulnerable to deliberate ( or accidental )
overconfidence…
The p rule can be “improved on” simply by making its
predictions more extreme: For patient i, the cheater may report a
fake qi {let’s call it Z} = either 100% or = zero:
Zero to the left 100% to the right of the red line.
p
q
Event
No event
marginal event frequency
approximately known to cheater
Proof
Event
Consider IDI: The cheater tries to ”optimize” Z:
he expects the i’th patient to contribute to IDI:
+(Z – pi)/#D
with prob. pi
and
–(Z – pi)/(n – #D) with prob. (1 – pi); i.e.,
No event
(Z – pi){ pi / #D – (1 – pi) / (n – #D) },
a linear function of Z, maximizable by setting
Z := 1 (0) for pi > (<) #D/n
= the marginal frequency approx. known to him.
If in doubt, he may play safe by setting Z := pi .
Adoption of IDI →
• Spurious results may arise
when risks are overconfident
( instead of being well calibrated )
as may happen with an unlucky choice of
regression program.
• (Cheaters beat the best probabilistic model, so…)
• A supporter of a new lab test may
sell it without ever doing it !*
• Simply by exploiting knowledge of the assessment
machinery, a poor prognostician can outperform a
good prognostician.
* cf. The Emperor’s New Clothes
Stepping back – what do we really want?
Ideally, clinical innovations should be rated in human utility
terms.
In particular:
New information sources should be valued in terms of
the clinical benefit that is expected to accrue
from ( optimized use of ) the enlarged body of information:
Value-of-Information ( VOI ) statistics.
All VOI-type, (quasi-) utility expectation statistics are
Proper Scoring Rules ( PSRs ).
Key properties of a PSR: Good performance cannot be faked.
It pays for a prognostician to strive to fully use the data at hand
and to honestly report his assessment.
He cannot increase his performance score by ‘strategic votes,’
not even by exploiting his knowledge of the scoring machinery.
Conversely:
IDI can be faked
↓
IDI is not a PSR
↓
IDI is not a VOI criterion
↓
One cannot construct a decision scenario
– not even a ridiculously artificial one –
that has the IDI as its utility-expectation criterion.
Strengthened conclusion
Even in the absence of cheating,
it cannot be claimed that IDI
measures something arguably useful
or constitutes a dependable yardstick.
Summing up the horror story:
What went wrong in the Boston group?
(1) They knew no better than embracing the C Index
as their measure of prognostic power.
(2) C turns out disappointing owing to its unexpected resilience
to ’well supported’ novel prognostic markers
[they mix up weight of effect & weight of evidence].
(3) They [undeservedly] discard C as ’insensitive to new
information.’
(4) They propose NRI, IDI and variants as being more
sensitive to new information [overlooking that these are also
sensitive to null or pseudo information].
(5) They rashly suggest SE formulae and
make vague promises of Gaussian distribution
in reasonably large samples [both wrong].
Thank you
Download