PPTX - The Skeptic`s Toolbox

advertisement
Evaluating Evidence in Medicine:
What Can Go Wrong?
Skeptic’s Toolbox
2012
Harriet Hall, MD
The SkepDoc
Overview
• What constitutes evidence in medicine?
• What can go wrong in clinical studies?
• Why even “evidence-based medicine” is
flawed.
Is This Evidence?
Is This Evidence? MRI Study of Salmon
• A salmon was shown photographs of humans
in social situations. It was asked to think about
what emotion the individual in the photo
must have been experiencing.
• The salmon couldn’t talk, but:
• On the fMRI scan, areas in the salmon’s brain
lit up, indicating increased blood flow,
indicating that the salmon was thinking.
Is This Evidence That:
•
•
•
•
Salmon can see pictures?
Salmon know what human emotions are?
Salmon can identify emotions from pictures?
Salmon can respond to requests of what to
think about?
What’s Wrong With This Picture?
The Salmon Was Dead and Gutted!
Statistical Artifact
• Each fMRI scan measures 50,000 voxels (3-D
pixels) and each study involves thousands of
scans.
• If you mine the data, you can find practically
anything you want.
• Brain scans are the new phrenology
– A blunt instrument
– Scans are pooled to establish normal average
– Often don’t mean what people think they mean
Amen poster
Would You Accept This Evidence?
• I tried it. I got better. It worked for me.
• Lots of people tried it and got better.
• We gave it to a lot of people in a study and they
improved.
• We compared it to a no-treatment group or a
usual-treatment group and it worked better.
• We compared it to a placebo and it worked
better.
• The weight of evidence from a large body of
studies shows that it works better than placebo
Is This Evidence?
• I tried it. I got better. It worked for me.
–
–
–
–
Anecdote. Plural of anecdote is not data.
Post hoc ergo propter hoc fallacy
Does Echinacea prevent colds?
Removing glucosamine didn’t remove effects
• We gave it to a lot of people in a study and they
improved.
– Uncontrolled study. Maybe they would have improved
without treatment.
– Cold got better in a week with treatment, lasted 7
days without treatment.
Is This Evidence?
• Our study compared it to a no-treatment group or a
usual-treatment group and it worked better.
– Hawthorne effect: Doing something is better than doing
nothing.
• Our study compared it to a placebo and it worked
better.
– Was the study blinded?
– Double blind, placebo-controlled randomized study is the
Gold Standard.
• BUT: What if we do a Gold Standard study on
something totally implausible and it works better than
a placebo?
There’s A Lot of Evidence:
A Fire Hose of Information
• 21 million papers are listed in PubMed:
– 700,000 more each year
– One a minute
• PubMed lists 23,000 journals, and there are
many more not listed
• You can find a study to support any belief.
Never Believe One Study
• Early positive studies often superseded by
better, negative studies (HRT).
• Ioannidis: Most published research findings
are wrong.
Ioannidis
• The smaller the study, the less likely the research
findings are to be true
• The smaller the effect, the less likely the research
findings are to be true.
• The greater the financial and other interests, the
less likely the research findings are to be true
• The hotter a scientific field, (with more research
teams involved), the less likely the research
findings are to be true.
Evaluating a Study
• Ask a lot of questions
• I’ll cover some
Skeptics Question Everything
What Kind of Study?
•
•
•
•
•
•
•
•
Case report
Case series
Case-control
Cohort
Epidemiologic
RCT
Placebo-controlled
Blinded (single or double)
Who’s Paying?
• Studies sponsored by pharmaceutical
companies more likely to be positive
– Subtle bias
– Unpublished negative information
• Studies by researchers with financial conflicts
of interest (consulting fees, honoraria from
pharmaceutical company) more likely to be
positive 91% vs. 67%
Big Pharma Distortion
• Turner looked at all antidepressant studies
registered with FDA
– Published studies: 94% positive
– Unpublished studies: 51% positive
Evidence that antidepressants don’t work?
No.
Effect Size
Turner vs. Kirsch
• Kirsch said < .5 means ineffective
– Effect size from journals: .41
– True effect size: .31
– Therefore antidepressants are not effective
• Turner said glass not empty, 1/3 full
• Patients’ responses not all-or-none; partial responses can
be meaningful
• Antidepressants DO work, just not as well as originally
thought.
• Kirsch supports psychotherapy, but its effect size is much
less than .5.
Scam Product Testing
• In-house: by non-academics on company’s
payroll
– Worthless. Tweaked to get desired results
• Independent testing companies: guns for hire
• Minuscule effects touted as significant
• Effects found, but not specific to product
– Amino acids may improve muscle strength
• Effects may not apply to average people (i.e.
taping injuries)
Are the Researchers Biased?
•
•
•
•
Homeopathy studies done by homeopaths
Chiropractic studies done by chiropractors
Surgical studies done by surgeons
Studies published in specialty journals for a
biased audience
Who Are the Subjects?
• Self selection bias: who volunteers?
– Believers?
– Professional subjects?
• Select group not typical of the general
population.
– Men only? No children? Limited age group?
– Subjects with concurrent diseases not accepted
– Subjects taking other medications not accepted.
Were Negative Studies Suppressed?
• File drawer effect
– Negative studies not submitted for publication.
– What if 4/5 studies were negative but only the
positive one published?
• Publication bias
– Journals don’t like to publish negative studies.
– Journals don’t like to publish replications that
debunk original results. (Bem, Wiseman)
Did Workers Mislead Author?
• Technicians and subordinates know what the
researcher hopes to find.
– May try to please the boss, consciously or
unconsciously
– May circumvent blinding procedures
– Can record 4.5 as 4 or 5.
– Faking to make job easier (homeopathy prep)
Did Workers Mislead Author?
• Benveniste homeopathy study
• Counting basophil degranulation under the
microscope is somewhat subjective
• Only one technician got positive results
What Are the Odds?
• 9 out of ten drugs in Phase I clinical trials fail.
• 50% of drugs that reach Phase III trials fail.
• A far higher percentage of promising drugs
never make it to clinical trials; they fail in
animal and in vitro trials.
Do the Data Justify the Conclusion?
• Teaching exercise:
1.
2.
3.
4.
Read the data section first
Draw your own conclusions
Read the paper’s conclusions
Scratch your head
Do the Data Justify the Conclusion?
Conclusion: low cholesterol kills children. The
higher the cholesterol, the better for health.
Do the Data Justify the Conclusion?
•
•
•
•
Sample of opportunity: data not collected systematically
Too few points to show correlation
Correlation doesn’t prove causation
Other explanations:
–
–
–
–
–
–
Hygiene
Poverty
Disease
Starvation
Genetic factors
Less access to medical care
• Better explanation: undernourished children have abnormally low
cholesterol levels
Do the Data Justify the Conclusion?
Conclusion: by the year 2038 100% of children will be autistic
What Aren’t They Telling Us?
•
•
•
•
•
Selection methods
Randomization methods
Identity of placebo
Whether people were fooled by placebo
Proper blinding procedures?
• Other factors
–
–
–
–
Glassware not thoroughly washed?
Contaminants in lab?
Mouse XMRV virus contaminated cell cultures in CFS study
Did they really do what they said they did?
How Many Dropouts?
- 10 total patients: 7 neg. 3 pos. = 30% pos.
- 6 drop out because it’s not working
- 30% success rate now looks like 75%
Where Was the Study Done?
Percent of Acupuncture Trials with Positive Results
•
•
•
•
•
•
•
Canada, Australia, New Zealand
US
Scandinavia
UK
Rest of Europe
Asia
Brazil, Israel, Nigeria
30%
53%
55%
60%
78%
98%
100%
What was the sample size?
• 1/3 of the chickens got better
• 1/3 of the chickens stayed the same
• What about the other third?
Were There Errors in Statistics?
• Wrong statistical test used
• Errors in calculation
What About Noncompliance?
• Did all subjects take their pills?
• Did they take them on time?
Noncompliance
• HIV Prophylaxis study in Africa
– 95% said they usually or always took meds on
time
– Pill count data: 88%
– Tests showed adequate plasma levels of drug: 1526%
Tooth Fairy Science
• Are they trying to study something that
doesn’t exist?
Emily Rosa and the Emperor's New
Clothes
Inaccurate Measuring Methods?
• Questionnaires rely on unreliable memories
and patient honesty.
– “30% less pain”
– “I eat like a bird”
– “Only one drink”
Using a Bogus Test? Measuring the
Components of ASEA
• A mixture of 16 chemically recombined
products of salt and water with completely
new chemical properties.
• They used a fluorescent indicator as a probe
for unspecified “highly reactive oxygen
species”
How Many Endpoints Were There?
• Multiple endpoints: some will show false
correlations just by chance
• Statistical corrections applied?
• Inappropriate data mining?
• The heart prayer study
– 6 positive out of 26 factors studied
– Inconsistent pattern
Were Goalposts Moved?
• AIDS prayer study: endpoint death
• Not enough subjects died: AIDS drugs kept
them alive
• They went back and looked at a lot of other
factors and found some apparent successes
(i.e., fewer doctor visits) but no change in
objective tests like CD4 count.
• Only 40 patients. Study wasn’t designed to
test non-death outcomes.
Statistical Significance ≠
Clinical Significance
• Did the drug lower the BP by 1% or 30%?
• Was the endpoint a lab value or a clinical benefit?
• B vitamin supplements lower homocysteine
but don’t lower risk of heart disease
• PSA screening finds cancers; doesn’t improve
survival
• Are the results POEMS – Patient Oriented
Evidence that Matters?
Was There Fraud?
• Dipak Das, resveratrol researcher
– Review board found him guilty of 145 counts of
fabrication or falsification of data
– 12 of his papers retracted so far
“I was blinded by work and my drive
for achievement”
• Hwang Woo-suk, stem cell researcher in South
Korea, claimed to have cloned human
embryonic stem cells
– Fabricated crucial data
– Embezzlement and bioethics law
violations
– Prison sentence (suspended)
– 2 papers in Science retracted.
– Fired from his job
Columbia Prayer Study
• Prayer doubled success of in vitro fertilization
–
–
–
–
Seriously flawed study
Convoluted design with 3 levels of overlapping prayer groups
No controls for prayers outside study
Investigated for lack of informed consent
• Authors
– Lobo, lead author, only learned of study 6-12 months after it was
completed. Denied any involvement other than editorial help.
– Cha severed his relationship with Columbia, refused to comment
– Wirth:
• Paranormal researcher with no medical degree
• Con man who went to federal prison for fraud and conspiracy
• Bruce Flamm debunked it in Skeptical Inquirer
• Retracted by journal, but only years later
• Still being cited as a valid study
How Were the Data Reported?
• NNT and NNH
– Lipitor for primary prevention of heart attacks:
• 19% Reduction
• NNT 75-250, NNH 200.
• Absolute risk vs. relative risk
– Cellphones increase the risk of acoustic neuroma. Relative
risk 200%.
– Baseline risk is 1:100,000
– 200% of 1 is 2
– Absolute risk 1 more in 100,000, or 0.00001%
What Are the Confidence Intervals?
• Confidence interval of 95%:
Where Was the Study Published?
• Acupuncture studies in acupuncture journals?
• Homeopathy studies in homeopathy journals?
• Acupuncture or homeopathy study published
in a major medical journal?
Were the Results Misinterpreted?
• True acupuncture = sham acupuncture
• Both better than placebo pill
• Acupuncturist’s interpretation:
– Sham acupuncture must work too
• Better interpretation: both are impressive
placebos.
Were Recommendations Justified?
• X didn’t work, but it didn’t cause harm, and
since we have no other effective treatment for
Y, we should continue to use X.
• New drug X didn’t work better than the
placebo, but we didn’t see any side effects,
and since we have no other effective
treatment for Y, X should be approved for
marketing.
Do we really know what the study
showed?
• Peer critiques, letters to editor
• Media distortions
– Presenting preliminary evidence as definitive
– Misinterpreting results of study
Glucosamine/Chondroitin Study
• Overall: not effective
• Subgroup analysis (10 subgroups)
– Pos. in patients with moderate to severe arthritis
– Neg. in patients with mild to moderate arthritis
• Reported in the media as both + and –
• Authors said study not powered to show
effectiveness in subgroup
What does p value mean?
• It’s significant at the p=0.05 level, so it must be
true.
• p=0.05
– Means 5 in 100 chance that a positive result is false
– Doesn’t mean a 95% chance that a positive result is
true.
• Says nothing about the meaning
of a positive result
4 possible outcomes
P Value = Specificity: Probability that
an ineffective treatment will give a
false positive result
Sensitivity: The probability that an
effective treatment will show a positive
result
Positive Predictive Value: if the study is
positive, how likely is it to be true?
FP
T
P
PPV = TP / TP + FP
0.05
Prior Probability 50%
0.05
If researchers don’t consider prior probability, they are automatically assigning a PP of 50%.
PPV = 80/80+5 = 94%. 6% chance that positive results are wrong.
PPV with Prior Probability 5%
FP
T
P
TP: 80% of 5%
= 4%
FP: 5% of 95%
= 4.75%
PPV = TP/
TP + FP =
4%/8.75%
Chance that positive
Result is true: 46%
Slightly less than
coin toss
PPV with Prior Probability 5% and 1%
PP 5%
PP 1%
TP: 80% of 5%
= 4%
TP: 80% of 1%
= 0.8%
FP: 5% of 95%
= 4.75%
FP: 5% of 99%
= 4.95%
PPV = TP/
TP + FP =
4%/8.75%
PPV= 0.8%/
0.8% + 4.95%
= 0.8/5.75
= 0.14%
FP
Chance that
positive result
is true: 46%
Slightly less than
coin toss
The Plausibility Problem, by David Weinberg, on SBM
Chance that
positive result
is true: 14%
Chance that
positive result
Is false: 86%
Statistical Significance
• The p=0.05 cutoff is arbitrary
• Statistical significance doesn’t mean clinical
significance
• Be wary of studies that say outcome was
positive but admit it was not statistically
significant.
– “X worked better than Y, but the results didn’t
reach significance.”
Power of a study
• A power of .8 in medical research is
considered a very powerful study and would
require a large number of patients.
• If the power is less than that, the Positive
Predictive Value drops even more.
When CAN You Believe
Research?Bausell’s Quick Checklist
•
•
•
•
Randomized with a credible control group
At least 50 subjects per group
Dropout rate 25% or less
Published in a high-quality, prestigious, peerreviewed journal
When CAN You Believe Research?
• Confirmed by other studies
• Consistent with other knowledge
• Prior probability
Systematic Reviews and Meta-analyses
• A flawed method to sort out conflicting
research results.
• A few high quality studies trump the
conclusions of a meta-analysis.
• The results of a meta analysis usually fail to
predict the results of future good clinical
trials.
Does It Make Sense?
• Energy medicine proponents claim to
have measured a 2 milligauss magnetic
field emanating from practitioners’ hands
• Reproducible measurements by other scientists fall in
the range of 0.004 milligauss.
• The magnetic field of the earth is 500 milligauss.
• Refrigerator magnet: 50,000 milligauss.
• Even if the 2 milligauss measurement were accurate, it
would be many orders of magnitude below the cell’s
noise level and billions of times less than the energy
received by your eye when viewing the brightest star.
Evidence-Based Medicine Isn’t Enough
EBM is working hard, but it got something wrong.
What’s missing from the EBM
pyramid?
Carl Sagan
Don’t forget prior plausibility
• “Extraordinary claims require extraordinary
evidence.”
• If basic science says a treatment is implausible,
we must set the bar for clinical evidence higher.
EBM Founders’ Assumptions
• The rigorous clinical trial is the final
arbiter of any claim that has already
demonstrated promise by all other
criteria—basic science, animal studies,
legitimate case series, small controlled
trials, “expert opinion,” etc.
• Claims lacking in promise were not even
part of the discussion.
Plausibility Spectrum for CAM
• Homeopathy: close to zero.
• Acupuncture: intermediate
– Underlying Oriental concepts have low plausibility
– But it’s plausible that inserting needles in the skin
could cause physiological effects
• Herbal medicine: high plausibility because
plants produce drugs.
EBM Accepts Tooth Fairy Science
• Typical fairy tales:
– Reiki (faith healing that substitutes Eastern mysticism
for Christian beliefs)
– Homeopathy (essentially a form of sympathetic magic)
– Therapeutic touch (A misnomer for smoothing out
wrinkles in a mythical human energy field without
actually touching the patient)
• Trying to apply the tools of science to these
therapeutic modalities based on fantasy just
produces a lot of confusing noise
Tooth Fairy Science:
Studying Therapeutic Touch
• Therapeutic Touch is said to manipulate
the alleged “human energy field.”
• Controlled TT studies have been done for:
– Pain
– Bone marrow transplant
– Recovery from cardiac surgery
• Positive results due to the effects of
suggestion and attention, not to a mythical
energy field.
Tooth Fairy Science: Therapeutic Touch
• 2008 study: Randomized trial of healing touch
to speed recovery from coronary artery
bypass surgery:
– Decrease in anxiety and length of stay
– No significant differences for other endpoints
• Lab study: therapeutic touch affects DNA
synthesis and mineralization of human
osteoblasts in culture.
• Cochrane review: positive
Another EBM Pitfall: Pragmatic Studies
• Clinical trials select patients to minimize possible
confounders: Subjects tend to be healthier and on
fewer medications than the average patient.
• Pragmatic studies look at the outcome of treatments
in real-world settings.
• Clotbuster drugs worked well for strokes in clinical
trials, but with more extensive use in ERs they
caused more strokes from bleeding complications.
Pragmatic Trials May Not Be Appropriate
for CAM Treatments
• Intended to evaluate practical real-world use
of treatments that have already been proven
to work in clinical trials.
• Pragmatic trials can’t provide objective
evidence that a treatment has effects beyond
placebo.
• CAM proponents favor pragmatic studies
because
– They don’t control for placebo effects.
– They can bypass good science.
– They can make CAM look better than it
really is.
Pragmatic Trial
• Acupuncture vs. usual care for low back pain.
• Acupuncture wins.
• Have you proved acupuncture is really more
effective?
• No, this is Cinderella Science
Cinderella Revised
Before
Cinderella in rags
and ashes
After
Ugly Stepsister who has
had a complete
makeover
Pragmatic Trial of Acupuncture for Low
Back Pain
Usual Care
Acupuncture
Needles alone
Acupuncture
Evaluating Evidence in Medicine: What
Can Go Wrong?
•
•
•
•
•
•
•
•
•
•
Placebo effect
Therapeutic effect of consultation (suggestion, expectation)
Unassisted natural healing (natural course of disease)
Unrecognized treatments (the spaghetti sauce factor)
Regression toward the mean
Other concurrent conventional treatment
Cessation of unpleasant treatment
Lifestyle changes
Publishing standards (p=0.05)
Publication bias
What Can Go Right?
•
•
•
•
•
Large well designed studies
Prior plausibility
Strongly positive results
Consensus of experts
Coherent body of evidence
Evaluating Evidence in Medicine: What
Can Go Wrong?
Dr. Jay Gordon
Pediatrician to the Stars
• “My very strong impression is that children
with the fewest vaccines, or no vaccines at all,
get sick less frequently and are healthier in
general. I truly believe they also develop less
autism.”
Take-Home Points
• There are various kinds of evidence, some more
credible than others.
• Most published clinical studies are wrong.
• Even “evidence-based medicine” can get it
wrong.
• Statistical significance ≠ clinical significance.
• Prior probability is important.
• Clinical studies don’t “prove” – they only change
the probabilities.
The End
Download