Evaluating Evidence in Medicine: What Can Go Wrong? Skeptic’s Toolbox 2012 Harriet Hall, MD The SkepDoc Overview • What constitutes evidence in medicine? • What can go wrong in clinical studies? • Why even “evidence-based medicine” is flawed. Is This Evidence? Is This Evidence? MRI Study of Salmon • A salmon was shown photographs of humans in social situations. It was asked to think about what emotion the individual in the photo must have been experiencing. • The salmon couldn’t talk, but: • On the fMRI scan, areas in the salmon’s brain lit up, indicating increased blood flow, indicating that the salmon was thinking. Is This Evidence That: • • • • Salmon can see pictures? Salmon know what human emotions are? Salmon can identify emotions from pictures? Salmon can respond to requests of what to think about? What’s Wrong With This Picture? The Salmon Was Dead and Gutted! Statistical Artifact • Each fMRI scan measures 50,000 voxels (3-D pixels) and each study involves thousands of scans. • If you mine the data, you can find practically anything you want. • Brain scans are the new phrenology – A blunt instrument – Scans are pooled to establish normal average – Often don’t mean what people think they mean Amen poster Would You Accept This Evidence? • I tried it. I got better. It worked for me. • Lots of people tried it and got better. • We gave it to a lot of people in a study and they improved. • We compared it to a no-treatment group or a usual-treatment group and it worked better. • We compared it to a placebo and it worked better. • The weight of evidence from a large body of studies shows that it works better than placebo Is This Evidence? • I tried it. I got better. It worked for me. – – – – Anecdote. Plural of anecdote is not data. Post hoc ergo propter hoc fallacy Does Echinacea prevent colds? Removing glucosamine didn’t remove effects • We gave it to a lot of people in a study and they improved. – Uncontrolled study. Maybe they would have improved without treatment. – Cold got better in a week with treatment, lasted 7 days without treatment. Is This Evidence? • Our study compared it to a no-treatment group or a usual-treatment group and it worked better. – Hawthorne effect: Doing something is better than doing nothing. • Our study compared it to a placebo and it worked better. – Was the study blinded? – Double blind, placebo-controlled randomized study is the Gold Standard. • BUT: What if we do a Gold Standard study on something totally implausible and it works better than a placebo? There’s A Lot of Evidence: A Fire Hose of Information • 21 million papers are listed in PubMed: – 700,000 more each year – One a minute • PubMed lists 23,000 journals, and there are many more not listed • You can find a study to support any belief. Never Believe One Study • Early positive studies often superseded by better, negative studies (HRT). • Ioannidis: Most published research findings are wrong. Ioannidis • The smaller the study, the less likely the research findings are to be true • The smaller the effect, the less likely the research findings are to be true. • The greater the financial and other interests, the less likely the research findings are to be true • The hotter a scientific field, (with more research teams involved), the less likely the research findings are to be true. Evaluating a Study • Ask a lot of questions • I’ll cover some Skeptics Question Everything What Kind of Study? • • • • • • • • Case report Case series Case-control Cohort Epidemiologic RCT Placebo-controlled Blinded (single or double) Who’s Paying? • Studies sponsored by pharmaceutical companies more likely to be positive – Subtle bias – Unpublished negative information • Studies by researchers with financial conflicts of interest (consulting fees, honoraria from pharmaceutical company) more likely to be positive 91% vs. 67% Big Pharma Distortion • Turner looked at all antidepressant studies registered with FDA – Published studies: 94% positive – Unpublished studies: 51% positive Evidence that antidepressants don’t work? No. Effect Size Turner vs. Kirsch • Kirsch said < .5 means ineffective – Effect size from journals: .41 – True effect size: .31 – Therefore antidepressants are not effective • Turner said glass not empty, 1/3 full • Patients’ responses not all-or-none; partial responses can be meaningful • Antidepressants DO work, just not as well as originally thought. • Kirsch supports psychotherapy, but its effect size is much less than .5. Scam Product Testing • In-house: by non-academics on company’s payroll – Worthless. Tweaked to get desired results • Independent testing companies: guns for hire • Minuscule effects touted as significant • Effects found, but not specific to product – Amino acids may improve muscle strength • Effects may not apply to average people (i.e. taping injuries) Are the Researchers Biased? • • • • Homeopathy studies done by homeopaths Chiropractic studies done by chiropractors Surgical studies done by surgeons Studies published in specialty journals for a biased audience Who Are the Subjects? • Self selection bias: who volunteers? – Believers? – Professional subjects? • Select group not typical of the general population. – Men only? No children? Limited age group? – Subjects with concurrent diseases not accepted – Subjects taking other medications not accepted. Were Negative Studies Suppressed? • File drawer effect – Negative studies not submitted for publication. – What if 4/5 studies were negative but only the positive one published? • Publication bias – Journals don’t like to publish negative studies. – Journals don’t like to publish replications that debunk original results. (Bem, Wiseman) Did Workers Mislead Author? • Technicians and subordinates know what the researcher hopes to find. – May try to please the boss, consciously or unconsciously – May circumvent blinding procedures – Can record 4.5 as 4 or 5. – Faking to make job easier (homeopathy prep) Did Workers Mislead Author? • Benveniste homeopathy study • Counting basophil degranulation under the microscope is somewhat subjective • Only one technician got positive results What Are the Odds? • 9 out of ten drugs in Phase I clinical trials fail. • 50% of drugs that reach Phase III trials fail. • A far higher percentage of promising drugs never make it to clinical trials; they fail in animal and in vitro trials. Do the Data Justify the Conclusion? • Teaching exercise: 1. 2. 3. 4. Read the data section first Draw your own conclusions Read the paper’s conclusions Scratch your head Do the Data Justify the Conclusion? Conclusion: low cholesterol kills children. The higher the cholesterol, the better for health. Do the Data Justify the Conclusion? • • • • Sample of opportunity: data not collected systematically Too few points to show correlation Correlation doesn’t prove causation Other explanations: – – – – – – Hygiene Poverty Disease Starvation Genetic factors Less access to medical care • Better explanation: undernourished children have abnormally low cholesterol levels Do the Data Justify the Conclusion? Conclusion: by the year 2038 100% of children will be autistic What Aren’t They Telling Us? • • • • • Selection methods Randomization methods Identity of placebo Whether people were fooled by placebo Proper blinding procedures? • Other factors – – – – Glassware not thoroughly washed? Contaminants in lab? Mouse XMRV virus contaminated cell cultures in CFS study Did they really do what they said they did? How Many Dropouts? - 10 total patients: 7 neg. 3 pos. = 30% pos. - 6 drop out because it’s not working - 30% success rate now looks like 75% Where Was the Study Done? Percent of Acupuncture Trials with Positive Results • • • • • • • Canada, Australia, New Zealand US Scandinavia UK Rest of Europe Asia Brazil, Israel, Nigeria 30% 53% 55% 60% 78% 98% 100% What was the sample size? • 1/3 of the chickens got better • 1/3 of the chickens stayed the same • What about the other third? Were There Errors in Statistics? • Wrong statistical test used • Errors in calculation What About Noncompliance? • Did all subjects take their pills? • Did they take them on time? Noncompliance • HIV Prophylaxis study in Africa – 95% said they usually or always took meds on time – Pill count data: 88% – Tests showed adequate plasma levels of drug: 1526% Tooth Fairy Science • Are they trying to study something that doesn’t exist? Emily Rosa and the Emperor's New Clothes Inaccurate Measuring Methods? • Questionnaires rely on unreliable memories and patient honesty. – “30% less pain” – “I eat like a bird” – “Only one drink” Using a Bogus Test? Measuring the Components of ASEA • A mixture of 16 chemically recombined products of salt and water with completely new chemical properties. • They used a fluorescent indicator as a probe for unspecified “highly reactive oxygen species” How Many Endpoints Were There? • Multiple endpoints: some will show false correlations just by chance • Statistical corrections applied? • Inappropriate data mining? • The heart prayer study – 6 positive out of 26 factors studied – Inconsistent pattern Were Goalposts Moved? • AIDS prayer study: endpoint death • Not enough subjects died: AIDS drugs kept them alive • They went back and looked at a lot of other factors and found some apparent successes (i.e., fewer doctor visits) but no change in objective tests like CD4 count. • Only 40 patients. Study wasn’t designed to test non-death outcomes. Statistical Significance ≠ Clinical Significance • Did the drug lower the BP by 1% or 30%? • Was the endpoint a lab value or a clinical benefit? • B vitamin supplements lower homocysteine but don’t lower risk of heart disease • PSA screening finds cancers; doesn’t improve survival • Are the results POEMS – Patient Oriented Evidence that Matters? Was There Fraud? • Dipak Das, resveratrol researcher – Review board found him guilty of 145 counts of fabrication or falsification of data – 12 of his papers retracted so far “I was blinded by work and my drive for achievement” • Hwang Woo-suk, stem cell researcher in South Korea, claimed to have cloned human embryonic stem cells – Fabricated crucial data – Embezzlement and bioethics law violations – Prison sentence (suspended) – 2 papers in Science retracted. – Fired from his job Columbia Prayer Study • Prayer doubled success of in vitro fertilization – – – – Seriously flawed study Convoluted design with 3 levels of overlapping prayer groups No controls for prayers outside study Investigated for lack of informed consent • Authors – Lobo, lead author, only learned of study 6-12 months after it was completed. Denied any involvement other than editorial help. – Cha severed his relationship with Columbia, refused to comment – Wirth: • Paranormal researcher with no medical degree • Con man who went to federal prison for fraud and conspiracy • Bruce Flamm debunked it in Skeptical Inquirer • Retracted by journal, but only years later • Still being cited as a valid study How Were the Data Reported? • NNT and NNH – Lipitor for primary prevention of heart attacks: • 19% Reduction • NNT 75-250, NNH 200. • Absolute risk vs. relative risk – Cellphones increase the risk of acoustic neuroma. Relative risk 200%. – Baseline risk is 1:100,000 – 200% of 1 is 2 – Absolute risk 1 more in 100,000, or 0.00001% What Are the Confidence Intervals? • Confidence interval of 95%: Where Was the Study Published? • Acupuncture studies in acupuncture journals? • Homeopathy studies in homeopathy journals? • Acupuncture or homeopathy study published in a major medical journal? Were the Results Misinterpreted? • True acupuncture = sham acupuncture • Both better than placebo pill • Acupuncturist’s interpretation: – Sham acupuncture must work too • Better interpretation: both are impressive placebos. Were Recommendations Justified? • X didn’t work, but it didn’t cause harm, and since we have no other effective treatment for Y, we should continue to use X. • New drug X didn’t work better than the placebo, but we didn’t see any side effects, and since we have no other effective treatment for Y, X should be approved for marketing. Do we really know what the study showed? • Peer critiques, letters to editor • Media distortions – Presenting preliminary evidence as definitive – Misinterpreting results of study Glucosamine/Chondroitin Study • Overall: not effective • Subgroup analysis (10 subgroups) – Pos. in patients with moderate to severe arthritis – Neg. in patients with mild to moderate arthritis • Reported in the media as both + and – • Authors said study not powered to show effectiveness in subgroup What does p value mean? • It’s significant at the p=0.05 level, so it must be true. • p=0.05 – Means 5 in 100 chance that a positive result is false – Doesn’t mean a 95% chance that a positive result is true. • Says nothing about the meaning of a positive result 4 possible outcomes P Value = Specificity: Probability that an ineffective treatment will give a false positive result Sensitivity: The probability that an effective treatment will show a positive result Positive Predictive Value: if the study is positive, how likely is it to be true? FP T P PPV = TP / TP + FP 0.05 Prior Probability 50% 0.05 If researchers don’t consider prior probability, they are automatically assigning a PP of 50%. PPV = 80/80+5 = 94%. 6% chance that positive results are wrong. PPV with Prior Probability 5% FP T P TP: 80% of 5% = 4% FP: 5% of 95% = 4.75% PPV = TP/ TP + FP = 4%/8.75% Chance that positive Result is true: 46% Slightly less than coin toss PPV with Prior Probability 5% and 1% PP 5% PP 1% TP: 80% of 5% = 4% TP: 80% of 1% = 0.8% FP: 5% of 95% = 4.75% FP: 5% of 99% = 4.95% PPV = TP/ TP + FP = 4%/8.75% PPV= 0.8%/ 0.8% + 4.95% = 0.8/5.75 = 0.14% FP Chance that positive result is true: 46% Slightly less than coin toss The Plausibility Problem, by David Weinberg, on SBM Chance that positive result is true: 14% Chance that positive result Is false: 86% Statistical Significance • The p=0.05 cutoff is arbitrary • Statistical significance doesn’t mean clinical significance • Be wary of studies that say outcome was positive but admit it was not statistically significant. – “X worked better than Y, but the results didn’t reach significance.” Power of a study • A power of .8 in medical research is considered a very powerful study and would require a large number of patients. • If the power is less than that, the Positive Predictive Value drops even more. When CAN You Believe Research?Bausell’s Quick Checklist • • • • Randomized with a credible control group At least 50 subjects per group Dropout rate 25% or less Published in a high-quality, prestigious, peerreviewed journal When CAN You Believe Research? • Confirmed by other studies • Consistent with other knowledge • Prior probability Systematic Reviews and Meta-analyses • A flawed method to sort out conflicting research results. • A few high quality studies trump the conclusions of a meta-analysis. • The results of a meta analysis usually fail to predict the results of future good clinical trials. Does It Make Sense? • Energy medicine proponents claim to have measured a 2 milligauss magnetic field emanating from practitioners’ hands • Reproducible measurements by other scientists fall in the range of 0.004 milligauss. • The magnetic field of the earth is 500 milligauss. • Refrigerator magnet: 50,000 milligauss. • Even if the 2 milligauss measurement were accurate, it would be many orders of magnitude below the cell’s noise level and billions of times less than the energy received by your eye when viewing the brightest star. Evidence-Based Medicine Isn’t Enough EBM is working hard, but it got something wrong. What’s missing from the EBM pyramid? Carl Sagan Don’t forget prior plausibility • “Extraordinary claims require extraordinary evidence.” • If basic science says a treatment is implausible, we must set the bar for clinical evidence higher. EBM Founders’ Assumptions • The rigorous clinical trial is the final arbiter of any claim that has already demonstrated promise by all other criteria—basic science, animal studies, legitimate case series, small controlled trials, “expert opinion,” etc. • Claims lacking in promise were not even part of the discussion. Plausibility Spectrum for CAM • Homeopathy: close to zero. • Acupuncture: intermediate – Underlying Oriental concepts have low plausibility – But it’s plausible that inserting needles in the skin could cause physiological effects • Herbal medicine: high plausibility because plants produce drugs. EBM Accepts Tooth Fairy Science • Typical fairy tales: – Reiki (faith healing that substitutes Eastern mysticism for Christian beliefs) – Homeopathy (essentially a form of sympathetic magic) – Therapeutic touch (A misnomer for smoothing out wrinkles in a mythical human energy field without actually touching the patient) • Trying to apply the tools of science to these therapeutic modalities based on fantasy just produces a lot of confusing noise Tooth Fairy Science: Studying Therapeutic Touch • Therapeutic Touch is said to manipulate the alleged “human energy field.” • Controlled TT studies have been done for: – Pain – Bone marrow transplant – Recovery from cardiac surgery • Positive results due to the effects of suggestion and attention, not to a mythical energy field. Tooth Fairy Science: Therapeutic Touch • 2008 study: Randomized trial of healing touch to speed recovery from coronary artery bypass surgery: – Decrease in anxiety and length of stay – No significant differences for other endpoints • Lab study: therapeutic touch affects DNA synthesis and mineralization of human osteoblasts in culture. • Cochrane review: positive Another EBM Pitfall: Pragmatic Studies • Clinical trials select patients to minimize possible confounders: Subjects tend to be healthier and on fewer medications than the average patient. • Pragmatic studies look at the outcome of treatments in real-world settings. • Clotbuster drugs worked well for strokes in clinical trials, but with more extensive use in ERs they caused more strokes from bleeding complications. Pragmatic Trials May Not Be Appropriate for CAM Treatments • Intended to evaluate practical real-world use of treatments that have already been proven to work in clinical trials. • Pragmatic trials can’t provide objective evidence that a treatment has effects beyond placebo. • CAM proponents favor pragmatic studies because – They don’t control for placebo effects. – They can bypass good science. – They can make CAM look better than it really is. Pragmatic Trial • Acupuncture vs. usual care for low back pain. • Acupuncture wins. • Have you proved acupuncture is really more effective? • No, this is Cinderella Science Cinderella Revised Before Cinderella in rags and ashes After Ugly Stepsister who has had a complete makeover Pragmatic Trial of Acupuncture for Low Back Pain Usual Care Acupuncture Needles alone Acupuncture Evaluating Evidence in Medicine: What Can Go Wrong? • • • • • • • • • • Placebo effect Therapeutic effect of consultation (suggestion, expectation) Unassisted natural healing (natural course of disease) Unrecognized treatments (the spaghetti sauce factor) Regression toward the mean Other concurrent conventional treatment Cessation of unpleasant treatment Lifestyle changes Publishing standards (p=0.05) Publication bias What Can Go Right? • • • • • Large well designed studies Prior plausibility Strongly positive results Consensus of experts Coherent body of evidence Evaluating Evidence in Medicine: What Can Go Wrong? Dr. Jay Gordon Pediatrician to the Stars • “My very strong impression is that children with the fewest vaccines, or no vaccines at all, get sick less frequently and are healthier in general. I truly believe they also develop less autism.” Take-Home Points • There are various kinds of evidence, some more credible than others. • Most published clinical studies are wrong. • Even “evidence-based medicine” can get it wrong. • Statistical significance ≠ clinical significance. • Prior probability is important. • Clinical studies don’t “prove” – they only change the probabilities. The End