The Importance of Controlled Experiments and the Semmelweis Reflex Ronny Kohavi 5/7/2008, updated 5/19/2008, updated 7/13/2008 I wanted to share some stories I collected in the last few months about the importance of controlled experiments vs. other experimental designs in establishing causality. These are summarized in the first part of this document. The second part provides stories where people reject results because they conflict with their strong-held beliefs. Evidence (e.g., experimental results) that contradict beliefs will be questioned and attempts will be made to find any flaw with the design or the analysis. That’s one reason to build a trustworthy system and spend significant time on analysis and reliability. Gary Loveman, the COO of Harrah’s, said that there were three ways to get fired at Harrah’s: steal, harass women, or institute a program or policy without first running an experiment (Hard Facts, p. 15). The culture at Microsoft is far away from the Loveman’s culture, but even when experiments will be executed here, there will sometimes be resistance to incorporating the results. 1. Importance of Randomized Controlled Experiments Here are two good examples of the importance of running randomized controlled experiments. The first example uses the term “placebo-controlled clinical trial,” which is the medical term for the randomized controlled experiment we use. The second example shows results from a quasi-experimental design, which, like it names implies, is not a real randomized design, but tries to imitate it. 1.1 Hormone-Replacement Therapy The following story is from the NY Times, May 5, 2002. It’s a story told by Kevin Patterson, an internist. When I started practicing medicine in the early 90's, one of my enthusiasms was hormonereplacement therapy. At that time, the observation had been made, repeatedly, that postmenopausal women who happened to take estrogen -- for osteoporosis or hot flashes, for instance -- were less likely to have heart attacks and strokes than women who didn't. I remember telling women in their 50's how premenopausal women were relatively immune to cardiovascular disease, at least compared with men, but that once they had been through menopause, this relative protection disappeared quickly. ''Take the estrogen,'' I suggested over and over. ''Preserve your youthful coronaries.'' This was in Manitoba, and these were pragmatic, sensible prairie women. I insisted to them that the recommendations and the evidence seemed clear. I remember my patients' brows knitting at the thought of menstrual cycles extending into their dotage, but ultimately the argument felt compelling. Certainly it did for me. I remembered being told in medical school that the underuse of estrogen was one of the great crimes of the medical patriarchy, itself an expression of latent misogyny. No misogynist I, off I went to work, my prescription pad leaping to hand at the sight of bifocals or pastel cardigans. Page 1 Semmelweis Reflex Then in 1998, the results of a formal, placebo-controlled clinical trial called the Heart and Estrogen/Progestin Replacement Study (HERS) were published. It showed that estrogen did not prevent heart attacks or strokes and, in fact, it made women more susceptible to blood clots. The net cardiovascular effect therefore was negative. This study astonished most doctors -- for me, it certainly felt like a betrayal. Betrayed by the recommendations, we had in turn betrayed many of the cardiganclad women of our acquaintance. A few months ago, in the emergency room of one of the hospitals I work in on Vancouver Island, I saw a woman in her mid-70's who was still taking Premarin, a common estrogen preparation. She had been having chest pain, and I was admitting her for observation, to make sure she wasn't having a heart attack. ''So, you take the Premarin because . . . ?'' I asked. ''My sisters all had heart attacks in their 50's,''she said. ''My doctor said the estrogen lowered my risk.'' ''We now think it probably doesn't.'' ''Really.'' ''Yes.'' Me, nodding, smiling weakly. ''What changed?'' ''Well, there were these studies that seemed to show that women who took estrogen had a relatively low incidence of heart attacks, but it turns out that really, it was the sort of woman who took estrogen who was less likely to have a heart attack. She was probably also less likely to smoke, more likely to seek regular medical attention -- she did something important different, anyway. When, just recently, they took a large group of women and randomly gave each woman either a placebo or estrogen, the ones taking estrogen didn't do at all better.'' ''Well,'' she said. ''Isn't that something?'' My patient was not alone. The data from HERS were so surprising that many health-care providers seem not to believe them, even today. In 2001, Premarin was the third most-prescribed drug in the United States. The key point: it was the woman who took estrogen who was less likely to have a heart attack – a correlation! In 2002, more than 6 million women were taking PremPro [similar to Premarin]. Statistically, that translates into (translation should be taken with a grain of salt since it’s from an attorney site hrt-attorneys.com ): 480,000 additional breast cancer cases. 420,000 more heart attacks. 480,000 more strokes. 480,000 more blood clot cases. Experimentation Platform Page 2 Semmelweis Reflex 1.2 Twin Studies The following story comes from a Nov 2007 article by the Washington Post titled Study Debunks Theory On Teen Sex, Delinquency and the Twin Study article . A deep study by Ohio State University early in the year found that youngsters who lose their virginity earlier than their peers are more likely to become juvenile delinquents. To reach this conclusion, the authors took into account (technically, controlled for) many variables that could affect the dependent variable (juvenile delinquency). This is commonly called a “quasi-experimental design.” If you believe no other causes other than those controlled for could cause juvenile delinquency, then you should believe the result (losing virginity earlier causes them to be more likely to be juvenile delinquents). This assumption is called Causal Sufficiency. In the above study, the authors controlled for a range of variables, including: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Gender Race Receipt of public assistance Parental education Family structure Previous substance use and depression Importance of religion School GPA Relative pubertal status Virginity pledge status With such an impressive list of causes under control, and data from a massive database called the National Longitudinal Study of Adolescent Health, who could question the result? The paper was accepted for publication. But then someone did question the result. Paige Harden, a PhD student from the University of Virginia and her colleagues, used the same database and found 534 same-sex twins. Their study now controlled for genetic and environmental variables, and the result was reversed: earlier age at first sex predicted lower levels of delinquency in early adulthood. Since a twin study practically trumps all other quasiexperimental designs, the new result was published in the same journal. The first paper found a noncausal correlation! While I shouldn’t make recommendations about early sex, this example is a good example of why controlled experiments are important. Causal Sufficiency is a strong assumption. While you may think you're controlling for all effects of time-of-day, day-of-week, season, geography, etc., there may be other factors that you are not taking into account that may reverse the trend, as was shown in this study. That's why randomized experimental designs are the gold standard! Experimentation Platform Page 3 Semmelweis Reflex 2. The Semmelweis Reflex: Rejecting Results that Contradict StrongHeld Beliefs The following stories illustrate what happens when there is solid evidence that contradicts strongly-held beliefs. 2.1 Ignaz Semmelweis’s Childbed Fever The story below is from the book Leadership and Self-Deception, the Encyclopedia Britannica, Childbed Fever: A Scientific Biography of Ignaz Semmelweis, and Wikipedia. I sent it to the authors of Hard Facts as a better example of something they discussed in the book. One of the authors blogged about it http://bobsutton.typepad.com/my_weblog/2008/05/thesemmelweis.html and correctly pointed out that controlled experiments are not always possible (e.g., the Yahoo/Microsoft merger). On the web and in services, they *are* possible, so let’s use this opportunity to its full extent. Semmelweis was a European doctor, an obstetrician, in the mid 1800s. He worked at Vienna’s General Hospital, an important research hospital. The mortality rate in the ward where he practiced was one in 10 – one in every ten women giving birth there died! The reputation of Vienna General was so bad that women preferred to give birth on the street and then went to the hospital. In the book Childbed Fever, they estimated that 2,000 women died each year from childbed fever in Vienna alone, and that in nineteenth-century Europe, childbed fever killed more than a million women. The collection of symptoms associated with these deaths was known as “childbed fever” or Puerperal fever. More than half the women who contracted the disease died within days. Patients begged to be moved to a second section of the maternity ward where the mortality rate was one in fifty – still horrific, but far better than one-in-ten in Semmelweis’s section. Semmelweis became obsessed with the problem. He tried to control for all factors, including birthing positions, ventilation, diet, and even the way laundry was done. The one obvious difference between the sections was that Semmelweis’s section was attended by doctors, while the other section was attended by midwives. After a four-month leave to visit another hospital, he discovered that the death rate had fallen significantly in his section of the ward in his absence. This, coupled with the death of his friend Jakob Kolletschka from an infection led to the breakthrough. Jakob’ contracted an infection after his finger was accidentally punctured with a knife while performing a postmortem examination and his autopsy showed a pathological situation similar to that of the women who were dying from childbed fever. Semmelweis proposed a connection between cadaveric contamination and childbed fever. Experimentation Platform Page 4 Semmelweis Reflex Yes, cadavers. Semmelweis spent far more time doing research on cadavers than other doctors. Vienna General was a teaching and research hospital and many doctors split their time between research on cadavers and treatment of live patients. The doctors in his section performed autopsies each morning on women who had died the previous day, but the midwives were not required or allowed to perform such autopsies. They hadn’t seen any problem with that practice because there was as yet no understanding of germs. Semmelweis concluded that ‘particles’ from cadavers and other diseased patients were being transmitted to healthy patients on the hands of the physicians. He experimented with various cleansing agents and instituted a policy requiring physicians to wash their hands thoroughly in a chlorine and lime solution before examining any patient. The death rate fell to one in a hundred! After the initial success, where rates dropped, a new group of students was admitted and the students neglected the washings. Mortality rate increased and Semmelweis instituted stricter controls: the names of students were publicly displayed and assigned to each woman in labor, making it obvious who neglected the washings. Once again, the mortality rate fell (Childbed Fever, p. 53) What is surprising about this story isn’t the discovery through attempts to control for factors, which led to the unthinkable conclusion (at the time) that there was something invisible that was transferred by the doctors. What is really shocking is how long it took the community of doctors to accept the results. According to Encyclopedia Britannica, the mortality rate in Semmelweis’s division fell from 18.27% to 1.27% in 1848. That was not enough to generate sufficient recognition and in 1849 he was dropped from his post at the clinic and turned down for a teaching post. Semmelweis spent the next six years at a Hospital in Pest, Hungary, where he reduced mortality rate in the obstetrics department to 0.85% while in Prague and Vienna the rate was still about 10% to 15%. According to Childbed Fever: A Scientific Biography of Ignaz Semmelweis (p. 69) an 1856 publication in a prominent Viennese medical periodical, Viennese Medical Weekly, by Jozsef Fleischer, a student of Semmelweis, showed success of chlorine washings. However, the editor for the periodical wrote at the end of the report “We believe that this chlorine-washing theory has long outlived its usefulness. The experiences and statistical results of most maternity institutions protest against the views presented above. It is time we are no longer to be deceived by this theory.” Vienna continued to ignore his recommendations. In 1861, he published a book, but the community rejected his doctrine. In 1865 he suffered a nervous breakdown and was taken to a mental hospital, where he was beaten by asylum personnel and died. It took Experimentation Platform Page 5 Semmelweis Reflex another 14 years for the discovery to be accepted, after Louis Pasteur, in 1879, show ed the presence of Streptococcus in the blood of women with child fever. Semmelweis is now recognized as a pioneer of antiseptic policy. More is available at Wikipedia’s Contemporary reaction to Ignaz Semmelweis. A 2005 article called Simpson, Semmelweis, and Transformational Change by Grant etal. in Obstetrics and Gynecology claims that despite 150 years of evidence, recent research shows that hand-hygiene practices by healthcare workers remain unacceptably low. Inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90,000 related deaths annually in the United States. The Semmelweis Reflex is reflex-like rejection of new knowledge because it contradicts entrenched norms, beliefs or paradigms. 2.2 Bloodletting: The First Clinical Trial The following story is from the NY Times, May 5, 2002, Childbed Fever, A Physician Looks at the Death of Washington, and from Wikipedia Since the days of the ancient people, including Mesopotamians, the Egyptians, the Greeks, the Mayans, and the Aztecs, the prevailing conception of illness was that the sick were contaminated by some toxin or contagion. These conditions could be improved by opening a vein and letting the sickness run out – bloodletting. Once the toxins were gone, the patient immediately felt different, and often better. As anyone who has given blood can tell you, losing a pint or two can make you feel transported, transformed. Intuitively, it was satisfying to doctors that the procedure left the patient feeling drained – physically, emotionally and into the sink. The practice was continued by surgeons and barber-surgeons. Though the Figure 1: Breathing a Vein" in 1804 bloodletting was often recommended by physicians, it was carried out by barbers. This division of labor led to the distinction between physicians and surgeons. The red-and-white-striped pole of the barbershop, still in use today, is derived from this practice: the red represents the blood being drawn, the white represents the tourniquet used, and the pole itself represents the stick squeezed in the patient's hand to dilate the veins. Experimentation Platform Page 6 Semmelweis Reflex Bloodletting was used to treat almost every disease. One British medical text recommended bloodletting for acne, asthma, cancer, cholera, coma, convulsions, diabetes, epilepsy, gangrene, gout, herpes, indigestion, insanity, jaundice, leprosy, ophthalmia, plague, pneumonia, scurvy, smallpox, stroke, tetanus, tuberculosis, and for some one hundred other diseases (Childbed Fever, p. 6). It was judged most effective to bleed patients while they were sitting upright or standing erect, and blood was often removed until the patient fainted. Figure 2: The Lancet, a medical instrument used to open veins Physicians often reported the simultaneous use of fifty or more leeches on a given patient. Through the 1830s the French imported about forty million leeches a year for medical purposes, and in the next decade, England imported six million leeches a year from France alone (Childbed Fever, p. 7). On December 12, 1799, President George Washington, 68 years of age, rode his horse in heavy snowfall to inspect his plantation at Mount Vernon. It was about 30 degrees Fahrenheit, and he complained about a sore throat, yet rode again the day after. On December 14, he was in respiratory distress. Mr Albin Rawlins, the estate overseer, prepared a medicinal mixture of molasses, vinegar, and butter, and when Washington almost suffocated trying to swallow the concoction, decided on bloodletting and removed 12-14 ounces of blood. Dr. James Craik was brought in, and extracted another 20 ounces of blood, followed by yet another 20 ounces of blood. When a vinegar and hot water solution did not help, he extracted another 40 ounces of blood. In the afternoon, another doctor arrived, Dr. Dick, and he drew another 32 ounces of blood for a total of about 82 to 124 ounces, or 2.5 to 3.7 liters in ten hours. The total blood in George Washington’s body was estimated at 7 liters, so about 35% to over 50% was extracted, which inevitably led to preterminal anemia, hypovolemia, and hypotension. The fact that General Washington stopped struggling and appeared physically calm shortly before his death may have been due to profound hypotension and shock (A Physician Looks at the Death of Washington). It is understood now that bloodletting only hastened the death of the ill. We know that bloodletting is unhelpful because a Parisian doctor named Pierre Louis did an experiment in 1836 that is now recognized as one of the first clinical trials, or a randomized controlled experiment. He treated people with pneumonia either with early, aggressive bloodletting or less aggressive measures; at the end of the experiment, Dr. Louis counted the bodies. They were stacked higher over by the bloodletting sink. Despite the result of the controlled experiment, it took years for bloodletting to be recognized as useful in very limited situations (e.g., in cases involving agitation, it has a sedative effect). Broussais, a well known French physician, continued to recommend leeches, fifty at a time. Since leeches were used repeatedly and in treatment of various diseases, it was possible for the leeches themselves to convey the disease. Experimentation Platform Page 7 Semmelweis Reflex Interestingly, Biopharm Leeches, originally established in 1812, is still alive and providing 50,000 leeches for modern surgery. Their tagline: The Biting Edge of Science. 2.3 Police Lineups, Technical Knockout of Morality vs. Science While the previous stories showed how it took time to learn, the following story (also available at http://jenk.livejournal.com/160588.html) shows that it may take a long time for results to sink. Experimentation Platform Page 8 Semmelweis Reflex Experimentation Platform Page 9