Selected Problems in Epidemiology Nina H. Fefferman, Ph.D. Co-Director Tufts Univ. InForMID Data mining in public health is not new, but it is more complicated A small historical example : Cholera, John Snow, 1854 During the height of the Miasmic theory of Disease 1) There was a Cholera outbreak in London 2) John Snow became ‘irrationally’ convinced that Cholera came from contaminated drinking water So Snow went to the London Registrar-General He looked at where those who died from Cholera got their water and when "The experiment … was on the grandest scale. No fewer than 300,000 people of both sexes, of every age and occupation, and of every rank and station, from gentlefolks down to the very poor, were divided into two groups without their choice, and, in most cases, without their knowledge; one group being supplied with water containing the sewage of London, and, amongst it, whatever might have come from the cholera patients, the other group having water quite free from such impurity." On the Mode of Communication of Cholera, Second Edition, 1854 Snow’s findings: Number of Houses Death from Cholera Death in Each 10,000 Houses Southwark and Vauxhall Company 40,046 1,263 315 Lambeth Company 26,107 98 37 Rest of London 256,423 1,422 59 Before 1852, your chances of getting cholera were not correlated with getting your water from either water company In the epidemic of 1853-54, your chances of getting cholera if your water was from Southwark and Vauxhall were more than eight times greater than if you got your water from Lambeth And then it got really impressive: Then Cholera reoccurred in the Soho district of London About 600 people died from cholera in a 10-day period Once again Snow took the operational death-certificate data from the Registrar-General This time he plotted the data on a clustering diagram, using a stacked histogram technique plotted on a map of Soho to do the data mining Lives saved due to real-time data mining Based upon this map, Snow was able to convince the London Board of Guardians to remove the pump handle from the public pump located on Broad Street The outbreak of cholera subsided with this operational change It was later revealed that the Broad Street well was contaminated by an underground cesspool located at 40 Broad Street which was just three feet from the well The Broad Street pump without a handle remains today as a tribute to Snow Modern problems: Happening on every scale imaginable: Genetic – We know what we’re looking at and what we’re looking for, just not how to find it Single Defined Population – We know who we’re looking at and what we’re looking for, but not how to find it Undefined Population – We don’t know who to look at, but we know what to look for Undefined Everything – We want to save lives, but don’t know what to do at all Chromosome Sequence Length (in base pairs) 1 245,203,898 2 243,315,028 3 199,411,731 4 191,610,523 5 180,967,295 6 170,740,541 7 158,431,299 8 145,908,738 9 134,505,819 10 135,480,874 11 134,978,784 12 133,464,434 13 114,151,656 14 105,311,216 Normally one-tenth of a single percent of DNA (about 3 million bases) differs from one person to the next 15 100,114,055 16 89,995,999 17 81,691,216 18 77,753,510 Luckily junk DNA makes up at least 50% of the human genome 19 63,790,860 20 63,644,868 21 46,976,537 But we still know of about 1.4 million locations where single nucleotide polymorphisms (SNPs) occur in humans 22 49,476,972 X 152,634,166 Y 50,961,097 Genetic Epidemiology: You have good reason to believe that a disease has a genetic component You have the sequenced genomes of some afflicted people The human genome is huge A paper on something like this: Rodin et al. 2005 J Comput Biol. 12(1): 1–11. Mining Genetic Epidemiology Data with Bayesian Networks Application to APOE Gene Variation and Plasma Lipid Levels So we need Data Mining This type of examination is called a “large-scale genotype–phenotype association study” Classical statistical methods (i.e. multivariable regression, contingency table analysis) are ill suited for high dimensional problems because they are “single inference procedures” We need “joint inference procedures” Methods for combining results across multiple “single inference procedures” are inefficient In this type of case, Data-mining methods are hypothesis-generating and classical statistical methods are hypothesis-testing A single defined population: We know who we’re looking at and what we’re looking for, but not how to find it In an adverse reaction study for a new vaccine or drug We know who to watch (those who receive the treatment) We know we’re looking for (“bad things that happen to them”) How do we find “it”? We also have to monitor people who don’t get the treatment and see what happens to them We wind up with a huge set of “all bad things that happen to lots of people” This leads to a lot of problems: A reference and paper on something like this: http://www.fda.gov/cder/aers/default.htm or Nu et al. 2001 Vaccine. 19(32):4627-34. Example problems in data mining for adverse events: Health care providers report adverse reactions by patients to any drug Unfortunately, many patients need to take several drugs at once, so all will be reported with the same event And there’s reporting bias - results don’t reflect the overall population (only the people who needed the drug in the first place, but that’s probably the portion we’re worried about anyway) Explicit example: Sudden Infant Death Syndrome (SIDS) and the Polio vaccine You can easily find a statistical association between the two – Does this mean the polio vaccine is dangerous? Not necessarily – the polio vaccine is mainly given to infants, who are the only possible victims of SIDS Receiving the polio vaccine increases your likelihood of being an infant, which significantly increases your chance of SIDS We would need to if there is an association within infants Undefined Population – We don’t know who to look at, but we know what to look for Example: Figuring out the source of a food-borne outbreak (Good news: we know some diseases are caused by food-borne pathogens) We can hypothesize that a certain activity is somehow related to the source like the food at a party being contaminated Unfortunately, there can be a lot of food at one large party You might not know if the food at the party is actually the culprit You need to ask if people at the party got sick If they did, you need to know which particular food at the party is contaminated The normal process here is to call everyone at the party and conduct a survey (see handout) These surveys can generate a huge amount of data and there’s no guarantee that the party was the source of the outbreak Horror scenario from a data perspective: Food poisoning at the Republican National Convention We wouldn’t know • Which day • Which location • Which caterer • How many people were made ill How do you figure out what how and who in real time? Part of the problem is to get the answer before more people become sick, so you want to narrow the focus of your investigation as you go – ask fewer people, ask fewer questions, all these surveys take time Undefined Everything – We want to save lives, but don’t know what to do at all Cancer : You’ll hear more about this later in the program from Dmitriy Fradkin Huge numbers of people diagnosed Huge numbers of possible contributing risks – environmental exposure to carcinogens genetic predisposition cancer-causing viruses Huge numbers of confounding factors – differences in diagnosis, treatment, outcome co-morbidity Let’s say we’re worried about the beginning of an outbreak of H5N1 avian flu It will probably start out looking like normal flu How quickly we can figure out where it is will determine how quickly we can try active intervention strategies We don’t know where it will start: International travel? Near airports? International bird migration patterns? Along the coasts? Depending on time of year? Once it’s here, we don’t really know how it will spread – Maybe we want an early warning system for cities – is the disease present or absent? These are the types of Epidemiological problems we face, what are the kinds of practical constraints we have to expect? There are many data collectors: Insurance companies, HMOs, public health agencies Issues of data control – Who controls the data? Is each entity found only at a single site? Do different sites contain different types of data? How can we make sure the data isn’t redundant and therefore skewing our information? How can we make sure we get all the pertinent data at the same time? Or at least how fast is fast enough to figure out what we need as quickly as possible? For more information, see http://www.hipaa.org/ And Privacy and Ethics: Individual privacy concerns limit the willingness of the data custodians to share it, even with government agencies such as the U.S. Centers for Disease Control In many cases, data is shared only after it has been “de-identified” according to HIPAA regulations – This removes a lot of useful information and doesn’t really do a whole lot to protect privacy, but that’s another issue (see Fefferman et al. 2005 J. Public Health Policy 26(4):430-449) We need a whole different slew of data mining techniques to mine data “blind” (when we don’t know what we’re seeing, what the numbers represent, how much they’ve been aggregated to represent averages or what we’re looking for) And other problems: Sometimes we don’t know where the best source of data is – We can monitor some cities more closely We can monitor certain diseases (notifiable diseases) Although this is constrained by having to verify by lab test Sometimes our expectations of “normal” levels of disease set the wrong benchmark for when we should start being concerned Different diseases have different normal incidence, which means that an increase of 10 cases per year of one disease is an outbreak, but it would take an increase of 1000 in another to be ‘unusual’ BOTULISM, FOODBORNE Number of reported cases, by year - United States, 1983-2003 ESCHERICHIA COLI, ENTEROHEMORRHAGIC O157:H7 Number of reported cases, United States and U.S. territories, 2003 Sometimes we expect something intermediate SALMONELLOSIS Incidence,* by year United States, 1973-2003 *Per 100,000 population And sometimes we expect the numbers to be reasonably large ACQUIRED IMMUNODEFICIENCY SYNDROME (AIDS) Number of reported cases, by year United States* and U.S. territories, 1983-2003 And sometimes our methods of surveillance itself creates issues *Total number of AIDS cases includes all cases reported to CDC as of December 31, 2003. Total includes cases among residents in U.S. territories and 220 cases among persons with unknown state of residence. Sometimes our problems are prospective Sometimes our problems retrospective In outbreak detection and biosurveillance, we want to find “unusual disease incidence” early In adverse reaction trials, we want to know overall effects, we don’t particularly care about the time scales on which they act In “classic epidemiological investigations” we are looking for the source of exposure to prevent further infection Advances in technology have caused a shift in our data mining needs It used to be that the bottle-neck to appropriate analysis was figuring out where to look for the data and collecting it A pre-processing problem Due to advances in reporting technology, we’re very close to getting real-time reporting for mortality data and we’re getting there for incidence data (for at least some diseases) Now we have to figure out how to find meaningful results in the chaos and clutter Data mining techniques can be tailored to handle all of these problems We haven’t covered all of the problems, but as you can see, we need better techniques and we need more people working on the use of these techniques Thanks for attending this workshop – we need you!