Computational, Statistical and Graph-Theoretical Detection

Computational, Statistical and Graph-Theoretical Methods for Disease Mapping and Cluster Detection by Shannon Christine Wieland B.S., B.A., Ohio State University, 1999 Submitted to the Harvard-MIT Division of Health Sciences and Technology in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY in the field of MATHEMATICS AND MEDICAL ENGINEERING at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2007 @Shannon Wieland, 2007. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Author ................................................ .- ................ Department of Mathematics Harvard-M ivision of Health Sciences and Technology 1t /1 August 9, 2007 C ertified by ........................ ............ Kenneth Mandl, M.D. Thesis Supervisor, Assistant Professor of Pediatrics Certified by... Bonnie Berger, Ph.D. Thesis SupervisorjPrgfessor oApplied Mathematics Accepted by.. ......................... e ............ Alar Toomre, Ph.D. ,hairperson, Applied Mathematics Committee Accepted by.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iDavid Jerison, Ph.D. Chairperson, Department Committee on Graduate Students Accepted by.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M/ Martha L. Gray, Ph.D. Edward Hood Taplin Pr essor of Medical and Electrical Engineering MASSACHUSETTS INSTITUTE.1 OF TEOHNOLOGY Director, Harvard-MIT Division of Health Sciences and Technology OCT 0A 2007 LIBRARIES ARCHIVES Computational, Statistical and Graph-Theoretical Methods for Disease Mapping and Cluster Detection by Shannon Christine Wieland Submitted to the Department of Applied Mathematics and the Harvard-MIT Division of Health Sciences and Technology on May 18, 2007, ini partial fulfillment of the requirements for the degree of Doctorate in Applied Mathematics and Medical Engineering Abstract Epidemiology, the study of disease risk factors in populations, emerged between the 16th and 19th centuries in response to terrifying epidemics of infectious diseases such as yellow fever, cholera and bubonic plague. Traditional epidemiological studies have led to modifications in hygiene, diet, and many other practices that have profoundly altered the dynamic between humans and diseases. In this thesis, we develop mathematical techniques to address modern challenges, including emerging diseases such as SARS and West Nile virus, the threat of bioterrorism, and stringent legislation protecting patient privacy. Within spatial epidemiology, one problem is to map the risk of disease across space (i.e., disease mapping), and another is to analyze the data for clustering. We propose a general technique, cartograms created from exact patient location data, that can address both of these problems. We also develop a graph-theoretical method to detect spatial clusters of any shape based on Euclidean minimum spanning trees. For mapping applications, we present an optimal strategy for mapping patient locations that preserves both privacy and spatial patterns within the data. For real-time disease surveillance, in which the goal is early detection of outbreaks based on time-series data, we introduce a generalized additive model that maintains constant specificity on various time scales. Thesis Supervisor: Kenneth D. Mandl Title: Assistant Professor Thesis Supervisor: Bonnie A. Berger Title: Professor Acknowledgments Foremost, I would like to thank my parents Sharon and David Merritt and Frank and Linda McDonald for looking after every aspect of my development throughout my life, and in particular for planning, encouraging, and sacrificing for my education. My parents-in-law Dennis and Ronnye Wieland have also been wonderfully supportive of my studies and career goals. I am indebted to my husband Aaron Wieland for his constant encouragement and for making my graduate school years happy ones, and to my daughters Bailey and Gwyneth for giving me firm deadlines, which undoubtedly sped along the process. I am also extremely grateful for the mentorship of my Ph.D. advisors, Bonnie Berger and Kenneth Mandl. In addition to introducing me to graph theory and algorithms, Professor Berger has provided guidance during the past five years in many areas, from my coursework to planning my professional life. I am also grateful for her rare example of combining motherhood with a successful academic career. Professor Mandl introduced me to the fields of spatial epidemiology and health surveillance, and has taught me a great deal about envisioning, choosing, and collaborating on projects. I am thankful for his singular regard for my best interests, and also for his endless supply of witty and hilarious comments. In addition to Professors Berger and Mandl, who helped develop all the ideas presented in this thesis, John Brownstein has been a helpful mentor and worked closely with me on three of the chapters of my thesis. I also collaborated with Chris Cassa on privacy protection and with Lucy Hadden, Karen Olson, and Athos Bousvaros on cartograms. I would also like to thank Daniel Kleitman for serving on my thesis committee and for his helpful comments. I am also thankful to many other people who have enriched my intellectual life. These include my undergraduate advisor at Ohio State University, Sherwin Singer, and Edward Marcotte at the University of Texas at Austin. I have had many useful and fun conversations with several colleagues at MIT and Harvard, including Brad Friedman, Lenore Cowen, Michael Baym, Gopal Ramachandran, Clark Freifeld, Gil Alterovitz, and Ronald Rivest, and many of their suggestions were helpful in my thesis. I am also extremely grateful to my daughters' child care providers at the MIT Technology Children's Center, a talented and caring group of teachers who have been essential to our family: Michelle Zapatka, Alki Ikonomou, Ariel Brower, Susan Robinson, Julia Tompkins, Kettelyne Destin, Maria Bonilla, Francesca Foster and Tyhise Garay. I would also like to acknowledge the helpful and kind administrative staff at MIT and HST, especially Linda Okun, Michele Gallarelli, Andrew Kiss, Patrice Macaluso, Domingo Altarejos, Cathy Modica and Kathleen Dickey. I am also thankful for grant support from the National Library of Medicine, the Medical Scientist Training Program, the MIT Health Sciences and Technology Bioinformatics and Integrative Genomics Program, the MIT Department of Mathematics, and the MIT Childcare Scholarship Fund from the MIT Center for Work, Life, and Family. Contents 1 Introduction 2 Density-equalizing Euclidean minimum spanning trees for the detection of all disease cluster shapes 2.1 Introduction .................... .... .... 25 2.2 EMST Cluster Detection.............. .... .... 27 2.2.1 Cartogram Construction ......... . . . . . . . . 27 2.2.2 Potential Clusters ............. . .... ... 28 2.2.3 Statistical Significance . . . . . . . . . . . . . . . . . . 33 Results . . . . . . . . . . . . . . . . . . . . . . . ... . 34 2.3.1 West Nile Virus, New York City, 1999. . . . . . 34 2.3.2 Inhalational Anthrax, Sverdlovsk, Russia, 1979. . . . . 35 2.3.3 Circular Clusters, Boston, Massachusetts . . . . 37 2.3.4 Rectangular Clusters, Boston, Massachusetts . . . . . . 39 2.3.5 Arbitrary Shapes ................. ... . 40 2.3 2.4 3 D iscussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Cartograms for Mapping and Analyzing Event Disease Data 3.1 Introduction ..... ....... 3.2 Event Cartograms ...................... ..... ... .... ..... 3.2.1 D ata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Event cartogram construction . . . . . . . . . . . . . . . . 3.2.3 Mapping the disease risk . . . . . . . . . . . . . . . . . .. 3.3 3.4 4 ................................. 3.3.1 Simulated Distributions 3.3.2 Pediatric Inflammatory Bowel Disease, Massachusetts, 1995-2006 Discussion . ..................... ................... ............. Optimal anomymization of patient spatial data 4.1 Introduction ................................ 4.2 Methods .................................. 4.3 Example .................................. 4.4 5 Examples 4.3.1 New York county census blocks 4.3.2 Sensitivity analysis . ................. ....................... Discussion ................................. Automated real time constant-specificity surveillance for disease outbreaks 83 5.1 Introduction . . . 5.2 Methods . . . . 5.3 ....................... 83 85 ......................... 5.2.1 Data . ....................... 5.2.2 Time series algorithms . 5.2.3 Model predictions based on historical data . 5.2.4 Detecting variability in the specificity ...... 5.2.5 Simulated outbreaks 5.2.6 Estimating sensitivity, specificity, and timeliness of detection 5.2.7 Comparing outbreak detection among models ............. ............... Results ........................... 5.3.1 Evaluation of specificity trends over time . 5.3.2 Comparison of sensitivity and timeliness of new and traditional methods 5.3.3 . ........................ Temporal sensitivity trends . . . . . . ...... 94 . . . . . 97 5.4 Discussion ................................. 98 5.5 Conclusions . ................................... 105 List of Figures 1-1 Die Seuche by A. Paul Weber, depicting the bubonic plague entering a city. Image courtesy of the National Library of Medicine. 1-2 ...... 19 The English physician John Snow created a dot map showing that cholera victims lived close to one public water pump, which was the source of the outbreak. Images courtesy of the National Library of Medicine. 2-1 ................................. 20 Construction of the Voronoi diagram cartogram. a) One hundred cases (green) and 50 controls (red) are distributed on a map. b) The case locations are superimposed on the Voronoi diagram constructed from the controls. c) A density-equalizing cartogram of the Voronoi diagram distorts the original map so that all Voronoi regions have the same area. New case locations are assigned on the cartogram by randomly plotting each case within its corresponding Voronoi region. . ........ 2-2 . . 28 Procedure to locate potential clusters illustrated on a set of 15 cases. The EMST is first constructed (top left). This is a tree connecting each case (circle) that minimizes the total summed edge distance. At each step, the longest remaining edge is deleted, forming two new connected components (red). Components that were unchanged from the previous step are shown in blue. The connected components are in one-to-one correspondence with the set of potential clusters. . ........ . . 30 2-3 Detection of 1999 New York West Nile virus cases by SaTScan and the EMST method. a) A typical data set consisting of the 56 West Nile virus cases (red and orange) and 400 background cases (blue and gray) are shown on a map of Connecticut, New Jersey and New York. Only part of the map is shown for clarity. The West Nile virus case locations have been randomly skewed for privacy [1]. The most likely cluster identified by SaTScan is shown (red and blue). The green shading represents the density of controls in each county. b) The Voronoi diagram cartogram of part of the study area is shown along with the transformed case locations. Although the Voronoi diagram cartogram regions are not shown, the distortion of county boundaries induced by the cartogram transformation is apparent. The minimum spanning tree (black edges) connects the most likely cluster identified by the EMST method (red and blue). The control density varies by less than 2.0% over the entire map. ........................ 2-4 36 SaTScan and EMST Detection of 1979 Sverdlovsk anthrax outbreak. a) A representative data set of 63 anthrax cases (red and orange) and 400 uniformly distributed background cases (blue and gray) is shown, along with the most likely cluster determined by SaTScan (red and blue). b) The EMST method most likely cluster (red and blue) is shown for the same data set, connected by the minimum spanning tree of the cartogram-transformed cases (black edges). . ........... 38 2-5 Equally detectable potential clusters of various shapes. A most likely cluster of 35 points selected from among the Boston circular cluster data sets, along with its minimum spanning tree, is shown in the upper left. Seven other configurations of 35 points, having minimum spanning trees with exactly the same weight, are also shown. Subject to the constraint imposed by the definition of a potential cluster above, all eight clusters have equivalent detectability by the EMST method. If embedded as potential clusters in a Boston data set of 500 total cases, all would achieve the same p-value of 0.0001. ............. . . . 3-1 Applications of cartograms to spatial epidemiology. .......... 3-2 Example of a Voronoi tesselation. Left: One thousand points are dis- 41 47 tributed on a map. Right: The Voronoi tesselation of the points divides the map into 1000 regions. Each region consists of the portion of the map closest to one point. The density structure is preserved in the tesselated map; regions of small Voronoi cell area correspond to high point density. .............................. 3-3 50 Dot maps and cartograms of three hypothetical disease distributions. Dot maps of 5,000 controls (blue) and 2,500 cases (red) are shown in the left column (a, c and e). The controls are distributed in proportion to the underlying population. The cases are distributed to illustrate constant relative risk (a), risk increasing linearly by a factor of four from north to south (c), and a localized cluster with a three-fold increase in relative risk in Iowa and neighboring states (e). The right column (b, d and f) shows the cartogram-transformed case locations for the three distributions. ....................... 61 3-4 Isopleth surfaces estimating the relative risk of three hypothetical disease distributions on standard maps (a,c and e) and cartograms (b,d and f). The exact locations of the cases and controls are shown in figure 3-3. The case distributions illustrate constant risk (a and b), a four-fold increase in risk from south to north (c and d) and a cluster of three-fold risk increase centered in Iowa (e and f). The patterns are obscured on the standard maps because of the presence of high relative risk artifacts, but are clear on the cartograms. 3-5 62 . ............ Pediatric inflammatory bowel disease risk in Massachusetts, 1995-2006. a) A standard map of the study area. b) A cartogram was constructed from the Voronoi diagram of the 7988 control locations. The 901 IBD cases were randomly placed within the cartogram regions corresponding to their original locations on the Voronoi diagram. An isopleth relative risk surface was calculated from the transformed case locations using kernel methods. Original case and control locations are not shown to protect patients' privacy. 4-1 63 . ................. Schematic of transition probabilities. A patient found at each location may transition to any other location. In this simple example, there are three locations (represented by houses) and nine transition probabilities (represented by arrows). The probabilities are variables solved by linear programming. 4-2 ................... 68 ........ Total population of each census block group in New York County, NY, according to the 2000 census. ................... ... 74 4-3 Transition probabilities for the optimal strategy to de-identify s < 20. 000 patients from New York County, New York with a maximum reidentification probability of . Transition probabilities from three of the 988 census blocks are shown, illustrating a few of the many possible transition distributions. The shading in region j represents the value of the probability Pij of transitions into the region. a) Patients in one census block (purple asterisk) may remain there, or they may transition to one of several nearby blocks. b) All patients originally in one census block (purple asterisk) are assigned to one neighboring block. c) Patients are re-assigned from one block (purple asterisk) to one of four nearby census blocks. No patients are re-assigned to the original census block (i.e. Pii = 0). ................... 4-4 75 Histogram of the distance between original and de-identified locations for an individual randomly chosen from the population, under the optimal strategy to de-identify a set of s < 20, 000 patients in New York County, New York to a probability of 2 4-5 ... . ............ 76 Relationship between the re-identification probability, the number s of patients, and the expected transition distance for the optimal LP strategy to de-identify patients by census block group in New York county, New York. As the level of privacy protection decreases, patients are moved a smaller distance in expectation. Aggregation by zip code (green diamond) and first three zip code digits (magenta circle) are suboptimal strategies ................... ......... 77 4-6 Aggregation of patients in New York County, New York by zip code and by first three zip code digits. Top) Census block groups have been aggregated by zip codes. Each census block group was assigned to the zip code containing its centroid. The expected distance moved by a randomly selected member of the population is 519 m, and the maximum probability that an individual is among a set of s de-identified patients is -. Bottom) Census block groups are aggregated by the first three zip code digits. The expected distance moved is 3.866 km, and the re-identification probability is S 5-1 . . . .. . . . . . . . . . . 81 Emergency department visits for respiratory presenting complaints, August 1, 1992 - July 30, 2004. Daily time series showing the number of patients presenting with respiratory complaints to the emergency department during a 12 year period. 5-2 . .................. 86 Evaluating variability in specificity on three time scales. Plots of pvalues for the chi-square test over various time scales for the five comparison models over a range of mean specificity values from 0.50 to 0.99, as well as p-values for the expectation-variance model. Top: calendar year of study. Middle: month of year. Bottom: day of week. The expectation-variance model has a p-value over 0.05 for the entire range of mean specificity values for all three time scales, so the null hypothesis of constant specificity is not rejected. All plots not shown are highly significant (p < 0.001) for non-constancy. 5-3 . ......... 95 Average specificity trends over time. Average specificity for each calendar year, month, and day of week for the five comparison methods during the study period. Data shown were recorded for each model implemented at 85% mean specificity. Similar trends were observed for all methods at 97% mean specificity (data not shown). . ...... 96 5-4 Seasonal sensitivity trends. Average sensitivity for each month of the study period for the autoregressive (left), trimmed seasonal (center), and expectation-variance (right) models when applied to data containing a superimposed spike outbreak of 10 additional patients during one day. Data shown were collected at a mean specificity of 97%. The sensitivity of the trimmed seasonal and autoregression models is higher during the winter than during the summer. Sensitivity is higher during the summer than during the winter for the expectation-variance model. July receiver-operator (ROC) curves lie below February ROC curves for all three models (insets). Similar trends were observed for flat and linear outbreaks. 5-5 ........................ 99 Seasonal trends in the mean and variance of ED visits. Mean number of ED visits (left axis, solid blue line) and mean variance in ED visits (right axis, dashed green line) as a function of the day of year. Data were smoothed using 5-day and 11-day moving averages, respectively. The ED utilization mean and variance are highest in the winter and lowest during the summer. ....................... 102 List of Tables 2.1 SaTScan and EMST method applied to West Nile virus. n, number of background cases added to cluster cases; SN, average sensitivity; FTC, average fraction of true cluster detected; FMLC, average fraction of most likely cluster coinciding with the true cluster (averaged over data sets for which a significant cluster was found); A, percent difference. 35 2.2 SaTScan and EMST method applied to anthrax. n, number of background cases added to cluster cases; SN, average sensitivity; FTC, average fraction of true cluster detected; FMLC, average fraction of most likely cluster coinciding with the true cluster (averaged over data sets for which a significant cluster was found); A, percent difference. . . . 2.3 37 SaTScan and EMST method applied to circular clusters. r, radius of cluster in kilometers; d, relative cluster density; m, mean cluster size; SN, average sensitivity; FTC, average fraction of true cluster detected; FMLC, average fraction of most likely cluster coinciding with the true cluster (averaged over data sets for which a significant cluster was found); A, percent difference. 2.4 ...................... 39 SaTScan and EMST method applied to rectangular clusters. r, ratio of cluster height to width; d, relative cluster density; SN, average sensitivity; FTC, average fraction of true cluster detected; FMLC, average fraction of most likely cluster coinciding with the true cluster (averaged over data sets for which a significant cluster was found); A, percent difference. ................................. 40 5.1 ROC curve areas for traditional and expectation-variance detection models applied to three different types of outbreaks superimposed on respiratory visits to an urban pediatric ED, August 1998 - July 2004. 5.2 97 Mean lag in detecting outbreaks of five additional patients per day superimposed on the pediatric ED respiratory visits, August 1998 - July 2004. Detection lag calculations exclude undetected outbreaks. Hence the sensitivity of the method must be considered when interpreting the detection lag. ................... .......... ... 97 Chapter 1 Introduction Terrifying epidemics have swept through populations throughout human history. Most famous among these is the bubonic plague, which spread in every direction from the Gobi desert in China in the 1320's, devastating parts of Asia and Africa. The plague reached Cyprus in 1347 and killed about one third of the European population in only two years [2]. In recent history, 500 million people contracted "Spanish Figure 1-1: Die Seuche by A. Paul Weber, depicting the bubonic plague entering a city. Image courtesy of the National Library of Medicine. influenza" in 1918 and 1919 [3]. The epidemic began among soldiers in March of 1918 at Camp Funston in Kansas. In late August, three nearly simultaneous outbreaks in Boston, Massachusetts, Freetown, Sierra Leone, and Brest, France signaled the start of a global pandemic which claimed at least thirty million lives [4]. The fear inspired by uncontrollable and fatal epidemics, such as the plague, yellow fever, cholera, and influenza, was a driving force for early advances in the field of spatial epidemiology. The first disease dot maps were published by a young surgeon named Valentine Seaman in 1798, showing the locations of yellow fever victims in New York City. Seaman created the maps to support his theory that yellow fever was caused by "putrid effluvia," which was ultimately disproved [5]. An English surgeon named John Snow is often credited with founding the field of spatial epidemiology for his 1854 study of cholera in London (see figure 1-2). At the time, frequent outbreaks of cholera, resulting in severe diarrhea and death due to dehydration, were generally thought to be caused by "miasma" in the air. Snow, who happened to live close to the epicenter of a large outbreak occurring in late August, 1854, correctly theorized that cholera was spread through contaminated water. He plotted the cases, revealing that they clustered around one pump on Broad street. Seven days into the outbreak, he pre- -· K--:-I , V~ ~a~II / ~.~li~ i~ ;j ::: e \ i ~ \7 j1> · t:i.~ I.1- ` 1·"I NNaPa::-·· (a) John Snow, 1847. ~ ~ ' s - (b) Map of cholera cases in London, 1854. Figure 1-2: The English physician John Snow created a dot map showing that cholera victims lived close to one public water pump, which was the source of the outbreak. Images courtesy of the National Library of Medicine. sented his findings to the local Board of Guardians. The pump handle was removed, ending the epidemic. The map was used to support Snow's theory in a subsequent publication "On the mode of communication of cholera" [6]. Snow's success showed the potential power of spatial methods in epidemiology: his finding not only saved lives, but also gave new insight into the transmission of a poorly understood disease. From the earliest disease dot maps, methods in spatial epidemiology have evolved to include a range of statistical and graphical techniques encompassing several distinct areas of study. Disease mapping explores spatial variations in disease risk, taking into account variations in the underlying at-risk population. Disease clustering studies investigate whether or not cases tend to cluster together more than expected, or seek to find localized subsets of patients comprising clusters. Ecological analysis investigates the relationship between the distribution of cases and environmental risk factors [7]. Despite its long and productive history, there are several challenges still facing the field of spatial epidemiology. These include the need to rapidly detect emerging diseases, such as Severe Acute Respiratory Syndrome and West Nile Virus, and bioterrorism events, such as the dissemination of anthrax through the United States postal service in 2001. There is also an increased public awareness of issues surrounding patient privacy, and more stringent legislation protecting privacy of patient-identifiable information, including geographic identifiers. Furthermore, there are recent advances in geographical information systems and cartography methods that can be leveraged for spatial epidemiology. In this thesis, we respond to these new challenges and advances with several related projects. In chapter 2, we create a new graph-theoretical method to detect spatial clusters of any shape. Existing disease cluster detection methods cannot detect clusters of all shapes and sizes, or identify highly irregular sets that overestimate the true extent of the cluster. We introduce a graph-theoretical method for detecting arbitrarily-shaped clusters based on the Euclidean minimum spanning tree of cartogram-transformed case locations, which overcomes these shortcomings. The method is illustrated using several clusters, including historical data sets from West Nile virus and inhalational anthrax outbreaks. Sensitivity and accuracy comparisons with the prevailing cluster detection method show that the method performs similarly on approximately circular historical clusters, and it greatly improves detection for non-circular clusters. The use of cartograms based on exact location data, developed for this method, is explored in other contexts in chapter 3. Density-equalizing cartograms of disease case locations are used to adjust for variation in the underlying at-risk population for the purposes of visual representation and statistical analysis of disease risk. The use of cartograms has been limited to analyzing count data in a small number of settings. We show how to create and interpret cartograms from exact location data collected using various types of traditional epidemiological studies. For mapping applications, there is a simple relationship between cartogram case density and disease risk; for analysis, the cartogram simplifies the null distribution of constant disease risk, enabling the use of a variety of well-advanced statistical methods. In chapter 4 we develop an optimal strategy for balancing the need for patient privacy with the need to share information about the spatial distributions of diseases for research and health surveillance. Ethical and legal mandates protect the privacy of patient data collected for medical care and research. Accidental disclosures sometimes occur, either because the guardians of the data do not anticipate a method of linking a released data set to individuals, or because of methodological flaws in the procedures used to ensure privacy. The prevailing solution, releasing data aggregated by large areas, usually preserves privacy but suffers from substantial information loss. We develop an alternative de-identification strategy to move individual locations based on linear programming. The method guarantees that privacy is protected. It moves patients in an optimal manner to ensure they move the minimal possible distance for the level of privacy protection. Thus the de-identified set is ideal for subsequent cluster detection or disease mapping studies. We illustrate how to de-identify patients in New York county, New York, showing that privacy is guaranteed while moving patients very short distances. In chapter 5, we develop a temporal method to detect aberrant health events for surveillance in real time. Detection of abnormal disease patterns is based on a difference between patterns observed, and those predicted by models of historical data. The usefulness of outbreak detection strategies depends on their specificity; the false alarm rate affects the interpretation of alarms. We evaluate the specificity of four traditional models: autoregressive, Serfling, trimmed seasonal, and waveletbased. We apply each to 12 years of emergency department visits for respiratory infection syndromes at a pediatric hospital, finding that the specificity of the four models was almost always a non-constant function of the day of the week, month, and year of the study (p < 0.05). We develop an outbreak detection method, called the expectation-variance model, based on generalized additive modeling to achieve a constant specificity by accounting for not only the expected number of visits, but also the variance of the number of visits. The expectation-variance model achieves constant specificity on all three time scales, as well as earlier detection and improved sensitivity compared to traditional methods in most circumstances. Modeling the variance of visit patterns enables real-time detection with known, constant specificity at all times. With constant specificity, public health practitioners can better interpret the alarms and better evaluate the cost-effectiveness of surveillance systems. Chapter 2 Density-equalizing Euclidean minimum spanning trees for the detection of all disease cluster shapes 2.1 Introduction Tests for the detection of disease clusters [8] are essential tools for identifying emergent infections and elucidating demographic and environmental factors influencing diseases. The shapes of these clusters are unpredictable [9, 10, 11, 12, 13]. However, the prevailing cluster detection method, a scan statistic that applies a likelihood ratio test to a large number of overlapping circles in a study region, reports only circular clusters [14, 15]. Straightforward extensions of the circular scan statistic, such as an elliptical scan [16] and a rectangular scan [17], are also limited to detecting specific outbreak shapes. Originally published as: Wieland SC, Brownstein JS, Berger B, Mandl KD. Density-equalizing Euclidean minimum spanning trees for the detection of all disease cluster shapes. Proceedings of the National Academy of Sciences. May 22, 2007. Few methods aim to detect clusters of arbitrary shape. One class of methods based on graph theory has recently emerged to address this problem [18, 19, 20, 21]. However, these have several limitations: they are restricted to clusters that fit inside a circular region of fixed size [18], they attempt to examine a set of potential clusters too large to exhaustively search [19], they have poor specificity [20], or have yet to be implemented or evaluated [21]. In addition to the difficulties inherent in any disease cluster detection method, such as accounting for the underlying population density and controlling the level of significance given multiple potential clusters of various sizes and in various locations, arbitrary shape cluster detection presents particular challenges. As more shapes are considered, the statistical power declines, and the computational running time may become unreasonable for typical problem sizes [18]. Furthermore, if the exact case locations are available, then considering every conceivable shape is problematic; it is always possible to draw a bizarrely shaped region of infinitesimally small total area that includes every case. This problem surfaces when data are aggregated into small regions. Indeed, one study identified excessively large clusters with highly irregular shapes having greater likelihood ratios than the inserted clusters which were the detection targets [20]. In this study, we address these challenges by removing the notion of shape from consideration, and replacing it with a mathematical formalization of potential clusters based on intercase distances. We introduce a method to locate clusters of any shape based on Euclidean minimum spanning trees (EMST's), which have previously found application in heuristic methods to divide other kinds of data into a pre-determined number of subsets [22, 23]. Application of the method to synthetic, West Nile virus, and anthrax data sets show that sensitivity and accuracy are substantially improved compared to the circular scan statistic method applied to non-circular clusters, which likely include the majority of real disease clusters. 2.2 EMST Cluster Detection Our cluster detection method consists of three sequential tasks. A density-equalizing cartogram of the study region and disease cases is first constructed from a Voronoi diagram of the controls. Second, the family of potential clusters to evaluate is defined, since it is not computationally feasible to consider all 2n subsets of n cases. Third, the statistical significance of each potential cluster is evaluated. We address each of these tasks below. 2.2.1 Cartogram Construction We begin with the precise spatial coordinates of a set of disease cases and controls, and a map of the study area. We first create a Voronoi diagram of the control locations, which subdivides the study area into the regions closest to each control location [24] (see figure 2-1). The density of controls within each Voronoi region is simply the number of controls in the region, which may be more than one if multiple controls can occur at the same location, divided by the region's area. We use this density function to create a density-equalizing cartogram of the Voronoi diagram. Cartograms have previously been used for aggregate data to test for clustering of several diseases [25, 26, 27, 28, 29]. To construct one, each point on the original map is essentially magnified or demagnified according to its local density. The result is a distorted map on which the density of controls is constant everywhere. Each case is placed on the cartogram at a random location within the region corresponding to its original Voronoi region, and all subsequent analyses are performed using these new case locations. Under the null hypothesis of constant relative risk, the new locations of the cases on the Voronoi diagram cartogram are uniformly and independently distributed. We use a diffusion-based cartogram construction algorithm [29], although other contiguous cartogram algorithms may also be suitable. Figure 2-1: Construction of the Voronoi diagram cartogram. a) One hundred cases (green) and 50 controls (red) are distributed on a map. b) The case locations are superimposed on the Voronoi diagram constructed from the controls. c) A densityequalizing cartogram of the Voronoi diagram distorts the original map so that all Voronoi regions have the same area. New case locations are assigned on the cartogram by randomly plotting each case within its corresponding Voronoi region. 2.2.2 Potential Clusters We call a potential cluster a subset of points S satisfying the property that every subset of S is "closer" to at least one other point in S than to any other point outside of S. To formalize this definition, we begin by defining the distance p(X, Y) between two sets X and Y to be the smallest distance separating the sets: p(X, Y)= minaExp(a, b) if X ý 0 and Y'$ ( (2.1) bEY otherwise o00 where p(x, y) is the Euclidean distance between two points. We also define the internal distance of a nonempty set S to be the maximum distance between any two nonempty subsets of S whose union is S: p(S) = max p(X, Y) (2.2) XUY=S We formally define a potential cluster as follows: Definition Let V be a nonempty set of cases of a disease. A potential cluster is a nonempty set S C V satisfying p(S) < p(S, V - S). Note that the entire set V is a potential cluster, as are the sets {v} for every v E V. If v is the nearest neighbor of w and w is the nearest of v, then {v, w} is a potential cluster. We wish to consider every potential cluster in V, but it is not straightforward from the definition how to locate potential clusters, nor how many of them are present. Progress was made toward finding potential clusters in a different application in bioinformatics [23] using the minimum spanning tree of V, a connected graph T spanning a set of points having minimal total weight w(T)= E w(e) (2.3) eEE(T) where E(T) denotes the set of edges of T, and the weight w(e) of an edge e is in this case the Euclidean distance between the endpoints of e. (For a detailed review of graph theoretical definitions, see [30].) Given a set V of n points, every potential cluster is a connected subgraph of the EMST T of V [23]. However, even for small epidemiological data sets, the number of connected subgraphs may be extremely large; EMST's of 50 and 75 random points have approximately 106 and 108 connected subgraphs, respectively. We prove that it is not necessary to consider all connected subgraphs of T to find the potential clusters. Remarkably, there are at most 2n - 1 potential clusters, of which n are trivial sets consisting of only one vertex. Furthermore, the potential clusters may be quickly found from an EMST using a greedy edge deletion procedure. After constructing an EMST of the set of cartogram case locations V, we iteratively delete the longest remaining edge of T. At each iteration we consider the two newly emergent connected components, each of which is a potential cluster. In this way, we evaluate all n - 1 nontrivial potential clusters for statistical significance using a test described below (see Figure 2-2). We prove that this procedure identifies the set of potential clusters by showing that that potential clusters, characterized by the definition above, are in one-to-one correspondence with a small class of subsets of an EMST T. For w > 0, we define 0 S 0 o o 0 o ° 0 Q0 o 00 o o 8o 0 00 o0 0 o o 0 0 ' 00 0 0 000 0 0 0 0o 00 0 000 00 00 o o 0o 0 oo o 0 0 0 0 00 Figure 2-2: Procedure to locate potential clusters illustrated on a set of 15 cases. The EMST is first constructed (top left). This is a tree connecting each case (circle) that minimizes the total summed edge distance. At each step, the longest remaining edge is deleted, forming two new connected components (red). Components that were unchanged from the previous step are shown in blue. The connected components are in one-to-one correspondence with the set of potential clusters. T, to be the graph derived from T by deleting all edges of T having weight greater than w. We label the n - 1 edges of T in order of decreasing weight, so that w(el) > w(e 2 ) _ ... · w(en-1) > 0. If the edge weights are distinct, then there are n distinct graphs T,; these are the graphs T = Tw(el) D T(e,,) D ...· Tw(e,_) 2 To. Tw(ek+l) is formed from Tw(ek) by deleting one edge, which splits one connected component of Tw(ek) into two components. Thus Tw(ek+l) has k + 1 connected components, k - 1 of which are present in Tw(ek), and two of which are newly created. There are 2n - 1 total distinct connected components among all the graphs T, (see Figure 2-2). If the edge weights are not distinct, then a variation of this argument shows that 2n - 1 is an upper bound on the number of distinct connected components. The following characterizes the connected components: Lemma 2.2.1 Let V be a nonempty set of points in a plane (representing cases of a disease). Let T be a Euclidean minimum spanning tree of V, S a nonempty subset of V, and Ts the subgraph of T induced by S. The set S is a potential cluster if and only if Ts is a connected component of To or of Tw(ek) for some k. The proof is made easier by two simple lemmas. Lemma 2.2.2 Let Ts be a connected subgraph of T with vertex set S. Then p(S) (Eq. 2.2) is equal to the maximum weight of an edge in Ts if ISI > 1, and 0 otherwise. Proof: If ISI = 1, then S = {x} and p(S) = max p(X, Y) = p({x}, {x}) = p(x, x) = 0. 80xcs OCYCS XUY=S If ISI > 1, let e = (vi, v 2 ) be an edge of maximum weight in Ts. Ts - e has two components with vertex sets V1 and V2. We first show that p(Vi, V2 ) = w(e), where w(e) is the weight of e. We have p(Vi, V2 ) = min p(x, y) < p(vl, v2). 1Y V 2 Assume the inequality is strict, so there exist wl C V1 and w 2 E V2 with p(wi, w 2 ) < p(v1 , v2 ). The graph T- e+ (wl, w2 ) is a spanning tree of V having lower weight than T, which is a contradiction. Hence p(Vi, V2) = w(e). We now show that p(S) = w(e). Since p(S) = max p(X, Y) _ p(Vi, V2 ) = w(e), agxcs 0XYCS XUY=S we need only prove that p(S) 5 w(e). This is true if p(X, Y) < w(e) for every X and Y satisfying the conditions 0 C X C S, 0 C Y C S and X U Y = S. Let X and Y be arbitrary sets satisfying these conditions. If X and Y share a common element, then p(X, Y) = 0 < w(e). If X and Y have no common element, then since they partition the vertices of Ts into two nonempty sets, there exists some edge f = (x, y) of Ts spanning X and Y. We have p(X, Y) = min p(a, b) 5 p(x, y) = w(f) < w(e). aEX bEY Hence p(S) = w(e). Lemma 2.2.3 If S is a nonempty, proper subset of V, then p(S, V - S) is equal to the minimum weight of an edge in T spanning the cut (S, V - S). Proof: Let e = (vl, v 2 ) be an edge of T of minimum weight spanning (S, V - S). We have p(S, V - S) = min p(a, b) < p(v 1 , v 2 ) = w(e). aES bEV-S It suffices to prove that p(S, V - S) > w(e), which holds if p(a, b) > w(e) for every a E S and b E V - S. Suppose there exist some a C S and b e V - S for which p(a, b) < w(e). The edge (a, b) must not be in T since e has minimum weight of all edges spanning (S, V - S). The graph T + (a, b) therefore contains exactly one cycle, and the cycle contains some edge f z (a, b) spanning (S, V - S). The graph T+(a, b)- f is a spanning tree of V, and w(T+(a, b)- f) = w(T)+w ((a, b))-w(f) < w(T) + w(e) - w(e) - w(T), contradicting the minimality of the weight of T. Hence p(a, b) > w(e) for every a S and be V - S, and so p(S, V - S) > w(e). Proof of Lemma 2.2.1: We first show that every potential cluster induces a connected component of To or of Tw(,,) for some k. Equivalently, we show that if a subgraph H of T is not a connected component of Tw(ek) or of To, then the vertex set of H is not a potential cluster. Xu et al. [23] showed that every potential cluster induces a connected subgraph of T, so that if H is not connected, then its vertex set is not a potential cluster. Suppose H is a connected subgraph of T which is not a connected component of Tw(,,) for any k, or of To. H must have at least one edge; let ej be an edge of H of maximal weight. Let C be the connected component of Tw(ej) containing ej. Since H is a connected subgraph of Tw(ej) containing ej, H C C We refer interchangeably to a graph and its vertex set to simplify notation. There exists some edge e C T spanning H and C - H, and since e E C, w(e) < w(ej). By lemma 2.2.2, p(H) = w(ej), and by lemma 2.2.3, p(H, V-H) < p(H, C-H) < w(e) < w(ej). Hence p(H, V - H) < p(H) and H is not a potential cluster. To finish the proof, we must show that every connected component of T(e,,) for any k or To is a potential cluster. This is trivial for Tw(,,) = T or for To, whose components # T with # 0, there must be are the individual vertices. Let Ts be a connected component of Tw(,,) vertex set S. Then p(S) < w(ek) by lemma 2.2.2. Since V - S some edge e E T spanning S and V - S. Since the edge is not in Tw( k), w(e) > w(ek). This is true for every spanning edge, so by lemma 2.2.3, p(S, V - S) > w(ek). Hence p(S) < p(S, V - S), and so S is a potential cluster. Note that the proof does not rely on the uniqueness of T, so degenerate EMST's do not affect the ability of the method to capture all potential clusters. If the set of cases V are continuously distributed on the cartogram, as in the present study, then in theory the EMST is unique with probability 1. However, degenerate EMST's may occur with extremely low probability due to the inability of computers to support arbitrary precision. 2.2.3 Statistical Significance In order to assign a p-value to any potential cluster, a test statistic is required, along with its distribution under the null hypothesis Ho of independently, uniformly distributed cases on the cartogram. Let E be a potential cluster generated under Ho, and let S be an observed potential cluster. We define Ps = Pr {w(E) < w(S) I card(E) = card(S)}, (2.4) where w is the weight of the potential cluster subgraph, and card denotes the number of cases. Ps is the p-value corresponding to the observed candidate cluster weight, conditioned on the number of cases in S. Because cases in a true cluster are closer together than expected, the weight w(S) of a potential cluster S corresponding to a hot-spot is likely to be smaller than a random EMST potential cluster subgraph containing the same number of cases. Consequently, a hot-spot should have a low value of Ps. We define the test statistic P to be the minimum value of Ps over the set of nontrivial potential clusters containing at most half of the cases. Monte Carlo techniques are used to fit Ps as a function of w(S) to a Gaussian distribution for each possible value of card(S). The null distribution of P is subsequently estimated, again by Monte Carlo, and a cutoff value corresponding to the desired level of significance a is obtained. The most significant cluster is reported, but the method could easily be modified to report all significant clusters without affecting the asymptotic running time. 2.3 Results We applied the SaTScan circular scan statistic [15] and EMST method to several types of data sets, finding that the EMST method was substantially better able to detect non-circular clusters. The SaTScan Bernoulli model was used with a maximum geographic window size containing 50% of the cases for each data set. For each method and data set, the most significant cluster with a p-value of at most 0.05 computed using 9,999 Monte Carlo replications was reported; thus the specificity, defined as the probability of reporting no significant cluster in data generated under the null hypothesis, was 0.95 for both methods and all data sets. The sensitivity, equal to the fraction of clusters that were detected, was calculated for each data set and method. To quantify the extent of overlap between the most likely cluster and the actual cluster, we defined two other measures. We defined FTC to be the fraction of true cluster cases that were correctly found in the most likely cluster, and FMLC to be the fraction of cases in the most likely cluster that coincided with the true cluster. 2.3.1 West Nile Virus, New York City, 1999 The EMST method and SaTScan had similar performance detecting a 1999 outbreak of West Nile virus in New York City [31]. This was encouraging because the 56 cases appear to have an approximately circular distribution (see Figure 2-3), suggesting an advantage for the circular scan statistic. We defined a study area consisting of Connecticut, New Jersey and New York, and generated 10,000 controls within the map distributed in proportion to 2000 U.S. census county population data. In order to evaluate the methods, we required data sets with both outbreak and non-outbreak cases. In addition to the West Nile virus cases, we generated 400, 600, 800, 1000 or 1200 additional non-outbreak background cases distributed according to the underlying population distribution. As the number of background cases increased, the West n 400 600 800 1000 1200 SN 1.00 1.00 0.99 0.99 0.89 SaTScan FTC FMLC 0.69 0.61 0.63 0.54 0.58 0.48 0.55 0.44 0.49 0.40 SN 1.00 1.00 1.00 0.99 0.96 EMST FTC 0.80 0.69 0.61 0.55 0.50 FMLC 0.53 0.48 0.44 0.41 0.38 Comparisons A SN A FT A FMLC +0.5% +16% -14% +0.2% +9.1% -11% +0.7% +5.1% -8.5% -0.4% -0.1% -6.8% +8.0% +3.4% -4.6% Table 2.1: SaTScan and EMST method applied to West Nile virus. n, number of background cases added to cluster cases; SN, average sensitivity; FTC, average fraction of true cluster detected; FMLC, average fraction of most likely cluster coinciding with the true cluster (averaged over data sets for which a significant cluster was found); A, percent difference. Nile virus cluster became harder to detect. We created 1000 data sets for each background case number. The data sets could represent, for example, emergency visits for neurological symptoms in a multi-state surveillance area, with controls drawn from all emergency visits. Figure 2-3 shows a typical data set along with its Voronoi diagram cartogram transformation and the most likely cluster obtained by both methods. The results of applying SaTScan and the EMST method to the data sets are summarized in Table 2.1. Both methods displayed similar comparative performance for all numbers of background cases. The sensitivity of both methods declined from 1.0 for 400 background cases to 0.96 and 0.89 for 1200 background cases for the EMST method and SaTScan, respectively. The percent change in FTC of the EMST method compared to SaTScan varied from -0.4% to 16%, and the percent change in FTC varied from -14% to -6.8%. 2.3.2 Inhalational Anthrax, Sverdlovsk, Russia, 1979 The EMST method had greater accuracy than SaTScan when applied to a highly noncircular outbreak of 62 cases of inhalational anthrax occurring in Sverdlovsk, Russia in 1979 [9]. Because we lacked spatial references for the data necessary to geocode the case locations, we used a uniform distribution within a square study region to generate 10,000 controls. The set of cases consisted of 400, 600, 800, 1000, or 1200 Figure 2-3: Detection of 1999 New York West Nile virus cases by SaTScan and the EMST method. a) A typical data set consisting of the 56 West Nile virus cases (red and orange) and 400 background cases (blue and gray) are shown on a map of Connecticut, New Jersey and New York. Only part of the map is shown for clarity. The West Nile virus case locations have been randomly skewed for privacy [1]. The most likely cluster identified by SaTScan is shown (red and blue). The green shading represents the density of controls in each county. b) The Voronoi diagram cartogram of part of the study area is shown along with the transformed case locations. Although the Voronoi diagram cartogram regions are not shown, the distortion of county boundaries induced by the cartogram transformation is apparent. The minimum spanning tree (black edges) connects the most likely cluster identified by the EMST method (red and blue). The control density varies by less than 2.0% over the entire map. n 400 600 800 1000 1200 SN 0.98 0.88 0.60 0.53 0.35 SaTScan FTC FMLC 0.32 0.65 0.28 0.53 0.19 0.44 0.17 0.37 0.11 0.32 SN 0.98 0.86 0.72 0.60 0.52 EMST FTC 0.48 0.39 0.32 0.26 0.21 FMLC 0.49 0.40 0.32 0.26 0.22 Comparisons A SN A FTc A FMLc -0.4% +48% -24% -2.3% +38% -25% +19% +68% -28% +12% +55% -31% +46% +100% -31% Table 2.2: SaTScan and EMST method applied to anthrax. n, number of background cases added to cluster cases; SN, average sensitivity; FTrc, average fraction of true cluster detected; FMLC, average fraction of most likely cluster coinciding with the true cluster (averaged over data sets for which a significant cluster was found); A, percent difference. uniformly distributed background cases, in addition to the anthrax case locations. These could represent, for example, visits for respiratory complaints to an emergency department, with controls drawn from all visits. For each number of background cases, 1000 data sets were generated. A typical data set is shown in Figure 2-4, along with the most likely cluster detected by SaTScan and the EMST method. The mean sensitivity, FTC, and FMLC are summarized in Table 2.2. The EMST method had comparable or greater sensitivity than SaTScan for all background population sizes, and it correctly identified a greater fraction of the anthrax cases (FTC) for all background population sizes. Both methods' sensitivity declined as more background cases were added: from 0.98 to 0.52 for the EMST method, and from 0.98 to 0.35 for SaTScan. The EMST method had a lower value of FMLC than SaTScan, indicating that it overestimated the cluster to a greater extent than SaTScan. However, the percent decline in FMLC incurred by using the EMST method instead of SaTScan was about half of the gain in FTC. 2.3.3 Circular Clusters, Boston, Massachusetts We also compared the ability of the EMST method and SaTScan to detect circular clusters. Because the circular scan statistic is optimized to detect circular clusters, we were surprised to find that the EMST method was as sensitive as SaTScan. The study area consisted of the 59 zip codes within 10 km of Boston, Massachusetts. Ten Figure 2-4: SaTScan and EMST Detection of 1979 Sverdlovsk anthrax outbreak. a) A representative data set of 63 anthrax cases (red and orange) and 400 uniformly distributed background cases (blue and gray) is shown, along with the most likely cluster determined by SaTScan (red and blue). b) The EMST method most likely cluster (red and blue) is shown for the same data set, connected by the minimum spanning tree of the cartogram-transformed cases (black edges). thousand controls were distributed on the map in proportion to zip code population data from the 2000 U.S. census. Data sets of 500 total cases were created, each containing a synthetic circular cluster in a random location with a radius of 1, 2 or 3 km. placed within the study region. We defined the relative cluster density to be the case density within the cluster divided by the case density outside the cluster. This ratio varied from 2 to 5 in the data sets. For each combination of outbreak radius and relative cluster density, 1000 data sets were created. For small clusters containing on average fewer than 35 cases, the EMST method had greater sensitivity. However, it is likely that stochastic effects caused such clusters to have non-circular shapes in general. Indeed, the smaller the cluster, the more pronounced the EMST method's relative improvement in sensitivity. For larger clusters, the EMST method had similar sensitivity to SaTScan (0.1% less to 4.1% greater) and similar values of FTC (3.4% less to 0.4% greater). However, SaTScan always had a greater value of FMLC, indicating that it located large circular clusters with greater Parameters r d 1 2 1 3 1 4 1 5 2 2 2 3 2 4 2 5 3 2 3 3 3 4 3 5 SaTScan m 8.2 12.9 16.3 20.8 33.7 50.1 64.4 75.7 79.5 108.9 133.0 153.8 SN 0.03 0.23 0.45 0.65 0.30 0.79 0.94 0.99 0.74 0.98 1.00 1.00 EMST FTC 0.03 0.21 0.41 0.61 0.25 0.73 0.89 0.95 0.65 0.93 0.97 0.98 FMLC 0.39 0.75 0.84 0.89 0.79 0.91 0.94 0.96 0.86 0.95 0.97 0.98 SN 0.07 0.29 0.49 0.69 0.39 0.81 0.95 0.99 0.77 0.99 1.00 1.00 Comparisons FTC 0.06 0.26 0.45 0.65 0.30 0.73 0.90 0.95 0.63 0.92 0.96 0.97 FMLC 0.22 0.54 0.66 0.73 0.59 0.76 0.82 0.86 0.72 0.82 0.88 0.91 A SN A FT +112 +25 +7.1 +5.7 +27 +2.3 +1.1 0.0 +4.1 +0.8 -0.1 0.0 +128 +26 +8.3 +6.4 +20 -0.3 +0.4 -0.3 -3.4 -2.0 -1.1 -0.8 A FMLC -42 -28 -21 -17 -25 -17 -13 -10 -17 -13 -9.8 -7.3 Table 2.3: SaTScan and EMST method applied to circular clusters. r, radius of cluster in kilometers; d, relative cluster density; m, mean cluster size; SN, average sensitivity; FTC, average fraction of true cluster detected; FMLC, average fraction of most likely cluster coinciding with the true cluster (averaged over data sets for which a significant cluster was found); A, percent difference. overall accuracy than the EMST method. Table 2.3 summarizes the results. 2.3.4 Rectangular Clusters, Boston, Massachusetts In a study of rectangular clusters, we found that the EMST method had greater sensitivity than SaTScan. Sets of 500 cases containing artificial rectangular clusters having a height-to-width ratio of 1, 4 or 16, and relative cluster density between 2 and 5 were generated within the same study region as above, and 10,000 controls were distributed in proportion to the background population as above. The cluster area was fixed at 20 km 2 , and 1000 data sets were generated for each combination of parameters by randomly placing a rectangular cluster within the study region map. The results are summarized in Table 2.4. In general, the EMST method had greater sensitivity than SaTScan (0.2% less to 166% greater), with the greatest percent increase in sensitivity when the cluster signal strength was weak or the height-to-width ratio was large. The EMST method captured a greater extent of the true cluster (FTC) than SaTScan for all cluster types Parameters SaTScan EMST Comparisons r d SN FTC FMLC SN FTC FMLC A SN A FTC AFMLC 1 1 1 1 4 4 4 4 16 16 16 16 2 3 4 5 2 3 4 5 2 3 4 5 0.56 0.92 0.99 1.00 0.43 0.95 1.00 1.00 0.21 0.82 0.99 1.00 0.47 0.82 0.91 0.93 0.26 0.64 0.73 0.78 0.06 0.25 0.31 0.35 0.61 0.95 0.99 1.00 0.58 0.97 1.00 1.00 0.55 0.98 1.00 1.00 0.50 0.86 0.94 0.97 0.42 0.86 0.95 0.97 0.31 0.74 0.86 0.93 0.65 0.78 0.85 0.88 0.62 0.74 0.80 0.84 0.52 0.60 0.67 0.73 +8.2% +3.2% -0.2% +0.2% +36% +2.2% +0.1% 0.0% +166% +21% +0.9% 0.0% +6.0% +4.7% +2.6% +4.5% +63% +34% +29% +25% +419% +199% +177% +166% -20% -13% -8.9% -7.3% -10.0% -4.4% +0.4% +3.2% -21% -17% -11% -6.0% 0.82 0.90 0.93 0.95 0.69 0.77 0.79 0.81 0.66 0.72 0.76 0.77 Table 2.4: SaTScan and EMST method applied to rectangular clusters. r, ratio of cluster height to width; d, relative cluster density; SN, average sensitivity; FTC, average fraction of true cluster detected; FMLC, average fraction of most likely cluster coinciding with the true cluster (averaged over data sets for which a significant cluster was found); A, percent difference. (2.6% to 419% greater). For most cluster types, there was a parallel decline in the fraction FMLC of the most likely cluster coinciding with the true cluster (20% less to +3.2% greater). 2.3.5 Arbitrary Shapes It is possible to gain insight into the EMST method's performance on other cluster shapes without additional intensive computer simulations. The EMST test statistic depends only on the cartogram, the total number of cases, and the cardinality and weight of a potential cluster. Hence, we can extrapolate the p-value obtained for one potential cluster to others having different shapes, but the same number of cases and weight. To illustrate this, we selected one most likely cluster of 35 cases from one of the Boston analysis data sets. The EMST method assigned a p-value of 0.0001 to this potential cluster. Figure 2-5 shows several configurations of potential clusters having the same number of cases and EMST weight, but very different shapes. If embedded as potential clusters within a Boston data set of 500 total cases, they would each Aa 02-e OL91 b Figure 2-5: Equally detectable potential clusters of various shapes. A most likely cluster of 35 points selected from among the Boston circular cluster data sets, along with its minimum spanning tree, is shown in the upper left. Seven other configurations of 35 points, having minimum spanning trees with exactly the same weight, are also shown. Subject to the constraint imposed by the definition of a potential cluster above, all eight clusters have equivalent detectability by the EMST method. If embedded as potential clusters in a Boston data set of 500 total cases, all would achieve the same p-value of 0.0001. achieve the same p-value of 0.0001. In fact, any potential cluster of 35 cases of any shape can be scaled in size to have the same weight, illustrating that the method can capture an infinite array of regular and irregular shapes. 2.4 Discussion We find that the EMST method is a powerful and accurate alternative to the circular scan statistic for non-circular clusters. At a specificity of 95%, the method had comparable sensitivity to SaTScan applied to large synthetic circular clusters and to an approximately circular West Nile virus outbreak. When applied to small circular clusters, synthetic rectangular clusters, and a highly irregular anthrax cluster, the EMST method had greater sensitivity. Although SaTScan had better accuracy detecting large circular clusters, the EMST method had comparable or superior accuracy for all other cluster types. The EMST method is also able to detect a large variety of shapes, including highly irregular ones. In addition to accurately locating clusters of any shape and size, the EMST method has two unique properties. First, its test statistic is based only on the weight of the potential cluster subgraph. To our knowledge, all other tests that provide the location of any detected clusters while allowing the user to set the level of significance for the test utilize the likelihood ratio test statistic developed by Kulldorff and Nagarwalla [14]. This test statistic requires the area of each region considered, which in turn requires a precise definition, including the shape, of the region. Second, we formally define a cluster in mathematical terms that are independent of cluster geometry, and which depend only on intercase distances. Traditionally, clusters are often imprecisely defined; for example, Knox's frequently cited definition is "a geographically bounded group of occurrences of sufficient size and concentration to be unlikely to have occurred by chance" [32]. Of other cluster detection methods designed to capture clusters of any shape, the EMST method is most similar mathematically to the upper level set method of Patil and Taillie [21], which examines a well-defined family of contiguous administrative regions with high relative rates. Assungdo et al. [20] used non-Euclidean minimum spanning trees of a graph with different vertices, edges and edge weights to consider contiguous administrative regions having similar disease rates, whether high or low. By contrast, we locate sets of individual cases corresponding to a mathematical formalization of a cluster, using specific subsets of the EMST. General tests of clustering [8] such as Tango's maximized excess events test [33], and disease mapping methods, such as Bayesian partition models [34, 35], kriging [36], and generalized additive models [37, 38], handle arbitrary geometric configurations of cases without difficulty. However, these address separate problems within spatial epidemiology, and comparison of clustering and disease mapping methods to cluster detection methods is not straightforward [39]. The EMST method can easily be extended to analyze regional summary data, consisting of counts of observed and expected disease cases for each region on a map. A cartogram is constructed to equalize the density of expected disease cases, and each observed case is randomly placed on the cartogram within its region of occurrence. After constructing the cartogram, the procedure for case-control data is followed. One limitation inherent in this and other methods for aggregated data is that exact spatial locations are not used, which decreases cluster detection sensitivity and accuracy [40]. This is also a limitation for the procedure detailed above for casecontrol data, since a loss of spatial information is incurred by randomizing cases within their regions of occurrence on the Voronoi diagram cartogram. Because the expected area of each region on the cartogram tends toward zero as the number of control locations increases, this loss can be minimized by increasing the number of controls. For 10,000 distinct controls on a square map, as used in our study, the loss of spatial information is modest; each case is expected to move approximately 1% of the length of one side of the square. We found that the EMST method gains in FTC for non-circular clusters were partially offset by a decline in FMLC, indicating that the EMST method reports fewer false negatives, but more false positives, than SaTScan. The relative cost to society of false negatives and false positives depends on many factors. The cost of false negative cases includes, for example, an increased risk of spread of a disease and the possibility that infected individuals who are unaware of the outbreak may not seek early treatment for symptoms, while the cost of false positive cases includes unnecessarily investigating and alarming the community. In retrospective research and prospective surveillance, the shape of true clusters are not known a priori. Thus, in most cases, a method that is able to detect clusters of any shape is preferable. Hence the EMST method may represent a practical adjunct to methods currently used in public health practice. Chapter 3 Cartograms for Mapping and Analyzing Event Disease Data 3.1 Introduction From the earliest disease dot map [5] to information-rich modern maps such as the annual U.S. cancer atlases, representations of the spatial distribution of diseases have flourished. Disease maps serve several functions: describing a disease prior to more rigorous statistical study, identifying areas of increased risk or features that may be missed by mechanical mathematical analysis, and even suggesting etiology and control strategies. Dot maps showing the exact locations of disease cases compromise privacy [41] and do not account for variation in the underlying population. This has motivated the use of choropleth and isopleth maps, depicting the average risk in administrative regions, and smooth risk functions, respectively. Disease maps typically use one of several standard cartographic projections, in which areas on the map reflect the surface area of regions represented, with the degree of distortion depending on the projection. Thus a map of disease risk shows an approximation to the amount of Joint work with John S. Brownstein, Karen Olson, Athos Bousvaros, Bonnie Berger and Kenneth D. Mandl land area in each region of increased risk. However, as noted by Dorling [42], diseases infect people, not land. A standard map showing an increased relative risk confined to a small area does not distinguish between a major outbreak in a dense metropolis and a few cases in a rural community. Cartograms have recently been introduced to capture this distinction. A cartogram, or density-equalizing projection, is a distorted map in which the area of each region is proportional to some quantity associated with the region, such as its population. Cartograms based on total census population have been used to simultaneously depict the relative risk and the total population affected by leukemia [27], lung cancer [29], cryptosporidiosis [28] and childhood cancers [25, 26, 43]. Cartograms have also been used to test for global clustering of disease cases [43, 27, 25]. Although the recent development of an efficient algorithm to create minimally distorted cartograms [29] has increased the practical possibilities for disease mapping and analysis, the use of cartograms has been limited in scope. First, disease cartograms have usually been based on estimates of total population taken from census data. In contrast, traditional epidemiological studies accommodate a multitude of methods for defining the underlying population at risk. Second, cartograms used for mapping and global clustering studies have only been constructed from count data, in which the number of cases and the underlying population size are aggregated by administrative regions such as counties. Aggregation results in a loss of spatial information, limiting the power for statistical analysis [40] and the ability to see trends. Exact point location data (usually termed "event" data [44]) is increasingly available due to clinical databases and fast geocoding software. The use of event data to create cartograms is an unexplored alternative to count data. In this study, we extend the use of cartograms of event data to other areas of spatial epidemiology including disease mapping and global studies of clustering (see figure 3-1). We show how cartograms based on event data can be used to visualize and analyze several types of traditional epidemiological studies. For disease mapping, there is a simple relationship between the density of cases on the cartogram and the risk of the disease. For statistical tests related to clustering, the null hypothesis is Figure 3-1: Applications of cartograms to spatial epidemiology. simplified by the cartogram transformation. This makes it possible to apply existing statistics developed for other fields to spatial epidemiology problems. We illustrate the use of cartograms based on event data for disease mapping using several simulated distributions of cases and controls. We also apply the method to cases of pediatric inflammatory bowel disease (IBD) in Massachusetts between 1995 and 2006, finding an increased relative risk in the southern half of the state. 3.2 3.2.1 Event Cartograms Data We begin with a set of two-dimensional coordinates on a map, each having a binary label (case or non-case) representing disease status. Ideally, this would consist of all the members of the population at risk for the disease in a study region. Since this is not feasible in practice, the data may be collected using a standard epidemiological study design including: 1. a case-control study, in which the exposures of a group having the disease are compared to those of a group without the disease. 2. a cohort study, in which a group is monitored (either prospectively or retrospectively) for the development of a disease over time. 3. a cross-sectional study, in which the health and exposure characteristics of a group are measured at a single point in time. We require the collection process to preserve spatial information about disease risk. A study in which subjects are selected at random from the total population would clearly satisfy this requirement, but may not be possible. As an alternative, a collection process that is independent of spatial location would be suitable. However, even this may not be possible. For example, consider a cohort of patients enrolled at a single clinic monitored for the development of a certain disease. The collection process itself is not spatially neutral; the chance that a patient visits the hospital likely depends on his or her distance from it. However, the study may still preserve spatial information about risk if, conditional on location, the odds of entering the study for those with and without the disease is a constant. The precise requirement is: p(enter studylcase, location L) p(enter studylnon-case, location L) = c. (3.1) For cohort and cross-sectional studies, subjects enter the study independent of disease status, so c = 1. For case-control studies, c is the ratio of the total number of cases to controls. Continuing the example above, the cohort study taking place at a single clinic may or may not satisfy this requirement. For example, if the clinic specializes in the treatment of the disease under study, its reputation may draw those with symptoms consistent with the disease from a larger area than patients without such symptoms. Patients close to the clinic may visit regardless of disease status, but distant patients may be more likely to choose the clinic if they have the disease. A more common example of a study that does not preserve the spatial risk structure is a matched case-control study, as previously noted [45]. Choosing controls to match the potential confounding characteristics of the cases is problematic because the spatial distribution of the matched controls may not be the same as a random sample drawn from the total population of all non-cases. 3.2.2 Event cartogram construction To create a cartogram from event data, a Voronoi tesselation (or Voronoi diagram) is first created from the set of non-cases. (We will refer to the non-cases as controls for simplicity, without loss of generality.) First described by Voronoi in 1908 [46], Voronoi tesselations have found applications in diverse fields including forestry [47], operations research [48], and computational geometry [49], as well as epidemiology [50, 51, 35]. Given n controls cl, c2, ... cn, the Voronoi tesselation partitions a map M into regions called Voronoi cells, consisting of the part of the map falling closest to each control: vor(ci) = {pE M Id(p, ci) 5 d(p, cj) V j = 1,...n} [24]. (3.2) In regions with a high density of controls, the Voronoi cells tend to be smaller than in low density regions. Figure 3-2 shows an example of a Voronoi tesselation. We assign a population to each Voronoi cell by counting the number of controls it contains. This number is usually one, but it may be greater if multiple controls may occur at the same location on the map. For example, multiple family members in a house or inhabitants of an apartment building may share the same geocoded location. Next, we use the Voronoi tesselation map to create a cartogram, on which the area of each projected cell is proportional to the number of controls in the original Voronoi cell. Hence the density of the controls is uniform throughout the cartogram. We use a fast diffusion-based cartogram construction algorithm developed by Gastner and Newman [29] that produces minimal distortion. The algorithm simulates diffusion of the population from regions of high density to low density regions, carrying map boundaries along during the process, until the density is equalized. Other contiguous cartogram construction algorithms may also be used. Figure 3-2: Example of a Voronoi tesselation. Left: One thousand points are distributed on a map. Right: The Voronoi tesselation of the points divides the map into 1000 regions. Each region consists of the portion of the map closest to one point. The density structure is preserved in the tesselated map; regions of small Voronoi cell area correspond to high point density. As Bithell [52] notes, computational artifacts may arise during cartogram construction that render the cartogram unsuitable for subsequent statistical analysis. These include a failure to produce an equalized map; we apply the diffusion algorithm iteratively until the control density differs by no more than 1% over the entire map. Because an initial step of the algorithm involves replacing the continuous density function with a discrete set of values on a grid, small regions of very high or low density may be missed entirely in the digitization. If this occurs, the software may fail to contract or expand these regions. To avoid this, we position the digitization grid so that the region of highest or lowest density is always captured. Finally, Voronoi cells may be specified by very few vertices, and as the cartogram is created, edges which should bend in space remain straight and may cross one another; we eliminate this possibility by adding additional vertices on the Voronoi edges until crossing edges are not encountered. After constructing the cartogram of the Voronoi tesselation, each case is plotted on the cartogram in a random location within the region corresponding to its Voronoi cell on the map. Although it is possible to show cases in their exact cartogramtransformed locations, they must be randomized because density is not equalized within each Voronoi cell, as noted by Merrill [26]. Randomization represents a loss of information, but the loss is small given sufficient numbers of distinct controls. Indeed, it tends to zero as the number of distinct control locations increases. 3.2.3 Mapping the disease risk The density of transformed cases on the cartogram is proportional to the risk of the disease. To see this, consider the set {Ri)}R, of nR _ n Voronoi cells on the original map M. Let ai and bi denote the number of cases and controls, respectively, in region RP for i = 1,..., nR. For cohort and cross-sectional studies, the quantity iestimates the odds of the disease in region Ri. For case-control studies, the relative risk of the disease is approximated by m-. _~, where m is the total number of cases. bi Let Ri denote the transformation on the cartogram of the region Ri for each i, and let cases in R•? 6(Ri) = # area of R3 (3.3) denote the density of cases in the cartogram region Ri. The number of cases in Ri is equal to the number of cases ai on the original map falling into the Voronoi cell Ri, by the procedure outlined above. Because the cartogram scales areas to equalize the density of controls, the area of region Rj is proportional to the number bi of controls in Ri. That is, area(Ri) = c bi, (3.4) where c is a constant. The area of the entire cartogram M is also proportional to the total number of controls, so area(M) = c. b = c n. j=1 (3.5) The Gastner and Newman cartogram procedure preserves the total map area, so area(M) = area(M). Solving for the constant of proportionality c gives area(Ri) = area(M) b n (3.6) This is simply the observation that - of the controls lie in region Ri, so R/ occupies bn of the total cartogram area. Thus the density of cases on the cartogram is 6(Ri) = n area(M) ai bi (3.7) Hence for cohort and cross-sectional studies, the odds of the disease is approximately the density of cases on the cartogram divided by the average density of controls on the original map. For case-control studies, the relative risk of the disease is approximately the density of cases on the cartogram divided by the average density of cases on the original map. 3.3 3.3.1 Examples Simulated Distributions We first illustrate the use of cartograms of event data for visualizing hypothetical diseases having known relative risk surfaces. We generated cases and controls for three examples: a constant risk surface, risk increasing with latitude, and a circular cluster. Five thousand controls, used for all three examples, were randomly distributed on a map of the continental United States in proportion to 2000 census county population estimates [53]. To generate each control, a zip code was drawn from a multinomial distribution with probability proportional to its census population. A map location for the control was then randomly selected from a uniform distribution within the zip code boundary. To illustrate constant risk, 2500 cases were distributed in proportion to the underlying population as above. Figure 3-3a shows the distribution of cases and controls. Figure 3-3b shows the cartogram derived from the Voronoi diagram of the controls, as well as the transformed locations of cases. The cases appear to be uniformly distributed, consistent with a relative risk of one everywhere on the map. Figures 3-4a and 3-4b show isopleth relative risk surfaces for the same data set. The surface on the map was created by dividing a kernel case density estimate by a kernel control density estimate [54]. There are regions with extremely high relative risk estimates on the order of 1000. The cartogram surface estimate, also created using kernel methods, is much closer to the true relative risk of one everywhere. The second example illustrates a smooth transition in relative risk. The relative risk of the 2500 cases increased linearly by a factor of four from south to north. A north-south gradient was chosen to illustrate diseases with latitude-dependent risk, such as stroke, multiple sclerosis and melanoma. Figures 3-3c and 3-3d show the map of cases and controls, and the cartogram-transformed case locations. The change in relative risk is difficult to detect on the original map, but suggested by the cartogramtransformed cases. Figures 3-4c and 3-4d show the corresponding isopleth risk surfaces. The latitudinal gradient is apparent on the cartogram, but is hard to detect on the standard map due to the presence of multiple locations having extremely high relative risk estimates. The third example illustrates the presence of a localized cluster in the data. Within the 2500 cases, a cluster was generated to roughly approximate the geographic distribution of a large multi-state mumps outbreak occurring in 2006 [55, 56]. The circular cluster was centered in Iowa and involved neighboring states. The relative risk inside the cluster was three times the risk outside the cluster. The cases could represent, for example, national syndromic surveillance of fever, with controls drawn from all syndromes. The cluster is difficult to discern on a dot map of cases and controls (figure 3-3e). It appears on an isopleth surface created using a standard map (figure 3-4e), but is accompanied by artifactual regions of very high relative risk. The cluster is clearly apparent on a cartogram scatter plot of cases (figure 3-3f), and is captured with fidelity on a cartogram isopleth risk surface (figure 3-4f). 3.3.2 Pediatric Inflammatory Bowel Disease, Massachusetts, 1995-2006 We also used the cartogram method to visualize the spatial distribution of IBD among the pediatric population in Massachusetts, finding an increase in relative risk in the southern portion of the state. Cases were drawn from patients diagnosed with IBD at any of 10 pediatric gastrointestinal clinics affiliated with an urban tertiary care pediatric hospital in Boston, Massachusetts. All 1163 patients who were diagnosed between January 25, 1995 and March 21, 2006 were considered. Of these, the addresses of 87 patients could not be geocoded. Of the remaining 1076 patients, the cases were selected to be the 901 patients residing in Massachusetts. Controls were drawn from clinic patients (exclusive of the cases) seen during the same time period who were diagnosed with ICD9 codes 564.00 (constipation), 789.00 (abdominal pain, unspecified site) or 564.1 (irritable bowel syndrome). Of these, the 7988 patients having addresses within Massachusetts were selected to be controls. The cartogram and new case locations are shown in Figure 3-5a. The density of cases on the cartogram is increased in the southern part of the state, reflecting an increased relative risk. 3.4 Discussion Cartograms of event data, based on Voronoi diagrams of control locations, produce maps of the variation in disease risk across space that can be used for visual displays or statistical analysis. Visual interpretation of the cartogram is straightforward: the density of cases reflects the odds or relative risk, and area reflects the size of the population at risk. The method requires no user-specified input parameters, and it minimizes the loss of spatial information. It is applicable to analyses on any scale, from local epidemiological studies to national initiatives such as BioSense and CDC influenza surveillance. Previous applications of cartograms to disease mapping and analysis have been limited to count data, aggregated by administrative regions such as census tracts [27, 28, 25, 26, 43] or zip codes [29]. This is a reflection of the general disease mapping literature; event data have received less attention [52] despite their greater precision and information content. It is conceivable that count data may in fact be superior because of the larger baseline of census population data. However, many epidemiological studies cannot draw baseline information from census data. For example, a cohort selected for inclusion in a study may have characteristics differing from available census categories. Furthermore, the use of event data for cluster detection has been shown to improve sensitivity compared to count data [40]. Alternative methods of creating cartograms from event data exist. In fact, any method of estimating a continuous or discrete control density function from event data may be used as the input for a cartogram. One example is binning by fixed administrative regions [28, 26]. This is essentially an aggregate method since it converts event data into count data, and hence it loses spatial information. Our proposed method "bins" data into dynamic regions that usually contain only one control. If a different estimation method is used, it must be unbiased. That is, the estimate of the mean density of controls in each region must tend to the exact distribution as the number of controls increases. Otherwise, the null hypothesis of constant, independent individual risk does not lead to complete spatial randomness of the cases on the cartogram. Because there are existing methods for mapping and analyzing disease event data that do not involve cartograms, we wish to consider the benefits and drawbacks of using cartograms compared to these methods. In the case of mapping, cartograms simultaneously depict the risk of the disease and the size of the underlying population; there is no straightforward way for standard methods to achieve this. However, the cartogram's distortion of familiar map boundaries may detract from its visual appeal. Either cartograms or standard methods may be used to create dot, choropleth, or isopleth maps. Standard dot maps show where the preponderance of cases occurs, reflecting the burden on health care infrastructure. Although commonly used (for example, [57, 58, 59, 45, 60]), it is difficult to visually compare the spatial distributions of cases and controls to assess variation in risk. In addition, dot maps may compromise patient privacy; maps containing minimal spatial references may be reverse-geocoded, frequently to the exact addresses of patients [61, 41]. Cartograms suffer from the same limitation if any spatial features, such as peninsulas, islands and sharp corners in administrative boundaries, can still be recognized after the cartogram transformation. Appreciating spatial variation in risk on the cartogram requires discerning deviations from complete spatial randomness. Although large deviations are obvious (for example, in figure 3-3f), the eye is not expert at this task, tending to find patterns where none exist. Choropleth maps bin event data by pre-existing administrative regions, and use shading or color to represent the odds ratio [62] or prevalence [63] in each region. In addition to the loss of information incurred by binning, several problems with choropleth maps have been noted in the literature, which apply equally to standard and cartogram methods. Administrative boundaries are unrelated to the disease process. Their irregularity may make the map difficult to interpret, and the choice of administrative groupings used may change the appearance of the map [36]. Regions with smaller denominators are subject to greater random variation, producing more extreme rates [64]. Furthermore, sparsely populated areas are often clustered together on the map, compounding the visual effect. The gray or color scale used to produce the map may strongly affect the visual appearance [7]. In choropleth maps using standard projections, large areas may dominate the map, irrespective of the underlying population size [36]. Isopleth maps, which estimate a smooth risk surface throughout the map, are an alternative to choropleth maps. Although many methods exist for creating standard isopleth maps from count data (for example, [36, 34]), the methods for event data are limited in number. Bithell used separate kernel estimates of case and control densities at each point, dividing to estimate the odds function [54]. Kelsall and Diggle used kernel estimates of the conditional probability of being a disease case, given location [38]. These methods do not allow adjustment for covariates. To address spatial confounding, which occurs when risk factors for a disease are not uniformly distributed in space, Kelsall and Diggle also introduced the use of generalized additive models that adjust for covariates. They modeled the log odds as the sum of a nonparametric function of the location and a parametric function of the covariates. One limitation of these methods is that under the null hypothesis of constant individual risk, the variance of the estimate changes throughout the map. In regions of high density, the variance is lower than in sparse areas. On the cartogram, the density of cases may be estimated at each point to derive the odds (for cross-sectional and cohort studies) or relative risk (for case-control studies), giving rise to an isopleth surface. Kernel density estimation is a natural choice, which would result in a variation of Bithell's method in which the bandwidth is the same for the numerator and denominator, but adjusted (relative to the original map) at each point depending on the local density of controls. This is preferable because the variance of the estimate at each point is constant. Generalized additive models can also be used in this setting to estimate the density while adjusting for covariates. This improves the procedure proposed by Kelsall and Diggle by equalizing the variance of the estimate. The use of cartogram-transformed data for statistical analysis may be complicated by computational artifacts arising during the cartogram construction [52]. We verify that the cartogram does indeed equalize the density of controls to within 1% throughout the map. The null hypothesis of constant, independent individual risk implies that cases are uniformly and independently distributed on the cartogram. For studies of clustering, this is a simplified version of the usual problem in which the underlying at-risk population has uniform density. In many cases, a statistical test used for the general problem can be simplified for use on cartogram-transformed data. However, the main benefit of using cartograms over standard methods for statistical tests of deviation from constant disease risk is that it enables the use of a host of methods that have been developed to analyze a single point process (cases) instead of two point processes (cases and non-cases). Because the problem is easier, the body of such methods is farther advanced than methods for two point processes. For example, tests for the detection of clusters find localized "hotspots" of disease. The most commonly used test is the scan statistic developed by Kulldorff and Nagarwalla [14, 15], which evaluates the statistical significance of a large number of circular regions using a likelihood ratio test. The cartogram transformation does not preclude subsequent use of the scan statistic; it may still be applied to test the null hypothesis that cases arise from a homogeneous Poisson process. However, in addition to standard cluster detection tests, the cartogram transformation allows the application of tests of deviation from a uniform null distribution. An example is the graph-theoretical cluster detection method developed in Chapter 2 to detect clusters of any shape. There are no corresponding methods available for two point processes to detect arbitrarily shaped clusters in event data, so the cartogram is necessary. Cartograms are also useful for tests of global clustering, which detect the general tendency of cases to cluster in space. Of the methods developed to compare two point processes on standard maps, most suffer from the problem of multiple hypotheses due to a variable parameter related to the scale of clustering. This makes it difficult to calculate the probability of type I error. One exception is a statistic proposed by Tango [33], which is the minimum p-value of a parameterized statistic that sums an index of goodness-of-fit and one of autocorrelation [65, 66]. The test is for count data, so event data must first be aggregated to use it. Simpler statistics such as the mean nearest-neighbor distance and Ripley's K-function are available to test for global clustering of event data on the cartogram. Selvin and Merrill [27] developed a statistic for cartogram-transformed data based on concentric polygons, defined to be polygons within a boundary polygon having the same centroid and parallel boundaries. They first ordered the cases on the cartogram by their distance from the cartogram boundary, from furthest to nearest. Then for each i, they calculated the proportion Pi of the cartogram area contained in the concentric polygon passing through the ith case. The value of Pi approximates the value of the uniform cumulative distribution function of a uniform random variable on the interval [0, 1]. The expected value is .n The expected and observed distributions were compared using the Kolmogorov test. We illustrated the use of cartograms for making dot and isopleth maps with a preliminary analysis of an IBD case-control study. IBD is characterized by chronic gastrointestinal inflammation believed to result from immune dysregulation. The term encompasses both ulcerative colitis (UC), which affects the large intestine, and Crohn's disease (CD), which may affect the entire gastrointestinal tract. Both cause a range of symptoms, such as diarrhea, weight loss, fever, abdominal pain, bleeding, and bowel obstruction. They are also associated with a variety of extra-intestinal manifestations, such as arthritis, skin nodules, and thromboembolic disease [67]. IBD is an interesting disease to study for clustering because its epidemiology is not completely understood. Linkage analysis studies have implicated regions on several chromosomes in susceptibility to UC and CD [68], with the most strongly associated genetic mutation occurring on a gene involved in the innate immune response [69]. The rates of concordance among identical twins with UC and CD are 20 [70] and 67 [67] percent, respectively, indicating that environmental factors play a significant role. Several factors are known to modify the genetic risk of IBD, most notably cigarette smoking, appendectomy [71] and the use of oral contraceptives [67]. Other potential risk factors, such as childhood hygiene, Mycobacterium paratuberculosis infection, childhood viral infections, diet, and blood transfusions have shown weak associations with IBD or have not been well studied [72]. In general, the environmental IBD risk modifiers are poorly understood, and further epidemiological study is needed. Previous IBD clustering studies have yielded some interesting preliminary results. There is evidence for concordance among spouses and friends [73], and one study of clustering at a global level among Swedish IBD cases found that patients were likely to have been born closer together in time and space than expected by chance [74]. No evidence of global clustering was detected in a study of CD cases surrounding Nottingham, England [75]. A seasonal pattern of IBD hospital admissions and a greater prevalence of IBD in the northern U.S. than in the south were found by Sonnenberg et al. [76]. A study of focused clustering concluded that a suspected cluster of four cases in an eight-year period in one district of France was statistically significant [77]; another found increased prevalence in a region of Ireland suspected to have a high burden of IBD [78]. While such focused studies are potentially illuminating, the investigation of previously suspected clusters raises statistical concerns [64]. Our study found an increased density of cartogram-transformed IBD cases in the southern portion of Massachusetts. While this may reflect an actual increase in relative risk in the south, it may also result from a violation of the assumption that spatial risk information is preserved by the study collection process, defined in equation 3.1. This could result, for example, from a bias in referral patterns to the gastrointestinal clinics. Our finding, although preliminary, gives cause for further investigation of IBD spatial clustering. Figure 3-3: Dot maps and cartograms of three hypothetical disease distributions. Dot maps of 5,000 controls (blue) and 2,500 cases (red) are shown in the left column (a, c and e). The controls are distributed in proportion to the underlying population. The cases are distributed to illustrate constant relative risk (a), risk increasing linearly by a factor of four from north to south (c), and a localized cluster with a three-fold increase in relative risk in Iowa and neighboring states (e). The right column (b, d and f) shows the cartogram-transformed case locations for the three distributions. 61 Relative risk I 0 - 0.73 b 0.73 - 1.07 1.07-1.7 O 1.7-2.46 2.46- 3.46 3.46 - 3,202 Figure 3-4: Isopleth surfaces estimating the relative risk of three hypothetical disease distributions on standard maps (a,c and e) and cartograms (b,d and f). The exact locations of the cases and controls are shown in figure 3-3. The case distributions illustrate constant risk (a and b), a four-fold increase in risk from south to north (c and d) and a cluster of three-fold risk increase centered in Iowa (e and f). The patterns are obscured on the standard maps because of the presence of high relative risk artifacts, but are clear on the cartograms. Relative risk * 0.19-0.67 ] 0.67 - 0.99 I] a a) b) 0.99 - 1.37 * 1.37-1.92 N 1.92- 2.78 Figure 3-5: Pediatric inflammatory bowel disease risk in Massachusetts, 1995-2006. a) A standard map of the study area. b) A cartogram was constructed from the Voronoi diagram of the 7988 control locations. The 901 IBD cases were randomly placed within the cartogram regions corresponding to their original locations on the Voronoi diagram. An isopleth relative risk surface was calculated from the transformed case locations using kernel methods. Original case and control locations are not shown to protect patients' privacy. Chapter 4 Optimal anomymization of patient spatial data 4.1 Introduction Since the first disease dot map was published more than 200 years ago, it has become a staple of the epidemiologist's toolkit. Dot maps are frequently used as an initial step in studying the spatial epidemiology of a disease, revealing associations among the cases and with the landscape. Often published in medical journals, dot maps represent one common and useful instance of sharing information about the geographical distribution of diseases with the scientific community and the public. Spatial data in other formats are also shared between individuals and institutions for many purposes: to identify focal clusters, to study etiological factors influencing disease risk, and ultimately to improve medical care and public health. Despite their potential for public good, geographical identifiers such as zip codes, street addresses, and locations on maps are highly identifying protected health information that pose a threat to patient privacy. Even coarse identifiers can be linked to individuals: one study found that 87% of subjects could be uniquely identified by Joint work with Christopher A. Cassa, Kenneth D. Mandl and Bonnie Berger their gender, zip code and date of birth [79]; another found that low-resolution dot maps of diseases published in several medical journals could be used to trace most cases to single addresses [41]. For this reason, data are frequently aggregated to preserve privacy. Aggregation by zip codes (in the United States) or census enumeration districts (in the United Kingdom) is common for published maps, spatial epidemiological studies, and prospective health surveillance. However, aggregation may erase spatial information useful for research and health surveillance; for example, cluster detection methods to detect spatial clusters are significantly less sensitive and specific when data are aggregated by zip code [40]. Furthermore, in many instances, aggregation by zip code may not be sufficiently privacy-preserving. In the 2000 U.S. census, 1210 inhabited zip codes contained fewer than 100 people, and the least populated of these contained only one person [53]. For research, disclosures of zip codes and other geographical identifiers is limited by the Health Insurance Portability and Accountability Act of 1996 (HIPAA). Under this legislation, zip codes may be released as part of a limited data set, accompanied by a data use agreement. The rule also defines a category of "non-identifiable data sets," whose dissemination is not restricted. Either of two criteria must be met for a data set to qualify. The first specifies that only the first three digits of a zip code are included, provided that at least 20,000 people share the same first three digits. Unfortunately, this is even coarser and less useful for research and public health than zip codes. Furthermore, the level of privacy protection depends on the number of patient records. For example, if it is revealed that 20 patients having a certain disease reside in a region containing 20,000 people, then there is a 1 chance that a randomly selected individual from the region is one of the patients. However, if 200 patients with the disease live in the region, then the probability that a random individual from the region is among the set of patients increases to 1100 The second criterion specifies that "the risk is very small that the information could be used, alone or in combination with other reasonably available data, to identify an individual" [80]. In line with this, several strategies have been developed to create a de-identified data set by applying a spatial transformation to each patient in the original set. These include the family of "geographical masks" [81], deterministic or stochastic functions of geographical identifiers designed to de-identify patient locations, while preserving the approximate spatial distribution of cases. These encompass previous approaches such as aggregation and translation by fixed distances, as well as affine transformations (consisting of scaling and rotation followed by translation) and random perturbations. Cassa et al. [1] evaluated a probabilistic randomization scheme using a bivariate Gaussian distribution with standard deviation inversely proportional to the population density to standardize the level of privacy protection throughout the map. Although these techniques represent a significant advance over aggregation, they apply the same transformation independent of the local geography, the number of patient records, and, in several cases, the underlying population counts. Consequently, the probability that any of the de-identified records originated from a single individual depends upon all of these variables. Although it may be possible to quantify this probability, the dependence on the variables may be complicated and the method of quantification would not be straightforward. However, a quantitative measure that captures privacy protection is essential for ethical and legal reasons. We present a principled approach to de-identifying patient locations based on linear programming that specifies the maximum probability of associating any of the transformed locations with any individual in the population. This re-identification probability is a user-specified parameter of the method. The solution is optimal in that it guarantees that patients are moved the minimum distance for the level of privacy protection offered. The method has the advantage that it does not move patients to unrealistic locations, such as lakes and rivers. It may be used to create de-identified data sets in accordance with HIPAA for publishing maps of diseases, for cluster detection, or for other epidemiological investigations. Application of the method to de-identifying patients in New York county, New York shows that a high level of privacy can be achieved while moving patients very short distances. 4.2 Methods Given the locations of a set of patients, the aim is to randomly assign new, deidentified locations that can be associated with the original patients with very low risk. The distance between the original and new locations should be minimized. The original locations may be any discrete geographical identifiers. For example, they may be zip codes, census block groups, street addresses, or even pixels in an image. The set of available locations must be known in advance; for example, these could be all the census block groups in a state, or all the residential addresses within a city. This problem can be captured by a linear programming (LP) model, a simple type of mathematical model that consists of a set of decision variables, constraint equations, and an objective function. Given m available locations, the decision variables are the transition probabilities Pij of assigning a patient in location i to a new location j (see figure 4-1). Once values have been assigned to the decision variables, each 3 V.. I V"-ý' U 0 K Ar h t 'IV cc, Figure 4-1: Schematic of transition probabilities. A patient found at each location may transition to any other location. In this simple example, there are three locations (represented by houses) and nine transition probabilities (represented by arrows). The probabilities are variables solved by linear programming. of s patients in a list of original locations is moved to a new location independently of the other patients. If a patient is originally in location i, a new location is drawn from the set j E {1, 2,..., m} using a multinomial distribution with probabilities Pij. The goal is thus to assign a value to each decision variable Pij so that this procedure ensures privacy and minimizes patient movement. Constraint equations specify conditions that must be satisfied by the decision variables Pij. Since the decision variables are probabilities, each must be nonnegative: 0 < Pij for all 1 < i <m and 1 < j< m. (4.1) In addition, every case must be moved somewhere ("moves" to the original location are allowed), so Pj = 1 for all 1 < i < m. (4.2) A final constraint guarantees that the risk of linking any randomized location with any original patient is small. In formal terms, we specify that the probability that any location from the randomized data set originated from any specific individual in the underlying population is at most ý: Pij - - nkEPkj < 0 V i, j. (4.3) k=1 In this equation, ( is a user-specified privacy bound between zero and one, s > 1 is the number of patients in the data set to be de-identified, and ni is the number of people in region i. For example, if the regions are census block groups, then the constants {nir=lj 1 may be corresponding populations drawn from the same census. If the regions are exact addresses, then ni is assumed to be 1 for each i. To derive this constraint, consider the probability of re-identifying a set of s cases that have been randomized to new locations. Given m available locations, let Pij denote the probability of transition from location i to location j for 1 < i,j < m. Given the set of s locations comprising the de-identified data set, we require the probability that any one of these derived from one specific individual to be at most (. Equivalently, the probability that a location from the randomized data set originated from an arbitrary specific individual is required to be at most •. Let X and Y denote the original and transformed locations, respectively. This condition is formally expressed as:: p(patient qjY = j) < S (4.4) for every individual q in the population and every location j E {1, 2,..., m}. The left hand side of this inequality is equivalent to p(patient q nX = L(q) Y = j), (4.5) where L(q) is the location of individual q, or p(patient q X = L(q)) -p(X = L(q) Y = j). (4.6) Assuming that all individuals in location L(q) have an equal chance of having the disease, we have p(patient q X = L(q)) = (4.7) nL(q) where nL(q) is the number of people in location L(q). Hence the condition expressed by equation 4.4 is = p(X for every individual q and location L(q) j. | Y = j) negy 5 1, - (4.8) 1 Since the location of q, L(q), may only take on the values 1, 2,..., m, this is equivalent to p(X = iIY = i) < ni - , , v , (4.9) - for every i and j in the set 1, 2,..., m. After multiplying both sides of equation 4.9 by p(Y = j), the left hand side becomes p(X = inY = j), or p(Y = jIX = i) -p(X = i). Furthermore, p(Y = jlX = i) is simply the transition probability from location i to location j, so it is equivalent to the decision variable Pij. Hence equation 4.9 is equivalent to m Pi -p(X = i) < ni, - - Pkj p(X = k) (4.10) k=1 for all i and j. Assuming that all individuals in the population have an equal probability of having the disease, we have p(X = i) = (4.11) mi pE=1 nr, Hence, we rearrange equation 4.10 to obtain S k k <0 (4.12) kc=1 for all j, and for all i having ni > 0. However, if ni = 0, then no patients can be found in location i, so any conditions on Pij do not affect the strategy. Hence we require that the inequality holds for all i and j. The objective function to be minimized is the expected distance that a patient is moved: E E :ý nipj, (4.13) i=1 j=1 Er where dij is the distance between region i and region j. For example, this could be the distance between census block group centroids, or between exact addresses. Several standard linear programming techniques to solve LP models, such as that specified by equations 4.1-4.13, have been developed. When applied to an LP model, they either locate an optimal solution that minimizes the objective function, or they prove that no solution exists. The latter happens if no probabilistic de-identification strategy has a risk of re-identification of at most (. For example, if there are m available individual addresses, then no strategy to de-identify s < m patients can achieve a risk of re-identification below -. If no strategy exists, then a larger re- identification risk can be specified (if acceptable for privacy protection), or the set of available locations can be expanded. Especially for address data and image pixels, there may be many available locations, and consequently a large number of decision variables and constraint equations. This affects the running time and storage requirements of linear programming methods. The problem size can be decreased by allowing only transitions from each region to its k nearest neighbors, for some fixed k. The solution to this modified problem may be slightly sub-optimal in terms of the distance patients are moved, but the restriction does not affect the accuracy of the re-identification probability. Simple variations of the linear program make it possible to capture other objective functions, constraint equations, or decision variable constraints. Instead of minimizing the expected distance, the expected squared distance may be used: m m d2 m di (4.14) Psj. i=1 j=1 E The squared distance penalizes long distance moves more heavily than short moves, which may be less likely to affect subsequent clustering analyses of the de-identified data set. In fact, any objective function that is a linear combination of the decision variables Pij may be used without complicating the analysis. If a deterministic strategy which always gives the same answer is preferred to a randomized strategy, this may be found by converting the problem into a binary integer program. This specifies that only the values 0 or 1 may be assigned to the decision variables. An optimal solution of the binary integer problem has the property that for any fixed i, Pij is equal to 1 for exactly one value of j, and is equal to 0 for all other values of j. The result is a mapping of the set of locations onto itself. For a fixed j, the set Ij = {i : Pij = 1}, if nonempty, has the property that iEIi n _> . In other words, the patients are binned into a subset of the locations, the number and positions of the bins minimize the expected transition distance, and the total population assigned to each bin is at least 1. In general, the optimal deterministic strategy moves patients farther than the optimal randomized strategy. It is also simple to add additional linear constraints to the problem. For example, it is possible to guarantee that no case is assigned to its original location by specifying that Pii = 0 for every i in the LP model. Although this would not increase the level of privacy, it may assuage fears that original locations may be released. In general, additional constraints increase the optimal value of the objective function. 4.3 Example To illustrate the method, we determine an optimal strategy to randomize patients in New York County census block groups with a maximum re-identification probability of 0.00005. We find that this bound is achieved while moving patients an average distance of only 265 meters (m), and we show that the privacy bound is robust to inaccurate census counts. The strategy preserves privacy to a greater extent than both aggregation by zip code and aggregation by the first three digits of zip code. 4.3.1 New York county census blocks We consider de-identifying case locations in New York County, NY grouped by census blocks. A census block is a small geographical unit typically containing approximately 1500 people [82]. According to the 2000 census, the 988 census blocks in New York County contain between 0 and 15112 people (see Figure 4-2). We devise the optimal strategy to de-identify one patient with a maximum re-identification probability of 0.00005. This is consistent with the spirit of the HIPAA legislation since the first three digits of a zip code may be released if shared by at least 20,000 people. This is also the optimal strategy to de-identify 10 patients with a maximum re-identification probability of 1-, 100 patients with a maximum probability of -L, or, more generally, 1 < s < 20000 patients with a maximum probability of . Transitions from any census block: were restricted to its nearest 100 neighbors. The LP model was solved using CPLEX LP software, resulting in a 988 x 988 matrix of transition probabilities. Each matrix row contained the transitions from a fixed census block to every other census block; by constraint, at most 100 of these were nonzero. Under the optimal strategy, the expected distance between a patient's original and de-identified location is only 265 m. Three of the 988 matrix rows are illustrated in figure 4-3. These show three possible configurations: patients are re-assigned to the same census block group or one of a few neighboring census block groups; patients are re-assigned to a single nearby census block group; and patients are moved to one of several possible census block groups which do not include the original location. IV 0 6257-15112 Figure 4-2: Total population of each census block group in New York County, NY, according to the 2000 census. Even from this limited subset, it is clear that the optimal strategy would be difficult to devise by hand. In particular, the optimal transition probabilities are not a monotonic or regular function of the distance between census block groups, such as a Gaussian function. In general, transitions are more likely to occur between nearby locations, and the likelihood of transition declines to zero as the distance between the regions increases (figure 4-4). The vast majority of the 98,800 transition probabilities are zero, indicating that transitions between most regions never occur; only 3155, or about 3.2%, of the transition probabilities are non-zero. For a fixed i, at most eight outgoing transition probabilities Pij were non-zero. Thus it is unlikely that the restriction to 100 such transitions affected the optimality of the result. Transition probability Do 0-0.2 0.2 - 0.4 S0.4-0.6 0.6- 0.8 0.8-1 Figure 4-3: Transition probabilities for the optimal strategy to de-identify s < 20, 000 patients from New York County, New York with a maximum re-identification probability of 2o . Transition probabilities from three of the 988 census blocks are shown, illustrating a few of the many possible transition distributions. The shading in region j represents the value of the probability Pij of transitions into the region. a) Patients in one census block (purple asterisk) may remain there, or they may transition to one of several nearby blocks. b) All patients originally in one census block (purple asterisk) are assigned to one neighboring block. c) Patients are re-assigned from one block (purple asterisk) to one of four nearby census blocks. No patients are re-assigned to the original census block (i.e. Pii = 0). To examine the relationship between re-identification probability and the expected distance moved by a patient, we calculated the optimal de-identification strategies for a range of re-identification bounds. Because the total population summed over all census block groups is 1,696,038, the minimum achievable re-identification probability is 16s, or s -0.00000059. This corresponds to the complete randomization strategy of moving each person in a list of s < 1, 696,038 people to census block i with probability n. The corresponding expected transition distance is 6.4 km. As the re-identification probability is increased, the optimal strategy moves the patients less in expectation. The least populated non-empty census block group contains only one individual, so the strategy of re-assigning patients to their original locations has a re-identification probability of 1 (which would be realized if one patient in a "de-identified" set came from that census block group) and an expected transition distance of 0 km. The optimal strategies for de-identifying patients were calculated for a range of re-identification probabilities between these two extremes, and the expected distance moved by each patient is shown in figure 4-5. The results are 0. Co 0r- . o .:p 5 Co 0. °F c- 0.0 0 500 1000 Distance (meters) 1500 Figure 4-4: Histogram of the distance between original and de-identified locations for an individual randomly chosen from the population, under the optimal strategy to de-identify a set of s < 20,000 patients in New York County, New York to a probability of . shown in figure 4-5. These optimal LP strategies move patients less than other HIPAA-compliant strategies (figure 4-5). Creating a HIPAA limited data set by aggregating patients by zip code moves patients an expected 519 m. The least populated zip code contains 884 people (excluding empty zip codes and one zip code containing only one person), so there is a maximum re-identification probability of - for a set of s < 884 patients under this strategy. Aggregating by the first three digits of zip code to create a HIPAA non-identifiable data set moves patients an expected 3.9 km, and has a maximum re-identification probability of .88-. (In this case, the aggregated data set would not qualify as non-identifiable under HIPAA since some digits are shared by fewer than 20,000 people.) Figure 4-6 shows the spatial bins into which patients are aggregated by zip code and by first three zip code digits. --O 0 Optimal LP Zip code 3-digit zip code E a) U) U, C .2 U, C 1 w 1/1696038 1/20000 1/2000 1/200 1/20 1/2 Re-identification probability / s Figure 4-5: Relationship between the re-identification probability, the number s of patients, and the expected transition distance for the optimal LP strategy to deidentify patients by census block group in New York county, New York. As the level of privacy protection decreases, patients are moved a smaller distance in expectation. Aggregation by zip code (green diamond) and first three zip code digits (magenta circle) are suboptimal strategies. 4.3.2 Sensitivity analysis Inaccuracies in the census estimates for each location used as input to the LP model may affect the re-identification probability. This happens when the number of people in a location is overestimated. It is elementary to show that overestimating all the census numbers by a factor of f leads to a re-identification probability of f -( instead of ý in the worst case. For example, if every region in the New York analysis had only half the people reflected by the census, the re-identification probability in that analysis would be 0.0001 instead of 0.00005. However, in practice, overestimating the census numbers may have very little effect on the re-identification bound. To illustrate this, we randomly chose 5% of the census blocks to have actual populations 10% below the census estimates. For each i and j, we re-calculated the re-identification probability that a patient reported to be in location j of the de-identified set corresponded to any specific individual in region i using the strategy calculated above to de-identify individuals with probability 2~. We found that only 1839 of the 98800, or fewer than 2%, of the re-identification probabilities violated the pre-specified bound of 0.00005. The maximum probability was 0.000052. 4.4 Discussion In the current climate of public concern for patient privacy and legislation imposing strict controls on the dissemination of patient-identifiable data, new strategies for de-identifying data sets while preserving information for disease surveillance and epidemiology are needed. It is imperative that strategies quantify the level of disclosure risk. The LP technique presented here for de-identifying spatial data has several benefits over existing methods. First, the user-specified level of privacy protection afforded by the method is mathematically well-defined. This re-identification probability is simply the maximum probability that any patient in the de-identified data set corresponds to any single individual in the population. This re-identification probability holds even if the exact randomized strategy is known to the data recipients. In other words, even knowledge of the complete set of transition probabilities {Pij} would not help re-identify patients beyond the pre-specified probability. Second, the strategy moves patients a smaller distance than the common practice of aggregating by zip code, and it is far superior to the strategy suggested in the HIPAA legislation of aggregating by the first three digits of zip code. In fact, it moves patients a smaller distance, on average, than every other possible strategy, either deterministic or random, obeying the same re-identification bound that can be expressed as a matrix of transition probabilities. Third, the technique is flexible, and can be extended based on the requirements of the user to minimize other objective functions or capture other constraint equations. It may also be used to calculate an optimal deterministic de-identification strategy, which always assigns patients to the same locations. In addition, the LP strategy does not assign cases to unrealistic places, such as bodies of water or other uninhabited regions. While this does not help in scientific exploration of the distribution of the disease using the de-identified data set, it makes for more attractive maps of the de-identified locations. The accuracy of the re-identification bound depends on a few assumptions. The underlying population size at each location must be known in advance, although the method appears to be robust to small inaccuracies. We also make the assumption that no other information is available to influence the a priori probability that any individual in the population has the disease. If any other information is available, it must be incorporated into the problem. Otherwise, the re-identification probability bound will not be correct, and privacy will not be guaranteed. For example, if the final version of the data set is to contain both the location and the race of each patient, then a de-identification strategy must be developed for each race represented. The population sizes ni used in equations 4.3 and 4.13 must represent the number of people of that race. Similarly, if age, sex, or any other identifier is released, a new LP model reflecting the sub-population sizes must be solved for each value of the identifier. This is not always possible since stratified population data may not be available. If the population sizes are unknown, a lower bound on each population size will suffice to ensure privacy, but the solution may not be optimal in expected distance. For individual addresses, we recommend using a population size of 1 for each address in the LP model. This limits the probability of associating any household with a case to the re-identification probability. However, the public may not feel comfortable with any addresses being released, even if the probability that an individual at that address has the disease is very small. An alternative is to use small aggregations such as city blocks or census block groups, as in the example presented in the previous section. The measure of privacy protection proposed here, equal to the probability that any individual in the underlying population is among the de-identified patients, captures what is essentially important to a patient: "Will I be identified as having a disease as a result of the disclosure?" This measure is difficult to compute for previous strategies, since it may depend on variables such as the number of patient records and the study region itself. Several other measures of confidentiality have been proposed. These include Spruill's measure [81], equal to the proportion of records in the de-identified set that lie closer to their original location than to all other locations in the original set. The exact value of the measure for the LP strategy depends not only on the privacy bound ý, but also on the number and locations of original records and on the particular values for destination locations drawn from the multinomial distribution. For low values of ( or a small number s of records, Spruill's measure is close to one. As ( tends toward 1, Spruill's measure approaches ). Increasing s also decreases Spruill's measure. The interpretation of this is unclear because Spruill's measure does not always capture the intuition about privacy. For example, creating a de-identified set by shuffling the exact locations of all patients in the original set measures well by Spruill, but is clearly unacceptable for privacy protection. Conversely, assigning completely random locations to de-identify a data set of two patients measures poorly by Spruill, but would certainly preserve privacy. Armstrong et al. also proposed four other measures of confidentiality. The first of these is a qualitative measure of vulnerability to geographical knowledge [81]. The LP strategy has no disclosure risk by this measure, since knowledge of all the possible locations does not decrease the re-identification probability. The second measures the ability to infer from the de-identified set regions within the map having a high disease risk. Like the de-identification strategies of aggregation and random perturbation, the LP method may reveal regions of high disease risk. However, this is a strength of the method, since the de-identified set may be used for disease mapping studies to depict variation in the spatial risk. The third measures the ability to re-identify all the patients, given the identity of some of the patients, and the final confidentiality measure is the minimum number of unlabeled locations from the original data set that can be used to compromise the entire de-identified set. There is no risk under the LP strategy by these measures; since patients are randomly moved independently of each other, relinking some of the patients cannot be used to compromise the identities of others. I / AZr Id__ Figure 4-6: Aggregation of patients in New York County, New York by zip code and by first three zip code digits. Top) Census block groups have been aggregated by zip codes. Each census block group was assigned to the zip code containing its centroid. The expected distance moved by a randomly selected member of the population is 519 m, and the maximum probability that an individual is among a set of s de-identified patients is 4. Bottom) Census block groups are aggregated by the first three zip code digits. The expected distance moved is 3.866 km, and the re-identification probability is 8188' s Chapter 5 Automated real time constant-specificity surveillance for disease outbreaks 5.1 Introduction The release of anthrax in 2001, the Severe Acute Respiratory Syndrome (SARS) outbreaks in China, Hong Kong and Toronto in 2002, and the emergence of new diseases such as West Nile virus have underscored the need for automated, real-time detection of outbreaks. Several such detection systems have been deployed in recent years at the hospital [83, 84], city [85, 86, 87], regional [88, 89, 90] and national [91, 92, 93] levels. Many systems use time series algorithms to detect aberrant conditions, such as CuSUM [94, 95, 96], variants of the Serfling method [85], multiresolution wavelet-based models [97], and trimmed seasonal models [98]. An outcome of any of these statistical methods - whether or not there is an alarm on any given day - is uninformative without an estimate of the likelihood that an Originally published as: Wieland SC, Brownstein JS, Berger B, Mandl KD. Automated real time constant-specificity surveillance for disease outbreaks. BMC Medical Informatics and Decision Making. 2007;7(1):15. alarm signals a true outbreak. This likelihood depends in part on the specificity of the detection method, equal to the proportion of non-outbreak days for which no alarm is raised. The specificity is related to the false alarm rate by the simple equation false alarm rate = 1 - specificity. Even small changes in the specificity of the detection method may have a large impact on the likelihood of a true outbreak. Despite the importance of knowing the specificity, analysis of the specificity of outbreak detection algorithms has been rudimentary, and it is common practice to report one average value of specificity that is assumed to reflect the true specificity on any day of the year or week. Implicit in this is the assumption that the specificity is constant as a function of time. If this assumption is incorrect - if instead the specificity of an outbreak detection system is a function of time that deviates significantly from its average value - then on any given day, a public health practitioner cannot know the specificity of the system or the related probability that there is a disease outbreak, and therefore cannot respond appropriately to alarms. The sensitivity of a method, or proportion of outbreaks detected, is negatively associated with its specificity. Unlike the specificity, however, it cannot be evaluated from non-outbreak data. This is because in addition to its dependence on the specificity, it also depends on the characteristics of an outbreak, including its duration and magnitude. Hence the trade-off between sensitivity and specificity must be carefully considered in the context of the outbreak type of interest to ensure that both fall in a useful range. We sought to characterize changes in the specificity of alarms produced by standard time series outbreak detection methods as a function of time. We further explored how these changes affect the sensitivity of detection methods to several outbreak types. We introduced a statistical technique that allows us to model properties of time series not captured by traditional models, developing an outbreak detection strategy with constant specificity that may be used by public health practitioners for biosurveillance. 5.2 5.2.1 Methods Data Data were collected retrospectively in the emergency department (ED) of an urban pediatric tertiary care teaching hospital. All patients with respiratory presenting complaints seen in the ED between August 1, 1992 and July 30, 2004 were included in the study. The data were divided into a six-year training period, and a test period consisting of the final six years. ED chief complaints were selected at triage from among a constrained list, and classified as respiratory or non-respiratory using a previously validated method [99]. The study was approved by the institutional review board. During the study period, approximately 137 patients were seen each day in the ED. The number of daily visits for respiratory complaints varied from 2 to 78. The mean number of respiratory visits was 21.05, and the standard deviation was 9.03 (see 5-1). These data and other hospital visit data time series have previously been shown to depend significantly on the day of the week and the season of the year [98, 100, 101, 102]. 5.2.2 Time series algorithms We implemented five traditional time series models used for outbreak detection: a simple autoregressive model, a Serfling model, the trimmed seasonal model, a waveletbased model, and a generalized linear model. In addition, we introduced a model of both the expectation and the variance based on generalized additive modeling techniques. The input to each algorithm was a time series of historical daily ED respiratory visit counts, and each returned a threshold number of visits for the day immediately following the historical period. An alarm occurred when the actual number of visits exceeded the threshold. Autoregressive model. The autoregressive model predicted the number of ED respiratory visits using linear regression on the number of visits during the previous __ 70 I 60 U 50 • 40 30 20 10 U CO M) It 0) LO 0M) (O D 0) 1* *) 000M) 0 0 0 C'j 0 CO 0 t 0 Figure 5-1: Emergency department visits for respiratory presenting complaints, August 1, 1992 - July 30, 2004. Daily time series showing the number of patients presenting with respiratory complaints to the emergency department during a 12 year period. seven days: 7 Et = ao + (5.1) ak - Vt-k, k=1 where Et is the predicted number of visits on day t, Vt-k is the actual number of visits on day t - k, and the coefficients ak were fitted by least squares regression using training data. Serfling method. The Serfling method and its variants have been extensively used for surveillance of influenza and other diseases [85, 103, 104]. Our implementation modeled the number of daily visits using linear regression on sine and cosine terms having yearly periodicities to capture seasonal effects, categorical variables for the day of week, and linear and quadratic terms. Under this model, the predicted number of visits on day t was dow(t) Et= E k kkdow(t)a+aas.t+at2 k=0 27 -doy(t) (0Sil365 +al -cos 27.-doy(t) 365 (5.2) where dow(t) is the day of the week from 0 to 6, doy(t) is the day of the year from 1 to 365, and the Kronecker delta function 65,, is equal to 1 when x = y and 0 otherwise. To calculate the day of the year during leap years, each day after February 28 was treated as though it occurred on the previous day. Trimmed seasonal model. The trimmed seasonal model is used in the AEGIS system [105] for statewide real-time population health monitoring, and was implemented as previously described [98]. Beginning with training set data, the average number of visits was calculated and subtracted from the data. From this, the average for each day of the week was calculated and again subtracted. To remove seasonal effects, the average for the day of the year was calculated after excluding the highest and lowest 25% of values for each day of the year, and again subtracted from the data. A first-order autoregressive, first-order moving average (ARMA) model was then fitted to the errors. The predicted number of visits Et was calculated by summing the overall average, the average for the day of the week, the average for the day of the year, and the ARMA prediction for day t. Wavelet model. The wavelet-based model was patterned after the wavelet anomaly detector developed by Zhang et al. [97]. The method used the number of daily visits in a training set, V1, V2,..., Vt-1, to produce a prediction for day t. It consisted of the following steps: 1. A low-frequency wavelet component of the visit signal having periodicity of more than 32 days was calculated. This period was selected by Zhang et al. because it removes seasonal effects while preserving higher-frequency information, and because it is a power of 2, which is mathematically convenient for wavelet analysis. We used the Haar wavelet in our implementation of the model [106]. 2. This low-frequency baseline was subtracted from the original signal, producing a residual for each day in the training set. 3. The predicted number of visits on day t was the value of the low-frequency component on the previous day. Daily alarm thresholds for the autoregressive, Serfling, trimmed seasonal, and wavelet-based models were calculated as the sum of the expected number of visits and a multiple A of the standard deviation of the model residuals on the historical training data. The value of Awas an adjustable parameter that affected the specificity of each model. Generalized linear model. The generalized linear model consisted of a Poisson distribution function, an identity link function, and a linear predictor that included day of the week, month of the year, holiday and linear trend terms: 11 5 Et = o + E k=0 k+1 k,dow(t) + E k+6 6k,moy(t) + 18sholiday(t) + 19 t, (5.3) k=0 where dow(t) and 6x,, are described in equation 5.2, moy(t) is the month from 1 (January) to 12 (December), and Iholiday(t) is an indicator function equal to 1 if day t is a holiday, and 0 otherwise. An alarm sounded if the value of the cumulative distribution function of a Poisson random variable with mean Et exceeded the desired specificity. This model was found by Jackson et al. [100] to have superior sensitivity to a variety of outbreak types compared to several control-chart and exponential weighted moving average models. Expectation-variance model. In addition, we developed and implemented a novel method for outbreak detection that captures changes in the ED visit standard deviation, as well as in the expected number of visits. In contrast to previous surveillance models, which assumed that the variance is constant or proportional to the mean, it did not assume a functional form for the variance. Instead, the dependence of both the mean number of visits and the variance was modeled explicitly. In other applications, several statisticians have modeled the variance as a function of the same or additional covariates used to model the mean using iterative successive relaxation procedures (see, for example, [107] and [108]). We employed a simplified procedure involving two distinct models: an expectation model of the daily expected number Et of respiratory ED visits, and a variance model of the daily variance at2 of respiratory ED visits. The number of daily visits is then modeled as a Gaussian with mean Et and variance at2. Both components are generalized additive models (GAM's): nonparametric extensions of linear regression models having several variants depending on the choice of smoothing technique, the procedure used to find estimates of the nonparametric functions for multivariate models, and the number of degrees of freedom for each covariate [109, 110]. The GAM of the expectation accepted historical daily visit counts as input, and modeled them as a function of linear time to capture a long-term trend, the day of the year to account for seasonal trends, and the day of the week: Et = ftrend(t) + fdoy(doy(t)) + fdow(dow(t)). (5.4) No smoothing was performed for the day-of-week term, since many replicates were available for each day of the week. A Gaussian kernel smoother was used for the trend term, and a Gaussian kernel smoother with circular boundaries was used for the dayof-year term since the day is a periodic covariate. Although a Gaussian was selected for its ease of interpretation, in general the choice of kernel function has little effect on the model compared to the choice of bandwidth [109]. Optimal bandwidths of the two Gaussian smoothers were estimated by a two-step procedure. First, to optimize the bandwidth of the day-of-year Gaussian, the mean predictive squared error (PSE) on a training set consisting of the first six years of ED visit data was calculated for a range of bandwidths using 10-fold cross-validation for a model containing only the day-of-week and day-of-year covariates. The bandwidth minimizing the mean PSE was chosen, corresponding to a Gaussian distribution with a standard deviation of five days. Next, the bandwidth of the kernel used for the trend term was chosen by using 10-fold cross-validation to estimate the mean PSE on the training set of a model containing all three covariates for a range of trend bandwidths, using the previously determined optimal bandwidth of the day-of-year kernel. The minimizing bandwidth was again chosen, corresponding to a standard deviation of eight days. Because the model contained multiple nonparametric functions, an iterative backfitting procedure was used to estimate each until the model converged [109]. The residuals of the expectation GAM on the historical data were squared and used as the input to the variance GAM. This GAM was also a function of linear time, day-of-year, and day-of-week variables: a2 = gtrend(t)+ gdoy(doy(t)) + gdo(dodow(t)). (5.5) The Gaussian smoothers were chosen to minimize the PSE on the training data set using the same procedure as above. The optimal smoothers corresponded to Gaussian distributions with standard deviations of 6 and 253 days for the day-of-year and trend terms, respectively. To set the alarm threshold for a given day, a composite expectation-variance model consisting of the two GAM's was trained on the previous six years of data. The alarm threshold for the next day was calculated as the sum of the expected number of ED visits, as predicted by the expectation GAM, and a multiple A of the expected standard deviation of ED visits, as predicted by the variance GAM: At = Et + A - ot = ftrend(t) + fdoy(doy(t)) + fdow(dow(t)) +A - /gtrend(t) + gdoy(doy(t)) + gdow(dow(t)). (5.6) (5.7) (5.8) The value of A was an adjustable model parameter. All models were implemented using the Matlab software package, Version 7.0.1 [111]. The Matlab system identification, statistics and wavelet toolboxes were used for the wavelet, generalized linear, and expectation-variance models. 5.2.3 lModel predictions based on historical data We used the expectation-variance model to generate alarm thresholds for each day during the test period from August 1, 1998 to July 30, 2004, which comprised the last six years of historical data. All of the available data could not be used for testing because a training period was required. To predict each threshold, the model was trained on the previous six years of data, ending the day before the day to be predicted, and was blind to the actual number of ED visits on the prediction day. The backfitting procedures to estimate the model successfully converged for each day of the study period. The model predictions for both the expected number of patients and the variance were always positive numbers throughout the study period. The average absolute predictive error was approximately four patients during the study period. For each day, an alarm threshold was produced for each desired outbreak detection specificity between 0.01 and 0.99 in 0.01 increments. This was achieved by varying the threshold parameter A appropriately. For example, to generate an alarm threshold with specificity s on day T, the model was trained on the historical visit data, VT-2191,.- , VT-1. This generated model estimates for the expected number of visits for each day, ET-2191, ... , ET- 1, ET, as well as estimates for the expected standard deviation of visits, rT-2191, ... ,aT-1, T. The parameter A was chosen so that the fraction of historical days for which the Z-score was at most A was as close as possible to the desired specificity s. That is, A was chosen to have the property that #It : T - 2191 < t < T - 1 and Vt - Et < A - at . 2191. s. (5.9) The predicted threshold for day T was ET + A -aT. Alarm thresholds for each day of the test period and each desired specificity were similarly calculated for the autoregressive, Serfling, trimmed seasonal, and wavelet models. The alarm threshold for the generalized linear model was the largest integer At for which the cumulative distribution function of a Poisson random variable with mean Et was at most s. With the exception of wavelet model thresholds, all alarm thresholds were calculated using the six years of visit data immediately preceding the prediction day. The wavelet model requires a training period having length equal to a power of two, so 2048 days of training data were used. 5.2.4 Detecting variability in the specificity To determine whether a given model at a particular mean specificity had constant specificity as a function of the day of the week, we tabulated the proportion of alarm and non-alarm days at that mean specificity by day of the week. A chi-square analysis was performed under the null hypothesis that all days of the week had an equal fraction of alarm days. A p-value less than 0.05 indicated that the specificity was dependent on the day of the week. To determine whether the specificity was constant as a function of month and year, we performed similar chi-square analyses after tallying alarm days by month of the year and by calendar year of the study, respectively. 5.2.5 Simulated outbreaks In order to ascertain the sensitivity of the models to outbreaks, we superimposed three synthetic outbreaks on the test data set: a flat outbreak of five additional patients per day for seven days, a linear outbreak which increased from one to five patients over five days, and a spike outbreak of 10 additional patients in one day. For each model, each outbreak type, and each day of the test period, we created a new semisynthetic data set by adding an outbreak beginning on that day to the original data set. We then made an alarm threshold prediction for each of the outbreak days, and for each desired specificity between 0.01 and 0.99, based on training using the semisynthetic data set. 5.2.6 Estimating sensitivity, specificity, and timeliness of detection The actual mean specificity for one model at each desired input specificity was determined by running the model on the historical data set. Specificity was estimated by calculating the fraction of days without alarms for each day of the week, month of the year, or calendar year. Sensitivity calculations used the results of applying each of the models to the semisynthetic data sets. The sensitivity was calculated as the fraction of outbreaks for which there was at least one alarm day. Exact 95 percent binomial confidence intervals were calculated for each estimate of sensitivity and specificity. Timeliness of detection was evaluated for each method by calculating the mean lag in days between the start of a flat outbreak and the first alarm sounded. Missed outbreaks, for which no alarms were sounded on any day of the outbreak, were excluded from timeliness calculations. An alarm sounding on the first outbreak day corresponded to a lag of zero. Timeliness calculations were calculated at the benchmark specificity values of 0.85 and 0.97. 5.2.7 Comparing outbreak detection among models To compare the outbreak detection performance of the expectation-variance model with the traditional models, receiver-operator (ROC) curves were constructed for all models. ROC curves show the dependence of the mean sensitivity on the mean specificity, and the area under the ROC curve is an indicator of overall performance. The area was estimated by the trapezoidal method. 5.3 5.3.1 Results Evaluation of specificity trends over time As suspected, the specificity of the five standard models was not constant over time. Hypothesis testing indicated that the specificity of the Serfling, trimmed seasonal and generalized linear models varied with the study calendar year and study month (p < 0.05) over a range of mean specificities between 0.50 and 0.99. The autoregressive model demonstrated a variable specificity with the study month and day of the week (p < 0.05) for the same range of mean specificities, and the wavelet model had variable specificity (p < 0.05) on all three time scales (5-2). Several trends in the specificity were apparent when the analysis was limited to particular values of mean specificity. For example, at a mean specificity of 85 percent, corresponding to approximately one false alarm each week, the autoregressive, Serfling, trimmed seasonal and wavelet models had highest specificity in June and July and low specificity during the winter months. The specificity of the autoregressive and wavelet models was highest in the middle of the week and lowest on Sunday, and the Serfling, trimmed seasonal and generalized linear models had higher specificity during certain study years (5-3). Similar trends were observed at other mean specificity values, including 0.90, 0.95, and 0.97 (data not shown). By contrast, the expectation-variance model specificity was constant as a function of the study year, study month, and the day of the week. Hypothesis testing resulted in a p-value above 0.05 for the entire range of input specificities on all three time scales, indicating that there was no evidence to suggest that the specificity was nonconstant on any time scale (5-2). 5.3.2 Comparison of sensitivity and timeliness of new and traditional methods The expectation-variance model usually outperformed traditional approaches in terms of sensitivity. The area under the expectation-variance model ROC curve was equal 1 n - O- Autoregression .,0 --4-- Serfling 10 - 10 0-- *-- 100 - Trimmed seasonal Wavelet Generalized linear Expectation-variance 1I% 10-2 0.5 -- 0.6 0.7 0.8 Mean specificity 0.9 Cutoff (p=0.05) 1 Figure 5-2: Evaluating variability in specificity on three time scales. Plots of pvalues for the chi-square test over various time scales for the five comparison models over a range of mean specificity values from 0.50 to 0.99, as well as p-values for the expectation-variance model. Top: calendar year of study. Middle: month of year. Bottom: day of week. The expectation-variance model has a p-value over 0.05 for the entire range of mean specificity values for all three time scales, so the null hypothesis of constant specificity is not rejected. All plots not shown are highly significant (p < 0.001) for non-constancy. C 0 ._o 0.9, 0.8 4 0.7 <U, cno T)u 0.6 n0= v. """' ý 14+ Jýt - W CL U) CO ý+-f +++ 0. 0. 0. U) 0. 0. n, A .1. -11+ 0.9 0.1 I 0.7 0.A ne 98 99 00 01 02 03 04 J FMAMJ J ASOND SMTWT F S Year Month Day of week Figure 5-3: Average specificity trends over time. Average specificity for each calendar year, month, and day of week for the five comparison methods during the study period. Data shown were recorded for each model implemented at 85% mean specificity. Similar trends were observed for all methods at 97% mean specificity (data not shown). Table 5.1: ROC curve areas for traditional and expectation-variance detection models applied to three different types of outbreaks superimposed on respiratory visits to an urban pediatric ED, August 1998 - July 2004. Flat outbreak Linear outbreak Spike outbreak Detection method 0.88 0.94 0.90 Autoregression 0.89 0.88 0.93 Serfling Trimmed seasonal 0.95 0.91 0.89 Wavelet 0.93 0.87 0.86 Generalized linear 0.95 0.91 0.91 0.91 0.95 0.91 Expectation-variance Table 5.2: Mdean lag in detecting outbreaks of five additional patients per day superimposed on the pediatric ED respiratory visits, August 1998 - July 2004. Detection lag calculations exclude undetected outbreaks. Hence the sensitivity of the method must be considered when interpreting the detection lag. Mean specificity Mean sensitivity Mean detection lag (days) Detection method Autoregression 0.97 0.40 2.26 Serfling 0.97 0.36 2.37 Trimmed seasonal 0.97 0.42 2.26 Wavelet 0.98 0.38 2.43 Generalized linear 0.95 0.68 1.93 Expectation-variance 0.97 0.58 1.96 to or greater than that of the five comparison models for all three outbreak types (table 5.1). The expectation-variance method also performed well in terms of earliness of detection. At a benchmark mean specificity of approximately 97 percent, it detected a seven-day outbreak consisting of five additional patients each day with a shorter lag than the autoregressive, Serfling, trimmed seasonal, and wavelet models (table 5.2). The expectation-variance model also had earlier detection than these models at 85 percent specificity (data not shown). 5.3.3 Temporal sensitivity trends The sensitivity of outbreak detection depends on the size and shape of an outbreak, as well as on the amount of noise in the ED utilization signal. Thus even when the specificity is held constant, it is natural for the sensitivity to vary with the season, day of the week, and trend. The ED visit signal had the least noise in the summer and the most noise in the winter (5-4). Hence the signal-to-noise ratio was highest in the summer for any fixed type of outbreak, and the sensitivity of any reasonable detection strategy should theoretically be greater during the summer than in the winter. Summer and winter ROC curves for the expectation-variance and five comparison methods confirmed that summer sensitivity was greater than winter sensitivity when the specificity was held fixed (5-4 insets). However, at mean specificity values of 85 and 97 percent, plots of sensitivity over time for the autoregressive, Serfling, trimmed seasonal and wavelet models showed a paradoxical increase in sensitivity to synthetic outbreaks during winter months compared to summer months (5-4). These seemingly contradictory results occurred because the mean specificity of these four comparison models was not the actual specificity during either the summer or winter. The specificity was significantly higher during the summer, corresponding to a shift to the left along the summer ROC curve and a concomitant decline in summer sensitivity. The opposite occurred in winter. This anomaly was corrected by the expectation-variance model (5-4), since it operated at the same specificity during all seasons. The generalized linear model exhibited variable specificity by month, but its specificity was not highest during the summer months (5-3), and hence it also had greater summer sensitivity than winter sensitivity. 5.4 Discussion We found that the specificity of outbreak detection was not constant for five traditional algorithms. This is important because having a standardized interpretation of the statistical characteristics of an outbreak detection test, including the specificity, aids public health practitioners in making rational decisions regarding resource allocation in the event of an alarm. The positive predictive value (PPV) of an alarm, the probability that an alarm signals a real outbreak, bears directly on the priority and 0.9 Cn 0.8 W 0.7 1 - specificity •0.6 0.5 U) 0.4 0.3 0.2 0.1 0- " -"" J FMAMJ JASOND JFMAMJJASONDJFMAMJJASOND Month Month Month - - - - - Feb ROC July ROC Figure 5-4: Seasonal sensitivity trends. Average sensitivity for each month of the study period for the autoregressive (left), trimmed seasonal (center), and expectationvariance (right) models when applied to data containing a superimposed spike outbreak of 10 additional patients during one day. Data shown were collected at a mean specificity of 97%. The sensitivity of the trimmed seasonal and autoregression models is higher during the winter than during the summer. Sensitivity is higher during the summer than during the winter for the expectation-variance model. July receiveroperator (ROC) curves lie below February ROC curves for all three models (insets). Similar trends were observed for flat and linear outbreaks. extent of response required. The PPV is related to the specificity by the equation sensitivity p sensitivity -p + (1 - specificity) - (1 - p) ' (5.10) where p is the prior probability of an outbreak. Because the specificity of an alarm strategy affects its PPV, it is crucial to have an accurate estimate of the specificity on any particular day. Even small differences in the specificity may have a great impact on the PPV; an alarm strategy at 95 percent specificity may have a PPV nearly twice as high as the same strategy at 90 percent specificity, depending on the nature of the outbreak considered and the sensitivity of the system. A public health practitioner responding to an alarm in the first case may wish to devote twice as many resources to investigating the alarm than in the second case. The specificity also affects the overall cost associated with a surveillance model. Let CTp, CFp, CTN and CFN denote the costs associated with true positive alarms, false positive alarms, true negatives, and false negatives, respectively. Then the expected total cost of an alarm strategy on a given day is a weighted sum of these costs: E [cost] = CT - sens -p + CFN (1 -sens) p + CFP (1 - spec) - (1 - p)+ CTN spec (1 - p). (5.11) Lowering the specificity contributes to the cost due to fruitlessly investigating more false positive alarms, reflected in the third summand of the equation. At a specificity of, for example, 99%, one can expect to experience a false alarm every 100 outbreakfree days. Lowering the specificity to 97% increases the false alarms to approximately once per month. The cost equation can also be used to compare two alarm methods, A and B. Strategy A is more cost-effective than strategy B if and only if the expected cost of A is less than that of B: (senSA - senSB)(CTP *p - CFN -p) < (specA - specB) (CFp (1 - p) - CTN (1 - p)). (5.12) Thus the greater the accuracy in the estimates of the specificity and sensitivity of each method, the prior probability of an outbreak p, and the costs of each scenario, 100 the more accurately a public health department can compare the cost-effectiveness of the various available surveillance methods. It may be desirable under certain conditions to have non-constant specificity. For example, one may wish to adjust the specificity so that the PPV is constant as a function of the day of the week, season, and trend. Alternatively, a high profile event may merit special attention, requiring lower specificity surveillance to increase the sensitivity to outbreaks. The expectation-variance model is preferable to traditional models in these situations because its specificity is known more reliably than that of traditional models. Therefore the specificity can easily be adjusted with time according to public health needs. By contrast, current models operate with unknown specificity, and adjusting an unknown quantity presents a difficulty. To understand the inability of traditional models to maintain constant specificity over time, it is useful to recast the outbreak detection problem in terms of percentiles instead of means. A perfect outbreak detection model operating at a specificity of 0.95 would output an alarm threshold equal to the 95th percentile for each day, above which an alarm would sound. More generally, a perfect model at specificity k would model the kth percentile. The autoregressive, Serfling, trimmed seasonal and wavelet models assume that the data have normally distributed errors with constant variance. They thus make a first approximation to this percentile by modeling the mean, to which a constant (which depends on k) is added. One problem with this approach is that the ED utilization signal is heteroscedastic - that is, its variance is not constant as a function of time (5-5). In practical terms, this means that the kth percentile is sometimes farther from the signal mean than at other times. Hence it cannot be captured by adding a constant value to the mean. The result is that during periods of greatest ED utilization variance, such as the winter months (5-5), the alarm thresholds of these traditional models underestimate the kth percentile, leading to a decreased winter specificity (5-3). Conversely, all four models overestimate the alarm threshold during the summer months, when the ED utilization variance is lowest. In fact, neglecting the dependence of the ED visit variance on the day of week, day of year, or long-term trend when determining the alarm threshold introduces some 101 CO) o U, 0 .C' Cn U) C. CZ Cr CZ J F M A M J J A Month S O N D Figure 5-5: Seasonal trends in the mean and variance of ED visits. Mean number of ED visits (left axis, solid blue line) and mean variance in ED visits (right axis, dashed green line) as a function of the day of year. Data were smoothed using 5-day and 11-day moving averages, respectively. The ED utilization mean and variance are highest in the winter and lowest during the summer. degree of systematic error in the alarm threshold, although it may not be of sufficient magnitude to cause statistically detectable variations in the specificity. Although the generalized linear model does not assume that the variance is constant, it does assume that the data are Poisson distributed, and consequently that the signal variance is equal to the signal mean. However, the actual signal variance is greater than the mean; the ratio ranges from approximately one to more than three during the calendar year (5-5). The result is that during periods of high relative signal variance, the specificity of the method is also relatively high. For example, in October, both the ratio of signal variance to signal mean (5-5) and the specificity (5-3) are high. Changes in specificity may also result from systematic errors in the expected 102 number of ED visits predicted by the algorithms. For example, our implementations of the wavelet and autoregression models do not take into account day-of-week effects on the number of ED visits. Hence during high-volume days, such as Sundays, these models underestimate the expected number of visits. This in turn lowers the alarm cutoff value and the specificity compared to low-volume days such as Wednesdays. The Serfling model constrains the seasonal effects of ED utilization to a sine wave. However, the normal seasonal pattern of respiratory visits includes a spring increase that coincides with the allergy season (5-5), which cannot be captured by a sine curve. This causes a May dip in the specificity of the Serfling model (5-3). In addition to the approach considered here, it may be possible to apply a generalized additive or other model to the squared residuals of a traditional algorithm. A model for the alarm threshold would then be constructed in a similar manner to the expectation-variance model. Because the specificity is affected by systematic errors in both the mean and the variance, it would be necessary to apply a statistical test to ensure that the specificity was constant. The expectation-variance model is a general time series method which could be applied to surveillance of other syndromes and populations. Implemented here in Matlab, it could easily be imported to other platforms, and it requires minimal additional computational resources for public health departments collecting surveillance visit data. It does, however, have several limitations. While useful for modeling syndromes that are predictable functions of the trend, season, and day-of-week covariates, such as respiratory or gastrointestinal illnesses, it would have limited utility compared to simpler models for rare or sporadically occurring syndromes. The present study has evaluated the specificity, sensitivity, and timeliness of detection using a training set containing six years of data. However, this much historical data is not always available for model training. Although the algorithm is easily adapted to shorter training sets, future work is needed to assess its performance with such sets. Like other detection methods, the training data must be free of an outbreak of interest in order for the specificity estimates to be accurate. Thus the training set used in the present study would be useful for detecting anthrax, other bioterrorism events, 103 or large influenza outbreaks due to changing viral strains, but not for reliably detecting yearly average influenza outbreaks present in the data. Like other time series methods, the model also does not take advantage of geospatial information or data streams containing different types of data. A more subtle limitation of the expectation-variance model is that its output is a binary variable - the absence or presence of an alarm. Kleinman et al. [112] proposed an approach to temporal and spatial surveillance which instead provides the probability that an observed event would be expected in the absence of an outbreak. This approach represents a shift from statistical testing to more detailed statistical modeling techniques [113]. Although the current implementation of our method is binary, it can easily be converted to a "modeling" approach. For example, a graph of the specificity as a function of the alarm threshold corresponds to a predicted cumulative distribution function of the number of visits on any given day. In addition to the limitations of the model, our study is limited in its analysis of sensitivity to various outbreak types. The sensitivity depends on the time series of additional outbreak patient visits, of which an infinite array of possibilities exist. In the absence of outbreak data capturing the essential features of the many diseases and syndromes that may be monitored, we have used synthetic outbreaks having simple functional forms or "canonical shapes" [114]. This makes comparisons between types of outbreaks easy to interpret. Alternatively, the response to one or more known outbreaks may be evaluated [100, 115]. This approach has the advantage that the outbreaks are inherently realistic, since they are instances of true outbreaks. However, they may be highly irregular and dominated by stochastic effects. Indeed, there is no guarantee that they bear resemblance to future outbreaks of the same or other diseases. The present study offers the promising conclusion that the expectationvariance model has good comparative sensitivity for a limited number of artificial outbreaks, but more detailed study in the context of outbreaks of interest would be necessary to conclude that the model is preferable to previous models for real-world surveillance. 104 5.5 Conclusions The interpretation of alarms using current outbreak detection strategies is difficult because the specificity is extremely variable. The fluctuations in specificity are due to changes on the same time scales in the variance of the ED utilization signal. Unlike previous models, the model developed here accounts for changes with time of not only the expected number of ED visits, but also of the variance of the number of visits. It is our hope that this provides a useful method for achieving a signaling strategy with known, constant specificity, enhancing the ability of public health practitioners to interpret the meaning of an alarm. 105 106 Bibliography [1] C.A. Cassa, S.J. Grannis, M. Overhage, and K.D. Mandl. A context-sensitive approach to anonymizing spatial surveillance data: Impact on outbreak detection. J. Am. Med. Inform. Assoc., 13:160-165, 2006. [2] J. Kelly. The Great Mortality: An Intimate History of the Black Death, the Most Devastating Plague of All Time. Harper Collins, 2005. [3] JS Oxford. Influenza A pandemics of the 20th century with special reference to 1918: Virology, pathology and epidemiology. Reviews in Medical Virology, 10:119-133, 2000. [4] A.W. Crosby. America's Forgotten Pandemic: The Influenza of 1918. Cambridge University Press, 2003. [5] G.W. Shannon. Disease mapping and early theories of yellow fever. The Professional Geographer,33(2):221-227, 1981. [6] H. Brody, M. Rip, P. Vinten-Johansen, N. Paneth, and S. Rachman. Mapmaking and myth-making in Broad Street: The London cholera epidemic, 1854. The Lancet, 356:64-68, 2000. [7] Andrew B. Lawson. Statistical Methods in Spatial Epidemiology. Wiley, 2001. [8] J. Besag and J. Newell. The detection of clusters in rare diseases. J. R. Stat. Soc. Ser. A Stat. Soc., 154:143-155, 1991. 107 [9] M. Meselson, J. Guillemin, M. Hugh-Jones, A. Langmuir, I. Popova, A. Shelokov, and 0. Yampolskaya. The Sverdlovsk anthrax outbreak of 1979. Science, 266:1202-1208, 1994. [10] M.O. Ruiz, C. Tedesco, T.J. McTighe, C. Austin, and U. Kitron. Environmental and social determinants of human risk during a West Nile virus outbreak in the greater Chicago area, 2002. International Journal of Health Geographics, 3:8, 2004. [11] P. Diggle. A point process modelling approach to raised incidence of a rare phenomenon in the vicinity of a prespecified point. J. R. Stat. Soc. Ser. A Stat. Soc., 153:349-362, 1990. [12] M.J. Keeling, M.E.J. Woolhouse, D.J. Shaw, L. Matthews, M. Chase-Topping, D.T. Haydon, S.J. Cornell, J. Kappey, J. Wilesmith, and B.T. Grenfell. Dynamics of the 2001 UK foot and mouth epidemic: Stochastic dispersal in a heterogeneous landscape. Science, 294:813-817, 2001. [13] P. Elliott, J. Wakefield, N. Best, and D. Briggs. Spatial Epidemiology: Methods and Applications. Oxford University Press, 2000. [14] M. Kulldorff and N. Nagarwalla. Spatial disease clusters: Detection and inference. Statistics in Medicine, 14:799-810, 1995. [15] M. Kulldorff. A spatial scan statistic. Commun. Stat. Theor. M., 26:1481-1496, 1997. [16] M. Kulldorff, L. Huang, L. Pickle, and L. Duczmal. An elliptical spatial scan statistic. Statistics in Medicine, 25(22):3929 - 3943, 2006. [17] D.B. Neill. Detection of spatial and spatio-temporal clusters. PhD thesis, Carnegie Mellon University, Pittsburgh, 2006. [18] T. Tango and K. Takahashi. A flexibly shaped spatial scan statistic for detecting clusters. InternationalJournal of Health Geographics, 4:11, 2005. 108 [19] L. Duczmal and R. Assunglo. A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Comput. Stat. Data Anal., 45:269-286, 2004. [20] R. Assungdo, M. Costa, A. Tavares, and S. Ferreira. Fast detection of arbitrarily shaped disease clusters. Statistics in Medicine, 25:723-742, 2006. [21] G.P. Patil and C. Taillie. Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environ. Ecol. Stat., 11:183-197, 2004. [22] C.T. Zahn. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput., C20:68-86, 1971. [23] Y. Xu, V. Olman, and D. Xu. Clustering gene expression data using a graphtheoretic approach: An application of minimum spanning trees. Bioinformatics, 18:536-545, 2002. [24] M. de Berg, M. van Kreveld, M. Overmars, and Schwarzkopf O. Computational Geometry: Algorithms and Applications. Springer-Verlag, 2000. [25] D.W. Merrill, S. Selvin, E.R. Close, and H.H. Holmes. Use of density equalizing map projections (DEMP) in the analysis of childhood cancer in four California counties. Statistics in Medicine, 15:1837-1848, 1996. [26] D. Merrill. Use of a density equalizing map projection in analysing childhood cancer in four California counties. Statistics in Medicine, 20:1499-1513, 2001. [27] S. Selvin and D. Merrill. Adult leukemia: A spatial analysis. Epidemiology, 13:151-156, 2002. [28] A. Khalakdina, S. Selvin, and D.W. Merrill. Analysis of the spatial distribution of cryptosporidiosis in AIDS patients in San Francisco using density equalizing map projections (DEMP). Int. J. Hyg. Environ. Health, 206:553-561, 2003. [29] M. Gastner and M. Newman. Diffusion-based method for producing densityequalizing maps. Proc. Natl. Acad. Sci. U.S.A., 101:7499-7504, 2004. 109 [30] B. Bollobas. Modern Graph Theory. Springer-Verlag, 1998. [31] J.S. Brownstein, H. Rosen, D. Purdy, J.R. Miller, M. Merlino, F. Mostashari, and D. Fish. Spatial analysis of West Nile virus: Rapid risk assessment of an introduced vector-borne zoonosis. Vector Borne Zoonotic Dis., 2:157-164, 2002. [32] E.G. Knox. Detection of clusters. In P. Elliott, editor, Methodology of Enquiries into Disease Clustering, pages 17-20. Small Area Health Statistics Unit, London, 1989. [33] T. Tango. A test for spatial disease clustering adjusted for multiple testing. Statistics in Medicine, 19:191-204, 2000. [34] D.G.T. Denison and C.C. Holmes. Bayesian partitioning for estimating disease risk. Biometrics, 57:143-149, 2001. [35] J.T.A.S. Ferreira, D.G.T. Denison, and C.C. Holmes. Partition modelling. In A.B. Lawson and D.G.T. Denison, editors, Spatial Cluster Modeling, pages 125146. Chapman & Hall, London, 2002. [36] 0. Berke. Exploratory disease mapping: Kriging the spatial risk function from regional count data. InternationalJournal of Health Geographics, 3:18, 2004. [37] T. Webster, V. Vieira, J. Weinberg, and A. Aschengrau. Method for mapping population-based case-control studies: An application using generalized additive models. InternationalJournal of Health Geographics, 5:26, 2006. [38] J.E. Kelsall and P.J. Diggle. Spatial variation in risk of disease: A nonparametric binary regression approach. J. R. Stat. Soc. Ser. C Appl. Statist., 47(4):559573, 1998. [39] P.J. Diggle. Spatial Epidemiology: Methods and Applications, pages 87-103. Oxford University Press, Oxford, 2000. [40] K.L. Olson, S.J. Grannis, and K.D. Mandl. Privacy protection versus cluster detection in spatial epidemiology. Am. J. Public Health, 96(11):2002-2008, 2006. 110 [41] J.S. Brownstein, C.A. Cassa, and K.D. Mandl. No place to hide - reverse identification of patients from published maps. New England Journal of Medicine, 355:1741-1742, 2006. [42] D Dorling. Worldmapper: The human anatomy of a small planet. PLoS Medicine, 4(1):el, 2007. [43] S. Selvin, J. Schulman, and D.W. Merrill. Distance and risk measures for the analysis of spatial data: A study of childhood cancers. Social Science and Medicine, 34(7):769-777, 1992. [44] A.C. Gatrell, T.C. Bailey, P.J. Diggle, and B.S. Rowlingson. Spatial point pattern analysis and its application in geographical epidemiology. Transactions of the Institute of British Geographers, 21(1):256-274, 1996. [45] A.G. Chetwynd, P.J. Diggle, and A. Marshall. Investigation of spatial clustering from individually matched case-control studies. Biostatistics, 2(3):277-293, 2001. [46] M.G. Voronoi. Nouvelles applications des parametres continus a la theorie des formes quadratiques. J. Reine Angew. Math., pages 198-287, 1908. [47] CM Gold, J. Nantel, and W. Yang. Outside-in: An alternative approach to forest map digitizing. InternationalJournal of GeographicalInformation Science, 10(3):291-310, 1996. [48] AB Mendes and IH Themido. Multi-outlet retail site location assessment. International Transactions in OperationalResearch, 11(1):1-18, 2004. [49] R. Klein, K. Mehlhorn, and S. Meiser. Randomized incremental construction of abstract Voronoi diagrams. ComputationalGeometry: Theory and Applications, 3(3):157-184, 1993. [50] F. Rezende, R.M.V. Almeida, and F.F. Nobre. Diagramas de Voronoi para a definigao de Areas de abrangincia de hospitais pdblicos no Municipio do Rio de Janeiro. Cad Sadde Pziblica, 16:467-75, 2000. 111 [51] I. Hanigan, G. Hall, and K.B.G. Dear. A comparison of methods for calculating population exposure estimates of daily weather for health research. International Journal of Health Geographics, 5(1):38, 2006. [52] J F Bithell. A classification of disease mapping methods. Statistics in Medicine, 19:2203-2215, 2000. [53] The Massachusetts Institute of Technology Geodata Repository. [54] JF Bithell. An application of density estimation to geographical epidemiology. Statistics in Medicine, 9:697-701, 1990. [55] CDC. Mumps epidemic - Iowa, 2006. Morbiditiy and Mortality Weekly Report, 55(13):366-368, 2006. [56] CDC. Update: Multistate outbreak of mumps - United States, January 1-May 2, 2006. Morbiditiy and Mortality Weekly Report, 55:1-5, 2006. [57] S.B. Eng, D.H. Werker, A.S. King, S.A. Marion, A. Bell, J.L. Issac-Renton, G.S. Irwin, and W.R. Bowie. Computer-generated dot maps as an epidemiologic tool: Investigating an outbreak of toxoplasmosis. Emerging Infectious Diseases, 5(6):815-9, 1999. [58] T.J. Oyana, P. Rogerson, and J.S. Lwebuga-Mukasa. Geographic clustering of adult asthma hospitalization and residential exposure to pollution at a United States-Canada border crossing. American Journalof Public Health, 94(7):12501257, 2004. [59] G.M. Jacquez, A. Kaufmann, J. Meliker, P. Goovaerts, G. AvRuskin, and J. Nriagu. Global, local and focused geographic clustering for case-control data with residential histories. Environmental Health: A Global Access Science Source, 4:4, 2005. [60] D. Han, P.A. Rogerson, J. Nie, M.R. Bonner, J.E. Vena, D. Vito, P. Muti, M. Trevisan, S.B. Edge, and J.L. Freudenheim. Geographic clustering of resi112 dence in early life and subsequent risk of breast cancer (United States). Cancer Causes and Control, 15:921-929, 2004. [61] A.J. Curtis, J.W. Mills, and M. Leitner. Spatial confidentiality and GIS: Reengineering mortality locations from published maps about Hurricane Katrina. International Journal of Health Geographics, 5:44, 2006. [62] M.T. Wallin, W.F. Page, and J.F. Kurtzke. Multiple sclerosis in US veterans of the Vietnam era and later military service: Race, sex, and geography. Annals of Neurology, 55(1):65-71, 2004. [63] K. Torugsa, S. Anderson, N. Thongsen, N. Sirisopana, A. Jugsudee, P. Junlananto, S. Nitayaphan, S. Sangkharomya, and A.E. Brown. HIV epidemic among young Thai men, 1991-2000. Emerging Infectious Diseases, 9(7):881883, 2003. [64] S.F. Olsen. Cluster analysis and disease mapping - why, when and how? a step by step guide. British Medical Journal,313:863-866, 1996. [65] T. Tango. A class of tests for detecting 'general' and 'focused' clustering of rare diseases. Statistics in Medicine, 14(21-22):2323-34, 1995. [66] P.A. RFogerson. The detection of clusters using a spatial version of the chi-square goodness-of-fit statistics. GeographicalAnalysis, 31(1):128-47, 1999. [67] Sonia Friedman and Richard S. Blumberg. Harrison'sInternal Medicine, chapter 276. McGraw-Hill, 16th edition, 2006. [68] Bonen DK and Cho JH. The genetics of inflammatory bowel disease. Gastroenterology, 124(2):521-536, 2003. [69] Warren Strober, Peter J. Murray, Atsushi Kitani, and Tomohiro Watanabe. Signalling pathways and molecular interactions of NODi and NOD2. Nature, 6, 2006. 113 [70] Jonas Halfvarson, Lennart Bodin, Curt Tysk, Eva Lindberg, and Gunnar Jrnerot. Inflammatory bowel disease in a Swedish twin cohort: term follow-up of concordance and clinical characteristics. A long- Gastroenterology, 124(7):1767-1773, 2003. [71] Edward V. Loftus. Clinical epidemiology of inflammatory bowel disease: Incidence, prevalence, and environmental influences. Gastroenterology, 126, 2004. [72] T Andus and V Gross. Etiology and pathophysiology of inflammatory bowel disease-environmental factors. Hepatogastroenterology,47(31):29-43, 2000. [73] J Aisenberg and HD Janowitz. Cluster of inflammatory bowel disease in three close college friends? J Clin Gastroenterol,17(4), 1993. [74] Anders Ekbom, Matthew Zack, Hans-Olov Adami, and Charles Helmick. Is there clustering of inflammatory bowel disease at birth? American Journal of Epidemiology, 134(8):876-886, 1991. [75] D.S. Miller, Andrea Keighley, P.G. Smith, A. O. Hughes, and M.J.S. Langman. A case-control method for seeking evidence of contagion in Crohn's disease. Gastroenterology,71(3):385-387, 1976. [76] Amnon Sonnenberg and Irene Wasserman. Epidemiology of inflammatory bowel disease among U.S. military veterans. Gastroenterology, 101:122-130, 1991. [77] M Valenciano, B Gagniere, C Maurage, H de Valk, and JC Desenclos. Etude d'un agregat apparent de maladies de crohn en indre-et-loire, 1990-1999 [analysis of an apparent cluster of crohn's disease cases in indre-et-loire, france (19901999)]. Rev Epidemiol Sante Publique, 50(6):509-517, 2002. [78] D O'Donovan, D Keegan, G McEvoy, H Mulcahy, and D O'Donoghue. A cluster of Crohn's disease in Ballybrack: Fact or fiction? Endoscopy, 36, 2004. [79] L. Sweeney. k-Anonymity: A model for protecting privacy. Int J Uncertainty Fuzziness Knowledge-Based Syst, 10:557-570, 2002. 114 [80] NIH itories, publication databases, number Research 04-5489: the and HIPAA privacy http://privacyruleandresearch.nih.gov/research-repositories.asp, reposrule. 2004. [81] M.P. Armstrong, G. Rushton, and D.L. Zimmerman. Geographically masking health data to preserve confidentiality. Statistics in Medicine, 18(5):497-525, 1999. [82] United States 2000 census. http://www.census.gov/geo/www/cob/bgmetadata.html. [83] C M Yuan, S Love, and M Wilson. Syndromic surveillance at hospital emergency departments - Southeastern Virginia. MMWR Morb. Mortal. Wkly. Rep., 53 Suppl:56-58, 2004. [84] L. Hammond, S. Papadopoulos, C.F. Johnson, S. MaWhinney, B. Nelson, and J.K. Todd. Use of an Internet-Based Community Surveillance Network to Predict Seasonal Communicable Disease Morbidity. Pediatrics, 109(3):414-418, 2002. [85] F. Mostashari, A. Fine, D. Das, J. Adams, and M. Layton. Use of ambulance dispatch data as an early warning system for communitywide influenzalike illness, New York City. J. Urban Health, 80 Supplement 1:i43-i49, 2003. [86] R. Heffernan, F. Mostashari, D. Das, M. Besculides, C. Rodriguez, J. Greenko, L. Steiner-Sichel, S. Balter, A. Karpati, P. Thomas, M. Phillips, J. Ackelsberg, E. Lee, J. Leng, J. Hartman, K. Metzger, R. Rosselli, and D. Weiss. New York City syndromic surveillance systems. MMWR Morb. Mortal. Wkly. Rep., 53 Supplement:25-27, 2004. [87] M.D. Lewis, J.A. Pavlin, J.L. Mansfield, S. O'Brien, L.G. Boomsma, Y. Elbert, and P.W. Kelley. Disease outbreak detection system using syndromic data in the greater Washington DC area. Am. J. Prev. Med., 23(3):180-186, 2002. 115 [88] F.C. Tsui, J.U. Espino, V.M. Dato, P.H. Gesteland, J. Hutman, and M.M. Wagner. Technical description of RODS: A real-time public health surveillance system. J. Am. Med. Inform. Assoc., 10:399-408, 2003. [89] M. Paladini. Daily emergency department surveillance system-Bergen County, New Jersey. MMWR Morb. Mortal. Wkly. Rep., 53 Supplement:47-49, 2004. [90] ZF Dembek, K. Carley, A. Siniscalchi, and J. Hadler. Hospital admissions syndromic surveillance-Connecticut, September 200-November 2003. MMWR Morb. Mortal. Wkly. Rep., 53 Supplement:50-52, 2004. [91] M.M. Wagner, J.M. Robinson, F.C. Tsui, J.U. Espino, and W.R. Hogan. Design of a national retail data monitor for public health surveillance. J. Am. Med. Inform. Assoc., 10:409-418, 2003. [92] R. Platt, C. Bocchino, B. Caldwell, R. Harmon, K. Kleinman, R. Lazarus, A.F. Nelson, J.D. Nordin, and D.P. Ritzwoller. Syndromic surveillance using minimum transfer of identifiable data: The example of the National Bioterrorism Syndromic Surveillance Demonstration Program. J. Urban Health, 80 Supplement 1(2):i25-i31, 2003. [93] D.L. Cooper, G. Smith, M. Baker, F. Chinemana, N. Verlander, E. Gerard, V. Hollyoak, and R. Griffiths. National symptom surveillance using calls to a telephone health advice service-United Kingdom, December 2001-February 2003. MMWR Morb. Mortal. Wkly. Rep., 53 Supplement:179-83, 2004. [94] L. Hutwagner, W. Thompson, G.M. Seeman, and T. Treadwell. The Bioterrorism Preparedness and Response Early Aberration Reporting System (EARS). J. Urban Health, 80 Supplement 1:i89-i96, 2003. [95] LC Hutwagner, EK Maloney, NH Bean, L. Slutsker, and SM Martin. Using laboratory-based surveillance data for prevention: An algorithm for detecting Salmonella outbreaks. Emerg. Infect. Dis., 3(3):395-400, 1997. 116 [96] L. Hutwagner, T. Browne, G.M. Seeman, and A.T. Fleischauer. Compar- ing aberration detection methods with simulated data. Emerg. Infect. Dis., 11(2):314-6, 2005. [97] J. Zhang, FC Tsui, MM Wagner, and WR Hogan. Detection of outbreaks from time series data using wavelet transform. In Proc. AMIA Symp., pages 748-752, 2003. [98] B.Y. Reis and K.D. Mandl. Time series modeling for syndromic surveillance. BMC Med. Inform. Decis. Mak., 3, 2003. [99] A.J. Beitel, K.L. Olson, B.Y. Reis, and K.D. Mandl. Use of emergency department chief complaint and diagnostic codes for identifying respiratory illness in a pediatric population. Pediatr.Emerg. Care, 20(6):355-360, 2004. [100] M.L. Jackson, A. Baer, I. Painter, and J. Duchin. A simulation study comparing aberration detection algorithms for syndromic surveillance. BMC Med. Inform. Decis. Mak., 7:6, 2007. [101] J.C. Brillman, T. Burr, D. Forslund, E. Joyce, R. Picard, and E. Umland. Mode[ling emergency department visit patterns for infectious disease complaints: Results and application to disease surveillance. BMC Med. Inform. Decis. Mak., 5, 2005. [102] R. Lazarus, K. Kleinman, I. Dashevsky, C. Adams, P. Kludt, A. DeMaria, and R. Platt. Use of automated ambulatory-care encounter records for detection of acute illness clusters, including potential bioterrorism events. Emerging Infectious Diseases, 8, 2002. [103] RE Serfling. Methods for current statistical analysis of excess pneumonia- influenza deaths. Public Health Rep., 78(6):494-506, 1963. [104] W.W. Thompson, D.K. Shay, E. Weintraub, L. Brammer, C.B. Bridges, N.J. Cox, and K. Fukuda. Influenza-associated hospitalizations in the United States. JAMA, 292:1333-1340, 2004. 117 [105] K.D. Mandl, J.M. Overhage, M.M. Wagner, W.B. Lober, P. Sebastiani, F. Mostashari, J.A. Pavlin, P.H. Gesteland, T. Treadwell, E. Koski, L. Hutwagner, D.L. Buckeridge, R.D. Aller, and S. Grannis. Implementing syndromic surveillance: A practical guide informed by the early experience. J. Am. Med. Inform. Assoc., 11:141-150, 2004. [106] A Boggess and FJ Narcowich. A First Course in Wavelets with FourierAnalysis. Prentice Hall Press, Upper Saddle River, NJ, 2001. [107] M. Aitkin. Modelling variance heterogeneity in normal regression using GLIM. Applied Statistics, 36, 1987. [108] R.A. Rigby and D.M. Stasinopoulos. A semi-parametric additive model for variance heterogeneity. Statistics and Computing, 6, 1996. [109] TJ Hastie and RJ Tibshirani. Generalized Additive Models. Chapman & Hall, New York, NY, 1990. [110] F. Dominici, A. McDermott, S.L. Zeger, and J.M. Samet. On the use of generalized additive models in time-series studies of air pollution and health. Am. J. Epidemiol., 156:193-203, 2002. [111] Matlab User's Guide. Mathworks, Inc., Natick, MA. [112] K. Kleinman, R. Lazarus, and R. Platt. A generalized linear mixed models approach for detecting incident clusters of disease in small areas, with an application to biological terrorism. Am. J. Epidemiol., 159:217-224, 2004. [113] L.A. Waller. Invited commentary: Surveilling surveillance - some statistical comments. Am. J. Epidemiol., 159:225-227, 2004. [114] K D Mandl, B Y Reis, and C Cassa. Measuring outbreak-detection performance by using controlled feature set simulations. MMWR Morb. Mortal. Wkly. Rep., 53 Suppl:130-136, 2004. 118 [115] A. Goldenberg, G. Shmueli, R.A. Caruana, and S.E. Fienberg. Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales. Proceedings of the National Academy of Science, 99:5237-5240, 2002. 119

Computational, Statistical and Graph-Theoretical Detection

Related documents

Products

Support

Computational, Statistical and Graph-Theoretical Detection

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib