Space-Time Scan Statistics for Early Warning Systems Martin Kulldorff Department of Ambulatory Care and Prevention Harvard University Medical School and Harvard Pilgrim Health Care Content • Background on Disease Surveillance • Purely Spatial Scan Statistics: Brain Cancer in the United States • Early Warning System using a Space-Time Permutation Scan Statistic: Syndromic Surveillance in New York City • Various Extensions Collaborators Harvard Medical School: Ken Kleinman, Richard Platt, Katherine Yih New York City Dep Health: Jessica Hartman, Rick Heffernan, Farzad Mostashari University of Connecticut: David Gregorio, Zixing Fang Universidad Federal Minais Gerais: Renato Assunção, Luiz Duczmal Importance of Early Disease Outbreak Detection • • • • • Eliminate health hazards Warn about risk factors Earlier diagnosis of new cases Quarantine cases Scientific research concerning treatments, vaccines, etc. • Early detection is especially critical for infectious diseases Disease Surveillance Data Sources • Disease Registries • Reportable Diseases • Electronic Health Records • Health Insurance Claims Data • Vital Statistics (Mortality) Types of Data • Diagnosed Diseases • Symptoms (Syndromic Surveillance) • Lab Test Results • Pharmaceutical Drug Sales Disease Surveillance Frequency of Analyses • Daily • Weekly • Monthly • Yearly Purely Temporal Methods Farrington CP, Andrews NJ, Beale AD, Catchpole MA (1996) A statistical algorithm for the early detection of outbreaks of infectious disease. J R Stat Soc A Stat Soc 159: 547–563. Hutwagner LC, Maloney EK, Bean NH, Slutsker L, Martin SM (1997) Using laboratory-based surveillance data for prevention: An algorithm for detecting salmonella outbreaks. Emerg Infect Dis 3: 395–400. Nobre FF, Stroup DF (1994) A monitoring system to detect changes in public health surveillance data. Int J Epidemiol 23: 408–418. Reis B, Mandl K (2003) Time series modeling for syndromic surveillance. BMC Med Inform Decis Mak 3: 2. Three Important Issues • • • An outbreak may start locally. Purely temporal methods can be used simultaneously for multiple geographical areas, but that leads to multiple testing. Disease outbreaks may not conform to the pre-specified geographical areas. Why Use a Scan Statistic? With disease outbreaks: • We do not know where they will occur. • We do not know their geographical size. • We do not know when they will occur. • We do not know how rapidly they will emerge. One-Dimensional Scan Statistic The Spatial Scan Statistic Create a regular or irregular grid of centroids covering the whole study region. Create an infinite number of circles around each centroid, with the radius anywhere from zero up to a maximum so that at most 50 percent of the population is included. For each circle: – Obtain actual and expected number of cases inside and outside the circle. – Calculate likelihood function. Compare Circles: – Pick circle with highest likelihood function as Most Likely Cluster. Inference: – Generate random replicas of the data set under the nullhypothesis of no clusters (Monte Carlo sampling). – Compare most likely clusters in real and random data sets (Likelihood ratio test). Poisson Likelihood Function c [c / μ ] x [(C-c)/(C- μ)] C-c c=cases in circle μ = expected cases in circle C = total cases Spatial Scan Statistic: Properties – Adjusts for inhomogeneous population density. – Simultaneously tests for clusters of any size and any location, by using circular windows with continuously variable radius. – Accounts for multiple testing. – Possibility to include confounding variables, such as age, sex or socio-economic variables. – Aggregated or non-aggregated data (states, counties, census tracts, block groups, households, individuals). U.S. Brain Cancer Mortality 1986-1995 deaths Children (age <20): 5,062 Adults (age 20+): 106,710 Adult Women: 48,650 Adult Men: 58,060 * annual deaths / 100,000 rate* (95% CI) 0.75 (0.66-0.83) 6.0 (5.8-6.2) 4.9 (4.7-5.0) 7.2 (7.0-7.5) Brain Cancer Known risk factors: • High dose ionizing radiation • Selected congenital and genetic disorders Explains only a small percent of cases. Potential risk factors: N-nitroso compounds?, phenols?, pesticides?, polycyclic aromatic hydrocarbons?, organic solvents? Adjustments All subsequent analyses where adjusted for: • Age • Gender • Ethnicity (African-American, White, Other) Brain Cancer Mortality, Children 1986-1995 SMR 2.07-42.82 (highest 10%) 1.20-2.06 0.83-1.19 0.50-0.82 Zero cases (1867 counties) 0 200 400 600 Miles Spatial Scan Statistic, Children 6 3 2 7 5 Risk Factor Color Key High Risk, Not Significant 0 200 400 600 Miles 1 4 Children: Seven Most Likely Clusters Cluster 1. Carolinas 2. California 3. Michigan 4. S Carolina 5. Kentucky-Tenn 6. Wisconsin 7. Nebraska Obs 86 16 318 24 127 10 12 Exp 51 4.9 250 10 88 2.4 3.6 RR 1.7 3.3 1.3 2.5 1.4 4.1 3.3 p= 0.24 0.74 0.74 0.79 0.79 0.98 0.99 Conclusions: Children No statistically significant clusters detected. Any part of the pattern seen on the original map may be due to chance. What About Adults? Brain Cancer Mortality, Adults 1986-1995 SMR 0 9.46-24.44 (highest 10%) 8.05-9.45 7.27-8.04 6.72-7.26 6.17-6.71 5.68-6.16 5.19-5.67 4.51-5.18 3.40-4.50 Zero Cases (312 counties) 200 400 600 Miles Spatial Scan Statistic: Adults 3 5 4 12 9 6 11 13 2 1 8 7 10 Spatial Scan Statistic, Women 6 4 9 10 7 12 Risk Factor Color Key Low Risk, High Risk, Low Risk, High Risk, 0 200 p < 0.05 p < 0.05 Not Significant Not Significant 400 600 Miles 5 13 2 3 1 8 11 Women: Most Likely Clusters Cluster 1. Arkansas et al. 2. Carolinas 3. Oklahoma et al. 4. Minnesota et al. Obs Exp RR 2830 2328 1.22 1783 1518 1.17 1709 1496 1.14 2616 2369 1.10 p= 0.0001 0.0001 0.003 0.01 10. N.J. / N.Y. 1809 2300 0.79 0.0001 11. S Texas 127 214 0.59 0.0001 12. New Mexico et al. 849 1049 0.81 0.0001 Spatial Scan Statistic: Men 4 7 14 5 8 11 13 1 3 2 15 10 9 Risk Factor Color Key Low Risk, p < 0.05 High Risk, Not Significant High Risk, p < 0.05 0 200 400 600 Miles 6 12 Men: Most Likely Clusters Cluster 1. Kentucky et al. 2. Carolinas 3. Arkansas et al. 4. Washington et al. 5. Michigan Obs Exp RR 3295 2860 1.15 1925 1658 1.16 1143 964 1.19 1664 1455 1.14 1251 1074 1.17 p= 0.0001 0.0001 0.001 0.003 0.005 11. N.J. / N.Y. 2084 2615 0.80 12. S Texas 157 262 0.60 13. New Mexico et al. 1418 1680 0.84 14. Upstate N.Y. et al. 1642 1895 0.87 0.0001 0.0001 0.0001 0.0001 Conclusions: Adults It is possible to pinpoint specific areas with higher and lower rates that are statistically significant, and unlikely to be due to chance. The exact borders of detected clusters are uncertain. Similar patterns for men and women. Conclusion: General The spatial scan statistic can be useful as an addition to disease maps, in order to determine if the observed patterns are likely due to chance or not. A complement rather than a replacement for regular disease maps. Space-Time Scan Statistic Use a cylindrical window, with the circular base representing space and the height representing time. We will only consider cylinders that reach the present time. For each cylinder: – Obtain actual and expected number of cases inside and outside the cylinder. – Calculate likelihood function. Compare Cylinders: – Pick cylinder with highest likelihood function as Most Likely Cluster. Inference: – Generate random replicas of the data set under the nullhypothesis of no clusters (Monte Carlo sampling). – Compare most likely clusters in real and random data sets (Likelihood ratio test). For each cylinder: – Obtain actual and expected number of cases inside and outside the cylinder. – Calculate likelihood function. Compare Cylinders: – Pick cylinder with highest likelihood function as Most Likely Cluster. Inference: – Generate random replicas of the data set under the nullhypothesis of no clusters (Monte Carlo sampling). – Compare most likely clusters in real and random data sets (Likelihood ratio test). Space-Time Permutation Scan Statistic 1. For each cylinder, calculate the expected number of cases conditioning on the marginals μst = Σscst x Σtcst / C where cst = # cases at time t in location s and C = total number of cases Space-Time Permutation Scan Statistic 2. For each cylinder, calculate Tst = [cst / μst cst ] x [(C-cst)/(C- μst)] C-cst = 1, otherwise 3. Test statistic T = maxst Tst if cst > μst Space-Time Permutation Scan Statistic 4. Generate random replicas of the data set conditioned on the marginals, by permuting the pairs of spatial locations and times. 5. Compare test statistic in real and random data sets using Monte Carlo hypothesis testing (Dwass, 1957): p = rank(Treal) / (1+#replicas) Space-Time Permutation Scan Statistic: Properties – Adjusts for purely geographical clusters. – Adjusts for purely temporal clusters. – Simultaneously tests for outbreaks of any size at any location, by using a cylindrical windows with variable radius and height. – Accounts for multiple testing. – Aggregated or non-aggregated data (counties, zip-code areas, census tracts, individuals, etc). Let’s Try It! • • • • • • Historic data, Nov 15, 2001 – Nov 14, 2002 Diarrhea, all age groups Use last 30 days of data. Temporal window size: 1-7 days Spatial window size: 0-5 kilometers Residential zip code and hospital coordinates Results: Hospital Analyses Date #days #hosp #cases #exp A Nov 21 6 1 101 73.6 B Jan 11 1 1 10 2.3 C Feb 26 4 2 97 66.9 D Mar 31 2 1 38 19.2 E Nov 1 6 3 122 86.6 F Nov 2 7 3 135 98.3 RR 1.4 4.4 1.4 2.0 1.4 1.4 p= recurrence interval 0.0008 1 / 3.4 years 0.0007 1 / 3.9 years 0.0018 1 / 1.5 years 0.0017 1 / 1.6 years 0.0017 1 / 1.6 years 0.0008 1 / 3.4 years Results: Residential Analyses Date #days #zips #cases #exp RR G Feb 9 2 15 63 34.7 1.8 H Mar 7 2 8 63 37.3 1.7 p= 0.0005 0.0027 reccurence interval 1 / 5.5 years 1 / 1.0 years 200 180 Citywide 160 Areas with residential signals Areas with hospital signals # of visits 140 120 100 80 60 H C G 40 A 20 E,F D B 0 Nov 2001 Dec Jan Feb Mar 2002 -----> Apr May Month Jun Jul Aug Sep Oct Nov Real-Time Daily Analyses • • • • • Starting November 1, 2003. Respiratory, Fever/Flu, Diarrhea, (+Vomiting) Hospital (and Residential) Analyses Spatial window size: 0-5 kilometers Temporal window size: 1-7 days Real-Time Results, Nov 24, 2003: Hospital Analysis Syndrome #days #hosp #cases #exp Respiratory 2 3 80 57.4 Fever/Flu 3 1 24 14.8 Diarrhea 2 4 18 8.2 RR 1.4 1.6 2.2 p= recurrence interval 0.13 every 8 days 0.68 every day 0.04 every 26 days Real-Time Results, Nov 25, 2003: Hospital Analysis Syndrome #days #hosp #cases #exp Respiratory 7 1 45 30.4 Fever/Flu 1 5 50 31.5 Diarrhea 3 4 22 11.5 RR 1.5 1.6 1.9 p= recurrence interval 0.46 every 2 days 0.04 every 23 days 0.17 every 6 days Real-Time Results, Nov 26, 2003: Hospital Analysis Syndrome #days #hosp #cases #exp Respiratory 5 2 233 199.4 Fever/Flu 7 7 299 252.1 Diarrhea 4 4 23 12.6 RR 1.1 1.2 1.8 p= recurrence interval 0.63 every 2 days 0.05 every 22 days 0.22 every 5 days Real-Time Results, Nov 27, 2003: Hospital Analysis Syndrome #days #hosp #cases #exp Respiratory 1 4 41 26.9 Fever/Flu 6 4 181 142.9 Diarrhea 5 3 29 14.1 RR 1.5 1.3 1.7 p= recurrence interval 0.45 every 2 days 0.03 every 36 days 0.50 every 2 days Real-Time Results, Nov 28, 2003: Hospital Analysis Syndrome #days #hosp #cases #exp Respiratory 2 4 98 78.8 Fever/Flu 7 5 228 178.0 Diarrhea 6 3 29 17.5 RR 1.2 1.3 1.5 p= recurrence interval 0.82 every day 0.001 every 1000 days 0.26 every 4 days Real-Time Results, Nov 29, 2003: Hospital Analysis Syndrome #days #hosp #cases #exp Respiratory 7 2 146 123.6 Fever/Flu 7 4 253 195.7 Diarrhea 7 4 44 29.4 RR 1.2 1.3 1.5 p= recurrence interval 0.95 every day 0.001 every 1000 days 0.21 every 5 days Real-Time Results, Nov 30, 2003: Hospital Analysis Syndrome #days #hosp #cases #exp Respiratory 1 1 19 10.7 Fever/Flu 6 9 429 364.1 Diarrhea 1 5 12 4.4 RR 1.8 1.2 2.7 p= recurrence interval 0.69 every day 0.002 every 500 days 0.06 every 17 days Summary Four strong diarrhea signals: • Two were early signals for city-wide outbreaks likely due to norovirus. • One was an early signal for a city-wide children outbreak, likely due to rotavirus. • One small outbreak of unknown etiology. Three medium strength diarrhea signals: • All during the rotavirus outbreak, possibly due to a shift in the geographical epicenter One real-time fever/flu signal, coinciding with the start of the flu season. Different Data Streams For example: • Nurses Hotline Calls • Regular Physician Visits • Emergency Department Visits • Ambulance Dispatches • Pharmaceutical Drug Sales • Lab Test Results Multiple Data Streams For each cylinder, add the Poisson log Tst = [1] [2] [3] log[ T st ] +log[ T st ] +log[ T st ] likelihoods: Test statistic T = maxst Tst Syndromic Surveillance in Boston: Upper and Lower GI • Harvard Pilgrim Health Care HMO members cared for by Harvard Vanguard Medical Associates • Historical Data from Jan 1 to Dec 31, 2002 • Mimicking Surveillance from Sept 1 to Dec 31, 2002 Three Data Streams • Telephone Calls ( ~ 20 / day) • Urgent Care Visits ( ~ 9 / day) • Regular Physician Visits ( ~ 22 / day) Multiple contacts by the same person removed. Strongest Signal: October 18 Recurrence Interval Multiple Data Streams: < 1 / 1000 days Single Data Streams: Tele: < 1 / 1000 days Urgent ~ every day Regular: ~ every day October 18 Signal • • • • • Friday Number of Cases: 5 Expected Cases: 0.04 Location: Zip Code 01740 Time Length: One Day October 18 Signal • • • • • • Friday Number of Cases: 5 Expected Cases: 0.04 Location: Zip Code 01740 Time Length: One Day Diagnosis: Pinworm Infestation (all 5) October 18 Signal • • • • • • • Friday Number of Cases: 5 (all tele) Expected Cases: 0.04 Location: Zip Code 01740 Time Length: One Day Diagnosis: Pinworm Infestation (all 5) Same Family: Mother, Father, 3 Kids Limitations • Space-time clusters may occur for other reasons than disease outbreaks • Automated detection systems does not replace the observant eyes of physicians and other health workers. • Epidemiological investigations by public health department are needed to confirm or dismiss the signals. Scan Statistics for Irregular Shaped Clusters Duczmal, Assunção. A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Computational Statistic and Data Analysis, 2004. Patil, Talllie. Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environmental and Ecological Statistics, 2004. Iyengar. Space-time clusters with flexible shapes. Morbidity and Mortality Weekly Report, 2005. Tango, Takahashi. A flexibly shaped spatial scan statistic for detecting clusters. Int J Health Geographics, 2005. Assunção, Costa, Tavares, Ferreira. Fast detection of arbitrarily shaped disease clusters. Statistics in Medicine, 2006. Probability Models • • • • • Poisson model (e.g. incidence, mortality) Bernoulli model (e.g. case-control data) Normal model (e.g. weight, blood lead levels) Exponential model (e.g. survival data) Ordinal model (e.g. early, medium and late stage cancer) • Space-time permutation model (when only case data is available) Application Areas • • • • • • • Chronic Diseases Infectious Diseases Health Services Accidents Brain Imaging Toxicology Veterinary Medicine • • • • • • Psychology Demography Criminology History Archeology Ecology Examples of Applications Beato Filho, Assunção, Silva, Marinho, Reis, Almeida. Homicide clusters and drug traffic in Belo Horizonte, Minas Gerais, Brazil from 1995 to 1999. Cadernos de Saúde Pública, 2001. Pellegrini. Analise espaço-temporal da leptospirose no municipio do Rio de Janeiro. Fiocruz, 2002. Andrade, Silva, Martelli, Oliveira, Morais Neto, Siqueira Junior, Melo, Di Fabio. Population-based surveillance of pediatric pneumonia: use of spatial analysis in an urban area of Central Brazil. Cadernos de Saúde Pública, 2004. Ceccato. Homicide in São Paulo, Brazil: Assessing spatial-temporal and weather variations. J Environmental Psychology, 2005. Simões, Mendes, Marques, Pereira, Bagagli. Spatial clusters of paracoccidioido-mycosis in southeastern Brazil. Revista do Instituto de Medicina Tropical de São Paulo, 2005. SaTScan Software Free. Download from www.satscan.org Registered users in 116 countries: 1. USA 2. Canada 3. United Kingdom 4. Brazil 5. Italy ... 100s. Albania, Bhutan, Burma, Fiji, Grenada, Guinea, Iraq, Macao, Madagascar, Malawi, Malta, etc Future Topics • • • • • Irregular shaped clusters Non-Euclidean neighbor definitions Multivariate data Multiple locations per observation Computational speed Acknowledgement Research funded by: Alfred P Sloan Foundation Centers for Disease Control and Prevention Massachusetts Department of Health National Cancer Institute National Institute of Child Health and Development National Institute of General Medical Sciences: Modeling Infectious Disease Agent Study (MIDAS) References Kulldorff. A spatial scan statistic. Communications in Statistics, Theory and Methods. 26:1481-1496, 1997. Fang, Kulldorff, Gregorio: Brain cancer in the United States 19861995, A Geographical Analysis. Neuro-Oncology, 6:179-187, 2004. Kulldorff, Heffernan, Hartman, Assunção, Mostashari. A space-time permutation scan statistic for disease outbreak detection. PLoS Medicine, 2(3):e59, 2005. Kulldorff, Mostashari, Duczmal, Yih, Kleinman, Platt. Multivariate spatial scan statistics for disease surveillance. Statistics in Medicine, 2006, in press. Kulldorff and IMS Inc. SaTScan v.7.0: Software for the spatial and space-time scan statistics, 2004. Free: http://www.satscan.org/