Center for Statistical Ecology and Environmental Statistics Surveillance Hotspots Systems for Digital Government By G. P. Patil1, R. Acharya2, R. Modarres3, W. L. Myers4,and S. L. Rathbun4 1 Center for Statistical Ecology and Environmental Statistics Department of Statistics, Penn State University 2 Department of Computer Science and Engineering, Penn State University 3 Department of Statistics, George Washington University 4 School of Forest Resources and Office for Remote Sensing and Spatial Information Resources, Penn State Institutes of Environment, Penn State University 4 Department of Health Administration, Biostatistics and Epidemiology, University of Georgia This material is based upon work supported by the National Science Foundation under Grant No. 0307010. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. This project is funded, in part, under a grant with the Pennsylvania Department of Health using Tobacco Settlement Funds. The Department specifically disclaims responsibility for any analyses, interpretations or conclusions. [Invited paper in preparation for Encyclopedia of Digital Government] Technical Report Number 2005-0203 TECHNICAL REPORTS AND REPRINTS SERIES February 2005 Department of Statistics The Pennsylvania State University University Park, PA 16802 G. P. Patil Distinguished Professor and Director Tel: (814)865-9442 Fax: (814)865-1278 Email: gpp@stat.psu.edu http: //www.stat.psu.edu/~gpp http://www.stat.psu.edu/hotspots DGOnline News I N T R O D U C T I O N Geoinformatic surveillance for spatial and temporal hotspot detection and prioritization is crucial in the 21st century. A hotspot may be any unusual phenomenon, anomaly, aberration, outbreak, elevated cluster, or critical area. Government agencies require hotspot delineation and prioritization for monitoring, etiology, management, or early warning. Responsible factors may be natural, accidental or intentional, with relevance to both infrastructure and security. This article describes multi-disciplinary research based on novel methods for hotspot detection and prioritization, driven by a diverse variety of case studies of interest to agencies, academia, and the private sector. These case studies concern critical societal issues, such as public health, ecosystem health, biodiversity and threats to biodiversity, emerging infectious disease, water management and conservation, carbon sources and sinks, persistent poverty, environmental justices, crop pathogens, i nvasive species management, biosurveillance, biosecurity, disease biogeoinformatics, social networks, sensor networks, hospital networks and syndromic surveillance, video mining, early warning, tsunami inundation, remote sensing, and disaster management. Our approach has involved an innovation of the popular circle -based spatial scan statistic. In particular, it employs the notion of an upper level set and is accordingly called the upper level set scan statistic system, pointing to the next generation of sophisticated analytical and computational system s, effective for the detection of arbitrarily shaped hotspots alon g spatiotemporal dimensions. It also involves a novel prioritization scheme based on multiple indicators and stakeholder criteria without having to reduce indicators to a single index using Hasse diagrams and partially ordered sets. It is accordingly called the poset prioritization and ranking system. See Patil and Taillie, 2004 ab. The following websites have additional information: (1) http://www.stat.psu.edu/hotspots/ (2) http://www.stat.psu.edu/~gpp/ (3)htttp://www.digitalgovernment.org/news/stories/2004/1104/1104_hots pots_heyman.jsp U P P E R L E V E L S E T H O T S P O T S T A T I S T I C S Y S T E M S C A N Patil and Taillie (2004 ab) introduce an innovation of the health -areapopular circle-based spatial and spatiotemporal scan stat istic. It employs the notion of an upper level set, and is accordingly called the upper level set (ULS) scan statistic, pointing to a sophisticated analytical and computational system as the next generation of the present day popular SaTScan ( Kulldorff and Nagarwalla, 1995; Kulldorff, 1997; Kulldorff et al., 1998; Kulldorff, 2001; Mostashari et al., 2002; Waller, 2002). Fig 1. Limitations of circular scanning windows. (Left) An irregularly shaped cluster— perhaps a cholera outbreak along a winding river floodplain. Small circles miss much of the outbreak and large circles include many unwanted cells. (Right) Circular windows may report a single irregularly shaped cluster as a series of small clusters. Background Theory of Scan Statistics The spatial scan statistic concerns the following situation: A region R of Euclidian space is tessellated or subdivided into cells, which will be denoted by the symbol a . Data is available in the form of a count Ya on each cell a . In addition, a “size” value Aa is associated with each cell. The cell sizes Aa are regarded as fixed and known, while the cell counts Ya are independent random variables. Two distributional settings are commonly studied: Binomial: The size Aa = N a is a positive integer and Ya ~ Binomial ( N a , pa ), where pa is an unknown parameter attached to cell a with 0 pa 1 . Poisson: The size Aa is a positive real number and Ya ~ Poisson ( a Aa ), where a > 0 is an unknown parameter attached to cell a . Each distributional model has a simple in terpretation. For the binomial, N a people reside in cell a and each person contracts a certain disease independently with probability pa . The cell count Ya is the number of diseased people. For the Poisson, Aa is the size (e.g., area or some adjusted population size) of the cell a , and Ya is a realization of a Poisson process with intensity a . In each scenario, the responses Ya are independent; it is assumed that spatial variability can be accounted for b y cell-to-cell variation in model parameters. The spatial scan statistic seeks to identify “hotspots” or “clusters” of cells having an elevated response with respect to the remainder of the region. Elevated response means large values for the rates (or intensities), Ga Ya / Aa , instead of the raw counts Ya . The scan statistic easily accommodates other adjustments, such as for age or gender. A collection of cells from the tessellation should satisfy several geometric properties before it could be considered as a candidate for a hotsp ot cluster. First, the union of the cells should comprise a geographically connected subset of the region R (Fig. 2). Such collections of connected cells will be referred to as zones Z and the set of all zones is den oted by . Second, the zone should not be excessively large. Otherwise, the zone instead of its exterior would constitute background. This restriction is generally achieved by limiting the search for hotspots to zones comprising of less than, say, fifty percent of the region. Fig. 2. A tessellated region. The collection of shaded cells in the left-hand diagram is connected and, therefore, constitutes a zone in . The collection on the right is not connected. The notion of a hotspot is inherently vague and lacks any a priori definition. There is no “true” hotspot in the statistical sense of a true parameter value. A hotspot is instead defined by its estimate, provided the estimate is statistically significant. To this end, the scan statistic adopts a hypothesis testing model in which the hotspot occurs as an unknown zonal parameter in the statement of the alternative hypothesis. The traditional spatial scan statistic uses expanding circles to determine a reduced list 0 of candidate zones Z. By their very construction, these candidate zones tend to be compact in shape and may do a poor job of approximating actual clusters. The reduced parameter space of the circular scan statistic is determined entirely by the geometry of the tessellation and does not involve the data in any way. We propose a scan statistic that takes an adaptive point of view in which 0 depends very much upon the data. Furthermore, 0 induces a tree structure useful for visualization and expressing uncertainty of hotspot clusters in the form of a hotspot confidence set on the tree. Although the traditional spatial scan statistic is applicable only to tessellated data, the ULS approach has an abstract graph (i.e., vertices and edges) as its starting point. Accordingly, this approach can also be applied to data defined over networks, such as subway, water or highway systems. There is complete flexibility regarding the definition of adjacency. For example, one may declare two cells as adjacent ( i) if their boundaries have at least one point in common, (ii) if their common boundary has positive length, or (iii) in the case of a drainage network, if the flow is from one cell to the next. ULS Scan Statistic The ULS scan statistic is an adaptive approach in which the reduced parameter space 0 ULS is determined from the data using the empirical cell rates Ga Ya / Aa . These rates determine a function a Ga defined over the cells in the tessellation. This function has only finitely many values and each level g defines an upper level set (ULS) U g {a : Ga g} Rate G Schematic Response “Surface” g g Z2 Z1 Z4 Z3 Z5 Z6 Region R Fig. 3. Schematic response surface with two response levels, g and g . The upper level set determined by g has three connected components, Z1 , Z 2 and Z 3 ; that determined by g has Z 4 , Z 5 and Z 6 as its connected components. The diagram also illustrates the three ways in which connectivity can change as the level drops from g to g : (i) zones Z1 and Z 2 grow in size and eventually coalesce into a single zone Z 4 , (ii) zone Z 3 simply grows to Z 5 , and (iii) zone Z 6 is newly emergent. Since upper level sets do not have to be geographically connected ( Fig. 3), we take the reduced list of candidate zones U LS to consist of all connected components of all possible upper level sets. The zones in ULS are plausible as potential hotspots since they are portions of upper level sets of the response rate. The number of zones is small enough for practical maximum likelihood search; in fact, the size of ULS does not exceed the number of cells in the tessellation. A ULS-tree can be defined on the reduced parameter space ULS . Its nodes are the zones Z ULS and are therefore collections of vertices from the abstract graph. Leaf nodes are typically singleton vertic es at which the response rate attains a local maximum. The root node consists of all connected vertices in the abstract graph. Fig. 4 shows the tree structure for the surface from Fig. 3. Intensity G g Z3 Z2 Z1 Schematic intensity “surface” A g Z4 Z5 Z6 B C Fig. 4.N.B.ULS connectivity tree for the schematic surface displayed in Fig. 3. The four leaf Intensity surface is cellular (piece-wise constant), with only finitely many levels A, B, C are junction where multiple zones The coalesceroot into a single zonerepresents the entire region. Junction nodes correspond to nodes surface peaks. node nodes (A, B and C) occur when two (or more) connected components coalesce into a single connected component. A consequence of the adaptivity of the ULS approach is that ULS must be recalculated for each replicate in a simulation study. Efficient algorithms are needed for this calculation. Several generic algorithms are available in the computer science literature (Cormen et al, 2001, Section 22.3 for depth first search; Knuth, 1973, p. 353 or Press et al, 1992, Section 8.6 for transitive closure). Hotspot Membership Rating Zonal estimation uncertainty is visually depicted by inner and outer envelopes, where the outer envelope consists of all cells belonging to at least one zone in the confidence set. Cells in the inner envelope belong to all of the zones in the confidence set. In other words, the outer envelope is the union of all zones in the confidence set while the inner envelope is their intersection (Fig. 5; Fig. 6). MLE Outer envelope Inner envelope Fig. 5. Estimation uncertainty in hotspot delineation. Cells in the inner envelope belong to all plausible estimates (at specified confidence level); cells in the outer envelope belong to at least one plausible estimate. The MLE is nested between the two envelopes. A numerical rating may also be assigned to each cell for inclusion in the hotspot. The rating is the percentage of zones in the co nfidence set that includes the cell under consideration. The inner envelope consists of cells receiving a 100% rating while the outer envelope contains the cells with a nonzero rating. A map of these ratings, with the superimposed MLE, provides a visual display of uncertainty of the hotspot delineation. Typology of Space-Time Hotspots Scan statistic methods extend readily to the detection of hotspots in space-time. A space-time version of the circle-based scan statistic employs cylindrical extensions of spatial circles, but cylinders are often unable to adequately represent the temporal evolution of a hotspot ( Fig. 7). The space-time generalization of the ULS scan statistic can detect arbitrarily shaped hotspots in space-time (Patil and Taillie 2004a). This lets us classify space-time hotspots into various evolutionary types, a few of which appear on the left hand side of Fig. 8. The merging hotspot is particularly interesting because, while it comprises a connected zone in space -time, several of its time slices are spatially disconnected. The diagrams in Fig. 8 are motivated by a study on “trajectories of persistent poverty in the US” being conducted by Amy Glasmeier of Penn State University. Tessellated Region R MLE Junction Node Alternative Hotspot Delineation Alternative Hotspot Locus Time Fig. 6. A confidence set of hotspots on the ULS tree. The different connected components correspond to different hotspot loci while the nodes within a connected component correspond to different delineations of that hotspot—all at the appropriate confidence level. Hotspot Cylindrical approximation Cylindrical approximation sees single hotspot as multiple hotspots Space 1990 Stationary Hotspot 1980 1970 Time (census year) 2000 2000 Time (census year) Time (census year) Fig. 7. Temporal evolution of a spatial hotspot is represented by the shape of the hotspot in space-time. Cylinders may not adequately capture this shape. 2000 1990 Time (census year) 1990 Shifting Hotspot 1980 1970 1980 1990 2000 1980 1970 Space (census tract) Space (census tract) 2000 1970 Expanding Hotspot 1990 Merging Hotspot 1980 1970 Space (census tract) Space (census tract) Time slices Fig. 8. The four diagrams on the left depict different types of space-time hotspots. The spatial dimension is represented schematically on the horizontal axis while time is on the vertical axis. The diagrams on the right show the trajectory (sequence of time slices) of a merging hotspot. P A R T I A L L Y O R D E R E D S E T P R I O R I T I Z A T I O N S Y S T E M H O T S P O T The prioritization system of hotspot geoinformatics is concerned with the ranking of a finite collection of objects when a suite of indicator values is available for each member of the collection. The objec ts can be represented as a configuration of points in indicator space, but the different indicators typically convey different comparative messages a nd there is no unique way to rank the objects while taking all indicators into account. A traditional approach is to assign a composite numerical score to each object by combining the indicator information in some fashion. Consciously or otherwise, every such composite involves judgments (often arbitrary or controversial) about tradeoffs or substitutability among indicators. Rather than attempting to combine indicators, Patil and Taillie (2004b) take the view that the relative positions in indicator spa ce determine only a partial ordering and that a given pair of objects may not be inherently comparable. Working with Hasse diagrams of the partial order, they study the collection of all rankings compatible with the partial order. In this way, an interval of possible ranks is assigned to each object. The intervals can be very wide. Noting, however, that ranks near the ends of each interval are usually infrequent under linear extensions, a distribution is obtained over the interval of possible ranks. This distribution, called the rank-frequency distribution, is unimodal, log-concave and represents the degree of ambiguity involved in attempting to assign a rank to the corresponding object. Stochastic ordering of distributions imposes a partial order on t he collection of rank-frequency distributions. This collection of distributions is in one-to-one correspondence with the original collection of objects and the induced ordering on these objects is called the cumulativ e rank-frequency (CRF) ordering, extending the original partial order. For example, Fig. 9 shows the Hasse diagram for a small partially ordered set (poset) with six objects, labeled a through f. The decision tree on the right enumerates all possible linear extensions of the poset , where each path through the tree determines a linear extension. In this example, there are a total of 16 linear extensions. Object a is assigned rank 1 by nine of those extensions, rank 2 by five of the extensions, and rank 3 by the remaining two extensions. The cumulative rank frequencies for object a are thus 9, 9+5=14, and 9+5+2=16. These determine a cumulative rank profile for object a as shown in the Fig. 10 and similarly for the other five objects. Linear extension decision tree Poset B (Hasse Diagram) e a b c d b a c b e f b d c d e d a d c e d d a c c b e d d e f d e f e f d e f e f e f f f f e f f e f e f e f e e c f f Fig. 9. Haase diagram and corresponding linear extension tree. The linear extension tree enumerates all admissible linear extensions of the poset. Dashed links in the dimension tree are not implied by the partial order and are called jumps. If one tries to trace the linear extension in the original Haase diagram, a jump would be required at each dashed link. Cumulative Frequency 16 a b c d e f 12 8 4 0 1 2 3 4 5 6 Rank Fig. 10. Cumulative rank-frequency distribution for the poset in Fig. 9. For this example, the six profiles are stacked one -above-the-other, thus determining a linear ordering of the objects. The CRF operator treats each linear extension as an equal “voter” in determining the CRF ranking. It is possible to generalize to a weighted CRF operator by giving linear extensions differential weights either on mathematical grounds (e.g., number of jump s) or empirical grounds (e.g., indicator concordance). Explicit enumeration of all possible linear extensions is computationally impractical unless the number of objects is quite small. In such cases, the rank -frequencies can be estimated using discrete Markov chain Monte Carlo (MCMC) methods. The resulting prioritization system has the following innovative features: Ability to rank and prioritize hotspots ; Utilizes multiple indicator and stakeholder criteria without integrating indicators into an index; Employs Hasse diagrams, partially ordered sets, and Markov Chain Monte Carlo computations leading to several key applications, including: Early warning systems; Identification of critical areas for focused investigation. In the area of Health Policy, Health Statistics, and Disease Etiology, the prioritization component may be combined with a hotspot detection component to yield a three-stage surveillance system: First stage screening: Identification of significant clusters (hotspots) by an upper level set version of the scan statistic; Second stage screening: Rank and prioritize significant hotspots using likelihood values and other attributes such as raw intensity values, remediation-feasibility scores, socio-economic and demographic factors; Third stage screening: Follow up hotspots for etiology and/or intervention. For more details, see Patil and Taillie (2004b). S E L E C T C A S E S T U D I E S In response to an ever increasing volume of georeferenced data, government agencies require a new generation of decisi on support systems for early detection, surveillance, and prioritization of hotspots. A decision support framework for geographic and network surveillance, using systems involving upper level sets and partially ordered sets, is applicable to a variety of important case studies, such as: 1. Cyber security and computer network diagnostics; 2. Tasking of self-organizing surveillance mobile sensor networks; 3. Drinking water quality and water utility vulnerability; 4. Surveillance network and early warning; 5. West Nile virus; 6. Crop pathogens and bioterrorism; 7. Disaster management: Oil spill detection monitoring, and prioritization; 8. Network analysis of biological integrity in freshwater streams. The framework can be applied to irregular networks, such as th ose formed by streams (Fig. 11), political units, social networks, and the internet. When applied to data collected over both space and time, the ULS scan statistic system may be used to detect shifting poverty hotspots (Fig. 11), coalescence of neighboring hotspots, or thei r growth. Fig. 11. Data on a network of streams (left), and shifting poverty hotspots (right). Protecting the nation’s computer networks from cyber attack is an important homeland security priority requiring diagnostic tools for detecting security attacks and infrastructure failures. A probabilistic finite state automaton (PFSA), describing a network element is obtained from its data stream output. The variational distance between the stochastic languages generated by normal and crisis automata may be used to form a crisis index. The ULS scan statistic is then applied to crises indices over a collection of network elements for hotspot detection. These hotspots and their prioritization can be used to detect coordinated attacks geographically spread over a network. Additional applications of PFSA include the tasking of self -organizing surveillance mobile sensor networks, geotelemetry with wireless sensor networks, videomining networks, and syndromic surveillance in public health. Fig 12. Framework for probabilistic finite state automata (left), and a metric for measuring the distance between two finite state automata. The National Tsunami Hazard Mitigation Program (NTHMP) is the first systematic national effort for the production of inundation maps essential for tsunami hazard planning and mitigation. Without a clear understanding of what areas are most at risk, it is not possible to develop effective emergency response plans involving population and infrastructure vulnerability and evacuation routes (Gonzalaz, 2001). Inundation maps enable the construction of tsunami risk maps, where risk is the hazard times the exposure; for example, the probability that a particular grid cell is struck by a tsunami times the number of people occupying that c ell. These form risk surfaces defined over tessellations of grid cells in regions under consideration. For purposes of optimal disaster management planning, it is essential to have the capability to recognize priority high risk areas with minimal false alarms. The tsunami disaster management triggers research, expanding its scope to geospatial continuous response risk variables with skewed distributions, and to hotspot trajectories representing changing spatial patterns of inundated areas with increasing tsunami severity. Understanding the latter typology may impact planning of evacuation routes. Under an expanding hotspot scenario, traffic is always directed outwards from the hotspots, but under merging hotspots, a portion of the traffic may be directed through regions between hotspots when the tsunami is predicted to be small. Another significant contribution to tsunami disaster management will be to prioritize and rank risk hotspots, detected at specified confidence levels with respect to multiple criteria, stakeholders, and indicators without reduction to a single index. Examples of such criteria may include the number of people at risk and the economic value of infrastructure, buildings, and their contents. Fig 13. Portion of a community projected to be inundated by a tsunami as predicted under a tsunami inundation model for a given earthquake scenario (blue region on the left). Two typologies expected under tsunamis of increasing severity (right). C O N C L U S I O N Government agencies often require concise summaries of georeferenced data to support their decisions regarding the geographic allocation of resources. Geoinformatic surveillance for spatial and spatiotemporal hotspot detection and prioritization is a critical need for the 21 st century. A hotspot can mean an unusual phenomenon, anomaly, aberration, outbreak, or critical area. Hotspot delineation and prioritization may be required for etiology, management, or early warning. The article briefly describes a prototype Geoinformatic Hotspot Surveillance (GHS) system for hotspot delineation and prioritization (Fig. 14) in a variety of case studies of critical societal importance . The prototype system is comprised of modules for (1) hotspot detection and delineation, and (2) hotspot prioritization. Geoinformatic Surveillance System Geoinformatic spatio-temporal data from a variety of data products and data sources with agencies, academia, and industry Masks, filters Spatially distributed response variables Hotspot analysis Prioritization Decision support systems Masks, filters Indicators, weights Fig. 14. Framework for the Geoinformatic Hotspot Surveillance (GHS) system. R E F E R E N C E S Cormen, T. H., Leierson, C. E., Rivest, R. L., and Stein, C. (2001). Introduction to Algorithms, Second Edition. MIT Press, Cambrid ge, Massachusetts. Gonzalez, F.I. (2001). The NTHMP inundation mapping program. In Proceedings of the International Tsunami Symposium 2001. Seattle, August 7-10, pp. 29-54. Knuth, D. E. 1973. The Art of Computer Programming: Volume 1, Fundamental Algorithms, Second Edition. Addison-Wesley, Reading, Massachusetts. KULLDORFF, M. 1997. A spatial scan statistic. Communications in Statistics: Theory and Methods 26, 1481–1496. Kulldorff, M. 2001. Prospective time-periodic geographical disease surveillance using a scan statistic. Journal of the Royal Statistical Society, Series A 164, 61–72. Kulldorff, M., Feuer, E. J., Miller, B. A., and Freedman, L. S. 1997. Breast cancer clusters in Northeast United States: A geographic analysis. American Journal of Epidemiology 146, 161–170. Kulldorff, M. And Nagarwalla, N. 1995. Spatial disease clusters: Detection and inference. Statistics in Medicine 14, 799–810. Kulldorff, M., Rand, K., Gherman, G., Williams, G., and Defrancesco, D. 1998. SaTScan version 2.1: Software for the spatial and space -time scan statistics. National Cancer Institute, Bethesda, MD. Mostashari, F., Kulldorff, M., and Miller, J. 2002. Dead bird clustering: An early warning system for West Nile virus activity. Manuscript prepared for the New York City West Nile Virus Surveillance Working Group. Under review. Patil, G.P. (2005). Geoinformatic surveillance of hotspot detection, prioritization, and early warning. Demo for 6th Annual National Conference on Digital Government Research, Atlanta, GA. Patil, G.P., and Taillie, C. 2004a. Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environmental and Ecological Statistics 11, 183197. Patil, G.P., and Taillie, C. 2004b. Multiple indicators, partially ordered sets, and linear extensions: Multi-criterion ranking and prioritization. Environmental and Ecological Statistics 11, 199-228. Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P . 1992. Numerical Recipes in C, Second Edition. Cambridge Univ ersity Press, Cambridge. NOTE: This research was supported by the National Science Foundation Digital Government Program Award Number EIA -0307010. Partner federal agencies include DOD, DOT, EPA, NASA, NCHS, NCI, NIEHS, USFS, and USGS with USGS as the coordinating agency. The contents have not been subjected to Agency review and therefore do not necessarily reflect the views of the Agencies and no official endorsement should be inferred . T E R M S A N D D E F I N I T I O N S Hotspots: A connected subset of the study region with statistically significant elevated rates of disease, poverty, accidents, or any other relevant georeferenced phenomenon. Upper Level Set Scan Statistic System: Adaptive system for hotspot detection and delineation based on upper level sets in georeferenced data. Poset Prioritization and Ranking System: Nonparametric approach to ranking objects using multiple indicators based on cumulative rank functions constructed from Haase diagrams and linear extension trees. Hotspot Rating: Confidence level that a given cell belongs to a hotspot. Typology of Space-Time Hotspots: Classification of the trajectories of hotspots over time when the upper level set scan statistic system is applied to space-time data. Early Warning: Alert to a pending disaster. Digital Government Case Studies: Investigations of interest to society demonstrating the efficacy of proposed digital informatic approaches to handling government data bases. Word count: 3397 (excluding references; list of key terms and their definitions)