A spatial neighborhood graph approach to Regional Co-location pattern discovery: summary of results Pradeep Mohan*, Shashi Shekhar, Zhe Jiang University of Minnesota, Twin-Cities, MN James A. Shine, James P. Rogers, Nicole Wayant US Army- ERDC, Topographic Engineering Center, Alexandria, VA *Contact: mohan@cs.umn.edu 1 Outline Motivation Problem Formulation Computational Approach Conclusions and Future work 2 Motivation: Spatial Heterogeneity, the second law of Geography Spatial Heterogeneity (Goodchild, 2004; Goodchild 2003) Expectations vary across space. Global models may not explain locally observed phenomena. Need for place based analysis. Spatial Heterogeneity in Retail Traditional Data Mining : Which pair of items sell together frequently ? Ans : Diaper in Transaction Beer in Transaction. Is this association true every where ? Answer : Blue Collar neighborhoods Global Spatial Data Mining – Global Co-location patterns Which pairs of spatial features are located together frequently ? Example: Gas stations and Convenience Stores Our Focus: Where do certain pairs of spatial features co-locate frequently ? Example: Assaults happen frequently around downtown bars. 3 Applications Crime analysis Localizing frequent crime patterns, Opportunities for crime vary across space! Question: Do downtown bars often lead to assaults more frequently ? Public Health Localizing elevated disease risks around putative sources (e.g. mining areas) Courtsey: www.amazon.com Question: Where does high asbestos concentration often lead to lung cancer ? Ecology Localizing symbiotic relationships between different species of plants / animals. Question: Where are Plover birds frequently found in the vicinity of a crocodile ? Courtsey: www.startribune.com Predicting localities of the next crime. 4 Regional co-location patterns (RCP) Input: Spatial Features, Crime Reports. Output: RCP (e.g. < (Bar, Assaults), Downtown >) Subsets of spatial features. Frequently located in certain regions of a study area. 5 Outline Motivation Problem Formulation Basic Concepts Problem Statement Challenges Related Work Computational Approach Conclusions and Future work 6 Basic Concepts: Neighborhoods Prevalence locality Subsets of spatial framework containing instances of a Pattern. Simple representation to visualize: Convex Hull Other Representations possible. Neighborhood Graph Given: A Spatial Neighbor Relation (spatial neighborhood size) Nodes: Individual event instances Edges: Presence (If neighbor relation is satisfied) Based on Event Centric Model (Huang , 2004) 7 Basic Concepts: Quantifying regional interestingness Conditional probability of observing a pattern instance within a locality given an instance of a feature within that locality. Regional Participation Ratio # instances of event type M participating in PR (RCP) RPR(RCP, M ) = # instances of M in dataset Example 1 2 2 RPR(< {ABC}, PL2 >, A) = ;RPR( {ABC},PL2 ,B) ;RPR( {ABC},PL2 ,C) 6 4 4 Regional Participation index Quantifies the local fraction RPI(RCP) = min{RPR(RCP, M)} Example participating in a relationship. 2 2 1 1 RPI ( {ABC}, PL2 ) min , , 4 6 4 4 8 Detailed Statement Given: A spatial framework, A collection of boolean spatial event types and their instances. A minimum interestingness threshold, Pθ A symmetric and transitive neighbor relation R (e.g. based on Spatial neighborhood size) *Prevalence Threshold = 0.25 *Spatial neighborhood Size = 1 Mile Find : All RCPs with prevalence >= Pθ Objective: Minimize computational cost. Constraints: (i) Spatial framework is Heterogeneous. (ii) Interest measure captures spatial heterogeneity. (iii) Completeness : All prevalent RCPs are reported. (iv) Correctness: Only prevalent RCPs are reported. 9 Challenges Conflicting Requirements Interest measure captures spatial heterogeneity while supporting scalable algorithms. Exponential search space. Candidate pattern set cardinality is exponential in the number of event types. Statistics Rigor Illustration: Computational Scalability 10 Challenges Conflicting Requirements Interest measure captures spatial heterogeneity while supporting scalable algorithms. Exponential search space. Candidate pattern set cardinality is exponential in the number of event types. Illustration: {NULL} A C B AB AC BC ABC n # Patterns O(2M) 3 4 k1*23 (k1 >0) 4 11 k2*24 (k2 >0) 5 26 k3*25 (k3 >0) 6 57 k4*26 (k4 >0) 11 Contributions Regional Co-location Patterns Neighborhood based Formulation Interest Measure Captures the local fraction of events participating in patterns. Shows attractive computational properties, Honors spatial heterogeneity. Computational Approach Computational Structure – Pattern Space Enumeration Performance Enhancement- Maximal locality based Pruning Strategies Experimental Evaluation Performance Evaluation using real datasets, Lincoln, NE Real world case study. 12 Related Work Approaches for Regional Co-location Pattern discovery Zoning Based (Celik et al., 2007) Zoning Based Fitness function Clustering (Eick et al., 2008) Spatial Neighborhood based Our Work Fitness Function Clustering Reports one pattern per interesting region based on a criterion (e.g. Max) Computational structure and pruning strategies not explored. Clustering is based on real valued attributes. 13 Outline Motivation Problem Formulation Computational Approach Pattern Space Enumeration Performance Tuning Experimental Evaluation Conclusions and Future work 14 Computational Approach Prevalence Threshold = 0.25 {Null} A B C <{BC},PL1({BC})> 0.16 <{AB},PL1({AB})> 0.16 <{AB},PL2({AB})> 0.33 <{AB},PL3({AB})> 0.25 ✕ ✔ ✔ <{AC},PL1({AC})> 0.25 <{AC},PL2({AC})> 0.25 <{AC},PL3({AC})> 0.25 ✔ ✔ ✔ <{BC},PL2({BC})> 0.25 <{BC},PL3({BC})> 0.16 <{BC},PL4({BC})> 0.16 ✕ ✔ ✕ ✕ Key Idea Enumerate Entire Pattern Space. Expensive ! Examine each pattern and prune. <{ABC},PL1({ABC})> 0.16 <{ABC},PL2({ABC})> 0.25 <{ABC},PL3({ABC})> 0.25 Compute Neighborhoods Identify candidate RCP instance ✕ ✔ ✕ ✔ ✔ Pruned RCP Accepted RCP 15 Performance Tuning: Key Ideas Key Idea Interest Measure shows special pruning properties in certain subsets of the spatial framework. Maximal Locality Key Properties Collection of connected instances. Maximal localities are mutually disjoint. Contains several RCPs. Key Observations RPI shows anti-monotonicity property within Maximal Localities Pruning a co-location, {AB}, prunes all its super sets (e.g. {ABC}, {ABCD}…etc.). RPI within a Maximal locality is an upper bound to RPI of constituent Prevalence localities. 16 Performance Tuning Prevalence Threshold = 0.25 {Null} A B ML1 {AB},0.167 {AC},0.25 C ML2 {BC},0.167 ✕ ✕ No RCP No RCP ML3 {AB},0.25 Compute Maximal Locality {BC},0.33 ✕<{BC},PL3({BC})>,0.167 ✕<{BC},PL4({BC})>,0.167 <{AC},PL1({AC})>,0.25 {ABC}: Pruned Automatically {AC},0.25 Completeness Pruning a pattern within a maximal locality does not prune any valid RCPs. Correctness Due to upper bound property of RPI Due to anti-monotonicity of RPI Accepting a pattern involves additional checks so that only prevalent RCPs are reported. 17 Experimental Evaluation: Spatial Neighborhood Size What is the effect of spatial neighborhood size on performance of different algorithms ? Fixed Parameters: Dataset Size : 7498 instances; # Features: 5; Prevalence Threshold: 0.07 Run Time # of RCPs Trends Run Time: ML Pruning out performs PS Enumeration by a factor of 1.5 - 5 # of RCPs examined: ML Pruning out performs PS Enumeration by a factor of 4.13 - 19 19 Experimental Evaluation: Feature Types What is the effect of number of feature types on performance of different algorithms ? Fixed Parameters: Dataset Size : 7498 instances; Spatial neighborhood size: 800 feet; Prevalence Threshold: 0.07 Run Time # of RCPs Trends Run Time: ML Pruning out performs PS Enumeration by a factor of 1.2 # of RCPs examined: ML Pruning out performs PS Enumeration by a factor of 1.6 – 3.5 20 Real Dataset Case study Q: Where do assaults frequently occur around bars ? Are there other factors ? Dataset: Lincoln, NE, Crime data (Winter ‘07), Neighborhood Size = 0.25 miles, Prevalence Threshold = 0.07 RCP of Larceny and Assaults RCP of Bar and Assaults RCP of Larceny, Bars and Assaults Observations Assaults are more likely to be found in areas reporting larceny (e.g. 47.6 % vs 21.1%) Crimes. Bars in Downtown are more likely to be crime prone than bars in other areas (e.g. 21.1%, 20.1 %) 22 Conclusion and Future work Conclusions Neighborhood based formulation of Regional Spatial Patterns. Regional Participation Index: Measures the local fraction of the global count. Vector representation for Prevalence Localities (other representations possible, convex for simplicity) Future Work Other representations for prevalence localities. Statistical interpretation LISA statistics / variants of Local Ripley’s K , multiple hypothesis testing. Interpretation using predictive methods (e.g. Geographically Weighted Regression) Acknowledgement: Reviewers of ACM GIS Members of the Spatial database and spatial data mining group, UMN. U.S. Department of Defense. Mr. Tom Casady and Kim Koffolt. 23