Regional Co-location Patterns - Spatial Database Group

A spatial neighborhood graph approach to Regional Co-location pattern discovery: summary of results Pradeep Mohan*, Shashi Shekhar, Zhe Jiang University of Minnesota, Twin-Cities, MN James A. Shine, James P. Rogers, Nicole Wayant US Army- ERDC, Topographic Engineering Center, Alexandria, VA *Contact: mohan@cs.umn.edu 1 Outline  Motivation  Problem Formulation  Computational Approach  Conclusions and Future work 2 Motivation: Spatial Heterogeneity, the second law of Geography Spatial Heterogeneity (Goodchild, 2004; Goodchild 2003)  Expectations vary across space.  Global models may not explain locally observed phenomena.  Need for place based analysis. Spatial Heterogeneity in Retail  Traditional Data Mining : Which pair of items sell together frequently ?  Ans : Diaper in Transaction  Beer in Transaction.  Is this association true every where ? Answer : Blue Collar neighborhoods Global Spatial Data Mining – Global Co-location patterns  Which pairs of spatial features are located together frequently ? Example: Gas stations and Convenience Stores Our Focus:  Where do certain pairs of spatial features co-locate frequently ? Example: Assaults happen frequently around downtown bars. 3 Applications  Crime analysis Localizing frequent crime patterns, Opportunities for crime vary across space! Question: Do downtown bars often lead to assaults more frequently ?   Public Health Localizing elevated disease risks around putative sources (e.g. mining areas)  Courtsey: www.amazon.com Question: Where does high asbestos concentration often lead to lung cancer ?  Ecology Localizing symbiotic relationships between different species of plants / animals.  Question: Where are Plover birds frequently found in the vicinity of a crocodile ? Courtsey: www.startribune.com Predicting localities of the next crime. 4 Regional co-location patterns (RCP)  Input: Spatial Features, Crime Reports.  Output: RCP (e.g. < (Bar, Assaults), Downtown >)  Subsets of spatial features.  Frequently located in certain regions of a study area. 5 Outline  Motivation  Problem Formulation     Basic Concepts Problem Statement Challenges Related Work  Computational Approach  Conclusions and Future work 6 Basic Concepts: Neighborhoods Prevalence locality  Subsets of spatial framework containing instances of a Pattern.  Simple representation to visualize: Convex Hull  Other Representations possible. Neighborhood Graph  Given: A Spatial Neighbor Relation (spatial neighborhood size)  Nodes: Individual event instances  Edges: Presence (If neighbor relation is satisfied)  Based on Event Centric Model (Huang , 2004) 7 Basic Concepts: Quantifying regional interestingness  Conditional probability of observing a pattern instance within a locality given an instance of a feature within that locality. Regional Participation Ratio # instances of event type M participating in PR (RCP) RPR(RCP, M ) = # instances of M in dataset Example 1 2 2 RPR(< {ABC}, PL2 >, A) = ;RPR( {ABC},PL2 ,B)  ;RPR( {ABC},PL2 ,C)  6 4 4 Regional Participation index Quantifies the local fraction RPI(RCP) = min{RPR(RCP, M)}  Example participating in a relationship. 2 2 1   1 RPI ( {ABC}, PL2 )  min , ,  4 6 4  4 8 Detailed Statement Given:  A spatial framework,  A collection of boolean spatial event types and their instances.  A minimum interestingness threshold, Pθ  A symmetric and transitive neighbor relation R (e.g. based on Spatial neighborhood size) *Prevalence Threshold = 0.25 *Spatial neighborhood Size = 1 Mile Find : All RCPs with prevalence >= Pθ Objective: Minimize computational cost. Constraints: (i) Spatial framework is Heterogeneous. (ii) Interest measure captures spatial heterogeneity. (iii) Completeness : All prevalent RCPs are reported. (iv) Correctness: Only prevalent RCPs are reported. 9 Challenges Conflicting Requirements  Interest measure captures spatial heterogeneity while supporting scalable algorithms. Exponential search space.  Candidate pattern set cardinality is exponential in the number of event types.  Statistics Rigor Illustration: Computational Scalability 10 Challenges Conflicting Requirements  Interest measure captures spatial heterogeneity while supporting scalable algorithms. Exponential search space.  Candidate pattern set cardinality is exponential in the number of event types.  Illustration: {NULL} A C B AB AC BC ABC n # Patterns O(2M) 3 4 k1*23 (k1 >0) 4 11 k2*24 (k2 >0) 5 26 k3*25 (k3 >0) 6 57 k4*26 (k4 >0) 11 Contributions  Regional Co-location Patterns   Neighborhood based Formulation Interest Measure Captures the local fraction of events participating in patterns.  Shows attractive computational properties, Honors spatial heterogeneity.   Computational Approach Computational Structure – Pattern Space Enumeration  Performance Enhancement- Maximal locality based Pruning Strategies   Experimental Evaluation Performance Evaluation using real datasets, Lincoln, NE  Real world case study.  12 Related Work Approaches for Regional Co-location Pattern discovery Zoning Based (Celik et al., 2007) Zoning Based Fitness function Clustering (Eick et al., 2008) Spatial Neighborhood based Our Work Fitness Function Clustering  Reports one pattern per interesting region based on a criterion (e.g. Max)  Computational structure and pruning strategies not explored.  Clustering is based on real valued attributes. 13 Outline  Motivation  Problem Formulation Computational Approach  Pattern Space Enumeration  Performance Tuning  Experimental Evaluation Conclusions and Future work 14 Computational Approach Prevalence Threshold = 0.25 {Null} A B C <{BC},PL1({BC})> 0.16 <{AB},PL1({AB})> 0.16 <{AB},PL2({AB})> 0.33 <{AB},PL3({AB})> 0.25 ✕ ✔ ✔ <{AC},PL1({AC})> 0.25 <{AC},PL2({AC})> 0.25 <{AC},PL3({AC})> 0.25 ✔ ✔ ✔ <{BC},PL2({BC})> 0.25 <{BC},PL3({BC})> 0.16 <{BC},PL4({BC})> 0.16 ✕ ✔ ✕ ✕ Key Idea  Enumerate Entire Pattern Space. Expensive !  Examine each pattern and prune. <{ABC},PL1({ABC})> 0.16 <{ABC},PL2({ABC})> 0.25 <{ABC},PL3({ABC})> 0.25 Compute Neighborhoods Identify candidate RCP instance ✕ ✔ ✕ ✔ ✔ Pruned RCP Accepted RCP 15 Performance Tuning: Key Ideas Key Idea  Interest Measure shows special pruning properties in certain subsets of the spatial framework. Maximal Locality Key Properties  Collection of connected instances.  Maximal localities are mutually disjoint.  Contains several RCPs. Key Observations RPI shows anti-monotonicity property within Maximal Localities  Pruning a co-location, {AB}, prunes all its super sets (e.g. {ABC}, {ABCD}…etc.).  RPI within a Maximal locality is an upper bound to RPI of constituent Prevalence localities. 16 Performance Tuning Prevalence Threshold = 0.25 {Null} A B ML1 {AB},0.167 {AC},0.25 C ML2 {BC},0.167 ✕ ✕ No RCP No RCP ML3 {AB},0.25 Compute Maximal Locality {BC},0.33 ✕<{BC},PL3({BC})>,0.167 ✕<{BC},PL4({BC})>,0.167 <{AC},PL1({AC})>,0.25 {ABC}: Pruned Automatically {AC},0.25 Completeness  Pruning a pattern within a maximal locality does not prune any valid RCPs. Correctness Due to upper bound property of RPI Due to anti-monotonicity of RPI  Accepting a pattern involves additional checks so that only prevalent RCPs are reported. 17 Experimental Evaluation: Spatial Neighborhood Size  What is the effect of spatial neighborhood size on performance of different algorithms ?  Fixed Parameters: Dataset Size : 7498 instances; # Features: 5; Prevalence Threshold: 0.07 Run Time # of RCPs Trends  Run Time: ML Pruning out performs PS Enumeration by a factor of 1.5 - 5  # of RCPs examined: ML Pruning out performs PS Enumeration by a factor of 4.13 - 19 19 Experimental Evaluation: Feature Types  What is the effect of number of feature types on performance of different algorithms ?  Fixed Parameters: Dataset Size : 7498 instances; Spatial neighborhood size: 800 feet; Prevalence Threshold: 0.07 Run Time # of RCPs Trends  Run Time: ML Pruning out performs PS Enumeration by a factor of 1.2  # of RCPs examined: ML Pruning out performs PS Enumeration by a factor of 1.6 – 3.5 20 Real Dataset Case study Q: Where do assaults frequently occur around bars ? Are there other factors ? Dataset: Lincoln, NE, Crime data (Winter ‘07), Neighborhood Size = 0.25 miles, Prevalence Threshold = 0.07 RCP of Larceny and Assaults RCP of Bar and Assaults RCP of Larceny, Bars and Assaults Observations  Assaults are more likely to be found in areas reporting larceny (e.g. 47.6 % vs 21.1%) Crimes.  Bars in Downtown are more likely to be crime prone than bars in other areas (e.g. 21.1%, 20.1 %) 22 Conclusion and Future work  Conclusions  Neighborhood based formulation of Regional Spatial Patterns.  Regional Participation Index: Measures the local fraction of the global count.  Vector representation for Prevalence Localities (other representations possible, convex for simplicity)  Future Work  Other representations for prevalence localities.  Statistical interpretation LISA statistics / variants of Local Ripley’s K , multiple hypothesis testing.  Interpretation using predictive methods (e.g. Geographically Weighted Regression) Acknowledgement:  Reviewers of ACM GIS  Members of the Spatial database and spatial data mining group, UMN.  U.S. Department of Defense.  Mr. Tom Casady and Kim Koffolt. 23

Regional Co-location Patterns - Spatial Database Group

Related documents

Products

Support

Regional Co-location Patterns - Spatial Database Group

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib