Regional Co-location Patterns - Spatial Database Group

advertisement
A spatial neighborhood graph approach to
Regional Co-location pattern discovery:
summary of results
Pradeep Mohan*, Shashi Shekhar, Zhe Jiang
University of Minnesota, Twin-Cities, MN
James A. Shine, James P. Rogers, Nicole Wayant
US Army- ERDC, Topographic Engineering Center, Alexandria, VA
*Contact: mohan@cs.umn.edu
1
Outline
 Motivation
 Problem Formulation
 Computational Approach
 Conclusions and Future work
2
Motivation: Spatial Heterogeneity, the second law of Geography
Spatial Heterogeneity (Goodchild, 2004; Goodchild 2003)
 Expectations vary across space.
 Global models may not explain locally observed phenomena.
 Need for place based analysis.
Spatial Heterogeneity in Retail
 Traditional Data Mining : Which pair of items sell together frequently ?
 Ans : Diaper in Transaction  Beer in Transaction.
 Is this association true every where ?
Answer : Blue Collar neighborhoods
Global Spatial Data Mining – Global Co-location patterns
 Which pairs of spatial features are located together frequently ?
Example: Gas stations and Convenience Stores
Our Focus:
 Where do certain pairs of spatial features co-locate frequently ?
Example: Assaults happen frequently around downtown bars.
3
Applications

Crime analysis
Localizing frequent crime patterns, Opportunities for crime
vary across space!
Question: Do downtown bars often lead to assaults more
frequently ?


Public Health
Localizing elevated disease risks around putative
sources (e.g. mining areas)

Courtsey: www.amazon.com
Question: Where does high asbestos concentration often lead
to lung cancer ?

Ecology
Localizing symbiotic relationships between
different species of plants / animals.

Question: Where are Plover birds frequently found in
the vicinity of a crocodile ?
Courtsey: www.startribune.com
Predicting localities of
the next crime.
4
Regional co-location patterns (RCP)
 Input: Spatial Features, Crime Reports.
 Output: RCP (e.g. < (Bar, Assaults), Downtown >)
 Subsets of spatial features.
 Frequently located in certain regions of a study area.
5
Outline
 Motivation
 Problem Formulation




Basic Concepts
Problem Statement
Challenges
Related Work
 Computational Approach
 Conclusions and Future work
6
Basic Concepts: Neighborhoods
Prevalence locality
 Subsets of spatial framework containing instances of a
Pattern.
 Simple representation to visualize: Convex Hull
 Other Representations possible.
Neighborhood Graph
 Given: A Spatial Neighbor Relation (spatial
neighborhood size)
 Nodes: Individual event instances
 Edges: Presence (If neighbor relation is satisfied)
 Based on Event Centric Model (Huang , 2004)
7
Basic Concepts: Quantifying regional interestingness
 Conditional probability of observing a pattern
instance within a locality given an instance of a
feature within that locality.
Regional Participation Ratio
# instances of event type M participating in PR (RCP)
RPR(RCP, M ) =
# instances of M in dataset
Example
1
2
2
RPR(< {ABC}, PL2 >, A) = ;RPR( {ABC},PL2 ,B)  ;RPR( {ABC},PL2 ,C) 
6
4
4
Regional Participation index
Quantifies the local fraction
RPI(RCP) = min{RPR(RCP, M)}

Example
participating in a relationship.
2 2 1 
 1
RPI ( {ABC}, PL2 )  min , , 
4 6 4  4
8
Detailed Statement
Given:
 A spatial framework,
 A collection of boolean spatial event types and
their instances.
 A minimum interestingness threshold, Pθ
 A symmetric and transitive neighbor relation R (e.g.
based on Spatial neighborhood size)
*Prevalence
Threshold = 0.25
*Spatial
neighborhood
Size = 1 Mile
Find :
All RCPs with prevalence >= Pθ
Objective:
Minimize computational cost.
Constraints:
(i) Spatial framework is Heterogeneous.
(ii) Interest measure captures spatial heterogeneity.
(iii) Completeness : All prevalent RCPs are reported.
(iv) Correctness: Only prevalent RCPs are reported.
9
Challenges
Conflicting Requirements
 Interest measure captures spatial heterogeneity while supporting scalable algorithms.
Exponential search space.
 Candidate pattern set cardinality is exponential in the number of event types.

Statistics Rigor
Illustration:
Computational Scalability
10
Challenges
Conflicting Requirements
 Interest measure captures spatial heterogeneity while supporting scalable algorithms.
Exponential search space.
 Candidate pattern set cardinality is exponential in the number of event types.

Illustration:
{NULL}
A
C
B
AB
AC
BC
ABC
n
# Patterns
O(2M)
3
4
k1*23 (k1 >0)
4
11
k2*24 (k2 >0)
5
26
k3*25 (k3 >0)
6
57
k4*26 (k4 >0)
11
Contributions

Regional Co-location Patterns


Neighborhood based Formulation
Interest Measure
Captures the local fraction of events participating in patterns.
 Shows attractive computational properties, Honors spatial heterogeneity.


Computational Approach
Computational Structure – Pattern Space Enumeration
 Performance Enhancement- Maximal locality based Pruning Strategies


Experimental Evaluation
Performance Evaluation using real datasets, Lincoln, NE
 Real world case study.

12
Related Work
Approaches for Regional Co-location Pattern discovery
Zoning Based
(Celik et al., 2007)
Zoning Based
Fitness function
Clustering
(Eick et al., 2008)
Spatial Neighborhood based
Our Work
Fitness Function Clustering
 Reports one pattern per interesting region based
on a criterion (e.g. Max)
 Computational structure and pruning strategies
not explored.
 Clustering is based on real valued attributes.
13
Outline
 Motivation
 Problem Formulation
Computational Approach
 Pattern Space Enumeration
 Performance Tuning
 Experimental Evaluation
Conclusions and Future work
14
Computational Approach
Prevalence Threshold = 0.25
{Null}
A
B
C
<{BC},PL1({BC})> 0.16
<{AB},PL1({AB})> 0.16
<{AB},PL2({AB})> 0.33
<{AB},PL3({AB})> 0.25
✕
✔
✔
<{AC},PL1({AC})> 0.25
<{AC},PL2({AC})> 0.25
<{AC},PL3({AC})> 0.25
✔
✔
✔
<{BC},PL2({BC})> 0.25
<{BC},PL3({BC})> 0.16
<{BC},PL4({BC})> 0.16
✕
✔
✕
✕
Key Idea
 Enumerate Entire
Pattern Space.
Expensive !
 Examine each
pattern and prune.
<{ABC},PL1({ABC})> 0.16
<{ABC},PL2({ABC})> 0.25
<{ABC},PL3({ABC})> 0.25
Compute Neighborhoods
Identify candidate RCP instance
✕
✔
✕
✔
✔
Pruned RCP
Accepted RCP
15
Performance Tuning: Key Ideas
Key Idea
 Interest Measure shows special pruning properties in certain subsets of the
spatial framework.
Maximal Locality
Key Properties
 Collection of connected instances.
 Maximal localities are mutually disjoint.
 Contains several RCPs.
Key Observations
RPI shows anti-monotonicity property within
Maximal Localities
 Pruning a co-location, {AB}, prunes all its
super sets (e.g. {ABC}, {ABCD}…etc.).
 RPI within a Maximal locality is an
upper bound to RPI of constituent
Prevalence localities.
16
Performance Tuning
Prevalence Threshold = 0.25
{Null}
A
B
ML1
{AB},0.167 {AC},0.25
C
ML2
{BC},0.167
✕
✕
No RCP
No RCP
ML3
{AB},0.25
Compute Maximal Locality
{BC},0.33
✕<{BC},PL3({BC})>,0.167
✕<{BC},PL4({BC})>,0.167
<{AC},PL1({AC})>,0.25
{ABC}: Pruned Automatically
{AC},0.25
Completeness
 Pruning a pattern within a maximal locality
does not prune any valid RCPs.
Correctness
Due to upper bound property of RPI
Due to anti-monotonicity of RPI
 Accepting a pattern involves additional
checks so that only prevalent RCPs are
reported.
17
Experimental Evaluation: Spatial Neighborhood Size
 What is the effect of spatial neighborhood size on performance of different algorithms ?
 Fixed Parameters: Dataset Size : 7498 instances; # Features: 5; Prevalence Threshold: 0.07
Run Time
# of RCPs
Trends
 Run Time: ML Pruning out performs PS Enumeration by a factor of 1.5 - 5
 # of RCPs examined: ML Pruning out performs PS Enumeration by a factor of 4.13 - 19
19
Experimental Evaluation: Feature Types
 What is the effect of number of feature types on performance of different algorithms ?
 Fixed Parameters: Dataset Size : 7498 instances; Spatial neighborhood size: 800 feet; Prevalence
Threshold: 0.07
Run Time
# of RCPs
Trends
 Run Time: ML Pruning out performs PS Enumeration by a factor of 1.2
 # of RCPs examined: ML Pruning out performs PS Enumeration by a factor of 1.6 – 3.5
20
Real Dataset Case study
Q: Where do assaults frequently occur around bars ? Are there other factors ?
Dataset: Lincoln, NE, Crime data (Winter ‘07), Neighborhood Size = 0.25 miles, Prevalence Threshold = 0.07
RCP of Larceny and Assaults
RCP of Bar and Assaults
RCP of Larceny, Bars and Assaults
Observations
 Assaults are more likely to be found in areas reporting larceny (e.g. 47.6 % vs 21.1%)
Crimes.
 Bars in Downtown are more likely to be crime prone than bars in other areas
(e.g. 21.1%, 20.1 %)
22
Conclusion and Future work
 Conclusions
 Neighborhood based formulation of Regional Spatial Patterns.
 Regional Participation Index: Measures the local fraction of the global
count.
 Vector representation for Prevalence Localities (other representations
possible, convex for simplicity)
 Future Work
 Other representations for prevalence localities.
 Statistical interpretation LISA statistics / variants of Local Ripley’s K ,
multiple hypothesis testing.
 Interpretation using predictive methods (e.g. Geographically Weighted
Regression)
Acknowledgement:
 Reviewers of ACM GIS
 Members of the Spatial database and spatial data mining group, UMN.
 U.S. Department of Defense.
 Mr. Tom Casady and Kim Koffolt.
23
Download