SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada Outline Introduction Related work Motivation Proposed Method Experimental evaluation Synthetic data Real data Conclusions SSCP: Mining Statistically Significant Co-location Patterns 2 Definition Co-location patterns are subsets of Boolean spatial features whose instances are often seen to be located at close spatial proximity. Examples: {Nile crocodile, Egyptian plover} {Shopping mall, parking} SSCP: Mining Statistically Significant Co-location Patterns 3 Event Centric Model Co-location is defined based on a spatial relationship R A co-location type C is a set of n different spatial features f1, f2, …, and fn. C C C111 A AA222 BB222 B1 BB11 D1 D1 D1 {A2,{A B1,, B C1,} Cis }an instance of co-location {A,B,C} an instance co-location {A,B,C} {A2, B 2 1, C 1 1, D 1 1}isform a clique of under a relation R. {A2, B1, D1} is an instance of co-location {A,B,D} CCC2 22 C3 {A2, C1, D1} is an instance of co-location {A,C,D} C3 C 3 {B1, C1, D1} is an instance of co-location {B,C,D} {A2, B1, C1, D1} is an instance of co-location {A, B,C,D} SSCP: Mining Statistically Significant Co-location Patterns 4 Prevalence Measure Participation ratio (PR) of a feature in a colocation type C, is the fraction of its instances participating in any instance of C. Participation index (PI) is the minimum participation ratio in C. PI ({A, B}) = min {1/2, 1/2} = 0.5 C1 C1 B2 B2 A1A1 C2 C2 B1 B1 C3 C3 A2 A2 PR and PI are anti-monotonic PI ({B, C}) = min {1, 2/3} = 0.66 PI ({A, B}) = min {1/2, 1/2} = 0.5 PI ({A, C}) = min {1/2, 1/3} = 0.33 PI ({A, B, C}) = min {1/2, 1/2, 1/3} = 0.33 PI({A,B,C}) <= PI ({A, B}) or PI ({B, C}) or PI ({A, C}) SSCP: Mining Statistically Significant Co-location Patterns 5 Related Work Spatial statistics Ripley’s K function, distance based measure, co-variogram function. Spatial data mining Koperski et al. [4] mine spatial association rules. Morimoto [5] also look for frequently occurring patterns. Shekhar et al. [2] introduce three models to materialize transaction. Huang, et al. [3], Yo et al. [6,7], and Xiao et al. [8]. SSCP: Mining Statistically Significant Co-location Patterns 6 Limitations of the Existing Methods Spatial statistics Defined only for pairs. Co-location mining Only one global threshold for PI is used. No guideline to setup PI-threshold Do not address the spatial auto-correlation and feature abundance effects. A simple threshold can report meaningless patterns or can miss meaningful patterns. SSCP: Mining Statistically Significant Co-location Patterns 7 Motivation Assume PI-threshold = 0.4 A has fewer instances B is abundant A & B have true spatial dependency. Existing co-location mining algorithms will not report {A,B}. SSCP: Mining Statistically Significant Co-location Patterns 8 Motivation Assume PI-threshold = 0.4 A & B are abundant. Both randomly distributed. Do not have any true spatial dependency. Existing co-location mining algorithms will report {A,B}. SSCP: Mining Statistically Significant Co-location Patterns 9 Motivation Assume PI-threshold = 0.4 A & B are auto-correlated. Do not have any true spatial dependency. Existing co-location mining algorithms will report {A,B}. SSCP: Mining Statistically Significant Co-location Patterns 10 Our Idea Our approach uses statistical test. #○ = 12 = 12measured Spatial dependency#∆is using PI. If features ○ and ∆ were spatially independent of each other, what is the chance of seeing the PI-value of {○, ∆} equal or higher than the observed PIvalue (0.41)? SSCP: Mining Statistically Significant Co-location Patterns 11 Generate Artificial Data Sets Observed data Artificial data sets generated under null model SSCP: Mining Statistically Significant Co-location Patterns 12 p-value computation If p <= α, PIobs is statistically significant at level α. p-value = 0.163 α = 0.05 PIobs = 0.41 SSCP: Mining Statistically Significant Co-location Patterns 13 Auto-correlated Feature A & B are auto-correlated. Do not have any true spatial dependency. SSCP: Mining Statistically Significant Co-location Patterns 14 Modeling Auto-correlation Auto-correlation is modeled as a cluster process. Poisson Cluster Process [9] Autocorrelation is measured in terms of intensity and type of distribution of a parent process and offspring process around each parent. SSCP: Mining Statistically Significant Co-location Patterns 15 Estimating Summary Statistics Estimate the summary statistics. Auto-correlated feature: intensity of parent and offspring process (κ, and µ values). Randomly distributed feature: Poisson intensity (either homogenous (a constant) or non-homogenous (a function of x and y)). SSCP: Mining Statistically Significant Co-location Patterns 16 Null Model Design The artificial data sets maintain the following properties of the observed data: same number of instances for each feature, and similar spatial distribution for each individual feature. SSCP: Mining Statistically Significant Co-location Patterns 17 p-value computation Estimate p Pr(PI0 (C) PIobs (C)) Use randomization tests, where a large number of datasets conforming to the null hypothesis is generated. PI p R 1 R 1 obs How many simulations do we need? Diggle suggested 500 simulations for α = 0.01 [10]. SSCP: Mining Statistically Significant Co-location Patterns 18 Improving Runtime: Data Generation In a simulation, we only generate feature instances of those clusters which are close enough to other different features (either auto-correlated or non auto-correlated) This saves time of the artificial data generation step of a simulation. SSCP: Mining Statistically Significant Co-location Patterns 19 Improving Runtime: PI-value Computation In a simulation Ri, for a co-location C PI0Ri (C) PIobs (C) C C & PI0Ri (C ) PIobs (C ) PI0Ri (C ) PIobs (C ) p R PI obs 1 R 1 No need to compute PI0Ri (C) Procedure: • In each simulation, compute PI0Ri (C) -values of all possible 2-size subsets • For a co-location C of size k ( > 2), we lookup PI-values of its 2-size subsets R of C. If a subset C' is found for which PI0 i (C ' ) < PIobs(C), PI0Ri (C) is not required to be computed. • Otherwise PI0 i (C) is computed for simulation Ri. R SSCP: Mining Statistically Significant Co-location Patterns 20 An Example Four features A, B, C, D {A,B,C}: If PI0R {A,B} < PIobs{A,B,C}, PI0R{A,B,C} < PIobs{A,B,C}. No need to compute PI0Ri {A,B,C}. PI0R {A,B,C} < PIobs{A,B,C} does not imply PI0R {A,B,C,D} < PIobs{A,B,C,D}. {A,B,C,D}: by checking 2-size subsets i i i i The worst case complexity is O(2n) The size of the largest co-location is much smaller. Largest co-location size is predictable i if PIobs(C) = 0, we do not compute PI0R-value of C, Our pruning strategies All these keep the actual cost in practice less than the worst case cost. SSCP: Mining Statistically Significant Co-location Patterns 21 Experimental Results (1) Negative association: Features ○ and ∆ with 40 instances of each. This synthetic data set is generated using multi-type Strauss process to impose a negative association (inhibition) between these two features. Result PIobs = 0.55 and p-value = 0.931 > 0.05 (α), hence (○, ∆) will not be reported. SSCP: Mining Statistically Significant Co-location Patterns 22 Experimental Results (2) Autocorrelation: #○ = 100, and #∆ = 120. ∆: independently and uniformly distributed over the space ○: spatially auto-correlated In our generated data, ∆ is found in most clusters of ○. The summary statistics of ○ is estimated by fitting the model of Matérn Cluster process[9] (κ= 40, µ = 5, r = 0.05). Results: PIobs {○, ∆} = 0.49, existing algorithm will report the pattern if a threshold <= 0.49 is chosen. p-value = 0.383 > 0.05 (α); {○, ∆} is not reported. SSCP: Mining Statistically Significant Co-location Patterns 23 Experimental Results (3) Multiple features: #○ = 40, #∆ = 40, #+ = 118, #x = 40, and = #30. Study area = Unit square, colocation neighborhood radius = 0.1 Features ○ and ∆ are negatively associated. Feature + is spatially autocorrelated. Features +, ○, and x are positively associated. Feature is randomly distributed. Significant co-location patterns = {○, +}, {○, x}, {+, x}, {○, +, x}, {○, +, }, {○, x, }, {+, x, }, and {○, +, x, }. SSCP: Mining Statistically Significant Co-location Patterns 24 Runtime Comparison (1) Features ○, ∆, +: are auto-correlated, strongly associated. Each has 400 instances. Feature x: is randomly distributed, and has 20 instances. Our algorithm finds all co-locations of features ○, ∆, and x. Instances of each auto-correlated features is increased cluster numbers is kept same number of instances per cluster is increased by a factor k. Runtime comparison Speedup SSCP: Mining Statistically Significant Co-location Patterns 25 Runtime Comparison (2) The number of clusters for features ○, ∆, and + is increased by a factor k but the number of instances per cluster is kept same. Total instances of x is increased by the same factor k. Runtime comparison Speedup SSCP: Mining Statistically Significant Co-location Patterns 26 Ants Data ○ = Cataglyphis ants (29) and ∆ = Messor ants (68). PIobs {Cataglyphis, Messor} = {24/29, 30/68} = 0.44. p-value = 0.142 > 0.05 (α); Colocation {○, ∆} is not significant. R. D. Harkness also did not find any clear association between these two species. Existing algorithm will report {○, ∆} if PI-threshold <= 0.44. SSCP: Mining Statistically Significant Co-location Patterns 27 Toronto Address Repository Data SSCP: Mining Statistically Significant Co-location Patterns 28 Found Co-locations SSCP: Mining Statistically Significant Co-location Patterns 29 Conclusions A new definition for co-location pattern. Does not depend on a global threshold. Statistically meaningful. Runtime cost of randomization tests is reduced. Investigate other prevalence measures to check if they allow additional pruning techniques. Removing redundant patterns. SSCP: Mining Statistically Significant Co-location Patterns 30 References 1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules in Large Databases. In: Proc. VLDB, pp. 487-499 (1994) 2. Shekhar, S. et al.: Discovering Spatial Co-location Patterns: A Summary of Results, In Proc. SSTD, pp. 236-256 (2001) 3. Huang, Y. et al.: Discovering Colocation Patterns from Spatial Data Sets: A General Approach. IEEE TKDE 16(12), 1472-1485 (2004) 4. Koperski, K. et al.: Discovery of Spatial Association Rules in Geographic Information Databases. In SSD, pp. 47-66 (1995) 5. Morimoto, Y.: Mining Frequent Neighboring Class Sets in Spatial Databases. In SIGKDD, pp. 353-358 (2001) 6. Yoo, J. S. et al.: A Partial Join Approach for Mining Co-location Patterns. In Proc. GIS, pp. 241-249 (2004) 7. Yoo, J. S. et al.: A joinless Apporach for Mining Spatial Co-location Patterns. IEEE TKDE 18(10), 1323-1337 (2006) 8. Xiao, X. et al.: Density Based Co-location Pattern Discovery. In Proc. GIS, pp. 250259 (2008). 9. Ilian et al: Statistical Analysis and Modeling of Spatial Point Patterns. 10. Diggle P.J.: Statisitcal Analysis of Spatial Point Pattern, 2003 SSCP: Mining Statistically Significant Co-location Patterns 31 Questions? SSCP: Mining Statistically Significant Co-location Patterns 32