Mining Statistically Significant Co-location Patterns

advertisement
SSCP: Mining Statistically
Significant Co-location Patterns
Sajib Barua and Jörg Sander
Dept. of Computing Science
University of Alberta, Canada
Outline
Introduction
Related work
Motivation
Proposed Method
Experimental evaluation
Synthetic data
Real data
Conclusions
SSCP: Mining Statistically Significant Co-location Patterns
2
Definition
 Co-location patterns are subsets of Boolean
spatial features whose instances are often seen
to be located at close spatial proximity.
 Examples:
{Nile crocodile, Egyptian plover}
{Shopping mall, parking}
SSCP: Mining Statistically Significant Co-location Patterns
3
Event Centric Model
Co-location is defined based on a spatial
relationship R
A co-location type C is a set of n different
spatial features f1, f2, …, and fn.
C
C
C111
A
AA222
BB222
B1
BB11
D1
D1
D1
{A2,{A
B1,, B
C1,} Cis }an
instance
of co-location
{A,B,C}
an instance
co-location
{A,B,C}
{A2, B
2 1, C
1 1, D
1 1}isform
a clique of
under
a relation
R.
{A2, B1, D1} is an instance of co-location {A,B,D}
CCC2
22
C3 {A2, C1, D1} is an instance of co-location {A,C,D}
C3
C
3
{B1, C1, D1} is an instance of co-location {B,C,D}
{A2, B1, C1, D1} is an instance of co-location {A, B,C,D}
SSCP: Mining Statistically Significant Co-location Patterns
4
Prevalence Measure
 Participation ratio (PR) of a feature in a colocation type C, is the fraction of its instances
participating in any instance of C.
 Participation index (PI) is the minimum
participation ratio in C.
PI ({A, B}) = min {1/2, 1/2} = 0.5
C1 C1 B2 B2
A1A1
C2 C2
B1 B1
C3 C3
A2 A2
PR and PI are anti-monotonic
PI ({B, C}) = min {1, 2/3} = 0.66
PI ({A, B}) = min {1/2, 1/2} = 0.5
PI ({A, C}) = min {1/2, 1/3} = 0.33
PI ({A, B, C}) = min {1/2, 1/2, 1/3} = 0.33
PI({A,B,C}) <= PI ({A, B}) or PI ({B, C}) or PI
({A, C})
SSCP: Mining Statistically Significant Co-location Patterns
5
Related Work
 Spatial statistics
Ripley’s K function, distance based measure,
co-variogram function.
 Spatial data mining
Koperski et al. [4] mine spatial association rules.
Morimoto [5] also look for frequently occurring patterns.
Shekhar et al. [2] introduce three models to materialize
transaction.
Huang, et al. [3], Yo et al. [6,7], and Xiao et al. [8].
SSCP: Mining Statistically Significant Co-location Patterns
6
Limitations of the Existing Methods
Spatial statistics
 Defined only for pairs.
Co-location mining
Only one global threshold for PI is used.
No guideline to setup PI-threshold
Do not address the spatial auto-correlation and
feature abundance effects.
A simple threshold can report meaningless
patterns or can miss meaningful patterns.
SSCP: Mining Statistically Significant Co-location Patterns
7
Motivation
Assume PI-threshold = 0.4
A has fewer instances
B is abundant
A & B have true spatial
dependency.
Existing co-location mining
algorithms will not report {A,B}.
SSCP: Mining Statistically Significant Co-location Patterns
8
Motivation
Assume PI-threshold = 0.4
A & B are abundant.
Both randomly distributed.
Do not have any true
spatial dependency.
Existing co-location mining algorithms
will report {A,B}.
SSCP: Mining Statistically Significant Co-location Patterns
9
Motivation
Assume PI-threshold = 0.4
A & B are auto-correlated.
Do not have any true
spatial dependency.
Existing co-location mining algorithms
will report {A,B}.
SSCP: Mining Statistically Significant Co-location Patterns
10
Our Idea
Our approach uses statistical test.
#○ = 12
= 12measured
Spatial dependency#∆is
using PI.
If features ○ and ∆ were spatially
independent of each other, what is the
chance of seeing the PI-value of {○, ∆}
equal or higher than the observed PIvalue (0.41)?
SSCP: Mining Statistically Significant Co-location Patterns
11
Generate Artificial Data Sets
Observed data
Artificial data sets generated under null model
SSCP: Mining Statistically Significant Co-location Patterns
12
p-value computation
If p <= α, PIobs is statistically
significant at level α.
p-value = 0.163
α = 0.05
PIobs = 0.41
SSCP: Mining Statistically Significant Co-location Patterns
13
Auto-correlated Feature
A & B are auto-correlated.
Do not have any true
spatial dependency.
SSCP: Mining Statistically Significant Co-location Patterns
14
Modeling Auto-correlation
Auto-correlation is modeled as a cluster
process.
Poisson Cluster Process [9]
 Autocorrelation is measured in terms of intensity
and type of distribution of a parent process and
offspring process around each parent.
SSCP: Mining Statistically Significant Co-location Patterns
15
Estimating Summary Statistics
Estimate the summary statistics.
 Auto-correlated feature: intensity of parent
and offspring process (κ, and µ values).
 Randomly distributed feature: Poisson
intensity (either homogenous (a constant) or
non-homogenous (a function of x and y)).
SSCP: Mining Statistically Significant Co-location Patterns
16
Null Model Design
The artificial data sets maintain the
following properties of the observed data:
same number of instances for each feature,
and
similar spatial distribution for each individual
feature.
SSCP: Mining Statistically Significant Co-location Patterns
17
p-value computation
Estimate p  Pr(PI0 (C)  PIobs (C))
Use randomization tests, where a large
number of datasets conforming to the null
hypothesis is generated.
 PI
p
R
1
R 1
obs
How many simulations do we need?
Diggle suggested 500 simulations for α = 0.01 [10].
SSCP: Mining Statistically Significant Co-location Patterns
18
Improving Runtime: Data Generation
 In a simulation, we only generate feature instances of
those clusters which are close enough to other different
features (either auto-correlated or non auto-correlated)
This saves time of the artificial data generation step of
a simulation.
SSCP: Mining Statistically Significant Co-location Patterns
19
Improving Runtime: PI-value Computation
 In a simulation Ri, for a co-location C
PI0Ri (C)  PIobs (C)
C   C & PI0Ri (C )  PIobs (C )
 PI0Ri (C )  PIobs (C )
p
R
 PI obs
1
R 1
No need to compute
PI0Ri (C)
Procedure:
•
In each simulation, compute PI0Ri (C) -values of all possible 2-size subsets
•
For a co-location C of size k ( > 2), we lookup PI-values of its 2-size subsets
R
of C. If a subset C' is found for which PI0 i (C ' ) < PIobs(C), PI0Ri (C) is not
required to be computed.
•
Otherwise PI0 i (C) is computed for simulation Ri.
R
SSCP: Mining Statistically Significant Co-location Patterns
20
An Example
Four features A, B, C, D
 {A,B,C}: If PI0R {A,B} < PIobs{A,B,C}, PI0R{A,B,C} <
PIobs{A,B,C}. No need to compute PI0Ri {A,B,C}.
 PI0R {A,B,C} < PIobs{A,B,C} does not imply PI0R {A,B,C,D}
< PIobs{A,B,C,D}.
 {A,B,C,D}: by checking 2-size subsets
i
i
i
i
The worst case complexity is O(2n)
 The size of the largest co-location is much smaller.
 Largest co-location size is predictable
i
 if PIobs(C) = 0, we do not compute PI0R-value
of C,
 Our pruning strategies
All these keep the actual cost in practice less than the worst case cost.
SSCP: Mining Statistically Significant Co-location Patterns
21
Experimental Results (1)
Negative association:
 Features ○ and ∆ with 40 instances of each.
 This synthetic data set is generated using multi-type Strauss process to
impose a negative association (inhibition) between these two features.
Result
PIobs = 0.55 and p-value = 0.931 > 0.05 (α), hence (○, ∆) will not be reported.
SSCP: Mining Statistically Significant Co-location Patterns
22
Experimental Results (2)
Autocorrelation:
 #○ = 100, and #∆ = 120.
 ∆: independently and uniformly
distributed over the space
○: spatially auto-correlated
In our generated data, ∆ is found in
most clusters of ○.
 The summary statistics of ○ is estimated
by fitting the model of Matérn Cluster
process[9] (κ= 40, µ = 5, r = 0.05).
Results:
 PIobs {○, ∆} = 0.49, existing algorithm
will report the pattern if a threshold <=
0.49 is chosen.
 p-value = 0.383 > 0.05 (α); {○, ∆} is not
reported.
SSCP: Mining Statistically Significant Co-location Patterns
23
Experimental Results (3)
Multiple features:
#○ = 40, #∆ = 40, #+ = 118, #x = 40,
and  = #30.
 Study area = Unit square, colocation neighborhood radius = 0.1
 Features ○ and ∆ are negatively 
associated.
 Feature + is spatially autocorrelated.
Features +, ○, and x are positively
associated.
 Feature  is randomly distributed.
Significant co-location patterns = {○, +}, {○, x}, {+, x}, {○, +, x}, {○, +,  }, {○, x,  },
{+, x,  }, and {○, +, x,  }.
SSCP: Mining Statistically Significant Co-location Patterns
24
Runtime Comparison (1)
 Features ○, ∆, +: are auto-correlated, strongly associated. Each has 400
instances.
 Feature x: is randomly distributed, and has 20 instances.
 Our algorithm finds all co-locations of features ○, ∆, and x.
 Instances of each auto-correlated features is increased
 cluster numbers is kept same
 number of instances per cluster is increased by a factor k.
Runtime comparison
Speedup
SSCP: Mining Statistically Significant Co-location Patterns
25
Runtime Comparison (2)
 The number of clusters for features ○, ∆, and + is
increased by a factor k but the number of instances per
cluster is kept same.
 Total instances of x is increased by the same factor k.
Runtime comparison
Speedup
SSCP: Mining Statistically Significant Co-location Patterns
26
Ants Data
 ○ = Cataglyphis ants (29) and ∆ =
Messor ants (68).
 PIobs {Cataglyphis, Messor} =
{24/29, 30/68} = 0.44.
 p-value = 0.142 > 0.05 (α); Colocation {○, ∆} is not significant.
 R. D. Harkness also did not find
any clear association between
these two species.
 Existing algorithm will report {○,
∆} if PI-threshold <= 0.44.
SSCP: Mining Statistically Significant Co-location Patterns
27
Toronto Address Repository Data
SSCP: Mining Statistically Significant Co-location Patterns
28
Found Co-locations
SSCP: Mining Statistically Significant Co-location Patterns
29
Conclusions
 A new definition for co-location pattern.
 Does not depend on a global threshold.
 Statistically meaningful.
 Runtime cost of randomization tests is reduced.
 Investigate other prevalence measures to check
if they allow additional pruning techniques.
 Removing redundant patterns.
SSCP: Mining Statistically Significant Co-location Patterns
30
References
 1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules in Large
Databases. In: Proc. VLDB, pp. 487-499 (1994)
 2. Shekhar, S. et al.: Discovering Spatial Co-location Patterns: A Summary of Results,
In Proc. SSTD, pp. 236-256 (2001)
 3. Huang, Y. et al.: Discovering Colocation Patterns from Spatial Data Sets: A General
Approach. IEEE TKDE 16(12), 1472-1485 (2004)
 4. Koperski, K. et al.: Discovery of Spatial Association Rules in Geographic
Information Databases. In SSD, pp. 47-66 (1995)
 5. Morimoto, Y.: Mining Frequent Neighboring Class Sets in Spatial Databases. In
SIGKDD, pp. 353-358 (2001)
 6. Yoo, J. S. et al.: A Partial Join Approach for Mining Co-location Patterns. In Proc.
GIS, pp. 241-249 (2004)
 7. Yoo, J. S. et al.: A joinless Apporach for Mining Spatial Co-location Patterns. IEEE
TKDE 18(10), 1323-1337 (2006)
 8. Xiao, X. et al.: Density Based Co-location Pattern Discovery. In Proc. GIS, pp. 250259 (2008).
 9. Ilian et al: Statistical Analysis and Modeling of Spatial Point Patterns.
 10. Diggle P.J.: Statisitcal Analysis of Spatial Point Pattern, 2003
SSCP: Mining Statistically Significant Co-location Patterns
31
Questions?
SSCP: Mining Statistically Significant Co-location Patterns
32
Download