Clustering Spatial Data in the Presence of Obstacles and Crossings

advertisement
Density-Based Clustering of
Spatial Data when facing
Physical Constraints
Authors: Dr. Osmar R. Zaiane and Chi-hoon Lee
Database Laboratory
Department of Computing Science
University of Alberta
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
DBCluC
(Density-Based Clustering with Constraints)
•
•
•
•
•
•
•
Introduction
Related works
Background Concepts
Modeling Constraints
DBCluC Algorithm
Performance Evaluation
Conclusion
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Introduction
• Cluster Analysis
– Clustering (unsupervised classification) is a process of partitioning data
objects into a set of meaningful sub-classes called clusters by maximizing
intra closeness in a cluster and minimizing inter closeness between clusters.
• Taxonomy of Clustering methods
Data Clustering
Non-Constraint Based
Partitioning
Hierarchical
Constraint Based
K-means
GraphPartitioning
K-medoids
CHAMELEON BIRCH
DBSCAN
STING
CLARANS
AUTOCLUST
DENCLUE
WaveCluster
Osmar Zaïane and Chi-Hoon Lee
AGNES/DIANA
CURE
DensityBased
GridBased
Database Laboratory Dept. of Computing Science
University of Alberta
Introduction
• Cluster Analysis
– Clustering (unsupervised classification) is a process of partitioning data
objects into a set of meaningful sub-classes called clusters by maximizing
intra closeness in a cluster and minimizing inter closeness between clusters.
• Taxonomy of Clustering methods
Data Clustering
Non-Constraint Based
Partitioning
GraphPartitioning
Constraint Based
DensityBased
DBSCAN
CLARANS
COD-CLARANS
AUTOCLUST+
DBCluC
AUTOCLUST
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Introduction (Cont.)
• Key factors for a spatial clustering algorithm
–
–
–
–
–
–
Scalability
Discover arbitrary shaped clusters
Discriminate noise and outliers
Minimum Domain Knowledge
Insensitive to data input order
Constraints
•Operational Constraints
–Ex) SQL aggregate and existence constraints [4]
•Physical Constraints
–Ex) Obstacles [1, 2] and crossings
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
DBCluC
(Density-Based Clustering with Constraints)
•
•
•
•
•
•
•
Introduction
Related works
Background Concepts
Modeling Constraints
DBCluC Algorithm
Performance Evaluation
Conclusion
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Related Works
• COD-CLARANS (A.K.H. Tung, et al. 2001)
– Defines the relationship between obstacles and data objects by
visibility graphs to compute obstructed distances between data
objects
– Require expensive preprocessing steps.
– Inherits disadvantages of CLARANS
• Number of clusters (k)
• Main memory management
• Micro-clustering method, Detection of only spherical shaped clusters
• AUTOCLUST+ (Vladimir Estivill-Castro, et al. 2000)
– Delaunay structure for data points
– Model obstacles as a set of line segments
– Scalable and efficient in 2-dimensional space
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
DBCluC
(Density-Based Clustering with Constraints)
•
•
•
•
•
•
•
Introduction
Related works
Background Concepts
Modeling Constraints
DBCluC Algorithm
Performance Evaluation
Conclusion
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Background Concepts
• DBSCAN
– Proposed by Ester, Kriegel, Sander, and Xu (KDD’ 96).
– Density based spatial clustering algorithm discriminating
noise.
– Detection capability of arbitrary shaped clusters with
noise.
– R* tree indexing structure (O(logn)).
– Density notion evaluated by two parameters: Eps and
MinPts.
• Eps: Maximum radius of the neighbourhood.
• MinPts: Minimum number of points in an Eps-neighbourhood
of a given query point.
– Neps(p): {q  D| dist(p,q)  Eps}. |Neps(p)|: MinPts.
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Background Concepts: DBSCAN
• Directly Density-reachable
A point p is directly density
reachable from a point q wrt.
Eps, MinPts if pNeps(q)
• Density – reachable
A point p is density-reachable
from a point q wrt. Eps, MinPts,
if there is a chain of points p1 ,
…,pn,, p1 =q , pn =p
• Density – connected
q
MinPts: 4
Eps: 2cm
• p• • • •
•
•
•
•
•
•• • • •
•
q
o
A point p is density-connected to a
point q wrt. Eps, MinPts, if there is a
point o such that both, p and q are
density-reachable from o wrt. Eps and
MinPts
Osmar Zaïane and Chi-Hoon Lee
• •
•
•
•
• • •• •
•• •• •
•
p
•• •• • •
p•
•• • •
• q
Database Laboratory Dept. of Computing Science
University of Alberta
Background Concepts: DBSCAN
•
Cluster
– A non-empty subset of data points satisfying the
following conditions:
– 1) Maximality: ∀ p, q: if p  C and q is density-reachable from p
with respect to Eps and MinPts, then q  C.
– 2) Connectivity. ∀ p, q  C: p is density-connected to q with
respect to Eps and MinPts.
• Noise
– Data point that does not belong to any cluster
Motivating Concepts - Obstacle
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Background Concepts (cont.)
•
Obstacle Constraints:
1. An Obstacle entity -Disconnectivity functionality
•
•
•
Grouping nearest data objects is not feasible
A polygon denoted by P(V, E) where V is a set of
points from the polygon and E is a set of line
segments
Types: Convex and Concave.
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Background Concepts:
Obstacle free density notions
• Directly obstacle free density-reachable
A point p is directly density reachable from a
point q wrt. Eps, MinPts if p  Neps(q) and
an edge joining p and q is obstacle-free.
• Obstacle free density – reachable
A point p is density-reachable from a
point q wrt. Eps, MinPts, if there is a
chain of points p1 , …,pn,, p1 =q , pn =p
such that pi is directly obstacle free
density-reachable from pi+1.
•r •
•
•
•
• • •• •
•• •• •
•
p
Eps: 2cm
q
• p• • • •
•
•
•
•
•
• • r• • •
•
• Obstacle free density – connected
A point p is density-connected to a
point q wrt. Eps, MinPts, if there is a
point o such that both, p and q are
obstacle free density-reachable from
o.
Osmar Zaïane and Chi-Hoon Lee
MinPts: 4
Database Laboratory Dept. of Computing Science
q
o
•• •• • •
p •
• • • •• •q
University of Alberta
Background Concepts: DBCluC
•
Cluster
– A non-empty subset of data points satisfying the
following conditions:
– 1) Maximality: ∀ p, q: if p  C and q is obstacle free densityreachable from p with respect to Eps and MinPts, then q  C.
– 2) Connectivity. ∀ p, q  C: p is obstacle free density-connected
to q with respect to Eps and MinPts.
• Noise
– Data point that does not belong to any cluster
Motivating Concepts - Obstacle
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
DBCluC
(Density-Based Clustering with Constraints)
•
•
•
•
•
•
Introduction
Background Concepts
Modeling Constraints
DBCluC Algorithm
Performance Evaluation
Conclusion
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Modeling Constraints – Obstacles
• Modeling Obstacles
–
Objectives
•
•
–
Crossings
Assign Disconnectivity Functionality.
Enhance performance of processing large number of obstacles by
reducing search spaces.
Method: Polygon Reduction Algorithm
•
•
•
Observation
– An obstacle is able to be modeled by a polygon.
– A given polygon creates a set of visible spaces with respect to
data objects to be clustered.
Goal
– Maintain a set of visible spaces created by an obstacle
associated with data objects.
Approach
–
Represents an obstacle as a set of Obstruction Lines.
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Modeling Constraints
•
Polygon Reduction Algorithm
•
Two steps
1. Convexity Test
2. Construct obstruction lines
1. Convexity Test.
•
•
A pre-stage in order to determine if a polygon is a convex or
a concave by checking the type of all points in the polygon.
Approaches
–
–
Turning Directional Approach
» Assume points of a polygon is enumerated in an order:
clockwise or counterclockwise
» O(n)
Externality Approach
» Check the relations between a polygon and an assessment
edge that are “very” close to a query point
» O(n2)
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Examples of Convexity TestTurning Directional Approach
v3
v1
Osmar Zaïane and Chi-Hoon Lee
v2
Database Laboratory Dept. of Computing Science
University of Alberta
Examples of Convexity Test –
Externality Approach
Query point
Convex point
Assessment edge
A point inside triangle area of the
query point and two endpoints of an
assessment edge
Query point
Convex point
Query point
Concave point
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Modeling Constraints
– Polygon Reduction Algorithm
1. Define the type of a polygon via Convexity Test
•
•
A polygon is concave if  a concave point in the
polygon.
A polygon is convex if  points are convex points.
2. Convex -
n

2

obstruction lines*.
3. Concave – The number of obstruction lines
depends on a shape of a given polygon
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Modeling Obstacles:
An example
vs1
vs2
vs6
vs3
vs4
vs5
8
Osmar Zaïane and Chi-Hoon Lee
4
Database Laboratory Dept. of Computing Science
University of Alberta
Modeling Constraints – a crossing
• Crossing Modeling
– Objective
• Efficiently assign connectivity functionality.
– Method: A polygon with Entry Points and Entry Edge.
• Defined by users’ or applications’ demands
Entry Points
Eps
Entry Edges
– Entry points modeled from a crossing connect reachable
objects
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
DBCluC
(Density-Based Clustering with Constraints)
•
•
•
•
•
•
Introduction
Background Concepts
Modeling Constraints
DBCluC Algorithm
Performance Evaluation
Conclusion
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
DBCluC
• DBCluC
– Extension from DBSCAN
– Start clustering from an arbitrary data point.
– Indexing data points with SR-tree
• K-NN Query and Range Query available.
– Consider crossing constraints while (after) clustering.
– Consider obstacles after retrieving neighbours of a
given query point.
• Visibility between a query point and its neighbours
is checked for all obstacles.
– Complexity
• O( N ·logN ·L), where N is the number of data points and L is
the number of obstruction lines.
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
DBCluC
(Density-Based Clustering with Constraints)
•
•
•
•
•
•
Introduction
Background Concepts
Modeling Constraints
DBCluC Algorithm
Performance Evaluation
Conclusion
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Performance
• Performance Evaluation - based on synthetic data sets
–
–
–
–
Detecting arbitrary shaped clusters
Insensitive to data input order
Discriminating noise and outliers
Pruning search spaces
Number of Data
objects
Number of
Obstacles(line
segments/ crossings
Number of
obstruction lines
Osmar Zaïane and Chi-Hoon Lee
DS3
DS5
12k
1000
7(29)/2
18(114)/3
15
74
Database Laboratory Dept. of Computing Science
University of Alberta
Performance (DS3)
(a) Before clustering
(b) Clustering ignoring constraints
(d) Clustering with obstacles
Osmar Zaïane and Chi-Hoon Lee
(c) Clustering with bridges
(e) Clustering with obstacles and bridges
Database Laboratory Dept. of Computing Science
University of Alberta
Performance (DS5)
(a) Before clustering
Osmar Zaïane and Chi-Hoon Lee
(b) Clustering ignoring constraints
Database Laboratory Dept. of Computing Science
University of Alberta
Performance (DS5)
(c) Clustering with bridges
(d) Clustering with obstacles
(e) Clustering with obstacles and bridges
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Performance
800
700
Time in second
600
500
400
300
200
100
0
25k
50k
75k
100k
125k
150k
175k
200k
N u mb e r s o f d a t a p o i n t s
(a) Run time varying size of data objects
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Performance
500
400
300
200
100
24
00
/1
08
0
21
00
/9
45
18
00
/8
10
15
00
/6
75
12
00
/5
40
90
0/
40
5
60
0/
27
0
30
0/
13
5
0
12
1/
72
Time in seceond
600
Num ber of line segm ents in obstacles/ Num ber of Obstruction lines (N=38K,
Eps=6.0, and MinPts=3)
(b) Run time varying size of obstacles
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Conclusion
• Propose a spatial clustering algorithm in the presence of
Constraints: Obstacles and Crossings.
• Modeling constraints
– Obstacles
• Polygon Reduction Algorithm.
– Reduces search spaces allowing DBCluC to handle large number
of obstacles
– Crossing
• Entry point and Entry edge.
– Control connectivity flow
• Experiments
– Scalable, efficient, and effective.
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Future Work
• Indexing obstacles
– Prune search spaces for large number of obstacles
– Reduce the complexity of DBCluC to O(N•logN)
• Extension to a high dimension with obstruction
hyper planes
• Consider the object altitude
• Consider more constraints: Time, Length of a
crossing, Direction of Crossing (one direction/bidirection)
• Extension to operational constraints
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
References
[1] A. K. H. Tung, J. Hou, and J. Han, Spatial Clustering in the Presence of Obstacles,
Proc. 2001 Int. Conf. on Data Engineering (ICDE'01), Heidelberg, Germany, April 2001.
[2] Vladimir Estivill-Castro and IckJai Lee. Autoclust+: Automatic clustering
of point-data sets in the presence of obstacles. In International Workshop on Temporal and
Spatial and Spatio-Temporal Data Mining (TSDM2000), pages 133-146, 2000.
[3] M.G. Stone. A mnemonic for areas of polygons. AMER. MATH. MONTHLY, 93:479-480, 1986.
[4] Anthony K. H. Tung, Raymond T. Ng, Laks V. S. Lakshmanan, and Jiawei Han. Constraint-based
clustering in large databases. In ICDT, pages 405-419, 2001.
[5] Osmar R. Zaïane and Chi-Hoon Lee, Clustering Spatial Data in the Presence of Obstacles: a DensityBased Approach, Sixth International Database Engineering and Applications Symposium (IDEAS 2002),
Edmonton, Alberta, Canada, July 17-19, 2002
[6] Osmar R. Zaïane, Andrew Foss, Chi-Hoon Lee, Weinan Wang, On Data Clustering Analysis:
Scalability, Constraints and Validation, in Proc. of the Sixth Pacific-Asia Conference on Knowledge
Discovery and Data Mining (PAKDD'02), pp 28-39, Taipei, Taiwan, May, 2002
[7] Osmar R. Zaïane, Chi-Hoon Lee, Clustering Spatial Data When Facing Physical Constraints, in Proc.
of the IEEE 2001 International Conference on Data Mining (ICDM'2002), pp ??-??, Maebashi City, Japan,
December 9 - 12, 2002
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Visibility Graph from [1]
v1
v2
p
v4
O1
O2
v3
Osmar Zaïane and Chi-Hoon Lee
q
v5
Database Laboratory Dept. of Computing Science
University of Alberta
Delaunay diagram
-Collection of edges satisfying an "empty circle" property: for
each edge we can find a circle containing the edge's endpoints
but not containing any other points.
- Dual of Voronoi Diagram
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Delaunay diagram
Osmar Zaïane and Chi-Hoon Lee
Database Laboratory Dept. of Computing Science
University of Alberta
Visible Space
Given a set D of n data objects with a polygon P(V, E), a visible space S is a space
that has a set P of data objects satisfying the following
1.
Space S is defined by three edges: the first edge(edges) e  E connects two
minimal convex points vi, vj  V, the second edge f is the extension of the line
connecting vi and its other adjacent point vk  V, and the third edge g is the
extension of the line connecting vj and its other adjacent vl  V.
2.
 p,q  P, p and q are visible to each other in S. Thus, P D
3.
S is not visible to any other visible space S’. Thus, S’ S = 
S1
e5
e1
S2
e2
e3
S3
Osmar Zaïane and Chi-Hoon Lee
S1
S5
e4 S
4
S2
S3
S4
S3
S4
S5
S4
S5
S5
Database Laboratory Dept. of Computing Science
S5
University of Alberta
Download