Density-Based Clustering of Spatial Data when facing Physical Constraints Authors: Dr. Osmar R. Zaiane and Chi-hoon Lee Database Laboratory Department of Computing Science University of Alberta Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta DBCluC (Density-Based Clustering with Constraints) • • • • • • • Introduction Related works Background Concepts Modeling Constraints DBCluC Algorithm Performance Evaluation Conclusion Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Introduction • Cluster Analysis – Clustering (unsupervised classification) is a process of partitioning data objects into a set of meaningful sub-classes called clusters by maximizing intra closeness in a cluster and minimizing inter closeness between clusters. • Taxonomy of Clustering methods Data Clustering Non-Constraint Based Partitioning Hierarchical Constraint Based K-means GraphPartitioning K-medoids CHAMELEON BIRCH DBSCAN STING CLARANS AUTOCLUST DENCLUE WaveCluster Osmar Zaïane and Chi-Hoon Lee AGNES/DIANA CURE DensityBased GridBased Database Laboratory Dept. of Computing Science University of Alberta Introduction • Cluster Analysis – Clustering (unsupervised classification) is a process of partitioning data objects into a set of meaningful sub-classes called clusters by maximizing intra closeness in a cluster and minimizing inter closeness between clusters. • Taxonomy of Clustering methods Data Clustering Non-Constraint Based Partitioning GraphPartitioning Constraint Based DensityBased DBSCAN CLARANS COD-CLARANS AUTOCLUST+ DBCluC AUTOCLUST Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Introduction (Cont.) • Key factors for a spatial clustering algorithm – – – – – – Scalability Discover arbitrary shaped clusters Discriminate noise and outliers Minimum Domain Knowledge Insensitive to data input order Constraints •Operational Constraints –Ex) SQL aggregate and existence constraints [4] •Physical Constraints –Ex) Obstacles [1, 2] and crossings Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta DBCluC (Density-Based Clustering with Constraints) • • • • • • • Introduction Related works Background Concepts Modeling Constraints DBCluC Algorithm Performance Evaluation Conclusion Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Related Works • COD-CLARANS (A.K.H. Tung, et al. 2001) – Defines the relationship between obstacles and data objects by visibility graphs to compute obstructed distances between data objects – Require expensive preprocessing steps. – Inherits disadvantages of CLARANS • Number of clusters (k) • Main memory management • Micro-clustering method, Detection of only spherical shaped clusters • AUTOCLUST+ (Vladimir Estivill-Castro, et al. 2000) – Delaunay structure for data points – Model obstacles as a set of line segments – Scalable and efficient in 2-dimensional space Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta DBCluC (Density-Based Clustering with Constraints) • • • • • • • Introduction Related works Background Concepts Modeling Constraints DBCluC Algorithm Performance Evaluation Conclusion Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Background Concepts • DBSCAN – Proposed by Ester, Kriegel, Sander, and Xu (KDD’ 96). – Density based spatial clustering algorithm discriminating noise. – Detection capability of arbitrary shaped clusters with noise. – R* tree indexing structure (O(logn)). – Density notion evaluated by two parameters: Eps and MinPts. • Eps: Maximum radius of the neighbourhood. • MinPts: Minimum number of points in an Eps-neighbourhood of a given query point. – Neps(p): {q D| dist(p,q) Eps}. |Neps(p)|: MinPts. Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Background Concepts: DBSCAN • Directly Density-reachable A point p is directly density reachable from a point q wrt. Eps, MinPts if pNeps(q) • Density – reachable A point p is density-reachable from a point q wrt. Eps, MinPts, if there is a chain of points p1 , …,pn,, p1 =q , pn =p • Density – connected q MinPts: 4 Eps: 2cm • p• • • • • • • • • •• • • • • q o A point p is density-connected to a point q wrt. Eps, MinPts, if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts Osmar Zaïane and Chi-Hoon Lee • • • • • • • •• • •• •• • • p •• •• • • p• •• • • • q Database Laboratory Dept. of Computing Science University of Alberta Background Concepts: DBSCAN • Cluster – A non-empty subset of data points satisfying the following conditions: – 1) Maximality: ∀ p, q: if p C and q is density-reachable from p with respect to Eps and MinPts, then q C. – 2) Connectivity. ∀ p, q C: p is density-connected to q with respect to Eps and MinPts. • Noise – Data point that does not belong to any cluster Motivating Concepts - Obstacle Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Background Concepts (cont.) • Obstacle Constraints: 1. An Obstacle entity -Disconnectivity functionality • • • Grouping nearest data objects is not feasible A polygon denoted by P(V, E) where V is a set of points from the polygon and E is a set of line segments Types: Convex and Concave. Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Background Concepts: Obstacle free density notions • Directly obstacle free density-reachable A point p is directly density reachable from a point q wrt. Eps, MinPts if p Neps(q) and an edge joining p and q is obstacle-free. • Obstacle free density – reachable A point p is density-reachable from a point q wrt. Eps, MinPts, if there is a chain of points p1 , …,pn,, p1 =q , pn =p such that pi is directly obstacle free density-reachable from pi+1. •r • • • • • • •• • •• •• • • p Eps: 2cm q • p• • • • • • • • • • • r• • • • • Obstacle free density – connected A point p is density-connected to a point q wrt. Eps, MinPts, if there is a point o such that both, p and q are obstacle free density-reachable from o. Osmar Zaïane and Chi-Hoon Lee MinPts: 4 Database Laboratory Dept. of Computing Science q o •• •• • • p • • • • •• •q University of Alberta Background Concepts: DBCluC • Cluster – A non-empty subset of data points satisfying the following conditions: – 1) Maximality: ∀ p, q: if p C and q is obstacle free densityreachable from p with respect to Eps and MinPts, then q C. – 2) Connectivity. ∀ p, q C: p is obstacle free density-connected to q with respect to Eps and MinPts. • Noise – Data point that does not belong to any cluster Motivating Concepts - Obstacle Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta DBCluC (Density-Based Clustering with Constraints) • • • • • • Introduction Background Concepts Modeling Constraints DBCluC Algorithm Performance Evaluation Conclusion Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Modeling Constraints – Obstacles • Modeling Obstacles – Objectives • • – Crossings Assign Disconnectivity Functionality. Enhance performance of processing large number of obstacles by reducing search spaces. Method: Polygon Reduction Algorithm • • • Observation – An obstacle is able to be modeled by a polygon. – A given polygon creates a set of visible spaces with respect to data objects to be clustered. Goal – Maintain a set of visible spaces created by an obstacle associated with data objects. Approach – Represents an obstacle as a set of Obstruction Lines. Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Modeling Constraints • Polygon Reduction Algorithm • Two steps 1. Convexity Test 2. Construct obstruction lines 1. Convexity Test. • • A pre-stage in order to determine if a polygon is a convex or a concave by checking the type of all points in the polygon. Approaches – – Turning Directional Approach » Assume points of a polygon is enumerated in an order: clockwise or counterclockwise » O(n) Externality Approach » Check the relations between a polygon and an assessment edge that are “very” close to a query point » O(n2) Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Examples of Convexity TestTurning Directional Approach v3 v1 Osmar Zaïane and Chi-Hoon Lee v2 Database Laboratory Dept. of Computing Science University of Alberta Examples of Convexity Test – Externality Approach Query point Convex point Assessment edge A point inside triangle area of the query point and two endpoints of an assessment edge Query point Convex point Query point Concave point Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Modeling Constraints – Polygon Reduction Algorithm 1. Define the type of a polygon via Convexity Test • • A polygon is concave if a concave point in the polygon. A polygon is convex if points are convex points. 2. Convex - n 2 obstruction lines*. 3. Concave – The number of obstruction lines depends on a shape of a given polygon Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Modeling Obstacles: An example vs1 vs2 vs6 vs3 vs4 vs5 8 Osmar Zaïane and Chi-Hoon Lee 4 Database Laboratory Dept. of Computing Science University of Alberta Modeling Constraints – a crossing • Crossing Modeling – Objective • Efficiently assign connectivity functionality. – Method: A polygon with Entry Points and Entry Edge. • Defined by users’ or applications’ demands Entry Points Eps Entry Edges – Entry points modeled from a crossing connect reachable objects Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta DBCluC (Density-Based Clustering with Constraints) • • • • • • Introduction Background Concepts Modeling Constraints DBCluC Algorithm Performance Evaluation Conclusion Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta DBCluC • DBCluC – Extension from DBSCAN – Start clustering from an arbitrary data point. – Indexing data points with SR-tree • K-NN Query and Range Query available. – Consider crossing constraints while (after) clustering. – Consider obstacles after retrieving neighbours of a given query point. • Visibility between a query point and its neighbours is checked for all obstacles. – Complexity • O( N ·logN ·L), where N is the number of data points and L is the number of obstruction lines. Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta DBCluC (Density-Based Clustering with Constraints) • • • • • • Introduction Background Concepts Modeling Constraints DBCluC Algorithm Performance Evaluation Conclusion Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Performance • Performance Evaluation - based on synthetic data sets – – – – Detecting arbitrary shaped clusters Insensitive to data input order Discriminating noise and outliers Pruning search spaces Number of Data objects Number of Obstacles(line segments/ crossings Number of obstruction lines Osmar Zaïane and Chi-Hoon Lee DS3 DS5 12k 1000 7(29)/2 18(114)/3 15 74 Database Laboratory Dept. of Computing Science University of Alberta Performance (DS3) (a) Before clustering (b) Clustering ignoring constraints (d) Clustering with obstacles Osmar Zaïane and Chi-Hoon Lee (c) Clustering with bridges (e) Clustering with obstacles and bridges Database Laboratory Dept. of Computing Science University of Alberta Performance (DS5) (a) Before clustering Osmar Zaïane and Chi-Hoon Lee (b) Clustering ignoring constraints Database Laboratory Dept. of Computing Science University of Alberta Performance (DS5) (c) Clustering with bridges (d) Clustering with obstacles (e) Clustering with obstacles and bridges Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Performance 800 700 Time in second 600 500 400 300 200 100 0 25k 50k 75k 100k 125k 150k 175k 200k N u mb e r s o f d a t a p o i n t s (a) Run time varying size of data objects Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Performance 500 400 300 200 100 24 00 /1 08 0 21 00 /9 45 18 00 /8 10 15 00 /6 75 12 00 /5 40 90 0/ 40 5 60 0/ 27 0 30 0/ 13 5 0 12 1/ 72 Time in seceond 600 Num ber of line segm ents in obstacles/ Num ber of Obstruction lines (N=38K, Eps=6.0, and MinPts=3) (b) Run time varying size of obstacles Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Conclusion • Propose a spatial clustering algorithm in the presence of Constraints: Obstacles and Crossings. • Modeling constraints – Obstacles • Polygon Reduction Algorithm. – Reduces search spaces allowing DBCluC to handle large number of obstacles – Crossing • Entry point and Entry edge. – Control connectivity flow • Experiments – Scalable, efficient, and effective. Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Future Work • Indexing obstacles – Prune search spaces for large number of obstacles – Reduce the complexity of DBCluC to O(N•logN) • Extension to a high dimension with obstruction hyper planes • Consider the object altitude • Consider more constraints: Time, Length of a crossing, Direction of Crossing (one direction/bidirection) • Extension to operational constraints Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta References [1] A. K. H. Tung, J. Hou, and J. Han, Spatial Clustering in the Presence of Obstacles, Proc. 2001 Int. Conf. on Data Engineering (ICDE'01), Heidelberg, Germany, April 2001. [2] Vladimir Estivill-Castro and IckJai Lee. Autoclust+: Automatic clustering of point-data sets in the presence of obstacles. In International Workshop on Temporal and Spatial and Spatio-Temporal Data Mining (TSDM2000), pages 133-146, 2000. [3] M.G. Stone. A mnemonic for areas of polygons. AMER. MATH. MONTHLY, 93:479-480, 1986. [4] Anthony K. H. Tung, Raymond T. Ng, Laks V. S. Lakshmanan, and Jiawei Han. Constraint-based clustering in large databases. In ICDT, pages 405-419, 2001. [5] Osmar R. Zaïane and Chi-Hoon Lee, Clustering Spatial Data in the Presence of Obstacles: a DensityBased Approach, Sixth International Database Engineering and Applications Symposium (IDEAS 2002), Edmonton, Alberta, Canada, July 17-19, 2002 [6] Osmar R. Zaïane, Andrew Foss, Chi-Hoon Lee, Weinan Wang, On Data Clustering Analysis: Scalability, Constraints and Validation, in Proc. of the Sixth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'02), pp 28-39, Taipei, Taiwan, May, 2002 [7] Osmar R. Zaïane, Chi-Hoon Lee, Clustering Spatial Data When Facing Physical Constraints, in Proc. of the IEEE 2001 International Conference on Data Mining (ICDM'2002), pp ??-??, Maebashi City, Japan, December 9 - 12, 2002 Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Visibility Graph from [1] v1 v2 p v4 O1 O2 v3 Osmar Zaïane and Chi-Hoon Lee q v5 Database Laboratory Dept. of Computing Science University of Alberta Delaunay diagram -Collection of edges satisfying an "empty circle" property: for each edge we can find a circle containing the edge's endpoints but not containing any other points. - Dual of Voronoi Diagram Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Delaunay diagram Osmar Zaïane and Chi-Hoon Lee Database Laboratory Dept. of Computing Science University of Alberta Visible Space Given a set D of n data objects with a polygon P(V, E), a visible space S is a space that has a set P of data objects satisfying the following 1. Space S is defined by three edges: the first edge(edges) e E connects two minimal convex points vi, vj V, the second edge f is the extension of the line connecting vi and its other adjacent point vk V, and the third edge g is the extension of the line connecting vj and its other adjacent vl V. 2. p,q P, p and q are visible to each other in S. Thus, P D 3. S is not visible to any other visible space S’. Thus, S’ S = S1 e5 e1 S2 e2 e3 S3 Osmar Zaïane and Chi-Hoon Lee S1 S5 e4 S 4 S2 S3 S4 S3 S4 S5 S4 S5 S5 Database Laboratory Dept. of Computing Science S5 University of Alberta