Computational Geometry and Spatial Data Mining Marc van Kreveld Department of Information and Computing Sciences Utrecht University Two-part presentation • Morning: Introduction to computational geometry with examples from spatial data mining • Afternoon: Geometric algorithms for spatial data mining (and spatio-temporal data mining) Spatial data mining and computation • “Geographic data mining involves the application of computational tools to reveal interesting patterns in objects and events distributed in geographic space and across time” (Miller & Han, 2001) [ data analysis ? ] • Large data sets attempt to carefully define interesting patterns (to avoid finding non-interesting patterns) advanced algorithms needed for efficiency Introduction to CG • Some words on algorithms and efficiency • Computational geometry algorithms through examples from spatial data mining – Voronoi diagrams and clustering – Arrangements and largest clusters – Approximation for the largest cluster Algorithms and efficiency • You may know it all already: • Please look bored if you know all of this • Please look bewildered if you haven’t got a clue what I’m talking about Algorithms • Computational problems have an input size, denoted by n – A set of n numbers – A set of n points in the plane (2n coordinates) – A simple polygon with n vertices – A planar subdivision with n vertices • A computational problem defines desired output in terms of the input Algorithms • Examples of computational problems: – Given a set of n numbers, put them in sorted order – Given a set of n points, find the two that are closest – Given a simple polygon P with n vertices and a point q, determine if q is inside P P q Algorithms • An algorithm is a scheme (sequence of steps) that always gives the desired output from the given input • An algorithm solves a computational problem • An algorithm is the basis of an implementation Algorithms • An algorithm can be analyzed for its running time efficiency • Efficiency is expressed using O(..) notation, it gives the scaling behavior of the algorithm – O(n) time: the running time doubles (roughly) if the input size doubles – O(n2) time: the running time quadruples (roughly) if the input size doubles Algorithms • Why big-Oh notation? – Because it is machine-independent – Because it is programming languageindependent – Because it is compiler-independent unlike running time in seconds It is only algorithm/method-dependent Algorithms • Algorithms research is concerned with determining the most efficient algorithm for each computational problem – Until ~1978: O(n2) time – Until 1990: O(n log n) time – Now: O(n) time } polygon triangulation Algorithms • For some problems, efficient algorithms are unknown to exist • Approximation algorithms may be an option. E.g. TSP – Exact: exponential time – 2-approx: O(n log n) time – 1.5-approx: O(n3) time – (1+)-approx: O(n1/) time Voronoi diagrams and clustering • A Voronoi diagram stores proximity among points in a set Voronoi diagrams and clustering • Single-link clustering attempts to maximize the distance between any two points in different sets Voronoi diagrams and clustering Voronoi diagrams and clustering Voronoi diagrams and clustering • Algorithm (point set P; desired: k clusters): – Compute Voronoi diagram of P – Take all O(n) neighbors and sort by distance – While #clusters > k do • Take nearest neighbor pair p and q • If they are in different clusters, then merge them and decrement #clusters (else, do nothing) Voronoi diagrams and clustering • Analysis; n points in P: – Compute Voronoi diagram: O(n log n) time – Sort by distance: O(n log n) time – While loop that merges clusters: O(n log n) time (using union-find structure) • Total: O(n log n) + O(n log n) + O(n log n) = O(n log n) time Voronoi diagrams and clustering • What would an “easy” algorithm have given? – really easy: O(n3) time – slightly less easy: O(n2 log n) time n3 time 10 n2 log n 1000 n log n 100 200 300 Computing Voronoi diagrams • By plane sweep • By randomized incremental construction • By divide-and-conquer all give O(n log n) time Computing Voronoi diagrams • Study the geometry, find properties – 3-point empty circle Voronoi vertex – 2-point empty circle Voronoi edge Computing Voronoi diagrams • Some geometric properties are needed, regardless of the computational approach • Other geometric properties are only needed for some approach Computing Voronoi diagrams • Fortune’s sweep line algorithm (1987) – An imaginary line moves from left to right – The Voronoi diagram is computed while the known space expands (left of the line) Computing Voronoi diagrams • Beach line: boundary between known and unknown sequence of parabolic arcs – Geometric property: beach line is y-monotone it can be stored in a balanced binary tree Computing Voronoi diagrams • Events: changes to the beach line = discovery of Voronoi diagram features – Point events Computing Voronoi diagrams • Events: changes to the beach line = discovery of Voronoi diagram features – Point events Computing Voronoi diagrams • Events: changes to the beach line = discovery of Voronoi diagram features – Circle events Computing Voronoi diagrams • Events: changes to the beach line = discovery of Voronoi diagram features – Circle events Computing Voronoi diagrams • Events: changes to the beach line = discovery of Voronoi diagram features – Only point events and circle events exist Computing Voronoi diagrams • For n points, there are – n point events – at most 2n circle events Computing Voronoi diagrams • Handling an event takes O(log n) time due to the balanced binary tree that stores the beach line in total O(n log n) time Intermediate summary • Voronoi diagrams are useful for clustering (among many other things) • Voronoi diagrams can be computed efficiently in the plane, in O(n log n) time • The approach is plane sweep (by Fortune) Figures from the on-line animation of Allan Odgaard & Benny Kjær Nielsen Arrangements and largest clusters • Suppose we want to identify the largest subset of points that is in some small region – formalize “region” to circle – formalize “small’’ to radius r Place circle to maximize point containment r Arrangements and largest clusters • Bad idea: Try m = 1, 2, ... and test every subset of size m • Not so bad idea: for every 3 points, compute the smallest enclosing circle, test the radius and test the other points for being inside Arrangements and largest clusters • Bad idea analysis: A set of n points has roughly ( mn ) = O(nm) subsets of size m • Not so bad idea analysis: n points give ( n3 ) = O(n3) triples of points. Each can be tested in O(n) time O(n4) time algorithm Arrangements and largest clusters • The placement space of circles of radius r C p A circle C of radius r contains a point p if and only if the center of C lies inside a circle of radius r centered at p Arrangements and largest clusters • The placement space of circles of radius r Circles with center here contain 2 points of P Circles with center here contain 3 points of P Arrangements and largest clusters • Maximum point containment is obtained for circles whose center lies in the most covered cell of the placement space Computing the most covered cell • Compute the circle arrangement in a topological data structure • Fill the cells by the cover value by traversal of the arrangement 1 0 1 2 1 2 3 1 The value to be assigned to a cell is +/- 1 of its (known) neighbor Computing the most covered cell • Compute the circle arrangement: – by plane sweep: O(n log n + k log n) time – by randomized incremental construction in O(n log n + k) time where k is the complexity of the arrangement; k = O(n2) If the maximum coverage is denoted m, then k = O(nm) and the running time is O(n log n + nm) Computing the most covered cell • Randomized incremental construction: – Put circles in random order – “Glue” them into the topological structure for the arrangement with vertical extensions Every cell has ≤ 4 sides (2 vertical and 2 circular) Computing the most covered cell Every cell has ≤ 4 sides (2 vertical and 2 circular) Trace a new circle from its leftmost point to glue it into the arrangement the exit from any cell can be determined in O(1) time Computing the most covered cell • Randomized analysis can show that adding one circle C takes O(log n + k’ ) time, where k’ is the number of intersections with C • The whole algorithm takes O(n log n + k) time, where k = k’ is the arrangement size • The O(n + k) vertical extensions can be removed in O(n + k) time Computing the most covered cell • Traverse the arrangement (e.g., depth-first search) to fill the cover numbers in O(n + k) time • into a circle +1 • out of a circle -1 Intermediate summary • The largest cluster for a circle of radius r can be computed in O(n log n + nm) time if it has m entities • We use arrangement construction and traversal • The technique for arrangement construction is randomized incremental construction (Mulmuley, 1990) Largest cluster for approximate radius • Suppose the specified radius r for a cluster is not so strict, e.g. it may be 10% larger Place circle to maximize point containment r (1+) r If the largest cluster of radius r has m entities, we must guarantee to find a cluster of m entities and radius (1+) r Approximate radius clustering • The idea: snap the entity locations to grid points of a well-chosen grid Snapping should not move points too much: less than r /4 grid spacing r /4 works Approximate radius clustering • The idea: snap the entity locations to grid points of a well-chosen grid 1 1 1 1 2 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 For each grid point, collect and add the count of all grid points within distance (1+/2) r Approximate radius clustering • The idea: snap the entity locations to grid points of a well-chosen grid 1 1 1 1 2 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 For each grid point, collect and add the count of all grid points within distance (1+/2) r 1 Collected count = 10 Approximate radius clustering • The idea: snap the entity locations to grid points of a well-chosen grid 1 1 1 2 1 2 1 1 1 1 1 1 8 92 110 6 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 For each grid point, collect and add the count of all grid points within distance (1+/2) r Approximate radius clustering • Claim: a largest approximate radius cluster is given by the highest count 1 1 1 2 1 2 1 1 1 1 1 1 8 92 110 6 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 Approximate radius clustering • Let Copt be a radius-r circle with the most entities inside • Due to the grid spacing, we have a grid point within distance r /4 from the center of Copt that must have a count Approximate radius clustering • Snapping moves entities at most r /4 • C and Copt differ in radius r /2 no point in Copt can have moved outside C • Snapped points inside C have their origins inside a circle of radius at most (1+) r no points too far from C can have entered C Approximate radius clustering • Intuition: We use the in different places – Snapping points – Trying only circle centers on grid points ... and we guarantee to test a circle that contains all entities in the optimal circle, but not other entities too far away Approximate radius clustering • Efficiency analysis – n entities: each gives a count to O(1/2) grid cells – in O(n /2) time we have all collected counts and hence the largest count Exact or approximate? • O(n log n + nm) versus O(n /2) time • In practice: What is larger: m or 1 /2 ? – If the largest cluster is expected to be fairly small, then the exact algorithm is fine – If the largest cluster may be large and we don’t care about the precise radius, the approximate radius algorithm is better Concluding this session • Basic computational geometry ... Voronoi diagrams, arrangements, -approximation techniques ... is already useful for spatial data mining • Afternoon: spatial and spatio-temporal data mining and more geometric algorithms