Living under the Curse of Dimensionality Dave Abel CSIRO Roadmap • Why spend time on high dimensional data? • Non-trivial … • Some explanations • A different approach • Which leads to … The Subtext: data engineering • Solution techniques are compositions of algorithms for fundamental operations; • Algorithms assume certain contexts; • There can be gains of orders of magnitude in using ‘good’ algorithms that are suited to the context; • Sometimes better algorithms need to be built for new contexts. COTS Database technology • ‘Simple’ data is handled well, even for very large databases and high transaction volumes, by relational database; • Geospatial data (2d, 2.5d, 3d) is handled reasonably well; • But pictures, series, sequences, …, are poorly supported. For example … • Find the 10 days for which trading on the LSX was most similar to today’s, and the pattern for the following day; • Find the 20 sequences from SwissProt that are most similar to this one; • If I hum the first few bars, can you fetch the song from the music archive? Dimensionality? • It’s all in the modelling; • K-d means … the important relationships and operations on these object involve a certain set of k attributes as a bloc; • 1d: a list/ key properties flow from value of a single attribute/(position in the list); • 2d: points on a plane/ key properties and relationships from position on the plane; • 3d and 4d: … All in the modelling … Take a set of galaxies: • Some physical interactions deal with galaxies as points in 3d (spatial) space; • Or analyses based on the colours of galaxies could consider them as points in (say) 5d (colour) space; All in the modelling (>5d)… Complex data types (pictures, graphs, etc) can be modelled as kd points using wellknown tricks: – A blinking star could be modelled by the histogram of its brightness; – A photo could be represented as histogram of brightness x colour (3x3) of its pixels (i.e. as a point in 9d space); – A sonar echo could be modelled by the intensity every 10 ms after the first return. Access Methods • Access methods structure a data set for “efficient” search; • The standard components of a method are: – Reduction of the data set to a set of sub-sets (partitions); – Definition of a directory (index) of partitions to allow traversal; – Definition of a search algorithm that traverses intelligently. Only a few variants on the theme • Space-based – Cells derived by a regular decomposition of the data space, s.t cells have ‘nice’ properties; – Points assigned to cells; • Data-based – Decomposition of the data set to sub-sets, s.t. the sub-sets have ‘nice’ properties; – Incremental or bulk load. Efficiency comes through pruning: the index supports discovery of the partitions that need not be accessed. kd an extension of 2d? • Extensive r&d on (geo)spatial database 1985-1995; • Surely kd is just a generalisation of the problems in 2d and 3d? • Analogues of 2d methods ran out of puff at about 8d, sometimes earlier; • Why was this? Did it matter? The Curse of Dimensionality • Named by Bellman (1961); • Creep in applicability, to generally include the “not commonsense” effects that become increasingly awkward as the dimensionality rises; • And the non-linearity of costs with dimensionality (often exponential); • Two examples. CofD: Example 1 Sample the space [0,1]d by a grid with a spacing of 0.1: – 1d: 10 points – 2d: 100 points – 3d: 1000 points; –… – 10d: 10000000000 points; CofD: Example 2 • Determine the mean number of points within a hypersphere of radius r, placed randomly within the unit hypercube with a density of a. Let’s assume r << 1. • Trivial if we ignore edge effects; • But that would be misleading … Edge effects? P(edge effect) = 2r = 4r – 4r2 (1d) (2d) = 6r – 12r2 + 8r3 (3d) Which means … • If it’s a uniform random distribution, a point is likely to be near a face (or edge) in highdimensional space; • Analyses quickly end up in intractable expressions; • Usually, interesting behaviour is lost when models are simplified to permit neat analyses. Early rumbles … • Weber et al [1998]: assertions that treebased indexes will fail to prune in high-d; • Circumstantial evidence; • Relied on ‘well-known’ comparative costs for disk and CPU (too generous); • Not a welcome report! Theorem of Instability • Reported by Beyer et al [1999 ], formalised & extended by Shaft & Ramakrishnan [2005]; • For many data distributions, all pairs of points are the same distance apart. | d d | 0 as dim c f Contrast plot, 3 Gaussian sets Which means … • Any search method based on a contracting search region must fall to the performance of a naiive (sequential) method, sooner or later; • This covers all (arguably) approaches devised to date; • So we need to think boldly (or change our interests) ... Target Problems In high-d, operations most commonly are framed in terms of neighbourhoods: – K Nearest Neighbours (kNN) query; – kNN join; – RkNN query. In low-d, operations are most commonly framed in terms of ranges for attributes. kNN Query • For this query point q, retrieve the 10 objects most similar to it. • Which requires that we define similarity, conventionally by a distance function; • The query type in high-d; • Almost ubiquitous in high-d; • Formidable literatures. kNN Join • For each object of a set, determine the k most similar points from the set; • Encountered in data mining, classification, compression, ….; • A little care provides a big reward; • Not a lot of investigation. RkNN Query • If a new object q appears, for what objects will it be a k Nearest Neighbour? • Eg a chain of bookstores knows where its stores are and where its frequent-buyers live. It is about to open a new store in Stockbridge. For which frequent-buyers will the new store be closer than the current stores? • Even less investigation. High costs inhibit use. Optimised Partitioning: the bet • If we have a simple index structure and a simple search method, we can frame partitioning of the data set as an optimisation (assignment) problem; • Although it’s NP-hard, we can probably solve it, well enough, using an iterative method; • And it might be faster. Which requires A. We devise the access method B. Formal statement of the problem: Objective function; Constraints. C. Solution Technique; D. Evaluate. Partitioning as the core concept • Reduce the data set to subsets (partitions). • Partitions contain a variable number of points, with an upper limit. • Partitions have a Minimum Bounding Box. Index • The index is a list of the partitions’ MBBs; • In no particular order; • Held in RAM (and so we should impose an upper limit on the number of partitions). I = {id, {low, high}d} Mindist Search Discipline • Fetch and scan the partitions (in a certain sequence), maintaining a list of the k candidates; • To scan a partition, – Evaluate dist from each member to the query point; – If better than the current k’th candidate, place it in the list of candidates. The Sequence: mindist • Can simply evaluate the minimum distance from a query point to any point within an MBB (the mindist for a partition); • If we fetch in ascending mindist, we can stop when a mindist is greater than the distance to the current k’th candidate; • Conveniently, this is the optimum in terms of partitions fetched. For example 3 4 A 1 3 6 8 A: 1 (..) 2 B 1 Q C 2 6: (6) 5 4: (6) B: 2: (6) 5: (6) Done! Objective Function Minimise the total elapsed time of performing a large set of queries Which requires that we have a representative set of queries, from an historical record or a generator. And we have the solutions for those queries. The Formal Statement C (Q) ( ( A( B ) C ( B ))) nq np m 1 j i mj j j Where A(B) is the cost of fetching a partition of B points, and C(B) is the cost of scanning a partition of B points. Unit costs acquired empirically We can plug in costs for different environments. Constraints • All points allocated one (and only one) partition; • Upper limit on points in a partition; • Upper limit on number of partitions used. Constraints N j 1 N j 0 j ij 1 i (1...N ) MaxPart j (1..N ) where j 1 if N i 1 ij 0 0 otherwise np MaxPartSize j (1..N ) Finally … mj 1 iff min dist (q m , j ) DK opt j (1,...np), m (1,...nq) 0 otherwise Which leaves us with the assignments of points to partitions as the only decision variables. The Solution Technique • Applies a conventional iterative refinement to an Initial Feasible Solution; • The problem seems to be fairly placid; • Acceptable load times for data sets trialled to date. How to assess? • Not hard to generate meaningless performance data; • Basic behaviour: synthetic data (N, d, k, distribution); • Comparative: real data sets; • Benchmarks: naiive method and bestpreviously-reported; • Careful implementation of a naiive methods can be rewarding. Response with N of points Response with Dimensionality Data Set N d Description CorelHist 68040 32 Colour histograms of images CorelUCI 68040 64 Colour histograms of images Forests 581012 10 Forest cover descriptions Aerial 274966 60 Texture data, aerial photographs SF 174956 2 Geographic point locations TS 76325 5 Time series, stock market indices Landsat 275465 60 Texture data, satellite Images Stock 6500 360 Daily stock prices Data Set Partitioning Time Average Query Costs (elapsed ms) OptP Sequential iDistance (mm:ss) CorelHist 6:04 1.11 8.63 3.44 CorelUCI 17:59 3.72 16.22 10.25 Forest 15:34 1.08 25.50 4.23 Aerial 55:04 23.11 63.25 43.22 SF 1:36 0.42 3.06 1.05 TS 0:48 0.14 2.41 1.22 Landsat 61:47 23.36 63.13 44.63 Stock 1:32 1.89 8.19 5.83 What does it mean? • Can reduce times by 3, below the cutoff; • The cutoff depends on the dataset size; • Some conjectures drawn from the Theorems are based on an unrealistic model and are probably quantitatively wrong; • Times for kNN queries have apparently fallen from 50 ms to 0.5 ms. 48.5 ms is attributable to system caching. Join? RkNN? • Work in progress! • Specialist kNN Join algorithms are well worthwhile; • Optimised Partitioning for RkNN works well; • Falls in query costs from 5 sec (or so) to 5 ms (or so); • Query + join + reverse is a nice package. Which all suggests (Part 1) • Neighbourhood operations used only in a few, specialised geospatial apps; • Specific data structures used; • More general view of “neighbourhood” might open up more apps; • Eg finding clusters of galaxies from catalogues: – Large groups of galaxies that are bound gravitationally; – Available definitions are not helpful in “seeing” clusters. The core element is high density; – Search by neighbourhoods, rather than an arbitrary grid, to find high-density regions.. Which all suggests (Part 2) • Algorithms using kNN as a basic operation can be accelerated by (apparently) x100; • RkNN is apparently much cheaper than we expected (and …); • Designer data structures appear possible (eg design such that no more than 5% of transactions take more than 50 ms). And which shows … • There are many interesting, open problems out there, for data engineers; • Using Other People’s Techniques can be quite profitable; • Data Engineers can be useful eScience team members. More? dave.abel@csiro.au