Living under the Curse of Dimensionality Dave Abel CSIRO

advertisement
Living under the Curse of
Dimensionality
Dave Abel
CSIRO
Roadmap
• Why spend time on high dimensional
data?
• Non-trivial …
• Some explanations
• A different approach
• Which leads to …
The Subtext: data engineering
• Solution techniques are compositions of
algorithms for fundamental operations;
• Algorithms assume certain contexts;
• There can be gains of orders of magnitude
in using ‘good’ algorithms that are suited to
the context;
• Sometimes better algorithms need to be
built for new contexts.
COTS Database technology
• ‘Simple’ data is handled well, even for very
large databases and high transaction
volumes, by relational database;
• Geospatial data (2d, 2.5d, 3d) is handled
reasonably well;
• But pictures, series, sequences, …, are
poorly supported.
For example …
• Find the 10 days for which trading on the
LSX was most similar to today’s, and the
pattern for the following day;
• Find the 20 sequences from SwissProt
that are most similar to this one;
• If I hum the first few bars, can you fetch
the song from the music archive?
Dimensionality?
• It’s all in the modelling;
• K-d means … the important relationships
and operations on these object involve a
certain set of k attributes as a bloc;
• 1d: a list/ key properties flow from value of
a single attribute/(position in the list);
• 2d: points on a plane/ key properties and
relationships from position on the plane;
• 3d and 4d: …
All in the modelling …
Take a set of galaxies:
• Some physical interactions deal with
galaxies as points in 3d (spatial) space;
• Or analyses based on the colours of
galaxies could consider them as points in
(say) 5d (colour) space;
All in the modelling (>5d)…
Complex data types (pictures, graphs, etc)
can be modelled as kd points using wellknown tricks:
– A blinking star could be modelled by the
histogram of its brightness;
– A photo could be represented as histogram of
brightness x colour (3x3) of its pixels (i.e. as a
point in 9d space);
– A sonar echo could be modelled by the
intensity every 10 ms after the first return.
Access Methods
• Access methods structure a data set for
“efficient” search;
• The standard components of a method
are:
– Reduction of the data set to a set of sub-sets
(partitions);
– Definition of a directory (index) of partitions to
allow traversal;
– Definition of a search algorithm that traverses
intelligently.
Only a few variants on the theme
• Space-based
– Cells derived by a regular decomposition of the data
space, s.t cells have ‘nice’ properties;
– Points assigned to cells;
• Data-based
– Decomposition of the data set to sub-sets, s.t. the
sub-sets have ‘nice’ properties;
– Incremental or bulk load.
Efficiency comes through pruning: the index
supports discovery of the partitions that need not
be accessed.
kd an extension of 2d?
• Extensive r&d on (geo)spatial database
1985-1995;
• Surely kd is just a generalisation of the
problems in 2d and 3d?
• Analogues of 2d methods ran out of puff at
about 8d, sometimes earlier;
• Why was this? Did it matter?
The Curse of Dimensionality
• Named by Bellman (1961);
• Creep in applicability, to generally include
the “not commonsense” effects that
become increasingly awkward as the
dimensionality rises;
• And the non-linearity of costs with
dimensionality (often exponential);
• Two examples.
CofD: Example 1
Sample the space [0,1]d by a grid with a
spacing of 0.1:
– 1d: 10 points
– 2d: 100 points
– 3d: 1000 points;
–…
– 10d: 10000000000 points;
CofD: Example 2
• Determine the mean number of points
within a hypersphere of radius r, placed
randomly within the unit hypercube with a
density of a. Let’s assume r << 1.
• Trivial if we ignore edge effects;
• But that would be misleading …
Edge effects?
P(edge effect) = 2r
= 4r – 4r2
(1d)
(2d)
= 6r – 12r2 + 8r3 (3d)
Which means …
• If it’s a uniform random distribution, a point
is likely to be near a face (or edge) in highdimensional space;
• Analyses quickly end up in intractable
expressions;
• Usually, interesting behaviour is lost when
models are simplified to permit neat
analyses.
Early rumbles …
• Weber et al [1998]: assertions that treebased indexes will fail to prune in high-d;
• Circumstantial evidence;
• Relied on ‘well-known’ comparative costs
for disk and CPU (too generous);
• Not a welcome report!
Theorem of Instability
• Reported by Beyer et al [1999 ],
formalised & extended by Shaft &
Ramakrishnan [2005];
• For many data distributions, all pairs of
points are the same distance apart.
| d  d |  0 as dim  
c
f
Contrast plot, 3 Gaussian sets
Which means …
• Any search method based on a
contracting search region must fall to the
performance of a naiive (sequential)
method, sooner or later;
• This covers all (arguably) approaches
devised to date;
• So we need to think boldly (or change our
interests) ...
Target Problems
In high-d, operations most commonly are
framed in terms of neighbourhoods:
– K Nearest Neighbours (kNN) query;
– kNN join;
– RkNN query.
In low-d, operations are most commonly framed
in terms of ranges for attributes.
kNN Query
• For this query point q, retrieve the 10
objects most similar to it.
• Which requires that we define similarity,
conventionally by a distance function;
• The query type in high-d;
• Almost ubiquitous in high-d;
• Formidable literatures.
kNN Join
• For each object of a set, determine the k
most similar points from the set;
• Encountered in data mining, classification,
compression, ….;
• A little care provides a big reward;
• Not a lot of investigation.
RkNN Query
• If a new object q appears, for what objects will it
be a k Nearest Neighbour?
• Eg a chain of bookstores knows where its stores
are and where its frequent-buyers live. It is
about to open a new store in Stockbridge. For
which frequent-buyers will the new store be
closer than the current stores?
• Even less investigation. High costs inhibit use.
Optimised Partitioning: the bet
• If we have a simple index structure and a
simple search method, we can frame
partitioning of the data set as an
optimisation (assignment) problem;
• Although it’s NP-hard, we can probably
solve it, well enough, using an iterative
method;
• And it might be faster.
Which requires
A. We devise the access method
B. Formal statement of the problem:
Objective function;
Constraints.
C. Solution Technique;
D. Evaluate.
Partitioning as the core concept
• Reduce the data set to subsets
(partitions).
• Partitions contain a variable number of
points, with an upper limit.
• Partitions have a Minimum Bounding Box.
Index
• The index is a list of the partitions’ MBBs;
• In no particular order;
• Held in RAM (and so we should impose an
upper limit on the number of partitions).
I = {id, {low, high}d}
Mindist Search Discipline
• Fetch and scan the partitions (in a certain
sequence), maintaining a list of the k
candidates;
• To scan a partition,
– Evaluate dist from each member to the query
point;
– If better than the current k’th candidate, place
it in the list of candidates.
The Sequence: mindist
• Can simply evaluate the minimum
distance from a query point to any point
within an MBB (the mindist for a partition);
• If we fetch in ascending mindist, we can
stop when a mindist is greater than the
distance to the current k’th candidate;
• Conveniently, this is the optimum in terms
of partitions fetched.
For example
3
4
A
1
3
6
8
A: 1 (..)
2
B
1
Q
C
2
6: (6)
5
4: (6)
B: 2: (6)
5: (6)
Done!
Objective Function
Minimise the total elapsed time of
performing a large set of queries
Which requires that we have a
representative set of queries, from an
historical record or a generator. And we
have the solutions for those queries.
The Formal Statement
C (Q)   (  ( A( B )  C ( B )))
nq
np
m 1
j i
mj
j
j
Where A(B) is the cost of fetching a partition of B points, and
C(B) is the cost of scanning a partition of B points.
Unit costs acquired empirically
We can plug in costs for different environments.
Constraints
• All points allocated one (and only one)
partition;
• Upper limit on points in a partition;
• Upper limit on number of partitions used.
Constraints
N

j 1
N

j 0
j
ij
 1  i  (1...N )
 MaxPart  j  (1..N )
where  j  1 if
N

i 1
ij
0
 0 otherwise
np  MaxPartSize  j  (1..N )
Finally …
 mj  1 iff min dist (q m , j )  DK opt
j  (1,...np), m  (1,...nq)
 0 otherwise
Which leaves us with the assignments of points to partitions as the
only decision variables.
The Solution Technique
• Applies a conventional iterative refinement
to an Initial Feasible Solution;
• The problem seems to be fairly placid;
• Acceptable load times for data sets trialled
to date.
How to assess?
• Not hard to generate meaningless performance data;
• Basic behaviour: synthetic data (N, d, k,
distribution);
• Comparative: real data sets;
• Benchmarks: naiive method and bestpreviously-reported;
• Careful implementation of a naiive
methods can be rewarding.
Response with N of points
Response with Dimensionality
Data Set
N
d
Description
CorelHist
68040
32
Colour histograms of images
CorelUCI
68040
64
Colour histograms of images
Forests
581012
10
Forest cover descriptions
Aerial
274966
60
Texture data, aerial photographs
SF
174956
2
Geographic point locations
TS
76325
5
Time series, stock market indices
Landsat
275465
60
Texture data, satellite Images
Stock
6500
360 Daily stock prices
Data Set
Partitioning
Time
Average Query Costs (elapsed ms)
OptP
Sequential
iDistance
(mm:ss)
CorelHist
6:04
1.11
8.63
3.44
CorelUCI
17:59
3.72
16.22
10.25
Forest
15:34
1.08
25.50
4.23
Aerial
55:04
23.11
63.25
43.22
SF
1:36
0.42
3.06
1.05
TS
0:48
0.14
2.41
1.22
Landsat
61:47
23.36
63.13
44.63
Stock
1:32
1.89
8.19
5.83
What does it mean?
• Can reduce times by 3, below the cutoff;
• The cutoff depends on the dataset size;
• Some conjectures drawn from the
Theorems are based on an unrealistic
model and are probably quantitatively
wrong;
• Times for kNN queries have apparently
fallen from 50 ms to 0.5 ms. 48.5 ms is
attributable to system caching.
Join? RkNN?
• Work in progress!
• Specialist kNN Join algorithms are well
worthwhile;
• Optimised Partitioning for RkNN works
well;
• Falls in query costs from 5 sec (or so) to 5
ms (or so);
• Query + join + reverse is a nice package.
Which all suggests (Part 1)
• Neighbourhood operations used only in a few,
specialised geospatial apps;
• Specific data structures used;
• More general view of “neighbourhood” might
open up more apps;
• Eg finding clusters of galaxies from catalogues:
– Large groups of galaxies that are bound
gravitationally;
– Available definitions are not helpful in “seeing”
clusters. The core element is high density;
– Search by neighbourhoods, rather than an arbitrary
grid, to find high-density regions..
Which all suggests (Part 2)
• Algorithms using kNN as a basic operation
can be accelerated by (apparently) x100;
• RkNN is apparently much cheaper than
we expected (and …);
• Designer data structures appear possible
(eg design such that no more than 5% of
transactions take more than 50 ms).
And which shows …
• There are many interesting, open
problems out there, for data engineers;
• Using Other People’s Techniques can be
quite profitable;
• Data Engineers can be useful eScience
team members.
More?
dave.abel@csiro.au
Download