Document 9858455

advertisement
Search k-Nearest Neighbors
in High Dimensions
Tomer Peled
Dan Kushnir
Tell me who your neighbors are, and I'll know who you are
Outline
Problem definition and flavors
Algorithms overview - low dimensions
Curse of dimensionality (d>10..20)
Enchanting the curse
•
•
•
•
Locality Sensitive Hashing
(high dimension approximate solutions)
l2 extension •
Applications (Dan) •
Nearest Neighbor Search
Problem definition
• Given: a set P of n points in Rd
Over some metric
• find the nearest neighbor p of q in P
Q?
Distance metric
Applications
Classification •
Clustering •
Segmentation •
Indexing •
Dimension reduction •
(e.g. lle)
Weight
q?
color
Naïve solution
No preprocess •
Given a query point q •
Go over all n points –
Do comparison in Rd –
query time = O(nd) •
Keep in mind
Common solution
Use a data structure for acceleration •
Scale-ability with n & with d is important •
When to use nearest neighbor
High level algorithms
Parametric
Non-parametric
Probability
distribution estimation
complex models
Sparse data
Density
estimation
Nearest
neighbors
High dimensions
Assuming no prior knowledge about the underlying probability structure
Nearest Neighbor
q?
min pi  P dist(q,pi)
r,  - Nearest Neighbor
(1 +  ) r
q?
r
dist(q,p1)  r
dist(q,p2)  (1 +  ) r
r2=(1 +  ) r1
Outline
Problem definition and flavors
Algorithms overview - low dimensions
Curse of dimensionality (d>10..20)
Enchanting the curse
Locality Sensitive Hashing
(high dimension approximate solutions)
l2 extension
Applications (Dan)
•
•
•
•
•
•
The simplest solution
Lion in the desert •
Quadtree
Split the first dimension into 2
Repeat iteratively
Stop when each cell
has no more than 1 data point
Quadtree - structure
P<X1
P<Y1
X1,Y1
P<X1
P≥Y1
X1,Y1
Y
X
P≥X1
P<Y1
P≥X1
P≥Y1
Query - Quadtree
P<X1
P<Y1
X1,Y1
P<X1
P≥Y1
X1,Y1
Y
X
In many cases works
P≥X1
P<Y1
P≥X1
P≥Y1
Pitfall1 – Quadtree
P<X1
P<Y1
X1,Y1
P<X1
P≥Y1
X1,Y1
Y
P<X1
X
In some cases doesn’t
P≥X1
P<Y1
P≥X1
P≥Y1
Pitfall1 – Quadtree
Y
X
In some cases nothing works
pitfall 2 – Quadtree
X
Y
d
O(2 )
Could result in Query time Exponential in #dimensions
Space partition based algorithms
Could be improved
Multidimensional access methods / Volker Gaede, O. Gunther
Outline
Problem definition and flavors
Algorithms overview - low dimensions
Curse of dimensionality (d>10..20)
Enchanting the curse
Locality Sensitive Hashing
(high dimension approximate solutions)
l2 extension
Applications (Dan)
•
•
•
•
•
•
Curse of dimensionality
O(nd)
Query
space n• d) )
O(ormin(nd,
Naive time
D>10..20  worst than sequential scan •
For most geometric distributions –
Techniques specific to high dimensions are needed •
•Prooved in theory and in practice by Barkol & Rabani 2000 & Beame-Vee 2002
Curse of dimensionality
Some intuition
2
22
23
2d
Outline
Problem definition and flavors
Algorithms overview - low dimensions
Curse of dimensionality (d>10..20)
Enchanting the curse
Locality Sensitive Hashing
(high dimension approximate solutions)
l2 extension
Applications (Dan)
•
•
•
•
•
•
Preview
General Solution – •
Locality sensitive hashing
Implementation for Hamming space •
Generalization to l1 & l2 •
Hash function
Hash function
Data_Item
Hash function
Key
Bin/Bucket
Hash function
Data structure
X=Number
in the range 0..n
X modulo 3
0
0..2
Storage Address
Usually we would like related Data-items to be stored at the same bin
Recall r,  - Nearest Neighbor
(1 +  ) r
q?
r
dist(q,p1)  r
dist(q,p2)  (1 +  ) r
r2=(1 +  ) r1
Locality sensitive hashing
q?
(1 +  ) r
r
(r, ,p1,p2) Sensitive
P1 ≡Pr[I(p)=I(q)] is “high” if p is “close” to q
P2 ≡Pr[I(p)=I(q)] is “low” if p is”far” from q
r2=(1 +  ) r1
Preview
General Solution – •
Locality sensitive hashing
Implementation for Hamming space •
Generalization to l1 & l2 •
Hamming Space
Hamming space = 2N binary strings •
Hamming distance = #changed digits •
Richard Hamming
a.k.a Signal distance
Hamming Space
N
Hamming space •
010100001111
Hamming distance •
010100001111
Distance = 4
010010000011
SUM(X1 XOR X2)
L1 to Hamming Space Embedding
C=11
2
p
8
11000000000
11111111000
d’=C*d
11000000000 11111111000
Hash function
11000000000
1
0 11111111000
1
p ∈ Hd’
Lj Hash function
Gj(p)=p|Ij
j=1..L, k=3 digits
Bits sampling from p
Store p into bucket p|Ij
101
2k buckets
Construction
p
1
2
L
Query
q
1
2
L
Alternative intuition random projections
C=11
2
p
8
11000000000
11111111000
d’=C*d
11000000000 11111111000
Alternative intuition random projections
C=11
2
p
8
11000000000
11111111000
11000000000 11111111000
Alternative intuition random projections
C=11
2
p
8
11000000000
11111111000
11000000000 11111111000
Alternative intuition random projections
11000000000
1
0 11111111000
1
110
111
100
101
p
101
23 Buckets
000
001
k samplings
Repeating
Repeating L times
Repeating L times
Secondary hashing
2k buckets
011
Simple Hashing
Size=B
M Buckets
M*B=α*n
α=2
Support volume tuning
dataset-size vs. storage volume
The above hashing
is locality-sensitive
k
Probability
Probability (p,q
k=1
Distance (q,pi)
Adopted from Piotr Indyk’s slides
Pr
Distance ( p, q ) 

in1same
bucket) =

# dimensions 

k=2
Distance (q,pi)
•
Preview
General Solution – •
Locality sensitive hashing
Implementation for Hamming space •
Generalization to l2 •
Direct L2 solution
New hashing function
Still based on sampling
Using mathematical trick
P-stable distribution for Lp distance
Gaussian distribution for L2 distance
•
•
•
•
•
Central limit theorem
v1*
+v2*
+…
…+vn*
(Weighted Gaussians) = Weighted Gaussian
=
Central limit theorem
v1* X1 +v2* X2
+…
…+vn* Xn
v1..vn = Real Numbers
X1:Xn = Independent Identically Distributed
(i.i.d)
=
Central limit theorem
1/ 2

2
i vi  X i  i | vi | 
Dot Product
Norm
X
Norm  Distance
1/ 2

2
i ui  X i i vi  X i  i | ui  vi | 
Features
Features
vector 1
vector 2
Distance
X
Norm  Distance
1/ 2

2
i ui  X i i vi  X i  i | ui  vi | 
Dot
Dot
Product
Product
Distance
X
d random*
numbers
1
The full Hashing
Features
vector
Discretization
step
Random[0,w]
22
77
42
[34 82 21]
phase
d
+b
w
a v  b
ha ,b (v)  

 w 
The full Hashing
7944
+34
100
7800 7900 8000 8100 8200
a v  b
ha ,b (v)  

 w 
The full Hashing
phase
Random[0,w]
7944
Discretization
step
+34
100
a v  b
ha ,b (v)  

 w 
The full Hashing
Features
vector
i.i.d from p-stable
distribution
1
Discretization
step
Random[0,w]
v
a
phase
d
+b
w
a v  b
ha ,b (v)  

 w 
Generalization: P-Stable distribution
Lp p=eps..2 •
L2 •
Generalized •
Central Limit Theorem •
Central Limit Theorem
Gaussian (normal) • P-stable distribution •
Cauchy for L2
distribution
P-Stable summary
r,  - Nearest Neighbor
Works for •
Generalizes to 0<p<=2 •
Improves query time •
Latest results
Reported in Email by
Alexander Andoni
Query time = O (dn1/(1+)log(n) )  O (dn1/(1+)^2log(n) )
Parameters selection
90% Probability  Best quarry time performance •
For Euclidean Space
L
Parameters selection …
Single projection hit an  - Nearest Neighbor •
with Pr=p1
k projections hits an  - Nearest Neighbor •
with Pr=p1k
L hashings fail to collide with Pr=(1-p1k)L •
To ensure Collision (e.g. 1-δ≥90%) •
1-
(1-p1k)L≥ 1-δ
•
For Euclidean Space
L
log(  )
k
log( 1  p1 )
K
… Parameters selection
time Candidates verification
Candidates extraction
k
Pros. & Cons.
Better Query Time than Spatial Data Structures
Scales well to higher dimensions and larger data size
( Sub-linear dependence )
Predictable running time
Extra storage over-head
Inefficient for data with distances concentrated around
average
works best for Hamming distance (although can be
generalized to Euclidean space)
In secondary storage, linear scan is pretty much all we
can do (for high dim)
requires radius r to be fixed in advance
From Pioter Indyk slides
Conclusion
..but at the end •
everything depends on your data set
Try it at home •
Visit: –
http://web.mit.edu/andoni/www/LSH/index.html
Andoni@mit.edu
Email Alex Andoni –
Test over your own data –
(C code under Red Hat Linux )
LSH - Applications
• Searching video clips in databases
.("Hierarchical, Non-Uniform Locality Sensitive
Hashing and Its Application to Video Identification“, Yang, Ooi, Sun).
•
•
•
•
•
•
•
•
•
•
Searching image databases
Image segmentation
Image classification
Texture classification
Clustering
Embedding and manifold learning
Compression – vector quantization.
Search engines
Genomics
In short: whenever K-Nearest Neighbors (KNN) are
needed.
(see the following).
(see the following).
(“Discriminant adaptive Nearest Neighbor Classification”, T. Hastie, R Tibshirani).
(see the following).
(see the following).
(LLE, and many others)
(“LSH Forest: SelfTuning Indexes for Similarity Search”, M. Bawa, T. Condie, P. Ganesan”).
(“Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing”, J. Buhler).
Motivation
• A variety of procedures in learning
require KNN computation.
• KNN search is a computational
bottleneck.
• LSH provides a fast approximate solution
to the problem.
• LSH requires hash function construction
and parameter tunning.
Outline
Fast Pose Estimation with Parameter Sensitive
Hashing G. Shakhnarovich, P. Viola, and T. Darrell.
• Finding sensitive hash functions.
Mean Shift Based Clustering in High
Dimensions: A Texture Classification Example
B. Georgescu, I. Shimshoni, and P. Meer
•
•
Tuning LSH parameters.
LSH data structure is used for algorithm
speedups.
Fast Pose Estimation with Parameter Sensitive
Hashing
G. Shakhnarovich, P. Viola, and T. Darrell
The Problem:
Given an image x, what are the
parameters θ, in this image?
i.e. angles of joints, orientation of the body,
etc.􀁺
i
Ingredients
• Input query image with unknown angles
(parameters).
• Database of human poses with known angles.
• Image feature extractor – edge detector.
• Distance metric in feature space dx.
• Distance metric in angles space:
m
d (1 ,  2 )  1  cos(1i   2i )
i 1
Example based learning
• Construct a database of example images with their known
angles.
• Given a query image, run your favorite feature extractor.
• Compute KNN from database.
• Use these KNNs to compute the average angles of the
query.
Input: query
Find KNN in
database of
examples
Output: Average
angles of KNN
The algorithm flow
Input Query
Processed query
Features extraction
Database of examples
Output Match
Feature Extraction
PSH
LWR
The image features
Image features are multiscale edge histograms:
B
A
0,


3
,
,
,
4 2
4
107 ( x)   A x / 4
Feature Extraction
PSH
LWR
PSH: The basic assumption
There are two metric spaces here: feature space (d x)
and parameter space ( d ).
We want similarity to be measured in the angles
space, whereas LSH works on the feature space.
• Assumption: The feature space is closely
related to the parameter space.
Feature Extraction
PSH
LWR
Insight: Manifolds
• Manifold is a space in which
every point has a neighborhood
resembling a Euclid space.
• But global structure may be
complicated: curved.
• For example: lines are 1D
manifolds, planes are 2D
manifolds, etc.
q
Feature Space
Is this Magic?
Parameters Space
(angles)
Feature Extraction
PSH
LWR
Parameter Sensitive Hashing (PSH)
The trick:
Estimate performance of different hash functions
on examples, and select those sensitive to d :
The hash functions are applied in feature space
but the KNN are valid in angle space.
Feature Extraction
PSH
LWR
PSH as a classification problem
Label pairs of examples
with similar angles
Compare
labeling
Define hash functions h
on feature space
Predict labeling of similar\
non-similar examples by using h
If labeling by h is good
accept h, else change h
Feature Extraction
PSH
LWR
A pair of examples (x i ,  i ), ( x j ,  j )
is labeled :
 1 if d ( i ,  j )  r
y ij  
 1 if d ( i ,  j )  r (1   )
Labels:
(r=0.25)
+1
+1
-1
-1
Feature Extraction
PSH
LWR
features
A binary hash function:
Feature
 1 if  (x)  T
h ,T ( x)  
- 1 otherwise
Predict th e labels
 1 if h ,T (xi )  h ,T (x j )
yˆ h(xi ,x j )  
 1 otherwise
Feature Extraction
PSH
LWR
h ,T will place both examples in the same
bin or separate them :

T
 (x)
Find the best T* that predicts the true
labeling with the probabilit ies constraints.
PSH
Feature Extraction
LWR
Local Weighted Regression (LWR)
• Given a query image, PSH returns
KNNs.
• LWR uses the KNN to compute a
weighted average of the estimated
*angles of the query:
  arg min   d ( g ( xi ,  ), i ) K (d X ( xi , x0 ))



x N ( x )
i
0
dist. weight
Results
Synthetic data were generated:
• 13 angles: 1 for rotation of the torso, 12 for
joints.
• 150,000 images.
• Nuisance parameters added: clothing,
illumination, face expression.
•
•
1,775,000 example pairs.
Selected 137 out of 5,123 meaningful features
Recall:
(how??):
P1 is prob of positive
18 bit hash functions (k), 150 hash tables (l). hash.
P2 is prob of bad hash.
B is the max number of
pts in a bucket.
• Without selection needed 40 bits and
1000 hash tables.
• Test on 1000 synthetic examples:
• PSH searched only 3.4% of the data per query.
Results – real data
• 800 images.
• Processed by a segmentation algorithm.
• 1.3% of the data were searched.
Results – real data
Interesting mismatches
Fast pose estimation - summary
• Fast way to compute the angles of human
body figure.
• Moving from one representation space to
another.
• Training a sensitive hash function.
• KNN smart averaging.
Food for Thought
• The basic assumption may be problematic
(distance metric, representations).
• The training set should be dense.
• Texture and clutter.
• General: some features are more important
than others and should be weighted.
Food for Thought: Point Location in
Different Spheres (PLDS)
• Given: n spheres in Rd , centered at P={p1,…,pn}
with radii {r1,…,rn} .
• Goal: given a query q, preprocess the points in P
to find point pi that its sphere ‘cover’ the query q.
ri
q
pi
Courtesy of Mohamad Hegaze
Mean-Shift Based Clustering in High Dimensions: A
Texture Classification Example
B. Georgescu, I. Shimshoni, and P. Meer
Motivation:
• Clustering high dimensional data by using local
density measurements (e.g. feature space).
• Statistical curse of dimensionality:
sparseness of the data.
• Computational curse of dimensionality:
expensive range queries.
• LSH parameters should be adjusted for optimal
performance.
Outline
•
Mean-shift in a nutshell + examples.
Our scope:
• Mean-shift in high dimensions – using LSH.
• Speedups:
1. Finding optimal LSH parameters.
2. Data-driven partitions into buckets.
3. Additional speedup by using LSH data structure.
Mean-shift
LSH
LSH: optimal k,l
LSH: data
partition
LSH: data struct
Mean-Shift in a Nutshell
bandwidth
point
Mean-shift
LSH
LSH: optimal k,l
LSH: data
partition
LSH: data struct
KNN in mean-shift
Bandwidth should be inversely proportional to the
density in the region:
high density - small bandwidth
low density - large bandwidth
Based on kth nearest neighbor
of the point
The bandwidth is
Adaptive mean-shift vs. non-adaptive.
Mean-shift
LSH
LSH: optimal k,l
LSH: data
partition
LSH: data struct
Mean-shift
LSH
LSH: optimal k,l
LSH: data
partition
LSH: data struct
Image segmentation algorithm
1. Input : Data in 5D (3 color + 2 x,y) or 3D (1 gray +2 x,y)
2. Resolution controlled by the bandwidth: hs (spatial), hr (color)
3. Apply filtering
3D:
Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’
Mean-shift
LSH
LSH: optimal k,l
LSH: data
partition
LSH: data struct
Image segmentation algorithm
Filtering:
pixel
value of the nearest mode
Mean-shift
trajectories
original
filtered
segmented
Filtering examples
original squirrel
filtered
original baboon
filtered
Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’
Segmentation examples
Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’
Mean-shift
LSH
LSH: optimal k,l
LSH: data
partition
LSH: data struct
Mean-shift in high dimensions
Statistical curse of dimensionality:
Sparseness of the data
variable bandwidth
Computational curse of dimensionality:
Expensive range queries
implemented with LSH
Mean-shift
LSH
LSH: optimal k,l
LSH: data
partition
LSH: data struct
LSH-based data structure
• Choose L random partitions:
Each partition includes K pairs
(dk,vk)
• For each point we check:
xi ,d K  vk
It Partitions the data into cells:
Mean-shift
LSH
LSH: optimal k,l
LSH: data
partition
LSH: data struct
Choosing the optimal K and L
• For a query q compute
smallest number of distances
to points in its buckets.
Mean-shift
LSH
LSH: optimal k,l
N Cl  n( K / d  1)
N C  LN Cl
LSH: data
partition
LSH: data struct
d
C
C
As L increases C increases but C decreases.
C determines the resolution of the data structure.
Large k  smaller number of points in a cell.
If L is too small  points might be missed,
but if L is too big  C might include extra points
Mean-shift
LSH
LSH: optimal k,l
LSH: data
partition
LSH: data struct
Choosing optimal K and L
Determine accurately the KNN for m randomly-selected
data points.
distance (bandwidth)
Choose error threshold 
The optimal K and L should satisfy
the approximate distance
Mean-shift
LSH
LSH: optimal k,l
LSH: data
partition
LSH: data struct
Choosing optimal K and L
• For each K estimate the error for
• In one run for all L’s:
find the minimal L satisfying the constraint L(K)
• Minimize time t(K,L(K)):
minimum
Approximation
error for K,L
L(K) for =0.05
Running time
t[K,L(K)]
Mean-shift
LSH
LSH: optimal k,l
LSH: data
partition
LSH: data struct
Data driven partitions
• In original LSH, cut values are random in the range of the
data.
• Suggestion: Randomly select a point from the data and
use one of its coordinates as the cut value.
uniform
data driven
points/bucket
distribution
Mean-shift
LSH
LSH: optimal k,l
LSH: data
partition
LSH: data struct
Additional speedup
Assume that all points in C will converge to the
same mode. (C  is like a type of an aggregate)
C
C
Speedup results
65536 points, 1638 points sampled , k=100
Food for thought
Low dimension
High dimension
A thought for food…
• Choose K, L by sample learning, or take the
traditional.
• Can one estimate K, L without sampling?
• A thought for food: does it help to know the data
dimensionality or the data manifold?
• Intuitively: dimensionality implies the number of
hash functions needed.
• The catch: efficient dimensionality learning requires
KNN.
15:30 cookies…..
Summary
• LSH suggests a compromise on accuracy for the
gain of complexity.
• Applications that involve massive data in high
dimension require the LSH fast performance.
• Extension of the LSH to different spaces (PSH).
• Learning the LSH parameters and hash
functions for different applications.
Conclusion
• ..but at the end
everything depends on your data set
• Try it at home
– Visit:
http://web.mit.edu/andoni/www/LSH/index.html
– Email Alex Andoni
Andoni@mit.edu
– Test over your own data
(C code under Red Hat Linux )
Thanks
•
•
•
•
Ilan Shimshoni (Haifa).
Mohamad Hegaze (Weizmann).
Alex Andoni (MIT).
Mica and Denis.
Download