SASH,ppt

advertisement
SASH
Spatial Approximation Sample
Hierarchy
Authors: Michael E. Houle, Jun Sakuma
SASH features


Index data in high-dimensional space
Fast construction of the index


N log N
Fast lookups of k approximate nearest
neighbors

k log N
Drawbacks of other methods

Slow construction


Slow lookups


Require a k-NN index to construct a k-NN
index
Reduce to grid searches or sequential
search
But they may allow for true nearest
neighbor queries
SASH construction

Two-phase process


Phase 1: divide the set into a hierarchy of
subsets
Phase 2: link elements of the hierarchy
together
SASH construction: phase 1




Start with a set of points in a metric
space
Divide the set in half randomly
Repeatedly divide the “second half” of
the set until there is one element
remaining
This hierarchy of sets reminds me of a
skip list
SASH subsets


Partitioning process roughly yields log N
sets of size 2k, 0 ≤ k ≤ log N
Label the sets S0 (for the set containing
one element, namely the root) through
Sh (for the largest set containing
approximately N/2 elements)
SASH appearance



A SASH is hierarchy of sets of size 2k,
0 ≤ k ≤ h, with directed edges from the
set of size 2k-1 to the set of size 2k
A SASH is generally not a tree, but it
has some of the flavor of a binary tree
with edges from sets of a certain size to
sets that are double that size.
A SASH usually has many more edges.
SASH construction: phase 2



The SASH is constructed inductively by
first setting SASH0 = S0.
For 1 ≤ i-1 ≤ h, SASHi-1 is a partial
SASH on the set S0 U S1 U … U Si-1
SASHi is constructed by starting with
SASHi-1 and producing new directed
edges from elements in Si-1 to elements
in Si.
SASH construction: phase 2


Let SASH0 be the root, S0
For 1 ≤ i ≤ h, assume SASHi-1 exists, then



For each c in Si, use SASHi-1 to find P possible
parents of c in Si-1
Once all c in Si link to possible parents, each p in
Si-1 links to the C closest children that chose it as a
possible parent
If some orphan objects in Si have no parents
linking to them, repeat the above, allowing them
to try link to more parents.
SASH parameters: P and C

In practice, the P is a small, and the C is at
least twice P (Their experiments use C=4P)


It is likely that objects will have at least one
parent that links to them, and if C > 2P, all
orphans can eventually find parents
Children link to “nearby” parents, and parents
then link to “nearby” children

The symmetric use of “nearby” gives good results,
even though the relation isn’t really symmetric.
A Completed SASH
Example on the real line with
P=2 and C=4
Randomly divide the set in
half until reaching one point
Randomly divide the set in
half until reaching one point
Randomly divide the set in
half until reaching one point
Randomly divide the set in
half until reaching one point
The sets Si
SASH Construction Example


Red nodes are in a completed SASH.
Light blue nodes are in the process of
being added to a SASH. Black nodes
have not been processed.
Links from children to parents are
green, and links from parents to
children are red.
SASH0:Construction P=2, C=4
SASH0:Complete
SASH1:Construction P=2, C=4
SASH1:Link children to parents
SASH1:Link parents to children
SASH1:Complete
SASH2:Construction
SASH2:Link children to parents
SASH2:Link parents to children
SASH2:Complete
SASH3:Construction
SASH3:Link children to parents
SASH3:Link parents to children
Some of the green arrows
were not reversed
Because parents only link to
their C=4 closest children
The green arrows are not
parts of the completed SASH
SASH3:Complete
SASH4:Construction P=2, C=4
SASH4:Link children to parents
SASH4:Link parents to children
The green links were not
returned to the children
The three purple nodes are
orphans
Link them by doubling P as
needed.
Orphans link to P=4 parents
Parents link to up to C=4
children
Two orphans were linked, and
one remains
Two orphans were linked, and
one remains
Link the final orphan to P=8
parents
Link parents to the orphan
The final green arrows are
removed
SASH4:Complete
What am I hiding from you
about this algorithm?

For 1 ≤ i ≤ h, assume SASHi-1 exists, then



For each c in Si, use SASHi-1 to find P possible
parents of c in Si-1
Once all c in Si link to possible parents, each p in
Si-1 links to the C closest children that chose it as a
possible parent
If some orphan objects in Si have no parents
linking to them, repeat the above, allowing them
to try link to more parents.
This part can be expensive

For 1 ≤ i ≤ h, assume SASHi-1 exists, then



For each c in Si, use SASHi-1 to find P possible
parents of c in Si-1
Once all c in Si link to possible parents, each p in
Si-1 links to the C closest children that chose it as a
possible parent
If some orphan objects in Si have no parents
linking to them, repeat the above, allowing them
to try link to more parents.
Cost of this operation

For each c in Si, use SASHi-1 to find P
possible parents of c in Si-1




There are N/2 points in Sh, and N/4 points
in Sh-1, for N2/8 checks
Or we could build an index, like a quadtree
and do a k-NN search directly
This is expensive, and is the catch-22 of
most k-NN algorithms
SASH uses an N log N method
Avoiding k-NN search in SASH
construction

Instead, perform a partial search query
on the new point using the partially
constructed SASH


Start with the root as the current set
While not at the bottom of the partial
SASH, let the current set equal the P
children of the current set that are closest
to the new point
Approximate parent search
without a k-NN graph
Start at the root
Search children
Keep the 2 children closest to
the query point
Search children
Keep the 2 children closest to
the query point
Search children
Keep the 2 closest children to
the query point
These are the approximate
parents of the query point
Important points:


No k-NN index needed
Log N search time for each element


Up to P objects retained at each level, and
each of those has up to C children
Only those PC children are searched at
each level to find the P closest objects to
send down to the next level.
SASH Issues


When a large number of children are
clustered near a few parents, some
children will be orphaned and have
parents that are farther away
A SASH is mostly static

Some new nodes can be added, but
clusters need to be filtered up through the
hierarchy during the construction process
Queries with a completed
SASH


Similar to the process described above
to get approximate parents
Two types of searches described


Uniform: Keep the same number of
children at each level
Geometric: Start the search with a small
number of nodes kept at each level, then
increase it
Queries with a completed
SASH


The big difference between constructing
the SASH and using it for queries is that
in the construction process, only the
nodes in the final partial SASH are
used.
In a query on a completed SASH, all of
the intermediate points visited can be
used in the final k-ANN search
Geometric search

Keeping too few points near the root
may lead to bad results, so instead of
starting near 1, the authors found that
0.5*PC (4 in the case of P=2, C=4)
nodes at smaller levels sufficed to keep
the search broad enough
Search process



Let ki be the number of elements we
will keep at level i of the SASH
Let U0=S0, the root
For 1 ≤ i ≤ h


Find all children of elements in Ui-1
Let Ui be the ki children of Ui-1 that are
closest to the query point
Search process


After the sets U0, …, Uh have been
determined, let U = U0 U U1 U … U Uh
Then the final result is the k closest
points in U to the query point
Search complexity


Each Ui has at most k elements, and
each of those has at most C children, so
we perform at most Ck distance
calculations for log N levels, in k log N
time
Once U has been determined, we
perform a true k-NN search on a set of
size k log N
Use of transitivity when
searching



We follow links from parents to children
under the assumption that children are
close to parents
We keep only the objects closest to the
query at each level
This gives good results in practice, but
may fail in pathological cases
Pathological example of failure
of transitivity



Pathological case on the real line
Assume the rest of the SASH is to the
left or the right of the chains shown
(following the dotted arrows)
The query will return two of the nodes
visited at the top, even though there
are points closer to the query, Q
Pathological example of failure
of transitivity when k=2
R
S
T
A
B
Q
A search for Q first finds S and
T
R
S
T
A
B
Q
T’s children are closer to Q
than those of S
R
S
T
A
B
Q
The search continues below T
R
S
T
A
B
Q
The search continues below T
R
S
T
A
B
Q
The search continues below T
R
S
T
A
B
Q
The search continues below T
R
S
T
A
B
Q
R and S are returned as the
k=2 nearest neighbors of Q
R
S
T
A
B
Q
However, A and B are the true
k=2 nearest neighbors of Q
R
S
T
A
B
Q
SASH Comparison to MTree




MTree (Ciaccia, Patella, Zezula) – Deals with
overlapping objects, uses a balanced
hierarchy with buckets and spheres as
regions
SASH-4: P=4, C=4P
MEDLINE – 1,055,073 objects with 1,101,003
attributes. Represents keywords found in
medical abstracts. Average 75 nonzero
attributes per object
SSeq = sequential search on a randomly
selected subset of the data
Complexity Comparison
Speed vs. accuracy
Internal SASH Comparisons


BactORF – Bacterial protein sequences;
385,039 objects with 40,000 attributes
– Sparse: 125 nonzero attributes per
object
VidFrame – Video -- 9,000,000 objects
with 32 attributes densely nonzero
SASH P=3,4,5,8,16; C=4P
Boosted SASH
Different dataset sizes
Conclusion




SASH indexes high-dimensional spaces
Efficient construction and query times
Uses approximate similarity, and a
generalization of equivalence relations
(symmetry and a weak form of
transitivity) to get good results
Large body of work in fuzzy logic on
transitivity and approximate similarity
Download