SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma SASH features Index data in high-dimensional space Fast construction of the index N log N Fast lookups of k approximate nearest neighbors k log N Drawbacks of other methods Slow construction Slow lookups Require a k-NN index to construct a k-NN index Reduce to grid searches or sequential search But they may allow for true nearest neighbor queries SASH construction Two-phase process Phase 1: divide the set into a hierarchy of subsets Phase 2: link elements of the hierarchy together SASH construction: phase 1 Start with a set of points in a metric space Divide the set in half randomly Repeatedly divide the “second half” of the set until there is one element remaining This hierarchy of sets reminds me of a skip list SASH subsets Partitioning process roughly yields log N sets of size 2k, 0 ≤ k ≤ log N Label the sets S0 (for the set containing one element, namely the root) through Sh (for the largest set containing approximately N/2 elements) SASH appearance A SASH is hierarchy of sets of size 2k, 0 ≤ k ≤ h, with directed edges from the set of size 2k-1 to the set of size 2k A SASH is generally not a tree, but it has some of the flavor of a binary tree with edges from sets of a certain size to sets that are double that size. A SASH usually has many more edges. SASH construction: phase 2 The SASH is constructed inductively by first setting SASH0 = S0. For 1 ≤ i-1 ≤ h, SASHi-1 is a partial SASH on the set S0 U S1 U … U Si-1 SASHi is constructed by starting with SASHi-1 and producing new directed edges from elements in Si-1 to elements in Si. SASH construction: phase 2 Let SASH0 be the root, S0 For 1 ≤ i ≤ h, assume SASHi-1 exists, then For each c in Si, use SASHi-1 to find P possible parents of c in Si-1 Once all c in Si link to possible parents, each p in Si-1 links to the C closest children that chose it as a possible parent If some orphan objects in Si have no parents linking to them, repeat the above, allowing them to try link to more parents. SASH parameters: P and C In practice, the P is a small, and the C is at least twice P (Their experiments use C=4P) It is likely that objects will have at least one parent that links to them, and if C > 2P, all orphans can eventually find parents Children link to “nearby” parents, and parents then link to “nearby” children The symmetric use of “nearby” gives good results, even though the relation isn’t really symmetric. A Completed SASH Example on the real line with P=2 and C=4 Randomly divide the set in half until reaching one point Randomly divide the set in half until reaching one point Randomly divide the set in half until reaching one point Randomly divide the set in half until reaching one point The sets Si SASH Construction Example Red nodes are in a completed SASH. Light blue nodes are in the process of being added to a SASH. Black nodes have not been processed. Links from children to parents are green, and links from parents to children are red. SASH0:Construction P=2, C=4 SASH0:Complete SASH1:Construction P=2, C=4 SASH1:Link children to parents SASH1:Link parents to children SASH1:Complete SASH2:Construction SASH2:Link children to parents SASH2:Link parents to children SASH2:Complete SASH3:Construction SASH3:Link children to parents SASH3:Link parents to children Some of the green arrows were not reversed Because parents only link to their C=4 closest children The green arrows are not parts of the completed SASH SASH3:Complete SASH4:Construction P=2, C=4 SASH4:Link children to parents SASH4:Link parents to children The green links were not returned to the children The three purple nodes are orphans Link them by doubling P as needed. Orphans link to P=4 parents Parents link to up to C=4 children Two orphans were linked, and one remains Two orphans were linked, and one remains Link the final orphan to P=8 parents Link parents to the orphan The final green arrows are removed SASH4:Complete What am I hiding from you about this algorithm? For 1 ≤ i ≤ h, assume SASHi-1 exists, then For each c in Si, use SASHi-1 to find P possible parents of c in Si-1 Once all c in Si link to possible parents, each p in Si-1 links to the C closest children that chose it as a possible parent If some orphan objects in Si have no parents linking to them, repeat the above, allowing them to try link to more parents. This part can be expensive For 1 ≤ i ≤ h, assume SASHi-1 exists, then For each c in Si, use SASHi-1 to find P possible parents of c in Si-1 Once all c in Si link to possible parents, each p in Si-1 links to the C closest children that chose it as a possible parent If some orphan objects in Si have no parents linking to them, repeat the above, allowing them to try link to more parents. Cost of this operation For each c in Si, use SASHi-1 to find P possible parents of c in Si-1 There are N/2 points in Sh, and N/4 points in Sh-1, for N2/8 checks Or we could build an index, like a quadtree and do a k-NN search directly This is expensive, and is the catch-22 of most k-NN algorithms SASH uses an N log N method Avoiding k-NN search in SASH construction Instead, perform a partial search query on the new point using the partially constructed SASH Start with the root as the current set While not at the bottom of the partial SASH, let the current set equal the P children of the current set that are closest to the new point Approximate parent search without a k-NN graph Start at the root Search children Keep the 2 children closest to the query point Search children Keep the 2 children closest to the query point Search children Keep the 2 closest children to the query point These are the approximate parents of the query point Important points: No k-NN index needed Log N search time for each element Up to P objects retained at each level, and each of those has up to C children Only those PC children are searched at each level to find the P closest objects to send down to the next level. SASH Issues When a large number of children are clustered near a few parents, some children will be orphaned and have parents that are farther away A SASH is mostly static Some new nodes can be added, but clusters need to be filtered up through the hierarchy during the construction process Queries with a completed SASH Similar to the process described above to get approximate parents Two types of searches described Uniform: Keep the same number of children at each level Geometric: Start the search with a small number of nodes kept at each level, then increase it Queries with a completed SASH The big difference between constructing the SASH and using it for queries is that in the construction process, only the nodes in the final partial SASH are used. In a query on a completed SASH, all of the intermediate points visited can be used in the final k-ANN search Geometric search Keeping too few points near the root may lead to bad results, so instead of starting near 1, the authors found that 0.5*PC (4 in the case of P=2, C=4) nodes at smaller levels sufficed to keep the search broad enough Search process Let ki be the number of elements we will keep at level i of the SASH Let U0=S0, the root For 1 ≤ i ≤ h Find all children of elements in Ui-1 Let Ui be the ki children of Ui-1 that are closest to the query point Search process After the sets U0, …, Uh have been determined, let U = U0 U U1 U … U Uh Then the final result is the k closest points in U to the query point Search complexity Each Ui has at most k elements, and each of those has at most C children, so we perform at most Ck distance calculations for log N levels, in k log N time Once U has been determined, we perform a true k-NN search on a set of size k log N Use of transitivity when searching We follow links from parents to children under the assumption that children are close to parents We keep only the objects closest to the query at each level This gives good results in practice, but may fail in pathological cases Pathological example of failure of transitivity Pathological case on the real line Assume the rest of the SASH is to the left or the right of the chains shown (following the dotted arrows) The query will return two of the nodes visited at the top, even though there are points closer to the query, Q Pathological example of failure of transitivity when k=2 R S T A B Q A search for Q first finds S and T R S T A B Q T’s children are closer to Q than those of S R S T A B Q The search continues below T R S T A B Q The search continues below T R S T A B Q The search continues below T R S T A B Q The search continues below T R S T A B Q R and S are returned as the k=2 nearest neighbors of Q R S T A B Q However, A and B are the true k=2 nearest neighbors of Q R S T A B Q SASH Comparison to MTree MTree (Ciaccia, Patella, Zezula) – Deals with overlapping objects, uses a balanced hierarchy with buckets and spheres as regions SASH-4: P=4, C=4P MEDLINE – 1,055,073 objects with 1,101,003 attributes. Represents keywords found in medical abstracts. Average 75 nonzero attributes per object SSeq = sequential search on a randomly selected subset of the data Complexity Comparison Speed vs. accuracy Internal SASH Comparisons BactORF – Bacterial protein sequences; 385,039 objects with 40,000 attributes – Sparse: 125 nonzero attributes per object VidFrame – Video -- 9,000,000 objects with 32 attributes densely nonzero SASH P=3,4,5,8,16; C=4P Boosted SASH Different dataset sizes Conclusion SASH indexes high-dimensional spaces Efficient construction and query times Uses approximate similarity, and a generalization of equivalence relations (symmetry and a weak form of transitivity) to get good results Large body of work in fuzzy logic on transitivity and approximate similarity