Nearest Neighbor Model k-d tree Locality sensitive hashing Nearest Neighbor Model k-d tree Locality sensitive hashing NN(k, xq) “Baseball” “Congress” “Police” “House” “Gate” “Money” . . . Article1 0 0 2 7 9 12 . . . Article2 0 0 41 6 7 2 . . . Article3 2 1 0 0 4 9 . . . Article4 0 6 0 0 0 14 . . . Article5 0 1 0 1 4 12 . . . Article6 8 2 0 2 1 5 . . . … … … … … … … . . . Nearest Neighbor Model k-d tree Locality sensitive hashing Dimensions • Works well in low-dimensional spaces • “Curse of dimensionality” • As the dimension increases, the space become large very fast, available data become sparse Nearest Neighbor Model k-d tree Locality sensitive hashing Lookup • A Balanced binary tree • Over data with k dimensions •Building the tree: •Given a set of examples, the root node is chosen to be median of those examples along the ith dimension. •Half of the examples goes to left branch and the other half goes to the left branch of the tree. • Recursively repeated until there fewer than two examples left. • How to choose which dimension to use? • j mod k for jth level of the tree (any dimension can be used for split several times) • Split on the dimension which has the widest spread of values Nearest Neighbor Model k-d tree Locality sensitive hashing Lookup First the whole cell is split by Red Subcells created from Red are split further by Green All four resulting subcells are further split by blue Nearest Neighbor Model k-d tree Locality sensitive hashing Building • Exact lookup • Same as the lookup from binary tree. • Must tend to which dimension divides the current level of the tree • Nearest neighbor • What if the queried data is very close to he dividing boundary? • Compute distance to the dividing boundary. • Check both right and left branches of the dividing boundary if too close. Nearest Neighbor Model k-d tree Locality sensitive hashing Nearest Neighbor Model k-d tree Locality sensitive hashing LSH 1D projection of 2D points 2D projection of 3D points LSH relies on the idea: •If two points are close together on n-dimensional space, they projections to lower-dimensional space would be closer too. Nearest Neighbor Model k-d tree Locality sensitive hashing Projection 1 2 … g1(x) 1 2 … g2(x) Hash buckets •Create multiple (l) random projections to create multiple hash tables g1(x), … gl(x) • Given a query point xq: • Fetch set of points from each bin gk(xq) for all k • Union these sets to come up with the candidate points • Compute actual distance from xq to each point in C, obtain nearest ones. • With high probability, the real nearest points showed up in one of the bins, mixed with some far away points (ignored). Nearest Neighbor Model k-d tree Locality sensitive hashing 13 million images 512 dimensions LSH A few thousand candidates