Three Cool Algorithms You’ve Never Heard Of! Carey Nachenberg cnachenberg@symantec.com Cool Data Structure: The Metric Tree City: LA Threshold: 1500km >1500km away <=1500km away City: NYC Threshold: 1100km City: Las Vegas Threshold: 1000km City: SF Threshold: 100km <=100km away City: San Jose Threshold: 200km … <=200km away >200km away … City: Boston Threshold: 400km City: Austin Threshold: 250km >100km away … City: Merced Threshold: 70km <=70km away … >70km away … >1100km away <=1110km away >1000km away <=1000km away … <=400km away … City: Providence Threshold: 200km <=200km away … >200km away … City: Atlanta Threshold: 600km >600km away <=600km away City: New Orleans Threshold: 300km <=300km away … >300km away … … Challenge: Building a Fuzzy Spell Checker Imagine you’re building a word processor and you want to implement a spell checker that gives suggestions… Of course it’s easy to tell the user that their word is misspelled… Question: What data structure could we use to determine if a word is in a dictionary or not? Right – a hash table or binary search tree could tell you if a word is spelled correctly. lobeky Suggestions lonely lovely locale … But what if we want to efficiently provide the user with possible alternatives? Providing Alternatives? Before we can provide alternatives, we need a way to find close matches… One useful tool for this is the “edit distance” metric. Edit Distance: How many letters must be added, deleted or replaced to get from word A to B. v l -> lovely has an edit lobeky distance of 2. l ob wel k y -> lowly has an edit distance of 3. So given the user’s misspelled word, and this edit distance function… How can we use this to provide the user with spelling suggestions? Providing Alternatives? Well, we could take our misspelled word and compute its edit distance to every word in the dictionary! 8 aardvark 5 ark 6 acorn … bone bonfire … lonely lonesome … And then give the user all words with an edit distance of <=3… lobeky But that’s really, really slow! There’s a better way! But before we talk about it, let’s talk about edit distance a bit more… Edit Distance As it turns out, the edit distance function, e(x,y), is what we call a “metric distance function.” What does that mean? 1. e(x,y) = e(y,x) The edit distance of “foo” from “food” is the same as from “food” to “foo” 2. e(x,y) >= 0 You can never have a negative edit distance… Well that makes sense… 3. e(x,z) <= e(x,y) + e(y,z) It’s never cheaper to do two conversions than a direct conversion. aka “the triangle inequality” e(“foo”,”feed”) = 3 e(“feed”,”goon”) = 4 Total cost: 7 > e(“foo”,”goon”) = 2 Metric Distance Functions Given some word w (e.g., pier), let’s say I happen to know all words with an edit distance of 1 from that word… Why? Because we know that all of these words have Now, at most if my one misspelled character word difference m (e.g., zifs) from has an“pier”… edit distance of 3 from w, what does that guarantee about So if “pier”misto3 these away from other“zifs”, words? then in the best case these other words would be one letter closer to “zifs” (e.g., is if 3, oneand of all pier’s Right: If e(“zifs”,”pier”) letters wasother replaced byare oneexactly of zifs’1letters)... these words edit from pier… of different Imagine if away we had thousands clouds like this. Then by definition, “zifs” must be at most 4 edits away from any word in We could compare your misspelled word to this cloud! the center word of each cloud. If e(m,w) is lessby than But thesome samethreshold reasoning,edit nonedistance, of thesethen words cloud’s words are good can the be less thanother 2 edits away from “zifs”… suggestions… tier peer piper pier pies pie +3 zifs e(“zifs”,”pier”) = 3 e(“pier”,”piper”) = 1 Total cost: 4 Let’sdirectly: see: And e(“zifs”,”pies”) == 42 e(“zifs”,”piper”) Metric Distance Functions rate hate date tier table gate ate gale peer 4 5 5 pier pies 3 computer pencil 8 zifs We could compare your misspelled word to the center word of each cloud. If e(m,w) is less than some threshold edit distance, then the cloud’s other words are good suggestions… piper pie A Better Way? That works well, but then again, we’d still have to do thousands of comparisons (one to each cloud)… Hmmm. Can we figure out a more efficient way to do this? Say with log2(D) comparisons, where D is the number of words in your dictionary? Duh… Well of course, we’ll need a tree! The Metric Tree The Metric Tree was invented in 1991 by Jeffrey Uhlmann of the Naval Research Labs. Each node in a Metric Tree holds a word, an edit distance threshold value and left and right next pointers. struct MetricTreeNode { string word; unsigned int editThreshold; MetricTreeNode *left, *right; }; Let’s see how to build a Metric Tree! Building one is really slow, but once we build it, searching it is really fast! The Metric Tree Node *buildMTree(SetOfWords &S) 1. Pick a random word W from set S. 2. Compute the edit distance for all other words in set S to your random word W. 3. Sort all these words based on their edit distance di to your random word W. 4. Select the median value of di, let dmed be this median edit distance. 5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed. 6. N->left = buildMTree(subset of S that is <= dmed) 7. N->right = buildMTree(subset of S that is > dmed) 8. return N main() { Let S = {every word in the dictionary}; Node *root = buildMTree(S); SetOfWords goat oyster roster hippo toad hamster mouse chicken rooster The Metric Tree Node *buildMTree(SetOfWords &S) Node *buildMTree(SetOfWords &S) 1. Pick a random word W from set S. 1. Pick a random word W from set S. 2. Compute the edit distance for all other 2. Compute for allword other words inthe setedit S todistance your random W. dmed = words in set S to your random word W. 3. Sort all these words based on their edit 3. Sort all these based on their distance di to words your random word W.edit distance di to your random word W. 4. Select the median value of di, 4. Select thebemedian value of di,distance. let dmed this median edit let dmed be this median edit distance. 5. Now, create a root node N for our tree and 5. Now, create root nodenode. N forSet ourits tree and put our worda W in this put our word W in thistonode. editThreshold value dmed.Set its editThreshold value to dmed. 6. N->left = buildMTree(subset of S that is <= dmed) 6.7.N->left = buildMTree(subset of of S that is <= )) N->right = buildMTree(subset S that is d > med dmed 7. N->right = buildMTree(subset of S that is > dmed) 8. return N 8. return N main() { Let S = {every word in the dictionary}; Node *root = buildMTree(S); 4 SetOfWords goat roster 16 oyster 22 roster 3 1 hamster hippo mouse 47 toad goat 66 hamster 6 3 toad 4 mouse 7 hippo chicken 77 rooster “rooster” 4 The Metric Tree Node *buildMTree(SetOfWords &S) Node *buildMTree(SetOfWords &S) 1. Pick a random word W from set S. 1. Pick a random word W from set S. 2. Compute the edit distance for all other dmed 2. Compute the edit distance for all other words in set S to your random word W. words in set S to your random word W. 3. Sort all these words based on their edit 3. Sort all these based on their distance di to words your random word W.edit distance di to your random word W. 4. Select the median value of di, 4. Select thebemedian value of di,distance. let dmed this median edit let dmed be this median edit distance. 5. Now, create a root node N for our tree and 5. Now, create root nodenode. N forSet ourits tree and put our worda W in this put our word W in thistonode. editThreshold value dmed.Set its editThreshold value to dmed. 6. N->left = buildMTree(subset of S that is <= dmed) 6.7.N->left = buildMTree(subset of of S that is <= )) N->right = buildMTree(subset S that is d > med dmed 7. N->right = buildMTree(subset of S that is > dmed) 8. return N 8. return N main() { Let S = {every word in the dictionary}; Node *root = buildMTree(S); “roster” Dictionary SetOfWords goat roster 64 oyster 34 roster 1 6 hamster hippo mouse 7 toad goat hamster toad mouse hippo chicken 7 rooster “rooster” =4 “oyster” 4 4 “mouse” 4 “hamster” 0 The Metric Tree Node *buildMTree(SetOfWords &S) Node *buildMTree(SetOfWords &S) 1. Pick a random word W from set S. 1. Pick a random word W from set S. 2. Compute the edit distance for all other 2. Compute for allword other words inthe setedit S todistance your random W. words in set S to your random word W. dmed 3. Sort all these words based on their edit 3. Sort all these based on their distance di to words your random word W.edit distance di to your random word W. 4. Select the median value of di, 4. Select thebemedian value of di,distance. let dmed this median edit let dmed be this median edit distance. 5. Now, create a root node N for our tree and 5. Now, create root nodenode. N forSet ourits tree and put our worda W in this put our word W in thistonode. editThreshold value dmed.Set its editThreshold value to dmed. 6. N->left = buildMTree(subset of S that is <= dmed) 6.7.N->left = buildMTree(subset of of S that is <= )) N->right = buildMTree(subset S that is d > med dmed 7. N->right = buildMTree(subset of S that is > dmed) 8. return N 8. return N main() { Let S = {every word in the dictionary}; Node *root = buildMTree(S); “roster” Dictionary SetOfWords goat roster oyster roster hamster hippo mouse toad goat 2 toad hamster mouse hippo 5 chicken 7 =5 “rooster” 4 “mouse” 4 “oyster” 4 “goat” “hamster” 0 5 “hippo” “toad” 5 “chicken 0 The Metric Tree Node *buildMTree(SetOfWords &S) 1. Pick a random word W from set S. 2. Compute the edit distance for all other words in set S to your random word W. 3. Sort all these words based on their edit distance di to your random word W. 4. Select the median value of di, let dmed be this median edit distance. 5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed. Dictionary SetOfWords goat roster oyster roster hamster hippo mouse toad goat toad hamster mouse hippo chicken “rooster” 4 6. N->left = buildMTree(subset of S that is <= dmed) “mouse” 7. N->right = buildMTree(subset of S that is4 > dmed) “toad” 5 8. return N main() “oyster” 4 { Let S = {every word in the dictionary}; Node *root = buildMTree(S); “roster” “goat” 5 “hamster” 0 “hippo” “chicken” 0 A Metric Tree So now we have a metric tree! “rooster” rooster 4 4 toad “toad” mouse “mouse” 4 4 oyster “oyster” 4 “roster” roster 0 55 “goat” goat 5 5 hamster “hamster” 0 “hippo” hippo 0 How do we interpret it? “chicken” chicken 0 Every word to the left of rooster is guaranteed to be within 4 edits of it… And every word to the right of rooster is guaranteed to be more than 4 edits away… And this same structure is repeated recursively! Searching “rooster” 4 “toad” 5 “mouse” 4 “oyster” 4 “hamster” 0 “hippo” 0 “roster” 0 oyster mouse 2 roosterroaster roster “goat” 5 hamster chicken toad hippo goat “chicken” 0 When you search a metric tree, you specify the word you’re looking for and an 1.edit-distance Your word and its search radius, e.g. radius are totally inside e.g., I want to find words the edit threshold. within 2 edits of “roaster”. In this case, all of your matches Starting at are the guaranteed root, there to are be in our leftto subtree… three cases consider: Searching “rooster” 4 “toad” 5 “mouse” 4 “oyster” 4 “hamster” 0 “hippo” 0 “roster” 0 oyster mouse 2 goute rooster roster “goat” 5 hamster chicken toad hippo goat “chicken” 0 2. Your word and its search radius are partially inside and partially outside the edit threshold. In this case, some matches will be in our left subtree and some in our right subtree… Searching “rooster” 4 “toad” 5 “mouse” 4 “oyster” 4 “hamster” 0 “hippo” 0 “roster” 0 oyster 2 vhivken mouse rooster roster “goat” 5 hamster chicken toad hippo goat “chicken” 0 3. Your word and its search radius are completely outside the edit threshold. In this case, all matches will be in our right subtree. e(“chomster”,”mouse”) = 25 e(“chomster”,”hamster”) = *This is a slight So mouse is outside of chomster’s chomster’s So hamster is inside of e(“chomster”,”rooster”) =3 simplification… PrintMatches(Node *cur, string misspell, intPrint rad) radius ofWe’ve 2. is It’s not a of close radius of 2. got a match! oyster oyster So rooster outside PrintMatches(Node *cur, string misspell, int rad) e(“chomster”,”mouse”) =5 { hamster! enough match to chomster’s radius of 2. print… It’s not a hamster { if e(misspell,cur->word) PrintMatches(Node *cur, string misspell, int Sincee(“chomster”,”rooster”) 5close is greater ourto=print… <= than rad 3 rad) mouse enough match e(“chomster”,”mouse”) =go5left. chomster 5 hamster mouse rooster <= rad { if e(misspell,cur->word) editThreshold of 4, we won’t Since 3 is less than our then print the current word 2 Since 5current is<= greater than our 2chomster then print the word if e(misspell,cur->word) rad 4 roster editThreshold of 4, let’s go left… chomster if e(misspell,cur->word) <=wecur->editThreshold editThreshold of 4, will go right. 2 print the current word ifthen e(misspell,cur->word) <= cur->editThreshold 2 roster then PrintMatches(cur->left) if e(misspell,cur->word) <= cur->editThreshold then PrintMatches(cur->left) if e(misspell,cur->word) > cur->editThreshold then PrintMatches(cur->left) hamster if e(misspell,cur->word) > cur->editThreshold then PrintMatches(cur->right); if e(misspell,cur->word) > cur->editThreshold then PrintMatches(cur->right); } then PrintMatches(cur->right); } Metric Tree: Search Algorithm } cur-> PrintMatches(root,”chomster”,2); cur-> cur-> Other Metric Tree Applications In addition to spell checking, the Metric Tree can be used with virtually any application where the items obey metric rules! Pretty cool, huh? Here’s the full search algorithm from the original paper (without my earlier simplications): PrintMatches(Node *cur, string misspell, int rad) { if ( e(cur->word , misspell) <= rad) cout << cur->word; if ( e(cur->word,misspell) – rad <= cur->editThresh ) PrintMatches(cur->left,misspell,maxDist) if ( e(cur->word, misspell) + rad >= cur->editThresh ) PrintMatches (cur->right,misspell,maxDist); } Challenge: Space-efficient Set Membership There are many problems where we want to maintain a set S of items and then check if a new item X is in the set, e.g.: “Is ‘carey nachenberg’ a student at UCLA?” “Is the phone number ‘424-750-7519’ known to be used by a terrorist cell? So, what data structures could you use for this? Right! Both hash tables and binary search trees allow you to: 1. Hold a bunch of items. 2. Quickly search through them to see if they hold an item X. So what’s the problem! Well, binary search trees and hash tables are memory hogs! But if I JUST want to do two things: 1. Add new items to the set 2. Check if an item was previously added to a set I can actually create a much more memory efficient data structure! In other words, if I never need to: 1. Print the items of the set (after they’ve been added). 2. Enumerate each value in the set. 3. Erase items from the set. Then we can do much better than our classic data structures! But first… A hash primer * A hash function is a function, y=f(x), that takes an input x (like a string) and returns an output number y for that input. The ideal hash function returns entirely different values for each different input, even if two inputs are almost identical: int y,z; y = idealHashFunction(“carey”); cout << y; z = idealHashFunction(“cArey”); cout << z; So even though these two strings are almost identical, a good hash function might return y=92629 and z=152. * Not that kind of hash. Hash Functions Here’s a not-so-good hash function. Can anyone figure out why? Right – because similar inputs produce the same output: int y, z; y = hashFunc(“bat”); z = hashFunc(“tab”); // y == z!!!! BAD! int hashFunc(const string &name) { int i, total=0; for (i=0;i<name.length(); i++) total total++(name[i] name[i]; * (i+1)); total = =total } return(total); How can we fix this? By changing our function! That’s a little better, although not great… A Better Hash Function The CRC or Cyclical Redundancy Check algorithm is an excellent hash function. This function was designed to check network packets for corruption. We won’t go into CRC’s details, but it’s a perfectly fine hashing algorithm… Ok, so we have a good hash function, now what? A Simple Set Membership Algorithm Most hash functions require a seed value to be to passed in. class SimpleSet Imagine(initialization) that I know I want store { slot 3000012131 up to 1 million items in my 9721 12131 Here’s how it might be set… used: public: I could create an array of say… “Carey” “Flint” unsigned CRC(unsigned seed, string &s) … void insertItem(string &name) { 100 { unsigned crcmillion = seed; bits int slot = CRC(SEED, name); forthen (int i=0;i<s.length();i++) And do the following… slot = slot % 100000000; crc = ((crc >> 8) & CONST1) ^ crcTable[(crc^ s[i]) & CONST2]; m_arr[slot] = 1; s 000000000000000000000000000000000000000000000000000000 1 1 } “Flint” return(crc); bool isItemInSet(string &name) } main() { { Typically you’d use a seed value of int slot = CRC(SEED, name); SimpleSet s; 0xFFFFFFFF with CRC. slot = slot % 100000000; if (m_arr[slot] == 1) s.insertItem(“Carey”); But you can change the seed if you like – this return(true); s.insertItem(“Flint”); results in a (much) different hash value, even else return(false); for the same input! } slot if (s.isItemInSet(“Flint”) == true) 9721 private: cout << “Flint’s in my set!”; } BitArray m_arr[100000000]; A Simple Set Membership Algorithm Ok, so what’s the problem with our SimpleSet? Right! There’s a chance of collisions! What if two names happen to hash right to the same slot? cool 000000000000000000000000000000000000000000000000000000 1 People main() { SimpleSet coolPeople; coolPeople.insertItem(“Carey”); } if (coolPeople.isItemInSet(“Paul”)) cout << “Paul Agbabian is cool!”; class SimpleSet { slot 3000012131 12131 public: … void insertItem(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; m_arr[slot] = 1; } bool isItemInSet(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; if (m_arr[slot] == 1) return(true); else return(false); } slot 11000012131 12131 private: BitArray m_arr[100000000]; A Simple Set Membership Algorithm Ok, so what’s the problem with our SimpleSet? Right! There’s a chance of collisions! What if two names happen to hash right to the same slot? Ack! If we put 1 million items in our 100 million entry array… we’ll have a collision rate of about 1%! Actually, depending on your requirements, that might not be so bad… class SimpleSet { public: … void insertItem(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; m_arr[slot] = 1; } bool isItemInSet(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; if (m_arr[slot] == 1) return(true); else return(false); } private: BitArray m_arr[100000000]; A Simple Set Membership Algorithm Our simple set can hold about 1M items in just 12.5MB of memory! While it does have some falsepositives, it’s much smaller than a hash table or binary search tree… But we’ve got to be able to do better… Right? Right! That’s where the Bloom Filter comes in! The Bloom Filter was invented by Burton Bloom in 1970. Let’s take a look! We’ll see how K is chosen in a bit. The Bloom Filter It’s a constant and its value is In a Bloom Filter, we use an array of bits just like our original algorithm! But instead of just using 1 hash function and setting just one bit for each insertion… Notice that each time we call the CRC function, it starts with a different We use K hash functions, seed value: compute K hash values and unsigned CRC(unsigned seed, string &s) { set K bits! unsigned crc = seed; for (int i=0;i<s.length();i++) computed from: class BloomFilter 1. { The max # of items you want to add. 2. public: The size of the array. 3. Your … desired false positive rate. const int K = 4; void insertItem(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i , name); slot = slot % 100000000; m_arr[slot] = 1; } } slot 9000022531 79929 9197 3000000013 22531 13 cool 000000000000000000000000000000000000000000000000000000 1 1 1 1 main() crc = ((crc >> 8) & CONST1) ^ crcTable[(crc^ s[i]) & CONST2]; People { return(crc); } BloomFilter coolPeople; (Passing K different seed values is the same as using K different hash functions…) } coolPeople.insertItem(“Preston”); private: BitArray m_arr[100000000]; The Bloom Filter Now to search, we do the same thing! Note: We only say an item is a member of the set if all K bits are set to 1. Note: If any bit that we check is 0, then we have a miss… main() { BloomFilter coolPeople; } class BloomFilter { public: … void insertItem(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i , name); bool isItemInSet(string &name) slot = slot % 100000000; { m_arr[slot] = 1; for (int i=0;i< K ;i++) { } } int slot = CRC( i , name); slot = slot % 100000000; if (m_arr[slot] == 0) cool 000000000000000000000000000000000000000000000000000000 1 1 1 1 return(false); People } return(true); } if (coolPeople.isItemInSet(“Carey”)) coolPeople.insertItem(“Preston”); cout << “I figured…”; private: BitArray m_arr[100000000]; The Bloom Filter Ok, so what’s the big deal? All we’re doing is checking K bits instead of 1?!!? Well, it turns out that this dramatically reduces the false positive rate! Ok… So the only questions are, how do we chose: 1. The size of our bit-array? 2. The value of K? Let’s see! class BloomFilter { public: void insertItem(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i , name); slot = slot % 100000000; m_arr[slot] = 1; } } bool isItemInSet(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i , name); slot = slot % 100000000; if (m_arr[slot] == 0) return(false); } return(true); } private: BitArray m_arr[100000000]; The Bloom Filter If you want to store N items in your Bloom Filter… And you want a false positive rate of F%... You’ll want to have M bits in your bit array: M = log(F) * N log(.6185) And you’ll want to use Now you’ve got to admit, that’s pretty efficient! K different hash functions: Let’s see some stats! K=.7*chance M Of course, unlike a hash table, there is some N of having a false positive… To store: N items with this FP rate, use M bits (bytes) and K hash fns But for many projects, this is not an issue, especially if you 1M .1% a certain 14.4M bits (1.79MB) can guarantee minimum level of FPs! 10 100M .001% 2.4B bits (299MB) 17 Now that’s COOL! And you’ve (hopefully) never heard about it! 100M .00001% 3.4B bits (419MB) 23 Challenge: Constant-time searching for similar items (in a high-dimensional space) Problem: I’ve got a large collection C of existing web-pages, and I want to determine if a new web-page P is a close match to any pages in my existing collection. Obvious approach: I could iterate through all C of my existing pages and do a pair-wise comparison of page P to each page. But that’s inefficient! So how can we do it faster? Answer: Use Locality Sensitive Hashing! LSH has two operations: Inserting items into the hash table: We add a bunch of items (e.g., web pages) into a locality-sensitive hash table Given an item, find closely-related items in the hash table: Once we have a filled locality-sensitive hash table, we want to search it for a new item and see if it contains anything similar. LSH, Operation #1: Insertion Here’s the Insertion algorithm: Step #1: Take each input item (e.g., a web-page) and convert it to a feature vector of size V. What’s a feature vector? It’s a fixed-length array of floating point numbers that measure various attributes about each input item. const int V = 6; float fv[V]; fv[0] = # of times the word “free” was used in the email fv[1] = # of times the word “viagra” was used in the email fv[2] = # of exclamation marks used in the email fv[3] = The length of the email in words fv[4] = The average length of each word found in the email =# of times the word marks “the” was used inin the email fv[5]fv[5] = The ratio of punctuation to letters the email The items in the feature vector should be chosen to provide maximum differentiation between different categories of items (e.g., spam vs clean email)! LSH, Operation #1: Insertion Why compute a feature vector for each input item? The feature vector is a way of plotting each item into N-space. In principle, items (e.g. emails) with similar content (i.e., similar feature vectors) should occupy similar regions of N-space. Input #1: “Click here now for free viagra!!!!!” Input #2: “Please come to the meeting at 5pm.” fv1 = {1, 1, 5,}6, 4.17, 0.2} fv2 = {0, 0, 1,} 7, 3.71, 0.038} fv2 fv1 5.0 1.0 1.0 LSH, Operation #1: Insertion Step #2: Note: N must be a Once you have a feature vector for each of your items, power of 2, e.g., 65536, you determine the size of your hash table. or 1,048,576 “I’m going to need to hold 100 million email feature vectors, so I’ll want an open hash table of size N = 1 million” Wait! Why is our hash table smaller than the # of items we want to store? Because we want to put related items in the same bucket/slot of the table! Step #3: Next compute the number of bits B required to represent N in binary. If N is 1 million, B will be log2(1 million), or 20. LSH, Operation #1: Insertion Step #4: Now, create B (e.g., 20) RANDOM feature vectors that are the same dimension as your input feature vectors. R1 = {.277,.891,3,.32,5.89, .136} R2 = {2.143,.073,0.3,4.9, .58, .252} … R19 = {.8,.425,6.43,5.6,.197,1.43} R20 = {1.47,.256,4.15,5.6,.437,.075} LSH, Operation #1: Insertion What are these B random vectors for? Each of the B random vectors defines a hyper-plane in N-space! (each hyper-plane is perpendicular to its random vector) R1 = {1,0,1} R2 = {0,0,3} If we have B such random vectors, we essentially chop up N-space with B possibly overlapping slices! So in our example, we’d have B=20 hyper-planes chopping up our V=6 dimensional space. (Chopping it up into 220 different regions!) R3 = {0,2.5,0} LSH, Operation #1: Insertion Ok, let’s consider a single random vector, R1, and it’s hyper-plane for now. Now let’s consider a second vector, v1. v1 R1 If the tips of those two vectors are on the same side of R’s hyper-plane, then the dot-product of the two vectors will be positive. R 1 · v1 > 0 v2 On the other hand, if the tips of those two vectors are on opposite sides of R’s hyper-plane, then the dotproduct of the two vectors will be negative. R 1 · v2 < 0 So this is useful – if we compute the dot product of two vectors R and v, we can determine if they’re close to each other or far from each other in N-space. LSH, Operation #1: Insertion Step #5: Create an empty open hash table with 2B buckets (e.g. 220 = 1M). For each item we want to add to our hash table… Take the feature vector for the item... 000…0000 And dot-product multiply it by every one of our B random-valued vectors… 000…0001 000…0010 000…0011 … Step #6: “Click here now for free viagra!!!!!” … 1111…11110 1111…11111 Let’s each bucket’s # 1s And iflabel we concatenate the using rather and 0s,binary this gives us athan B-digit (e.g., decimal 20 digit)numbers. binary number. (You’ll see why soon ) Which we can use to compute a bucket number in our hash table and store our item! · {1, 1, 5, 6, 4.17, 0.2} is on the… R1 = {.277,.891,3,.32,5.89, .136} -3.25 Opp. 0 side of R1 -1.73 0 Opp. side of R2 R2 = {2.13,.07,0.3,4.9, .58, .252} … … … R19 = {.8,.45,6.3,5.6,.197,1.43} R20 = {1.7,.26,4.15,5.6,.47,.07} 1 .18 Same side as R19 1 side as R20 5.24 Same This basically tells us whether our feature vector is on the same side or the opposite side of the hyper-plane of every one of our random vectors. Now convert every positive dot-product to a 1 And convert every negative dot-product into a 0 LSH, Operation #1: Insertion Basically, every item in bucket 0000000000000 will be on the opposite sides of hyperplanes of all the random vectors. 000…0000 000…0001 000…0010 000…0011 … 1111…11110 1111…11111 “Click here now for free viagra!!!!!” … {1, 1, 5, 6, 4.17, 0.2} And every item in bucket 111111111111111 will be on the same side of the hyperplanes of all the random vectors. And items in bucket 000000000001 will be on the same side as R20, but the opposite side of R1, R2… R19. So each bucket essentially represents one of the 220 different regions of Nspace, as divided by the 20 random hyper-plane slices. LSH, Operation #2: Searching Searching for closely-related items is the same as inserting! 000…0000 Step #1: Compute the feature vector for your item 000…0001 000…0010 000…0011 … 1111…11110 1111…11111 “Click here now for free viagra!!!!!” … {1, 1, 5, 6, 4.17, 0.2} Step #2: Dot-product multiply this vector by your B random vectors Step #3: Convert all positive dot-products to 1, and all negative dot-products to 0 Step #4: Use the concatenated binary number to pick a bucket in your hash table And viola – you’ve located similar feature vectors/items! LSH, One Last Point… Typically, we don’t just use one LSH hash table… But we use two or more, each with a different set of random vectors! Why? Then, when searching for a new vector V, we take the union of all buckets that V hashes to, from all hash tables to obtain a list of matches. Questions?