Three Cool Algorithms You`ve Never Heard Of

advertisement
Three Cool Algorithms
You’ve Never Heard Of!
Carey Nachenberg
cnachenberg@symantec.com
Cool Data Structure: The Metric Tree
City:
LA
Threshold: 1500km
>1500km
away
<=1500km
away
City:
NYC
Threshold: 1100km
City:
Las Vegas
Threshold: 1000km
City:
SF
Threshold: 100km
<=100km
away
City:
San Jose
Threshold: 200km
…
<=200km
away
>200km
away
…
City:
Boston
Threshold: 400km
City:
Austin
Threshold: 250km
>100km
away
…
City:
Merced
Threshold: 70km
<=70km
away
…
>70km
away
…
>1100km
away
<=1110km
away
>1000km
away
<=1000km
away
…
<=400km
away
…
City: Providence
Threshold: 200km
<=200km
away
…
>200km
away
…
City:
Atlanta
Threshold: 600km
>600km
away
<=600km
away
City: New Orleans
Threshold: 300km
<=300km
away
…
>300km
away
…
…
Challenge: Building a Fuzzy Spell Checker
Imagine you’re building a word
processor and you want to
implement a spell checker that
gives suggestions…
Of course it’s easy to tell the user
that their word is misspelled…
Question: What data structure
could we use to determine if a
word is in a dictionary or not?
Right – a hash table or binary
search tree could tell you if a word
is spelled correctly.
lobeky
Suggestions
lonely
lovely
locale
…
But what if we want to
efficiently provide the
user with possible
alternatives?
Providing Alternatives?
Before we can provide
alternatives, we need a way to find
close matches…
One useful tool for this is the
“edit distance” metric.
Edit Distance: How many letters
must be added, deleted or
replaced to get from word A to B.
v l -> lovely has an edit
lobeky
distance of 2.
l ob
wel k y -> lowly has an edit
distance of 3.
So given the user’s
misspelled word, and this
edit distance function…
How can we use this to
provide the user with
spelling suggestions?
Providing Alternatives?
Well, we could take our misspelled
word and compute its edit distance
to every word in the dictionary!
8
aardvark
5
ark
6
acorn
…
bone
bonfire
…
lonely
lonesome
…
And then give the user all words
with an edit distance of <=3…
lobeky
But that’s really, really slow!
There’s a better way!
But before we talk about
it, let’s talk about edit
distance a bit more…
Edit Distance
As it turns out, the edit distance
function, e(x,y), is what we call a
“metric distance function.”
What does that mean?
1. e(x,y) = e(y,x)
The edit distance of “foo” from “food”
is the same as from “food” to “foo”
2. e(x,y) >= 0
You can never have a negative
edit distance… Well that makes sense…
3. e(x,z) <= e(x,y) + e(y,z)
It’s never cheaper to do two
conversions than a direct conversion.
aka “the triangle inequality”
e(“foo”,”feed”) = 3
e(“feed”,”goon”) = 4
Total cost:
7
>
e(“foo”,”goon”) = 2
Metric Distance Functions
Given some word w (e.g., pier), let’s say I
happen to know all words with an
edit distance of 1 from that word…
Why? Because we know that all of these
words have
Now,
at most
if my one
misspelled
character
word
difference
m (e.g., zifs) from
has an“pier”…
edit distance of 3
from w, what does that guarantee about
So if “pier”misto3 these
away from
other“zifs”,
words?
then in the
best case these other words would be one
letter
closer
to “zifs” (e.g., is
if 3,
oneand
of all
pier’s
Right:
If e(“zifs”,”pier”)
letters
wasother
replaced
byare
oneexactly
of zifs’1letters)...
these
words
edit
from
pier… of different
Imagine if away
we had
thousands
clouds like this.
Then by definition, “zifs” must be at
most 4 edits away from any word in
We could compare your misspelled word to
this cloud!
the center word of each cloud. If e(m,w) is
lessby
than
But
thesome
samethreshold
reasoning,edit
nonedistance,
of thesethen
words
cloud’s
words
are
good
can the
be less
thanother
2 edits
away
from
“zifs”…
suggestions…
tier
peer
piper
pier
pies
pie
+3
zifs
e(“zifs”,”pier”) = 3
e(“pier”,”piper”) = 1
Total cost:
4
Let’sdirectly:
see:
And
e(“zifs”,”pies”) == 42
e(“zifs”,”piper”)
Metric Distance Functions
rate
hate
date
tier
table
gate
ate
gale
peer
4
5
5
pier
pies
3
computer
pencil
8
zifs
We could compare your misspelled word to
the center word of each cloud. If e(m,w) is
less than some threshold edit distance, then
the cloud’s other words are good
suggestions…
piper
pie
A Better Way?
That works well, but then again, we’d still
have to do thousands of comparisons
(one to each cloud)…
Hmmm. Can we figure out a more efficient
way to do this?
Say with log2(D) comparisons, where D is
the number of words in your dictionary?
Duh… Well of course, we’ll need a tree!
The Metric Tree
The Metric Tree was invented in 1991 by Jeffrey
Uhlmann of the Naval Research Labs.
Each node in a Metric Tree holds a word, an
edit distance threshold value and left and right next pointers.
struct MetricTreeNode
{
string word;
unsigned int editThreshold;
MetricTreeNode *left, *right;
};
Let’s see how to build a Metric Tree! Building one is really slow,
but once we build it, searching it is really fast!
The Metric Tree
Node *buildMTree(SetOfWords &S)
1. Pick a random word W from set S.
2. Compute the edit distance for all other
words in set S to your random word W.
3. Sort all these words based on their edit
distance di to your random word W.
4. Select the median value of di,
let dmed be this median edit distance.
5. Now, create a root node N for our tree and
put our word W in this node. Set its
editThreshold value to dmed.
6. N->left = buildMTree(subset of S that is <= dmed)
7. N->right = buildMTree(subset of S that is > dmed)
8. return N
main()
{
Let S = {every word in the dictionary};
Node *root = buildMTree(S);
SetOfWords
goat
oyster
roster
hippo
toad
hamster
mouse
chicken
rooster
The
Metric
Tree
Node *buildMTree(SetOfWords &S)
Node *buildMTree(SetOfWords &S)
1. Pick a random word W from set S.
1. Pick a random word W from set S.
2. Compute the edit distance for all other
2. Compute
for allword
other
words inthe
setedit
S todistance
your random
W.
dmed =
words in set S to your random word W.
3. Sort all these words based on their edit
3. Sort
all these
based on
their
distance
di to words
your random
word
W.edit
distance di to your random word W.
4. Select the
median value of di,
4. Select
thebemedian
value of
di,distance.
let dmed
this median
edit
let dmed be this median edit distance.
5. Now,
create a root node N for our tree and
5. Now,
create
root
nodenode.
N forSet
ourits
tree and
put our
worda W
in this
put
our word W in
thistonode.
editThreshold
value
dmed.Set its
editThreshold value to dmed.
6. N->left = buildMTree(subset of S that is <= dmed)
6.7.N->left
= buildMTree(subset
of of
S that
is <=
))
N->right
= buildMTree(subset
S that
is d
> med
dmed
7. N->right = buildMTree(subset of S that is > dmed)
8. return N
8. return N
main()
{
Let S = {every word in the dictionary};
Node *root = buildMTree(S);
4
SetOfWords
goat
roster
16
oyster 22
roster 3
1
hamster
hippo
mouse
47
toad
goat
66
hamster 6
3
toad
4
mouse 7
hippo
chicken 77
rooster
“rooster”
4
The
Metric
Tree
Node *buildMTree(SetOfWords &S)
Node *buildMTree(SetOfWords &S)
1. Pick a random word W from set S.
1. Pick a random word W from set S.
2. Compute the edit distance for all other
dmed
2. Compute
the
edit
distance
for
all
other
words in set S to your random word W.
words in set S to your random word W.
3. Sort all these words based on their edit
3. Sort
all these
based on
their
distance
di to words
your random
word
W.edit
distance di to your random word W.
4. Select the
median value of di,
4. Select
thebemedian
value of
di,distance.
let dmed
this median
edit
let dmed be this median edit distance.
5. Now,
create a root node N for our tree and
5. Now,
create
root
nodenode.
N forSet
ourits
tree and
put our
worda W
in this
put
our word W in
thistonode.
editThreshold
value
dmed.Set its
editThreshold value to dmed.
6. N->left = buildMTree(subset of S that is <= dmed)
6.7.N->left
= buildMTree(subset
of of
S that
is <=
))
N->right
= buildMTree(subset
S that
is d
> med
dmed
7. N->right = buildMTree(subset of S that is > dmed)
8. return N
8. return N
main()
{
Let S = {every word in the dictionary};
Node *root = buildMTree(S);
“roster”
Dictionary
SetOfWords
goat
roster
64
oyster 34
roster 1 6
hamster
hippo
mouse 7
toad
goat
hamster
toad
mouse
hippo
chicken 7
rooster
“rooster”
=4
“oyster”
4
4
“mouse”
4
“hamster”
0
The
Metric
Tree
Node *buildMTree(SetOfWords &S)
Node *buildMTree(SetOfWords &S)
1. Pick a random word W from set S.
1. Pick a random word W from set S.
2. Compute the edit distance for all other
2. Compute
for allword
other
words inthe
setedit
S todistance
your random
W.
words in set S to your random word W.
dmed
3. Sort all these words based on their edit
3. Sort
all these
based on
their
distance
di to words
your random
word
W.edit
distance di to your random word W.
4. Select the
median value of di,
4. Select
thebemedian
value of
di,distance.
let dmed
this median
edit
let dmed be this median edit distance.
5. Now,
create a root node N for our tree and
5. Now,
create
root
nodenode.
N forSet
ourits
tree and
put our
worda W
in this
put
our word W in
thistonode.
editThreshold
value
dmed.Set its
editThreshold value to dmed.
6. N->left = buildMTree(subset of S that is <= dmed)
6.7.N->left
= buildMTree(subset
of of
S that
is <=
))
N->right
= buildMTree(subset
S that
is d
> med
dmed
7. N->right = buildMTree(subset of S that is > dmed)
8. return N
8. return N
main()
{
Let S = {every word in the dictionary};
Node *root = buildMTree(S);
“roster”
Dictionary
SetOfWords
goat
roster
oyster
roster
hamster
hippo
mouse
toad
goat
2
toad
hamster
mouse
hippo
5
chicken 7
=5
“rooster”
4
“mouse”
4
“oyster”
4
“goat”
“hamster”
0 5
“hippo”
“toad”
5
“chicken
0
The Metric Tree
Node *buildMTree(SetOfWords &S)
1. Pick a random word W from set S.
2. Compute the edit distance for all other
words in set S to your random word W.
3. Sort all these words based on their edit
distance di to your random word W.
4. Select the median value of di,
let dmed be this median edit distance.
5. Now, create a root node N for our tree and
put our word W in this node. Set its
editThreshold value to dmed.
Dictionary
SetOfWords
goat
roster
oyster
roster
hamster
hippo
mouse
toad
goat
toad
hamster
mouse
hippo
chicken
“rooster”
4
6. N->left = buildMTree(subset of S that is <= dmed)
“mouse”
7. N->right = buildMTree(subset of S that
is4 > dmed)
“toad”
5
8. return N
main()
“oyster”
4
{
Let S = {every word in the dictionary};
Node *root = buildMTree(S);
“roster”
“goat”
5
“hamster”
0
“hippo”
“chicken”
0
A Metric Tree
So now we have a
metric tree!
“rooster”
rooster
4
4
toad
“toad”
mouse
“mouse”
4
4
oyster
“oyster”
4
“roster”
roster
0
55
“goat”
goat
5
5
hamster
“hamster”
0
“hippo”
hippo
0
How do we
interpret it?
“chicken”
chicken
0
Every word to the left of
rooster is guaranteed to
be within 4 edits of it…
And every word to the
right of rooster is
guaranteed to be more
than 4 edits away…
And this same structure
is repeated recursively!
Searching
“rooster”
4
“toad”
5
“mouse”
4
“oyster”
4
“hamster”
0
“hippo”
0
“roster”
0
oyster
mouse
2
roosterroaster
roster
“goat”
5
hamster
chicken
toad
hippo
goat
“chicken”
0
When you search a metric
tree, you specify the word
you’re looking for and an
1.edit-distance
Your word and
its search
radius,
e.g.
radius are totally inside
e.g.,
I want
to find words
the edit
threshold.
within 2 edits of “roaster”.
In this case, all of your
matches
Starting
at are
the guaranteed
root, there to
are
be
in our
leftto
subtree…
three
cases
consider:
Searching
“rooster”
4
“toad”
5
“mouse”
4
“oyster”
4
“hamster”
0
“hippo”
0
“roster”
0
oyster
mouse
2
goute
rooster
roster
“goat”
5
hamster
chicken
toad
hippo
goat
“chicken”
0
2. Your word and its search
radius are partially inside
and partially outside the
edit threshold.
In this case, some matches
will be in our left subtree
and some in our right
subtree…
Searching
“rooster”
4
“toad”
5
“mouse”
4
“oyster”
4
“hamster”
0
“hippo”
0
“roster”
0
oyster
2
vhivken
mouse
rooster
roster
“goat”
5
hamster
chicken
toad
hippo
goat
“chicken”
0
3. Your word and its search
radius are completely
outside the edit threshold.
In this case, all matches
will be in our right subtree.
e(“chomster”,”mouse”)
= 25
e(“chomster”,”hamster”)
=
*This is a slight
So
mouse
is outside
of chomster’s
chomster’s
So
hamster
is inside of
e(“chomster”,”rooster”)
=3
simplification…
PrintMatches(Node
*cur,
string
misspell,
intPrint
rad)
radius
ofWe’ve
2. is
It’s
not
a of
close
radius
of
2.
got
a match!
oyster
oyster
So
rooster
outside
PrintMatches(Node
*cur,
string
misspell,
int
rad)
e(“chomster”,”mouse”)
=5
{
hamster!
enough
match
to
chomster’s
radius
of
2. print…
It’s not a
hamster
{ if e(misspell,cur->word)
PrintMatches(Node
*cur,
string
misspell,
int
Sincee(“chomster”,”rooster”)
5close
is greater
ourto=print…
<= than
rad
3 rad) mouse
enough
match
e(“chomster”,”mouse”)
=go5left. chomster 5 hamster
mouse
rooster
<=
rad
{ if e(misspell,cur->word)
editThreshold
of
4,
we
won’t
Since
3 is less
than our
then print
the
current
word
2
Since
5current
is<=
greater
than our
2chomster
then
print
the
word
if e(misspell,cur->word)
rad
4
roster
editThreshold
of
4,
let’s
go
left…
chomster
if e(misspell,cur->word)
<=wecur->editThreshold
editThreshold
of
4,
will
go
right.
2
print the current word
ifthen
e(misspell,cur->word)
<=
cur->editThreshold
2
roster
then PrintMatches(cur->left)
if e(misspell,cur->word)
<= cur->editThreshold
then
PrintMatches(cur->left)
if e(misspell,cur->word)
> cur->editThreshold
then
PrintMatches(cur->left)
hamster
if e(misspell,cur->word)
>
cur->editThreshold
then PrintMatches(cur->right);
if e(misspell,cur->word)
> cur->editThreshold
then
PrintMatches(cur->right);
} then PrintMatches(cur->right);
}
Metric Tree: Search Algorithm
}
cur->
PrintMatches(root,”chomster”,2);
cur->
cur->
Other Metric Tree Applications
In addition to spell checking, the
Metric Tree can be used with
virtually any application where the
items obey metric rules!
Pretty cool, huh? Here’s the full search
algorithm from the original paper
(without my earlier simplications):
PrintMatches(Node *cur, string misspell, int rad)
{
if ( e(cur->word , misspell) <= rad)
cout << cur->word;
if ( e(cur->word,misspell) – rad <= cur->editThresh )
PrintMatches(cur->left,misspell,maxDist)
if ( e(cur->word, misspell) + rad >= cur->editThresh )
PrintMatches (cur->right,misspell,maxDist);
}
Challenge: Space-efficient Set
Membership
There are many problems where we want
to maintain a set S of items and then
check if a new item X is in the set, e.g.:
“Is ‘carey nachenberg’ a student at UCLA?”
“Is the phone number ‘424-750-7519’ known to
be used by a terrorist cell?
So, what data structures could you
use for this?
Right! Both hash tables and
binary search trees allow you to:
1. Hold a bunch of items.
2. Quickly search through them to see
if they hold an item X.
So what’s the problem!
Well, binary search trees and hash
tables are memory hogs!
But if I JUST want to do two things:
1. Add new items to the set
2. Check if an item was previously added to a set
I can actually create a much more
memory efficient data structure!
In other words, if I never need to:
1. Print the items of the set (after
they’ve been added).
2. Enumerate each value in the set.
3. Erase items from the set.
Then we can do
much better than
our classic data
structures!
But first… A hash primer
*
A hash function is a function, y=f(x), that takes an
input x (like a string) and returns an
output number y for that input.
The ideal hash function returns entirely different values for
each different input, even if two inputs are almost identical:
int y,z;
y = idealHashFunction(“carey”);
cout << y;
z = idealHashFunction(“cArey”);
cout << z;
So even though these two strings are almost identical, a good
hash function might return y=92629 and z=152.
* Not that kind of hash.
Hash Functions
Here’s a not-so-good
hash function.
Can anyone figure
out why?
Right – because
similar inputs
produce the same
output:
int y, z;
y = hashFunc(“bat”);
z = hashFunc(“tab”);
// y == z!!!! BAD!
int hashFunc(const string &name)
{
int i, total=0;
for (i=0;i<name.length(); i++)
total
total++(name[i]
name[i]; * (i+1));
total
= =total
}
return(total);
How can we fix this?
By changing our function! That’s a
little better, although not great…
A Better Hash Function
The CRC or Cyclical Redundancy
Check algorithm is an excellent hash
function.
This function was designed to
check network packets for
corruption.
We won’t go into CRC’s details, but it’s a perfectly
fine hashing algorithm…
Ok, so we have a good hash function, now what?
A Simple Set Membership Algorithm
Most hash functions require a seed
value
to be to
passed
in. class SimpleSet
Imagine(initialization)
that I know
I want
store
{
slot 3000012131
up to 1 million
items
in my
9721
12131
Here’s how
it might
be set…
used:
public:
I
could
create
an
array
of
say…
“Carey”
“Flint”
unsigned CRC(unsigned seed, string &s)
…
void insertItem(string &name)
{
100
{
unsigned
crcmillion
= seed; bits
int slot = CRC(SEED, name);
forthen
(int i=0;i<s.length();i++)
And
do the following…
slot = slot % 100000000;
crc = ((crc >> 8) & CONST1) ^
crcTable[(crc^ s[i]) & CONST2]; m_arr[slot] = 1;
s 000000000000000000000000000000000000000000000000000000
1
1
}
“Flint”
return(crc);
bool isItemInSet(string &name)
} main()
{
{
Typically you’d use a seed value of
int slot = CRC(SEED, name);
SimpleSet
s;
0xFFFFFFFF with CRC.
slot = slot % 100000000;
if (m_arr[slot] == 1)
s.insertItem(“Carey”);
But you can change the seed if you like – this
return(true);
s.insertItem(“Flint”);
results in a (much) different hash value, even else return(false);
for the same input!
}
slot
if (s.isItemInSet(“Flint”) == true)
9721
private:
cout << “Flint’s in my set!”;
}
BitArray m_arr[100000000];
A Simple Set Membership Algorithm
Ok, so what’s the problem with
our SimpleSet?
Right! There’s a chance of
collisions!
What if two names happen to
hash right to the same slot?
cool 000000000000000000000000000000000000000000000000000000
1
People
main()
{
SimpleSet coolPeople;
coolPeople.insertItem(“Carey”);
}
if (coolPeople.isItemInSet(“Paul”))
cout << “Paul Agbabian is cool!”;
class SimpleSet
{
slot 3000012131
12131
public:
…
void insertItem(string &name)
{
int slot = CRC(SEED,name);
slot = slot % 100000000;
m_arr[slot] = 1;
}
bool isItemInSet(string &name)
{
int slot = CRC(SEED,name);
slot = slot % 100000000;
if (m_arr[slot] == 1)
return(true);
else return(false);
}
slot 11000012131
12131
private:
BitArray m_arr[100000000];
A Simple Set Membership Algorithm
Ok, so what’s the problem with
our SimpleSet?
Right! There’s a chance of
collisions!
What if two names happen to
hash right to the same slot?
Ack! If we put 1 million items in
our 100 million entry array…
we’ll have a collision rate of
about 1%!
Actually, depending on your
requirements,
that might not be so bad…
class SimpleSet
{
public:
…
void insertItem(string &name)
{
int slot = CRC(SEED,name);
slot = slot % 100000000;
m_arr[slot] = 1;
}
bool isItemInSet(string &name)
{
int slot = CRC(SEED,name);
slot = slot % 100000000;
if (m_arr[slot] == 1)
return(true);
else return(false);
}
private:
BitArray m_arr[100000000];
A Simple Set Membership Algorithm
Our simple set can hold about 1M
items in just 12.5MB of memory!
While it does have some falsepositives, it’s much smaller than a
hash table or binary search tree…
But we’ve got to be able to do
better… Right?
Right! That’s where the Bloom Filter
comes in!
The Bloom Filter was invented by
Burton Bloom in 1970.
Let’s take a look!
We’ll see how K is chosen in a bit.
The Bloom Filter
It’s a constant and its value is
In a Bloom Filter, we use an
array of bits just like our
original algorithm!
But instead of just using
1 hash function
and setting
just one bit
for each insertion…
Notice that each time we call the CRC
function, it starts with a different
We use K hash
functions,
seed value:
compute K hash values and
unsigned CRC(unsigned seed, string &s)
{
set K bits!
unsigned crc = seed;
for (int i=0;i<s.length();i++)
computed from:
class BloomFilter
1. { The max # of items you want to add.
2. public:
The size of the array.
3. Your
… desired false positive rate.
const int K = 4;
void insertItem(string &name)
{
for (int i=0;i< K ;i++)
{
int slot = CRC( i , name);
slot = slot % 100000000;
m_arr[slot] = 1;
}
}
slot
9000022531
79929
9197
3000000013
22531
13
cool
000000000000000000000000000000000000000000000000000000
1
1
1
1
main()
crc = ((crc >> 8) & CONST1) ^ crcTable[(crc^ s[i]) & CONST2];
People
{ return(crc);
}
BloomFilter coolPeople;
(Passing K different seed values is the same
as using K different hash functions…)
}
coolPeople.insertItem(“Preston”);
private:
BitArray m_arr[100000000];
The Bloom Filter
Now to search, we do the
same thing!
Note: We only say an item is
a member of the set if all K
bits are set to 1.
Note: If any bit that we
check is 0, then we have a
miss…
main()
{
BloomFilter coolPeople;
}
class BloomFilter
{
public:
…
void insertItem(string &name)
{
for (int i=0;i< K ;i++)
{
int slot = CRC( i , name);
bool isItemInSet(string
&name)
slot = slot % 100000000;
{
m_arr[slot] = 1;
for (int i=0;i< K ;i++)
{ }
} int slot = CRC( i , name);
slot = slot % 100000000;
if (m_arr[slot] == 0)
cool 000000000000000000000000000000000000000000000000000000
1
1
1
1
return(false);
People
}
return(true);
}
if (coolPeople.isItemInSet(“Carey”))
coolPeople.insertItem(“Preston”);
cout << “I figured…”;
private:
BitArray m_arr[100000000];
The Bloom Filter
Ok, so what’s the big deal?
All we’re doing is checking K
bits instead of 1?!!?
Well, it turns out that this
dramatically reduces the
false positive rate!
Ok… So the only questions are,
how do we chose:
1. The size of our bit-array?
2. The value of K?
Let’s see!
class BloomFilter
{
public:
void insertItem(string &name)
{
for (int i=0;i< K ;i++)
{
int slot = CRC( i , name);
slot = slot % 100000000;
m_arr[slot] = 1;
}
}
bool isItemInSet(string &name)
{
for (int i=0;i< K ;i++)
{
int slot = CRC( i , name);
slot = slot % 100000000;
if (m_arr[slot] == 0)
return(false);
}
return(true);
}
private:
BitArray m_arr[100000000];
The Bloom Filter
If you want to store N items
in your Bloom Filter…
And you want a false
positive rate of F%...
You’ll want to have M bits in
your bit array:
M = log(F) * N
log(.6185)
And you’ll
want to use
Now you’ve got to admit, that’s pretty
efficient!
K different hash functions:
Let’s see some stats!
K=.7*chance
M
Of course, unlike a hash table, there is some
N
of having a false positive…
To store:
N items with this FP rate, use M bits (bytes) and K hash fns
But for many projects, this is not an issue, especially if you
1M
.1% a certain
14.4M
bits (1.79MB)
can guarantee
minimum
level of FPs! 10
100M
.001%
2.4B bits (299MB)
17
Now that’s COOL! And you’ve (hopefully) never heard about it!
100M
.00001% 3.4B bits (419MB)
23
Challenge: Constant-time
searching for similar items
(in a high-dimensional space)
Problem:
I’ve got a large collection C of existing web-pages, and I want to determine if
a new web-page P is a close match to any pages in my existing collection.
Obvious approach:
I could iterate through all C of my existing pages and do a pair-wise
comparison of page P to each page.
But that’s inefficient!
So how can we do it faster?
Answer: Use Locality Sensitive Hashing!
LSH has two operations:
Inserting items into the hash table:
We add a bunch of items (e.g., web pages) into a
locality-sensitive hash table
Given an item, find closely-related
items in the hash table:
Once we have a filled locality-sensitive hash table, we want
to search it for a new item and see if it contains
anything similar.
LSH, Operation #1: Insertion
Here’s the Insertion algorithm:
Step #1:
Take each input item (e.g., a web-page) and convert it to a feature vector of size V.
What’s a feature vector?
It’s a fixed-length array of floating point numbers that measure various attributes
about each input item.
const int V = 6;
float fv[V];
fv[0] = # of times the word “free” was used in the email
fv[1] = # of times the word “viagra” was used in the email
fv[2] = # of exclamation marks used in the email
fv[3] = The length of the email in words
fv[4] = The average length of each word found in the email
=#
of times
the word marks
“the” was
used inin
the
email
fv[5]fv[5]
= The
ratio
of punctuation
to letters
the
email
The items in the feature vector should be chosen to provide maximum
differentiation between different categories of items (e.g., spam vs clean email)!
LSH, Operation #1: Insertion
Why compute a feature vector for each input item?
The feature vector is a way of plotting each item into N-space.
In principle, items (e.g. emails) with similar content
(i.e., similar feature vectors) should occupy similar regions of N-space.
Input #1:
“Click here now for free viagra!!!!!”
Input #2:
“Please come to the meeting at 5pm.”
fv1 = {1, 1, 5,}6, 4.17, 0.2}
fv2 = {0, 0, 1,} 7, 3.71, 0.038}
fv2
fv1
5.0
1.0
1.0
LSH, Operation #1: Insertion
Step #2:
Note: N must be a
Once you have a feature vector for each of your items,
power of 2, e.g., 65536,
you determine the size of your hash table.
or 1,048,576
“I’m going to need to hold 100 million email feature vectors,
so I’ll want an open hash table of size N = 1 million”
Wait! Why is our hash table smaller than the # of items we want to store?
Because we want to put related items in the same bucket/slot of the table!
Step #3:
Next compute the number of bits B required to represent N in binary.
If N is 1 million, B will be log2(1 million), or 20.
LSH, Operation #1: Insertion
Step #4:
Now, create B (e.g., 20) RANDOM feature vectors that are the
same dimension as your input feature vectors.
R1 = {.277,.891,3,.32,5.89, .136}
R2 = {2.143,.073,0.3,4.9, .58, .252}
…
R19 = {.8,.425,6.43,5.6,.197,1.43}
R20 = {1.47,.256,4.15,5.6,.437,.075}
LSH, Operation #1: Insertion
What are these B random vectors for?
Each of the B random vectors defines a hyper-plane in N-space!
(each hyper-plane is perpendicular to its random vector)
R1 = {1,0,1}
R2 = {0,0,3}
If we have B such random vectors, we
essentially chop up N-space with B
possibly overlapping slices!
So in our example, we’d have B=20
hyper-planes chopping up our
V=6 dimensional space.
(Chopping it up into
220 different regions!)
R3 = {0,2.5,0}
LSH, Operation #1: Insertion
Ok, let’s consider a single random vector, R1, and it’s hyper-plane for now.
Now let’s consider a second vector, v1.
v1
R1
If the tips of those two vectors are on the same side
of R’s hyper-plane, then the dot-product of the two
vectors will be positive.
R 1 · v1 > 0
v2
On the other hand, if the tips of those two vectors are
on opposite sides of R’s hyper-plane, then the dotproduct of the two vectors will be negative.
R 1 · v2 < 0
So this is useful – if we compute the dot product of two
vectors R and v, we can determine if they’re close to each
other or far from each other in N-space.
LSH, Operation #1: Insertion
Step #5:
Create an empty open hash table
with 2B buckets (e.g. 220 = 1M).
For each item we want to add to
our hash table…
Take the feature vector for the item...
000…0000
And dot-product multiply it by every one of
our B random-valued vectors…
000…0001
000…0010
000…0011
…
Step #6:
“Click here now for free viagra!!!!!”
…
1111…11110
1111…11111
Let’s
each bucket’s
# 1s
And iflabel
we concatenate
the
using
rather
and
0s,binary
this gives
us athan
B-digit
(e.g., decimal
20 digit)numbers.
binary number.
(You’ll see why soon )
Which we can use to compute a
bucket number in our hash table
and store our item!
· {1, 1, 5, 6, 4.17, 0.2}
is on the…
R1 = {.277,.891,3,.32,5.89, .136} -3.25 Opp.
0
side of R1
-1.73
0
Opp.
side of R2
R2 = {2.13,.07,0.3,4.9, .58, .252}
…
…
…
R19 = {.8,.45,6.3,5.6,.197,1.43}
R20 = {1.7,.26,4.15,5.6,.47,.07}
1
.18 Same
side as R19
1
side as R20
5.24 Same
This basically tells us whether our feature vector
is on the same side or the opposite side of the
hyper-plane of every one of our random vectors.
Now convert every positive dot-product to a 1
And convert every negative dot-product into a 0
LSH, Operation #1: Insertion
Basically, every item in bucket
0000000000000
will be on the opposite sides of hyperplanes of all the random vectors.
000…0000
000…0001
000…0010
000…0011
…
1111…11110
1111…11111
“Click here now for free viagra!!!!!”
…
{1, 1, 5, 6, 4.17, 0.2}
And every item in bucket
111111111111111
will be on the same side of the hyperplanes of all the random vectors.
And items in bucket
000000000001
will be on the same side as R20, but
the opposite side of R1, R2… R19.
So each bucket essentially represents
one of the 220 different regions of Nspace, as divided by the 20 random
hyper-plane slices.
LSH, Operation #2: Searching
Searching for closely-related
items is the same as inserting!
000…0000
Step #1:
Compute the feature vector for
your item
000…0001
000…0010
000…0011
…
1111…11110
1111…11111
“Click here now for free viagra!!!!!”
…
{1, 1, 5, 6, 4.17, 0.2}
Step #2:
Dot-product multiply this vector by
your B random vectors
Step #3:
Convert all positive dot-products to 1,
and all negative dot-products to 0
Step #4:
Use the concatenated binary number
to pick a bucket in your hash table
And viola – you’ve located similar
feature vectors/items!
LSH, One Last Point…
Typically, we don’t just use one LSH hash table…
But we use two or more, each with a
different set of random vectors!
Why?
Then, when searching for a new vector V, we take the
union of all buckets that V hashes to, from all hash
tables to obtain a list of matches.
Questions?
Download