Improving minhashing: De Bruijn sequences and primitive roots for counting trailing zeroes

advertisement
Improving minhashing:
De Bruijn sequences and
primitive roots for
counting trailing zeroes
Why things you didn’t think you
cared about are actually practical
Mark Manasse
means of bit-twiddling
Frank McSherry
Kunal Talwar
Microsoft Research
This all comes out of a perennial quest to
make minhashing faster
 Minhashing is a technique for sampling an
element from a stream which is

◦ Uniformly random (every element equally
likely)
◦ Consistent (similar stream ↔ similar sample)
 P(S(A) = S(B))= A B
A B
Minhashing




View documents, photos, music, etc. as a set
of features
View feature set as high-dimensional vector
Find closely-matching vectors
◦ Most of the time
◦ Proportionally close
In L2, this leads us to cosine similarity
(Indyk, Motwani)
◦ A hash function whose bits match proprtionally to
the cosine of the angle between two vectors
◦ Allows off-line computation of hash, and faster
comparison of hashes
Locality sensitive hashing
Working in L1:
Jaccard similarity
Given two sets, define
sim(A,B) to equal the
cardinality of the
intersection of the
sets divided by the
cardinality of the
union
 Proven useful, when
applied to the set of
phrases in a web
page, when testing for
near duplicates

A
A
∩
B
B




Pick ~100 such samples, by picking ~100
random one-to-one functions mapping to a wellordered range
For each function, select the pre-image of the
smallest image under that function
Naively, takes 100 such evaluations per input
item (one for each function)
Improve by factor of almost 8, by
◦ Choosing a 64 bit function
◦ Lead 8 bits of 8 images generated by carving 64 into 8
◦ Compute more bits, but only when needed
Basic idea, and old speedup




Assume a uniformly chosen random function,
mapping injectively to infinite binary
sequences.
Order sequences lexicographically
Given sets A and B, and a random function f,
argmin(f”(AB)) is certain to be an element
of A or B, and is in the intersection with
exactly the Jaccard coefficient.
Sampling requires uniformity and consistency
◦ Uniformity so that probability mass is spread
◦ Consistency small perturbations don’t matter
Why this approximates Jaccard
Carve 64 into expected 32, by dividing at
1’s into 10…0
 32, because expect half of bits to be 1
 Better yet, the number of maximal length
sequences is bounded by 2, independent
of length of input


But, how do we efficiently divide at 1’s?
New idea to speed up and reduce
collisions


Could go bit at a time, shifting and testing
Lots of missed branch predictions, lots of
tests
Could look at low-order byte; if zero, shift
by 8, if not do table look-up
 Almost certainly good enough, in practice


But we can do mathematically better….
Dividing at 1’s


Reduce to a simpler problem: taking the
logarithm base 2 of a power of 2
Given x in twos-complement, x&-x is the
smallest power of 2 in binary expansion of x
◦ Works because -x = ~x + 1
 ~x is 01..1 below smallest power of 2 in x
◦ x & (x-1) removes the least power of 2 (not useful here)
◦ x ^ -x is all ones above the least power of 2
◦ x ^ (x-1) is all ones at and below the least power of 2

So all 64-bit numbers can be reduced to only 65
possibilities, depending on least power of 2
Dividing at 1’s smoothly



Naïve: binary search in sorted table
Better: perfect hashing
Using x&-x, all 65 possible values are
powers of 2, or 0
◦ 2 is a primitive root of unity modulo 67 (kudos
to Peter Montgomery for noting this)
◦ So, the powers of 2 generate the multiplicative
group (1, 2, …, 66) modulo 67
◦ That is, first 66 powers of 2 are distinct mod 67

So, take (x&-x) % 67, and look in a table
How to we figure out which one?


Leiserson, Prokop, and Randall noticed that De
Bruijn sequences are even better
Like Gray codes, only folded
◦ De Bruijn sequences are vaguely like Gray codes
 Hamiltonian circuit of hypercube is Gray code
 Hamiltonian circuit of De Bruijn graph is ….



De Bruijn sequences allow candy necklaces
where any sequence of k candies occurs at
exactly one starting point in clockwise order
Always exist (even generalized to higher
dimension, but we don’t need that)
(00011101)* is such a sequence for 3-bit binary
Perfect, maybe. But optimal?
Any rotation of De Bruijn is De Bruijn
 Reversal of De Bruijn is De Bruijn
 Canonicalize sequences by rotating to least
 Three canonical sequences for binary sequences
of length 6; one is own reversal (6 is even)
 Starting with 6 zeroes, the first five bits needed
in rotation are zero, so shift is good enough
 Just look at high-order 6 bits after multiplying by
constant DB=0x0218a392cd3d5dbfUL
 Doesn’t handle 0, just powers of 2

More de Bruijn
#define DB 0x0218a392cd3d5dbfUL
#define NRES 100
unsigned short dblookup[64]; // initialized to dblookup[(DB << i) >> 58] = i
unsigned result[NRES + 64]; // answers plus spill-over space
unsigned n=0, rb=0, elog=0; // quantity produced, remaining bits of
randomness, left-over zeroes
unsigned long long cur;
while (n < NRES) {
cur = newrandom(key);
elog += rb;
rb = 64;
while (cur != 0) {
unsigned short log = dblookup[((cur & (1+~cur)) * DB) >> 58];
cur >>= log + 1;
rb -= log + 1;
result[n++] = log + elog;
elog= 0;
}}
Few branch misses
Selecting randomly and
repeatably from weighted
distributions
Mark Manasse
Frank McSherry
Kunal Talwar
Microsoft Research


Jaccard, as defined,
doesn’t work when
the number of copies
of an element is a
distinguishing factor
If we generalize a
little, we get sum of
lesser number of
occurrences divided
by sum of greater
◦ Still works even for
non-integral counts
◦ Allows for weighting of
elements by importance

Same as before, for
integer counts, if we
replace items with
iteminstance
◦ <cat, cat, dog> 
{cat1, cat2, dog1}

Sample is (item,
instance), not just
item
Jaccard, extended to multisets




To allow for faster computation, we estimate
similarity by sampling
Pick some number of samples, where for any
sets A and B, each sample agrees with
probability equal to sim(A,B)
Count the average number of matching
samples
To get a good sample, pick a random one-toone mapping of set elements to a wellordering, and pick preimage of the (unique!)
smallest.
Sampling, instead of pairwise
computation, for sets
Given a good way to approximate Jaccard
for sets, we can convert a multiset (but
not a distribution) into a set by replacing
100 occurrences of “cat” by “cat1”, “cat2”,
…, “cat100”.
 Requires (if streaming) remembering how
many prior occurrences of elements have
been seen.

Multiset Jaccard, one
implementation
Previous technique is linear in input size,
if input is {cat, cat, cat, dog}
 Exponential in input size if input is (cat,
100)
 Probability to the rescue!

◦ If our random map is to a real number between
0 and 1, we don’t need to generate 100
random values to find the smallest
◦ CDF(X > x) = 1-x
◦ CDF(min_k > x) = (1-x)k
Multiset reduction considered
inefficient
That’s the probability, but not one that
lets us pick samples to test for agreement
 Not good enough to pick an element, have
to pick an instance of the element

◦ (cat, 100) and (cat, 200) are only .5 similar

Has to be repeatable
◦ If cat7 is chosen from (cat, 100), mustn’t
choose cat73 from (cat, 200) (but cat104 would
be OK)
Not so fast!

A sampling process must pick an element
of the input
◦ For discrete things, an integer occurrence at
most equal to the number of occurrences
◦ For non-negative real valued things, a real
value at most equal to the input
Must pick uniformly by weight
 Must pick same sample from any subset
containing the sample

Properties for repeatable sampling
Uniformity
To be uniform in a column, we
have to pick a number smaller than
a given number
 Variant of reservoir sampling
suffices

8.0
S(8,1) =
7.632…
4.0
S(8,2) =
4.918…
n = 3.724…
S(4,1) =
3.054…
2.0
◦ Given n, pick a random number
below n uniformly
1.0
◦ Given that number, pick a random
number below that, and repeat
S(1,1) =
0.5 0.783…
◦ To make repeatable & expected constant time, break into powers of 2 For this choice of n, S(4,1)
is the downward selection;
◦ Given n, round up to next higher
for slightly smaller n,
power of 2, repeat downward
S(1,1) would be selected
process until below half
Same process, but we have to first 8.0
S(8,1) =
round up, by finding smallest chosen
7.632…
S(8,2) =
number above n
4.0 4.918…
 First check the power of 2 range
n = 3.724…
S(4,1) =
containing n
3.054…
 Next level up contains something if 2.0
first selected number < 2k+1 is > 2k
 If this happens, take smallest
1.0
number in range
S(1,1) =
 Otherwise, repeat at next up power 0.5 0.783…
of 2
For this choice of n,

S(8,2) is the upward
selection
Scaling up





Given a scaled-up column to n, need to construct
the right distribution for second smallest number
(assuming the n’th is the smallest)
In the discrete case (if we consider only integers
as valid choices) IDF (2nd smallest > x)
proportional to
(1-x)n-1×x, so CDF = (n+1)xn-nxn+1= xn(1-n(x1))
In the continuous case (which we can get by
scaling the discrete case), CDF = xn(1-nlnx)
Pick a random luckiness factor for a column, p,
solve for x in CDF = p by iteration
Pick column with smallest x value
Picking a column


We can just use one bit to decide if a power
of 2 range has any selected values
So use a single random value to decide which
of 64 powers of 2 are useful
◦ Either by computing 64 samples in parallel or
◦ Computing 64 intervals at once


Use logarithms of CDF rather than CDF to
keep things reasonable; look at 1-x instead
of x to keep log away from 0
Partially evaluate convergence to save time,
and compare preimages to CDF when
possible
Reducing randomness and
improving numerical accuracy
Related documents
Download