Example solved in class

advertisement
The below Shingle x Document matrix is reproduced from MinwiseHashing-Lect.pptx slide 26. Since
it is transposed, columns are documents and rows are shingles.
Input matrix (Shingles x
Documents)
The universal hash equation from slide 28 is:
ha,b(x)=((a·x+b) mod p) mod N
where:
a,b … random integers
p … prime number (p > N)
Calculate the minhash signatures, in both row-id and hash value form, for two hash functions: the
first with a1 = 3 and b1 = 2, and the second with a2 = 1 and b2 = 4. Let p be the large prime
2147483647. Notice that since we used tiny values of a and b for ease of calculation, no a * x + b
exceeds p, so the “mod p” can be ignored for this problem.
Solution: Our output should be two 2 x 4 matrices: One for row-id, and one for hash value. Since N
(number of rows) is 7, we can simplify the first hash function to h3,2(x) = 3 * x + 2 mod 7 and the
second to h1,4(x) = x + 4 mod 7. For convenience we can rewrite the documents as sets d1 = {1, 2, 6,
7}, d2 = {3, 4, 5}, d3 = {1, 6, 7}, and d4 = {2, 3, 4, 5}.
For hash function 1 and document 1, we choose shingle 2, because its hash value (3 * 2 + 2) % 7
= 1 is smaller than the hash values for document 1’s other shingles 1, 6, and 7. So we place “2” in cell
(1,1) of the “Shingle index” matrix, and “1” in cell (1,1) of the “Hash value” matrix. You should be
able to work out the minimum hash value and corresponding shingle index, and get the below full
answer:
Hash value
1
3
0
0
2
3
0
0
Shingle index
2
6
4
3
7
6
4
3
Download