H A S H T A B L... D E A L I N G W... S O M E P R O B...

advertisement
HASH TABLES
DEALING WITH RAW BYTES
SOME PROBABILISTIC ANALYSIS
Hash Tables
Motivations
1
 Balanced search trees
 Store (key, value)-pairs
 O(log n)-time search, insert, delete
 Relatively complex implementations
 Can we improve running times for basic operations?
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Example
2
 Given a set of (key, value)-pairs
 Keys are in the set {0, 1, 2, 3, …, n-1}
 Values could be anything
 Store them in an array (direct access table)
0
1
2
3
…
…
…
…
n-2
n-1
v0
NULL
v2
NULL
…
…
…
…
V[n-2]
V[n-1]
 Search/insert/delete takes O(1)-time
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
What is the drawback?
3
 Unlikely see such perfect key set in practice
 (Key, Value) = (English Word, Meaning)
 (Key, Value) = (function, address)
 (Key, Value) = (URL, IP address)
 Wastes lots of space when n >> # pairs
 Say keys are 8-byte integers, n = 2256-1
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
The Static Key Set Case
4
DICTIONARY
RESERVED WORDS IN A PROGRAMMING LANGUAGE
- COMMAND NAMES IN AN OS
- FILE NAMES IN CD-ROM
-
-
LAZY ARRAY & MINIMAL PERFECT HASHING
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Static Key Set
5
 Wanted: data structure for an online dictionary
 With what we know so far, what are the choices?
 Use a balanced BST such as RB tree, AVL tree
 Randomize the keys and insert one by one in a normal BST or
a splay tree
 Sort the keys, use binary search
 Sorting + binary search is the best of the three
options


Search still takes O(log n)-time
Can we do better?
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
A Wild Solution
6
 Every key is a series of 0s and 1s
 There is always some integer m such that, for all
practical purposes

A key K is an (≤ m)-bit number bin-rep(K)
E.g. in ASCII, bin-rep(“cse250”) = 0x637365323530

Use this m-bit number as an index to an array of values

 What’s the longest non-technical English word?
 Floccinaucinihilipilification (29 characters)
 What’s the longest technical English word?
 Pneumonoultramicroscopicsilicovolcanoconiosis (a disease)
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
The solution is wild in at least two ways
7
 So, say m = 8x30 = 240 bits
 Need an array A of 2240 ≈ 1.76 x 1072 elements
 Even if we have that much memory space, there is
still one major problem
A[x] have to be initialized to NULL for all x from 0 to 2240-1
 NULL is just 0
 We have n = 150000 = 15x104 words, say
 Initializing the data structure takes ≥ n13 steps


The O(n log n) sorting + binary search looks great
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
The lazy array data structure
8
 Sequentially read inputs into an array Dict of size n
 Dict[i] is the i’th (word, value) pair
 Insert all words:
 For (i=0; i<n; ++i) A[Dict[i].word] = i
 search(x): (x is a word)


If (0 ≤ A[x] ≤ n-1 && Dict[A[x]].word == x)
 Return Dict[A[x]].value
Else
 Return false
 You can even delete(x) in O(1)-time
 Just set Dict[i] = NULL
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Lazy array, an illustration
9
cse250
…
0
(“UB”, “University at Buffalo”)
0x4D6174726978
1
(“Mark Twain”, “Great writer”)
…
2
(“cse250”, “boring course”)
0x637365323530
3
NULL
…
…
…
…
n-1
(“Matrix”, “Best Scifi Movie”)
…
cse251
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
2
Set to 2 by accident
0x637365323531
Dict
n-1
2
A
5/28/2016
Major drawback and an inspiration
10
 Used a humongous amount of space
 n << 2# bits to represent the longest word
 However, if there was a function h
 0 ≤ h(word) ≤ n-1
 For any two words x & y, h(x) ≠ h(y)
 Then, we’re (almost) in good shape
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Sample Input, n=6
11
Key
Index
Value
4164616D
0
…
“Ashley”
4173686C6579
1
…
“Daniel”
44616E69656C
2
…
“Kayla”
4B61796C61
3
…
“Mike”
4D696B65
4
…
“Troy”
54726F79
5
…
“Adam”
Hash Code
(using ASCII)
In hex
Function h(hash_code)  {0,1,2,3,4,5}
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
What’s that function h?
12
int h(int a) {
a = (((a%256)%100)%41)%10;
a = (a*a)%14;
return (a>2) ? a-4 : a;
}
Took me ½ hour to come up with that
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Minimal Perfect Hash Function
13
 S: the set of n (hash codes of) keys
 S = {0x4164616D, 0x4173686C6579, 0x44616E69656C, 0x4B61796C61, 0x4D696B65, 0x54726F79} in the
example above
 h: S  {0,1,..,n-1} is a MPHF if it is a bijection
 We want to
 Find such a function h (in short amount of time)
 May be store … the function in a data structure!
 Evaluating h(code) should take O(1)-time
 Possible, but a little bit complicated
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Hasing – General Ideas
14
-
HASHING FIRST PROPOSED BY ARNOLD DUMEY (1956)
- HASH CODES
- CHAINING
- OPEN ADDRESSING, LINEAR PROBING, QUADRATIC
PROBING
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Top level view
15
h(object)
Arbitrary objects
(strings, doubles, ints)
n Objects
actually
used
{0,1,…,m-1}
int with
wide range
Has
h
code
m
Compression
function
We will also call
this the hash function
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Good Hash Function
16
 If key1 = key2, then h(key1) = h(key2)
 If key1 ≠ key2, then it’s extremely unlikely that
h(key1) = h(key2)

Collision problem!
 Constructing the function h takes little time
 Given key, computing h(key) takes O(|key|)-time
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Collision is simply unavoidable
17
 Pigeonhole principle
 K+1 pigeons, K holes  at least one hole with ≥ 2 pigeons
 There are many more objects in the universe than m
 Object set = set of strings of length ≤ characters
 Object set = set of possible URLs
 Object set = set of possible file names in a CD-ROM
 While m is something like a few hundred thousands or less
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Hash codes or int-style types
18
 Say we want hash codes map to (4-byte) int
 Easy when objects =
 short int, or int, or char or unsigned char
 Simply cast them to uint32_t
 What about when objects =
 long int (8 byte integers)
x0
x1
x2
x3
x4
x5
x6
x7
Hash code
y0
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
y1
y2
y3
5/28/2016
Casting Down
19
unsigned int hash_code1(unsigned long a) {
return static_cast<unsigned int>(a);
}
int main() {
unsigned long a = 0x8888888877777777;
unsigned long b = 0x1111111177777777;
cout << hex << a << " converted to " << hash_code1(a) << endl;
cout << hex << b << " converted to " << hash_code1(b) << endl;
return 0;
}
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Drawback of Casting from Long to Int
20
 We ignore the first 4 bytes of information
 If key1 and key2 differ only in the first 4 bytes, they
will collide!
 On the other hand, if keys are uniformly distributed,
we are OK.
 Could also sum 1st 4 bytes with 2nd 4 bytes
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Hash codes for strings & variable length objects
21
 Say we have a universe of character array objects
 “Computer Science”
 “Floccinaucinihilipilification”
 “Alan Turing”
 …
 How do we produce 4-byte hash codes for them?
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Hash codes for strings (or byte-sequences)
22
 Add up the characters
 XOR 4-bytes at a time
 Polynomial hash codes
 Shifting hash codes
 FNV hash
 MurmurHash
 Etc.
 Important Lesson: data-dependency!
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Some experimental results
23
Hash code function
into uint32_t
#of collisions
Max bucket size
Sum
1730769
175
Xor
583
3
Shift7
56
2
Poly31
22
2
Poly33
22
2
FNV
0
1
FNV is widely used, in DNS & Twitter, for example
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Compression functions
24
 4-byte hash codes can’t be used as indices
 232 = 4,294,967,296 ≈ 4x109 is too many
 To store n entries, need indices in {0,1, …m-1}
 m should be close to n (say n = 50K, m = 60K)
 Compression function f: uint32_t  {0,1, …m-1}



Division method
Multiplication method
Universal hashing
Compression functions are hash functions and thus there methods
can be used to design hash codes too!
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Hash function design problem
25
 Universe U = all uint32_t integers
 S, an unknown subset of n members of U
 Find f : U  {0,1,…,m-1}
 Computing f(u) is fast
 Minimize collisions
 Note: suppose |U| > m ≥ n
 For a fixed S, there always exists f with no collisions
 For a fixed f, there always exists S with lots of collisions
 If S’s distribution is truly arbitrary
 The best f is such that f(s) is uniformly distributed on {0…m-1}
 Ball-into-Bins model: Throw n “balls” randomly into m “bins”
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
An analogy – the birthday problem
26
 U = 7 billion people in the world
 S = set of students in this room
 f maps students to birthdates {Jan 01, …, Dec 31}
 So m = 365 (forget leap year)
 Question:
 If S is chosen randomly from U, how large must S be until it is
more likely to have a collision than not?
 This is called the birthday “paradox”
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Birthday paradox
27
 Say there are n students in this room
 Prob[1st student does not “collide”] = 1
 Prob[2nd student does not “collide”] = 1-1/m
 Prob[3rd student does not “collide” | first two didn’t
collide] = 1-2/m
…
 Overall probability of no collision is
(1-1/m)(1-2/m)…(1-(n-1)/m) < ½
when n=23 and m = 365
 When n=30, Prob[no collision] ≈ 30%
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Rarity of Minimal Perfect Hash Function
28
 Consider |U| = N, m = n, MPHF is a bijection
 Number of functions from U to {0,1,…n-1} is nN
 For a fixed S (but unknown) of size n
 number of MPHF for S is
 Hence, the fraction of functions which are MPHF is


When n = 10, the ratio is 0.00036…
When n = 20, the ratio is 2.32*10-8
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Division method
29
 How does this function perform for different m?
 The answer depends a lot on the distribution of S in
the universe U
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
m from 50K to 60K
30
 Total # collisions: 19K
 Max bucket size 6-8, typically
 Recall n ≈ 47K
 Could we have guessed this result without coding?
 Something in the spirit of the birthday paradox?
 Motto: Think, then code!
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Balls into Bins
31
 Throw n balls into m bins randomly
 Probability a given bin is empty is (1-1/m)n ≈ e-n/m
 Expected number of empty bins is me-n/m
 It can be shown mathematically that on average,
when m ≈ n, the maximum bin size is about
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
These estimates are incredibly good!
32
 n = 47000, m = 50000
 me-n/m ≈ 19000
 And
 You can repeat the experiment with m ≈ 100K
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Multiplication method – slightly better!
33
Golden ratio
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Universal Hashing
34
 Adversary can always pick key set S which create
O(n) collisions

Denial of Service attack (more later!)
 Universal hashing approach

Design a family H of hash functions such that for any k ≠ k’

Pick a hash function h in H uniformly at random

Note that the key set is chosen by the adversary
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Theoretical Results
35
 α = n/m is the load factor of the hash table
 Expected bucket size is at most 1 + α
 Fact: when the universal family is n-independent,
they behave almost as if we throw balls randomly
and independently into bins
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Additional notes
36
 Table size m: should choose a prime for mod
compression



If it’s even, even % even = even
Objects in computer memory often start with even address
If it’s a power of 2 then we effectively mod out the high-order
bits
 Lost the relative order of keys
 Can’t answer queries such as:
“what are the keys (& associated values) in between 3 and 432?”
 “list the smallest k keys”


BTW, how do we answer such query with a BST?
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Collision resolution
37
SEPARATE CHAINING
OPEN ADDRESSING
CUCKOO HASHING
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Separate Chaining
38
Turing
Cantor
Index Pointer
0
1
Knuth
Knuth
Karp
Cantor
Dijkstra
2
3
Karp
Turing
4
Dijkstra
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Performance
39
 Under simple uniform hashing assumption
 i.e. Each object hashed into a bucket with probability 1/m,
uniformly and independent from other objects
 Expected search time Θ(1+α)
 Worst-case search time Ω(n) – though very unlikely
 Using universal hashing, expected time for n
operations is Ω(n)
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Denial of Service Attacks
40
 http://events.ccc.de/congress/2011/Fahrplan/events/4680.en.html
 http://www.ocert.org/advisories/ocert-2011-003.html
 http://permalink.gmane.org/gmane.comp.security.full-disclosure/83694
 “Hash tables are a commonly used data structure in most programming
languages. Web application servers or platforms commonly parse attackercontrolled POST form data into hash tables automatically, so that they can
be accessed by application developers. If the language does not provide a
randomized hash function or the application server does not recognize
attacks using multi-collisions, an attacker can degenerate the hash table by
sending lots of colliding keys. The algorithmic complexity of inserting n
elements into the table then goes to O(n**2), making it possible to exhaust
hours of CPU time using a single HTTP request.”
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
BTW
41
 You can do Separate Treeing too!
 They don’t teach that in school
 What’s the performance of your hash table then?
 What’s the drawback?
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Open Addressing
42
 Store all entries in the hash table itself, no pointer to
the “outside”
 Advantage
 Less space waste
 Perhaps good cache usage
 Disadvantage
 More complex collision resolution
 Slower operations
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Open Addressing
43
Index Pointer
Turing
0
1
Cantor
h(“Knuth”, 0)
Knuth
Karp
h(“Karp”, 0)
Turing
h(“Knuth”, 1)1)
h(“Dijkstra”,
2
3
Knuth
4
Cantor
h(“Dijkstra”, 2)
5
Dijkstra
Karp
h(“Karp”, 1)
6
7
Dijkstra
h(“Dijkstra”, 0)
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Open Addressing Scheme
44
 Instead of
h : U  {0,1,…,m-1}
e.g. h(key) = 3
 We use an extended hash function which defines a
probe sequence
h : U x {0,1,…,m-1}  {0,1,…m-1}
e.g. h(key, 0) = 5
h(key, 1) = 9
h(key, 2) = 7
…
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Insert Algorithm
45
for (i=0; i<m; i++) {
j = h(key, i);
if (Table[j] == NULL) {
insert entry;
break;
}
}
if (i == m) report error “hash table
overflown”
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Desirable Property of Probe Sequence
46
 For any key,
h(key, 0), …, h(key, m-1) is a permutation of the set
{0,1, …, m-1}
 What happens if the property does not hold?
 How do we do search, BTW?
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Delete
47
 Find where the key is
 Can’t simply remove and set the entry to NULL
 Why?
 One solution
 Set deleted entry to be a special DELETED object
 Modify insert so that new object replaces a DELETED entry as
well as a NULL entry
 When search, pass over DELETED entries – don’t stop!
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Three typical choices for probe sequence
48
 Linear probing -- h(key, i) = h’(key) + ci (mod m)



Good hash function h’, and c relatively prime to m (why?)
Causes primary clustering problem
Widely used due to excellent cache usage
 Quadratic probing -- h(key, i) = h’(key) + c1i + c2i2 (mod m)

c2 ≠ 0 is an auxiliary constant
 Double hashing -- h(key, i) = h1(key) + i*h2(key) (mod m)


Need h2(key) relatively prime to m
E.g., m = 2k for some k, and h2(key) always odd
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Analysis (α < 1)
49
 Expected # of probes in an unsuccessful search
 Insertion on average takes time
 Expected # of probes in a successful searh
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Cuckoo Hashing
50
- Rasmus Pagh & Flemming Friche Rodler, 2001
- A variant of open addressing
- Does not use perfect hashing
- Time:
- O(1)-lookup time in the worst-case
- O(1)-amortized insertion time
- Space:
- 3 words per key like BSTs
- Very competitive in practice
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Cuckoo Hashing – Basic Idea
51
HQN
Karp
Levin
Knuth
h1
Rehash!
(pick new &
random h1 h2)
Cantor
Dijkstra
h2
Turing
CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo
5/28/2016
Download