Hash Tables Hash tables Dealing with raw bytes Some probabilistic analysis

advertisement
Hash Tables
Hash tables
Dealing with raw bytes
Some probabilistic analysis
Motivations
• Balanced search trees
– Store (key, value)-pairs
– O(log n)-time search, insert, delete, max, min
– O(log n + |output|)-time range query
– Relatively complex implementations
• Can we improve running times for basic
operations?
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
1
Example
• Given a set of (key, value)-pairs
– Keys are in the set {0, 1, 2, 3, …, n-1}
– Values could be anything
• Store them in an array (direct access table)
0
1
2
3
…
…
…
…
n-2
n-1
v0
NULL
v2
NULL
…
…
…
…
Vn-2
Vn-1
• Search/insert/delete takes O(1)-time
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
2
What are the drawbacks?
• Unlikely see such perfect key set in practice
– (Key, Value) = (English Word, Meaning)
– (Key, Value) = (function, address)
– (Key, Value) = (URL, IP address)
• Thus there’ll be lots of NULL entries
– Wastes lots of space because n >> # pairs
– Say keys are 8-byte integers, n = 2256-1
• Can’t do range-query efficiently
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
3
-
Dictionary
Reserved words in a programming language
Command names in an os
File names in cd-rom
Lazy array & Minimal Perfect hashing
THE STATIC KEY SET CASE
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
4
Static Key Set
• Wanted: data structure for an online dictionary
• With what we know so far, what are the choices?
– Use a balanced BST such as RB tree, AVL tree
– Randomize the keys and insert one by one in a
normal BST or a splay tree
– Sort the keys, use binary search
• Sorting + binary search is the best of the three
options
– Search still takes O(log n)-time
– Can we do better?
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
5
A Wild Solution
• Every key is a series of 0s and 1s
• There is always some integer m such that, for all
practical purposes
– A key K is an (≤ m)-bit number bin-rep(K)
– E.g. in ASCII, bin-rep(“cse250”) = 0x637365323530
– Use this m-bit number as an index to an array of values
• What’s the longest non-technical English word?
– Floccinaucinihilipilification (29 characters)
• What’s the longest technical English word?
– Pneumonoultramicroscopicsilicovolcanoconiosis (a
disease)
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
6
The solution is wild in at least two ways
• So, say m = 8x30 = 240 bits
– Need an array A of 2240 ≈ 1.76 x 1072 elements
• Even if we have that much memory space, there
is still one major problem
– A[x] initialized to NULL for all x from 0 to 2240-1
– NULL is just 0
– Dictionary has n = 150000 = 15x104 words, say
– Initializing the data structure takes ≥ n13 steps
– The O(n log n) sorting + binary search looks great!
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
7
The lazy array data structure
• Sequentially read inputs into an array Dict
of size n
– Dict[i] is the i’th (word, value) pair
• Insert all words into huge array A
– For (i=0; i<n; ++i) A[Dict[i].word] = i
• search(x): (x is a word)
– If (0 ≤ A[x] ≤ n-1 && Dict[A[x]].word == x)
• Return Dict[A[x]].value
– Else
• Return false
• You can even delete(x) in O(1)-time
– Just set Dict[A[x]] = NULL
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
8
Lazy array, an illustration
cse250
…
0
(“UB”, “University at Buffalo”)
0x4D6174726978
1
(“Mark Twain”, “Great writer”)
…
2
(“cse250”, “boring course”)
0x637365323530
3
NULL
…
…
…
…
n-1
(“Matrix”, “Best Scifi Movie”)
…
cse251
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
2
Set to 2 by accident
0x637365323531
Dict
n-1
2
A
9
Major drawback and an inspiration
• Used a humongous amount of space
– n << 2#
bits to represent the longest word
• However, if there was a function h
– 0 ≤ h(word) ≤ n-1
– For any two words x & y, h(x) ≠ h(y)
• Then, we’re (almost) in good shape
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
10
Sample Input, n=6
Key
Index
Value
4164616D
0
…
“Ashley”
4173686C6579
1
…
“Daniel”
44616E69656C
2
…
“Kayla”
4B61796C61
3
…
“Mike”
4D696B65
4
…
“Troy”
54726F79
5
…
“Adam”
Hash Code
(using ASCII)
In hex
Function h(hash_code)  {0,1,2,3,4,5}
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
11
What was that function h?
int h(int a) {
a = (((a%256)%100)%41)%10;
a = (a*a)%14;
return (a>2) ? a-4 : a;
}
Took me ½ hour to come up with that
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
12
Minimal Perfect Hash Function
• S: the set of n (hash codes of) keys
–S={
0x4164616D, 0x4173686C6579, 0x44616E69656C, 0x4B61796C61, 0x4D696B65, 0x54726F79
} in the
example above
• h: S  {0,1,..,n-1} is a MPHF if it is a bijection
• We want to
– Find such a function h (in short amount of time)
– Maybe store the function … in a data structure!
– Evaluate h(code) in O(1)-time
• Possible, but a little bit complicated
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
13
-
Hashing first proposed by Arnold Dumey (1956)
Hash codes
Chaining
Open addressing, linear probing, quadratic probing
HASING – GENERAL IDEAS
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
14
Top level view
h(object)
Arbitrary objects
(strings, doubles, ints)
{0,1,…,m-1}
int with
wide range
n Objects
actually
used
Hash
code
Compression
function
m
We will also call
this the hash function
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
15
Good Hash Function
• If key1 ≠ key2, then it’s extremely unlikely that
h(key1) = h(key2)
– Collision problem!
• Constructing the function h takes little time
• Given key, computing h(key) takes O(|key|)-time
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
16
Collision is simply unavoidable
• Pigeonhole principle
– K+1 pigeons, K holes  at least one hole with ≥ 2
pigeons
• There are many more objects in the universe
than m
–
–
–
–
Object set = set of strings of length ≤ 30 characters
Object set = set of possible URLs
Object set = set of possible file names in a CD-ROM
While the range size m is something like a few
hundred thousands or less
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
17
Hash codes for int-style types
• Say we want hash codes map to (4-byte) int
• Easy when objects =
– short, or int, or char or unsigned char
– Simply cast them to uint32_t
• What about when objects =
– long int (8 byte integers)
x0
x1
x2
x3
x4
x5
x6
x7
Hash code
y0
5/28/2016
y1
y2
y3
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
18
Casting Down, Lose Infomation
unsigned int hash_code1(unsigned long a) {
return static_cast<unsigned int>(a);
}
int main() {
unsigned long a = 0x8888888877777777;
unsigned long b = 0x1111111177777777;
cout << hex << a << " converted to " << hash_code1(a) << endl;
cout << hex << b << " converted to " << hash_code1(b) << endl;
return 0;
}
8888888877777777 converted to 77777777
1111111177777777 converted to 77777777
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
19
Drawback of Casting from Long to Int
• We ignore the first 4 bytes of information
• If key1 and key2 differ only in the first 4 bytes,
they will collide!
• On the other hand, if keys are uniformly
distributed, we are OK.
• Could also sum 1st 4 bytes with 2nd 4 bytes
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
20
Hash codes for strings & variable length objects
• Say we have a universe of character array
objects
– “Computer Science”
– “Floccinaucinihilipilification”
– “Alan Turing”
–…
• How do we produce 4-byte hash codes for
them?
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
21
Hash codes for strings (or byte-sequences)
•
•
•
•
•
•
•
Add up the characters
XOR 4-bytes at a time
Polynomial hash codes
Shifting hash codes
FNV hash
MurmurHash
Etc.
• Important Lesson: data-dependency!
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
22
Some experimental results
Hash code function
into uint32_t
#of collisions
Max bucket size
Sum
1730769
175
Xor
583
3
Shift7
56
2
Poly31
22
2
Poly33
22
2
FNV
0
1
FNV is widely used, in DNS & Twitter, for example
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
23
Compression functions
• 4-byte hash codes can’t be used as indices
– 232 = 4,294,967,296 ≈ 4x109 is too many
• Store n entries, need indices in {0,1, …m-1}
– m should be close to n (say n = 50K, m = 60K)
• Compression function
– f: uint32_t  {0,1, …m-1}
– Division method
– Multiplication method
– Universal hashing
Compression functions are hash functions and thus
there methods can be used to design hash codes too!
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
24
Hash function design problem
• Universe U = all uint32_t integers
• S, an unknown subset of n members of U
• Find f : U  {0,1,…,m-1}
– Computing f(u) is fast
– Minimize collisions
• Note: suppose |U| > m ≥ n
– For a fixed S, there always exists f with no collisions
– For a fixed f, there always exists S with lots of collisions
• If S’s distribution is truly arbitrary
– The best f is such that f(s) is uniformly distributed on
{0…m-1}
– Ball-into-Bins model: Throw n “balls” randomly into m
“bins”
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
25
An analogy – the birthday problem
• U = 7 billion people in the world
• S = set of students in this room
• f maps students to birthdates {Jan 01, …, Dec
31}
– So m = 365 (forget leap years)
• Question:
– If S is chosen randomly from U, how large must S be
until it is more likely to have a collision than not?
• This is called the birthday “paradox”
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
26
Birthday paradox
•
•
•
•
Say there are n students in this room
Prob[1st student does not “collide”] = 1
Prob[2nd student does not “collide”] = 1-1/m
Prob[3rd student does not “collide” | first two
didn’t collide] = 1-2/m
• …
• Overall probability of no collision is
(1-1/m)(1-2/m)…(1-(n-1)/m) < ½
when n=23 and m = 365
• When n=30, Prob[no collision] ≈ 30%
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
27
Rarity of Minimal Perfect Hash Function
• Consider |U| = N, m = n, MPHF is a bijection
• Number of functions from U to {0,1,…n-1} is nN
• For a fixed S (but unknown) of size n
– number of MPHF for S is n!nN-n
– Hence, the fraction of functions which are MPHF is
– When n = 10, the ratio is 0.00036…
– When n = 20, the ratio is 2.32*10-8
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
28
Division method
• How does this function perform for different
m?
• The answer depends a lot on the distribution
of S in the universe U
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
29
m from 50K to 60K
• Total # collisions: 19K
• Max bucket size 6-8, typically
• Recall n ≈ 47K
• Could we have guessed this result without
coding?
– Something in the spirit of the birthday paradox?
– Motto: Think, then code!
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
30
Balls into Bins
• Throw n balls into m bins randomly
• Probability a given bin is empty is
(1-1/m)n ≈ e-n/m
• Expected number of empty bins is me-n/m
• It can be shown mathematically that on
average, when m ≈ n, the maximum bin size
is about
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
31
These estimates are incredibly good!
• n = 47000, m = 50000
• me-n/m ≈ 19000
• And
• You can repeat the experiment with m ≈ 100K
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
32
Multiplication method – slightly better!
Golden ratio
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
33
Universal Hashing
• Adversary can always pick key set S which
create O(n) collisions
– Denial of Service attack (more later!)
• Universal hashing approach
– Design a family H of hash functions such that for any
k ≠ k’
– Pick a hash function h in H uniformly at random
– Note that the key set is chosen by the adversary
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
34
Theoretical Results
• α = n/m is the load factor of the hash table
• Expected bucket size is at most 1 + α
• Fact: when the universal family is nindependent, they behave almost as if we
throw balls randomly and independently into
bins
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
35
Additional notes
• Table size m: should choose a prime for mod
compression
– If it’s even, even % even = even
– Objects in computer memory often start with even
address
– If it’s a power of 2 then we effectively mod out the
high-order bits
• Lost the relative order of keys
– Can’t answer queries such as:
• “what are the keys (& associated values) in between 3 and
432?”
• “list the smallest k keys”
– BTW, how do we answer such query with a BST?
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
36
Separate chaining
Open addressing
Cuckoo hashing
COLLISION RESOLUTION
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
37
Separate Chaining
Turing
Cantor
Index
0
1
Knuth
Turing
Knuth
Karp
Cantor
Dijkstra
2
3
Karp
Pointer
4
Dijkstra
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
38
Performance
• Under simple uniform hashing assumption
– i.e. Each object hashed into a bucket with probability
1/m, uniformly and independent from other objects
• Expected search time Θ(1+α)
• Worst-case search time Ω(n) – though very
unlikely
• Using universal hashing, expected time for n
operations is Ω(n)
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
39
Denial of Service Attacks
• http://events.ccc.de/congress/2011/Fahrplan/events/4680.en.html
• http://www.ocert.org/advisories/ocert-2011-003.html
• http://permalink.gmane.org/gmane.comp.security.full-disclosure/83694
• “Hash tables are a commonly used data structure in most programming
languages. Web application servers or platforms commonly parse
attacker-controlled POST form data into hash tables automatically, so
that they can be accessed by application developers. If the language
does not provide a randomized hash function or the application server
does not recognize attacks using multi-collisions, an attacker can
degenerate the hash table by sending lots of colliding keys. The
algorithmic complexity of inserting n elements into the table then goes to
O(n**2), making it possible to exhaust hours of CPU time using a single
HTTP request.”
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
40
BTW
• You can do Separate Treeing too!
• They don’t teach that in school
• What’s the performance of your hash table
then?
• What’s the drawback?
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
41
Open Addressing
• Store all entries in the hash table itself, no
pointer to the “outside”
• Advantage
– Less space waste
– Perhaps good cache usage
• Disadvantage
– More complex collision resolution
– Slower operations
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
42
Open Addressing
Index
Turing
Pointer
0
1
Cantor
h(“Knuth”, 0)
Knuth
Karp
h(“Karp”, 0)
Turing
h(“Knuth”, 1)1)
h(“Dijkstra”,
2
3
Knuth
4
Cantor
h(“Dijkstra”, 2)
5
Dijkstra
Karp
h(“Karp”, 1)
6
7
Dijkstra
5/28/2016
h(“Dijkstra”, 0)
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
43
Open Addressing Scheme
• Instead of
h : U  {0,1,…,m-1}
e.g. h(key) = 3
• We use an extended hash function which
defines a probe sequence
•
h : U x {0,1,…,m-1}  {0,1,…m-1}
e.g. h(key, 0) = 5
h(key, 1) = 9
h(key, 2) = 7
…
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
44
Insert Algorithm
for (i=0; i<m; i++) {
j = h(key, i);
if (Table[j] == NULL) {
insert entry;
break;
}
}
if (i == m) report error “hash
table overflown”
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
45
Desirable Property of Probe Sequence
• For any key, h(key, 0), …, h(key, m-1) is a
permutation of the set {0,1, …, m-1}
• What happens if the property does not hold?
• How do we search, BTW?
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
46
Delete
• Find where the key is
• Can’t simply remove and set the entry to NULL
– Why?
• One solution
– Set deleted entry to be a special DELETED object
– Modify insert so that new object replaces a
DELETED entry as well as a NULL entry
– When search, pass over DELETED entries – don’t
stop!
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
47
Three typical choices for probe sequence
• Linear probing -- h(key, i) = h’(key) + ci (mod m)
– Good hash function h’, and c relatively prime to m
(why?)
– Causes primary clustering problem
– Widely used due to excellent cache usage
 Quadratic probing -- h(key, i) = h’(key) + c1i + c2i2 (mod m)
– c2 ≠ 0 is an auxiliary constant
• Double hashing -- h(key, i) = h1(key) + i*h2(key) (mod m)
– Need h2(key) relatively prime to m
– E.g., m = 2k for some k, and h2(key) always odd
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
48
Analysis (α < 1)
• Expected # of probes in an unsuccessful
search
• Insertion on average takes time
• Expected # of probes in a successful searh
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
49
Cuckoo Hashing
- Rasmus Pagh & Flemming Friche Rodler, 2001
- A variant of open addressing
- Does not use perfect hashing
- Time:
- O(1)-lookup time in the worst-case
- O(1)-amortized insertion time
- Space:
- 3 words per key like BSTs
- Very competitive in practice
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
50
Cuckoo Hashing – Basic Idea
Karp
HQN
Levin
Knuth
Rehash!
(pick new & random
h1
h1 h2)
Cantor
Dijkstra
h2
Turing
5/28/2016
CSE 250, Fall 2012, SUNY Buffalo, @Hung Q. Ngo
51
Download