Hash Table

advertisement
Hash Table
Hash table
• The goal
– O(1) runtime for:
• Insert
• Find
• Delete
• Can get log(n) operations with binary search,
set, or map
• Utilizes a hash function
Create and Insert
Find
We Want Uniformity
• Hash table is slow if the bins are large
• If the hash function produces random integers
– Values are spread throughout bins evenly
– Fast Find
Hash Function Performance
• n = 133910
– Number of English words in our dictionary
• m = 1024
– Number of bins in the hash table
• What makes a good hash function?
ASCII % m
ASCII % m
1024
991
958
925
892
859
826
793
760
727
694
661
628
595
562
529
496
463
430
397
364
331
298
265
232
199
166
133
100
67
34
1
0
1000
2000
3000
4000
5000
6000
Sum of ASCII % m
Sum of ASCII % m
1024
991
958
925
892
859
826
793
760
727
694
661
628
595
562
529
496
463
430
397
364
331
298
265
232
199
166
133
100
67
34
1
0
100
200
300
400
500
600
700
XOR << 1
XOR << 1
1024
991
958
925
892
859
826
793
760
727
694
661
628
595
562
529
496
463
430
397
364
331
298
265
232
199
166
133
100
67
34
1
0
50
100
150
200
250
300
350
Built in hash function
C++ Hash
1024
991
958
925
892
859
826
793
760
727
694
661
628
595
562
529
496
463
430
397
364
331
298
265
232
199
166
133
100
67
34
1
0
50
100
150
200
250
More Bins
Usually want about as many bins as items
m = 131071
Comparison
• n = 133910
• m = 131071
• Hash table
– Max bin size is 7
– 1.6 average bin size
– Single Vector lookup with 1.6 items to search through
• Binary search/set/map - log(n) runtime
– log(133910) ≈ 17
– 15 comparisons on average
– Access and compare 15 items
Balls into Bins
• Throw n balls into m bins randomly
• Probability a given bin is empty is
(1-1/m)n ≈ e-n/m
• Expected number of empty bins is me-n/m
• It can be shown mathematically that on
average, when m ≈ n, the maximum bin size is
about
5/28/2016
CSE 250, SUNY Buffalo, @Hung Q. Ngo
18
These estimates are incredibly good!
• n = 47000, m = 50000
• me-n/m ≈ 19000
• And
• You can repeat the experiment with m ≈ 100K
5/28/2016
CSE 250, SUNY Buffalo, @Hung Q. Ngo
19
Theoretical Results
• α = n/m is the load factor of the hash table
• Expected bucket size is at most 1 + α
• Fact: when the universal family is nindependent, they behave almost as if we
throw balls randomly and independently into
bins
5/28/2016
CSE 250, SUNY Buffalo, @Hung Q. Ngo
20
Performance
• Under simple uniform hashing assumption
– i.e. Each object hashed into a bucket with probability 1/m,
uniformly and independent from other objects
• Expected search time Θ(1+α)
• Worst-case search time Ω(n) – though very unlikely
• Using universal hashing, expected time for n
operations is Ω(n)
5/28/2016
CSE 250, SUNY Buffalo, @Hung Q. Ngo
21
Probe Sequences
• If you only want 1 element per bin
• If the bin is occupied
– Pick a new bin deterministically
• Different methods
– Use the next bin
– Use the square of the bin
– Add a second hash
• Gets very slow as the table fills up
Secure Hash Functions
• Can’t be easily inverted
• Doesn’t leak information about what was
hashed
• h(data) = hashValue
• Given hashValue, can’t determine anything
about data
• Given hashValue, can’t find data’ such that:
– h(data’) = hashValue
Secure Hash Functions
• Verify a downloaded file
• Download and check if the hash matches the
given hash value
• Apache_OpenOffice_4.1.1_MacOS_x86-64_install_en-US.dmg
– MD5 = 53147bd5b16e0e42c3370a223231a4cf
– (https://www.openoffice.org/download/index.html)
• Prevents attackers from sending a malicious file
– Public wifi
• Secure hash needed
– Attacker can’t write a different file with the same hash
Passwords and Hashing

SHA256 hash of my password


SHA256 hash of my password with 1 edit


1906bc7c801f03c41551b06e2fd406e8f4717
87c51357e8731ec61dd599f04c8
6410ef0d3a6d3324fcba02131e5742215c993
01055398a75457a27ac89dffb5f
Inputs must match exactly
• Performance difference on Friday
Download