Hash Table Hash table • The goal – O(1) runtime for: • Insert • Find • Delete • Can get log(n) operations with binary search, set, or map • Utilizes a hash function Create and Insert Find We Want Uniformity • Hash table is slow if the bins are large • If the hash function produces random integers – Values are spread throughout bins evenly – Fast Find Hash Function Performance • n = 133910 – Number of English words in our dictionary • m = 1024 – Number of bins in the hash table • What makes a good hash function? ASCII % m ASCII % m 1024 991 958 925 892 859 826 793 760 727 694 661 628 595 562 529 496 463 430 397 364 331 298 265 232 199 166 133 100 67 34 1 0 1000 2000 3000 4000 5000 6000 Sum of ASCII % m Sum of ASCII % m 1024 991 958 925 892 859 826 793 760 727 694 661 628 595 562 529 496 463 430 397 364 331 298 265 232 199 166 133 100 67 34 1 0 100 200 300 400 500 600 700 XOR << 1 XOR << 1 1024 991 958 925 892 859 826 793 760 727 694 661 628 595 562 529 496 463 430 397 364 331 298 265 232 199 166 133 100 67 34 1 0 50 100 150 200 250 300 350 Built in hash function C++ Hash 1024 991 958 925 892 859 826 793 760 727 694 661 628 595 562 529 496 463 430 397 364 331 298 265 232 199 166 133 100 67 34 1 0 50 100 150 200 250 More Bins Usually want about as many bins as items m = 131071 Comparison • n = 133910 • m = 131071 • Hash table – Max bin size is 7 – 1.6 average bin size – Single Vector lookup with 1.6 items to search through • Binary search/set/map - log(n) runtime – log(133910) ≈ 17 – 15 comparisons on average – Access and compare 15 items Balls into Bins • Throw n balls into m bins randomly • Probability a given bin is empty is (1-1/m)n ≈ e-n/m • Expected number of empty bins is me-n/m • It can be shown mathematically that on average, when m ≈ n, the maximum bin size is about 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 18 These estimates are incredibly good! • n = 47000, m = 50000 • me-n/m ≈ 19000 • And • You can repeat the experiment with m ≈ 100K 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 19 Theoretical Results • α = n/m is the load factor of the hash table • Expected bucket size is at most 1 + α • Fact: when the universal family is nindependent, they behave almost as if we throw balls randomly and independently into bins 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 20 Performance • Under simple uniform hashing assumption – i.e. Each object hashed into a bucket with probability 1/m, uniformly and independent from other objects • Expected search time Θ(1+α) • Worst-case search time Ω(n) – though very unlikely • Using universal hashing, expected time for n operations is Ω(n) 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 21 Probe Sequences • If you only want 1 element per bin • If the bin is occupied – Pick a new bin deterministically • Different methods – Use the next bin – Use the square of the bin – Add a second hash • Gets very slow as the table fills up Secure Hash Functions • Can’t be easily inverted • Doesn’t leak information about what was hashed • h(data) = hashValue • Given hashValue, can’t determine anything about data • Given hashValue, can’t find data’ such that: – h(data’) = hashValue Secure Hash Functions • Verify a downloaded file • Download and check if the hash matches the given hash value • Apache_OpenOffice_4.1.1_MacOS_x86-64_install_en-US.dmg – MD5 = 53147bd5b16e0e42c3370a223231a4cf – (https://www.openoffice.org/download/index.html) • Prevents attackers from sending a malicious file – Public wifi • Secure hash needed – Attacker can’t write a different file with the same hash Passwords and Hashing SHA256 hash of my password SHA256 hash of my password with 1 edit 1906bc7c801f03c41551b06e2fd406e8f4717 87c51357e8731ec61dd599f04c8 6410ef0d3a6d3324fcba02131e5742215c993 01055398a75457a27ac89dffb5f Inputs must match exactly • Performance difference on Friday