Chapter 5: Hashing • Hash Table ADT • Hash Functions • Collision Resolution • Rehashing CS 340 Page 89 Hashing Hashing is a technique for performing searches, insertions, and deletions from a list in constant time. A particular component of each data element being stored is used as a key which is mapped to a particular cell in a hash table. Problems arise when collisions occur, i.e., when multiple data elements are mapped to the same cell. The term “hash” was coined to illustrate the analogy between hashing and the culinary practice of chopping and mixing ingredients to make a hash. Essentially, the input domain is “chopped” into several subdomains, which are then “mixed” into the output range to improve the uniformity of their distribution. CS 340 Page 90 The Hash Table Abstract Data Type Yu Wang Klein-Mayer-White Stefik Bouvier-CumminsEhlmann Fujinoki A hash table is a list of keys, mapped to particular cells via a hash function. • The table is implemented as a fixed-size array. • The table size and hash function are strategically chosen to avoid collisions. Example: A hash table to hold the CS Department faculty and staff Hash function: length of last name Table size: 11 Collision-free keys: 9 Ave. # comparisons per name: 1.50 Wang Stefik Klein Tornaritis Tornaritis Bartholomew Klein Stefik Fujinoki-Yu Ehlmann-Tornaritis White Mayer Bouvier Bartholomew-CumminsWang CS 340 Hash function: ((Sum of office room # digits) * (# of vowels in last name) + (Last 2 digits of office phone #)) % 25 Table size: 25 Collision-free keys: 24 Ave. # comparisons per name: 1.08 Hash function: (Office room #) % 15 Table size: 15 Collision-free keys: 12 Ave. # comparisons per name: 1.42 Ehlmann Bouvier-Mayer Yu Cummins Fujinoki White Bartholomew Page 91 New Year’s Day Independence Day Labor Day Mother’s Day Lincoln’s Birthday Valentine’s Day Martin Luther King, Jr. Day St. Patrick’s Day - Flag Day - Columbus Day Choosing The Hash Table Size Define the load factor, , of a hash table to be the ratio of the number of elements in the hash table to the table size. • If > 1, then collisions are inevitable, so it is wise to choose a table size greater than the number of anticipated elements. • If << 1, then there will be a very large number of empty slots, lessening the probability of a collision, but wasting a lot of memory. Example: A hash table to hold the 2011 holidays Mew Year’s Day – Martin Luther King, Jr. Day Lincoln’s Birthday – Valentine’s Day – Washington’s Birthday St. Patrick’s Day Easter Sunday Mother’s Day – Memorial Day Flag Day – Father’s Day Independence Day Veterans Day Washington’s Birthday Father’s Day Easter Sunday Table size: 12 Hash function: Month # Load factor: 1.5 Labor Day Columbus Day – Halloween Veterans Day – Thanksgiving Christmas Memorial Day - Thanksgiving Christmas Halloween CS 340 Table size: 43 Hash function: (Month #) + (Day #) Load factor: 0.419 Page 92 Choosing The Hash Function Given a particular hash table size and a particular type of data, the hash function should be chosen to minimize the number of collisions. This usually requires an in-depth analysis of the keys expected to go in the table. Example: A hash table to hold CS undergraduate course enrollment statistics Summing the digits yields: 9, 5, 10, 6, 6, 14, 6, 8, 6, 10, 6, 7, 12, 9, 11, 9, 15, 15, 13, 15, 14, 13, 18, and 22. (Table size: 18; 7 noncollisions.) CS 340 There are 24 active courses: 108, 140, 145, 150, 240, 275, 312, 314, 321, 325, 330, 340, 390, 423, 425, 434, 438, 447, 454, 456, 482, 490, 495, and 499. Summing (one’s Summing (100’s Summing (one’s digit) digit) + 2*(ten’s digit) + 3*(ten’s digit) + 3*(ten’s digit) + digit) + 4*(100’s + 9*(one’s digit) 9*(100’s digit) yields: digit) yields: 12, 12, yields: 73, 13, 58, 16, 17, 21, 26, 24, 30, 44, 17, 14, 16, 27, 16, 14, 68, 24, 26, 18, 32, 34, 34, 38, 36, 39, 20, 17, 21, 18, 28, 54, 12, 15, 30, 37, 54, 45, 47, 49, 53, 55, 30, 23, 26, 30, 31, 55, 49, 85, 79, 55, 55, 57, 62, 63, 68, 30, 32, 34, 34, 39, 73, 46, 31, 76, and and 72. and 43. 112. (Table size: 56; (Table size: 32; (Table size: 100; 20 non-collisions.) 10 non-collisions.) 20 non-collisions.) Page 93 Collision Resolution What should be done when a collision does occur? There are two main strategies: separate chaining and probing. Separate Chaining With separate chaining, the hash table is an array of linked lists, with each linked list containing all of the elements that map to the same value. New Year’s Day Martin Luther King, Jr. Lincoln’s Birthday St. Patrick’s Day Easter Sunday Mother’s Day Flag Day Independence Day Labor Day Columbus Day Veterans Day Christmas Day CS 340 Day Day Valentine’s Washington’s Birthday Memorial Day Father’s. Day Halloween Thanksgiving Day Disadvantages: Average successful search: 1+(/2) comparisons Average unsuccessful search: comparisons Worst case search: n comparisons (forPage a 94 Collision Resolution (Continued) Probing With probing, the hash table is an array of values, with a whole series of cells probed until no collision occurs (i.e., cells h0(x), h1(x), h2(x),… are tried, where hi(x) = (Hash(x) + f(i)) mod tablesize, with f(0) = 0). Linear Probing: f(i) is a linear function Example: f(i) = 3i and Hash(x) = x insert 1492 (slot 2) 1492 insert 1776 1492 insert 1812 (slot 2 slot 5) (slot 6) 1776 1492 1812 1776 insert 1945 (slot 5 slot 8) 1492 1812 1776 1945 insert 1968 (slot 8 slot 1) 1968 1492 insert 1992 1968 1492 (slot 2 1992 slot 5 1812 slot 8 1812 1776 slot 1 1776 slot 4) 1945 1945 Problems With Linear Probing: • Coefficient and table size must be relatively prime or free cells may not be found. • Bad tendency to experience primary clustering, resulting in many collisions. CS 340 Page 95 Collision Resolution (Continued) Quadratic Probing: f(i) is a quadratic function Example: f(i) = 2i 2 and Hash(x) = x 1992 insert 1492 (slot 2) 1492 insert 1776 1492 insert 1812 1492 (slot 2 1812 slot 4) (slot 6) 1776 1776 insert 1945 1492 insert 1968 (slot 5) 1812 1945 1776 (slot 8) 1492 insert 1992 1492 1812 (slot 2 1812 slot 4 1945 slot 0) 1945 1776 1776 1968 1968 Problems With Quadratic Probing: CS 340 • Coefficient and table size must be carefully chosen or free cells may be ignored. • Bad tendency to experience secondary clustering, since keys with the same original hashed value will follow the same sequence of cells through the table. Page 96 Collision Resolution (Continued) Double Hashing: f(i) is a second hash function, multiplied by an iterative value Example: f(i) = iHash2(x), where Hash2(x) = 7 - x mod 7, and Hash(x) = x insert 1492 (slot 2) 1492 insert 1776 1492 insert 1812 (slot 2 slot 3) (slot 6) 1776 1492 1812 insert 1945 1492 1812 (slot 5) 1492 1812 (slot 8) 1945 1776 1776 insert 1968 1945 1776 insert 1992 (slot 2 slot 5 slot 8 slot 1) 1968 1992 1492 1812 1945 1776 1968 Problems With Double Hashing: CS 340 • A strategic choice must be made for both hashing functions. • Calculation will be much more expensive in the event of a collision. Page 97 Rehashing When a hash table starts getting too full, with many delays caused by repeated collisions, rehashing the values into a new, larger table with a new hash function may alleviate the problem. Bush2 Reagan Obama Kennedy Insert: Bush2 (2001) Clinton (1993) Bush (1989) Reagan (1981) Carter (1977) Ford (1974) Nixon (1969) Johnson (1963) Kennedy (1961) Eisenhower (1953) Hash (president) = first_year_in_office mod 11 CS 340 Inserting Obama (2009) would cause a collision in slot 7, so… Nixon Reagan Clinton Kennedy Ford Johnson Eisenhower Carter Bush Bush2 Johnson Bush REHASH Hash (president) = first_year_in_office mod 23 Nixon Clinton Ford Eisenhower Carter Page 98 C++ “map”: #include <map> #include <string> using namespace std; void main() { map<string, string> phone_book; phone_book["Sally Smart"] = "555-9999"; phone_book["John Doe"] = "555-1212"; phone_book["J. Random Hacker"] = "553-1337"; } Java “map”: Map<String, String> phoneBook = new HashMap<String, String>(); phoneBook.put("Sally Smart", "555-9999"); phoneBook.put("John Doe", "555-1212"); phoneBook.put("J. Random Hacker", "555-1337"); Lua “table”: Associative Arrays Hashing is used extensively in modern programming, particularly in database management, network security, and operating systems. Consequently, most modern programming languages have built-in mechanisms for implementing associative arrays, i.e., dictionaries based on the key-value concept of hash tables. phone_book = { ["Sally Smart"] = "555-9999", ["John Doe"] = "555-1212", ["J. Random Hacker"] = "553-1337", -- Trailing comma is OK } aTable = { -- Table as value subTable = { 5, 7.5, k = true }, -- key is "subTable " -- Function as value ['John Doe'] = function (age) if age < 18 then return "Young" else return "Old!" end end, -- Table and function (and other types) can also be used as keys } Perl “hash”: Python “dictionary”: Ruby “hash”: %phone_book = ( 'Sally Smart' => '555-9999', 'John Doe' => '555-1212', 'J. Random Hacker' => '553-1337', ); phonebook = { 'Sally Smart' : '555-9999', 'John Doe' : '555-1212', 'J. Random Hacker' : '553-1337' } phonebook = { 'Sally Smart' => '555-9999', 'John Doe' => '555-1212', 'J. Random Hacker' => '553-1337' } CS 340 Page 99