241-423 Advanced Data Structures and Algorithms Semester 2, 2013-2014 10. Hashing Objectives – introduce hashing, hash functions, hash tables, collisions, linear probing, double hashing, bucket hashing, JDK hash classes ADSA: Hashing/10 1 Contents 1. Searching an Array 2. Hashing 3. Creating a Hash Function 4. Solution #1: Linear Probing 5. Solution #2: Double Hashing 6. Solution #3: Bucket Hashing 7. Table Resizing 8. Java's hashCode() 9. Hash Tables in Java 10. Efficiency 11. Ordered/Unordered Sets and Maps ADSA: Hashing/10 2 1. Searching an Array • If the array is not sorted, a search requires O(n) time. • If the array is sorted, a binary search requires O(log n) time • If the array is organized using hashing then it is possible to have constant time search: O(1). ADSA: Hashing/10 3 2. Hashing • A hash function takes a search item and returns its array location (its index position). • The array + hash function is called a hash table. • Hash tables support efficient insertion and search in O(1) time – but the hash function must be carefully chosen ADSA: Hashing/10 4 A Simple Hash Function • The simple hash function: 0 kiwi 1 hashCode("apple") = 5 hashCode("watermelon") = 3 hashCode("grapes") = 8 hashCode("cantaloupe") = 7 hashCode("kiwi") = 0 hashCode("strawberry") = 9 hashCode("mango") = 6 hashCode("banana") = 2 2 3 4 5 6 7 8 9 ADSA: Hashing/10 banana watermelon apple mango cantaloupe grapes strawberry 5 A Table with Key-Value Pairs • A hash table for a map storing (ID, Name) items, – ID is a nine-digit integer ADSA: Hashing/10 (025610001, jim) (981100002, ad) (451220004, tim) … • The hash table is an array of size N = 10,000 • The hash function is hashCode(ID) = last four digits of the ID key 0 1 2 3 4 9997 9998 9999 (200759998, tim) 6 Applications of Hash Tables • Small databases • Compilers • Web Browser caches ADSA: Hashing/10 7 3. Creating a Hash Function • The hash function should return the table index where an item is to be placed – but it's not possible to write a perfect hash function – the best we can usually do is to write a hash function that tells us where to start looking for a location to place the item ADSA: Hashing/10 8 A Real Hash Function • A more realistic hash function produces: 0 – hash("apple") = 5 hash("watermelon") = 3 hash("grapes") = 8 hash("cantaloupe") = 7 hash("kiwi") = 0 hash("strawberry") = 9 hash("mango") = 6 hash("banana") = 2 hash("honeydew") = 6 2 1 3 ADSA: Hashing/10 banana watermelon 4 5 6 7 8 • Now what? kiwi 9 apple mango cantaloupe grapes strawberry 9 Collisions • A collision occurs when two item hash to the same array location – e.g "mango" and "honeydew" both hash to 6 • Where should we place the second and other items that hash to this same location? • Three popular solutions: – linear probing, double hashing, bucket hashing (chaining) ADSA: Hashing/10 10 4. Solution #1: Linear Probing o Linear probing handles collisions by placing the colliding item in the next empty table cell (perhaps after cycling around the table) Key = 32 to be added [hash(32) = 6] 41 18 44 59 32 22 0 1 2 3 4 5 6 7 8 9 10 11 12 o Each inspected table cell is called a probe – in the above example, three probes were needed to insert 32 ADSA: Hashing/10 11 Example 1 • A hash table with N = 13 and h(k) = k mod 13 0 1 2 3 4 5 6 7 8 9 10 11 12 k • Insert keys: 18, 41, 22, 44, 59, 32, 31, 73 • Total number of probes: 19 18 41 22 44 59 32 31 73 h (k ) Probes 5 2 9 5 7 6 5 8 5 2 9 5 7 6 5 8 6 7 6 9 8 7 10 8 11 9 10 41 18 44 59 32 22 31 73 0 1 2 3 4 5 6 7 8 9 10 11 12 ADSA: Hashing/10 12 Example 2: Insertion • Suppose we want to add "seagull": – hash(seagull) = 143 – 3 probes: table[143] is not empty; table[144] is not empty; table[145] is empty – put seagull at location 145 ... 141 142 144 robin sparrow hawk 145 seagull 143 146 147 bluejay 148 owl ... ADSA: Hashing/10 13 Searching • Look up "seagull": – hash(seagull) = 143 – 3 probes: table[143] != seagull; table[144] != seagull; table[145] == seagull – found seagull at location 145 ... 141 142 144 robin sparrow hawk 145 seagull 143 146 147 bluejay 148 owl ... ADSA: Hashing/10 14 Searching Again • Look up "cow": – hash(cow) = 144 ... 141 142 – 3 probes: table[144] != cow; table[145] != cow; table[146] is empty – "cow" is not in the table since the probes reached an empty cell ADSA: Hashing/10 144 robin sparrow hawk 145 seagull 143 146 147 bluejay 148 owl ... 15 Insertion Again • Add "hawk": – hash(hawk) = 143 ... 141 142 – 2 probes: table[143] != hawk; table[144] == hawk – hawk is already in the table, so do nothing 144 robin sparrow hawk 145 seagull 143 146 147 bluejay 148 owl ... ADSA: Hashing/10 16 Insertion Again • Add "cardinal": – hash(cardinal) = 147 ... 141 142 – 3 or more probes: 147 and 148 are occupied; "cardinal" goes in location 0 (or 1, or 2, or ...) ADSA: Hashing/10 144 robin sparrow hawk 145 seagull 143 146 147 bluejay 148 owl 17 Search with Linear Probing o o Search hash table A get(k) – start at cell h(k) – probe consecutive locations until: • • • An item with key k is found, or An empty cell is found, or N cells have been unsuccessfully probed (N is the table size) ADSA: Hashing/10 Algorithm get(k) i h(k) p 0 // count num of probes repeat c A[i] if c = // empty cell return null else if c.key () = k return c.element() else // linear probing i (i + 1) mod N pp+1 until p = N return null 18 Lazy Deletion • Deletions are done by marking a table cell as deleted, rather than emptying the cell. • Deleted locations are treated as empty when inserting and as occupied during a search. ADSA: Hashing/10 19 Updates with Lazy Deletion o delete(k) – Start at cell h(k) – Probe consecutive cells until: • A cell with key k is found: o insert(k, v) – Start at cell h(k) – Probe consecutive cells until: • • put DELETED in cell; • return true • • • put v in cell; • return true or an empty cell is found • return false or N cells have been probed A cell i is found that is either empty or contains DELETED • or N cells have been probed • return false • return false ADSA: Hashing/10 20 Clustering • Linear Probing tends to form “clusters”. – a cluster is a sequence of non-empty array locations with no empty cells among them • e.g. the cluster in Example 1 on slide 12 • The bigger a cluster gets, the more likely it is that new items will hash into that cluster, and make it ever bigger. • Clusters reduce hash table efficiency – searching becomes sequential (O(n)) ADSA: Hashing/10 continued 21 • If the size of the table is large relative to the number of items, linear probing is fast ( O(1)) – a good hash function generates indices that are evenly distributed over the table range, and collisions will be minimal • As the ratio of the number of items (n) to the table size (N) approaches 1, hashing slows down to the speed of a sequential search ( O(n)). ADSA: Hashing/10 22 5. Solution #2: Double Hashing • In the event of a collision, compute a second 'offset' hash function – this is used as an offset from the collision location • Linear probing always uses an offset of 1, which contributes to clustering. Hashing makes the offset more random. ADSA: Hashing/10 23 Example of Double Hashing • A hash table with N = 13, h(k) = k mod 13, and d(k) = 7 - k mod 7 • Insert keys 18, 41, 22, 44, 59, 32, 31, 73 • Total number of probes: 11 0 1 2 3 4 5 6 7 8 9 10 11 12 k 18 41 22 44 59 32 31 73 h (k ) d (k ) Probes 5 2 9 5 7 6 5 8 3 1 6 5 4 3 4 4 5 2 9 5 7 6 5 8 10 9 0 31 41 18 32 59 73 22 44 0 1 2 3 4 5 6 7 8 9 10 11 12 ADSA: Hashing/10 24 6. Solution #3: Bucket Hashing • The previous solutions use open hashing: – all items go into the array ... 141 142 143 144 • In bucket hashing, an array cell points to a linked list of items that all hash to that cell. robin sparrow hawk seagull 145 146 147 bluejay 148 owl ... ADSA: Hashing/10 also called chaining continued 25 • Chaining is generally faster than linear probing: – searching only examines items that hash to the same table location • With linear probing and double hashing, the number of table items (n) is limited to the table size (N), whereas the linked lists in chaining can keep growing. • To delete an element, just erase it from its list. ADSA: Hashing/10 26 7. Table Resizing • As the number of items in the hash table increases, search speed goes down. • Increase the hash table size when the number of items in the table is a specified percentage of its size. Works with open chaining also. ADSA: Hashing/10 continued 27 • Create a new table with the specified size and cycle through the items in the original table. • For each item, use the hash() value modulo the new table size to hash to a new index. • Insert the item at the front of the linked list. ADSA: Hashing/10 28 8. Java's hashCode() • public int hashCode() is defined in Object – it returns the memory address of the object • hashCode() does not know the size of the hash table – the returned value must be adjusted • e.g hashCode() % N • hashCode() can be overridden in your classes ADSA: Hashing/10 29 Coding your own hashCode() • Your hashCode() must: – always return the same value for the same item • it can’t use random numbers, or the time of day – always return the same value for equal items • if o1.equals(o2) is true then hashCode() for o1 and o2 must be the same number ADSA: Hashing/10 continued 30 • A good hashCode() should: – be fast to evaluate – produce uniformly distributed hash values • this spreads the hash table indices around the table, which helps minimize collisions – not assign similar hash values to similar items ADSA: Hashing/10 31 String Hash Function • In the majority of hash table applications, the key is a string. – combine the string's characters to form an integer public int hashCode() { int hash = 0; for (int i = 0; i < s.length; i++) hash = 31*hash + s[i]; return hash; } ADSA: Hashing/10 continued 32 String strA = "and"; String strB = "uncharacteristically"; String strC = "algorithm"; hashVal = strA.hashCode(); // hashVal = 96727 hashVal = strB.hashCode(); // hashVal = -2112884372 hashVal = strC.hashCode(); // hashVal = 225490031 A hash function might overflow and return a negative number. The following code insures that the table index is nonnegative. tableIndex = (hashVal & Integer.MAX_VALUE) % tableSize ADSA: Hashing/10 33 Time24 Hash Function • For the Time24 class, the hash value for an object is its time converted to minutes. • Since each hour is 60 mins more than the last, and a minute is between 0--59, then each hash is unique. public int hashCode() { return hour*60 + minute; ADSA: Hashing/10 } 34 9. Hash Tables in Java • Java provides HashSet, Hashtable and HashMap in java.util – HashSet is a set – Hashtable and HashMap are maps ADSA: Hashing/10 continued 35 • Hashtable is synchronized; it can be accessed safely from multiple threads – Hashtable uses an open hash, and has rehash() for resizing the table • HashMap is newer, faster, and usually better, but it is not synchronized – HashMap uses a bucket hash, and has a remove() method ADSA: Hashing/10 36 Hash Table Operations • HashSet, Hashtable and HashMap have noargument constructors, and constructors that take an integer table size. • HashSet has add(), contains(), remove(), iterator(), etc. ADSA: Hashing/10 continued 37 • Hashtable and HashMap include: – public Object put(Object key, Object value) • returns the previous value for this key, or null – public Object get(Object key) – public void clear() – public Set keySet() • dynamically reflects changes in the hash table – many others ADSA: Hashing/10 38 Using HashMap • A HashMap with Strings as keys and values HashMap "Charles Nguyen" "(531) 9392 4587" "Lisa Jones" "(402) 4536 4674" "William H. Smith" "(998) 5488 0123" A telephone book ADSA: Hashing/10 39 Coding a Map HashMap <String, String> phoneBook = new HashMap<String, String>(); phoneBook.put("Charles Nguyen", "(531) 9392 4587"); phoneBook.put("Lisa Jones", "(402) 4536 4674"); phoneBook.put("William H. Smith", "(998) 5488 0123"); String phoneNumber = phoneBook.get("Lisa Jones"); System.out.println( phoneNumber ); prints: (402) 4536 4674 ADSA: Hashing/10 40 HashMap<String, String> h = new HashMap<String, String>(100, /*capacity*/ 0.75f /*load factor*/ ); h.put( h.put( h.put( h.put( "WA", "NY", "RI", "BC", ADSA: Hashing/10 "Washington" ); "New York" ); "Rhode Island" ); "British Columbia" ); 41 Capacities and Load Factors • HashMaps round capacities up to powers of two – e.g. 100 --> 128 – default capacity is 16; load factor is 0.75 • The load factor is used to decide when it is time to double the size of the table – just after you have added 96 elements in this example – 128 * 0.75 == 96 • Hashtables work best with capacities that are prime numbers. ADSA: Hashing/10 42 10. Efficiency • Hash tables are efficient – until the table is about 70% full, the number of probes (places looked at in the table) is typically only 2 or 3 • Cost of insertion / accessing, is O(1) ADSA: Hashing/10 continued 43 • Even if the table is nearly full (leading to long searches), efficiency remains quite high. • Hash tables work best when the table size (N) is a prime number. HashMaps use powers of 2 for N. ADSA: Hashing/10 continued 44 o In the worst case, searches, insertions and removals on a hash table take O(n) time (n = no. of items) – the worst case occurs when all the keys inserted into the map collide o The load factor = n/N affects the performance of a hash table (N = table size). – for linear probe, 0 ≤ ≤ 1 – for chaining with lists, it is possible that > 1 ADSA: Hashing/10 continued 45 • Assume that the hash function uniformly distributes indices around the hash table. – we can expect = n/N elements in each cell. • • on average, an unsuccessful search makes comparisons before arriving at the end of a list and returning failure mathematical analysis shows that the average number of probes for a successful search is approximately 1 + /2 – so keep small! ADSA: Hashing/10 46 11. Ordered/Unordered Sets and Maps • Use an ordered set or map if an iteration should return items in order – average search time: O(log2n) • Use an unordered set or map with hashing when fast access and updates are needed without caring about the ordering of items – average search time: O(1) ADSA: Hashing/10 47 Timing Tests • SearchComp.java: – read a file of 25025 randomly ordered words and insert each word into a TreeSet and a HashSet. – report the amount of time required to build both data structures – shuffle the file input and time a search of the TreeSet and HashSet for each shuffled word – report the total time required for both search techniques ADSA: Hashing/10 continued 48 Ford & Topp's HashSet build and search times are much better than TreeSet. ADSA: Hashing/10 continued 49 • SearchJComp.java – replace Ford & Topp's TreeSet and HashSet by the ones in the JDK. JDK HashSet and TreeSet are much the same speed for searching ADSA: Hashing/10 50