Dictionary based on arrays Hashing and Hash tables • Suppose have a dictionary with small integer keys, in range 0 - 999. • Can use an array containing values, with key as index. private Object[] theDictionary = new Object[1000]; • we retrieve the value associated with a particular key by simply indexing into the array: //require: 0 <= key && key <= 999 public Element get (int key) { return (Element) theDictionary[key]; } Trade-offs • Plusses – Accessing the dictionary requires only constant time, rather than linear or logarithmic time required by lists or binary searches. • Minuses – set of possible keys is rarely limited to a small range of integers or even integers. – When keys are integers, set of possible values is typically large, such as social security numbers. – Keys are often character strings (names, for instance) or more general objects. Hashing: Generalizing the index • We can use the idea of indexing an array if we convert a key to an array index. • a hash function. A hash function takes a key as argument, and returns an array index as result. If theDictionary is the name of the array, a hash function can be specified as follows: //ensure: 0 <= result && result < theDictionary.length int hash (Key key) • To locate an item in the dictionary, the hash function is applied to the key to obtain the array index: public Element valueOf (Key key) { return theDictionary[hash(key)]; } 1 Hashing Hashing • goal is efficiency: hash function must be easy to compute. • For example, if keys are social security numbers: keys hash function 0 1 2 3 4 5 6 7 int hash (int key) { return key % 1000; } Hashing Hashing • Problems: – Which hashing function to use – Collision (not perfect hashing). • the hash function will map many different keys to the same index. • Requirements for a hashing function: – i. Easy to compute. – ii. Evenly distributed key values. – iii. Must have “equal” keys to hash to same index • The goal here is to reduce the number of collisions. – Load factor: • Distribution of collisions in dictionary 2 Hashing Object’s hash method • Need to know something about the expected keys. For instance, if keys are seven digit numbers and the index set is 0 - 999, a reasonable hash function simply takes the high-order three digits: int hash (int key) { return key / 10000; } • However, if the seven digit keys are local telephone numbers, then this is clearly not a good approach. We are in fact selecting the local exchange, of which there will be very few. Object’s hash method • Java Objects are expected to use built-in hash functions or re-define it if you redefine the equals. • the method hashCode as defined in the class Object is only guaranteed to produce equal values for identical arguments. That is, if a == b, then a.hashCode() == b.hashCode() • The Java class Object defines a method hashCode, specified as public int hashCode(); • It can be used to build a hash function. Thus a hash function for the array theDictionary might be written: int hash (Key key) { return Math.abs(key.hashCode())%theDictionary.length; } Canonical Java class definition • Thus a well define class must include: • • • • public public public public boolean equals(Object obj){…} int hashCode(){…} Object clone(){…} String toString(){…} • this is a very strong requirement, atypical for most of our applications • We usually redefine equals on objects • Thus, the 3rd requirement of a hashing function, equals must hash to equals, will require that we redefine Java’s hash function for every class where we redefine equals. 3 Hashing functions • • • • Casting to an integer Polynomial hash code Folding shifting Handling Collisions: Open chaining • if we attempt to insert an element in the dictionary and the location returned by the hash function is already occupied, search the array for the next unoccupied location. //Add a key-value pair, given that the dictionary is not full. public add (Key key, Element value) { int index = hash(key); // Find the next available position while (theDictionary[index] != null) index = (index+1)%theDictionary.length; theDictionary[index] = new KeyValuePair(key, value); } Handling Collisions: Open chaining Handling Collisions: Open chaining As an example, suppose the array contains ten elements, keys are three digit numbers, and the hash function simply returns the low-order digit of the key. Elements with keys 126, 128, 206, 208, 306, 776 are added in that order As an example, suppose the array contains ten elements, keys are three digit numbers, and the hash function simply returns the low-order digit of the key. Elements with keys 126, 128, 206, 208, 306, 776 are added in that order 0 1 2 3 4 5 6 7 8 9 126 0 1 2 3 4 5 6 7 8 9 126 128 4 Handling Collisions: Open chaining Handling Collisions: Open chaining As an example, suppose the array contains ten elements, keys are three digit numbers, and the hash function simply returns the low-order digit of the key. Elements with keys 126, 128, 206, 208, 306, 776 are added in that order As an example, suppose the array contains ten elements, keys are three digit numbers, and the hash function simply returns the low-order digit of the key. Elements with keys 126, 128, 206, 208, 306, 776 are added in that order 0 1 2 3 4 5 6 7 8 9 126 206 128 0 1 2 3 4 5 6 7 8 9 126 206 128 208 Handling Collisions: Open chaining Handling Collisions: Open chaining As an example, suppose the array contains ten elements, keys are three digit numbers, and the hash function simply returns the low-order digit of the key. Elements with keys 126, 128, 206, 208, 306, 776 are added in that order As an example, suppose the array contains ten elements, keys are three digit numbers, and the hash function simply returns the low-order digit of the key. Elements with keys 126, 128, 206, 208, 306, 776 are added in that order 0 1 2 3 4 5 6 7 8 9 306 126 206 128 208 0 1 2 3 4 5 6 7 8 9 306 776 126 206 128 208 5 Handling Collisions: Open chaining Handling Collisions: Open chaining • To locate a key, start at hash location and sequentially search array. public Element valueOf (Key key) { int index = hash(key); while (theDictionary[index] != null && !theDictionary[index].key().equals(key)) index = (index+1)%theDictionary.length; if (theDictionary[index] == null) return null; else return theDictionary[index].value(); } • If key searched for is not in dictionary, finds empty element and return null. Algorithm assume array is not full. • Problems with deletion. – Items are deleted by replacing it with a null • Might find null before finding chain element. • deleted element can be left in table and marked “deleted.” • over time the table becomes cluttered with deleted elements. – the deleted elements will need to be removed, – remaining elements must repositioned in the table. • This is referred to as rehashing, and is a relatively expensive operation. Handling Collisions: separate chaining • Implements the dictionary as an array of lists. An element is added by appending it to the list specified by the hash function. 0 1 2 3 4 5 6 7 8 9 Load factor • the more elements a table contains, the more the look up method exhibits linear behavior. • load factor = n/ array size. • With open chaining, – load factor is 0 for an empty table, 126 206 128 208 306 776 – 1 for a completely full table. • With separate chaining, load factor can be larger than 1. •time required to add an element will be constant, if the list append operation is constant, • time required to locate a key depends only on number of items hashed to same location. •Deletion presents no particular problem. 6 Load factor • Use smaller number of buckets, but at the risk to increment the number of collisions, thus length of maximum list will grow. • Given m keys from n slots, the load factor is m/n. • At it increases, we’ll find less empty buckets. • Probability of empty buckets : e m/n • With m/n == 2, 13% buckets will be empty • With m/n == 5, 0.7% will be empty. Load factor • With linear probing, – if the load factor is 0.5 – i.e., the table is half full – locating a key in the table requires on average examining only about 1.5 cells. – Inserting an element or determining that an element is not in the table, requires only about 2.5 probes. – As the load factor approaches 1, performance degenerates. With load factor of 0.9, roughly 50 cells need to be examined to locate an item – linear probing is acceptable if load factor remains small – less than 0.5. easy to achieve for a stable dictionary whose size can be given a priori. Assuming a uniform distribution • The probability that k out m keys will fall into a particular bucket of a table with n entries: (m/n)k e m/n k! • For a given load factor, m/n, expected number of nodes examined for a successful search is approximately 1 + (m/n)/2 • For an unsuccessful search, is (m/n)/2 Load factor • Not easy to accomplish with a dynamic table, with many additions and deletions. Then, must be prepared to increase table size when table becomes too full. However, since the hash function depends on the number of elements in the table, increasing table size requires repositioning of all elements in the table – that is, rehashing. • With separate chaining, a load factor of 1.0 is generally considered acceptable. Locating a key requires about 1.5 problems with a load factor of 1.0. 7 HashMap Java’s hashing tables • Java defines interface map public V put (K key, V value) Add a key-value pair to the dictionary, or replaces the value if an entry with the specified key is already in the dictionary. Returns the value previously associated with the key, or null if the key was not previously in the map. public V get (K key) Retrieve a value associated with a specified key. public V remove (K key) Remove an item with the specified key. Returns the value associated with the key, or null if the key was not in the dictionary. • HashMap is a concrete class that implements Map. • The table of key-value pairs is implemented with a separate-chaining approach. • The hash function is based on object’s own hashCode plus a supplemental hash function, which defends against poor quality hash functions. • Keys must implement hashCode and equals in a consistent way. That is, if k1 and k2 are keys and k1.equals(k2), then it must be the case that k1.hashCode() == k2.hashCode(). • The class HashMap provides several constructors. One version allows specification of the initial array size (initialCapacity) and the maximum acceptable load factor. • public HashMap (int initialCapacity, float loadFactor); HashMap • Performance is improved if initialCapacity is a prime number. • The specified load factor should be between 0 and 1. • If load factor of the table is exceeded, a new array is allocated, and elements are rehashed into it. size of the new array is twice the size of the previous, plus one. • A second constructor requires only the initial capacity be specified: • public HashMap (int initialCapacity); • A default load factor of 0.75 is used in this case. • Finally, a Hashtable with default initial capacity of 16 items can be created by using a constructor with no arguments 8