Hashing and Hash tables Dictionary based on arrays Trade

advertisement
Dictionary based on arrays
Hashing and Hash tables
• Suppose have a dictionary with small integer keys, in
range 0 - 999.
• Can use an array containing values, with key as index.
private Object[] theDictionary = new Object[1000];
• we retrieve the value associated with a particular key by
simply indexing into the array:
//require: 0 <= key && key <= 999
public Element get (int key) {
return (Element) theDictionary[key];
}
Trade-offs
• Plusses
– Accessing the dictionary requires only constant time, rather than
linear or logarithmic time required by lists or binary searches.
• Minuses
– set of possible keys is rarely limited to a small range of integers or
even integers.
– When keys are integers, set of possible values is typically large,
such as social security numbers.
– Keys are often character strings (names, for instance) or more
general objects.
Hashing: Generalizing the index
• We can use the idea of indexing an array if we convert a key to an
array index.
• a hash function. A hash function takes a key as argument, and returns
an array index as result. If theDictionary is the name of the array, a
hash function can be specified as follows:
//ensure: 0 <= result && result < theDictionary.length
int hash (Key key)
• To locate an item in the dictionary, the hash function is applied to the
key to obtain the array index:
public Element valueOf (Key key) {
return theDictionary[hash(key)];
}
1
Hashing
Hashing
• goal is efficiency: hash function must be easy to compute.
• For example, if keys are social security numbers:
keys
hash function
0
1
2
3
4
5
6
7
int hash (int key) {
return key % 1000;
}
Hashing
Hashing
• Problems:
– Which hashing function to use
– Collision (not perfect hashing).
• the hash function will map many different keys to the same index.
• Requirements for a hashing function:
– i. Easy to compute.
– ii. Evenly distributed key values.
– iii. Must have “equal” keys to hash to same index
• The goal here is to reduce the number of collisions.
– Load factor:
• Distribution of collisions in dictionary
2
Hashing
Object’s hash method
• Need to know something about the expected keys. For
instance, if keys are seven digit numbers and the index set
is 0 - 999, a reasonable hash function simply takes the
high-order three digits:
int hash (int key) {
return key / 10000;
}
• However, if the seven digit keys are local telephone
numbers, then this is clearly not a good approach. We are
in fact selecting the local exchange, of which there will be
very few.
Object’s hash method
• Java Objects are expected to use built-in hash functions or re-define it
if you redefine the equals.
• the method hashCode as defined in the class Object is only
guaranteed to produce equal values for identical arguments. That is,
if a == b, then a.hashCode() == b.hashCode()
• The Java class Object defines a method hashCode,
specified as
public int hashCode();
• It can be used to build a hash function. Thus a hash
function for the array theDictionary might be written:
int hash (Key key) {
return Math.abs(key.hashCode())%theDictionary.length;
}
Canonical Java class definition
• Thus a well define class must include:
•
•
•
•
public
public
public
public
boolean equals(Object obj){…}
int hashCode(){…}
Object clone(){…}
String toString(){…}
• this is a very strong requirement, atypical for most of our applications
• We usually redefine equals on objects
• Thus, the 3rd requirement of a hashing function, equals must hash
to equals, will require that we redefine Java’s hash function for
every class where we redefine equals.
3
Hashing functions
•
•
•
•
Casting to an integer
Polynomial hash code
Folding
shifting
Handling Collisions: Open chaining
• if we attempt to insert an element in the dictionary
and the location returned by the hash function is
already occupied, search the array for the next
unoccupied location.
//Add a key-value pair, given that the dictionary is not full.
public add (Key key, Element value) {
int index = hash(key);
// Find the next available position
while (theDictionary[index] != null)
index = (index+1)%theDictionary.length;
theDictionary[index] = new KeyValuePair(key, value);
}
Handling Collisions: Open chaining
Handling Collisions: Open chaining
As an example, suppose the array contains ten elements, keys are three digit
numbers, and the hash function simply returns the low-order digit of the key.
Elements with keys 126, 128, 206, 208, 306, 776 are added in that order
As an example, suppose the array contains ten elements, keys are three digit
numbers, and the hash function simply returns the low-order digit of the key.
Elements with keys 126, 128, 206, 208, 306, 776 are added in that order
0
1
2
3
4
5
6
7
8
9
126
0
1
2
3
4
5
6
7
8
9
126
128
4
Handling Collisions: Open chaining
Handling Collisions: Open chaining
As an example, suppose the array contains ten elements, keys are three digit
numbers, and the hash function simply returns the low-order digit of the key.
Elements with keys 126, 128, 206, 208, 306, 776 are added in that order
As an example, suppose the array contains ten elements, keys are three digit
numbers, and the hash function simply returns the low-order digit of the key.
Elements with keys 126, 128, 206, 208, 306, 776 are added in that order
0
1
2
3
4
5
6
7
8
9
126
206
128
0
1
2
3
4
5
6
7
8
9
126
206
128
208
Handling Collisions: Open chaining
Handling Collisions: Open chaining
As an example, suppose the array contains ten elements, keys are three digit
numbers, and the hash function simply returns the low-order digit of the key.
Elements with keys 126, 128, 206, 208, 306, 776 are added in that order
As an example, suppose the array contains ten elements, keys are three digit
numbers, and the hash function simply returns the low-order digit of the key.
Elements with keys 126, 128, 206, 208, 306, 776 are added in that order
0
1
2
3
4
5
6
7
8
9
306
126
206
128
208
0
1
2
3
4
5
6
7
8
9
306
776
126
206
128
208
5
Handling Collisions: Open chaining
Handling Collisions: Open chaining
• To locate a key, start at hash location and sequentially
search array.
public Element valueOf (Key key) {
int index = hash(key);
while (theDictionary[index] != null &&
!theDictionary[index].key().equals(key))
index = (index+1)%theDictionary.length;
if (theDictionary[index] == null)
return null;
else
return theDictionary[index].value();
}
• If key searched for is not in dictionary, finds empty element
and return null. Algorithm assume array is not full.
• Problems with deletion.
– Items are deleted by replacing it with a null
• Might find null before finding chain element.
• deleted element can be left in table and marked “deleted.”
•
over time the table becomes cluttered with deleted elements.
– the deleted elements will need to be removed,
–
remaining elements must repositioned in the table.
• This is referred to as rehashing, and is a relatively expensive operation.
Handling Collisions: separate chaining
• Implements the dictionary as an array of lists. An element is added by
appending it to the list specified by the hash function.
0
1
2
3
4
5
6
7
8
9
Load factor
• the more elements a table contains, the more the look up
method exhibits linear behavior.
•
load factor = n/ array size.
• With open chaining,
– load factor is 0 for an empty table,
126
206
128
208
306
776
– 1 for a completely full table.
• With separate chaining, load factor can be larger than 1.
•time required to add an element will be constant, if the list append
operation is constant,
• time required to locate a key depends only on number of items
hashed to same location.
•Deletion presents no particular problem.
6
Load factor
• Use smaller number of buckets, but at the risk to
increment the number of collisions, thus length of
maximum list will grow.
• Given m keys from n slots, the load factor is m/n.
• At it increases, we’ll find less empty buckets.
• Probability of empty buckets : e m/n
• With m/n == 2, 13% buckets will be empty
• With m/n == 5, 0.7% will be empty.
Load factor
• With linear probing,
– if the load factor is 0.5 – i.e., the table is half full – locating a key
in the table requires on average examining only about 1.5 cells.
– Inserting an element or determining that an element is not in the
table, requires only about 2.5 probes.
– As the load factor approaches 1, performance degenerates. With
load factor of 0.9, roughly 50 cells need to be examined to locate
an item
– linear probing is acceptable if load factor remains small – less than
0.5. easy to achieve for a stable dictionary whose size can be given
a priori.
Assuming a uniform distribution
• The probability that k out m keys will fall into a
particular bucket of a table with n entries:
(m/n)k
e m/n k!
• For a given load factor, m/n, expected number of
nodes examined for a successful search is
approximately 1 + (m/n)/2
• For an unsuccessful search, is (m/n)/2
Load factor
• Not easy to accomplish with a dynamic table, with many
additions and deletions. Then, must be prepared to increase
table size when table becomes too full. However, since the
hash function depends on the number of elements in the
table, increasing table size requires repositioning of all
elements in the table – that is, rehashing.
• With separate chaining, a load factor of 1.0 is generally
considered acceptable. Locating a key requires about 1.5
problems with a load factor of 1.0.
7
HashMap
Java’s hashing tables
• Java defines interface map
public V put (K key, V value)
Add a key-value pair to the dictionary, or replaces the value if an
entry with the specified key is already in the dictionary. Returns
the value previously associated with the key, or null if the key
was not previously in the map.
public V get (K key)
Retrieve a value associated with a specified key.
public V remove (K key)
Remove an item with the specified key. Returns the value
associated with the key, or null if the key was not in the
dictionary.
• HashMap is a concrete class that implements Map.
• The table of key-value pairs is implemented with a separate-chaining
approach.
• The hash function is based on object’s own hashCode plus a
supplemental hash function, which defends against poor quality hash
functions.
• Keys must implement hashCode and equals in a consistent way. That
is, if k1 and k2 are keys and k1.equals(k2), then it must be the case that
k1.hashCode() == k2.hashCode().
• The class HashMap provides several constructors. One version allows
specification of the initial array size (initialCapacity) and the
maximum acceptable load factor.
• public HashMap (int initialCapacity, float loadFactor);
HashMap
• Performance is improved if initialCapacity is a prime number.
• The specified load factor should be between 0 and 1.
• If load factor of the table is exceeded, a new array is allocated, and
elements are rehashed into it. size of the new array is twice the size of
the previous, plus one.
• A second constructor requires only the initial capacity be specified:
• public HashMap (int initialCapacity);
• A default load factor of 0.75 is used in this case.
• Finally, a Hashtable with default initial capacity of 16 items can be
created by using a constructor with no arguments
8
Download