Hashing as a Dictionary Implementation

advertisement
Hashing
Chapters 19-20
What is Hashing?
A technique that determines an index or
location for storage of an item in a data
structure
The hash function receives the search key
• Returns the index of an element in an array
called the hash table
• The index is known as the hash index
A perfect hash function maps each search
key into a different integer suitable as an
index to the hash table
2
What is Hashing?
A hash function indexes its hash table.
3
What is Hashing?
Two steps of the hash function
• Convert the search key into an integer called
the hash code
• Compress the hash code into the range of
indices for the hash table
Typical hash functions are not perfect
• They can allow more than one search key to
map into a single index
• This is known as a collision
4
What is Hashing?
h(555–1163)
A collision caused by the hash function h
(for a table of size 101)
5
Hash Functions
General characteristics of a good hash
function
• Minimize collisions
• Distribute entries uniformly throughout
the hash table
• Be fast to compute
6
Computing Hash Codes
We will override the hashCode method of
Object
Guidelines
• If a class overrides the method equals, it
should override hashCode
• If the method equals considers two objects
equal, hashCode must return the same value
for both objects
• If an object invokes hashCode more than
once during execution of program on the same
data, it must return the same hash code
7
Computing Hash Codes
Hash code for a primitive type
• Use the primitive typed key itself
• Can cast types byte, short, or char to int
• Manipulate internal binary representations
(e.g. use folding)
• e.g. for long, casting would lose 1st 32 bits
• but, could divide into two 32-bit halves (by shifting),
then add or XOR (^)the results
• e.g. int hashCode = (int)(key^(key>>32))
• For a search key of type double:
long bits = Double.doubleToLongBits(key);
int hashCode = (int)(bits^(bits>>32))
8
Computing Hash Codes
For a string s with n characters having Unicode value ui for
the ith character (e.g., u0 u1 u2 … un-1) and positive constant g
(e.g., 31 in Java’s String class), the hash code could be:
u0gn-1 + u1gn-2 + … + un-2g + un-1
or
(…((u0g+u1)g+u2)g+…+un-2)g+un-1 (Horner’s method)
e.g., public int hashCode() in Java class String
The hash code for a string, s
int hash = 0;
int n = s.length();
for (int i = 0; i < n; i++)
hash = g * hash + s.charAt(i); // g is a positive constant
Note: hash could be negative due to overflow
9
10
11
Compressing a Hash Code
Must compress the hash code “c” so it fits
into the index range
Typical method is to compute c modulo n
• n is a prime number (the size of the table)
• Index will then be between 0 and n – 1
private int getHashIndex(Object key) {
int hashIndex = key.hashCode() % hashTable.length;
if (hashIndex < 0)
hashIndex = hashIndex + hashTable.length;
return hashIndex;
} // end getHashIndex
Note:
if c is non-negative, 0 <= c%n <= n-1
if c is negative, -(n-1) <= c%n <= -1
(if c is negative, add n to c%n so 1 <= result <= n-1 )
12
Resolving Collisions
Options when hash functions returns
location already used in the table
• Use another location in the table
(“open addressing”)
• Change the structure of the hash table so
that each array location can represent
multiple values
13
Open Addressing with Linear Probing
Open addressing scheme locates alternate
location
• New location must be open, available
Linear probing
• If collision occurs at hashTable[k], look
successively at location k + 1, k + 2, …
14
Open Addressing with Linear Probing
h(555–1163)
The effect of linear probing after adding four entries
whose search keys hash to the same index.
15
Open Addressing with Linear Probing
h(555–1163)
555–1163
A revision of the hash table in the previous figure when
linear probing resolves collisions; each entry contains a
search key and its associated value
16
Removals
h(555–1163)
555–1163
A hash table if remove used null to remove entries.
17
Removals
We need to distinguish among three kinds
of locations in the hash table
1. Occupied
•
The location references an entry in the dictionary
2. Empty
•
The location contains null and always did
3. Available
•
The location's entry was removed from the
dictionary
18
Open Addressing with Linear Probing
A linear probe sequence (a) after adding an entry; (b) after
19
removing two entries;
Open Addressing with Linear Probing
A linear probe sequence (c) after a search; (d) during the
search while adding an entry; (e) after an addition to a
20
formerly occupied location.
Searches that Dictionary Operations Require
To retrieve an entry
• Search the probe sequence for the key
• Examine entries that are present, ignore locations in
available state
• Stop search when key is found or null reached
To remove an entry
• Search the probe sequence same as for retrieval
• If key is found, mark location as available
To add an entry
• Search probe sequence same as for retrieval
• Note first available slot
• Use available slot if the key is not found
21
Open Addressing, Quadratic Probing
Change the probe sequence
• Given search key k
• Probe to k + 1, k + 22, k + 32, … k + n2
Can reach any location in the hash table
if table size is a prime number and if
hash table is at most half full
For avoiding primary clustering
• But can lead to secondary clustering
22
Open Addressing, Quadratic Probing
A probe sequence of length 5 using quadratic probing.
Note: for hash index k and table size n, we can improve
efficiency by using the recurrence relation
ki+1 = (ki + 2i + 1) modulo n
for i>=0 and k0=k.
23
Open Addressing with Double Hashing
Resolves collision by examining locations
• At original hash index
• Plus an increment determined by 2nd function
Second hash function
• Different from first
• Depends on search key
• Returns nonzero value
Reaches every location in hash table if table size
is prime
Avoids both primary and secondary clustering
24
Open Addressing with Double Hashing
e.g. h1(key) = key modulo 7 and h2(key) = 5 – key modulo 5
The first three locations in a probe sequence generated
by double hashing for the search key. Note: sum of the
two hash functions must be computed modulo table size.
25
Separate Chaining
Alter the structure of the hash table
Each location can represent multiple
values
• Each location called a bucket
Bucket can be a(n)
• List
• Sorted list
• Chain of linked nodes
• Array
• Vector
26
Separate Chaining
A hash table for use with separate chaining; each bucket
is a chain of linked nodes.
27
Separate Chaining
Where new entry is inserted into linked bucket when
integer search keys are (a) duplicate and unsorted;
28
Separate Chaining
Where new entry is inserted into linked bucket when
integer search keys are (b) distinct and unsorted;
29
Separate Chaining
Where new entry is inserted into linked bucket when
integer search keys are (c) distinct and sorted
30
Pseudo-code for Chaining Algorithms
(for distinct search keys and unsorted chains)
//Algorithm add(key, value)
index = getHashIndex(key)
if (hashTable[index] = = null) {
hashTable[index] = new Node(key, value)
currentSize++
}
else {
Search chain that begins at hashTable[index] for a node that contains key
if (key is found) {
// assume currentNode references the node that contains key
oldValue = currentNode.getValue()
currentNode.setValue(value)
return oldValue
}
else {
// add new node to end of chain
// assume nodeBefore references the last node
newNode = new Node(key, value)
nodeBefore.setNextNode(newNode)
currentSize++
}
}
31
Pseudo-code for Chaining Algorithms (continued)
//Algorithm remove(key)
index = getHashIndex(key)
Search chain that begins at hashTable[index] for node that contains key
if (key is found) {
Remove node that contains key from chain
currentSize-return value in removed node
}
else
return null
//Algorithm getValue(key)
index = getHashIndex(key)
Search chain that begins at hashTable[index] for node that contains key
if (key is found)
return value in found node
else
return null
32
Efficiency Observations
Successful retrieval or removal
• Same efficiency as successful search
Unsuccessful retrieval or removal
• Same efficiency as unsuccessful search
Successful addition
• Same efficiency as unsuccessful search
Unsuccessful addition
• Same efficiency as successful search
33
Load Factor
Perfect hash function not always possible
or practical
• Thus, collisions likely to occur
As hash table fills
• Collisions occur more often
Measure for table fullness, the load factor*
Note: max value for load factor depends on type of collision resolution used; for
separate chaining, there is no maximum value…
*for open addressing
34
Cost of Open Addressing
(Linear Probing)
½[1 + 1/(1-λ)2] for an unsuccessful search
½[1 + 1/(1-λ)] for a successful search
The average number of comparisons required by a search of the hash
table for given values of the load factor when using linear probing.
35
Cost of Open Addressing
(Quadratic Probing or Double Hashing)
1/(1-λ) for an unsuccessful search
(1/λ)log[1/(1-λ)] for a successful search
Note: for quadratic
probing or double
hashing, should
have < 0.5
The average number of comparisons required by a search
of the hash table for given values of the load factor when
using either quadratic probing or double hashing.
36
Cost of Separate Chaining
Note: Reasonable
efficiency requires
only < 1
Average number of comparisons required by search of hash table for
given values of load factor when using separate chaining. Note that the
load factor here is the # of dictionary entries / # of chains (i.e., the load
37
factor is the average # of dictionary entries per chain).
Rehashing
When load factor becomes too large
• Expand the hash table
Double present size, increase result
to next prime number
Place current entries into new hash
table in locations (i.e., recompute
the index for each entry for the new
table size)
38
Comparing Schemes for Collision Resolution
Average number of
comparisons required
by search of hash table
versus for 4
techniques when
search is
(a) successful;
(b) unsuccessful.
39
A Dictionary Implementation
That Uses Hashing
A hash table and one of its entry objects
40
A Dictionary Implementation
That Uses Hashing
Beginning of private class TableEntry
• Made internal to dictionary class
private class TableEntry implements java.io.Serializable
{ private Object entryKey;
private Object entryValue;
private boolean inTable;
// true if entry is in hash table
private TableEntry(Object key, Object value)
{ entryKey = key;
entryValue = value;
inTable = true;
} // end constructor
...
41
A Dictionary Implementation That
Uses Hashing
A hash table containing dictionary entries, removed
entries, and null values.
42
Java Class Library: The Class HashMap
Assumes search-key objects belong to a
class that overrides methods hashCode
and equals
Hash table is collection of buckets
Constructors
• public HashMap()
• public HashMap (int initialSize)
• public HashMap (int initialSize,
float maxLoadFactor)
• public HashMap (Map table)
43
Download