Hashing

advertisement
Lecture 17
Chapter 5, Hashing
• dictionary operations
• general idea of hashing
• hash functions
• chaining
• closed hashing
April 11, 11
Dictionary operations
o search
o insert
o delete
Applications:
• data base search
• books in a library
• patient records, GIS data etc.
• web page caching (web search)
• combinatorial search (game tree)
Dictionary operations
o search
o insert
o delete
ARRAY
sorted
Search
Insert
delete
LINKED LIST
unsorted
sorted
unsorted
O(n)
O(n)
O(n)
O(n)
O(1)
O(n)
O(n)
O(n)
O(n)
O(n)
O(n)
O(log n)
comparisons
and data
movements
combined
(Assuming
keys can be
compared
with <, > and
= outcomes)
Exercise: Create a similar table separately for data movements and
for comparisons.
Performance goal for dictionary operations:
O(n) is too inefficient.
Goal
• O(log n) on average
• O(log n) in the worst-case
• O(1) on average
Data structure that achieve these goals:
(a) binary search tree
(b) balanced binary search tree (AVL tree)
(c) hashing. (but worst-case is O(n))
Hashing
o An important and widely useful technique
for implementing dictionaries.
o Constant time per operation (on the
average).
o Worst case time proportional to the size of
the set for each operation (just like array and
linked list implementation)
General idea
U = Set of all possible keys: (e.g. 9 digit SS #)
If n = |U| is not very large, a simple way to
support dictionary operations is:
map each key e in U to a unique integer h(e)
in the range 0 .. n – 1.
Boolean array H[0 .. n – 1] to store keys.
General idea
Ideal case not realistic
• U the set of all possible keys is usually very large so we
can’t create an array of size n = |U|.
• Create an array H of size m much smaller than n.
• Actual keys present at any time will usually be smaller
than n.
• mapping from U -> {0, 1, …, m – 1} is called hash
function.
Example: D = students currently enrolled in courses, U =
set of all SS #’s, hash table of size = 1000
Hash function h(x) = last three digits.
Example (continued)
Insert Student “Dan” SS# = 1238769871
h(1238769871) = 871
Dan
NULL
hash table
...
0 1 2 3
buckets
871
999
Example (continued)
Insert Student “Tim” SS# = 1872769871
h(1238769871) = 871, same as that of Dan.
Collision
Dan
NULL
hash table
...
0 1 2 3
buckets
871
999
Hash Functions
If h(k1) =  = h(k2): k1 and k2 have collision at slot

There are two approaches to resolve collisions.
Collision Resolution Policies
Two ways to resolve:
(1) Open hashing, also known as separate
chaining
(2) Closed hashing, a.k.a. open addressing
Chaining: keys that collide are stored in a linked
list.
Previous Example:
Insert Student “Tim” SS# = 1872769871
h(1238769871) = 871, same as that of Dan.
Collision
Tim
NULL
Dan
hash table
...
0 1 2 3
buckets
871
999
Open Hashing
The hash table is a pointer to the head of a
linked list
All elements that hash to a particular bucket
are placed on that bucket’s linked list
Records within a bucket can be ordered in
several ways
by order of insertion, by key value order, or by
frequency of access order
Open Hashing Data Organization
0
...
1
2
...
3
4
D-1
...
Implementation of open hashing - search
bool contains( const HashedObj & x )
{
list<HashedObj> whichList = theLists[ myhash( x ) ];
return find( whichList.begin( ), whichList.end( ), x ) !=
whichList.end( );
}
Find is a function in the STL class algorithm. Code for
find is described below:
template<class InputIterator, class T>
InputIterator find ( InputIterator first, InputIterator last,
const T& value ) {
for ( ;first!=last; first++)
if ( *first==value ) break;
return first; }
Implementation of open hashing - insert
bool insert( const HashedObj & x )
{
list<HashedObj> whichList = theLists[ myhash( x ) ];
if( find( whichList.begin( ), whichList.end( ), x ) !=
whichList.end( ) )
return false;
whichList.push_back( x );
return true;
}
The new key is inserted at the end of the list.
Implementation of open hashing - delete
Choice of hash function
A good hash function should:
• be easy to compute
• distribute the keys uniformly to the buckets
• use all the fields of the key object.
Example: key is a string over {a, …, z, 0, … 9, _ }
Suppose hash table size is n = 10007.
(Choose table size to be a prime number.)
Good hash function: interpret the string as a number to
base 37 and compute mod 10007.
h(“word”) = ? “w” = 23, “o” = 15, “r” = 18 and “d” = 4.
h(“word”) = (23 * 37^3 + 15 * 37^2 + 18 * 37^1 + 4) %
10007
Computing hash function for a string
Horner’s rule:
(( … (a0 x + a1) x + a2) x + … + an-2 )x + an-1)
int hash( const string & key )
{
int hashVal = 0;
for( int i = 0; i < key.length( ); i++ )
hashVal = 37 * hashVal + key[ i ];
return hashVal;
}
Computing hash function for a string
int myhash( const HashedObj & x ) const
{
int hashVal = hash( x );
hashVal %= theLists.size( );
return hashVal;
}
Alternatively, we can apply % theLists.size() after each iteration of
the loop in hash function.
int myHash( const string & key )
{
int hashVal = 0; int s = theLists.size();
for( int i = 0; i < key.length( ); i++ )
hashVal = (37 * hashVal + key[ i ]) % s;
return hashVal % s;
}
Analysis of open hashing/chaining
Open hashing uses more memory than open addressing (because
of pointers), but is generally more efficient in terms of time.
If the keys arriving are random and the hash function is good, keys
will be nicely distributed to different buckets and so each list will
be roughly the same size.
Let n = the number of keys present in the hash table.
m = the number of buckets (lists) in the hash table.
If there are n elements in set, then each bucket will have roughly
n/m
If we can estimate n and choose m to be ~ n, then the average
bucket will be 1. (Most buckets will have a small number of
items).
Analysis continued
Average time per dictionary operation:
m buckets, n elements in dictionary  average n/m
elements per bucket
n/m = l is called the load factor.
insert, search, remove operation take O(1+n/m) =
O(1+l) time each (1 for the hash function
computation)
If we can choose m ~ n, constant time per operation on
average. (Assuming each element is likely to be
hashed to any bucket, running time constant,
independent of n.)
Closed Hashing
Associated with closed hashing is a rehash strategy:
“If we try to place x in bucket h(x) and find it
occupied, find alternative location h1(x), h2(x), etc.
Try successively until all the cells have been probed.
If this happens, then the hash table is full.”
h(x) is called home bucket
Simplest rehash strategy is called linear hashing
hi(x) = (h(x) + i) % D
In general, the collision resolution strategy is to
generate a sequence of hash table addresses
(probe sequence); test each slot until you find an
empty one (probing)
Closed Hashing
Example: m =8, keys a,b,c,d have hash values h(a)=3, h(b)=0,
h(c)=4, h(d)=3
Where do we insert d? 3 already filled
0
Probe sequence using linear hashing:
h1(d) = (h(d)+1)%8 = 4%8 = 4
h2(d) = (h(d)+2)%8 = 5%8 = 5*
h3(d) = (h(d)+3)%8 = 6%8 = 6
Etc.
Wraps around the beginning of the
table
b
1
2
3
4
a
c
5
d
6
7
Operations Using Linear Hashing
• Test for membership: search
• Examine h(k), h1(k), h2(k), …, until we find k or
an empty bucket or home bucket
case 1: successful search -> return true
case 2: unsuccessful search -> false
case 3: unsuccessful search and table is full
• If deletions are not allowed, strategy works!
• What if deletions?
Dictionary Operations with Linear Hashing
• What if deletions?
If we reach empty bucket, cannot be sure that k is
not somewhere else and empty bucket was
occupied when k was inserted
• Need special placeholder deleted, to distinguish
bucket that was never used from one that once
held a value
Implementation of closed hashing
Code slightly modified from the text.
// CONSTRUCTION: an approximate initial size or default of 101
//
// ******************PUBLIC OPERATIONS*********************
// bool insert( x )
--> Insert x
// bool remove( x )
--> Remove x
// bool contains( x ) --> Return true if x is present
// void makeEmpty( )
--> Remove all items
// int hash( string str ) --> Global method to hash strings
There is no distinction between hash function used in closed
hashing and open hashing. (I.e., they can be used in either context
interchangeably.)
template <typename HashedObj>
class HashTable
{
public:
explicit HashTable( int size = 101 ) : array( nextPrime( size ) )
{ makeEmpty( ); }
bool contains( const HashedObj & x )
{
return isActive( findPos( x ) );
}
void makeEmpty( )
{
currentSize = 0;
for( int i = 0; i < array.size( ); i++ )
array[ i ].info = EMPTY;
}
bool insert( const HashedObj & x )
{ int currentPos = findPos( x );
if( isActive( currentPos ) )
return false;
array[ currentPos ] = HashEntry( x, ACTIVE );
if( ++currentSize > array.size( ) / 2 )
rehash( );
// rehash when load factor exceeds 0.5
return true;
}
bool remove( const HashedObj & x )
{
int currentPos = findPos( x );
if( !isActive( currentPos ) )
return false;
array[ currentPos ].info = DELETED;
return true;
}
enum EntryType { ACTIVE, EMPTY, DELETED };
private: struct HashEntry
{
HashedObj element;
EntryType info;
};
vector<HashEntry> array;
int currentSize;
bool isActive( int currentPos ) const
{ return array[ currentPos ].info == ACTIVE; }
int findPos( const HashedObj & x )
{
int offset = 1; // int offset = s_hash(x);
/* double hashing */
int currentPos = myhash( x );
while( array[ currentPos ].info != EMPTY &&
array[ currentPos ].element != x )
{
currentPos += offset;
// offset += 2
// Compute ith probe
/* quadratic probing
if( currentPos >= array.size( ) )
currentPos -= array.size( );
}
return currentPos;
}
*/
Performance Analysis - Worst Case
• Initialization: O(m), m = # of buckets
• Insert and search: O(n), n number of elements
currently in the table
– Suppose there are close to n elements in the table
that form a chain. Now want to search x, and say x
is not in the table. It may happen that h(x) = start
address of a very long chain. Then, it will take O(c)
time to conclude failure. c ~ n.
• No better than an unsorted array.
Example
I
0
1001
1
9537
2
3016
3
4
5
6
7
9874
8
2009
9
9875
10
1. What if next element has home
bucket 0? h(k) = k%11 = 0
 go to bucket 3
Same for elements with home
bucket 1 or 2!
Only a record with home position
3 will stay.
 p = 4/11 that next record will
go to bucket 3
2. Similarly, records hashing to 7,8,9
will end up in 10
3. Only records hashing to 4 will end up
in 4 (p=1/11); same for 5 and 6
II
insert 1052 (h.b. 7)
0
1001
1
9537
2
3016
3
4
5
6
7
9874
8
2009
9
9875
10
1052
next element in bucket
3 with p = 8/11
Performance Analysis - Average Case
• Distinguish between successful and
unsuccessful searches
• Delete = successful search for record to be
deleted
• Insert = unsuccessful search along its probe
sequence
• Expected cost of hashing is a function of
how full the table is: load factor l = n/m
Random probing model vs. linear probing
model
•It can be shown that average costs under linear
hashing (probing) are:
•Insertion: 1/2(1 + 1/(1 - l)2)
•Deletion: 1/2(1 + 1/(1 - l))
•Random probing: Suppose we use the following
approach: we create a sequence of hash functions h,
h,… all of which are independent of each other.
• insertion: 1/(1 – l )
• deletion: 1/l log(1/ (1 – l))
Random probing – analysis of insertion (unsuccessful
search)
What is the expected number of times one should roll a
die before getting 4?
Answer: 6 (probability of success = 1/6.)
More generally, if the probability of success = p,
expected number of times you repeat until you succeed
is 1/p.
If the current load factor = l, then the probability of
success = 1 – l since the proportion of empty slots is 1 – l.
Improved Collision Resolution
• Linear probing: hi(x) = (h(x) + i) % D
• all buckets in table will be candidates for inserting a new
record before the probe sequence returns to home position
• clustering of records, leads to long probing sequence
• Linear probing with increment c > 1: hi(x) = (h(x) + ic) % D
• c constant other than 1
• records with adjacent home buckets will not follow same
probe sequence
• Double hashing: hi(x) = (h(x) + i g(x)) % D
• G is another hash function that is used as the increment
amount.
• Avoids clustering problems associated with linear probing.
Comparison with Closed Hashing
• Worst case performance is O(n) for both. Average
case is a small constant in both cases when a is small.
• Closed hashing – uses less space.
• Open hashing – behavior is not sensitive to load
factor. Also no need to resize the table since memory
is dynamically allocated.
Random probing model vs. linear probing
model
•It can be shown that average costs under linear
hashing (probing) are:
•Insertion: 1/2(1 + 1/(1 - l)2)
•Deletion: 1/2(1 + 1/(1 - l))
•Random probing: Suppose we use the following
approach: we create a sequence of hash functions h,
h,… all of which are independent of each other.
• insertion: 1/(1 – l )
• deletion: 1/l log(1/ (1 – l))
Random probing – analysis of insertion (unsuccessful
search)
What is the expected number of times one should roll a
die before getting 4?
Answer: 6 (probability of success = 1/6.)
More generally, if the probability of success = p,
expected number of times you repeat until you succeed
is 1/p.
Probes are assumed to be independent. Success in the
case of insertion involves finding an empty slot to insert.
Proof for the case insertion: 1/(1 – l )
Recall: geometric distribution involves a sequence of
independent random experiments, each with outcome
success (with prob = p) or failure (with prob = 1 – p).
We repeat the experiment until we get success.
The question is: what is the expected number of trials
performed?
Answer: 1/p
In case of insertion, success involves finding an empty
slot. Probability of success is thus 1 – l.
Thus, the expected number of probes = 1/(1 – l )
Improved Collision Resolution
Linear probing: hi(x) = (h(x) + i) % D
all buckets in table will be candidates for inserting a new
record before the probe sequence returns to home position
clustering of records, leads to long probing sequence
Linear probing with increment c > 1: hi(x) = (h(x) + ic) % D
c constant other than 1
records with adjacent home buckets will not follow same
probe sequence
Double hashing: hi(x) = (h(x) + i g(x)) % D
G is another hash function that is used as the increment
amount.
Avoids clustering problems associated with linear probing.
Comparison with Closed Hashing
Worst case performance is O(n) for both. Average
case is a small constant in both cases when a is small.
Closed hashing – uses less space.
Open hashing – behavior is not sensitive to load
factor. Also no need to resize the table since memory
is dynamically allocated.
Successful search
20
Linear probing
Double hashing
1S
8eparate chaining
16
14
Average#ofprobes
12
10
8
6
4
2
0
0.2
0.4
0.6
Load factor
0.8
1
Unsuccessful search
20
Linear probing
Double hashing
1S
8eparate chaining
16
14
Average#ofprobes
12
10
8
6
4
2
0
0.2
0.4
0.6
Load factor
0.8
1
Another hash function - Multiplication Method
m
A
A

)
h
key


(
key
mod
1
)
whe
0


1


p
We choose m to be power of 2 (m=2 ) and
A
5

1


0
.
61803398
...
2
For example, k=123456, m=512 then:

)
h
key

512

(
12345

0
.
618

mod
1
)



512

(
7629
.
62963
mod
1
)



512

0
.
62963

322
.
371

322



Multiplication Method: Implementation
w bits
key
x
high order word
A 2W
low order word
h(key)
extract p bits
product
Download