Hashing (Ch. III.11) Data Structures:= ( Set ; Operations-on-Set)

advertisement
Hashing (Ch. III.11)
Data Structures:=
Examples:
Table
List
Stack
Queue
=
=
=
=
( Set ;
Operations-on-Set)
( {items},
( {items},
( {items},
( {items},
{Insert, Delete, Search } )
{Insert, Delete, Search }
{Push, Pop}
)
{Enqueue, Dequeue } )
stack and queue do not
)
support the search operation
dictionary operations
Typically Operations-on-Set  { Insert ,
Modify,
Delete,
Find-Max,
Search,
Find-Min, etc.…}
Comparison with algebraic structures in mathematics:
similarity:
same basic idea
algebraic structure: = ( Set; Operations-on-Set)
Example: Group (A,+) : e.g. (real-numbers, addition), ({0,1,2}, +modulus 3)
difference:
Set in math structure is fixed and often infinite; In CS the set is always finite and
items are repeatedly added and removed (set is dynamic); in fact inserting and deleting items are
central to data structure work, are performed a great number of times and therefore need to be
very fast.
Goal:
1. Design: Find data structure most appropriate for task at hand;
Example: priority queues for scheduling.
2. Implementation: should satisfy software engineering criteria: more general criteria such
as modularity, scalability; and/or more specific criteria such as encapsulation, info hiding,
inheritance, etc.
Example: heap for implementing priority queues.
3. Performance: Operations-on-Set should be as fast as possible: ideally the worst or
average time should be of constant order or at most of order lgn. ?????
Example: worst case asymptotic times for priority queues implemented with heap:
T-Insert/Change = (lg n), T-Extract = ( lg n)
Hashing
Goal: Develop data structure for dictionary operations, i.e. insert, delete, search as fast an as
elegantly implemented as possible, and no need to worry about max, min, etc.
First thing that comes to mind: Table, implemented as array, with direct access to every
element. (see Figure 11.1, p.223)
Pros: ( 1) for all dictionary operations
Cons: array requires lots of space, and may be outright impossible if we work with a large numer
of items/keys.
Page 1 of 6
D:\291220516.doc
5/29/2016, 6:18 AM
Main Ideas of Hash Tables:
- The number of keys that potentially occur in the problem may be very large, but the
values that need to actually be stored is relatively small.
- We could make a rule (function) that assigns not one but a whole set of values to a single
table slot, taking the risk that relatively few values will be assigned to the same slot in a
particular run of the problem. Thus we generalize the direct access table idea from "one
key-value to one table slot" to "set of key-values to one table slot". (see Figure 11.2,
p.225)
- To make sure only few values go to the same slot we need to come up with a clever rule
for assigning values to slots. Intuitively this could be accomplished by dispersing in a
hodgepodge way or "hashing" the values "evenly" across the table slots -- hash function;
- Under very unfortunate circumstances all values can hash to a single slot making the
worst case time of the order of n, O( n) = O( #items), a depressing result. The average
time, however, is proportional to the average number of values hashing to a slot.
Assuming the table has m slots and our hash function disperses the n values evenly over
these slots, the average length should be m/n, making the average table access time O(
m/n) = O( #table-slots/#items)
- Of course, we still cannot avoid that more than one key hashes to a single slot: when this
occurs there is a collision and we need a collision resolution policy.
Now let's work out the details: Let
U – universe of n keys; |U| = n;
K – keys actually stored in table
T[ 0… m-1] – array implementing table with m slots

# table  slots n

load factor
# items
m
Hash function: h: U  {0, 1, 2, …m-1}
Collision Resolution by Chaining: (see Figure 11.3, p.225)
This first thing that comes to mind is simply to make a list of all keys with the same hash value
and attach it to the corresponding table slot h(k).
Let
k – some key or element of U, while
x – a pointer to an item/element to be sorted; the element has a key, key(x) U, and
assorted satellite data (for more detail review section 10.2 Linked lists);
Assume a doubly linked list with element
x – pointer to element
prev[x] – pointer to key[x]
previous element
next[x] - pointer to
next element
Page 2 of 6
D:\291220516.doc
5/29/2016, 6:18 AM
CHAINED-HASH-INSERT(T,
x) //p.227
Insert x at beginning of list at T [ h( key[x] ) ]
O(1)
CHAINED-HASH-SEARCH(T,
k) //p.227
Search k in list at T[ h(k)]
O( length-list at T[ h(k)] )
CHAINED-HASH-DELETE(T,
x) //p.227
Delete k from list at T[ h( key[x] )]
O( 1) if doubly linked list
Note that
- if we are deleting based on some key k and not on pointer x to element we must search
for the key k in the list at T[ h( k ) ] : the time is then be proportional to the length of the
list at h(k) as in searching;
- if the list is not doubly linked, i.e. the previous link is missing, and only the next link is
provided, we must still search for x in order to find the element preceding x and update
its next link to point to element after x. The worst case occurs when x is last, requiring to
traverse the whole list and making the time again proportional to the length of the list at
T[h(k)] (see section 10.2, and p. 226).
Analysis of hashing with chaining: How long does it take to search for a key?
Worst case: O(n) trivial
Average case: O(α = load factor)
Proof Idea:
Assume: each key is equally likely to be hashed to any slot or equivalently each slot is equally
likely to get a key -- simple uniform hashing
Cost of unsuccessful search (Theorem 11.1, p.227): O(1+α)
- O(1) to compute hash function and access slot T[ h(k)]
- search entire list at slot, average list length = load factor = α
Cost of successful search (Theorem 11.2, p.227): O(1 
-

2


2n
 1)
O(1) to access slot T[ h(k)]
average list length of successful search for some key k is
1
T = Σ (avg-number-elements-examined-before-k +1).
n
Intuitively, as k is equally likely to be the key of the first, as well as of the last element,
as well as any element in between the avg-number-elements-before-k is α/2;
Page 3 of 6
D:\291220516.doc
5/29/2016, 6:18 AM
More precisely: how many elements were examined before finding k depends on when k
was inserted ,or how many elements were added after k, or
avg-number-elements-examined-before-k = avg-number-elements-added-after-k
Thus if k was added
1st  avg-number-elements-added-after-k = (n-1)/m
2d  avg-number-elements-before-k = (n-2)/m
3st  avg-number-elements-before-k = (n-3)/m
…
ith  avg-number-elements-before-k = (n-i)/m with i=1,2,…,n
1 n ni
1 n
1 n
 1) 
(
 ( n  i )  1
n i1 m
nm i1
n i 1
n n
1 n n
=
1 
i 
nm i1 nm i1 n
T =
=
n 1 n(n  1)

1
m nm 2
=

(n  1)
 1 n
 
1    
1  
1
2m
2 2m n
2 2m
What is a good hash function? – one that disperses values evenly so that
for most x ≠ y  h(x) ≠ h(y)
How to do this?
(a) reason systematically through probabilities, and assure that the condition of univorm
hashing is met, i.e. the probability that k hashes to some slot j is the same for all slots:
P(j) = 1/m i.e. P( j ) 
 P(k )
k : h( k )  j
Unfortunately we usually do not know the probability distribution of the keys and cannot
check this.
(b) Ad hoc methods that try to destroy any dependency patterns in the data, and thus make
hash values independent of each other:
with fixed hash function: division method, multiplication method
with randomly chosen hash function for each run: universal hashing
Page 4 of 6
D:\291220516.doc
5/29/2016, 6:18 AM
Division method (Ch. 11.3.1):
h(k) = k mod m
The simplest thing that can come to mind in fact. There are some caveats, however:
1. Do not choose m= 2p (some power of 2) when working with binary numbers:
Choosing m= 2p makes the hash function h(k) dependent only on its p lowest digits:
Let k have a w+1 bit representation
k = kw2w + kw-12w-1 + … + kp+12p+1 + kp 2p + kp-12p-1 + kp-12p-2 + … + k121 + k020 or
= (kw2w-p + kw-12w-1-p + … + kp+121 + kp)2p + (kp-12p-1 + kp-12p-2 + … + k121 + k020)
Dividing by m= 2p yields
k/m =
= kw-p2w-p + kw-12w-1-p + … + kp+12p+1-p + kp2p-p + kp-12-1 + kp-12-2 + … + k12-p+1 + k02-p or
=(kw-p2w-p + kw-12w-p-1 + … + kp+121
integer part of division by m
irrelevant for h(k)
+ kp ) + ( kp-1 + kp-1
+ … + k1
+ k0)2-p
fractional part = remainder of
division by m, i.e. h(k)
Example: w+1=8, p=4,
k = 1101 1010
= 1.2 7 + 1.2 6 +0.2 5 +1.2 4 +1.2 3 +0.2 2 +1.2 1 +0.2 0 = (1101) 2 4=p + (1010) 2 0
Dividing by m=2 4=p yields
k/m =
= 1.2 7-4 + 1.2 6-4 +0.2 5-4 +1.2 4-4 +1.2 3-4 +0.2 2-4 +1.2 1-4 +0.2 0-4
= 1.2 3 + 1.2 2 +0.2 1 +1.2 0 +1.2 -1 +0.2 -2 +1.2 -3 +0.2 -4 = (1101) 2 0 + (1010) 2 -4
integer
part
fractional
h(k)
Similarly do not choose m a power of 10 (or of radix d) for decimal (or radix d) applications.
2. If k is a character string in radix 2 p representation and m= 2 p –1 any pair of strings that are
identical except for a transposition of two adjacent characters hash to the same slot.
3. Good values for m are primes not too close to exact powers of 2
See examples p.231
Page 5 of 6
D:\291220516.doc
5/29/2016, 6:18 AM
Multiplication method (Ch. 11.3.2): h(k) = floor( m k A mod 1)
with
m = 2p
s
0< A= w <1
2
or
0 < s = A . 2w < 2w
Example: k = 13 ,
p=3
,
w=4
,
A=
5 1
9
 .618033... approx.
2
16
Page 6 of 6
D:\291220516.doc
5/29/2016, 6:18 AM
Download