Hashing

advertisement
Hashing
Sections 10.2 – 10.3
CS 302
Dr. George Bebis
The Search Problem
• Unsorted list
– O(N)
• Sorted list
– O(logN) using arrays (i.e., binary search)
– O(N) using linked lists
• Binary Search tree
– O(logN) (i.e., balanced tree)
– O(N) (i.e., unbalanced tree)
• Can we do better than this?
– Direct Addressing
– Hashing
2
Direct Addressing
• Assumptions:
– Key values are distinct
– Each key is drawn from a universe U = {0, 1, . . . , n - 1}
• Idea:
– Store the items in an array, indexed by keys
3
Direct Addressing (cont’d)
• Direct-address table
representation:
– An array T[0 . . . n - 1]
– Each slot, or position, in T
corresponds to a key in U
Search, insert, delete in O(1) time!
– For an element x with key k, a
pointer to x will be placed in
location T[k]
– If there are no elements with
key k in the set, T[k] is empty,
represented by NIL
4
Direct Addressing (cont’d)
Example 1: Suppose that the are integers from 1 to 100 and
that there are about 100 records.
Create an array A of 100 items and stored the record whose key
is equal to i in in A[i].
|K| = |U|
|K|: # elements in K
|U|: # elements in U
5
Direct Addressing (cont’d)
Example 2: Suppose that the keys are 9-digit social security
numbers (SSN)
Although we could use the same idea, it would be very inefficient
(i.e., use an array of 1 billion size to store 100 records)
|K| << |U|
6
Hashing
Idea:
– Use a function h to compute the slot for each key
– Store the element in slot h(k)
• A hash function h transforms a key into an
index in a hash table T[0…m-1]:
h : U → {0, 1, . . . , m - 1}
• We say that k hashes to slot h(k)
7
Hashing (cont’d)
0
U
(universe of keys)
K k1
(actual k4
keys)
k5
k2
k3
h(k1)
h(k4)
h(k2) = h(k5)
h(k3)
m-1
h : U → {0, 1, . . . , m - 1}
hash table size: m
8
Hashing (cont’d)
Example 2: Suppose that the keys are 9-digit social security
numbers (SSN)
9
Advantages of Hashing
• Reduce the range of array indices handled:
m instead of |U|
where m is the hash table size: T[0, …, m-1]
• Storage is reduced.
• O(1) search time (i.e., under assumptions).
Collisions
Collisions occur when h(ki)=h(kj), i≠j
0
U
(universe of keys)
K k1
(actual k4
keys)
k5
k2
k3
h(k1)
h(k4)
h(k2) = h(k5)
h(k3)
m-1
11
Collisions (cont’d)
• For a given set K of keys:
– If |K| ≤ m, collisions may or may not happen,
depending on the hash function!
– If |K| > m, collisions will definitely happen (i.e., there
must be at least two keys that have the same hash
value)
• Avoiding collisions completely might not be
easy.
12
Handling Collisions
• We will discuss two main methods:
(1) Chaining
(2) Open addressing
• Linear probing
• Quadratic probing
• Double hashing
13
Chaining
• Idea:
– Put all elements that hash to the same slot into a
linked list
– Slot j contains a pointer to the head of the list of all
elements that hash to j
14
Chaining (cont’d)
• How to choose the size of the hash table m?
– Small enough to avoid wasting space.
– Large enough to avoid many collisions and keep
linked-lists short.
– Typically 1/5 or 1/10 of the total number of elements.
• Should we use sorted or unsorted linked lists?
– Unsorted
• Insert is fast
• Can easily remove the most recently inserted elements
15
Hash Table Operations
• Search
• Insert
• Delete
16
Searching in Hash Tables
Alg.: CHAINED-HASH-SEARCH(T, k)
search for an element with key k in list T[h(k)]
• Running time depends on the length of the list of
elements in slot h(k)
17
Insertion in Hash Tables
Alg.: CHAINED-HASH-INSERT(T, x)
insert x at the head of list T[h(key[x])]
• T[h(key[x])] takes O(1) time; insert will take O(1)
time overall since lists are unsorted.
• Note: if no duplicates are allowed, It would take
extra time to check if item was already inserted.
18
Deletion in Hash Tables
Alg.: CHAINED-HASH-DELETE(T, x)
delete x from the list T[h(key[x])]
• T[h(key[x])] takes O(1) time.
• Finding the item depends on the length of the list
of elements in slot h(key[x])
19
Analysis of Hashing with Chaining:
Worst Case
• How long does it take to
search for an element with a
T
0
given key?
• Worst case:
– All n keys hash to the same slot
then O(n) plus time to compute
chain
m-1
the hash function
20
Analysis of Hashing with Chaining:
Average Case
• It depends on how well the hash
function distributes the n keys among
the m slots
T
n0 = 0
n2
• Under the following assumptions:
(1) n = O(m)
(2) any given element is equally likely to
hash into any of the m slots (i.e., simple
uniform hashing property)
then  O(1) time plus time to compute
the hash function
n3
nj
nk
nm – 1 = 0
21
Properties of Good Hash Functions
•
Good hash function properties
(1) Easy to compute
(2) Approximates a random function
i.e., for every input, every output is equally likely.
(3) Minimizes the chance that similar keys hash to the
same slot
i.e., strings such as pt and pts should hash to different slot.
•
We will discuss two methods:
– Division method
– Multiplication method
22
The Division Method
• Idea:
– Map a key k into one of the m slots by taking
the remainder of k divided by m
h(k) = k mod m
• Advantage:
– fast, requires only one operation
•
Disadvantage:
– Certain values of m are bad (i.e., collisions), e.g.,
• power of 2
• non-prime numbers
23
Example
• If m = 2p, then h(k) is just the least
significant p bits of k
m
97
m
100
– p=1m=2
 h(k) = {0, 1} , least significant 1 bit of k
– p=2m=4
 h(k) ={0, 1, 2, 3}, least significant 2 bits of k
 Choose m to be a prime, not close to a
power of 2
 Column 2:
k mod 97
 Column 3:
k mod 100
24
The Multiplication Method
Idea:
(1) Multiply key k by a constant A, where 0 < A < 1
(2) Extract the fractional part of kA
(3) Multiply the fractional part by m
(4) Truncate the result
h(k) =
e.g., 12.3  12
= m (k A mod 1)
fractional part of kA = kA - kA
• Disadvantage: Slower than division method
• Advantage: Value of m is not critical
25
Example – Multiplication Method
Suppose k=6, A=0.3, m=32
(1) k x A = 1.8
(2) fractional part: 1.8  1.8  0.8
(3) m x 0.8 = 32 x 0.8 = 25.6
(4)  25.6  25
h(6)=25
26
Open Addressing
• Idea: store the keys in the table itself
• No need to use linked lists anymore
e.g., insert 14
• Basic idea:
– Insertion: if a slot is full, try another one,
until you find an empty one.
– Search: follow the same probe sequence.
– Deletion: need to be careful!
• Search time depends on the length of
probe sequences!
probe sequence:
<1, 5, 9>
27
Generalize hash function notation:
• A hash function contains two arguments now:
(i) key value, and (ii) probe number
h(k,p),
p=0,1,...,m-1
• Probe sequence:
<h(k,0), h(k,1), h(k,2), …. >
• Example:
Probe sequence:
<h(14,0), h(14,1), h(14,2)>=<1, 5, 9>
e.g., insert 14
Generalize hash function notation:
e.g., insert 14
– Probe sequence must be a permutation of
<0,1,...,m-1>
– There are m! possible permutations
Probe sequence: <h(14,0), h(14,1), h(14,2)>=<1, 5, 9>
Common Open Addressing Methods
• Linear probing
• Quadratic probing
• Double hashing
• None of these methods can generate more than
m2 different probe sequences!
30
Linear probing: Inserting a key
• Idea: when there is a collision, check the next available
position in the table:
h(k,i) = (h1(k) + i) mod m
i=0,1,2,...
• i=0: first slot probed: h1(k)
• i=1: second slot probed: h1(k) + 1
• i=2: third slot probed: h1(k)+2, and so on
probe sequence: < h1(k), h1(k)+1 , h1(k)+2 , ....>
• How many probe sequences can linear probing
generate?
m probe sequences maximum
wrap around
31
Linear probing: Searching for a key
• Given a key, generate a probe
sequence using the same procedure.
• Three cases:
0
(1) Position in table is occupied with an
element of equal key FOUND
(2) Position in table occupied with a
different element  KEEP SEARCHING
(3) Position in table is empty NOT FOUND
m-1
wrap around
32
Linear probing: Searching for a key
• Running time depends on the length of
the probe sequences.
0
• Need to keep probe sequences
short to ensure fast search.
m-1
wrap around
33
Linear probing: Deleting a key
• First, find the slot containing the key
to be deleted.
• Can we just mark the slot as empty?
e.g., delete 98
0
– It would be impossible to retrieve keys
inserted after that slot was occupied!
• Solution
– “Mark” the slot with a sentinel value
DELETED
• The deleted slot can later be used
for insertion.
m-1
34
Primary Clustering Problem
• Long chunks of occupied slots are created.
• As a result, some slots become more likely than others.
• Probe sequences increase in length.  search time
increases!!
initially, all slots have probability 1/m
Slot b:
2/m
Slot d:
4/m
Slot e:
5/m
35
Quadratic probing
i=0,1,2,...
• Clustering is less serious but still a problem (secondary
clustering)
• How many probe sequences can quadratic probing
generate?
m -- the initial position determines probe sequence
36
Double Hashing
(1) Use one hash function to determine the first slot.
(2) Use a second hash function to determine the
increment for the probe sequence:
h(k,i) = (h1(k) + i h2(k) ) mod m, i=0,1,...
• Initial probe: h1(k)
• Second probe is offset by h2(k) mod m, so on ...
• Advantage: handles clustering better
• Disadvantage: more time consuming
• How many probe sequences can double hashing
generate?
m2 -- why?
37
Double Hashing: Example
h1(k) = k mod 13
h2(k) = 1+ (k mod 11)
h(k,i) = (h1(k) + i h2(k) ) mod 13
• Insert key 14:
i=0: h(14,0) = h1(14) = 14 mod 13 = 1
i=1: h(14,1) = (h1(14) + h2(14)) mod 13
= (1 + 4) mod 13 = 5
i=2: h(14,2) = (h1(14) + 2 h2(14)) mod 13
= (1 + 8) mod 13 = 9
0
1
2
3
4
5
6
7
8
9
10
11
12
79
69
98
72
14
50
38
Download