Lecture 3 ADT Map/Dictionary, Hash Tables, Skip Lists TDDC32

advertisement
Lecture 3
ADT Map/Dictionary, Hash Tables,
Skip Lists
TDDC32
Lecture notes in Design and Implementation of a Software Module in Java 21 January 2013
Tommy Färnqvist, IDA, Linköping University
3.1
Lecture Topics
Innehåll
1
ADT Map/Dictionary
1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
3
2
Hash Tables
2.1 Collision Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Choosing a Hash Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
4
6
3
Skip Lists
7
1
3.2
ADT Map/Dictionary
1.1
Definitions
ADT Map
• Domain: sets of entries/pairs (key, value) The sets are partial functions mapping keys to values!
• Typical operations:
– size() the number of pairs in the set
– isEmpty() check if the set is empty
– get(k) get the info associated with k or null if there is no such key
– put(k, v) add (k, v) to the set an return null if k is new; otherwise replace value by v and return
the old value
– remove(k) remove entry (k, v) and return v; return null if there is no such entry
3.3
ADT Map
• Examples:
– Course database: (code, name)
– Associative memory: (address, value)
– Sparse matrix: ((row, column), value)
– Lunch menu: (day, course)
• Static Map: no updates allowed
• Dynamic Map: updates are allowed
3.4
1
Example: operations performed on initially empty Map M
operation
isEmpty()
put(5,A)
put(7,B)
put(2,C)
put(8,D)
put(2,E)
get(7)
get(4)
get(2)
size()
remove(5)
remove(2)
get(2)
isEmpty()
utdata
true
null
null
null
null
C
B
null
E
4
A
E
null
false
M
0/
(5,A)
(5,A),(7,B)
(5,A),(7,B),(2,C)
(5,A),(7,B),(2,C),(8,D)
(5,A),(7,B),(2,E),(8,D)
(5,A),(7,B),(2,E),(8,D)
(5,A),(7,B),(2,E),(8,D)
(5,A),(7,B),(2,E),(8,D)
(5,A),(7,B),(2,E),(8,D)
(7,B),(2,E),(8,D)
(7,B),(8,D)
(7,B),(8,D)
(7,B),(8,D)
3.5
ADT Dictionary
• Domain: sets of pairs (key, value) The sets are relations between keys and values!
• Typical operations:
– size() the number of pairs in the set
– isEmpty() check if the set is empty
– find(k) return some entry with key k or null if no pair exists
– findAll(k) return an iterable collection of all entries with key k
– insert(k, v) add (k, v) and return the new entry
– remove(k, v) remove and return pair (k, v); return null if there is no such pair
– entries() return iterable collection of all entries
3.6
ADT Dictionary
• Example:
– Swedish-English dictionary . . . , (jakt, yacht), (jakt, hunting), . . .
– Telephone directory (multiple numbers allowed)
– Relation between liuid and courses taken
– Lunch menu (multiple courses): (day, course)
• Static Dictionary: no updated allowed
• Dynamic Dictionary: updates are allowed
3.7
Example: operations on initially empty Dictionary D
operation
isEmpty()
insert(5,A)
insert(7,B)
insert(2,C)
insert(8,D)
insert(2,E)
find(7)
find(4)
find(2)
findAll(2)
size()
remove(find(5))
find(5)
utdata
true
(5,A)
(7,B)
(2,C)
(8,D)
(2,E)
(7,B)
null
(2,C)
(2,C),(2,E)
5
(5,A)
null
D
0/
(5,A)
(5,A),(7,B)
(5,A),(7,B),(2,C)
(5,A),(7,B),(2,C),(8,D)
(5,A),(7,B),(2,C),(8,D),(2,E)
(5,A),(7,B),(2,C),(8,D),(2,E)
(5,A),(7,B),(2,C),(8,D),(2,E)
(5,A),(7,B),(2,C),(8,D),(2,E)
(5,A),(7,B),(2,E),(8,D),(2,E)
(5,A),(7,B),(2,E),(8,D),(2,E)
(7,B),(2,C),(8,D),(2,E)
(7,B),(2,C),(8,D),(2,E)
3.8
2
1.2
Implementation
Implementations: Map, Dictionary
• Table/array: sequence of memory chunks of equal size
– Unordered: no particular order between T [i] and T [i + 1]
– Ordered: . . . but here T [i] < T [i + 1]
• Linked list
– Unordered
– Ordered
• (Binary) Search Trees
• Hashing
• Skip Lists
3.9
Table representation of Dictionary
Unordered table:
find by linear search
• unsuccessful look-up: n comparisons ⇒ O(n) time
• successful look-up, worst case: n comparisons ⇒ O(n) time
• successful look-up, average case with uniform distribution of requests:
comparisons ⇒ O(n) time
1
n (1 + 2 + . . . + n)
=
n+1
2
3.10
Table representation of Dictionary
Ordered table (keys in Map/Dictionary are linearly ordered):
find by binary search
• look-up: O(log n) time
• . . . updates are expensive!!
3.11
Search tree representation of Dictionary
Balanced search tree:
• look-up: O(log n) time
• updates also in O(log n) time
3.12
Can we do better?
Yes, using hash tables
• Idea: given a table T [0, . . . , max] to store elements in. . . . . . for each element find a suitable table
index
• Find a function h such that h(key) ∈ [0, . . . , max] and (ideally) such that k1 6= k2 ⇒ h(k1 ) 6= h(k2 )
• Store each key-value pair (k, v) in T [h(k)]
3.13
2
Hash Tables
Hash Table
• In practice, hash functions do not give unique values (are not injective)
• We need conflict (or collision) resolution
. . . and
• We need to find a good hash function
3.14
3
2.1
Collision Resolution
Collision Resolution
Two principles for handling collisions:
• Chaining: keep conflicting data in linked lists
– Separate chaining: Keep a linked list of the colliding items outside the table
– Coalesced chaining: Store all items inside the table
• Open addressing: Store all items inside the table and let some algorithm decide what index to use in
case of a collision
[Swe: Separat länkning, samlad länkning, öppen adressering]
3.15
Chaining
0
1
2
41
28
54
3
4
18
5
6
7
8
9
10
36
10
90
12
11
Separate chaining 12
38
25
Coalesced chaining
3.16
Open addressing
3.17
Example: Hashing with Separate Chaining
• Hash table of size 13
• Hash function h with h(k) = k mod 13
• Store 10 integer keys: 54, 10, 18, 25, 28, 41, 38, 36, 12, 90
0
1
2
41
28
54
3
4
5
18
6
7
8
9
10
36
10
90
12
11
12
38
25
3.18
4
Separate Chaining: find
Given: key k, hash table T , hash function h
• compute h(k)
• search for k in the list pointed to by T [h(k)]
Notation: probe = one access to the linked list data structure
•
•
•
•
1 probe for accessing the list header (if non-empty)
1+1 probes for accessing the contents of the first element
1+2 probes for accessing the contents of the second element
...
A probe (just pointer de-referencing) takes constant time. How many probes P are needed to retrieve a hash
table entry?
[Swe: probe=sondering]
3.19
Separate Chaining: Unsuccessful Look-up
• n data items
• m positions in the table
Worst case:
• All items have the same hash value: P = 1 + n
Average case:
• Hash values equally distributed among m:
• Average length α of the list: α = n/m
– α = n/m is called the load factor
• P = 1+α
3.20
Separate Chaining: Successful Look-up
Average case:
• Access to T [h(k)] (beginning of list L): 1
• Traversing L ⇒ k found after: |L|/2
• Expected (or average) |L| corresponds to α, thus: P = α/2 + 1
3.21
Coalesced Chaining: items inside table
• Place data items in the table
• Extend them with pointers
• Resolve collisions by using the first free slot
Chain may contain keys with different hash values. . . . . . but all keys with same hash value appears in the
same chain
+ Better space utilization - Table may become full - Long chains (look-up takes longer time)
3.22
Coalesced Chaining
3.23
5
Open addressing
• Store all elements inside the table
• Use a fixed algorithm fo find a free slot
Sequential/Linear Probing
• desired hash index j = h(k)
• in case of conflict go the the next free position
• if at the end of the table, go to the beginning. . .
• Close positions rapidly filled (primary clustering)
• How to do remove(k)?
3.24
Open addressing — remove()
The element to remove may be part of a collision chain. How can we tell?
It it is part fo a chain, we cannot just remove it!
• since all keys are stored, re-hash all remaining data?
• Scan elements after, re-hash or compact when suitable, stop at first free slot. . . ?
• Ignore – insert a “deleted” marker if the next slot is non-empty. . .
3.25
Double Hashning – or what to do on a collision?
• Second hash function h2 computes increments in case of conflicts
• Increment beyond the table is taken modulo m = tableSize
Linear probing is double hashing with h2 (k) = 1 Requirements on h2 :
• h2 (k) 6= 0 for all k
• For each k, h2 (k) has no common divisors with m ⇒ all table positions can be reached
A common choice: h2 (k) = q − (k mod q) for q < m, m prime (i.e., pick a prime less than the table size!)
2.2
3.26
Choosing a Hash Function
What is a good hash function?
Assume k is a natural number.
Hashing should give a uniform distribution of hash values, but this depends on the distribution of keys
in the data to be hashed.
Example: Hashing of last names in a (Swedish) student group
• hash function: the ASCII-value of the last character bad choice: majority of the names end with ’n’.
3.27
6
Example Hash Functions
• Memory address
– Reinterpret the memory address of the object as an integer (default hash code of all Java objects)
– Good in general, except for numeric and string keys
• Integer cast
– Reinterpret the bits of the key as an integer
– Suitable for keys of length less than or equal to the number of bits of the integer type
• Component sum
– We partition the bits of the key into components of fixed length (e.g. 16 or 32 bits) and sum the
components (ignoring overflows)
– Suitable for numeric keys of fixed length greater than or equal to the number of bits of the
integer type
3.28
Example Hash Functions
• Polynomial accumulation
– Partition the bits of the key into a sequence of components of fixed length (e.g. 8, 16, or 32 bits)
a0 a1 an−1
– Evaluate the polynomial
p(z) = a0 + a1 z + a2 z2 + . . . + an−1 zn−1
at a fixed value z, ignoring overflows
– Especially suitable for strings. (E.g. z = 33 gives at most 6 collisions on a set of 50000 English
words.)
• Polynomial p(z) can be evaluated in O(n) time using Horner’s rule:
– The following polynomials are successively computed, each from the previous one in O(1) time
p0 (z) = an−1
pi (z) = an−i−1 + zpi−1 (z) (i = 1, 2, . . . , n − 1)
• We have p(z) = pn−1 (z)
3.29
Hashing by integer division
Let m be the table size
h(k) = k mod m
Avoid
• m = 2d : hashing gives d last bits of k
• m = 10d : hashing gives d last decimal digits
Usually, prime number is suggested for m Check samples of real data to experiment with hash parameters
See http://burtleburtle.net/bob/hash/doobs.html for other opinions on the matter.
3
3.30
Skip Lists
Skip List
•
•
•
•
•
A hierarchical linked list. . .
A randomized alternative for implementation of ADT Dictionary
Insertion uses randomization (“coin flipping”)
Good expected-case performance
Worst-case performance in skip lists is very unlikely (>250 items, the risk of a search time more than
3 times the expected is below 10−6 )
3.31
7
Skip List data structure
•
•
•
•
Levels L1 , . . . , Lh of nodes (key, value)
Some nodes appear in several levels (towers)
Special keys: −∞ and +∞ . . . smaller/larger than any real key. . .
Several levels of doubly linked lists, less dense higher up
– Level 1: all nodes in a doubly linked list between −∞ and +∞ in order by the ’<’-relation
– On average half of the nodes in Li appear in Li+1
– Special keys −∞ and +∞ appear on all levels
– Only −∞ and +∞ appear on level Lh
3.32
Example: Skip List
3.33
Searching
When searching for a key k:
• Follow the list of the highest level. . .
– Stop before we pass a ki > k (we risk to overshoot what we are looking for)
– If found, return it, otherwise. . .
• We have stopped at a level:
– We have found the key?
– No, switch to next lower level (via the “last tower”) and continue searching
– Returns: The largest key ki ≤ k (which might be +∞)
3.34
Searching
Search for key k:
• Similarity with binary search – but for lists
• Example: find(18)
3.35
8
Insert
function INSERT(x)
P ← FIND(x)
if P.value < x then
insert new list cell after P
“toss a coin” to decide how high this “tower” should be:
while “toss a coin”=yes do
increase the tower height one step
(possibly increase height of skip list)
3.36
Example: insert(20)
3.37
Delete. . . and properties
• Similar to insert:
– Search
– If found, remove and fix the link between the towers
• Worst case execution time of find, insert, and remove on a skip list of n items is O(n + h)
• But expected execution time (assuming uniform distribution of keys) is O(log n) if search starts at
height blog nc
3.38
Voluntary Homework Problem
In a hash table of size 13, use linear probing to insert:14, 43, 28, 66, 79, 19, 1, 21, 72, 29
Describe where each element is placed and comment/explain why
9
3.39
Download