Lecture Ov er ADT

advertisement
J. Maluszynski, IDA, Linköpings Universitet, 2004.
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
Page 2
Page 1
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
Page 4
Arrays with elements of different size
Different elements X i may have different size si, l
E M I L
s1 =4
s3 =5
X Y L I S S I
s2 =2
M U R P H Y
s3 =6
i
J. Maluszynski, IDA, Linköpings Universitet, 2004.
u
J. Maluszynski, IDA, Linköpings Universitet, 2004.
a function from an integer interval l u of indices to a value set
ADT array
Lecture Overview: (1)Arrays, (2)Hashing
Array:
Operations:
[Lewis/Denenberg pp. 130–131]
ADT Array
[Lewis/Denenberg pp. 133–134]
[Lewis/Denenberg p. 139–140]
[Lewis/Denenberg 6.1]
[Lewis/Denenberg 8.3]
Access X i returns the value of X at index i
Length X returns the number of elements u l 1
Assign X i v changes the value of X at index i to v
Initialize X v assign v to every element of X
Special case: strings with additional operations
Multidimensional arrays
Representation of arrays
list representations
ADT Set, ADT Dictionary
Hashing techniques
– Chaining: separate or coalesced
– Open addressing: Linear Probing or Double Hashing
[Lewis/Denenberg 8.5]
J. Maluszynski, IDA, Linköpings Universitet, 2004.
– space for pointers in addition to space for data
– time needed to follow them
+ fast substitution by moving pointers
instead of moving data elements of large size si
Hash Functions
Represented as table of pointers to elements:
Page 3
l
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
Length or u must be stored
Alternatively, a sentinel value may be used as end marker
instead of keeping u.
(for constant element size s)
Contiguous Representation of Arrays
Elements of array X stored in a table
Start address X
Each element has size s
i-th element stored starting at address X s i
Access in O 1 time
Iteration:
sequence of addresses computed as above until u reached
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
Page 5
4 2
J. Maluszynski, IDA, Linköpings Universitet, 2004.
Inf Idx Next
25 5
Inf Idx Next
81 9
majority of the elements are null elements.
Linked list representation of arrays
Used for sparse arrays:
0
13
A
FirstInRow
1
4
11 13 22 31 43
FirstInCol
column-compressed
A
11
Row
4
1
3
5
22
43
13
31
1
number of nonzero entries
Inf Idx Next
0 4 0 0 25 0 0 0 81 0 0 0
– access in O m time in the worst case, with m
For a sparse matrix we may use
+ list items linked in two-dimensions (Right, Down pointers; row/col index)
+ compressed storage using one-dimensional arrays
11 0
0
0
22 0
0
0
0 0
0
0
31 0
3
43
0
0
0
4
S y
x
S
J. Maluszynski, IDA, Linköpings Universitet, 2004.
y
Page 7
y
Col 1
row-compressed
uncompressed
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
x
ADT Set
Operations on a set S:
MakeEmptySet creates and returns a new, empty set
IsMember S x searches in S for an element x, returns true iff x
IsEmptySet S returns true iff S 0/
Insert S x adds x to S
Delete S x removes x from S
Size S returns the number of elements in S
IsEqual S1 S2 returns S1 S2
Union S1 S2 returns S1 S2
Intersection S1 S2 returns S1 S2
Difference S1 S2 returns S1 S2
For ordered sets (order function ):
FindMin S returns the least element x in S:
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
Sets
Page 6
J. Maluszynski, IDA, Linköpings Universitet, 2004.
Set: collection of (mutually different) elements from a single universe.
Multiset: an element may be contained multiple times
Elements in a set are not necessarily ordered, but an available (total) order
can be used for more efficient retrieval of elements.
J. Maluszynski, IDA, Linköpings Universitet, 2004.
We consider only finite sets for representation in a computer.
Examples:
Page 8
Set of cities in a country
Membership list of an association
Set of declared names maintained in a compiler
Set of words known to a spell-checking program
...
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
Dictionaries
Data sets: data items retrieved by a key.
Examples:
Telephone directory, person name
English-Swedish dictionary, English word
Railway timetable, destination
Identfier list in compiled program, identifier
...
Static Dictionary: no updates allowed
Dynamic Dictionary: updates allowed
Ideal Hashing
LookUp(D,k)
h(k) = j
j
T
Page 9
i
J. Maluszynski, IDA, Linköpings Universitet, 2004.
J. Maluszynski, IDA, Linköpings Universitet, 2004.
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
Page 10
Store each item k i in T h k
Page 12
Find a function h (hash function)
that maps keys to indices of T in a unique way:
k1 k2
h k1
h k2
and that can be computed in time O 1 .
Represent the data set as a table T (hash table).
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
Hashing: Resolving Conflicts
J. Maluszynski, IDA, Linköpings Universitet, 2004.
J. Maluszynski, IDA, Linköpings Universitet, 2004.
Open Addressing:
the index for a conflicting value is determined by an algorithm
Coalesced Chaining:
the data are placed in the table and linked
a chain may include data with different hash values
Separate Chaining:
the hash table includes only a pointer to the list
Chaining:
conflicting data elements are represented as linked lists
Page 11
k
h k2
In practice, hash functions are not unique (not injective):
k2 but h k1
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
Hashing
ADT Dictionary
How to achieve (O 1 time) retrieval from a Set or Dictionary?
Domain:
A dictionary is a set of pairs k i
k K key,
i I information
The keys are unique and linearly ordered.
A conflict / collision:
Operations on a dictionary D
MakeEmptyDictionary K I creates an empty dict. for keys from K, inf. from I
LookUp D k searches in D for a pair (item) with key k;
returns k i if found, Λ otherwise.
IsEmpty D returns true iff D contains no items.
Insert D k i adds to D an item k i if k not yet in D,
overwrites old information entry for k by i, otherwise.
Delete D k removes from D the item with key k, if existing.
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
k1
J. Maluszynski, IDA, Linköpings Universitet, 2004.
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
Page 14
Hashing with Separate Chaining: LookUp
Page 13
Hashing with Separate Chaining: Example
Given: key k, hash table T , hash function h
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
store 10 integer keys: 54, 10, 18, 25, 28, 41, 38, 36, 12, 90
k mod 13
54
25
1. compute h k
10
38
2. search for k in the list pointed by T h k
n
J. Maluszynski, IDA, Linköpings Universitet, 2004.
1 probe for accessing the list header (if nonempty)
1 1 probes for accessing the contents of the first element
1 2 probes for accessing the contents of the second element
...
Notation: probe = one access to the linked list data structure
using a hash table of size 13
36
12
28
90
18
41
and hash function h with h k
0
1
2
3
4
5
6
7
8
9
10
11
12
J. Maluszynski, IDA, Linköpings Universitet, 2004.
Successful lookup:
Page 16
1
1 2 probes
Average case:
Expected number P of probes for given key k
k found after L
3 2
Access to T h k (beginning of a list L)
Traversing L
α 2
Expected L is α, thus:
P
J. Maluszynski, IDA, Linköpings Universitet, 2004.
How many probes P are needed to retrieve a hash table entry?
A probe (pointer dereferencing and comparison ) takes constant time.
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
Page 15
Hashing with Separate Chaining: Successful lookup
Hashing with Separate Chainining: Unsuccessful lookup
1
n data items
m positions in the table
α
Unsuccessful lookup:
1
Worst case:
P
all items have the same hash value
Average case:
Hash values equally distributed among m:
Average length of the list α n m
α n m is called the load factor
P
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
Coalesced Chaining (1)
41
10
28
36
12
18
90
54
38
Page 17
25
Eliminate the first probe
by storing the first list item in the table entry
0
1
2
3
4
5
6
7
8
9
10
11
12
The increase in space consumption is acceptable
if key fields are small or hash table is quite full
Page 19
from end of the table go to the beginning
1 mod m
– Delete ??
0
1
2
3
4
5
6
7
8
9
J. Maluszynski, IDA, Linköpings Universitet, 2004.
probes 5 times
Insert(62)
J. Maluszynski, IDA, Linköpings Universitet, 2004.
10
12
23
34
45
99
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
Coalesced Chaining (2)
Page 18
Place data items in the table
Extend them with pointers
0
1
2
3
4
15
1
41
0
Page 20
Two approaches:
J. Maluszynski, IDA, Linköpings Universitet, 2004.
Rehashing: the deleted position is made into empty; all subsequent positions in the probe sequence are deleted and re-inserted to the hash table
Deleted bit: the deleted position is marked but kept in the probe sequence.
J. Maluszynski, IDA, Linköpings Universitet, 2004.
– table may become full
+ exploits space better
with about the same average time
Chains may contain keys with different
hash values,
but all keys with the same hash value
appear in the same list.
Lookup:
Resolve collisions by using the first free slot
Insert(41)
Insert(15)
Insert(1)
Insert(0)
...
5
6
7
8
9
10
11
12
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
Two Deletion Techniques for Open Adressing
Open adressing: a position is in probe sequence
cannot be made into empty position
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
Open Addressing with Linear Probing
hk
Sequence of probed table positions:
H k l
hk
H k 0
1
H k l
l mod m
hk
– close positions rapidly filled
(primary clustering)
Sequential / Linear Probing
desired hash index j
in case of conflict go to the next free position
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
10
12
22
32
43
99
10
0
1
2
3
4
5
6
7
8
9
Lookup(32)
Delete(22)
Page 21
99
32
43
12
10
Deletion under Open Adressing
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
12
22 d
32
43
0
1
2
3
4
5
6
7
6
Lookup(32)
Lookup(32)
Lookup stops
WRONG!!!
10
12
32
43
99
"Rehashing" technique
J. Maluszynski, IDA, Linköpings Universitet, 2004.
J. Maluszynski, IDA, Linköpings Universitet, 2004.
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
Page 22
mod m = h k
q
1
l h2 k
k.
k mod q for q
size T
mod m
m prime, m prime.
J. Maluszynski, IDA, Linköpings Universitet, 2004.
Lab 2: Open Adressing with Linear Probing and Re-hashing
The assignment: implement a hash table
open adressing, linear probing, deletion by re-hashing;
the keys are strings
the associated information is a type information;
may be empty.
insert and delete are to be implemented by a single write operation:
– write K, empty
delete K
– write K, T
delete K; insert K,T
Page 24
J. Maluszynski, IDA, Linköpings Universitet, 2004.
A skeleton code is given.
The assignment is to implement the operations look-up and write.
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
What is a good hash function?
Bad choice: majority of the names ends with n
hashing last names in a (swedish) student group
hash function: the ASCII-value of the last character
Example:
Hashing should give a uniform distribution of hash values,
but this depends on the distribution of keys in the data to be hashed.
Open addressing with Double Hashing
9
7
"Deleted bit" technique
8
9
8
99
Page 23
Assume k is a natural number.
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
Double Hashing
Requirements on h2:
k
second hashing function h2 computes increments in case of conflicts
h2 k
increment beyond the table is taken modulo m, with m
H k 0
hk
H k l 1
H k l
Linear probing is double hashing with h2 k
h2 k
0
h2 k
For each k, h2 k has no common divisor with m
all table positions can be reached by the computed increments
A common choice for h2:
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
Page 26
n
1
J. Maluszynski, IDA, Linköpings Universitet, 2004.
Page 25
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
0
Uniform distribution!!!
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
Hashing by Multiplication
Page 28
Choose an irrational number θ
Choose table size m
hk
m kθ
kθ
5
1982
87
2
4
9
1
6
3
8
J. Maluszynski, IDA, Linköpings Universitet, 2004.
7
2000
23
J. Maluszynski, IDA, Linköpings Universitet, 2004.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
10
Hashing by Multiplication: A property of irrational numbers
Hashing by Modulo-Division
θ an irrational number.
n denote:
j
m size of the table
12
The hash function is
Example
1968
22
For i
fi j
k mod m
1
hk
iθ
iθ
0 1 ; all of them are distinct.
fi
Notice: fi
We put them in order:
fi1 fi2
fin
Avoid:
m 2d : hashing gives d last bits of k
m 10d : hashing of decimal numbers gives last d digits
For every n, there are at most three
different lengths of the intervals
fi j
Prime numbers suggested for m
n falls in a maximum length interval
Check samples of real data to experiment with hash parameters
...
...
J. Maluszynski, IDA, Linköpings Universitet, 2004.
f7
0.326
100
Page 27
0 61803399
f6
0.708
φ 1, m
1954
57
1953
95
5 1
2
f5
0.090
hashing with θ
TDDB57 DALG – Lecture 4: ADT Array, Hashing .
1
φ
1774
33
f4
0.472
input value :
hash value :
θ
1
3
8
Example
4
6
A nice θ is the reciprocal of golden ratio φ
2
7
9
Approximate distribution
f1
f2
f3
0.618 0.236 0.854
5
10
!
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Download