Hashing for Direct File Access

advertisement
116096421
Modified by Li Ma
Hashing for Direct Files
Introduction to Hashing
To access a record in a serial file, all previous records must be accessed first.
Consequently accessing record#100 is much slower than accessing record#1.
With direct files, records can be accessed directly, without accessing other records first.
So retrieving record#100 is just as fast as retrieving record#1. This requires direct
access storage.
Question: Given the key for a record, how do we find the record without scanning the
entire file?
Answer: From a key value, compute the address of the record, i.e., apply a function F
to obtain the record address: address = F(key).
We call the process to get the record address from the key hashing. Several different
keys may have the same address, i.e. F(key1) = F(key2) = … We call this collision. All
the operations on the key to get the record address make up a hash function.
Hash Function
Load factor = (#records in file) / (max #records the file can hold), so the load factor
ranges from 0 to 1 (0  load factor  1)
 Load factor = 0 for an empty file
 Load factor = 1 for a full file
As load factor approaches 1, collisions become more likely, so the file has to be
expanded.
Often, the key is alphanumeric. In such cases, hashing usually consists of two steps:
1. Convert the key to a number
2. From the number, compute an address
With a good hash function, keys are distributed randomly (and uniformly) throughout the
file.
Example A
1. Take every third letter in a key, add up the alphabetic positions of these letters:
a. M o z a r t  13 + 1 = 14;
b. T c h a i k o v s k y  20 + 1 + 15 + 11 = 47
2. Given a number k, use (k mod N) as the record address, where N is file size (i,e.
maximum number of records).
CS246, CS-TSU
Page 1
116096421
Modified by Li Ma
Key
Mozart
Tchaikovsky
Ravel
Beethoven
Mendelssohn
Bach
Greig
Rachmaninoff
Vivaldi
Chopin
A
1
B
2
C
3
D
4
E
5
F
6
G
7
H
8
I
9
J
10
Numeric Equivalent
14
47
23
44
44
10
16
55
32
19
K
11
L
12
M
13
N
14
O
15
P
16
Q
17
(Mod 16) Address
14
15
7
12
12
10
0
7
0
3
R
18
S
19
T
20
U
21
V
22
W
23
X
24
Y
25
Z
26
N = 16
Load Factor = 10/16 = 0.625
#Collisions = 3
Example B: Mid-Square Method
1. Concatenate the alphabetic positions of the first and last letters in the key, then
square the results
2. Take middle 2 digits of the squared number, saying k, use (k mod N) as the
record address
Example:
MOZART
13
20  1320  (1320)2 = 1742400  24
Key
Mozart
Tchaikovsky
Ravel
Beethoven
Mendelssohn
Bach
Greig
Rachmaninoff
Vivaldi
Chopin
Number
Square
1320
2025
1812
0214
1314
0208
0707
1806
2209
0314
1742400
4100625
3283344
45796
1726596
43264
499849
3261636
4879681
98596
Middle 2
digits
24
6
33
79
65
26
98
16
96
59
(mod 16)
Address
8
6
1
15
1
10
2
0
0
11
#Collisions = 2
CS246, CS-TSU
Page 2
116096421
Modified by Li Ma
Example C: Folding Method
1. Convert key to a number (first and last letter, same as in Example B, but not
square the number)
2. Partition the number into a number of equal parts, fold over each other, then
sum, and truncate if needed.
Example:
MOZART
13
20

1320
13
20
13 + 02 (fold over, then sum)
15
Key
Mozart
Tchaikovsky
Ravel
Beethoven
Mendelssohn
Bach
Greig
Rachmaninoff
Vivaldi
Chopin
Number
1320
2025
1812
0214
1314
0208
0707
1806
2209
0314
Sum after
fold over
(13+02) 15
(20+52) 72
39
43
54
82
77
78
112
(03+41) 44
(mod 16)
Address
15
8
7
11
6
2
13
14
0
12
#Collisions = 0
Collision Resolution
Recall the definition for hashing:
 Hashing: Given the key for a record, compute the record address.
 Collision: When two records hash to the same address.
 The hash value (address) is called their home address, but one of them must be
stored elsewhere. Finding the other storage location is called collision
resolution.
There are two basic approaches to resolving a shared address:
1. Open addressing or Progressive Overflow: Store the record at some other
address in the same file.
2. Separate overflow: Store the record in another file, called overflow area.
CS246, CS-TSU
Page 3
116096421
Modified by Li Ma
Open addressing
All records are stored in one file. The basic idea is:
– For each key, generate a sequence of addresses, called the probe sequence:
PA0, PA1, PA2, and PA3.
– When a collision occurs, store the new record at the first available probe
address, i.e. at the first PAi that is not already storing a record.
Note that PA0 = home address = hash(key)
The most common probe sequence is of the form:
PAi = [hash(key) + c(i)] mod N, where i = 0,1,…, N-1.
The function hash(key) is a hash function, i.e. a function that maps keys to integers
in the range from 0 to N-1.
The function c(i) represents the collision resolution strategy. It is required to have the
following two properties:
– Property 1: c(0)=0. This ensures that the first probe in the sequence is the
home address.
– Property 2: The set of values {c(0) mod N, c(1) mod N,……, c(N-1) mod N}
must contain every integer between 0 and N-1. This property ensures that the
probe sequence eventually probes every possible file position.
Linear Probing
The simplest collision resolution strategy in open addressing is called Linear
Probing, in which:
PAi = [hash(key) + i * step] mod N, where N = File size (max # records) and
step is a constant, usually 1.
Note: PAi+1 = (PAi + step) mod N
Example:
N= 13 (In this example, assume the file size is 13)
PAi = [hash(key) + i] mod 13
Key
Mozart
Tchaikovsky
Ravel
Beethoven
Mendelssohn
Bach
Greig
Rachmaninoff
Vivaldi
Chopin
CS246, CS-TSU
hash(key)
(14 mod 13) 1
(47 mod 13) 8
(23 mod 13) 10
(44 mod 13) 5
(44 mod 13) 5
(10 mod 13) 10
(16 mod 13) 3
(55 mod 13) 3
(32 mod 13) 6
(19 mod 13) 6
PA1
2
9
11
6
6
11
4
4
7
7
PA2
3
10
12
7
7
12
5
5
8
8
PA3
4
11
0
8
8
0
6
6
9
9
PA4
5
12
1
9
9
1
7
7
10
10
Page 4
116096421
Modified by Li Ma
Hash File:
Record Number
0
1
2
3
4
5
6
7
8
9
10
11
12
Key
Mozart
Greig
Rachmaninoff
Beethoven
Mendelssohn
Vivaldi
Tchaikovsky
Chopin
Ravel
Bach
Search:
Search for Ravel, cost = 1 access
Search for Chopin, cost = 4 accesses
Insertion:
To insert an element, we follow the same probe sequence that would be used
in searching an element. Thus linear probing finds an empty cell by doing
linear search starting from position hash(key)
For example: Insert Eisner
Since hash(Eisner)=6, so search the empty address for Eisner starting with
home address 6. Stop searching when an empty address is reached (i.e., at
address 12). So cost is 7 accesses
Deletion:
Assume we delete Chopin. We free up the record and then insert Eisner
(home address=6). Search stops at address 9, which is an empty address
where we can insert Eisner. But, it should not stop here!
–
–
Problem: Deletions can cause searches to end too soon since
searching stops at the empty address.
Solution: When a record is deleted, we mark the address with a
“tombstone”. When searching for empty address, we pass over
tombstones without stopping.
For example:
After delete Bach, there are tombstones at addresses 9 and 11.
Now search the empty address for Eisner (home address=6), the cost is still 7
accesses.
CS246, CS-TSU
Page 5
116096421
Modified by Li Ma
Problem of linear probing:
If many keys hash to the same vicinity, a dense cluster of records can form.
The time required for a search increases with the size of the cluster. This is
called primary clustering.
For example:
PA0
PA1
5
6
6
7
PA2
7
8
PA3
8
9
PA4
9
10
Suppose the home address for Ravel, Bach, and Greig is 5; that for Chopin,
Vivaldi, Mozart is 6 (not same as the previous calculation)
Insert the following records in this order:
Ravel(5), Chopin(6), Bach(5), Vivaldi(6), Greig(5), Mozart(6)
0
1
2
3
4
5
6
7
8
9
10
11
12
Ravel
Chopin
Bach
Vivaldi
Greig
Mozart
A primary cluster
Access Cost:
– How many probes does it take to retrieve Bach or Vivaldi?
 3 probes (i.e., 3 file accesses)
– How many probes does it take to retrieve Greig or Mozart
 5 probes
These accesses slow down retrieval (and updating) significantly.
Partial Solution:
Use a non-linear probing function.
An alternative to linear probing that addresses the primary clustering problem
is called quadratic probing.
CS246, CS-TSU
Page 6
116096421
Modified by Li Ma
Quadratic Probing
In quadratic probing, the function c(i) is a quadratic function in i of the form c(i)=i2
Clearly c(i)=i2 satisfies property 1. The following theorem gives the conditions
under which quadratic probing works:
Theorem: When quadratic probing is used in a file of size N, where N is a prime
number, the first [N/2] probes are distinct.
For example:
PAi = [hash(key) + 2i2] mod 13
Since N=13, [N/2]=6, the first 6 probes are distinct.
i
2 i2
5 + 2 i2
6 + 2 i2
PA0
5
6
1
2
7
8
PA1
7
8
2
8
13
14
PA2
0
1
3
18
23
24
PA3
10
11
4
32
37
38
5
50
55
56
PA4
11
12
PA5
3
4
Insert the following records in this order:
Ravel(5), Chopin(6), Bach(5), Vivaldi(6), Greig(5), Mozart(6)
Hash file:
0
Greig
1
Mozart
2
3
4
5
Ravel
6
Chopin
7
Bach
8
Vivaldi
9
10
11
12
2
2
5
5
0
0
1
1
3
4
Secondary cluster
for Key = 5
CS246, CS-TSU
3
4
Secondary cluster
for Key = 6
Page 7
116096421
Modified by Li Ma
The primary cluster has been broken up into two secondary clusters.
All records with the same home address follow the same sequence of probe
addresses. This sequence is a secondary cluster.
Solution: Double Hashing
Double Hashing
While quadratic probing eliminates the primary clustering problem, it places a
restriction on the number of items that can be put in the file. The file must be less
than half full.
Double hashing is yet another method of generating a probing sequence. It
requires two distinct hash functions:
hash1: K  {0, 1,….., N-1}
hash2: K  {1, 2,….., N-1}
The probing sequence is then computed as follows
PAi = [hash1(key) + i*hash2(key)] mod N
Each probe sequence is linear, but the step size now depends on the key.
How do we select a double hashing function?
We can select hash2(key) = 1 + (key) mod (N-1), where key is actually the
number for the key. For example:
Key
Mozart
Ravel
Bach
Greig
Vivaldi
Chopin
A
1
B
2
C
3
D
4
E
5
F
6
Number
13+1=14
18+5=23
2+8=10
7+9=16
22+1+9=32
3+16=19
G
7
H
8
I
9
J
10
K
11
hash1
6
5
5
5
6
6
L
12
M
13
N
14
hash2
3
12
11
5
9
8
O
15
P
16
PA0
6
5
5
5
6
6
Q
17
R
18
PA1
9
4
3
10
2
1
S
19
T
20
PA2
12
3
1
2
11
9
U
21
V
22
PA3
2
2
12
7
7
4
PA4
5
1
10
12
3
12
W
23
Y
25
X
24
Z
26
In the table, for Bach, PAi = [hash1(10) + ihash2(10)] mod 13
–
–
–
–
Records are now scattered throughput the file
No cluster (primary or secondary)
Records can now be accessed with fewer probes, i.e., more records are
stored at or near their home addresses.
Thus, retrieval is now faster.
CS246, CS-TSU
Page 8
116096421
Modified by Li Ma
Buckets
Sometime a disk reads and writes not one record at a time, but an entire block of
data at a time, which may hold a large number of records. It makes sense for
hash functions to compute block addresses, not record addresses.
This way, several records can be stored at the same address. Each such block is
called a bucket.
Now addresses refer to buckets not records. We say that a hash function
identifies a bucket, into which the record is placed.
Collisions are a problem only when the bucket is full. If there are collisions, it
causes an overflow in the bucket.
Example:
Key
Mozart
Tchaikovsky
Ravel
Beethoven
Mendelssohn
Bach
Greig
Rachmaninoff
Vivaldi
Chopin
hash(key)
1
2
4
5
5
4
3
5
0
1
Using linear probing:
Step size = 1
Bucket size = 2
0
1
2
3
4
5
Vivaldi
Mozart
Chopin
Tchaikovsky
Greig
Ravel
Bach
Beethoven
Mendelssohn
#overflow = 1, the record for Rachmaninoff has been discarded.
CS246, CS-TSU
Page 9
116096421
Modified by Li Ma
With this organization:
o Fewer overflows
o Records are stored much closer to home address
o Faster retrieval
So far we have used open addressing and all records are stored in one file.
Tradeoff of open addressing
– Clustering increases access time
– To avoid clustering, records with same home address must be scattered
throughout the file
– This leads to more disk-head movements, which also increases access time.
Solution: Separate Overflow
CS246, CS-TSU
Page 10
Download