Chap12. Extensible Hashing

advertisement
File Structures by Folk, Zoellick and Riccardi
Chap12. Extendible Hashing
서울대학교 컴퓨터공학부
객체지향시스템연구실
SNU-OOPSLA-LAB
교수 김 형 주
File Structures
SNU-OOPSLA Lab.
1
Chapter Objectives





Describe the problem solved by extendible hashing and related
approaches
Explain how extendible hashing works; show how it combines
tries with conventional, static hashing
Use the buffer, file, and index classes of previous chapters to
implement extendible hashing, including deletion
Review studies of extendible hashing performance
Examine alternative approaches to the same problem, including
dynamic hashing, linear hashing, and hashing schemes that
control splitting by allowing for overflow buckets
File Structures
SNU-OOPSLA Lab.
2
Contents






12.1 Introduction
12.2 How extendible hashing works
12.3 Implementation
12.4 Deletion
12.5 Extendible hashing performance
12.6 Alternative approaches
File Structures
SNU-OOPSLA Lab.
3
12.1 Introduction

Dynamic files


Static hashing




undergo a lot of growths
described in chapter 11 (direct hashing)
typically worse than B-Tree for dynamic files
eventually requires file reorganization
Extendible hashing


hashing for dynamic file
Fagin, Nievergelt, Pippenger, and Strong (ACM TODS 1979)
File Structures
SNU-OOPSLA Lab.
4
Overview(1)

Direct access (hashing) files have static size, so not
suitable for files whose size is unknown in advance

Dynamic file structure is desired which retains the feature
of fast retrieval by primary key, and which also expands
and contracts as the number of records in the file
fluctuates (without reorganizing the whole file)

Similar motivation!

Indexed-sequential File ==> B tree

Hashing ==> Extendible Hashing
File Structures
SNU-OOPSLA Lab.
5
Overview(2)

Extendible Hashing
Primary key
Hashing function
H(key)
Extract first d digit
Directory
Index
File Structures
Table look-up
SNU-OOPSLA Lab.
File pointer
6
12.2 How Extendible Hashing works

Idea from Tries file (radix searching)

The branching factor of the tree is equal to the # of alternative
symbols in each position of the key
e.g.) Radix 26 trie - able, abrahms, adams, anderson,
adnrews, baird
 Use
a
b
File Structures
the first n characters for branching
b
d
n
l
r
adams
d
able
abrahms
e
anderson
r
andrews
baird
SNU-OOPSLA Lab.
7
Extendible Hashing




H maps keys to a fixed address space, with size the largest
prime less than a power of 2 (65531 < 216)
File pointers point to blocks of records known as buckets,
where an entire bucket is read by one physical data transfer,
buckets may be added to or removed from the file dynamically
The d bits are used as an index in a directory array containing
2d entries, which usually resides in primary memory
The value d, the directory size(2d), and the number of buckets
change automatically as the file expands and contracts
File Structures
SNU-OOPSLA Lab.
8
Extendible Hashing Example
Directory with d=3 and 4 buckets
d’=1
d=3
000
001
010
011
100
101
110
111
B0
H(key)=0
d’=3
B100 H(key)=100
d’=3
B101 H(key)=101
d’=2
B11 H(key)=11
File Structures
SNU-OOPSLA Lab.
9
Turning the trie into a directory

Using Trie for extendible hashing
(1) Use Radix 2 Trie :
Keys in A : beginning with 0
Keys in B : beginning with 10
Keys in C : beginning with 11
A
0
1
0
1
B
C
(2) Retrieving from secondary storage the buckets containing
keys, instead of individual keys
File Structures
SNU-OOPSLA Lab.
10
Representation of Trie (1)

Tree is not preferable (directory is not big)

A flattened array
1. Make a complete full binary tree
2. Collapse it into the directory structure
0
1
0
1
A
00
A
01
0
1
File Structures
B
10
B
C
11
C
SNU-OOPSLA Lab.
11
Representation of Trie(2)

Directory is a complete binary tree

Directory entry : a pointer to the associated bucket

Given an address beginning with the bits 10, the 210
directory entries

Introduced for uniform distribution
File Structures
SNU-OOPSLA Lab.
12
Retrieve a record

Steps in retrieving a record with a given key





find H(given key)
extract first d bits of H(given key)
use this value as an index into the directory to find a pointer
use this pointer to read a bucket into primary memory
locate the desired record within the bucket (scan)
File Structures
SNU-OOPSLA Lab.
13
Expansion & Contraction(1)


A pair of adjunct buckets with the same value of d’ which
share a common value of the first d’-1 bits of H(key) can
be combined if the average load < 50%, so all records
would be able to fit into one bucket
File contraction is the reverse of expansion; the directory
can be compacted and d decremented whenever all pairs
of pointers have the same values
File Structures
SNU-OOPSLA Lab.
14
Expansion & Contraction(2)
Bucket B0 overflows, then splits into B0 and B1
d=3
d’=2
000
001
010
d’=2
011
100
d’=3
101
110
111
d’=3
B00 H(key)=00..
B01 H(key)=01..
B100 H(key)=100..
B00 H(key)=101..
d’=2
B00 H(key)=11..
File Structures
SNU-OOPSLA Lab.
15
Expansion & Contraction(3)
d=4
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
File Structures
d’=2
B00 H(key)=00..
d’=2
B01 H(key)=01..
d’=4
B1000H(key)=1000..
d’=4
B1001H(key)=1001..
d’=3
B101 H(key)=101..
d’=2
B11 H(key)=11..
Bucket B100 overflows, d increase to 4
SNU-OOPSLA Lab.
16
Splitting to Handle Overflow (1)

When overflow occurs
e.g.1) Overflowing of bucket A
Split A into A and D
 Come to use additional unused bits
 No need to expand the directory

00
A
01
10
11
File Structures
B
C
00
A
01
D
10
B
11
C
SNU-OOPSLA Lab.
17
Splitting to Handle Overflow(2)

e.g. Overflowing of bucket B

Do not have additional unused bits
(need to expand the directory)
1. Divide B using 3 bits of hash address
2. Make a complete full binary tree
3. Collapse it into the directory structure
00
A
01
10
11
File Structures
B
C
SNU-OOPSLA Lab.
18
1. Result of overflow of bucket B
A
0
B
0
1
0
1
D
1
C
3. Directory
2. Complete Binary Tree
0
0
000
0
1
001
1
0
A
1
1
0
1
File Structures
0
1
0
1
A
010
011
B
B
100
D
101
C
SNU-OOPSLA Lab.
D
110
C
111
19
Creating Address

Function hash(KEY)



Fold/Add hashing algorithm
Do not MOD hashing value by address space since no fixed
address space exists
Output from the hash function for a number of keys
bill
lee
pauline
alan
julie
mike
elizabeth
mark
File Structures
0000 0011 0110 1100
0000 0100 0010 1000
0000 1111 0110 0101
0100 1100 1010 0010
0010 1110 0000 1001
0000 0111 0100 1101
0010 1100 0110 1010
0000 1010 0000 0111
SNU-OOPSLA Lab.
20
Int Hash (char * key)
{
int sum = 0;
int len = strlen(key);
if (len % 2 == 1) len ++; // make len even
for (int j = 0; j < len; j+2)
sum = (sum + 100 * key[j] + key[j+1]) % 19937;
return sum;
}
Figure 12.7 Function Hash (key) returns an integer hash value for key
for a 15 bit
File Structures
SNU-OOPSLA Lab.
21
Int MakeAddress (char * key, int depth)
{
int retval = 0;
int hashVal = Hash(key);
// reverse the bits
for (int j = 0; j < depth; j++)
{
retval = retval << 1;
int lowbit = hashVal & 1;
retval = retval | lowbit;
hashVal = hashVal >> 1;
}
return retval;
}
Figure 12.9 Function MakeAddress(key,depth)
File Structures
SNU-OOPSLA Lab.
22

















Class Bucket: protected TextIndex
{protected:
Bucket (Directory & dir, int maxKeys = defaultMaxKeys);
int Insert (char * key, int recAddr);
int Remove(char * key);
Bucket * Split ();
int NewRange (int & newStart, int & newEnd);
int Redistribute (Bucket & newBucket);
int FindBuddy ();
int TryCombine ();
int Combine (Bucket * buddy, int buddyIndex);
int Depth;
Directory & Dir;
int BucketAddr;
friend class Directory;
friend class BucketBuffer;
}; Figure 12.10 Main members of class Bucket
File Structures
SNU-OOPSLA Lab.
23
class Directory
{public:
Directory (…..); ~Directory();
int Open (..); int Create(…); int Close();
int Insert(…); int Delete(…); int Search(…);
protected
int DoubleSize();
int Collape();
int InsertBucket (….);
int Find (…);
int StoreBucket(…);
int LoadBucket(…)
…..
}
Figure 12.11 Definition of class Directory
File Structures
SNU-OOPSLA Lab.
24
12.4 Deletion

When to combine buckets

Buddy buckets: the buckets are siblings and at the leaf level
of the tree (Buddy means something like friend)
e.g., B and D in page 19 are buddy buckets

Examine the directory to see if we can make changes
there

Shrink the directory if none of the buckets requires the depth
of address information that is currently available in the
directory
File Structures
SNU-OOPSLA Lab.
25
Buddy Bucket

Given a bucket with an address uvwxy, where u, v,
w, x, and y have values of either 0 or 1, the buddy
bucket, if it exists, has the value uvwxz, such that
z = y XOR 1

If enough keys are deleted, the contents of buddy
buckets can be combined into a single bucket
File Structures
SNU-OOPSLA Lab.
26
Collapsing the Directory


Collapse condition

If a single cell, downsizing is impossible

If there is a pair of directory cells that do not both point to the
same bucket, collapsing is impossible
Allocating space

Allocate half the size of the original

Copy the bucket references shared by each cell pair to a single
cell in the new directory
File Structures
SNU-OOPSLA Lab.
27
12.5 Extendible Hashing Performance



Time : O(1)

If the directory can kept in RAM: a single access

Otherwise: two accesses are necessary
Space utilization of the bucket

r (# of records), b (block size), N (# of Blocks)

Utilization = r / bN

Average utilization ==> 0.69
Space utilization for the directory
 How
large a directory should we expect to have,
given an expected number of keys?

Expected value for the directory size by Flajolet(1983)

Estimated directory size =3.92 / b X r(1+1/b)
File Structures
SNU-OOPSLA Lab.
28
Space utilization for buckets

Periodic and fluctuating




With uniform distributed addresses, all the buckets tend to fill up at the
same time -> split at the same time
As buffer fills up : 90%
After a concentrated series of splits : 50%
r : # of records , b : block size



N ~= 4/(b ln 2)
Utilization = r / bN ~= ln 2 = 0.69
Average utilization of 69%
 B tree space utilization

Normal B-tree : 67%, B-tree with redistribution in insertion : 85 %
File Structures
SNU-OOPSLA Lab.
29
12.6 Alternative Approaches(1): Dynamic Hashing

Similar to dynamic extendible hashing


Use a directory to track bucket addresses
Extend the directory through the use of tries

Start with a hash function that covers an address space of
a fixed size

When overflow occurs

splits forming the leaves of a trie that grows down from the
original address node makes a trie
File Structures
SNU-OOPSLA Lab.
30
Alternative Approaches(2): Dynamic Hashing


Two kinds of nodes

External node: reference a data bucket

Internal node: point to two children index nodes

When a node has split children, it changed from an external
node to an internal node
Two hash functions

Apply the first hash function original address space

if external node is found : search is completed

if internal node is found : apply second hash function
File Structures
SNU-OOPSLA Lab.
31
(a)
(b)
1
2
1
3
2
Original
address
space
4
3
40
(c)
1
20
21
41
4
3
2
Original
address
space
4
1
Original
address
space
41
410
File Structures
SNU-OOPSLA Lab.
411
32
Dynamic Hashing vs. Extendible Hashing(1)

Overflow handling



Both schemes extend the hash function locally, as a binary search
trie
Both schemes use directory structure

Dynamic hashing: a linked structure

Extendible hashing: perfect tree expressible as an array
Space Utilization

both schemes is the same (space utilization : 69%)
File Structures
SNU-OOPSLA Lab.
33
Dynamic Hashing and Extendible Hashing(2)

Growth of directory



Actual size of an index node


Dynamic hashing: slower, more gradual growth
Extendible hashing: extend directory by doubling it
Dynamic hashing is lager than a directory cell in extendible
hashing (because of pointers)
Page fault


Dynamic hashing: more than one page fault (with linked structure
for the directory)
Extendible hashing: single page fault
File Structures
SNU-OOPSLA Lab.
34
Alternative Approaches(3): Linear Hashing

Unlike extendible hashing and dynamic hashing, linear hashing does
not use a directory.

The actual address space is extended one bucket at a time as buckets
overflow

Because the extension of the address space does not necessarily
correspond to the bucket that is overflowing,
linear hashing necessarily involves the use of overflow buckets, even
as the address space expands

No directories: Avoid additional seek resulting from additional layer

Use more bits of hashed value

hd(k) : depth d hashing function (using function make_address)
File Structures
SNU-OOPSLA Lab.
35
The growth of address space in linear hashing(1)
w
a
b
c
d
00
01
10
11
a
b
c
d
A
000
01
10
11
100
(b)
(a)
y
x
a
00
b
01
c
10
x
d
A
11
100
B
101
a
00
b
01
(c)
File Structures
c
10
d
11
A
100
B
C
101
110
(d)
SNU-OOPSLA Lab.
(continued...)
36
The growth of address space in linear hashing(2)
x
a
00
b
01
c
10
d
11
A
100
B
C
D
101
110
111
(e)
File Structures
SNU-OOPSLA Lab.
37
Alternative Approaches(5)
:Approaches to Controlling Splitting


Postpone splitting: increase space utilization

B-Tree: redistribution rather than splitting

Hashing: placing records in chains of overflow buckets to
postpone splitting
Triggering event for splitting


Linear hashing
 Every time any bucket overflows
 Not split overflowing bucket
Litwin(1980): overall load factor of the file

Below 2 seeks, 75% ~ 80% storage utilization
File Structures
SNU-OOPSLA Lab.
38
Alternative Approaches(5)
:Approaches to Controlling Splitting

Postpone splitting for extensible hashing

Use chaining overflow bucket

Avoid doubling directory space

1.1 seek, 76% ~ 81% storage utilization
File Structures
SNU-OOPSLA Lab.
39
Let’s Review !!!






12.1 Introduction
12.2 How extendible hashing works
12.3 Implementation
12.4 Deletion
12.5 Extendible hashing performance
12.6 Alternative approaches
File Structures
SNU-OOPSLA Lab.
40
Download