Hash-Based Indexes Introduction As for any index, 3 alternatives for data entries

advertisement
Hash-Based Indexes
(From Chapter 11)
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Introduction
As for any index, 3 alternatives for data entries k*:
Hash-based indexes are best for equality
selections.
Static and dynamic hashing techniques exist
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Static Hashing
# primary pages fixed, allocated sequentially,
never de-allocated; overflow pages if needed.
h(k) mod N = bucket to which data entry with
key k belongs. (N = # of buckets)
h(key) mod N
key
0
2
h
N-1
Primary bucket pages
Overflow pages
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Static Hashing (Contd.)
Buckets contain data entries.
Hash fn works on search key field(s) of record r.
Must distribute values over range 0 ... N-1.
h(key) =
Long overflow chains can develop and degrade
performance
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Extendible Hashing
Main idea: If bucket (primary page) becomes full,
why not re-organize file by doubling # of buckets?
But reading and writing all buckets is expensive!
Idea:
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Insert h(r)=14
LOCAL DEPTH
GLOBAL DEPTH
2
00
2
4* 12* 32*16*
Bucket A
2
1* 5* 21*13* Bucket B
01
10
2
11
10*
Bucket C
2
DIRECTORY
15* 7* 19*
Bucket D
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Insert h(r)=20
2
LOCAL DEPTH
4* 12* 32* 16*
GLOBAL DEPTH
Bucket A
2
2
1* 5* 21*13* Bucket B
00
01
10
2
11
10*
Bucket C
2
DIRECTORY
15* 7* 19*
Bucket D
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Insert h(r)=20
2
LOCAL DEPTH
32* 16*
GLOBAL DEPTH
Bucket A
2
2
1* 5* 21*13* Bucket B
00
01
10
2
11
10*
Bucket C
2
DIRECTORY
15* 7* 19*
Bucket D
2
4* 12* 20*
Bucket A2
(`split image'
of Bucket A)
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Insert h(r)=32
LOCAL DEPTH
GLOBAL DEPTH
0
0
4* 12* 1* 10*
Bucket A
DIRECTORY
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Insert h(r)=16
LOCAL DEPTH
GLOBAL DEPTH
1
1
4* 12* 10* 32*
Bucket A
0
1
1
DIRECTORY
1*
Bucket B
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Insert h(r)=20
LOCAL DEPTH
GLOBAL DEPTH
2
2
4* 12* 32* 16*
1
1*
00
Bucket A
Bucket B
01
10
2
11
10*
Bucket C
DIRECTORY
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Insert h(r)=5, 15, 7, 19
LOCAL DEPTH
GLOBAL DEPTH
3
000
3
32* 16*
Bucket A
1
1* 5* 15* 7*
Bucket B
001
010
2
011
10*
Bucket C
100
101
110
111
3
DIRECTORY
4* 12* 20*
Bucket A2
(`split image'
of Bucket A)
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Deletions
Inverse of insertion
If removal of data entry makes bucket
empty, merge with ‘split image’
If each directory element points to same
bucket as its split image, can halve
directory
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Comments on Extendible
Hashing
If directory fits in memory, equality search
answered with _____ I/O; else _____
100MB file, 100 bytes/rec, 4K pages contain 1,000,000
records (as data entries) and 25,000 directory elements;
chances are high that directory will fit in memory.
Directory grows in spurts, and, if the distribution of
hash values is ________, directory can grow large
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Linear Hashing
This is another dynamic hashing scheme, an
alternative to Extendible Hashing
LH handles the problem of long overflow
chains without using a directory, and
handles duplicates
Main idea:
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Inserting h(r) = 43
Level=2, N=4
h
PRIMARY
Next=0 PAGES
h
3
2
000
00
001
01
010
10
011
11
32*44* 36*
9* 25* 5*
14* 18*10*30*
31*35* 7* 11*
(This info
is for illustration
only!)
(The actual contents
of the linear hashed
file)
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Example (Inserting h(r) = 43)
Level=2
h
h
3
PRIMARY
2 Next=0 PAGES
000
00
001
01
010
10
011
11
OVERFLOW
PAGES
32* 44* 36*
9* 25* 5*
14* 18*10*30*
31*35* 7* 11*
43*
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Inserting h(r) = 50 (End of a
Round)
Level=2
h3
h2
000
00
001
01
010
10
PRIMARY
PAGES
OVERFLOW
PAGES
32*
9* 25*
66*18* 10* 34*
Next=3
31*35* 7* 11*
011
11
100
00
101
01
5* 37*29*
110
10
14*30*22*
43*
44*36*
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Overview of LH File
In the middle of a round.
Bucket to be split
Next
Buckets that existed at the
beginning of this round:
this is the range of
Buckets split in this round:
If h Level ( search key value )
is in this range, must use
h Level+1 ( search key value )
to decide if entry is in
`split image' bucket.
hLevel
`split image' buckets:
created (through splitting
of other buckets) in this round
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Summary
Hash-based indexes: best for ______ searches,
cannot support _____ searches.
Static Hashing can lead to ________________.
Extendible Hashing uses directory doubling to avoid
___________
Duplicates may require ________________
Linear hashing avoids directory by splitting in rounds
Naturally handles ______________
Uses overflow buckets (but not very long in practice)
Database Management Systems, R. Ramakrishnan and Johannes Gehrke
Download