Storage and Indexing

advertisement
Database Systems
(資料庫系統)
November 8, 2004
Lecture #9
By Hao-hua Chu (朱浩華)
1
Announcement
• Midterm exam: November 20 (Sat): 2:30 PM in CSIE
101/103
• Assignment #6 is available on the course homepage.
– It is due on 11/24
– It is very difficult
– Suggest you do it before midterm exam
• Assignment #7 will be available on the course homepage
later this afternoon.
– It is due 11/16 (next Tuesday).
– It is easy.
– It will help you prepare midterm exam.
2
Cool Ubicomp Project
Counter Intelligence (MIT)
• Smart kitchen & kitchen
wares
• Talking Spoon
– Salty, sweet ,hot?
• Talking Cultery
– Bacteria?
• Smart fridge & counters
– RFID tags
– Tracking food from fridge to
your month
3
Hash-Based Indexing
Chapter 11
4
Introduction
• Recall that Hash-based indexes are best for equality
selections.
– Cannot support range searches.
– Equality selections are useful for join operations.
• Static and dynamic hashing techniques exist
– Trade-offs similar to ISAM vs. B+ trees.
– Static hashing technique
– Two dynamic hashing techniques
• Extendible Hashing
• Linear Hashing
5
Static Hashing
• # primary pages fixed, allocated sequentially, never deallocated; overflow pages if needed.
• h(k) mod N = bucket to which data entry with key k
belongs. (N = # of buckets)
h(key) mod N
key
0
2
h
N-1
Primary bucket pages
Overflow pages
6
Static Hashing (Contd.)
• Buckets contain data entries.
• Hash function works on search key field of record r.
– Ideally uniformly distribute values over range 0 ... N-1
– h(key) = (a * key + b) usually works well.
– a and b are constants; lots known about how to tune h.
•
Cost for insertion/delete/search
–
two/two/one disk page I/Os (no overflow chains).
• Long overflow chains can develop and degrade
performance.
–
–
Why poor performance? Scan through overflow chains linearly.
Extendible and Linear Hashing: Dynamic techniques to fix this
problem.
7
Simple Solution
• Avoid creating overflow pages:
– When a bucket (primary page) becomes full, double #
of buckets & re-organize the file.
•
What’s wrong with this simple solution?
–
High cost concern: reading and writing all pages is
expensive!
8
Extendible Hashing
•
The basic Idea (another level of abstraction):
–
–
–
•
•
Directory much smaller than file, so doubling it is much
cheaper.
Only one page of data entries is split
–
•
Use directory of pointers to buckets (hash to the directory entry)
Double # of buckets by doubling the directory
Splitting just the bucket that overflowed!
The page that overflows, rehash that page to two pages.
Trick lies in how hash function is adjusted!
–
–
Before doubling directory, h(r) -> 0..N-1 buckets.
After doubling directory, h(r) -> 0 .. 2N-1
9
Example
• Directory is array of size 4.
• To find bucket for r, take last
global depth # bits of h(r);
– Example: If h(r) = 5 = binary
101, it is in bucket pointed
to by 01.
• Global depth: # of bits used
for hashing directory entries.
• Local depth of a bucket: #
bits for hashing a bucket.
• When can global depth be
different from local depth?
LOCAL DEPTH
GLOBAL DEPTH
2
00
2
4* 12* 32* 16*
Bucket A
2
1*
5* 21* 13*
Bucket B
01
10
2
11
10*
DIRECTORY
Bucket C
2
15* 7* 19*
Bucket D
DATA PAGES
10
Insert h(r)=20 (Causes Doubling)
LOCAL DEPTH
GLOBAL DEPTH
2
00
2
4 12 32*16*
Bucket A
2
3
1* 5* 21*13* Bucket B
32* 16* Bucket A
000
2
1* 5* 21* 13* Bucket B
001
10
2
11
10*
Bucket C
15* 7* 19*
010
2
011
10*
Bucket C
100
2
4: 0000 0100
12: 0000 1100
20: 0001 0100
16: 0001 0000
32: 0010 0000
3
GLOBAL DEPTH
01
DIRECTORY
LOCAL DEPTH
Bucket D
101
2
110
15* 7* 19*
Bucket D
111
3
4* 12* 20*
DIRECTORY
4* 12* 20*
Bucket A2
(`split image'
of Bucket
A)
11
Extensible Hashing Insert
• Check if the bucket is full.
– If no, done!
• Otherwise, check if local depth = global depth
– if no, rehash the entries and distribute them into two
buckets + increment the local depth
– if yes, double the directory -> rehash the entries and
distribute into two buckets
•
Directory is doubled by copying it over and
`fixing’ pointer to split image page.
–
You can do this only by using the least significant bits
in the directory.
12
1: 0000 0001
5: 0000 0101
21: 0001 0101
13: 0000 1101
9: 0000 1001
Insert 9
LOCAL DEPTH
LOCAL DEPTH
3
3
000
2
000
1* 5* 21* 13* Bucket B
001
001
010
2
011
10*
Bucket C
100
101
2
110
15* 7* 19*
Bucket D
4* 12* 20*
Bucket A2
(`split image'
of Bucket A)
1* 9*
2
011
10*
Bucket B
Bucket C
100
101
2
110
15* 7* 19*
111
3
DIRECTORY
3
3
010
111
DIRECTORY
32* 16* Bucket A
GLOBAL DEPTH
32* 16* Bucket A
GLOBAL DEPTH
3
3
4* 12* 20*
3
5* 13* 21*
Bucket D
Bucket A2
(`split image'
of Bucket A)
Bucket B2
(`split image'
of Bucket13
B)
Directory Doubling
Why use least significant bits in directory?
 Allows for doubling via copying!
6 = 110
2
1
0
1
6*
6 = 110
3
00
10
11
000
000
001
001
2
010
1
011
01
6*
3
100
0
101
1
110
6*
6*
01
010
011
10
11
111
Least Significant
00
100
6*
101
110
6*
111
vs.
Most Significant
14
Comments on Extendible
Hashing
• If directory fits in memory, equality search answered with
one disk access; else two.
–
–
–
100MB file, 100 bytes/rec, you have 1M data entries.
A 4K page (a bucket) can contain 40 data entries. You need about
25,000 directory elements; chances are high that directory will fit in
memory.
If the distribution of hash values is skewed (concentrates on a few
buckets), directory can grow large.
• Delete: If removal of data entry makes bucket empty, can
be merged with `split image’. If each directory element
points to same bucket as its split image, can halve directory.
15
Linear Hashing (LH)
• This is another dynamic hashing scheme, an alternative to
Extendible Hashing.
– LH fixes the problem of long overflow chains (in static hashing)
without using a directory (in extendible hashing).
•
Basic Idea: Use a family of hash functions h0, h1, h2, ...
– Each function’s range is twice that of its predecessor.
– Pages are split when overflows occur – but not necessarily the
overflowing page. (Splitting occurs in turn, in a round robin
fashion.)
– Buckets are added gradually (one bucket at a time).
– When all the pages at one level (the current hash function) have
been split, a new level is applied.
– Primary pages are allocated consecutively.
16
Levels of Linear Hashing
• Initial Stage.
– The initial level distributes entries into N0 buckets.
– Call the hash function to perform this h0.
• Splitting buckets.
– If a bucket overflows its primary page is chained to an overflow
page (same as in static hashing).
– Also when a bucket overflows, some bucket is split.
• The first bucket to be split is the first bucket in the file (not
necessarily the bucket that overflows).
• The next bucket to be split is the second bucket in the file … and so
on until the Nth. has been split.
• When buckets are split their entries (including those in overflow
pages) are distributed using h1.
– To access split buckets the next level hash function (h1) is
applied.
– h1 maps entries to 2N0 (or N1)buckets.
17
Levels of Linear Hashing (Cnt)
• Level progression:
– Once all Ni buckets of the current level (i) are split, the hash
function hi is replaced by hi+1.
– The splitting process starts again at the first bucket, and hi+2 is
applied to find entries in split buckets.
18
Linear Hashing Example
h0
00 next 64 36
01
1
10
6
11
31 15
17
5
• Initially, the index level equal to
0 and N0 equals 4 (three
entries fit on a page).
• h0 maps index entries to one of
four buckets.
• h0 is used and no buckets have
been split.
• Now consider what happens
when 9 (1001) is inserted
(which will not fit in the second
bucket).
• Note that next indicates which
bucket is to split next. (Round
Robin)
19
Linear Hashing Example 2
000 h1
next
64
01 h0
next
1
10 h0
6
11 h0
31
100 h1
36
• The page indicated by next is
split (the first one).
• Next is incremented.
17
15
5
9
• An overflow page is chained to the
primary page to contain the
inserted value.
• If h0 maps a value from zero to
next – 1 (just the first page in this
case) h1 must be used to insert
the new entry.
• Note how the new page falls
naturally into the sequence as the
fifth page.
20
Linear Hashing Example 3
2
1
h1
h1
h1
h1
h1
h0
h0
h0
next3
next1
next2
64
8
32
1
17
9
10
18
6
18
14
15
7
11
31
•
•
•
16
h1
h1
36
h
h
5
13
11
Assume1inserts 1of 8, 7, 18, 14, 111, 32, 162, 10, 13, 233
After the 2nd. split the base level is 1 (N1 = 8), use h1.
Subsequent
splits
h
- will use h2 for6 inserts
14 between the first bucket and next-1.
21
1
Linear Hashing vs. Extendable
Hashing
• What is the similarity?
– One round of RR of splitting in LH is the same as 1step doubling of directory in EH
• What are the differences?
– Directory overhead vs. none
– Overflow pages vs. none
– Gradual splitting (of pages) vs. one-step doubling (of
directory)
– Pages are allocated in order vs. not in order
– Splitting non-overflowing pages vs. splitting
overflowing pages
22
Summary
• Hash-based indexes: best for equality searches, cannot
support range searches.
• Static Hashing can lead to long overflow chains.
• Extendible Hashing avoids overflow pages by splitting a
full bucket when a new data entry is to be added to it.
(Duplicates may require overflow pages.)
Directory to keep track of buckets, doubles periodically.
– Can get large with skewed data; additional I/O if this does not fit
in main memory.
– a skewed data distribution is one in which the hash values of
data entries are not uniformly distributed!
–
23
Summary (Contd.)
• Linear Hashing avoids directory by splitting buckets
round-robin, and using overflow pages.
–
–
Overflow pages not likely to be long.
Space utilization could be lower than Extendible Hashing, since
splits not concentrated on `dense’ data areas.
• Can tune criterion for triggering splits to trade-off slightly
longer chains for better space utilization.
24
Download