Secondary Indexes and B

advertisement
More on Indexes
Secondary Indexes
B-Trees
Source: our textbook,
slides by Hector Garcia-Molina
1
Secondary Indexes
Sometimes we want multiple indexes on a
relation.
 Ex: search Candies(name,manf) both by name
and by manufacturer
Typically the file would be sorted using the
key (ex: name) and the primary index would
be on that field.
The secondary index is on any other attribute
(ex: manf).
Secondary index also facilitates finding
records, but cannot rely on them being sorted
2
Sparse Secondary Index?
No!
Since records are not sorted on that
key, cannot predict the location of a
record from the location of any other
record.
Thus secondary indexes are always
dense.
3
Sequence
field
• Sparse index
30
20
80
100
90
...
30
50
20
70
80
40
100
10
does not make sense!
90
60
4
Design of Secondary Indexes
Always dense, usually with duplicates
Consists of key-pointer pairs ("key"
means search key, not relation key)
Entries in index file are sorted by key
Therefore second-level index is sparse
5
Secondary indexes
10
50
90
...
sparse
secondlevel
Sequence
field
10
20
30
40
30
50
50
60
70
...
80
40
dense
firstlevel
20
70
100
10
90
60
6
Secondary Index and
Duplicate Keys
Scheme in previous diagram wastes
space in the present of duplicate keys
If a search key value appears n times in
the data file, then there are n entries
for it in the index.
7
Duplicate values & secondary indexes
one option...
Problem:
excess overhead!
• disk space
• search time
10
10
10
20
20
30
40
40
40
40
...
20
10
20
40
10
40
10
40
30
40
8
Buckets
To avoid repeating values, use a level of
indirection
Put buckets between the secondary index file
and the data file
One entry in index for each search key K; its
pointer goes to a location in a "bucket file",
called the bucket for K
Bucket holds pointers to all records with
search key K
9
Duplicate values & secondary indexes
20
10
10
20
30
40
20
40
10
40
50
60
...
saves space as long as
search-keys are larger than
pointers and average key
appears at least twice
10
40
30
40
buckets
10
Why “bucket” idea is useful
Indexes
name: primary
dept: secondary
floor: secondary
Records
Emp (name,dept,floor,...)
11
Query: SELECT name FROM Emp
WHERE dept = 'Toy' AND floor = 2
dept index
Emp
floor index
Toy
Intersect Toy dept bucket and floor 2
bucket to get set of matching Emp’s
Saves disk I/O's
2
12
Summary of Indexes So Far
Advantages:
 simple
 index is sequential file, good for scans
Disadvantages
 either inserts are expensive
 or lose sequentiality (cf. next slide)
Instead use B-tree data structure to
implement index
13
Example
continuous
Index
10
20
30
33
40
50
60
free space
70
80
90
(sequential)
39
31
35
36
32
38
34
overflow area
(not sequential)
14
B-Trees
Several related data structures
Key features are:
 automatically adjust number of levels of
indexes as size of data file changes
 storage on blocks is managed to keep
every block between half full and full =>
no overflow blocks needed
We'll actually study B+ trees
15
B-Tree Structure
an example of a balanced search tree: every
root-to-leaf path has same length
each node (vertex) in the tree is a block,
which contains search keys and pointers
parameter n, which is largest value so that
n+1 pointers and n keys fit in one block
 Ex: If block size is 4096 bytes, keys are 4 bytes,
and pointers are 8 bytes, then n = 340.
16
Constraints on B-Tree Nodes
Keys in leaf nodes are copies of keys from
data file, in sorted order
Root contains between 2 and n+1 index node
pointers
Each internal node contains between
(n+1)/2 and n+1 index node pointers
Each non-leaf node consists of
ptr1,key1,ptr2,key2,…,keym-1,ptrm
where ptri points to index node with keys
between keyi-1 and keyi
17
Constraints (cont'd)
Each leaf contains between (n+1)/2
and n data record pointers, plus a "next
leaf" pointer
Associated with each data record
pointer is a key, and the pointer points
to the data record with that key
18
Example B-tree nodes with n = 3
more concise notation
30
35
Leaf:
to record
with key 30
30 35
to record
with key 35
30
30
Non-leaf:
textbook notation
to part of tree
with keys < 30
to part of tree
with keys ≥ 30
19
95
81
57
Sample non-leaf
to keys
to keys
to keys
< 57
57 k<81
81k<95
to keys
95
20
57
81
95
To record
with key 57
To record
with key 81
To record
with key 85
Sample leaf node:
From non-leaf node
to next leaf
in sequence
21
120
150
180
30
Leaf
30
35
Full node
counts even if null
Non-leaf
3
5
11
n=3
min. node
22
180
200
150
156
179
120
130
100
101
110
30
35
3
5
11
120
150
180
30
100
B-Tree Example
n=3
Root
… to records …
23
Insert into B+tree
(a) simple case
 space available in leaf
(b) leaf overflow
(c) non-leaf overflow
(d) new root
24
n=3
30
31
32
3
5
11
30
100
(a) Insert key = 32
25
(a) Insert key = 7
30
30
31
3
57
11
3
5
7
100
n=3
26
180
200
160
179
150
156
179
180
120
150
180
160
100
(c) Insert key = 160
n=3
27
(d) New root, insert 45
40
45
40
30
32
40
20
25
10
12
1
2
3
10
20
30
30
new root
n=3
28
Deletion from B-tree
(a) Simple case - no example
(b) Coalesce with neighbor (sibling)
(c) Re-distribute keys
(d) Cases (b) or (c) at non-leaf
29
(b) Coalesce with sibling
n=4
40
50
10
20
30
40
10
40
100
 Delete 50
30
(c) Redistribute keys
n=4
35
40
50
10
20
30
35
10
40 35
100
 Delete 50
31
(d) Non-leaf coalese
n=4
25
 Delete 37
40
45
30
37
30
40
25
26
30
20
22
10
14
1
3
10
20
25
40
new root
32
B-tree deletions in practice
– Often, coalescing is not implemented
 Too hard and not worth it!
33
Applications of B-Trees
B-tree is used to implement indexes
The data record pointers in the leaves
correspond to the data record pointers in
sequential indexes
Some example uses:
 B-tree search key is primary key for data file, leaf
pointers form a dense index on the file
 B-tree search key is primary key for data file, leaf
pointers form a sparse index on the file
 B-tree search key is not primary key, leaf pointers
form a dense index on the file
34
B-Trees with Duplicate Keys
Change definition of B-tree:
If key K appears in an internal node, then K is
the smallest "new" key in the subtree S
rooted at the pointer that follows K in the
node
"New" means K does not appear in the part
of the B-tree to the left of S but it does
appear in S
Allow null key in certain situations
35
43
47
23
37
41
23
23
13
17
23
7
13
2
3
5
-37
43
7
17
Example B-Tree with Duplicates
36
Lookup in B-Trees
Assume no duplicate keys.
Assume B-tree is a dense index.
To find the record with key K, search starting at
the root and ending at a leaf:
 if current node is not a leaf and has keys K1,
K2, …, Kn, find the smallest key, Ki, in the
sequence that is ≤ K.
 follow the (i+1)-st pointer to a node at the
next level and repeat
 when a leaf node is reached, find the key with
value K and follow the associated pointer to
the data record
37
Range Queries with B-Trees
Range query: a query in which a range of
values is sought. Examples:
 SELECT * FROM R WHERE R.k > 40;
 SELECT * FROM R WHERE R.k >= 10 AND R.k <=
25;
To find all keys in the range [a,b]:
Do a lookup on a: leads to leaf where a could be
Search the leaf for all keys ≥ a
If we find a key > b, we are done
Else follow next-leaf pointer and continue searching
in the next leaf
 Continue until finding a key > b or no more leaves




38
Efficiency of B-Trees
B-trees allow lookup, insertion and deletion of
records with very few disk I/Os
Number of disk I/Os is number of levels in the Btree plus cost of any reorganization
If n is at least 10, then splitting/merging blocks
will be rare and usually limited to the leaves
For typical sizes of keys, pointers, blocks and files,
3 levels suffice (see next slide)
Also can keep root block of B-tree in memory
39
Size of B-Tree
 Assume




4096 bytes per block
4 bytes per key (e.g., integer)
8 bytes per pointer
no header info in the block
 Then n = 340 (can keep n keys and n+1 pointers in a
block)
 Assume on average a block has 255 pointers
 Count:




one node at level 1 (the root)
255 nodes at level 2
255*255 = 65,025 nodes at level 3 (leaves)
each leaf has 255 pointers, so total number of records is more
than 16 million
40
Download