Slides on external sorts, MultiSet / B+

advertisement
CSC 556 – DBMS II, Spring 2013
April 10 & 17, 2013
Storage media hierarchies, external
sorts, B-trees
Storage medium abstraction
• Storage medium as a subclass of an interface
allows you to prototype storage medium semiindependently from DBMS structures atop it.
Sequential File I/O
• Queue abstraction supports sequential file I/O
or core I/O via a series of enqueue, peek &
dequeue calls. Supplies one-record lookahead.
• External sorts, i.e., sorts where the data do
not fit into memory, rely on this approach.
• A record size entry can precede any variable
length record in the file, OR
• A sentinel value can mark a record’s end.
Direct Access File I/O
• Direct access (a.k.a. random access) file I/O uses a seek
system call to locate a position with a direct-access (binary)
data file.
• man lseek, fseek, ftell
• Offsets are from 0, or current seek position, or end.
• Unix 3C library builds atop level 2 system calls. DBMS
may access level 2 system calls directly.
• It treats the file as an array where a seek address is an offset
into the array.
• Applications attempt to keep contiguous records in
contiguous blocks.
Merge sort & Sequential File I/O
• The split phase partitions alternate runs from a queue into
two helper queues, where the initial queue is the file to be
sorted, and the helpers are temporary files.
• A run is a sorted subsequence.
• The merge phase doubles the run length by merging peer
runs from the helper queues back into the initial queue.
• Natural-length merge sort inspects the data to locate run
boundaries. It may stumble onto big runs.
• Another variants uses an internal sort such as quicksort to
build larger initial runs in memory.
• A file-based sort is an external sort.
April 10 example of merge sort
• End of split phase with run length = 1.
• End-of-run record is underlined.
• Destination of this phase is in bold.
main -17
-101
5
0
18
tmp0 -17
5
18
62
1
tmp1 -101
0
29
666
-1
29
62
666
1
-1
April 10 example of merge sort
• End of split phase with run length = 1.
main -17
-101
5
0
18
tmp0 -17
5
18
62
1
tmp1 -101
0
29
666
-1
29
62
666
1
-1
• End of merge phase grows run length to 2.
main -101
-17
0
5
18
tmp0 -17
5
18
62
1
tmp1 -101
0
29
666
-1
29
62
666
-1
1
April 10 example of merge sort
• End of split phase with run length = 2.
main -101
-17
0
5
18
29
tmp0 -101
-17
18
29
-1
1
tmp1 0
5
62
666
62
666
-1
1
• End of merge phase grows run length to 4.
main -101
-17
0
5
18
29
tmp0 -101
-17
18
29
-1
1
tmp1 0
5
62
666
62
666
-1
1
April 10 example of merge sort
• End of split phase with run length = 4.
main -101
-17
0
5
18
29
tmp0 -101
-17
0
5
-1
1
tmp1 18
29
62
666
62
666
-1
1
• End of merge phase grows run length to 8.
main -101
-17
0
5
18
29
tmp0 -101
-17
0
5
-1
1
tmp1 18
29
62
666
62
666
-1
1
April 10 example of merge sort
• End of split phase with run length = 8.
main -101
-17
0
5
18
29
62
666
tmp0 -101
-17
0
5
18
29
62
666
tmp1 -1
1
-1
1
• End of merge phase grows run length to 16.
main -101
-17
-1
0
1
5
18
29
tmp0 -101
-17
0
5
18
29
62
666
tmp1 -1
1
62
666
Merge sort is O(n log(n))
• Picture a merge sort as a tree growing up from
N runs of length 1 to 1 run of length N.
April 10 radix10 sort (bucket sort)
• This sort also uses sequential file I.O.
• Initial sequence of integers.
main
-17
-101
5
0
18
29
62
666
1
-1
• Sequence normalized to non-negative values
with digits to accommodate the largest.
main
084
000
106
101
119
130
163
707
102
100
Number temp queues = radix
main
084
000
106
tmp0
000
130
100
tmp1
101
tmp2
102
tmp3
163
tmp4
084
tmp5
tmp6
106
tmp7
767
tmp8
tmp9
119
101
119
130
163
767
102
100
Number temp queues = radix
main
000
130
100
101
102
tmp0
000
100
101
102
106
tmp1
119
tmp2
tmp3
130
tmp4
tmp5
tmp6
163
tmp7
tmp8
tmp9
084
767
163
084
106
767
119
Number temp queues = radix
main
000
100
tmp0
000
084
tmp1
100
101
tmp2
tmp3
tmp4
tmp5
tmp6
tmp7
tmp8
tmp9
767
101
102
106
119
130
102
106
119
130
163
163
767
084
Number temp queues = radix
•
•
•
•
Sort is O(N x D) for N items and D digits, but
D is a constant and does not affect growth rate
Radix sort is therefore O(N) on data size
It requires a fixed-bit-width key field.
• Fixed-width fields required for a relational DBMS.
• It has a lot of copy overhead.
• Use a radix with many bits, many (smaller) files.
main 000
084
100
101
102
106
119
130
163
767
final
-17
-1
0
1
5
18
29
62
666
-101
Base-2numbits code.
void radixsort(interface_queueOfInts *queueToSort, int
numbits,
interface_queueOfInts * temporaryQueues[]) {
const int numqueues = (1 << numbits);
int mask = ~(~0 << numbits);
const int allbits = sizeof(int) * 8 ;
// bits per sorted value
for (int shifter = 0 ; shifter < allbits ; shifter += numbits) {
splitphase(queueToSort, temporaryQueues, shifter, mask);
mergephase(queueToSort, temporaryQueues,
numqueues); } }
Base-2numbits code.
static void splitphase(interface_queueOfInts *merger,
interface_queueOfInts * splitter[],
int shifter, int bitmask) {
bool ignoreme ; // We only peek on queues with data, etc.
while (merger->canPeek()) { // while there are runs to split
int value = merger->peek(ignoreme);
merger->dequeue();
int qid = (value >> shifter) & bitmask ;
splitter[qid]->enqueue(value);}}
Base 2numbits code.
static void mergephase(interface_queueOfInts *merger,
interface_queueOfInts * splitter[],
int numqueues) {
bool ignoreme ;
for (int qtodrain = 0 ; qtodrain < numqueues ; qtodrain++) {
while (splitter[qtodrain]->canPeek()) {
merger->enqueue(splitter[qtodrain]->peek(ignoreme));
splitter[qtodrain]->dequeue();
}
}}
MultiSet is a B+-tree Map over
various storage subclasses
• DataMine/mset/MultiSet.h
Relational DBMS
• Flat, fixed width records (tuples) fit into contiguous
memory locations & file blocks.
• Take the least common multiple (LCM) of the record
size and block size, and allocate using that.
• Unix lseek and fcntl are the primary system calls.
Windows has counterparts. The low-end cylinder
allocation on the disk is managed by the operating
system.
fixed0 fixed1 fixed2 fixed2 fixed3 fixed4 fixed5 fixed6 fixed7 fixed8
B-trees and index files
• B-trees are balanced binary trees; typically degree D > 2, as
determined by block size.
• How many B-tree node records fit into a disk block?
• Each node holds between D/2 and D entries.
• The root is an exception. It can hold < D/2 entries.
• In B+-trees the leaves hold the actual data, typically as seek
indices into the contiguous database file. B+-trees also link the
leaves into a sorted chain for range-based serial access.
• B+-tree interior nodes hold pointers to children.
• A B-tree grows from the bottom up whenever an insertion
causes a node to split.
MultiSet
• MultiSet uses sets of keys or multisets of keys (duplicate keys
allowed) to map to application data elements.
• Search includes ==, <, <=, > or >= key.
• First or last key occurrence for actual multisets.
• Greatest lesser value when key is not present.
• Also supports least greater value.
• Serial linked list at leaves (B+ tree) supports duplicate
keys & iterating over results, including following
operations.
• A result is a MultiSet amenable to union, intersection
and set difference with other MultiSet objects.
MultiSet.h
typedef unsigned long location ; const unsigned BTREEDEGREE = 16 ;
template <class KeyType>
struct treenode { // basic internal node
location parent ; // type treenode
location child[BTREEDEGREE] ; // treenodes or leafnodes
KeyType key[BTREEDEGREE] ;} // min keys for those children
template <class KeyType>
struct leafnode { // leaf connects treenodes to treeelems
location parent ; // type treenode
location prev, next ; // siblings or cousins
location child[BTREEDEGREE] ; // treeelems
KeyType key[BTREEDEGREE] ;} // keys for those children
MultiSet.h
template <class ElementType>
struct treeelem { /* a btree leaf element */
ElementType element ; // main contents
location next ;
// next avail. if needed for a free list
};
template <class ElementType, class KeyType>
class MultiSet {
/* ElementType is the type of the set's elements. KeyType is the type,
typically part or possibly all of an ElementType object, that
constitutes a search key.
... */
Abstract location
• Location is an unsigned long that is either a
seek offset (file) or a cast of an object pointer.
• MultiSet records the depth of the tree.
• Interior nodes are type treenode.
• Leaf nodes are type leafnode.
• Leaf nodes point to treeelem application data.
• Those may be app data or may be record
indices into another file of flat data records.
Searching through the B-tree
• findleaf uses findsubtree on interior nodes.
• Uses O(log n) binary search on interior node.
• Calls findsubtree in a loop until hitting the
leaves.
• Multi-key version finds first or last instance.
• findslot returns the array index inside a node.
Insertion in the B-tree
• randominsert starts at the root
• Initial element is a special case.
• Root node is always special because it may contain
fewer than degree / 2 entries.
• Otherwise, if it fits into a node, put it there.
• When the node is full, split into two nodes of
size N/2 and N/2 + 1.
• Add the new node to its parent.
• If that parent node was already full, split it recursively.
Deletion from a B-tree
• There is no problem when number of entries remains
>= N/2. Just slide entries above the deleted entry
down to cover the deleted one.
• When entries < N/2, try merging with neighbors in a
serial chain (siblings or cousins).
• If there are too few for that, merge into one node.
Other is empty, delete it from its parent.
• If parent entries < N/2, perform recursive delete.
• See deleteelem and deleteleaf in MultiSet.
http://en.wikipedia.org/wiki/B-tree
• keeps keys in sorted order for sequential traversing
• uses a hierarchical index to minimize the number of
disk reads
• uses partially full blocks to speed insertions and
deletions
• keeps the index balanced with an elegant recursive
algorithm
• minimizes waste by making sure the interior nodes
are at least half full
CoreMultiSet & FileMultiSet
•
•
•
•
•
•
•
storage read / write via abstract location
readnode / readleaf / readelem
writenode / writeleaf / writeelem
allocnode / allocleaf / allocelem
freenode / freeleaf / freeelem
FileMultiSet maintains free lists of the above.
Flat file structure in main relational file makes
storage management of free list “easy.”
Other Indexing – Skip Lists
• Skip lists support probabilistic log(N) lookup,
insertion & deletion of a key -> value mapping.
• We will review Pugh’s paper from the 1990’s.
• A skip list links each mapping in a series of key-sorted
linked lists, in which higher-order lists contain fewer
members, acting as “highways” & “boulevards” in
locating keys.
• Skip lists provide better support for concurrency than
some balanced tree algorithms by avoiding global
tree restructuring on rebalancing.
Other Indexing – Hashing
• Hashing approaches O(1) (constant-time) lookup for an
ideal hash function.
• The ideal hash function preserves all of the bits of
distinguishing information is a search key, while folding
them into fewer bits to use as a lookup index into an
array, an index file, or a flat file of fixed-width records.
• Hash tables disambiguate key collisions either by storing
colliding elements in a linked list (chained hashing) or by
rehashing to a new bucket (open address hashing).
• Hashing supports only == tests, not <, <=, >, >=.
Other Indexing – Sorting
• When a sequence of fixed-size records or indices are
sorted in a file of contiguous records, binary search is
a viable option for locating a key.
• Hashing and sorting are particularly appropriate for
indexing on-the-fly, temporary result sets to be
combined via intersection, union or set difference.
• Approximate O(1) hashing is fast when combining
result sets based only on equality.
• Sort-based search and merging are appropriate when
a query requests a result set sorted on an attribute.
Query Processing (Elmasri &
Navathe chapter 19)
• Translate SQL statements into abstract syntax tree
using basic compilation techniques.
• Interpret the abstract syntax tree.
• ORDER-BY and elimination of duplicate tuples in a
PROJECT are supported by external sort.
• Duplicate elimination can use low-level memcmp byte
comparison when comparing fixed-size records.
• If SORT is based in an index key or composite index
key, or if query does not entail PROJECT, then a B+tree index avoids need for a sort.
SELECT Processing
• Use indexed fields as primary search keys.
• Use index-based set intersection, union and
difference operations to support AND, OR and NOT.
• Use slow (O(n)) sequential search, or external sorting
(O(n log(n)) where appropriate (ORDER-BY or
duplicate elimination) combined with binary search.
• Use hashing for == matching.
• Use composite search key indexing or hashing.
• Utilize selectivity where possible.
JOIN Processing
• Nested-loop is brute force approach (O(NK)).
• Single-loop when join attributes are indexed.
• Sort-merge join requires records to be sorted
on the join attributes, and then merged.
• Partition-hash join hashes smaller of two
contributing relations into chained hash table.
• Larger relation is then hashed on join
attributes to retrieve tuples from the smaller.
PROJECT Processing
• If projection includes a distinguishing key, the
projected tuples are already unique. Just
select a subset of the query results.
• Otherwise, DISTINCT projections require
sorting based on entire tuple to eliminate
duplicates.
• It is also possible to hash on entire tuples to
eliminate duplicates.
Approaches to Query Optimization
• Pipelined or stream-based processing of query
stages across multiple thread / memory units.
• Compiler optimization techniques such as
common subexpression elimination.
• Functional approaches such as lazy evaluation.
• Use meta-data and heuristics such as size of
contributing relations (smaller is faster and
may fit into memory), key distribution data.
Costs to consider
•
•
•
•
•
•
Access cost (disk I/O), e.g., NFS vs. local disk.
Disk storage cost for intermediate files.
Computation costs (O(?)) cost.
Memory usage cost (avoid thrashing).
Communication cost for distributed systems.
Maintenance cost in terms of resources &
availability of the database.
Physical Database Design
• Chapter 20 in Elmasri & Navathe textbook.
• Accumulate query statistics & data mine
them.
• What attributes to index?
• When to use a clustered index on a non-key.
• Actual dataset can be organized on only 1 key.
• Hashing works well on equality-only joins.
• Dynamic hashing for volatile files.
Physical Database Techniques
• CREATE [ UNIQUE ] INDEX <index name> ON <table
name> (<column name> [ <order> ] { , <column
name> [ <order> ] } ) [ CLUSTER ] ;
• Denormalization demotes normalized tables to
weaker forms for increased speed.
• Vertical partitioning splits relations over attributes to
speed projection dynamics.
• Horizontal partitioning splits relations over indexed
tuples to speed selection dynamics.
Collect statistics
• Storage statistics include data about table spaces,
index spaces, buffer pools. DBMS may require
preallocation and sizing / tuning for clustering.
• I/O and device performance statistics include read /
write (paging) on disk intents, hot spots, and
thrashing for core memory.
• NFS/local/in-core, number of network interface cards,
amount of core memory, cache, memory topology.
• Query / transaction statistics help determine
attributes to index & query distributions.
Tuning queries
• Precompiled queries offer opportunities for profiling and speed
improvement.
• Avoid generating larger intermediate result sets when smaller ones are
available.
• Avoid nested queries that generate large cross-products in favor of
sequential queries. P. 737 example potential to search all of M for each
tuple from E.
SELECT Ssn
SELECT MAX(Salary) AS High_salary, Dno INTO TEMP
FROM EMPLOYEE E
FROM EMPLOYEE GROUP BY Dno ;
WHERE
SELECT EMPLOYEE.Ssn FROM EMPLOYEE, TEMP
Salary = SELECT MAX(Salary)
WHERE EMPLOYEE.Salary = TEMP.High_salary
FROM EMPLOYEE AS M
AND EMPLOYEE.Dno = TEMP.Dno ;
WHERE M.Dno = E.Dno ;
Download