CSC 556 – DBMS II, Spring 2013 April 10 & 17, 2013 Storage media hierarchies, external sorts, B-trees Storage medium abstraction • Storage medium as a subclass of an interface allows you to prototype storage medium semiindependently from DBMS structures atop it. Sequential File I/O • Queue abstraction supports sequential file I/O or core I/O via a series of enqueue, peek & dequeue calls. Supplies one-record lookahead. • External sorts, i.e., sorts where the data do not fit into memory, rely on this approach. • A record size entry can precede any variable length record in the file, OR • A sentinel value can mark a record’s end. Direct Access File I/O • Direct access (a.k.a. random access) file I/O uses a seek system call to locate a position with a direct-access (binary) data file. • man lseek, fseek, ftell • Offsets are from 0, or current seek position, or end. • Unix 3C library builds atop level 2 system calls. DBMS may access level 2 system calls directly. • It treats the file as an array where a seek address is an offset into the array. • Applications attempt to keep contiguous records in contiguous blocks. Merge sort & Sequential File I/O • The split phase partitions alternate runs from a queue into two helper queues, where the initial queue is the file to be sorted, and the helpers are temporary files. • A run is a sorted subsequence. • The merge phase doubles the run length by merging peer runs from the helper queues back into the initial queue. • Natural-length merge sort inspects the data to locate run boundaries. It may stumble onto big runs. • Another variants uses an internal sort such as quicksort to build larger initial runs in memory. • A file-based sort is an external sort. April 10 example of merge sort • End of split phase with run length = 1. • End-of-run record is underlined. • Destination of this phase is in bold. main -17 -101 5 0 18 tmp0 -17 5 18 62 1 tmp1 -101 0 29 666 -1 29 62 666 1 -1 April 10 example of merge sort • End of split phase with run length = 1. main -17 -101 5 0 18 tmp0 -17 5 18 62 1 tmp1 -101 0 29 666 -1 29 62 666 1 -1 • End of merge phase grows run length to 2. main -101 -17 0 5 18 tmp0 -17 5 18 62 1 tmp1 -101 0 29 666 -1 29 62 666 -1 1 April 10 example of merge sort • End of split phase with run length = 2. main -101 -17 0 5 18 29 tmp0 -101 -17 18 29 -1 1 tmp1 0 5 62 666 62 666 -1 1 • End of merge phase grows run length to 4. main -101 -17 0 5 18 29 tmp0 -101 -17 18 29 -1 1 tmp1 0 5 62 666 62 666 -1 1 April 10 example of merge sort • End of split phase with run length = 4. main -101 -17 0 5 18 29 tmp0 -101 -17 0 5 -1 1 tmp1 18 29 62 666 62 666 -1 1 • End of merge phase grows run length to 8. main -101 -17 0 5 18 29 tmp0 -101 -17 0 5 -1 1 tmp1 18 29 62 666 62 666 -1 1 April 10 example of merge sort • End of split phase with run length = 8. main -101 -17 0 5 18 29 62 666 tmp0 -101 -17 0 5 18 29 62 666 tmp1 -1 1 -1 1 • End of merge phase grows run length to 16. main -101 -17 -1 0 1 5 18 29 tmp0 -101 -17 0 5 18 29 62 666 tmp1 -1 1 62 666 Merge sort is O(n log(n)) • Picture a merge sort as a tree growing up from N runs of length 1 to 1 run of length N. April 10 radix10 sort (bucket sort) • This sort also uses sequential file I.O. • Initial sequence of integers. main -17 -101 5 0 18 29 62 666 1 -1 • Sequence normalized to non-negative values with digits to accommodate the largest. main 084 000 106 101 119 130 163 707 102 100 Number temp queues = radix main 084 000 106 tmp0 000 130 100 tmp1 101 tmp2 102 tmp3 163 tmp4 084 tmp5 tmp6 106 tmp7 767 tmp8 tmp9 119 101 119 130 163 767 102 100 Number temp queues = radix main 000 130 100 101 102 tmp0 000 100 101 102 106 tmp1 119 tmp2 tmp3 130 tmp4 tmp5 tmp6 163 tmp7 tmp8 tmp9 084 767 163 084 106 767 119 Number temp queues = radix main 000 100 tmp0 000 084 tmp1 100 101 tmp2 tmp3 tmp4 tmp5 tmp6 tmp7 tmp8 tmp9 767 101 102 106 119 130 102 106 119 130 163 163 767 084 Number temp queues = radix • • • • Sort is O(N x D) for N items and D digits, but D is a constant and does not affect growth rate Radix sort is therefore O(N) on data size It requires a fixed-bit-width key field. • Fixed-width fields required for a relational DBMS. • It has a lot of copy overhead. • Use a radix with many bits, many (smaller) files. main 000 084 100 101 102 106 119 130 163 767 final -17 -1 0 1 5 18 29 62 666 -101 Base-2numbits code. void radixsort(interface_queueOfInts *queueToSort, int numbits, interface_queueOfInts * temporaryQueues[]) { const int numqueues = (1 << numbits); int mask = ~(~0 << numbits); const int allbits = sizeof(int) * 8 ; // bits per sorted value for (int shifter = 0 ; shifter < allbits ; shifter += numbits) { splitphase(queueToSort, temporaryQueues, shifter, mask); mergephase(queueToSort, temporaryQueues, numqueues); } } Base-2numbits code. static void splitphase(interface_queueOfInts *merger, interface_queueOfInts * splitter[], int shifter, int bitmask) { bool ignoreme ; // We only peek on queues with data, etc. while (merger->canPeek()) { // while there are runs to split int value = merger->peek(ignoreme); merger->dequeue(); int qid = (value >> shifter) & bitmask ; splitter[qid]->enqueue(value);}} Base 2numbits code. static void mergephase(interface_queueOfInts *merger, interface_queueOfInts * splitter[], int numqueues) { bool ignoreme ; for (int qtodrain = 0 ; qtodrain < numqueues ; qtodrain++) { while (splitter[qtodrain]->canPeek()) { merger->enqueue(splitter[qtodrain]->peek(ignoreme)); splitter[qtodrain]->dequeue(); } }} MultiSet is a B+-tree Map over various storage subclasses • DataMine/mset/MultiSet.h Relational DBMS • Flat, fixed width records (tuples) fit into contiguous memory locations & file blocks. • Take the least common multiple (LCM) of the record size and block size, and allocate using that. • Unix lseek and fcntl are the primary system calls. Windows has counterparts. The low-end cylinder allocation on the disk is managed by the operating system. fixed0 fixed1 fixed2 fixed2 fixed3 fixed4 fixed5 fixed6 fixed7 fixed8 B-trees and index files • B-trees are balanced binary trees; typically degree D > 2, as determined by block size. • How many B-tree node records fit into a disk block? • Each node holds between D/2 and D entries. • The root is an exception. It can hold < D/2 entries. • In B+-trees the leaves hold the actual data, typically as seek indices into the contiguous database file. B+-trees also link the leaves into a sorted chain for range-based serial access. • B+-tree interior nodes hold pointers to children. • A B-tree grows from the bottom up whenever an insertion causes a node to split. MultiSet • MultiSet uses sets of keys or multisets of keys (duplicate keys allowed) to map to application data elements. • Search includes ==, <, <=, > or >= key. • First or last key occurrence for actual multisets. • Greatest lesser value when key is not present. • Also supports least greater value. • Serial linked list at leaves (B+ tree) supports duplicate keys & iterating over results, including following operations. • A result is a MultiSet amenable to union, intersection and set difference with other MultiSet objects. MultiSet.h typedef unsigned long location ; const unsigned BTREEDEGREE = 16 ; template <class KeyType> struct treenode { // basic internal node location parent ; // type treenode location child[BTREEDEGREE] ; // treenodes or leafnodes KeyType key[BTREEDEGREE] ;} // min keys for those children template <class KeyType> struct leafnode { // leaf connects treenodes to treeelems location parent ; // type treenode location prev, next ; // siblings or cousins location child[BTREEDEGREE] ; // treeelems KeyType key[BTREEDEGREE] ;} // keys for those children MultiSet.h template <class ElementType> struct treeelem { /* a btree leaf element */ ElementType element ; // main contents location next ; // next avail. if needed for a free list }; template <class ElementType, class KeyType> class MultiSet { /* ElementType is the type of the set's elements. KeyType is the type, typically part or possibly all of an ElementType object, that constitutes a search key. ... */ Abstract location • Location is an unsigned long that is either a seek offset (file) or a cast of an object pointer. • MultiSet records the depth of the tree. • Interior nodes are type treenode. • Leaf nodes are type leafnode. • Leaf nodes point to treeelem application data. • Those may be app data or may be record indices into another file of flat data records. Searching through the B-tree • findleaf uses findsubtree on interior nodes. • Uses O(log n) binary search on interior node. • Calls findsubtree in a loop until hitting the leaves. • Multi-key version finds first or last instance. • findslot returns the array index inside a node. Insertion in the B-tree • randominsert starts at the root • Initial element is a special case. • Root node is always special because it may contain fewer than degree / 2 entries. • Otherwise, if it fits into a node, put it there. • When the node is full, split into two nodes of size N/2 and N/2 + 1. • Add the new node to its parent. • If that parent node was already full, split it recursively. Deletion from a B-tree • There is no problem when number of entries remains >= N/2. Just slide entries above the deleted entry down to cover the deleted one. • When entries < N/2, try merging with neighbors in a serial chain (siblings or cousins). • If there are too few for that, merge into one node. Other is empty, delete it from its parent. • If parent entries < N/2, perform recursive delete. • See deleteelem and deleteleaf in MultiSet. http://en.wikipedia.org/wiki/B-tree • keeps keys in sorted order for sequential traversing • uses a hierarchical index to minimize the number of disk reads • uses partially full blocks to speed insertions and deletions • keeps the index balanced with an elegant recursive algorithm • minimizes waste by making sure the interior nodes are at least half full CoreMultiSet & FileMultiSet • • • • • • • storage read / write via abstract location readnode / readleaf / readelem writenode / writeleaf / writeelem allocnode / allocleaf / allocelem freenode / freeleaf / freeelem FileMultiSet maintains free lists of the above. Flat file structure in main relational file makes storage management of free list “easy.” Other Indexing – Skip Lists • Skip lists support probabilistic log(N) lookup, insertion & deletion of a key -> value mapping. • We will review Pugh’s paper from the 1990’s. • A skip list links each mapping in a series of key-sorted linked lists, in which higher-order lists contain fewer members, acting as “highways” & “boulevards” in locating keys. • Skip lists provide better support for concurrency than some balanced tree algorithms by avoiding global tree restructuring on rebalancing. Other Indexing – Hashing • Hashing approaches O(1) (constant-time) lookup for an ideal hash function. • The ideal hash function preserves all of the bits of distinguishing information is a search key, while folding them into fewer bits to use as a lookup index into an array, an index file, or a flat file of fixed-width records. • Hash tables disambiguate key collisions either by storing colliding elements in a linked list (chained hashing) or by rehashing to a new bucket (open address hashing). • Hashing supports only == tests, not <, <=, >, >=. Other Indexing – Sorting • When a sequence of fixed-size records or indices are sorted in a file of contiguous records, binary search is a viable option for locating a key. • Hashing and sorting are particularly appropriate for indexing on-the-fly, temporary result sets to be combined via intersection, union or set difference. • Approximate O(1) hashing is fast when combining result sets based only on equality. • Sort-based search and merging are appropriate when a query requests a result set sorted on an attribute. Query Processing (Elmasri & Navathe chapter 19) • Translate SQL statements into abstract syntax tree using basic compilation techniques. • Interpret the abstract syntax tree. • ORDER-BY and elimination of duplicate tuples in a PROJECT are supported by external sort. • Duplicate elimination can use low-level memcmp byte comparison when comparing fixed-size records. • If SORT is based in an index key or composite index key, or if query does not entail PROJECT, then a B+tree index avoids need for a sort. SELECT Processing • Use indexed fields as primary search keys. • Use index-based set intersection, union and difference operations to support AND, OR and NOT. • Use slow (O(n)) sequential search, or external sorting (O(n log(n)) where appropriate (ORDER-BY or duplicate elimination) combined with binary search. • Use hashing for == matching. • Use composite search key indexing or hashing. • Utilize selectivity where possible. JOIN Processing • Nested-loop is brute force approach (O(NK)). • Single-loop when join attributes are indexed. • Sort-merge join requires records to be sorted on the join attributes, and then merged. • Partition-hash join hashes smaller of two contributing relations into chained hash table. • Larger relation is then hashed on join attributes to retrieve tuples from the smaller. PROJECT Processing • If projection includes a distinguishing key, the projected tuples are already unique. Just select a subset of the query results. • Otherwise, DISTINCT projections require sorting based on entire tuple to eliminate duplicates. • It is also possible to hash on entire tuples to eliminate duplicates. Approaches to Query Optimization • Pipelined or stream-based processing of query stages across multiple thread / memory units. • Compiler optimization techniques such as common subexpression elimination. • Functional approaches such as lazy evaluation. • Use meta-data and heuristics such as size of contributing relations (smaller is faster and may fit into memory), key distribution data. Costs to consider • • • • • • Access cost (disk I/O), e.g., NFS vs. local disk. Disk storage cost for intermediate files. Computation costs (O(?)) cost. Memory usage cost (avoid thrashing). Communication cost for distributed systems. Maintenance cost in terms of resources & availability of the database. Physical Database Design • Chapter 20 in Elmasri & Navathe textbook. • Accumulate query statistics & data mine them. • What attributes to index? • When to use a clustered index on a non-key. • Actual dataset can be organized on only 1 key. • Hashing works well on equality-only joins. • Dynamic hashing for volatile files. Physical Database Techniques • CREATE [ UNIQUE ] INDEX <index name> ON <table name> (<column name> [ <order> ] { , <column name> [ <order> ] } ) [ CLUSTER ] ; • Denormalization demotes normalized tables to weaker forms for increased speed. • Vertical partitioning splits relations over attributes to speed projection dynamics. • Horizontal partitioning splits relations over indexed tuples to speed selection dynamics. Collect statistics • Storage statistics include data about table spaces, index spaces, buffer pools. DBMS may require preallocation and sizing / tuning for clustering. • I/O and device performance statistics include read / write (paging) on disk intents, hot spots, and thrashing for core memory. • NFS/local/in-core, number of network interface cards, amount of core memory, cache, memory topology. • Query / transaction statistics help determine attributes to index & query distributions. Tuning queries • Precompiled queries offer opportunities for profiling and speed improvement. • Avoid generating larger intermediate result sets when smaller ones are available. • Avoid nested queries that generate large cross-products in favor of sequential queries. P. 737 example potential to search all of M for each tuple from E. SELECT Ssn SELECT MAX(Salary) AS High_salary, Dno INTO TEMP FROM EMPLOYEE E FROM EMPLOYEE GROUP BY Dno ; WHERE SELECT EMPLOYEE.Ssn FROM EMPLOYEE, TEMP Salary = SELECT MAX(Salary) WHERE EMPLOYEE.Salary = TEMP.High_salary FROM EMPLOYEE AS M AND EMPLOYEE.Dno = TEMP.Dno ; WHERE M.Dno = E.Dno ;