CS 4420 Project Option 1 Phase I Design Clint Jed Casper & Derrick Coetzee Our Phase I project consists of three main groups of interacting objects: 1. The front-end command language parser 2. The objects providing the block abstraction on top of files a. BlockFile: provides cached block-level access to a specific file b. Block: represents a particular block of a file, allows reading/updating c. BufferManager: caches recently used blocks and flushes dirty ones to disk when they are removed from the cache 3. The objects representing the structures stored in the database a. Catalog: describes attributes of the database as a whole, stores metadata b. Relation: describes one table of data and manages its storage c. Attribute: describes one column of a table and its properties d. Index, BtreeIndex: provide efficient very-large integer-to-integer maps Lex/Yacc Command Language Parser StorageManager (facade) Catalog relations catalogFile BufferManager (singleton) blockCache 0-1 n Block Cached blocks BlockFile 1 filename n Relation name dataFile attributes Attribute name, type index 1 0-1 Index indexFile BtreeIndex Phase I Implementation Scanner/Parser Subsystem The DBMS is command line driven. The following is a small demonstration of the DBMS command line interface. [src]$ smalldb Welcome to smalldb > CREATE TABLE Car(pnum, Integer, clr, String); > PRINT CATALOG; Car( pnum:Integer clr:String ) > PRINT BUFFERSTATS; Logical block access count: 0 Physical block access count: 0 > EXIT; We utilized Lex and Yacc to generate the scanner/parser subsystem. When a SQL command is entered at the prompt, the parser extracts the necessary information and calls the corresponding method provided by the StorageManager. An example excerpt from our context-free grammar demonstrates this mechanism. Command: CREATE TABLE ID4 '(' AttrList ')' ';' { SM->CreateTable($3.stringval, $5.attrList); } | CREATE INDEX ID4 ON ID4 '(' ID4 ')' OptionalNoDup ';' { SM->CreateIndex($3.stringval, $5.stringval, $7.stringval, $9.boolval); } | LOAD TABLE ID4 ID4 ';' { SM->LoadTable($3.stringval, $4.stringval); } Storage Manager The StorageManager has no low-level function of its own; it simply provides a convenient interface to the storage management subsystem of the DBMS, indicated by the large dotted box in the diagram. Some of its primary methods include: • CreateTable Steps of the table creation process include: 1. Create a new Relation object 2. Add attributes to the Relation 3. Insert the Relation into the catalog • LoadTable The semantics of LoadTable follow. For every Relation the DBMS will create a data file with the relative pathname ./data/relation-name.sdb, which will be used to store the Relation’s data (but not its metadata). If the file already exists, it will instead load the existing data it contains. Whenever a LOAD TABLE command is executed on a relation, the file from which the data is loaded will be encoded and used to overwrite any data that currently exists in the .sdb. Once the data has been loaded, the file it was loaded from is no longer needed. Steps of the table loading process include: 1. 2. 3. 4. Look up the Relation in the catalog and obtain its BlockFile. Free all the old Blocks in the file. Allocate a new Block. Read tuples from the load file, encode them, and write them into the Block. Write as many tuples into the Block as possible; there is no spanning of tuples across multiple blocks. Also update relation metadata such as number of tuples as tuples are added. 5. Flush the Block to disk. 6. Repeat steps 3 - 6 until all the tuples have been loaded into the data file. • PrintTable Steps of the table printing process include: 1. Look up the Relation in the catalog and obtain its BlockFile. 2. Iterate through each Block of the BlockFile. 3. For each Block, print each tuple in the order they occur in the block. Catalog The catalog is a collection of Relation objects. During start-up, the DBMS reads each of the schemas that are stored in the catalog file on disk into main memory. The DBMS expects to find the catalog in a file having the relative pathname => ./catalog/root. For each schema, a Relation object, which provides stores the relation’s metadata, is created and kept in the Catalog object. At this point, the data in the catalog file essentially becomes an in-memory data structure. However, any changes made to the Catalog object are still reflected on disk for the purposes of persistence. Whenever a table is added to the database, a new Relation object is created and inserted into the Catalog object in memory and the schema for that relation is written to the appropriate block of the catalog file. To simplify things, the catalog uses a fixed record format to store the schemas on disk. For the purposes of Phase I, the primary functions provided by the Catalog class are: • • • Read the catalog metadata from the catalog file on disk Print a summary of the relations in the catalog Insert a new relation into the catalog and add it to the catalog file BlockFile A BlockFile provides an abstraction of a file that allows it to be viewed as a series of blocks as opposed to the view as a stream of data provided by the OS. A BlockFile is primarily used to read a block from a file or allocate a new block at the end of a file. An important design decision for our DBMS is that all storage is managed through the same BlockFile abstraction. This means that data files, index files, the catalog file, and so forth are all handled through the same interface. One important ramification of this design decision is that the blocks of the catalog file or an index file are cached and accessed in the BufferManager in the exact same manner as the blocks of a data file, and blocks from different files may be stored together in the global cache. Each Relation, Index, and the Catalog has an associated BlockFile. The functionality of a BlockFile is tightly coupled with that of the BufferManager object and of the Blocks that make up a given BlockFile. A BlockFile implements automatic opening and closing of the actual disk file. The file is opened, if necessary, before any read or write request and it is closed when the last Block of the BlockFile is ejected from the cache by the BufferManager. Given a file offset, a BlockFile will retrieve the Block that corresponds to that offset and set the cursor, within the Block, to the correct value. The Block that is returned may have come from either the BufferManager’s cache or a read from the disk. It also exports a method to allocate a new Block at the end of a file; this causes no disk operations until it is flushed. In addition, it maintains static counters to keep statistics on the total number of logical block accesses and physical block accesses (cache misses) performed by the system. Block A Block object represents an abstraction for a physical disk block. Blocks are produced by operations on a BlockFile, and are cached by BufferManager. A Block allows you to write directly into the buffer that will eventually be written to disk, and you can read data from the buffer as well. Any Block that has been modified is considered “dirty,” and this must be indicated by explicitly setting a dirty flag. All dirty Blocks are written to disk when either the Block is ejected from the cache by the BufferManager or when the system is exited. The Block object provides a flush method that forces the data to be written back to disk immediately. Note that we are not attempting to design a fault-tolerant DBMS. Therefore, writes can be indefinitely suspended, and if a crash were to occur then it could result in the DBMS being in an inconsistent and unpredictable state. A Block also has a simple one-bit reference counting mechanism to “pin” it in the cache while some other piece of the code is using it. This prevents it from being ejected from the cache and destroyed while other references to the Block still exist. For the purposes of Phase I, the primary methods provided by the Block class are: For simple stream-based reading and writing within the block, methods are provided to read and write 4-byte values; the values are read or written at a cursor position within the block, which is set initially by the BlockFile, advanced by the read and write methods, and can be set manually as well. BufferManager The BufferManager manages the caching of Blocks. It uses a Least Recently Used policy to determine which block must be ejected when a new Block is inserted and the cache is already full. For any dirty Block that is selected for ejection from the cache, the BufferManager will make sure that the Block is flushed back to disk before discarding it. The BufferManager is never accessed directly outside of the block abstraction subsystem; instead, it communicates only with BlockFile objects during their read and write operations. To implement these policies, the BufferManager keeps a simple list of blocks in order by the time at which they were last read. Blocks are simply ejected from the end of this list if it becomes too large. For the purposes of Phase I, the primary functions provided by the BufferManager class are: • • • Get a block from the cache, if it is present, given its file and block number. This also entails marking the block as the most recently used one. Insert a new block into the most recently used position in the cache. Eject the least recently used block from the cache, flushing it if it is dirty. Relation This is just a wrapper class for all of the metadata that describes the schema for a particular relation. Other than its use as a wrapper class, its principal purpose is to write and read the metadata for the schema to a Block. Relations keep track of an ordered list of Attributes. Relations are also used in displaying table information. Attribute The Attribute object describes the metadata for an attribute in a particular relation, including its name, type, and the index associated with it, if any. Attributes are important participants in the creation of indexes, and are also used in displaying table information. Index and BtreeIndex The Index class provides the abstract interface for an index mechanism, used to locate a record given the value of one of its attributes. The three key functions are: • • Create an index Insert a (key, value) pair • Given a key, look up its value, or else determine that the key is not present Notice that there is no means of removing or modifying keys. Because our database is read-only, these are not required; they would be straightforward extensions however, analogous to insertion. Because all attributes are limited to 4 bytes, we limited keys to 32-bit integers. Because we can locate a record in a given file based solely on its 32-bit file offset, and indeed we are limited to such offsets by the library, we also limited the values to 32-bit integers. Thus the problem is reduced to one of constructing very large maps from integers to integers. We used a B+-tree for our implementation for experience, although for a readonly database such as this one a multilevel sparse index would be more appropriate. The B+-tree implementation is not a completely standard one however. For one thing, we always store all keys in the tree, even when the key is the primary key. This decreases coupling because the index doesn’t need to know about how records are formatted in a block, nor does the indexing function of scanning through a block for a matching record need to be performed by the relation. Also, since any relation with at least two attributes has at most about 60 records per block, and in our B+-Tree n is about 70, this won’t add more than one additional level to the tree. For another, we use different n values for leaf nodes (currently 62) and internal nodes (currently 84). The goal is to make each tree node occupy almost exactly one block. This simplifies reading and writing tree nodes; we simplify read and write the corresponding blocks. These n are also large enough to ensure a very flat tree. No node can be more than half empty. Our insertion algorithm is also fundamentally different from the usual one. Described by Sedgewick in his Algorithms in C++, our insertion is top-down, meaning that instead of splitting nodes as late as possible, it splits any full node it visits while traversing the tree. With this greedy approach, every split is a constant-time operation, since it’s able to assume there’s room in the parent node. The approach was chosen for its simplicity, but also has the advantage of making worst-case behavior less likely by splitting full nodes sooner than they would be otherwise. Creating a new index is simple, involving only the creation of a BlockFile for the index file and a block corresponding to an empty root node; the root node is always the first block in the file, which eliminates the need for storing metadata describing its position. Because the block cache is naïve LRU, we also make the optimization of storing the root node in memory at all times. Both adding and looking up blocks involve a common process of tree traversal, which examines first the root node, finds the correct pointer to follow, reads this block, and then repeats this with the next block, until a leaf node is reached. If things were simpler, we could then just examine this leaf node, either returning the value if the key is found for lookup, or inserting the new pair for insertion. Insertion is complicated by the possibility that the leaf node is full. To avert this, we perform a split of any full node visited on the way down in the usual way. If a new root node must be created, the old root node must be copied to another place in the file. This is safe, because no other nodes contain pointers to the root node. A new root node is then placed in the first block with one key and two pointers. Lookup, on the other hand, is complicated by the fact that the index needs to work for keys with duplicates. To solve this, the lookup function does not return a single value, but instead finds the leaf node containing the first value for that key, and returns a simple iterator object whose implementation is tied to the particular type of index. The iterator keeps a handle on the leaf node containing the values of the target key, and gives a new one each time one is demanded, possibly following block pointers to the next leaf node, until all the values for that key are exhausted. Destroying the iterator then releases the reference on the leaf node. Conclusion Our Phase I system provides all of the capabilities that will be needed to perform the critical functions of Phase II. We are able to load large files of test data into the system while collecting statistics about them, and the Index and Relation objects allow easy scrolling through and searching for records; these are the main capabilities required by the query optimizer. The loosely coupled, extensible object structure will allow us to easily add the functionality we need while limiting complexity. Thus we feel our existing system is powerful enough to allow us to move into the next phase.