CS 4420 Project Option 1 Phase I Design

advertisement
CS 4420 Project Option 1
Phase I Design
Clint Jed Casper & Derrick Coetzee
Our Phase I project consists of three main groups of interacting objects:
1. The front-end command language parser
2. The objects providing the block abstraction on top of files
a. BlockFile: provides cached block-level access to a specific file
b. Block: represents a particular block of a file, allows reading/updating
c. BufferManager: caches recently used blocks and flushes dirty ones to
disk when they are removed from the cache
3. The objects representing the structures stored in the database
a. Catalog: describes attributes of the database as a whole, stores metadata
b. Relation: describes one table of data and manages its storage
c. Attribute: describes one column of a table and its properties
d. Index, BtreeIndex: provide efficient very-large integer-to-integer maps
Lex/Yacc
Command Language
Parser
StorageManager
(facade)
Catalog
relations
catalogFile
BufferManager
(singleton)
blockCache
0-1
n
Block
Cached
blocks
BlockFile
1 filename
n
Relation
name
dataFile
attributes
Attribute
name, type
index
1
0-1
Index
indexFile
BtreeIndex
Phase I Implementation
Scanner/Parser Subsystem
The DBMS is command line driven. The following is a small demonstration of the
DBMS command line interface.
[src]$ smalldb
Welcome to smalldb
> CREATE TABLE Car(pnum, Integer, clr, String);
> PRINT CATALOG;
Car( pnum:Integer clr:String )
> PRINT BUFFERSTATS;
Logical block access count: 0
Physical block access count: 0
> EXIT;
We utilized Lex and Yacc to generate the scanner/parser subsystem. When a SQL
command is entered at the prompt, the parser extracts the necessary information and calls
the corresponding method provided by the StorageManager. An example excerpt from
our context-free grammar demonstrates this mechanism.
Command: CREATE TABLE ID4 '(' AttrList ')' ';'
{ SM->CreateTable($3.stringval, $5.attrList); }
| CREATE INDEX ID4 ON ID4 '(' ID4 ')' OptionalNoDup ';'
{ SM->CreateIndex($3.stringval, $5.stringval, $7.stringval, $9.boolval); }
| LOAD TABLE ID4 ID4 ';'
{ SM->LoadTable($3.stringval, $4.stringval); }
Storage Manager
The StorageManager has no low-level function of its own; it simply provides a
convenient interface to the storage management subsystem of the DBMS, indicated by
the large dotted box in the diagram. Some of its primary methods include:
•
CreateTable
Steps of the table creation process include:
1. Create a new Relation object
2. Add attributes to the Relation
3. Insert the Relation into the catalog
•
LoadTable
The semantics of LoadTable follow. For every Relation the DBMS will create a data
file with the relative pathname ./data/relation-name.sdb, which will be used to store
the Relation’s data (but not its metadata). If the file already exists, it will instead load
the existing data it contains. Whenever a LOAD TABLE command is executed on a
relation, the file from which the data is loaded will be encoded and used to overwrite
any data that currently exists in the .sdb. Once the data has been loaded, the file it was
loaded from is no longer needed. Steps of the table loading process include:
1.
2.
3.
4.
Look up the Relation in the catalog and obtain its BlockFile.
Free all the old Blocks in the file.
Allocate a new Block.
Read tuples from the load file, encode them, and write them into the Block.
Write as many tuples into the Block as possible; there is no spanning of tuples
across multiple blocks. Also update relation metadata such as number of
tuples as tuples are added.
5. Flush the Block to disk.
6. Repeat steps 3 - 6 until all the tuples have been loaded into the data file.
•
PrintTable
Steps of the table printing process include:
1. Look up the Relation in the catalog and obtain its BlockFile.
2. Iterate through each Block of the BlockFile.
3. For each Block, print each tuple in the order they occur in the block.
Catalog
The catalog is a collection of Relation objects. During start-up, the DBMS reads each of
the schemas that are stored in the catalog file on disk into main memory. The DBMS
expects to find the catalog in a file having the relative pathname => ./catalog/root. For
each schema, a Relation object, which provides stores the relation’s metadata, is created
and kept in the Catalog object. At this point, the data in the catalog file essentially
becomes an in-memory data structure. However, any changes made to the Catalog object
are still reflected on disk for the purposes of persistence.
Whenever a table is added to the database, a new Relation object is created and inserted
into the Catalog object in memory and the schema for that relation is written to the
appropriate block of the catalog file. To simplify things, the catalog uses a fixed record
format to store the schemas on disk. For the purposes of Phase I, the primary functions
provided by the Catalog class are:
•
•
•
Read the catalog metadata from the catalog file on disk
Print a summary of the relations in the catalog
Insert a new relation into the catalog and add it to the catalog file
BlockFile
A BlockFile provides an abstraction of a file that allows it to be viewed as a series of
blocks as opposed to the view as a stream of data provided by the OS. A BlockFile is
primarily used to read a block from a file or allocate a new block at the end of a file.
An important design decision for our DBMS is that all storage is managed through the
same BlockFile abstraction. This means that data files, index files, the catalog file, and
so forth are all handled through the same interface. One important ramification of this
design decision is that the blocks of the catalog file or an index file are cached and
accessed in the BufferManager in the exact same manner as the blocks of a data file, and
blocks from different files may be stored together in the global cache.
Each Relation, Index, and the Catalog has an associated BlockFile. The functionality of a
BlockFile is tightly coupled with that of the BufferManager object and of the Blocks that
make up a given BlockFile. A BlockFile implements automatic opening and closing of
the actual disk file. The file is opened, if necessary, before any read or write request and
it is closed when the last Block of the BlockFile is ejected from the cache by the
BufferManager.
Given a file offset, a BlockFile will retrieve the Block that corresponds to that
offset and set the cursor, within the Block, to the correct value. The Block that is
returned may have come from either the BufferManager’s cache or a read from the disk.
It also exports a method to allocate a new Block at the end of a file; this causes no disk
operations until it is flushed. In addition, it maintains static counters to keep statistics on
the total number of logical block accesses and physical block accesses (cache misses)
performed by the system.
Block
A Block object represents an abstraction for a physical disk block. Blocks are produced
by operations on a BlockFile, and are cached by BufferManager. A Block allows you to
write directly into the buffer that will eventually be written to disk, and you can read data
from the buffer as well. Any Block that has been modified is considered “dirty,” and this
must be indicated by explicitly setting a dirty flag. All dirty Blocks are written to disk
when either the Block is ejected from the cache by the BufferManager or when the
system is exited. The Block object provides a flush method that forces the data to be
written back to disk immediately.
Note that we are not attempting to design a fault-tolerant DBMS. Therefore, writes can
be indefinitely suspended, and if a crash were to occur then it could result in the DBMS
being in an inconsistent and unpredictable state. A Block also has a simple one-bit
reference counting mechanism to “pin” it in the cache while some other piece of the code
is using it. This prevents it from being ejected from the cache and destroyed while other
references to the Block still exist. For the purposes of Phase I, the primary methods
provided by the Block class are:
For simple stream-based reading and writing within the block, methods are provided to
read and write 4-byte values; the values are read or written at a cursor position within the
block, which is set initially by the BlockFile, advanced by the read and write methods,
and can be set manually as well.
BufferManager
The BufferManager manages the caching of Blocks. It uses a Least Recently Used policy
to determine which block must be ejected when a new Block is inserted and the cache is
already full. For any dirty Block that is selected for ejection from the cache, the
BufferManager will make sure that the Block is flushed back to disk before discarding it.
The BufferManager is never accessed directly outside of the block abstraction subsystem;
instead, it communicates only with BlockFile objects during their read and write
operations.
To implement these policies, the BufferManager keeps a simple list of blocks in order by
the time at which they were last read. Blocks are simply ejected from the end of this list if
it becomes too large. For the purposes of Phase I, the primary functions provided by the
BufferManager class are:
•
•
•
Get a block from the cache, if it is present, given its file and block number. This
also entails marking the block as the most recently used one.
Insert a new block into the most recently used position in the cache.
Eject the least recently used block from the cache, flushing it if it is dirty.
Relation
This is just a wrapper class for all of the metadata that describes the schema for a
particular relation. Other than its use as a wrapper class, its principal purpose is to write
and read the metadata for the schema to a Block. Relations keep track of an ordered list
of Attributes. Relations are also used in displaying table information.
Attribute
The Attribute object describes the metadata for an attribute in a particular relation,
including its name, type, and the index associated with it, if any. Attributes are important
participants in the creation of indexes, and are also used in displaying table information.
Index and BtreeIndex
The Index class provides the abstract interface for an index mechanism, used to locate a
record given the value of one of its attributes. The three key functions are:
•
•
Create an index
Insert a (key, value) pair
•
Given a key, look up its value, or else determine that the key is not present
Notice that there is no means of removing or modifying keys. Because our database is
read-only, these are not required; they would be straightforward extensions however,
analogous to insertion.
Because all attributes are limited to 4 bytes, we limited keys to 32-bit integers. Because
we can locate a record in a given file based solely on its 32-bit file offset, and indeed we
are limited to such offsets by the library, we also limited the values to 32-bit integers.
Thus the problem is reduced to one of constructing very large maps from integers to
integers. We used a B+-tree for our implementation for experience, although for a readonly database such as this one a multilevel sparse index would be more appropriate. The
B+-tree implementation is not a completely standard one however.
For one thing, we always store all keys in the tree, even when the key is the primary key.
This decreases coupling because the index doesn’t need to know about how records are
formatted in a block, nor does the indexing function of scanning through a block for a
matching record need to be performed by the relation. Also, since any relation with at
least two attributes has at most about 60 records per block, and in our B+-Tree n is about
70, this won’t add more than one additional level to the tree.
For another, we use different n values for leaf nodes (currently 62) and internal nodes
(currently 84). The goal is to make each tree node occupy almost exactly one block. This
simplifies reading and writing tree nodes; we simplify read and write the corresponding
blocks. These n are also large enough to ensure a very flat tree. No node can be more than
half empty.
Our insertion algorithm is also fundamentally different from the usual one. Described by
Sedgewick in his Algorithms in C++, our insertion is top-down, meaning that instead of
splitting nodes as late as possible, it splits any full node it visits while traversing the tree.
With this greedy approach, every split is a constant-time operation, since it’s able to
assume there’s room in the parent node. The approach was chosen for its simplicity, but
also has the advantage of making worst-case behavior less likely by splitting full nodes
sooner than they would be otherwise.
Creating a new index is simple, involving only the creation of a BlockFile for the index
file and a block corresponding to an empty root node; the root node is always the first
block in the file, which eliminates the need for storing metadata describing its position.
Because the block cache is naïve LRU, we also make the optimization of storing the root
node in memory at all times.
Both adding and looking up blocks involve a common process of tree traversal, which
examines first the root node, finds the correct pointer to follow, reads this block, and then
repeats this with the next block, until a leaf node is reached. If things were simpler, we
could then just examine this leaf node, either returning the value if the key is found for
lookup, or inserting the new pair for insertion.
Insertion is complicated by the possibility that the leaf node is full. To avert this, we
perform a split of any full node visited on the way down in the usual way. If a new root
node must be created, the old root node must be copied to another place in the file. This is
safe, because no other nodes contain pointers to the root node. A new root node is then
placed in the first block with one key and two pointers.
Lookup, on the other hand, is complicated by the fact that the index needs to work for
keys with duplicates. To solve this, the lookup function does not return a single value, but
instead finds the leaf node containing the first value for that key, and returns a simple
iterator object whose implementation is tied to the particular type of index. The iterator
keeps a handle on the leaf node containing the values of the target key, and gives a new
one each time one is demanded, possibly following block pointers to the next leaf node,
until all the values for that key are exhausted. Destroying the iterator then releases the
reference on the leaf node.
Conclusion
Our Phase I system provides all of the capabilities that will be needed to perform the
critical functions of Phase II. We are able to load large files of test data into the system
while collecting statistics about them, and the Index and Relation objects allow easy
scrolling through and searching for records; these are the main capabilities required by
the query optimizer. The loosely coupled, extensible object structure will allow us to
easily add the functionality we need while limiting complexity. Thus we feel our existing
system is powerful enough to allow us to move into the next phase.
Download