ppt 1.7MB

advertisement
For 2004 ACM Honored Class
Feng Qian April 3, 2007



Introduction & Implementation in C++ (钱风)
Implementation in Java (周牧星)
Schedule and grading policy (马融)





Objective: Build an Database Management
System (DBMS) in C++/Java
A team project, hard but fun engineering
work
No official base code is given, you are
encouraged to work from zero
No strict technical restrictions, much more
flexible than previous projects (Compiler / OS)
TAs: 马融、周牧星、曲文涛、钱风




A Profound understanding of modern DBMS
A Proficient programming ability for large
engineering projects
A Perfect training for debugging skill for
thousands lines of codes
A Practical simulation of teamwork
environment in industrial zone
 No pains, No gains 









Storage
Buffer
Heap File
Index
SQL Parser
SQL Engine
Concurrency Control
Database Manager
Interface
Bottom Layer
Top Layer



Arrange data on disk, in the form of Page
Similar to Virtual Memory in OS
High level layers only knows a Page ID as an
entry point for access of the data

My Implementation
 Virtual address space: 4GB
 Self-grow page files: 4MB each
 An advanced bitmap: segment tree
 Main interfaces
(All deal with file operations on disk)
▪ CreateDB, OpenDB, CloseDB, DestroyDB
▪ AllocatePage, DeallocatePage
▪ WritePage, ReadPage

A traditional cache strategy, store frequently
accessed pages in memory (instead of in the
disk storage system)

My implementation
 Store 2048 frequently used pages
 2nd-chance replacement algorithm
 Main interfaces:
▪ AllocatePage, DeallocatePage
▪ GetPage (Handle page missing)
▪ ReleasePage
▪ PinPage, UnpinPage (Reference count)





provides the ability of sequential access of
the records, as well as insertion, deletion,
updating and etc basic features.
Manage table structures
Manipulate tables in unit of tuple
Generate iterators for upper layers
The heap file defines the format of table
structures and tuples in pages.

Use single page to store a table structure
CREATE TABLE student (studentid INTEGER, name VARCHAR(10),
age INTEGER, schoolid INTEGER);
Table Info: Table ID: 2B, Num of columns: 2B, Num of tuples: 4B, First page ID: 4B
Num of pages: 4B, Table name: 10B
Each Field: Column Name: 2B, Column type: 2B, Column size: 2B, Not null: 1B
Key: 1B Offset: 2B, Index info: 4B
In the table structure page:

Store the table content
 Assume a tuple (123,“Zhang”,20,45) is stored in Page 5
 The unique identification (RID) of this tuple is defined by its page (5) and its
offset in the page (0x010h)
 Use linked multiple pages to store all tuples in a table
Structure
Page
Global Bitmap
Content
Page 1
Content
Page 2
Content
Page 3
…
Content
Page n

Main Interface of my implementation
 CreateTable
 DropTable
 InsertTuple
 DeleteTuple
 UpdateTuple
 SearchTuple


To accelerate the access of tuples
The most simple type of index:
Always keep tuples in order
Screenshot of Microsoft Access 2007

Typical index technique: B+ Tree
BT Header
Page
some important information about the B+ tree
is recorded in the BT Header Page, such as the
root page id of the tree and the key type
BT Index
Page
BTIndexPage is the internal node
and only contains index to the
child B+ tree page.
BT Leaf
Page
BTLeafPage is the leaf node and
contain index entries with RIDs

Common B+ Tree operations
 Search, Insert, Delete
 BoundaryQuery (return an array of RIDs)
Maintain B+ Tree at the same time of manipulating
the heap file.
 Keep the balance of the tree
 To be simple, you can build a B+ Tree for only the
column with the “primary key” flag.
 Please refer to algorithm books and papers for more
information of B+ Tree



Parse SQL statement into the grammar tree
Two ways
 Write a naïve parser yourself
 Use tools (YACC for C++ users, JCup for Java users)

Your DBMS should support st the following
SQL statements
 SELECT fields FROM tables [WHERE Wclause] [GROUP BY fields] [HAVING





fexpression] [ORDER BY fields]
CREATE TABLE tablename (ctclauses)
DROP TABLE tablename
INSERT INTO tablename (fields) VALUE (values)
DELETE FROM tablename [WHERE Wclause]
UPDATE tablename SET uclauses [WHERE Wclause]
 Group functions: AVG | COUNT | MIN | MAX | SUM
 Relation Op.: AND | OR
 Data Types: CHAR | VARCHAR | DATE | TIME | DOUBLE | INTEGER | LONG

Examples
 CREATE TABLE student
(studentid INTEGER,
name VARCHAR(30) DEFAULT '(NONAME)',
age INTEGER DEFAULT -1,
dat DATE DEFAULT DATE '2000-01-01' NOT NULL);
 SELECT star1.name, star2.name FROM MovieStar star1, MovieStar star2
WHERE Star1.address=star2.address AND Star1.name<Star2.name;
 SELECT name FROM MovieExec WHERE (cert#) IN
(SELECT producerC# FROM Movie WHERE ( title , year ) IN
(SELECT movieTitle , movieYear FROM StarsIn
WHERE starName = 1));
 SELECT COUNT(birthdate) FROM MovieStar
GROUP BY gender HAVING gender='F';


Execute the parsed SQL statement
A possible implementation
SQL Statement
SQL Grammar Tree (by SQL Parser)
Relational Algebra Query Tree
Optimized to Execution Tree

Relational Algebra Query Tree

Optimized Execution Tree
Refer http://www.ittc.ku.edu/~sgauch/647/s99/notes/11b/sld001.htm for details
Different users can open the same database in the
same time
 They also can read from the same table
concurrently
 They cannot access the same table when someone
do writing operations on it
 Thus, a concurrency control is needed


A possible implementation
 Set up a table lock
 Two kinds of locks: Read lock and Write lock
When a table has a write lock, no operations are
allowed on it. When a table has a read lock, only
read requirements are allowed.
 If a user want to do some operations on it, he
must wait until the previous lock is released.
 Avoid dead locking!





The bridge between access interface and the core
system
Since we allow open more than one database
simultaneously, the DB manager should maintain all
opened databases in current system.
Maintain system tables for each database
package the result from the low-level system, and
send it to the right session
Logger

Main functions in my implementation
 Hold the connected sessions (multi-threaded)
 Manage every opened database in current system
 Package the result and do response
 Logger

Two types of user interfaces
 Graphics User Interface
 XDBC (JDBC, ODBC or you defined) programming
interface

Remote user interfaces should be
implemented
 C++ users: recommend raw socket
 Java users: recommend RMI

I “created” a simple API. Sample program:
#include “FatwormAPI.h”
CFatWormAPI::Connect(“192.168.1.12”,1202);
CFatWormAPI::ExecuteCommand("LOGIN AAA AAA;");
CFatWormAPI::ExecuteCommand("OPEN DATABASE STU;");
CFatWormAPI::ExecuteCommand(“SELECT * FROM student2 WHERE
studentid<1000;");
for (int i=0;i<CFatWormAPI::GetSelectCount();i++) {
printf("%d %s %d %d\n",
CFatWormAPI::GetInteger(i,0),
CFatWormAPI::GetChar(i,1),
CFatWormAPI::GetInteger(i,2),
CFatWormAPI::GetInteger(i,3)
);
}
CFatWormAPI::DisConnect();
Client
Local Access
Socket/RMI
Remote Server
Database Manager
SQL parsing & SQL Engine
Heap File
Concurrency
Control
B+ Tree Indexing
Storage & Buffer
X
D
B
C

Advantage
 Very Fast (Most real DBMS are written in C++)
 Direct memory access
inline static int ReadInt(const BYTE * pData,int ofs) {
return *((int *)(pData+ofs));
}
 Powerful Marco
#define MAKE_RID(PID,OFS) ((((DWORD)(PID))<<20) | (DWORD)(OFS))

Disadvantage
 “Access violation” program – Be careful when
programming!

Contact me via e-mail
 shinyflute@gmail.com
 shinyflute@sjtu.org
Download