Time for Change: Why Not Transact In Memory? Sang K. Cha Email: chask@db.stanford.edu on leave from Seoul National University, Korea Transact In Memory Project My interest in main memory database started from: WS-IRIS at HP Lab (1991-1992.1): • In-memory personal object manager project of Tore Risch (before SmallBase project which became TimesTen) Related previous projects at Seoul National Univ.: C++ design & implementation of MMDBMS prototypes • Initial design of MMDBMS modules (1993-1994) • Mr.RT for ETRI (1995-1996) – 2+ local startups • Xmas: Extension of Mr.RT (1997-1999) Object-oriented middleware for spatial object database (1994-1999) Transact in Memory project (2000 - present) 2 Transact In Memory Project Spirit: Do what we think are important rather than what others (government bureaucrats or committees) think important. No more Build and Perish: Deal with the end users directly for quick feedback on our work Funding Scheme: Experimental Presently, self-managed through Transact In Memory, Inc. for research and development freedom Venture capital financing in the future? 3 Transact In Memory Project Our Users Performance-demanding users with simple applications (with 3n-letter company product bases) Telco companies Stock brokerage companies …. Our People: 20+ is expanding Graduate students Consultant (patent attorney for IP protection) Full-time, field service engineering & marketing staffs 4 Outline Introduction to Main Memory DBMS MMDBMS vs DRDBMS Memory-primary vs disk-primary architecture More on Transact In Memory Project Some hypotheses Parallel Architecture with Parallel Logging and Recovery Some Memory-Primary Design P*TIME: a Highly Parallel Transact In Memory Engine Summary 5 Main Memory DBMS Database resident in memory Read transactions simply read the in-memory data. Update transactions do in-memory updates and write update log to the log disk. Occasionally, checkpoint the dirty pages of the in-memory database to the disk-resident backup DB to shorten the recovery time. MMDBMS Primary DB Checkpointing Backup DB Logging Log 6 MMDBMS Technical Issues Logging and Recovery: critical for not losing updates In-Memory Index Structures Efficient L2 Cache Utilization? Concurrency Control of Index and Data DBMS Process & Application Binding Architecture Query Processing Memory Management Availability for Mission-critical Applications 7 Reminder on Hardware Advances 1000 times over the 15 years [Patterson] CPU power, Memory capacity, Disk capacity • Not really for memory and disk speed • Network interface speed: 100Mbps – Gbps (maybe 100 times) VAX 11/750 in 1984 (Stanford KBMS personal workstation) • • • • CPU (32 bit, cycle time: 320 ns, CPI: 10) 8MB memory (800 ns for write, 640ns for read) Hundreds of MB disk space 10Mbps ethernet connection How much speedup in the DBMS performance over the same period? --- Not 1000 times! 8 In the old days (1984), a few MB was said… With the availability of very large, relatively inexpensive main memories, it is becoming possible to keep large databases resident in memory. …. At the present time, memory for VAX 11/780 costs approximately $1,500 a megabyte. In 1990, a gigabyte of memory should cost less than $200,000 (with the availability of 1Mb chips). If 4 Mbit chips are available, the price might be as low as $50,000. - DeWitt, Katz, Olken, Shapiro, Stonebraker, and Wood, “Implementation Techniques for Main Memory Database Systems,” SIGMOD 1984 9 and today a few GB…. We write the paper with essentially the same wording: As the price of memory continues to drop below $1,000/GB, it is now feasible to place many of the database tables and indexes in main memory. - Kim, Cha, Kwon, SIGMOD 2001 Real numbers: Max CPU# Max Memory Memory Price www.kingston.com Sun Blade 1K 2 8GB Sun Fire 3800 8 64GB Sun Fire 15K 108 0.5TB Compaq ML 570 4 16GB $1170/2GB, $7021/4GB $914/2GB, $2143/4GB 10 MMDBMS vs DRDBMS Well-Designed MMDBMS should outperform DRDBMS significantly not only for read-oriented applications but also for update-intensive applications! with memory-optimized data structures and algorithms disk access reduced to the sequential form of log writing and occasional checkpointing 11 Q: Is Disk Database with large buffer the same as Main Memory Database? No! • Complex mapping between disk and memory – E.g., traversing index blocks in buffer requires bookkeeping the mapping between disk and memory addresses » Swizzle or not Large Buffer Index Blocks Data Blocks Database • record disk address Log Disk index block design is not optimized against hardware cache misses. 12 Other Reasons to Say No! Update performance advantage of main memory database Disk-resident DBMS is limited in taking advantage of hardware advances Huge code base • many levels of abstraction for portability • backward compatibility with the old customers (long history of evolution) Slow to change Too many control knobs to optimize Disk-primary design 13 Two Different Paradigms Disk Primary Design the disk-primary structures and algorithms. Memory Primary Design the memory-primary structures and algorithms. Cache-conscious design Cache the disk blocks of the If necessary, map the indisk-resident database in the memory structures to the memory buffer. disk. Uniform mapping 14 Memory Primary versus Disk Primary CPU Cache Memory Memory-Primary Focus Main Memory (DRAM) Disk-Primary Focus 15 Why Memory-Primary Design? Memory access is no longer uniform! Cost of L2 cache misses: hundred clock cycles, or thousand instruction opportunity (From David Patterson, et. al., IEEE Micro, 1997) 16 Memory-Primary Design Cache-conscious: the well-known disk access rule comes back in a slightly different flavor Random access is expensive: • Example: Pointer chasing Block access is more efficient: • Align data structures with L2 cache lines (e.g., 64 bytes). • Pack the right data in the cache block. 17 Cache behavior of commercial DBMS (on Uniprocessor Pentium II Xeon) Anastassia Ailamaki et al, DBMSs on a Modern Processor: Where does time go?, VLDB 99 Memory related delays: 40-80% of execution time. Data accesses on caches: 19-86% of memory stalls. Multiprocessor cache behavior? Probably worse because of coherence cache misses 18 Coherence Cache Miss Problem Updates on the shared data structures on sharedmemory multiprocessor systems Cache invalidation Coherence Cache Miss Problems with Frequently updated, shared internal control structures locks, transaction tables, etc. 19 Transact In Memory Project at SNU Objective: Build a highly parallel, highly scalable Main-Memory DBMS Exploit the parallelism in logging, recovery, and CC Exploit advances in HW computing power Specific technical issues published so far: Parallel logging and recovery [ICDE01] Cache-conscious index [SIGMOD01] Cache-conscious concurrency control of index [VLDB01] 20 Transact In Memory Project Some Hypotheses MMDBMS should do more than just keeping the database in memory New architecture and new algorithms are needed; Otherwise DRDBMS will catch up eventually. Memory-primary design should boost the performance significantly MMDBMS is the best place to apply the self-tuning RISC DBMS concept! Thanks to Surajit Chaudhuri and Gerhard Weikum for their VLDB 2000 vision paper! Once we have a high-performance MMDBMS, today’s multi-tier architecture may not be the best fit. Database servers may become idle … This is also a good place for inventing … Because of new cost model, new premise 21 Parallel Main Memory DBMS Transaction Mgr Data Dictionary Replication Mgr … Recovery Mgr Primary Database Lock Mgr Memory Mgr Buffer … Buffer Buffer Checkpoint Mgr Log Anchor Backup DB 1 … Backup DB m Parallel Checkpointing … Buffer Log Mgr Log 1 … Log n Parallel Logging 22 Parallel Main Memory DBMS Primary Database BDP BDP … Buffer LP LP Buffer Buffer … Buffer Recovery Mgr BDR Log Anchor BDR Backup DB1 … Backup DBm LR Log1 LR … BDP: Backup DB rePlay LP: Log rePlay BDR: Backup DB Read LR: Log Read Logn Fully Parallel Recovery: Backup Database Loading and Log Processing all in Parallel 23 Problems with Existing Logging Schemes Constraint on log replay by the serialization order Conventional Wisdom Distribute log records to multiple disks • Pays the cost of merging and applying log records by the serialization order during recovery Partition the database and assigning different log disks to different partitions => limited parallelism • Poor utilization of multiple log disks when updates are skewed to certain partitions • Difficulty of handling transactions updating multiple segments 24 Differential Logging: Definition Let the value of R: p q Differential log (p,q) = p q Redo: p (p,q) = q Undo: q (p,q) = p where denotes the bit-wise XOR [Lee, Kim, and Cha, ICDE 2001] 25 Differential Logging Example Log sequence number Value of a resource, R Resource ID XOR difference 0000 L1 R … 0101 L2 R … 1100 0101 1001 Redo L1; Redo L2; 0000 0101 1001 Redo L2; Redo L1; 0000 1100 1001 26 Source of Parallelism XOR operation is commutative and associative Serialization order is insignificant • Log records can be applied in arbitrary order • Log records can be distributed to multiple disks to improve the logging and recovery performance. Redo and undo can be intermixed • Single-pass recovery possible Backup DB loading can be intermixed with log processing • Initialize the primary DB with 0’s. Copying Backup DB into memory is equivalent to applying XOR to the backup data and the corresponding memory initialized with 0’s. 27 Cache-conscious Index Structures Design focus: minimize the L2 cache misses Index node alignment with cache lines • Node size: a few multiples of L2 cache blocks Pointer elimination for increasing fanout Reduced height Reduced cache misses • CSB+-tree [Rao & Ross, SIGMOD 2000] Key compression for increasing fanout • CR-tree [Kim et al, SIGMOD 2001], pkT/pkB-tree[Chen, et al SIGMOD 2001] B+-tree: 23 CSB+-tree: 34 23 34 47 58 28 Concurrency Control of Index What if we combine traditional latch-based index concurrency control schemes? Lock coupling involves latching and unlatching index nodes while traversing down the tree Latching/unlatching the index node involves memory write on the cache block containing the latch. Memory writes on the shared data structures on the shared-memory multiprocessor systems lead to coherence cache misses. 29 OLFIT Concurrency Control: Overview Consistent node access protocol [Cha et al, VLDB 2001, Patent-Pending]: ReadNode: • Just read the node optimistically with no latch operations • At the end, detect the read-update conflict by checking the latch state and the version number updated by the UpdateNode operation • Retry on the conflict. UpdateNode: • Latch and unlatch the index node before and after updating the node content – Prevent other node updates from interfering • Increment the version number of the index node before unlatching 30 P*TIME: Highly Parallel Transact In Memory Engine CPU CPU CPU CPU Cache Cache Cache Cache Parallel Checkpointing Cache-Conscious Index Concurrency Control Primary DB Backup Backup Backup DB DB DB Parallel Logging LogLog Log Disks Partitions Partitions Partitions Parallel Recovery 31 P*TIME: Highly Parallel Transact In Memory Engine Lightweight, multithreaded architecture Scalable performance on multiprocessor platforms Implemented mostly in C++ Small footprint (<10MB) Industry standard API: SQL, JDBC (with binding to Java and C++), ODBC Support of DB embedded application architecture Multi-level concurrency control Data replication for high availability Simple performance model Ease of application tuning and optimization 32 P*TIME-Embedded Telco Directory Server 15M customer database 1 table (~1GB) 3 hash indexes (~1.3GB) • • • • Backup DB loading time: 16 sec(65MB/sec) Log processing time: 3040MB/sec Parallel index building time: 19 sec Initial DB loading from text file: 6 min 33 Summary The memory-primary architecture Utilizes the hardware resource more efficiently. Still many things to do P*TIME: second-generation MMDBMS Memory-primary architecture with parallelism exploited Other optimizations in implementation Numerous applications in Telecom and Internet Directory servers Update-intensive applications: batch on-line 34