Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan1, Anastassia Ailamaki1, J. Gregory Steffan2 and Todd C. Mowry1,3 1Carnegie Mellon University 2University of Toronto 3Intel Research Pittsburgh Chip Multiprocessors are Here! AMD Opteron IBM Power 5 Intel Yonah 2 cores now, soon will have 4, 8, 16, or 32 Multiple threads per core How do we best use them? Copyright 2005 Chris Colohan 2 Multi-Core Enhances Throughput Users Database Server Transactions DBMS Database Cores can run concurrent transactions and improve throughput Copyright 2005 Chris Colohan 3 Multi-Core Enhances Throughput Users Database Server Transactions DBMS Database Can multiple cores improve transaction latency? Copyright 2005 Chris Colohan 4 Parallelizing transactions DBMS SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } Intra-query parallelism Used for long-running queries (decision support) Does not work for short queries Short queries dominate in commercial workloads Copyright 2005 Chris Colohan 5 Parallelizing transactions DBMS SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } Intra-transaction parallelism Each thread spans multiple queries Hard to add to existing systems! Need to change interface, add latches and locks, worry about correctness of parallel execution… Copyright 2005 Chris Colohan 6 Parallelizing transactions DBMS SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } Intra-transaction parallelism Thread Level Speculation (TLS) Hard to add to existing systems! makes parallelization Need to change interface, add latcheseasier. and locks, worry Breaks transaction into threads about correctness of parallel execution… Copyright 2005 Chris Colohan 7 Thread Level Speculation (TLS) Epoch 1 Epoch 2 =*p *p= *p= Time =*q *q= *q= =*p =*p =*q =*q Sequential Copyright 2005 Chris Colohan Parallel 8 Thread Level Speculation (TLS) Epoch 1 *p= Epoch 2 Violation! =*p *p= Time R2 =*p *q= *q= =*q =*p Use epochs Detect violations Restart to recover Buffer state Worst case: =*q Best case: Sequential Sequential Fully parallel Parallel Data dependences limit performance. Copyright 2005 Chris Colohan 9 A Coordinated Effort TPC-C Transactions DBMS BerkeleyDB Hardware Simulated machine Copyright 2005 Chris Colohan 10 A Coordinated Effort Transaction Programmer DBMS Programmer Choose epoch boundaries Remove performance bottlenecks Hardware Developer Add TLS support to architecture Copyright 2005 Chris Colohan 11 So what’s new? Intra-transaction parallelism Without changing the transactions With minor changes to the DBMS Without having to worry about locking Without introducing concurrency bugs With good performance Halve transaction latency on four cores Copyright 2005 Chris Colohan 12 Related Work Optimistic Concurrency Control (Kung82) Sagas (Molina&Salem87) Transaction chopping (Shasha95) Copyright 2005 Chris Colohan 13 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions Copyright 2005 Chris Colohan 14 Case Study: New Order (TPC-C) GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } 78% of transaction execution time Only dependence is the quantity field Very unlikely to occur (1/100,000) Copyright 2005 Chris Colohan 15 Case Study: New Order (TPC-C) GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; GET cust_info FROM customer; INSERT item INTO order_line; UPDATE district WITH order_id; } INSERT order_id INTO new_order; TLS_foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } Copyright 2005 Chris Colohan 16 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions Copyright 2005 Chris Colohan 17 Time Dependences in DBMS Copyright 2005 Chris Colohan 18 Dependences in DBMS Time Dependences serialize execution! Copyright 2005 Chris Colohan Performance tuning: Profile execution Remove bottleneck dependence Repeat 19 Buffer Pool Management CPU get_page(5) put_page(5) Buffer Pool ref: 0 1 Copyright 2005 Chris Colohan 20 Buffer Pool Management get_page(5) put_page(5) get_page(5) Time CPU put_page(5) get_page(5) put_page(5) Buffer Pool ref: 0 Copyright 2005 Chris Colohan get_page(5) put_page(5) TLS ensures first epoch gets page first. Who cares? 21 Buffer Pool Management Escape speculation get_page(5) CPU put_page(5) Invoke operation Store undo function Resume speculation get_page(5) Time • • • • put_page(5) get_page(5) put_page(5) Buffer Pool ref: 0 Copyright 2005 Chris Colohan get_page(5) put_page(5) = Escape Speculation 22 Buffer Pool Management get_page(5) put_page(5) get_page(5) Time CPU put_page(5) get_page(5) put_page(5) Buffer Pool ref: 0 Copyright 2005 Chris Colohan get_page(5) put_page(5) Not undoable! = Escape Speculation 23 Buffer Pool Management get_page(5) put_page(5) get_page(5) Time CPU put_page(5) Buffer Pool ref: 0 get_page(5) put_page(5) = Escape Speculation Delay put_page until end of epoch Copyright 2005 Chris Colohan Avoid dependence 24 Removing Bottleneck Dependences We introduce three techniques: Delay operations until non-speculative Escape speculation Mutex and lock acquire and release Buffer pool, memory, and cursor release Log sequence number assignment Buffer pool, memory, and cursor allocation Traditional parallelization Memory allocation, cursor pool, error checks, false sharing Copyright 2005 Chris Colohan 25 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions Copyright 2005 Chris Colohan 26 Experimental Setup Detailed simulation Superscalar, out-of-order, 128 entry reorder buffer Memory hierarchy modeled in detail CPU CPU CPU CPU 32KB 4-way L1 $ 32KB 4-way L1 $ 32KB 4-way L1 $ 32KB 4-way L1 $ TPC-C transactions on BerkeleyDB In-core database Single user Single warehouse Measure interval of 100 transactions Measuring latency not throughput Copyright 2005 Chris Colohan 2MB 4-way L2 $ Rest of memory system 27 Optimizing the DBMS: New Order Time (normalized) 1.25 26% improvement 1 0.75 Other CPUs not helping 0.5 0.25 Cache misses increase Idle CPU Violated Cache Miss Busy Can’t optimize much more 0 Copyright 2005 Chris Colohan 28 Optimizing the DBMS: New Order Time (normalized) 1.25 1 Idle CPU Violated Cache Miss Busy 0.75 0.5 0.25 0 This process took me 30 days and <1200 lines of code. Copyright 2005 Chris Colohan 29 Other TPC-C Transactions Time (normalized) 1 3/5 Transactions speed up by 46-66% 0.75 Idle CPU Failed Cache Miss Busy 0.5 0.25 0 New Order Copyright 2005 Chris Colohan Delivery Stock Level Payment Order Status 30 Conclusions A new form of parallelism for databases Intra-transaction parallelism Tool for attacking transaction latency Without major changes to DBMS TLS can be applied to more than transactions Halve transaction latency by using 4 CPUs Copyright 2005 Chris Colohan 31 Any questions? For more information, see: www.colohan.com Backup Slides Follow… TPC-C Transactions on 2 CPUs Time (normalized) 1 3/5 Transactions speed up by 30-44% 0.75 Idle CPU Failed Cache Miss Busy 0.5 0.25 0 New Order Copyright 2005 Chris Colohan Delivery Stock Level Payment Order Status 34 LATCHES Latches Mutual exclusion between transactions Cause violations between epochs Read-test-write cycle RAW Not needed between epochs TLS already provides mutual exclusion! Copyright 2005 Chris Colohan 36 Large critical section Latches: Aggressive Acquire Acquire latch_cnt++ …work… latch_cnt-- latch_cnt++ …work… (enqueue release) latch_cnt++ …work… (enqueue release) Homefree Commit work latch_cnt-- Homefree Commit work latch_cnt-Release Copyright 2005 Chris Colohan 37 Latches: Lazy Acquire Small critical sections Acquire …work… Release (enqueue acquire) …work… (enqueue release) (enqueue acquire) …work… (enqueue release) Homefree Copyright 2005 Chris Colohan Acquire Commit work Release Homefree Acquire Commit work Release 38 HARDWARE TLS in Database Systems Time Large epochs: • More dependences • Must tolerate • More state • Bigger buffers Non-Database TLS Copyright 2005 Chris Colohan TLS in Database Systems Concurrent transactions 40 Feedback Loop for() { do_work(); } think I know this is parallel! par_for() { do_work(); } Must…Make …Faster feed back back feed feed back back feed feed back back feed feed back Copyright 2005 Chris Colohan feed back feed back feed back feed 41 Violations == Feedback *p= Violation! =*p *p= Time R2 =*p *q= *q= =*q =*p Must…Make =*q …Faster Sequential Copyright 2005 Chris Colohan Parallel 0x0FD8 0xFD20 0x0FC0 0xFC18 42 Eliminating Violations 0x0FD8 0xFD20 0x0FC0 0xFC18 Violation! =*p *p= Time R2 =*p *q= Violation! =*q *q= =*q =*q Parallel Copyright 2005 Chris Colohan Optimization may make slower? Eliminate *p Dep. 43 Time Tolerating Violations: Sub-epochs Violation! =*q *q= Violation! =*q *q= =*q =*q Eliminate *p Dep. Copyright 2005 Chris Colohan Sub-epochs 44 Sub-epochs Started periodically by hardware How many? When to start? Violation! *q= Hardware implementation =*q Just like epochs =*q Use more epoch contexts No need to check violations between sub-epochs within an epoch Copyright 2005 Chris Colohan Sub-epochs 45