Applying Thread Level Speculation to Database Transactions Chris Colohan (Adapted from his Thesis Defense talk) Chip Multiprocessors are Here! AMD Opteron IBM Power 5 Intel Yonah 2 cores now, soon will have 4, 8, 16, or 32 Multiple threads per core How do we best use them? 2 Multi-Core Enhances Throughput Users Database Server Transactions DBMS Database Cores can run concurrent transactions and improve throughput 3 Multi-Core Enhances Throughput Users Database Server Transactions DBMS Database Can multiple cores improve transaction latency? 4 Parallelizing transactions DBMS SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } Intra-query parallelism Used for long-running queries (decision support) Does not work for short queries Short queries dominate in commercial workloads 5 Parallelizing transactions DBMS SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } Intra-transaction parallelism Each thread spans multiple queries Hard to add to existing systems! Need to change interface, add latches and locks, worry about correctness of parallel execution… 6 Parallelizing transactions DBMS SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } Intra-transaction parallelism Thread Level Speculation (TLS) Hard to add to existing systems! makes parallelization Need to change interface, add latcheseasier. and locks, worry Breaks transaction into threads about correctness of parallel execution… 7 Thread Level Speculation (TLS) Epoch 1 Epoch 2 =*p *p= *p= Time =*q *q= *q= =*p =*p =*q =*q Sequential Parallel 8 Thread Level Speculation (TLS) Epoch 1 *p= Epoch 2 Violation! =*p *p= Time R2 =*p *q= *q= =*q =*p Use epochs Detect violations Restart to recover Buffer state Worst case: =*q Best case: Sequential Sequential Fully parallel Parallel Data dependences limit performance. 9 A Coordinated Effort TPC-C Transactions DBMS BerkeleyDB Hardware Simulated machine 10 A Coordinated Effort Transaction Programmer DBMS Programmer Choose epoch boundaries Remove performance bottlenecks Hardware Developer Add TLS support to architecture 11 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Hardware Support Results Conclusions 12 Case Study: New Order (TPC-C) GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } 78% of transaction execution time Only dependence is the quantity field Very unlikely to occur (1/100,000) 13 Case Study: New Order (TPC-C) GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; GET cust_info FROM customer; INSERT item INTO order_line; UPDATE district WITH order_id; } INSERT order_id INTO new_order; TLS_foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } 14 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Hardware support Results Conclusions 15 Time Dependences in DBMS 16 Dependences in DBMS Time Dependences serialize execution! Performance tuning: Profile execution Remove bottleneck dependence Repeat 17 Buffer Pool Management CPU get_page(5) put_page(5) Buffer Pool ref: 0 1 18 Buffer Pool Management get_page(5) put_page(5) get_page(5) Time CPU put_page(5) get_page(5) put_page(5) Buffer Pool ref: 0 get_page(5) put_page(5) TLS ensures first epoch gets page first. Who cares? 19 Buffer Pool Management Escape speculation get_page(5) CPU put_page(5) Invoke operation Store undo function Resume speculation get_page(5) Time • • • • put_page(5) get_page(5) put_page(5) Buffer Pool ref: 0 get_page(5) put_page(5) = Escape Speculation 20 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() Wraps get_page() return ret; } 21 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); No violations while calling get_page() ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; } 22 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() May get bad input data from speculative thread! return ret; } 23 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; Only page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); one epoch per transaction at a time ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; } 24 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() How to undo get_page() return ret; } 25 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { Isolated static tls_mutex mut; page_t *ret; Undoing this operation tls_escape_speculation();does not cause check_get_arguments(id); cascading aborts tls_acquire_mutex(&mut); ret = get_page(id); Undoable Easy way to return tls_release_mutex(&mut); system to initial state tls_on_violation(put, ret); tls_resume_speculation() return ret; } Can also be used for: Cursor management malloc() 26 Buffer Pool Management get_page(5) put_page(5) get_page(5) Time CPU put_page(5) get_page(5) put_page(5) Buffer Pool ref: 0 get_page(5) put_page(5) Not undoable! = Escape Speculation 27 Buffer Pool Management get_page(5) put_page(5) get_page(5) Time CPU put_page(5) Buffer Pool ref: 0 get_page(5) put_page(5) = Escape Speculation Delay put_page until end of epoch Avoid dependence 28 Removing Bottleneck Dependences We introduce three techniques: Delay operations until non-speculative Escape speculation Mutex and lock acquire and release Buffer pool, memory, and cursor release Log sequence number assignment Buffer pool, memory, and cursor allocation Traditional parallelization Memory allocation, cursor pool, error checks, false sharing 29 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Hardware support Results Conclusions 30 TLS in Database Systems Time Large epochs: • More dependences • Must tolerate • More state • Bigger buffers Non-Database TLS TLS in Database Systems Concurrent transactions 31 Feedback Loop for() { do_work(); } think I know this is parallel! par_for() { do_work(); } Must…Make …Faster feed back back feed feed back back feed feed back back feed feed back feed back feed back feed back feed 32 Violations == Feedback *p= Violation! =*p *p= Time R2 =*p *q= *q= =*q =*p Must…Make =*q …Faster Sequential Parallel 0x0FD8 0xFD20 0x0FC0 0xFC18 33 Eliminating Violations 0x0FD8 0xFD20 0x0FC0 0xFC18 Violation! =*p *p= Time R2 =*p *q= Violation! =*q *q= =*q =*q Parallel Optimization may make slower? Eliminate *p Dep. 34 Time Tolerating Violations: Sub-epochs Violation! =*q *q= Violation! =*q *q= =*q =*q Eliminate *p Dep. Sub-epochs 35 Sub-epochs Started periodically by hardware How many? When to start? Violation! *q= Hardware implementation =*q Just like epochs =*q Use more epoch contexts No need to check violations between sub-epochs within an epoch Sub-epochs 36 Old TLS Design CPU CPU L1 $ L1 $ CPU L1 $ CPU L1 $ Buffer speculative state in write back L1 cache Restart by invalidating speculative lines Invalidation Detect violations through invalidations Problems: L2 $ large • L1 cache not enough • Later epochs only get values on commit Rest of system only sees committed data Rest of memory system 37 New Cache Design CPU L1 $ CPU CPU L1 $ L1 $ CPU L1 $ Speculative writes immediately visible to L2 (and later epochs) Restart by invalidating speculative lines Buffer speculative and non-speculative state for all epochs in L2 L2 $ Invalidation Rest of memory system Detect violations at lookup time Invalidation coherence between L2 caches 38 New Features CPU L1 $ CPU CPU L1 $ L1 $ CPU L1 $ Speculative state in L1 and L2 cache Cache line replication (versions) L2 $ Data dependence tracking within cache Speculative victim cache Rest of memory system 39 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Hardware support Results Conclusions 40 Experimental Setup Detailed simulation Superscalar, out-of-order, 128 entry reorder buffer Memory hierarchy modeled in detail CPU CPU CPU CPU 32KB 4-way L1 $ 32KB 4-way L1 $ 32KB 4-way L1 $ 32KB 4-way L1 $ TPC-C transactions on BerkeleyDB In-core database Single user Single warehouse Measure interval of 100 transactions Measuring latency not throughput 2MB 4-way L2 $ Rest of memory system 41 Optimizing the DBMS: New Order Time (normalized) 1.25 26% improvement 1 0.75 0.5 0.25 Other CPUs not helping Cache misses increase Idle CPU Violated Cache Miss Busy Can’t optimize much more 0 42 Optimizing the DBMS: New Order Time (normalized) 1.25 1 0.75 0.5 Idle CPU Violated Cache Miss Busy 0.25 0 This process took me 30 days and <1200 lines of code. 43 Other TPC-C Transactions Time (normalized) 1 3/5 Transactions speed up by 46-66% 0.75 Idle CPU Violated Cache Miss Busy 0.5 0.25 0 New Order Delivery Stock Level Payment Order Status 44 Scaling Idle CPU Violated Latch Stall Cache Miss Busy Time (normalized) 1 0.75 0.5 0.25 0 Seq. 2 CPUs 4 CPUs 8 CPUs 45 Scaling New Order 150 Idle CPU Violated Latch Stall Cache Miss Busy Time (normalized) 1 0.75 0.5 0.25 0 Seq. 2 CPUs 4 CPUs 8 CPUs Seq. 2 CPUs 4 CPUs 8 CPUs 46 Conclusions A new form of parallelism for databases Intra-transaction parallelism Tool for attacking transaction latency Without major changes to DBMS With feasible new hardware TLS can be applied to more than transactions Halve transaction latency by using 4 CPUs 47