Applying Thread Level Speculation to Database Transactions Chris Colohan

advertisement
Applying Thread Level
Speculation to Database
Transactions
Chris Colohan
(Adapted from his Thesis Defense talk)
Chip Multiprocessors are Here!
AMD Opteron
IBM Power 5
Intel Yonah



2 cores now, soon will have 4, 8, 16, or 32
Multiple threads per core
How do we best use them?
2
Multi-Core Enhances Throughput
Users
Database Server
Transactions
DBMS
Database
Cores can run concurrent transactions
and improve throughput
3
Multi-Core Enhances Throughput
Users
Database Server
Transactions
DBMS
Database
Can multiple cores improve
transaction latency?
4
Parallelizing transactions
DBMS
SELECT cust_info FROM customer;
UPDATE district WITH order_id;
INSERT order_id INTO new_order;
foreach(item) {
GET quantity FROM stock;
quantity--;
UPDATE stock WITH quantity;
INSERT item INTO order_line;
}

Intra-query parallelism



Used for long-running queries (decision support)
Does not work for short queries
Short queries dominate in commercial workloads
5
Parallelizing transactions
DBMS
SELECT cust_info FROM customer;
UPDATE district WITH order_id;
INSERT order_id INTO new_order;
foreach(item) {
GET quantity FROM stock;
quantity--;
UPDATE stock WITH quantity;
INSERT item INTO order_line;
}

Intra-transaction parallelism


Each thread spans multiple queries
Hard to add to existing systems!

Need to change interface, add latches and locks, worry
about correctness of parallel execution…
6
Parallelizing transactions
DBMS
SELECT cust_info FROM customer;
UPDATE district WITH order_id;
INSERT order_id INTO new_order;
foreach(item) {
GET quantity FROM stock;
quantity--;
UPDATE stock WITH quantity;
INSERT item INTO order_line;
}

Intra-transaction parallelism

Thread Level Speculation (TLS)
Hard to add to existing systems!
makes
parallelization
Need
to change
interface, add latcheseasier.
and locks, worry

Breaks transaction into threads

about correctness of parallel execution…
7
Thread Level Speculation (TLS)
Epoch 1
Epoch 2
=*p
*p=
*p=
Time
=*q
*q=
*q=
=*p
=*p
=*q
=*q
Sequential
Parallel
8
Thread Level Speculation (TLS)
Epoch 1
*p=
Epoch 2
Violation!
=*p 
*p=

Time
R2
=*p
*q=
*q=


=*q 
=*p
Use epochs
Detect violations
Restart to recover
Buffer state
Worst case:

=*q

Best case:

Sequential
Sequential
Fully parallel
Parallel
Data dependences limit performance.
9
A Coordinated Effort
TPC-C
Transactions
DBMS
BerkeleyDB
Hardware
Simulated machine
10
A Coordinated Effort
Transaction
Programmer
DBMS Programmer
Choose epoch
boundaries
Remove performance
bottlenecks
Hardware Developer
Add TLS support to
architecture
11
Outline







Introduction
Related work
Dividing transactions into epochs
Removing bottlenecks in the DBMS
Hardware Support
Results
Conclusions
12
Case Study: New Order (TPC-C)
GET cust_info FROM customer;
UPDATE district WITH order_id;
INSERT order_id INTO new_order;
foreach(item) {
GET quantity FROM stock
WHERE i_id=item;
UPDATE stock WITH quantity-1
WHERE i_id=item;
INSERT item INTO order_line;
}

78% of transaction
execution time
Only dependence is the quantity field

Very unlikely to occur (1/100,000)
13
Case Study: New Order (TPC-C)
GET cust_info FROM customer;
UPDATE district WITH order_id;
INSERT order_id INTO new_order;
foreach(item) {
GET quantity FROM stock
WHERE i_id=item;
UPDATE stock WITH quantity-1
WHERE i_id=item;
GET cust_info FROM customer;
INSERT item INTO order_line;
UPDATE district WITH order_id;
}
INSERT order_id INTO new_order;
TLS_foreach(item) {
GET quantity FROM stock
WHERE i_id=item;
UPDATE stock WITH quantity-1
WHERE i_id=item;
INSERT item INTO order_line;
}
14
Outline







Introduction
Related work
Dividing transactions into epochs
Removing bottlenecks in the DBMS
Hardware support
Results
Conclusions
15
Time
Dependences in DBMS
16
Dependences in DBMS
Time
Dependences serialize execution!
Performance tuning:
 Profile execution
 Remove bottleneck dependence
 Repeat
17
Buffer Pool Management
CPU
get_page(5)
put_page(5)
Buffer Pool
ref: 0
1
18
Buffer Pool Management
get_page(5)
put_page(5)
get_page(5)
Time
CPU
put_page(5)
get_page(5)
put_page(5)
Buffer Pool
ref: 0
get_page(5)
put_page(5)
TLS ensures first
epoch gets page first.
Who cares?
19
Buffer Pool Management
Escape speculation
get_page(5)
CPU
put_page(5)
Invoke operation
Store undo function
Resume speculation
get_page(5)
Time
•
•
•
•
put_page(5)
get_page(5)
put_page(5)
Buffer Pool
ref: 0
get_page(5)
put_page(5)
= Escape Speculation
20
get_page() wrapper
page_t *get_page_wrapper(pageid_t id) {
static tls_mutex mut;
page_t *ret;
tls_escape_speculation();
check_get_arguments(id);
tls_acquire_mutex(&mut);
ret = get_page(id);
tls_release_mutex(&mut);
tls_on_violation(put, ret);
tls_resume_speculation()
 Wraps
get_page()
return ret;
}
21
get_page() wrapper
page_t *get_page_wrapper(pageid_t id) {
static tls_mutex mut;
page_t *ret;
tls_escape_speculation();
check_get_arguments(id);
tls_acquire_mutex(&mut);
 No violations
while calling
get_page()
ret = get_page(id);
tls_release_mutex(&mut);
tls_on_violation(put, ret);
tls_resume_speculation()
return ret;
}
22
get_page() wrapper
page_t *get_page_wrapper(pageid_t id) {
static tls_mutex mut;
page_t *ret;
tls_escape_speculation();
check_get_arguments(id);
tls_acquire_mutex(&mut);
ret = get_page(id);
tls_release_mutex(&mut);
tls_on_violation(put, ret);
tls_resume_speculation()
 May get bad
input data from
speculative
thread!
return ret;
}
23
get_page() wrapper
page_t *get_page_wrapper(pageid_t id) {
static tls_mutex mut;
 Only
page_t *ret;
tls_escape_speculation();
check_get_arguments(id);
tls_acquire_mutex(&mut);
one
epoch per
transaction at a
time
ret = get_page(id);
tls_release_mutex(&mut);
tls_on_violation(put, ret);
tls_resume_speculation()
return ret;
}
24
get_page() wrapper
page_t *get_page_wrapper(pageid_t id) {
static tls_mutex mut;
page_t *ret;
tls_escape_speculation();
check_get_arguments(id);
tls_acquire_mutex(&mut);
ret = get_page(id);
tls_release_mutex(&mut);
tls_on_violation(put, ret);
tls_resume_speculation()
 How to undo
get_page()
return ret;
}
25
get_page() wrapper
page_t *get_page_wrapper(pageid_t id) {
 Isolated
static tls_mutex mut;
page_t *ret;
 Undoing this
operation
tls_escape_speculation();does not cause
check_get_arguments(id); cascading aborts
tls_acquire_mutex(&mut);

ret = get_page(id);
Undoable
Easy way to return
tls_release_mutex(&mut); system to initial state

tls_on_violation(put, ret);
tls_resume_speculation()
return ret;
}

Can also be used for:


Cursor management
malloc()
26
Buffer Pool Management
get_page(5)
put_page(5)
get_page(5)
Time
CPU
put_page(5)
get_page(5)
put_page(5)
Buffer Pool
ref: 0
get_page(5)
put_page(5)
Not undoable!
= Escape Speculation
27
Buffer Pool Management
get_page(5)
put_page(5)
get_page(5)
Time
CPU
put_page(5)
Buffer Pool
ref: 0
get_page(5)
put_page(5)
= Escape Speculation

Delay put_page until end of
epoch

Avoid dependence
28
Removing Bottleneck Dependences
We introduce three techniques:
 Delay operations until non-speculative




Escape speculation


Mutex and lock acquire and release
Buffer pool, memory, and cursor release
Log sequence number assignment
Buffer pool, memory, and cursor allocation
Traditional parallelization

Memory allocation, cursor pool, error checks,
false sharing
29
Outline







Introduction
Related work
Dividing transactions into epochs
Removing bottlenecks in the DBMS
Hardware support
Results
Conclusions
30
TLS in Database Systems
Time
Large epochs:
• More dependences
• Must tolerate
• More state
• Bigger buffers
Non-Database
TLS
TLS in Database
Systems
Concurrent
transactions
31
Feedback Loop
for() {
do_work();
}
think
I know this
is parallel!
par_for() {
do_work();
}
Must…Make
…Faster
feed back
back feed
feed back
back feed
feed back
back feed
feed back
feed
back
feed
back
feed
back
feed
32
Violations == Feedback
*p=
Violation!
=*p
*p=
Time
R2
=*p
*q=
*q=
=*q
=*p
Must…Make
=*q …Faster
Sequential
Parallel
0x0FD8
0xFD20
0x0FC0
0xFC18
33
Eliminating Violations
0x0FD8
0xFD20
0x0FC0
0xFC18
Violation!
=*p
*p=
Time
R2
=*p
*q=
Violation!
=*q
*q=
=*q
=*q
Parallel
Optimization may
make slower?
Eliminate *p Dep.
34
Time
Tolerating Violations: Sub-epochs
Violation!
=*q
*q=
Violation!
=*q
*q=
=*q
=*q
Eliminate *p Dep.
Sub-epochs
35
Sub-epochs

Started periodically by
hardware



How many?
When to start?
Violation!
*q=
Hardware implementation

=*q
Just like epochs


=*q
Use more epoch contexts
No need to check violations
between sub-epochs within
an epoch
Sub-epochs
36
Old TLS Design
CPU
CPU
L1 $
L1 $
CPU
L1 $
CPU
L1 $
Buffer speculative state
in write back L1 cache
Restart by invalidating
speculative lines
Invalidation
Detect violations
through invalidations
Problems:
L2 $
large
• L1 cache not
enough
• Later epochs only get values
on commit
Rest of system only
sees committed data
Rest of memory system
37
New Cache Design
CPU
L1 $
CPU
CPU
L1 $
L1 $
CPU
L1 $
Speculative writes
immediately visible to
L2 (and later epochs)
Restart by invalidating
speculative lines
Buffer speculative and
non-speculative state
for all epochs in L2
L2 $
Invalidation
Rest of memory system
Detect violations at
lookup time
Invalidation coherence
between L2 caches
38
New Features
CPU
L1 $
CPU
CPU
L1 $
L1 $
CPU
L1 $
Speculative state in L1 and
L2 cache
Cache line replication
(versions)
L2 $
Data dependence tracking
within cache
Speculative victim cache
Rest of memory system
39
Outline







Introduction
Related work
Dividing transactions into epochs
Removing bottlenecks in the DBMS
Hardware support
Results
Conclusions
40
Experimental Setup

Detailed simulation



Superscalar, out-of-order,
128 entry reorder buffer
Memory hierarchy
modeled in detail
CPU
CPU
CPU
CPU
32KB
4-way
L1 $
32KB
4-way
L1 $
32KB
4-way
L1 $
32KB
4-way
L1 $
TPC-C transactions on
BerkeleyDB





In-core database
Single user
Single warehouse
Measure interval of 100
transactions
Measuring latency not
throughput
2MB 4-way L2 $
Rest of memory system
41
Optimizing the DBMS: New Order
Time (normalized)
1.25
26%
improvement
1
0.75
0.5
0.25
Other CPUs
not helping
Cache misses
increase
Idle CPU
Violated
Cache Miss
Busy
Can’t optimize
much more
0
42
Optimizing the DBMS: New Order
Time (normalized)
1.25
1
0.75
0.5
Idle CPU
Violated
Cache Miss
Busy
0.25
0
This process took me 30 days
and <1200 lines of code.
43
Other TPC-C Transactions
Time (normalized)
1
3/5 Transactions
speed up by 46-66%
0.75
Idle CPU
Violated
Cache Miss
Busy
0.5
0.25
0
New Order
Delivery
Stock Level
Payment
Order Status
44
Scaling
Idle CPU
Violated
Latch Stall
Cache Miss
Busy
Time (normalized)
1
0.75
0.5
0.25
0
Seq.
2 CPUs
4 CPUs
8 CPUs
45
Scaling
New Order 150
Idle CPU
Violated
Latch Stall
Cache Miss
Busy
Time (normalized)
1
0.75
0.5
0.25
0
Seq.
2 CPUs
4 CPUs
8 CPUs
Seq.
2 CPUs
4 CPUs
8 CPUs
46
Conclusions

A new form of parallelism for databases


Intra-transaction parallelism




Tool for attacking transaction latency
Without major changes to DBMS
With feasible new hardware
TLS can be applied to more than
transactions
Halve transaction latency by using 4 CPUs
47
Download