An Integrated Hardware-Software Approach to Transactional Memory Sean Lie

advertisement
An Integrated Hardware-Software
Approach to Transactional Memory
Sean Lie
6.895 Theory of Parallel Systems
Monday December 8th, 2003
Transactional Memory
„
Transactional memory provides atomicity without the
problems associated with locks.
Locks
if (i<j) {
a = i; b = j;
} else {
a = j; b = i; }
Transactional Memory
Lock(L[a]); Lock(L[b]);
StartTransaction;
Flow[i] = Flow[i] – X;
Flow[i] = Flow[i] – X;
Flow[j] = Flow[j] + X;
Flow[j] = Flow[j] + X;
Unlock(L[b]); Unlock(L[a]);
EndTransaction;
„
I propose an integrated hardware-software approach to
transactional memory.
Hardware Transactional Memory
„
HTM: Transactional memory
can be implemented in
hardware using the cache and
cache coherency mechanism.
[Herlihy & Moss]
„
Uncommitted transactional
data is stored in the cache.
„
Transactional data is marked
in the cache using an
additional bit per cache line.
„
HTM has very low overhead
but has size and length
limitations.
Trans Tag Status
1
0
1
0
Data
Software Transactional Memory
Original
Object
Trans
Version
Readers List
248
‘a’
FLAG
NULL
NULL
NULL
48
77
“Mass. Ave.”
NULL
NULL
FLEX Software
Transaction System
„
STM: Transactional memory
can be implemented in
software using compiler and
library support.
„
Uncommitted transactional
data is stored in a copy of the
object.
„
Transactional data is marked
by flagging the object field.
„
STM does not have size or
length limitations but has high
overhead.
Results
„ An
integrated approach gives the best of both worlds.
† Common
z
case: HTM mode - Small/short transactions run fast.
† Uncommon
z
case:
STM mode - Large/long transactions are slower but possible.
„ An
integrated hardware-software transactional memory
system was implemented and evaluated.
† HTM
was implemented in the UVSIM software simulator.
† A subset of STM functionality was implemented for the benchmark applications.
† HTM was modified to be software-compatible.
Hardware vs. Software
„ HTM
has much lower overhead than STM.
„ A
network flow µbenchmark (node-push) was
implemented for evaluating overhead.
1 processor overheads:
Atomicity
“Worst”
“More Realistic”
Mechanism
Cycles (% of Base)
Locks
505%
136%
HTM
153%
104%
1879%
206%
STM
„
“worst” case:
Back-to-back small transactions
“more realistic” case:
Some processing between
small transactions
However, HTM has 2 serious limitations.
Hardware Limitation:
Cache Capacity
„
HTM uses the cache to hold
all transactional data.
„
Therefore, HTM aborts
transactions larger than the
cache.
„
Restricting transaction size is
awkward and not modular.
Size will depend on
associativity, block size, etc.
in addition to cache size.
† Cache configuration change
from processor to processor.
Trans Tag Status
1
1
†
1
1
Data
Hardware Limitation:
Context Switches
Cache
Trans data
for thread a
„
The cache is the only
transactional buffer for all
threads.
Processor
„
Therefore, HTM aborts
transactions on context
switches.
„
Restricting context switches
is awkward and not modular.
Thread a
†
Thread b
Context switches occur
regularly in modern systems
(e.g.. TLB exceptions).
HSTM: An Integrated Approach
„
Transactions are switched
from HTM to STM when
necessary.
„
When a transaction aborts in
HTM, it is restart in STM.
„
HTM is modified to be
software-compatible.
Trans Tag Status
1
0
Data
Readers List
248
‘a’
FLAG
77
“Mass. Ave.”
1
0
Software-Compatible HTM
Original
Object
Readers List
248
‘a’
FLAG
77
“Mass. Ave.”
Store Check
null
null??
Y
Load Check
flag
flag??
N
Y
Normal Memory
Operation
N
„
1 new instruction: xAbort
„
Additional checks are
performed in software:
On loads, check if the memory
location is set to FLAG.
† On stores, check if there is a
readers list.
†
Abort
„
„
Software checks are slow.
Performing checks adds a 2.2x performance overhead over pure
HTM in “worst” case (1.1x in “more realistic” case).
Overcoming Size Limitations
„ The
node-push benchmark was modified to touch
more nodes to evaluate size limitations.
„ HSTM
uses HTM when possible and STM when
necessary.
300
HTM
STM
Cycles per Node
250
200
150
HTM Transactions
stop fitting after
this point
100
50
0
1
5
9
13
17
21
25
29
33
37
Transaction Size (Number of Nodes x 100)
41
45
49
Overcoming Size Limitations
„ The
node-push benchmark was modified to touch
more nodes to evaluate size limitations.
„ HSTM
uses HTM when possible and STM when
necessary.
300
HTM
STM
Cycles per Node
250
HSTM
200
150
HTM Transactions
stop fitting after
this point
100
50
0
1
5
9
13
17
21
25
29
33
37
Transaction Size (Number of Nodes x 100)
41
45
49
Overcoming Context Switching
Limitations
„ Context
switches occur on TLB exceptions.
„ The
node-push benchmark was modified to choose
from a larger set of nodes.
† More
nodes Æ higher probability of TLB miss (Pabort).
„ HSTM
behaves like HTM when Pabort is low and like
STM when Pabort is high.
3000
6
STM
Avg. Cycles per Iteration
2500
5
HTM
2000
4
1500
3
1000
2
500
1
0
0
1
4
7
10
13
16
19
Number of Nodes (x1000)
22
25
28
31
Avg. Aborts per HTM Transaction
TLB Aborts
Overcoming Context Switching
Limitations
„ Context
switches occur on TLB exceptions.
„ The
node-push benchmark was modified to choose
from a larger set of nodes.
† More
nodes Æ higher probability of TLB miss (Pabort).
„ HSTM
behaves like HTM when Pabort is low and like
STM when Pabort is high.
3000
STM
HTM
Avg. Cycles per Iteration
2500
HSTM
2000
1500
1000
500
0
1
4
7
10
13
16
19
Number of Nodes (x1000)
22
25
28
31
Conclusions
„
An integrated approach gives the best of both worlds.
† Common
z
case: HTM mode - Small/short transactions run fast.
† Uncommon
z
„
case:
STM mode - Large/long transactions are slower but possible.
Trade-offs: † “STM
z
This is acceptable since it is uncommon.
† “HTM
z
mode” is not has fast as pure STM.
mode” is not has fast as pure HTM.
Is this acceptable?
Future Work
„ Full
implementation of STM in UVSIM
„ Integration
compiler
of software-compatible HTM into the FLEX
„ Evaluate
how software-compatible HTM performs for
parallel applications
„ Should
software-compatible modifications be moved
into hardware?
„ Can
a transaction be transferred from hardware to
software during execution?
Download