Hardware Support for Efficient Transactional and Supervised Memory Systems Jayaram Bobba

advertisement
Hardware Support for Efficient
Transactional and Supervised Memory
Systems
Jayaram Bobba
Dissertation Defense
1/14/2010
Dept. of Computer Sciences
University of Wisconsin–Madison
Overview:
1) Research Area
2) Challenges/
Contributions
3) Big Picture
Research Area
Device Scaling
Abundant Transistors
Emergence of CMPs
 Hard to Program
Hardware Support to Improve Productivity
Empty/full-bits
Transactional Memory
MemTracker
Deterministic Memory
7/26/2016
Wisconsin Multifacet Project
Supervised Systems
2
Challenges
• Supervised Systems
– Sequential-consistency only
– Ad hoc hardware
– Lack of formalism
• Transactional Memory
– “Most transactions are small”
• Self-fulfilling
– Limited applicability
7/26/2016
Wisconsin Multifacet Project
Contribution 1:
Supervised Memory
TSOdata ,Safe Supervision
Contribution 2:
TokenTM
Contribution 3:
StealthTest
3
Software
Big Picture
Applications
Tools
Hardware
Supervised
Systems
Supervised
Memory
7/26/2016
StealthTest
TokenTM
TSOdata and
Safe Supervision
Wisconsin Multifacet Project
4
Outline
Slide Count
•
•
•
•
•
Motivation
Supervised Memory
TokenTM
StealthTest
Conclusion
7/26/2016
4
18
4 /19
16
6
Wisconsin Multifacet Project
5
On Software Productivity
Better
More
Hardware Software
More Productivity
Yannis’s “Law”: Programmer
Productivity doubles every 6
years
7/26/2016
More Performance
Moore’s Law
Moore’s Law will continue
But Yannis’s Law?
Wisconsin Multifacet Project
6
What has changed?
• “A Fundamental Turn towards Concurrency in
Software” [Herb Sutter, 2005]
• Moore’s Law -> Better Computers
– Sequential Computers (Past)
• Memory wall, Power wall etc.
– Attack of the killer CMPs* (Current)
• How to program? Expose parallelism to software
• Parallel programs hard to write
* Adapted from “Attack of the killer micros” by Eugene Brooks
7/26/2016
Wisconsin Multifacet Project
7
Who solves the productivity issue?
• Why, Of course, hardware architects!
• Long live Moore’s Law
– Spend some transistors on productivity issues
• Architectural Support for Enhancing Productivity
–
–
–
–
–
7/26/2016
for language features
for bug avoidance
for debugging
for performance feedback
and so on…
Wisconsin Multifacet Project
8
Seriously, Who should solve it?
• HW Architects or SW Engineers?
• ‘software crisis’ in the past too…
“We must now reconsider the balance of hardware
• Why HW architects?and software and to provide more specialized
–
function in hardware than we have previously, in
More bang for the buck
order to(Economic)
drastically simplify the programming
process” Edward A. Feustel, IEEE TOC, July 1973 in
• Software/IT (1,152 billion)
vs Hardware (138 billion)
support of Tagged Memory
[Wen Mei Hwu, Micro-39 Keynote]
– SW cannot do it alone (Technical)
• Decades of automatic parallelization efforts
• Virtual Memory, Tagged Memory for LISP-like languages
7/26/2016
Wisconsin Multifacet Project
9
Outline
• Motivation
• Supervised Memory
–
–
–
–
Background/Motivation
Explore relaxed supervised systems
Define Supervised Memory
Propose formal models
• TokenTM
• StealthTest
• Conclusion
7/26/2016
Wisconsin Multifacet Project
10
Why Supervised Systems?
• Synchronization
– Hardware TM systems
– Empty/Full-bits
• [Berry et al 2006] Graph processing algorithms on 4 processor MTA >
64K BG/L
• Controlled non-determinism
– Deterministic/Interleaving Constrained Multiprocessing
• Debugging
– Log-based architectures
• Safety
– Heap checkers, Bounds checkers
• Language Features
– Hardware-assisted Garbage Collection
7/26/2016
Wisconsin Multifacet Project
11
What are Supervised Systems?
1) out-of-band metadata per data block
2) monitor & control (supervise) memory
accesses to data
3) execute handlers on specific metadata states
• pure software possible, but inefficient
– shadow memory
E.g., Valgrind. Mean Slowdown 22X [Nethercote et
al., VEE2007]
7/26/2016
Wisconsin Multifacet Project
12
State-of-the-Art
• Expect Sequentially-Consistent (SC) hardware
– Most hardware is not
• Ad hoc
– Whither primitives?
• Informal treatment of memory consistency
– Ambiguous/Incorrect
7/26/2016
Wisconsin Multifacet Project
13
Contributions
• Expect Sequentially-Consistent (SC) hardware
– Most hardware is not
Explore relaxed supervised systems
• Ad hoc
– Whither primitives?
Define Supervised Memory
• Informal treatment of memory consistency
– Ambiguous/Incorrect
7/26/2016
Propose formal memory models
Wisconsin Multifacet Project
14
Outline
• Motivation
• Supervised Memory
–
–
–
–
Background/Motivation
Explore relaxed supervised systems
Define Supervised Memory
Propose formal models
• TokenTM
• StealthTest
• Conclusion
7/26/2016
Wisconsin Multifacet Project
15
Explore relaxed supervised systems
PC
PC
r1
r2
r3
r1
r2
r3
ST 0x01, A
0x01
0x10
ST 1, [A]
LD [B], r1
ST
0x10, C
ST 2,[C]
LD [C], r3
ST A
LD B
Memory
Store
Buffer
Processor
TSO-lite: A TSO-compliant system
7/26/2016
Block
Data
A
0x00
B
0x01
C
0x11
Wisconsin Multifacet Project
Metadata
16
Explore relaxed supervised systems
LD
ST
LD
Exception
None
LD/ST
7/26/2016
Memory
Full
Empty
PC
PC
r1
r2
r3
r1
r2
r3
ST 0x01, A
0x01
Store
Buffer
ST
Processor
Empty/Full-Bits on TSO-lite
ST 1, [A]
LD [B], r1
ST
0x10, C
ST 2,[C]
LD [C], r3
I1: NO LOAD
BYPASS
Block
Data
Metadata
A
0x00
Full
B
0x01
None
C
0x11
Empty
Wisconsin Multifacet Project
EXCEPTION
I2: LATE
EXCEPTIONS
17
Explore relaxed supervised systems
Deterministic Shared Memory (DMP)
[Devietti et al., ASPLOS 2009]
“depending upon the consistency model of the
underlying hardware, threads must perform a
memory fence at the edge of a quantum”
• Insert a fence after the last operation in the
quantum
• Insert a fence before the first shared operation in
the quantum
I3: Reordered metabit-reads
7/26/2016
Illustration
Wisconsin Multifacet Project
18
Outline
• Motivation
• Supervised Memory
–
–
–
–
Background/Motivation
Explore relaxed supervised systems
Define Supervised Memory
Propose formal models
• TokenTM
• StealthTest
• Conclusion
7/26/2016
Wisconsin Multifacet Project
19
Define Supervised Memory
What is Supervised Memory?
• Each memory location A,
– data (A.d)
– metadata (A.m)
• New operations
– Supervised Load (sLD A)
– Supervised Store (sST A)
• Jump on reading special metadata (Optionally)
– Hardware exception
7/26/2016
Wisconsin Multifacet Project
20
Define Supervised Memory
Supervised Operations
sLD A =>
Start:
atomic{
curm = Val[RA.m]
// Read metadata
nextm = NEXT(Load, curm) // Check software// specified FSM
If nextm == EXCEPTION then
Jump to Handler
RA.d
// Read data
If (nextm != curm) then
WA.m,nextm
// Update metadata
}
Handler:
…
7/26/2016
Wisconsin Multifacet Project
21
Define Supervised Memory
Using Supervised Memory
• Software assigns semantics to metadata
– Metastates stored as metadata
• E.g., Initialized, Uninitialized
– Metastate transition function (NEXT)
• Use supervised operations to monitor/control
data operations
– E.g., catch read access to uninitialized data
7/26/2016
Wisconsin Multifacet Project
22
Outline
• Motivation
• Supervised Memory
–
–
–
–
Background/Motivation
Explore relaxed supervised systems
Define Supervised Memory
Propose formal models
• TokenTM
• StealthTest
• Conclusion
7/26/2016
Wisconsin Multifacet Project
23
Propose formal models
TSO Axioms
[Hangal et al., ISCA 2004]
7/26/2016
Wisconsin Multifacet Project
24
Propose formal models
TSO Axioms
[Hangal et al., ISCA 2004]
Axiom
Description
Order
Total Order on all write accesses
Atomicity
No intervening accesses for atomic operations
Termination
All write accesses eventually complete
Value
Reads return latest value from memory or store buffer
Memory
Barrier
No reordering across a barrier
ReadAny
Accesses cannot pass outstanding reads
WriteWrite
Write access cannot pass outstanding writes
Rd A
Rd B
7/26/2016
Rd A
Wr B
Wr A
Wr B
Wr A
Rd B
Wisconsin Multifacet Project
Reordering Axioms
Allows store buffers
25
Propose formal models
TSOall: A Consistency Model for
Supervised Memory
TSO axioms applied to all accesses—data and metadata
+ (Simple) Like TSO
— (Slow) Prohibits optimizations
Thread: sST A ->[Rd A.m, Wr A.d, Wr A.m]
sLD B ->[Rd B.m, Rd B.d]
=> Store buffers ineffective
• Tension
– Ease of Reasoning vs Performance
7/26/2016
Wisconsin Multifacet Project
26
Propose formal models
Blast from the Past
[Adve and Hill, ISCA1990]
• Ease of Reasoning (SC) vs Performance (RC)
• Observation:
– Simple programs rely only on certain SC orders
– Ignore non-essential orders. Still appears as SC
• Challenge: Simple? Non-essential orders?
• Solution: Data-race-freedom
– For data-race-free programs, RC = SC
7/26/2016
Wisconsin Multifacet Project
27
Propose formal models
Safe Supervision
Motivation
• Ease of Reasoning (TSOall) vs Performance (?)
• Observation:
– Simple supervised programs rely only on certain TSOall orders
– Ignore non-essential orders. Still appears as TSOall
• Challenge: Simple? Non-essential orders?
• Solution: Safe Supervision
– For safely supervised programs, ? = TSOall
7/26/2016
Examples
Wisconsin Multifacet Project
28
Safe Supervision
• metadata accesses to location A not used to order
operations to a different location B
Initially, A.m = Empty, B.d = 0
Thread 1:
Thread 2:
B.d = 1
While (A.m == Empty);
A.m = Full
Read B.d
• Most uses of supervision are safely supervised. E.g.,
• Heap Checker: Initialized/Uninitialized values
• Transactional Memory: Conflict Detection information
7/26/2016
Definition
Wisconsin Multifacet Project
29
Propose formal models
Axiom
Description
Order
Total Order on all write accesses
Atomicity
No intervening accesses for atomic operations
Termination
All write accesses eventually complete
Value
Reads return latest value from memory or store buffer
Memory
Barrier
No reordering across a barrier
ReadAny
Data accesses cannot pass outstanding data reads
WriteWrite
Data writes cannot pass outstanding data writes
Reordering Axioms
TSOdata: Fast Yet Simple
Thread: sST B->[Rd A.m, Wr A.d, Wr A.m] Store buffers can
be used
sLDA ->[Rd B.m, Rd B.d]
For safely supervised programs, TSOdata = TSOall
7/26/2016
Wisconsin Multifacet Project
30
TSOdata on OpenSPARC T2
• Goal: Explore low-level issues on a real design
• Late Exceptions with deferred handlers
– Dump store buffer entries on exception
– Enhance store buffer to carry Virtual Address (VA)
– ~200 cycles to read out 4 entries
• Disable store buffer bypassing for supervised
loads
• Low space overhead for adding metabits (~4%)
7/26/2016
Wisconsin Multifacet Project
31
Supervised Memory Summary
• Expects Sequentially-Consistent (SC) hardware
– Most hardware is not
Explore relaxed memory systems
• Ad hoc
– Whither primitives?
Define Supervised Memory
• Informal treatment of memory consistency
– Ambiguous/Incorrect
7/26/2016
Propose formal memory models
Wisconsin Multifacet Project
32
Outline
•
•
•
•
•
Motivation
Supervised Memory
TokenTM [ISCA 2008]
StealthTest
Conclusion
7/26/2016
Longer Version
Wisconsin Multifacet Project
33
TokenTM Summary
• Current Hardware TMs
– Most Transactions Small & Short Running
– Penalize large/long transactions
– Too restrictive for wide-spread TM use?
• Hypothesis
– Must Support Efficient Large/Long Transactions As
Well
– Is such an HTM even possible?
• Yes! TokenTM
1. LogTM’s Log to buffer unbounded values
2. Transactional Tokens for unbounded conflict detection
• Conflict state in memory metabits
7/26/2016
Wisconsin Multifacet Project
34
Transactional Tokens
• Challenge: How to efficiently track Read/Write
sets?
• Token Coherence [Martin03]
– Read/Write sets for cache coherence
• Solution: Transactional Tokens
– T tokens per memory block
– At least one token to read, All T tokens to write
(token conflict detection)
– Token Metadata
<c0,c1,…,ci,…> where 0≤ci≤T is count of tokens
held by thread with TID i.
7/26/2016
Wisconsin Multifacet Project
35
Tokens and Supervised Memory
• Challenge: Where to store Unbounded, Globally
Accessible Token Metadata?
– unbounded and globally accessible
• Solution
– Supervised Memory’s Metadata
– Piggyback on existing Virtual Memory and
Cache Coherence mechanisms
Skip Animation
7/26/2016
Wisconsin Multifacet Project
36
TokenTM: a Large-Transaction TM
• New Conflict Detection Mechanism
– Transactional Tokens in Supervised Memory
– Token Coherence [Martin03] at different level
• Version Management
– Save old/new values for unbounded Write set
– LogTM [Moore06] undo log
7/26/2016
Wisconsin Multifacet Project
37
Outline
•
•
•
•
•
Motivation
Supervised Memory
TokenTM
StealthTest [PACT 2009]
Conclusion
7/26/2016
Wisconsin Multifacet Project
38
StealthTest Summary (1/2)
The Problem: fork Overhead
• Software testing hard
– Multithreading makes harder
– Run tests on deployed software
E.g., Delta Execution for patch testing
[Tucek et al., ASPLOS 2009]
– Non-intrusive mechanisms
• fork (existing)
7/26/2016
Wisconsin Multifacet Project
Functionally
Hidden
• Online software testing can help
Low
Overhead
Good
Scaling
39
StealthTest Summary (2/2)
Solution: TM for testing
• Demonstrate two uses
• Delta Execution
• In vivo Testing
7/26/2016
Wisconsin Multifacet Project
Functionally
Hidden
• Leverage Transactional Memory for online testing
• Non-Intrusive?
– transaction { test(); abort}
Low
Overhead
– Fast TM mechanisms
Good
Scaling
40
Outline
•
•
•
•
Motivation
Supervised Memory
TokenTM
StealthTest
– Online Software Testing
• E.g., Patch Validation
– StealthTest: TM for online testing
– Delta Execution using StealthTest
– In vivo Testing using StealthTest (Optionally)
• Conclusion
7/26/2016
Wisconsin Multifacet Project
41
Online Patch Validation
• Bug fixes can introduce more bugs
– Patches must be validated
• Online Validation [Nagaraja et al., OSDI 2004]
– Increased resource usage
– Lockstep execution
Input
Output
Production
Testing
Diff
7/26/2016
Wisconsin Multifacet Project
42
Delta Execution
[Tucek et al., ASPLOS 2009]
• Online Patch Validation
Most patches are small
Patched and Un-patched executions similar
• Delta Execution
– Run together except when they differ
Prior Work Delta Execution
Increased Resource Usage
Lockstep Execution
7/26/2016
O
O
Wisconsin Multifacet Project
P
P
43
Production
Testing
Delta Execution using fork
Time
7/26/2016
Wisconsin Multifacet Project
44
Multi-threading and fork
Production
Testing
‘Park’ all other threads
Time
7/26/2016
Stop all threads to get a consistent
memory snapshot
Wisconsin Multifacet Project
45
fork
Poor Performance
~9.8ms for split/~106ms for merge [Tucek et al, ASPLOS 2009]
Poor Scalability
Web-server response rate reduced by 43%
Want an alternate mechanism
7/26/2016
Wisconsin Multifacet Project
46
Outline
•
•
•
•
Motivation
Supervised Memory
TokenTM
StealthTest
– Online Software Testing
• E.g., Patch Validation
– StealthTest: TM for online testing
– Delta Execution using StealthTest
– In vivo Testing using StealthTest (Optionally)
• Conclusion
7/26/2016
Wisconsin Multifacet Project
47
Delta Execution
Delta Execution
using StealthTest
Isolate
patched execution
Introspect
patched execution
Monitor
delta data access
StealthTest
Transactional
Memory fork
transaction{…}
Version Management
Tracks new/old values
Conflict Detection
Monitor accesses
Execute on child
process
Page diffing
mprotect
7/26/2016
Wisconsin Multifacet Project
48
StealthTest Interface
Delta Execution
Isolate
patched execution
ST_begin_transaction
ST_abort_transaction
Introspect
patched execution
ST_get_old
ST_get_new
Monitor
delta data access
ST_protect_set
ST_protect_clear
StealthTest
Transactional
Memory
transaction{…}
7/26/2016
Version Management
Tracks new/old values
Wisconsin Multifacet Project
Conflict Detection
Monitor accesses
49
Requirements from TM
• Strong Atomicity [Martin et al., CAL 2006]
Transactions isolated from non-transactions
=> Test transactions isolated from application code
• Flexible Conflict Resolution
Can abort transactions if necessary
=> Abort tests if they block application
• Communication from within transactions
=> Expose result of a test
7/26/2016
Wisconsin Multifacet Project
50
Outline
•
•
•
•
Motivation
Supervised Memory
TokenTM
StealthTest
– Online Software Testing
• E.g., Patch Validation
– StealthTest: TM for online testing
– Delta Execution using StealthTest
– In vivo Testing using StealthTest (Optionally)
• Conclusion
7/26/2016
Wisconsin Multifacet Project
51
Production Testing
fork
Delta Execution
using StealthTest
Install
D data
fork
Patched
execution
Unpatched
execution
Compute and
Isolate D data
Merged
execution
7/26/2016
Production
StealthTest
transaction
Wisconsin Multifacet Project
52
Production Testing
fork
Multi-threaded
Delta Execution
Install
D data
fork
Patched
execution
Unpatched
execution
Compute and
Isolate D data
Merged
execution
7/26/2016
Original
StealthTest
transaction
Wisconsin Multifacet Project
53
Evaluation
(1) Effective?
(2) Non-intrusive?
• Workloads
– Collection of multi-threaded server apps
– Same as Tucek et al., ASPLOS 2009
• Pin-based TM Emulation
• 2-way SMP with 2.4GHz Pentium 4 CPUs and 2.5GB RAM
7/26/2016
Wisconsin Multifacet Project
54
(1) Effective?
Program
Description
Patch
Description
Patch Verified?
fork
StealthTest
Crafty
Chess App
Code refactoring
P
P
Raytrace
Raytracer
Result reporting fix
P
P
Tar
Archive Util
Incremental archiving fix
P
P
Apache1
Web Server
Buffer overflow fix
P
P
Apache2
Web Server
Buffer overflow fix
P
P
DNSCache
DNS Cache
Behavior Change
P
P
MySQL5.0
DB Server
Extra permission checks
P
O
OpenSSL
Security Lib
Added bug in TLS handling
P
O
Squid
Web Cache
Buffer overflow fix
P
O
ATPhttpd
Web Server
Buffer overflow fix
P
O
7/26/2016
Wisconsin Multifacet Project
Works
Memory
allocation
sockets
55
(2) Non-intrusive?
Program
Description
fork
ForkOverhead(%)
PatchDuration(%)
Crafty
Chess App
0.1
<0.1
Raytrace
Raytracer
0.2
0.5
Tar
Archive Util
41
7.3
Apache1
Web Server
2.8
0.1
Apache2
Web Server
12
0.1
DNSCache
DNS Cache
65
0.1
MySQL5.0
DB Server
4.7
5.0
OpenSSL
Security Lib
12
<0.1
Squid
Web Cache
2.9
0.2
ATPhttpd
Web Server
65
0.8
7/26/2016
Wisconsin Multifacet Project
56
Outline
•
•
•
•
Motivation
Supervised Memory
TokenTM
StealthTest
– Online Software Testing
• E.g., Patch Validation
– StealthTest: TM for online testing
– Delta Execution using StealthTest
– In vivo Testing using StealthTest
• Conclusion
7/26/2016
Wisconsin Multifacet Project
57
StealthTest Summary
• Software testing hard
• Online software testing can help
• StealthTest leverages TM for
non-intrusive online testing
• Demonstrate two uses
– Delta Execution
– In vivo Testing
7/26/2016
Wisconsin Multifacet Project
Functionally
Hidden
– Existing mechanisms inadequate
Low
Overhead
Good
Scaling
58
Outline
•
•
•
•
•
Motivation
Supervised Memory
TokenTM
StealthTest
Conclusion
7/26/2016
Wisconsin Multifacet Project
59
Contribution 1: Supervised Memory
[Under Submission]
• Supervised Systems – Useful, Renewed interest
• Problem
– SC only, while most systems are not
– Ad hoc hardware, specific to a supervised system
– No Formalism, leads to ambiguity/incorrectness
• Contributions
– Explore non-SC systems
– General model for supervision: Supervised Memory
– Formal Specification
7/26/2016
Wisconsin Multifacet Project
60
Contribution 2: TokenTM
[Bobba et al., ISCA 2008]
• Transactional Memory, a supervised system
• Problem
– “Most transactions are small”, Self-fulfilling assumption
– Penalize large/long transactions
– Too restrictive for wide-spread TM use?
• Contributions
– TokenTM
• First HTM to support efficient large/long transactions as well
• Follow-up: Purdue’s LiteTM [Jafri et al., HPCA 2010]
7/26/2016
Wisconsin Multifacet Project
61
Contribution 3: StealthTest
[Bobba et al., PACT 2009]
• Using transactional memory for testing
• Problem
– Existing fork-based mechanisms
• High overhead
• Poor scalability
• Contributions
– StealthTest, low-overhead interface for online testing
– Two StealthTest-based testing frameworks
7/26/2016
Wisconsin Multifacet Project
62
Other Research and Contributions
• Performance Pathologies
– Bobba et al., ISCA 2007
– Bobba et al., IEEE Micro Top Picks Jan 2008
• LogTM-SE
– Yen et al., HPCA 2007
• Nested LogTM
– Moravan et al., ASPLOS 2006
• LogTM
– Moore et al., HPCA 2006
• GEMS LogTM-SE Implementation
– Development, Release and Support
7/26/2016
Wisconsin Multifacet Project
63
Acknowledgments
Advisors
Mark Hill, David Wood
Mike Swift, Ben Liblit, Shan Lu, Karu Sankaralingam
Mikko Lipasti, Jeffrey Naughton
Co-authors
Kevin Moore, Luke Yen, Haris Volos, Michelle Moravan, Weiwei Xiong,
Neelam Goyal
Colleagues
Alaa Alameldeen, Arkaprava Basu, Brad Beckmann, Polina Dudnik, Dan
Gibson, Mike Marty, Somayeh Sardashti, Rathijit Sen, Cong Wang,
Yasuko Watanabe, Min Xu
Matt Allen, Piramanayagam Arumuga Nainar, Siddharth Barman,
Koushik Chakraborthy, Venkat Govindraju, Amit Kumar, Srinath
Sridharan, Philip Wells
7/26/2016
Wisconsin Multifacet Project
64
Software
Key Contributions
Applications
Tools
Hardware
Supervised
Systems
Supervised
Memory
7/26/2016
StealthTest
TokenTM
TSOdata and
Safe Supervision
Wisconsin Multifacet Project
65
Backup
7/26/2016
Wisconsin Multifacet Project
66
CPU Trends
From “The Free Lunch Is Over” by Herb Sutter
7/26/2016
Wisconsin Multifacet Project
67
The ‘Re-Birth’ of Parallel Programming
• Sequential Computers
– Memory wall, Power wall etc.
• Attack of the killer CMPs*
– General-purpose parallel computers
– How to program?
Expose parallelism to software
“The Free Lunch is Over” – Herb Sutter, 2005
* Adapted from “Attack of the killer micros” by Eugene Brooks
7/26/2016
Wisconsin Multifacet Project
68
Parallel Programming is Hard
(currently)
• Hard for programmers
– Correctness
• Synchronization, Data races, Atomicity violations
– Performance
• Communication, Scheduling, Load-Balancing, Critical
Path
• Hard for tools
– Compilers, Static Analysis
• Intractable/Inefficient
7/26/2016
Wisconsin Multifacet Project
69
Houston, We Have a Problem!
• Who should solve this problem?
• Yannis’s Law: Programmer Productivity
doubles every 6 years
– http://ix.cs.uoregon.edu/~yannis/law.html
• Proebsting's Law: Compiler Advances Double
Computing Power Every 18 Years
– http://research.microsoft.com/enus/um/people/toddpro/papers/law.htm
7/26/2016
Wisconsin Multifacet Project
70
Parallel Algorithms vs Moore’s Law
• “Improvement resulting from … algorithmic
speedup is comparable to that resulting from
from the hardware speedup due to Moore’s
Law over the same length of time”
David E. Keyes
“A Science-Based Case for Large-Scale
Simulation”, July 2003.
7/26/2016
Wisconsin Multifacet Project
71
In the “Landscape of Parallel
Computing Research…” [Asanovic et
al., 2006]
7/26/2016
Wisconsin Multifacet Project
72
Why not Tagged Memory?
[Gehringer and Keedy, CAN 1985]
• Type information in tags
• Arguments do not apply to dynamically-typed
languages like Lisp
• For other languages,
– Simpler but more specialized designs
– Compilers improved to make the proposals moot
7/26/2016
Wisconsin Multifacet Project
73
Explore relaxed memory systems
Existing proposals assume SC
• Assume SC or don’t deal with multiprocessors
7/26/2016
Proposal
Base
Architecture
Implementation
WWT
MIPS
SC
Tapeworm
MIPS
SC
LogTM
SPARC
SC
OneTM
SPARC
SC
Informing Memory MIPS, Alpha
SC
SafeMem
x86
x86
MemTracker
MIPS
SC
DMP
x86
SC
Wisconsin Multifacet Project
74
DMP Correctness
7/26/2016
Wisconsin Multifacet Project
75
Non-TSOall Executions
7/26/2016
Wisconsin Multifacet Project
76
Propose formal models
TSOdata is Complex
Empty/full-bits
sST
Empty
sLD
Exception
7/26/2016
Full
Initial State:
A.d = 0, A.m = None
B.d = 0, B.m = Empty
sST
T0:
dST 1, A
sLD B
T1:
sST B, 1
dLD A
Can dLD A return 0?
Wisconsin Multifacet Project
77
Safe Supervision
7/26/2016
Wisconsin Multifacet Project
78
7/26/2016
Wisconsin Multifacet Project
79
TokenTM Logical Operation
Thread X
PC
BEGIN_XACT
Load A
Store B
COMMIT_XACT
Thread Y
Undo Log
Undo Log
BEGIN_XACT
Load A
Store A
COMMIT_XACT
PC
ABORT
Shared Memory
Block
7/26/2016
Data
Metadata
<cx, cy, …>
A
0x..00..
<1,0,…>
<0,0,…>
<1,1,…>
B
0x..00..
B:0x..11..
0x..00..
<0,0,…>
<T,0,…>
C
0x..10..
<0,0,…>
Wisconsin Multifacet Project
Insufficient
tokens
80
Existing HTM Systems
Assumption: Most transactions small & short running
 Optimized for small transactions
 Degrade with large, long running transactions
• Non-localized Overhead, E.g.,
LogTM-SE [Yen07] false conflicts
OneTM [Blundel07] serializes
• Complex, Expensive Operations, E.g.,
XTM [Chung06]& PTM [Chuang06] manipulate page tables
Premature Optimization?
7/26/2016
Wisconsin Multifacet Project
81
Why Large Transactions?
Programmers may want large (>>cache) and/or
long (>> ctx switch) transactions
– HLL transactions invoke unpredictable lower-level code
– Replace critical sections containing syscalls or I/O
– Avoid concurrency bugs [Lu08]
• But “Most transactions small & short running”
– Restrict TM to use by gurus (like OS spin locks)?
– Self fulfilling prophesy?
Must Support Efficient Large/Long Transactions As Well
7/26/2016
Wisconsin Multifacet Project
82
Toward a Large-Transaction TM
Efficiently detect conflicts between in-flight transactions
using Read/Write Sets
• Unbounded
• Globally accessible
Small Transactions:
Low Overhead
Fast read/write set ops.
E.g., Add to read set
Clear read set
Large Transactions:
Localized Overhead
Accessible read/write set
(potentially unbounded)
Minimal Changes to
Coherence / VM
7/26/2016
N
O
Wisconsin Multifacet Project
Heavyweight eviction ops
Negative acks
Additional page tables
83
Existing Mechanisms
Synergy between cache coherence and conflict detection
Hence, overload cache coherence
Small Transactions:
Low Overhead
+ Excellent for bounded/small TM
But,
- ‘Virtualization’ on overflows
- Tough to access ‘virtualized’ state
7/26/2016
Wisconsin Multifacet Project
Minimal Changes to
× Coherence / VM
×
Large Transactions:
Localized Overhead
84
TokenTM: a Large-Transaction TM
• New Conflict Detection Mechanism
This Talk
– Transactional Tokens in Tagged Memory
– Token Coherence [Martin03] at different level
• Version Management
– Save old/new values for unbounded Write set
– LogTM [Moore06] undo log
7/26/2016
Wisconsin Multifacet Project
85
Transactional Tokens
• Challenge: How to efficiently track Read/Write sets?
• Token Coherence [Martin03]
– Read/Write sets for cache coherence
• Solution: Transactional Tokens
– T tokens per memory block
– At least one token to read, All T tokens to write
(token conflict detection)
– Token Metadata
<c0,c1,…,ci,…> where 0≤ci≤T is count of tokens held
by thread with TID i.
7/26/2016
Wisconsin Multifacet Project
86
Tagged Memory
• Challenge: Where to store Unbounded, Globally
Accessible Token Metadata?
• Virtual Memory
– unbounded and globally accessible
• Solution, similar to OneTM [Blundel07]
– Tag Virtual Memory
– Piggyback on existing Virtual Memory and
Cache Coherence mechanisms
7/26/2016
Wisconsin Multifacet Project
87
TokenTM Logical Operation
Thread X
PC
BEGIN_XACT
Load A
Store B
COMMIT_XACT
Thread Y
Undo Log
Undo Log
BEGIN_XACT
Load A
Store A
COMMIT_XACT
PC
ABORT
Shared Memory
Block
7/26/2016
Data
Metadata
<cx, cy, …>
A
0x..00..
<1,0,…>
<0,0,…>
<1,1,…>
B
0x..00..
B:0x..11..
0x..00..
<0,0,…>
<T,0,…>
C
0x..10..
<0,0,…>
Wisconsin Multifacet Project
Insufficient
tokens
88
Storing Metadata
Unbounded
Difficult to access globally
Thread Y
Thread X
PC
BEGIN_XACT
Load A
Store B
COMMIT_XACT
Undo
Log
Token
log
Undo
TokenLog
log
Cx
CY
PC
BEGIN_XACT
Load A
Store A
COMMIT_XACT
Software
Tagged Memory
Block
Data
Metadata
Metastate
(Sum,
<c , cTID)
…>
x
7/26/2016
Hardware
y,
A
0x..00..
B
0x..00..
(0, -)
<0,0,…>
(0, -)
<0,0,…>
C
Lossy
Summary
0x..10..
(0, -)
<0,0,…>
Wisconsin Multifacet Project
Concise
Accessible
89
Hardware Metastate
• Metadata summary (sum, TID)
– sum, total number of tokens acquired
– TID, identify owner when sum = 1 or sum = T (optional)
Some summaries,
<c0, c1, …, ci, …>
(sum, TID)
<0, 0, 0, 0>
(0, -)
<0, 0, 1, 0>
(1, 2)
<0, T, 0, 0>
(T, 2)
<0, 1, 1, 1>
(3, -)
• Concise -> Stored in packed field (e.g., State[1:2] , Attr[3:16])
• Fast -> Accessed as part of normal memory operation
7/26/2016
Wisconsin Multifacet Project
90
Token Logs
• Distributed structures for unbounded
Read/Write sets
– per-thread
– stored in program memory (e.g., heap)
– list of <address, num_tokens>
Token log
A: 1
B: T
• Accessible to hardware for fast ops
– Add to read set -> Append to token log
7/26/2016
Wisconsin Multifacet Project
91
Double-entry Bookkeeping
(Keeping Metadata Consistent)
Thread X
Logical
Token State
Metadata
<cx, cy, …>
PC
BEGIN_XACT
Load A
Store B
COMMIT_XACT
Thread Y
Token log
Token log
A: 1
A: 1
PC
BEGIN_XACT
Load A
Store A
COMMIT_XACT
<1,1,…>
<1,0,…>
<0,0,…>
Software
<0,0,…>
Hardware
<0,0,…>
7/26/2016
Block
Metastate
(Sum, TID)
A
(2, X)
(1,
(0,
-)
B
(0, -)
C
(0, -)
Wisconsin Multifacet Project
92
Implementing Hardware Metastate
Thread X
BEGIN_XACT
Load A
Store B
COMMIT_XACT
Thread Y
Token log
Token log
A: 1
BEGIN_XACT
Load A
Store A
COMMIT_XACT
Software
Load A
Private
Caches
Tag
A
Coherence
State
Data
Exclusive
Owned 0x..00..
0x..00..
Modified
Load A
Hardware
Tag
Data
Sum TID
A Shared 0x..00.. 1 X
Coherence
State
Sum TID
101, XXData A
A
DATAFwd_GETS
A
GETS A
Block Directory
Data
0x..00..
A Shared
Exclusive
Not
Present
@@
P1,P2
P1 0x..00..
Main
Memory
7/26/2016
Metastate
GETS A
(Sum,
TID)
Sum TID
Wisconsin Multifacet Project
(0,0)
0 0,
(0,0)
Upgrade A
Shared copies cannot
update metastate
Solution: Fission / Fusion
93
Metastate
Fission
Thread X
Thread Y
BEGIN_XACT
Load A
Store B
COMMIT_XACT
Token log
Token log
A: 1
A: 1
BEGIN_XACT
Load A
Store A
COMMIT_XACT
Software
1,X
Coherence
State
fission
Tag
Data Sum TID
0,Private A Modified
0x..00.. 1,X 1 X
Owned
Caches
0x..00..
Fwd_GETS A
Main
Memory
7/26/2016
Tag
A
Load A
Coherence
State
Data
Shared
0x..00..
Data A
Block Directory
A Shared
Exclusive
@ P1
@ P1,P2
Data
Sum TID
Hardware
Sum TID
01 Y-
GETS A
0x..00..
Wisconsin Multifacet Project
94
Metastate Fusion
• Metastate Fusion
– On store, metastate copies fused back
• Why does fission/fusion work?
– Store sees ‘complete’ metastate
– Load sees
• ‘complete’ metastate, if writer exists
• ‘partial’ metastate, otherwise
7/26/2016
Wisconsin Multifacet Project
95
Hardware Cost
• Additional metabits in caches/memory
– Recoded ECC to cull metabits
• Changes to coherence protocols
– Additional payload on messages
– Minimal changes to protocol logic
• Requires non-silent eviction
7/26/2016
Wisconsin Multifacet Project
96
Evaluation Methodology
• Methodology
– Full System Simulation
– Multifacet GEMS
• Base System
– 32-core CMP system, in-order, single-issue cores
– Private 4-way 32KB writeback split I&D L1 caches
– Shared 8-way 8 MB writeback L2
– On-chip directory @ L2, MESI coherence
– Packet-switched interconnect in a tiled topology
7/26/2016
Wisconsin Multifacet Project
97
TM Systems
• LogTM-SE [Yen07] variant
 Parallel Bloom Filters for conflict detection
 4 2Kbit H3 filters
+ Compact, less hardware overhead
- False Conflicts
• LogTM-SE_Perfect
+ No False Conflicts
- Unimplementable
• TokenTM
7/26/2016
Wisconsin Multifacet Project
98
Results
Performance Normalized to
LogTM-SE_Perfect
1.2
1
Large Transactions:
Minor
degradation
Localized
with
largeOverhead
transactions
0.8
0.6
LogTM-SE
0.4
LogTM-SE_Perfect
0.2
TokenTM
0
Comparable
on
Small Transactions:
small
Lowtransactions
Overhead
7/26/2016
Wisconsin Multifacet Project
99
7/26/2016
Wisconsin Multifacet Project
100
In vivo Testing
[Murphy et al. TR 2007, Chu et al. ICST 2008]
• Run unit tests on deployed software
+ More testing
+ More realistic
Catch bugs early
7/26/2016
Wisconsin Multifacet Project
101
In vivo Testing
using StealthTest
ST_begin_transaction();
try {
test();
ST_begin_escape();
fprintf(log, “…”, success);
ST_end_escape();
} catch/except() {
ST_begin_escape();
fprintf(log, “…”, fail);
ST_end_escape();
}
ST_abort_transaction(NO_RETRY);
7/26/2016
Wisconsin Multifacet Project
102
Evaluation
• Workloads
– Bugbench
• Server Workloads
– STAMP
• Transactional Memory benchmarks
• Implementation
– Intel STM
• Language-Based TM
– TL2 STM
• Library-Based TM
• Quad-core workstation with RHEL5
7/26/2016
Wisconsin Multifacet Project
103
(1) Effective?
Description
Size
(LOC)
Bug Type
Error
Detected?
NCOM
file compress
1.9K
Stack Smash
Yes
POLY
file “unixier”
0.7K
Stack Smash
Yes
GZIP
file compress
8.2K
Buffer Overflow
Yes
MAN
documentation
4.7K
Buffer Overflow
Yes
BC
calculator
17.0K
Buffer Overflow
Yes
HTPD1
web server
224K
Atomicity
Yes
SQUD
proxy cache
93.5K
Buffer Overflow
Possible
CVS
version control
114.5K
Double Free
Possible
MSQL2
DBMS
514K
Atomicity
Possible
MSQL3
DBMS
1028K
Atomicity
Possible
7/26/2016
Wisconsin Multifacet Project
Unsupported
Library Calls
Program
Works
• Built on Intel STM.
• Run tests on Bugbench applications
104
(2) Non-intrusive?
• Built on TL2 STM.
• Run tests on STAMP applications (1000 tests per min)
Normalized Execution Time
4
3.5
3
2.5
None
2
fork
1.5
StealthTest
1
0.5
0
Genome
7/26/2016
Intruder
Vacation
Wisconsin Multifacet Project
Yada
105
Atomicity Violation Bugs?
7/26/2016
Wisconsin Multifacet Project
106
Degree-2 Transactions
• Isolate only writes.
• Implementation
– Reads in escape action
– Early Release
– Add new type of transaction to TM
7/26/2016
Wisconsin Multifacet Project
107
StealthTest Wish List
• Hardware Support
• System Calls within Transactions
• Interaction between Locks and Transactions
7/26/2016
Wisconsin Multifacet Project
108
4
Normalized Execution Time
3.5
3
None
2.5
ForkIV-10^3
2
ForkIV-10^1
1.5
StealthIV-10^1
1
StealthIV-10^3
0.5
0
Genome
7/26/2016
Intruder
Vacation
Wisconsin Multifacet Project
Yada
109
Related Work
• Binary Translation (SPROCKETS)
• Code Emulation (STEM)
• TLS (Oplinger&LAM, PathExpander)
7/26/2016
Wisconsin Multifacet Project
110
In vivo Testing Motivation
• Ordering Bug in MySQL
• In vivo Test: Data Consistency Checks
Buggy Code (In mysys/thr_lock.c):
void thr_lock_delete(THR_LOCK *lock) {
…
pthread_mutex_destroy(&lock->mutex);
…
list_delete(thr_lock_thread_list, &lock->list);
…
}
7/26/2016
Wisconsin Multifacet Project
111
Download