ATLAS: FPGA-based HTM Software Development Environment

advertisement
ATLAS
(a.k.a. RAMP Red)
Parallel Programming with Transactional Memory
Njuguna Njoroge and Sewook Wee
Transactional Coherence and Consistency
Computer System Lab
Stanford University
http:/tcc.stanford.edu/prototypes
Why we built ATLAS
 Multicore processors exposes challenges of multithreaded programming
• Transactional Memory simplifies parallel programming
 As simple as coarse-grain locks
 As fast as fine-grain locks
 Currently missing for evaluating TM
• Fast TM prototypes to develop software on
 FPGAs improving capabilities attractive for CMP prototyping
• Fast  Can operate > 100 MHz
• More logic, memory and I/O’s
• Larger libraries of pre-designed IP cores
 ATLAS: 8-processor Transactional Memory System
• 1st FPGA-based Hardware TM system
• Member of RAMP initiative  RAMP Red
2
ATLAS provides …
 Speed
• > 100x speed-up over SW simulator [FPGA 2007]
 Rich software environment
• Linux OS
• Full GNU environment (gcc, glibc, etc.)
 Productivity
• Guided performance tuning
• Standard GDB environment + deterministic replay
3
TCC’s Execution Model
Transaction
• Building block of a program
• Critical region
• Executed atomically & isolated from others
4
TCC’s Execution Model
CPU 0
CPU 1
CPU 2
...
ld 0xdddd
ld 0xeeee
Execute
Code
...
st 0xbeef
...
...
ld 0xaaaa
Execute
ld 0xbbbb
...
Code
Code
Arbitrate
Commit
Execute
ld 0xbeef
...
0xbeef
0xbeef
Arbitrate
Undo
Commit
Re-
ld 0xbeef
Execute
Code
In TCC, All Transactions All The Time [PACT 2004]
5
CMP Architecture for TCC
Speculatively Read Bits:
Register
Checkpoint
Processor
ld 0xdeadbeef
Speculatively Written Bits:
st 0xcafebabe
Load/Store
Address
Store
Address
FIFO
Data
Cache
V
R7:0
W7:0
Violation
TAG
(2-ported)
DATA
(single-ported)
Commit:
Read pointers from Store
Address FIFO, flush
addresses W bits set
Violation Detection:
Compare incoming
address to R bits
Commit Address
Snoop
Control
Commit
Address In
Commit
Data
Commit
Control
Commit
Data Out
Commit
Address Out
Commit Bus
Refill Bus
6
ATLAS 8-way CMP on BEE2 Board
User FPGA
TCC
TCC
PPC
PPC
I$
TCC$
I$
TCC$
Control FPGA
Linux
I/O
(disk, net)
PPC
User FPGA
TCC
TCC
PPC
PPC
I$
User Switch
TCC$
I$
TCC$
User Switch
Control Switch
User Switch
I$
TCC$
TCC
PPC
I$
User Switch
TCC$
DDR2
DRAM
Controller
TCC
PPC
User FPGA
User FPGAs
 4 FPGAs for a total of 8
TCC CPUs
 PPC, TCC caches, BRAMs
and busses run @ 100
MHz
Commit
Token
Arbiter
I$
TCC$
TCC
PPC
I$
TCC$
TCC
PPC
User FPGA
Control FPGA
 Linux PPC @ 300 MHz
• Launch TCC apps here
• Handle system services for
TCC PowerPCs
 Fabric runs @ 100 MHz
7
ATLAS Software Overview
TM Application
TM API
ATLAS Profiler
ATLAS Subsystem
Linux OS
ATLAS HW on BEE2
 TM application can be easily written with TM API
 ATLAS profiler provides a runtime profiling and guided
performance tuning
 ATLAS subsystem provides Linux OS support for the TM
application
8
ATLAS subsystem
Invokes
parallel work
Transfers
initial context
Linux
TCC
PPC0
PPC
Exit with
app. stats
TCC
PPC1
Violation
TCC … TCC
PPC2
PPC7
Joins
parallel work
Commit
9
ATLAS System Support
Linux PPC
regenerates
and
services the
request.
TCC PPC requests OS
support.
(TLB miss, system call)
Linux
PPC
TCC
PPC
Linux PPC
replies back to
the requestor.
 Serialize, if request is irrevocable
• System Call
• Page-out
10
Coding with TM API: histogram
main (int argc, void* argv) {
… sequential code …
TM_PARALLEL(run, NULL, numCpus);
… sequential code …
}
// static scheduling with interleaved access to A[]
void* run(void* args) {
int i = TM_GET_THREAD_ID();
for (;i < NUM_LOOP; i+= TM_GET_NUM_THREAD()) {
TM_BEGIN();
bucket[A[i]]++;
TM_END();
}
OpenTM will provide high-level (OpenMP style) pragmas
11
Guided Performance Tuning
 TAPE: Light-weight runtime profiler [ICS 2005]
 Tracking most significant violations (longest loss time)
•
•
•
•
Violated object address
PC where object was read
Loss time & # of occurrence
Committing thread’s ID and transaction PC
 Tracking most significant overflows (longest duration)
•
•
•
•
Overflows: when speculative state can no longer stay in TCC$
PC where overflows
Overflow duration & number of occurrence
Type of overflow (LRU or Write Buffer)
12
Deterministic Replay
 All Transactions All The Time
• TM 101: Transaction is executed atomically and in isolation
• TM’s illusion: transaction starts after older transactions finish
 Only need to record “the order of commit”
• Minimal runtime overhead & footprint size = 1B / transaction
Logging execution
T0
write-set
T1
T2
Token arbiter
enforces
commit order
specified in LOG
T2
Replay execution
T0
write-set
T1
T2
T2
LOG: T0 T1 T2
time
time
13
Useful Features of Replay
 Monitoring code in the transaction
• Remember we only record the transaction order
 Verification
• Log is not written in stone
• Complete runtime scenario coverage is possible
 Choice of running Replay on
• ATLAS itself
 HW support for other debugging tools (see next slide)
• Local machine (your favorite desktop or workstation)
 Runs natively on faster local machine, sequentially
 Seamless access to existing debugging tools
14
GDB support
 Current status
• GDB integrated with local machine replay
 GDB provides debugability while guaranteeing deterministic
replay
• Below are work-in-progress
 Breakpoint
• Thread local BP vs. global BP
• Stop the world by controlling commit token
 Stepping
• Backward stepping: Transaction is ready to roll back
• Transaction stepping
 Unlimited data-watch (ATLAS only)
• Separate monitor TCC cache to register data-watches
15
Conclusion: ATLAS provides
 Speed
• > 100x speed-up over SW simulator [FPGA 2007]
 Software environment
• Linux OS
• Full GNU environment (gcc, glibc, etc.)
 Productivity
• TAPE: Guided performance tuning
• Deterministic replay
• Standard GDB environment
 Future Work
• High-level language support (Java, Python, …)
16
Questions and Answers
 tcc_fpga_xtreme@mailman.stanford.edu
 ATLAS Team Members
•
•
•
•
System Hardware – Njuguna Njoroge, PhD Candidate
System Software – Sewook Wee, PhD Candidate
High level languages – Jiwon Seo, PhD Candidate
HW Performance – Lewis Mbae, BS Candidate
 Past contributors
• Interconnection Fabric – Jared Casper, PhD Candidate
• Undergrads – Justy Burdick, Daxia Ge, Yuriy Teslar
17
Download