ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer System Lab Stanford University http:/tcc.stanford.edu/prototypes Why we built ATLAS Multicore processors exposes challenges of multithreaded programming • Transactional Memory simplifies parallel programming As simple as coarse-grain locks As fast as fine-grain locks Currently missing for evaluating TM • Fast TM prototypes to develop software on FPGAs improving capabilities attractive for CMP prototyping • Fast Can operate > 100 MHz • More logic, memory and I/O’s • Larger libraries of pre-designed IP cores ATLAS: 8-processor Transactional Memory System • 1st FPGA-based Hardware TM system • Member of RAMP initiative RAMP Red 2 ATLAS provides … Speed • > 100x speed-up over SW simulator [FPGA 2007] Rich software environment • Linux OS • Full GNU environment (gcc, glibc, etc.) Productivity • Guided performance tuning • Standard GDB environment + deterministic replay 3 TCC’s Execution Model Transaction • Building block of a program • Critical region • Executed atomically & isolated from others 4 TCC’s Execution Model CPU 0 CPU 1 CPU 2 ... ld 0xdddd ld 0xeeee Execute Code ... st 0xbeef ... ... ld 0xaaaa Execute ld 0xbbbb ... Code Code Arbitrate Commit Execute ld 0xbeef ... 0xbeef 0xbeef Arbitrate Undo Commit Re- ld 0xbeef Execute Code In TCC, All Transactions All The Time [PACT 2004] 5 CMP Architecture for TCC Speculatively Read Bits: Register Checkpoint Processor ld 0xdeadbeef Speculatively Written Bits: st 0xcafebabe Load/Store Address Store Address FIFO Data Cache V R7:0 W7:0 Violation TAG (2-ported) DATA (single-ported) Commit: Read pointers from Store Address FIFO, flush addresses W bits set Violation Detection: Compare incoming address to R bits Commit Address Snoop Control Commit Address In Commit Data Commit Control Commit Data Out Commit Address Out Commit Bus Refill Bus 6 ATLAS 8-way CMP on BEE2 Board User FPGA TCC TCC PPC PPC I$ TCC$ I$ TCC$ Control FPGA Linux I/O (disk, net) PPC User FPGA TCC TCC PPC PPC I$ User Switch TCC$ I$ TCC$ User Switch Control Switch User Switch I$ TCC$ TCC PPC I$ User Switch TCC$ DDR2 DRAM Controller TCC PPC User FPGA User FPGAs 4 FPGAs for a total of 8 TCC CPUs PPC, TCC caches, BRAMs and busses run @ 100 MHz Commit Token Arbiter I$ TCC$ TCC PPC I$ TCC$ TCC PPC User FPGA Control FPGA Linux PPC @ 300 MHz • Launch TCC apps here • Handle system services for TCC PowerPCs Fabric runs @ 100 MHz 7 ATLAS Software Overview TM Application TM API ATLAS Profiler ATLAS Subsystem Linux OS ATLAS HW on BEE2 TM application can be easily written with TM API ATLAS profiler provides a runtime profiling and guided performance tuning ATLAS subsystem provides Linux OS support for the TM application 8 ATLAS subsystem Invokes parallel work Transfers initial context Linux TCC PPC0 PPC Exit with app. stats TCC PPC1 Violation TCC … TCC PPC2 PPC7 Joins parallel work Commit 9 ATLAS System Support Linux PPC regenerates and services the request. TCC PPC requests OS support. (TLB miss, system call) Linux PPC TCC PPC Linux PPC replies back to the requestor. Serialize, if request is irrevocable • System Call • Page-out 10 Coding with TM API: histogram main (int argc, void* argv) { … sequential code … TM_PARALLEL(run, NULL, numCpus); … sequential code … } // static scheduling with interleaved access to A[] void* run(void* args) { int i = TM_GET_THREAD_ID(); for (;i < NUM_LOOP; i+= TM_GET_NUM_THREAD()) { TM_BEGIN(); bucket[A[i]]++; TM_END(); } OpenTM will provide high-level (OpenMP style) pragmas 11 Guided Performance Tuning TAPE: Light-weight runtime profiler [ICS 2005] Tracking most significant violations (longest loss time) • • • • Violated object address PC where object was read Loss time & # of occurrence Committing thread’s ID and transaction PC Tracking most significant overflows (longest duration) • • • • Overflows: when speculative state can no longer stay in TCC$ PC where overflows Overflow duration & number of occurrence Type of overflow (LRU or Write Buffer) 12 Deterministic Replay All Transactions All The Time • TM 101: Transaction is executed atomically and in isolation • TM’s illusion: transaction starts after older transactions finish Only need to record “the order of commit” • Minimal runtime overhead & footprint size = 1B / transaction Logging execution T0 write-set T1 T2 Token arbiter enforces commit order specified in LOG T2 Replay execution T0 write-set T1 T2 T2 LOG: T0 T1 T2 time time 13 Useful Features of Replay Monitoring code in the transaction • Remember we only record the transaction order Verification • Log is not written in stone • Complete runtime scenario coverage is possible Choice of running Replay on • ATLAS itself HW support for other debugging tools (see next slide) • Local machine (your favorite desktop or workstation) Runs natively on faster local machine, sequentially Seamless access to existing debugging tools 14 GDB support Current status • GDB integrated with local machine replay GDB provides debugability while guaranteeing deterministic replay • Below are work-in-progress Breakpoint • Thread local BP vs. global BP • Stop the world by controlling commit token Stepping • Backward stepping: Transaction is ready to roll back • Transaction stepping Unlimited data-watch (ATLAS only) • Separate monitor TCC cache to register data-watches 15 Conclusion: ATLAS provides Speed • > 100x speed-up over SW simulator [FPGA 2007] Software environment • Linux OS • Full GNU environment (gcc, glibc, etc.) Productivity • TAPE: Guided performance tuning • Deterministic replay • Standard GDB environment Future Work • High-level language support (Java, Python, …) 16 Questions and Answers tcc_fpga_xtreme@mailman.stanford.edu ATLAS Team Members • • • • System Hardware – Njuguna Njoroge, PhD Candidate System Software – Sewook Wee, PhD Candidate High level languages – Jiwon Seo, PhD Candidate HW Performance – Lewis Mbae, BS Candidate Past contributors • Interconnection Fabric – Jared Casper, PhD Candidate • Undergrads – Justy Burdick, Daxia Ge, Yuriy Teslar 17