BERKELEY PAR LAB A Case for FAME: FPGA Architecture Model Execution Zhangxi Tan, Andrew Waterman, Henry Cook, Sarah Bird, Krste Asanovic, David Patterson The Parallel Computing Lab, UC Berkeley ISCA ’10 A Brief History of Time BERKELEY PAR LAB Hardware prototyping initially popular for architects Prototyping each point in a design space is expensive Simulators became popular cost-effective alternative Software Architecture Model Execution (SAME) simulators most popular SAME performance scaled with uniprocessor performance scaling 2 The Multicore Revolution BERKELEY PAR LAB Abrupt change to multicore architectures HW, SW systems larger, more complex Timing-dependent nondeterminism Dynamic code generation Automatic tuning of app kernels We need more simulation cycles than ever 3 The Multicore Simulation Gap BERKELEY PAR LAB As number of cores increases exponentially, time to model a target cycle increases accordingly SAME is difficult to parallelize because of cycle-by-cycle interactions Relaxed simulation synchronization may not work Must bridge simulation gap 4 One Decade of SAME BERKELEY PAR LAB Median Instructions Simulated/ Benchmark Median #Cores Median Instructions Simulated/ Core ISCA 1998 267M 1 267M ISCA 2008 825M 16 100M Effect is dramatically shorter (~10 ms) simulation runs 5 FAME: FPGA Architecture Model Execution BERKELEY PAR LAB The SAME approach provides inadequate simulation throughput and latency Need a fundamentally new strategy to maximize useful experiments per day Want flexibility of SAME and performance of hardware Ours: FPGA Architecture Model Execution (FAME) (cf. SAME, Software Architecture Model Execution) Why FPGAs? FPGA capacity scaling with Moore’s Law Now can fit a few cores on die Highly concurrent programming model with cheap synchronization 6 Non-FAME: FPGA Computers BERKELEY PAR LAB FPGA Computers: using FPGAs to build a production computer RAMP Blue (UCB 2006) 1008 MicroBlaze cores No MMU, message passing only Requires lots of hardware • 21 BEE2 boards (full rack) / 84 FPGAs RTL directly mapped to FPGA Time-consuming to modify Cool, useful, but not a flexible simulator 7 FAME: System Simulators in FPGAs Target System A I$ CORE I$ CORE D$ D$ … Target System B I$ CORE I$ CORE I$ CORE I$ CORE I$ CORE D$ D$ D$ D$ D$ L2$ L2$ L2$ Shared L2$ / Interconnect DRAM BERKELEY PAR LAB DRAM Host System (FAME simulator) 8 A Vast FAME Design Space BERKELEY PAR LAB FAME design space even larger than SAME’s Three dimensions of FAME simulators Direct or Decoupled: does one host cycle model one target cycle? Full RTL or Abstract RTL? Host Single-threaded or Host Multi-threaded? See paper for a FAME taxonomy! 9 FAME Dimension 1: Direct vs. Decoupled BERKELEY PAR LAB Direct FAME: compile target RTL to FPGA Problem: common ASIC structures map poorly to FPGAs Solution: resource-efficient multi-cycle FPGA mapping Decoupled FAME: decouple host cycles from target cycles Full RTL still modeled, so timing accuracy still guaranteed R1 R2 R3 R4 RegFile W1 W2 R1 R2 W1 Rd1 Rd2 Rd3 Rd4 Target System Regfile Rd1 Rd2 RegFile FSM Decoupled Host Implementation 10 FAME Dimension 2: Full RTL vs. Abstract RTL BERKELEY PAR LAB Decoupled FAME models full RTL of target machine Don’t have full RTL in initial design phase Full RTL is too much work for design space exploration Abstract FAME: model the target RTL at a high level For example, split timing and functional models (à la SAME) Also enables runtime parameterization: run different simulations without re-synthesizing the design Advantages of Abstract FAME come at cost: model verification Timing of abstract model not guaranteed to match target machine Target RTL Abstraction Functional Model Timing Model 11 FAME Dimension 3: Single- or Multi-threaded Host BERKELEY PAR LAB Problem: can’t fit big manycore on FPGA, even abstracted Problem: long host latencies reduce utilization Solution: host-multithreading Target Model CPU 1 CPU 2 CPU 3 CPU 4 Multithreaded Emulation Engine (on FPGA) PC PC1 PC PC1 1 1 Single hardware pipeline with multiple copies +1 of CPU state 2 I$ IR GPR GPR GPR GPR1 X Y 2 D$ 12 Metrics besides Cycles: Power, Area, Cycle Time BERKELEY PAR LAB FAME simulators determine how many cycles a program takes to run Computing Power/Area/Cycle Time: SAME old story Push target RTL through VLSI flow Analytical or empirical models Collecting event stats for model inputs is much faster than with SAME 13 RAMP Gold: A Multithreaded FAME Simulator BERKELEY PAR LAB Rapid accurate simulation of manycore architectural ideas using FPGAs Initial version models 64 cores of SPARC v8 with shared memory system on $750 board Hardware FPU, MMU, boots OS. Cost Simics (SAME) RAMP Gold (FAME) Performance (MIPS) Simulations per day $2,000 0.1 - 1 1 $2,000 + $750 50 - 100 100 14 RAMP Gold Target Machine BERKELEY PAR LAB 64 cores SPARC V8 CORE SPARC V8 CORE I$ I$ D$ … D$ SPARC V8 CORE SPARC V8 CORE I$ I$ D$ D$ Shared L2$ / Interconnect DRAM 15 RAMP Gold Model BERKELEY PAR LAB I$ CORE I$ CORE D$ D$ 64 cores … I$ CORE D$ Shared L2$ / Interconnect DRAM Timing State Timing Model Pipeline Arch State Functional Model Pipeline SPARC V8 ISA I$ One-socket manycore target CORE Split functional/timing D$ model, both in hardware – Functional model: Executes ISA – Timing model: Capture pipeline timing detail Host multithreading of both functional and timing models Functional-first, timingdirected Built for Xilinx Virtex-5 systems 16 [ RAMP Gold, DAC ‘10 ] Case Study: Manycore OS Resource Allocation BERKELEY PAR LAB Spatial resource allocation in a manycore system is hard Combinatorial explosion in number of apps and number of resources Idea: use predictive models of app performance to make it easier on OS HW partitioning for performance isolation (so models still work when apps run together) Problem: evaluating effectiveness of resulting scheduling decisions requires running hundreds of schedules for billions of cycles each Simulation-bound: 8.3 CPU-years for Simics! See paper for app modeling strategy details 17 Case Study: Manycore OS Resource Allocation BERKELEY PAR LAB Normalized Runtime 4 3.5 worst sched. chosen sched. best sched. 3 2.5 2 1.5 1 0.5 0 Synthetic Only PARSEC Small 18 Case Study: Manycore OS Resource Allocation BERKELEY PAR LAB Normalized Runtime 4 3.5 worst sched. chosen sched. best sched. 3 2.5 2 1.5 1 0.5 0 Synthetic Only PARSEC Small PARSEC Large The technique appears to perform very well for synthetic or reduced-input workloads, but is lackluster in reality! 19 RAMP Gold Performance BERKELEY PAR LAB FAME (RAMP Gold) vs. SAME (Simics) Performance PARSEC parallel benchmarks, large input sets >250x faster than full system simulator for a 64-core target system Speedup (Geometric Mean) 300 250 200 150 Functional only Functional+cache/memory (g-cache) Functional+cache/memory+coherency (GEMS) 100 50 0 4 8 16 32 Number of Target Cores 64 20 Researcher Productivity is Inversely Proportional to Latency BERKELEY PAR LAB Simulation latency is even more important than throughput How long before experimenter gets feedback? How many experimenter-days are wasted if there was an error in the experimental setup? Median Latency (days) Maximum Latency (days) FAME 0.04 0.12 SAME 7.50 33.20 21 Fallacy: FAME is too hard BERKELEY PAR LAB FAME simulators more complex, but not greatly so Efficient, complete SAME simulators also quite complex Most experiments only need to change timing model RAMP Gold’s timing model is only 1000 lines of SystemVerilog Modeled Globally Synchronized Frames [Lee08] in 3 hours & 100 LOC Corollary fallacy: architects don’t need to write RTL We design hardware; we shouldn’t be scared of HDL 22 Fallacy: FAME Costs Too Much BERKELEY PAR LAB Running SAME on cloud (EC2) much more expensive! FAME: 5 XUP boards at $750 ea.; $0.10 per kWh SAME: EC2 Medium-High instances at $0.17 per hour Runtime (hours) Cost for first Cost for next Carbon experiment experiment offset (trees) FAME 257 $3,750 $10 0.1 SAME 73,000 $12,500 $12,500 55.0 Are architects good stewards of the environment? SAME uses energy of 45 seconds of Gulf oil spill! 23 Fallacy: statistical sampling will save us BERKELEY PAR LAB Sampling may not make sense for multiprocessors Timing is now architecturally visible May be OK for transactional workloads Even if sampling is appropriate, runtime dominated by functional warming => still need FAME FAME simulator ProtoFlex (CMU) originally designed for this purpose Parallel programs of the future will likely be dynamically adaptive and auto-tuned, which may render sampling useless 24 Challenge: Simulator Debug Loop can be Longer Takes 2 hours to push RAMP Gold through the CAD tools BERKELEY PAR LAB Software RTL simulation to debug simulator is also very slow SAME debug loop only minutes long But sheer speed of FAME eases some tasks Try debugging and porting a complex parallel program in SAME 25 Challenge: FPGA CAD Tools BERKELEY PAR LAB Compared to ASIC tools, FPGA tools are immature Encountered 84 formally-tracked bugs developing RAMP Gold Including several in the formal verification tools!! By far FAME’s biggest barrier (Help us, industry!) On the bright side, the more people using FAME, the better 26 When should Architects still use SAME? BERKELEY PAR LAB SAME still appropriate in some situations Pure functional simulation ISA design Uniprocessor pipeline design FAME necessary for manycore research with modern applications 27 Conclusions BERKELEY PAR LAB FAME uses FPGAs to build simulators, not computers FAME works, it’s fast, and we’re using it SAME doesn’t cut it, so use FAME! Thanks to the entire RAMP community for contributions to FAME methodology Thanks to NSF, DARPA, Xilinx, SPARC International, IBM, Microsoft, Intel, and UC Discovery for funding support RAMP Gold source code is available: http://ramp.eecs.berkeley.edu/gold 28