FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations Michael K. Papamichael, James C. Hoe, Onur Mutlu papamix@cs.cmu.edu, jhoe@ece.cmu.edu, onur@cmu.edu Computer Architecture Lab at Our work was supported by NSF. We thank Xilinx and Bluespec for their FPGA and tool donations. 5/3/2011 Simulation in Computer Architecture Slow for large-scale multiprocessor studies Full-system fidelity + long benchmarks How can we make it faster? Speed, accuracy, flexibility trade-off Full-system simulators sacrifice accuracy for speed and flexibility Accuracy Speed Flexibility Accelerate simulation with FPGAs Can simulate up to millions of gates Orders of magnitude simulation speedup CALCM Computer Architecture Lab at Carnegie Mellon 2 The FIST Project Explores fast NoC models for full-system simulations FPGA-friendly, but avoid direct implementation Low error, many topologies, >10M packets/sec Simpler requirements of full-system simulation Estimate packet latencies, capture high-order effects R P L2 L1 R P L2 L1 4x4 5x5 P L2 L1 R FPGA (Xilinx LX110T) R P L2 L1 FPGA area requirements for state-of-the-art mesh NoC* CALCM *NoC RTL from http://nocs.stanford.edu/router.html 3 Computer Architecture Lab at Carnegie Mellon FIST Approach View NoC as set of routers/links Abstract router into black-box Represent by load-delay curves R R R R R Specific to each router configuration and traffic pattern R N R N R N R N R N Latency R R N State Logic Buffers R Load R N R N R N N CALCM Computer Architecture Lab at Carnegie Mellon 4 FIST Approach Treat each hop as a set of load-delay curves Trade-off between model complexity and fidelity Keep track of load at each node To track router load monitor traffic over window of time R N R N R N R N R N R N Latency R N R N Load R N CALCM Computer Architecture Lab at Carnegie Mellon 5 FIST in Action Route packet from source to destination Determine routers that will be traversed Sum up the delays for each traversed router Index load-delay curves using current load at each router R N S R N R N N R N R N R R N R N D packet delay R N CALCM Computer Architecture Lab at Carnegie Mellon 6 Outline Introduction to FIST FIST-based Network Models Evaluation Related Work & Conclusions CALCM Computer Architecture Lab at Carnegie Mellon Outline Introduction to FIST FIST-based Network Models Evaluation Related Work & Conclusions CALCM Computer Architecture Lab at Carnegie Mellon Putting FIST Into Context Detailed network models Cycle-accurate network simulators (e.g. BookSim) Analytical network models Typically study networks under synthetic traffic patterns Updated Curves Use Curves Train Curves Feedback Network models within full-system simulators Model network within a broader simulated system Assign delay to each packet traversing the network Traffic generated by real workloads CALCM Computer Architecture Lab at Carnegie Mellon 9 Offline and Online FIST Offline FIST Detailed network simulator generates curves offline Can use synthetic or actual workload traffic Load curves into FIST and run experiment Detailed Network Model Online FIST (tolerates dynamic changes in network behavior) Initialization of curves same as offline Periodically run detailed network simulator on the side Compare accuracy and, if necessary, update curves Provide feedback and receive updated curves Detailed Network Model CALCM Computer Architecture Lab at Carnegie Mellon 10 Online Training in Action Example with no initial training 50 45 40 35 30 25 20 15 10 5 0 … After Training … Actual Latency Estimated Latency Before Training … 0 66 132 198 264 330 396 462 1884 1950 2016 2082 2148 2214 2280 2346 2412 2478 Latency El Ellapsed Cycles 1000s) El Elapsed cycles (in(in 1000s) CALCM Computer Architecture Lab at Carnegie Mellon FIST Applicability “FIST-Friendly” Networks Exhibit stable, predictable behavior as load fluctuates Actual traffic similar to training traffic FIST Limitations Depends on fidelity, representativeness of training models Higher loads and large buffers can limit FIST’s accuracy High network load increased packet latency variance Large buffers increased range of observed packet latencies Cannot capture fine-grain packet interactions Cannot replace cycle-accurate detailed network models FIST only as good as its training data CALCM Computer Architecture Lab at Carnegie Mellon 12 Applying FIST to NoCs NoCs affected by on-chip limitations and scarce resources Employ simple routing algorithms Operate at low loads Usually simple deterministic routing NoCs usually over-provisioned to handle worst-case Have been observed to operate at low injection rates Small buffers On-chip abundance of wires reduces buffering requirements Amount of buffering in NoCs is limited or even eliminated NoCs are “FIST-Friendly” CALCM Computer Architecture Lab at Carnegie Mellon 13 Outline Introduction to FIST FIST-based Network Models Evaluation Related Work & Conclusions CALCM Computer Architecture Lab at Carnegie Mellon FIST Implementations Software Implementation of FIST (written in C++) Implements online and offline FIST models Hardware Implementation (written in Bluespec) Precisely replicates software-based FIST Block diagram of architecture Packet Descriptors Src Dest Size Router Elements Routing Logic Pick routers Calculate Delay Load Tracker Curve BRAM Handle load tracking & delay queries CALCM Computer Architecture Lab at Packet Delays Partial Delays Carnegie Mellon 15 Peeking Under The Hood Tracking Latency T 7 6 5 4 3 2 S H D R0 R2 Store-and-forward 0 R0 R1 R2 R1 H 5 2 3 4 5 6 10 7 15 20 25 30 T H 2 3 4 5 6 7 T H 2 3 4 5 6 7 T Packet Latency = 30 R0 Latency = 9 R1 Latency = 14 R2 Latency = 7 Wormhole-routed 0 R0 R1 R2 H 5 2 3 4 10 15 5 6 H 2 3 7 20 30 T 4 H Injection latency 25 5 6 2 3 7 T 4 5 T Packet Latency = 30 R0 Latency = R1 Latency = R2 Latency = ? Traversal latencies Use separate injection and traversal latency curves per router Similar issues arise for load tracking & dynamic training CALCM Computer Architecture Lab at Carnegie Mellon Methodology Examined online and offline FIST models Network and system configuration Consists of control, data and coherence packets Offline and Online FIST models with two curves per router 26 SPEC CPU2006 benchmarks of varying network intensity 8 SPLASH-2 and 2 PARSEC workloads Traffic generated by cache misses 4x4, 8x8, 16x16 wormhole-routed mesh Each network node hosts core+coherent L1 and a slice of L2 Multiprogrammed and multithreaded workloads Replaced cycle-accurate NoC model in tiled CMP simulator Curves represent injection and traversal latency at each router Initial training using uniform random synthetic traffic Please see paper for more details! CALCM Computer Architecture Lab at Carnegie Mellon 17 Accuracy Results (offline) 8x8 mesh using FIST offline model Average Latency and Aggregate IPC Error Latency/IPC Error (in %) 15% 10% Latency Error IPC Error IPC Error < 4% 5% 0% -5% -10% Latency Error < 8% -15% MP (Low) MP (Med) CALCM Computer Architecture Lab at MP (High) Carnegie Mellon MT (SPL/PAR) 18 Accuracy Results (online) 8x8 mesh using FIST online model Error(in %) LatencyError Latency/IPC Average Latency and Aggregate IPC Error 0.1 10% 0.05 5% Latency Error IPC Error 0%0 -5% -0.05 -10% -0.1 MP (Low) MP (Med) MP (High) MT (SPL/PAR) Both Latency and IPC Error below 3% CALCM Computer Architecture Lab at Carnegie Mellon 19 What about a very simple model? 8x8 mesh using hop-based model 1 How 0.8 80% Error(in %) Latency Latency/IPC Error atency Error 0.6 C Error 60% does simple network model affect high-order results? Latency Error Latency Error IPC Error IPC Error 0.4 40% 0.2 20% 0 0% FIST models always within this range -0.2 -20% -0.4 -40% -0.6 -10% -60% -0.8 -80% MP (Low) MP (Med) MP (High) MT (SPL/PAR) Very high error for both latency and IPC! CALCM Computer Architecture Lab at Carnegie Mellon 20 Performance Results Qualitative comparison Speed (packets/sec) 100M 10M 1M 100K 10K HW FIST SW FIST SW NoC Sims (e.g. BookSim) 1K Complexity / Level of Detail SW-based speedup results for 16x16 mesh Offline FIST: 43x Online FIST: 18x HW-based speedup (offline): ~3-4 orders of magnitude CALCM Computer Architecture Lab at Carnegie Mellon 21 Hardware Implementation Results FPGA resource usage & clock frequency Different mesh configurations Xilinx Virtex-5 LX155T FPGA FIST Model Size 4x4 8x8 12x12 16x16 20x20 FPGA Area 4% 15% 34% 60% 94% Freq. 380 MHz 263 MHz 250 MHz 214 MHz 200 MHz Direct Implementation FPGA Area Freq. 61% 130 MHz Will not fit - FIST can scale to large NoCs with many routers CALCM Computer Architecture Lab at Carnegie Mellon Outline Introduction to FIST FIST-based Network Models Evaluation Related Work & Conclusions CALCM Computer Architecture Lab at Carnegie Mellon Related Work Vast body of work on network modeling Abstract network modeling Analytical models, hardware prototyping, etc. Performance vs. accuracy trade-off studies [Burger 95] Load-delay curve representation of network [Lugones 09] FPGAs for network modeling Cycle-accurate fidelity at the cost of limited scalability Time-multiplexing can help with scalability [Wang 10] But still suffer from high implementation complexity CALCM Computer Architecture Lab at Carnegie Mellon Conclusions & Future Directions Conclusions Full-system simulators can tolerate small inaccuracies FIST can provide fast SW- or HW-based NoC models SW model provides 18x-43x average speedup w/ <2% error HW model can scale to 100s routers with >1000x speedup NoCs within a CMP are “FIST-friendly” But not all networks good candidates for FIST modeling Future Directions FPGA-friendly NoC models at multiple levels of fidelity Configurable generation of hardware NoC models CALCM Computer Architecture Lab at Carnegie Mellon Thanks! Questions? CALCM Computer Architecture Lab at Carnegie Mellon