The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras Jonathan Rose Greg Steffan University of Toronto Electrical and Computer Engineering Processors and FPGAs Processors present in many digital systems FPGA Custom Logic Processor Soft processors - implemented in FPGA fabric Our goal is to study the architecture of soft processors 2 Motivation for understanding soft processor architecture 1. Soft processors are popular 16% of FPGA designs use a soft processor 2. FPGA Journal, November 2003 This number has and will continue to increase Soft processors are end-user customizable Application-specific architectural tradeoffs Can be tuned by designers 3 Don’t we already understand processor architecture? Not accurately/completely Accurate cycle-to-cycle behaviour Estimated area/power No clock frequency impact Not in FPGA domain Lookup tables vs transistors Dedicated RAMs and Multipliers fast Must revisit processor architecture in FPGA context 4 Research Goals 1. Generate soft processor implementations 2. Develop measurement methodology 3. System for generating RTL Metrics for comparing soft processors Develop understanding of architectural tradeoffs Analyze area/performance/power space Explore soft processor architecture experimentally 5 Soft Processor Rapid Exploration Environment (SPREE) ■ ISA ■ Datapath SPREE RTL 6 Input: Instruction Set Architecture (ISA) Description ■ ISA ■ Datapath ■ Graph of Generic Operations (GENOPs) ■ Edges indicate flow of data MIPS ADD – add rd, rs, rt FETCH SPREE RFREAD RFREAD ADD RTL RFWRITE ISA currently fixed (subset of MIPS I) 7 Input: Datapath Description ■ Interconnection of hand-coded components ■ ISA ■ Datapath ■ Allows efficient synthesis ■ Described using C++ Ifetch Ifetch Mul Reg RegFile File Ifetch SPREE RTL Mul Mul Data Mem ALU ALU Write Shifter Back SPREE Component Library Reg file Write Back ALU Data Mem Limited to simple in-order issue pipelines 8 Step 1. ISA vs Datapath Verification ■ ISA ■ Datapath ■Components described using GENOPs Verify FETCH SPREE RFREAD RFREAD ADD RTL RFWRITE Ifetch Reg File FETCH RFREAD Mul Write Back ADD ALU RFREAD RFWRITE Data Mem 9 Step 2. Datapath Instantiation ■ ISA ■ Datapath SPREE ■ Multiplexer insertion ■ Unused connection/component removal Ifetch Reg File Mul Write Back ALU RTL Data Mem 10 Step 3. Control Generation ■ ISA ■ Datapath SPREE RTL Control Control Ifetch Reg File Control Control Mul Write Back ALU Data Mem Laborious step performed automatically 11 Output: Verilog RTL Description ■ ISA ■ Datapath SPREE Verilog RTL Control Control Ifetch Reg File Control Mul Write Back ALU RTL Control Data Mem 12 Back-end Infrastructure RTL Benchmarks (MiBench, Dhrystone 2.1, RATES, XiRisc) Modelsim Quartus II 4.2 RTL Simulator CAD Software Stratix 1S40 1. Cycle Count 2. Resource Usage 3. Clock Frequency 4. Power In this work we can measure each accurately! 13 Metrics for Measurement Area: Equivalent Stratix Logic Elements (LEs) Relative silicon areas used for RAMs/Multipliers Performance: Wall clock time count ÷ clock frequency Arithmetic mean across benchmark set Cycle Energy: Dynamic Energy (eg. nJ/instr) Excluding I/O 14 Trace-Based Verification Ensure SPREE generates functional processors Trace RTL Modelsim (RTL Simulator) 110100 101011 111101 Compare Trace Benchmark Applications MINT (Instruction-set Simulator) 110100 101011 111101 All generated soft processors are verified this way 15 Architectural Exploration Results 16 Architectural Features Explored Hardware vs software multiplication Shifter implementation Pipelining Depth Organization Forwarding 17 Validation of SPREE Through Comparison to Altera’s Nios II Has three variations: II/e – unpipelined, no HW multiplier Nios II/s – 5-stage, with HW multiplier Nios II/f – 6-stage, dynamic branch prediction Nios Caveats – not completely fair comparison Very similar but tweaked ISA Nios II Supports exceptions, OS, and caches We do not and save on the hardware costs We believe the comparison is meaningful 18 _ SPREE vs Nios II faster Average Wall Clock Time (us) 9000 SPREE Processors 8000 Altera Nios II/e Altera Nios II/s 7000 Altera Nios II/f 6000 5000 -3-stage pipe -HW multiply -Multiply-based shifter 4000 3000 2000 1000 500 700 900 1100 1300 1500 1700 1900 Area (Equivalent LEs) smaller Competitive and can dominate (9% smaller, 11% faster) 19 Architectural Features Explored Hardware vs software multiplication Shifter implementation Pipelining Depth Organization Forwarding 20 Hardware vs Software Multiplication Hardware multiply is fast but not always needed 8 7 6.90 Wastes area (220 LEs) and can waste energy 7.41 Hardware Multiply Software Multiply 6 2.81 3 2.93 4 3.18 5 3.30 Energy/instruction (nJ/instr) 2 1 0 2-stage 3-stage 5-stage Processor Total energy wasted if few multiply instructions, saved if many 21 Shifter Implementation Shifters are expensive in FPGAs We explore three implementations: Serial shifter (shift register) 2. Multiplier-based barrel shifter (hard multiplier) 3. LUT-based barrel shifter (multiplexer tree) 1. 22 Performance-Area of Different Shifter Implementations faster Average Wall Clock Time (us) 4000 3500 3-stage 4-stage Serial 5-stage 3000 2500 2000 1500 Multiplierbased LUT-based 1000 900 1000 1100 1200 1300 1400 1500 Area (Equivalent LEs) smaller Multplier-based shifter is a good compromise 23 Pipeline Depth Explored between 2 and 7 stages 1-stage and 6-stage pipeline not interesting 2-stage F/D/R/EX/M 3-stage 4-stage 5-stage (new) 7-stage WB F/D R/EX/M WB F D F D F D R/EX/M WB R/EX R EX/M EX EX WB EX/M WB 24 Pipeline Depth and Performance 3000 3.00 100.00 80.00 60.00 40.00 20.00 0.00 2-stage 3-stage 2500 2.50 Cycles Per Instruction Average Wall Clock Time (us) Clock Frequency (MHz) 120.00 2000 1500 1000 2.00 1.50 1.00 0.50 500 0.00 4-stage 5-stage 7-stage 2-stage 3-stage 4-stage 5-stage 7-stage 0 2-stage 3-stage 4-stage 5-stage 7-stage 2-stage pipeline and 7-stage pipeline suffers from nuances 3,4, and 5-stage pipelines perform the same 25 4-stage (A) F D R/EX/M WB 4-stage (B) F/D R/EX EX/M WB Average Wall Clock Time (us) Pipeline Organization Tradeoff 4000 4-Stage (A) 3500 4-Stage (B) 3000 2500 2000 1500 1000 1000 1100 1200 1300 1400 Area (Equivalent LEs) 4-stage (B) is 15% faster but requires up to 70 more LEs 26 Pipeline Forwarding F D/R EX M WB Prevent stalls when data hazards occur MIPS has two source operands (rs & rt) Four forwarding configuration are possible: No forwarding Forward rs Forward rt Forward both rs and rt 27 _ Pipeline Forwarding 1650 Average Wall Clock Time (us) 1600 no forwarding 1550 forward rt 1500 9% 3-stage 1450 1400 20% 4-stage forward rs 5-stage 1350 1300 forward rs&rt 1250 1200 800 900 1000 1100 1200 1300 1400 Area (Equivalent LEs) Up to 20% speed improvement for both operands The rs operand benefits more than rt (9% faster) 28 Summary of Presented Architectural Conclusions Hardware multiplication can be wasteful Multiplier-based shifter is a sweet spot 3-stage pipelines are attractive Tradeoffs exist within pipeline organization Forwarding Improves performance by 20% Favours the rs operand 29 Future Work Explore other exciting architectural axes Branch prediction, aggressive forwarding ISA changes VLIW datapaths Caches and memory hierarchy Compiler optimizations Port to other devices Explore aggressive customization Add exceptions and OS support 30