Improving Pipelined Soft Processors with Multithreading Martin Labrecque Gregory Steffan ECE Dept. University of Toronto Presented at RAAW 2006, Orlando, FL Processors and FPGAs FPGAs increasingly implement SoCs, with CPUs Soft processors: processors in the FPGA fabric FPGA Zero Test Instr 15:0 20:0 P C datIn Xtnd << 2 25:21 regA 20:16 regB Instr. Mem. datW 20:13 Data Mem. Xtnd datOut datA Reg. Array regW 4:0 datB addr aluA ALU aluB Wdest 25:21 +4 IncrPC Wdata Processor Custom Logic Soft processors are: •Easier to program than HDL •Customizable 2 Soft processors in Embedded Systems What do designers care about? Minimizing area? Matching frequency? Hitting performance target? Area efficiency: a combined metric Performance Area Instr. Count xx Frequency Cycle Count x Area We trade-off 4 criteria (soft proc. power is related to area) 3 Multithreading Replace processor stalls Million Instr. xx Frequency # Cycles x Area Fill them with instructions from other threads When to switch thread? Every instruction (e.g. Sun’s Niagara) Convenient technique for in-order processors Fine-grained multithreading: 1 instr. per thread in round-robin 4 Traditional execution 3 stages BEFORE Avoiding processor stall cycles F F E F E W W F E E W W Time Data and control hazards create stall cycles Multithreading: execute streams of independent instructions Ideally, eliminates all stalls 3 stages AFTER Legend F F E Thread1 F F F F F Thread2 E E E E E E Thread3 W W W W W W W Time 5 How useful is multithreading? Commercial SPs: single-threaded (NIOS-II,Microblaze) Fort et al. [FCCM’06] have shown: multithreaded SP smaller than multiple SPs with some performance degradation We go further by showing that: the Area-Efficiency of Multithreaded SP is GREATER THAN the Area-Efficiency of Single-Threaded SP Not straightforward, here is how we did it 6 Outline Architectural Support for Multiple Threads Soft Processor Infrastructure Improvements to Baseline Multithreading 7 P C Instr. Mem +4 Reg. Array Forwarding lines Single-Threaded Processor (simplified) Data Mem ALU Hazard Detection Logic 8 2-Threaded Processor (simplified) Data Mem P C P C Instr. Mem Reg. Array ALU +4 Ctrl. Hazard Detection Logic Replicate state for each thread Simplify control logic 9 Additional storage for multiple threads Program counters Registers Data mem. N x More efficiently done in FPGA than in ASIC Increase memory size while preserving frequency Multithreading builds on the strengths of FPGAs 10 Outline Architectural Support for Multiple Threads Soft Processor Infrastructure Improvements to baseline multithreading 11 Measurement Infrastructure Benchmarks RTL (MiBench, Dhrystone 2.1, RATES, XiRisc) Modelsim RTL Simulator 1. Cycle Count Single-Thread Processors SPREE System [FPGA’06] Quartus II 5.0 CAD Software Stratix 1S40C5 2. Resource Usage 3. Clock Frequency 4. Power We can measure area/performance/energy accurately 12 Evaluation methodology Same benchmark running on all threads Some mixed benchmarks results in the paper Run until completion of the last thread Same instruction space We present results with fixed latency on-chip RAM We are implementing a solution for off-chip RAM 13 Processors: 3, 5 and 7 stages Pipe3 Pipe3 F/D R/EX/M WB Pipe5 Pipe5 F D R/EX1 Pipe7 F Pipe7 D R EX1 F: D: R: EX: M: WB: 1174 LEs 78.3 MHz EX2/M EX2/M WB Fetch Decode Register Execute Memory Writeback 1283 LEs 86.79 MHz EX3/WB1 WB2 1557 LEs, 100.59 MHz Best of each pipeline depth generated by SPREE By default: thread count = number of pipeline stages 14 Area efficiency (MIPS / 1000 LEs) Area efficiency results 90 80 70 60 50 77% 33% 106% 40 30 20 10 0 single MT 3-stage single MT 5-stage single MT 7-stage Area efficiency is most improved with deeper pipelines 3- and 7-stages have similar area efficiency 15 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 gol bitcnts vlc iquant quant fir fft des crc bubble_sort 0 Mean IPC (Instructions/cycle) Ideal IPC = 1 Normalized IPC (instructions per cycle). IPC results for 3, 5 and 7 stages 2,5 2 1,5 pipe3_mt pipe5_mt 1 pipe7_mt 0,5 0 Mean IPC versus single-threaded proc. 24%, 45% and 104% more instructions per cycle, respectively 16 Improvements to the Baseline Multithreaded Soft Processors Optimize away unpipelined multi-cycle paths Selection of architectural features 1) Multiplier implementation 2) Number of registers 3) Number of threads Combination of techniques optimizing area efficiency 17 1- Changing multiplication support Register file • Default MIPS has Hi/Lo registers Hi/Lo Multiplier MUX •3-operand multiplies (NIOS2 and Microblaze) – Two instructions compute high and low parts – Avoids replicating Hi and Lo registers support 18 2- Reducing the register file Not all registers are utilized [RAAW’06] Many threads can combine the savings Results in saved memory blocks 1..N 1..N-k 1..N 1..N-k 2N 2N-2k •Applicable to the 5-stage processor •Increases slightly cycle count due to increased register pressure •Allows area and frequency improvements 19 Reducing the Number of Threads 3 stages • Usually: # threads = # pipeline stages • Last stage: writeback to non-conflicting register F F E Legend F F F F E E E E E W W W W W W Thread1 Thread2 Thread3 Time Positive effect on the 5 and 7-stage processors Helps meet processing latency deadline (shorter round-robin) Gives designers more flexibility 20 Conclusions Multithreaded SPs outperforms Single-threaded Assumes independent threads Assumes use of on-chip memory 33%, 77% and 106% increase in area-efficiency Demonstrated that benefits increase with pipeline depth Techniques to optimize away unpipelined multi-cycle paths Selection and combination of architectural features Multiplier support Number of threads Number of registers Commercial FPGA makers should have a Multi-Threaded SP 21 Long term goals Multiple multithreaded soft processors Research using off-chip memory hierarchy Study of synchronization mechanisms Make easy to target and scale up for non-HW people Experimental Testbed: NetFPGA –Virtex-II Pro –4 x 1 Gbps Ethernet –PCI board –64 MB DDR2 DRAM Stanford/Xilinx platform Collaboration with network researchers Perform real high bandwidth experiments 22 Thank you Martin Labrecque (martinl@eecg.utoronto.ca) Gregory Steffan ECE Dept. University of Toronto 23 Where do threads come from? Event processing e.g. multiple sources of interrupts Packet processing e.g. CAN, RS-485, Ethernet, etc. Systems handling requests e.g. bus controllers For now, we consider independent threads 24 SPREE vs Nios II [IEEE TCAD’07] faster Geomean Wall Clock Time (us) 1900 SPREE Processors Altera Nios II/e 1700 Altera Nios II/s Altera Nios II/f 1500 1300 1100 900 700 500 300 500 700 900 1100 1300 1500 1700 1900 Area (Equivalent LEs) smaller 25 Architectural Parameters Used in SPREE Multiplication Support Hardware Shifter implementation Flipflops, FU or software routine multiplier, or LUTs Pipelining Depth (2-7 stages) Forwarding lines We focus on core microarchitecture (for now) 26 Contributions on Multithreaded Soft Processors Multithreaded SP dominate single-threaded processors in area and IPC Demonstrated that these benefits Increase with the # of pipeline stages Explained techniques to optimize away unpipelined multi-cycle paths Selection of architectural features Number of threads Number of registers Multiplier support Combination of techniques that optimize area efficiency 27 Unpipelined Multicycle Paths Example of 3-stage pipeline with multicycle on load, store, shift and multiplies ST F/D R/EX EX WB MT F/D R/EX M WB Not practical in ST because of hazard detection Important source of IPC improvement 28 Normalized Equiv. LEs / MHz / nJ/instr Changing multiplication support 1.6 1.4 1.2 1 Area Frequency EnergyPerInstr 0.8 0.6 0.4 0.2 0 Hi/Lo 3op Hi/Lo 3op Hi/Lo 3op 3-stage 5-stage 7-stage For multithreaded SPs, 3op-multiplies always win 29 Normalized Equiv. LEs / MHz / nJ/instr Reducing the Number of Threads 1.2 1 0.8 Area 0.6 Frequency Positive effect on the 5 and 7stage processors EnergyPerInstr 0.4 0.2 0 pipe3_mt_2T pipe5_mt_4T pipe7_mt_6T 30 SPREE System (Soft Processor Rapid Exploration Environment) ISA Processor Description Datapath ■ Input: Processor description ■ Made of hand-coded components ■ SPREE System 1. 2. 3. SPREE Verify ISA against datapath Datapath Instantiation Control Generation ■ Output: Synthesizable Verilog RTL 31 Multithreading Million Instr. xx Frequency # Cycles x Area Replace processor stalls Fill them with instructions from other threads When to switch thread? Multiple techniques Most common: every instruction (e.g. Sun’s Niagara) Interleaved instructions in pipeline T1 T2 T3 T1 T2 T3 Time Fine-grained multithreading: 1 instr. per thread in round-robin 32 Experimental Testbed: NetFPGA –Virtex-II Pro –4 x 1 Gbps Ethernet –PCI board –64 MB DDR2 DRAM Stanford/Xilinx platform Collaboration with network researchers Perform real high bandwidth experiments 33 Removed load and branch delay slots in the code 34