Improving Pipelined Soft Processors with Multithreading

advertisement
Improving Pipelined Soft
Processors with
Multithreading
Martin Labrecque
Gregory Steffan
ECE Dept. University of Toronto
Presented at RAAW 2006, Orlando, FL
Processors and FPGAs
 FPGAs increasingly implement SoCs, with CPUs

Soft processors: processors in the FPGA fabric
FPGA
Zero
Test
Instr
15:0
20:0
P
C
datIn
Xtnd << 2
25:21
regA
20:16
regB
Instr.
Mem.
datW
20:13
Data
Mem.
Xtnd
datOut
datA
Reg.
Array
regW
4:0
datB
addr
aluA
ALU
aluB
Wdest
25:21
+4
IncrPC
Wdata
Processor
Custom Logic
Soft processors are:
•Easier to program than HDL
•Customizable
2
Soft processors in Embedded Systems

What do designers care about?
 Minimizing
area?
 Matching frequency?
 Hitting performance target?
 Area efficiency: a combined metric
Performance
Area

Instr. Count xx Frequency
Cycle Count x Area
We trade-off 4 criteria (soft proc. power is related to area)
3
Multithreading

Replace processor stalls
Million Instr. xx Frequency
# Cycles x Area
Fill them with instructions from other threads
When to switch thread?
Every instruction (e.g. Sun’s Niagara)
Convenient technique for in-order processors
Fine-grained multithreading: 1 instr. per thread in round-robin
4
Traditional
execution
3 stages
BEFORE
Avoiding processor stall cycles
F
F
E
F
E
W W
F
E
E
W W
Time
Data and control hazards create stall cycles
 Multithreading: execute streams of independent instructions

Ideally,
eliminates
all stalls
3 stages
AFTER
Legend
F
F
E
Thread1
F F F F F
Thread2
E E E E E E
Thread3
W W W W W W W
Time
5
How useful is multithreading?

Commercial SPs: single-threaded (NIOS-II,Microblaze)

Fort et al. [FCCM’06] have shown:
 multithreaded SP smaller than multiple SPs


with some performance degradation
We go further by showing that:
the Area-Efficiency of Multithreaded SP
is GREATER THAN
the Area-Efficiency of Single-Threaded SP
Not straightforward, here is how we did it
6
Outline
 Architectural Support for
Multiple Threads
 Soft Processor Infrastructure
 Improvements to Baseline Multithreading
7
P
C
Instr.
Mem
+4
Reg.
Array
Forwarding lines
Single-Threaded Processor (simplified)
Data
Mem
ALU
Hazard Detection Logic
8
2-Threaded Processor (simplified)
Data
Mem
P
C
P
C
Instr.
Mem
Reg.
Array
ALU
+4
Ctrl.
Hazard Detection Logic


Replicate state for each thread
Simplify control logic
9
Additional storage for multiple threads
Program
counters
Registers
Data mem.
N x
 More efficiently done in FPGA than in ASIC
 Increase memory size while preserving frequency
Multithreading builds on the strengths of FPGAs
10
Outline
Architectural Support for Multiple Threads
 Soft Processor Infrastructure
 Improvements to baseline multithreading

11
Measurement Infrastructure
Benchmarks
RTL
(MiBench,
Dhrystone 2.1,
RATES,
XiRisc)
Modelsim
RTL Simulator
1. Cycle Count
Single-Thread Processors
SPREE System [FPGA’06]
Quartus II 5.0
CAD Software
Stratix
1S40C5
2. Resource Usage
3. Clock Frequency
4. Power
We can measure area/performance/energy accurately
12
Evaluation methodology

Same benchmark running on all threads
 Some
mixed benchmarks results in the paper
Run until completion of the last thread
 Same instruction space

 We present results with fixed latency on-chip RAM
 We are implementing a solution for off-chip RAM
13
Processors: 3, 5 and 7 stages
Pipe3
Pipe3 F/D
R/EX/M
WB
Pipe5
Pipe5
F
D
R/EX1
Pipe7 F
Pipe7
D
R
EX1
F:
D:
R:
EX:
M:
WB:
1174 LEs
78.3 MHz
EX2/M
EX2/M
WB
Fetch
Decode
Register
Execute
Memory
Writeback
1283 LEs
86.79 MHz
EX3/WB1
WB2
1557 LEs, 100.59 MHz
Best of each pipeline depth generated by SPREE
By default: thread count = number of pipeline stages
14
Area efficiency (MIPS / 1000 LEs)
Area efficiency results
90
80
70
60
50
77%
33%
106%
40
30
20
10
0
single
MT
3-stage
single
MT
5-stage
single
MT
7-stage
 Area efficiency is most improved with deeper pipelines
 3- and 7-stages have similar area efficiency
15
1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
gol
bitcnts
vlc
iquant
quant
fir
fft
des
crc
bubble_sort
0
Mean
IPC (Instructions/cycle)
Ideal IPC = 1
Normalized IPC (instructions per cycle).
IPC results for 3, 5 and 7 stages
2,5
2
1,5
pipe3_mt
pipe5_mt
1
pipe7_mt
0,5
0
Mean
IPC versus
single-threaded proc.
24%, 45% and 104% more instructions per cycle, respectively
16
Improvements to the Baseline
Multithreaded Soft Processors
 Optimize away unpipelined multi-cycle paths
 Selection of architectural features
1) Multiplier implementation
2) Number of registers
3) Number of threads
Combination of techniques optimizing area efficiency
17
1- Changing multiplication support
Register file
• Default MIPS has Hi/Lo registers
Hi/Lo
Multiplier
MUX
•3-operand multiplies (NIOS2 and Microblaze)
– Two instructions compute high and low parts
– Avoids replicating Hi and Lo registers support
18
2- Reducing the register file
Not all registers are utilized [RAAW’06]
 Many threads can combine the savings
 Results in saved memory blocks

1..N
1..N-k
1..N
1..N-k
2N
2N-2k
•Applicable to the 5-stage processor
•Increases slightly cycle count due to increased register pressure
•Allows area and frequency improvements
19
Reducing the Number of Threads
3 stages
• Usually: # threads = # pipeline stages
• Last stage: writeback to non-conflicting register
F
F
E
Legend
F F F F
E E E E E
W W W W W W
Thread1
Thread2
Thread3
Time
Positive effect on the 5 and 7-stage processors
Helps meet processing latency deadline (shorter round-robin)
Gives designers more flexibility
20
Conclusions





Multithreaded SPs outperforms Single-threaded
 Assumes independent threads
 Assumes use of on-chip memory
33%, 77% and 106% increase in area-efficiency
Demonstrated that benefits increase with pipeline depth
Techniques to optimize away unpipelined multi-cycle paths
Selection and combination of architectural features
 Multiplier support
 Number of threads
 Number of registers
Commercial FPGA makers should have a Multi-Threaded SP
21
Long term goals

Multiple multithreaded soft processors
 Research
using off-chip memory hierarchy
 Study
of synchronization mechanisms
 Make easy to target and scale up for non-HW people
Experimental Testbed: NetFPGA
–Virtex-II Pro
–4 x 1 Gbps Ethernet
–PCI board
–64 MB DDR2 DRAM


Stanford/Xilinx platform
Collaboration with network researchers
Perform real high bandwidth experiments
22
Thank you
Martin Labrecque (martinl@eecg.utoronto.ca)
Gregory Steffan
ECE Dept. University of Toronto
23
Where do threads come from?

Event processing


e.g. multiple sources of interrupts
Packet processing
 e.g. CAN, RS-485, Ethernet, etc.

Systems handling requests

e.g. bus controllers
For now, we consider independent threads
24
SPREE vs Nios II
[IEEE TCAD’07]
faster
Geomean Wall Clock Time (us)
1900
SPREE Processors
Altera Nios II/e
1700
Altera Nios II/s
Altera Nios II/f
1500
1300
1100
900
700
500
300
500
700
900
1100
1300
1500
1700
1900
Area (Equivalent LEs)
smaller
25
Architectural Parameters Used in
SPREE

Multiplication Support
 Hardware

Shifter implementation
 Flipflops,

FU or software routine
multiplier, or LUTs
Pipelining
 Depth
 (2-7 stages)
 Forwarding
lines
We focus on core microarchitecture (for now)
26
Contributions on
Multithreaded Soft Processors
Multithreaded SP dominate single-threaded
processors in area and IPC
Demonstrated that these benefits
Increase with the # of pipeline stages
Explained techniques to optimize away
unpipelined multi-cycle paths
Selection of architectural features
Number of threads
Number of registers
Multiplier support
Combination of techniques that optimize area efficiency
27
Unpipelined Multicycle Paths

Example of 3-stage pipeline with multicycle
on load, store, shift and multiplies
ST
F/D

R/EX
EX
WB
MT
F/D
R/EX
M
WB
Not practical in ST
because of hazard
detection
Important source of IPC improvement
28
Normalized Equiv. LEs / MHz / nJ/instr
Changing multiplication support
1.6
1.4
1.2
1
Area
Frequency
EnergyPerInstr
0.8
0.6
0.4
0.2
0
Hi/Lo
3op
Hi/Lo
3op
Hi/Lo
3op
3-stage
5-stage
7-stage
For multithreaded SPs, 3op-multiplies always win
29
Normalized Equiv. LEs / MHz / nJ/instr
Reducing the Number of Threads
1.2
1
0.8
Area
0.6
Frequency
Positive effect
on the 5 and 7stage
processors
EnergyPerInstr
0.4
0.2
0
pipe3_mt_2T
pipe5_mt_4T
pipe7_mt_6T
30
SPREE System
(Soft Processor Rapid Exploration Environment)
ISA
Processor
Description Datapath
■ Input: Processor description
■ Made of hand-coded components
■ SPREE System
1.
2.
3.
SPREE
Verify ISA against datapath
Datapath Instantiation
Control Generation
■ Output: Synthesizable Verilog
RTL
31
Multithreading

Million Instr. xx Frequency
# Cycles x Area
Replace processor stalls
Fill them with instructions from other threads
When to switch thread?
Multiple techniques
Most common: every instruction (e.g. Sun’s Niagara)
Interleaved
instructions
in pipeline
T1
T2
T3
T1
T2
T3
Time
Fine-grained multithreading: 1 instr. per thread in round-robin
32
Experimental Testbed: NetFPGA
–Virtex-II Pro
–4 x 1 Gbps Ethernet
–PCI board
–64 MB DDR2 DRAM


Stanford/Xilinx platform
Collaboration with network researchers
Perform real high bandwidth experiments
33

Removed load and branch delay slots in
the code
34
Download