MICRO 44 Talk - Compilers Creating Custom Processors

advertisement
Bundled Execution of Recurring
Traces for Energy-Efficient General
Purpose Processing
Shantanu Gupta, Shuguang Feng, Amin Ansari,
Scott Mahlke, and David August
University of Michigan
(Intel, Northrup-Grumman, UIUC, Princeton)
MICRO-44
December 6, 2011
1
University of Michigan
Electrical Engineering and Computer Science
Computational Efficiency Landscape
• Energy
dilemma
• More gates can fit on a die mW
/
ops
AMD 6850
M
0
• But power constraints10limit
their use
GTX 295
1,000
• To scale performance, need toW increase efficiency
10
1
10
S1070
GTX 280
s/m IBM Cell
Mop
s/m
Mop
W
100
r
cy
tte en
Be ffici
rE
we
Po
Performance (GFLOPs)
10,000
Core i7
AMD Opteron
W
Core 2
s/m
p
o
M
0.1
Pentium M
Embedded
Processors
1
10
1
UltraPortable
Power (Watts)
Portable with
frequent charges
1,000
100
Wall Power
2
Dedicated
Power Network
University of Michigan
Electrical Engineering and Computer Science
2
Where Does The Energy Go?
• Energy used in a single-issue RISC in-order core
• Instruction fetch and
decode energy dominates
• Actual execution barely
consumes 10%
Plenty of opportunities to save energy….
3
[Dally’08]
University of Michigan
Electrical Engineering and Computer Science
Increasing Efficiency with Accelerators
Application regularity
defines success:
Flexibility
FPGAs
1.Small dominant code
segments
2.Little control flow
3.Narrow application set
4.Data parallelism
General Purpose
Processors
ASIPs DSPs
SIMD
Loop Accelerators,
ASICs
Efficiency, Performance
• Accelerators can give 10 – 50X efficiency
4
University of Michigan
Electrical Engineering and Computer Science
Utility Factor for Accelerators
• What fraction of the code gets accelerated?
• Most solutions fail for “irregular” or “general-purpose” code
Flexibility
FPGAs
General Purpose
Processors
???
ASIPs DSPs
SIMD
Loop Accelerators,
ASICs
Efficiency, Performance
Goal: A design to target irregular codes
5
University of Michigan
Electrical Engineering and Computer Science
• A compute engine for “hot regular
regions” in irregular codes
Program
• Key insights:
1.
Hot
Regions
2.
CPU
BERET
The BERET Architecture
CPU
BERET
L1 I$
L1 D$
Exploits recurring instructions
(traces) to save on
copy live-ins
redundant fetches and decodes
copy live-outs
Uses a bundled execution model
to save on
redundant register reads/writes
BERET: Bundled Execution of REcurring Traces
6
University of Michigan
Electrical Engineering and Computer Science
Insight 1: Recurring Instructions
• How
aboutsuch
loops?
We
leverage
looping traces for savings
►
Typical loops in irregular codes are large and control intensive!
1. Straight-line code
 simple hardware
BB 0
Hot basic blocks
2. Typically short BB1 easy to buffer
85%
BB 1
15%
3. Significant
fetch
/ decode savings for buffered
10%
90%
BB 3
BB 2
BB 4
instructions
50%
50%
BB 6
BB 3 exit?
BB 2
BB 4 exit?
BB 5
BB 5
BB 1
BB 2
BB 5
BB 20
BB 20
BB 7
A looping trace
BB 20
Control Flow
Graph (CFG)
7
University of Michigan
Electrical Engineering and Computer Science
Frequency of Recurring Instructions
Offload stable traces in irregular loops
8
University of Michigan
Electrical Engineering and Computer Science
Insight 2: Bundled Execution
• Traditional processors issue and execute
instructions in isolation…
>>
LD
>>
+
LD
+
/
&
ST
>>
<<
Bundled
execution
LD
+
ST
LD
+
/
&
ST
>>
<<
ST
11 instrs, 14 reads, 10 writes
3 instrs, 6 reads, 2 writes
9
University of Michigan
Electrical Engineering and Computer Science
Efficiency of Bundled Execution
All results normalized to a bundle length of 1
2.6
Normalized Perf/Power
2.4
2.2
2
1.8
1.6
1.4
1.2
1
2
3
4
5
Bundle length
Bundled execution increases datapath efficiency by more than 2x
10
University of Michigan
Electrical Engineering and Computer Science
10
BERET Hardware Design
•I$ Hardware design objectives:
D$
►
►
Capable
of executing straight-line code in a loop (traces)
Index bits
MUX
Support for bundled execution of trace instructions
Input Latch
SEB
1
SEB 2
SEB N control to the main
SEB
Handle
trace
side-exits,
and transfer
config.
processor
Configure
SEB
1 – 2 cycles
config. bits
Configuration
RAM (CRAM)
►
Store
Buffer
Internal
Register File
Writeback Bus
LD
ALU
<<
ALU
Execute
SEB
Writeback
1 – 5 cycles
1 – 2 cycles
11
Output Latch
SEB: Subgraph Execution Block
University of Michigan
Electrical Engineering and Computer Science
Compiler Support
1. Trace Detection
2. Mapping traces to SEBs
Data flow
Hotsubgraphs
Trace
Program
×
1
Hot Traces
(with high loop
back probability)
2
3
BERET with SEBs
Configuration
SEB 0
SEB 1
SEB 2
SEB 3
|
exit
BR
Assert
12
University of Michigan
Electrical Engineering and Computer Science
RF
ST
+
MPY
2
ADD
LD
SUB
BR
& LD
BR
Assert
AND
SHIFT
<<
ST
ADD
ADD
3
OR
+
+
BR
Control
exit
1
CPU-BERET Execution Flow
RF-0
RF-1
RF-0
Header
Copy Live-Outs
Assert
Header
Body
…
Header
Body
Header
Body
Copy Live-Ins
Header
BERET
Execution
CPU
Side
Exit
RF
RF
Execution
Time
RF-1
Registers
Program
Assert
discovered,
executes
copied to
back
on
last
BERET
BERET
main
to
iteration
main
processor
processor
squashed
13
University of Michigan
Electrical Engineering and Computer Science
Energy Savings
Training set
Test set
14
University of Michigan
Electrical Engineering and Computer Science
Performance Impact
15
University of Michigan
Electrical Engineering and Computer Science
Concluding Remarks
• Scaling program performance in energy-constrained
environment requires improving computational efficiency
• Most accelerators exploit program regularity for savings
• BERET is a configurable engine that saves energy by:
►
Exploiting hot traces to avoid redundant fetches and decodes
►
Using a bundled execution model to reduce temporary variable
reads and writes
Energy Saving
~35%
Performance Enhancement Area Overhead
~10%
20%
16
University of Michigan
Electrical Engineering and Computer Science
Questions
• For more
►
See http://cccp.eecs.umich.edu
17
University of Michigan
Electrical Engineering and Computer Science
Fine Grain Program Phase Behavior
Traditional phases too coarse-grained to match accelerator
Traditional phases
Fine-grain
0M
Accelerate the pink portions
10M
Hypothesis of This Work
Irregular programs are composed of fine-grain periods of high degrees of
regularity. We can identify these periods and run them on an accelerator
customized for “simple” execution.
18
University of Michigan
Electrical Engineering and Computer Science
Download