ppt - Compilers Creating Custom Processors

advertisement
Hot Chips 16
August 24, 2004
OptimoDE: Programmable
Accelerator Engines Through
Retargetable Customization
Nathan Clark, Hongtao Zhong,
Kevin Fan, Scott Mahlke
Krisztián Flautner,
Koen Van Nieuwenhove
CCCP Research Group
University of Michigan
http://cccp.eecs.umich.edu
ARM Limited
Hot Chips 16
August 24, 2004
OptimoDE Overview
• OptimoDE
– A configurable VLIW-styled Data Engine architecture
– Targeted at intensive data processing
• Characteristics
– Very wide performance envelope
• Power / area / speed tradeoff
• Exploiting parallelism in applications
– Unlimited data path configuration options
– User extensible through ISA customization
• Semi-automatic design system
– User-in-the-loop design, retargetable compiler toolchain
Hot Chips 16
August 24, 2004
SRAM
I/O
S
M2
M
switch
S
S
S
FIFO
S
M
DATA 2
M1
DATA 1
CTRL
M
Data
ARM
CPU
Memory
Memory
SDRAM
Controller
AHB Bus Matrix
SDRAM
Control
OptimoDE in a System On Chip
Data
Engine
MEM
DMA
Controller
Interrupt
Controller
SoC
Hot Chips 16
August 24, 2004
OptimoDE Architecture Model
Function units
Memories
…
interconnect
Functional Units
– ALU, ACU, Multipliers
– Custom
Controller
…
•
I/O ports
•
– RAM ( asynch / synch )
– ROM
•
I/O ports
– addressable
– handshake protocol
•
Registers
Memory
Registers
– Register files
regs regs regs
…
regs
interconnect
Interconnect
Controller
•
Interconnect
– Direct connection
– Shared bus
mC
•
Controller
•
•
All layers required
Intra-layer configuration
Hot Chips 16
August 24, 2004
Design Toolchain
mA Evaluation
mA Definition
OptimoDE
OptimoDE
Library
Library
Librarian
User
User
Library
Library
mA
mA
.inc
.c1
.inc
xxxxx
xx
dct
xxxxx
instantiate
DesignDE
mA
mA
.inc
.c2
.inc
mA
mA
.inc
.c3
.inc
DEvelop
mA
.tmp
save
ISS
set target
#include …
run / profile
001010
110011
010110
110111
load
main()
{
… = dct();
create_resource
LIFETIME
}
LOAD
Hot Chips 16
August 24, 2004
Compiler Toolchain
DEvelop
+
+
2
C Source
Description
3
INPUT
*
+
+
1
3
2
*
*
4
+
5
+
7
6
*
*
8
1
OUTPUT
analysis/opti
bind
0
1
2
3
4
5
Analysis feedback
* +
1
2
*
4
* +
3
6
* +
8
7
*
5
* +
9
10
Microcode
schedule
Syntax checks
Match architecture
Optimize code and
Dataflow analysis
and dataflow graph
register use
Hot Chips 16
August 24, 2004
32-point DCT Microarchitecture
inport
rom
outport
ram_1
acu_1
acu_2
imm_1
ram_2
alu_1
alu_2
acu_3
imm_2
control
• 2 Custom FUs, 2 RAM, 1 ROM, 3 ACU, 2 I/O ports
• Designer responsible for creating custom units manually
Hot Chips 16
August 24, 2004
Retargetable Customization
• Prototype 2 technologies in OptimoDE
– Automated ISA customization
– Retargetable customization to an “application-area”
• Customizing for 1 application
– Programmability  Nominally programmable
– Critical problem – Cannot sustain performance
across similar applications
– How well does a custom ISA generalize
• 5 encryption algorithms, create custom design for each
• Average loss >80% versus native [MICRO, 2003]
– Proactive generalization creates a retargetable design
Hot Chips 16
August 24, 2004
Creating Custom Instructions
• Candidate discovery
– Identify customization
opportunities
• Examine program DFG
• Partition DFG at:
– Memory operations
– Unprofitable edges
• Enumerate candidate
subgraphs within each
partition
Hot Chips 16
August 24, 2004
Grouping and Selection
• Group candidate
subgraphs with same
structure
Group 1
• Estimate performance
and cost for each group
• Greedily select groups to
implement in hardware
subject to budget
• 1 CFU created per group
Group 1
Cost: 2 Adders
Gain: 10,000 Cycles
Group 3
Group 2
Group 2
Cost: 0.5 Adders
Gain: 1,000 Cycles
Group 4
Cost: 1 Adder
Cost: 1 Adder
Gain: 1,500 Cycles Group
Gain:
3 2,500 Cycles
Group 4
Hot Chips 16
August 24, 2004
Proactively Generalize Groups
Input 1
• Cost-effectively extend
group functionality to
enable reuse
0x8
>>
0xFF
Input 1
Input 2
• Wildcard – multiple
functionality at nodes
|
>>
0xFF
+
Input 2
Output
• Subsumed – configurable
interconnect to bypass nodes
0x8, 0x4
+,-
Output
|,&
Hot Chips 16
August 24, 2004
Native Speedups
9
8
ARM 926EJ (200 MHz)
OptimoDE (333 MHz, 5 Issue: 3 ALU, 1 Mem, 1 Brn)
OptimoDE + 1 CFU (333 MHz)
7
Speedup
6
5
4
3
2
1
0
3Des
AES
Blowfish
Md5
Rc4
SHA
33d
33dd3d d eess3d eeesss- - -33 dd
B3 ees
es -B
-B lldooews s
w
33l odw ffiiss
3d d eefssi s- hh
hR
33eds -R
c
3d d ee-Rs c c44
4
es s--A
A
33 - EE
BBllo3d ddeeAsES SS
wef s s--SS
BBlBlo o w
wfwf f iiss-hhSh hhaa
Bl loo w
ow f iisish --3a
fis shh --B-3 3ddee
de ss
BBhl -B Blloow
lo wsfi
Bl loo w
w
fi s
w
o f
BBllwofi f iisshhfish shh
s -R
Bl o w
ow wffhis-R -R
cc4
Blofis ishh--cA4 4
Bl wh- AE
ow f i AE ES
S
fis sh -S
h
S
R
Rc-S h
RcRc4c44--h3a a
R 4 -3 3dd
c4 -B d ee s
-B lo es s
lo w
w fi
R Rc4fish sh
c4 -R
- c
R Rc4Rc4 4
c4 A
R-A E
R c4ES S
c4 S
AEAE-Sh ha
A
S
S
AE E - -3a
S-S-B3de d e
Bl lo s s
ow w
f
AEAE fish ish
S
S- -R
AEAE Rc c4
S-S- 4
AA
A
AE E ESES
S- SShSh ShSha
a
ShSh a-3a -3
a-a -B de de
Bl lo s s
ow w
f
ShSh fishi sh
a- a R
ShShRc4 c4
a- a AA
ShS ESES
h
a- a
Sh-S
a ha
Speedup
Speedup
Hot Chips 16
August 24, 2004
Importance of Generalization
77 7
3des
33
22
11
00
Blowfish
Rc4
AES
66 6
Key: application run – application designed for
Sha
Base
Base
Generalized
Generalized
55 5
44 4
3
2
1
0
Hot Chips 16
August 24, 2004
Case Study - Md5
10
9
8
Speedup
7
6
5
4
3
2
1
0
0
2
4
6
8
CFU Cost Budget (Relative to 32-bit ripple-carry adder)
10
Hot Chips 16
August 24, 2004
OptimoDE Design for this Point
Input 1
Input 2
+
Input 3
+
ALU 1
RF
RF
ALU 2
ACU
SRAM
…
Input 4
CFU
RF
RF
+
0x1B
>>
Control Memory
0x5
<<
|
Output
Input 2
Input 3
Input 1
^
&
^
Output
Hot Chips 16
August 24, 2004
Die Area Breakdown
OptimoDE = 5.5
in 0.13 m
ARM 926EJ = 5.0 mm2 in 0.13 m
mm2
CFU Reg. File and
2% Interconnect
9%
Control
Memory
11%
ALU 1
RF
RF
ALU 2
ACU
SRAM
…
Control Memory
CFU
RF
RF
Baseline
Design
78%
Hot Chips 16
August 24, 2004
Conclusions
• OptimoDE
– Configurable VLIW-style data engine architecture
– Automated tools for implementing embedded signal
and data processing solutions
• Automatic retargetable customization
– Customized design combined with cost-effective
generalization
– Performance programmability - Performance stability
across a family of similar applications
Hot Chips 16
August 24, 2004
For More Information
• CCCP group website
– cccp.eecs.umich.edu
• ARM OptimoDE information
– www.arm.com/products/CPUs/families/OptimoDE.html
Hot Chips 16
August 24, 2004
Designing for a Domain
1
Fraction of Native Speedup Attained
0.9
0.8
0.7
3des
AES
Blowfish
Rc4
Sha
Md5
Mean
0.6
0.5
0.4
0.3
0.2
0.1
0
Rc4
Rc4, Sha
Blowfish, Rc4, Sha
3des, Blowfish,
Rc4, Sha
Apps Designed For
3des, AES,
Blowfish, Rc4, Sha
Download