Spatial Computation Mihai Budiu Computing without General-Purpose Processors

advertisement
Spatial Computation
Computing without General-Purpose Processors
Mihai Budiu
Microsoft Research – Silicon Valley
Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein
Carnegie Mellon University
Outline
• Intro: Problems of current architectures
100
2000
1998
1996
1994
1992
1990
1988
1986
1984
1
1982
10
1980
Performance
1000
• Compiling Application-Specific Hardware
• ASH Evaluation
• Conclusions
2
Resources
[Intel]
• We do not worry about not having hardware resources
• We worry about being able to use hardware resources
3
1010
109
gate
108
wire
107
106
105
5ps
20ps
104
Complexity
ALUs
Cannot rely on global signals
(clock is a global signal)
4
1010
109
108
107
106
105
104
gate short,
Simple,
wire
unidirectional
interconnect
5ps 20ps
Automatic
translation
C ! HW
Simple hw,
mostly idle
Complexity
ALUs
No interpretation
Distributed
control,
Asynchronous
Cannot rely on global signals
(clock is a global signal)
5
Our Proposal:
Application-Specific Hardware
• ASH addresses these problems
• ASH is not a panacea
• ASH “complementary” to CPU
Low ILP computation
+ OS + VM
CPU
ASH
High-ILP
computation
$
Memory
6
Paper Content
• Automatic translation of
C to hardware dataflow machines
• High-level comparison of
dataflow and superscalar
• Circuit-level evaluation -power, performance, area
7
Outline
• Problems of current architectures
• CASH:
Compiling Application-Specific Hardware
• ASH Evaluation
• Conclusions
8
Application-Specific Hardware
C program
Compiler
Dataflow IR
HW backend
Reconfigurable/custom hw
9
Computation
Program
IR
a
x = a & 7;
...
Circuits
a
7
&
2
y = x >> 2;
x
Operations
Variables
Dataflow
>>
Nodes
Def-use edges
No interpretation
&7
>>2
Pipeline stages
Channels (wires)
10
Basic Computation=
Pipeline Stage
+
latch
data
ack
valid
11
Distributed Control Logic
global
FSM
ack
rdy
+
short, local wires
12
MUX: Forward Branches
b
if (x > 0)
y = -x;
else
y = b*x;
x
*
0
-
f
>
!
y
SSA
= no arbitration
Conditionals ) Speculation
13
Memory Access
LD
ST
pipelined
arbitrated
network
Monolithic
Memory
LD
local communication
global structures
Future work: fragment this!
14
Outline
• Problems of current architectures
• Compiling ASH
• ASH Evaluation
• Conclusions
15
Evaluating ASH
C
Mediabench kernels
(1 hot function/benchmark)
CASH
core
Verilog
back-end
commercial tools
Synopsys,
Cadence P/R
180nm std. cell
library, 2V
ModelSim
Mem
(Verilog simulation)
ASIC
~1999
technology
performance
numbers
16
Compile Time
C
200 lines
CASH
core
20 seconds
Verilog
back-end
10 seconds
Synopsys,
Cadence P/R
20 minutes
1 hour
Mem
ASIC
17
it_
e
gw
pe
it_
d
gw
pe
e
g2
_
m
pe
d
_e
_d
g2
_
m
pe
eg
jp
eg
jp
m
_e
gs
Mem access
Datapath
m
_d
21
_e
8
gs
g7
21
_d
_e
2
g7
pc
m
_d
7
ad
pc
m
ad
Square mm
ASH Area
P4: 217
6
5
4
3
minimal RISC core
1
0
18
ASH vs 600MHz CPU [.18 mm]
4
3.65
3.57
3
2.5
1.93
1.87
2
1.52
1.55
1.35
1.5
1
0.77
0.60
1.23
0.70
0.53
0.5
0.48
av
g
pe
g2
_d
m
pe
g2
_e
pe
gw
it _
d
pe
gw
it _
e
m
jp
eg
_d
jp
eg
_e
_e
gs
m
_d
gs
m
g7
21
_e
g7
21
_d
_e
ad
pc
m
_d
0
ad
pc
m
Times slower
3.5
19
Bottleneck: Memory Protocol
ST
LSQ
LD
Memory
• Enabling dependent operations requires round-trip to memory.
• Limit study: round trip zero time ) up to 5x speed-up.
• Exploring novel memory access protocols.
20
Power
DSP
110
45.0
Xeon
[+cache]
67000
mP
4000
42.5
40.0
35.0
34.4
28.3
25.2
25.0
23.6
21.8
25.2
22.5
21.6
20.0
15.0
10.0
13.0
9.3
9.3
5.0
av
g
m
pe
g2
_d
m
pe
g2
_e
pe
gw
i t_
d
pe
gw
i t_
e
jp
eg
_e
jp
eg
_d
gs
m
_e
gs
m
_d
g7
21
_e
g7
21
_d
0.0
ad
pc
m
_d
ad
pc
m
_e
Power [mW]
29.7
30.0
21
_e
pc
m
_d
pc
m
av
g
g7
21
_d
g7
21
_e
gs
m
_d
gs
m
_e
jp
eg
_d
jp
eg
_e
m
pe
g2
_d
m
pe
g2
_e
pe
gw
it_
d
pe
gw
it_
e
ad
ad
Energy-delay vs superscalar
(times better)
Energy-delay vs. Wattch
10000
1000
100
10
1
22
Energy Efficiency
1000x
Dedicated hardware
ASH media kernels
Asynchronous mP
FPGA
General-purpose DSP
Microprocessors
0.01
0.1
1
10
100
1000
Energy Efficiency [Operations/nJ]
23
Outline
Problems of current architectures
+ Compiling ASH
+ Evaluation
= Related work, Conclusions
24
Related Work
•
•
•
•
•
•
Optimizing compilers
High-level synthesis
Reconfigurable computing
Dataflow machines
Asynchronous circuits
Spatial computation
We target an extreme point in the design space:
no interpretation,
fully distributed computation and control
25
ASH Design Point
• Design an ASIC in a day
• Fully automatic synthesis to layout
• Fully distributed control and computation
(spatial computation)
– Replicate computation to simplify wires
• Energy/op rivals custom ASIC
• Performance rivals superscalar
• E£t 100 times better than any processor
26
Conclusions
Spatial computation strengths
Feature
No interpretation
Advantages
Energy efficiency, speed
Spatial layout
Short wires, no contention
Asynchronous
Low power, scalable
Distributed
No global signals
Automatic compilation Designer productivity
27
Backup Slides
• Absolute performance
• Control logic
• Exceptions
• Leniency
• Normalized area
• Loops
• ASH weaknesses
• Splitting memory
• Recursive calls
• Leakage
• Why not compare to…
• Targetting FPGAs
28
pe
gw
pe
gw
_e
g
av
it _
e
it _
d
pe
g2
m
_e
_d
_d
jp
eg
jp
eg
m
_e
gs
pe
g2
m
_e
_d
m
_d
gs
g7
21
g7
21
m
_e
ad
pc
m
_d
ad
pc
Megaoperations per second
Absolute Performance
9000
MOPSall
8000
MOPSspec
7000
MOPS
6000
5000
4000
3000
2000
1000
0
29
Pipeline Stage
ackout
C
rdyin
ackin
rdyout
=
datain
back
Reg
D
dataout
Exceptions
• Strictly speaking, C has no exceptions
• In practice hard to accommodate
exceptions in hardware implementations
• An advantage of software flexibility:
PC is single point of execution control
Low ILP computation
+ OS + VM + exceptions
CPU
$$$
ASH
High-ILP
computation
Memory
back
31
Critical Paths
b
if (x > 0)
y = -x;
else
y = b*x;
x
*
0
-
>
!
y
32
Lenient Operations
b
if (x > 0)
y = -x;
else
y = b*x;
x
*
0
-
>
!
y
Solves the problem of unbalanced paths
back
33
back
av
g
e
d
it_
gw
pe
e
it_
gw
pe
d
2_
pe
g
m
e
2_
pe
g
m
eg
_
jp
d
eg
_
_e
gs
m
jp
_e
_d
_d
21
21
gs
m
g7
g7
_e
pc
m
ad
_d
pc
m
ad
Normalized Area
120
2.5
Lines/sq mm
sq mm/kbyte
100
2
80
1.5
60
1
40
20
0.5
0
0
34
Control Flow ) Data Flow
data
f
Merge (label)
data
data
predicate
Gateway
p
Split (branch)
!
35
0
Loops
i
*
0
int sum=0, i;
for (i=0; i < 100; i++)
sum += i*i;
return
return sum;
sum;
+1
< 100
sum
+
!
ret
back
36
ASH Weaknesses
• Both branch and join not free
• Static dataflow
(no re-issue of same instr)
• Memory is “far”
• Fully static
– No branch prediction
– No dynamic unrolling
– No register renaming
• Calls/returns not lenient
back
37
Branch Prediction
i
ASH crit path
for (i=0; i < N; i++) {
...
CPU crit path
1
+
<
exception
if (exception) break;
}
Predicted not taken
Effectively a noop for CPU!
back Predicted taken.
!
&
result available before inputs38
Memory Partitioning
• MIT RAW project: Babb FCCM ‘99,
Barua HiPC ‘00,Lee ASPLOS ‘00
• Stanford SpC:
Semeria DAC ‘01, TVLSI ‘02
• Illinois FlexRAM: Fraguella PPoPP ‘03
• Hand-annotations #pragma
back
39
Recursion
save live values
recursive call
restore live values
back
stack
40
Leakage Power
Ps = k Area e-VT
• Employ circuit-level techniques
• Cut power supply of idle circuit portions
– most of the circuit is idle most of the time
– strong locality of activity
• High VT transistors on non-critical path
back
41
Why Not Compare To…
• In-order processor
– Worse in all metrics than superscalar, except power
– We beat it in all metrics, including performance
• DSP
– We expect roughly the same results as for superscalar
(Wattch maintains high IPC for these kernels)
• ASIC
– No available tool-flow supports C to the same degree
• Asynchronous ASIC
– We compared with a Balsa synthesis system
– We are 15 times better in Et compared to resulting ASIC
• Async processor
– We are 350 times better in Et than Amulet (scaled to .18)
back
42
Compared to Next Talk
Engine Performance
[180nm]
[MIPS]
SNAP/LE
28
SNAP/LE
ASH
back
240
1100
E/instruction
[pJ]
24
218
20
43
Why not target FPGA
•
•
•
Do not support asynchronous circuits
Very inefficient in area, power, delay
Too fine-grained for datapath circuits
•
We are designing an async FPGA
back
44
Download