Spatial Computation

advertisement
Spatial Computation
Computing without General-Purpose Processors
Mihai Budiu
Microsoft Research – Silicon Valley
joint work with
Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein
Carnegie Mellon University
May 10, 2005
Outline
• Intro: Problems of current architectures
100
2000
1998
1996
1994
1992
1990
1988
1986
1984
1
1982
10
1980
Performance
1000
• Compiling Application-Specific Hardware
• ASH Evaluation
• Conclusions
2
Resources
[Intel]
• We do not worry about not having hardware resources
• We worry about being able to use hardware resources
3
1010
109
gate
108
wire
107
106
105
5ps
20ps
104
Complexity
ALUs
Cannot rely on global signals
(clock is a global signal)
4
1010
109
108
107
106
105
104
gate short,
Simple,
wire
unidirectional
interconnect
5ps 20ps
Automatic
translation
C ! HW
Simple hw,
mostly idle
Complexity
ALUs
No interpretation
Distributed
control,
Asynchronous
Cannot rely on global signals
(clock is a global signal)
5
Our Proposal:
Application-Specific Hardware
• ASH addresses these problems
• ASH is not a panacea
• ASH “complementary” to CPU
Low ILP computation
+ OS + VM
CPU
ASH
High-ILP
computation
$
Memory
6
Outline
• Problems of current architectures
• CASH:
Compiling Application-Specific Hardware
• ASH Evaluation
• Conclusions
7
Application-Specific Hardware
C program
Compiler
Dataflow IR
HW backend
Reconfigurable/custom hw
8
Computation
Program
IR
a
x = a & 7;
...
Circuits
a
7
&
2
y = x >> 2;
x
Operations
Variables
Dataflow
>>
Nodes
Def-use edges
No interpretation
&7
>>2
Pipeline stages
Channels (wires)
9
Basic Computation=
Pipeline Stage
+
latch
data
ack
valid
10
Asynchronous Computation
+
data
1
latch
5
+
+
+
ack
valid
2
+
3
+
6
4
+
7
+
8
11
Distributed Control Logic
global
FSM
ack
rdy
+
short, local wires
12
MUX: Forward Branches
b
if (x > 0)
y = -x;
else
y = b*x;
x
*
0
-
f
>
!
y
SSA
= no arbitration
Conditionals ) Speculation
Critical path
13
Control Flow ) Data Flow
data
f
Merge (label)
data
data
predicate
Gateway
p
Split (branch)
!
14
0
Loops
i
*
0
int sum=0, i;
for (i=0; i < 100; i++)
sum += i*i;
return
return sum;
sum;
+1
< 100
sum
+
!
ret
back
15
i
Pipelining
*
+
<=
pipelined
multiplier
(8 stages)
int sum=0, i;
for (i=0; i < 100; i++)
sum += i*i;
return sum;
100
1
sum
+
step 1
16
i
Pipelining
*
100
1
+
<=
sum
+
step 2
17
i
Pipelining
*
100
1
+
<=
sum
+
step 3
18
i
Pipelining
*
100
1
+
<=
sum
+
step 4
19
i
Pipelining
i=1
100
1
+
<=
i=0
sum
+
step 5
20
i
Pipelining
*
i=1
100
1
+
<=
i=0
sum
+
back
step 6
21
i
Pipelining
*
+
<=
i’s loop
predicate
100
1
Long
latency
pipe
sum
sum’s loop
+
step 7
22
i
Pipelining
*
i’s loop
critical path
100
1
+
<=
Predicate ack
edge is on the
critical path.
sum
sum’s loop
+
23
Pipeline balancing
*
i
100
1
+
<=
i’s loop
decoupling
FIFO
sum
sum’s loop
+
step 7
24
i
Pipeline balancing
*
i’s loop
100
1
+
<=
critical path
decoupling
FIFO
sum
sum’s loop
+
back
back to talk
25
Procedures
Caller
Call
Callee
Argument
Return
Continuation
26
Memory Access
LD
ST
pipelined
arbitrated
network
Monolithic
Memory
LD
local communication
global structures
Future work: fragment this!
27
Outline
• Problems of current architectures
• Compiling ASH
• ASH Evaluation
• Conclusions
28
Evaluating ASH
C
Mediabench kernels
(1 hot function/benchmark)
CASH
core
Verilog
back-end
commercial tools
Synopsys,
Cadence P/R
180nm std. cell
library, 2V
ModelSim
Mem
(Verilog simulation)
ASIC
~1999
technology
performance
numbers
29
Compile Time
C
200 lines
CASH
core
20 seconds
Verilog
back-end
10 seconds
Synopsys,
Cadence P/R
20 minutes
1 hour
Mem
ASIC
30
pe
g2
_d
jp
eg
_e
pe
g2
_e
pe
gw
it_
d
pe
gw
it_
e
m
m
_e
_d
jp
eg
_d
gs
m
gs
m
g7
21
_e
1.5
g7
21
_d
_e
_d
4
ad
pc
m
ad
pc
m
Area [sq mm]
ASH Area
2
(mm )
P4: 217
4.5
Memory access
Circuit
3.5
3
2.5
2
minimal RISC core
1
0.5
0
31
ASH vs 600MHz CPU
[4-wide OOO, .18 mm]
2.40
2.50
1.98
1.79
1.65
1.50
1.37
1.34
1.06
1.00
1.05
0.80
0.74
0.56
0.43
0.50
0.44
av
er
ag
e
pe
g2
_d
m
pe
g2
_e
pe
gw
it_
d
pe
gw
it_
e
m
jp
eg
_e
jp
eg
_d
_e
gs
m
g7
21
_d
g7
21
_e
gs
m
_d
_e
ad
pc
m
_d
0.00
ad
pc
m
Times faster
2.00
32
Bottleneck: Memory Protocol
ST
LSQ
LD
Memory
• Enabling dependent operations requires round-trip to memory.
• Exploring novel memory access protocols.
33
mP
4000
29
26
23
19
22
av
er
ag
e
DSP
110
pe
g2
_d
m
pe
g2
_e
pe
gw
it_
d
pe
gw
it_
e
10
m
50
jp
eg
_e
40
jp
eg
_d
20
_e
70
gs
m
30
g7
21
_d
g7
21
_e
gs
m
_d
_e
_d
70
ad
pc
m
ad
pc
m
Power [mW]
Power (mW)
Xeon
[+cache]
67000
60
46
38
30
22
25
10
10
0
34
171
100
36
48
ge
147
er
a
389
av
1000
e
m
pe
g2
_d
m
pe
g2
_e
pe
gw
it_
d
pe
gw
it_
e
eg
_
jp
d
eg
_
285
jp
gs
m
_e
363
gs
m
_d
ad
pc
m
_d
ad
pc
m
_e
g7
21
_d
g7
21
_e
Times better than superscalar
Energy-delay
10000
1524 1788
437
174
227
50
10
1
35
pe
g2
_d
m
pe
g2
_e
pe
gw
it_
d
pe
gw
it_
e
66
m
40
jp
eg
_e
52
jp
eg
_d
_e
57
gs
m
_d
143
gs
m
80
g7
21
_d
g7
21
_e
_e
_d
60
ad
pc
m
ad
pc
m
(non-speculative arithmetic)
[Operations/nJ]
Energy Efficiency (op/nJ)
160
143
140
120
100
62
51
39
55
40
28
28
20
0
36
Energy Efficiency
1000x
Dedicated hardware
ASH media kernels
Asynchronous mP
FPGA
General-purpose DSP
Microprocessors
0.01
0.1
1
10
100
1000
Energy Efficiency [Operations/nJ]
37
Outline
Problems of current architectures
+ Compiling ASH
+ Evaluation
= Related work, Conclusions
38
Bilbliography
• Dataflow: A Complement to Superscalar
Mihai Budiu, Pedro Artigas, and Seth Copen Goldstein
ISPASS 2005
• Spatial Computation
Mihai Budiu, Girish Venkataramani, Tiberiu Chelcea, and Seth Copen Goldstein
ASPLOS 2004
• C to Asynchronous Dataflow Circuits: An End-to-End Toolflow
Girish Venkataramani, Mihai Budiu, Tiberiu Chelcea, and Seth Copen Goldstein
IWLS 2004
• Optimizing Memory Accesses For Spatial Computation
Mihai Budiu and Seth Copen Goldstein
CGO 2003
• Compiling Application-Specific Hardware
Mihai Budiu and Seth Copen Goldstein
FPL 2002
39
Related Work
•
•
•
•
•
•
Optimizing compilers
High-level synthesis
Reconfigurable computing
Dataflow machines
Asynchronous circuits
Spatial computation
We target an extreme point in the design space:
no interpretation,
fully distributed computation and control
40
ASH Design Point
• Design an ASIC in a day
• Fully automatic synthesis to layout
• Fully distributed control and computation
(spatial computation)
– Replicate computation to simplify wires
• Energy/op rivals custom ASIC
• Performance rivals superscalar
• E£t 100 times better than any processor
41
Conclusions
Spatial computation strengths
Feature
No interpretation
Advantages
Energy efficiency, speed
Spatial layout
Short wires, no contention
Asynchronous
Low power, scalable
Distributed
No global signals
Automatic compilation Designer productivity
42
Backup Slides
• Absolute performance
• Control logic
• Exceptions
• Leniency
• Normalized area
• ASH weaknesses
• Splitting memory
• Recursive calls
• Leakage
• Why not compare to…
• Targeting FPGAs
43
back
pe
gw
it_
e
it_
d
pe
g2
_e
pe
g2
_d
jp
eg
_e
5000
pe
gw
m
m
_e
_d
jp
eg
_d
gs
m
gs
m
g7
21
_e
g7
21
_d
_e
_d
6000
ad
pc
m
ad
pc
m
Millions of Operations per Second
Absolute Performance
12300
MOPSall
MOPSspec
MOPS
4000
3000
2000
CPU range
1000
0
44
Pipeline Stage
ackout
C
rdyin
ackin
rdyout
=
datain
back
Reg
D
dataout
Exceptions
• Strictly speaking, C has no exceptions
• In practice hard to accommodate
exceptions in hardware implementations
• An advantage of software flexibility:
PC is single point of execution control
Low ILP computation
+ OS + VM + exceptions
CPU
$$$
ASH
High-ILP
computation
Memory
back
46
Critical Paths
b
if (x > 0)
y = -x;
else
y = b*x;
x
*
0
-
>
!
y
47
Lenient Operations
b
if (x > 0)
y = -x;
else
y = b*x;
x
*
0
-
>
!
y
Solves the problem of unbalanced paths
back back to talk
48
back
ag
e
av
er
pe
gw
i t_
d
pe
gw
i t_
e
2_
e
2_
d
200
pe
g
pe
g
_e
250
m
m
jp
eg
_d
_e
jp
eg
gs
m
_d
_e
gs
m
g7
21
_d
m
_e
m
_d
g7
21
ad
pc
ad
pc
Source Lines/sq mm
Lines/sq mm
KBytes/sq mm
5
4
150
3
100
2
50
1
0
Object code Kb/sq mm
Normalized Area
6
0
49
ASH Weaknesses
• Both branch and join not free
• Static dataflow
(no re-issue of same instr)
• Memory is “far”
• Fully static
– No branch prediction
– No dynamic unrolling
– No register renaming
• Calls/returns not lenient
back
50
Branch Prediction
i
ASH crit path
for (i=0; i < N; i++) {
...
CPU crit path
1
+
<
exception
if (exception) break;
}
Predicted not taken
Effectively a noop for CPU!
back Predicted taken.
!
&
result available before inputs51
Memory Partitioning
• MIT RAW project: Babb FCCM ‘99,
Barua HiPC ‘00,Lee ASPLOS ‘00
• Stanford SpC:
Semeria DAC ‘01, TVLSI ‘02
• Illinois FlexRAM: Fraguella PPoPP ‘03
• Hand-annotations #pragma
back
52
Recursion
save live values
recursive call
restore live values
back
stack
53
Leakage Power
Ps = k Area e-VT
• Employ circuit-level techniques
• Cut power supply of idle circuit portions
– most of the circuit is idle most of the time
– strong locality of activity
back
54
Why Not Compare To…
• In-order processor
– Worse in all metrics than superscalar, except power
– We beat it in all metrics, including performance
• DSP
– We expect roughly the same results as for superscalar
(Wattch maintains high IPC for these kernels)
• ASIC
– No available tool-flow supports C to the same degree
• Asynchronous ASIC
– We compared with a Balsa synthesis system
– We are 15 times better in Et compared to resulting ASIC
• Async processor
– We are 350 times better in Et than Amulet (scaled to .18)
back
55
Why not target FPGA
•
•
•
Do not support asynchronous circuits
Very inefficient in area, power, delay
Too fine-grained for datapath circuits
•
We are designing an async FPGA
back
56
Download