Spatial Computation Mihai Budiu CMU CS Thesis committee:

advertisement
Spatial Computation
Mihai Budiu
CMU CS
Thesis committee:
Seth Goldstein
Peter Lee
Todd Mowry
Babak Falsafi
Nevin Heintze
SCS
Ph.D. Thesis defense, December 8, 2003
Spatial Computation
Thesis committee:
Seth Goldstein
A model of general-purpose
computation
Peter Lee
SCS
based on Application-Specific
Hardware.
Todd Mowry
Babak Falsafi
Nevin Heintze
Ph.D. Thesis defense, December 8, 2003
2
Thesis Statement
Application-Specific Hardware (ASH):
• can be synthesized by adapting software
compilation for predicated architectures,
• provides high-performance for programs with
high ILP, with very low power consumption,
• is a more scalable and efficient computation
substrate than monolithic processors.
3
Outline
• Introduction
•
•
•
•
Compiling for ASH
Media processing on ASH
ASH vs. superscalar processors
Conclusions
4
CPU Problems
•
•
•
•
Complexity
Power
Global Signals
Limited ILP
5
Design Complexity
Design Time:
CAD productivity favors FPL
100,000,000
Logic transistors/chip
.10 1,000,000
Transistors/staff*month
100,000
10,000
10,000
1000
x
x x
100
xx
10
x
100
2005
2007
2003
2001
1999
1997
1993
1995
1991
1989
1987
21%/Year
1983
1985
2.5
x
1000
Productivity
58%/Year
10
2009
.35
10,000,000
1,000,000
100,000
1981
Chip size (K transistors)
10,000,000
Source: S. Malik, orig Sematech
from Michael Flynn’s FCRC 2003 talk
6
Communication vs. Computation
gate
wire
5ps
20ps
Power consumption on wires is also dominant
7
Our Approach: ASH
Application-Specific Hardware
8
Resource Binding Time
1.
1.
2. Programs
2.
CPU
Programs
ASH
9
Hardware Interface
software
software
ISA
virtual ISA
gates
hardware
CPU
hardware
ASH
10
Application-Specific Hardware
C program
Dataflow IR
Compiler
Reconfigurable/custom hw
11
Contributions
Embedded
systems
Asynchronous
circuits
Computer
architecture
Reconfigurable
computing
Compilation
High-level
synthesis
Nanotechnology
Dataflow
machines
theory
12
Outline
• Introduction
• CASH: Compiling for ASH
• Media processing on ASH
• ASH vs. superscalar processors
• Conclusions
13
Computation = Dataflow
Programs
Circuits
a
x = a & 7;
...
7
&
2
y = x >> 2;
x
>>
• Operations ) functional units
• Variables
) wires
• No interpretation
14
Basic Operation
+
latch
data
ack
valid
15
Asynchronous Computation
+
data
1
latch
5
+
+
+
ack
valid
2
+
3
+
6
4
+
7
+
8
16
Distributed Control Logic
FSM
ack
rdy
+
short, local wires
asynchronous control
17
Forward Branches
b
if (x > 0)
y = -x;
else
y = b*x;
x
*
0
-
>
!
y
Conditionals ) Speculation
critical path
18
Control Flow ) Data Flow
data
Merge (label)
data
data
predicate
Gateway
p
Split (branch)
!
19
0
Loops
i
*
0
int sum=0, i;
for (i=0; i < 100; i++)
sum += i*i;
return
return sum;
sum;
+1
< 100
sum
+
!
ret
20
Predication and Side-Effects
addr
token
sequencing
of side-effects
no speculation
pred
Load
data
to
memory
token
21
Thesis Statement
Application-Specific Hardware:
• can be synthesized by adapting software
compilation for predicated architectures,
• provides high-performance for programs with
high ILP, with very low power consumption,
• is a more scalable and efficient computation
substrate than monolithic processors.
22
Outline
• Introduction
• CASH: Compiling for ASH
– An optimization on the SIDE
• Media processing on ASH
• ASH vs. superscalar processors
• Conclusions
skip to
23
Availability Dataflow Analysis
y = a*b;
...
if (x) {
... y
... = a*b;
}
24
Dataflow Analysis Is
Conservative
if (x) {
...
y = a*b;
}
...
y
... = a*b;
?
25
Static Instantiation,
Dynamic Evaluation
flag = false;
if (x) {
...
y = a*b;
flag = true;
}
...
... = flag ? y : a*b;
26
0
epic_d
10
5
300.twolf
300.twolf
254.gap
197.parser
181.mcf
176.gcc
35
175.vpr
164.gzip
188.ammp
183.equake
147.vortex
134.perl
132.ijpeg
130.li
129.compress
124.m88ksim
099.go
mesa
rasta
pgp_d
pgp_e
g721_d
g721_e
pegwit_d
pegwit_e
jpeg_d
jpeg_e
mpeg2_d
197.parser
254.gap
epic_d
mpeg2_e
176.gcc
epic_e
181.mcf
175.vpr
gsm_d
gsm_e
% reduction
40
164.gzip
188.ammp
183.equake
adpcm_d
15
adpcm_e
%st promo
%st PRE
134.perl
0
147.vortex
132.ijpeg
Stores
130.li
129.compress
30
124.m88ksim
099.go
mesa
rasta
pgp_d
pgp_e
g721_d
g721_e
pegwit_d
pegwit_e
jpeg_d
jpeg_e
mpeg2_d
mpeg2_e
20
epic_e
25
gsm_d
gsm_e
adpcm_d
adpcm_e
SIDE Register Promotion Impact
45
Loads
% ld promo
% ld PRE
30
25
20
15
53
10
5
27
Outline
• Introduction
• CASH: Compiling for ASH
• Media processing on ASH
• ASH vs. superscalar processors
• Conclusions
28
Performance Evaluation
ASH
LSQ
L1
8K
L2
1/4M
Mem
limited BW
CPU: 4-way OOO
Assumption: all operations have the same latency.
29
5.8
rasta
pegwit_e
pegwit_d
mpeg2_e
mpeg2_d
mesa
5.8
jpeg_e
jpeg_d
3
gsm_e
gsm_d
g721_e
g721_d
epic_e
epic_d
adpcm_e
adpcm_d
Times faster
Media Kernels, vs 4-way OOO
12
2.5
2
1.5
1
0.5
0
30
rasta
pegwit_e
pegwit_d
mpeg2_e
mpeg2_d
mesa
jpeg_e
jpeg_d
gsm_e
gsm_d
g721_e
g721_d
epic_e
epic_d
adpcm_e
adpcm_d
Media Kernels, IPC
25
Base IPC
ASH IPC
20
15
10
5
4
0
31
10
Speed-up
rasta
pegwit_e
pegwit_d
mpeg2_e
mpeg2_d
mesa
jpeg_e
jpeg_d
gsm_e
gsm_d
g721_e
g721_d
epic_e
8
epic_d
9
adpcm_e
adpcm_d
Times bigger
Speed-up / IPC Correlation
12
IPC Ratio
7
6
5
4
3
2
1
0
32
Low-Level Evaluation
C
CASH
core
Results shown so far.
All results in thesis.
Verilog
back-end
Synopsys,
Cadence P/R
180nm std. cell
library, 2V
~1999
technology
Results in the next two slides.
ASIC
33
e
it_
gw
pe
d
it_
gw
_e
g2
pe
_d
g2
pe
pe
m
m
d
eg
_
_e
gs
m
jp
_e
_d
_d
21
21
gs
m
g7
g7
_d
ad
pc
m
Square mm
12
Area
10
8
6
4
2
0
Reference: P4 in 180nm has 217mm2
34
Power
350
Times smaller than OOO
300
250
200
150
100
50
0
power ratio
adpcm_d
g721_d
g721_e
gsm_d
gsm_e
jpeg_d
mpeg2_d
mpeg2_e
pegwit_d
pegwit_e
70
41
41
129
147
94
121
136
303
303
vs 4-way OOO superscalar, 600 Mhz, with clock gating (Wattch), ~ 6W
35
Thesis Statement
Application-Specific Hardware:
• can be synthesized by adapting software
compilation for predicated architectures,
• provides high-performance for programs with
high ILP, with very low power consumption,
• is a more scalable and efficient computation
substrate than monolithic processors.
36
Outline
• Introduction
• CASH: Compiling for ASH
• Media processing on ASH
– dataflow pipelining
• ASH vs. superscalar processors
• Conclusions
skip to
37
i
Pipelining
*
+
<=
pipelined
multiplier
(8 stages)
int sum=0, i;
for (i=0; i < 100; i++)
sum += i*i;
return sum;
100
1
sum
+
cycle=1
38
i
Pipelining
*
100
1
+
<=
sum
+
cycle=2
39
i
Pipelining
*
100
1
+
<=
sum
+
cycle=3
40
i
Pipelining
*
100
1
+
<=
sum
+
cycle=4
41
i
Pipelining
i=1
100
1
+
<=
i=0
sum
+
pipeline balancing
cycle=5
42
Outline
•
•
•
•
Introduction
CASH: Compiling for ASH
Media processing on ASH
ASH vs. superscalar processors
• Conclusions
43
This Is Obvious!
ASH runs at full dataflow speed,
so CPU cannot do any better
(if compilers equally good).
44
Percent slower / faster
-10
0
147.vortex
134.perl
132.ijpeg
130.li
129.compress
124.m88ksim
-20
099.go
SpecInt95, ASH vs 4-way OOO
30
20
10
-30
-40
-50
45
Branch Prediction
i
ASH crit path
for (i=0; i < N; i++) {
...
CPU crit path
1
+
<
exception
if (exception) break;
}
Predicted not taken
Effectively a noop for CPU!
Predicted taken.
!
&
result available before inputs
46
Percent slower/faster
-20
147.vortex
134.perl
132.ijpeg
130.li
129.compress
124.m88ksim
-40
099.go
SpecInt95, perfect prediction
60
baseline
prediction
40
20
no data
0
-60
47
ASH Problems
• Both branch and join not free
• Static dataflow
(no re-issue of same instr)
• Memory is “far”
• Fully static
– No branch prediction
– No dynamic unrolling
– No register renaming
• Calls/returns not lenient
• ...
48
Thesis Statement
Application-Specific Hardware:
• can be synthesized by adapting software
compilation for predicated architectures,
• provides high-performance for programs with
high ILP, with very low power consumption,
• is a more scalable and efficient computation
substrate than monolithic processors.
49
Outline
Introduction
+ CASH: Compiling for ASH
+ Media processing on ASH
+ ASH vs. superscalar processors
= Conclusions
50
Strengths
• low power
• simple verification?
• economies of scale
• highly optimized
•
•
•
•
•
•
•
•
specialized to app.
unlimited ILP
simple hardware
no fixed window
branch prediction
control speculation
full-dataflow
global signals/decision
51
Conclusions
• Compiling “around the ISA” is a fruitful
research approach.
• Distributed computation structures require
more synchronization overhead.
• Spatial Computation efficiently implements
high-ILP computation with very low power.
52
Backup Slides
• Control logic
• Pipeline balancing
• Lenient execution
• Dynamic Critical Path
• Memory PRE
• Critical path analysis
• CPU + ASH
53
Control Logic
rdyin
C
C
ackout
D
rdyout
datain
Reg
D
ackin
dataout
back back to talk
54
Last-Arrival Events
+
data
ack
valid
• Event enabling the
generation of a result
• May be an ack
• Critical path=collection
of last-arrival edges
55
Dynamic Critical Path
3. Some edges may repeat
2. Trace back along
last-arrival edges
1. Start from last node
back back to analysis
56
Critical Paths
b
if (x > 0)
y = -x;
else
y = b*x;
x
*
0
-
>
!
y
57
Lenient Operations
b
if (x > 0)
y = -x;
else
y = b*x;
x
*
0
-
>
!
y
Solve the problem of unbalanced paths
back
back to talk
58
i
Pipelining
*
i=1
100
1
+
<=
i=0
sum
+
cycle=6
59
i
Pipelining
*
+
<=
i’s loop
predicate
100
1
Long
latency
pipe
sum
sum’s loop
+
cycle=7
60
i
Pipelining
i’s loop
*
critical path
100
1
+
<=
Predicate ack
edge is on the
critical path.
sum
sum’s loop
+
61
Pipelinine balancing
*
i
100
1
+
<=
i’s loop
decoupling
FIFO
sum
sum’s loop
+
cycle=7
62
i
Pipelinine balancing
*
i’s loop
100
1
+
<=
critical path
decoupling
FIFO
sum
sum’s loop
+
back
back to presentation
63
Register Promotion
(p1) *p=…
(p1) *p=…
(p2) …=*p
(p2 Æ : p1) …=*p
f
Load is executed only if store is not
64
Register Promotion (2)
(p1) *p=…
(p2) …=*p
(p1) *p=…
(false)
…=*p
• When p2 ) p1 the load becomes dead...
• ...i.e., when store dominates load in CFG
back
65
¼ PRE
(p1) ...=*p
(p2) ...=*p
(p1 Ç p2)
...=*p
This corresponds in the CFG to lifting the load
to a basic block dominating the original loads
66
Store-store (1)
(p1) *p=…
(p1 Æ : p2) *p=…
(p2) *p=...
(p2) *p=...
• When p1 ) p2 the first store becomes dead...
• ...i.e., when second store post-dominates first in CFG
67
Store-store (2)
(p1) *p=…
(p1 Æ : p2) *p=…
(p2) *p=...
(p2) *p=...
• Token edge eliminated, but...
• ...transitive closure of tokens preserved
back
68
A Code Fragment
for(i = 0; i < 64; i++)
{
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
Y[i] = X[j].q;
}
SpecINT95:124.m88ksim:init_processor, stylized
69
Dynamic Critical Path
definition
sizeof(X[j])
load predicate
loop predicate
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
70
MIPS gcc Code
LOOP:
L1: beq $v0,$a1,EXIT
; X[j].r == i
L2: addiu $v1,$v1,20
; &X[j+1].r
L3: lw $v0,0($v1)
; X[j+1].r
L4: addiu $a0,$a0,1
; j++
L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF
EXIT:
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
L1! L2 ! L3 ! L5 ! L1
4-instructions loop-carried dependence
break;
71
If Branch Prediction Correct
LOOP:
L1: beq $v0,$a1,EXIT
; X[j].r == i
L2: addiu $v1,$v1,20
; &X[j+1].r
L3: lw $v0,0($v1)
; X[j+1].r
L4: addiu $a0,$a0,1
; j++
L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF
EXIT:
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
L1! L2 ! L3 ! L5 ! L1
Superscalar is issue-limited!
2 cycles/iteration sustained
72
Critical Path with Prediction
Loads are not
speculative
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
73
Prediction + Load Speculation
~4 cycles!
Load not pipelined
ack edge
(self-anti-dependence)
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
74
OOO Pipe Snapshot
LOOP:
L1: beq $v0,$a1,EXIT
; X[j].r == i
L2: addiu $v1,$v1,20
; &X[j+1].r
L3: lw $v0,0($v1)
; X[j+1].r
L4: addiu $a0,$a0,1
; j++
L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF
EXIT:
register
renaming
IF
DA
EX
WB
CT
L5
L1
L2
L1
L2
L3
L4
L1
L3
L5
L3
L2
L1
L3
L3
75
Unrolling?
for(i = 0; i < 64; i++)
{
for (j = 0; X[j].r != 0xF; j+=2) {
if (X[j].r == i)
break;
if (X[j+1].r == 0xF)
break;
when 1
if (X[j+1].r == i)
break;
}
Y[i] = X[j].q;
}
back
back to talk
iteration
76
Ideal Architecture
Low ILP computation
+ OS
+ VM
CPU
ASH
High-ILP
computation
Memory
back
77
Download