Computing Without Processors Mihai Budiu Thesis Proposal Thesis Committee:

advertisement
Computing Without Processors
Thesis Proposal
Mihai Budiu
July 30, 2001
Thesis Committee:
Seth Goldstein, chair
Todd Mowry
Peter Lee
Babak Falsafi, ECE
Nevin Heintze, Agere Systems
This presentation uses TeXPoint by George Necula
Four Types of Research
•
•
•
•
Solve nonexistent problems
Solve past problems
Solve current problems
Solve future problems
2
The Law
(source: Intel)
3
The Crossover Phenomenon
technology
time
4
Example Crossover
access speed (ns)
no caches
caches CPU
DRAM
200
1980
time
5
Trouble Ahead
for
Microarchitecture
Signal Propagation
mm
die size
20
distance
in 1 clock
now
time
7
Reliability & Yield
defects/chip
occurring
tolerable
new process
now
time
8
Energy
power
CPU consumption
thermal
dissipation
100W
now
time
9
Instruction-Level Parallelism (ILP)
instructions
fetch
commit
now
time
10
Premises of this Research
• We will have lots of gates
– Moore’s law continues
– Nanotechnology
• Contemporary architectures do not scale
11
Outline
•
•
•
•
•
•
•
Motivation
ASH: Application-Specific Hardware
The spatial model of computation
CASH: Compiling for ASH
Evolutionary path
Conclusions
Future work
12
ASH
Application-Specific Hardware
HLL program
Compiler
Circuit
Reconfigurable
hardware
13
ASH: A Scalable Architecture
-- Thesis Statement -Application-specific hardware on a
reconfigurable-hardware substrate is a
solution for the smooth evolution of
computer architecture.
We can provide scalable compilers for
translating high-level languages into
hardware.
14
Example
int f(void)
{
int i=0, j = 0;
for (; i < 10; i++)
j += i;
return j;
}
15
Outline
•
•
•
•
•
•
•
Motivation
ASH: Application-Specific Hardware
The spatial model of computation
CASH: Compiling for ASH
Evolutionary path
Conclusions
Future work
16
ASH and Nanotechnology
• Build reconfigurable hardware using
nanotechnology
• Low Power: 1010 gates use less than 2 W
Huge structures
• Low cost: nanocents/gate
• High density: 105x over CMOS
Nano-RAM cell
.
In yellow: a CMOS RAM cell
17
A Limit Study of Performance
A graph of the whole program execution:
Basic block
Control-flow transfer
Memory write
Memory read
Memory word
18
Typical Program Graph (g721_e)
Memory reads
Control flow transfer
100% code cluster
memcpy
100% memory cluster
19
Program Graph After Inlining memcpy
memcpy
20
09
12
9.
9.
go
co
m
pr
es
s
-1
_d
g_
e
g_
d
pe
g2
m
jpe
jpe
_e
m
1 clock/square
gs
_e
_d
e
_d
m
_Q
_Q
gs
g7
21
g7
21
ep
i c_
9
13
0.
li
13
2.
ijp
eg
ad
pc
m
_d
ad
pc
m
_e
times slower than native
Application Slowdown
11
10
5 clocks/square
8
7
6
5
4
3
2
1
0
21
How Time Is Spent
No caches: reads expensive
100%
90%
80%
60%
50%
40%
30%
idle
execution
control flow
register traffic
20%
10%
0%
09
12
9
9.
co .g o
m
pr
es
s
13
13 0.li
2.
ijp
ad e g
pc
m
ad _d
pc
m
_e
ep
g7 ic_e
21
_Q
g7
_
21 d
_Q
_e
gs
m
_d
gs
m
_e
jp
eg
_d
jp
eg
_e
m
pe
g2
_d
percent
70%
No speculation
22
Lesson
The spatial model of computation has
different properties.
23
Outline
•
•
•
•
•
•
Motivation
ASH: Application-Specific Hardware
The spatial model of computation
CASH: Compiling for ASH
Evolutionary path
Future work
24
CASH: Compiling for ASH
Program to circuits
Memory partitioning
Interconnection net
25
Reliability
Compilation
1. Program
2. Split-phase Abstract
Machines
int reverse(int x)
{
int k,r=0;
for (k=0; k<32; k++)
r |= x&1;
x = x >> 1;
r = r << 1;
}
}
Computations
& local storage
Unknown latency ops.
3. Configurations placed
independently
4. Placement on chip
26
Power
Split-phase Abstract Machines
CFG
SAM 1
SAM 3
SAM 2
27
Hyperblock => SAM
• Single-entry, multiple exit
• May contain loops
28
SAM => FSM
Exit
Start
Loop
Local
memory
Exit
Remote
Memory
29
Implementing SAMs
- interesting details -
30
args
Computation
start
Register
The SAM FSM
results
exit
Predicates (control)
Combinational logic
31
Signals
Computation = Dataflow
Programs
Circuits
a
x = a & 7;
...
7
&
2
y = x >> 2;
x
>>
• Variables => wires + tokens
• No token store; no token matching
• Local communication only
32
Tokens & Synchronization
• Tokens signal operation completion
• Possible implementations:
data
ack
valid
data
data valid
valid
reset
Local
Global
Static
33
ILP
Speculation and Eager Muxes
b
if (x > 0)
y = -x;
else
y = b*x;
slow
x
*
0
-
f
y
Computation
>
!
Predicates
Static-Single Assignment implemented in hardware
34
Predicates
• Select variable definition
• Guard side-effects
– Memory access
– Procedure calls
• Control looping
• Decide exit branch
x=...
x=...
...=x
*q = 2;
35
Computing Predicates
s
t
b
• Correct for irreducible graphs
• Correct even when speculatively computed
• Can be eagerly computed
36
Loops + Dataflow = Pipelining
0
i
1
&a[0]
for (i=0; i < 10; i++)
a[i] += i;
a[3]
a[2]
+
+
load
a[1]
+
a[0]
store
37
Outline
•
•
•
•
•
•
•
Motivation
ASH: Application-Specific Hardware
The spatial model of computation
CASH: Compiling for ASH
Evolutionary path
Conclusions
Future work
38
Evolutionary Path
Microprocessors
ASH
The problem with ASH: Resources
39
Virtualization
40
CPU+ASH
support computation
+ OS
+ VM
CPU
ASH
core
computation
Memory
41
Outline
•
•
•
•
•
•
•
Motivation
ASH: Application-Specific Hardware
The spatial model of computation
CASH: Compiling for ASH
Evolutionary path
Conclusions
Future work
42
ASH Benefits
Problem
Reliability
Power
Signals
ILP
Solution
Configuration around defects
Only “useful” gates switching
Localized computation
Statically extracted
43
Scalable Performance
performance
ASH
CPU
now
time
44
Summary
• Contemporary CPU architecture faces
lots of problems
• Application-Specific Hardware (ASH)
provides a scalable technology
• Compiling HLL into hardware dataflow
machines is an effective solution
45
Timeline
now
CASH
core
Explore architectural/compiler trade-offs
Hw/sw partitioning
(ASH + CPU)
Cost
models
06/01
09/01
Loop parallelization
Memory partitioning
Write
thesis
ASH
Simulation
12/01
04/02
06/02
09/02
12/02
46
Extras
•
•
•
•
•
Related work
Reconfigurable hardware
Other cross-over phenomena
A CPU + ASH study
More about predicates
47
Related Work
•
•
•
•
•
•
Hardware synthesis from HLL
Reconfigurable hardware
Predicated execution
Dataflow machines
Speculative execution
Predicated SSA
back
48
Reconfigurable Hardware
Interconnection
network
Universal gates
and/or
storage elements
Programmable Switches
back to presentation
49
Main RH Ingredient: RAM Cell
a0
a1
0
0
0
1
data
a0
a1
a1 & a2
Universal gate = RAM
data in
0
control
Switch controlled by a 1-bit RAM cell
back
50
Reconfigurable Computing
• Back to ENIAC-style computation
• Synthesize one machine to solve one
problem
back
back to “extras”
51
Efficiency
hardware resources
idle
used
now
time
52
Manufacturing Cost
cost
cost
affordable
3x109$
now
time
53
Complexity
transistors
available
1010
109
108
manageable
now
time
54
CAD Tools
manual interventions
necessary
feasible
now
time
back
55
ASH Benefits
Problem
Reliability
Power
Signals
ILP
Complexity
CAD
Efficiency
Cost
Performance
Solution
Configuration around defects
Only “useful” gates switching
Localized computation
Statically extracted
Hierarchy of abstractions
Compiler + local place & route
Circuit customized to application
No masks, no physics, same substrate
Scalable
back
56
CPU+ASH Study
• Reconfigurable functional unit on
processor pipeline
• Adapted SimpleScalar 3.0
• ASH & CPU use the same memory
hierarchy (incl. L1)
• ASH can access CPU registers
• CPU pipeline interlocked with ASH
• Results pending
back
57
Simplifying Predicates
• Shared implementations
a
b
• Control equivalence
c
58
Deep Speculation
if (p)
if (q)
x = a;
else
x = b;
else
x = c;
a
b
c
p&q p&!q
!p
x
59
Predicates & Tokens
q
P
ready
x
P_ready
safe
ready
~x
*q = 2
q
ready
1
x
safe
safe & ready
P & P_ready
ready & safe
Predicated tokens
*q = 2
Eliminate wires
~x
Eliminate speculation
back
60
Download