TU Kaiserslautern - Xputer Lab Configware Engineering for

advertisement
PATMOS 2015, the 25th International Workshop on Power
and Timing Modeling, Optimization and Simulation;
Salvador, Bahia, Brazil, Sept 1-5, 2015
Reiner Hartenstein
TU Kaiserslautern
How to cope with
the Power Wall
>> Outline <<
TU Kaiserslautern
• The Power Wall
• “Dataflow” Computing
• Reconfigurable Computing
• Time to Space Mapping
• The Xputer Paradigm
• Conclusions
http://www.uni-kl.de
© 2015, reiner@hartenstein.de
2
http://hartenstein.de
TU Kaiserslautern
spin-off from the
EU’s PATMOS project
Oldest conference series
on power efficiency issues
3 years elder than ISPLED.
http://hartenstein.de/PATMOS/
Power efficiency is going to become an industry-wide issue
Some incremental improvements are on track,
at all abstraction levels - not only in
hardware design and software development
however, there is still a lot to do
© 2015, reiner@hartenstein.de
http://hartenstein.de
Power Efficiency of
Programming Languages
(an example)
© 2015, reiner@hartenstein.de
http://hartenstein.de
Programming Language Popularity
(IEEE)
source: iStockphoto
not yet
determined
by power
efficiency
© 2015, reiner@hartenstein.de
5
http://hartenstein.de
ICT infrastructures
TU Kaiserslautern
Already the internet’s 2007
carbon footprint was higher than
that of worldwide air traffic
Power consumption by internet:
x30 til 2030 if trends continue
G. Fettweis, E. Zimmermann:
ICT Energy Consumption Trends and Challenges;
WPMC'08, Lapland, Finland,
8 – 11 Sep 2008
It‘s more than the entire world‘s
total power consumption to-day !!!
© 2015,
6 6
© New
York reiner@hartenstein.de
Times
Data Center at Dallas
IDC predicts: the total number
of data centers
around the world
http://hartenstein.de
will peak at 8.6 million in 2017.
TU Kaiserslautern
Google‘s Electricity Bill
(for example)
Google asked the Federal Energy Regulatory Commission
for the authority to sell electricity,
Google patent for water-based
data centers: the ocean provides
cooling & power (motion of surface)
Cost of a data center is determined solely by the monthly
power bill, not by the cost of hardware or maintenance
„The possibility of computer equipment power consumption
spiraling out of control could have serious consequences
for the overall affordability of computing” [L. A. Barrosso, Google]
© 2015, reiner@hartenstein.de
http://hartenstein.de
TU Kaiserslautern
Massive Critique of the
von Neumann Model
Patterson’s Law:
bandwidth gap grows 50% / year
has reached >1000x (long ago)
Dave Patterson
instruction streams are
very slow and extremely
memory-cycle-hungry
“von Neumann
Syndrome”
C.V. Ramamoorthy
Von Neumann Syndrome
Nathan’s Law:
Software is a gas.
It expands to fill
all its containers ...
Nathan Myhrvold
Software dimension: MLoC
(Million Lines of Code):
Mac OS X 10.4
SAP NetWeaver 2007
Debian 5.0 2009
86
238
324
von Neumann: an extremely
power-inefficient paradigm
John Backus 1978: Can programming be liberated from the von Neumann style?
© 2015, reiner@hartenstein.de
8
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• The Power Wall
• “Dataflow” Computing
• Reconfigurable Computing
• Time to Space Mapping
• The Xputer Paradigm
• Conclusions
http://www.uni-kl.de
© 2015, reiner@hartenstein.de
9
http://hartenstein.de
Terminology Problems
TU Kaiserslautern
Searching a more power-efficient machine paradigm:
Stressing differences to „Control-Flow“ Computers an area
called „Dataflow“ Computers was started mid‘ 70ies at MIT
Xputer area people are forced to sidestep by
using terms like „data-driven“ or „data streams“…
although the „Dataflow“ scene is I-Structure-centered.
© 2015, reiner@hartenstein.de
http://hartenstein.de
MIT Tagged Token Dataflow Architecture*
I-Structure
Storage
no PC
no updateable global store
PE:
?
I-Structure
Storage
RU
Token
Queue
to/from the
Communication
Network
„instruction is executed
even if some of its operands
are not yet available“
*) source: Jurij Silc
© 2015, reiner@hartenstein.de
Communication
Network
PE
PE
Wait-Match Unit &
Waiting Token Store
Instruction
Fetch Unit
SU
Form
Token
Unit
ALU &
Form
Tag
11
Program Store &
Constant Store
http://hartenstein.de
Solution: I-Structure concept*
TU Kaiserslautern
Problems with
“Dataflow”
Architectures
read request
deferrend but write
operation is allowed
at least one read
request has
been deferred
?
• very complex
data structures !
can be read but not written
?
• “each update consumes the structure and
the value produces a new data structure”
[source: Jurij Silc]
•“awkward or
even impossible
to implement”
© 2015, reiner@hartenstein.de
12
„instruction is executed even
if some of its operands
are not yet available“
*) „I-Structure Flow“
instead
of „Dataflow“
http://hartenstein.de
A Second Opinion
TU Kaiserslautern
D. D. Gajski, D. A. Padua, D. J. Kuck, R. H. Kuhn:
A Second Opinion on Data Flow Machines and
Languages; IEEE COMPUTER, February 1982
the subtitle:
"... data flow techniques attract a great deal of attention.
Other alternatives, however, offer more hope for the future."
( Still active …
5th workshop on Data-Flow Execution Models for Extreme Scale
Computing (DFM 2015), Oct 18-21, 2015, San Francisco, CA, USA )
However, the power efficiency break-thru did not happen here
© 2015, reiner@hartenstein.de
http://hartenstein.de
Computing Paradigms
CPU
program
counter
term
term
CPU
Xputer
RPU
DFC*
DFC
DPU
program
counter
rDPU
data
counters
DPUs
yes
no
no
I-
structure
*) based on data dependency graph
© 2015, reiner@hartenstein.de
execution
triggered by
paradigm
instruction instruction-streambased (von Neumann)
fetch
data
arrival**
data-driven or datastream-based
Reconfigurable Computing
complicated
I-structure “Dataflow” Computer
handling
14
**) “transport-triggered”
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• The Power Wall
• “Dataflow” Computing
• Reconfigurable Computing
• Time to Space Mapping
• The Xputer Paradigm
• Conclusions
http://www.uni-kl.de
© 2015, reiner@hartenstein.de
15
http://hartenstein.de
TU Kaiserslautern
First FPGA
available 1984
from Xilinx
LUT
LUT
LUT
LUT
LUT
LUT
Table
Table
LUT
© 2015, reiner@hartenstein.de
http://hartenstein.de
Reconfigurable Computing
(RC): the intensive Impact
TU Kaiserslautern
Speed-ups by von Neumann to RC Migrations
Tarek
El-Ghazawi
[Tarek El-Ghazawi et al.: IEEE COMPUTER, Febr. 2008]
SGI Altix 4700 with RC 100 RASC compared to Beowulf cluster
Application
DNA and Protein
sequencing
DES breaking
Power
Savings
Cost
Size
8723
779
22
253
28514
3439
96
1116
Speed-up
factor
massively much less
equipment
saving
needed
energy
much less memory and
bandwidth needed
© 2015, reiner@hartenstein.de
17
http://hartenstein.de
Speedup- 106 Speedup-Factor
Factor
+ Pre-FPGA solutions
TU Kaiserslautern
Image processing,
Pattern matching, DES breaking28514
Multimedia DSP and
105
15000
Grid-based DRC*
(„fair comparizon“)
no FPGA: DPLA on
MoM by TU-KL*
103
6000
1984: 1 DPLA
replaces 256
FPGAs
fabricated by E.I.S.
Multi University
Project Chip
103
Speed-ups by vN
Software to FPGA
Migrations
100
1985
1990
Reed-Solomon
Decoding 2400
video-rate
stereo vision MAC
pattern
recognition
730
900
BLAST
52
40
8723
Crypto
3000
CT imaging
1000
288
Viterbi Decoding
Smith-Waterman
pattern matching
457
100
1000
400
SPIHT wavelet-based
image compression
http://www.fpl.uni-kl.de/staff/hartenstein/eishistory_en.html
DNA & protein
sequencing
wireless
real-time face
detection
88
protein
identification
FFT
molecular
dynamics
simulation
Bioinformatics
20
GRAPE
100
© 2015, reiner@hartenstein.de *) TU Kaiserslautern
18
http://www.fpl.uni-kl.de/staff/hartenstein/Hartenstein-Speedup-Factors.pdf
http://hartenstein.de
Design Rule Check Accelerator
*) PISA DRC accelerator [ICCAD 1984]
TU Kaiserslautern
Processing 4-by-4 Reference Patterns
DPLA: fabricated by the
E.I.S. Multi University Project: http://www.fpl.uni-kl.de/staff/hartenstein/eishistory_en.html
1984: 1 DPLA
replaces 256 FPGAs
© 2015, reiner@hartenstein.de
DPLA was more area-efficient than
FPGA - by several orders of magnitude
19
http://hartenstein.de
The Reconfigurable
Computing Paradox
although the effective
integration density of
FPGAs is by 4 orders of
magnitude behind the
Gordon Moore curve,
because of:
•wiring overhead
• reconfigurability overhead
•routing congestion
•etc.
Reinvent Computing
Enabling software developers to apply their skills
over FPGAs has been a long and, as of yet, unreached
research objective in reconfigurable computing.
© 2015, reiner@hartenstein.de
20
http://hartenstein.de
Obstacles to widespread FPGA adoption
go well beyond the required skill set
- Workshop at FPL_2015
http://reconfigurablecomputing4themasses.net/
© 2015, reiner@hartenstein.de
21
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• The Power Wall
• “Dataflow” Computing
• Reconfigurable Computing
• Time to Space Mapping
• The Xputer Paradigm
• Conclusions
http://www.uni-kl.de
© 2015, reiner@hartenstein.de
22
http://hartenstein.de
TU Kaiserslautern
Dual paradigm mind set:
an old hat– but was ignored
time to space mapping: procedural to structural:
loop to pipe mapping
PDP-16 RTMs:
1971
why did it
take 25
years to
find out?
token bit
evoke
FF
FF
FF
1967: W. A. Clark: Macromodular Computer
Systems; 1967 SJCC, AFIPS Conf. Proc.
C. G. Bell et al: The Description and Use of RegisterTransfer Modules (RTM's); IEEE Trans-C21/5, May 1972
© 2006, reiner@hartenstein.de
23
http://hartenstein.de
Transformations since the 70ies
(time to time/space mapping)
Loop Transformations: rich methodology published:
[survey: Diss. Karin Schmidt, 1994, Shaker Verlag]
time domain:
procedure domain
program loop
space domain:
structure domain
Strip Mining
Transformation
Pipeline
k time steps,
n DPUs
n x k time steps,
1 CPU
time algorithm
© 2015, reiner@hartenstein.de
space/time algorithmus
24
http://hartenstein.de
The Systolic Arrays (1)
1980
no instruction streams needed
time
(pipe network) DPA
x
x
x
DataPath Array (array of DPUs)
x
x
x
|
x
x
x
|
|
time
x x x
define:
x x x ... which data item
at which time
x x x - at which port
„data streams“
© 2015, reiner@hartenstein.de
execution
transporttriggered
port #
- - - x x x
time
- - - - x x x
- - - - - x x x
port #
M. J. Foster, H. T. Kung: The
Design of Special-Purpose
VLSI Chips ...
IEEE 7th ISCA, La Baule,
France, May 6-8, 1980
input
data
stream
|
|
|
|
|
|
|
|
|
|
|
x
x
x
time
25
x
x
x
|
x
x
x
port #
output
data
streams
H. T. Kung
http://hartenstein.de
TU Kaiserslautern
M. J. Foster and H. T. Kung:
“The Design of SpecialPurpose VLSI Chips ... “
http://karl-steinbuch.org
Systolic Arrays (2)
Karl Steinbuch
why not general purpose ?
It is not sufficient to invent
something. You need to recognize,
that you have invented something.
© 2015, reiner@hartenstein.de
26
http://hartenstein.de
What Synthesis Method?
H. T. Kung: „of course algebraic!“ (linear projection)
only linear pipes
supports only special applications
with strictly regular data dependencies
http://kressarray.de/
My student Rainer Kress replaced it by simulated annealing:
supports also any irregular & wild shape pipe networks
© 2015, reiner@hartenstein.de
27
*) KressArray [ASP-DAC-1995]
http://hartenstein.de
H. T. Kung: “It’s not our job”
TU Kaiserslautern
another Tunnel Vision Symptom
without a sequencer: missed to
invent a new machine paradigm
(the Xputer)
© 2015, reiner@hartenstein.de
*) or receives
28
http://hartenstein.de
The Xputer machine Paradigm
TU Kaiserslautern
(TU-KL)
Xputer literature
is the TU-KL‘s Symbiosis of
Time to Space Mapping and
Reconfigurable Computing.
obtained by adding
auto-sequencing memory
with multiple data counters
instead of a program counter
ASM
GAG
ASM
RAM
data
counter
ASM
ASM: Auto-Sequencing Memory
GAG: Generic Address Generstor
© 2015, reiner@hartenstein.de
ASM
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• The Power Wall
• “Dataflow” Computing
• Reconfigurable Computing
• Time to Space Mapping
• The Xputer Paradigm
• Conclusions
http://www.uni-kl.de
© 2015, reiner@hartenstein.de
30
http://hartenstein.de
What means Configware?
Xputer pages
Software
Source
time domain
Software to
Configware
Migration
(instruction-procedural)
© 2015, reiner@hartenstein.de
Configware
Source
Placement
& Routing
mapper
Software
Compiler
Software Code
space domain
Configware Code
( time domain)
(structural: space domain)
31
http://hartenstein.de
Compilation: Software vs.
Configware u. Flowware
TU Kaiserslautern
Xputer literature
Software
Engineering
source program
Configware
Engineering
C, FORTRAN
MATHLAB
placement source „program“
& routing
mapper
software
compiler
procedural:
time domain)
configware
compiler
space
domain
data scheduler
software code
configware code
© 2015, reiner@hartenstein.de
32
time domain
flowware code
http://hartenstein.de
Heterogeneous: Co-Compilation
TU Kaiserslautern
Xputer pages
C, FORTRAN, MATHLAB
automatic SW / CW partitioner
Software /
Configware
software Co-Compiler
compiler
mapper
configware
compiler
data scheduler
software code
configware code
© 2015, reiner@hartenstein.de
33
flowware code
http://hartenstein.de
Paradigm Shift Consequences
TU Kaiserslautern
Xputer literature
von Neumann:
CPU
software
Program Counter (PC)
Software Engineering
resources: fixed
algorithm: variable
Xputer: Configware
Configuration Code (CC)
configware
flowware
1 programming
source needed
Engineering
resources: variable
algorithm: variable
2 programming
sources needed
Data Counters (DCs) sequencing code (e. g. see MoPL language)
© 2015, reiner@hartenstein.de
34
http://hartenstein.de
ASM
x
x
x
use data counters,
no program counter
x
x
x
|
|
|
x x x
x x x x x x - -
32 ports, or
n x 32 ports
Xputer pages
© 2015, reiner@hartenstein.de
|
|
|
|
|
|
|
|
|
|
x
x
x
x
x
x
35
|
x
x
x
ASM
other example
|
ASM
100 & more
on-chip ASM
are feasible
x
x
x
ASM
implemented ASM
by distributed ASM
on-chipmemory ASM
ASM
ASM
reconfigurable
(pipe network) rDPA
ASM Data stream
generators
- - - x x x
ASM
- - - - x x x
ASM
- - - - - x x x
ASM
Xputer
machine
paradigm
GAG
RAM
data
counter
ASM: AutoSequencing
Memory
http://hartenstein.de
state register(s):
MoPL
program counter: data counter(s):
Software Languages
read next instruction
goto (instruction address)
jump to (instruction address)
instruction loop
instruction loop nesting
instruction loop escape
instruction stream branching
no: no internally parallel loops
Flowware Languages
read next data item
goto (data address)
jump to (data address)
data loop
data loop nesting
data loop escape
data stream branching
yes: internally parallel loops
But there is an Asymmetry
© 2006, reiner@hartenstein.de
36
more simple:
no ALU tasks
Xputer literature
TU Kaiserslautern
Xputer pages
Duality of procedural Languages
http://hartenstein.de
CoDe-X Dual Paradigm
Application Development
TU Kaiserslautern
high level language
Juergen Becker’s
PhD thesis 1996
C language source
Partitioner
SW
compiler
CPU
CW
compiler
software/configware
co-compiler
software code configware code
instructiondatastreamstreambased
based
CPU
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
© 2015, reiner@hartenstein.de
accelerator
hardwired
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
reconfigurable
37
accelerator
http://hartenstein.de
>> Outline <<
TU Kaiserslautern
• The Power Wall
• “Dataflow” Computing
• Reconfigurable Computing
• Time to Space Mapping
• The Xputer Paradigm
• Conclusions
http://www.uni-kl.de
© 2015, reiner@hartenstein.de
38
http://hartenstein.de
Illustrating the Paradigm Trap
TU Kaiserslautern
the watering can
model [Hartenstein]
( crippled
watering can )
© 2015, reiner@hartenstein.de
The von Neumann
Manycore Approach
( many
crippled
watering
cans )
The von Neumann
single core Approach
The
Memory
Wall
(1)
many
von
Neumann
bottlenecks
39
http://hartenstein.de
Illustrating the Paradigm Trap
(2)
TU Kaiserslautern
The “Dataflow“ Computer
extremely
complicated:
no watering
can model !
(a power efficiency
break-thru did not
yet happen here)
© 2015, reiner@hartenstein.de
40
http://hartenstein.de
Xputer: the only massively
power-efficient Paradigm
TU Kaiserslautern
the watering can model [Hartenstein]
The Xputer Paradigm
has no von
Neumann
bottleneck
© 2015, reiner@hartenstein.de
41
http://hartenstein.de
thank you !
© 2015, reiner@hartenstein.de
42
http://hartenstein.de
END
© 2015, reiner@hartenstein.de
43
http://hartenstein.de
Backup for discussion:
© 2015, reiner@hartenstein.de
44
http://hartenstein.de
Pipelining through DPU Arrays:
the TU-KL Xputer principle
no memory wall
DPA
massively avoiding
memory cycles
DPU operation is
transport-triggered
|
|
- - - x x x
- - - - x x x
x x x - -
- - - - - x x x
|
|
|
|
|
|
|
|
|
|
|
x
x
x
no message passing
nor thru common memory
45
input data streams
|
x x x
x x x -
no instruction streams
© 2015, reiner@hartenstein.de
x
x
x
x
x
x
x
x
x
x
x
x
output data streams
|
x
x
x
DPA = DPU array
DPU = Data Path Unit
http://hartenstein.de
I-Structures (I = incremental) - part 1
Jurij Silc: Dataflow Architectures
http://csd.ijs.si/courses/dataflow/
26
28
28
© 2015, reiner@hartenstein.de
http://hartenstein.de
27
I-Structures -
part 2
Jurij Silc: Dataflow Architectures
(PE)
http://csd.ijs.si/courses/dataflow/
MIT
Tagged-Token
Dataflow
Architecture
31
29
I-structure
select
© 2015, reiner@hartenstein.de
Istructure
assign
30
http://hartenstein.de
20
Taxonomy
TU Kaiserslautern
Flynn‘s taxonomy:
von Neumann only
Diana‘s taxonomy:
Reconfigurable Computing
Reiner‘s taxonomy:
heterogeneous systems
Reiner‘s 2nd taxonomy:
Xputers only
noI
© 2015, reiner@hartenstein.de
48
http://hartenstein.de
Von Neumann Syndrome
TU Kaiserslautern
Lambert M. Surhone, Mariam T.
Tennoe,
Susan F. Hennessow
(ed.): Von Neumann Syndrome;
ßetascript publishing 2011
© 2015, reiner@hartenstein.de
49
http://hartenstein.de
Download