TU Kaiserslautern - Xputer Lab Configware Engineering for

advertisement
PATMOS 2015, the 25th International Workshop on Power
and Timing Modeling, Optimization and Simulation;
Salvador, Bahia, Brazil, Sept 1-5, 2015
Reiner Hartenstein
TU Kaiserslautern
IEEE fellow
FPL fellow
SDPS fellow
http://hartenstein.de
downloadable from
http://xputer.de
How to cope with
the Power Wall
>> Outline <<
TU Kaiserslautern
• The Power Wall
• “Dataflow” Computing
• Reconfigurable Computing
• Time to Space Mapping
• The Xputer Paradigm
• Conclusions
http://www.uni-kl.de
© 2015, reiner@hartenstein.de
2
http://xputer.de
The Workshop
Series
TU Kaiserslautern
spin-off from the
PATMOS project
Project
.leader:
Reiner
Hartenstein
partner .
leader:
Antonio
Núñez
partner .
leader:
Francis
Jutand
.
.
Oldest conference series on power efficiency
http://hartenstein.de/PATMOS/
Power efficiency is going to become an industry-wide issue
Some incremental improvements are on track,
at all abstraction levels
however, there is still a lot to be done
© 2015, reiner@hartenstein.de
3
http://xputer.de
TU Kaiserslautern
Power-Efficient Computing
Power-efficient Microchip Design
tutorial
by Jan
Power-efficient Computer Architectures Rabaey
Power-efficient Languages and Compilers
Power-efficient Software Implementation
(Power-efficient Memory)
Power-efficient Machine Paradigm
© 2015, reiner@hartenstein.de
4
my
presentation
http://xputer.de
Three tectonic shifts:
TU Kaiserslautern
ICT infrastructures
the energy-constrained world
from internet of people to internet of (every)thing
the end of scaling as we know it
© Hewlett Packard
Power consumption by internet:
x30 til 2030 if trends continue
G. Fettweis, E. Zimmermann:
ICT Energy Consumption Trends and Challenges;
WPMC'08, Lapland, Finland,
8 – 11 Sep 2008
© Hewlett Packard
It‘s more than the entire world‘s
total power consumption to-day !!!
© 2015,
© New
York reiner@hartenstein.de
Times
5
5
http://xputer.de
Data Center at Dallas
>> Outline <<
TU Kaiserslautern
• The Power Wall
• “Dataflow” Computing
• Reconfigurable Computing
• Time to Space Mapping
• The Xputer Paradigm
• Conclusions
http://www.uni-kl.de
© 2015, reiner@hartenstein.de
6
http://xputer.de
Terminology Problems
TU Kaiserslautern
Stressing differences to „Control-Flow“ Computers an area
called „Dataflow“ Computers was started mid‘ 70ies at MIT
Xputer area people are forced to sidestep by
using terms like „data-driven“ or „data streams“…
or „
“
Tagged Token Flow
… although the„Dataflow“ scene is „I-Structure“-centered
© 2015, reiner@hartenstein.de
7
http://xputer.de
A Second Opinion
TU Kaiserslautern
D. D. Gajski, D. A. Padua, D. J. Kuck, R. H. Kuhn:
A Second Opinion on Data Flow Machines and
Languages; IEEE COMPUTER, February 1982
the subtitle:
"... data flow techniques attract a great deal of attention.
Other alternatives, however, offer more hope for the future."
( Still active workshops …, e. g.:
5th Workshop on Data-Flow Execution Models for Extreme Scale
Computing (DFM 2015), Oct 18-21, 2015, San Francisco, CA, USA )
However, the power efficiency break-thru did not happen here
© 2015, reiner@hartenstein.de
8
http://xputer.de
>> Outline <<
TU Kaiserslautern
• The Power Wall
• “Dataflow” Computing
• Reconfigurable Computing
• Time to Space Mapping
• The Xputer Paradigm
• Conclusions
http://www.uni-kl.de
© 2015, reiner@hartenstein.de
9
http://xputer.de
Speedup- 106 Speedup-Factor
Factor
+ Pre-FPGA solutions
TU Kaiserslautern
Image processing,
Pattern matching,
28514
DES
breaking
Multimedia
105
Design Rule Check
accelerator PISA
15000 („fair comparizon“)
no FPGA: DPLA on
1984 MoM by TU-KL*
6000
>15 years earlier
103
1984: 1 DPLA
replaces 256
FPGAs
fabricatedbyE.I.S.
Multi University
Project Chip
103
pattern
recognition
Speed-ups by vN
Software to FPGA
Migrations
1985
1990
© 2015, reiner@hartenstein.de
730
900
8723
52
40
3000
CT imaging
Crypto
1000
288
Viterbi Decoding
Smith-Waterman
pattern matching
457
100
1000
400
SPIHT wavelet-based
image compression
BLAST
DNA & protein
sequencing
Reed-Solomon
Decoding 2400
video-rate
stereo vision MAC
http://www.fpl.uni-kl.de/staff/hartenstein/eishistory_en.html
100
DSP and
wireless
real-time face
detection
88
protein
identification
FFT
molecular
dynamics
simulation
Bioinformatics
20
GRAPE
100
*) TU Kaiserslautern
10
for references see here:
http://www.fpl.uni-kl.de/staff/hartenstein/Hartenstein-Speedup-Factors.pdf
http://xputer.de
The Reconfigurable
Computing Paradox
although the effective integration density of FPGAs
is by 4 orders of magnitude behind the Gordon
Moore curve, because of:
•wiring overhead
• reconfigurability overhead
•routing congestion
von Neumann: an extremely
power-inefficient paradigm
“von Neumann Syndrome”
Reinvent Computing
© 2015, reiner@hartenstein.de
C.V. Ramamoorthy Von Neumann Syndrome
11
http://xputer.de
Obstacles to widespread FPGA adoption
go well beyond the required skill set
- Workshop at FPL_2015
http://reconfigurablecomputing4themasses.net/
© 2015, reiner@hartenstein.de
12
http://xputer.de
TU Kaiserslautern
What about Acceleration
by Graphics Processors?
von
•Drastically smaller Speed-ups if at all Neumann
•Power saving mostly not documented
•R. Vaduc et al.*: „ … adding a GPU is equivalent
to adding one more multicore CPU socket …”
*) R. Vuduc, J. Choi, M. Guney, A. Shringarpure: On the Limits of GPU Acceleration;
Proc. HotPar'10, 2nd USENIX workshop on Hot Topics in Parallelism, June 14 15, 2010, Berkeley, CA, USA, USENIX Assoc. Berkeley, CA, USA
http://newport.eecs.uci.edu/~amowli/resources/papers/vuduc2010-hotpar.pdf
© 2015, reiner@hartenstein.de
http://xputer.de
>> Outline <<
TU Kaiserslautern
• The Power Wall
• “Dataflow” Computing
• Reconfigurable Computing
• Time to Space Mapping
• The Xputer Paradigm
• Conclusions
http://www.uni-kl.de
© 2015, reiner@hartenstein.de
14
http://xputer.de
TU Kaiserslautern
Dual paradigm mind set:
an old hat– but was ignored
time to space mapping: procedural to structural:
loop to pipe mapping
PDP-16 RTMs:
why did it
take 25
years to
find out?
demultiplexer
1971
token bit
evoke
FF
FF
FF
1967: W. A. Clark: Macromodular Computer
Systems; 1967 SJCC, AFIPS Conf. Proc.
C. G. Bell et al: The Description and Use of RegisterTransfer Modules (RTM's); IEEE Trans-C21/5, May 1972
© 2006, reiner@hartenstein.de
15
http://hartenstein.de
The Systolic Arrays (1)
1980
no instruction streams needed
time
(pipe network) DPA
x
x
x
DataPath Array (array of DPUs)
x
x
x
|
x
x
x
|
|
time
x x x
define:
x x x ... which data item
at which time
x x x - at which port
„data streams“
© 2015, reiner@hartenstein.de
port #
- - - x x x
execution
transporttriggered
time
Kung‘s
- - - - x x x
example
- - - - - x x x (algebra)
port #
M. J. Foster, H. T. Kung: The
Design of Special-Purpose
VLSI Chips ...
IEEE 7th ISCA, La Baule,
France, May 6-8, 1980
input
data
stream
|
|
|
|
|
|
|
|
|
|
|
x
x
x
time
16
x
x
x
|
x
x
x
port #
output
data
streams
H. T. Kung
http://xputer.de
TU Kaiserslautern
M. J. Foster and H. T. Kung:
“The Design of SpecialPurpose VLSI Chips ... “
It is not sufficient to invent
something. You need to recognize,
that you have invented something.
© 2015, reiner@hartenstein.de
17
http://karl-steinbuch.org
Systolic Arrays (2)
Karl Steinbuch
17
http://xputer.de
What Synthesis Method?
H.T.Kung*: „of course algebraic!“ (linear projection)
*) a mathematician
only linear pipes
supports only very special applications
with strictly regular data dependencies
http://kressarray.de/
My student Rainer Kress replaced it by simulated annealing:
this supports also any irregular & wild shape pipe networks
© 2015, reiner@hartenstein.de
18
*) KressArray [ASP-DAC-1995]
http://xputer.de
H. T. Kung: “It’s not our job”
TU Kaiserslautern
another Tunnel Vision Symptom
without a sequencer: missed to
invent a new machine paradigm
(the Xputer)
© 2015, reiner@hartenstein.de
*) or receives
19
http://xputer.de
>> Outline <<
TU Kaiserslautern
• The Power Wall
• “Dataflow” Computing
• Reconfigurable Computing
• Time to Space Mapping
• The Xputer Paradigm
• Conclusions
http://www.uni-kl.de
© 2015, reiner@hartenstein.de
20
http://xputer.de
The Xputer machine Paradigm
TU Kaiserslautern
(TU-KL)
Xputer literature
is the TU-KL‘s Symbiosis of
Time to Space Mapping and
Reconfigurable Computing!
ASM
obtained by adding
auto-sequencing memory (ASM)
With data counters instead
of a program counter
GAG: Generic Address Generstor
© 2015, reiner@hartenstein.de
21
ASM
ASM
GAG
ASM
RAM
data
counter
http://xputer.de
state register(s):
MoPL
program counter: data counter(s):
Software Languages
read next instruction
goto (instruction address)
jump to (instruction address)
instruction loop
instruction loop nesting
instruction loop escape
instruction stream branching
no: no internally parallel loops
Flowware Languages
read next data item
goto (data address)
jump to (data address)
data loop
data loop nesting
data loop escape
data stream branching
yes: internally parallel loops
But there is an Asymmetry
© 2006, reiner@hartenstein.de
22
more simple:
no ALU tasks
Xputer literature
TU Kaiserslautern
Xputer pages
Duality of procedural Languages
http://hartenstein.de
Compilation: Software vs.
Configware u. Flowware
TU Kaiserslautern
Xputer literature
Software
Engineering
source program
time to
space
mapping
Configware
Engineering
C, etc.
source „program“
mapper
software
compiler
procedural:
time domain)
configware
compiler
space
domain
data scheduler
software code
configware code
© 2015, reiner@hartenstein.de
23
time domain
flowware code
http://xputer.de
TU Kaiserslautern
Heterogeneous: Co-Compilation
Xputer pages
important: why?
C, or other high level language
CoDe-X
automatic SW / CW partitioner
Software /
Configware
software Co-Compiler
compiler
(Jürgen
Becker‘s
Ph.D. thesis)
mapper
configware
compiler
data scheduler
software code
© 2015, reiner@hartenstein.de
configware code
24
flowware code
http://xputer.de
>> Outline <<
TU Kaiserslautern
• The Power Wall
• “Dataflow” Computing
• Reconfigurable Computing
• Time to Space Mapping
• The Xputer Paradigm
• Conclusions
http://www.uni-kl.de
© 2015, reiner@hartenstein.de
25
http://xputer.de
Illustrating the Paradigm Trap
TU Kaiserslautern
the watering can
model [Hartenstein]
( crippled
watering can )
© 2015, reiner@hartenstein.de
The von Neumann
Manycore Approach
( many
crippled
watering
cans )
The von Neumann
single core Approach
The
Memory
Wall
(1)
many
von
Neumann
bottlenecks
26
http://xputer.de
Illustrating the Paradigm Trap
(2)
TU Kaiserslautern
The “Dataflow“ Computer
extremely
complicated:
no watering
can model !
(a power efficiency
break-thru did
not happen here)
© 2015, reiner@hartenstein.de
27
http://xputer.de
Xputer: the only massively
power-efficient Paradigm
TU Kaiserslautern
the watering can model [Hartenstein]
The Xputer Paradigm
has no von
Neumann
bottleneck
© 2015, reiner@hartenstein.de
fully
supporting
Reconfigurable
Computing
28
http://xputer.de
TU Kaiserslautern
We need a Seismic Shift …
It’s an extremely
massive challenge !
… to avoid future unaffordable
ICT power consumption cost
that’s why
heterogeneous
is important
The software from more
than half a century
sits squarely on top
For many more years we must work under a
heterogeneous triple-paradigm mind set:
Configware, Flowware, and still even Software
© 2015, reiner@hartenstein.de
29
http://xputer.de
thank you !
© 2015, reiner@hartenstein.de
30
http://xputer.de
END
© 2015, reiner@hartenstein.de
31
http://xputer.de
Backup for
discussion:
© 2015, reiner@hartenstein.de
32
http://xputer.de
Reconfigurable Computing
(RC): the intensive Impact
TU Kaiserslautern
Speed-ups by von Neumann to RC Migrations
Tarek
El-Ghazawi
[Tarek El-Ghazawi et al.: IEEE COMPUTER, Febr. 2008]
SGI Altix 4700 with RC 100 RASC compared to Beowulf cluster
Application
DNA and Protein
sequencing
.
DES breaking
Speed-up
factor
8723
779
22
253
28514
3439
96
1116
massively much less
equipment
saving
needed
energy
much less memory and
bandwidth needed
© 2015, reiner@hartenstein.de
Savings divisor
Power
Cost
Size
33
http://xputer.de
Taxonomy
TU Kaiserslautern
Flynn‘s taxonomy:
von Neumann only
Diana‘s taxonomy:
Reconfigurable Computing
Reiner‘s taxonomy:
heterogeneous systems
Reiner‘s 2nd taxonomy:
Xputers only
noI
© 2015, reiner@hartenstein.de
34
http://xputer.de
TU Kaiserslautern
First FPGA
available 1984
from Xilinx
LUT
LUT
LUT
LUT
LUT
LUT
Table
Table
LUT
© 2015, reiner@hartenstein.de
http://xputer.de
Transformations since the 70ies
(time to time/space mapping)
Loop Transformations: rich methodology published:
[survey: Diss. Karin Schmidt, 1994, Shaker Verlag]
time domain:
procedure domain
program loop
space domain:
structure domain
Strip Mining
Transformation
Pipeline
k time steps,
n DPUs
n x k time steps,
1 CPU
time algorithm
© 2015, reiner@hartenstein.de
space/time algorithmus
36
http://xputer.de
The Reconfigurable
Computing Paradox
although the effective
integration density of
FPGAs is by 4 orders of
magnitude behind the
Gordon Moore curve,
because of:
•wiring overhead
• reconfigurability overhead
•routing congestion
•etc.
Reinvent Computing
Enabling software developers to apply their skills
over FPGAs has been a long and, as of yet, unreached
research objective in reconfigurable computing.
© 2015, reiner@hartenstein.de
37
http://xputer.de
ASM
ASM
x
x
x
x
x
x
use data counters,
no program counter
x
x
x
|
|
|
x x x
x x x x x x - |
|
|
|
|
|
|
|
|
x
x
x
x
x
x
38
|
x
x
x
ASM
ASM: AutoXputer pages
Sequencing
Memory
© 2015, reiner@hartenstein.de
|
ASM
Xputer
machine
paradigm
|
ASM
implemented ASM
by distributed ASM
on-chipmemory ASM
ASM
general
purpose
reconfigurable
(pipe network) rDPA
ASM Data stream
generator usage
- - - x x x
ASM
- - - - x x x
ASM
- - - - - x x x
ASM
GAG
RAM
data
counter
GAG: Generic
Address
Generators
(reconfigurable: to avoid
memory-cycle-hungry
address computation)
http://xputer.de
Paradigm Shift Consequences
Xputer literature
TU Kaiserslautern
von Neumann: Software Engineering
CPU resources: fixed
1 programming
software
algorithm: variable
source needed
Program Counter (PC)
Configuration Xputer:
Code (CC)
configware
flowware
Configware Engineering
resources: variable
algorithm: variable
2 programming
sources needed
Data Counters (DCs) sequencing code (e. g. see MoPL language)
Xputer and vN: Heterogenous
Engineering
all 3 programming sources needed
© 2015, reiner@hartenstein.de
39
http://xputer.de
Pipelining through DPU Arrays:
the TU-KL Xputer principle
no memory wall
DPA
massively avoiding
memory cycles
DPU operation is
transport-triggered
|
|
- - - x x x
- - - - x x x
x x x - -
- - - - - x x x
|
|
|
|
|
|
|
|
|
|
|
x
x
x
no message passing
nor thru common memory
40
input data streams
|
x x x
x x x -
no instruction streams
© 2015, reiner@hartenstein.de
x
x
x
x
x
x
x
x
x
x
x
x
output data streams
|
x
x
x
DPA = DPU array
DPU = Data Path Unit
http://xputer.de
Von Neumann Syndrome
TU Kaiserslautern
Lambert M. Surhone, Mariam T.
Tennoe,
Susan F. Hennessow
(ed.): Von Neumann Syndrome;
ßetascript publishing 2011
© 2015, reiner@hartenstein.de
41
http://xputer.de
Computing Paradigms
CPU
program
counter
term
term
CPU
Xputer
RPU
DFC*
DPU
program
counter
rDPU
data
counters
yes
no
execution
triggered by
paradigm
instruction instruction-streambased (von Neumann)
fetch
data
arrival**
data-driven or datastream-based
Reconfigurable Computing
no
complicated
I-structure “Dataflow” Computer
handling
*) based on tagged token “I-Structure”
**) “transport-triggered”
DFC
DPUs
I-
structure
© 2015, reiner@hartenstein.de
42
http://xputer.de
MIT Tagged Token Dataflow Architecture*
I would
call it Tagged Token Flow Architecture
no Program Counter
no updateable global store
PE:
?
I-Structure
Storage
I-Structure
Storage
to/from the
Communication
Network
„instruction is executed
even if some of its operands
are not yet available“
*) source: Jurij Silc
© 2015, reiner@hartenstein.de
Communication
Network
RU
Token
Queue
SU
PE
PE
Wait-Match Unit &
Waiting Token Store
Instruction
Fetch Unit
Form
Token
Unit
ALU &
Form
Tag
43
Program Store &
Constant Store
http://xputer.de
Solution: I-Structure concept*
TU Kaiserslautern
Problems with
such “Dataflow”
Architectures
very complex
data structures !
“read request
deferrend but write
operation is allowed”
“at least one read
request has
been deferred”
?
“can be read but not written”
?
“each update consumes the structure and
the value produces a new data structure”
I would
“awkward or
call it „Tagged Token Flow“
even impossible
= Tagged Token Structure
to implement” [source: Jurij Silc]
*) „I-Structure Flow“
insteadhttp://xputer.de
of „Dataflow“
© 2015, reiner@hartenstein.de
44
I-Structures (I = incremental) - part 1
45
Jurij Silc: Dataflow Architectures
http://csd.ijs.si/courses/dataflow/
26
28
28
© 2015, reiner@hartenstein.de
http://xputer.de
27
I-Structures -
part 2
Jurij Silc: Dataflow Architectures
http://csd.ijs.si/courses/dataflow/
MIT
Tagged-Token
Dataflow
Architecture
46
(PE)
31
29
I-structure
select
© 2015, reiner@hartenstein.de
Istructure
assign
30
http://xputer.de
20
Power Efficiency of
Programming Languages
(an example)
© 2015, reiner@hartenstein.de
47
http://xputer.de
How big is big ?
© Hewlett Packard
© 2015, reiner@hartenstein.de
48
http://xputer.de
Download