Dataflow Supercomputers - High Performance Computing Group

advertisement
Dataflow Supercomputers
Michael J. Flynn
Maxeler Technologies and Stanford
University
Outline
◦ History
• Dataflow as a supercomputer technology
• openSPL: generalizing the dataflow
programming model
• Optimizing the hardware for dataflow
The great parallel precessor debate of 1967
• Amdahl espouses to sequential machine
and posits Amdahl’s law
• Danial Slotnick (father of ILLIAC IV) posits
the parallel approach while recognizing
the problem of programming
“The parallel approach to computing does require
that some original thinking be done about numerical
analysis and data management in order to secure
efficient use. In an environment which has represented the
absence of the need to think as the highest virtue
this is a decided disadvantage.”
-Daniel Slotnick (1967)
….Speedup in parallel processing is achieved
by programming effort……
Michael J Flynn
The (multi core) Parallel Processor Problem
• Efficient distribution of tasks
• Inter-node communications (data assembly
& dispatch) reduces computational
efficiency: speedup/nodes
• Memory bottleneck limitations
• The sequential programming model: Layers
of abstraction hide critical sources of and
limits to efficient parallel execution
May’s first law: Software efficiency halves every 18 months, exactly
compensating for Moore’s Law
May’s second law: Compiler technology doubles efficiency no faster
than once a decade
David May
Dataflow as a supercomputer technology
• Looking for another way.
experience from Maxeler
Some
• Dataflow is an old technology (70’s and
80’s), conceptually creates and ideal
machine to match the program but the
interconnect problem was insurmountable
for the day.
• Today’s FPGAs have come a long way and
enable an emulation of the dataflow
machine.
Hardware and Software Alternatives
• Hardware:
A reconfigurable heterogeneous
accelerator array model
• Software:
A spatial (2D) dataflow programming
model rather than a sequential model
Accelerator HW model
• Assumes host CPU + FPGA accelerator
• Application consists of two parts
– Essential (high usage, >99%) part (kernel(s))
– Bulk part (<1% dynamic activity)
• Essential part is executed on accelerator; Bulk
part on host
• So Slotnick’s law of effort now only applies to a
small portion of the application
FPGA accelerator hardware model:
server with acceleration cards
10
Each (essential) program has a data flow graph (DFG)
The ideal HW to execute the DFG is a data flow machine that
exactly matches the DFG
A compiler / translator transforms the DF machine so that it can
be emulated by the FPGA.
FPGA based accelerators, while slow in cycle time, offer much
more flexibility in matching DFGs.
Limitation 1: The DFG is limited in (static) size to O (104) nodes.
Limitation 2: Only the control structure is matched not the data
access patterns
Acceleration with Static, Synchronous, Streaming
DFMs
Create a static DFM (unroll loops, etc.); generally the
goal is throughput not latency.
Create a fully synchronous DFM synchronized to
multiple memory channels. The time through the
DFM is always the same.
Stream computations across the long DFM array,
creating MISD or pipelined parallelism.
If silicon area and pin BW allow, create multiple copies
of the DFM (as with SIMD or vector computations).
Iterate on the DFM aspect ratio to optimize speedup.
12
Acceleration with Static, Synchronous, Streaming
DFMs
Create a fully synchronous data flow machine
synchronized to multiple memory channels,
then
stream computations across a long array
PCIe accelerator card w memory is DFE
(Engine)
Computation #2
Data from node
memory
Computation #1
Results to
memory
FPGA based DFM
Buffer
intermediate results
13
Example: X2 + 30
x
x
SCSVar x = io.input("x", scsInt(32));
30
SCSVar result = x * x + 30;
io.output("y", result, scsInt(32));
+
y
Example: Moving Average
Y = (Xn-1 + X +
Xn+1) / 3
SCSVar x = io.input(“x”, scsFloat(7,17));
SCSVar prev = stream.offset(x, -1);
SCSVar next = stream.offset(x, 1);
SCSVar sum = prev + x + next;
SCSVar result = sum / 3;
io.output(“y”, result, scsFloat(7,17));
Example: Choices
x
SCSVar x = io.input(“x”, scsUInt(24));
SCSVar result = (x>10) ? x+1 : x-1;
io.output(“y”, result, scsUInt(24));
>
1
1
10
-
+
y
Data flow graph as
generated by compiler
4866 nodes; about 250x100
Each node represents
a line of JAVA code with
area time parameters, so
that the designer can change
the aspect ratio to improve
pin BW, area usage and
speedup
18
8 dataflow engines
(192-384GB RAM)
High-speed MaxRing
Zero-copy RDMA between CPUs and DFEs
over Infiniband
Dynamic CPU/DFE balancing
19
Example: Seismic Data Processing
For Oil & Gas exploration:
distribute grid of sensors over large area
Sonic impulse the area and record reflections:
frequency, amplitude, delay at each sensor
Sea based surveys use 30,000 sensors to record data
(120 db range) each sampled at more than 2kbps
with new sonic impulse every 10 seconds

Order of terabytes of data each day
1200m
Generates >1GB every 10s
1200m
1200m
1200m
1200m
Up to 240x
speedup for 1
MAX2 card
compared to single
CPU core
Speedup increases
with cube size
1 billion point
modelling domain
using single FPGA
card
22
Achieved Computational Speedup for the entire
application (not just the kernel) compared to Intel
server
624
RTM with Chevron
VTI 19x and TTI 25x
Credit 32x and Rates 26x
62
4
Sparse Matrix
20-40x
Lattice Boltzman
Fluid Flow 30x
Seismic Trace
Processing
24x
Conjugate Gradient Opt
26x
So for HPC, how can emulation (FPGA) be
better than high performance x86 processor(s)?
Multi core approach lacks robustness in streaming
hardware (spanning area, time, power)
Multi core lacks robust parallel software methodology and
tools
FPGAs emulate the ideal data flow machine
Success comes about from their flexibility in matching the
DFG with a synchronous DFM and streaming data
through and shear size > 1 million cells
Effort and support tools provide significant application
speedup
24
Generalizing the programming model
openSPL
• Open spatial programming language, an
orderly way to expose parallelism
• 2D dataflow is the programmer’s model, JAVA
the syntax
• Could target hardware implementations,
beyond the DFEs
– map on to CPUs (e.g. using OpenMP/MPI)
– GPUs
– Other accelerators
Temporal Computing (1D)
• A program is a sequence of
instructions
• Performance is dominated by:
CPU
– Memory latency
– ALU availability
Memory
Actual computation time
Get
Read
Inst
data
.
1
1
C Write Get Read
Inst
O
Result
data
M
.
1
2
P
2
C Write Get Read
Inst
O
Result
data
M
.
2
3
P
3
C Write
O
Result
M
3
P
Time
Spatial Computing (2D)
Synchronous data movement
data
in
ALU
Contr
ol
Contr
ol
ALU
ALU
Buffe
r
data
out
ALU
ALU
Read data [1..N]
Computation
Write results [1..N]
Throughput dominated
Time
OpenSPL Basics
• Control and Data-flows are decoupled
– both are fully programmable
– can run in parallel for maximum performance
• Operations exist in space and by default run in parallel
– their number is limited only by the available space
• All operations can be customized at various levels
– e.g., from algorithm down to the number representation
• Data sets (actions) streams through the operations
• The data transport and processing can be matched
OpenSPL Models
• Memory:
– Fast Memory (FMEM): many, small in size, low latency
– Large Memory (LMEM): few, large in size, high latency
– Scalars: many, tiny, lowest latency, fixed during exec.
• Execution:
– datasets + scalar settings sent as atomic “actions”
– all data flows through the system synchronously in “ticks”
• Programming:
– API allows construction of a graph computation
– meta-programming allows complex construction
Spatial Arithmetic
• Operations instantiated as separate arithmetic units
• Units along data paths use custom arithmetic and number
representation
• The above may reduce individual unit sizes
– can maximize the number that fit on a given SCS
• Data rates of memory and I/O communication may also be
maximized due to scaled down data sizes
Exponent (8)
s
S
Mantissa (23)
S S S S S S
Exponent (3)
Potentially optimal encoding
s
S
S S
Mantissa (10)
Spatial Arithmetic at All Levels
• Arithmetic optimizations at the bit level
– e.g., minimizing the number of ’1’s in binary numbers, leading to linear
savings of both space and power (the zeros are omitted in the
implementation)
• Higher level arithmetic optimizations
– e.g., in matrix algebra, the location of all non-zero elements in sparse
matrix computations is important
• Spatial encoding of data structures can reduce transfers
between memory and computational units (boost performance
and improve efficiency)
– In temporal computing encoding and decoding would take time and
eventually can cancel out all of the advantages
– In spatial computing, encoding and decoding just consume a bit more
of additional space
Benchmarking Spatial Computers
• Spatial computing systems generate one result
during every tick
• SC system efficiency is strongly determined by how
efficiently data can be fed from external sources
• Fair comparison metrics are needed, among others:
– computations per cubic foot of datacenter space
– computations per Watt
– operational costs per computation
Hardware: FPGA pros & Cons
• The FPGA while quite suitable for emulation is not
an ideal hardware substrate
– Too fine grain, wasted area
– Expensive
– Place and route times are excessive and will get longer
– Slow cycle time
• FPGA advantages
– Commodity part with best process technology
– Flexible interconnect
– Transistor density scaling
– In principle, possible to quickly reduce to ASIC
Silicon device density scaling (ITRS 10 year
projections)
Net: there’s either 20 billion transistors or 50 Giga Bytes of Flash
on a 1cm2 die
Hardware alternatives: DFArray
• Clearly an array structure is attractive
• The LUT is inefficient in the context of dataflow
• Dataflow operations are relative few and well defined
(arithmetic, logic, mux and FIFO and store)
• Flexibility in data sizing is important but not
necessarily to the bit.
• Existing DSPs (really MACs) in FPGAs are prized and
well used (2000 or so per chip)
Hardware alternatives: DFArray
• The flexible interconnect required by the dataflow
will still limit the cycle time (perhaps 5x slower
than CPU). This also reduces power requirements.
• The great advantage is in the increased
operational density (perhaps 10x over existing
FPGAs) enabling much larger dataflow machines
and greater parallelism.
• Avoids very long (many hours) “place & route”
time required by FPGA
Hardware alternatives: DFArray
• The big disadvantage: it’s not a
commodity part (even an expensive
one).
• There’s a lot of research issues in
determining the best dataflow hardware
fabric.
Conclusions
Parallel Processing demands rethinking algorithms,
programming approach and environment and hardware.
The success of FPGA acceleration points to the weakness of
evolutionary approaches to parallel processing: hardware
(multi core) and software (C++, etc.), at least for some
applications
The automation of acceleration is still early on; still
required: tools, methodology for writing apps., analysis
methodology and (maybe) a new hardware basis
For FPGA success software is key: VHDL, inefficient place
and route, SW are big limitations
38
Conclusions 2
• In parallel processing: to find success, start
with the problem not the solution.
• There’s a lot of research ahead to effectively
create parallel translation technology.
Thank you
Download