PVTOL Overview

advertisement
Parallel Vector Tile-Optimized Library
(PVTOL) Architecture
Jeremy Kepner, Nadya Bliss, Bob Bond, James Daly, Ryan
Haney, Hahn Kim, Matthew Marzilli, Sanjeev Mohindra,
Edward Rutledge, Sharon Sacco, Glenn Schrader
MIT Lincoln Laboratory
May 2007
This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and
are not necessarily endorsed by the United States Government.
MIT Lincoln Laboratory
PVTOL-1
4/13/2015
Outline
• Introduction
• PVTOL Machine Independent Architecture
–
–
–
–
–
Machine Model
Hierarchal Data Objects
Data Parallel API
Task & Conduit API
pMapper
• PVTOL on Cell
–
–
–
–
–
The Cell Testbed
Cell CPU Architecture
PVTOL Implementation Architecture on Cell
PVTOL on Cell Example
Performance Results
• Summary
PVTOL-2
4/13/2015
MIT Lincoln Laboratory
PVTOL Effort Overview
Goal: Prototype advanced software
technologies to exploit novel
processors for DoD sensors
Tiled
Processors
CPU in disk drive
• Have demonstrated 10x performance
benefit of tiled processors
• Novel storage should provide 10x more IO
Approach: Develop Parallel Vector
Tile Optimizing Library (PVTOL) for
high performance and ease-of-use
Automated Parallel Mapper
A
P
0
P1
FFT
B
FFT
C
P2
PVTOL
~1 TByte
RAID disk
~1 TByte
RAID disk
Hierarchical Arrays
PVTOL-3
4/13/2015
DoD Relevance: Essential for flexible,
programmable sensors with large IO
and processing requirements
Wideband
Digital
Arrays
Massive
Storage
• Wide area data
• Collected over many time scales
Mission Impact:
•Enabler for next-generation synoptic,
multi-temporal sensor systems
Technology Transition Plan
•Coordinate development with
sensor programs
•Work with DoD and Industry
standards bodies
DoD Software
Standards
MIT Lincoln Laboratory
Embedded Processor Evolution
High Performance Embedded Processors
MFLOPS / Watt
10000
Cell
i860
MPC7447A
1000
SHARC
MPC7410
MPC7400
PowerPC
100
PowerPC with AltiVec
SHARC
750
10
i860 XR
1990
Cell (estimated)
603e
2000
2010
Year
• 20 years of exponential growth in FLOPS / Watt
• Requires switching architectures every ~5 years
• Cell processor is current high performance architecture
PVTOL-4
4/13/2015
MIT Lincoln Laboratory
Cell Broadband Engine
•
•
Cell was designed by IBM, Sony
and Toshiba
Asymmetric multicore processor
–
PVTOL-5
4/13/2015
1 PowerPC core + 8 SIMD cores
•
Playstation 3 uses Cell as
main processor
•
Provides Cell-based
computer systems for highperformance applications
MIT Lincoln Laboratory
Multicore Programming Challenge
Past Programming Model:
Von Neumann
•
Great success of Moore’s Law era
–
–
•
Simple model: load, op, store
Many transistors devoted to
delivering this model
Moore’s Law is ending
–
Future Programming Model:
???
•
Processor topology includes:
–
•
Registers, cache, local memory,
remote memory, disk
Cell has multiple programming
models
Need transistors for performance
Increased performance at the cost of exposing complexity to the programmer
PVTOL-6
4/13/2015
MIT Lincoln Laboratory
Parallel Vector Tile-Optimized Library
(PVTOL)
• PVTOL is a portable and scalable middleware library for
•
multicore processors
Enables incremental development
Desktop
1. Develop serial code
Cluster
Embedded
Computer
3. Deploy code
2. Parallelize code
4. Automatically parallelize code
Make parallel programming as easy as serial programming
PVTOL-7
4/13/2015
MIT Lincoln Laboratory
PVTOL Development Process
Serial PVTOL code
void main(int argc, char *argv[]) {
// Initialize PVTOL
process pvtol(argc, argv);
// Create input, weights, and output matrices
typedef Dense<2, float, tuple<0, 1> > dense_block_t;
typedef Matrix<float, dense_block_t, LocalMap> matrix_t;
matrix_t input(num_vects, len_vect),
filts(num_vects, len_vect),
output(num_vects, len_vect);
// Initialize arrays
...
// Perform TDFIR filter
output = tdfir(input, filts);
}
PVTOL-8
4/13/2015
MIT Lincoln Laboratory
PVTOL Development Process
Parallel PVTOL code
void main(int argc, char *argv[]) {
// Initialize PVTOL
process pvtol(argc, argv);
// Add parallel map
RuntimeMap map1(...);
// Create input, weights, and output matrices
typedef Dense<2, float, tuple<0, 1> > dense_block_t;
typedef Matrix<float, dense_block_t, RunTimeMap> matrix_t;
matrix_t input(num_vects, len_vect , map1),
filts(num_vects, len_vect , map1),
output(num_vects, len_vect , map1);
// Initialize arrays
...
// Perform TDFIR filter
output = tdfir(input, filts);
}
PVTOL-9
4/13/2015
MIT Lincoln Laboratory
PVTOL Development Process
Embedded PVTOL code
void main(int argc, char *argv[]) {
// Initialize PVTOL
process pvtol(argc, argv);
// Add hierarchical map
RuntimeMap map2(...);
// Add parallel map
RuntimeMap map1(..., map2);
// Create input, weights, and output matrices
typedef Dense<2, float, tuple<0, 1> > dense_block_t;
typedef Matrix<float, dense_block_t, RunTimeMap> matrix_t;
matrix_t input(num_vects, len_vect , map1),
filts(num_vects, len_vect , map1),
output(num_vects, len_vect , map1);
// Initialize arrays
...
// Perform TDFIR filter
output = tdfir(input, filts);
}
PVTOL-10
4/13/2015
MIT Lincoln Laboratory
PVTOL Development Process
Automapped PVTOL code
void main(int argc, char *argv[]) {
// Initialize PVTOL
process pvtol(argc, argv);
// Create input, weights, and output matrices
typedef Dense<2, float, tuple<0, 1> > dense_block_t;
typedef Matrix<float, dense_block_t, AutoMap> matrix_t;
matrix_t input(num_vects, len_vect , map1),
filts(num_vects, len_vect , map1),
output(num_vects, len_vect , map1);
// Initialize arrays
...
// Perform TDFIR filter
output = tdfir(input, filts);
}
PVTOL-11
4/13/2015
MIT Lincoln Laboratory
PVTOL Components
•
Performance
–
•
Portability
–
•
Built on standards, e.g. VSIPL++
Productivity
–
PVTOL-12
4/13/2015
Achieves high performance
Minimizes effort at user level
MIT Lincoln Laboratory
PVTOL Architecture
Portability: Runs on a range of architectures
Performance: Achieves high performance
Productivity: Minimizes effort at user level
PVTOL-13
4/13/2015
PVTOL preserves the
simple load-store
programming model in
software
MIT Lincoln Laboratory
Outline
• Introduction
• PVTOL Machine Independent Architecture
–
–
–
–
–
Machine Model
Hierarchal Data Objects
Data Parallel API
Task & Conduit API
pMapper
• PVTOL on Cell
–
–
–
–
–
The Cell Testbed
Cell CPU Architecture
PVTOL Implementation Architecture on Cell
PVTOL on Cell Example
Performance Results
• Summary
PVTOL-14
4/13/2015
MIT Lincoln Laboratory
Machine Model - Why?
• Provides description of underlying hardware
• pMapper: Allows for simulation without the hardware
• PVTOL: Provides information necessary to specify map hierarchies
size_of_double
cpu_latency
cpu_rate
mem_latency
mem_rate
net_latency
net_rate
…
Hardware
PVTOL-15
4/13/2015
=
=
=
=
=
=
=
Machine Model
MIT Lincoln Laboratory
PVTOL Machine Model
•
Requirements
–
–
•
Provide hierarchical machine model
Provide heterogeneous machine model
Design
–
–
Specify a machine model as a tree of machine models
Each sub tree or a node can be a machine model in its own right
Entire
Network
Cluster
of
Clusters
CELL
Cluster
CELL
Dell
Cluster
2 GHz
CELL
SPE 0
SPE 1
…
SPE 7
SPE 0
SPE 1
…
SPE 7
LS
LS
…
LS
LS
LS
…
LS
PVTOL-16
4/13/2015
Dell 0
Dell 1
Dell
Cluster
3 GHz
…
Dell 15
Dell 0
Dell 1
…
MIT Lincoln Laboratory
Dell 32
Machine Model UML Diagram
A machine model constructor can consist of just node
information (flat) or additional children information (hierarchical).
Machine
Model
0..*
1
Node
Model
0..1
CPU
Model
0..1
Memory
Model
A machine model can take a single
machine model description
(homogeneous) or an array of
descriptions (heterogeneous).
0..1
Disk
Model
0..1
Comm
Model
PVTOL machine model is different from PVL machine model in that
it separates the Node (flat) and Machine (hierarchical) information.
PVTOL-17
4/13/2015
MIT Lincoln Laboratory
Machine Models and Maps
Machine model is tightly coupled to the maps in the application.
Machine model defines
layers in the tree
Maps provide mapping
between layers
Entire
Network
grid:
dist:
policy:
...
Cluster
of
Clusters
CELL
Cluster
grid:
dist:
policy:
...
grid:
dist:
policy:
...
CELL
Dell
Cluster
2 GHz
CELL
SPE 0
SPE 1
…
SPE 7
SPE 0
SPE 1
…
SPE 7
LS
LS
…
LS
LS
LS
…
LS
PVTOL-18
4/13/2015
Dell 0
Dell 1
*Cell node includes main memory
Dell
Cluster
3 GHz
…
Dell 15
Dell 0
Dell 1
…
MIT Lincoln Laboratory
Dell 32
Example: Dell Cluster
*
A =
Dell
Cluster
clusterMap =
Dell 0
Dell 1
Dell 2
dellMap =
Cache
Cache
Cache
grid:
dist:
policy:
nodes:
map:
1x8
block
default
0:7
dellMap
Dell 3
grid:
dist:
policy:
Dell 4
Dell 5
Dell 6
Dell ...
Cache
Cache
Cache
4x1
block
default
Cache
Cache
NodeModel nodeModelCluster, nodeModelDell,
nodeModelCache;
MachineModel machineModelMyCluster
= MachineModel(nodeModelCluster, 32, machineModelDell);
MachineModel machineModelDell
= MachineModel(nodeModelDell, 1, machineModelCache);
MachineModel machineModelCache
= MachineModel(nodeModelCache);
PVTOL-19
4/13/2015
*Assumption: each
fits into cache of each Dell node.
hierarchical machine
model constructors
flat machine
model constructor
MIT Lincoln Laboratory
Example: 2-Cell Cluster
*
A =
NodeModel nmCluster, nmCell, nmSPE,nmLS;
MachineModel mmCellCluster =
MachineModel(nmCluster, 2,mmCell);
CLUSTER
clusterMap =
grid:
dist:
policy:
nodes:
map:
1x2
block
default
0:1
cellMap
A =
MachineModel mmCell =
MachineModel(nmCell,8,mmSPE);
CELL 1
CELL 0
grid:
dist:
policy:
nodes:
map:
cellMap =
1x4
block
default
0:3
speMap
MachineModel mmSPE =
SPE 0
SPE 1
SPE 2
SPE 3
SPE 4
SPE 5
SPE 6
speMap =
LS
LS
PVTOL-20
4/13/2015
LS
LS
LS
LS
SPE 7
grid:
dist:
policy:
LS
LS
SPE 0
SPE 1
SPE 2
SPE 3MachineModel(nmSPE,
SPE 4
SPE 5
SPE1,
6
SPE 7
mmLS);
4x1
block
default
LS
LS
LS
LS
LS
MachineModel mmLS =
LS
LS
LS
MachineModel(nmLS);
MIT Lincoln Laboratory
*Assumption: each
fits into the local store (LS) of the SPE.
Machine Model Design Benefits
Simplest case (mapping an array onto a cluster of nodes)
can be defined as in a familiar fashion (PVL, pMatlab).
*
A =
Dell
Cluster
clusterMap =
grid:
dist:
nodes:
Dell 1
Dell 0
Dell 2
Dell 3
1x8
block
0:7
Dell 4
Dell 5
Dell 6
Dell ...
Ability to define heterogeneous models allows
execution of different tasks on very different systems.
CLUSTER
taskMap =
nodes: [Cell Dell]
CELL 0
...
PVTOL-21
4/13/2015
Dell
Cluster
...
...
MIT Lincoln Laboratory
Outline
• Introduction
• PVTOL Machine Independent Architecture
–
–
–
–
–
Machine Model
Hierarchal Data Objects
Data Parallel API
Task & Conduit API
pMapper
• PVTOL on Cell
–
–
–
–
–
The Cell Testbed
Cell CPU Architecture
PVTOL Implementation Architecture on Cell
PVTOL on Cell Example
Performance Results
• Summary
PVTOL-22
4/13/2015
MIT Lincoln Laboratory
Hierarchical Arrays UML
View
Vector
Matrix
Tensor
Iterator
0..
Map
Block
RuntimeMap
LocalMap
0..
0..
LayerManager
DistributedManager
DiskLayerManager
NodeLayerManager
DistributedMapping
PVTOL-23
4/13/2015
OocLayerManager
DiskLayerMapping
NodeLayerMapping
HwLayerManager
SwLayerManager
HwLayerMapping
OocLayerMapping
TileLayerManager
TileLayerMapping
SwLayerMapping
MIT Lincoln Laboratory
Isomorphism
CELL
Cluster
CELL 0
SPE 0
LS
…
LS
CELL 1
SPE 7
LS
SPE 0
LS
…
LS
grid:
dist:
nodes:
map:
1x2
block
0:1
cellMap
NodeLayerManager
upperIf: 
lowerIf: heap
grid:
dist:
policy:
nodes:
map:
1x4
block
default
0:3
speMap
SwLayerManager
upperIf: heap
lowerIf: heap
SPE 7
LS
grid:
4x1
dist:
block
policy: default
TileLayerManager
upperIf: heap
lowerIf: tile
Machine model, maps, and layer managers are isomorphic
PVTOL-24
4/13/2015
MIT Lincoln Laboratory
Hierarchical Array Mapping
Machine Model
Hierarchical Map
CELL
Cluster
CELL 0
CELL 1
SPE 0
…
SPE 7
SPE 0
…
SPE 7
LS
LS
LS
LS
LS
LS
grid:
dist:
nodes:
map:
1x2
block
0:1
cellMap
grid:
dist:
policy:
nodes:
map:
1x4
block
default
0:3
speMap
clusterMap
cellMap
grid:
4x1
dist:
block
policy: default
speMap
Hierarchical Array
CELL
Cluster
CELL
0
CELL
1
SPE 0
SPE 1
SPE 2
SPE 3
SPE 4
SPE 5
SPE 6
SPE 7
SPE 0
SPE 1
SPE 2
SPE 3
SPE 4
SPE 5
SPE 6
SPE 7
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
PVTOL-25
4/13/2015
MIT Lincoln Laboratory
*Assumption: each
fits into the local store (LS) of the SPE. CELL X implicitly includes main memory.
Spatial vs. Temporal Maps
•
CELL
Cluster
Spatial Maps
–
grid:
dist:
nodes:
map:
1x2
block
0:1
cellMap
–
CELL
0
SPE 0
SPE 1
SPE 2
Distribute across multiple
processors
CELL
1
grid:
dist:
policy:
nodes:
map:
SPE 3
…
–
1x4
block
default
0:3
speMap
SPE 0
SPE 1
SPE 2
SPE 3
•
LS
LS
LS
LS
LS
Temporal Maps
–
LS
LS
LS
LS
Logical
Assign ownership of array
indices in main memory to
tile processors
May have a deep or shallow
copy of data
…
grid:
4x1
dist:
block
policy: default
LS
Distribute across multiple
processors
Physical
–
Partition data owned by a
single storage unit into
multiple blocks
Storage unit loads one
block at a time
E.g. Out-of-core, caches
PVTOL-26
4/13/2015
MIT Lincoln Laboratory
These managers imply that
there is main memory at the
SPE level
Layer Managers
• Manage the data distributions between adjacent levels in
the machine model
NodeLayerManager
upperIf: 
lowerIf: heap
CELL
Cluster
CELL 1
CELL 0
SPE 0
SPE 0
SPE 1
CELL
Cluster
CELL 0
SPE 1
DiskLayerManager
upperIf: 
lowerIf: disk
CELL
Cluster
Spatial distribution
between nodes
Spatial distribution
between disks
Disk 0
Disk 1
CELL 0
CELL 1
OocLayerManager
upperIf: disk
lowerIf: heap
SwLayerManager
upperIf: heap
lowerIf: heap
Temporal distribution
between a node’s disk and
main memory (deep copy)
Spatial distribution
between two layers in main
memory (shallow/deep copy)
HwLayerManager
upperIf: heap
lowerIf: cache
CELL
Cluster
CELL 0
SPE 0
SPE 1
SPE 2
SPE 3
PPE
PPE
L1 $
PVTOL-27
4/13/2015
Temporal distribution between
main memory and cache
(deep/shallow copy)
LS
LS
LS
LS
TileLayerManager
upperIf: heap
lowerIf: tile
Temporal distribution between
main memory and tile processor
memory (deep copy)
MIT Lincoln Laboratory
Tile Iterators
•
Iterators are used to access temporally distributed tiles
•
Kernel iterators
–
•
CELL
Cluster
User iterators
–
–
CELL 1
CELL 0
–
•
12
3 4
SPE 0
SPE 1
SPE 0
Row-major
Iterator
PVTOL-28
4/13/2015
SPE 1
Used within kernel expressions
Instantiated by the programmer
Used for computation that
cannot be expressed by kernels
Row-, column-, or plane-order
Data management policies specify
how to access a tile
–
–
–
–
Save data
Load data
Lazy allocation (pMappable)
Double buffering (pMappable)
MIT Lincoln Laboratory
Pulse Compression Example
…
CPI 2
CELL
0
CPI 1
CPI 0
CELL
Cluster
…
CPI 2
CPI 1
CPI 0
CELL
1
CELL
2
SPE 0
SPE 1
…
SPE 0
SPE 1
…
SPE 0
SPE 1
…
LS
LS
LS
LS
LS
LS
LS
LS
LS
DIT
PVTOL-29
4/13/2015
DAT
DOT
MIT Lincoln Laboratory
Outline
• Introduction
• PVTOL Machine Independent Architecture
–
–
–
–
–
Machine Model
Hierarchal Data Objects
Data Parallel API
Task & Conduit API
pMapper
• PVTOL on Cell
–
–
–
–
–
The Cell Testbed
Cell CPU Architecture
PVTOL Implementation Architecture on Cell
PVTOL on Cell Example
Performance Results
• Summary
PVTOL-30
4/13/2015
MIT Lincoln Laboratory
API Requirements
• Support transitioning from serial to parallel to hierarchical
code without significantly rewriting code
Embedded
parallel
processor
Parallel
processor
Uniprocessor
Fits in
main memory
Fits in
main memory
PVL
PVTOL
Fits in
main memory
Parallel
processor
Parallel processor w/
cache optimizations
Uniprocessor
Parallel
tiled processor
Fits in
main memory
Fits in
main memory
Fits in cache
Fits in cache
Fits in
tile memory
Uniprocessor w/
cache optimizations
PVTOL-31
4/13/2015
MIT Lincoln Laboratory
Data Types
•
Block types
–
•
•
PVTOL-32
4/13/2015
•
Map types
–
int, long, short, char, float,
double, long double
Vector, Matrix, Tensor
Local, Runtime, Auto
Layout types
–
•
Views
–
Dense
Element types
–
•
Row-, column-, plane-major
Dense<int Dims,
class ElemType,
class LayoutType>
•
Vector<class ElemType,
class BlockType,
class MapType>
MIT Lincoln Laboratory
Data Declaration Examples
Serial
Parallel
// Create tensor
typedef Dense<3, float, tuple<0, 1, 2> > dense_block_t;
typedef Tensor<float, dense_block_t, LocalMap> tensor_t;
tensor_t cpi(Nchannels, Npulses, Nranges);
// Node map information
Grid grid(Nprocs, 1, 1, Grid.ARRAY); // Grid
DataDist dist(3);
// Block distribution
Vector<int> procs(Nprocs);
// Processor ranks
procs(0) = 0; ...
ProcList procList(procs);
// Processor list
RuntimeMap cpiMap(grid, dist, procList); // Node map
// Create tensor
typedef Dense<3, float, tuple<0, 1, 2> > dense_block_t;
typedef Tensor<float, dense_block_t, RuntimeMap> tensor_t;
tensor_t cpi(Nchannels, Npulses, Nranges, cpiMap);
PVTOL-33
4/13/2015
MIT Lincoln Laboratory
Data Declaration Examples
// Tile map information
Grid tileGrid(1, NTiles 1, Grid.ARRAY);
DataDist tileDist(3);
DataMgmtPolicy tilePolicy(DataMgmtPolicy.DEFAULT);
RuntimeMap tileMap(tileGrid, tileDist, tilePolicy);
Hierarchical
//
//
//
//
Grid
Block distribution
Data mgmt policy
Tile map
// Tile processor map information
Grid tileProcGrid(NTileProcs, 1, 1, Grid.ARRAY); // Grid
DataDist tileProcDist(3);
// Block distribution
Vector<int> tileProcs(NTileProcs);
// Processor ranks
inputProcs(0) = 0; ...
ProcList inputList(tileProcs);
// Processor list
DataMgmtPolicy tileProcPolicy(DataMgmtPolicy.DEFAULT); // Data mgmt policy
RuntimeMap tileProcMap(tileProcGrid, tileProcDist, tileProcs,
tileProcPolicy, tileMap); // Tile processor map
// Node map information
Grid grid(Nprocs, 1, 1, Grid.ARRAY); //
DataDist dist(3);
//
Vector<int> procs(Nprocs);
//
procs(0) = 0;
ProcList procList(procs);
//
RuntimeMap cpiMap(grid, dist, procList,
Grid
Block distribution
Processor ranks
Processor list
tileProcMap); // Node map
// Create tensor
typedef Dense<3, float, tuple<0, 1, 2> > dense_block_t;
typedef Tensor<float, dense_block_t, RuntimeMap> tensor_t;
tensor_t cpi(Nchannels, Npulses, Nranges, cpiMap);
PVTOL-34
4/13/2015
MIT Lincoln Laboratory
Pulse Compression Example
Untiled version
Tiled version
// Declare weights and cpi tensors
tensor_t cpi(Nchannels, Npulses, Nranges,
cpiMap),
weights(Nchannels, Npulse, Nranges,
cpiMap);
// Declare weights and cpi tensors
tensor_t cpi(Nchannels, Npulses, Nranges,
cpiMap),
weights(Nchannels, Npulse, Nranges,
cpiMap);
// Declare FFT objects
Fftt<float, float, 2, fft_fwd> fftt;
Fftt<float, float, 2, fft_inv> ifftt;
// Declare FFT objects
Fftt<float, float, 2, fft_fwd> fftt;
Fftt<float, float, 2, fft_inv> ifftt;
// Iterate over CPI's
for (i = 0; i < Ncpis; i++) {
// DIT: Load next CPI from disk
...
// Iterate over CPI's
for (i = 0; i < Ncpis; i++) {
// DIT: Load next CPI from disk
...
// DAT: Pulse compress CPI
dataIter = cpi.beginLinear(0, 1);
weightsIter = weights.beginLinear(0, 1);
outputIter = output.beginLinear(0, 1);
while (dataIter != data.endLinear()) {
output = ifftt(weights * fftt(cpi));
dataIter++; weightsIter++; outputIter++;
}
// DAT: Pulse compress CPI
output = ifftt(weights * fftt(cpi));
// DOT: Save pulse compressed CPI to disk
...
// DOT: Save pulse compressed CPI to disk
...
}
}
PVTOL-35
4/13/2015
MIT Lincoln Laboratory
Kernelized tiled version is identical to untiled version
Setup Assign API
• Library overhead can be reduced by an initialization time
expression setup
– Store PITFALLS communication patterns
– Allocate storage for temporaries
– Create computation objects, such as FFTs
Assignment Setup Example
Equation eq1(a, b*c + d);
Equation eq2(f, a / d);
for( ... ) {
...
eq1();
eq2();
...
}
Expressions stored
in Equation object
Expressions invoked
without re-stating
expression
Expression objects can hold setup information without duplicating the equation
PVTOL-36
4/13/2015
MIT Lincoln Laboratory
Redistribution: Assignment
A
B
Main memory is the highest level where all of A and
B are in physical memory. PVTOL performs the
redistribution at this level. PVTOL also performs
the data reordering during the redistribution.
CELL
Cluster
CELL
0
CELL
0
CELL
1
SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7 SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
Cell
Cluster
CELL
Cluster
LS
PVTOL ‘invalidates’ all of A’s local store blocks at the lower
layers, causing the layer manager to re-load the blocks
from main memory when they are accessed.
LS
LS
Individual
Cells
CELL
1
SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7 SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
Individual
SPEs
SPE
Local
Stores
PVTOL ‘commits’ B’s local store memory blocks to main
memory, ensuring memory coherency
PVTOL A=B Redistribution Process:
1.
2.
3.
4.
5.
PVTOL ‘commits’ B’s resident temporal memory blocks.
PVTOL descends the hierarchy, performing PITFALLS intersections.
PVTOL stops descending once it reaches the highest set of map nodes at which all of A and all of B
are in physical memory.
PVTOL performs the redistribution at this level, reordering data and performing element-type
conversion if necessary.
PVTOL ‘invalidates’ A’s temporal memory blocks.
Programmer writes ‘A=B’
Corner turn dictated by maps, data ordering (row-major vs. column-major)
PVTOL-37
4/13/2015
MIT Lincoln Laboratory
Redistribution: Copying
Deep copy
A
B
CELL
Cluster
CELL
Cluster
CELL
0
grid: 1x2
dist: block
nodes: 0:1
map:cellMap
LS
grid:4x1
dist:block
policy: default
nodes: 0:3
map: speMap
LS
LS
CELL
0
CELL
1
SPE 0 SPE 1 SPE 2 SPE 3
LS
…
SPE 0 SPE 1 SPE 2 SPE 3
LS
LS
LS
LS
LS
…
…
LS
LS
CELL
1
…
SPE 4 SPE 5 SPE 6 SPE 7
LS
LS
A allocates a hierarchy based on
its hierarchical map
LS
LS
SPE 4 SPE 5 SPE 6 SPE 7
LS
LS
LS
LS
grid: 1x2
dist: block
nodes: 0:1
map:cellMap
LS
Commit B’s local store memory
blocks to main memory
A shares the memory allocated
by B. No copying is performed
grid:1x4
dist:block
policy: default
A allocates its own memory and
copies contents of B
grid:1x4
dist:block
policy: default
nodes: 0:3
map: speMap
CELL
Cluster
CELL
0
grid:4x1
dist:block
policy: default
CELL
1
SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7 SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7
Shallow copy
LS
A allocates a hierarchy based
on its hierarchical map
LS
A
LS
LS
LS
LS
LS
B
LS
LS
LS
LS
A
LS
LS
LS
LS
LS
Commit B’s local store memory
blocks to main memory
B
Programmer creates new view using copy constructor with a new hierarchical map
PVTOL-38
4/13/2015
MIT Lincoln Laboratory
Pulse Compression + Doppler Filtering
Example
…
CPI 2
CPI 1
CPI 0
CELL
Cluster
…
CPI 2
CPI 1
CPI 0
CELL
1
CELL
0
CELL
2
SPE 0
SPE 1
…
SPE 0
SPE 1
SPE 2
SPE 3
…
SPE 0
SPE 1
…
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
DIT
PVTOL-39
4/13/2015
DIT
DOT
MIT Lincoln Laboratory
Outline
• Introduction
• PVTOL Machine Independent Architecture
–
–
–
–
–
Machine Model
Hierarchal Data Objects
Data Parallel API
Task & Conduit API
pMapper
• PVTOL on Cell
–
–
–
–
–
The Cell Testbed
Cell CPU Architecture
PVTOL Implementation Architecture on Cell
PVTOL on Cell Example
Performance Results
• Summary
PVTOL-40
4/13/2015
MIT Lincoln Laboratory
Tasks & Conduits
A means of decomposing a problem into a set of
asynchronously coupled sub-problems (a pipeline)
Conduit A
Task 1
Conduit B
•
•
•
•
•
Task 2
Task 3
Conduit C
Each Task is SPMD
Conduits transport distributed data objects (i.e. Vector, Matrix, Tensor)
between Tasks
Conduits provide multi-buffering
Conduits allow easy task replication
Tasks may be separate processes or may co-exist as different threads
within a process
PVTOL-41
4/13/2015
MIT Lincoln Laboratory
Tasks/w Implicit Task Objects
Task Function
Task
Thread
Map
Communicator
0..*
Sub-Task
0..*
sub-Map
A PVTOL Task consists
of a distributed set of
Threads that use the
same communicator
0..*
sub-Communicator
Roughly
equivalent to the
“run” method of a
PVL task
Threads may be either
preemptive or
cooperative*
Task ≈ Parallel Thread
PVTOL-42
4/13/2015
MIT Lincoln Laboratory
* PVL task state machines provide primitive cooperative multi-threading
Cooperative vs. Preemptive Threading
Cooperative User Space Threads (e.g. GNU Pth)
Thread 1
User Space
Scheduler
Thread 2
yield( )
Preemptive Threads (e.g. pthread)
Thread 1
O/S
Scheduler
Thread 2
interrupt , I/O wait
return from
interrupt
return from yield( )
return from yield( )
return from
interrupt
yield( )
interrupt , I/O wait
yield( )
• PVTOL calls yield( ) instead of blocking
while waiting for I/O
• O/S support of multithreading not needed
• Underlying communication and
computation libs need not be thread safe
• SMPs cannot execute tasks concurrently
interrupt , I/O wait
• SMPs can execute tasks concurrently
• Underlying communication and
computation libs must be thread safe
PVTOL can support both threading styles via an internal thread wrapper layer
PVTOL-43
4/13/2015
MIT Lincoln Laboratory
Task API
support functions get values for current task SPMD
• length_type pvtol::num_processors();
• const_Vector<processor_type>
pvtol::processor_set();
Task API
• typedef<class T>
pvtol::tid pvtol::spawn(
(void)(TaskFunction*)(T&),
T& params,
Map& map);
•
int pvtol::tidwait(pvtol::tid);
Similar to typical thread API except for spawn map
PVTOL-44
4/13/2015
MIT Lincoln Laboratory
Explicit Conduit UML
(Parent Task Owns Conduit)
Parent task owns
the conduits
PVTOL Task
Thread
Application tasks owns
the endpoints (i.e.
readers & writers)
ThreadFunction
Application Parent Task
Child
Parent
0..
Application Function
0..
PVTOL Conduit
0..
Conduit
Data Reader
1
0..
Conduit
Data Writer
PVTOL
Data
Object
Multiple Readers are allowed
Only one writer is allowed
PVTOL-45
4/13/2015
Reader & Writer objects manage a Data Object,
provides a PVTOL view of the comm buffers
MIT Lincoln Laboratory
Implicit Conduit UML
(Factory Owns Conduit)
Factory task owns
the conduits
PVTOL Task
Thread
Application tasks owns
the endpoints (i.e.
readers & writers)
ThreadFunction
Conduit Factory Function
Application Function
0..
0..
PVTOL Conduit
0..
Conduit
Data Reader
1
0..
Conduit
Data Writer
PVTOL
Data
Object
Multiple Readers are allowed
Only one writer is allowed
PVTOL-46
4/13/2015
Reader & Writer objects manage a Data Object,
provides a PVTOL view of the comm buffers
MIT Lincoln Laboratory
Conduit API
Conduit Declaration API
• typedef<class T>
Note: the Reader and Writer
class Conduit {
connect( ) methods block
Conduit( );
waiting for conduits to finish
Reader& getReader( );
Writer& getWriter( );
initializing and perform a
};
function similar to PVL’s two
Conduit Reader API
phase initialization
• typedef<class T>
class Reader {
public:
Reader( Domain<n> size, Map map, int depth );
void setup( Domain<n> size, Map map, int depth );
void connect( );
// block until conduit ready
pvtolPtr<T> read( );
// block until data available
T& data( );
// return reader data object
};
Conduit Writer API
• typedef<class T>
class Writer {
public:
Writer( Domain<n> size, Map map, int depth );
void setup( Domain<n> size, Map map, int depth );
void connect( );
// block until conduit ready
pvtolPtr<T> getBuffer( );
// block until buffer available
void write( pvtolPtr<T> ); // write buffer to destination
T& data( );
// return writer data object
};
Conceptually Similar to the PVL Conduit API
PVTOL-47
4/13/2015
MIT Lincoln Laboratory
Task & Conduit
API Example/w Explicit Conduits
typedef struct { Domain<2> size; int depth; int numCpis; } DatParams;
int DataInputTask(const DitParams*);
int DataAnalysisTask(const DatParams*);
int DataOutputTask(const DotParams*);
Conduits
created in
parent task
int main( int argc, char* argv[])
{
…
Conduit<Matrix<Complex<Float>>> conduit1;
Conduit<Matrix<Complex<Float>>> conduit2;
Pass Conduits
to children via
Task
parameters
DatParams
datParams = …;
datParams.inp = conduit1.getReader( );
datParams.out = conduit2.getWriter( );
Spawn
Tasks
vsip:: tid ditTid = vsip:: spawn( DataInputTask, &ditParams,ditMap);
vsip:: tid datTid = vsip:: spawn( DataAnalysisTask, &datParams, datMap );
vsip:: tid dotTid = vsip:: spawn( DataOutputTask, &dotParams, dotMap );
Wait for
Completion
vsip:: tidwait( ditTid );
vsip:: tidwait( datTid );
vsip:: tidwait( dotTid );
}
“Main Task” creates Conduits, passes to sub-tasks as
parameters, and waits for them to terminate
PVTOL-48
4/13/2015
MIT Lincoln Laboratory
DAT Task & Conduit
Example/w Explicit Conduits
int DataAnalysisTask(const DatParams* p)
{
Vector<Complex<Float>> weights( p.cols, replicatedMap );
ReadBinary (weights, “weights.bin” );
Conduit<Matrix<Complex<Float>>>::Reader inp( p.inp );
inp.setup(p.size,map,p.depth);
Conduit<Matrix<Complex<Float>>>::Writer out( p.out );
out.setup(p.size,map,p.depth);
inp.connect( );
out.connect( );
for(int i=0; i<p.numCpis; i++) {
pvtolPtr<Matrix<Complex<Float>>> inpData( inp.read() );
pvtolPtr<Matrix<Complex<Float>>> outData( out.getBuffer() );
(*outData) = ifftm( vmmul( weights, fftm( *inpData, VSIP_ROW ),
VSIP_ROW );
out.write(outData);
}
}
Declare and Load
Weights
Complete conduit
initialization
connect( ) blocks until
conduit is initialized
Reader::getHandle( )
blocks until data is
received
Writer::getHandle( )
blocks until output
buffer is available
Writer::write( ) sends
the data
pvtolPtr destruction
implies reader extract
Sub-tasks are implemented as ordinary functions
PVTOL-49
4/13/2015
MIT Lincoln Laboratory
DIT-DAT-DOT Task & Conduit
API Example/w Implicit Conduits
typedef struct { Domain<2> size; int depth; int numCpis; } TaskParams;
int DataInputTask(const InputTaskParams*);
int DataAnalysisTask(const AnalysisTaskParams*);
int DataOutputTask(const OutputTaskParams*);
int main( int argc, char* argv[ ])
{
…
TaskParams
params = …;
Conduits NOT
created in
parent task
Spawn
Tasks
vsip:: tid ditTid = vsip:: spawn( DataInputTask, &params,ditMap);
vsip:: tid datTid = vsip:: spawn( DataAnalysisTask, &params, datMap );
vsip:: tid dotTid = vsip:: spawn( DataOutputTask, &params, dotMap );
vsip:: tidwait( ditTid );
vsip:: tidwait( datTid );
vsip:: tidwait( dotTid );
Wait for
Completion
}
“Main Task” just spawns sub-tasks and waits for them to terminate
PVTOL-50
4/13/2015
MIT Lincoln Laboratory
DAT Task & Conduit
Example/w Implicit Conduits
int DataAnalysisTask(const AnalysisTaskParams* p)
{
Vector<Complex<Float>> weights( p.cols, replicatedMap );
ReadBinary (weights, “weights.bin” );
Conduit<Matrix<Complex<Float>>>::Reader
inp(“inpName”,p.size,map,p.depth);
Conduit<Matrix<Complex<Float>>>::Writer
out(“outName”,p.size,map,p.depth);
inp.connect( );
out.connect( );
for(int i=0; i<p.numCpis; i++) {
pvtolPtr<Matrix<Complex<Float>>> inpData( inp.read() );
pvtolPtr<Matrix<Complex<Float>>> outData( out.getBuffer() );
(*outData) = ifftm( vmmul( weights, fftm( *inpData, VSIP_ROW ),
VSIP_ROW );
out.write(outData);
}
}
Constructors
communicate/w factory
to find other end based
on name
connect( ) blocks until
conduit is initialized
Reader::getHandle( )
blocks until data is
received
Writer::getHandle( )
blocks until output
buffer is available
Writer::write( ) sends
the data
pvtolPtr destruction
implies reader extract
Implicit Conduits connect using a “conduit name”
PVTOL-51
4/13/2015
MIT Lincoln Laboratory
Conduits and Hierarchal Data Objects
Conduit connections may be:
• Non-hierarchal to non-hierarchal
• Non-hierarchal to hierarchal
• Hierarchal to Non-hierarchal
• Non-hierarchal to Non-hierarchal
Example task function/w hierarchal mappings on conduit input & output data
…
input.connect();
output.connect();
for(int i=0; i<nCpi; i++) {
pvtolPtr<Matrix<Complex<Float>>> inp( input.getHandle( ) );
pvtolPtr<Matrix<Complex<Float>>> oup( output.getHandle( ) );
do {
*oup = processing( *inp );
inp->getNext( );
Per-time Conduit
oup->getNext( );
communication possible
} while (more-to-do);
(implementation dependant)
output.write( oup );
}
Conduits insulate each end of the conduit from the other’s mapping
PVTOL-52
4/13/2015
MIT Lincoln Laboratory
Replicated Task Mapping
• Replicated tasks allow conduits to abstract
away round-robin parallel pipeline stages
• Good strategy for when tasks reach their
scaling limits
Conduit A
Task 2
Rep #0
Task 1
Conduit B
Task 2
Rep #1
Task 2
Rep #2
Task 3
Conduit C
“Replicated” Task
Replicated mapping can be based on a 2D task map (i.e. Each row in the
map is a replica mapping, number of rows is number of replicas
PVTOL-53
4/13/2015
MIT Lincoln Laboratory
Outline
• Introduction
• PVTOL Machine Independent Architecture
–
–
–
–
–
Machine Model
Hierarchal Data Objects
Data Parallel API
Task & Conduit API
pMapper
• PVTOL on Cell
–
–
–
–
–
The Cell Testbed
Cell CPU Architecture
PVTOL Implementation Architecture on Cell
PVTOL on Cell Example
Performance Results
• Summary
PVTOL-54
4/13/2015
MIT Lincoln Laboratory
PVTOL and Map Types
PVTOL distributed arrays are templated on map type.
LocalMap
The matrix is not distributed
RuntimeMap
The matrix is distributed and all map information is
specified at runtime
AutoMap
The map is either fully defined, partially defined, or
undefined
Focus on
Notional matrix construction:
Matrix<float, Dense, AutoMap> mat1(rows, cols);
Specifies the data
type, i.e. double,
complex, int, etc.
PVTOL-55
4/13/2015
Specifies the
storage layout
Specifies the
map type:
MIT Lincoln Laboratory
pMapper and Execution in PVTOL
All maps are of type LocalMap or RuntimeMap
OUTPUT
APPLICATION
At least one map
of type AutoMap
(unspecified,
partial) is present
PVTOL-56
4/13/2015
pMapper is an automatic mapping system
• uses lazy evaluation
• constructs a signal flow graph
• maps the signal flow graph at data access
PERFORM.
MODEL
EXPERT
MAPPING
SYSTEM
ATLAS
SIGNAL
FLOW
EXTRACTOR
SIGNAL
FLOW
GRAPH
EXECUTOR/
SIMULATOR
OUTPUT
MIT Lincoln Laboratory
Examples of Partial Maps
A partially specified map has one or more of the map
attributes unspecified at one or more layers of the hierarchy.
Examples:
Grid: 1x4
Dist: block
Procs:
pMapper will figure out which 4 processors to use
Grid: 1x*
Dist: block
Procs:
pMapper will figure out how many columns the grid should have
and which processors to use; note that if the processor list was
provided the map would become fully specified
Grid: 1x4
Dist:
Procs: 0:3
pMapper will figure out whether to use block, cyclic,
or block-cyclic distribution
Grid:
Dist: block
Procs:
pMapper will figure out what grid to use and on how many
processors; this map is very close to being completely unspecified
pMapper
• will be responsible for determining attributes that influence performance
• will not discover whether a hierarchy should be present
PVTOL-57
4/13/2015
MIT Lincoln Laboratory
pMapper UML Diagram
pMapper
not pMapper
PVTOL-58
4/13/2015
pMapper is only invoked when an
AutoMap-templated PvtolView is created.
MIT Lincoln Laboratory
pMapper & Application
// Create input tensor (flat)
typedef Dense<3, float, tuple<0, 1, 2> > dense_block_t;
typedef Tensor<float, dense_block_t, AutoMap> tensor_t;
tensor_t input(Nchannels, Npulses, Nranges),
// Create input tensor (hierarchical)
AutoMap tileMap();
AutoMap tileProcMap(tileMap);
AutoMap cpiMap(grid, dist, procList, tileProcMap);
typedef Dense<3, float, tuple<0, 1, 2> > dense_block_t;
typedef Tensor<float, dense_block_t, AutoMap> tensor_t;
tensor_t input(Nchannels, Npulses, Nranges, cpiMap),
For each pvar
• For each Pvar in the Signal Flow Graph (SFG),
pMapper checks if the map is fully specified
AutoMap is
partially specified
or unspecified
AutoMap is
fully specified
• If it is, pMapper will move on to the next Pvar
• pMapper will not attempt to remap a pre-defined map
Get next pvar
Map pvar
• If the map is not fully specified, pMapper will map it
• When a map is being determined for a Pvar, the map
returned has all the levels of hierarchy specified, i.e.
all levels are mapped at the same time
Get next pvar
PVTOL-59
4/13/2015
MIT Lincoln Laboratory
Outline
• Introduction
• PVTOL Machine Independent Architecture
–
–
–
–
–
Machine Model
Hierarchal Data Objects
Data Parallel API
Task & Conduit API
pMapper
• PVTOL on Cell
–
–
–
–
–
The Cell Testbed
Cell CPU Architecture
PVTOL Implementation Architecture on Cell
PVTOL on Cell Example
Performance Results
• Summary
PVTOL-60
4/13/2015
MIT Lincoln Laboratory
Mercury Cell Processor Test System
Mercury Cell Processor System
• Single Dual Cell Blade
– Native tool chain
– Two 2.4 GHz Cells running in SMP
mode
– Terra Soft Yellow Dog Linux
2.6.14
• Received 03/21/06
– booted & running same day
– integrated/w LL network < 1 wk
– Octave (Matlab clone) running
– Parallel VSIPL++ compiled
•Each Cell has 153.6 GFLOPS (single
precision ) – 307.2 for system @ 2.4
GHz (maximum)
Software includes:
• IBM Software Development Kit (SDK)
– Includes example programs
• Mercury Software Tools
– MultiCore Framework (MCF)
– Scientific Algorithms Library (SAL)
– Trace Analysis Tool and Library (TATL)
PVTOL-61
4/13/2015
MIT Lincoln Laboratory
Outline
• Introduction
• PVTOL Machine Independent Architecture
–
–
–
–
–
Machine Model
Hierarchal Data Objects
Data Parallel API
Task & Conduit API
pMapper
• PVTOL on Cell
–
–
–
–
–
The Cell Testbed
Cell CPU Architecture
PVTOL Implementation Architecture on Cell
PVTOL on Cell Example
Performance Results
• Summary
PVTOL-62
4/13/2015
MIT Lincoln Laboratory
Cell Model
Element Interconnect Bus
•4 ring buses
Synergistic Processing Element
•½ processor speed
•Each ring 16 bytes wide
•128 SIMD Registers, 128 bits wide
•Dual issue instructions
•Max bandwidth 96 bytes / cycle (204.8 GB/s @ 3.2 GHz)
Local Store
•256 KB Flat memory
Memory Flow
Controller
•Built in DMA Engine
• PPE and SPEs need different
programming models
64 – bit PowerPC (AS)
VMX, GPU, FPU, LS, …
L1
– SPEs MFC runs concurrently with
program
– PPE cache loading is noticeable
– PPE has direct access to memory
L2
Hard to use SPMD programs on PPE and SPE
PVTOL-63
4/13/2015
MIT Lincoln Laboratory
Compiler Support
• GNU gcc
– gcc, g++ for PPU and SPU
– Supports SIMD C extensions
• IBM XLC
• IBM Octopiler
– Promises automatic
parallel code with DMA
– Based on OpenMP
– C, C++ for PPU, C for SPU
– Supports SIMD C extensions
– Promises transparent SIMD code
vadd does not produce SIMD code in SDK
•GNU provides familiar product
•IBM’s goal is easier programmability
• Will it be enough for high performance
customers?
PVTOL-64
4/13/2015
MIT Lincoln Laboratory
Mercury’s MultiCore Framework (MCF)
MCF provides a network across Cell’s
coprocessor elements.
Manager (PPE) distributes data to Workers (SPEs)
Synchronization API for Manager and its workers
Workers receive task and data in “channels”
Worker teams can receive different pieces of data
DMA transfers are abstracted away by “channels”
Workers remain alive until network is shutdown
MCF’s API provides a Task Mechanism whereby
workers can be passed any computational kernel.
PVTOL-65
4/13/2015
Can be used in conjunction with Mercury’s SAL
(Scientific Algorithm Library)
MIT Lincoln Laboratory
Mercury’s MultiCore Framework (MCF)
MCF provides API data distribution “channels” across
processing elements that can be managed by PVTOL.
PVTOL-66
4/13/2015
MIT Lincoln Laboratory
Sample MCF API functions
Manager Functions
mcf_m_net_create( )
mcf_m_net_initialize( )
mcf_m_net_add_task( )
mcf_m_net_add_plugin( )
mcf_m_team_run_task( )
mcf_m_team_wait( )
mcf_m_net_destroy( )
mcf_m_mem_alloc( )
mcf_m_mem_free( )
mcf_m_mem_shared_alloc( )
mcf_m_tile_channel_create( )
mcf_m_tile_channel_destroy( )
mcf_m_tile_channel_connect( )
mcf_m_tile_channel_disconnect( )
mcf_m_tile_distribution_create_2d( )
mcf_m_tile_distribution_destroy( )
mcf_m_tile_channel_get_buffer( )
mcf_m_tile_channel_put_buffer( )
mcf_m_dma_pull( )
mcf_m_dma_push( )
mcf_m_dma_wait( )
mcf_m_team_wait( )
mcf_w_tile_channel_create( )
mcf_w_tile_channel_destroy( )
mcf_w_tile_channel_connect( )
mcf_w_tile_channel_disconnect( )
mcf_w_tile_channel_is_end_of_channel( )
mcf_w_tile_channel_get_buffer( )
mcf_w_tile_channel_put_buffer( )
mcf_w_dma_pull_list( )
mcf_w_dma_push_list( )
mcf_w_dma_pull( )
mcf_w_dma_push( )
mcf_w_dma_wait( )
Worker Functions
mcf_w_main( )
mcf_w_mem_alloc( )
mcf_w_mem_free( )
mcf_w_mem_shared_attach( )
Initialization/Shutdown
PVTOL-67
4/13/2015
Channel Management
Data Transfer
MIT Lincoln Laboratory
Outline
• Introduction
• PVTOL Machine Independent Architecture
–
–
–
–
–
Machine Model
Hierarchal Data Objects
Data Parallel API
Task & Conduit API
pMapper
• PVTOL on Cell
–
–
–
–
–
The Cell Testbed
Cell CPU Architecture
PVTOL Implementation Architecture on Cell
PVTOL on Cell Example
Performance Results
• Summary
PVTOL-68
4/13/2015
MIT Lincoln Laboratory
Cell PPE – SPE
Manager / Worker Relationship
PPE
Main
Memory
SPE
PPE loads data into Main Memory
PPE launches SPE kernel
expression
SPE loads data from Main Memory
to & from its local store
SPE writes results back to
Main Memory
SPE indicates that the task is
complete
PPE (manager) “farms out” work to the SPEs (workers)
PVTOL-69
4/13/2015
MIT Lincoln Laboratory
SPE Kernel Expressions
•
PVTOL application
–
–
•
Written by user
Can use expression kernels to perform computation
Expression kernels
–
–
Built into PVTOL
PVTOL will provide multiple kernels, e.g.
Simple
•
+, -, *
Expression
kernelFFT
loader
–
–
–
Complex
Pulse compression
Doppler filtering
STAP
Built into PVTOL
Launched onto tile processors when PVTOL is initialized
Runs continuously in background
Kernel Expressions are effectively SPE overlays
PVTOL-70
4/13/2015
MIT Lincoln Laboratory
SPE Kernel Proxy Mechanism
Matrix<Complex<Float>> inP(…);
Matrix<Complex<Float>> outP (…);
outP=ifftm(vmmul(fftm(inP)));
PVTOL
Expression or
pMapper SFG
Executor
(on PPE)
Check
Signature
Match
Call
struct PulseCompressParamSet ps;
ps.src=wgt.data,inP.data
ps.dst=outP.data
ps.mappings=wgt.map,inP.map,outP.map
MCF_spawn(SpeKernelHandle, ps);
get mappings from input param;
set up data streams;
while(more to do) {
get next tile;
process;
write tile;
}
Pulse
Compress
SPE Proxy
(on PPE)
pulseCompress(
Vector<Complex<Float>>& wgt,
Matrix<Complex<Float>>& inP,
Matrix<Complex<Float>>& outP
);
Lightweight Spawn
Pulse
Compress
Kernel
(on SPE)
Name
Parameter
Set
Kernel Proxies map expressions or expression fragments to available SPE kernels
PVTOL-71
4/13/2015
MIT Lincoln Laboratory
Kernel Proxy UML Diagram
User
Code
Program Statement
0..
Expression
0..
Manager/
Main
Processor
Direct
Implementation
SPE Computation
Kernel Proxy
FPGA Computation
Kernel Proxy
Library
Code
Computation
Library (FFTW, etc)
Worker
SPE Computation
Kernel
FPGA Computation
Kernel
This architecture is applicable to many types of accelerators (e.g. FPGAs, GPUs)
PVTOL-72
4/13/2015
MIT Lincoln Laboratory
Outline
• Introduction
• PVTOL Machine Independent Architecture
–
–
–
–
–
Machine Model
Hierarchal Data Objects
Data Parallel API
Task & Conduit API
pMapper
• PVTOL on Cell
–
–
–
–
–
The Cell Testbed
Cell CPU Architecture
PVTOL Implementation Architecture on Cell
PVTOL on Cell Example
Performance Results
• Summary
PVTOL-73
4/13/2015
MIT Lincoln Laboratory
DIT-DAT-DOT on Cell Example
PPE DIT
SPE Pulse
Compression Kernel
PPE DAT
PPE DOT
1
CPI 1
CPI 2
1
2
2
3
3
CPI 3
4
Explicit for (…) {
Tasks
read data;
1
outcdt.write( );
}
Implicit
Tasks
1
2
3
PVTOL-74
4/13/2015
4
for (…) {
1
incdt.read( );
2
pulse_comp ( );
3,4 outcdt.write( );
}
for (…) {
a=read data; // DIT
b=a;
c=pulse_comp(b); // DAT
d=c;
write_data(d); // DOT
}
4
2
3
2
3
for (each tile) {
load from memory;
out=ifftm(vmmul(fftm(inp)));
write to memory;
}
4
for (…) {
incdt.read( );
write data;
}
for (each tile) {
load from memory;
out=ifftm(vmmul(fftm(inp)));
write to memory;
}
MIT Lincoln Laboratory
Outline
• Introduction
• PVTOL Machine Independent Architecture
–
–
–
–
–
Machine Model
Hierarchal Data Objects
Data Parallel API
Task & Conduit API
pMapper
• PVTOL on Cell
–
–
–
–
–
The Cell Testbed
Cell CPU Architecture
PVTOL Implementation Architecture on Cell
PVTOL on Cell Example
Performance Results
• Summary
PVTOL-75
4/13/2015
MIT Lincoln Laboratory
Benchmark Description
Benchmark Hardware
Benchmark Software
Mercury Dual Cell Testbed
Octave
(Matlab clone)
PPEs
Simple
FIR Proxy
SPE FIR Kernel
SPEs
1 – 16 SPEs
Based on HPEC Challenge Time Domain FIR Benchmark
PVTOL-76
4/13/2015
MIT Lincoln Laboratory
Time Domain FIR Algorithm
3
Filter slides along
reference to form
dot products
2
1
•Number of Operations:
0
Single Filter (example size 4)
x
x
x
x
k – Filter size
n – Input size
nf - number of filters
0
1
2
3
4
5
6
7
...
n-2
Total FOPs: ~ 8 x nf x n x k
n-1
•Output Size: n + k - 1
+
Reference Input
data
0
1
2
3
4
5
Output point
6
7
...
M-3
M-2
M-1
HPEC Challenge Parameters TDFIR
•TDFIR uses complex data
•TDFIR uses a bank of filters
– Each filter is used in a tapered convolution
– A convolution is a series of dot products
Set
k
n
nf
1
128
4096
64
2
12
1024
20
FIR is one of the best ways to demonstrate FLOPS
PVTOL-77
4/13/2015
MIT Lincoln Laboratory
Performance Time Domain FIR (Set 1)
Time vs. Increasing L, Constants M = 64 N = 4096 K = 128 04-Aug-2006
3
GIGAFLOPS
vs. Increasing L, Constants M = 64 N = 4096 K = 128 04-Aug-2006
3
10
Cell (1 SPEs)
Cell @ 2.4 GHz
Cell (2 SPEs)
Cell (4 SPEs)
Cell (8 SPEs)
Cell (16 SPEs)
10
Cell
Cell
Cell
Cell
Cell
2
1
10
GIGAFLOPS
TIME (seconds)
10
(1 SPEs)
(2 SPEs)
(4 SPEs)
(8 SPEs)
(16 SPEs)
0
10
2
10
-1
10
Cell @ 2.4 GHz
-2
10
1
0
1
10
10
2
3
10
10
Number of Iterations (L)
4
10
5
10
10
0
10
1
2
10
3
4
10
10
Number of Iterations (L)
5
10
10
Set 1 has a bank of 64 size 128 filters with size 4096 input vectors
•Octave runs TDFIR in a loop
– Averages out overhead
– Applications run convolutions many times
typically
PVTOL-78
4/13/2015
Maximum GFLOPS for TDFIR #1 @2.4 GHz
# SPE
1
2
4
8
16
GFLOPS
16
32
63
126
253
MIT Lincoln Laboratory
Performance Time Domain FIR (Set 2)
Time vs. L, Constants M = 20 N = 1024 K = 12 25-Aug-2006
3
GIGAFLOPS vs. L, Constants M = 20 N = 1024 K = 12 25-Aug-2006
3
10
10
Cell
Cell
Cell
Cell
Cell
2
10
1
(1 SPEs)
(2 SPEs)
(4 SPEs)
(8 SPEs)
(16 SPEs)
Cell
Cell
Cell
Cell
Cell
2
10
(1 SPEs)
(2 SPEs)
(4 SPEs)
(8 SPEs)
(16 SPEs)
GIGAFLOPS
TIME (seconds)
10
0
10
1
10
0
10
-1
10
-1
10
-2
10
Cell @ 2.4 GHz
Cell @ 2.4 GHz
-3
10
-2
0
2
10
10
4
10
Number of Iterations (L)
6
10
8
10
10
0
10
2
4
10
10
Number of Iterations (L)
6
8
10
10
Set 2 has a bank of 20 size 12 filters with size 1024 input vectors
•
TDFIR set 2 scales well with the
number of processors
–
PVTOL-79
4/13/2015
Runs are less stable than set 1
GFLOPS for TDFIR #2 @ 2.4 GHz
# SPE
1
2
4
8
16
GFLOPS
10
21
44
85
185
MIT Lincoln Laboratory
Outline
• Introduction
• PVTOL Machine Independent Architecture
–
–
–
–
–
Machine Model
Hierarchal Data Objects
Data Parallel API
Task & Conduit API
pMapper
• PVTOL on Cell
–
–
–
–
–
The Cell Testbed
Cell CPU Architecture
PVTOL Implementation Architecture on Cell
PVTOL on Cell Example
Performance Results
• Summary
PVTOL-80
4/13/2015
MIT Lincoln Laboratory
Summary
Goal: Prototype advanced software
technologies to exploit novel
processors for DoD sensors
Tiled
Processors
CPU in disk drive
• Have demonstrated 10x performance
benefit of tiled processors
• Novel storage should provide 10x more IO
Approach: Develop Parallel Vector
Tile Optimizing Library (PVTOL) for
high performance and ease-of-use
Automated Parallel Mapper
A
P
0
P1
FFT
B
FFT
C
P2
PVTOL
~1 TByte
RAID disk
~1 TByte
RAID disk
Hierarchical Arrays
PVTOL-81
4/13/2015
DoD Relevance: Essential for flexible,
programmable sensors with large IO
and processing requirements
Wideband
Digital
Arrays
Massive
Storage
• Wide area data
• Collected over many time scales
Mission Impact:
•Enabler for next-generation synoptic,
multi-temporal sensor systems
Technology Transition Plan
•Coordinate development with
sensor programs
•Work with DoD and Industry
standards bodies
DoD Software
Standards
MIT Lincoln Laboratory
Download