Platform Unaware Programming Models

advertisement
Platform Unaware Programming
Models
Rosa M. Badia
Pieter Bellens, Jorge Ejarque, Josep M. Perez,
Jesus Labarta, Marc de Palol, Raül Sirvent, Enric Tejedor
Barcelona Supercomputing Center (BSC-CNS)
Technical University of Catalonia (UPC)
rosa.m.badia@bsc.es
Outline

“*” superscalar: overview
“*” = GRID or Cell or SMP

Programming model syntax

Evolution and Platforms

Generic runtime features

Specific features

Grid version

Cell Superscalar (CellSs)
May 2007
2
Overview

Superscalar processor
Instructions 
Functional units

Registers

Memory
Flow sequential program
Concurrent execution, out of order, speculation, ... 


May 2007
Overview
Ease the programmin of Grid applications

Ease the programming of {Grid, multicore, ...} applications

Basic idea:
ISU
IDU
LSU
L3 Directory/Control
IFU
BXU
L2
ISU
FXU
FPU
FXU

LSU
L2
FPU
IDU
IFU
BXU
L2
Grid
ns ➨ seconds/minutes/hours
May 2007
Grid
ns  100 useconds  minutes/hours Mapping of concepts:
Instructions  Block operations
Functional units  SPUs Fetch &decode unit  PPE Registers (name space)  Main memory
Registers (storage)  SPU memory  Full binary
 computational resources
 local host
 Files
 Files
Standard sequential languages:
* superscalar (*Ss)
On standard processors run sequential
“easy” programming
On other platforms runs parallel
“decent” performance
Constraint
portable
Coarse grain tasks algorithms
Operations only access arguments and local data
Objectives and overview
for (int i = 0; i < MAXITER; i++) {
newBWd = GenerateRandom();
subst (referenceCFG, newBWd, newCFG);
dimemas (newCFG, traceFile, DimemasOUT);
post (newBWd, DimemasOUT, FinalOUT);
if(i % 3 == 0) Display(FinalOUT);
}
fd = open(FinalOUT, R);
printf("Results file:\n"); present (fd);
close(fd);
May 2007
Input/output data
Objectives and overview
Subst
Subst
DIMEMAS
EXTRACT
DIMEMAS
EXTRACT
Subst
DIMEMAS
EXTRACT
Subst
DIMEMAS
EXTRACT
Subst
Subst
DIMEMAS
DIMEMAS
EXTRACT
EXTRACT
Display
…
Subst
DIMEMAS
EXTRACT
Display
CIRI Grid
GS_open
May 2007
Objectives and overview
Subst
DIMEMAS
EXTRACT
Subst
DIMEMAS
EXTRACT
Subst
DIMEMAS
EXTRACT
Subst
DIMEMAS
EXTRACT
Subst
Subst
DIMEMAS
DIMEMAS
EXTRACT
EXTRACT
Display
…
Subst
DIMEMAS
EXTRACT
Display
Grid
GS_open
May 2007
Syntax

Small set of annotations (pragmas)

Task annotation:

Independent piece of code without lateral effects

Before declaration of a subroutine
#pragma css task input(n) output(result)
void factorial(unsigned int n, unsigned int *result) {
...
}
#pragma css task input(left[leftSize], right[rightSize]) output(result[leftSize+rightSize])
void merge(float *left, unsigned int leftSize, float *right, unsigned int rightSize, float
*result){
...
}
May 2007
Syntax
int main (int argc, char **argv) {
int i, j, k;
NB
…
B
NB
initialize(A, B, C);
for (i=0; i < NB; i++)
for (j=0; j < NB; j++)
for (k=0; k < NB; k++)
block_addmultiply( C[i][j], A[i][k], B[k][j]);
}
static void block_addmultiply( float C[BS][BS], float
A[BS][BS], float B[BS][BS]) {
int i, j, k;
for (i=0; i < B; i++)
for (j=0; j < B; j++)
for (k=0; k < B; k++)
C[i][j] += A[i][k] * B[k][j];
}
May 2007
B
B
B
Syntax
int main (int argc, char **argv) {
int i, j, k;
NB
…
B
NB
initialize(A, B, C);
for (i=0; i < NB; i++)
for (j=0; j < NB; j++)
for (k=0; k < NB; k++)
block_addmultiply( C[i][j], A[i][k], B[k][j]);
}
#pragma css task input(A, B) inout(C)
static void block_addmultiply( float C[BS][BS], float
A[BS][BS], float B[BS][BS]) {
int i, j, k;
for (i=0; i < B; i++)
for (j=0; j < B; j++)
for (k=0; k < B; k++)
C[i][j] += A[i][k] * B[k][j];
}
May 2007
B
B
B
Syntax
int main (int argc, char **argv) {
int i, j, k;
NB
…
NB
B
initialize(A, B, C);
# pragma css start
for (i=0; i < NB; i++)
for (j=0; j < NB; j++)
for (k=0; k < NB; k++)
block_addmultiply( C[i][j], A[i][k], B[k][j]);
#pragma css finish
}
#pragma css task input(A, B) inout(C)
static void block_addmultiply( float C[BS][BS], float
A[BS][BS], float B[BS][BS]) {
int i, j, k;
for (i=0; i < B; i++)
for (j=0; j < B; j++)
for (k=0; k < B; k++)
C[i][j] += A[i][k] * B[k][j];
}
May 2007
B
B
B
Syntax
int main (int argc, char **argv) {
int i, j, k;
NB
…
NB
initialize(A, B, C);
for (i=0; i < NB; i++)
for (j=0; j < NB; j++)
for (k=0; k < NB; k++)
block_addmultiply( C[i][j], A[i][k], B[k][j]);
for (i = 0; i < N; i++)
for (ii = 0; ii < BSIZE; ii++)
{
for (j = 0; j < N; j++){
#pragma css wait on(matrix[i][j])
for (jj = 0; jj < BSIZE; jj++)
fprintf (file, "%f ", matrix[i][j][ii][jj]);
}
fprintf (file, "\n");
}
}
May 2007
B
B
B
B
Evolution & platforms
•
Grid – GRID superscalar

Version 1 – dependencies based on files

Based on IDL (no annotations), code generation

C/C++, Perl, Java, shell script

GT2, GT4, ssh/scp, Ninf-G

Deployment center, Monitor

Checkpointing, fault tolerance



Version used in BEinGRID – integrating with GridWay metascheduler
(DRMAA)
Open Source from (now), Apache v2
Version 2 – dependencies for (almost) any data type

C

Source – to – source compiler

Runtime with less features than version 1
May 2007
Evolution & platforms
•
Grid – GRID superscalar

•

Designed on the framework of CoreGRID, WP7

Based on GCM (ProActive implementation)

Java-GAT to access underlying middleware (Java-SAGA soon???)

Implementation ongoing
•
Semantic scheduler
Based on resource ontologies
•
•
Version 3 - componentized version of GRID superscalar
Prototype for version 1
Under development in BREIN
May 2007
Evolution & platforms
•
•
Clusters
MareNostrum version
Version 1 tailored for MareNostrum supercomputer
•
Takes into account local scheduler (first loadleveler, now
SLURM) and GPFS features
•
•
Uses ssh/scp instead of GRID middleware
May 2007
Evolution & platforms
•
Cell/BE

•
Cell Superscalar (CellSs) – version for Cell BE multicore
processor

Based on version 2

Tailored to Cell BE multicore

Open Source (GPL for the compiler & LGPL for the runtime)
Homogeneous multicores /SMPs

SMP superscalar (SMPSs)


CellSs compiler
Runtime ported to SMPs using threads

Much easier platform than Cell/BE!
May 2007
Generic runtime features

Data dependence analysis




Detects RaW, WaR, WaW dependencies based on parameters
For files: according to file names
For data in general: according to memory address
Tasks’ Directed Acyclic Graph is built based on these dependencies
Subst
DIMEMAS
EXTRACT
Subst
DIMEMAS
EXTRACT
Subst
DIMEMAS
EXTRACT
Display
May 2007
Subst
Generic runtime features
WaW and WaR dependencies can be avoid with renaming


File renaming

Data renaming
While
{
T1
T2
T3
}
(!end_condition())
(… ,… , “ f1” );
(“ f1” , … , … );
(… ,… ,… );
WaR
T1_1
“f1”
WaW
T1_2
“f1_1”
“f1”
T2_1
T2_2
T3_1
T3_2
May 2007
T1_N
…
“f1_2”
“f1”
T1_N
T1_N
Generic runtime features
May 2007
Specific features: Grid

Interface Definition Language (IDL) file in XML format


In/Out/InOut files or scalars
The subroutines/functions listed will be executed in a remote
server in the Grid.
<?xml version="1.0" encoding="UTF-8"?>
<interface name="example">
<function name="subst" type="void">
<argument name="referenceCFG" direction="in" type="file"/>
<argument name="newBW" direction="in" type="double"/>
<argument name="newCFG" direction="out" type="file"/>
</function>
<function name="dimemas" type="void">
<argument name="newCFG" direction="in" type="file"/>
<argument name="traceFile" direction="in" type="file"/>
<argument name="DimemasOUT" direction="out" type="file"/>
</function>
<function name="post" type="void">
<argument name="newBW" direction="in" type="double"/>
<argument name="DimemasOUT" direction="in" type="file"/>
<argument name="FinalOUT" direction="inout" type="file"/>
</function>
<function name="display" type="void">
<argument name="toplot" direction="in" type="file"/>
</function>
</interface>
May 2007
Specific features: Grid
Code generation: gsstubgen

Generates the code necessary to build a Grid application from a
sequential application

Function stubs (master side)

Worker Main program (worker side)
app.idl
gsstubgen
client
app.c
app_constraints.cc
server
app­stubs.c
app.h
app_constraints_wrapper.cc
app_constraints.h
May 2007
app­worker.c
app­functions.c
Specific features: Grid
Automatic configuration and compilation: Deployment center
May 2007
Code generation: gsstubgen
GRID superscalar applications architecture
May 2007
Specific features: Grid
Runtime features
1. Data dependence
analysis
2. File Renaming
3. Task scheduling
4. Resource brokering
5. Shared disks
management and file
transfer policy
1. Scalar results collection
2. Checkpointing at task
level
3. API functions
implementation
4. Exception handling
5. Fault Tolerance
May 2007
Specific features: Grid
File transfers policy
f1
T1
Working directories
f1 f4
T1
f4 (temp.)
T2
server1
f7
f1
f7
f4
client
f7
server2
May 2007
T2
Specific features: Grid




Task Scheduling
Scheduler tries to allocate a resource for each of the
ready tasks (greedy scheduling)
Constraints matching

The classAd library is used to match resource
ClassAds with task ClassAds

This filters those available resources that match the
task constraints
Locality exploitation

If more than one resources fulfils the constraint, the
resource which minimizes this formula is selected:
f(t,r) = FT(r) + ET(t,r)




t = task
r = resource
FT = File transfer time to resource r
ET = Execution time of task t on resource r (using user provided cost function)
May 2007
Specific features: Grid
Calls sequence without GRID superscalar
app.c
app­functions.c
LocalHost
May 2007
Specific features: Grid
Calls sequence with GRID superscalar
app.c
app­stubs.c
GRID superscalar
runtime
app_constraints_wrapper.cc
GT2
app­functions.c
app_constraints.cc
RemoteHost
LocalHost
app­worker.c
May 2007
Specific features: Grid

Sample applications:







Ghyper: computation of molecular potential energy hypersurfaces
Drug design, evolutive algorithm to find the proteins that eliminate wrong signals
in cells that produce cancer
Metagenomics (15 million Blast executions)
Process and product design (optimization problems, chemistry engineering) –
BEinGRID.
Automatic extraction of structure from huge MPI tracefiles
Astrophysics: telemetry parameter simulations (GASS) and received data
management tractament (GOG) for the ESA project GAIA
Institut Neurociencia Alicante/ Institut d'Investigacions Biomèdiques August Pi i
Sunyer: Model for processing the cerebral crust
May 2007
GRID superscalar ongoing work


Fault tolerance features for Grid (partially implemented) and for
MareNostrum
Interoperation with GridWay

Based on DRMAA

Several GRID superscalar apps managed by GridWay

More abstraction with underlying infrastructure

i.e., EGEE since GridWay is glite enabled

Semantic scheduler on the framework of Brein project

Componentised version:


ssh/scp version for the Spanish Supercomputing Network

New scheduling algorithms, experimentation with GCM
implementation, MareNostrum as target architecture
Composed of MareNostrum and smaller “ MareNostrums”
May 2007
Cell Superscalar (CellSs)
So, what is the Cell BE?
Programmers point of view
ThinArchitecture
processor point of view
SMT
PPE
SPE
SPE
SPE
Hard to optimize
SPE
SPE
SPE
Separate address spaces
Tiny local memory
Bandwidth
SPE
SPE
User point of view
CellSs compiling infrastructure
Cell Superscalar: Execution behavior
CellSs scheduling strategy
...
(a)
...
(b)
(c)
CellSs execution behaviour
Main thread: task generat
Helper thread
Clustering
Chain of 7 block multiply (27
Size of block: 64x64 floats
Stage in/out
Reuse
May 2007
CellSs execution behaviour
Waiting for SPE availability
Schedule & dispatch
May 2007
CellSs execution behavior
Task generation
Stage out and notification
Graph update
May 2007
Schedule Dispatch
CellSs: preliminary results


Toy application that generates totally independent tasks.
SPU tasks perform a transposition of a 64x64 floats( Input: 64 x 64
floats, output: 64x64 floats)
Task duration artificially increased to test different task duration:
4.89, 19.42, 24.23, 48.40, 96.80 and 120.90 secs.
Scalability with different task sizes
...
Speed up

16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
4,89 usecs
9,71 usecs
19,42 usecs
24,23 usecs
48,40 usecs
96,80 usecs
120,9 usecs
0
2.5
5
7.5
10
#SPUs
12.5
15
17.5
CellSs: preliminary results

Results for different versions, from scalar to tile from the SDK

Task durations: from 2000 secs to 21.86 secs
Scalability analysis of matrix multiply
...
Speed up
8
7.5
7
6.5
6
5.5
2022,77 usecs
5
4.5
281,32 usecs
4
3.5
3
58,46 usecs
117,47 usecs
27,87 usecs
21,86 usecs
2.5
2
1.5
1
1
2
3
4
5
#SPUs
6
7
8
CellSs: preliminary results

Results in GFlops
CellSs: preliminary results
•
Overheads (absolute time) in matrix multiply (sdk version)
CellSs: preliminary results
•
Overheads (% of total time) in matrix multiply (sdk version)
CellSs: Cholesky factorization
for (i = 0; i < DIM; i++) {
for (j= 0; j< i-1; j++){
for (k = 0; k < j-1; k++) {
sgemm_tile( A[i][k], A[j][k], A[i][j] );
}
strsm_tile( A[j][j], A[i][j] );
}
for (j = 0; j < i-1; j++) {
ssyrk_tile( A[i][j], A[i][i] );
}
spotrf_tile( A[i][i] );
}
NB
NB
B
#pragma css task input(A[64][64], B[64][64]) inout(C[64][64])
void sgemm_tile(float *A, float *B, float *C)
#pragma css task input (T[64][64]) inout(B[64][64])
void strsm_tile(float *T, float *B)
#pragma css task input(A[64][64], B[64][64]) inout(C[64][64])
void ssyrk_tile(float *A, float *C)
May 2007
B
CellSs: Cholesky factorization
Cholesky factorization scalability
8
7
Speed up
6
5
1024 x 1024
2048 x 2048
4
4096 x 4096
3
2
1
1
2
3
4
5
6
7
8
#SPUs
Performance (in GFlops)
Task dependence graph for
a 320 x 320 floats matrix
(blocks of 64 x 64, 5 x 5 blocks)
#SPUs
1
May 2007
1024 x 1024 2048 x 2048 4096 x 4096
11.99
17.56
18.74
CellSs: Cholesky factorization
Cholesky performance
90
80
70
GFlops
60
50
40
30
20
10
0
0
1000
2000
3000
4000
5000
Matrix size
Cholesky performance
(only tasks sched and exec)
100
90
80
GFlops
70
60
50
40
30
20
10
0
0
1000
2000
3000
Matrix size
4000
5000
CellSs: issues and ongoing efforts
•
CellSs programming model
•
•
Memory association
•
Array regions
•
Subobject accesses
•
Blocks larger than Local Store.
•
•
Access to global memory by tasks?
Inline directives
CellSs runtime system
•
•
Further optimization of overheads (insert task and remove task),
•
•
scheduling algorithms: overhead, locality
•
overlays
Short circuiting (SPE  SPE transfers)
SMP superscalar (SMPSs)
More information

GRID superscalar home page:
www.bsc.es/grid/grid_superscalar
www.bsc.es/cellsuperscalar
May 2007
Download