Platform Unaware Programming Models Rosa M. Badia Pieter Bellens, Jorge Ejarque, Josep M. Perez, Jesus Labarta, Marc de Palol, Raül Sirvent, Enric Tejedor Barcelona Supercomputing Center (BSC-CNS) Technical University of Catalonia (UPC) rosa.m.badia@bsc.es Outline “*” superscalar: overview “*” = GRID or Cell or SMP Programming model syntax Evolution and Platforms Generic runtime features Specific features Grid version Cell Superscalar (CellSs) May 2007 2 Overview Superscalar processor Instructions Functional units Registers Memory Flow sequential program Concurrent execution, out of order, speculation, ... May 2007 Overview Ease the programmin of Grid applications Ease the programming of {Grid, multicore, ...} applications Basic idea: ISU IDU LSU L3 Directory/Control IFU BXU L2 ISU FXU FPU FXU LSU L2 FPU IDU IFU BXU L2 Grid ns ➨ seconds/minutes/hours May 2007 Grid ns 100 useconds minutes/hours Mapping of concepts: Instructions Block operations Functional units SPUs Fetch &decode unit PPE Registers (name space) Main memory Registers (storage) SPU memory Full binary computational resources local host Files Files Standard sequential languages: * superscalar (*Ss) On standard processors run sequential “easy” programming On other platforms runs parallel “decent” performance Constraint portable Coarse grain tasks algorithms Operations only access arguments and local data Objectives and overview for (int i = 0; i < MAXITER; i++) { newBWd = GenerateRandom(); subst (referenceCFG, newBWd, newCFG); dimemas (newCFG, traceFile, DimemasOUT); post (newBWd, DimemasOUT, FinalOUT); if(i % 3 == 0) Display(FinalOUT); } fd = open(FinalOUT, R); printf("Results file:\n"); present (fd); close(fd); May 2007 Input/output data Objectives and overview Subst Subst DIMEMAS EXTRACT DIMEMAS EXTRACT Subst DIMEMAS EXTRACT Subst DIMEMAS EXTRACT Subst Subst DIMEMAS DIMEMAS EXTRACT EXTRACT Display … Subst DIMEMAS EXTRACT Display CIRI Grid GS_open May 2007 Objectives and overview Subst DIMEMAS EXTRACT Subst DIMEMAS EXTRACT Subst DIMEMAS EXTRACT Subst DIMEMAS EXTRACT Subst Subst DIMEMAS DIMEMAS EXTRACT EXTRACT Display … Subst DIMEMAS EXTRACT Display Grid GS_open May 2007 Syntax Small set of annotations (pragmas) Task annotation: Independent piece of code without lateral effects Before declaration of a subroutine #pragma css task input(n) output(result) void factorial(unsigned int n, unsigned int *result) { ... } #pragma css task input(left[leftSize], right[rightSize]) output(result[leftSize+rightSize]) void merge(float *left, unsigned int leftSize, float *right, unsigned int rightSize, float *result){ ... } May 2007 Syntax int main (int argc, char **argv) { int i, j, k; NB … B NB initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]); } static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) { int i, j, k; for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j]; } May 2007 B B B Syntax int main (int argc, char **argv) { int i, j, k; NB … B NB initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]); } #pragma css task input(A, B) inout(C) static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) { int i, j, k; for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j]; } May 2007 B B B Syntax int main (int argc, char **argv) { int i, j, k; NB … NB B initialize(A, B, C); # pragma css start for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]); #pragma css finish } #pragma css task input(A, B) inout(C) static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) { int i, j, k; for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j]; } May 2007 B B B Syntax int main (int argc, char **argv) { int i, j, k; NB … NB initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]); for (i = 0; i < N; i++) for (ii = 0; ii < BSIZE; ii++) { for (j = 0; j < N; j++){ #pragma css wait on(matrix[i][j]) for (jj = 0; jj < BSIZE; jj++) fprintf (file, "%f ", matrix[i][j][ii][jj]); } fprintf (file, "\n"); } } May 2007 B B B B Evolution & platforms • Grid – GRID superscalar Version 1 – dependencies based on files Based on IDL (no annotations), code generation C/C++, Perl, Java, shell script GT2, GT4, ssh/scp, Ninf-G Deployment center, Monitor Checkpointing, fault tolerance Version used in BEinGRID – integrating with GridWay metascheduler (DRMAA) Open Source from (now), Apache v2 Version 2 – dependencies for (almost) any data type C Source – to – source compiler Runtime with less features than version 1 May 2007 Evolution & platforms • Grid – GRID superscalar • Designed on the framework of CoreGRID, WP7 Based on GCM (ProActive implementation) Java-GAT to access underlying middleware (Java-SAGA soon???) Implementation ongoing • Semantic scheduler Based on resource ontologies • • Version 3 - componentized version of GRID superscalar Prototype for version 1 Under development in BREIN May 2007 Evolution & platforms • • Clusters MareNostrum version Version 1 tailored for MareNostrum supercomputer • Takes into account local scheduler (first loadleveler, now SLURM) and GPFS features • • Uses ssh/scp instead of GRID middleware May 2007 Evolution & platforms • Cell/BE • Cell Superscalar (CellSs) – version for Cell BE multicore processor Based on version 2 Tailored to Cell BE multicore Open Source (GPL for the compiler & LGPL for the runtime) Homogeneous multicores /SMPs SMP superscalar (SMPSs) CellSs compiler Runtime ported to SMPs using threads Much easier platform than Cell/BE! May 2007 Generic runtime features Data dependence analysis Detects RaW, WaR, WaW dependencies based on parameters For files: according to file names For data in general: according to memory address Tasks’ Directed Acyclic Graph is built based on these dependencies Subst DIMEMAS EXTRACT Subst DIMEMAS EXTRACT Subst DIMEMAS EXTRACT Display May 2007 Subst Generic runtime features WaW and WaR dependencies can be avoid with renaming File renaming Data renaming While { T1 T2 T3 } (!end_condition()) (… ,… , “ f1” ); (“ f1” , … , … ); (… ,… ,… ); WaR T1_1 “f1” WaW T1_2 “f1_1” “f1” T2_1 T2_2 T3_1 T3_2 May 2007 T1_N … “f1_2” “f1” T1_N T1_N Generic runtime features May 2007 Specific features: Grid Interface Definition Language (IDL) file in XML format In/Out/InOut files or scalars The subroutines/functions listed will be executed in a remote server in the Grid. <?xml version="1.0" encoding="UTF-8"?> <interface name="example"> <function name="subst" type="void"> <argument name="referenceCFG" direction="in" type="file"/> <argument name="newBW" direction="in" type="double"/> <argument name="newCFG" direction="out" type="file"/> </function> <function name="dimemas" type="void"> <argument name="newCFG" direction="in" type="file"/> <argument name="traceFile" direction="in" type="file"/> <argument name="DimemasOUT" direction="out" type="file"/> </function> <function name="post" type="void"> <argument name="newBW" direction="in" type="double"/> <argument name="DimemasOUT" direction="in" type="file"/> <argument name="FinalOUT" direction="inout" type="file"/> </function> <function name="display" type="void"> <argument name="toplot" direction="in" type="file"/> </function> </interface> May 2007 Specific features: Grid Code generation: gsstubgen Generates the code necessary to build a Grid application from a sequential application Function stubs (master side) Worker Main program (worker side) app.idl gsstubgen client app.c app_constraints.cc server app­stubs.c app.h app_constraints_wrapper.cc app_constraints.h May 2007 app­worker.c app­functions.c Specific features: Grid Automatic configuration and compilation: Deployment center May 2007 Code generation: gsstubgen GRID superscalar applications architecture May 2007 Specific features: Grid Runtime features 1. Data dependence analysis 2. File Renaming 3. Task scheduling 4. Resource brokering 5. Shared disks management and file transfer policy 1. Scalar results collection 2. Checkpointing at task level 3. API functions implementation 4. Exception handling 5. Fault Tolerance May 2007 Specific features: Grid File transfers policy f1 T1 Working directories f1 f4 T1 f4 (temp.) T2 server1 f7 f1 f7 f4 client f7 server2 May 2007 T2 Specific features: Grid Task Scheduling Scheduler tries to allocate a resource for each of the ready tasks (greedy scheduling) Constraints matching The classAd library is used to match resource ClassAds with task ClassAds This filters those available resources that match the task constraints Locality exploitation If more than one resources fulfils the constraint, the resource which minimizes this formula is selected: f(t,r) = FT(r) + ET(t,r) t = task r = resource FT = File transfer time to resource r ET = Execution time of task t on resource r (using user provided cost function) May 2007 Specific features: Grid Calls sequence without GRID superscalar app.c app­functions.c LocalHost May 2007 Specific features: Grid Calls sequence with GRID superscalar app.c app­stubs.c GRID superscalar runtime app_constraints_wrapper.cc GT2 app­functions.c app_constraints.cc RemoteHost LocalHost app­worker.c May 2007 Specific features: Grid Sample applications: Ghyper: computation of molecular potential energy hypersurfaces Drug design, evolutive algorithm to find the proteins that eliminate wrong signals in cells that produce cancer Metagenomics (15 million Blast executions) Process and product design (optimization problems, chemistry engineering) – BEinGRID. Automatic extraction of structure from huge MPI tracefiles Astrophysics: telemetry parameter simulations (GASS) and received data management tractament (GOG) for the ESA project GAIA Institut Neurociencia Alicante/ Institut d'Investigacions Biomèdiques August Pi i Sunyer: Model for processing the cerebral crust May 2007 GRID superscalar ongoing work Fault tolerance features for Grid (partially implemented) and for MareNostrum Interoperation with GridWay Based on DRMAA Several GRID superscalar apps managed by GridWay More abstraction with underlying infrastructure i.e., EGEE since GridWay is glite enabled Semantic scheduler on the framework of Brein project Componentised version: ssh/scp version for the Spanish Supercomputing Network New scheduling algorithms, experimentation with GCM implementation, MareNostrum as target architecture Composed of MareNostrum and smaller “ MareNostrums” May 2007 Cell Superscalar (CellSs) So, what is the Cell BE? Programmers point of view ThinArchitecture processor point of view SMT PPE SPE SPE SPE Hard to optimize SPE SPE SPE Separate address spaces Tiny local memory Bandwidth SPE SPE User point of view CellSs compiling infrastructure Cell Superscalar: Execution behavior CellSs scheduling strategy ... (a) ... (b) (c) CellSs execution behaviour Main thread: task generat Helper thread Clustering Chain of 7 block multiply (27 Size of block: 64x64 floats Stage in/out Reuse May 2007 CellSs execution behaviour Waiting for SPE availability Schedule & dispatch May 2007 CellSs execution behavior Task generation Stage out and notification Graph update May 2007 Schedule Dispatch CellSs: preliminary results Toy application that generates totally independent tasks. SPU tasks perform a transposition of a 64x64 floats( Input: 64 x 64 floats, output: 64x64 floats) Task duration artificially increased to test different task duration: 4.89, 19.42, 24.23, 48.40, 96.80 and 120.90 secs. Scalability with different task sizes ... Speed up 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 4,89 usecs 9,71 usecs 19,42 usecs 24,23 usecs 48,40 usecs 96,80 usecs 120,9 usecs 0 2.5 5 7.5 10 #SPUs 12.5 15 17.5 CellSs: preliminary results Results for different versions, from scalar to tile from the SDK Task durations: from 2000 secs to 21.86 secs Scalability analysis of matrix multiply ... Speed up 8 7.5 7 6.5 6 5.5 2022,77 usecs 5 4.5 281,32 usecs 4 3.5 3 58,46 usecs 117,47 usecs 27,87 usecs 21,86 usecs 2.5 2 1.5 1 1 2 3 4 5 #SPUs 6 7 8 CellSs: preliminary results Results in GFlops CellSs: preliminary results • Overheads (absolute time) in matrix multiply (sdk version) CellSs: preliminary results • Overheads (% of total time) in matrix multiply (sdk version) CellSs: Cholesky factorization for (i = 0; i < DIM; i++) { for (j= 0; j< i-1; j++){ for (k = 0; k < j-1; k++) { sgemm_tile( A[i][k], A[j][k], A[i][j] ); } strsm_tile( A[j][j], A[i][j] ); } for (j = 0; j < i-1; j++) { ssyrk_tile( A[i][j], A[i][i] ); } spotrf_tile( A[i][i] ); } NB NB B #pragma css task input(A[64][64], B[64][64]) inout(C[64][64]) void sgemm_tile(float *A, float *B, float *C) #pragma css task input (T[64][64]) inout(B[64][64]) void strsm_tile(float *T, float *B) #pragma css task input(A[64][64], B[64][64]) inout(C[64][64]) void ssyrk_tile(float *A, float *C) May 2007 B CellSs: Cholesky factorization Cholesky factorization scalability 8 7 Speed up 6 5 1024 x 1024 2048 x 2048 4 4096 x 4096 3 2 1 1 2 3 4 5 6 7 8 #SPUs Performance (in GFlops) Task dependence graph for a 320 x 320 floats matrix (blocks of 64 x 64, 5 x 5 blocks) #SPUs 1 May 2007 1024 x 1024 2048 x 2048 4096 x 4096 11.99 17.56 18.74 CellSs: Cholesky factorization Cholesky performance 90 80 70 GFlops 60 50 40 30 20 10 0 0 1000 2000 3000 4000 5000 Matrix size Cholesky performance (only tasks sched and exec) 100 90 80 GFlops 70 60 50 40 30 20 10 0 0 1000 2000 3000 Matrix size 4000 5000 CellSs: issues and ongoing efforts • CellSs programming model • • Memory association • Array regions • Subobject accesses • Blocks larger than Local Store. • • Access to global memory by tasks? Inline directives CellSs runtime system • • Further optimization of overheads (insert task and remove task), • • scheduling algorithms: overhead, locality • overlays Short circuiting (SPE SPE transfers) SMP superscalar (SMPSs) More information GRID superscalar home page: www.bsc.es/grid/grid_superscalar www.bsc.es/cellsuperscalar May 2007