Towards Optimized UPC Implementations Tarek A. El-Ghazawi The George Washington University tarek@gwu.edu Agenda Background UPC Language Overview Productivity Performance Issues Automatic Optimizations Conclusions IBM T.J. Waston UPC: Unified Parallel C 02/22/05 2 Parallel Programming Models What is a programming model? An abstract machine which outlines the view perceived by the programmer of data and execution Where architecture and applications meet A non-binding contract between the programmer and the compiler/system Good Programming Models Should Allow efficient mapping on different architectures Keep programming easy Benefits Application - independence from architecture Architecture - independence from applications IBM T.J. Waston UPC: Unified Parallel C 02/22/05 3 Programming Models Process/Thread Address Space Message Passing Shared Memory DSM/PGAS MPI OpenMP UPC IBM T.J. Waston UPC: Unified Parallel C 02/22/05 4 Programming Paradigms Expressivity LOCALITY Implicit Explicit P A R A L L E I S M Implicit Explicit IBM T.J. Waston Sequential Data Parallel (e.g. C, Fortran, Java) (e.g. HPF, C*) Shared Memory Distributed Shared (e.g. OpenMP) Memory/PGAS (e.g. UPC, CAF, and Titanium) UPC: Unified Parallel C 02/22/05 5 What is UPC? Unified Parallel C An explicit parallel extension of ISO C A distributed shared memory/PGAS parallel programming language IBM T.J. Waston UPC: Unified Parallel C 02/22/05 6 Why not message passing? Performance High-penalty for short transactions Cost of calls Two sided Excessive buffering Ease-of-use Explicit data transfers Domain decomposition does not maintain the original global application view More code and conceptual difficulty IBM T.J. Waston UPC: Unified Parallel C 02/22/05 7 Why DSM/PGAS? Performance No calls Efficient short transfers locality Ease-of-use Implicit transfers Consistent global application view Less code and conceptual difficulty IBM T.J. Waston UPC: Unified Parallel C 02/22/05 8 Why DSM/PGAS: New Opportunities for Compiler Optimizations Thread0 Sobel Operator Image Ghost Zones Thread1 Thread2 Thread3 DSM P_Model exposes sequential remote accesses at compile time Opportunity for compiler directed prefetching IBM T.J. Waston UPC: Unified Parallel C 02/22/05 9 History Initial Tech. Report from IDA in collaboration with LLNL and UCB in May 1999 UPC consortium of government, academia, and HPC vendors coordinated by GWU, IDA, and DoD The participants currently are: IDA CCS, GWU, UCB, MTU, UMN, ARSC, UMCP, U florida, ANL, LBNL, LLNL, DoD, DoE, HP, Cray, IBM, Sun, Intrepid, Etnus, … IBM T.J. Waston UPC: Unified Parallel C 02/22/05 10 Status Specification v1.0 completed February of 2001, v1.1.1 in October of 2003, v1.2 will add collectives and UPC/IO Benchmarking Suites: Stream, GUPS, RandomAccess, NPB suite, Splash-2, and others Testing suite v1.0, v1.1 Short courses and tutorials in the US and abroad Research Exhibits at SC 2000-2004 UPC web site: upc.gwu.edu UPC Book by mid 2005 from John Wiley and Sons Manual(s) IBM T.J. Waston UPC: Unified Parallel C 02/22/05 11 Hardware Platforms UPC implementations are available for SGI O 2000/3000 – 32 and 64b GCC UCB – 32 b GCC Intrepid Cray T3D/E Cray X-1 HP AlphaServer SC, Superdome UPC Berkeley Compiler: Myrinet, Quadrics, and Infiniband Clusters Beowulf Reference Implementation (MPIbased, MTU) New ongoing efforts by IBM and Sun IBM T.J. Waston UPC: Unified Parallel C 02/22/05 12 UPC Execution Model A number of threads working independently in a SPMD fashion MYTHREAD specifies thread index (0..THREADS-1) Number of threads specified at compile-time or run-time Process and Data Synchronization when needed Barriers and split phase barriers Locks and arrays of locks Fence Memory consistency control IBM T.J. Waston UPC: Unified Parallel C 02/22/05 13 UPC Memory Model Thread 0 Thread THREADS-1 Thread 1 Shared Private 0 Private 1 Private THREADS-1 Shared space with thread affinity, plus private spaces A pointer-to-shared can reference all locations in the shared space A private pointer may reference only addresses in its private space or addresses in its portion of the shared space Static and dynamic memory allocations are supported for both shared and private memory IBM T.J. Waston UPC: Unified Parallel C 02/22/05 14 UPC Pointers How to declare them? int *p1; /* private pointer pointing locally */ shared int *p2; /* private pointer pointing into the shared space */ int *shared p3; /* shared pointer pointing locally */ shared int *shared p4; /* shared pointer pointing into the shared space */ You may find many using “shared pointer” to mean a pointer pointing to a shared object, e.g. equivalent to p2 but could be p4 as well. IBM T.J. Waston UPC: Unified Parallel C 02/22/05 15 UPC Pointers Thread 0 Shared Private IBM T.J. Waston P4 P3 P1 P2 P1 P2 UPC: Unified Parallel C P1 P2 02/22/05 16 Synchronization - Barriers No implicit synchronization among the threads UPC provides the following synchronization mechanisms: Barriers Locks Memory Consistency Control Fence IBM T.J. Waston UPC: Unified Parallel C 02/22/05 17 Memory Consistency Models Has to do with ordering of shared operations, and when a change of a shared object by a thread becomes visible to others Consistency can be strict or relaxed Under the relaxed consistency model, the shared operations can be reordered by the compiler / runtime system The strict consistency model enforces sequential ordering of shared operations. (No operation on shared can begin before the previous ones are done, and changes become visible immediately) IBM T.J. Waston UPC: Unified Parallel C 02/22/05 18 Memory Consistency Models User specifies the memory model through: declarations pragmas for a particular statement or sequence of statements use of barriers, and global operations Programmers responsible for using correct consistency model IBM T.J. Waston UPC: Unified Parallel C 02/22/05 19 UPC and Productivity Metrics Lines of ‘useful’ Code indicates the development time as well as the maintenance cost Number of ‘useful’ Characters alternative way to measure development and maintenance efforts Conceptual Complexity function level, keyword usage, number of tokens, max loop depth, … IBM T.J. Waston UPC: Unified Parallel C 02/22/05 20 Manual Effort – NPB Example NPB-CG NPB-EP NPB-FT NPB-IS NPB-MG #line #char #line #char #line #char #line #char #line #char SEQ UPC SEQ MPI 665 16145 127 2868 575 13090 353 7273 610 14830 710 17200 183 4117 1018 21672 528 13114 866 21990 506 16485 130 4741 665 22188 353 7273 885 27129 1046 37501 181 6567 1278 44348 627 13324 1613 50497 UPCeffort IBM T.J. Waston #UPC# SEQ # SEQ MPI effort UPC: Unified Parallel C UPC Effort (%) 6.77 6.53 44.09 43.55 77.04 65.56 49.58 80.31 41.97 48.28 MPI Effort (%) 106.72 127.49 36.23 38.52 92.18 99.87 77.62 83.20 82.26 86.14 # MPI # SEQ # SEQ 02/22/05 21 Manual Effort – More Examples #line #char #line Histogram #char #line N-Queens #char GUPS SEQ MPI SEQ UPC 41 1063 12 188 86 1555 98 2979 30 705 166 3332 41 1063 12 188 86 1555 47 1251 20 376 139 2516 UPCeffort IBM T.J. Waston #UPC# SEQ # SEQ MPI effort UPC: Unified Parallel C MPI Effort (%) 139.02 180.02 150.00 275.00 93.02 124.28 UPC Effort (%) 14.63 17.68 66.67 100.00 61.63 61.80 # MPI # SEQ # SEQ 02/22/05 22 Conceptual Complexity - HIST Work Data Distr. Distr. HISTOGRAM MPI HISTOGRAM UPC #Parameters IBM T.J. Waston #Function calls #References to THREADS and MYTHREAD #UPC Constructs & UPC Types Notes #Parameters #Function calls # References to myrank and nprocs #MPI Types Notes Comm. Synch. & Consist. Misc. Ops Sum Overall Score 5 0 4 0 0 0 3 4 0 0 12 4 2 1 0 0 0 3 22 0 2 if 1 for 2 0 2 shared decl. 1 0 3 1 lockdec 1 lock/unlock 2 barriers 5 0 0 0 15 2 0 2 6 4 26 8 3 0 2 0 2 5 0 0 8 2 if 1 for 6 0 2 1 Scatter 1 Reduce (implicit w. Collective) 1 Init/Finalize 2 Comm UPC: Unified Parallel C 47 02/22/05 23 Conceptual Complexity - GUPS GUPS UPC #Parameters #Function calls #References to THREADS and MYTHREAD #UPC Constructs & UPC Types Notes GUPS MPI #Parameters IBM T.J. Waston #Function calls # References to myrank and nprocs #MPI Types Notes Work Distr. Data Distr. Comm. Synch. & Consist. Misc. Ops Sum Overall Score 21 0 6 4 0 0 0 2 0 0 27 6 3 4 0 0 0 7 43 3 0 3 forall 2 for 3 if 5 shared 2 all_alloc 2 free 18 0 17 7 38 6 3 5 0 5 for 3 if 0 0 0 3 1 3 6 6 80 22 13 1 4 26 6 2 0 0 8 2 mem alloc 2 mem free 3 window 2 onesided 4 collect (implicit w. Collective and WinFence) 1 barrier Init Finalize comm_rank comm_size 2 Wtime (6 error handle) 2 barriers UPC: Unified Parallel C 136 02/22/05 24 UPC Optimizations Issues Particular Challenges Avoiding Address Translation Cost of Address Translation Special Opportunities Locality-driven compiler-directed prefetching Aggregation General Low-level optimized libraries, e.g. collective Backend optimizations Overlapping of remote accesses and synchronization with other work IBM T.J. Waston UPC: Unified Parallel C 02/22/05 25 Showing Potential Optimizations Through Emulated Hand-Tunings Different Hand-tuning levels: Unoptimized UPC code referred as UPC.O0 Privatized UPC code referred as UPC.O1 Prefetched UPC code hand-optimized variant using block get/put to mimic the effect of prefetching referred as UPC.O2 Fully Hand-Tuned UPC code Hand-optimized variant integrating privatization, aggregation of remote accesses as well as prefetching Referred as UPC.O3 T. El-Ghazawi and S. Chauvin, “UPC Benchmarking Issues”, 30th Annual Conference IEEE International Conference on Parallel Processing,2001 (ICPP01) Pages: 365-372 IBM T.J. Waston UPC: Unified Parallel C 02/22/05 26 STREAM BENCHMARK Address Translation Cost and Local Space Privatization- Cluster MB/s Put Get Scale Sum CC N/A N/A 1565.04 5409.3 UPC Private N/A N/A 1687.63 1776.81 UPC Local 1196.51 1082.89 54.22 82.7 UPC Remote 241.43 237.51 0.09 0.16 MB/s Copy (arr) Copy (ptr) Memcpy Memset CC 1340.99 1488.02 1223.86 2401.26 UPC Private 1383.57 433.45 1252.47 2352.71 UPC Local 47.2 90.67 1202.8 2398.9 UPC Remote 0.09 0.20 1197.22 2360.59 Results gathered on a Myrinet Cluster IBM T.J. Waston UPC: Unified Parallel C 02/22/05 27 Address Translation and Local Space Privatization – DSM ARCHITECTURE STREAM BENCHMARK MB/S Bulk operations Element-by-Element operations MB/Sec Memory copy Block Get Block Put Array Set Array Copy Sum Scale GCC 127 N/A N/A 175 106 223 108 UPC Private 127 N/A N/A 173 106 215 107 UPC Local Shared 139 140 136 26 14 31 13 UPC Remote Shared (within SMP node) 130 129 136 26 13 30 13 UPC Remote Shared (beyond SMP node) 112 117 136 24 12 28 12 IBM T.J. Waston UPC: Unified Parallel C 02/22/05 28 Aggregation and Overlapping of Remote Shared Memory Accesses 100 0.25 Execution Time (sec) Execution Time (sec) 0.2 0.15 0.1 10 1 0.1 0.05 0 0.01 1 2 4 8 16 1 2 4 THREADS UPC NO OPT. 8 16 32 NP UPC FULL OPT. UPC NO OPT. UPC N-Queens: Execution Time UPC FULL OPT. UPC Sobel Edge: Execution Time Benefit of hand-optimizations are greatly application dependent: N-Queens does not perform any better, mainly because it is an embarrassingly parallel program Sobel Edge Detector does get a speedup of one order of magnitude after hand-optimizating, scales linearly perfectly. SGI O2000 IBM T.J. Waston UPC: Unified Parallel C 02/22/05 29 Impact of Hand-Optimizations on NPB.CG 70 60 Computation Time (sec) 50 40 30 20 10 0 1 2 4 8 16 32 Processors UPC - O0 IBM T.J. Waston UPC - O1 UPC - O3 GCC UPC: Unified Parallel C Class A on SGI Origin 2k 02/22/05 30 Shared Address Translation Overhead X Y Z Z Actual Access Actual Access PRIVATE MEMORY ACCESS Address Calculation Overhead 600 Address Translation Overhead Local Shared Access Time (ns) UPC Put/Get Function Call Overhead LOCAL SHARED MEMORY ACCESS 500 123 400 300 247 200 100 144 0 Local Shared memory access Memory Access Time Overhead Present in Local-Shared Memory Accesses (SGI Origin 2000, GCC-UPC) Address Calculation Address Function Call Quantification of the Address Translation Overheads Address translation overhead is quite significant More than 70% of work for a local-shared memory access Demonstrates the real need for optimization IBM T.J. Waston UPC: Unified Parallel C 02/22/05 31 Shared Address Translation Overheads for Sobel Edge Detection 100 Execution Time (sec) 90 80 70 60 50 40 30 20 10 1 2 Processing + Memory Access Address Calculation 4 8 Address Function Call UPC.O3 UPC.O0 UPC.O3 UPC.O0 UPC.O3 UPC.O0 UPC.O3 UPC.O0 UPC.O3 UPC.O0 0 16 #Processors UPC.O0: unoptimized UPC code, UPC.O3: handoptimized UPC code. Ox notations from T. El-Ghazawi, S. Chauvin, “UPC Benchmarking Issues”, Proceedings of the 2001 International Conference on Parallel Processing, Valencia, September 2001 02/22/05 UPC: Unified Parallel C IBM T.J. Waston 32 Reducing Address Translation Overheads via Translation Look-Aside Buffers F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber, “Fast Address Translation Techniques for Distributed Shared Memory Compilers”, IPDPS’05, Denver CO, April 2005 Use Look-up Memory Model Translation Buffers (MMTB) to perform fast translations Two alternative methods proposed to create and use MMTB’s: FT: basic method using direct addressing RT: advanced method, using indexed addressing Was prototyped as a compiler-enabled optimization no modifications to actual UPC codes are needed IBM T.J. Waston UPC: Unified Parallel C 02/22/05 33 Different Strategies – Full-Table Array distributed across 4 THREADS array[0] TH0 array[1] TH1 array[2] TH2 array[3] TH3 array[4] TH0 array[5] TH1 array[6] TH2 array[7] TH3 MMTB stored on each thread [0] [0] [0] [1] [1] [1] [2] [2] [2] [3] [3] [3] [4] [4] [4] [5] [5] [5] [6] [6] [6] [7] [7] [7] 57FF8040 57FF8040 57FF8040 FT[0] 5FFF8040 5FFF8040 5FFF8040 FT[1] 67FF8040 67FF8040 67FF8040 FT[2] 6FFF8040 6FFF8040 6FFF8040 FT[3] 57FF8048 57FF8048 57FF8048 FT[4] 5FFF8048 5FFF8048 5FFF8048 FT[5] 67FF8048 67FF8048 67FF8048 FT[6] 6FFF8048 6FFF8048 6FFF8048 FT[7] FT Look-up Table shared int array[8]; Consider shared [B] int array[8]; To Initialize FT: Data affinity i [0,7], FT[i] = _get_vaddr(&array[i]) TH0 TH2 To Access array[ ]: TH1 [0] [1] [2] i [0,7], array[i] = _get_value_at(FT[i]) [4] IBM T.J. Waston [5] [6] Pros Direct mapping No address calculation Cons Large memory required Can lead to competition over caches and main memory TH3 [3] [7] UPC: Unified Parallel C 02/22/05 34 Different Strategies – Reduced-Table: Infinite blocksize RT Strategy: BLOCKSIZE=infinite Only first address of the element of the array needs to be saved since all array data is contiguous Only one table entry in this case Address calculation step is simple in that case Consider shared [] int array[4]; To initialize RT: RT[0] = _get_vaddr(&array[0]) To access array[]: i [0,3], array[i] = _get_value_at( RT[0] + i ) array[0] i array[1] array[2] array[3] RT[0] THREAD0 IBM T.J. Waston RT[0] THREAD1 RT[0] THREAD2 RT[0] THREAD3 UPC: Unified Parallel C 02/22/05 35 Different Strategies – Reduced-Table: Default blocksize BLOCKSIZE=1 RT Strategy: Only first address of elements on each thread are saved since all array data is contiguous Consider shared [1] int array[16]; Less memory required than FT, MMTB buffer has threads entries Address calculation step is a bit costly but much cheaper than current implementations To initialize RT: i [0,THREADS-1], RT[i] = _get_vaddr(&array[i]) To access array[]: i [0,15], array[i] = _get_value_at( RT[i mod THREADS] + (i/THREADS)) array[0] array[1] array[2] array[3] RT[0] array[4] array[5] array[6] array[7] RT[1] array[8] array[9] array[10] array[11] array[12] array[13] array[14] array[15] RT RT RT RT RT[2] RT[3] RT THREAD0 IBM T.J. Waston THREAD1 THREAD2 UPC: Unified Parallel C THREAD3 02/22/05 36 Different Strategies – Reduced-Table: Arbitrary blocksize RT Strategy: ARBITRARY BLOCK SIZES Only first address of elements of each block are saved since all block data is contiguous Consider shared [2] int array[16]; Less memory required than for FT, but more than previous cases Address calculation step more costly than previous cases To initialize T: i [0,7], RT[i] = _get_vaddr(&array[i*blocksize(array)]) To access array[]: i [0,15], array[i] = _get_value_at( RT[i / blocksize(array)] + (i mod blocksize(array)) ) RT[0] RT[1] array[0] array[2] array[4] array[6] RT[2] array[1] array[3] array[5] array[7] RT[3] array[8] array[10] array[12] array[14] array[9] array[11] array[13] array[15] RT RT RT RT RT[4] RT[5] RT[6] RT[7] THREAD0 THREAD1 THREAD2 THREAD3 RT IBM T.J. Waston UPC: Unified Parallel C 02/22/05 37 Performance Impact of the MMTB – Sobel Edge Sobel Edge (N=2048) Sobel Edge (N=2048) 3 16 14 2.5 Execution Time (sec) Execution Time (sec) 12 10 8 6 2 1.5 1 4 0.5 2 0 0 1 2 4 8 16 1 2 4 O0 O0.FT O0.RT 8 16 #THREADS #THREADS O3 O0.FT MPI O0.RT O3 MPI Performance of Sobel-Edge Detection using new MMTB strategies (with and without O0) FT and RT are performing around 6 to 8 folds better than the regular basic UPC version (O0) RT strategy slower than FT since address calculation (arbitrary block size case), becomes more complex. FT on the other hand is performing almost as good as the hand-tuned versions (O3 and MPI) IBM T.J. Waston UPC: Unified Parallel C 02/22/05 38 Performance Impact of the MMTB – Matrix Multiplication MATRIX MULTIPLICATION (N=256) MATRIX MULTIPLICATION (N=256) 16 14 2 THREADS 4 THREADS 8 THREADS 16 THREADS 14 12 12 Time (sec) 10 8 6 10 8 6 4 4 2 2 0 Performance and Hardware Profiling of Matrix Multiplication using new MMTB strategies T .O 0. R 3 0. FT .O PC U 0 .O PC U U PC T .O 0. R PC .O PC U U 3 .O 0. FT .O PC U PC T .O PC 0. R 0 U PC .O .O U U 3 0. FT 0 .O PC U U PC T .O 0. R PC .O U 3 0. FT .O PC U 0 .O PC U PC T .O 0. R .O .O PC U MPI 3 0 UPC.O3 U UPC.O0.RT U UPC.O0.FT .O .O PC UPC.O0 0. FT 0 PC 16 PC 8 U 4 # THREADS PC 2 U 1 U Execution Time (sec) 1 THREAD 16 THREADS Computation L1 Data Cache Misses L2 Data Cache Misses Graduated Loads Graduated Stores Decoded Branches TLB Misses FT strategy: increase in L1 data cache misses due to the large table size RT strategy: L1 kept low, but increase in number of loads and stores is observed showing increase in computations (arbitrary blocksize used) IBM T.J. Waston UPC: Unified Parallel C 02/22/05 39 Time and storage requirements of the Address Translation Methods for the Matrix Multiply Microkernel For a shared array of N elements with B as blocksize Storage requirements per shared array UPC.O0 NE UPC.O0.FT N E N P THREADS UPC.O0.RT # of memory accesses per shared memory access # of arithmetic operations per shared memory access More than 25 More than 5 1 0 1 Up to 3 N N E P THREADS B (E: element size in bytes, P: pointer size in bytes) Comparison among Optimizations of Storage, Memory Accesses and Computation Requirements Number of loads and stores can increase with arithmetic operators IBM T.J. Waston UPC: Unified Parallel C 02/22/05 40 UPC Work-sharing Construct Optimizations By thread/index number By thread/index number (for integer) (upc_forall integer) for(i=0; i<N; i++) upc_forall(i=0; i<N; i++; i) { loop body; if(MYTHREAD == i%THREADS) By the address of a shared variable (upc_forall address) upc_forall(i=0; i<N; i++; &shared_var[i]) loop body; loop body; } By the address of a shared variable (for address) for(i=0; i<N; i++) By thread/index number { (for optimized) if(upc_threadof(&shared_var[i]) == MYTHREAD) for(i=MYTHREAD; i<N; i+=THREADS) loop body; loop body; } IBM T.J. Waston UPC: Unified Parallel C 02/22/05 41 Performance of Equivalent upc_forall and for Loops 0.06 0.05 Time (sec.) 0.04 0.03 0.02 0.01 0 1 2 upc_forall address IBM T.J. Waston 4 Processor(s) upc_forall integer for address 8 for integer UPC: Unified Parallel C 16 for optimized 02/22/05 42 Performance Limitations Imposed by Sequential C Compilers -- STREAM BULK Element-by-Element memcpy memset Struct cp Copy (arr) Copy (ptr) Set Sum Scale Add Triad NUMA (MB/s) F 291.21 163.90 N/A 291.59 N/A 159.68 135.37 246.3 235.1 303.82 C 231.20 214.62 158.86 120.57 152.77 147.70 298.38 133.4 13.86 20.71 BULK Element-by-Element memcpy memset Struct cp Copy (arr) Copy (ptr) Set Sum Scale Add Triad Vector (MB/s) F 14423 11051 N/A 14407 N/A 11015 17837 14423 10715 16053 C 18850 5307 7882 7972 7969 10576 18260 7865 3874 5824 IBM T.J. Waston UPC: Unified Parallel C 02/22/05 43 Loopmark – SET/ADD Operations BULK Element-by-Element memcpy memset Struct cp Copy (arr) Copy (ptr) Set Sum Scale Add Triad Vector F 14423 11051 N/A 14407 N/A 11015 17837 14423 10715 16053 C 18850 5307 7882 7972 7969 10576 18260 7865 3874 5824 Let us compare loopmarks for each F / C operation IBM T.J. Waston UPC: Unified Parallel C 02/22/05 44 Loopmark – SET/ADD Operations Fortran C MEMSET (bulk set) MEMSET (bulk set) 146. 1 t = mysecond(tflag) 163. 147. 1 V M--<><> a(1:n) = 1.0d0 164. 1 148. 1 t = mysecond(tflag) - t 1 memset(a, 1, NDIM*sizeof(elem_t));; times[1][k] = mysecond_(); 149. 1 times(2,k) = t 165. 1 times[1][k]; times[1][k] = mysecond_() SET SET 158. 1 arrsum = 2.0d0; 217. 1 set = 2; 159. 1 t = mysecond(tflag) 220. 1 times[5][k] = mysecond_(); 160. 1 MV------< DO i = 1,n 222. 1 MV--< for (i=0; i<NDIM; i++) 161. 1 MV c(i) = arrsum 223. 1 MV 162. 1 MV arrsum = arrsum + 1 224. 1 MV 163. 1 MV------> END DO 225. 1 MV--> 164. 1 t = mysecond(tflag) - t 227. 165. 1 times(4,k) = t 180. 1 t = mysecond(tflag) 181. 1 V M--<><> c(1:n) = a(1:n) + 182. 1 t = mysecond(tflag) - t times(7,k) = t 1 times[5][k]; { c[i] = (set++); } times[5][k] = mysecond_() 1 - ADD ADD 183. - b(1:n) 283. 1 times[10][k]= mysecond_(); 285. 1 Vp--< for (j=0; j<NDIM; j++) 286. 1 Vp 287. 1 Vp 288. 1 Vp--> 290. { c[j] = a[j] + b[j]; } 1 times[10][k] = mysecond_() times[10][k]; - Legend: V: Vectorized – M: Multistreamed – p: conditional, partial and/or computed IBM T.J. Waston UPC: Unified Parallel C 02/22/05 45 UPC vs CAF using the NPB workloads In General, UPC slower than CAF, mainly due to Point-to-point vs barrier synchronization Better scalability with proper collective operations Program writers can do a p-to-p syncronization using current constructs Scalar performance of source-to-source translated code Alias analysis (C pointers) » Can highlight the need for explicitly using restrict to help several compiler backends Lack of support for multi-dimensional arrays in C » Can prevent high level loop transformations and software pipelining, causing a 2 times slowdown in SP for UPC Need for exhaustive C compiler analysis » A failure to perform proper loop fusion and alignment in the critical section of MG can lead to 51% more loads for UPC than CAF » A failure to unroll adequately the sparse matrix-vector Parallel multiplication inUPC: CGUnified can lead to Cmore cycles in UPC 02/22/05 IBM T.J. Waston 46 Conclusions UPC is a locality-aware parallel programming language With proper optimizations, UPC can outperform MPI in random short accesses and can otherwise perform as good as MPI UPC is very productive and UPC applications result in much smaller and more readable code than MPI UPC compiler optimizations are still lagging, in spite of the fact that substantial progress has been made For future architectures, UPC has the unique opportunity of having very efficient implementations as most of the pitfalls and obstacles are revealed along with adequate UPC: Unified Parallel C IBM T.J.solutions Waston 02/22/05 47 Conclusions In general, four types of optimizations: Optimizations to Exploit the Locality Consciousness and other Unique Features of UPC Optimizations to Keep the Overhead of UPC low Optimizations to Exploit Architectural Features Standard Optimizations that are Applicable to all Systems Compilers IBM T.J. Waston UPC: Unified Parallel C 02/22/05 48 Conclusions Optimizations possible at three levels: Source to source program acting during the compilation phase and incorporating most UPC specific optimizations C backend compilers to compete with Fortran Strong run-time system that can work effectively with the Operating System IBM T.J. Waston UPC: Unified Parallel C 02/22/05 49 Selected Publications T. El-Ghazawi, W. Carlson, T. Sterling, and K. Yelick, UPC: Distributed Shared Memory Programming. John Wiley &Sons Inc., New York, 2005. ISBN: 0-471-22048-5. (June 2005) T. El-Ghazawi, F. Cantonnet, Y. Yao, S. Annareddy, A. Mohamed, Benchmarking Parallel Compilers for Distributed Shared Memory Languages: A UPC Case Study, Journal of Future Generation Computer Systems, North-Holland (Accepted) IBM T.J. Waston UPC: Unified Parallel C 02/22/05 50 Selected Publications T. El-Ghazawi and S. Chauvin, “UPC Benchmarking Issues”, 30th Annual Conference IEEE International Conference on Parallel Processing,2001 (ICPP01) Pages: 365-372 T. El-Ghazawi and F. Cantonnet. “UPC performance and potential: A NPB experimental study”. Supercomputing 2002 (SC2002), Baltimore, November 2002 F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber, “Fast Address Translation Techniques for Distributed Shared Memory Compilers”, IPDPS’05, Denver CO, April 2005 CUG and PPOP IBM T.J. Waston UPC: Unified Parallel C 02/22/05 51