Applications of UPC Jason J Beech-Brandt Cray Centre of Excellence for HECToR Novel Parallel Programming Languages for HPC EPCC, Edinburgh 18 June 2008 Outline UPC basics…. • Shared and Private data – scalars and arrays • Pointers • Dynamic memory Practical uses of UPC • GUPS (Global Random Access) benchmark • BenchC – MPI CFD code transitioned to UPC/MPI hybrid • Xflow – Pure UPC CFD code Cray support for PGAS • Legacy – X1E, T3E • Current – X2, XT • Future – Baker Context Most parallel programs are written using either: • Message passing with a SPMD model Usually for scientific applications with C/Fortran Scales easily • Shared memory with threads in OpenMP, Threads+C/C++/Fortran or Java Usually for non-scientific applications Easier to program, but less scalable performance Global Address Space (GAS) Languages take the best of both • global address space like threads (programmability) • SPMD parallelism like MPI (performance) • local/global distinction, i.e., layout matters (performance) 6/24/2008 3 Partitioned Global Address Space Languages Explicitly-parallel programming model with SPMD parallelism • Fixed at program start-up, typically 1 thread per processor Global address space model of memory • Allows programmer to directly represent distributed data structures Address space is logically partitioned • Local vs. remote memory (two-level hierarchy) Programmer control over performance critical decisions • Data layout and communication Performance transparency and tunability are goals Multiple PGAS languages: UPC (C), CAF (Fortran), Titanium (Java) 6/24/2008 4 UPC Overview and Design Philosophy Unified Parallel C (UPC) is: • An explicit parallel extension of ANSI C • A partitioned global address space language • Sometimes called a GAS language Similar to the C language philosophy • Programmers are clever and careful, and may need to get close to hardware to get performance, but can get in trouble • Concise and efficient syntax Common and familiar syntax and semantics for parallel C with simple extensions to ANSI C Based on ideas in Split-C, AC, and PCP 6/24/2008 5 UPC Execution Model A number of threads working independently in a SPMD fashion • Number of threads specified at compile-time or run-time; available as program variable THREADS • MYTHREAD specifies thread index (0..THREADS-1) • upc_barrier is a global synchronization: all wait • There is a form of parallel loop, upc_forall There are two compilation modes • Static Threads mode: THREADS is specified at compile time by the user The program may use THREADS as a compile-time constant • Dynamic threads mode: Compiled code may be run with varying numbers of threads 6/24/2008 6 Hello World in UPC Any legal C program is also a legal UPC program If you compile and run it as UPC with P threads, it will run P copies of the program. Using this fact, plus the identifiers from the previous slides, we can parallel hello world: #include <upc.h> /* needed for UPC extensions */ #include <stdio.h> main() { printf("Thread %d of %d: hello UPC world\n", MYTHREAD, THREADS); } 6/24/2008 7 Private vs. Shared Variables in UPC Normal C variables and objects are allocated in the private memory space Global address space for each thread. Shared variables are allocated only once, with thread 0 6/24/2008 shared int ours; int mine; // use sparingly: performance Thread0 Thread1 Threadn Shared ours: mine: mine: mine: Private 8 Shared and Private Data Examples of Shared and Private Data Layout: Assume THREADS = 4 shared int x; /*x will have affinity to thread 0 */ shared int y[THREADS]; int z; will result in the layout: Global address space 0 y[0] 1 y[1] 2 y[2] 3 y[3] Shared x z z z z Private 9 Shared and Private Data shared int A[4][THREADS]; will result in the following data layout: Thread 0 Thread 1 A[0][0] A[0][1] A[0][2] A[1][0] A[1][1] A[1][2] A[2][0] A[2][1] A[2][2] A[3][0] A[3][1] A[3][2] Thread 2 10 Blocking of Shared Arrays Default block size is 1 Shared arrays can be distributed on a block per thread basis, round robin with arbitrary block sizes. A block size is specified in the declaration as follows: • shared [block-size] type array[N]; • e.g.: shared [4] int a[16]; 11 Blocking of Shared Arrays Block size and THREADS determine affinity The term affinity means in which thread’s local shared-memory space, a shared data item will reside Element i of a blocked array has affinity to thread: i blocksize mod THREADS November 14, 2005 12 Shared and Private Data Shared objects placed in memory based on affinity Affinity can be also defined based on the ability of a thread to refer to an object by a private pointer All non-array shared qualified objects, i.e. shared scalars, have affinity to thread 0 Threads access shared and private data November 14, 2005 13 Shared and Private Data Assume THREADS = 4 shared [3] int A[4][THREADS]; will result in the following data layout: Thread 0 Thread 1 Thread 2 Thread 3 A[0][0] A[0][3] A[1][2] A[2][1] A[0][1] A[1][0] A[1][3] A[2][2] A[0][2] A[1][1] A[2][0] A[2][3] A[3][0] A[3][1] A[3][2] A[3][3] 14 upc_forall • A vector addition can be written as follows… • The code would be correct but slow if the affinity expression were i+1 rather than i. #define N 100*THREADS The cyclic data shared int v1[N], v2[N], sum[N]; distribution may perform poorly on void main() { some machines int i; upc_forall(i=0; i<N; i++; i) sum[i]=v1[i]+v2[i]; } 6/24/2008 15 UPC Pointers Where does the pointer point? Where does the pointer reside? Private Local PP (p1) Shared PS (p3) Shared SP (p2) SS (p4) int *p1; /* shared int *p2; /* int *shared p3; /* shared int *shared private pointer to local memory */ private pointer to shared space */ shared pointer to local memory */ p4; /* shared pointer to shared space */ Shared to private is not recommended. 6/24/2008 16 UPC Pointers Global address space Thread0 Thread1 Threadn p3: p3: p3: p4: p4: p4: p1: p1: p1: p2: p2: p2: Shared Private int *p1; /* private pointer to local memory */ shared int *p2; /* private pointer to shared space */ int *shared p3; /* shared pointer to local memory */ shared int *shared p4; /* shared pointer to shared space */ Pointers to shared often require more storage and are more costly to dereference; they may refer to local or remote memory. 6/24/2008 17 Common Uses for UPC Pointer Types int *p1; These pointers are fast (just like C pointers) Use to access local data in part of code performing local work Often cast a pointer-to-shared to one of these to get faster access to shared data that is local shared int *p2; Use to refer to remote data Larger and slower due to test-for-local + possible communication int *shared p3; Not recommended shared int *shared p4; Use to build shared linked structures, e.g., a linked list 6/24/2008 18 UPC Pointers In UPC pointers to shared objects have three fields: • thread number • local address of block • phase (specifies position in the block) Virtual Address Thread Phase Example: Cray T3E implementation Phase 63 6/24/2008 Thread 49 48 Virtual Address 38 37 0 19 UPC Pointers Pointer arithmetic supports blocked and non-blocked array distributions Casting of shared to private pointers is allowed but not vice versa ! When casting a pointer-to-shared to a pointer-tolocal, the thread number of the pointer to shared may be lost Casting of shared to local is well defined only if the object pointed to by the pointer to shared has affinity with the thread performing the cast 6/24/2008 20 Dynamic Memory Allocation in UPC Dynamic memory allocation of shared memory is available in UPC Functions can be collective or not A collective function has to be called by every thread and will return the same value to all of them As a convention, the name of a collective function typically includes “all” November 14, 2005 21 Collective Global Memory Allocation shared void *upc_all_alloc (size_t nblocks, size_t nbytes); nblocks: number of blocks nbytes: block size This function has the same result as upc_global_alloc. But this is a collective function, which is expected to be called by all threads All the threads will get the same pointer Equivalent to : shared [nbytes] char[nblocks * nbytes] November 14, 2005 22 Collective Global Memory Allocation Thread 0 Thread 1 Thread THREADS-1 SHARED N N … N PRIVATE ptr ptr ptr shared [N] int *ptr; ptr = (shared [N] int *) upc_all_alloc( THREADS, N*sizeof( int ) ); November 14, 2005 23 Global Memory Allocation shared void *upc_global_alloc (size_t nblocks, size_t nbytes); nblocks : number of blocks nbytes : block size Non collective, expected to be called by one thread The calling thread allocates a contiguous memory region in the shared space Space allocated per calling thread is equivalent to : shared [nbytes] char[nblocks * nbytes] If called by more than one thread, multiple regions are allocated and each calling thread gets a different pointer November 14, 2005 24 Thread 0 Thread 1 N N N N N N … … … … … … Thread THREADS-1 N N SHARED N PRIVATE ptr November 14, 2005 ptr ptr 25 Local-Shared Memory Allocation shared void *upc_alloc (size_t nbytes); nbytes: block size Non collective, expected to be called by one thread The calling thread allocates a contiguous memory region in the local-shared space of the calling thread Space allocated per calling thread is equivalent to : shared [] char[nbytes] If called by more than one thread, multiple regions are allocated and each calling thread gets a different pointer November 14, 2005 26 Local-Shared Memory Allocation Thread 0 Thread 1 N N … … Thread THREADS-1 N SHARED PRIVATE ptr ptr ptr shared [] int *ptr; ptr = (shared [] int *)upc_alloc(N*sizeof( int )); November 14, 2005 27 Memory Space Clean-up void upc_free(shared void *ptr); The upc_free function frees the dynamically allocated shared memory pointed to by ptr upc_free is not collective November 14, 2005 28 Lots more I haven’t mentioned! Synchronization – no implicit synchronization among the threads – it’s up to you! • Barriers (Blocking) upc_barrier expropt; • Split-Phase Barriers (Non-blocking) upc_notify expropt; upc_wait expropt; Note: upc_notify is not blocking upc_wait is • Locks – collective and global String functions in UPC • UPC equivalents of memcpy, memset upc_forall • Work sharing for construct Special functions • Shared pointer information (phase, block size, thread number) • Shared object information (size, block size, element size) UPC collectives UPC-IO UPC in Use Global Random Access (GUPS) benchmark written in UPC BenchC – unstructured mesh CFD code – hybrid UPC/MPI code XFlow – unstructured dynamic mesh CFD code – UPC UPC Random Access: Designed for Speed This version of UPC Random Access was originally written in Spring 2004 Written to maximize speed Had to work inside of the HPCC benchmark Had to run well on any number of CPUs Also happens to be a very productive way of writing the Global RA. 6/24/2008 31 UPC Random Access: Highlights Trivial to parallelize, each PE gets its share of updates Unified Parallel C allows direct, one-sided access to distributed variables; NO two-sided messages! Decomposed “Table” into 2 Dims. to allow explicit, fast computation of LocalOffset and PE number Serial version is very succinct…. u64Int Ran; Ran = 1; for (i=0; i<NUPDATE; i++) { Ran = (Ran << 1) ^ (((s64Int) Ran < 0) ? POLY : 0); GlobalOffset = Ran & (TABSIZE -1); Table[GlobalOffset] ^= Ran; } 6/24/2008 32 Productivity: Fewer lines of code UPC VERSION BASE VERSION #pragma _CRI concurrent for (j=0; j<STRIPSIZE; j++) for (i=0; i<SendCnt/STRIPSIZE; i++) { VRan[j] = (VRan[j] << 1) ^ ((s64Int) VRan[j]< ZERO64B ? POLY : ZERO64B); GlobalOffset = VRan[j] & (TableSize - 1); if (PowerofTwo) LocalOffset=GlobalOffset>>logNumProcs ; else LocalOffset=(double)GlobalOffset/(double)TH READS; WhichPe=GlobalOffset-LocalOffset*THREADS; Table[LocalOffset][WhichPe] ^= VRan[j] ; } } NumRecvs = (NumProcs > 4) ?(Mmin(4,MAX_RECV)) : 1; for (j = 0; j < NumRecvs; j++) MPI_Irecv(&LocalRecvBuffer[j*LOCAL_BUFFER _SIZE], localBufferSize,INT64_DT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD,&inreq[j]); while (i < SendCnt) { do { MPI_Testany(NumRecvs, inreq, &index, &have_done, &status); if (have_done) { if (status.MPI_TAG == UPDATE_TAG) { MPI_Get_count(&status, INT64_DT, &recvUpdates); bufferBase = index*LOCAL_BUFFER_SIZE; for (j=0; j < recvUpdates; j ++) { inmsg = LocalRecvBuffer[bufferBase+j]; LocalOffset = (inmsg & (TableSize - 1)) GlobalStartMyProc; HPCC_Table[LocalOffset] ^= inmsg; } } else if (status.MPI_TAG == FINISHED_TAG) { NumberReceiving--; } else { abort(); } 6/24/2008 33 Productivity : Fewer lines of code UPC VERSION 6/24/2008 BASE VERSION MPI_Irecv(&LocalRecvBuffer[index*LOCAL_BUFFER _SIZE], localBufferSize,INT64_DT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD,&inreq[index]); } } while (have_done && NumberReceiving > 0); if (pendingUpdates < maxPendingUpdates) { Ran = (Ran << 1) ^ ((s64Int) Ran < ZERO64B ? POLY : ZERO64B); GlobalOffset = Ran & (TableSize-1); if ( GlobalOffset < Top) WhichPe = ( GlobalOffset / (MinLocalTableSize + 1) ); else WhichPe = ( (GlobalOffset - Remainder) / MinLocalTableSize ); if (WhichPe == MyProc) { LocalOffset = (Ran & (TableSize - 1)) GlobalStartMyProc; HPCC_Table[LocalOffset] ^= Ran; } else { HPCC_InsertUpdate(Ran, WhichPe, Buckets); pendingUpdates++; } i++; } else { 34 Productivity : Fewer lines of code BASE VERSION UPC VERSION 6/24/2008 MPI_Test(&outreq, &have_done, MPI_STATUS_IGNORE); if (have_done) { outreq = MPI_REQUEST_NULL; pe = HPCC_GetUpdates(Buckets, LocalSendBuffer, localBufferSize, &peUpdates); MPI_Isend(&LocalSendBuffer, peUpdates, INT64_DT, (int)pe, UPDATE_TAG, MPI_COMM_WORLD, &outreq); pendingUpdates -= peUpdates; }}} while (pendingUpdates > 0) { do { MPI_Testany(NumRecvs, inreq, &index, &have_done, &status); if (have_done) { if (status.MPI_TAG == UPDATE_TAG) { MPI_Get_count(&status, INT64_DT, &recvUpdates); bufferBase = index*LOCAL_BUFFER_SIZE; for (j=0; j < recvUpdates; j ++) { inmsg = LocalRecvBuffer[bufferBase+j]; LocalOffset = (inmsg & (TableSize - 1)) GlobalStartMyProc; HPCC_Table[LocalOffset] ^= inmsg; } } else if (status.MPI_TAG == FINISHED_TAG) { NumberReceiving--; 35 Productivity : Fewer lines of code BASE VERSION UPC VERSION 6/24/2008 } else { abort();} MPI_Irecv(&LocalRecvBuffer[index*LOCAL_BUFFER_SI ZE], localBufferSize,INT64_DT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD,&inreq[index]); }} while (have_done && NumberReceiving > 0); MPI_Test(&outreq, &have_done, MPI_STATUS_IGNORE); if (have_done) { outreq = MPI_REQUEST_NULL; pe = HPCC_GetUpdates(Buckets, LocalSendBuffer, localBufferSize, &peUpdates); MPI_Isend(&LocalSendBuffer, peUpdates, INT64_DT, (int)pe, UPDATE_TAG, MPI_COMM_WORLD, &outreq); pendingUpdates -= peUpdates; } } for (proc_count = 0 ; proc_count < NumProcs ; ++proc_count) { if (proc_count == MyProc) { finish_req[MyProc] = MPI_REQUEST_NULL; continue; } MPI_Isend(&Ran, 1, INT64_DT, proc_count, FINISHED_TAG,MPI_COMM_WORLD, finish_req + proc_count); } while (NumberReceiving > 0) { 36 Productivity : Fewer lines of code BASE VERSION UPC VERSION 6/24/2008 MPI_Waitany(NumRecvs, inreq, &index, &status); if (status.MPI_TAG == UPDATE_TAG) { MPI_Get_count(&status, INT64_DT, &recvUpdates); bufferBase = index * LOCAL_BUFFER_SIZE; for (j=0; j < recvUpdates; j ++) { inmsg = LocalRecvBuffer[bufferBase+j]; LocalOffset = (inmsg & (TableSize - 1)) GlobalStartMyProc; HPCC_Table[LocalOffset] ^= inmsg; } } else if (status.MPI_TAG == FINISHED_TAG){ NumberReceiving--; } else { abort(); } MPI_Irecv(&LocalRecvBuffer[index*LOCAL_BUFFER _SIZE], localBufferSize,INT64_DT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &inreq[index]); } MPI_Waitall( NumProcs, finish_req, finish_statuses); HPCC_FreeBuckets(Buckets, NumProcs); for (j = 0; j < NumRecvs; j++) { MPI_Cancel(&inreq[j]); MPI_Wait(&inreq[j], &ignoredStatus); } 37 Productivity: Algorithm Transparency Generate Random Number Compute GO Decompose GO into LO and WhichPE XOR VRan and Table 6/24/2008 #pragma _CRI concurrent for (j=0; j<STRIPSIZE; j++) for (i=0; i<SendCnt/STRIPSIZE; i++) { VRan[j] = (VRan[j] << 1) ^ ((s64Int)VRan[j] < ZERO64B ? POLY : ZERO64B); GlobalOffset = VRan[j] & (TableSize - 1); if (PowerofTwo) LocalOffset=GlobalOffset>>logNumProcs ; else LocalOffset= (double)GlobalOffset/(double)THREADS; WhichPe=GlobalOffset-LocalOffset*THREADS; Table[LocalOffset][WhichPe] ^= VRan[j] ; }} 38 Productivity + Speed = Results UPC Random Access sustains 7.69 GUPs on 1008 Cray X1E MSPs. Works inside the HPCC framework Is “in the spirit” of the benchmark Easy to understand and modify if computations are more complex The Future • Atomic XORs will vastly improve performance All memory references will be “Fire and Forget” 6/24/2008 39 BenchC Unstructured finite element CFD solver Extensive and detailed timing routines • • • • • I/O Problem setup and initialization Communication routines Matrix solver routines “Number crunching” routines Extensively validated Developed at AHPCRC MPI Performance is very well understood on a number of platforms. UPC performance is well understood on Cray X1/E. Communication during the number-crunching kernels can be either 100% MPI or 100% UPC • Compile-time option • UPC version still uses MPI during set-up stages (not tracked in performance measurements) BenchC MPI mode • All communication and timing done using MPI calls UPC mode • Problem I/O, setup and initialization done using MPI • Core communication (gathers, scatters, collective operations), timing done using UPC Communication method (MPI vs. UPC) is selected at compile-time UPC was incrementally added to the code, one routine at a time. Data Layout Very lightweight modification to the code – little disruption Create shared buffers for boundary exchanges • shared double *shared buffSH[THREADS] • buffSH[MYTHREAD] = (shared double *shared)upc_alloc(nec*sizeof(Element)) • buff = (double *) buffSH[MYTHREAD] Thread0 N1 Thread1 … N2 ThreadTHREADS-1 Nn SHARED … buffSH[0] buffSH[1] buffSH[THREADS-1] PRIVATE buff buff buff BenchC UPC Communication Patterns Gathers/Scatters …initial ordering of data… ip = ep[i]; for (j = 0; j < num; j++) bg[iloc1 + j] = buffSH[ip][iloc2 + j] Broadcasts/Reductions global_normSH[MYTHREAD] = loc_val; if (MYTHREAD == 0){ for (ip = 1; ip < THREADS; ip++) loc_val +=global_normSH[ip]; for (ip = 0; ip < THREADS; ip++) global_normSH[ip] = loc_val; } On Cray X1/E all communication algorithms multistream and vectorize UPC performance is very good • PGAS concepts supported at the hardware level BenchC on the X1/E Speedup for UPC and MPI BenchC on Cray X1E 800 700 600 GFlops 500 X1E UPC 400 X1E MPI Perfect 300 200 100 0 -12 8 28 48 68 Number of MSP's 88 108 128 UPC performance exceeds that of MPI, by an increasing margin with processor count – validates the UPC approach XFlow Complex CFD applications have moving components and/or changing domain shapes • Rotational geometries and/or flapping wings • Most fluid-structure interaction applications • Engines, turbines, pumps, etc. • Fluid-particle flows and free-surface flow Many methods have been developed to solve these types of applications • None are ideal and all have limitations MPI approach to “Dynamic-Mesh CFD” was unsuccessful in the past due to algorithm complexity (~1997 time frame) Ultimate goal of “Dynamic-Mesh CFD” • Traditionally, a mesh is created and then the computed solution is “dependent” on (i.e. only as good as) that initial mesh • Ultimate goal of this project is to reverse this paradigm Mesh should be “dependent” on the solution by continuously changing XFlow Fully integrate automatic mesh generation within the parallel flow solver • Mesh generation never stops and runs in-conjunction with the flow solver Element connectivity changes as required to maintain a “Delaunay” mesh New nodes added as required to match user-specified refinement values Existing nodes deleted when not needed • Mesh continuously changes due to changes in geometry and/or the solution Mesh size can grow or shrink at each time step Very complicated method • Parallelism (UPC), vectorization, dynamic data structures, solvers (mesh moving and fluid flow), general CFD accuracy, scalability, CAD links, etc. • Will take time to fully evaluate XFlow Mesh is distributed amongst all processors (fairly uniform loading) Developed a fast Parallel Recursive Center Bi-section mesh partitioner Each processor maintains and controls its own piece of the mesh • Each processor has a list of nodes, faces, and elements • Each list consists of an array of C structures (Node, Face, or Element arrays) • These arrays are defined “shared” Adds a “processor-dimension” to each array elementsSH[proc][local-index].n[0-through-4] elementsSH[proc][local-index].np[0-hrough-4] elements[local-index].c[X] elements[local-index].det Can read-from, or write-to, other processors “entities” whenever required Only 1 processor is allowed to actually change the mesh at one time Lots of barriers (upc_barrier) and synchronization throughout XFlow What this method needs… • Each processor able to reference any arbitrary elements, faces, or nodes across entire mesh • Each processor able to modify any other processors portion of the mesh • Each processor able to search anywhere in the mesh • For performance, minimize these off-processor references by using smart mesh partitioning techniques Why MPI is not a good fit for this method • Can’t arbitrarily read-from or write-to other processors data • Searches “stop” at processor/partition boundaries Partition boundaries are “hard” (enforced) boundaries • A processor can’t change and alter another processor’s mesh structure Why is UPC good • You can do these kinds of things Need to carefully use memory/process barriers Implementation using UPC (continued) At each time step… 1. 2. Test if re-partitioning is required Set-up inter processor communication • 3. Block (color) elements into vectorizable groups • 4. Due to the node-scatter operation in the Finite Element routines Calculate the refinement value at each mesh node by solving the Laplace equation with known values on the boundaries • 5. 6. 7. Required at each time step, if the mesh has changed May be replaced by relating mesh refinement values to an error measure Move the mesh by solving a modified form of the Linear Elasticity equations Solve the coupled fluid-flow equation system Apply the “Dynamic-Mesh Update” routines 1. 2. 3. 4. Swap element faces in order to obtain a “Delaunay” mesh Add new nodes (based on element measures) at locations where there are not enough Delete existing nodes from locations where there are too many Swap element faces to improve mesh quality Implementation using UPC (mesh partitioning) • • Recursive Center Bisection mesh partitioner Partitioning and (especially) re-distribution of the mesh is easy with UPC – Complicated and lots of code required to do this with MPI • Vectorizes fairly well so re-partitioning can be carried-out quickly Implementation using UPC (algorithms) • “1D” (walking) search algorithms are very important • May also span processor boundaries – Easy to implement using UPC Implementation using UPC (example) /* ELEMENT BACKUP */ /* CONSTRUCT NEW ELEMENTS */ elemA = elementsSH[epA][eA]; elemB = elementsSH[epB][eB]; neplus = 0; ne1 = eA; ne2 = eB; ne3 = new_elem(&neplus); /* PRELIMINARYS */ ifc = elementsSH[epA][eA].f [iA]; ifcP = elementsSH[epA][eA].fp[iA]; if (facesSH[ifcP][ifc].fix) return NOT_OK; nA = elementsSH[epA][eA].n [iA]; nB = elementsSH[epB][eB].n [iB]; npA = elementsSH[epA][eA].np[iA]; npB = elementsSH[epB][eB].np[iB]; for (i = 0; i < NFC; i++){ fB[i] = elementsSH[epB][eB].f [i]; fpB[i] = elementsSH[epB][eB].fp[i]; } if (fqc == AR){ q1 = elementsSH[epA][eA].qu; if (elementsSH[epB][eB].qu > q1) q1 = elementsSH[epB][eB].qu; } nmin pmin nmid pmid nmax pmax = = = = = = elementsSH[epA][eA].n [iminA]; elementsSH[epA][eA].np[iminA]; elementsSH[epA][eA].n [imidA]; elementsSH[epA][eA].np[imidA]; elementsSH[epA][eA].n [imaxA]; elementsSH[epA][eA].np[imaxA]; nep1 = epA; nep2 = epB; nep3 = MYTHREAD; hn[0] = elementsSH[nep2][ne2].hn; hn[1] = elementsSH[nep3][ne3].hn; elementsSH[nep2][ne2] = elementsSH[epA][eA]; elementsSH[nep3][ne3] = elementsSH[epA][eA]; elementsSH[nep2][ne2].hn = hn[0]; elementsSH[nep3][ne3].hn = hn[1]; elementsSH[nep1][ne1].n [iminA] elementsSH[nep1][ne1].np[iminA] elementsSH[nep2][ne2].n [imidA] elementsSH[nep2][ne2].np[imidA] elementsSH[nep3][ne3].n [imaxA] elementsSH[nep3][ne3].np[imaxA] = = = = = = nB; npB; nB; npB; nB; npB; /* CHECK NEW ELEMENT STATUS */ ier1 = elem_state(ne1, ier2 = elem_state(ne2, ier3 = elem_state(ne3, if (ier1 != OK || ier2 return NOT_OK; nep1); nep2); nep3); != OK || ier3 != OK) Application - Micro-UAVs Important technology for the DoD Lots of areas to look into • • • Optimization techniques for shape and motion Optimization techniques for control algorithms Fluid-structure interactions Clips from “Artificial Muscles EAP” Yoseph Bar-Cohen, Ph.D. Jet Propulsion Laboratory http://ndeaa.jpl.nasa.gov/nasa-nde/lommas/eap/EAP-web.htm Micro-UAVs (results) Reynolds number ~ 400 Velocity Vectors at a Cross-Section Mesh at a Cross-Section Cray UPC/CAF support Supported from T3E days until the present! X1/X2 • UPC/CAF supported natively in Cray compiler and at the hardware level – remote address translation done in hardware • Cray compilers have supported UPC and CAF for many years – lots of code exposure and aggressive optimization of PGAS codes • Performance tools, CrayPat, allow for collecting data on UPC (and CAF) code • Totalview supports debugging of UPC code • HECToR X2 addition – happening now! XT • UPC support thourgh Berkeley UPC – distributed in module format Future architectures – PGAS support in hardware References http://upc.gwu.edu/ - Unified Parallel C at George Washington University http://upc.lbl.gov/ - Berkeley Unified Parallel C Project http://docs.cray.com/ - Cray C and C++ Reference Manual