Applications of UPC Jason J Beech-Brandt Cray Centre of Excellence for HECToR

advertisement
Applications of UPC
Jason J Beech-Brandt
Cray Centre of Excellence for HECToR
Novel Parallel Programming Languages for HPC
EPCC, Edinburgh
18 June 2008
Outline
UPC basics….
• Shared and Private data – scalars and arrays
• Pointers
• Dynamic memory
Practical uses of UPC
• GUPS (Global Random Access) benchmark
• BenchC – MPI CFD code transitioned to UPC/MPI hybrid
• Xflow – Pure UPC CFD code
Cray support for PGAS
• Legacy – X1E, T3E
• Current – X2, XT
• Future – Baker
Context
Most parallel programs are written using either:
• Message passing with a SPMD model
Usually for scientific applications with C/Fortran
Scales easily
• Shared memory with threads in OpenMP, Threads+C/C++/Fortran or
Java
Usually for non-scientific applications
Easier to program, but less scalable performance
Global Address Space (GAS) Languages take the best of both
• global address space like threads (programmability)
• SPMD parallelism like MPI (performance)
• local/global distinction, i.e., layout matters (performance)
6/24/2008
3
Partitioned Global Address Space Languages
Explicitly-parallel programming model with SPMD parallelism
• Fixed at program start-up, typically 1 thread per processor
Global address space model of memory
• Allows programmer to directly represent distributed data structures
Address space is logically partitioned
• Local vs. remote memory (two-level hierarchy)
Programmer control over performance critical decisions
• Data layout and communication
Performance transparency and tunability are goals
Multiple PGAS languages: UPC (C), CAF (Fortran), Titanium (Java)
6/24/2008
4
UPC Overview and Design Philosophy
Unified Parallel C (UPC) is:
• An explicit parallel extension of ANSI C
• A partitioned global address space language
• Sometimes called a GAS language
Similar to the C language philosophy
• Programmers are clever and careful, and may need to get close to
hardware
to get performance, but
can get in trouble
• Concise and efficient syntax
Common and familiar syntax and semantics for parallel C with simple
extensions to ANSI C
Based on ideas in Split-C, AC, and PCP
6/24/2008
5
UPC Execution Model
A number of threads working independently in a SPMD fashion
• Number of threads specified at compile-time or run-time; available as
program variable THREADS
• MYTHREAD specifies thread index (0..THREADS-1)
• upc_barrier is a global synchronization: all wait
• There is a form of parallel loop, upc_forall
There are two compilation modes
• Static Threads mode:
THREADS is specified at compile time by the user
The program may use THREADS as a compile-time constant
• Dynamic threads mode:
Compiled code may be run with varying numbers of threads
6/24/2008
6
Hello World in UPC
Any legal C program is also a legal UPC program
If you compile and run it as UPC with P threads, it will run P copies of the
program.
Using this fact, plus the identifiers from the previous slides, we can
parallel hello world:
#include <upc.h> /* needed for UPC extensions */
#include <stdio.h>
main() {
printf("Thread %d of %d: hello UPC world\n",
MYTHREAD, THREADS);
}
6/24/2008
7
Private vs. Shared Variables in UPC
Normal C variables and objects are allocated in the private memory space
Global address
space
for each thread.
Shared variables are allocated only once, with thread 0
6/24/2008
shared int ours;
int mine;
// use sparingly: performance
Thread0 Thread1
Threadn
Shared
ours:
mine:
mine:
mine:
Private
8
Shared and Private Data
Examples of Shared and Private Data Layout:
Assume THREADS = 4
shared int x; /*x will have affinity to thread 0 */
shared int y[THREADS];
int z;
will result in the layout:
Global address
space
0
y[0]
1
y[1]
2
y[2]
3
y[3]
Shared
x
z
z
z
z
Private
9
Shared and Private Data
shared int A[4][THREADS];
will result in the following data layout:
Thread 0
Thread 1
A[0][0]
A[0][1]
A[0][2]
A[1][0]
A[1][1]
A[1][2]
A[2][0]
A[2][1]
A[2][2]
A[3][0]
A[3][1]
A[3][2]
Thread 2
10
Blocking of Shared Arrays
Default block size is 1
Shared arrays can be distributed on a block per
thread basis, round robin with arbitrary block sizes.
A block size is specified in the declaration as follows:
• shared [block-size] type array[N];
• e.g.: shared [4] int a[16];
11
Blocking of Shared Arrays
Block size and THREADS determine affinity
The term affinity means in which thread’s local
shared-memory space, a shared data item will reside
Element i of a blocked array has affinity to thread:
i


 blocksize  mod THREADS


November 14, 2005
12
Shared and Private Data
Shared objects placed in memory based on affinity
Affinity can be also defined based on the ability of a
thread to refer to an object by a private pointer
All non-array shared qualified objects, i.e. shared
scalars, have affinity to thread 0
Threads access shared and private data
November 14, 2005
13
Shared and Private Data
Assume THREADS = 4
shared [3] int A[4][THREADS];
will result in the following data layout:
Thread 0
Thread 1
Thread 2
Thread 3
A[0][0]
A[0][3]
A[1][2]
A[2][1]
A[0][1]
A[1][0]
A[1][3]
A[2][2]
A[0][2]
A[1][1]
A[2][0]
A[2][3]
A[3][0]
A[3][1]
A[3][2]
A[3][3]
14
upc_forall
• A vector addition can be written as follows…
• The code would be correct but slow if the affinity
expression were i+1 rather than i.
#define N 100*THREADS
The cyclic data
shared int v1[N], v2[N], sum[N]; distribution may
perform poorly on
void main() {
some machines
int i;
upc_forall(i=0; i<N; i++; i)
sum[i]=v1[i]+v2[i];
}
6/24/2008
15
UPC Pointers
Where does the pointer point?
Where does
the pointer
reside?
Private
Local
PP (p1)
Shared
PS (p3)
Shared
SP (p2)
SS (p4)
int *p1;
/*
shared int *p2; /*
int *shared p3; /*
shared int *shared
private pointer to local memory */
private pointer to shared space */
shared pointer to local memory */
p4; /* shared pointer to
shared space */
Shared to private is not recommended.
6/24/2008
16
UPC Pointers
Global address
space
Thread0 Thread1
Threadn
p3:
p3:
p3:
p4:
p4:
p4:
p1:
p1:
p1:
p2:
p2:
p2:
Shared
Private
int *p1;
/* private pointer to local memory */
shared int *p2; /* private pointer to shared space */
int *shared p3; /* shared pointer to local memory */
shared int *shared p4; /* shared pointer to
shared space */
Pointers to shared often require more storage and are more costly to dereference;
they may refer to local or remote memory.
6/24/2008
17
Common Uses for UPC Pointer Types
int *p1;
These pointers are fast (just like C pointers)
Use to access local data in part of code performing local
work
Often cast a pointer-to-shared to one of these to get faster
access to shared data that is local
shared int *p2;
Use to refer to remote data
Larger and slower due to test-for-local + possible
communication
int *shared p3;
Not recommended
shared int *shared p4;
Use to build shared linked structures, e.g., a linked list
6/24/2008
18
UPC Pointers
In UPC pointers to shared objects have three fields:
• thread number
• local address of block
• phase (specifies position in the block)
Virtual Address
Thread
Phase
Example: Cray T3E implementation
Phase
63
6/24/2008
Thread
49 48
Virtual Address
38 37
0
19
UPC Pointers
Pointer arithmetic supports blocked and non-blocked
array distributions
Casting of shared to private pointers is allowed but
not vice versa !
When casting a pointer-to-shared to a pointer-tolocal, the thread number of the pointer to shared may
be lost
Casting of shared to local is well defined only if the
object pointed to by the pointer to shared has affinity
with the thread performing the cast
6/24/2008
20
Dynamic Memory Allocation in UPC
Dynamic memory allocation of shared memory is
available in UPC
Functions can be collective or not
A collective function has to be called by every thread
and will return the same value to all of them
As a convention, the name of a collective function
typically includes “all”
November 14, 2005
21
Collective Global Memory Allocation
shared void *upc_all_alloc
(size_t nblocks, size_t nbytes);
nblocks: number of blocks
nbytes:
block size
This function has the same result as
upc_global_alloc. But this is a collective function,
which is expected to be called by all threads
All the threads will get the same pointer
Equivalent to :
shared [nbytes] char[nblocks * nbytes]
November 14, 2005
22
Collective Global Memory Allocation
Thread 0
Thread 1
Thread THREADS-1
SHARED
N
N
…
N
PRIVATE
ptr
ptr
ptr
shared [N] int *ptr;
ptr = (shared [N] int *)
upc_all_alloc( THREADS, N*sizeof( int ) );
November 14, 2005
23
Global Memory Allocation
shared void *upc_global_alloc
(size_t nblocks, size_t nbytes);
nblocks : number of blocks
nbytes : block size
Non collective, expected to be called by one thread
The calling thread allocates a contiguous memory
region in the shared space
Space allocated per calling thread is equivalent to :
shared [nbytes] char[nblocks * nbytes]
If called by more than one thread, multiple regions
are allocated and each calling thread gets a different
pointer
November 14, 2005
24
Thread 0
Thread 1
N
N
N
N
N
N
…
…
…
…
…
…
Thread THREADS-1
N
N
SHARED
N
PRIVATE
ptr
November 14, 2005
ptr
ptr
25
Local-Shared Memory Allocation
shared void *upc_alloc (size_t nbytes);
nbytes:
block size
Non collective, expected to be called by one thread
The calling thread allocates a contiguous memory
region in the local-shared space of the calling thread
Space allocated per calling thread is equivalent to :
shared [] char[nbytes]
If called by more than one thread, multiple regions
are allocated and each calling thread gets a different
pointer
November 14, 2005
26
Local-Shared Memory Allocation
Thread 0
Thread 1
N
N
…
…
Thread THREADS-1
N
SHARED
PRIVATE
ptr
ptr
ptr
shared [] int *ptr;
ptr = (shared [] int *)upc_alloc(N*sizeof( int ));
November 14, 2005
27
Memory Space Clean-up
void upc_free(shared void *ptr);
The upc_free function frees the dynamically allocated
shared memory pointed to by ptr
upc_free is not collective
November 14, 2005
28
Lots more I haven’t mentioned!
Synchronization – no implicit synchronization among the threads – it’s up to you!
• Barriers (Blocking)
upc_barrier expropt;
• Split-Phase Barriers (Non-blocking)
upc_notify expropt;
upc_wait expropt;
Note: upc_notify is not blocking upc_wait is
• Locks – collective and global
String functions in UPC
• UPC equivalents of memcpy, memset
upc_forall
• Work sharing for construct
Special functions
• Shared pointer information (phase, block size, thread number)
• Shared object information (size, block size, element size)
UPC collectives
UPC-IO
UPC in Use
Global Random Access (GUPS) benchmark written in UPC
BenchC – unstructured mesh CFD code – hybrid UPC/MPI code
XFlow – unstructured dynamic mesh CFD code – UPC
UPC Random Access:
Designed for Speed
This version of UPC Random Access was originally written in
Spring 2004
Written to maximize speed
Had to work inside of the HPCC benchmark
Had to run well on any number of CPUs
Also happens to be a very productive way of writing the
Global RA.
6/24/2008
31
UPC Random Access: Highlights
Trivial to parallelize, each PE gets its share of updates
Unified Parallel C allows direct, one-sided access to distributed variables;
NO two-sided messages!
Decomposed “Table” into 2 Dims. to allow explicit, fast computation of
LocalOffset and PE number
Serial version is very succinct….
u64Int Ran;
Ran = 1;
for (i=0; i<NUPDATE; i++) {
Ran = (Ran << 1) ^ (((s64Int) Ran < 0) ? POLY : 0);
GlobalOffset = Ran & (TABSIZE -1);
Table[GlobalOffset] ^= Ran;
}
6/24/2008
32
Productivity: Fewer lines of code
UPC VERSION
BASE VERSION
#pragma _CRI concurrent
for (j=0; j<STRIPSIZE; j++)
for (i=0; i<SendCnt/STRIPSIZE; i++) {
VRan[j] = (VRan[j] << 1) ^ ((s64Int) VRan[j]<
ZERO64B ? POLY : ZERO64B);
GlobalOffset = VRan[j] & (TableSize - 1);
if (PowerofTwo)
LocalOffset=GlobalOffset>>logNumProcs ;
else
LocalOffset=(double)GlobalOffset/(double)TH
READS;
WhichPe=GlobalOffset-LocalOffset*THREADS;
Table[LocalOffset][WhichPe] ^= VRan[j] ;
}
}
NumRecvs = (NumProcs > 4) ?(Mmin(4,MAX_RECV))
: 1;
for (j = 0; j < NumRecvs; j++)
MPI_Irecv(&LocalRecvBuffer[j*LOCAL_BUFFER
_SIZE], localBufferSize,INT64_DT,
MPI_ANY_SOURCE, MPI_ANY_TAG,
MPI_COMM_WORLD,&inreq[j]);
while (i < SendCnt) {
do {
MPI_Testany(NumRecvs, inreq, &index,
&have_done, &status);
if (have_done) {
if (status.MPI_TAG == UPDATE_TAG) {
MPI_Get_count(&status, INT64_DT,
&recvUpdates);
bufferBase = index*LOCAL_BUFFER_SIZE;
for (j=0; j < recvUpdates; j ++) {
inmsg = LocalRecvBuffer[bufferBase+j];
LocalOffset = (inmsg & (TableSize - 1)) GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= inmsg;
}
} else if (status.MPI_TAG == FINISHED_TAG) {
NumberReceiving--;
} else {
abort();
}
6/24/2008
33
Productivity : Fewer lines of code
UPC VERSION
6/24/2008
BASE VERSION
MPI_Irecv(&LocalRecvBuffer[index*LOCAL_BUFFER
_SIZE], localBufferSize,INT64_DT,
MPI_ANY_SOURCE, MPI_ANY_TAG,
MPI_COMM_WORLD,&inreq[index]);
}
} while (have_done && NumberReceiving > 0);
if (pendingUpdates < maxPendingUpdates) {
Ran = (Ran << 1) ^ ((s64Int) Ran <
ZERO64B ? POLY : ZERO64B);
GlobalOffset = Ran & (TableSize-1);
if ( GlobalOffset < Top)
WhichPe = ( GlobalOffset /
(MinLocalTableSize + 1) );
else
WhichPe = ( (GlobalOffset - Remainder) /
MinLocalTableSize );
if (WhichPe == MyProc) {
LocalOffset = (Ran & (TableSize - 1)) GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= Ran;
}
else {
HPCC_InsertUpdate(Ran, WhichPe, Buckets);
pendingUpdates++;
}
i++;
}
else {
34
Productivity : Fewer lines of code
BASE VERSION
UPC VERSION
6/24/2008
MPI_Test(&outreq, &have_done,
MPI_STATUS_IGNORE);
if (have_done) {
outreq = MPI_REQUEST_NULL;
pe = HPCC_GetUpdates(Buckets,
LocalSendBuffer, localBufferSize,
&peUpdates);
MPI_Isend(&LocalSendBuffer, peUpdates,
INT64_DT, (int)pe, UPDATE_TAG,
MPI_COMM_WORLD, &outreq);
pendingUpdates -= peUpdates;
}}}
while (pendingUpdates > 0) {
do {
MPI_Testany(NumRecvs, inreq, &index,
&have_done, &status);
if (have_done) {
if (status.MPI_TAG == UPDATE_TAG) {
MPI_Get_count(&status, INT64_DT,
&recvUpdates);
bufferBase = index*LOCAL_BUFFER_SIZE;
for (j=0; j < recvUpdates; j ++) {
inmsg = LocalRecvBuffer[bufferBase+j];
LocalOffset = (inmsg & (TableSize - 1)) GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= inmsg;
}
} else if (status.MPI_TAG == FINISHED_TAG) {
NumberReceiving--;
35
Productivity : Fewer lines of code
BASE VERSION
UPC VERSION
6/24/2008
} else {
abort();}
MPI_Irecv(&LocalRecvBuffer[index*LOCAL_BUFFER_SI
ZE], localBufferSize,INT64_DT,
MPI_ANY_SOURCE, MPI_ANY_TAG,
MPI_COMM_WORLD,&inreq[index]);
}} while (have_done && NumberReceiving > 0);
MPI_Test(&outreq, &have_done,
MPI_STATUS_IGNORE);
if (have_done) {
outreq = MPI_REQUEST_NULL;
pe = HPCC_GetUpdates(Buckets,
LocalSendBuffer, localBufferSize,
&peUpdates);
MPI_Isend(&LocalSendBuffer, peUpdates,
INT64_DT, (int)pe, UPDATE_TAG,
MPI_COMM_WORLD, &outreq);
pendingUpdates -= peUpdates;
} }
for (proc_count = 0 ; proc_count < NumProcs ;
++proc_count) {
if (proc_count == MyProc) { finish_req[MyProc]
= MPI_REQUEST_NULL; continue; }
MPI_Isend(&Ran, 1, INT64_DT, proc_count,
FINISHED_TAG,MPI_COMM_WORLD, finish_req +
proc_count);
}
while (NumberReceiving > 0) {
36
Productivity : Fewer lines of code
BASE VERSION
UPC VERSION
6/24/2008
MPI_Waitany(NumRecvs, inreq, &index,
&status);
if (status.MPI_TAG == UPDATE_TAG) {
MPI_Get_count(&status, INT64_DT,
&recvUpdates);
bufferBase = index * LOCAL_BUFFER_SIZE;
for (j=0; j < recvUpdates; j ++) {
inmsg = LocalRecvBuffer[bufferBase+j];
LocalOffset = (inmsg & (TableSize - 1)) GlobalStartMyProc;
HPCC_Table[LocalOffset] ^= inmsg;
}
} else if (status.MPI_TAG == FINISHED_TAG){
NumberReceiving--;
} else {
abort(); }
MPI_Irecv(&LocalRecvBuffer[index*LOCAL_BUFFER
_SIZE], localBufferSize,INT64_DT,
MPI_ANY_SOURCE, MPI_ANY_TAG,
MPI_COMM_WORLD, &inreq[index]);
}
MPI_Waitall( NumProcs, finish_req,
finish_statuses);
HPCC_FreeBuckets(Buckets, NumProcs);
for (j = 0; j < NumRecvs; j++) {
MPI_Cancel(&inreq[j]);
MPI_Wait(&inreq[j], &ignoredStatus);
}
37
Productivity: Algorithm Transparency
Generate Random
Number
Compute GO
Decompose GO
into LO and
WhichPE
XOR VRan and Table
6/24/2008
#pragma _CRI concurrent
for (j=0; j<STRIPSIZE; j++)
for (i=0; i<SendCnt/STRIPSIZE; i++) {
VRan[j] = (VRan[j] << 1) ^ ((s64Int)VRan[j]
< ZERO64B ? POLY : ZERO64B);
GlobalOffset = VRan[j] & (TableSize - 1);
if (PowerofTwo)
LocalOffset=GlobalOffset>>logNumProcs ;
else
LocalOffset=
(double)GlobalOffset/(double)THREADS;
WhichPe=GlobalOffset-LocalOffset*THREADS;
Table[LocalOffset][WhichPe] ^= VRan[j] ;
}}
38
Productivity + Speed = Results
UPC Random Access sustains 7.69 GUPs on 1008 Cray X1E MSPs.
Works inside the HPCC framework
Is “in the spirit” of the benchmark
Easy to understand and modify if computations are more complex
The Future
• Atomic XORs will vastly improve performance
All memory references will be “Fire and Forget”
6/24/2008
39
BenchC
Unstructured finite element CFD solver
Extensive and detailed timing routines
•
•
•
•
•
I/O
Problem setup and initialization
Communication routines
Matrix solver routines
“Number crunching” routines
Extensively validated
Developed at AHPCRC
MPI Performance is very well understood on a number of platforms.
UPC performance is well understood on Cray X1/E.
Communication during the number-crunching kernels can be either
100% MPI or 100% UPC
• Compile-time option
• UPC version still uses MPI during set-up stages (not tracked in
performance measurements)
BenchC
MPI mode
• All communication and timing done using MPI calls
UPC mode
• Problem I/O, setup and initialization done using MPI
• Core communication (gathers, scatters, collective
operations), timing done using UPC
Communication method (MPI vs. UPC) is selected at
compile-time
UPC was incrementally added to the code, one routine at a
time.
Data Layout
Very lightweight modification to the code – little disruption
Create shared buffers for boundary exchanges
• shared double *shared buffSH[THREADS]
• buffSH[MYTHREAD] = (shared double *shared)upc_alloc(nec*sizeof(Element))
• buff = (double *) buffSH[MYTHREAD]
Thread0
N1
Thread1
…
N2
ThreadTHREADS-1
Nn
SHARED
…
buffSH[0]
buffSH[1]
buffSH[THREADS-1]
PRIVATE
buff
buff
buff
BenchC UPC Communication Patterns
Gathers/Scatters
…initial ordering of data…
ip = ep[i];
for (j = 0; j < num; j++)
bg[iloc1 + j] = buffSH[ip][iloc2 + j]
Broadcasts/Reductions
global_normSH[MYTHREAD] = loc_val;
if (MYTHREAD == 0){
for (ip = 1; ip < THREADS; ip++) loc_val +=global_normSH[ip];
for (ip = 0; ip < THREADS; ip++) global_normSH[ip] = loc_val;
}
On Cray X1/E all communication algorithms multistream and vectorize
UPC performance is very good
• PGAS concepts supported at the hardware level
BenchC on the X1/E
Speedup for UPC and MPI BenchC on Cray X1E
800
700
600
GFlops
500
X1E UPC
400
X1E MPI
Perfect
300
200
100
0
-12
8
28
48
68
Number of MSP's
88
108
128
UPC performance exceeds that of MPI, by an increasing margin with
processor count – validates the UPC approach
XFlow
Complex CFD applications have moving components and/or changing domain
shapes
• Rotational geometries and/or flapping wings
• Most fluid-structure interaction applications
• Engines, turbines, pumps, etc.
• Fluid-particle flows and free-surface flow
Many methods have been developed to solve these types of applications
• None are ideal and all have limitations
MPI approach to “Dynamic-Mesh CFD” was unsuccessful in the past due to
algorithm complexity (~1997 time frame)
Ultimate goal of “Dynamic-Mesh CFD”
• Traditionally, a mesh is created and then the computed solution is
“dependent” on (i.e. only as good as) that initial mesh
• Ultimate goal of this project is to reverse this paradigm
Mesh should be “dependent” on the solution by continuously changing
XFlow
Fully integrate automatic mesh generation within the parallel flow solver
• Mesh generation never stops and runs in-conjunction with the flow
solver
Element connectivity changes as required to maintain a
“Delaunay” mesh
New nodes added as required to match user-specified refinement
values
Existing nodes deleted when not needed
• Mesh continuously changes due to changes in geometry and/or the
solution
Mesh size can grow or shrink at each time step
Very complicated method
• Parallelism (UPC), vectorization, dynamic data structures, solvers
(mesh moving and fluid flow), general CFD accuracy, scalability, CAD
links, etc.
• Will take time to fully evaluate
XFlow
Mesh is distributed amongst all processors (fairly uniform loading)
Developed a fast Parallel Recursive Center Bi-section mesh partitioner
Each processor maintains and controls its own piece of the mesh
• Each processor has a list of nodes, faces, and elements
• Each list consists of an array of C structures (Node, Face, or Element arrays)
• These arrays are defined “shared”
Adds a “processor-dimension” to each array
elementsSH[proc][local-index].n[0-through-4]
elementsSH[proc][local-index].np[0-hrough-4]
elements[local-index].c[X]
elements[local-index].det
Can read-from, or write-to, other processors “entities” whenever required
Only 1 processor is allowed to actually change the mesh at one time
Lots of barriers (upc_barrier) and synchronization throughout
XFlow
What this method needs…
• Each processor able to reference any arbitrary elements, faces, or nodes
across entire mesh
• Each processor able to modify any other processors portion of the mesh
• Each processor able to search anywhere in the mesh
• For performance, minimize these off-processor references by using smart
mesh partitioning techniques
Why MPI is not a good fit for this method
• Can’t arbitrarily read-from or write-to other processors data
• Searches “stop” at processor/partition boundaries
Partition boundaries are “hard” (enforced) boundaries
• A processor can’t change and alter another processor’s mesh structure
Why is UPC good
• You can do these kinds of things
Need to carefully use memory/process barriers
Implementation using UPC (continued)
At each time step…
1.
2.
Test if re-partitioning is required
Set-up inter processor communication
•
3.
Block (color) elements into vectorizable groups
•
4.
Due to the node-scatter operation in the Finite Element routines
Calculate the refinement value at each mesh node by solving the Laplace
equation with known values on the boundaries
•
5.
6.
7.
Required at each time step, if the mesh has changed
May be replaced by relating mesh refinement values to an error measure
Move the mesh by solving a modified form of the Linear Elasticity equations
Solve the coupled fluid-flow equation system
Apply the “Dynamic-Mesh Update” routines
1.
2.
3.
4.
Swap element faces in order to obtain a “Delaunay” mesh
Add new nodes (based on element measures) at locations where there are not enough
Delete existing nodes from locations where there are too many
Swap element faces to improve mesh quality
Implementation using UPC (mesh partitioning)
•
•
Recursive Center Bisection mesh partitioner
Partitioning and (especially) re-distribution of the mesh is easy with UPC
– Complicated and lots of code required to do this with MPI
•
Vectorizes fairly well so re-partitioning can be carried-out quickly
Implementation using UPC (algorithms)
• “1D” (walking) search algorithms are very important
• May also span processor boundaries
– Easy to implement using UPC
Implementation using UPC (example)
/* ELEMENT BACKUP */
/* CONSTRUCT NEW ELEMENTS */
elemA = elementsSH[epA][eA];
elemB = elementsSH[epB][eB];
neplus = 0;
ne1 = eA;
ne2 = eB;
ne3 = new_elem(&neplus);
/* PRELIMINARYS */
ifc = elementsSH[epA][eA].f [iA];
ifcP = elementsSH[epA][eA].fp[iA];
if (facesSH[ifcP][ifc].fix) return NOT_OK;
nA = elementsSH[epA][eA].n [iA];
nB = elementsSH[epB][eB].n [iB];
npA = elementsSH[epA][eA].np[iA];
npB = elementsSH[epB][eB].np[iB];
for (i = 0; i < NFC; i++){
fB[i] = elementsSH[epB][eB].f [i];
fpB[i] = elementsSH[epB][eB].fp[i];
}
if (fqc == AR){
q1 = elementsSH[epA][eA].qu;
if (elementsSH[epB][eB].qu > q1)
q1 = elementsSH[epB][eB].qu;
}
nmin
pmin
nmid
pmid
nmax
pmax
=
=
=
=
=
=
elementsSH[epA][eA].n [iminA];
elementsSH[epA][eA].np[iminA];
elementsSH[epA][eA].n [imidA];
elementsSH[epA][eA].np[imidA];
elementsSH[epA][eA].n [imaxA];
elementsSH[epA][eA].np[imaxA];
nep1 = epA;
nep2 = epB;
nep3 = MYTHREAD;
hn[0] = elementsSH[nep2][ne2].hn;
hn[1] = elementsSH[nep3][ne3].hn;
elementsSH[nep2][ne2] = elementsSH[epA][eA];
elementsSH[nep3][ne3] = elementsSH[epA][eA];
elementsSH[nep2][ne2].hn = hn[0];
elementsSH[nep3][ne3].hn = hn[1];
elementsSH[nep1][ne1].n [iminA]
elementsSH[nep1][ne1].np[iminA]
elementsSH[nep2][ne2].n [imidA]
elementsSH[nep2][ne2].np[imidA]
elementsSH[nep3][ne3].n [imaxA]
elementsSH[nep3][ne3].np[imaxA]
=
=
=
=
=
=
nB;
npB;
nB;
npB;
nB;
npB;
/* CHECK NEW ELEMENT STATUS */
ier1 = elem_state(ne1,
ier2 = elem_state(ne2,
ier3 = elem_state(ne3,
if (ier1 != OK || ier2
return NOT_OK;
nep1);
nep2);
nep3);
!= OK || ier3 != OK)
Application - Micro-UAVs
Important technology for the DoD
Lots of areas to look into
•
•
•
Optimization techniques for shape and motion
Optimization techniques for control algorithms
Fluid-structure interactions
Clips from “Artificial Muscles EAP”
Yoseph Bar-Cohen, Ph.D.
Jet Propulsion Laboratory
http://ndeaa.jpl.nasa.gov/nasa-nde/lommas/eap/EAP-web.htm
Micro-UAVs (results)
Reynolds number ~ 400
Velocity Vectors at a Cross-Section
Mesh at a Cross-Section
Cray UPC/CAF support
Supported from T3E days until the present!
X1/X2
• UPC/CAF supported natively in Cray compiler and at the hardware level –
remote address translation done in hardware
• Cray compilers have supported UPC and CAF for many years – lots of code
exposure and aggressive optimization of PGAS codes
• Performance tools, CrayPat, allow for collecting data on UPC (and CAF)
code
• Totalview supports debugging of UPC code
• HECToR X2 addition – happening now!
XT
• UPC support thourgh Berkeley UPC – distributed in module format
Future architectures – PGAS support in hardware
References
http://upc.gwu.edu/ - Unified Parallel C at George Washington University
http://upc.lbl.gov/ - Berkeley Unified Parallel C Project
http://docs.cray.com/ - Cray C and C++ Reference Manual
Download