Lecture 7,chapter 6 Synchronous Computations 同步计算 http://bioinfo.uncc.edu/szhang 1 Synchronous Computations 在全同步(fully synchronous)应用程序, 所有的进程在一定的点 上同步. Barrier栅 实现进程同步的一个基本的机制 - inserted at the point in each process where it must wait在每个进 程必须等待的点插入一个MPI_Barrier(), 当所 有进程到达它时,所有进程才能从该点继续运行。 MPI_Barrier(MPI_COMM_WORLD); 2 进程在不同时间到达Barrier 3 库例程(library routines)调用Barrier 4 MPI_Barrier() Barrier 唯 一 的 参 数 是 一 个 已 命 名 的 通 信 子 (communicator) 通信子内的每个进程都要调用MPI_Barrier(),它阻 塞进程直到组内所有进程都调用它,并且只有那 时才返回。 5 Barrier 实现方式 集中式计数器实现 Centralized counter implementation (a linear barrier 线性障栅): 6 一个良好的barrier实现必须考虑一个barrier可能被进程不只 使用一次. 前面的进程第一次离开barrier之前,可能一个进程第二次进 入barrier。 7 基于计数器的barriers 常分为两个阶段 two phases: • 到达阶段:一个进程进入“到达”阶段后直到所有进程都进 入这个到达阶段才能离开此阶段; • 离开阶段:进程进入离开阶段并被释放。Then processes move to departure phase and are released. Two-phase handles the reentrant scenario. 8 Example code: Master: For循环计数的主进程实现barrier for (i = 0; i < n; i++) /*count slaves as they reach barrier*/ recv(Pany); for (i = 0; i < n; i++) /* release slaves */ send(Pi); Slave processes: send(Pmaster); recv(Pmaster); 阻塞进程 9 消息传递系统中的Barrier实现 10 树结构实现barrier More efficient. O(log p) steps Suppose 8 processes, P0, P1, P2, P3, P4, P5, P6, P7: 1st stage: P1 sends message to P0; (when P1 reaches its barrier) P3 sends message to P2; (when P3 reaches its barrier) P5 sends message to P4; (when P5 reaches its barrier) P7 sends message to P6; (when P7 reaches its barrier) 2nd stage: P2 sends message to P0; (P2 & P3 reached their barrier) P6 sends message to P4; (P6 & P7 reached their barrier) 3rd stage: P4 sends message to P0; (P4, P5, P6, & P7 reached barrier) P0 terminates arrival phase; (when P0 reaches barrier & received message from P4) Release with a reverse tree construction. 11 Tree barrier 12 Butterfly Barrier 13 局部同步 设进程 Pi 在继续执行前需要和进行 Pi-1 和进程 Pi+1 同步并 交换数据: Not a perfect three-process barrier because process Pi-1 will only synchronize with Pi and continue as soon as Pi allows. Similarly, process Pi+1 only synchronizes with Pi. 14 Deadlock死锁 当一对进程各自与对方发送和接收消息时,可能会发生死 锁。 特别是两个进程在调用同步例程(或者调用阻塞例程而没 有足够缓存)后,都先进行send,就会导致死锁,双方都 等着对方recv。 一个解决办法:安排一个进程先recv再send,另一个进程先 send再recv 15 避免死锁的复合阻塞例程 sendrecv() 上面局部同步实例改为 MPI provides MPI_Sendrecv() and MPI_Sendrecv_replace(). MPI sendrev()s actually has 12 parameters! 16 Synchronized Computations同步计算 Can be classififed as: • Fully synchronous 全同步 or • Locally synchronous 局部同步 In fully synchronous, all processes involved in the computation must be synchronized. In locally synchronous, processes only need to synchronize with a set of logically nearby processes, not all processes involved in the computation 17 全同步计算举例 Data Parallel Computations数据并行计算 Same operation performed on different data elements simultaneously; i.e., in parallel. Particularly convenient because: • 容易编程 (本质上只要一个程序). • 可方便扩展问题的规模. • Many numeric and some non-numeric problems can be cast in a data parallel form. 18 Example 给数组的每个元素加上同一个常向量k: for (i = 0; i < n; i++) a[i] = a[i] + k; 其中语句: a[i] = a[i] + k; 可以被多个处理器同时执行,每个处理器使用不同的index i (0 < i <= n). 19 数据并行计算 20 SIMD的forall construct 并行程序设计语言中(Fortran,c#,oracle等),有一个特定的 “parallel” 结构,即forall复合语句, 用以指明数据并行操作 Example forall (i = 0; i < n; i++) { body } 表示循环体body的n个实例可同时被执行。 循环变量i在每个循环体的实例中有一个不同的值,第一个实例 的i值为0,下一个为1,依次类推。 21 To add k to each element of an array, a, we can write forall (i = 0; i < n; i++) a[i] = a[i] + k; 22 Forall在消息传递的库例程不可用 Data parallel technique applied to multiprocessors and multicomputers Example To add k to the elements of an array: i = myrank; /* process rank*/ a[i] = a[i] + k; /* body */ barrier(mygroup); where myrank is a process rank between 0 and n - 1. 23 Data Parallel Example: Prefix Sum Problem 数据并行举例:前缀求和问题 给定一组有序数, x0, …, xn-1, 计算所有的部分和, i.e.: x0 + x1; x0 + x1 + x2; x0 + x1 + x2 + x3; … . Can also be defined with associative operations other than addition. Widely studied. Practical applications in areas such as processor allocation, data compaction, sorting, and polynomial evaluation. 24 对16个数求部分和的数据并行方法 25 Data parallel prefix sum operation 26 27 // Calculate_PrefixSum: 协同计算prefix sum的值。 // myid: 结点号,对应于相应的二进制。 // mynum: 参与运算的数据。即本节点或本进程的原始数据 // n: hypercube的维数 int Calculate_PrefixSum( int myid, int mynum, int n ) { int result = mynum; //本节点计算的prefix sum做后放在result中。 int msg = result; int exp2 = 1; int i, num, partner; MPI_Status status; for( i = 0; i < n; i++ ) { partner = myid ^ exp2; MPI_Send( &msg, 1, MPI_INT, partner, 0, MPI_COMM_WORLD ); MPI_Recv( &num, 1, MPI_INT, partner, 0, MPI_COMM_WORLD, &status ); msg += num; if ( partner < myid ) result += num; MPI_Barrier( MPI_COMM_WORLD ); exp2 = exp2 << 1; } printf( "myid:%3d,\tresult:%3d,\n", myid, result ); //打印本节点最后的计算结果result. return result; } 28 29 Synchronous Iteration同步迭代 (Synchronous Parallelism同步并行性) 每次迭代由几个进程组成,这些进程在每次迭代开始时同时启动, 直到所有进程结束前一次迭代时才开始下一次迭代。 Using forall construct: for (j = 0; j < n; j++) /*for each synch. forall (i = 0; i < N; i++) { iteration */ /*N processes each using*/ body(i); /* specific value of i */ } 30 Using message passing barrier: for (j = 0; j < n; j++) { /*for each synchr.iteration */ i = myrank; /*find value of i to be used */ body(i); barrier(mygroup); } 31 另一个全同步迭代程序例子 用雅可比(Jacobi)迭代法解线性方程组 Solving a General System of Linear Equations by Iteration Suppose the equations are of a general form with n equations and n unknowns where the unknowns are x0, x1, x2, … xn-1 (0 <= i < n). 32 把第i个方程 改写成 即 This equation gives xi in terms of the other unknowns. 可以作为迭代公式计算每个未知数的逼近值。 33 Jacobi Iteration雅可比迭代法 All values of x are updated together.所有的x值同时更新 可以证明如果方程组对角线上a的绝对值大于同行中其他a的绝 对值之和 (a方阵是对角占优的), 则雅可比法收敛 i.e. This condition is a sufficient but not a necessary condition. 34 Termination终止 A simple, common approach. Compare values computed in one iteration to values obtained from the previous iteration. Terminate computation when all values are within given tolerance; i.e., when 前后两次结果的差异 However, this does not guarantee the solution to that accuracy. 35 Convergence Rate 36 Parallel Code 每个未知数分配一个进程,每个进程迭代步数一样,进程 Pi 可写成下面的代码 x[i] = b[i]; /*initialize unknown*/ for (iteration = 0; iteration < limit; iteration++) { sum = -a[i][i] * x[i]; for (j = 0; j < n; j++) /* compute summation */ sum = sum + a[i][j] * x[j]; new_x[i] = (b[i] - sum) / a[i][i]; /* compute unknown */ allgather(&new_x[i]); /*bcast/recv values */ global_barrier(); /* wait for all processes */ } allgather() sends the newly computed value of x[i] from process i to every other process and collects data broadcast 37 from the other processes. Introduce a new message-passing operation - Allgather. Allgather Broadcast and gather values in one composite construction. MPI_Allgather(); MPI_Allreduce(); 38 n = pSize; //每个进程对应一个未知数 if(!pID) input(); //根进程负责输入 MPI_Bcast(a, MAX_A, MPI_FLOAT, 0, MPI_COMM_WORLD); //广播系数 MPI_Bcast(b, MAX_N, MPI_FLOAT, 0, MPI_COMM_WORLD); //广播b for(i = 0; i < n; i++) { x[i] = b[i]; new_x[i] = b[i]; } //初始化x的值 i = pID; //当前进程处理的元素为该进程的ID do{//迭代求x[i]的值 iteration++; //迭代次数加1 //将上一轮迭代的结果作为本轮迭代的起始值,并设置新的迭代结果为0 for(j = 0; j < n; j++) { x[j] = result_x[j]; new_x[j] = 0; result_x[j] = 0; } MPI_Barrier(MPI_COMM_WORLD); //同步等待 sum = - a[i][i] * x[i]; for(j = 0; j < n; j++) sum += a[i][j] * x[j]; //求和 new_x[i] = (b[i] - sum) / a[i][i]; //求新的x[i] //使用归约的方法,同步所有计算结果 MPI_Allreduce(new_x, result_x, n, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD); if(iteration > MAX_ITERATION) { break; } //如果迭代次数超过了最大迭代次数则退出 } while(!tolerance()); //精度不满足要求继续迭代 39 40 Partitioning Usually number of processors much fewer than number of data items to be processed. Partition the problem so that processors take on more than one data item. 41 block allocation – allocate groups of consecutive unknowns to processors in increasing order. cyclic allocation – processors are allocated one unknown in order; i.e., processor P0 is allocated x0, xp, x2p, …, x((n/p)-1)p, processor P1 is allocated x1, xp+1, x2p+1, …, x((n/p)-1)p+1, and so on. Cyclic allocation no particular advantage here (Indeed, may be disadvantageous because the indices of unknowns have to be computed in a more complex way). 42 Effects of computation and communication in Jacobi iteration Consequences of different numbers of processors done in textbook. Get: 43 MPI Safe Message Passing Routines MPI offers several methods for safe communication: 44 To be continued… 45