+ x

advertisement
Lecture 7,chapter 6
Synchronous Computations
同步计算
http://bioinfo.uncc.edu/szhang
1
Synchronous Computations
在全同步(fully synchronous)应用程序, 所有的进程在一定的点
上同步.
Barrier栅
实现进程同步的一个基本的机制 - inserted at the
point in each process where it must wait在每个进
程必须等待的点插入一个MPI_Barrier(), 当所
有进程到达它时,所有进程才能从该点继续运行。
MPI_Barrier(MPI_COMM_WORLD);
2
进程在不同时间到达Barrier
3
库例程(library routines)调用Barrier
4
MPI_Barrier()
Barrier 唯 一 的 参 数 是 一 个 已 命 名 的 通 信 子
(communicator)
通信子内的每个进程都要调用MPI_Barrier(),它阻
塞进程直到组内所有进程都调用它,并且只有那
时才返回。
5
Barrier 实现方式
集中式计数器实现
Centralized counter implementation (a linear barrier 线性障栅):
6
一个良好的barrier实现必须考虑一个barrier可能被进程不只
使用一次.
前面的进程第一次离开barrier之前,可能一个进程第二次进
入barrier。
7
基于计数器的barriers 常分为两个阶段 two phases:
• 到达阶段:一个进程进入“到达”阶段后直到所有进程都进
入这个到达阶段才能离开此阶段;
• 离开阶段:进程进入离开阶段并被释放。Then processes
move to departure phase and are released.
Two-phase handles the reentrant scenario.
8
Example code:
Master:
For循环计数的主进程实现barrier
for (i = 0; i < n; i++) /*count slaves as they reach barrier*/
recv(Pany);
for (i = 0; i < n; i++)
/* release slaves */
send(Pi);
Slave processes:
send(Pmaster);
recv(Pmaster);
阻塞进程
9
消息传递系统中的Barrier实现
10
树结构实现barrier
More efficient. O(log p) steps
Suppose 8 processes, P0, P1, P2, P3, P4, P5, P6, P7:
1st stage: P1 sends message to P0; (when P1 reaches its barrier)
P3 sends message to P2; (when P3 reaches its barrier)
P5 sends message to P4; (when P5 reaches its barrier)
P7 sends message to P6; (when P7 reaches its barrier)
2nd stage: P2 sends message to P0; (P2 & P3 reached their barrier)
P6 sends message to P4; (P6 & P7 reached their barrier)
3rd stage: P4 sends message to P0; (P4, P5, P6, & P7 reached barrier)
P0 terminates arrival phase;
(when P0 reaches barrier & received message from P4)
Release with a reverse tree construction.
11
Tree barrier
12
Butterfly Barrier
13
局部同步
设进程 Pi 在继续执行前需要和进行 Pi-1 和进程 Pi+1 同步并
交换数据:
Not a perfect three-process barrier because process Pi-1
will only synchronize with Pi and continue as soon as Pi
allows. Similarly, process Pi+1 only synchronizes with Pi.
14
Deadlock死锁
当一对进程各自与对方发送和接收消息时,可能会发生死
锁。
特别是两个进程在调用同步例程(或者调用阻塞例程而没
有足够缓存)后,都先进行send,就会导致死锁,双方都
等着对方recv。
一个解决办法:安排一个进程先recv再send,另一个进程先
send再recv
15
避免死锁的复合阻塞例程
sendrecv()
上面局部同步实例改为
MPI provides MPI_Sendrecv() and
MPI_Sendrecv_replace().
MPI sendrev()s actually has 12 parameters!
16
Synchronized Computations同步计算
Can be classififed as:
• Fully synchronous 全同步
or
• Locally synchronous 局部同步
In fully synchronous, all processes involved in the
computation must be synchronized.
In locally synchronous, processes only need to
synchronize with a set of logically nearby processes, not
all processes involved in the computation
17
全同步计算举例
Data Parallel Computations数据并行计算
Same operation performed on different data elements
simultaneously; i.e., in parallel.
Particularly convenient because:
• 容易编程 (本质上只要一个程序).
• 可方便扩展问题的规模.
• Many numeric and some non-numeric problems can be
cast in a data parallel form.
18
Example
给数组的每个元素加上同一个常向量k:
for (i = 0; i < n; i++)
a[i] = a[i] + k;
其中语句:
a[i] = a[i] + k;
可以被多个处理器同时执行,每个处理器使用不同的index
i (0 < i <= n).
19
数据并行计算
20
SIMD的forall construct
并行程序设计语言中(Fortran,c#,oracle等),有一个特定的
“parallel” 结构,即forall复合语句, 用以指明数据并行操作
Example
forall (i = 0; i < n; i++) {
body
}
表示循环体body的n个实例可同时被执行。
循环变量i在每个循环体的实例中有一个不同的值,第一个实例
的i值为0,下一个为1,依次类推。
21
To add k to each element of an array, a, we can write
forall (i = 0; i < n; i++)
a[i] = a[i] + k;
22
Forall在消息传递的库例程不可用
Data parallel technique applied to multiprocessors and
multicomputers
Example
To add k to the elements of an array:
i = myrank;
/* process rank*/
a[i] = a[i] + k;
/* body */
barrier(mygroup);
where myrank is a process rank between 0 and n - 1.
23
Data Parallel Example: Prefix Sum Problem
数据并行举例:前缀求和问题
给定一组有序数, x0, …, xn-1, 计算所有的部分和, i.e.:
x0 + x1;
x0 + x1 + x2;
x0 + x1 + x2 + x3; … .
Can also be defined with associative operations other
than addition.
Widely studied. Practical applications in areas such as
processor allocation, data compaction, sorting, and
polynomial evaluation.
24
对16个数求部分和的数据并行方法
25
Data parallel prefix sum operation
26
27
//
Calculate_PrefixSum: 协同计算prefix sum的值。
//
myid: 结点号,对应于相应的二进制。
//
mynum: 参与运算的数据。即本节点或本进程的原始数据
//
n: hypercube的维数
int Calculate_PrefixSum( int myid, int mynum, int n )
{
int result = mynum; //本节点计算的prefix sum做后放在result中。
int msg = result;
int exp2 = 1;
int i, num, partner;
MPI_Status status;
for( i = 0; i < n; i++ ) {
partner = myid ^ exp2;
MPI_Send( &msg, 1, MPI_INT, partner, 0, MPI_COMM_WORLD );
MPI_Recv( &num, 1, MPI_INT, partner, 0, MPI_COMM_WORLD, &status );
msg += num;
if ( partner < myid ) result += num;
MPI_Barrier( MPI_COMM_WORLD );
exp2 = exp2 << 1;
}
printf( "myid:%3d,\tresult:%3d,\n", myid, result ); //打印本节点最后的计算结果result.
return result;
}
28
29
Synchronous Iteration同步迭代
(Synchronous Parallelism同步并行性)
每次迭代由几个进程组成,这些进程在每次迭代开始时同时启动,
直到所有进程结束前一次迭代时才开始下一次迭代。
Using forall construct:
for (j = 0; j < n; j++) /*for each synch.
forall (i = 0; i < N; i++) {
iteration */
/*N processes each using*/
body(i);
/* specific value of i */
}
30
Using message passing barrier:
for (j = 0; j < n; j++) { /*for each synchr.iteration */
i = myrank;
/*find value of i to be used */
body(i);
barrier(mygroup);
}
31
另一个全同步迭代程序例子
用雅可比(Jacobi)迭代法解线性方程组
Solving a General System of Linear Equations by Iteration
Suppose the equations are of a general form with n equations
and n unknowns
where the unknowns are x0, x1, x2, … xn-1 (0 <= i < n).
32
把第i个方程
改写成
即
This equation gives xi in terms of the other unknowns.
可以作为迭代公式计算每个未知数的逼近值。
33
Jacobi Iteration雅可比迭代法
All values of x are updated together.所有的x值同时更新
可以证明如果方程组对角线上a的绝对值大于同行中其他a的绝
对值之和 (a方阵是对角占优的), 则雅可比法收敛 i.e.
This condition is a sufficient but not a necessary condition.
34
Termination终止
A simple, common approach. Compare values computed in
one iteration to values obtained from the previous iteration.
Terminate computation when all values are within given
tolerance; i.e., when
前后两次结果的差异
However, this does not guarantee the solution to that accuracy.
35
Convergence Rate
36
Parallel Code
每个未知数分配一个进程,每个进程迭代步数一样,进程 Pi
可写成下面的代码
x[i] = b[i];
/*initialize unknown*/
for (iteration = 0; iteration < limit; iteration++) {
sum = -a[i][i] * x[i];
for (j = 0; j < n; j++) /* compute summation */
sum = sum + a[i][j] * x[j];
new_x[i] = (b[i] - sum) / a[i][i];
/* compute unknown */
allgather(&new_x[i]);
/*bcast/recv values */
global_barrier(); /* wait for all processes */
}
allgather() sends the newly computed value of x[i] from
process i to every other process and collects data broadcast
37
from the other processes.
Introduce a new message-passing operation - Allgather.
Allgather
Broadcast and gather values in one composite construction.
MPI_Allgather();
MPI_Allreduce();
38
n = pSize; //每个进程对应一个未知数
if(!pID) input(); //根进程负责输入
MPI_Bcast(a, MAX_A, MPI_FLOAT, 0, MPI_COMM_WORLD); //广播系数
MPI_Bcast(b, MAX_N, MPI_FLOAT, 0, MPI_COMM_WORLD); //广播b
for(i = 0; i < n; i++) {
x[i] = b[i];
new_x[i] = b[i];
} //初始化x的值
i = pID; //当前进程处理的元素为该进程的ID
do{//迭代求x[i]的值
iteration++; //迭代次数加1
//将上一轮迭代的结果作为本轮迭代的起始值,并设置新的迭代结果为0
for(j = 0; j < n; j++)
{
x[j] = result_x[j];
new_x[j] = 0;
result_x[j] = 0;
}
MPI_Barrier(MPI_COMM_WORLD); //同步等待
sum = - a[i][i] * x[i];
for(j = 0; j < n; j++) sum += a[i][j] * x[j]; //求和
new_x[i] = (b[i] - sum) / a[i][i]; //求新的x[i]
//使用归约的方法,同步所有计算结果
MPI_Allreduce(new_x, result_x, n, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD);
if(iteration > MAX_ITERATION) {
break;
} //如果迭代次数超过了最大迭代次数则退出
} while(!tolerance()); //精度不满足要求继续迭代
39
40
Partitioning
Usually number of processors much fewer than number
of data items to be processed. Partition the problem so
that processors take on more than one data item.
41
block allocation – allocate groups of consecutive
unknowns to processors in increasing order.
cyclic allocation – processors are allocated one unknown
in order; i.e., processor P0 is allocated x0, xp, x2p, …, x((n/p)-1)p,
processor P1 is allocated x1, xp+1, x2p+1, …, x((n/p)-1)p+1, and so
on.
Cyclic allocation no particular advantage here (Indeed,
may be disadvantageous because the indices of
unknowns have to be computed in a more complex way).
42
Effects of computation and communication
in Jacobi iteration
Consequences of different numbers of processors done in
textbook.
Get:
43
MPI Safe Message Passing Routines
MPI offers several methods for safe communication:
44
To be continued…
45
Download