3 并行性问题

advertisement
Foundation to Parallel
Programming
并行程序设计基础
•
•
•
•
Introduction to Parallel Programming
Approaches for Parallel Programs
Parallel Programming Model
Parallel Programming Paradigm
2
Parallel Programming is a Complex Task
• Parallel software developers handle issues such as:
–
–
–
–
–
–
–
–
–
–
Non-determinism
Communication
Synchronization
data partitioning and distribution
load-balancing
fault-tolerance
Heterogeneity
shared or distributed memory
Deadlocks
race conditions.
3
Users’ Expectations from Parallel Programming
Environments
• Parallel computing can only be widely successful if
parallel software is able to meet expectations of the
users, such as:
–
–
–
–
–
–
–
–
–
provide architecture/processor type transparency
provide network/communication transparency
be easy-to-use and reliable
provide support for fault-tolerance
accommodate heterogeneity
assure portability
provide support for traditional high-level languages
be capable of delivering increased performance
to provide parallelism transparency
4
2 并行程序构造方法
串行代码段
for ( i= 0; i<N; i++ ) A[i]=b[i]*b[i+1];
for (i= 0; i<N; i++) c[i]=A[i]+A[i+1];
(a) 使用库例程构造并行程序
id=my_process_id();
p=number_of_processes();
for ( i= id; i<N; i=i+p) A[i]=b[i]*b[i+1];
barrier();
for (i= id; i<N; i=i+p) c[i]=A[i]+A[i+1];
例子: MPI,PVM, Pthreads
(c) 加编译注释构造并行程序的方法
#pragma parallel
#pragma shared(A,b,c)
#pragma local(i)
{
# pragma pfor iterate(i=0;N;1)
for (i=0;i<N;i++) A[i]=b[i]*b[i+1];
# pragma synchronize
# pragma pfor iterate (i=0; N; 1)
for (i=0;i<N;i++)c[i]=A[i]+A[i+1];
}
例子:SGI power C
(b) 扩展串行语言
my_process_id,number_of_processes(), and barrier()
A(0:N-1)=b(0:N-1)*b(1:N)
c=A(0:N-1)+A(1:N)
例子: Fortran 90
5
2 并行程序构造方法
三种并行程序构造方法比较
方法
库例程
实例
MPI, PVM
优点
易于实现, 不需要新编 无 编 译 器 检 查 ,
译器
扩展
Fortran90
分析和优化
允许编译器检查、分析 实现困难,需要新
和优化
编译器注释 SGI powerC, HPF
缺点
编译器
介于库例程和扩展方法之间, 在串行平台
上不起作用.
6
3 并行性问题
3.1 进程的同构性
SIMD: 所有进程在同一时间执行相同的指令
MIMD:各个进程在同一时间可以执行不同的指令
SPMD: 各个进程是同构的,多个进程对不同的数据执
行相同的代码(一般是数据并行的同义语)
常对应并行循环,数据并行结构,单代码
MPMD:各个进程是异构的, 多个进程执行不同的代码
要为有1000个处理器的计算机编写一个完全异构的并行
程序是很困难的
7
3 并行性问题
并行块
parbegin S1 S2 S3 …….Sn parend
S1 S2 S3 …….Sn可以是不同的代码
并行循环: 当并行块中所有进程共享相同代码时
parbegin S1 S2 S3 …….Sn parend
S1 S2 S3 …….Sn是相同代码
简化为
parfor (i=1; i<=n, i++) S(i)
8
3 并行性问题
SPMD程序的构造方法
用单代码方法
用数据并行程序的构造方法
要说明以下SPMD程序:
parfor (i=0; i<=N, i++) foo(i)
要说明以下SPMD程序:
parfor (i=0; i<=N, i++) {
C[i]=A[i]+B[i];
}
用户需写一个以下程序:
pid=my_process_id();
numproc=number_of _processes();
parfor (i=pid; i<=N, i=i+numproc) foo(i)
用户可用一条数据赋值语句:
C=A+B
或
forall (i=1,N) C[i]=A[i]+B[i]
此程序经编译后生成可执行程序A, 用shell
脚本将它加载到N个处理结点上:
run A –numnodes N
9
3 并行性问题
MPMD程序的构造方法
用多代码方法
用SPMD伪造MPMD
对不提供并行块或并行循环的语言
要说明以下MPMD程序:
parbegin S1 S2 S3 parend
要说明以下MPMD程序:
parbegin S1 S2 S3 parend
用户需写3个程序, 分别编译生成3
个可执行程序S1 S2 S3, 用shell脚
本将它们加载到3个处理结点上:
run S1 on node1
run S2 on node2
run S3 on node3
可以用以下SPMD程序:
parfor (i=0; i<3, i++) {
if (i=0) S1
if (i=1) S2
if (i=2) S3
}
S1, S2和S3是顺序语言程
序加上进行交互的库调用.
10
3 并行性问题
3.2 静态和动态并行性
程序的结构: 由它的组成部分构成程序的方法
静态并行性: 程序的结构以
及进程的个数在运行之前(如
编译时, 连接时或加载时)就
可确定, 就认为该程序具有
静态并行性.
静态并行性的例子:
parbegin P, Q, R parend
动态并行性: 否则就认为该
程序具有动态并行性. 即意
味着进程要在运行时创建和
终止
动态并行性的例子:
while (C>0) begin
fork (foo(C));
C:=boo(C);
end
其中P,Q,R是静态的
11
3 并行性问题
开发动态并行性的一般方法: Fork/Join
Process A:
begin
Z:=1
fork(B);
T:=foo(3);
end
Process B:
begin
fork(C);
X:=foo(Z);
join(C);
output(X+Y);
end
Process C:
begin
Y:=foo(Z);
end
Fork: 派生一个子进程
Join: 强制父进程等待子进程
12
3 并行性问题
3.3 进程编组
目的:支持进程间的交互,常把需要交互的进程调度在同一组中
一个进程组成员由:组标识符+ 成员序号 唯一确定.
3.4 划分与分配
原则: 使系统大部分时间忙于计算, 而不是闲置或忙于交互;
同时不牺牲并行性(度).
划分: 切割数据和工作负载
分配:将划分好的数据和工作负载映射到计算结点(处理器)上
分配方式
显式分配: 由用户指定数据和负载如何加载
隐式分配:由编译器和运行时支持系统决定
就近分配原则:进程所需的数据靠近使用它的进程代码
13
3 并行性问题
并行度(Degree of Parallelism, DOP):同时执行的分进程数.
并行粒度(Granularity): 两次并行或交互操作之间所执行的
计算负载.
指令级并行
块级并行
进程级并行
任务级并行
并行度与并行粒度大小常互为倒数: 增大粒度会减小并行度.
增加并行度会增加系统(同步)开销
14
Levels of Parallelism
PVM/MPI
Threads
Compilers
CPU
Task i-l
func1 ( )
{
....
....
}
a ( 0 ) =..
b ( 0 ) =..
+
Task i
func2 ( )
{
....
....
}
a ( 1 )=..
b ( 1 )=..
x
Task i+1
func3 ( )
{
....
....
}
a ( 2 )=..
b ( 2 )=..
Load
Code-Granularity
Code Item
Large grain
(task level)
Program
Medium grain
(control level)
Function (thread)
Fine grain
(data level)
Loop (Compiler)
Very fine grain
(multiple issue)
With hardware 15
Responsible for Parallelization
Grain Size
Code Item
Parallelised
by
Very Fine
Instruction
Processor
Fine
Loop/Instruction block
Compiler
Medium
(Standard one page) Function
Programmer
Large
Program/Separate heavy-weight
process
Programmer
16
4 交互/通信问题
交互:进程间的相互影响
4.1 交互的类型
通信:两个或多个进程间传送数的操作
通信方式:
共享变量
父进程传给子进程(参数传递方式)
消息传递
17
4 交互/通信问题
同步:
原子同步
控制同步(路障,临界
区)
数据同步(锁,条件临
界区,监控程序,事件)
例子:
原子同步
parfor (i:=1; i<n; i++) {
atomic{x:=x+1; y:=y-1}
}
路障同步
parfor(i:=1; i<n; i++){
Pi
barrier
Qi
}
临界区
parfor(i:=1; i<n; i++){
critical{x:=x+1; y:=y+1}
}
数据同步(信号量同步)
parfor(i:=1; i<n; i++){
lock(S);
x:=x+1;
y:=y-1;
unlock(S)
}
18
4 交互/通信问题
聚集(aggregation):用一串超步将各分进程计算所得
的部分结果合并为一个完整的结果, 每个超步包含一个
短的计算和一个简单的通信或/和同步.
聚集方式:
归约
扫描
例子: 计算两个向量的内积
parfor(i:=1; i<n; i++){
X[i]:=A[i]*B[i]
inner_product:=aggregate_sum(X[i]);
}
19
4 交互/通信问题
4.2 交互的方式
P1 P2
同步的交互: 所有参与者同时到达并执
行交互代码C
异步的交互: 进程到达C后, 不必等待其
它进程到达即可执行C
…
Pn
交互代码 C
…
20
4 交互/通信问题
•4.3 交互的模式
一对一:点到点(point to point)
一对多:广播(broadcast),播撒(scatter)
多对一:收集(gather), 归约(reduce)
多对多:全交换(Tatal Exchange), 扫描
(scan) , 置换/移位(permutation/shift)
21
4 交互/通信问题
P1
1
P1
1
P1
1
P1
1,1
P2
3
P2
3
P2
3
P2
3,1
P3
5
P3
5,1
P3
5
P3
5,1
(a) 点对点(一对一): P1发送一个值给P3
(b) 广播(一对多): P1发送一个值给全体
P1
1,3,5
P1
1
P1
1,3,5
P2
P2
3
P2
3
P2
3
P3
P3
5
P3
5
P3
5
P1
1,3,5
(c) 播撒(一对多): P1向每个结点发送一个值
(d) 收集(多对一): P1从每个结点接收
一个值
22
4 交互/通信问题
P1
1,2,3
P1
1,4,7
P1
1
P1
1,5
P2
4,5,6
P2
2,5,8
P2
3
P2
3,1
P3
7,8,9
P3
3,6,9
P3
5
P3
5,3
(e) 全交换(多对多): 每个结点向每个
结点发送一个不同的消息
P1
1
P1
1,9
P2
3
P2
3
P3
5
P3
5
(g) 归约(多对一): P1得到和1+3+5=9
(f) 移位(置换, 多对多): 每个结
点向下一个结点发送一个值并
接收来自上一个结点的一个值.
P1
1
P1
1,1
P2
3
P2
3,4
P3
5
P3
5,9
(h) 扫描(多对多): P1得到1, P2得到
1+3=4, P3得到1+3+5=9
23
Approaches for Parallel Programs
Development
• Implicit Parallelism
– Supported by parallel languages and parallelizing
compilers that take care of identifying parallelism, the
scheduling of calculations and the placement of data.
• Explicit Parallelism
– based on the assumption that the user is often the
best judge of how parallelism can be exploited for a
particular application.
– programmer is responsible for most of the
parallelization effort such as task decomposition,
mapping task to processors, the communication
structure.
24
Strategies for Development
25
Parallelization Procedure
Assignment
Decomposition
Sequential Computation
Tasks
Process Elements
Mapping
Orchestration
Processors
26
Sample Sequential Program
FDM (Finite Difference Method)
…
loop{
for (i=0; i<N; i++){
for (j=0; j<N; j++){
a[i][j] = 0.2 * (a[i][j-1] +
+ a[i+1][j] + a[i][j]);
}
}
a[i][j+1] + a[i-1][j]
}
…
27
Parallelize the Sequential Program
• Decomposition
a task
…
loop{
for (i=0; i<N; i++){
for (j=0; j<N; j++){
a[i][j] = 0.2 * (a[i][j-1] + a[i][j+1]
+ a[i-1][j] + a[i+1][j] + a[i][j]);
}
}
}
…
28
Parallelize the Sequential Program
• Assignment
PE
Divide the tasks
equally among
process elements
PE
PE
PE
29
Parallelize the Sequential Program
• Orchestration
PE
need to communicate
and to synchronize
PE
PE
PE
30
Parallelize the Sequential Program
• Mapping
PE
PE
PE
PE
Multiprocessor
31
Parallel Programming Models
• Sequential Programming Model
• Shared Memory Model (Shared Address Space Model)
– DSM
– Threads/OpenMP (enabled for clusters)
– Java threads
• Message Passing Model
– PVM
– MPI
• Hybrid Model
– Mixing shared and distributed memory model
– Using OpenMP and MPI together
32
Sequential Programming Model
• Functional
– Naming: Can name any variable in virtual address
space
• Hardware (and perhaps compilers) does translation to
physical addresses
– Operations: Loads and Stores
– Ordering: Sequential program order
33
Sequential Programming Model
• Performance
– Rely on dependences on single location (mostly):
dependence order
– Compiler: reordering and register allocation
– Hardware: out of order, pipeline bypassing, write
buffers
– Transparent replication in caches
34
SAS(Shared Address Space)
Programming Model
System
Thread
(Process)
Thread
(Process)
read(X)
write(X)
X
Shared variable
35
Shared Address Space Architectures
•Naturally provided on wide range of platforms
–History dates at least to precursors of mainframes in
early 60s
–Wide range of scale: few to hundreds of processors
•Popularly known as shared memory machines or
model
–Ambiguous: memory may be physically distributed
among processors
36
Shared Address Space Architectures
•Any processor can directly reference any memory
location
–Communication occurs implicitly as result of loads and
stores
•Convenient:
–Location transparency
–Similar programming model to time-sharing on
uniprocessors
• Except processes run on different processors
• Good throughput on multiprogrammed workloads
37
Shared Address Space Model
• Process: virtual address space plus one or more threads of control
• Portions of address spaces of processes are shared
Virtual address spaces for a
collection of processes communicating
via shared addresses
Load
P1
Machine physical address space
Pn pr i v at e
Pn
P2
Common physical
addresses
P0
St or e
Shared portion
of address space
Private portion
of address space
P2 pr i vat e
P1 pr i vat e
P0 pr i vat e
• Writes
to shared address visible to other threads (in other processes too)
• Natural extension of uniprocessor model: conventional memory
operations for comm.; special atomic operations for synchronization
38
• OS uses shared memory to coordinate processes
Communication Hardware for SAS
• Also natural extension of uniprocessor
• Already have processor, one or more memory
modules and I/O controllers connected by
hardware interconnect of some sort
– Memory capacity increased by adding modules, I/O by
I/O
devices
controllers
Mem
Mem
Mem
Mem
Interconnect
Processor
I/O ctrl
I/O ctrl
Interconnect
Processor
Add processors for processing!
39
History of SAS Architecture
• “Mainframe” approach
– Motivated by multiprogramming
– Extends crossbar used for mem bw and I/O
– Originally processor cost limited to small
• later, cost of crossbar
P
P
I/O
C
I/O
C
– Bandwidth scales with p
– High incremental cost; use multistage instead
M
M
M
M
M
M
$
$
P
P
 “Minicomputer” approach






Almost all microprocessor systems have bus
Motivated by multiprogramming
Used heavily for parallel computing
Called symmetric multiprocessor (SMP)
Latency larger than for uniprocessor
Bus is bandwidth bottleneck
 caching is key: coherence problem
 Low incremental cost
I/O
I/O
C
C
40
Example: Intel Pentium Pro Quad
CPU
P-Pro
module
256-KB
Interrupt
L2 $
controller
Bus interface
P-Pro
module
P-Pro
module
PCI
bridge
PCI bus
PCI
I/O
cards
PCI
bridge
PCI bus
P-Pro bus (64-bit data, 36-bit address, 66 MHz)
Memory
controller
MIU
1-, 2-, or 4-way
interleaved
DRAM
– Highly integrated,
targeted at high
volume
– Low latency and
bandwidth
41
Example: SUN Enterprise
P
$
P
$
$2
$2
CPU/mem
cards
Mem ctrl
Bus interface/switch
Gigaplane bus (256 data, 41 address, 83 MHz)
I/O cards
–
2 FiberChannel
SBUS
SBUS
Memory on processor cards themselves
•
–
–
SBUS
100bT, SCSI
Bus interface
16 cards of either type: processors + memory, or I/O
But all memory accessed over bus, so symmetric
Higher bandwidth, higher latency bus
42
M
M
Scaling Up

M
Net work
Net work
$
$
 $
M $
M $
P
P
P
P
P
“Dance hall”

M $
P
Distributed memory
– Problem is interconnect: cost (crossbar) or bandwidth (bus)
– Dance-hall: bandwidth still scalable, but lower cost than
crossbar
• latencies to memory uniform, but uniformly large
– Distributed memory or non-uniform memory access (NUMA)
• Construct shared address space out of simple message transactions
across a general-purpose network (e.g. read-request, read-response)
43
Shared Address Space Programming
Model
• Naming
– Any process can name any variable in shared space
• Operations
– Loads and stores, plus those needed for ordering
• Simplest Ordering Model
– Within a process/thread: sequential program order
– Across threads: some interleaving (as in time-sharing)
– Additional orders through synchronization
44
Synchronization
• Mutual exclusion (locks)
– No ordering guarantees
• Event synchronization
– Ordering of events to preserve dependences
– e.g. producer —> consumer of data
– 3 main types:
• point-to-point
• global
• group
45
MP Programming Model
Node A
Node B
process
process
send (Y)
receive (Y’)
Y
Y’
message
46
Message-Passing Programming Model
Match
ReceiveY, P, t
AddressY
Send X, Q, t
AddressX
–
–
–
–
–
Local process
address space
Local process
address space
ProcessP
Process Q
Send specifies data buffer to be transmitted and receiving process
Recv specifies sending process and application storage to receive into
Memory to memory copy, but need to name processes
User process names only local data and entities in process/tag space
In simplest form, the send/recv match achieves pairwise synch event
– Many overheads: copying, buffer management,
protection
47
Message Passing Architectures
• Complete computer as building block, including I/O
– Communication via explicit I/O operations
• Programming model: directly access only private
address space (local memory), comm. via explicit
messages (send/receive)
• Programming model more removed from basic
hardware operations
– Library or OS intervention
48
Evolution of Message-Passing Machines
• Early machines: FIFO on each link
– synchronous ops
– Replaced by DMA, enabling nonblocking ops
• Buffered by system at destination until
recv
101
100
• Diminishing role of topology
– Store&forward routing
– Cost is in node-network interface
– Simplifies programming
001
000
111
011
110
010
49
Example: IBM SP-2
Power 2
CPU
IBM SP-2 node
L2 $
Memory bus
General interconnection
network formed fr om
8-port switches
4-way
interleaved
DRAM
Memory
controller
MicroChannel bus
I/O
DMA
i860
NI
DRAM
NIC
– Made out of essentially complete RS6000 workstations
– Network interface integrated in I/O bus (bw limited by I/O bus)
• Doesn’t need to see memory references
50
Example Intel Paragon
i860
i860
L1 $
L1 $
Intel
Paragon
node
Memory bus (64-bit, 50 MHz)
Mem
ctrl
DMA
Driver
Sandia’ s Intel Paragon XP/S-based Super computer
2D grid network
with processing node
attached to every switch
NI
4-way
interleaved
DRAM
8 bits,
175 MHz,
bidirectional
– Network interface integrated in memory bus, for performance
51
Message Passing Programming Model
• Naming
– Processes can name private data directly.
– No shared address space
• Operations
– Explicit communication: send and receive
– Send transfers data from private address space to
another process
– Receive copies data from process to private
address space
– Must be able to name processes
52
Message Passing Programming Model
• Ordering
– Program order within a process
– Send and receive can provide pt-to-pt synch
between processes
• Can construct global address space
– Process number + address within process address
space
– But no direct operations on these names
53
Design Issues Apply at All Layers
• Programming model’s position provides
constraints/goals for system
• In fact, each interface between layers supports
or takes a position on
– Naming model
– Set of operations on names
– Ordering model
– Replication
– Communication performance
54
Naming and Operations
• Naming and operations in programming model
can be directly supported by lower levels, or
translated by compiler, libraries or OS
• Example
– Shared virtual address space in programming model
• Hardware interface supports shared physical
address space
– Direct support by hardware through v-to-p mappings,
no software layers
55
Naming and Operations (Cont’d)
• Hardware supports independent physical
address spaces
– system/user interface: can provide SAS through
OS
• v-to-p mappings only for data that are local
• remote data accesses incur page faults; brought in via
page fault handlers
– Or through compilers or runtime, so above
sys/user interface
56
Naming and Operations (Cont’d)
• Example: Implementing Message Passing
• Direct support at hardware interface
– But match and buffering benefit from more flexibility
• Support at sys/user interface or above in
software (almost always)
– Hardware interface provides basic data transport
– Send/receive built in sw for flexibility (protection, buffering)
– Or lower interfaces provide SAS, and send/receive built on
top with buffers and loads/stores
57
Naming and Operations (Cont’d)
• Need to examine the issues and tradeoffs at
every layer
– Frequencies and types of operations, costs
• Message passing
– No assumptions on orders across processes except
those imposed by send/receive pairs
• SAS
– How processes see the order of other processes’
references defines semantics of SAS
• Ordering very important and subtle
58
Naming and Operations (Cont’d)
• Uniprocessors play tricks with orders to gain
parallelism or locality
• These are more important in multiprocessors
• Need to understand which old tricks are valid,
and learn new ones
• How programs behave, what they rely on, and
hardware implications
59
Message Passing Model / Shared Memory Model
Message Passing
Shared Memory
Architecture any
Programming difficult
SMP or DSM
easier
Performance
good
better (SMP)
worse (DSM)
Cost
less expensive
very expensive
SunFire15K
$4,140,830
60
Parallelisation Paradigms
•
•
•
•
•
Task-Farming/Master-Worker
Single-Program Multiple-Data (SPMD)
Pipelining
Divide and Conquer
Speculation.
61
Master Worker/Slave Model
• Master decomposes the
problem into small tasks,
distributes to workers and
gathers partial results to
produce the final result.
• Mapping/Load Balancing
– Static
– Dynamic
• When number of tasks are
larger than the number of
CPUs / they are know at
runtime / CPUs are
heterogeneous.
Static
62
Single-Program Multiple-Data
• Most commonly used
model.
• Each process executes
the same piece of
code, but on different
parts of the data.—
splitting the data
among the available
processors.
• geometric/domain
decomposition, data
parallelism.
63
Pipelining
• Suitable for fine grained parallelism.
• Also suitable for application involving
multiple stages of execution, but need to
operate on large number of data sets.
64
Divide and Conquer
• A problem is divided into two or
more sub problems, and each of
these sub problems are solved
independently, and their results
are combined.
• 3 operations: split, compute, and
join.
• Master-worker/task-farming is like
divide and conquer with master
doing both split and join
operation.
• (seems like a “hierarchical”
master-work technique)
65
Speculative Parallelism
• It used when it is quite difficult to achieve
parallelism through one of the previous
paradigms.
• Problems with complex dependencies – use
“look ahead “execution.
• Employing different algorithms for solving the
same problem
66
Download