Foundation to Parallel Programming 并行程序设计基础 • • • • Introduction to Parallel Programming Approaches for Parallel Programs Parallel Programming Model Parallel Programming Paradigm 2 Parallel Programming is a Complex Task • Parallel software developers handle issues such as: – – – – – – – – – – Non-determinism Communication Synchronization data partitioning and distribution load-balancing fault-tolerance Heterogeneity shared or distributed memory Deadlocks race conditions. 3 Users’ Expectations from Parallel Programming Environments • Parallel computing can only be widely successful if parallel software is able to meet expectations of the users, such as: – – – – – – – – – provide architecture/processor type transparency provide network/communication transparency be easy-to-use and reliable provide support for fault-tolerance accommodate heterogeneity assure portability provide support for traditional high-level languages be capable of delivering increased performance to provide parallelism transparency 4 2 并行程序构造方法 串行代码段 for ( i= 0; i<N; i++ ) A[i]=b[i]*b[i+1]; for (i= 0; i<N; i++) c[i]=A[i]+A[i+1]; (a) 使用库例程构造并行程序 id=my_process_id(); p=number_of_processes(); for ( i= id; i<N; i=i+p) A[i]=b[i]*b[i+1]; barrier(); for (i= id; i<N; i=i+p) c[i]=A[i]+A[i+1]; 例子: MPI,PVM, Pthreads (c) 加编译注释构造并行程序的方法 #pragma parallel #pragma shared(A,b,c) #pragma local(i) { # pragma pfor iterate(i=0;N;1) for (i=0;i<N;i++) A[i]=b[i]*b[i+1]; # pragma synchronize # pragma pfor iterate (i=0; N; 1) for (i=0;i<N;i++)c[i]=A[i]+A[i+1]; } 例子:SGI power C (b) 扩展串行语言 my_process_id,number_of_processes(), and barrier() A(0:N-1)=b(0:N-1)*b(1:N) c=A(0:N-1)+A(1:N) 例子: Fortran 90 5 2 并行程序构造方法 三种并行程序构造方法比较 方法 库例程 实例 MPI, PVM 优点 易于实现, 不需要新编 无 编 译 器 检 查 , 译器 扩展 Fortran90 分析和优化 允许编译器检查、分析 实现困难,需要新 和优化 编译器注释 SGI powerC, HPF 缺点 编译器 介于库例程和扩展方法之间, 在串行平台 上不起作用. 6 3 并行性问题 3.1 进程的同构性 SIMD: 所有进程在同一时间执行相同的指令 MIMD:各个进程在同一时间可以执行不同的指令 SPMD: 各个进程是同构的,多个进程对不同的数据执 行相同的代码(一般是数据并行的同义语) 常对应并行循环,数据并行结构,单代码 MPMD:各个进程是异构的, 多个进程执行不同的代码 要为有1000个处理器的计算机编写一个完全异构的并行 程序是很困难的 7 3 并行性问题 并行块 parbegin S1 S2 S3 …….Sn parend S1 S2 S3 …….Sn可以是不同的代码 并行循环: 当并行块中所有进程共享相同代码时 parbegin S1 S2 S3 …….Sn parend S1 S2 S3 …….Sn是相同代码 简化为 parfor (i=1; i<=n, i++) S(i) 8 3 并行性问题 SPMD程序的构造方法 用单代码方法 用数据并行程序的构造方法 要说明以下SPMD程序: parfor (i=0; i<=N, i++) foo(i) 要说明以下SPMD程序: parfor (i=0; i<=N, i++) { C[i]=A[i]+B[i]; } 用户需写一个以下程序: pid=my_process_id(); numproc=number_of _processes(); parfor (i=pid; i<=N, i=i+numproc) foo(i) 用户可用一条数据赋值语句: C=A+B 或 forall (i=1,N) C[i]=A[i]+B[i] 此程序经编译后生成可执行程序A, 用shell 脚本将它加载到N个处理结点上: run A –numnodes N 9 3 并行性问题 MPMD程序的构造方法 用多代码方法 用SPMD伪造MPMD 对不提供并行块或并行循环的语言 要说明以下MPMD程序: parbegin S1 S2 S3 parend 要说明以下MPMD程序: parbegin S1 S2 S3 parend 用户需写3个程序, 分别编译生成3 个可执行程序S1 S2 S3, 用shell脚 本将它们加载到3个处理结点上: run S1 on node1 run S2 on node2 run S3 on node3 可以用以下SPMD程序: parfor (i=0; i<3, i++) { if (i=0) S1 if (i=1) S2 if (i=2) S3 } S1, S2和S3是顺序语言程 序加上进行交互的库调用. 10 3 并行性问题 3.2 静态和动态并行性 程序的结构: 由它的组成部分构成程序的方法 静态并行性: 程序的结构以 及进程的个数在运行之前(如 编译时, 连接时或加载时)就 可确定, 就认为该程序具有 静态并行性. 静态并行性的例子: parbegin P, Q, R parend 动态并行性: 否则就认为该 程序具有动态并行性. 即意 味着进程要在运行时创建和 终止 动态并行性的例子: while (C>0) begin fork (foo(C)); C:=boo(C); end 其中P,Q,R是静态的 11 3 并行性问题 开发动态并行性的一般方法: Fork/Join Process A: begin Z:=1 fork(B); T:=foo(3); end Process B: begin fork(C); X:=foo(Z); join(C); output(X+Y); end Process C: begin Y:=foo(Z); end Fork: 派生一个子进程 Join: 强制父进程等待子进程 12 3 并行性问题 3.3 进程编组 目的:支持进程间的交互,常把需要交互的进程调度在同一组中 一个进程组成员由:组标识符+ 成员序号 唯一确定. 3.4 划分与分配 原则: 使系统大部分时间忙于计算, 而不是闲置或忙于交互; 同时不牺牲并行性(度). 划分: 切割数据和工作负载 分配:将划分好的数据和工作负载映射到计算结点(处理器)上 分配方式 显式分配: 由用户指定数据和负载如何加载 隐式分配:由编译器和运行时支持系统决定 就近分配原则:进程所需的数据靠近使用它的进程代码 13 3 并行性问题 并行度(Degree of Parallelism, DOP):同时执行的分进程数. 并行粒度(Granularity): 两次并行或交互操作之间所执行的 计算负载. 指令级并行 块级并行 进程级并行 任务级并行 并行度与并行粒度大小常互为倒数: 增大粒度会减小并行度. 增加并行度会增加系统(同步)开销 14 Levels of Parallelism PVM/MPI Threads Compilers CPU Task i-l func1 ( ) { .... .... } a ( 0 ) =.. b ( 0 ) =.. + Task i func2 ( ) { .... .... } a ( 1 )=.. b ( 1 )=.. x Task i+1 func3 ( ) { .... .... } a ( 2 )=.. b ( 2 )=.. Load Code-Granularity Code Item Large grain (task level) Program Medium grain (control level) Function (thread) Fine grain (data level) Loop (Compiler) Very fine grain (multiple issue) With hardware 15 Responsible for Parallelization Grain Size Code Item Parallelised by Very Fine Instruction Processor Fine Loop/Instruction block Compiler Medium (Standard one page) Function Programmer Large Program/Separate heavy-weight process Programmer 16 4 交互/通信问题 交互:进程间的相互影响 4.1 交互的类型 通信:两个或多个进程间传送数的操作 通信方式: 共享变量 父进程传给子进程(参数传递方式) 消息传递 17 4 交互/通信问题 同步: 原子同步 控制同步(路障,临界 区) 数据同步(锁,条件临 界区,监控程序,事件) 例子: 原子同步 parfor (i:=1; i<n; i++) { atomic{x:=x+1; y:=y-1} } 路障同步 parfor(i:=1; i<n; i++){ Pi barrier Qi } 临界区 parfor(i:=1; i<n; i++){ critical{x:=x+1; y:=y+1} } 数据同步(信号量同步) parfor(i:=1; i<n; i++){ lock(S); x:=x+1; y:=y-1; unlock(S) } 18 4 交互/通信问题 聚集(aggregation):用一串超步将各分进程计算所得 的部分结果合并为一个完整的结果, 每个超步包含一个 短的计算和一个简单的通信或/和同步. 聚集方式: 归约 扫描 例子: 计算两个向量的内积 parfor(i:=1; i<n; i++){ X[i]:=A[i]*B[i] inner_product:=aggregate_sum(X[i]); } 19 4 交互/通信问题 4.2 交互的方式 P1 P2 同步的交互: 所有参与者同时到达并执 行交互代码C 异步的交互: 进程到达C后, 不必等待其 它进程到达即可执行C … Pn 交互代码 C … 20 4 交互/通信问题 •4.3 交互的模式 一对一:点到点(point to point) 一对多:广播(broadcast),播撒(scatter) 多对一:收集(gather), 归约(reduce) 多对多:全交换(Tatal Exchange), 扫描 (scan) , 置换/移位(permutation/shift) 21 4 交互/通信问题 P1 1 P1 1 P1 1 P1 1,1 P2 3 P2 3 P2 3 P2 3,1 P3 5 P3 5,1 P3 5 P3 5,1 (a) 点对点(一对一): P1发送一个值给P3 (b) 广播(一对多): P1发送一个值给全体 P1 1,3,5 P1 1 P1 1,3,5 P2 P2 3 P2 3 P2 3 P3 P3 5 P3 5 P3 5 P1 1,3,5 (c) 播撒(一对多): P1向每个结点发送一个值 (d) 收集(多对一): P1从每个结点接收 一个值 22 4 交互/通信问题 P1 1,2,3 P1 1,4,7 P1 1 P1 1,5 P2 4,5,6 P2 2,5,8 P2 3 P2 3,1 P3 7,8,9 P3 3,6,9 P3 5 P3 5,3 (e) 全交换(多对多): 每个结点向每个 结点发送一个不同的消息 P1 1 P1 1,9 P2 3 P2 3 P3 5 P3 5 (g) 归约(多对一): P1得到和1+3+5=9 (f) 移位(置换, 多对多): 每个结 点向下一个结点发送一个值并 接收来自上一个结点的一个值. P1 1 P1 1,1 P2 3 P2 3,4 P3 5 P3 5,9 (h) 扫描(多对多): P1得到1, P2得到 1+3=4, P3得到1+3+5=9 23 Approaches for Parallel Programs Development • Implicit Parallelism – Supported by parallel languages and parallelizing compilers that take care of identifying parallelism, the scheduling of calculations and the placement of data. • Explicit Parallelism – based on the assumption that the user is often the best judge of how parallelism can be exploited for a particular application. – programmer is responsible for most of the parallelization effort such as task decomposition, mapping task to processors, the communication structure. 24 Strategies for Development 25 Parallelization Procedure Assignment Decomposition Sequential Computation Tasks Process Elements Mapping Orchestration Processors 26 Sample Sequential Program FDM (Finite Difference Method) … loop{ for (i=0; i<N; i++){ for (j=0; j<N; j++){ a[i][j] = 0.2 * (a[i][j-1] + + a[i+1][j] + a[i][j]); } } a[i][j+1] + a[i-1][j] } … 27 Parallelize the Sequential Program • Decomposition a task … loop{ for (i=0; i<N; i++){ for (j=0; j<N; j++){ a[i][j] = 0.2 * (a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j] + a[i][j]); } } } … 28 Parallelize the Sequential Program • Assignment PE Divide the tasks equally among process elements PE PE PE 29 Parallelize the Sequential Program • Orchestration PE need to communicate and to synchronize PE PE PE 30 Parallelize the Sequential Program • Mapping PE PE PE PE Multiprocessor 31 Parallel Programming Models • Sequential Programming Model • Shared Memory Model (Shared Address Space Model) – DSM – Threads/OpenMP (enabled for clusters) – Java threads • Message Passing Model – PVM – MPI • Hybrid Model – Mixing shared and distributed memory model – Using OpenMP and MPI together 32 Sequential Programming Model • Functional – Naming: Can name any variable in virtual address space • Hardware (and perhaps compilers) does translation to physical addresses – Operations: Loads and Stores – Ordering: Sequential program order 33 Sequential Programming Model • Performance – Rely on dependences on single location (mostly): dependence order – Compiler: reordering and register allocation – Hardware: out of order, pipeline bypassing, write buffers – Transparent replication in caches 34 SAS(Shared Address Space) Programming Model System Thread (Process) Thread (Process) read(X) write(X) X Shared variable 35 Shared Address Space Architectures •Naturally provided on wide range of platforms –History dates at least to precursors of mainframes in early 60s –Wide range of scale: few to hundreds of processors •Popularly known as shared memory machines or model –Ambiguous: memory may be physically distributed among processors 36 Shared Address Space Architectures •Any processor can directly reference any memory location –Communication occurs implicitly as result of loads and stores •Convenient: –Location transparency –Similar programming model to time-sharing on uniprocessors • Except processes run on different processors • Good throughput on multiprogrammed workloads 37 Shared Address Space Model • Process: virtual address space plus one or more threads of control • Portions of address spaces of processes are shared Virtual address spaces for a collection of processes communicating via shared addresses Load P1 Machine physical address space Pn pr i v at e Pn P2 Common physical addresses P0 St or e Shared portion of address space Private portion of address space P2 pr i vat e P1 pr i vat e P0 pr i vat e • Writes to shared address visible to other threads (in other processes too) • Natural extension of uniprocessor model: conventional memory operations for comm.; special atomic operations for synchronization 38 • OS uses shared memory to coordinate processes Communication Hardware for SAS • Also natural extension of uniprocessor • Already have processor, one or more memory modules and I/O controllers connected by hardware interconnect of some sort – Memory capacity increased by adding modules, I/O by I/O devices controllers Mem Mem Mem Mem Interconnect Processor I/O ctrl I/O ctrl Interconnect Processor Add processors for processing! 39 History of SAS Architecture • “Mainframe” approach – Motivated by multiprogramming – Extends crossbar used for mem bw and I/O – Originally processor cost limited to small • later, cost of crossbar P P I/O C I/O C – Bandwidth scales with p – High incremental cost; use multistage instead M M M M M M $ $ P P “Minicomputer” approach Almost all microprocessor systems have bus Motivated by multiprogramming Used heavily for parallel computing Called symmetric multiprocessor (SMP) Latency larger than for uniprocessor Bus is bandwidth bottleneck caching is key: coherence problem Low incremental cost I/O I/O C C 40 Example: Intel Pentium Pro Quad CPU P-Pro module 256-KB Interrupt L2 $ controller Bus interface P-Pro module P-Pro module PCI bridge PCI bus PCI I/O cards PCI bridge PCI bus P-Pro bus (64-bit data, 36-bit address, 66 MHz) Memory controller MIU 1-, 2-, or 4-way interleaved DRAM – Highly integrated, targeted at high volume – Low latency and bandwidth 41 Example: SUN Enterprise P $ P $ $2 $2 CPU/mem cards Mem ctrl Bus interface/switch Gigaplane bus (256 data, 41 address, 83 MHz) I/O cards – 2 FiberChannel SBUS SBUS Memory on processor cards themselves • – – SBUS 100bT, SCSI Bus interface 16 cards of either type: processors + memory, or I/O But all memory accessed over bus, so symmetric Higher bandwidth, higher latency bus 42 M M Scaling Up M Net work Net work $ $ $ M $ M $ P P P P P “Dance hall” M $ P Distributed memory – Problem is interconnect: cost (crossbar) or bandwidth (bus) – Dance-hall: bandwidth still scalable, but lower cost than crossbar • latencies to memory uniform, but uniformly large – Distributed memory or non-uniform memory access (NUMA) • Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response) 43 Shared Address Space Programming Model • Naming – Any process can name any variable in shared space • Operations – Loads and stores, plus those needed for ordering • Simplest Ordering Model – Within a process/thread: sequential program order – Across threads: some interleaving (as in time-sharing) – Additional orders through synchronization 44 Synchronization • Mutual exclusion (locks) – No ordering guarantees • Event synchronization – Ordering of events to preserve dependences – e.g. producer —> consumer of data – 3 main types: • point-to-point • global • group 45 MP Programming Model Node A Node B process process send (Y) receive (Y’) Y Y’ message 46 Message-Passing Programming Model Match ReceiveY, P, t AddressY Send X, Q, t AddressX – – – – – Local process address space Local process address space ProcessP Process Q Send specifies data buffer to be transmitted and receiving process Recv specifies sending process and application storage to receive into Memory to memory copy, but need to name processes User process names only local data and entities in process/tag space In simplest form, the send/recv match achieves pairwise synch event – Many overheads: copying, buffer management, protection 47 Message Passing Architectures • Complete computer as building block, including I/O – Communication via explicit I/O operations • Programming model: directly access only private address space (local memory), comm. via explicit messages (send/receive) • Programming model more removed from basic hardware operations – Library or OS intervention 48 Evolution of Message-Passing Machines • Early machines: FIFO on each link – synchronous ops – Replaced by DMA, enabling nonblocking ops • Buffered by system at destination until recv 101 100 • Diminishing role of topology – Store&forward routing – Cost is in node-network interface – Simplifies programming 001 000 111 011 110 010 49 Example: IBM SP-2 Power 2 CPU IBM SP-2 node L2 $ Memory bus General interconnection network formed fr om 8-port switches 4-way interleaved DRAM Memory controller MicroChannel bus I/O DMA i860 NI DRAM NIC – Made out of essentially complete RS6000 workstations – Network interface integrated in I/O bus (bw limited by I/O bus) • Doesn’t need to see memory references 50 Example Intel Paragon i860 i860 L1 $ L1 $ Intel Paragon node Memory bus (64-bit, 50 MHz) Mem ctrl DMA Driver Sandia’ s Intel Paragon XP/S-based Super computer 2D grid network with processing node attached to every switch NI 4-way interleaved DRAM 8 bits, 175 MHz, bidirectional – Network interface integrated in memory bus, for performance 51 Message Passing Programming Model • Naming – Processes can name private data directly. – No shared address space • Operations – Explicit communication: send and receive – Send transfers data from private address space to another process – Receive copies data from process to private address space – Must be able to name processes 52 Message Passing Programming Model • Ordering – Program order within a process – Send and receive can provide pt-to-pt synch between processes • Can construct global address space – Process number + address within process address space – But no direct operations on these names 53 Design Issues Apply at All Layers • Programming model’s position provides constraints/goals for system • In fact, each interface between layers supports or takes a position on – Naming model – Set of operations on names – Ordering model – Replication – Communication performance 54 Naming and Operations • Naming and operations in programming model can be directly supported by lower levels, or translated by compiler, libraries or OS • Example – Shared virtual address space in programming model • Hardware interface supports shared physical address space – Direct support by hardware through v-to-p mappings, no software layers 55 Naming and Operations (Cont’d) • Hardware supports independent physical address spaces – system/user interface: can provide SAS through OS • v-to-p mappings only for data that are local • remote data accesses incur page faults; brought in via page fault handlers – Or through compilers or runtime, so above sys/user interface 56 Naming and Operations (Cont’d) • Example: Implementing Message Passing • Direct support at hardware interface – But match and buffering benefit from more flexibility • Support at sys/user interface or above in software (almost always) – Hardware interface provides basic data transport – Send/receive built in sw for flexibility (protection, buffering) – Or lower interfaces provide SAS, and send/receive built on top with buffers and loads/stores 57 Naming and Operations (Cont’d) • Need to examine the issues and tradeoffs at every layer – Frequencies and types of operations, costs • Message passing – No assumptions on orders across processes except those imposed by send/receive pairs • SAS – How processes see the order of other processes’ references defines semantics of SAS • Ordering very important and subtle 58 Naming and Operations (Cont’d) • Uniprocessors play tricks with orders to gain parallelism or locality • These are more important in multiprocessors • Need to understand which old tricks are valid, and learn new ones • How programs behave, what they rely on, and hardware implications 59 Message Passing Model / Shared Memory Model Message Passing Shared Memory Architecture any Programming difficult SMP or DSM easier Performance good better (SMP) worse (DSM) Cost less expensive very expensive SunFire15K $4,140,830 60 Parallelisation Paradigms • • • • • Task-Farming/Master-Worker Single-Program Multiple-Data (SPMD) Pipelining Divide and Conquer Speculation. 61 Master Worker/Slave Model • Master decomposes the problem into small tasks, distributes to workers and gathers partial results to produce the final result. • Mapping/Load Balancing – Static – Dynamic • When number of tasks are larger than the number of CPUs / they are know at runtime / CPUs are heterogeneous. Static 62 Single-Program Multiple-Data • Most commonly used model. • Each process executes the same piece of code, but on different parts of the data.— splitting the data among the available processors. • geometric/domain decomposition, data parallelism. 63 Pipelining • Suitable for fine grained parallelism. • Also suitable for application involving multiple stages of execution, but need to operate on large number of data sets. 64 Divide and Conquer • A problem is divided into two or more sub problems, and each of these sub problems are solved independently, and their results are combined. • 3 operations: split, compute, and join. • Master-worker/task-farming is like divide and conquer with master doing both split and join operation. • (seems like a “hierarchical” master-work technique) 65 Speculative Parallelism • It used when it is quite difficult to achieve parallelism through one of the previous paradigms. • Problems with complex dependencies – use “look ahead “execution. • Employing different algorithms for solving the same problem 66