并行算法概述 2 Content • Parallel Computing Model • Basic Techniques to Parallel Algorithm 3 Von Neumann Model M E M O RY M AR MDR IN P U T OUTPUT Ke yb o ard M ouse S c an n e r D is k M o n ito r P rin te r LE D D is k P R O C E S S IN G U N IT A LU TE M P C O N TR O L U N IT PC IR 4 Instruction Processing Fetch instruction from memory Decode instruction Evaluate address Fetch operands from memory Execute operation Store result 5 并行计算模型 Parallel Computing Model • 计算模型 • 桥接软件和硬件 • 为算法设计提供抽象体系结构 • Ex) PRAM, BSP, LogP 6 并行程序设计模型 Parallel Programming Model • 程序员使用什么来编码? • 确定通信(communication)和同步 (synchronization) • 暴露给程序员的通信原语(Communication primitives)实现编程模型 • Ex) Uniprocessor, Multiprogramming, Data parallel, message-passing, shared-addressspace 7 Aspects of Parallel Processing 4 Application developer 3 Algorithm developer Parallel computing model Parallel programming model 2 System programmer Middleware Interconnection Network Memory P P P Memory P Multiprocessors P P P Memory P Multiprocessors 1 Architecture designer P P P Memory P P P P P Multiprocessors Multiprocessors 8 Parallel Computing Models –并行随机存 取机(Parallel Randon Access Machine) 特性: • Processors Pi (i (0 i p-1 ) • 每一处理器配有局部内存 • 一全局共享内存 • 所有处理器都可以访问 9 Illustration of PRAM CLK Single program executed in MIMD mode Each processor has P1 a unique index. P2 P3 Pp Shared Memory P processors connected to a single shared memory 10 Parallel Randon Access Machine 操作类型: • 同步 • 处理器执行时会加锁 F每一步,处理器或者工作或者待机 F适用于SIMD和MIMD体系结构 • 异步 • 处理器有局部时钟,用于同步处理器 F适用于MIMD architecture 11 Problems with PRAM • 是对现实世界并行系统的一种简化描 述 • 未考虑多种开销 • 延迟,带宽,远程内存访问,内存访问冲突, 同步开销, etc • 在PRAM上理论分析性能分析好的算法, 实际性能可能差 12 Parallel Randon Access Machine Read / Write冲突 • EREW : Exclusive - Read, Exclusive -Write • 对一变量吴并发操作 ( read or write) • CREW : Concurrent – Read, Exclusive – Write • 允许并发读同一变量 • 互斥写 • ERCW : Exclusive Read – Concurrent Write • CRCW : Concurrent – Read, Concurrent – Write 15 Parallel Randon Access Machine 基本Input/Output 操作 • 全局内存 • global read (X, x) • global write (Y, y) • 局部内存 • read (X, x) • write (Y, y) 16 Example: Sum on the PRAM model 对有n = 2k个数的数组A求和 A PRAM machine with n processor 计算S = A(1) + A(2) + …. + A(n) 构建二叉树计算和 17 Example: Sum on the PRAM model Level >1, Pi compute B(i) = B(2i-1) + B(2i) S=B(1) P1 Level 1, Pi B(i) = A(i) B(1) P1 B(1) B(2) P1 P2 B(1) B(2) B(3) B(4) P1 P2 P3 P4 B(1) =A(1) B(2) =A(2) B(1) =A(1) P1 P2 P3 B(2) =A(2) B(1) =A(1) P4 P5 B(2) =A(2) B(1) =A(1) P6 P7 B(2) =A(2) P8 18 Example: Sum on the PRAM model Algorithm processor Pi ( i=0,1, …n-1) Input A : array of n = 2k elements in global memory Output S : S= A(1) + A(2) + …. . A(n) Local variables Pi n : i : processor Pi identity Begin 1. global read ( A(i), a) 2. global write (a, B(i)) 3. for h = 1 to log n do if ( i ≤ n / 2h ) then begin global read (B(2i-1), x) global read (b(2i), y) z = x +y global write (z,B(i)) end 4. if i = 1 then global write(z,S) End 19 其它分布式模型 • Distributed Memory Model • 无全局内存 • 每一处理器有局部内存 • Postal Model • 当访问非局部内存时,处理器发送请求 • 处理器不会停止,它会继续工作直到数据到达 20 Network Models • 关注通信网络拓扑的影响 • 早期并行计算关注点 • 分布式内存模型 • 远程内存访问的代价与拓扑和访问模式相关 • 提供有效的 • 数据映射 • 通信路由 21 LogP • 受并行计算机设计的影响 • 分布式内存多处理器模型 • 处理器通信通过点对点的消息通信实现 • 目标是分析并行计算机的性能瓶颈 • 制定通信网络的性能特点 • 为数据放置提供帮助 • 显示了平衡通信的重要性 22 Model Parameters • Latency (L) • 从源到目的端发送消息的延迟 • Hop(跳) count and Hop delay • Communication Overhead (o) • 处理器在发送或接收一条消息时的时间开销 • Communication bandwidth (g) • 消息之间的最小时间间隔 • Processor count (P) • 处理器个数 23 LogP Model g sender o receiver L t o 24 Bulk Synchronous Parallel • Bulk Synchronous Parallel(BSP) • P个配有局部内存的处理器 • 路由器 • 周期性全局同步 • 考虑因素 • 带宽限制 • 延迟 • 同步开销 • 未考虑因素 • 通信开销 • 处理器拓扑 25 BSP Computer • 分布式内存体系结构 • 3 种部件 • 节点 • 处理器 • 局部内存 • 路由器 (Communication Network) • 点对点(Point-to-point),消息传递( message passing)或者共享变量(shared variable) • 路障 • 全部或部分 26 Illustration of BSP Node (w) M P Node P M Node P M Barrier (l) Communication Network (g) w parameter g parameter 每一超步(superstep)最大计算时间 计算最多消耗w个时钟周期. 当所有处理器都参与通信时,发送一消息单元所需要的时钟周期# ,即 网络带宽 h:每一超步最大接收和发送消息的数量 通信操作需要gh 时钟周期 l parameter 路障(Barrier)同步需要l 时钟周期 27 BSP Program • 每一BSP计算由S个超步构成 • 一超步包括一系列步骤和一个路障 • Superstep • 任何远程内存访问需要路障 – 松散同步 28 BSP Program Superstep 1 P1 P2 Computation Communication Superstep 2 Barrier P3 P4 Example: Pregel Pregel is a framework developed by Google: SIGMOD 2010 High scalability Fault-tolerance • 灵活实现图算法 30 Bulk Synchronous Parallel Model Iterations Data Data CPU 1 Data CPU 1 Data CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Barrier Data Barrier Data Barrier Data 31 图 Graph 32 Entities and Supersteps 计算由顶点、边和一系列迭代(即超步)构成 每一顶点赋有值。 每一边包含与源点、边值和目的顶点 每一超步: 用户定义的函数F 处理每一顶点V F 在超步S – 1 读发送给V的消息,发送消息给 其它顶点。这些消息将在S + 1超步收到 F 更改顶点V 和出边的状态 F可以改变图的拓扑 33 Algorithm Termination 根据各顶点投票决定算法是否终止 superstep 0,每一顶点活跃 所有活跃顶点参与任意给定超步中的计算 当顶点投票终止时,顶点进入非活跃状态 如果顶点收到外部消息,顶点可以进入活跃状态 当所有节点都同时变为非活跃状态时,程序 终止 Vote to Halt Active Inactive Message Received Vertex State Machine 34 The Pregel API in C++ A Pregel program is written by subclassing the vertex class: template <typename VertexValue, typename EdgeValue, typename MessageValue> To define the types for vertices, edges and messages class Vertex { public: virtual void Compute(MessageIterator* msgs) = 0; const string& vertex_id() const; int64 superstep() const; const VertexValue& GetValue(); VertexValue* MutableValue(); OutEdgeIterator GetOutEdgeIterator(); Override the compute function to define the computation at each superstep To get the value of the current vertex To modify the value of the vertex void SendMessageTo(const string& dest_vertex, const MessageValue& message); void VoteToHalt(); To pass messages to other vertices 35 Pregel Code for Finding the Max Value Class MaxFindVertex : public Vertex<double, void, double> { public: virtual void Compute(MessageIterator* msgs) { int currMax = GetValue(); SendMessageToAllNeighbors(currMax); for ( ; !msgs->Done(); msgs->Next()) { if (msgs->Value() > currMax) currMax = msgs->Value(); } if (currMax > GetValue()) *MutableValue() = currMax; else VoteToHalt(); } }; 36 Finding the Max Value in a Graph 3 6 2 1 节点内数值是节点值 蓝色箭头是消息 3 6 6 2 1 6 蓝色节点投票终 止 6 6 2 6 6 6 6 6 6 37 Model Survey Summary • No single model is acceptable! • Between models, subset of characteristics are focused in majority of models • Computational Parallelism • Communication Latency • Communication Overhead • Communication Bandwidth • Execution Synchronization • Memory Hierarchy • Network Topology 38 Computational Parallelism • Number of physical processors • Static versus dynamic parallelism • Should number of processors be fixed? • Fault-recovery networks allow for node failure • Many parallel systems allow incremental upgrades by increasing node count 39 Latency • Fixed message length or variable message length? • Network topology? • Communication Overhead? • Contention based latency? • Memory hierarchy? 40 Bandwidth • Limited resource • With low latency • Tendency for bandwidth abuse by flooding network 41 Synchronization • Ability to solve a wide class of problems require asynchronous parallelism • Synchronization achieved via message passing • Synchronization as a communication cost 42 Unified Model? • Difficult • Parallel machines are complicated • Still evolving • Different users from diverse disciplines • Requires a common set of characteristics derived from needs of different users • Again need for balance between descriptivity and prescriptivity 43 Content • Parallel Computing Model • Basic Techniques of Parallel Algorithm • Concepts • Decomposition • Task • Mapping • Algorithm Model 44 分解、任务及依赖图 • 设计并行算法的第一步是将问题分解成可并发执 行的任务 • 分解可用任务依赖图(task dependency graph) 表示。图中节点代表任务,边代表任务依赖 45 Example: Multiplying a Dense Matrix with a Vector 计算输出向量y的每一元素可独立进行。因此,矩阵 与向量之积可分解为n个任务 Example: Database Query Processing 在如下数据库上执行查询: MODEL = ``CIVIC'' AND YEAR = 2001 AND (COLOR = ``GREEN'' OR COLOR = ``WHITE) ID# Model 4523 Civic 3476 Corolla 7623 Camry 9834 Prius 6734 Civic 5342 Altima 3845 Maxima 8354 Accord 4395 Civic 7352 Civic Year 2002 1999 2001 2001 2001 2001 2001 2000 2001 2002 Color Blue White Green Green White Green Blue Green Red Red Dealer MN IL NY CA OR FL NY VT CA WA Price $18,000 $15,000 $21,000 $18,000 $17,000 $19,000 $22,000 $18,000 $17,000 $18,000 46 Example: Database Query Processing 执行查询可分成任务。每一任务可看作产生满足 某一条件的中间结果 边表示一个任务的输出是另一个任务的输入 47 Example: Database Query Processing 同一问题可采用其它方式分解。不同的分解可能存在重 大的性能差异 48 任务粒度 • 分解的任务数量越多,粒度越小。否则粒度越大 49 50 并行度Degree of Concurrency • 能并行执行的任务数称为一分解的degree of concurrency • maximum degree of concurrency • average degree of concurrency • 当任务粒度小时,并行度大。 51 任务交互图Task Interaction Graphs • 任务之间通常需要交换数据 • 表达任务之间交换关系的图称为task interaction graph. • task interaction graphs 表达数据依赖;task dependency graphs表达control dependencies. Task Interaction Graphs: An Example 稀疏矩阵A乘以向量 b. 计算结果向量的每一元素可视之为独立任务 由于内存优化,可以将b 根据任务划分,可以发现任务交 互图和矩阵A的图一样 52 53 进程和映射Processes and Mapping • 任务的数量超过处理单元的数量,因此必须将任务 映射到进程 • 恰当的任务映射对并行算法的性能非常重要 • 映射由任务依赖图和任务交互图决定 • 任务依赖图确保任务在任何时间点均匀分布到所有 进程 (minimum idling and optimal load balance). • 任务交互图用于确保进程与其它进程之间的交互最 少 (minimum communication). • Processes and Mapping: Example 将数据库查询任务映射到进程. 根据同一层没有依 赖关系,同一层任务可分配给不同进程 54 55 分解技术Decomposition Techniques •递归分解(recursive decomposition) •数据分解(data decomposition) •探索分解(exploratory decomposition) •猜测分解(speculative decomposition) 56 Recursive Decomposition • 适合可用分治法解决的问题. • 给定问题首先分解为一系列子问题 • 这些子问题进一步递归分解,直到所需要 的任务粒度 Recursive Decomposition: Example 经典的例子是快速排序 In this example, once the list has been partitioned around the pivot, each sub-list can be processed concurrently (i.e., each sub-list represents an independent subtask). This can be repeated recursively. 57 58 Recursive Decomposition: Example We first start with a simple serial loop for computing the minimum entry in a given list: 1. procedure SERIAL_MIN (A, n) 2. begin 3. min = A[0]; 4. for i := 1 to n − 1 do 5. if (A[i] < min) min := A[i]; 6. endfor; 7. return min; 8. end SERIAL_MIN 59 Recursive Decomposition: Example We can rewrite the loop as follows: 1. procedure RECURSIVE_MIN (A, n) 2. begin 3. if ( n = 1 ) then 4. min := A [0] ; 5. else 6. lmin := RECURSIVE_MIN ( A, n/2 ); 7. rmin := RECURSIVE_MIN ( &(A[n/2]), n - n/2 ); 8. if (lmin < rmin) then 9. min := lmin; 10. else 11. min := rmin; 12. endelse; 13. endelse; 14. return min; 15. end RECURSIVE_MIN Recursive Decomposition: Example 以上代码可用如下求最小数例子说明. 求{4, 9, 1, 7, 8, 11, 2, 12}的最小数. 任务依赖图如下: 60 61 Data Decomposition • 划分数据,将数据分配给不同任务 • 输入数据划分 • 中间数据划分 • 输出划分 • 输出数据的每一元素可以独立计算出 Output Data Decomposition: Example n x n 矩阵A和B相乘得到矩阵C. 输出矩阵C的计算 可以分为 如下四个任务: Task 1: Task 2: Task 3: Task 4: 62 Output Data Decomposition: Example 以前面的矩阵相乘例子为例,还可以派生如下两 种划分: Decomposition I Decomposition II Task 1: C1,1 = A1,1 B1,1 Task 1: C1,1 = A1,1 B1,1 Task 2: C1,1 = C1,1 + A1,2 B2,1 Task 2: C1,1 = C1,1 + A1,2 B2,1 Task 3: C1,2 = A1,1 B1,2 Task 3: C1,2 = A1,2 B2,2 Task 4: C1,2 = C1,2 + A1,2 B2,2 Task 4: C1,2 = C1,2 + A1,1 B1,2 Task 5: C2,1 = A2,1 B1,1 Task 5: C2,1 = A2,2 B2,1 Task 6: C2,1 = C2,1 + A2,2 B2,1 Task 6: C2,1 = C2,1 + A2,1 B1,1 Task 7: C2,2 = A2,1 B1,2 Task 7: C2,2 = A2,1 B1,2 Task 8: C2,2 = C2,2 + A2,2 B2,2 Task 8: C2,2 = C2,2 + A2,2 B2,2 63 64 Input Data Partitioning • 如果输出事先未知,这时可以考虑输入划分 • 每一任务处理一部分输入数据,形成局部结果。 合并局部结果形成最终结果 Input Data Partitioning: Example 统计事务数量的例子可采用输入数据划分。 65 Partitioning Input and Output Data 也可以将输入划分和输出划分相结合以便得到更高的并行度. 对 于 统 计 事 务 的 例 子 , 事 务 集 (input) 和 事 务 统 计 数 量 (output) 可同时划分如下: 66 67 Intermediate Data Partitioning • 计算通常可视为一系列从输入到输出的变换. • 因此,可考虑将中间结果进行分解 Intermediate Data Partitioning: Example Let us revisit the example of dense matrix multiplication. 68 Intermediate Data Partitioning: Example A decomposition of intermediate data structure leads to the following decomposition into 8 + 4 tasks: Stage I Stage II Task 01: D1,1,1= A1,1 B1,1 Task 02: D2,1,1= A1,2 B2,1 Task 05: D1,2,1= A2,1 B1,1 Task 06: D2,2,1= A2,2 B2,1 Task 03: D1,1,2= A1,1 B1,2 Task 07: D1,2,2= A2,1 B1,2 Task 09: C1,1 = D1,1,1 + D2,1,1 Task 11: C2,1 = D1,2,1 + D2,2,1 Task 04: D2,1,2= A1,2 B2,2 Task 08: D2,2,2= A2,2 B2,2 Task 10: C1,2 = D1,1,2 + D2,1,2 Task 12: C2,,2 = D1,2,2 + D2,2,2 69 Intermediate Data Partitioning: Example The task dependency graph for the decomposition (shown in previous foil) into 12 tasks is as follows: 70 71 Exploratory Decomposition • 在许多场合,随着执行的逐步推进而进行划分. • 这些应用通常涉及搜索解答的状态空间 • 适合应用包括:组合优化,定理证明,游戏, etc. Exploratory Decomposition: Example 15 puzzle (a tile puzzle). 72 Exploratory Decomposition: Example 产生当前状态的后继状态,将搜索每一状态视为一独 立任务 73 74 Speculative Decomposition • 在某些应用,任务之间依赖事先未知 • 两种方法: • 保守方法(conservative approaches):当确认没有依 赖时,可以识别独立任务, • 乐观方法(optimistic approaches)即使可能是错误的, 仍然调度任务 • 保守方法可能产生较少的并发;乐观方法可能需要回滚 Speculative Decomposition: Example 模拟网络的例子(例如生产线和计算机网络). 任务是模拟不同输入和节点参数(如延迟)下 网络的行为 75 76 Hybrid Decompositions 在quicksort, 递归分解限制了并发。这时可用数据分 解和递归分解 离散事件模拟(discrete event simulation)可用数 据分解和猜测分解 对于找最小数,可用数据分解和递归分解 77 任务特性 • 任务特征影响并行算法的选择及其性能 • 任务生成 • 任务粒度 • 与任务相关的数据规模 78 Task Generation • 静态任务生成 • 例如:矩阵运算,图算法,图像处理应用以及其它结构 化问题. • 任务分解通常用数据分解和递归分解. • 动态任务生成 • 一个例子是15谜 – 每一15谜棋局由前一棋局产生. • 应用通常用探索和猜测法分解. 79 Task Sizes • 任务粒度可以是统一,也可以是非一致 • 例如:组合优化问题里很难估计状态空间的大小 80 Size of Data Associated with Tasks • The size of data associated with a task may be small or large when viewed in the context of the size of the task. • A small context of a task implies that an algorithm can easily communicate this task to other processes dynamically (e.g., the 15 puzzle). • A large context ties the task to a process, or alternately, an algorithm may attempt to reconstruct the context at another processes as opposed to communicating the context of the task. 81 Characteristics of Task Interactions • Tasks may communicate with each other in various ways. The associated dichotomy is: • Static interactions: • The tasks and their interactions are known a-priori. These are relatively simpler to code into programs. • Dynamic interactions: • The timing or interacting tasks cannot be determined a-priori. These interactions are harder to code, especially, as we shall see, using message passing APIs. 82 Characteristics of Task Interactions • Regular interactions: • There is a definite pattern (in the graph sense) to the interactions. These patterns can be exploited for efficient implementation. • Irregular interactions: • Interactions lack well-defined topologies. 83 Characteristics of Task Interactions: Example A simple example of a regular static interaction pattern is in image dithering. The underlying communication pattern is a structured (2-D mesh) one as shown here: 84 Characteristics of Task Interactions: Example The multiplication of a sparse matrix with a vector is a good example of a static irregular interaction pattern. Here is an example of a sparse matrix and its associated interaction pattern. 85 Characteristics of Task Interactions • Interactions may be read-only or read-write. • In read-only interactions, tasks just read data items associated with other tasks. • In read-write interactions tasks read, as well as modify data items associated with other tasks. 86 Mapping • Mapping Techniques for Load Balancing • Static and Dynamic Mapping • Methods for Minimizing Interaction Overheads • Maximizing Data Locality • Minimizing Contention and Hot-Spots • Overlapping Communication and Computations • Replication vs. Communication • Group Communications vs. Point-to-Point Communication • Parallel Algorithm Design Models • Data-Parallel, Work-Pool, Task Graph, Master-Slave, Pipeline, and Hybrid Models 87 Mapping Techniques • Mappings must minimize overheads. • Primary overheads are communication and idling. • Minimizing these overheads often represents contradicting objectives. • Assigning all work to one processor trivially minimizes communication at the expense of significant idling. 88 Mapping Techniques for Minimum Idling Mapping must simultaneously minimize idling and load balance. Merely balancing load does not minimize idling. 89 Mapping Techniques for Minimum Idling Mapping techniques can be static or dynamic. • Static Mapping • Tasks are mapped to processes a-priori • For this to work, we must have a good estimate of the size of each task. Even in these cases, the problem may be NP complete. • Dynamic Mapping • Tasks are mapped to processes at runtime • This may be because the tasks are generated at runtime, or that their sizes are not known. 90 Schemes for Static Mapping • Mappings based on data partitioning • Mappings based on task graph partitioning • Hybrid mappings 91 Mappings Based on Data Partitioning The simplest data decomposition schemes for dense matrices are 1-D block distribution schemes. Block Array Distribution Schemes Block distribution schemes can be generalized to higher dimensions as well. 92 93 Block Array Distribution Schemes: Examples • For multiplying two dense matrices A and B, we can partition the output matrix C using a block decomposition. • For load balance, we give each task the same number of elements of C. (Note that each element of C corresponds to a single dot product.) 94 Cyclic and Block Cyclic Distributions • If the amount of computation associated with data items varies, a block decomposition may lead to significant load imbalances. • A simple example of this is in LU decomposition (or Gaussian Elimination) of dense matrices. LU Factorization of a Dense Matrix A decomposition of LU factorization into 14 tasks notice the significant load imbalance. 1: 6: 11: 2: 7: 12: 3: 8: 13: 4: 9: 14: 5: 10: 95 Block Cyclic Distributions Variation of the block distribution scheme that can be used to alleviate the loadimbalance and idling problems. Partition an array into many more blocks than the number of available processes. Blocks are assigned to processes in a roundrobin manner so that each process gets several non-adjacent blocks. 96 99 Mappings Based on Task Partitioning • Partitioning a given task-dependency graph across processes. • Determining an optimal mapping for a general task-dependency graph is an NPcomplete problem. 100 Task Partitioning: Mapping a Binary Tree Dependency Graph Example illustrates the dependency graph of one view of quick-sort and how it can be assigned to processes in a cube. 101 Hierarchical Mappings • Sometimes a single mapping technique is inadequate. • For example, the task mapping of the binary tree (quicksort) cannot use a large number of processors. • For this reason, task mapping can be used at the top level and data partitioning within each level. 102 An example of task partitioning at top level with data partitioning at the lower level. 103 Schemes for Dynamic Mapping • Dynamic mapping is sometimes also referred to as dynamic load balancing, since load balancing is the primary motivation for dynamic mapping. • Dynamic mapping schemes can be centralized or distributed. 104 Centralized Dynamic Mapping • Processes are designated as masters or slaves. • When a process runs out of work, it requests the master for more work. • When the number of processes increases, the master may become the bottleneck. • To alleviate this, a process may pick up a number of tasks (a chunk) at one time. This is called Chunk scheduling. • Selecting large chunk sizes may lead to significant load imbalances as well. • A number of schemes have been used to gradually decrease chunk size as the computation progresses. 105 Distributed Dynamic Mapping • Each process can send or receive work from other processes. • This alleviates the bottleneck in centralized schemes. • There are four critical questions: • how are sensing and receiving processes paired together • who initiates work transfer • how much work is transferred • when is a transfer triggered? • Answers to these questions are generally application specific. 106 Minimizing Interaction Overheads • Maximize data locality • Where possible, reuse intermediate data. Restructure computation so that data can be reused in smaller time windows. • Minimize volume of data exchange • There is a cost associated with each word that is communicated • Minimize frequency of interactions • There is a startup cost associated with each interaction • Minimize contention and hot-spots • Use decentralized techniques, replicate data where necessary. 107 Minimizing Interaction Overheads (cont.) • Overlapping computations with interactions • Use non-blocking communications, multithreading, and prefetching to hide latencies. • Replicating data or computations. • Using group communications instead of point- to-point primitives. • Overlap interactions with other interactions. 108 Parallel Algorithm Models 算法模型主要涉及选择划分方法以及映射技术 以减少任务之间的交互. • Data Parallel Model • 任务静态映射到进程,每一任务在不同数据上执行相似 操作. • Task Graph Model • 根据任务依赖图,任务之间的交互用于增强局部性或减 少交互开销 109 Parallel Algorithm Models (cont.) • Master-Slave Model • 一个或多个进程产生任务,并静态或动态分配给工作 进程 • Pipeline / Producer-Consumer Model • 数据流通过一系列进程,每一进程在数据上执行任务. • Hybrid Models • 由多种模型水平、垂直或顺序组合来解决应用问题