FOUNDATION TO PARALLEL PROGRAMMING 2 CONTENT • 并行程序设计简介 • 并行程序设计模型 • 并行程序设计范型 3 Parallel Programming is a Complex Task • 并行软件开发人员面临的问题: – 不确定性 – 通讯 – 同步 – 划分与分发 – 负载平衡 – 容错 – 竞争 – 死锁 – ... 4 Levels of Parallelism PVM/MPI Threads Compilers CPU Task i-l func1 ( ) { .... .... } a ( 0 ) =.. b ( 0 ) =.. + Task i func2 ( ) { .... .... } a ( 1 )=.. b ( 1 )=.. x Task i+1 func3 ( ) { .... .... } a ( 2 )=.. b ( 2 )=.. Load Code-Granularity Code Item Large grain (task level) Program Medium grain (control level) Function (thread) Fine grain (data level) Loop (Compiler) Very fine grain (multiple issue) With hardware 5 Responsible for Parallelization Grain Size Code Item Parallelised by Very Fine Instruction 处理器 Fine Loop/Instruction block 编译器 Medium (Standard one page) Function 程序员 Large Program/Separate heavy-weight process 程序员 6 Parallelization Procedure Assignment Decomposition Sequential Computation Tasks Process Elements Mapping Processors Orchestration 7 Sample Sequential Program FDM (Finite Difference Method) … loop{ for (i=0; i<N; i++){ for (j=0; j<N; j++){ a[i][j] = 0.2 * (a[i][j-1] + + a[i+1][j] + a[i][j]); } } } … a[i][j+1] + a[i-1][j] 8 Parallelize the Sequential Program • Decomposition … loop{ for (i=0; i<N; i++){ for (j=0; j<N; j++){ a[i][j] = 0.2 * (a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j] + a[i][j]); } } } … a task 9 Parallelize the Sequential Program • Assignment PE Divide the tasks equally among process elements PE PE PE 10 Parallelize the Sequential Program • Orchestration PE need to communicate and to synchronize PE PE PE 11 Parallelize the Sequential Program • Mapping PE PE PE PE Multiprocessor 12 Parallel Programming Models • Sequential Programming Model • Shared Memory Model (Shared Address Space Model) • • • • DSM Threads/OpenMP (enabled for clusters) Cilk Java threads • Message Passing Model • PVM • MPI • Functional Programming • MapReduce 13 Parallel Programming Models • Partitioned Global Address Space Programming (PGAS) Languages • UPC, Coarray Fortran, Titanium • Languages and Paradigm for Hardware Accelerators • CUDA, OpenCL • Hybrid: MPI + OpenMP + CUDA/OpenCL trends Scalar Application Vector MPP System, Message Passing: MPI Distributed memory Multi core nodes: OpenMP,… Shared Memory Accelerator (GPGPU, FPGA): Cuda, OpenCL,.. Hybrid codes 15 Sequential Programming Model • Functional • Naming: Can name any variable in virtual address space • Hardware (and perhaps compilers) does translation to physical addresses • Operations: Loads and Stores • Ordering: Sequential program order 16 Sequential Programming Model • Performance • Rely on dependences on single location (mostly): dependence order • Compiler: reordering and register allocation • Hardware: out of order, pipeline bypassing, write buffers • Transparent replication in caches 17 SAS (Shared Address Space) Programming Model System Thread (Process) Thread (Process) read(X) write(X) X Shared variable 18 Shared Address Space Programming Model • 变量命名 • 任何进程在共享空间里可以命名任何变量 • Operations • Loads and stores, plus those needed for ordering • Simplest Ordering Model • 在进程/线程内: sequential program order • 线程之间: 存在交叉 (类似于分时里面的交叉) • Additional orders through synchronization 19 Synchronization • Mutual exclusion (locks) • No ordering guarantees • Event synchronization • Ordering of events to preserve dependences • e.g. producer —> consumer of data 20 MP Programming Model Node A Node B process process send (Y) receive (Y’) Y Y’ message 21 Message-Passing Programming Model Match ReceiveY, P, t AddressY Send X, Q, t AddressX Local process address space Local process address space ProcessP Process Q • Send指定待传输的数据缓存以及接受进程 • Recv指定发送进程以及存放接受数据的存储空间 • 用户进程只能在进程地址空间里命名局部变量和实体 • 存在许多开销:拷贝、缓存管理、保护 22 Message Passing Programming Model • 命名 – 进程可以直接命名局部变量 – 不存在共享地址空间 • Operations – 明确通信: send 和receive – Send从私有空间传输数据到另外一个进程 – Receive 拷贝数据到私有地址空间 – 必须能够命名进程 23 Message Passing Programming Model • Ordering • 进程里面由程序确定顺序 • Send和receive提供了进程间点对点的同步 • 可以构建全局地址空间 • 例如:进程id + 进程地址空间内部地址 • 但对其不存在直接操作 Functional Programming • 函数操作不会更改数据结构,而是创建新的数据结 构 • 原来数据始终未改 • 数据流动未明确在程序设计中确定 • 操作的顺序并不重要 Functional Programming fun foo(l: int list) = sum(l) + mul(l) + length(l) Order of sum() and mul(), etc does not matter – they do not modify l GPU • Graphical Processing Unit • 一个GPU由大量的核组成,比如上百个核. • 但通常CPU包含 2, 4, 8或12个核 • Cores? – 芯片里至少共享内存或L1 cache的处理 单元 • General Purpose computation using GPU in applications other than 3D graphics • GPU accelerates critical path of application CPU v/s GPU GPU and CPU • Typically GPU and CPU coexist in a heterogeneous setting • “Less” computationally intensive part runs on CPU (coarse-grained parallelism), and more intensive parts run on GPU (fine-grained parallelism) • NVIDIA’s GPU architecture is called CUDA (Compute Unified Device Architecture) architecture, accompanied by CUDA programming model, and CUDA C language What is CUDA? CUDA: Compute Unified Device Architecture. A parallel computing architecture developed by NVIDIA. The computing engine in GPU. CUDA gives developers access to the instruction set and memory of the parallel computation elements in GPUs. Processing Flow CUDA的处理流: 从主存拷贝数据到GPU内 存 CPU启动GPU上的计算进 程. GPU在每个核上并行执行 从GPU内存拷贝结果到主 存 CUDA Programming Model Definitions: Device = GPU Host = CPU Kernel = function that runs on the device CUDA Programming Model A kernel is executed by a grid of thread blocks A thread block is a batch of threads that can cooperate with each other by: Sharing data through shared memory Synchronizing their execution Threads from different blocks cannot cooperate CUDA Kernels and Threads Parallel portions of an application are executed on the device as kernels One kernel is executed at a time Many threads execute each kernel Differences between CUDA and CPU threads CUDA threads are extremely lightweight Very little creation overhead Instant switching CUDA uses 1000s of threads to achieve efficiency Multi-core CPUs can use only a few Arrays of Parallel Threads A CUDA kernel is executed by an array of threads All threads run the same code Each thread has an ID that it uses to compute memory addresses and make control decisions Minimal Kernels Manage memory CPU v/s GPU © NVIDIA Corporation 2009 Partitioned Global Address Space • Most parallel programs are written using either: • Message passing with a SPMD model (MPI) • Usually for scientific applications with C++/Fortran • Scales easily • Shared memory with threads in OpenMP, Threads+C/C++/F or Java • Usually for non-scientific applications • Easier to program, but less scalable performance • Partitioned Global Address Space (PGAS) Languages take the best of both • SPMD parallelism like MPI • Local/global distinction, i.e., layout matters • Global address space like threads (programmability) 39/86 How does PGAS compare to other models? Process/Thread Address Space Message passing Shared Memory PGAS MPI OpenMP UPC, CAF, X10 • 计算在多个places执行. • Place包含可以被运端进程 操作的数据 • 数据在生命周期里存在于 创建该数据的place • 一个place的数据可以指向另 外place的数据. • 数据结构 (e.g. arrays) 可以分 布到多个places. A place expresses locality. 40 PGAS Overview • “Partitioned Global View” (or PGAS) • Global Address Space: 每一线程可 以看到全部数据,所 以不需要复制数据 • Partitioned: 将全局 地址空间分割,程序 员意识到线程之间的 数据共享 • 实现 • GA Library from PNNL • Unified Parallel C (UPC), FORTRAN 2009 • X10, Chapel • 概念 • 内存和结构 • Partition and mapping • Threads and affinity • Local and non-local accesses • Collective operations and “Owner computes” 41 Memories and Distributions • Software Memory • Distinct logical storage area in a computer program (e.g., heap or stack) • For parallel software, we use multiple memories • Structure • Collection of data created by program execution (arrays, trees, graphs, etc.) • Partition • Division of structure into parts • Mapping • Assignment of structure parts to memories 42 Software Memory Examples • Executable Image at right • “Program linked, loaded and ready to run” • Memories • Static memory • data segment • Heap memory • Holds allocated structures • Explicitly managed by programmer (malloc, free) • Stack memory • Holds function call records • Implicitly managed by runtime during execution 43 Affinity and Nonlocal Access • Affinity是线程与内存的关 联 • 如果线程与内存存在关系, 它可以存取它的结构 • 这些的内存称为局部内存 • 非局部访问 • Thread 0 需要part B • Part B in Memory 1 • Thread 0跟memory 1没有关 系 • 非局部访问通常通过进程 之间通信实现,因此开销 较大 45 Threads and Memories for Different Programming Methods Thread Count Memory Count Nonlocal Access 1 1 N/A Either 1 or p 1 N/A p p No. Message required. 1 (host) + p (device) 2 (Host + device) No. DMA required. UPC, FORTRAN p p Supported. X10 n p Supported. Sequential OpenMP MPI CUDA Hybrid (MPI+OpenMP+CUDA+… • Take the positive off all models • Exploit memory hierarchy • Many HPC applications are adopting this model • Mainly due to developer inertia • Hard to rewrite million of source lines Hybrid parallel programming Python: Ensemble simulations MPI: Domain partition OpenMP: External loop partition CUDA: assign inner loops Iteration to GPU threads 48 Design Issues Apply at All Layers • Programming model’s position provides constraints/goals for system • In fact, each interface between layers supports or takes a position on – Naming model – Set of operations on names – Ordering model – Replication – Communication performance 49 Naming and Operations Naming and operations in programming model can be directly supported by lower levels, or translated by compiler, libraries or OS Example: Shared virtual address space in programming model Hardware interface supports shared physical address space Direct support by hardware through v-to-p mappings, no software layers 50 Naming and Operations (Cont’d) Hardware supports independent physical address spaces system/user interface: can provide SAS through OS v-to-p mappings only for data that are local remote data accesses incur page faults; brought in via page fault handlers Or through compilers or runtime, so above sys/user interface 51 Naming and Operations (Cont’d) Example: Implementing Message Passing Direct support at hardware interface Support at sys/user interface or above in software (almost always) Hardware interface provides basic data transport Send/receive built in software for flexibility (protection, buffering) Or lower interfaces provide SAS, and send/receive built on top with buffers and loads/stores 52 Naming and Operations (Cont’d) • Need to examine the issues and tradeoffs at every layer • Frequencies and types of operations, costs • Message passing • No assumptions on orders across processes except those imposed by send/receive pairs • SAS • How processes see the order of other processes’ references defines semantics of SAS • Ordering very important and subtle 53 Ordering model • Uniprocessors play tricks with orders to gain parallelism or locality • These are more important in multiprocessors • Need to understand which old tricks are valid, and learn new ones • How programs behave, what they rely on, and hardware implications 54 Parallelization Paradigms • Task-Farming/Master-Worker • Single-Program Multiple-Data (SPMD) • Pipelining • Divide and Conquer • Speculation. Master Worker/Slave Model • Master将问题分解 成小任务,将任务 分发到workers执 行,然后收集结果 形成最终结果. • 映射/负载平衡 • 静态 • 动态 Static 55 Single-Program Multiple-Data • 每一进程执行同样的 代码,但是处理不同 的数据。 • 领域分解,数据并行 56 57 Pipelining • 适合 • 细粒度的并行 • 多阶段执行的应用 分治法Divide and Conquer • 问题分解成多个子问题,每 一子问题独立求解,合并各 结果 • 3种操作: split, compute, 和 join. • Master-worker/task-farming 同分治法类似:master运行 split和join操作 • 形式上类似于层次masterwork 58 59 猜测并行Speculative Parallelism • 适合问题之间存在复杂的依赖关系 • 采用“look ahead “execution. • 使用多种算法解决问题