Parallel Programming

FOUNDATION TO PARALLEL PROGRAMMING 2 CONTENT • 并行程序设计简介 • 并行程序设计模型 • 并行程序设计范型 3 Parallel Programming is a Complex Task • 并行软件开发人员面临的问题: – 不确定性 – 通讯 – 同步 – 划分与分发 – 负载平衡 – 容错 – 竞争 – 死锁 – ... 4 Levels of Parallelism PVM/MPI Threads Compilers CPU Task i-l func1 ( ) { .... .... } a ( 0 ) =.. b ( 0 ) =.. + Task i func2 ( ) { .... .... } a ( 1 )=.. b ( 1 )=.. x Task i+1 func3 ( ) { .... .... } a ( 2 )=.. b ( 2 )=.. Load Code-Granularity Code Item Large grain (task level) Program Medium grain (control level) Function (thread) Fine grain (data level) Loop (Compiler) Very fine grain (multiple issue) With hardware 5 Responsible for Parallelization Grain Size Code Item Parallelised by Very Fine Instruction 处理器 Fine Loop/Instruction block 编译器 Medium (Standard one page) Function 程序员 Large Program/Separate heavy-weight process 程序员 6 Parallelization Procedure Assignment Decomposition Sequential Computation Tasks Process Elements Mapping Processors Orchestration 7 Sample Sequential Program FDM (Finite Difference Method) … loop{ for (i=0; i<N; i++){ for (j=0; j<N; j++){ a[i][j] = 0.2 * (a[i][j-1] + + a[i+1][j] + a[i][j]); } } } … a[i][j+1] + a[i-1][j] 8 Parallelize the Sequential Program • Decomposition … loop{ for (i=0; i<N; i++){ for (j=0; j<N; j++){ a[i][j] = 0.2 * (a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j] + a[i][j]); } } } … a task 9 Parallelize the Sequential Program • Assignment PE Divide the tasks equally among process elements PE PE PE 10 Parallelize the Sequential Program • Orchestration PE need to communicate and to synchronize PE PE PE 11 Parallelize the Sequential Program • Mapping PE PE PE PE Multiprocessor 12 Parallel Programming Models • Sequential Programming Model • Shared Memory Model (Shared Address Space Model) • • • • DSM Threads/OpenMP (enabled for clusters) Cilk Java threads • Message Passing Model • PVM • MPI • Functional Programming • MapReduce 13 Parallel Programming Models • Partitioned Global Address Space Programming (PGAS) Languages • UPC, Coarray Fortran, Titanium • Languages and Paradigm for Hardware Accelerators • CUDA, OpenCL • Hybrid: MPI + OpenMP + CUDA/OpenCL trends Scalar Application Vector MPP System, Message Passing: MPI Distributed memory Multi core nodes: OpenMP,… Shared Memory Accelerator (GPGPU, FPGA): Cuda, OpenCL,.. Hybrid codes 15 Sequential Programming Model • Functional • Naming: Can name any variable in virtual address space • Hardware (and perhaps compilers) does translation to physical addresses • Operations: Loads and Stores • Ordering: Sequential program order 16 Sequential Programming Model • Performance • Rely on dependences on single location (mostly): dependence order • Compiler: reordering and register allocation • Hardware: out of order, pipeline bypassing, write buffers • Transparent replication in caches 17 SAS (Shared Address Space) Programming Model System Thread (Process) Thread (Process) read(X) write(X) X Shared variable 18 Shared Address Space Programming Model • 变量命名 • 任何进程在共享空间里可以命名任何变量 • Operations • Loads and stores, plus those needed for ordering • Simplest Ordering Model • 在进程/线程内: sequential program order • 线程之间: 存在交叉 (类似于分时里面的交叉) • Additional orders through synchronization 19 Synchronization • Mutual exclusion (locks) • No ordering guarantees • Event synchronization • Ordering of events to preserve dependences • e.g. producer —> consumer of data 20 MP Programming Model Node A Node B process process send (Y) receive (Y’) Y Y’ message 21 Message-Passing Programming Model Match ReceiveY, P, t AddressY Send X, Q, t AddressX Local process address space Local process address space ProcessP Process Q • Send指定待传输的数据缓存以及接受进程 • Recv指定发送进程以及存放接受数据的存储空间 • 用户进程只能在进程地址空间里命名局部变量和实体 • 存在许多开销：拷贝、缓存管理、保护 22 Message Passing Programming Model • 命名 – 进程可以直接命名局部变量 – 不存在共享地址空间 • Operations – 明确通信: send 和receive – Send从私有空间传输数据到另外一个进程 – Receive 拷贝数据到私有地址空间 – 必须能够命名进程 23 Message Passing Programming Model • Ordering • 进程里面由程序确定顺序 • Send和receive提供了进程间点对点的同步 • 可以构建全局地址空间 • 例如：进程id + 进程地址空间内部地址 • 但对其不存在直接操作 Functional Programming • 函数操作不会更改数据结构，而是创建新的数据结构 • 原来数据始终未改 • 数据流动未明确在程序设计中确定 • 操作的顺序并不重要 Functional Programming fun foo(l: int list) = sum(l) + mul(l) + length(l) Order of sum() and mul(), etc does not matter – they do not modify l GPU • Graphical Processing Unit • 一个GPU由大量的核组成，比如上百个核. • 但通常CPU包含 2, 4, 8或12个核 • Cores? – 芯片里至少共享内存或L1 cache的处理单元 • General Purpose computation using GPU in applications other than 3D graphics • GPU accelerates critical path of application CPU v/s GPU GPU and CPU • Typically GPU and CPU coexist in a heterogeneous setting • “Less” computationally intensive part runs on CPU (coarse-grained parallelism), and more intensive parts run on GPU (fine-grained parallelism) • NVIDIA’s GPU architecture is called CUDA (Compute Unified Device Architecture) architecture, accompanied by CUDA programming model, and CUDA C language What is CUDA? CUDA: Compute Unified Device Architecture. A parallel computing architecture developed by NVIDIA. The computing engine in GPU. CUDA gives developers access to the instruction set and memory of the parallel computation elements in GPUs. Processing Flow CUDA的处理流: 从主存拷贝数据到GPU内存 CPU启动GPU上的计算进程. GPU在每个核上并行执行 从GPU内存拷贝结果到主存 CUDA Programming Model Definitions: Device = GPU Host = CPU Kernel = function that runs on the device CUDA Programming Model A kernel is executed by a grid of thread blocks A thread block is a batch of threads that can cooperate with each other by: Sharing data through shared memory Synchronizing their execution Threads from different blocks cannot cooperate CUDA Kernels and Threads Parallel portions of an application are executed on the device as kernels One kernel is executed at a time Many threads execute each kernel Differences between CUDA and CPU threads CUDA threads are extremely lightweight Very little creation overhead Instant switching CUDA uses 1000s of threads to achieve efficiency Multi-core CPUs can use only a few Arrays of Parallel Threads A CUDA kernel is executed by an array of threads All threads run the same code Each thread has an ID that it uses to compute memory addresses and make control decisions Minimal Kernels Manage memory CPU v/s GPU © NVIDIA Corporation 2009 Partitioned Global Address Space • Most parallel programs are written using either: • Message passing with a SPMD model (MPI) • Usually for scientific applications with C++/Fortran • Scales easily • Shared memory with threads in OpenMP, Threads+C/C++/F or Java • Usually for non-scientific applications • Easier to program, but less scalable performance • Partitioned Global Address Space (PGAS) Languages take the best of both • SPMD parallelism like MPI • Local/global distinction, i.e., layout matters • Global address space like threads (programmability) 39/86 How does PGAS compare to other models? Process/Thread Address Space Message passing Shared Memory PGAS MPI OpenMP UPC, CAF, X10 • 计算在多个places执行. • Place包含可以被运端进程操作的数据 • 数据在生命周期里存在于创建该数据的place • 一个place的数据可以指向另外place的数据. • 数据结构 (e.g. arrays) 可以分布到多个places. A place expresses locality. 40 PGAS Overview • “Partitioned Global View” (or PGAS) • Global Address Space: 每一线程可以看到全部数据，所以不需要复制数据 • Partitioned: 将全局地址空间分割，程序员意识到线程之间的数据共享 • 实现 • GA Library from PNNL • Unified Parallel C (UPC), FORTRAN 2009 • X10, Chapel • 概念 • 内存和结构 • Partition and mapping • Threads and affinity • Local and non-local accesses • Collective operations and “Owner computes” 41 Memories and Distributions • Software Memory • Distinct logical storage area in a computer program (e.g., heap or stack) • For parallel software, we use multiple memories • Structure • Collection of data created by program execution (arrays, trees, graphs, etc.) • Partition • Division of structure into parts • Mapping • Assignment of structure parts to memories 42 Software Memory Examples • Executable Image at right • “Program linked, loaded and ready to run” • Memories • Static memory • data segment • Heap memory • Holds allocated structures • Explicitly managed by programmer (malloc, free) • Stack memory • Holds function call records • Implicitly managed by runtime during execution 43 Affinity and Nonlocal Access • Affinity是线程与内存的关联 • 如果线程与内存存在关系，它可以存取它的结构 • 这些的内存称为局部内存 • 非局部访问 • Thread 0 需要part B • Part B in Memory 1 • Thread 0跟memory 1没有关系 • 非局部访问通常通过进程之间通信实现，因此开销较大 45 Threads and Memories for Different Programming Methods Thread Count Memory Count Nonlocal Access 1 1 N/A Either 1 or p 1 N/A p p No. Message required. 1 (host) + p (device) 2 (Host + device) No. DMA required. UPC, FORTRAN p p Supported. X10 n p Supported. Sequential OpenMP MPI CUDA Hybrid (MPI+OpenMP+CUDA+… • Take the positive off all models • Exploit memory hierarchy • Many HPC applications are adopting this model • Mainly due to developer inertia • Hard to rewrite million of source lines Hybrid parallel programming Python: Ensemble simulations MPI: Domain partition OpenMP: External loop partition CUDA: assign inner loops Iteration to GPU threads 48 Design Issues Apply at All Layers • Programming model’s position provides constraints/goals for system • In fact, each interface between layers supports or takes a position on – Naming model – Set of operations on names – Ordering model – Replication – Communication performance 49 Naming and Operations Naming and operations in programming model can be directly supported by lower levels, or translated by compiler, libraries or OS Example: Shared virtual address space in programming model Hardware interface supports shared physical address space Direct support by hardware through v-to-p mappings, no software layers 50 Naming and Operations (Cont’d) Hardware supports independent physical address spaces system/user interface: can provide SAS through OS  v-to-p mappings only for data that are local  remote data accesses incur page faults; brought in via page fault handlers Or through compilers or runtime, so above sys/user interface 51 Naming and Operations (Cont’d) Example: Implementing Message Passing Direct support at hardware interface Support at sys/user interface or above in software (almost always) Hardware interface provides basic data transport Send/receive built in software for flexibility (protection, buffering) Or lower interfaces provide SAS, and send/receive built on top with buffers and loads/stores 52 Naming and Operations (Cont’d) • Need to examine the issues and tradeoffs at every layer • Frequencies and types of operations, costs • Message passing • No assumptions on orders across processes except those imposed by send/receive pairs • SAS • How processes see the order of other processes’ references defines semantics of SAS • Ordering very important and subtle 53 Ordering model • Uniprocessors play tricks with orders to gain parallelism or locality • These are more important in multiprocessors • Need to understand which old tricks are valid, and learn new ones • How programs behave, what they rely on, and hardware implications 54 Parallelization Paradigms • Task-Farming/Master-Worker • Single-Program Multiple-Data (SPMD) • Pipelining • Divide and Conquer • Speculation. Master Worker/Slave Model • Master将问题分解成小任务，将任务分发到workers执行，然后收集结果形成最终结果. • 映射/负载平衡 • 静态 • 动态 Static 55 Single-Program Multiple-Data • 每一进程执行同样的代码，但是处理不同的数据。 • 领域分解，数据并行 56 57 Pipelining • 适合 • 细粒度的并行 • 多阶段执行的应用分治法Divide and Conquer • 问题分解成多个子问题，每一子问题独立求解，合并各结果 • 3种操作: split, compute, 和 join. • Master-worker/task-farming 同分治法类似：master运行 split和join操作 • 形式上类似于层次masterwork 58 59 猜测并行Speculative Parallelism • 适合问题之间存在复杂的依赖关系 • 采用“look ahead “execution. • 使用多种算法解决问题

Parallel Programming

Related documents

Products

Support

Parallel Programming

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib