Parallel Programming

advertisement
FOUNDATION TO PARALLEL
PROGRAMMING
2
CONTENT
• 并行程序设计简介
• 并行程序设计模型
• 并行程序设计范型
3
Parallel Programming is a Complex Task
• 并行软件开发人员面临的问题:
– 不确定性
– 通讯
– 同步
– 划分与分发
– 负载平衡
– 容错
– 竞争
– 死锁
– ...
4
Levels of Parallelism
PVM/MPI
Threads
Compilers
CPU
Task i-l
func1 ( )
{
....
....
}
a ( 0 ) =..
b ( 0 ) =..
+
Task i
func2 ( )
{
....
....
}
a ( 1 )=..
b ( 1 )=..
x
Task i+1
func3 ( )
{
....
....
}
a ( 2 )=..
b ( 2 )=..
Load
Code-Granularity
Code Item
Large grain
(task level)
Program
Medium grain
(control level)
Function (thread)
Fine grain
(data level)
Loop (Compiler)
Very fine grain
(multiple issue)
With hardware
5
Responsible for Parallelization
Grain Size Code Item
Parallelised by
Very Fine
Instruction
处理器
Fine
Loop/Instruction block
编译器
Medium
(Standard one page) Function
程序员
Large
Program/Separate heavy-weight
process
程序员
6
Parallelization Procedure
Assignment
Decomposition
Sequential Computation
Tasks
Process Elements
Mapping
Processors
Orchestration
7
Sample Sequential Program
FDM (Finite Difference Method)
…
loop{
for (i=0; i<N; i++){
for (j=0; j<N; j++){
a[i][j] = 0.2 * (a[i][j-1] +
+ a[i+1][j] + a[i][j]);
}
}
}
…
a[i][j+1] + a[i-1][j]
8
Parallelize the Sequential Program
• Decomposition
…
loop{
for (i=0; i<N; i++){
for (j=0; j<N; j++){
a[i][j] = 0.2 * (a[i][j-1] + a[i][j+1]
+ a[i-1][j] + a[i+1][j] + a[i][j]);
}
}
}
…
a task
9
Parallelize the Sequential Program
• Assignment
PE
Divide the tasks
equally among
process elements
PE
PE
PE
10
Parallelize the Sequential Program
• Orchestration
PE
need to
communicate
and to
synchronize
PE
PE
PE
11
Parallelize the Sequential Program
• Mapping
PE
PE
PE
PE
Multiprocessor
12
Parallel Programming Models
• Sequential Programming Model
• Shared Memory Model (Shared Address Space
Model)
•
•
•
•
DSM
Threads/OpenMP (enabled for clusters)
Cilk
Java threads
• Message Passing Model
• PVM
• MPI
• Functional Programming
• MapReduce
13
Parallel Programming Models
• Partitioned Global Address Space Programming
(PGAS) Languages
•
UPC, Coarray Fortran, Titanium
• Languages and Paradigm for Hardware
Accelerators
•
CUDA, OpenCL
• Hybrid: MPI + OpenMP + CUDA/OpenCL
trends
Scalar Application
Vector
MPP System, Message Passing: MPI
Distributed
memory
Multi core nodes: OpenMP,…
Shared
Memory
Accelerator (GPGPU,
FPGA): Cuda,
OpenCL,..
Hybrid codes
15
Sequential Programming Model
• Functional
• Naming: Can name any variable in virtual
address space
• Hardware (and perhaps compilers) does
translation to physical addresses
• Operations: Loads and Stores
• Ordering: Sequential program order
16
Sequential Programming Model
• Performance
• Rely on dependences on single location
(mostly): dependence order
• Compiler: reordering and register allocation
• Hardware: out of order, pipeline bypassing,
write buffers
• Transparent replication in caches
17
SAS (Shared Address Space)
Programming Model
System
Thread
(Process)
Thread
(Process)
read(X)
write(X)
X
Shared variable
18
Shared Address Space Programming
Model
• 变量命名
• 任何进程在共享空间里可以命名任何变量
• Operations
• Loads and stores, plus those needed for ordering
• Simplest Ordering Model
• 在进程/线程内: sequential program order
• 线程之间: 存在交叉 (类似于分时里面的交叉)
• Additional orders through synchronization
19
Synchronization
• Mutual exclusion (locks)
• No ordering guarantees
• Event synchronization
• Ordering of events to preserve dependences
• e.g. producer —> consumer of data
20
MP Programming Model
Node A
Node B
process
process
send (Y)
receive (Y’)
Y
Y’
message
21
Message-Passing Programming
Model
Match
ReceiveY, P, t
AddressY
Send X, Q, t
AddressX
Local process
address space
Local process
address space
ProcessP
Process Q
• Send指定待传输的数据缓存以及接受进程
• Recv指定发送进程以及存放接受数据的存储空间
• 用户进程只能在进程地址空间里命名局部变量和实体
• 存在许多开销:拷贝、缓存管理、保护
22
Message Passing Programming Model
• 命名
– 进程可以直接命名局部变量
– 不存在共享地址空间
• Operations
– 明确通信: send 和receive
– Send从私有空间传输数据到另外一个进程
– Receive 拷贝数据到私有地址空间
– 必须能够命名进程
23
Message Passing Programming Model
• Ordering
• 进程里面由程序确定顺序
• Send和receive提供了进程间点对点的同步
• 可以构建全局地址空间
• 例如:进程id + 进程地址空间内部地址
• 但对其不存在直接操作
Functional Programming
• 函数操作不会更改数据结构,而是创建新的数据结
构
• 原来数据始终未改
• 数据流动未明确在程序设计中确定
• 操作的顺序并不重要
Functional Programming
fun foo(l: int list) =
sum(l) + mul(l) + length(l)
Order of sum() and mul(), etc does not matter – they do
not modify l
GPU
• Graphical Processing Unit
• 一个GPU由大量的核组成,比如上百个核.
• 但通常CPU包含 2, 4, 8或12个核
• Cores? – 芯片里至少共享内存或L1 cache的处理
单元
• General Purpose computation using GPU in
applications other than 3D graphics
• GPU accelerates critical path of application
CPU v/s GPU
GPU and CPU
• Typically GPU and CPU coexist in a heterogeneous
setting
• “Less” computationally intensive part runs on CPU
(coarse-grained parallelism), and more intensive parts run
on GPU (fine-grained parallelism)
• NVIDIA’s GPU architecture is called CUDA (Compute
Unified Device Architecture) architecture, accompanied by
CUDA programming model, and CUDA C language
What is CUDA?
CUDA: Compute Unified Device
Architecture.
A parallel computing architecture
developed by NVIDIA.
The computing engine in GPU.
CUDA gives developers access to the
instruction set and memory of the
parallel computation elements in GPUs.
Processing Flow
CUDA的处理流:
从主存拷贝数据到GPU内
存
CPU启动GPU上的计算进
程.
GPU在每个核上并行执行
从GPU内存拷贝结果到主
存
CUDA Programming Model
Definitions:
Device = GPU
Host = CPU
Kernel =
function that
runs on the
device
CUDA Programming Model
A kernel is executed by a grid of thread
blocks
A thread block is a batch of threads
that can cooperate with each other by:
Sharing data through shared memory
Synchronizing their execution
Threads from different blocks cannot
cooperate
CUDA Kernels and Threads
Parallel portions of an application are
executed on the device as kernels
One kernel is executed at a time
Many threads execute each kernel
Differences between CUDA and CPU threads
CUDA threads are extremely lightweight
Very little creation overhead
Instant switching
CUDA uses 1000s of threads to achieve efficiency
Multi-core CPUs can use only a few
Arrays of Parallel Threads
A CUDA kernel is executed by an array of threads
All threads run the same code
Each thread has an ID that it uses to compute memory
addresses and make control decisions
Minimal Kernels
Manage memory
CPU v/s GPU
© NVIDIA Corporation 2009
Partitioned Global Address Space
• Most parallel programs are written using either:
• Message passing with a SPMD model (MPI)
• Usually for scientific applications with C++/Fortran
• Scales easily
• Shared memory with threads in OpenMP, Threads+C/C++/F or Java
• Usually for non-scientific applications
• Easier to program, but less scalable performance
• Partitioned Global Address Space (PGAS) Languages take the
best of both
• SPMD parallelism like MPI
• Local/global distinction, i.e., layout matters
• Global address space like threads (programmability)
39/86
How does PGAS compare to other models?
Process/Thread
Address Space
Message passing
Shared Memory
PGAS
MPI
OpenMP
UPC, CAF, X10
• 计算在多个places执行.
• Place包含可以被运端进程
操作的数据
• 数据在生命周期里存在于
创建该数据的place
• 一个place的数据可以指向另
外place的数据.
• 数据结构 (e.g. arrays) 可以分
布到多个places.
A place expresses locality.
40
PGAS Overview
• “Partitioned Global
View” (or PGAS)
• Global Address
Space: 每一线程可
以看到全部数据,所
以不需要复制数据
• Partitioned: 将全局
地址空间分割,程序
员意识到线程之间的
数据共享
• 实现
• GA Library from PNNL
• Unified Parallel C (UPC),
FORTRAN 2009
• X10, Chapel
• 概念
• 内存和结构
• Partition and mapping
• Threads and affinity
• Local and non-local
accesses
• Collective operations and
“Owner computes”
41
Memories and Distributions
• Software Memory
• Distinct logical storage area in a computer
program (e.g., heap or stack)
• For parallel software, we use multiple
memories
• Structure
• Collection of data created by program
execution (arrays, trees, graphs, etc.)
• Partition
• Division of structure into parts
• Mapping
• Assignment of structure parts to memories
42
Software Memory Examples
• Executable Image at
right
• “Program linked, loaded
and ready to run”
• Memories
• Static memory
• data segment
• Heap memory
• Holds allocated structures
• Explicitly managed by
programmer (malloc, free)
• Stack memory
• Holds function call records
• Implicitly managed by
runtime during execution
43
Affinity and Nonlocal Access
• Affinity是线程与内存的关
联
• 如果线程与内存存在关系,
它可以存取它的结构
• 这些的内存称为局部内存
• 非局部访问
• Thread 0 需要part B
• Part B in Memory 1
• Thread 0跟memory 1没有关
系
• 非局部访问通常通过进程
之间通信实现,因此开销
较大
45
Threads and Memories for Different
Programming Methods
Thread
Count
Memory
Count
Nonlocal Access
1
1
N/A
Either 1 or p
1
N/A
p
p
No. Message required.
1 (host) +
p (device)
2 (Host +
device)
No. DMA required.
UPC, FORTRAN
p
p
Supported.
X10
n
p
Supported.
Sequential
OpenMP
MPI
CUDA
Hybrid (MPI+OpenMP+CUDA+…
• Take the positive off all models
• Exploit memory hierarchy
• Many HPC applications are adopting this model
• Mainly due to developer inertia
• Hard to rewrite million of source lines
Hybrid parallel programming
Python: Ensemble simulations
MPI: Domain partition
OpenMP: External loop partition
CUDA: assign inner loops
Iteration to GPU threads
48
Design Issues Apply at All Layers
• Programming model’s position provides
constraints/goals for system
• In fact, each interface between layers
supports or takes a position on
– Naming model
– Set of operations on names
– Ordering model
– Replication
– Communication performance
49
Naming and Operations
Naming and operations in programming
model can be directly supported by lower
levels, or translated by compiler, libraries
or OS
Example: Shared virtual address space in
programming model
Hardware interface supports shared physical
address space
Direct support by hardware through v-to-p mappings, no
software layers
50
Naming and Operations (Cont’d)
Hardware supports independent physical
address spaces
system/user interface: can provide SAS through
OS
 v-to-p mappings only for data that are local
 remote data accesses incur page faults; brought in via page
fault handlers
Or through compilers or runtime, so above sys/user
interface
51
Naming and Operations (Cont’d)
Example: Implementing Message Passing
Direct support at hardware interface
Support at sys/user interface or above in
software (almost always)
Hardware interface provides basic data transport
Send/receive built in software for flexibility
(protection, buffering)
Or lower interfaces provide SAS, and
send/receive built on top with buffers and
loads/stores
52
Naming and Operations (Cont’d)
• Need to examine the issues and tradeoffs at every
layer
• Frequencies and types of operations, costs
• Message passing
• No assumptions on orders across processes except those
imposed by send/receive pairs
• SAS
• How processes see the order of other processes’ references
defines semantics of SAS
• Ordering very important and subtle
53
Ordering model
• Uniprocessors play tricks with orders to gain
parallelism or locality
• These are more important in multiprocessors
• Need to understand which old tricks are valid,
and learn new ones
• How programs behave, what they rely on, and
hardware implications
54
Parallelization Paradigms
• Task-Farming/Master-Worker
• Single-Program Multiple-Data (SPMD)
• Pipelining
• Divide and Conquer
• Speculation.
Master Worker/Slave Model
• Master将问题分解
成小任务,将任务
分发到workers执
行,然后收集结果
形成最终结果.
• 映射/负载平衡
• 静态
• 动态
Static
55
Single-Program Multiple-Data
• 每一进程执行同样的
代码,但是处理不同
的数据。
• 领域分解,数据并行
56
57
Pipelining
• 适合
• 细粒度的并行
• 多阶段执行的应用
分治法Divide and Conquer
• 问题分解成多个子问题,每
一子问题独立求解,合并各
结果
• 3种操作: split, compute, 和
join.
• Master-worker/task-farming
同分治法类似:master运行
split和join操作
• 形式上类似于层次masterwork
58
59
猜测并行Speculative Parallelism
• 适合问题之间存在复杂的依赖关系
• 采用“look ahead “execution.
• 使用多种算法解决问题
Download