PPT - Introduction to Concurrency in Programming Languages

advertisement
Introduction to Concurrency in Programming Languages:
Chapter 11: Data Parallelism
Matthew J. Sottile
Timothy G. Mattson
Craig E Rasmussen
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
1
Objectives
• Introduce the data parallel algorithmic pattern.
• Demonstrate application of this pattern to:
– Matrix multiplication
– Cellular automaton
• Discuss limitations of this pattern for general-purpose
programming.
• Discuss use of task parallel constructs to express data
parallelism.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
2
The Algorithm Design Patterns
Start with a basic concurrency decomposition
•
•
•
A problem decomposed into a set of tasks
A data decomposition aligned with the set of tasks … designed to minimize
interactions between tasks and make concurrent updates to data safe.
Dependencies and ordering constraints between groups of tasks.
Specialist
Parallelism
Pipeline
Event Based
Coordination
Agenda
Parallelism
Task
Parallelism
Recursive
algorithms
Embarrassingly
Parallel
Separable
Dependencies
Result
Parallelism
Geometric
Decomposition
Data
Parallelism
Recursive
Data
Patterns covered in this lecture
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
3
Result Parallelism: Algorithm Patterns
• The core idea is to define the algorithm in terms of the data
structures within the problem and how they are
decomposed.
– Data parallelism:
• A broadly applicable pattern in which the parallelism is expressed as
streams of instructions applied concurrently to the elements of a data
structure (e.g., arrays).
– Geometric decomposition:
• A data parallel pattern where the data structures at the center of the
problem are broken into sub-regions or tiles that are distributed about
the threads or processes involved in the computation. The algorithm
consists of updates to local or interior points, exchange of boundary
regions, and update of the edges.
– Recursive data:
• A data parallel pattern used with recursively defined data structures.
Extra work (relative to the serial version of the algorithm) is expended
to traverse the data structure and define the concurrent tasks, but this
is compensated for by the potential for parallel speedup.
Not covered in this lecture
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
4
Supporting Patterns
•
Fork-join
–
•
SPMD
–
•
A process or thread (the master) sets up a task queue and manages other threads (the workers)
as they grab a task from the queue, carry out the computation, and then return for their next task.
This continues until the master detects that a termination condition has been met, at which point
the master ends the computation.
SIMD
–
•
Parallelism is expressed in terms of loops that execute concurrently.
Master-worker
–
•
Multiple copies of a single program are launched typically with their own view of the data. The
path through the program is determined in part base don a unique ID (a rank). This is by far the
most commonly used pattern with message passing APIs such as MPI.
Loop parallelism
–
•
A computation begins as a single thread of control. Additional threads are created as needed
(forked) to execute functions and then when complete terminate (join). The computation
continues as a single thread until a later time when more threads might be useful.
The computation is a single stream of instructions applied to the individual components of a data
structure (such as an array).
Functional parallelism
–
Concurrency is expressed as a distinct set of functions that execute concurrently. This pattern
may be used with an imperative semantics in which case the way the functions execute are
defined in the source code (e.g., event based coordination). Alternatively, this pattern can be
used with declarative semantics, such as within a functional language, where the functions are
defined but how (or when) they execute is dictated by the interaction of the data with the
language model.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
5
Outline
• Data parallel algorithms and the SIMD Pattern
• Case studies:
– Matrix multiplication
– Cellular automaton
• Beyond SIMD
• Geometric decomposition
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
6
Data Parallelism Pattern
• Use when:
– Your problem is defined in terms of collections of data
elements operated on by a similar (if not identical)
sequence of instructions; i.e. the concurrency is in the
data.
• Solution
– Define collections of data elements that can be updated
in parallel.
– Define computation as a sequence of collective
operations applied together to each data element.
Tasks
Data 1
Data 2
Data 3
……
Data n
7
Source: Mattson and Keutzer, UCB CS294
SIMD: Single Instruction Multiple Data
• A supporting pattern for
data parallelism.
• Definition: A single
instruction stream is
applied to multiple data
elements.
• One program text
• One instruction counter
• Distinct data streams per
Processing Element (PE)
PE
PE
PE
PE
Source: Mattson and Keutzer, UCB CS294
8
Pattern Example: SIMD
• Definition: A single stream of program instructions execute in parallel
for different lanes in a data structure. There is only one program
counter for a SIMD program.
PU Index
Time
PU[1]
PU[2]
PU[3]
Prev Instruction
Prev Instruction
Prev Instruction
Load A[1]
Load A[2]
Load A[n]
Load B[1]
Load B[2]
Add(A[1], B[1])
Add(A[2], B[2])
Add(A[n], B[n])
Store C[1]
Store C[2]
Store C[n]
Next Instruction
Next Instruction
Next Instruction
……………
Load B[n]
SIMD operation of adding vector A and B into C
9
Source: Mattson and Keutzer, UCB CS294
Data Parallel Collective operations
Communication:
Element-wise Ops
Reductions
+ + + + + + + +
+ + + + + + + +
+
+
+
+
+
+
+
Arbitrary
Nested Parallelism
trees, graphs, nested control …
[[
][
][[
][
]]]
Linearized form
Source: Ali Adl-Tabatabai, Intel
10
Outline
• Data parallel algorithms and the SIMD Pattern
• Case studies:
– Matrix multiplication
– Cellular automaton
• Limitations of SIMD data parallel programming
• Beyond SIMD
• Geometric decomposition
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
11
Matrix Multiplication
• Matrices are rectangular arrays of numbers.
• Linear Algebra is the branch of mathematics that deals with
matrices.
• Linear algebra is used for a wide range of applications
including:
–
–
–
–
–
–
–
–
Oil reservoir modeling
Quantum Chemistry
Image processing
Least squares and other statistics calculations
Seismic Signal processing
Financial analytics
Business analytics (operations research)
Graphics
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
12
Introduction to Linear Algebra
• Definition:
– Linear algebra: the branch of mathematics concerned with the study of
vectors, vector spaces, linear maps (also called linear transformations), and
systems of linear equations.
• Example: Consider the following system of linear equations
x + 2y + z = 1
x + 3y + 3z = 2
x + y + 4z = 6
– This system can be represented in terms of vectors and a matrix as
the classic “Ax = b” problem.
1
2
1
x
1
3
3
y
1
1
4
z
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
1
=
2
6
Source: Mattson and Keutzer, UCB CS294
13
Linear transformations
• A matrix, A ∈ RMxP multiplies a vector, x ∈ RP to define a linear
transformation of that vector into RM. By “linear” we mean that
A(βx+y) = βAx + Ay, where β is a scalar.
• Matrix multiplication defines the composition of two linear
transformations on that vector space
for (i=0; i<M; i++){
for (j=0; j<N; j++){
C[i][j] = 0.0;
for(k=0; k<P; k++){
C[i][j] += A[i][k] * B[k][j];
}
}
}
Compute C = A*B where
C ∈ RMxN
A ∈ RMxP
B ∈ RPxN
• Matrix multiplication is the core building block of Linear Algebra
– O(N2) memory accesses; O(N3) compute
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
Source: Mattson and Keutzer, UCB CS294
14
Other key linear algebra operations
• LU Decomposition
– A = LU where L is a lower triangular matrix and U is an
upper triangular matrix.
• Cholesky Decomposition
– A = L*LT, symmetric positive definite A and L is lower
triangular
• QR Decomposition
– A=Q*R, Q is orthogonal and R is upper triangular
• Symmetric Eigenvalue problem
– For symmetric matrix A, find T such that
• T-1 A T = D
where D is a diagonal
• And many more
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
Source: Mattson and Keutzer, UCB CS294
15
Matrix Multiplication (serial)
! three 2D arrays : a,b,c
real, dimension(100,100) :: a,b,c
! loop counters
integer
:: i,j,k
do i=1,100
do j=1,100
c(i,j) = 0.0
do k=1,100
c(i,j) = c(i,j) + a(i,k) * b(k,j)
end do
end do
end do
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
This operation is called a
dot product of the two
vectors a and b to produce
a scalar value for c(i,j)
16
Matrix Multiplication (Simple Data Parallel)
! three 2D arrays : a,b,c
real, dimension(100,100) :: a,b,c
! loop counters
integer :: i,j
Uses fine grained SIMD style
do i=1,100
parallelism to compute the dot-product
do j=1,100
c(i,j) = sum(a(i,:) * b(:,j))
end do
Array slice notation for
end do
the ith row of and the jth
column of b.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
17
Array slice notation
• Designating different slices of an array.
A(:,:)
A(2,3)
A(3,:)
A(1:3,:)
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
A(:,3)
A(1:3,2:4)
18
Outline
• Data parallel algorithms and the SIMD Pattern
• Case studies:
– Matrix multiplication
– Cellular automaton
• Limitations of SIMD data parallel programming
• Beyond SIMD
• Geometric decomposition
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
19
Cellular Automaton
• An illustration of the Rule 30: 1D cellular
automaton over 200 time steps.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
20
Initialize the cellular automaton
integer :: time = 0, finished = 100
integer :: cells(NUMCELLS), next(NUMCELLS)
integer :: left(NUMCELLS), right(NUMCELLS)
! ... initialization of all elements to zero,
! except for center cell
cells = 0
cells(NUMCELLS/2) = 1
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
21
Cellular automaton Time loop for 1D rule 30
do while (time < finished)
Circular shift of
! ... make working copies
array by 1 in each
next = cells
direction.
right = cshift(cells, 1)
left = cshift(cells, -1)
! ... apply rules
where (left == 1 .and. cells == 1 .and. right == 1) next = 0
where (left == 1 .and. cells == 1 .and. right == 0) next = 0
where (left == 1 .and. cells == 0 .and. right == 1) next = 0
where (left == 1 .and. cells == 0 .and. right == 0) next = 1
where (left == 0 .and. cells == 1 .and. right == 1) next = 1
where (left == 0 .and. cells == 1 .and. right == 0) next = 1
where (left == 0 .and. cells == 0 .and. right == 1) next = 1
where (left == 0 .and. cells == 0 .and. right == 0) next = 0
! ... obtain results from working copy
cells = next
time = time + 1
end do
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
Check each
element of the
three arrays in
parallel, and
update the next
state of the array
based on the Rule
30 transition rules.
22
Outline
• Data parallel algorithms and the SIMD Pattern
• Case studies:
– Matrix multiplication
– Cellular automaton
• Beyond SIMD
• Geometric decomposition
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
23
Limitations of SIMD
• SIMD parallelism for data parallel algorithms is powerful
– Sequential, deterministic semantics for expressing parallel
algorithms.
• Writing the parallel program is easier
• Debugging is vastly simplified since a good supporting SIMD programming
environment is deterministic
• … but Strict SIMD parallelism is very limited
– Even naturally “data parallel’ algorithms where a single stream of
instructions operate on multiple data elements, the model breaks
down at boundaries
• Solution:
– Keep a single program text but let the instructions vary slightly
based on the ID of each processing element … i.e. the SPMD or
Single Program multiple data model
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
24
Data-par SPMD style
• SPMD approach for data parallel algorithms:
– Define an abstract index space that appropriately spans the
problem domain.
– Data structures in the problem are aligned to this index space.
– Tasks (e.g. work-items in OpenCL or “threads” in CUDA) operate
on these data structures for each point in the index space.
• This approach was popularized for graphics applications
where the index space mapped onto the pixels in an
image. More recently, It’s been extended to General
Purpose GPU (GPGPU) programming.
Note: This is basically a fine grained extreme form of the SPMD
pattern.
25
The BIG idea behind OpenCL
•OpenCL execution model … execute a kernel at each point in a
problem domain.
–E.g., process a 1024 x 1024 image with one kernel
invocation per pixel or 1024 x 1024 = 1,048,576
kernel executions
Traditional loops
void
trad_mul(int n,
const float *a,
const float *b,
float *c)
{
int i;
for (i=0; i<n; i++)
c[i] = a[i] * b[i];
}
Data Parallel OpenCL
kernel void
dp_mul(global const float *a,
global const float *b,
global float *c)
{
int id = get_global_id(0);
c[id] = a[id] * b[id];
} // execute over “n” work-items
Source: Khronos Group , GDC’2010 OpenCL overview
OpenCL:
An N-dimension domain of work-items
•Define an N-dimensioned index space that is “best” for
your algorithm
– Global Dimensions: 1024 x 1024 (whole problem space)
– Local Dimensions: 128 x 128
(work group … executes together)
1024
1024
Synchronization between work-items
possible only within workgroups:
barriers and memory fences
Cannot synchronize outside
of a workgroup
Source: Khronos Group , GDC’2010 OpenCL overview
OpenCL (SPMD) example: Binary thresholding
int x[m][n];
int i,j;
for (i=0; i<m; i++)
for (j=0; j<n; j++)
if (x[i][j] < t)
x[i][j] = 0;
else
x[i][j] = 1;
Serial Program
Create data parallel by singling out
the elemental function (kernel) …
int threshold(int x, int t) {
if (x < t) return 0;
else return 1;
}
Then use with a language that
applies the kernel to every point in x
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
28
Binary Threshold (OpenCL Kernel)
• OpenCL Host
program (not shown)
– sets up the platform,
– defines the data (2D
array X),
– loads data onto the
compute device
– Launches the kernel for
each point in x
– Collects results back
from compute device
__kernel threshold(
__global int* x,
const
int t)
{
int i_id = get_global_id(0);
int j_id = get_global_id(1) ;
if (x[i_id][j_id] < t)
x[i_id][j_id] = 0;
else
x[i_id][j_id] = 1;
}
OpenCL kernel function
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
29
Using task parallelism (OpenMP) to express data
parallelism
for i=1:n
x(i) = a(i) * b(i);
x = a * b;
#pragma omp parallel for private(i)
for (i=0; i<n; i++)
x[i] = a[i] * b[i];
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
Serial Vector
multiplication
Expressed with a data
parallel array notation
Mapping loop
iterations (tasks) onto
threads using
OpenMP
30
Outline
• Data parallel algorithms and the SIMD Pattern
• Case studies:
– Matrix Multiplication
– Cellular Automaton
• Beyond SIMD
• Geometric decomposition
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
31
Geometric Decomposition Pattern
• Name
– Geometric decomposition
• Problem
– How can an algorithm be organized around a data structures that
has been decomposed into concurrently updatable chunks?
• Context
– Consider a class of problems built around a core data structure
that can be decomposed into chunks that can be updated
concurrently.
– Updates of chunks in two phases: (1) local update computation
and (2) input from a subset of neighboring chunks.
• Forces
– Standard tradeoff … small chunks to increase load balancing
options vs. large chunks to reduce scheduling overhead.
– Performance sensitive to locality of data to UEs that will use it.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
Source: Mattson and Keutzer, UCB CS294
32
Geometric Decomposition Pattern: solution
• Solution involves the following steps:
– Decompose data … structured or unstructured. Manage granularity
to balance load.
– Implicit methods, linear algebra:
• See the linear algebra dwarf (I won’t repeat that content here)
– Explicit methods (image processing)
• Define Exchange operations to share data needed for local
update.
• Define a local update operation.
• Manage dual … data distribution and task scheduling
– Frequently uses ghost cells to hold all the data required for
a local update
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
Source: Mattson and Keutzer, UCB CS294
33
2u  0
Geometric decomposition example:
solving partial differential equations
• Laplace’s equation:
– u is a potential field (gravitational, electrostatic, etc.)
in free space (no sinks, no sources)
• Poisson’s equation:
– u is a potential field with sources of sinks (f(x,y,z)).
• Diffusion problems:
– E.g. u is non-steady-state temperature or
concentration of a diffusing substance … α2 is the
diffusion constant.
• Wave equation:
– u is displacement from equilibrium for oscillatory
medium or field … ν is speed of propagation
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
 2u  0
2u  f ( x, y, z)
1 u
 u 2
 t
2
2
1

u
2
u 2 2
 t
Source: Mattson and Keutzer, UCB CS294
34
Boundary and Initial Conditions
• A given problem is defined by the PDE and
the key constraints:
– Initial conditions: starting point for
propagation problems
– Boundary conditions: specified on
domain boundaries to provide the interior
solution in computational domain
R
R
s
n
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
Source: Mattson and Keutzer, UCB CS294
35
Solving PDEs
• In simplest cases, they can be solved analytically.
• For “interesting” problems they must be solved numerically.
• Numerical solution consists of the following steps:
– Define the problem:
• Define the domain
• Define functions to compute values within the domain
• Define Boundary conditions.
– Discretize the domain … turn the continuous problem into a
discrete problem … i.e. superimpose a mesh over the domain and
find values at points on the mesh
– Define methods to update values at a mesh point based on other
values in the mesh.
– Propagate a solution as dynamic variables (often time) evolves.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
Source: Mattson and Keutzer, UCB CS294
36
Example: finite difference methods
• Solve the heat diffusion equation in 1 D:
– u(x,t) describes the temperature field
– We set the heat diffusion constant to one
– Boundary conditions, constant u at endpoints.
•
map onto a mesh with stepsize h and k
• Central difference approximation for spatial
derivative (at fixed time)
• Time derivative at t = tn+1
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
 2u
u

x 2
t
xi  x0 + ih
ti  t0 + ik
u j +1  2u j + u j 1
 2u

2
x
h2
du
u n +1  u n

dt
k
Source: Mattson and Keutzer, UCB CS294
37
Example: Explicit finite differences
• Combining time derivative expression using spatial derivative at t = tn
u nj+1  u nj
k

u nj+1  2u nj + u nj1
h2
• Solve for u at time n+1 and step j
u nj+1  (1  2r)u nj + ru nj1 + ru nj+1
rk
h2
• The solution at t = tn+1 is determined explicitly from the solution at t = tn
(assume u[t][0] = u[t][N] = Constant for all t).
for(int t=0; t<K-1; t++)
for (int x=1; x<N-1; x++)
u[t+1][x] = (1-2*r)*u[t][x] + r*u[t][x-1] + r*u[t][x+1];
• Explicit methods are easy to compute … each point updated based on
nearest neighbors. Converges for r<1/2.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
Source: Mattson and Keutzer, UCB CS294
38
Explicit methods: A Stencil problem in 2D
• Stencil methods: replace each Grid point with linear combination of
neighbors.
• Stencil methods extensively used in image processing as well.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
Source: Mattson and Keutzer, UCB CS294
39
Geometric Decomposition Pattern: solution
• Example: finite difference solver with a 5 point
stencil
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
Source: Mattson and Keutzer, UCB CS294
40
Geometric Decomposition Pattern: solution
Decompose into “chunks”
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
Source: Mattson and Keutzer, UCB CS294
41
Geometric Decomposition Pattern: solution
Add ghost cells (1/2
width of stencil)
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
Source: Mattson and Keutzer, UCB CS294
42
Geometric Decomposition Pattern: solution
Update values in ghost cells,
Stencil has everything it
needs to update local values.
Parallel computation in
phases:
•Update ghost cells
•Move stencil over local
block
•Go to next iteration
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
Source: Mattson and Keutzer, UCB CS294
43
Download