Shared Memory Programming

advertisement
Starting Parallel Algorithm
Design
David Monismith
Based on notes from Introduction to
Parallel Programming 2nd Edition by
Grama, Gupta, Karypis, and Kumar
Outline
•
•
•
•
•
Decomposition
Tasks and Interaction
Load Balancing
Managing Overhead
Parallel Models
Decomposition
• Decomposition - dividing a computation into
parts that may be executed in parallel
• Tasks - programmer defined units of
computation into which the main
computation is subdivided
• Task-dependency graphs - abstraction used to
express dependencies between tasks and their
relative order of execution
Granularity
• Granularity - number/size of tasks that a
computation can be divided into
• Fine grained - task divided into many small tasks
• Coarse grained - task divided into few large tasks
• Degree of concurrency - maximum number of
tasks that can be executed in parallel in a
program at any time
• Average degree of concurrency can be more
useful as it provides a better indication of
performance
Example
• Matrix-Vector Multiplication
– Figure will be drawn upon the board
– Generally considered fine grained if parallelizing
based upon each dot product
– Could be considered coarse-grained if using a dual
core processor and each task computes half of the
dot products
Task Graphs
• Critical path - longest directed path between a pair of
start an finish nodes in the task graph
• Critical path length – sum of the weights of the nodes
along a critical path
• Weight of a node is the size of the task or amount of
work associated with the task
• Aside from these factors, the interaction between tasks
running on different processors may cost additional
runtime
• An example of a task dependency graph will be drawn
in class to aid in the understanding of these concepts
Processes and Threads vs. Processors
• mapping - mechanism by which tasks are assigned to
processes and/or threads for execution
• Threads and processes are logical units that perform tasks
• Processors physically perform the computations
• Important to realize this because we may have multiple
stages of computation
• For example, internode communication vs. shared memory
communication
• Drawing a task dependency or task interaction graph may
help us to understand how tasks interact with one another
and will aid in development of a parallel algorithm
Decomposition Techniques
•
•
•
•
•
Embarrassingly Parallel
Recursive decomposition
Data Decomposition
Exploratory decomposition
Speculative decomposition
Embarrassingly Parallel Tasks
• Some tasks lend themselves to direct parallelization
• Such tasks are said to be embarrassingly parallel and
can be directly mapped to processes or threads
• A subset of these types of tasks represent the map
pattern
• Note that the map pattern represents a function that
can be “replicated and applied to all elements in a
collection” – source https://software.intel.com/enus/blogs/2009/06/10/parallel-patterns-3-map
• Map operations occur in independent loop iterations
Embarrassingly Parallel (Map)
• Performing array (or matrix) addition is a
straightforward example that is easily parallelized
• The serial example of this follows:
for(i = 0; i < N; i++)
C[i] = A[i] + B[i];
• Three OpenMP parallel versions follow on the
next slides
OpenMP First Try
• We could parallelize the loop on the last slide directly as follows:
#pragma omp parallel private(i) shared(A,B,C)
{
int start = omp_get_thread_num()*(N /
omp_get_num_threads());
int end = start + (N/omp_get_num_threads());
for(i = start; i < end; i++)
C[i] = A[i] + B[i];
}
• Notice that i is declared private because it it is not shared between
threads – each thread gets its own copy of i
• Arrays A, B, and C are declared shared because they are shared
between threads
OpenMP for clause
• It is preferred to allow OpenMP to directly parallelize loops
using the for clause as follows
#pragma omp parallel private(i) shared(A,B,C)
{
#pragma omp for
for(i = 0; i < N; i++)
C[i] = A[i] + B[i];
}
• Notice that the loop can be written in a serial fashion and
it will be automatically partitioned and tasked to a thread
Shortened OpenMP for
• When using a single for loop, the parallel and
for clauses may be combined
#pragma omp parallel for private(i) \
shared(A,B,C)
for(i = 0; i < N; i++)
C[i] = A[i] + B[i];
Recursive Decomposition
• Used to include concurrency in problems that can
be solved with divide-and-conquer
• Such a problem is solved by dividing it into
independent sub-problems
• A special type of this decomposition is the
Reduction Pattern, wherein elements of a
collection are combined with a binary associative
operator (e.g. +, -, min, max, etc.), source https://software.intel.com/enus/blogs/2009/07/23/parallel-pattern-7-reduce
Example
• To find a minimum serially given an array A of
size N use the following algorithm
min = A[0];
for(i = 1; i < N; i++)
if(A[i] < min)
min = A[i];
Example
• Decomposing this task for parallelism requires a recursive solution
int findMinRec(int A[], int i, int n)
{
if(n == 1)
return A[i];
else
{
int lmin = findMinRec(A, i, n/2);
int rmin = findMinRec(A, i+n/2, n-n/2);
return min(lmin,rmin);
}
}
OpenMP Implementation
for(i = 0; i < N; i++)
A[i] = rand() % 100;
small = A[0];
#pragma omp parallel for reduction(min:small)
for(i = 0; i < N; i++) {
if(A[i] < small)
small = A[i];
}
OpenMP Sum Reduction
for(i = 0; i < N; i++)
A[i] = i+1;
sum = 0;
#pragma omp parallel for reduction(+:sum)
for(i = 0; i < N; i++)
sum += A[i];
printf("The sum is %d\n", sum);
Data Decomposition
• Commonly used on algorithms that operate
on large data structures
• Involves two steps
– Data is partitioned
– Data partitioning is used to cause partitioning of
computations into tasks
• Operations on different data partitions are
typically similar or are chosen from a small set
of operations
Partitioning
• Partitioning output data – outputs computed
independently of others as a function of input
– Example – matrix multiplication can be partitioned
into submatrices
• Partitioning input data – task is created for each
partition of the input data
– Example – finding a minimum or maximum
• Partitioning input and output – combination of
the two cases above
• Partitioning intermediate data
Next Time
• More decompositions
– Exploratory Decomposition
– Speculative Decomposition
•
•
•
•
Tasks and Interactions
Load balancing
Handling overhead
Parallel Algorithm Models
Download