High Performance Computing Lecture 1

advertisement
Virtues of Good (Parallel) Software
Concurrency
Able to exploit concurrencies in
algorithm/problem/hardware
Scalability
Resilient to increasing processor count
Locality
More frequent access to local data than to remote
data
Modularity
Employ abstraction and modular design
1
Two Basic Requirements for
Parallel Program
Safety: Produce correct results
Result computed on P processors and on 1
processor must be IDENTICAL.
Livelihood: Able to proceed and finish; free
of deadlock.
2
Sources of
Overhead
 Execution time
 The time that elapses from when the first processor starts executing on the
problem to when the last processor completes execution
 Execution time = computation time + communication time + idle time
 Communication / interprocess interaction: usually main source of overhead
 T_comm = t_s + t_w*L
 Minimize the volume and frequency of communications; overlap
computation/communication
 Idling: lack of computation or lack of data




Load imbalance
Synchronization
Presence of serial components
Wait on remote data
 Replicated Computation
 Communicate or replicate
3
Speedup & Efficiency
 Relative speed-up: the factor by which the execution
time is reduced on multiple processors
 S(p) = T_1/T_p
 T_1 is the execution time on one processor
 T_p is execution time on p processors
 Absolute speed-up: where is T_1 is the uniprocessor
time for best-known (sequential) algorithm
 S(p) <= p
 Embarrassingly parallel (EP): no communication among cpus.
 Superlinear speedup: exists in reality
 Efficiency: the fraction of time that processors spend
doing useful work.
 E = S/p = T_1/(p*T_p)
 Parallel cost: p*T_p
 Parallel overhead: T_o = p*T_p – T_1
4
S 
Amdahl’s Law
1
1 

P
Alpha – fraction of operations in serial code that
can be parallelized
P – number of processors
This is for a fixed problem
size
T_p = alpha*T_1/p + (1alpha)*T_1
S 1/(1-alpha) as
Pinfinity
Alpha = 90%, S10
Alpha =99%, S100
Alpha = 99.9%, S1000
“Mental block”
5
Gustafson’s Law
S  (1   )  P
Alpha – fraction of time spent on parallel
operations in the parallel program
This is for a scaled problem size; or
constant run time.
T_1 = (1-alpha)*T_p + p*alpha*T_p
As problem size increases, fraction of
parallel operations increases
6
Iso-Efficiency Function
For fixed problem size N, as P increases,
increase in speedup S slows down or levels off,
efficiency E decreases
For fixed P, as the N increases, S increases,
efficiency E increases
As P increases, can increase the problem size N
such that the efficiency is kept constant
This N(p) for fixed efficiency is called iso-efficiency
function
Rate of increase in N(p), dN/dp, measures the
scalability of a parallel program
Smaller rate of increase  more scalable
7
Parallel Program Design
PCAM Model
(I. Foster)
Concurrency,
scalability
Locality,
performance-related
issues
8
Partitioning
Decompose the computation to be performed
and the data operated on by this computation
into small tasks
Purpose: expose opportunities of parallel
execution
Ignore practical issues such as number of processors
in target machine etc
Avoid replicating computation and data
Focus: Define a large number of small tasks in
order to yield a fine-grained decomposition of
the problem
Fine grained decomposition provides the greatest
flexibility in terms of potential parallel algorithms
Maximize concurrency
9
Partitioning
Good partition: divides both the computation
associated with a problem and the data this
computation operates on
Domain/Data decomposition: first focus on data
Partition the data associated with the problem
Associate computations with partitioned data
Functional decomposition: first focus on
computation
Decompose computations to be performed
Deal with data decomposed computations work on
10
Domain Decomposition
 Decompose the data first, and then associated
computations
 “owner computes”
 Outcome: tasks comprising some data and a set of operations
on that data
 Some operation may require data from several tasks 
communication
 Data can be input data, output data, intermediate data,
or all of them.
 Rule of thumb: focus first on largest data structure or the data
structure accessed most frequently
 Mesh-based problems:
 Structured mesh: 1D, 2D, 3D decompositions
 Unstructured mesh: graph partitioning tools such as METIS
 Favor the most aggressive decomposition possible at
this stage
11
Functional Decomposition
Focus first on computation to be performed;
Divide computations into disjoint tasks
Then consider the data associated with each
sub-task
Data requirements may be disjoint  done
Data may overlap significantly, communications; May
just as well try domain decomposition
Provide an alternative way of thinking about
problem; Hybrid decomposition maybe best
E.g. multi-physics simulations, overall functional
decomposition, each component domain
decomposition
12
Partitioning: Questions to Ask
 Does your partition define more tasks (an order of
magnitude more?) than the number of processors of the
target machine?
 No  reduced flexibility in subsequent stages
 Does your partition avoid redundant computation and
storage requirements?
 No  may not be scalable to large problems
 Are tasks of comparable size?
 No  hard to allocate to cpus with equal amount of work  load
imbalance
 Does the number of tasks scale with problem size?
 Ideal: increased problem size  increase in number of tasks
 No  may not be able to solve larger problems with more
processors
 Have you identified alternative partitions?
 Maximize flexibility; try both domain and functional
decompositions
13
Communication
Purpose: Determine the interaction among tasks
Distribute communication operations among many
tasks
Organize communication operations in a way that
permits concurrent execution
4 categories of communications:
Local/global communications:
Local: each task communicates with a small set of
other tasks (neighbors)
Global: communicate with many or all other tasks
14
Communication
 Structured/un-structured communication
 Structured: A task and neighbors form a regular structure, grid or
tree
 Un-structured: communication represented by arbitrary graphs
 Static/dynamic communication:
 Static: identity of communication partners does not change over
time
 Dynamic: identity of partners determined by data computed at
runtime and highly variable
 Synchronous/asynchronous communication
 Synchronous: requires coordination between communication
partners
 Asynchronous: without cooperation
15
Task Dependency Graph
Task dependencies: one task cannot start until
some other task(s) finishes.
E.g. the output of one task is the input to another task
Represented by the task dependency graph:
Directed acyclic
Nodes: tasks (task size as the weight of node)
Directed edges: dependencies among tasks
16
Task Dependency Graph
 Degree of concurrency: number of tasks that can run
concurrently
 Maximum degree of concurrency: the maximum number of tasks
that can be executed simultaneously at any given time
 Average degree of concurrency: the average number of tasks
that can run concurrently over the duration of program
 Critical path: The longest vertex-weighted directed path
between any pair of start and finish nodes
 Critical path length: sum of vertex weights along the
critical path
 Average degree of concurrency = total amount of work /
critical path length
17
Task Interaction Graph
Even independent tasks
may need to interact, e.g.
sharing data
Interaction graph:
captures interaction
patterns among tasks
Nodes: tasks
Edges: communications /
interactions
Example interaction graph
Usually contains task
dependency graph as
sub-graph
18
Communication: Questions to Ask
 Do all tasks perform the same number of communication
operations?
 Unbalanced communication  poor scalability
 Distribute communications equitably
 Does each task communicate only with a small number
of neighbors?
 May need to re-formulate global communication in terms of local
communication structures
 Can communications proceed concurrently?
 Can computations associated with different tasks
proceed concurrently?
 No  may need to re-order computations / communications
19
Agglomeration
Improve performance: Combine tasks to reduce
the task interaction strength, increase locality,
increase the computation and communication
granularity. Also determine if it is worthwhile to
replicate data/computation
Dependent tasks will be combined
Independent tasks may also be agglomerated to
increase granularity
Goals: reduce communication cost, retain
flexibility w.r.t. scalability and mapping decisions
20
Increasing Granularity
 Coarse-grain usually performs better:
 Send less data (reduce volume of communication)
 Use fewer messages when sending same amount of data
(reducing frequency of communications)
 Surface-to-volume effects:
 Communication cost usually proportional to surface area of
domain
 Computation cost usually proportional to volume of domain
 As task size increases, amount of communication per unit
computation decreases
 High-D decomposition usually more efficient than low-D
decompositions, due to reduced surface area for a given volume.
 Replicate computation:
 May trade off replicated computation for reduced communication
or execution time.
21
Agglomeration: Questions to Ask
 Has agglomeration reduced communication costs by
increasing locality?
 If computation is replicated, have you verified that the
benefits of replication out-weigh its costs for a range of
problem size and processor counts?
 If data is replicated, have you verified that it does not
comprise scalability
 Do the tasks have similar computation and
communication costs after agglomeration?
 Load balance
 Does the number of tasks still scale with problem size?
22
Mapping
Map tasks to processors or processes.
If the number of tasks is larger than the number
of processors, may need to place more than one
task on a single processor
Goal: minimize total execution time
Place tasks that execute concurrently on different
processors
Place tasks that communicate frequently on the same
processor
In general case, no computationally tractable
algorithm for the mapping problem, NPcomplete.
If SPMD-style, one task per processor
23
Parallel Algorithm Models
 Data parallel model: processors perform similar
operations on different data
 Work/task pool model (replicated workers):
 Pool of tasks, a number of processors
 A processor can remove a task from pool and work on it
 A processor may generate a new task during computation and
add it to the pool
 Master-slave/manager-worker model: master processors
generate work and allocate it to worker processors
 Pipeline/producer-consumer model: a stream of data
passes through a succession of processors, each
perform some task on it.
 Hybrid model: combination of two or more models
24
Download