CMPS 5433 – Parallel Algorithms Dr. Ranette Halverson Chapter 1

advertisement
CMPS 5433 – Parallel Algorithms
Dr. Ranette Halverson
TEXT: Structured Parallel Programming: Patterns for Efficient
Computation
M. McCool, A. Robison, J. Reinders
Chapter 1
1
Introduction
• Parallel? Sequential?
• In general, what does Sequential mean? Parallel?
• Is Parallel always better than Sequential?
• Examples??
• Parallel resources?
• How Efficient is Parallel?
• How can you measure Efficiency?
2
Parallel Computing Features
• Vector instructions
• Multithreaded cores
• Multicore processors
• Multiple processors
• Graphics engines
• Parallel co-processors
• Pipelining
• Multi-tasking
Automatic
Parallelization
vs.
Explicit Parallel
Programming
3
All Modern Computers are Parallel
• All programmers must be able to program for parallel
computers!!
• Design & Implement
• Efficient, Reliable, Maintainable
• Scalable
• Will try to avoid hardware issues…
4
Patterns for Efficient Computation
• Patterns: valuable algorithmic structures commonly
seen in efficient parallel programs
• AKA: algorithm skeletons
• Examples of Patterns in Sequential Programming?
• Linear search
• Recursion ~~ Divide-and-Conquer
• Greedy Algorithm
5
Goals of This Book
• Capture Algorithmic Intent
• Avoid mapping algorithms to particular hardware
• Focus on
• Performance
• Low-overhead implementation
• Achieving efficiency & scalability
• Think Parallel
6
Scalable
• Scalable: ability to efficiently apply an algorithm to ‘any’
number of processors
• Considerations:
• Bottleneck: situation in which productive work is delayed
due to limited resources relative to amount of work;
congestion occurring when work arrives more quickly than it
can be processed
• Overhead: cost of implementation of a (parallel) algorithm
that is not part of the solution itself – often due to the
distribution or collection of data & results
7
Serialization (1.1)
• Act of putting set of operations into a specific order
• Serial Semantics & Serial Illusion
• But were computers really serial?
• Have we become overly dependent on sequential strategy?
• Pipelining, Multi-tasking
8
Can we ignore parallel strategies?
• Improved Performance
• The Past
• The Future
• Why is my new faster computer not faster??
• Deterministic algorithms
• End result vs. Intermediate results
• Timing issues – Relative vs. Absoute
9
Approach of THIS Book
• Structured Patterns for parallelism
• Eliminate non-determinism as much as possible
10
Serial Traps
• Programmers think serial
(sequential)
• List of instructions,
executed in serial order
• Standard programming
training
• Serial Trap: Assumption of
serial ordering
Consider:
A = B + 7
Read Z
C = X * 2
Does order matter??
A = B + 7
Read A
B = X * 2
11
Example:
Search web for a particular search phrase
What does parallel_for imply? Are both correct? Will the
results differ? What about time?
for (i=0; i<num_web_sites;++i)
search (search_phrase, website[i]
parallel_for (i=0; i<num_web_sites;++i)
search (search_phrase, website[i]
12
Can we apply “parallel” to previous
example?? Results?
parallel_do
{
A = B + 7
Read Z
C = X * 2
}
parallel_do
{
A = B + 7
Read A
B = X * 2
}
13
Performance (1.2)
Complexity of a
Sequential Algorithm
• Time complexity
• Big Oh
• What do we mean?
• Why do we measure
performance?
Complexity of a Parallel
Algorithm
• Time
• How can we compare?
• Work
• How do we split problems
across processors?
14
Time Complexity
• Doesn’t really mean time
• Assumptions about measuring “time”
• All instructions take same amount of time
• Computers are same speed
• Are these assumptions true?
• So, what is Time Complexity, really?
15
Example: Payroll Checks
Suppose want to calculate & print 1000 payroll checks.
Each employee’s pay is independent of others.
parallel_for (i<num_employees;++i)
Calc_Pay(empl[i]);
Any problems with this?
16
Payroll Checks (cont’d)
• What if all employee data is stored in an array? Can it be
accessed in parallel?
• What about printing* the checks?
• What if computer has 2000 processors?
• Is this application a good candidate for a parallel
program?
• How much faster can we do the job?
17
Example: Sum 1000 Integers
• How can you possibly sum a single list of integers in
parallel?
• What is the optimal number of processors to use?
• Consider number of operations.
18
Actual Time vs. Number of Instructions
Considerations of parallel processing
• Communication
• How is data distributed? Results compiled?
• Shared memory vs. Local memory
• One computer vs. Network (e.g. SETI Project)
• Work
19
SPAN = Critical Path
• The time required to perform the longest chain of tasks
that must be performed sequentially
• SPAN limits speed up that can be accomplished via
parallelism
• See Figures 1.1 & 1.2
• Provides a basis for optimization
20
Shared Memory & Locality (of reference)
• Shared memory model – book assumption
• Communication is simplified
• Causes bottlenecks ~~ Why? Locality
• Locality
• Memory (data) accesses tend to be close together
• Accesses near each other (time & space) tend to be cheaper
• Communication – best when none
• More processors is not always better
21
Amdahl’s Law (Argument)
• Proposed in 1967
• Provides upper bound on speed-up attainable via
parallel processing
• Sequential vs. Parallel parts of a program
• Sequential portion limits parallelism
22
Work-Span Model
3 components to
consider when
parallelizing
• Total Work
• Span
• Communication
Example: Consider
program with 1000
operations
• Are ops independent?
• OFTEN, parallelizing adds
to the total number of
operations to be
performed
23
Reduction
• Communication strategy for “reducing” number
information from n to 1 workers
• O(log p) = p is number of workers/processors
• Example: 16 processors hold 1 integer each & integers
need to be added together.
• How can we effectively do this?
24
Broadcast
• The “opposite” of Reduction
• Communication strategy for distributing data from 1
worker to n workers
• Example: 1 processor holds 1 integer & integer needs to
be distributed to the 15 others
• How can we effectively do this?
25
Load Balancing
• A process that endures all processors have
approximately the same amount of work/instructions
• To ensure processors are not idle
26
Granularity
• A measure of the degree to which a problem is broken
down into smaller parts, particularly for assigning each
part to a separate processor in a parallel computer
• Coarse granularity: Few parts, each relatively large
• Fine granularity: Many parts, each relatively small
27
Pervasive Parallelism &
Hardware Trends (1.3.1)
Moore’s Law – 1965, Gordon More of Intel
• Number of integrated circuits on silicon chip doubles
every 2 years (approximately)
• Until 2004 – increased transistor switching speeds 
increased clock rate  increased performance
• Note Figures 1.3 & 1.4 in text
• BUT around 2003 no more increase in clock rate – 3 GHz
28
Walls limiting single processor performance
• Power wall: power consumption growth with clock speed, nonlinear growth
• Instruction-level parallelism wall: no new low-level parallelism
(automatic) – only constant factor increase
• Superscalar instructions
• Very Large Instruction Word (by compiler)
• Pipelining – max 10 stages
• Memory wall: difference between processor & memory speeds
• Latency
• Bandwidth
29
Historical Trends (1.3.2)
• Early Parallelism – WW2
• Newer HW features
•
•
•
•
•
•
•
Larger word size
Superscalar
Vector
Multithreading
Parallel ALUs
Pipelines
GPUs
• Virtual memory
• Prefetching
• Cache
• Benchmarks
• Standard programs used to
compare performance on
different computers or to
demonstrate a computer’s
capabilities
30
Explicit Parallel Programming (1.3.3)
• Serial Traps: unnecessary assumptions deriving from
serial programming & execution
• Programmer assumes serial execution so gives not
consideration to possible parallelism
• Later, no possible to parallelize
• Automatic Parallelism ~ Different strategies
• Absolute automatic – no help from programmer – compiler
• Parallel constructs – “optional”
31
Examples of Automatic Parallelism
Consider the following
simple program:
1. A = B + C
2. D = 5 * X + Y
3. E = P + Z / 7
4. F = B + X
5. G = A * 10
Rule 1: Parallelize a block of seq. instruct.
with no repeated variables.
Rule 2: Parallelize any block of seq. instruct.
if repeated variables don’t change (not on
left)
What about instruction 5? What can we do?
32
Parallelization Problems
• Pointers allow a data structure to be distributed across
memory
• Parallel analysis is very difficult
• Loops can accommodate or restrict or hide possible
parallelization
void addme(int n, double a[n],b[n], c[n]
{
int I;
for (I=0; I <n; ++I)
a[I] = b[I] + c[I];}
33
Examples
void addme(int n, double
a[n],b[n], c[n]
{ int I;
for (I=0; I <n; ++I)
a[I]=b[I]+c[I];}
double a[10]
a[0]= 1
addme(9,a+1,a,a)
34
Examples
void call me ( )
{
foo ( );
bar ( ); }
void call me ( )
{
cilk_spawn foo ( );
bar ( ); }
Mandatory Parallelism
VS.
Optional Parallelism
“Explicit parallel programming
constructs allow algorithms to
be expressed without
specifying unintended &
unnecessary serial constraints”
35
Structured Pattern-based Programming (1.4)
• Patterns: commonly recurring strategies for dealing with
particular problems
• Tools for Parallel problems
• Goals: parallel scalability, good data locality, reduced overhead
• Common in CS: OOP, Natural language proc., data
structures, SW engineering
36
Patterns
• Abstractions: strategies or approaches which help to hide
certain details & simplify problem solving
• Implementation Patterns – low-level, system specific
• Design Patterns – high-level abstraction
• Algorithm Strategy Patterns – Algorithm Skeletons
• Semantics – pattern = building block, task arrangement, data
dependencies, abstract
• Implementation – for real machine, granularity, cache
• Ideally, treat separately, but not always possible
37
Figure 1.1 – Overview of Parallel Patterns
38
Patterns – 3 most common
• Nesting
• Composability
• Map
• Regular parallelism, Embarrassing Parallelism
• Fork-Join
• Divide & Conquer
39
Parallel Programming Models (1.5)
Desired Properties (1.5.1)
• Contemporary, Popular languages – not designed parallel
• Need transition to parallel
• Desired Properties of Parallel Language (p. 21)
• Performance – achievable, scalable, predictable, tunable
• Productivity – expressive, composable, debuggable,
maintainable
• Portability – functionality & performance across systems
(compilers & OS)
40
Programming – C, C++
Textbook – Focus on
• C++ & parallel support
• Intel support
• Intel Threading Building Blocks (TBB) – Appendix C
• C++ Template Library, open source & commercial
• Intel Cilk Plus – Appendix B
• Compiler extensions for C & C++, open source & commercial
• Other products available – Figure 1.2
41
Abstractions vs. Mechanisms (1.5.2)
• Avoid HW Mechanisms
• Particularly vectors & threads
• Focus on
• TASKS – opportunities for parallelism
• DECOMPOSITION – breaking problem down
• DESIGN of ALGORITHM – overall strategy
42
Abstractions, not Mechanisms
“(Parallel) programming should focus on decomposition
of problem & design of algorithm rather than specific
mechanisms by which it will be parallelized.” p. 23
Reasons to avoid HW specifics (mechanisms)
• Reduced portability
• Difficult to manage nested parallelism
• Mechanisms vary by machine
43
Regular Data Parallelism (1.5.3)
• Key to scalability – Data Parallelism
• Divide up DATA not CODE!
• Data Parallelism: any form of parallelism in which the
amount of work grows with the size of the problem
• Regular Data Parallelism: subcategory of D.P. which
maps efficiently to vector instructions
• Parallel languages contain constructs for D.P.
44
Composability (1.5.4)
The ability to use a feature in a program without regard
to other features being used elsewhere
• Issues:
• Incompatibility
• Inability to support hierarchical composition (nesting)
• Oversubscription: situation in nested parallelism in
which a very large number of threads are created
• Can lead to failure, inefficiency, inconsistency
45
Thread (p. 387)
• Smallest sequence of
program instructions that
can be managed
• Program  Processes 
Threads
• “Cheap” Context Switch
• Multiple Processors
Process
Data
Thr1 Thr2 Thr3
~~
~~
~~
Stack Stack Stack
46
Portability & Performance (1.5.5 & 1.5.6)
• Portable: the ability to run on a variety of HW with little
adjustments
• Very desirable; C, C++, Java are portable languages
• Performance Portability: the ability to maintain
performance levels when run on a variety of HW
• Trade-offs
General/Portable  Specific/Performance
47
Issues (1.5.7)
• Determinism vs.
Non-determinism
• Safety – ensuring only
correct orderings occur
• Serially Consistent
• Maintainability
48
Download