Fundamentals of Parallel Processing

advertisement
Introduction to Parallel
Processing
Shantanu Dutt
University of Illinois at Chicago
Acknowledgements



Ashish Agrawal, IIT Kanpur, “Fundamentals of Parallel
Processing” (slides), w/ some modifications and
augmentations by Shantanu Dutt
John Urbanic, Parallel Computing: Overview (slides), w/
some modifications and augmentations by Shantanu
Dutt
John Mellor-Crummey, “COMP 422 Parallel Computing:
An Introduction”, Department of Computer Science, Rice
University, (slides), w/ some modifications and
augmentations by Shantanu Dutt
2
Outline

The need for explicit multi-core/processor parallel
processing:



Applications for parallel processing





Moore's Law and its limits
Different uni-processor performance enhancement techniques
and their limits
Overview of different applications
Classification of parallel computations
Classification of parallel architectures
Examples of MIMD/SPMD parallel algorithms
Summary
Some text from: Fund. of Parallel
Processing, A. Agrawal, IIT Kanpur
3
Outline

The need for explicit multi-core/processor parallel
processing:



Applications for parallel processing





Moore's Law and its limits
Different uni-processor performance enhancement
techniques and their limits
Overview of different applications
Classification of parallel computations
Classification of parallel architectures
Examples of MIMD/SPMD parallel algorithms
Summary
Some text from: Fund. of Parallel
Processing, A. Agrawal, IIT Kanpur
4
Moore’s Law &
Need for Parallel Processing





Chip performance doubles
every 18-24 months
Power consumption is prop. to
freq.
Limits of Serial computing –
 Heating issues
 Limit to transmissions
speeds
 Leakage currents
 Limit to miniaturization
Multi-core processors already
commonplace.
Most high performance servers
already parallel.
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur
5
Quest for Performance






Pipelining
Superscalar Architecture
Out of Order Execution
Caches
Instruction Set Design
Advancements
Parallelism
 Multi-core processors
 Clusters
 Grid
This is the future
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur
6
Pipelining






Illustration of Pipeline using the fetch, load, execute, store stages.
At the start of execution – Wind up.
At the end of execution – Wind down.
Pipeline stalls due to data dependency (RAW, WAR), resource conflict, incorrect
branch prediction – Hit performance and speedup.
Pipeline depth – No of cycles in execution simultaneously.
Intel Pentium 4 – 35 stages.
Top text from: Fundamentals of Parallel
Processing, A. Agrawal, IIT Kanpur
7
Pipelining
•
•
Tpipe(n) is pipelined time to process n instructions = fill-time + n*(max{ti} ~
n*(max{ti} for large n, as fill-time is a constant wrt n), ti = exec. time of the
i’th stage.
This pipelined throughput = 1/max{ti}
8
Cache












Desire for fast cheap and non volatile memory
Memory speed growth at 7% per annum while processor growth at 50% p.a.
Cache – fast small memory.
L1 and L2 caches.
Retrieval from memory takes several hundred clock cycles
Retrieval from L1 cache takes the order of one clock cycle and from L2 cache
takes the order of 10 clock cycles.
Cache ‘hit’ and ‘miss’.
Prefetch used to avoid cache misses at the start of the execution of the
program.
Cache lines used to avoid latency time in case of a cache miss
Order of search – L1 cache -> L2 cache -> RAM -> Disk
Cache coherency – Correctness of data. Important for distributed parallel
computing
Limit to cache improvement: Improving cache performance will at most improve
efficiency to match processor efficiency
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur
9
: instruction-level parallelism—degree generally low and dependent
on how the sequential code has been written, so not v. effective
(single-instr.
multiple data)
(exs. of limited data parallelism)
(exs. of limited & low-level functional parallelism)
10
11
12
13
14
(simultaneous multithreading)
(multi-threading)
15
16
Thus ……: Two Fundamental Issues in
Future High Performance


Microprocessor performance improvement via various implicit and
explicit parallelism schemes and technology improvements is reaching
(has reached?) a point of diminishing returns
Thus need development of explicit parallel algorithms that are based on
a fundamental understanding of the parallelism inherent in a problem,
and exploiting that parallelism with minimum
interaction/communication between the parallel parts
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur
17
Outline

The need for explicit multi-core/processor parallel
processing:



Applications for parallel processing





Moore's Law and its limits
Different uni-processor performance enhancement techniques
and their limits
Overview of different applications
Classification of parallel computations
Classification of parallel architectures
Examples of MIMD/SPMD parallel algorithms
Summary
Some text from: Fund. of Parallel
Processing, A. Agrawal, IIT Kanpur
18
19
20
Computing and Design/CAD


Designs of complex to very complex systems have almost
become the norm in many areas of engineering, from design
of chips with billions of transistors to aircrafts of various types
of sophistication (large fly-by-wire passenger aircrafts to
fighter planes) to complex engines to buildings and bridges.
An effective design process needs to explore the design
space in smart ways (without being exhaustive but also
without leaving out useful design points) to optimize some
metric (e.g., minimizing power consumption of a chip) while
satisfying tens to hundreds of constraints on others (e.g., on
speed and temperature profile of the chip). This is an
extremely time intensive process for large and complex
designs and can benefit significantly from parallel processing.
21
Applications of Parallel Processing
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur
22
23
24
25
26
27
28
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur
29
Outline

The need for explicit multi-core/processor parallel
processing:



Applications for parallel processing





Moore's Law and its limits
Different uni-processor performance enhancement techniques
and their limits
Overview of different applications
Classification of parallel computations
Classification of parallel architectures
Examples of MIMD/SPMD parallel algorithms
Summary and future advances
Some text from: Fund. of Parallel
Processing, A. Agrawal, IIT Kanpur
30


Multiple tasks at once.
Distribute work into multiple
execution units.
A classification of parallelism:





Data Parallelism
Functional or Control Parallelism
Data Parallelism - Divide the
dataset and solve each sector
“similarly” on a separate
execution unit.
Functional Parallelism –
Divide the 'problem' into
different tasks and execute the
tasks on different units. What
would func. parallelism look like
for the example on the right?
Hybrid: Can do both: Say, first
partition by data, and then for
each data block, partition by
functionality
Data Parallelism

Sequential
Parallelism - A simplistic understanding
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur
31
Data Parallelism
Functional Parallelism:
Example: Earth weather model
Q: What would a data parallel
breakup look like for this
problem?
Q. How can a hybrid breakup be
done?
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur
32
Flynn’s Classification


Flynn's Classical Taxonomy – Based on # of instruction/task and
data streams
 Single Instruction, Single Data streams (SISD): your single-core
uni-processor PC
 Single Instruction, Multiple Data streams (SIMD): special purpose
low-granularity multi-processor m/c w/ a single control unit
relaying the same instruction to all processors (w/ different data)
every cc (e.g., nVIDIA graphic co-processor w/ 1000’s of simple
cores)
 Multiple Instruction, Single Data streams (MISD): pipelining is a
major example
 Multiple Instruction, Multiple Data streams (MIMD): the most
prevalent model. SPMD (Single Program Multiple Data) is a very
useful subset. Note that this is v. different from SIMD. Why?
Data vs Control Parallelism is an independent classification to
Flynn’s
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur
33
Flynn’s Classification (cont’d).
Example machines: Thinking Machines CM 2000, nVIDIA GPU
34
Flynn’s Classification (cont’d).
35
Flynn’s Classification (cont’d).
36
Flynn’s Classification (cont’d).
Example machines: Various current multicomputers (see the most recent list at
http://www.top500.org/), multi-core processors like the Intel i3, i5, i7 processors
(all quad-core: 4 processors on a single chip)
37
Flynn’s Classification (cont’d).
38
Flynn’s Classification (cont’d).
• Data Parallelism: SIMD and SPMD fall into this category
• Functional Parallelism: MISD falls into this category
• MIMD can incorporates both data and functional parallelisms
(the latter at either instruction level—different instrs. being
executed across the processors at any time, or at the high-level
function space)
39
Outline

The need for explicit multi-core/processor parallel
processing:



Applications for parallel processing





Moore's Law and its limits
Different uni-processor performance enhancement techniques
and their limits
Overview of different applications
Classification of parallel computations
Classification of parallel architectures
Examples of MIMD/SPMD parallel algorithms
Summary
Some text from: Fund. of Parallel
Processing, A. Agrawal, IIT Kanpur
40
Parallel Arch. Classification

1
Multi-processor Architectures
Distributed Memory with message passing—Most prevalent architecture model
for # processors > 8

Indirect interconnectionn n/ws

Direct interconnection n/ws

Shared Memory


Uniform Memory Access (UMA)
Non- Uniform Memory Access (NUMA)—Distributed shared memory
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur
41
Distributed Memory—Message Passing Architectures




Each processor P (with its own local
cache C) is connected to exclusive local
memory, i.e. no other CPU has direct
access to it.
Each node comprises at least one
network interface (NI) that mediates the
connection to a communication network.
On each CPU runs a serial process that
can communicate with other processes
on other CPUs by means of the network.
Blocking vs Non-blocking
communication



Blocking: computation stalls until commun.
occurs/completes
Non-blocking: if no commun. has
occurred/completed at calling point,
computation proceeds to the next
instruction/statement (will require later calls to
commun. primitive until commun. occurrs)
Direct vs Indirect Communication /
Interconnection network
Example: A 2x4
mesh n/w (direct
connection n/w)
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur
42
43
44
1
The ARGO Beowulf Cluster at UIC (http://accc.uic.edu/service/argo-cluster)
•
Has 56 compute nodes/computers and a master node
•
•
•
Master here has a different meaning—generally a system front-end where you login and perform
various tasks before submitting your parallel code to run on several compute nodes—than the
“master” node in a parallel algorithm (e.g., the one we saw for the finite-element heat distribution
problem), which would actually be one of the compute nodes, and generally distributes data to the
other compute nodes, monitors progress of the computation, determines the end of the
computation, etc., and may also additionally perform a part of the computation
Compute nodes are divided among 14 zones, each zone containing 4 nodes which are
connected as a ring network. Zones are connected to each other by a higher-level n/w.
Each node (compute or master) has 2 processors. Each processor on some nodes are
single-core ones, and dual cores in others; see http://accc.uic.edu/service/arg/nodes
45
1
System Computational Actions in a Message-Passing Program
Proc. X
Proc. Y
Proc. X
Proc. Y
recv(P2, b);
/* blocking */
a := b+c;
b := x*y;
send(P1,b); /* non-blocking */
Message passing
mapping
a := b+c;
b := x*y;
(a) Two basic parallel processes
X, Y, and their data dependency
Processor/core
containing X
b
P(X)
P(Y)
Processor/core
containing Y
Message passing
Link(s) (direct
of data item “b”.
or indirect) betw.
the 2 processors
(b) Their mapping to a message-passing multicomputer
46
Distributed Shared Memory Arch.: UMA





1
Flat memory model
Memory bandwidth and latency are the same for all
processors and all memory locations.
Simplest example – dual core processor
Most commonly represented today by Symmetric
Multiprocessor (SMP) machines
Cache coherent UMA—consistent cache values of the
same data item in different proc./core caches
L1 cache
L2 cache
Dual-Core
Quad-Core
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur
47
1
System Computational Actions in a Shared-Memory Program
Proc. X
Proc. Y
Proc. X
Proc. Y
a := b+c;
b := z*w;
Shared-memory
mapping
a := b+c;
b := x*y;
(a) Two basic parallel processes
X, Y, and their data dependency
Possible Actions by O.S.:
(i) Since “b” is a shared
data item (e.g.,
designated by compiler or
programmer), check “b”’s
status bit to see if it has
been written to (or more
generally, check if a write
counter to see if has a
new value since last read)
(ii) If so {read “b” &
decrement read_cntr for
“b”} else go to (i) and busy
wait (check periodically).
P(X)
P(Y)
Shared Memory
Possible Actions by O.S.:
(i) Since “b” is a shared
data item (e.g.,
designated by compiler or
programmer), check “b”’s
location to see if it can be
written to (all reads done:
read_cntr for “b” = 0).
(ii) If so, write “b” to its
location and mark status
bit as written by “Y”. (or
increment its write counter
if “b” will be written to
multiple times by “Y”).
(iii) Initialize read_cntr for
“b” to pre-determined
value
(b) Their mapping to a shared-memory multiprocessor
48
Distributed Shared Memory Arch.: NUMA







1
Memory is physically distributed but logically shared.
The physical layout similar to the distributed-memory message-passing case
Aggregated memory of the whole system appear as one single address space.
Due to the distributed nature, memory access performance varies depending on which CPU
accesses which parts of memory (“local” vs. “remote” access).
Example: Two locality domains linked through a high speed connection called Hyper
Transport (in general via a link, as in message passing arch’s, only here these links are used
by the O.S., not by the programmer, to transmit read/write non-local data to/from
processor/non-local memory).
Advantage – Scalability (compared to UMA’s)
Disadvantage – a) Locality Problems and Connection congestion. b) Not a natural parallel
prog./algo. model (it is easier to partition data among proc’s instead of think of all of it
occupying a large monolithic address space that each processor can access).
Most text from Fundamentals of Parallel
Processing, A. Agrawal, IIT Kanpur
all-to-all
(complete
graph)
connection
via a
combination
of direct and
indirect
conns.
49
Outline

The need for explicit multi-core/processor parallel
processing:



Applications for parallel processing





Moore's Law and its limits
Different uni-processor performance enhancement techniques
and their limits
Overview of different applications
Classification of parallel computations
Classification of parallel architectures
Examples of MIMD/SPMD parallel algorithms
Summary
Some text from: Fund. of Parallel
Processing, A. Agrawal, IIT Kanpur
50
An example parallel algorithm for a finite element computation


Easy Parallel Situation – Each data part is independent.
No communication is required between the execution
units solving two different parts. E.g., matrix multiplication
Next Level: Simple, structured and sparse communication
needed.




Example: Heat Equation (more generally, a Poisson
Equation Solver)
The initial temperature is zero on the boundaries and high
in the middle
Is this a good data
The boundary temperature is held at zero.
partition for N data elts
The calculation of an element is dependent upon its (grid points) and P
processors? Analysis?
neighbor elements
Serial Code –
repeat
do y=2, N-1
do x=2, M-1
u2(x,y)=u1(x,y)+cx*[u1(x+1,y) +
u1(x-1,y)] + cy*[u1(x,y+1)} +
u1(x,y-1)] /* cx, cy are const. */
enddo
enddo
u1 = u2;
until convergence (u1 ~ u2) Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur
data1 data2
…...
data P
: Data commun. needed betw.
processes working on adajacent
data sets
51
1.
2.
3.
4.
5.
6.
7.
8.
9.
1.
2.
3.
4.
5.
6.
7.
find out if I am MASTER or WORKER
if I am MASTER
initialize array
send each WORKER starting info and
subarray
do until all WORKERS converge
gather from all WORKERS convergence
data
broadcast to all WORKERS convergence
signal
end do
receive results from each WORKER
14.
15.
16.
17.
18.
19.
20.
update border of my portion of solution
array
determine if my solution has converged
if so {send MASTER convergence signal
recv. from MASTER convergence signal}
end do }
send MASTER results
endif
else if I am WORKER
receive from MASTER starting info and
subarray
do until solution converged {
send (non-blocking?) neighbors my border
info
receive (non-blocking?) neighbors border
info
update interior of my portion of solution
Workers
array (see comput. given in the serial code)
wait for incomplete non-block. receive (if
any) to complete by busy waiting or
blocking receive.
Code from: Fundamentals of Parallel
Processing, A. Agrawal, IIT Kanpur
Master (can be one of the workers)
Problem
Grid
52
An example of an SPMD message-passing parallel
program
53
SPMD message-passing parallel program (contd.)
1
node xor D,
54
How to interconnect the
multiple cores/processors
is a major consideration in
a parallel architecture
55
1
Tflops
Tflops
kW
56
Summary

Serial computers / microprocessors will probably not get much faster parallelization unavoidable







Pipelining, cache and other optimization strategies for serial computers reaching a
plateau
the heat wall has also been reached
Application examples
Data and functional parallelism
Flynn’s taxonomy: SIMD, MISD, MIMD/SPMD
Parallel Architectures Intro
 Distributed Memory message-passing
 Shared Memory
 Uniform Memory Access
 Non Uniform Memory Access (distributed shared memory)
Parallel program/algorithm examples
Most text from: Fund. of Parallel
Processing, A. Agrawal, IIT Kanpur
57
Additional References






Computer Organization and Design– Patterson Hennessey
Modern Operating Systems – Tanenbaum
Concepts of High Performance Computing – Georg Hager
Gerhard Wellein
Cramming more components onto Integrated Circuits – Gordon
Moore, 1965
Introduction to Parallel Computing –
https://computing.llnl.gov/tutorials/parallel_comp
The Landscape of Parallel Computing Research – A view from
Berkeley, 2006
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur
58
Download