Uploaded by Rishika Kushwah

kandemir cse431 chapter6B messagepassing.pdf

advertisement
CSE 431
Computer Architecture
Spring 2016
Chapter 6B: Introduction to
Message Passing Multiprocessors
Mahmut Taylan Kandemir ( www.cse.psu.edu/~kandemir )
[Adapted from Computer Organization and Design, 5th Edition,
Patterson & Hennessy, © 2014, MK]
Review: Shared Memory Multiprocessors (SMP)
Q1 – Single address space shared by all cores
Q2 – Cores coordinate/communicate through shared
variables in memory (via loads and stores)
Use of shared data must be coordinated via synchronization
primitives (locks) that allow access to data to only one core at a
time
Core
Core
Cache
Cache
Core
Cache
Interconnection Network
Memory
SMPs come in two styles
I/O
Uniform memory access (UMA) multiprocessors
Nonuniform memory access (NUMA) multiprocessors
Fork-Join Computation Model
The master thread “forks” into a number of threads which
execute blocks of code in parallel, and then join back into
the master thread when done.
Join
Fork
http://en.wikipedia.org/wiki/OpenMP
Spin-Locks on Bus Connected ccUMAs
ccUMA = cache-coherent UMA
With a bus based cache coherency protocol (e.g., MSI,
MESI), joins are done via spin-locks which allow cores to
wait on a local copy of the lock variable in their caches
Reduces bus traffic – once the core with the lock releases the
lock (e.g., writes a 0) all other caches see that write and
invalidate their old copy of the lock variable. Unlocking restarts
the ll-sc race to get the lock. The winning core gets the bus
and writes the lock back to 1. The other caches then invalidate
their copy of the lock and on the next lock read fetch the new
lock value (1) from memory.
This scheme has problems scaling up to many cores
because of the communication traffic when the lock is
released and contested
Message Passing Multiprocessors (MPP)
Each core has its own private address space
Q1 – Cores share data by explicitly sending and
receiving information (message passing)
Q2 – Coordination is built into message passing
primitives (message send and message receive)
Cores
Cores
Cores
Cache
Cache
Cache
Memory
Memory
Memory
Interconnection Network
SMP vs MPP
Shared Memory
Distributed Memory
Communication in Network Connected Multi’s
Implicit communication via loads and stores (SMP)
easy to use (uniform interface to the programmer)
hardware architects have to provide coherent caches and
process (thread) synchronization primitives (like ll and sc)
lower communication overhead
harder to overlap computation with communication
Explicit communication via sends and receives (MPP)
simplest solution for hardware architects
higher communication overhead
easier to overlap computation with communication
communication exposed to the programmer (optimizing it may
be difficult though)
MPP
Local Memory vs Remote Memory
A processor can directly access to only its local memory
Accessing remote memory involves explicit message
passing
Send and Receive
Also called multicomputers or clusters
MPP
A cluster of computers
Each with its own processor and memory
An interconnect to pass messages between them
Producer-Consumer Scenario:
- P1 produces data D, uses a SEND to send it to P2
- The network routes the message to P2
- P2 calls a RECEIVE to get the message
Two types of send primitives
- Synchronous: P1 stops until P2 confirms receipt of message
- Asynchronous: P1 sends its message and continues
Standard libraries for message passing:
Most common is MPI – Message Passing Interface
Pros and Cons of Message Passing
Message sending and receiving is much slower than addition,
for example
But, message passing multiprocessors are much easier for
hardware architects to design
Don’t have to worry about cache coherency for example
The advantage for programmers is that communication is
explicit, so there are fewer “performance surprises” than with
the implicit communication in cache-coherent SMPs
Message passing standard MPI-2.2 (www.mpi-forum.org)
However, its harder to port a sequential program to a message
passing multiprocessor since every communication must be
identified in advance
With cache-coherent shared memory the hardware figures out what
data needs to be communicated
Who generates the communication code (programmer?
compiler?)
Message Passing Libraries (1)
Many “message passing libraries” were once available
Chameleon, from ANL.
CMMD, from Thinking Machines.
Express, commercial.
MPL, native library on IBM SP-2.
NX, native library on Intel Paragon.
Zipcode, from LLL.
PVM, Parallel Virtual Machine, public, from ORNL/UTK.
Others...
MPI, Message Passing Interface, now the industry
standard.
Need standards to write portable code.
Message Passing Libraries (2)
All communication, synchronization require subroutine
calls
No shared variables
Program run on a single processor just like any uniprocessor
program, except for calls to message passing library
Subroutines for
Communication
- Pairwise or point-to-point: Send and Receive
- Collectives all processor get together to
– Move data: Broadcast, Scatter/gather
– Compute and move: sum, product, max, … of data on many
processors
Synchronization
- Barrier
- No locks because there are no shared variables to protect
Enquiries
- How many processes? Which one am I? Any messages waiting?
Aside: Quick Summary of MPI
The MPI Standard describes
point-to-point message-passing
collective communications
group and communicator concepts
http://www.mpiprocess topologies
forum.org/docs/docs.html
environmental management
process creation and management
one-sided communications
extended collective operations
external interfaces
I/O functions
a profiling interface
Language bindings for C, C++ and Fortran are defined
Collective Communications
P0
P1
P2
P3
A
P0
P1
P2
P3
A BCD
Broadcast
Scatter
Gather
A
A
A
A
A
B
C
D
A0A1A2A3
B0B1B2B3
[Demmel and Yellick]
A0B0C0D0
All-to-all
A1B1C1D1
C0C1C2C3
A2B2C2D2
D0D1D2D3
A3B3C3D3
Concurrency and Parallelism
Programs are designed to be sequential or concurrent
Sequential – only one activity, behaving in the “usual” way
Concurrent – multiple, simultaneous activities, designed as
independent operations or as cooperating threads or processes
- The various parts of a concurrent program need not execute
simultaneously, or in a particular sequence, but they do need to
coordinate their activities by exchanging information in some way
A key challenge is to build parallel (concurrent) programs
that have high performance on multiprocessors as the
number of cores increase – programs that scale
Problems that arise
- Scheduling threads on cores close to the memory space where their
data primarily resides
- Load balancing threads on cores and dealing with thermal hot-spots
- Time for synchronization of threads
- Overhead for communication of threads
Examples of Concurrency and Parallelism
Many operations have “inherent data level parallelism” –
multiple independent operations that can be described in
one compound instruction in a suitable language
Matrix computations – e.g., addition
int A[m][n], B[m][n], C[m][n]; //dimensions m × n
for (i = 0; i < m; i++)
for (j = 0; j < n; j++)
C[i][j] = A[i][j] + B[i][j];
Database search
find an item with a given property by examining all items
memcached http://memcached.org/
redis.io http://redis.io
Web search
Google’s MapReduce algorithm
http://labs.google.com/papers/mapreduce.html
Encountering Amdahl’s Law
Speedup due to enhancement E is
Exec time w/o E
Speedup w/ E = ---------------------Exec time w/ E
Suppose that enhancement E accelerates a fraction F
(F <1) of the task by a factor S (S>1) and the remainder
of the task is unaffected
ExTime w/ E = ExTime w/o E
((1-F) + F/S)
Speedup w/ E = 1 / ((1-F) + F/S)
Example 1: Amdahl’s Law
Speedup w/ E = 1 / ((1-F) + F/S)
Consider an enhancement which runs 20 times faster
but which is only usable 25% of the time.
Speedup w/ E = 1/(.75 + .25/20) = 1.31
What if its usable only 15% of the time?
Speedup w/ E = 1/(.85 + .15/20) = 1.17
Amdahl’s Law tells us that to achieve linear speedup
with 100 cores (that is, 100 times faster), none of the
original computation can be scalar!
To get a speedup of 90 from 100 cores, the
percentage of the original program that could be scalar
would have to be 0.1% or less
Speedup w/ E = 1/(.001 + .999/100) = 90.99
Moral of the Story
The performance of any system is constrained by
the speed or capacity of the slowest point.
The impact of an effort to improve the performance
of a program is primarily constrained by the amount
of time that the program spends in parts of the
program not targeted by the effort.
Amdahl's Law is a statement of the maximum
theoretical speed-up you can ever hope to achieve.
The actual speed-ups are always less than the
speed-up predicted by Amdahl's Law. Why?
Is superlinear speedup possible?
Multiprocessor Scaling
To get good speedup on a multiprocessor while
keeping the problem size fixed is harder than
getting good speedup by increasing the size of
the problem
Strong scaling – when good speedup is achieved on
a multiprocessor without increasing the size of the
problem
Weak scaling – when good speedup is achieved on a
multiprocessor by increasing the size of the problem
proportionally to the increase in the number of cores
and the total size of memory
Multiprocessor Benchmarks
Scaling?
LINPACK
Weak
Reprogram?
Yes
Description
Dense matrix linear algebra
http://www.top500.org/project/linpack/
SPECrate
Weak
No
Parallel SPEC programs for joblevel parallelism
SPLASH 2
Strong
No
Independent job parallelism (both
kernels and applications, from
high-performance computing)
NAS Parallel
Weak
Yes (c or
Fortran)
Five kernels, mostly from
computational fluid dynamics
PARSEC
Weak
No
Multithreaded programs that use
Pthreads and OpenMP. Nine
applications and 3 kernels – 8
with data parallelism, 3 with
pipelined parallelism
Berkeley
Patterns
Strong or
Weak
Yes
13 design patterns implemented
by frameworks or kernels
DGEMM Scaling: Thread Count, Matrix Size
Multiprocessor Basics
Q1 – How do they share data?
A single physical address space shared by all cores or message
passing
Q2 – How do they coordinate?
•
Through atomic operations on shared variables in memory (via loads
and stores) or via message passing
Q3 – How scalable is the architecture? How many cores?
# of Cores
Communication
model
Physical
connection
Message passing 8 to 2048 +
SMP
NUMA
8 to 256 +
UMA
2 to 32
Network
8 to 256 +
Bus
2 to 8
Yet More Parallel Approaches
An alternate classification
Data Streams
Single
Instruction
Streams
Multiple
Single
SISD: Intel
Pentium 4
SIMD: SSE
Instr’s of
x86
Multiple
MISD: No
examples
today
MIMD:
SMPs (IBM
Power 8);
MPPs (Intel
Phi)
SPMD: Single Program Multiple Data
A parallel program running on a MIMD computer
With conditional code for different cores
Download