Chap. 4 Multiprocessors and Thread-Level Parallelism

advertisement
Chap. 4 Multiprocessors and
Thread-Level Parallelism
Uniprocessor performance
Performance (vs. VAX-11/780)
10000
From Hennessy and Patterson, Computer
Architecture: A Quantitative Approach, 4th
edition, October, 2006
??%/year
1000
52%/year
100
10
25%/year
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
• VAX
: 25%/year 1978 to 1986
• RISC + x86: 52%/year 1986 to 2002
• RISC + x86: ??%/year 2002 to present
2
From ILP to TLP & DLP



(Almost) All microprocessor companies
moving to multiprocessor systems
Single processors gain performance by
exploiting instruction level parallelism (ILP)
Multiprocessors exploit either:



Thread level parallelism (TLP), or
Data level parallelism (DLP)
What’s the problem?
3
From ILP to TLP & DLP (cont.)

We’ve got tons of infrastructure for singleprocessor systems
 Algorithms, languages, compilers, operating
systems, architectures, etc.
 These don’t exactly scale well

Multiprocessor design: not as simple as creating
a chip with 1000 CPUs
 Task scheduling/division
 Communication
 Memory issues
 Even programming  moving from 1 to 2
CPUs is extremely difficult
4
Why Multiprocessors?


Slowdown in uniprocessor performance arising from
diminishing returns in exploiting ILP, combined with
growing concern on power
Growth in data-intensive applications



Growing interest in servers, server perf.
Increasing desktop perf. less important


Data bases, file servers, …
Outside of graphics
Improved understanding in how to use multiprocessors
effectively

Especially server where significant natural TLP
Multiprocessing

Flynn’s Taxonomy of Parallel Machines



SISD: Single I Stream, Single D Stream


How many Instruction streams?
How many Data streams?
A uniprocessor
SIMD: Single I, Multiple D Streams

Each “processor” works on its own data
But all execute the same instrs in lockstep
E.g. a vector processor or MMX

=>Data Level Parallelism


Flynn’s Taxonomy

MISD: Multiple I, Single D Stream


Not used much
MIMD: Multiple I, Multiple D Streams

Each processor executes its own instructions and
operates on its own data
This is your typical off-the-shelf multiprocessor
(made using a bunch of “normal” processors)
Includes multi-core processors, Clusters, SMP servers

 Thread Level Parallelism

MIMD popular because

Flexible: can run both N programs, or work on 1 multithreaded program
together
Cost-effective: same processor in desktop & MIMD



Back to Basics



1.
“A parallel computer is a collection of
processing elements that cooperate and
communicate to solve large problems fast.”
Parallel Architecture = Computer Architecture +
Communication Architecture
2 classes of multiprocessors WRT memory:
Centralized Memory Multiprocessor
•
•
2.
< few dozen processor chips (and < 100 cores) in
2006
Small enough to share single, centralized memory
Physically Distributed-Memory multiprocessor
•
•
Larger number chips and cores than 100.
BW demands  Memory distributed among
processors
Centralized Shared Memory Multiprocessors
Distributed Memory Multiprocessors
Centralized-Memory Machines


Also “Symmetric Multiprocessors” (SMP)
“Uniform Memory Access” (UMA)




All memory locations have similar latencies
Data sharing through memory reads/writes
P1 can write data to a physical address A,
P2 can then read physical address A to get that data
Problem: Memory Contention



All processor share the one memory
Memory bandwidth becomes bottleneck
Used only for smaller machines

Most often 2,4, or 8 processors
Shared Memory Pros and Cons

Pros


Communication happens automatically
More natural way of programming



Easier to write correct programs and gradually optimize them
No need to manually distribute data
(but can help if you do)
Cons


Needs more hardware support
Easy to write correct, but inefficient programs
(remote accesses look the same as local ones)
Distributed-Memory Machines

Two kinds

Distributed Shared-Memory (DSM)





Message-Passing




All processors can address all memory locations
Data sharing like in SMP
Also called NUMA (non-uniform memory access)
Latencies of different memory locations can differ
(local access faster than remote access)
A processor can directly address only local memory
To communicate with other processors,
must explicitly send/receive messages
Also called multicomputers or clusters
Most accesses local, so less memory contention
(can scale to well over 1000 processors)
Message-Passing Machines

A cluster of computers



Each with its own processor and memory
An interconnect to pass messages between them
Producer-Consumer Scenario:




Two types of send primitives



P1 produces data D, uses a SEND to send it to P2
The network routes the message to P2
P2 then calls a RECEIVE to get the message
Synchronous: P1 stops until P2 confirms receipt of message
Asynchronous: P1 sends its message and continues
Standard libraries for message passing:
Most common is MPI – Message Passing Interface
Message Passing Pros and Cons

Pros



Simpler and cheaper hardware
Explicit communication makes programmers aware of
costly (communication) operations
Cons


Explicit communication is painful to program
Requires manual optimization


If you want a variable to be local and accessible via LD/ST,
you must declare it as such
If other processes need to read or write this variable, you must
explicitly code the needed sends and receives to do this
Challenges of Parallel Processing


First challenge is % of program inherently
sequential (limited parallelism available in
programs)
Suppose 80X speedup from 100 processors.
What fraction of original program can be
sequential?
a.
b.
c.
d.
10%
5%
1%
<1%
7/11/2016
16
Amdahl’s Law Answers
Speedup overall 
1
1  Fraction enhanced  
80 
Fraction parallel
Speedup parallel
1
1  Fraction
parallel
80  (1  Fraction parallel  

Fraction parallel
100
Fraction parallel
) 1
100
79  80  Fraction parallel  0.8  Fraction parallel
Fraction parallel  79 / 79.2  99.75%
17
Challenges of Parallel Processing
Second challenge is long latency to remote
memory (High cost of communications)


delay ranges from 50 clock cycles to 1000 clock cycles.
Suppose 32 CPU MP, 2GHz, 200 ns remote
memory, all local accesses hit memory
hierarchy and base CPI is 0.5. (Remote
access = 200/0.5 = 400 clock cycles.)
What is performance impact if 0.2%
instructions involve remote access?


a.
b.
c.
1.5X
2.0X
2.5X
7/11/2016
CPI Equation
CPI = Base CPI +
Remote request rate
x Remote request cost
 CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 =
1.3
 No communication is 1.3/0.5 or 2.6
faster than 0.2% instructions involve
local access

19
Challenges of Parallel Processing
Application parallelism  primarily via new
algorithms that have better parallel performance
Long remote latency impact  both by architect
and by the programmer
For example, reduce frequency of remote
accesses either by
1.
2.



Caching shared data (HW)
Restructuring the data layout to make more accesses
local (SW)
7/11/2016
Cache Coherence Problem

Shared memory easy with no caches



P1 writes, P2 can read
Only one copy of data exists (in memory)
Caches store their own copies of the data


Those copies can easily get inconsistent
Classic example: adding to a sum






P1 loads allSum, adds its mySum, stores new allSum
P1’s cache now has dirty data, but memory not updated
P2 loads allSum from memory, adds its mySum, stores allSum
P2’s cache also has dirty data
Eventually P1 and P2’s cached data will go to memory
Regardless of write-back order, the final value ends up wrong
Small-Scale—Shared Memory

Caches serve to:




Increase bandwidth
versus bus/memory
Reduce latency of
access
Valuable for both
private data and
shared data
What about cache
consistency?
Time
0
1
2
3
Event
$A
$B
X
(memory)
1
1
CPU A
reads X
CPU B
reads X
CPU A
stores 0
into X
1
1
1
1
0
1
0
•Read and write a single memory location (X) by
two processors (A and B)
•Assume Write-through cache
22
Cache coherence problem
Time
0
1
2
3
Event
$A
$B
X
(memory)
1
1
CPU A
reads X
CPU B
reads X
CPU A
stores 0
into X
1
1
1
1
0
1
0
23
Example Cache Coherence Problem
P2
P1
u=?
$
P3
3
u=?
4
$
5
$
u :5 u= 7
u :5
I/O devices
1
u:5
2
Memory



Processors see different values for u after event 3
With write back caches, value written back to memory
depends on which cache flushes or writes back value
 Processes accessing main memory may see very stale
value
Unacceptable for programming, and it’s frequent! 7/11/201624
Cache Coherence Definition

A memory system is coherent if
1.
A read by a processor P to a location X that follows a
write by P to X, with no writes of X by another processor
occurring between the write and read by P, always
returns the value written by P.

2.
If P1 writes to X and P2 reads X after a sufficient time,
and there are no other writes to X in between, P2’s read
returns the value written by P1’s write.

3.
Preserves program order
any write to an address must eventually be seen by all
processors
Writes to the same location are serialized: two writes to
location X are seen in the same order by all processors.

preserves causality
Maintaining Cache Coherence

Hardware schemes

Shared Caches



Snooping



Trivially enforces coherence
Not scalable (L1 cache quickly becomes a bottleneck)
Every cache with a copy of data also has a copy of sharing
status of block, but no centralized state is kept
Needs a broadcast network (like a bus) to enforce coherence
Directory


Sharing status of a block of physical memory is kept in just
one location, the directory
Can enforce coherence even with a point-to-point network
Snoopy Cache-Coherence Protocols
State
Address
Data
Pn
P1
Bus snoop
$
$
Mem


I/O devices
Cache-memory
transaction
Cache Controller “snoops” all transactions on the
shared medium (bus or switch)
 relevant transaction if for a block it contains
 take action to ensure coherence
 invalidate, update, or supply value
 depends on state of the block and the protocol
Either get exclusive access before write via write
invalidate or update all copies on write
27
Example: Write-thru Invalidate
P2
P1
u=?
$
P3
3
u=?
4
$
5
$
u :5 u= 7
u :5
I/O devices
1
u:5
2
u=7
Memory


Must invalidate before step 3
Write update uses more broadcast medium BW
 all recent MPUs use write invalidate
7/11/2016
Proce
ssor
Bus
Activity
CPU A Cache
reads miss for
X
X
CPU B Cache
reads miss for
X
X
CPU A Indalidati
stores on for X
1
into X
CPU B Cache
reads miss for
X
X
$A
$B
0
0
0
0
0
1
1
X
(memory)
0
0
1
1
29
Download