lecture

advertisement
Scalable Multiprocessors
Read Dubois/Annavaram/Stenström Chapter 5.5-5.6
(COMA architectures could be paper topic)
Read Dubois/Annavaram/Stenström Chapter 6
 What is a scalable design? (7.1)
 Realizing programming models (7.2)
 Scalable communication architectures (SCAs)
 Message-based SCAs (7.3-7.5)
 Shared-memory based SCAs (7.6)
4/8/2015 slide 1
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Scalability
Goals (P is number of processors)
 Bandwidth: scale linearly with P
 Latency: short and independent of P
 Cost: low fixed cost and scale linearly with P
Example: A bus-based multiprocessor
 Bandwidth: constant
 Latency: short and constant
 Cost: high for infrastructure and then linear
4/8/2015 slide 2
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Organizational Issues
Dance-hall memory organization
M

M
Distributed memory organization
M
Scalable network
Scalable network
Switch
Switch
Switch
$
$
P
P
P
Switch
Switch
CA
M
$
Switch


$
$
P
P
 Network composed of switches for performance and cost
 Many concurrent transactions allowed
 Distributed memory can bring down bandwidth demands
Bandwidth scaling:
 no global arbitration and ordering
 broadcast bandwidth fixed and expensive
4/8/2015 slide 3
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Scaling Issues
Latency scaling:
 T(n) = Overhead + Channel Time + Routing Delay
 Channel Time is a function of bandwidth
 Routing Delay is a function of number of hops in network
Cost scaling:
 Cost(p,m) = Fixed cost + Incremental Cost (p,m)
 Design is cost-effective if speedup(p,m) > costup(p,m)
4/8/2015 slide 4
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Physical Scaling
 Chip, board, system-level partitioning has a big
impact on scaling
However, little consensus
Diagnostics network
Control network
Data network
PM PM
Processing
partition
SPARC
Processing Control
partition
processors
FPU
$
ctrl
Data
networks
$
SRAM
I/O partition
Control
network
NI
MBUS
DRAM
ctrl
DRAM
4/8/2015 slide 5
Vector
unit
DRAM
ctrl
DRAM
DRAM
ctrl
DRAM
Vector
unit
DRAM
ctrl
DRAM
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Network Transaction Primitives
Primitives to implement the programming model on a
scalable machine
 One-way transfer
between source and
destination
 Resembles a bus
transaction but much
richer in variety
Comm unication Network
s erialized ms g
 
o u tp u t b ufer
f
So u rce No d e
Examples:
 A message send transaction
 A write transaction in a SAS machine
4/8/2015 slide 6
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
in p u t b ufer
f
Des tin atio n No d e
Bus vs. Network Transactions
Design Issues:
Protection
Format
Output buffering
Media arbitration
Destination
name & routing
Input buffering
Action
Completion detection
Transaction ordering
4/8/2015 slide 7
Bus Transactions:
V->P address
translation
Fixed
Simple
Network Transactions:
Done at multiple points
Global
Direct
Flexible
Support flexible in
format
Distributed
Via several switches
One source
Response
Simple
Global order
Several sources
Rich diversity
Response transaction
No global order
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
SAS Transactions
Source
(1) Initiate memory access
Destination
Load r Global address]
(2) Address translation
(3) Local /remote check
(4) Request transaction
Read request
Read request
(5) Remote memory access
Wait
Memory access
Read response
(6) Reply transaction
Read response
(7) Complete memory access
Time
Issues:
 Fixed or variable size transfers
 Deadlock avoidance and input buffer full
4/8/2015 slide 8
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Sequential Consistency
while (flag==0);
print A;
A=1;
flag=1;
P2
P1
Memory
P3
Memory
Memory
A:0
flag:0->1
Delay
3: load A
1: A=1
2: flag=1
Interconnection network
(a)
P2
P1
(b)
P3
Congested path
Issues:
 Writes need acks to signal completion
 SC may cause extreme waiting times
4/8/2015 slide 9
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Message Passing
Multiple flavors of synchronization semantics
Blocking versus non-blocking
Blocking send/recv returns when operation completes
Non-blocking returns immediately (probe function
tests completion)
Synchronous
Send completes after matching receive has executed
Receive completes after data transfer from matching
send completes
Asynchronous (buffered, in MPI terminology)
Send completes as soon as send buffer may be
reused
4/8/2015 slide 10
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Synchronous MP Protocol
Source
Destination
Recv Psrc, local VA, len
(1) Initiate send
(2) Address translation on P
src
Send Pdest, local VA, len
(3) Local/remote check
Send-rdy req
(4) Send-ready request
(5) Remote check for
posted receive
(assume success)
Wait
Tag check
(6) Reply transaction
Recv-rdy reply
(7) Bulk data transfer
Source VADest VA or ID
Data-xfer req
Time
Alternative: Keep match table at the sender,
enabling a two-phase receive-initiated protocol
4/8/2015 slide 11
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Asynchronous Optimistic
MP Protocol
Destination
Source
(1) Initiate send
(2) Address translation
Send (Pdest, local VA, len)
(3) Local /remote check
(4) Send data
(5) Remote check for
posted receive; on fail,
allocate data buffer
Tag match
Data-xfer req
Time
Allocate buffer
Recv P
src, local VA, len
Issues:
Copying overhead at receiver from temp buffer to user space
Huge buffer space at receiver to cope with worst case
4/8/2015 slide 12
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Asynchronous Robust MP
Protocol
Destination
Source
(1) Initiate send
(2) Address translation on P
dest
Send Pdest, local VA, len
(3) Local /remote check
Send-rdy req
(4) Send-ready request
(5) Remote check for posted
receive (assume fail);
record send-ready
Return and compute
Tag check
(6) Receive-ready request
Recv Psrc, local VA, len
(7) Bulk data reply
Source VADest VA or ID
Recv-rdy req
Data-xfer reply
Time
Note: after handshake, send and recv buffer
addresses are known, so data transfer can be
performed with little overhead
4/8/2015 slide 13
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Active Messages
Request
handler
Reply
handler
User-level analog of network transactions
transfer data packet and invoke handler to
extract it from network and integrate with ongoing computation
4/8/2015 slide 14
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Challenges Common to
SAS and MP
 Input buffer overflow: how to signal buffer space is exhausted
Solutions:
 ACK at protocol level
 back pressure flow control
 special ACK path or drop packets (requires time-out)
Fetch deadlock (revisited): a request often generates a
response that can form dependence cycles in the network
Solutions:
 two logically independent request/response networks
 NACK requests at receiver to free space
4/8/2015 slide 15
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Increasing HW Support, Specialization, Intrusiveness, Performance (???)
Spectrum of Designs
None, physical bit stream
blind, physical DMA
User/System
User-level port
User-level handler
Remote virtual address
Processing, translation
Global physical address
Proc + Memory controller
Cache-to-cache
Cache controller
4/8/2015 slide 16
nCUBE, iPSC, . . .
CM-5, *T
J-Machine, Monsoon, . . .
Paragon, Meiko CS-2
RP3, BBN, T3D
Dash, KSR, Flash
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
MP Architectures
Scalable Network
Message
Output Processing
– checks
– translation
– formatting
– scheduling M
CA
°°°
Communication Assist
P
Node Architecture
CA
M
P
Input Processing
– checks
– translation
– buffering
– action
Design tradeoff: how much processing in CA vs P,
and how much interpretation of network transaction
Physical DMA (7.3)
User-level access (7.4)
Dedicated message processing (7.5)
4/8/2015 slide 17
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Physical DMA
Data
Example: nCUBE/2,
IBM SP1
Dest
DMA
channels
Addr
Length
Rdy
Memory
Status,
interrupt
Cmd
P

Addr
Length
Rdy
Memory
P
Node processor packages messages in user/system mode
DMA used to copy between network and system buffers
Problem: no way to distinguish between user/system messages, which
results in much overhead because node processor must be involved
4/8/2015 slide 18
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
User-Level Access
Example: CM-5
User/system
Data
Dest

Mem
P
Status,
interrupt
Mem
P
 Network interface mapped into user address space
 Communication assist does protection checks, translation, etc.
No intervention by kernel except for interrupts
4/8/2015 slide 19
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Dedicated Message Processing
Network
dest
°°°
Mem
Mem
NI
P
User
MP
System
NI
P
User
MP
MP does
Interprets message
Supports message
operations
Off-loads P with a clean
message abstraction
System
Issues:
 P/MP communicate via shared memory: coherence traffic
 MP can be a bottleneck due to all concurrent actions
4/8/2015 slide 20
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Shared Physical Address Space
Scalable Network
Pseudo
memory
M
Pseudo
processor
P
Pseudo
memory
Pseudo
processor
M
P
Remote read/write performed by pseudo processors
Cache coherence issues treated in Ch. 8
4/8/2015 slide 21
PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Download