Scalable Multiprocessors Read Dubois/Annavaram/Stenström Chapter 5.5-5.6 (COMA architectures could be paper topic) Read Dubois/Annavaram/Stenström Chapter 6 What is a scalable design? (7.1) Realizing programming models (7.2) Scalable communication architectures (SCAs) Message-based SCAs (7.3-7.5) Shared-memory based SCAs (7.6) 4/8/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalability Goals (P is number of processors) Bandwidth: scale linearly with P Latency: short and independent of P Cost: low fixed cost and scale linearly with P Example: A bus-based multiprocessor Bandwidth: constant Latency: short and constant Cost: high for infrastructure and then linear 4/8/2015 slide 2 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Organizational Issues Dance-hall memory organization M M Distributed memory organization M Scalable network Scalable network Switch Switch Switch $ $ P P P Switch Switch CA M $ Switch $ $ P P Network composed of switches for performance and cost Many concurrent transactions allowed Distributed memory can bring down bandwidth demands Bandwidth scaling: no global arbitration and ordering broadcast bandwidth fixed and expensive 4/8/2015 slide 3 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scaling Issues Latency scaling: T(n) = Overhead + Channel Time + Routing Delay Channel Time is a function of bandwidth Routing Delay is a function of number of hops in network Cost scaling: Cost(p,m) = Fixed cost + Incremental Cost (p,m) Design is cost-effective if speedup(p,m) > costup(p,m) 4/8/2015 slide 4 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Physical Scaling Chip, board, system-level partitioning has a big impact on scaling However, little consensus Diagnostics network Control network Data network PM PM Processing partition SPARC Processing Control partition processors FPU $ ctrl Data networks $ SRAM I/O partition Control network NI MBUS DRAM ctrl DRAM 4/8/2015 slide 5 Vector unit DRAM ctrl DRAM DRAM ctrl DRAM Vector unit DRAM ctrl DRAM PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Network Transaction Primitives Primitives to implement the programming model on a scalable machine One-way transfer between source and destination Resembles a bus transaction but much richer in variety Comm unication Network s erialized ms g o u tp u t b ufer f So u rce No d e Examples: A message send transaction A write transaction in a SAS machine 4/8/2015 slide 6 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 in p u t b ufer f Des tin atio n No d e Bus vs. Network Transactions Design Issues: Protection Format Output buffering Media arbitration Destination name & routing Input buffering Action Completion detection Transaction ordering 4/8/2015 slide 7 Bus Transactions: V->P address translation Fixed Simple Network Transactions: Done at multiple points Global Direct Flexible Support flexible in format Distributed Via several switches One source Response Simple Global order Several sources Rich diversity Response transaction No global order PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 SAS Transactions Source (1) Initiate memory access Destination Load r Global address] (2) Address translation (3) Local /remote check (4) Request transaction Read request Read request (5) Remote memory access Wait Memory access Read response (6) Reply transaction Read response (7) Complete memory access Time Issues: Fixed or variable size transfers Deadlock avoidance and input buffer full 4/8/2015 slide 8 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Sequential Consistency while (flag==0); print A; A=1; flag=1; P2 P1 Memory P3 Memory Memory A:0 flag:0->1 Delay 3: load A 1: A=1 2: flag=1 Interconnection network (a) P2 P1 (b) P3 Congested path Issues: Writes need acks to signal completion SC may cause extreme waiting times 4/8/2015 slide 9 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Message Passing Multiple flavors of synchronization semantics Blocking versus non-blocking Blocking send/recv returns when operation completes Non-blocking returns immediately (probe function tests completion) Synchronous Send completes after matching receive has executed Receive completes after data transfer from matching send completes Asynchronous (buffered, in MPI terminology) Send completes as soon as send buffer may be reused 4/8/2015 slide 10 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Synchronous MP Protocol Source Destination Recv Psrc, local VA, len (1) Initiate send (2) Address translation on P src Send Pdest, local VA, len (3) Local/remote check Send-rdy req (4) Send-ready request (5) Remote check for posted receive (assume success) Wait Tag check (6) Reply transaction Recv-rdy reply (7) Bulk data transfer Source VADest VA or ID Data-xfer req Time Alternative: Keep match table at the sender, enabling a two-phase receive-initiated protocol 4/8/2015 slide 11 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Asynchronous Optimistic MP Protocol Destination Source (1) Initiate send (2) Address translation Send (Pdest, local VA, len) (3) Local /remote check (4) Send data (5) Remote check for posted receive; on fail, allocate data buffer Tag match Data-xfer req Time Allocate buffer Recv P src, local VA, len Issues: Copying overhead at receiver from temp buffer to user space Huge buffer space at receiver to cope with worst case 4/8/2015 slide 12 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Asynchronous Robust MP Protocol Destination Source (1) Initiate send (2) Address translation on P dest Send Pdest, local VA, len (3) Local /remote check Send-rdy req (4) Send-ready request (5) Remote check for posted receive (assume fail); record send-ready Return and compute Tag check (6) Receive-ready request Recv Psrc, local VA, len (7) Bulk data reply Source VADest VA or ID Recv-rdy req Data-xfer reply Time Note: after handshake, send and recv buffer addresses are known, so data transfer can be performed with little overhead 4/8/2015 slide 13 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Active Messages Request handler Reply handler User-level analog of network transactions transfer data packet and invoke handler to extract it from network and integrate with ongoing computation 4/8/2015 slide 14 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Challenges Common to SAS and MP Input buffer overflow: how to signal buffer space is exhausted Solutions: ACK at protocol level back pressure flow control special ACK path or drop packets (requires time-out) Fetch deadlock (revisited): a request often generates a response that can form dependence cycles in the network Solutions: two logically independent request/response networks NACK requests at receiver to free space 4/8/2015 slide 15 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Increasing HW Support, Specialization, Intrusiveness, Performance (???) Spectrum of Designs None, physical bit stream blind, physical DMA User/System User-level port User-level handler Remote virtual address Processing, translation Global physical address Proc + Memory controller Cache-to-cache Cache controller 4/8/2015 slide 16 nCUBE, iPSC, . . . CM-5, *T J-Machine, Monsoon, . . . Paragon, Meiko CS-2 RP3, BBN, T3D Dash, KSR, Flash PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 MP Architectures Scalable Network Message Output Processing – checks – translation – formatting – scheduling M CA °°° Communication Assist P Node Architecture CA M P Input Processing – checks – translation – buffering – action Design tradeoff: how much processing in CA vs P, and how much interpretation of network transaction Physical DMA (7.3) User-level access (7.4) Dedicated message processing (7.5) 4/8/2015 slide 17 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Physical DMA Data Example: nCUBE/2, IBM SP1 Dest DMA channels Addr Length Rdy Memory Status, interrupt Cmd P Addr Length Rdy Memory P Node processor packages messages in user/system mode DMA used to copy between network and system buffers Problem: no way to distinguish between user/system messages, which results in much overhead because node processor must be involved 4/8/2015 slide 18 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 User-Level Access Example: CM-5 User/system Data Dest Mem P Status, interrupt Mem P Network interface mapped into user address space Communication assist does protection checks, translation, etc. No intervention by kernel except for interrupts 4/8/2015 slide 19 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Dedicated Message Processing Network dest °°° Mem Mem NI P User MP System NI P User MP MP does Interprets message Supports message operations Off-loads P with a clean message abstraction System Issues: P/MP communicate via shared memory: coherence traffic MP can be a bottleneck due to all concurrent actions 4/8/2015 slide 20 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Shared Physical Address Space Scalable Network Pseudo memory M Pseudo processor P Pseudo memory Pseudo processor M P Remote read/write performed by pseudo processors Cache coherence issues treated in Ch. 8 4/8/2015 slide 21 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011