Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures • Ref: Sections 10.1, 10.2, 15, 16, 17(except 17.5), 18. • Background (Appendix): firmware, firmware communications, memories, caching, process management; see also Section 11 for memory and caching. Contents Shared memory architecture = multiprocessor • Multicore technology (Chip MultiProcessor – CMP) • Functional and performance features of external memory and caching in multiprocessors • Interconnection networks • Multiprocessor taxonomy • Local I/O Sections 15, 16, 17, 18 contain several ‘descriptive-style’ parts, e.g. classifications, technologies, products, etc. They can be read easily by the students. During the lectures we’ll concentrate on the most critical issues from the conceptual and technical point of view, through examples and exercises. MCSN - High Performance Computing 2 Abstract architecture and physical architecture ... Abstract interconnection network: all the needed direct links corresponding to the interprocess communication channels Abstraction of physical interconnect, Memory hierarchy, I/O, Process run-time support, Process mapping onto PEs, etc. All the physical details are condensed into a small number of parameters used to evaluate Lcom. MCSN - High Performance Computing Abstract Processing Elements (PE), having all the main features of real PEs (processor, assembler, memory hierarchy, external memory, I/O, etc.); one PE for each process Evaluation of calculation times Tcalc Cost model of the specific parallel program executed on the specific parallel architecture 3 Multiple Instruction Stream (MIMD) architectures Parallelism between processes MCSN - High Performance Computing 4 Shared memory vs distributed memory Multiprocessor Currently, the main technology for multicore and multicore-based systems. We’ll start with shared memory multiprocessors. Multicomputer Currently, the main technology for clusters and data-centres. Processing nodes are multiprocessors. MCSN - High Performance Computing 5 Levels and shared memory Applications Processes Assembler Firmware contains all Shared physical memory instructions (multiprocessor): and data any processor can address • private (thus can access directly) any • shared location of the physical of memory processes Hardware MCSN - High Performance Computing 6 Levels and shared memory Applications Processes Graphs of cooperating processes expressed by a concurrent language messagepassing (e.g. LC) or shared data No sharing implemented by sharing Assembler RTS (concurrent language) based on shared data structures (communication channel descriptors, processFirmware descriptors, etc.) exploiting Hardware the shared physical memory MCSN - High Performance Computing Different shared data at different levels Shared physical memory (multiprocessor): any processor can address (thus can access directly) any location of the physical memory contains all instructions and data • private • shared of 7 processes Generic scheme of multiprocessor M0 … Mj … External shared main memory Mm-1 N x m interconnect Interconnection network(s) Processing Element (PE) routing and flow control PE0 W0 … PEi CPU0 Wi CPUi … PEN-1 WN-1 CPUN-1 PE interface unit: decouples CPUs from interconnect technology (‘Wrapping’ unit) CPU (processor units, MMUs, caches) + local I/O MCSN - High Performance Computing 8 Typical PE Interconnection Network To/from external memory modules and other PEs Processing Node (PE) PE interface unit W External memory interface CPU Secondary Cache Primary Cache (Instr. + Data) MMUs Processor units I/O interface Interrupt Arbiter ... UC Local I/O units MCSN - High Performance Computing 9 Shared memory basics - 1 MCSN - High Performance Computing 10 Example: an elementary multiprocessor Just to understand / to review basic concepts and techniques, which will be extended to real multiprocessor architectures M tM W W C2 C2 C1 C1 t P t P PE0 Question: Which kinds of requests are sent from a PE to M and which reply is sent from M to a PE? [true/false] 1. Copy a message from P_msg to Q_vtg 2. Request a message to be assigned to Q_vtg 3. A single-word read 4. A single-word write 5. A C1-block read 6. A C1-block write PE1 Abstract PE0 P Abstract PE1 Q MCSN - High Performance Computing 11 Example: an elementary multiprocessor Just to understand / to review basic concepts and techniques, which will be extended to real multiprocessor architectures M tM W W C2 C2 C1 C1 t P t P PE0 PE1 Abstract PE0 P Abstract PE1 Q MCSN - High Performance Computing Question: Which kinds of requests are sent from a PE to M and which reply is sent from M to a PE? [true/false] 1. Copy a message from P_msg to Q_vtg 2. Request a message to be assigned to Q_vtg 3. A single-word read 4. A single-word write yes, if Write-Though 5. A C1-block read yes 6. A C1-block write yes, if Write-Back Question: What is the format (configuration of bits) of a request PE-M and of a reply M-PE ? Question: What happens if a request from PE0 and a request from PE1 arrive ‘simultaneoulsy’ to M ? 12 Behavior of the memory unit M • Processing module as a unifying concept at the various levels = processing unit at firmware level, process at the process level. • All the same mechanisms studied for process cooperation (LC) are applied at the firmware level too, though with different implementations and performances. • Communication through RDY-ACK interfaces. • Nondeterminism: test simultaneously, in the same clock cycle, all the RDYs of the input interfaces; select one of the ready requests, possibly applying a fair priority strategy. • Nondeterminism may be implemented as real parallelism in the same clock cycle: if input requests are compatible and the memory bandwidth is sufficient, more requests can be served simultaneously. MCSN - High Performance Computing 13 Behavior of the memory unit M • A further feature can be defined for a shared memory unit: indivisible sequences of memory accesses. • An additional bit (INDIV) is associated to each memory request: if it is set to 1, once the associated request is accepted by M, the other requests are left pending in the input interfaces (simple waiting queue mechanims), until INDIV is reset to 0. • During an indivisible sequence of memory accesses, the M behavior is deterministic. • At the end, the nondeterministic/parallel behavior is resumed (possibly by serving a waiting request). • This mechanism is provided by some machines : proper instructions (e.g. TEST_AND_SET) or annotation in LOAD and STORE instructions. MCSN - High Performance Computing 14 In general M0 … Mj … Nodeterminism and parallelism Mm-1 in the behavior of memory units Interconnection network(s) PE0 W0 … PEi CPU0 MCSN - High Performance Computing Wi CPUi … PEN-1 WN-1 and of network switching units CPUN-1 15 Technology overview and multicore MCSN - High Performance Computing 16 CPU technology Pipelined External memory interface (MINF) C2 MMU and multithreaded processor technology: general view (Sections 12, 13) C1 P MCSN - High Performance Computing In the simplified cost model adopted for this course, this structure is invisible and abstracted by the equivalent service time per instruction Tinstr (e.g. 2t). 17 Pipelined / vectorized Execution Unit da DM INT Pipelined Mul / Div da IU a IU EU_Master •general, Distribuzione •floating-point Operazioni corte •and LOAD vector •registers Registri RG • Registri RFP FP Pipelined Add / Sub + Vectorization facilities Collettore FP Pipelined Mul / Div MCSN - High Performance Computing 18 Multithreading (‘hardware’ multithreading) Example of 2-thread CPU e.g. Hyperthreading IM – instruction C1 IU0 IU1 DM – data C1 external memory M I/O C2 EU_Master0 I EU_Master1 N switch F FU0 CPU chip MCSN - High Performance Computing FU1 FU2 FU3 Ideally, an equivalent number of q N PEs is available, where q is the multithreading degree. In practice, aqN with a < 1. 19 Multicore technology: Chip MultiProcessor (CMP) CMP single chip MINF I/O INF MINF MINF MINF internal interconnect PE 0 PE 1 For our purposes, the terms ‘multicore’ and ‘manycore’ are synonymous. We use the more general and rigorous term ‘Chip MultiProcessor (CMP)’. MCSN - High Performance Computing ... I/O INF PE N-1 W PE / core C2 C1 (instr + data) pipelined processor local I/O coproc. 20 Internal interconnect examples for CMP Ring PE PE PE PE sw sw sw sw sw sw sw sw PE PE PE PE Switching Unit (or, simply, Switch): routing and flow control 2D toroidal mesh Crossbar … … … … … … … … … … … … … … … … … … … … MCSN - High Performance Computing … … … … … … … … … … … … … 21 Example of single-CMP system Ethernet high-bandwidth main memory Fiber channel M0 ... M7 M0 graphics ... video M7 I/O chassis I/O chassis IM ... ... IM MINF MINF MINF MINF CMP I/O and networking interconnect I/O INF internal interconnect PE 0 PE 1 ... router scsi I/O INF I/O and networking interconnect PE N-1 Raid subsystems scsi scsi scsi storage server LANS / WANS / other subnets MCSN - High Performance Computing 22 Example of multiple-CMP system high-bandwidth shared main memory M ... M ... M ... M ... M ... M external interconnect CMP0 PE ... CMP1 PE PE ... CMPm-1 PE PE ... PE ... MCSN - High Performance Computing 23 Intel Xeon (4-16 PEs) and Tilera Tile64 (64 PEs) MCSN - High Performance Computing 24 Intel Xeon Phi (64 PEs) Bidirectional ring interconnect Internal local memory (GDDR5 technology), up to 16GB (3rd level cachelike) PE: • pipelined • in-order • vectorized arithm. • 4-thread • 2-level cache • ring interface MCSN - High Performance Computing 25 Shared memory basics - 2 MCSN - High Performance Computing 26 Memory bandwidth and latency High bandwidth of M (BM) is needed for: 1. Minimize the latency of cache-block transfers (Tfault) 2. Minimize contention of PEs for memory accesses MCSN - High Performance Computing 27 Memory bandwidth and latency High bandwidth of M (BM) is needed for: 1. Minimize the latency of cache-block transfers (Tfault) 2. Minimize contention of PEs for memory accesses MCSN - High Performance Computing 28 Minimize the latency of cache-block transfers • If BM = 1 words/tM, cache is quite useless for programs characterized by locality only (or mainly locality) • BM = s1 words/tM is the best offered bandwidth: exploitable if the remaining subsystems (interconnect, PEs) are able to sustain it. • Solutions: 1. Interleaved macro-modules (hopefully, m = s1 , e.g. = 8) M0 M1 … M7 Interleaved macro-module 0 M8 M9 … Interleaved macro-module 1 M15 ... Interleaved … 2 2. High-bandwidth firmware communications from M to interconnect and PEs. Notice: s1-wide links are not realistic 1-word links • Pipelined communications and wormhole flow-control: next week MCSN - High Performance Computing 29 Cost model of FW communications (Sect. 10.1, 10.2) 𝑻𝒊𝒅 = 𝒎𝒂𝒙 (𝑻𝒄𝒂𝒍𝒄 , 𝑻𝒔−𝒄𝒐𝒎 ) Single buffering (figure): Communication latency: 𝑳𝒄𝒐𝒎 = 𝟐(𝝉 + 𝑻𝒕𝒓 ) This expresses also the communication service time: 𝑻𝒔−𝒄𝒐𝒎 = 𝑳𝒄𝒐𝒎 = 𝟐(𝝉 + 𝑻𝒕𝒓 ) Sender:: Receiver:: wait ACK; write msg into OUT, set RDY, reset ACK, … Double buffering: 𝑳𝒄𝒐𝒎 = 𝟐(𝝉 + 𝑻𝒕𝒓 ) 𝑳𝒄𝒐𝒎 𝑻𝒔−𝒄𝒐𝒎 = = 𝝉 + 𝑻𝒕𝒓 𝟐 wait RDY; use IN, set ACK, reset RDY, … Service time Tid Calculation time Tcalc RDY ACK RDY Lcom Receiver Communication Latency Communication time NOT overlapped to (i.e. not masked by) internal calculation Tcom … Sender Transmission Latency (Link only) Tid = Tcalc + Tcom … Clock cycle for calculation only Ttr t Lcom = 2 (Ttr + t) MCSN - High Performance Computing Clock cycle for calculation and communication On chip Ttr = 0 Tcalc ≥ Lcom Tcom = 0 Alternate use of the two interfaces 30 Elementary system example: memory latency M (tM) M0 ... (t) C1 PE0 Block is pipelined (t) word-by-word 𝑅𝑄0 = 𝜏𝑀 + (𝜎1 + 3)𝑇𝑡𝑟 + (𝜎1 + 6)𝜏 A more general and accurate, and easier-to-use, cost model will be studied for fully pipelined communications. W C2 RQ0 is the BASE memory access latency per block = without impact of contention: optimistic or for single PE. C1 PE1 tM M IM W C2 C1 IM Memory interface unit (t) C2 MEMORY ACCESS LATENCY per block: ... (t) W Double buffered links M7 Ttr t Stream of s1 words Request e.g. 48 – 112 bits in parallel s1 (𝝉 + 𝑻𝒕𝒓 ) 1 word (not in scale) Possible optimization (not so popular): the processor could re-start at this point if the first word of the stream has the address that generated fault. MCSN - High Performance Computing 31 Elementary system example: memory latency For all units, except M: 𝑻𝒄𝒂𝒍𝒄 = 𝝉 → 𝑻𝒊𝒅 = 𝑻𝒔−𝒄𝒐𝒎 = 𝝉 + 𝑻𝒕𝒓 Memory service time: 𝑻𝑴 = 𝝉𝑴 𝝈𝟏 If 𝑻𝒊𝒅 ≤ 𝑻𝑴 M (stream generator) is the bottleneck. tM Example: 𝜏𝑀 = 32𝜏, 𝜎1 = 8, 𝑇𝑡𝑟 = 2𝜏. 𝑅𝑄0 = 𝜏𝑀 + (𝜎1 + 3)𝑇𝑡𝑟 + (𝜎1 + 6)𝜏 s1(t + Ttr) = 68 t If 𝑻𝒊𝒅 > 𝑻𝑴 IM-net-PE is the bottleneck. Example: 𝜏𝑀 = 32𝜏, 𝜎1 = 8, 𝑇𝑡𝑟 = 4𝜏. 𝑅𝑄0 = 𝜎1 τ + 𝑇𝑡𝑟 + (𝜎1 + 3)𝑇𝑡𝑟 + (𝜎1 + 6)𝜏 = 98 t tM s1(t + Ttr) tM tM tM s1(t + Ttr) s1(t + Ttr) s1(t + Ttr) MCSN - High Performance Computing 32 Tfault M M0 ... (tM) M7 IM From now on: Memory interface unit (t) W C2 (t) C1 PE0 UNDER-LOAD memory access latency per block: RQ RQ0 (t) 𝑻𝒇𝒂𝒖𝒍𝒕 = 𝑵𝒇𝒂𝒖𝒍𝒕 ∗ 𝑹𝑸 W C2 Initially, we assume: 𝑹𝑸 = 𝑹𝑸𝟎 C1 PE1 Abstract PE0 P Abstract PE1 Q MCSN - High Performance Computing Example of the first week For Q: 𝑇𝑐𝑎𝑙𝑐 = 𝑇𝑄0 + 𝑇𝑓𝑎𝑢𝑙𝑡 𝑇𝑄0 = 8 𝑀 𝜏 For M = 5 Mega, no reuse can be exploited RQ = 68t 𝑀 𝑁𝑓𝑎𝑢𝑙𝑡 = 𝜎1 𝑇𝑓𝑎𝑢𝑙𝑡 = 𝑁𝑓𝑎𝑢𝑙𝑡 ∗ 𝑅𝑄 = 8.5 𝑁 𝜏 𝑇𝑐𝑎𝑙𝑐 = 𝑇𝑄0 + 𝑇𝑓𝑎𝑢𝑙𝑡 = 16.5 𝑁 𝜏 33 Elementary system example The evaluation of RQ0 (M bottleneck) is optimistic because the request part of the timing diagram contains a rough simplification: tM Ttr t Request e.g. 48 – 112 bits in parallel also the request will be pipelined word-by-word: request link 1-word wide. Exercises 1. Why for all units, except M: 𝑻𝒄𝒂𝒍𝒄 = 𝝉? 2. Explain the RQ0 evaluation in the example when M is not bottleneck. MCSN - High Performance Computing 34 In general ... M0 M1 … M7 M8 Interleaved macro-module 0 M9 … M15 Interleaved macro-module 1 IM Interleaved … 2 IM IM Network ... ... ... ... s1 words are read in parallel by the macro-module, and sent in pipeline one word at the time through IM - network- W - C2 - C1. Reverse path for block writing. MCSN - High Performance Computing ... ... ... Target: W C2 C1 𝝈𝟏 𝝉𝑴 ≤ 𝑩𝒏𝒆𝒕𝒘𝒐𝒓𝒌 and in general it is: 𝐵𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝐵𝑐𝑎𝑐ℎ𝑒 35 Memory bandwidth and latency High bandwidth of M (BM) is needed for: 1. Minimize the latency of cache-block transfers (Tfault) 2. Minimize contention of PEs for memory accesses MCSN - High Performance Computing 36 Memory bandwidth and contention tM Single internally interleaved macro-module: 1 𝐵𝑀 = blocks/sec tM 𝜏𝑀 s1(t + Ttr) M(0) M0 ... s1(t + Ttr) M(1) M7 M8 ... IM Only one PE at the time can be served. Two externally interleaved macromodules (= interleaved each other), each macro-module internally interleaved (= inside the macro-module): 2 𝐵𝑀−𝒎𝒂𝒙 = blocks/sec M15 IM interconnect W C2 C1 PE0 tM conflict Two PE at the time can be served, if not in conflict for the same macro-module. C2 C1 PE1 tM s1(t + Ttr) 𝜏𝑀 W W select the macro-module M(j) according to the index j contained in the physical address. tM no conflict tM s1(t + Ttr) s1(t + Ttr) MCSN - High Performance Computing s1(t + Ttr) 37 In general m macro-modules ... M0 M1 … M7 M8 Interleaved macro-module 0 ... … M15 Interleaved macro-module 1 IM Network M9 Interleaved … 2 IM The destination macro-module name belongs to the routing information set (inserted by W) ... IM ... ... ... N Processing Elements MCSN - High Performance Computing 38 A first idea of contention effect For an (externally) interleaved memory, the probability that a generic processor accesses any (macro-)module is approximated by 1/m. With this assumption, the probability of having PEs in conflict for the same macromodule is distributed according to the binomial law. We can find (Section 17.3.5) : Interleaved memory bandwidth (m modules, N processors) 64.00 56.00 48.00 40.00 m=4 m=8 32.00 m=16 24.00 m=32 16.00 m= 64 8.00 0.00 0 8 16 24 MCSN - High Performance Computing 32 N 40 48 56 Simplified evaluation: • only a subclass of multiprocessor architectures (SMP), • no network effect on latency and conflicts, • no impact of parallel program structures. 64 39 A more general client-server model will be derived Contention in memory AND in the network the importance of high-bandwidth and low-latency networks MCSN - High Performance Computing 40 Caching • Caching is even more important in multiprocessor, – for latency and contention reduction, – provided that reuse is intensively exploited. • For shared data, intensive reuse can exist, with a proper design of process RTS. • However, the CACHE COHERENCE problem arises (studied in the second part of semester). MCSN - High Performance Computing 41 Multiprocessor taxonomy MCSN - High Performance Computing 42 SMP vs NUMA architectures M0 … Mj Mm- … 1 Interconnection network W0 CPU0 … Wi … CPUi WN-1 CPUN-1 Symmetric MultiProcessor: The base latency is independent of the specific PE and memory macromodule. Also called UMA(Uniform Memory Access). Interconnection network Non Uniform Memory Access: The base latency depends (heavily) on the specific PE and referred macromodule. Local memories are shared. Each of them can be interleaved, but they are sequentially organized each other. Local accesses have lower latency than remote ones. All private information are allocated in the local memory. MCSN - High Performance Computing M0 W0 CPU0 … WN-1 MN-1 CPUN-1 Target: contention is reduced, at the expence of base latency for shared data (optimizations are needed). 43 SMP-like single-CMP architecture M0 ... M7 M0 ... IM ... ... IM MINF MINF MINF MINF M7 CMP I/O INF internal interconnect PE 0 MCSN - High Performance Computing PE 1 ... I/O INF PE N-1 44 SMP and NUMA multiple-CMP architectures a) multiple-CMP SMP architecture M ... M M IM0 ... M M IMj ... M IMm-1 external interconnect PE WW WW CMP0 CMPi ... PE PE ... WW CMPN-1 PE PE ... PE b) multiple-CMP NUMA architecture external interconnect M M ... IM0 WW M M ... IMi WW M ... IMN-1 CMPi PE MCSN - High Performance Computing WW M CMP0 PE ... PE ... CMPN-1 PE PE ... PE 45 Process_to_Processor Mapping Anonymous Processors Dedicated Processors Dynamic mapping (low-level scheduling) Static mapping Originally NUMA Originally SMP Multiprogrammed mapping Exclusive mapping One-to-one More processes share dynamically the same PE Context-switch overhead ‘Traditional’ computing servers, data-centres (?), cloud (?). MCSN - High Performance Computing Parallel applications dedicated to specific domains Exercise: give an approximate evaluation of the context-switch calculation time. 46 Interconnection networks MCSN - High Performance Computing 47 Two extreme cases of networks Old-style bus Crossbar • Bus is no longer applicable to highly parallel systems : cheap, but no parallelism in memory accesses minimum bandwidth, and maximum latency. • Crossbar = fully interconnected with N2 dedicated links: maximum parallelism and bandwidth, minimum latency, but can be applied to limited parallelism only (e.g., N = 8) because of link cost and pin-count reasons. • Limited degree networks for highly parallel systems: much lower cost than crossbar by reducing the number of links and interfaces (pin-count), at the expence of latency, but the maximum bandwidth can equally be achieved. MCSN - High Performance Computing 48 ‘High-performance’ networks • Many of the Limited Degree networks, that are studied for multiprocessors, are used in distributed memory systems and in high-performance multicomputers too. – • The firmware level is the same, or is very similar for different architectures. The main difference lies in the implementation of routing and flow control protocols: – – – • Notable industrial examples: Infiniband, Myrinet, QS-net, etc. In multiprocessors and high-performance multicomputers, the primitive protocols at the firmware level are (can be) used directly in the RTS of applications, without the additional software layers like TCP-IP of traditional networks. The overhead imposed by traditional TCP-IP implementations is evaluated in several orders of magnitude (e.g. msecs vs nsecs of latency !): • no/scarce firmware support (NIC is used for physical layers only), • in kernel mode on top of operating systems (e.g., Linux). We’ll see that the modern network systems, cited above, render visible the primitive firmware protocols too – – For high-performance distributed applications, unless TCP-IP is forced by binary portability reasons of ‘old’/legacy products. Moreover, such networks implement TCP-IP with intensive firmware support (mainly in NIC) and in user mode: 1-2 orders of magnitude of overhead is saved. MCSN - High Performance Computing 49 Firmware messages as streams • Messages are packets transmitted as streams of elementary data units, typically words. • Example: a cache block transmitted from the main memory as a stream of s1 words. MCSN - High Performance Computing 50 Evaluation metrics At least evaluated as order of magnitude O(f(N)) Typical limited-degree networks: • Cost of links – bus O(1) – crossbar O(N2) : absolute maximum O(1), O(N), O(N lgN) • Maximum bandwidth – bus O(1) – crossbar O(N) : absolute maximum O(N) • Complexity of design to achieve the maximum bandwidth (nondeterminism vs parallelism) – bus O(1) – crossbar O(cN) : absolute maximum (monolitic design) O(c2) O(1) for any N (modular design) • Latency ( distance) – bus O(N) – crossbar O(1) : absolute minimum MCSN - High Performance Computing O(N), O( 𝑁) 𝑶(𝒍𝒈𝑵): the best except O(1) 51 From crossbars to limited-degree networks Monolitic (single unit) N x N crossbar N x N crossbar N bidirectional N bidirectional interfaces interfaces Monolitic 2 x 2 crossbar input interfaces output interfaces MUX Assumed as elementary buiding block for N N modular design input links output links MUX Exercise: describe the firmware behavior of the 2 2 switch, and prove that the 1 2 maximum bandwidth is given by 𝜏+ 𝑇 (single buffering) or by 𝜏+ 𝑇 (double buffering). 𝑡𝑟 MCSN - High Performance Computing 𝑡𝑟 52 Modular design for limited-degree networks A 4 x 4 limited degree network implemented by the limited degree interconnection of 2 x 2 elementary crossbars 2 x2 crossbar 2 x2 crossbar Binary butterfly with dimension n = 𝐥𝐠 𝟐 𝑵 ( 2 in the example) 2 x2 crossbar Notable example of multi-stage network (2-stage in the example) N=4 2 x2 crossbar network dimension n = number of stages MCSN - High Performance Computing 53 Modular crossbar as a butterfly n=1 22 n=2 22 22 n=3 22 ‘straight’ links: next stage, same level 22 ‘oblique’ links: next stage; the base-2 representations of the source and destination levels differ only in the source-index bit. 22 22 22 22 22 22 22 22 22 22 22 22 MCSN - High Performance Computing 54 k-ary n-fly networks ariety k, dimension n Butterfly n PE sw PE M sw sw k PE M sw PE M sw sw PE sw sw sw M M PE sw PE sw sw M Fat tree algorithm, Simple deterministic routing based on the binary representations ofsw sender and destination, current stage sw index, and straight/oblique link . sw Typical utilization: SMP PE PE sw PE MCSN - High Performance Computing PE • Number of processing nodes = 2N, N = kn M M PE Extendable to any ariety k, though it must be ‘low’ for limited degree networks. • Node degree = 2k • Latency distance = n = lg 𝒌 𝑵 • Number of links and switches • = O(N lgN), respectively (n – 1) 2n , n 2n-1 • Maximum bandwidth = O(N) sw • Complexity for maximum bandwidth = sw sw O(1), once the elementary crossbar is available. PE PE PE PE 55 Fat tree (typical for NUMA) has logarithmic mean latency (e.g. n or 2n, with n = lgPE2(N) number of tree levels), and other similar properties of butterflies. M sw sw sw based. M PE Routing algorithm: common-ancestor Butterfly A tree structure PE M In NUMA, process mapping must be chosen properly, in order to minimize distances. sw sw sw M PE However, contention in switches is too high with simple trees. PE M sw sw In order to sw minimize contention, the linkMand switch bandwidth increases from level to PE level,PEe.g. doubles: fat tree. M sw sw sw PE Problem: also the cost and complexity ofM switches increases from level to level! Modular crossbars cannot be used, otherwise the latency increases. Fat tree sw sw sw sw PE sw PE PE MCSN - High Performance Computing sw PE PE sw PE PE PE 56 Generalized fat tree PE Modest increase of contention. PE PE PE PE Suitable both for NUMA and for SMP, if switches behave according to the butterfly-routing or to the tree-routing. PE PE PE Third level crossbar Second level crossbar MCSN - High Performance Computing First level crossbar 57 k-ary n-cubes 4-ary 1-cube Toroidal structures: rings. switch unit • Number of processing nodes = N = kn 4-ary 2-cube • Node degree = 2n • Latency distance = O(k n) 𝒏 = O( 𝑵) for small n = O(lgk N) for large n • However, process mapping is critical. • Number of links and switches = k n 4-ary 3-cube • Maximum bandwidth = O(N) • Complexity for maximum bandwidth = O(cn) for minimum latency, otherwise O(1). • Simple deterministic routing (dimensional). MCSN - High Performance Computing 58 Local Input-Output MCSN - High Performance Computing 59 Interprocessor communicatons • In a multiprocessor, the main mode of processor cooperation for process RTS is via shared memory. • However, there are some cases in which asynchronous events are needed and more efficiently signaled through direct interprocessor communications, i.e. via Input-Output. • Examples: – processor synchronization (locking, notify), – low-level scheduling (process wake-up), – cache coherence strategies, etc. • In such cases, signaling and testing the presence of asynchronous events via shared memoria is very time consuming in terms of latency, bandwidth and contention. MCSN - High Performance Computing 60 Local I/O Each PE contains an on-chip local I/O unit (UC), to send and receive interprocessor event messages. The same, or a dedicated, interconnection structure is used. Traditional I/O bus has no sense for performance reasons: instead dedicated, on chip links are provided with CPU and W. Interconnection Structure … W0 CPU0 W n-1 CPUn-1 UC0 UCn-1 internal interconnect • PE (core) input interprocessor messages W output interprocessor msg.s C2 instruction C1 IM Load/Store requests data C1 DM IU interrupt interface Load data EU Int, Ackint interrupt message MCSN - High Performance Computing local I/O unit (UC) local I/O memory (MUC) To start an interprocessor comunication, a CPU uses the I/O instructions: Memory Mapped I/O. • The associated UC forwards the event message to the destination PE UC, in the form of word stream through Ws and interconnect. • W is able to distinguish memory access requests/replies from interprocessor communications. • The receiving UC uses the interrupt mechanism to forward the event message to the destination CPU. There is no request-reply behavior, instead it is a purely asynchronous mechanism.61 Example 1 Assume that the event message is composed by the event_code and by two data words (data_1, data_2), and that the process running on destination PE inserts the tuple (event_code, data_1, data_2) in a queue associated to the event. Source CPU executes the following Memory Mapped I/O instructions: STORE STORE STORE STORE RUC, 0, PE_dest RUC, 1, event_code RUC, 2, data_1 RUC, 3, data_2 where RUC means ... Interrupt message from UC to CPU: (event, parameter_1, parameter_2) Destination CPU executes the following interrupt handler: HANDLER: STORE STORE STORE … GOTO … Rbuffer_ev, Rbuffer_pointer, Revent Rbuffer_1, Rbuffer_pointer, Rparameter_1 Rbuffer_2, Rbuffer_pointer, Rparameter_2 Rret_interrupt Exercise: 1. What happens in a Memory Mapped I/O instruction if the I/O unit doesn’t contain a physical local memory? 2. Can the STORE instructions executed by the source CPU be replaced by LOAD instructions? MCSN - High Performance Computing 62 Example 2 Alternative behavior: the process running in the destination PE is in a busy waiting condition of the event message, executing the special instruction: WAITINT Rmask, Revent, Rparameter_1, Rparamter_2 or, if WAITINT instruction is not primitive, a simple busy waiting loop like: MASKINT Rmask WAIT: GOTO WAIT EI (no real handler) MCSN - High Performance Computing 63 Synchronous vs asynchronous event notification process instructions interrupt interrupt handler event registration Example 1 asynchronous wait synchronus wait Example 2 MCSN - High Performance Computing 64