Synchronization and Communication in the T3E Multiprocessor Background • T3E is the second of Cray’s massively scalable multiprocessors (after T3D) • Both are scalable up to 2048 processing elements • Shared memory systems, but programmable using message passing (PVM or MPI, “more portable”) or shared memory (HPF) Challenges • T3E (and T3D) attempted to overcome the inherent limitations of employing commodity microprocessors in very large multiprocessors • Memory interface - cache-line based system makes references to single words inefficient • Typical address spaces too small for use in big systems • Non-cached references are often desirable (e.g. message to other processor) T3D Strengths (used in T3E) • External structure in each PE to expand address space • Shared address space • 3D torus interconnect • Pipelined remote memory access with prefetch queue and non-cached stores T3D: Room for improvement • Overblown barrier network • One outstanding cache line fill at a time (low load bandwidth) • Too many ways to access remote memory • Low single-node performance • Unoptimized special hardware features (block transfer engine, DTB Annex, dedicated message queues and registers) T3E Overview • Each PE contains Alpha 21164, local memory, and control and routing chips • Network links timemultiplexed at 5X system frequency • Self-hosted running Unicos/mk • No remote caching or board-level caches E-Registers • • • • Extend physical address space Increase attainable memory pipelining Enable high single-word bandwidth Provide mechanisms for data distribution, messaging, and atomic memory operations • In general, they improve on the inefficient individual structures of the T3D Operations with E-Registers • • Appropriate operands are stored in appropriate E-registers by processor Processor then issues another store command to initiate operation – Address specifies command and source or destination E-register – Data specifies pointer to already stored operands and remote address index Address Translation • Global virtual addresses and virtual PE numbers formed outside processors • Centrifuge used for efficient data distribution • Specifying memory location on data bus enables bigger address space Remote Reads/Writes • All operations done by reading into E-registers (Gets) or writing from E-registers to memory (Puts) • Vector forms transfer 8 words with arbitrary stride (e.g. every 3rd word) • Large number of E-registers allows significant Gets/Puts pipelining – Limited by bus interface (256B/26.7ns) • Single word load bandwidth high – can be loaded into contiguous E-registers, then moved into cache (instead of getting each cache line) Atomic Memory Operations • Fetch_&_Inc, Fetch_&_Add, Compare_&_Swap, Masked_Swap • Can be performed on any memory location • Performed like any E-register operation – Operands in E-registers – Triggered via store, sent over network – Result sent back and stored in specified Eregister Messaging T3D: Specific queue location of fixed size T3E: Arbitrary number of queues, mapped to normal memory, of any size up to 128 MB T3D: All incoming messages generated interrupts, adding significant penalties T3E: Three options – interrupt, don’t interrupt (detected via polling), and interrupt after threshold number of messages Messaging Specifics • Message queues consist of Message Queue Control Words (MQCW) • Messages assembled into 8 E-registers, SEND issued with address of MQCW • Message queue is managed in software – avoids OS if polling is used Synchronization • Support for barriers and eurekas (message from one processor to group) • 32 barrier synchronization units (BSUs) at each processor, accessed as memorymapped registers • Synchronization packets use a dedicated high-priority virtual channel – Propagated through logical tree embedded in 3D torus interconnect Synchronization • Simple barrier operation involves 2 states – – • First arms all processors in group (S_ARM) Once all are armed, network notifies all of completion and processors return to S_BAR Eureka requires 3 states to ensure one is received before issuing next one – Eureka notification immediately followed by barrier Performance Increasing number of E-registers allows greater pipelining and bandwidth (limited by control logic) Effective bandwidth greatly increases with higher transfer sizes due to effects of overhead, startup latency Performance Transfer bandwidth independent of stride, except when data happens to be loaded from same bank(s) (multiples of 4, 8) Several million AMOs/sec required to saturate memory system and increase latency Performance Very high message bandwidth is supported without latency increase Hardware barrier many times faster than efficient software barrier (about 15 for 1024 PEs) Conclusions • E-registers allow a highly pipelined memory system and provide a common interface for all global memory operations • Both messages and standard shared memory ops supported • Fast hardware barrier supported with almost no extra cost • No remote caching eliminates need for bulky coherence mechanisms and helps allow 2048 PE systems • Paper provides no means of quantitative comparison to alternative systems