WildFire: A Scalable Path for SMPs Erik Hagersten and Michael Koster Presented by Andrew Waterman ECE259 Spring 2008 Insight and Motivation • SMP abandoned for more scalable cc-NUMA • But SMP bandwidth has scaled faster than CPU speed • cc-NUMA is scalable but more complicated – Program/OS specialization necessary – Communication to remote memory is slow – May not be optimal for real access patterns • SMP (UMA) is simpler – More straightforward programming model – Simpler scheduling, memory management – No slow remote memory access • Why not leverage SMP to the extent that it scales? Multiple SMP (MSMP) • Connect few, large SMPs (nodes) together • Distributed shared memory – Weren't we just NUMA-bashing? • Several CPUs per node => many local memory refs • Few nodes => unscalable coherence protocol OK WildFire Hardware • MSMP with 112 UltraSPARCs – Four unmodified Sun E6000 SMPs • GigaPlane bus (3.2 GB/s within a node) • 16 2CPU or I/O cards per node • WildFire Interface (WFI) is just another I/O board (cool!) – SMPs (UMA) connected via WFI == cc-NUMA (!) • But this is OK... few, large nodes • Full cache coherence, both intra- & inter-node WildFire from 30,000 ft (emphasis on a single SMP node) WildFire Software • ABI-compatible with Sun SMPs – It's the software, stupid! • Slightly (allegedly) modified Solaris 2.6 • Threads in same process grouped onto same node • Hierarchical Affinity Scheduler (HAS) • Coherent Memory Replication (CMR) – OK, so this isn't purely a software technique Coherent Memory Replication (CMR) • S-COMA with fixed home locations for each block – For those keeping score, that means it's not COMA • Local physical pages “shadow” remote physical pages – Keep frequently-read pages close: less avg. latency • Implementation: hardware counters – CMR page allocation handled within OS – Coherence still in hardware at block granularity • Enabled/disabled at page granularity • CMR memory allocation adjusts with mem. pressure Hierarchical Affinity Scheduling (HAS) • Exploit locality by scheduling a process on the last node on which it executed • Only reschedule onto another node when load imbalance exceeds a threshold • Works particularly well when combined with CMR – Frequently-accessed remote pages still shadowed locally after a context switch • Lagniappe locality WildFire Implementation A single Sun E6000 with WFI. Recall: WildFire Interface is just one of 16 standard cards on the GigaPlane bus. WildFire Implementation • Network Interface Address Controller (NIAC) + Network Interface Data Controller (NIDC) == WFI • NIAC interfaces with GigaPlane bus and handles internode coherence • Four NIDCs talk to point-to-point interconnect between nodes – Three ports per NIDC (one for each remote node) – 800MB/s in each direction with each remote node WildFire Cache Coherence • Intra-node coherence: bus + snoopy • Inter-node (global) coherence: directory – Directory state kept at a block's home node • Directory cache (SRAM) backed by memory – Home node determined by high-order address bits – MOSI – Nothing special since scalability not an issue • Blocking directory, 3-stage WB => no corner cases • NIAC sits on bus and asserts “ignore” signal for requests that global coherence must attend to – NIAC intervenes if block's state is inadequate or resides in remote memory WildFire Cache Coherence • Coherent Memory Replication complicates matters – A local shadow page has a different physical address from its corresponding remote page – If a block's state is insufficient, must look up global address in order for WFI to issue remote request • Stored in LPA2GA SRAM • Also cache the reverse lookup (GA2LPA) Coherence Example WildFire Memory Latency • WildFire compared to SGI Origin (2x R10K per node) and Sequent NUMA-Q (4x Xeon per node) • WF's remote mem. latency mediocre (2.5x Origin, similar to NUMA-Q), but less relevant because remote accesses less frequent (1/14 as many as Origin, 1/7 as many as NUMA-Q) Evaluating WildFire • Clever performance evaluation methodology: isolate on WildFire itself by comparing single-node 16-cpu system with two-node, 8cpu/node system – Pure SMP vs. WildFire • Also compare with NUMA-fat – Basically WF with no OS support, i.e. no CMR, no HAS, no locality-aware memory allocation, no kernel replication • And compare with NUMA-thin – NUMA-fat but with small (2 CPU) nodes • Finally, turn off HAS and CMR to evaluate their contribution to WF's performance Evaluating WildFire • WF with HAS+CMR comes within 13% of pure SMP • Speedup(HAS+CMR) >> Speedup(HAS)*Speedup(CMR) • Locality-aware allocation and large nodes are important Evaluating WildFire • Performance trends correlate with locality of reference • HAS + CMR + Kernel Replication + Initial Allocation improve locality of access from 50% (i.e. uniform distribution between two nodes) to 87% Summary • WildFire = a few large SMP nodes + directory-based coherence between nodes + fast point-to-point interconnect + clever scheduling and replication techniques • Pretty good performance (unfortunately, no numbers for 112 CPUs) • Good idea? – I think so, but I doubt much room for scalability • Then again, that wasn't the point • Criticisms? – Authors are very proud of their slow directory protocol – Kernel modifications may not be so slight Questions? Erik Hagersten and Michael Koster Presented by Andrew Waterman ECE259 Spring 2008 WildFire: A Scalable Path for SMPs Erik Hagersten and Michael Koster Presented by Andrew Waterman ECE259 Spring 2008