AlphaServer GS320 Architecture & Design Gharachorloo, Sharma, Steely, and Van Doren Compaq Research & High-Performance Servers Published in 2000 (ASPLOS-IX) Presented by Matt Johnson CPS221/ECE259, Advanced Computer Arch. II Duke University, 1/30/08 AlphaServer GS320 Architecture & Design Sold by HP until 2004, now discontinued Overview Design Goals Architecture (from 10E+3 ft.) Coherence Protocol Memory Consistency Performance Analysis/Questions Design Goals Targeting small/medium multiprocessors Exploit known (and limited) system size to implement ideas that don't scale well e.g. protocol optimizations (limited queue sizes) Avoid the high latency and protocol overhead of traditional directory protocols, and the bandwidth/scalability problems of snooping Design Goals RAS (reliability, availability, serviceability) Modularity (QBBs, we'll get to them in a moment) Hardware partitions (failure containment) Efficiency Tight integration with CPUs (Alpha 21264) CPU support for coherence/consistency operations Directory Protocol avoids NACKs and stalls Architecture Between 4 and 32 Alpha 21264 CPUs Arranged in Quad-processor Building Blocks 7M+ ASIC Gates 4 CPUs 32GB Memory 8 PCI Slots 10-Port Switch Architecture 10-Port Local Switch (per QBB) 4 Processor, 4 Memory, 1 I/O (PCI), 1 Global 2 QBBs can be connected directly, up to 8 with a global switch Hardware Coherence Support DIRectory: 14 bits/64-byte cache line store owner/sharer info Duplicate TAG Store copies CPUs' L2 cache tags Transactions-in-Transit Table keeps track of outstanding transactions from a node (48 entries) All implemented in ASICs, some supported by 21264 Architecture Coherence Protocol 4 Types of Requests Read (not writing, don't need an exclusive copy) Read-Exclusive (don't have it, want to write to it) Exclusive (have a shared copy, want to write to it) Exclusive-Without-Data Used when you want to write an entire cache line (64B) Don't need to transfer the old data in this case Coherence Protocol Satisfies all requests w/o NACKs or retries Blocks at the host Saves bandwidth Accomplishes this by ”doing the right thing” on the requestee side, transparently to the requester State machines at nodes can be simple,fast,small Dependencies are resolved on the outskirts of the system,not by clogging up the core w/ a heavy protocol Coherence Protocol Deadlock is prevented by using 3 virtual lanes Q0 for requests, Q1 for local responses, Q2 for remote responses, QIO for I/O (PCI) transactions Total ordering required on Q1, Point-to-Point ordering on Q0/QIO, no requirements on Q2 Split responses into 2 parts (↓Latency,↑Perf.) Commit (yeah, I heard ya) Data (except exclusive-without-data requests) Coherence Protocol Instead of building their protocol to handle the general case, they optimize it for a specific case e.g. the crossbar local and global switches lend themselves to meeting the ordering requirements they can delay certain responses because they can bound the latency by a reasonable time Coherence Protocol Coherence Protocol (u,v)=(1,0) should be disallowed (would violate sequential consistency) Performance Performance 33.5 Gflops/sec on Linpack workload Supports 2720 users on SAP Benchmark Higher I/O bandwidth, but similar commercial workload performance to IBM RS/6000 S80 (24-CPU) and Sun 10000 (64-CPU) systems NUMA+Lightweight Protocol->↓Mem. Latency Much better for applications where this matters Analysis/Questions