AlphaServer GS320 Architecture & Design Gharachorloo, Sharma, Steely, and Van Doren

advertisement
AlphaServer GS320 Architecture & Design


Gharachorloo, Sharma, Steely, and Van Doren

Compaq Research & High-Performance Servers

Published in 2000 (ASPLOS-IX)
Presented by Matt Johnson

CPS221/ECE259, Advanced Computer Arch. II

Duke University, 1/30/08
AlphaServer GS320 Architecture & Design

Sold by HP until 2004, now discontinued
Overview

Design Goals

Architecture (from 10E+3 ft.)

Coherence Protocol

Memory Consistency

Performance

Analysis/Questions
Design Goals

Targeting small/medium multiprocessors

Exploit known (and limited) system size to
implement ideas that don't scale well


e.g. protocol optimizations (limited queue sizes)
Avoid the high latency and protocol overhead of
traditional directory protocols, and the
bandwidth/scalability problems of snooping
Design Goals


RAS (reliability, availability, serviceability)

Modularity (QBBs, we'll get to them in a moment)

Hardware partitions (failure containment)
Efficiency

Tight integration with CPUs (Alpha 21264)


CPU support for coherence/consistency operations
Directory Protocol avoids NACKs and stalls
Architecture

Between 4 and 32 Alpha 21264 CPUs

Arranged in Quad-processor Building Blocks

7M+ ASIC Gates

4 CPUs

32GB Memory

8 PCI Slots

10-Port Switch
Architecture

10-Port Local Switch (per QBB)

4 Processor, 4 Memory, 1 I/O (PCI), 1 Global

2 QBBs can be connected directly, up to 8 with a global switch

Hardware Coherence Support

DIRectory: 14 bits/64-byte cache line store owner/sharer info

Duplicate TAG Store copies CPUs' L2 cache tags


Transactions-in-Transit Table keeps track of outstanding
transactions from a node (48 entries)
All implemented in ASICs, some supported by 21264
Architecture
Coherence Protocol

4 Types of Requests

Read (not writing, don't need an exclusive copy)

Read-Exclusive (don't have it, want to write to it)

Exclusive (have a shared copy, want to write to it)

Exclusive-Without-Data

Used when you want to write an entire cache line (64B)

Don't need to transfer the old data in this case
Coherence Protocol

Satisfies all requests w/o NACKs or retries

Blocks at the host

Saves bandwidth



Accomplishes this by ”doing the right thing” on the
requestee side, transparently to the requester
State machines at nodes can be simple,fast,small
Dependencies are resolved on the outskirts of the
system,not by clogging up the core w/ a heavy
protocol
Coherence Protocol

Deadlock is prevented by using 3 virtual lanes



Q0 for requests, Q1 for local responses, Q2 for
remote responses, QIO for I/O (PCI) transactions
Total ordering required on Q1, Point-to-Point
ordering on Q0/QIO, no requirements on Q2
Split responses into 2 parts (↓Latency,↑Perf.)

Commit (yeah, I heard ya)

Data (except exclusive-without-data requests)
Coherence Protocol

Instead of building their protocol to handle the
general case, they optimize it for a specific case


e.g. the crossbar local and global switches lend
themselves to meeting the ordering requirements
they can delay certain responses because they can
bound the latency by a reasonable time
Coherence Protocol
Coherence Protocol

(u,v)=(1,0) should be disallowed (would violate
sequential consistency)
Performance
Performance

33.5 Gflops/sec on Linpack workload

Supports 2720 users on SAP Benchmark


Higher I/O bandwidth, but similar commercial
workload performance to IBM RS/6000 S80
(24-CPU) and Sun 10000 (64-CPU) systems
NUMA+Lightweight Protocol->↓Mem. Latency

Much better for applications where this matters
Analysis/Questions
Download