AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used Outline • Features • Block diagram • Microarchitecture • Pipeline • Cache • Memory controller • HyperTransport • InterCPU Connections Features • 64-bit x86-based microprocessor • On chip double-data-rate (DDR) memory controller [low • • • • memory latency] Three HyperTransport links [connect to other devices without support chips] Out of order, superscalar processor Adds 64-bit (48-bit virtual and 40-bit physical) addressing and expands number of registers Supports legacy 32-bit applications without modifications or recompilation Features • Double the number of registers • Integer general purposes registers (GPR’s) – 16 each • Streaming SIMD extension (SSE) registers – 16 each • Satisfies the register allocation needs of more than 80% of functions appearing in a typical program. • Connected to a memory through an integrated memory controller • High performance I/O subsystem via HyperTransport bus. Block diagram Microarchitecture • Works with fixed-length micro-ops and dispatches into two independent schedulers: One for integer, and one for floating point and multimedia (MMX, 3DNow, SSE and SSE2) • Load and store micro-ops go to the load/store unit • 11 micro-ops each cycle to the following execution resources. • Three integer execution units • Three address generation units • Three floating point and multimedia units • Two load/store to the data cache Microarchitecture Pipeline • Long enough for high frequency and short enough for good IPC (Instructions per cycle) • Fully integrated from instruction fetch through DRAM access. • Execute pipeline is typically • 12 stages for integer • 17 stages for floating-point • Data cache access occurs in stage 11. • In case that L1 cache miss, the pipeline access the L2 cache in parallel and the request goes to the system request queue. • Pipeline in the DRAM run as the same frequency as the core Pipeline Memory, Cache, and HyperTransport Cache • Separate L1 Instruction and Data caches. • Each is 64 Kbytes, 2-way set associative, 64-byte cache line. • L2 cache (Data & Instructions) • Size: 1 Mbytes. 16-way set associative. • uses a pseudo-least-recently-used (LRU) replacement policy • Independent L1 and L2 translation look-aside buffers (TLB). • The L1 TLB is fully associative and stores thirty-two 4-Kbyte page translations, and eight 2-Mbyte/4-Mbyte page translations. • The L2 TLB is four-way set-associative with 512 4-Kbyte entries. Onboard Memory Control • 128-bit memory bus • Latency reduced and bandwidth doubled • Multicore: Processors have own memory interface and • • • • own memory Available memory scales with the number of processors DDR-SDRAM only Up to 8 registered DDR DIMMs per processor Memory bandwidth of up to 5.3 Gbytes/s per processor. HyperTransport • Bidirectional, serial/parallel, scalable, high-bandwidth low- latency bus • Packet based • 32-bit words regardless of physical width • Facilitates power management and low latencies HyperTransport in the Opteron • 16 CAD HyperTransport (16-bit wide, CAD=Command, Address, Data) • processor-to-processor and processor-to-chipset • bandwidth of up to 6.4 GB/s (per HT port) • 8-bit wide HyperTransport for components such as normal I/O-Hubs InterCPU Connections • Multiple CPUs connected through a proprietary extension running on additional HyperTransport interfaces • Allows support of a cache-coherent, Non-Uniform Memory Access, multi-CPU memory access protocol • Non-Uniform Memory Access • Separate cache memory for each processor • Memory access time depends on memory location. (i.e. local faster than non-local) • Cache coherence • Integrity of data stored in local caches of a shared resource • Each CPU can access the main memory of another processor, transparent to the programmer