POWER5 Ewen Cheslack-Postava Case Taintor Jake McPadden POWER5 Lineage IBM 801 – widely considered the first true RISC processor POWER1 – 3 chips wired together (branch, integer, floating point) POWER2 – Improved POWER1 – 2nd FPU and added cache and 128 bit math POWER3 – moved to 64 bit architecture POWER4… POWER4 Dual-core High-speed connections to up to 3 other pairs of POWER4 CPUs Ability to turn off pair of CPUs to increase throughput Apple G5 uses a single-core derivative of POWER4 (PowerPC 970) POWER5 designed to allow for POWER4 optimizations to carry over Pipeline Requirements Maintain binary compatibility Maintain structural compatibility Optimizations for POWER4 carry forward Improved performance Enhancements for server virtualization Improved reliability, availability, and serviceability at chip and system levels Pipeline Improvements Enhanced thread level parallelism Two threads per processor core a.k.a Simultaneous Multithreading (SMT) 2 threads/core * 2 cores/chip = 4 threads/chip Each thread has independent access to L2 cache Dynamic Power Management Reliability, Availability, and Serviceability POWER5 Chip Stats Copper interconnects Decrease wire resistance and reduce delays in wire dominated chip timing paths 8 levels of metal 389 mm2 POWER5 Chip Stats Silicon on Insulator devices (SOI) Thin layer of silicon (50nm to 100 µm) on insulating substrate, usually sapphire or silicon dioxide (80nm) Reduces electrical charge transistor has to move during switching operation (compared to CMOS) • Increased speed (up to 15%) • Reduced switching energy (up to 20%) • Allows for higher clock frequencies (> 5GHz) SOI chips cost more to produce and are therefore used for high-end applications Reduces soft errors Pipeline Pipeline identical to POWER4 All latencies including branch misprediction penalty and load-to-use latency with L1 data cache hit same as POWER4 POWER5 Pipeline IF – instruction fetch, IC – instruction cache, BP – branch predict, Dn – decode stage, Xfer – transfer, GD – group dispatch, MP – mapping, ISS – instruction issue, RF – register file read, EX – execute, EA – compute address, DC – data cache, F6 – six cycle floating point unit, Fmt – data format, WB – write back, CP – group commit Instruction Data Flow LSU – load/store unit, FXU – fixed point execution unit, FPU – floating point unit, BXU – branch execution unit, CRL – condition register logical execution unit Instruction Fetch Fetch up to 8 instructions per cycle from instruction cache Instruction cache and instruction translation shared between threads One thread fetching per cycle Branch Prediction Three branch history tables shared by 2 threads 1 bimodal, 1 path-correlated prediction 1 to predict which of the first 2 is correct Can predict all branches – even if every instruction fetched is a branch Branch Prediction Branch to link register (bclr) and branch to count register targets predicted using return address stack and count cache mechanism Absolute and relative branch targets computed directly in branch scan function Branches entered in branch information queue (BIQ) and deallocated in program order Instruction Grouping Separate instruction buffers for each thread 24 instructions / buffer 5 instructions fetched from 1 thread’s buffer and form instruction group All instructions in a group decoded in parallel Group Dispatch & Register Renaming When all resources necessary for group are available, group is dispatched (GD) D0 – GD: instructions still in program order MP – register renaming, registers mapped to physical registers Register files shared dynamically by two threads In ST mode all registers are available to single thread Placed in shared issue queues Group Tracking Instructions tracked as group to simplify tracking logic Control information placed in global completion table (GCT) at dispatch Entries allocated in program order, but threads may have intermingled entries Entries in GCT deallocated when group is committed Load/Store Reorder Queues Load reorder queue (LRQ) and store reorder queue (SRQ) maintain program order of loads/stores within a thread Allow for checking of address conflicts between loads and stores Instruction Issue No distinction made between instructions for different threads No priority difference between threads Independent of GCT group of instruction Up to 8 instructions can issue per cycle Instructions then flow through execution units and write back stage Group Commit Group commit (CP) happens when all instructions in group have executed without exceptions and the group is the oldest group in its thread One group can commit per cycle from each thread Enhancements to Support SMT Instruction and data caches same size as POWER4 but double to 2 and 4 way associativity respectively IC and DC entries can be fully shared between threads Enhancements to Support SMT Two step address translation Effective address Virtual Address using 64 entry segment lookaside buffer (SLB) Virtual address Physical Address using hashed page table, cached in a 1024 entry four way set associative TLB Two first level translation tables (instruction, data) SLB and TLB only used in case of first-level miss Enhancements to Support SMT First Level Data Translation Table – fully associative 128 entry table First Level Instruction Translation Table – 2-way set associative 128 entry table Entries in both tables tagged with thread number and not shared between threads Entries in TLB can be shared between threads Enhancements to Support SMT LRQ and SRQ for each thread, 16 entries But threads can run out of queue space – add 32 virtual entries, 16 per thread Virtual entries – contain enough information to identify the instruction, but not address for load/store Low cost way to extend LRQ/SRQ and not stall instruction dispatch Enhancements to Support SMT Branch Information Queue (BIQ) 16 entries (same as POWER4) Split in half for SMT mode Performance modeling suggested this was a sufficient solution Load Miss Queue (LMQ) 8 entries (same as POWER4) Added thread bit to allow dynamic sharing Enhancements to Support SMT Dynamic Resource Balancing Resource balancing logic monitors resources (e.g. GCT and LMQ) to determine if one thread exceeds threshold Offending thread can be throttled back to allow sibling to continue to progress Methods of throttling • Reduce thread priority (using too many GCT entries) • Inhibit instruction decoding until congestion clears (incurs too many L2 cache misses) • Flush all thread instructions waiting for dispatch and stop thread from decoding instructions until congestion clears (executing instruction that takes a long time to complete) Enhancements to Support SMT Thread priority Supports 8 levels of priority 0 not running 1 lowest, 7 highest Give thread with higher priority additional decode cycles Both threads at lowest priority power saving mode Single Threaded Mode All rename registers, issue queues, LRQ, and SRQ are available to the active thread Allows higher performance than POWER4 at equivalent frequencies Software can change processor dynamically between single threaded and SMT mode RAS of POWER4 High availability in POWER4 Minimize component failure rates Designed using techniques that permit hard and soft failure detection, recovery, isolation, repair deferral, and component replacement while system is operating Fault tolerant techniques used for array, logic, storage, and I/O systems Fault isolation and recovery RAS of POWER5 Same techniques as POWER4 New emphasis on reducing scheduled outages to further improve system availability Firmware upgrades on running machine ECC on all system interconnects Single bit interconnect failures dynamically corrected Deferred repair scheduled for persistent failures Source of errors can be determined – for non recoverable error, system taken down, book containing fault taken offline, system rebooted – no human intervention Thermal protection sensors Dynamic Power Management Reduce switching power Clock gating Reduce leakage power Minimum low-threshold transistors Low power mode Two stage fix for excess heat Stage 1: alternate stalls and execution until the chip cools Stage 2: clock throttling Effects of dynamic power management with and without simultaneous multithreading enabled. Photographs were taken with a heat-sensitive camera while a prototype POWER5 chip was undergoing tests in the laboratory. Memory Subsystem Memory controller and L3 directory moved on-chip Interfaces with DDR1 or DDR2 memory Error correction/detection handled by ECC Memory scrubbing for “soft errors” Error correction while idle Cache Sizes L1 I-cache 64K L1 D-cache 32K L2 1.92M L3 36M Cache Hierarchy Cache Hierarchy Reads from memory are written into L2 L2 and L3 are shared between cores L3 (36MB) acts as a victim cache for L2 Cache line is reloaded into L2 if there is a hit in L3 Write back to main memory if line in L3 is dirty and evicted Important Notes on Diagram Three buses between the controller and the SMI chips Address/command bus Unidirectional write data bus (8 bytes) Unidirectional read data bus (16 bytes) Each bus operates at twice the DIMM speed Important Notes on Diagram 2 or 4 SMI chips can be used Each SMI can interface with two DIMMs 2 SMI mode – 8-byte read, 2 byte write 4 SMI mode – 4-byte read, 2 byte write Size does matter POWER5 Pentium III …compensating? Possible configurations DCM (Dual Chip Module) MCM (Multi Chip Module) One POWER5 chip, one L3 chip Four POWER5 chips, four L3 chips Communication is handled by a Fabric Bus Controller (FBC) “distributed switch” Typical Configurations 2 MCMs to form a “book” 16 way symmetric multi-processor (appears as 32 way) DCM books also used Fabric Bus Controller “buffers and sequences operations among the L2/L3, the functional units of the memory subsystem, the fabric buses that interconnect POWER5 chips on the MCM, and the fabric buses that interconnect multiple MCMs” Separate address and data buses to facilitate split transactions Each transaction tagged to allow for out of order replies 16-way system built with eight dual-chip modules. Address Bus Addresses broadcasted from MCM to MCM using ring structure Each chip forwards address down the ring and to the other chip in MCM Forwarding ends when originating chip receives address Response Bus Includes coherency information gleaned from memory subsystem “snooping” One chip in MCM combines other three chips’ snoop responses with the previous MCM snoop response and forwards it on Response Bus When originating chip receives the responses, transmits a “combined response” which details actions to be taken Early combined response mechanism Each MCM determines whether to send a cache line from L2/L3 depending on previous snoop responses Reduces cache-to-cache latency Data Bus Services all data-only transfers (such as cache interventions) Services data-portion of address ops Cache eviction (“cast-out”) Snoop pushes DMA writes eFuse Technology IBM has created a method of morphing processors to increase efficiency Chips physically alter their design prior to or while functioning Electromigration, previously a serious liability, is detected and used to determine the best way to improve the chip The Method The idea is similar to traffic control on busy highways A lane can be used for the direction with the most traffic. So fuses and software algorithm are used to detect circuits that are experiencing various levels of traffic Method Cont. The chip contains millions of microfuses The fuses act autonomously to reduce voltage on underused circuits, and share the load of overused circuits Furthermore, the Chip can “repair” itself in the case of design flaws or physical circuit failures. Method Cont. The idea is not new It has been attempted before, however previous fuses have affected or damaged the processor Cons of eFuse Being able to allocate circuitry throughout a processor implies significant redundancy Production costs are increased because extra circuitry is required Over-clocking is countered by the processor automatically detecting a flaw Pros of eFuse Significant savings in optimization cost The need to counter electromigration is no longer as important The same processor architecture can be altered in the factory, for a more specialized task. Self-repair and self-upgrading Benchmarks: SPEC Results Benchmarks Notes on Benchmarks IBM’s 64 processor system beat the previous leader (HP’s 64 processor system) by 3x tpmC IBM’s 32 processor system beat HP’s 64 processor system by about 1.5x Both of IBM’s systems maintain lower price/tpmC POWER5 Hypervisor A hypervisor is the virtualization of a machine allowing multiple operating systems to run on a single system While the HV is not new, IBM has made many important improvements to the design of HV with the POWER5 Hypervisor Cont. The main purpose is to abstract the hardware, and divide it logically between tasks Hardware considerations: Processor time Memory I/O Hypervisor – Processor The POWER5 hypervisor virtualizes processors A virtual processor is given to each LPAR The virtual processor is given a percentage of processing time on one or more processors Processor time is determined by preset importance values and excess time HV Processor Cont. Processor time is defined by processing units, which = 1/100 of a CPU. A LPAR is given an entitlement in processing units and a weight Weight is a number between 1 and 256 that implies the importance of the LPAR and its priority in receiving excess PU’s HV Processor Cont. Dynamic micro-partitioning The resources distribution can be further changed in DLPAR These LPARs are put in the control of PLM that can actively change their entitlement In place of a weight, DLPARs are given a “share” value, that indicates its value If a partition is very busy it will be Hypervisor – I/O IBM chose not to give the HV direct control over I/O devices Instead the HV delegates responsibility of I/O devices to specific LPAR A LPAR partition serves as a virtual device for the other LPARs on the system The HV can also give a LPAR sole control over an I/O device. HV – I/O Cont. http://www.research.ibm.com/journal/rd/494/armst3.gif Hypervisor – SCSI Storage devices are virtualized using SCSI The hypervisor acts as a messenger from between partitions. A queue is implemented and the respective partitions is informed of increases in its queue length The OS running on the controlling partition is in complete control of SCSI execution Hypervisor – Network The HV operates a VLAN within its controlled partitions The VLAN is simplified, secure, and efficient Each Partition is made aware of a virtual network adaptor Each partition is labeled with a virtual MAC address and the HV acts as the switch HV – Network Cont. However, if the MAC address is unknown to the HV it will deliver the packet to a partition with a physical network adaptor This partition will deal with external LAN issues such as address translation, synchronization, and packet size Hypervisor – Console Virtual Consoles are handled in much the same way. The Console can also be assigned by the controlling partition as a monitor on an extended network. The controlling partition can completely simulate the monitors presence through the HV for mundane requirements.