Computer Architecture Advanced Topics 1 Computer Architecture 2011 – Advanced Topics Performance per Watt Mobile smaller form-factor decreases power budget – Power generates heat, which must be dissipated to keep transistors within allowed temperature – Limits the processor’s peak power consumption Change the target – Old target: get max performance – New target: get max performance at a given power envelope Performance per Watt Performance via frequency increase – Power = CV2f, but increasing f also requires increasing V – X% performance costs 3X% power Assume performance linear with frequency A power efficient feature – better than 1:3 performance : power – Otherwise it is better to just increase frequency (and voltage) – U-arch performance features should be power efficient 2 Computer Architecture 2011 – Advanced Topics Higher Performance vs. Longer Battery Life Processor average power is <10% of the platform – The processor reduces power in periods of low processor activity – The processor enters lower power states in idle periods Average power includes low-activity periods and idle-time – Typical: 1W – 3W Max power limited by heat dissipation – Typical: 20W – 100W Intel® LAN Fan DVD ICH 2% 2% 2% 3% CLK 5% Display (panel + inverter) 33% HDD 8% GFX 8% Misc. 8% CPU 10% Intel® MCH Power Supply 10% 9% Decision – Optimize for performance when Active – Optimize for battery life when idle 3 Computer Architecture 2011 – Advanced Topics Leakage Power The power consumed by a processor consists of – Active power: used to switch transistors – Leakage power: leakage of transistors under voltage Leakage power is a function of – Number of transistors and their size – Operating voltage – Die temperature Leakage power reduction – The LLC (Last Level Cache) is built with low-leakage transistors (2/3 of the die transistors) Low-leakage transistors are slower, increasing cache access latency The significant power saved justifies the small performance loss – Enhanced SpeedStep® technology 4 Reduces voltage and frequency on low processor activity Computer Architecture 2011 – Advanced Topics Enhanced SpeedStep™ Technology The “Basic” SpeedStep™ Technology had – 2 operating points – Non-transparent switch The “Enhanced” version provides – Multi voltage/frequency operating points For example, 1.6GHz Pentium M processor operation ranges: 6.1X 18 16 2.8 14 Efficiency ratio = 2.3 2.4 12 2.0 10 1.2 2.7X ) 1.6 8 6 0.8 4 0.4 2 0.0 0 0.8 5 20 3.2 Frequency – Higher power efficiency 2.7X lower frequency 2X performance loss >2X energy gain – Outstanding battery life Freq (GHz) Power (Watts) 3.6 – Transparent switch – Frequent switches Benefits Voltage, Frequency, Power 4.0 1.0 1.2 Voltage (Volt) 1.4 Typical Power From 600MHz @ 0.956V To 1.6GHz @ 1.484V (GHz 1.6 Computer Architecture 2011 – Advanced Topics 2nd Generation Intel® CoreTM Sandy Bridge 6 Computer Architecture 2011 – Advanced Topics 2nd Gen Intel® Core™ Microarchitecture: Overview Integrates CPU, Graphics, MC, PCI Express* On Single Chip Next Generation Intel® Turbo Boost Technology High Bandwidth Last Level Cache Next Generation Processor Graphics and Media DMI High BW/low-latency modular core/GFX interconnect PCI Express* x16 PCIe System Agent IMC Substantial performance improvement Display Core LLC Core LLC Core LLC Core LLC 2ch DDR3 Intel® Advanced Vector Extension (Intel® AVX) Integrated Memory Controller 2ch DDR3 Embedded DisplayPort Graphics Discrete Graphics Support: 1x16 or 2x8 PECI Interface To Embedded Controller Notebook DP Port Intel® Hyper-Threading Technology 4 Cores / 8 Threads 2 Cores / 4 Threads PCH 7 Computer Architecture 2011 – Advanced Topics Core Block Diagram 32k L1 Instruction Cache Pre decode Instruction Front End (IA instructions Uops) Branch Pred Queue Decoders Decoders Decoders Decoders 1.5k uOP cache Reord Zeroing Idioms Store Allocate/Rename/Retire In Order Allocation, Rename, Retirement erBuff Buffe Load Buffe rs rs ers Scheduler Port 0 Out of Order “Uop” Scheduling Port 1 Port 5 ALU, SIMUL, ALU, SIALU, ALU, Branch, DIV, FP MUL Port 3 Load Load Six Execution Ports FP ADD FP Shuffle Store Address Store Address L2 Cache (MLC) 8 Port 2 Fill Buffers In order Out-oforder Port 4 Store Data Data Cache 32k L1Unit Data Cache 48 bytes/cycle Computer Architecture 2011 – Advanced Topics Front End 32KB L1 I-Cache Pre decode Instruction Queue Decoders Decoders Decoders Decoders Branch Prediction Unit Instruction Fetch and Decode • 32KB 8-way Associative ICache • 4 Decoders, up to 4 instructions / cycle • Micro-Fusion – Bundle multiple instruction events into a single “Uops” • Macro-Fusion – Fuse instruction pairs into a complex “Uop” • Decode Pipeline supports 16 bytes per cycle 9 Computer Architecture 2011 – Advanced Topics Decoded Uop Cache 32KB L1 I-Cache Branch Prediction Unit Pre decode Instruction Queue Decoders Decoders Decoders Decoders Decoded Uop Cache ~1.5 Kuops Decoded Uop Cache • Instruction Cache for Uops instead of Instruction Bytes – ~80% hit rate for most applications • Higher Instruction Bandwidth and Lower Latency – Decoded Uop Cache can represent 32-byte / cycle More Cycles sustaining 4 instruction/cycle – Able to ‘stitch’ across taken branches in the control flow 10 Computer Architecture 2011 – Advanced Topics Branch Prediction Unit 32k L1 Instruction Cache Branch Prediction Unit Pre decode Instruction Queue Decoders Decoders Decoders Decoders Decoded Uop Cache ~1.5 Kuops New Branch Predictor • Twice as many targets • Much more effective storage for history • Much longer history for data dependent behaviors 11 Computer Architecture 2011 – Advanced Topics Front End 32k L1 Instruction Cache Branch Prediction Unit Zzzz Pre decode Instruction Queue Decoders Decoders Decoders Decoders Decoded Uop Cache ~1.5 Kuops • Decoded Uop Cache lets the normal front end sleep – Decode one time instead of many times • Branch-Mispredictions reduced substantially – The correct path is also the most efficient path Save Power while Increasing Performance 12 Computer Architecture 2011 – Advanced Topics “Out of Order” Part of the machine Zeroing Idioms Allocate/Rename/Retire Reord Store In Order Allocation, Rename, Retirement erBuff Buffe Load Buffe rs rs Scheduler Port 0 In order ers OutPort of1Order “Uop” Port Scheduling Port 5 2 Port 3 Out-of-order Port 4 • Receives Uops from the Front End • Sends them to Execution Units when they are ready • Retires them in Program Order • Increase Performance by finding more Instruction Level Parallelism – Increasing Depth and Width of machine implies larger buffers More Data Storage, More Data Movement, More Power 13 Computer Architecture 2011 – Advanced Topics Sandy Bridge Out-of-Order (OOO) Cluster Load Buffers Store Reorder Buffers Buffers Allocate/Rename/Retire Zeroing Idioms In order Out-oforder Scheduler FP/INT Vector PRF Int PRF • Method: Physical Reg File (PRF) instead of centralized Retirement Register File – Single copy of every data – No movement after calculation • Allows significant increase in buffer sizes – Dataflow window ~33% larger PRF has better than linear performance/power Key enabler for Intel® AVX 14 Nehalem Sandy Bridge Load Buffers 48 64 Store Buffers 32 36 RS Scheduler Entries 36 54 PRF integer N/A 160 PRF floatpoint N/A 144 ROB Entries 128 168 Computer Architecture 2011 – Advanced Topics Intel® Advanced Vector Extensions • Vectors are a natural data-type for many applications • Extend SSE FP instruction set to 256 bits operand size XMM0 YMM0 128 bits 256 bits (AVX) – Intel AVX extends all 16 XMM registers to 256bits • New, non-destructive source syntax – VADDPS ymm1, ymm2, ymm3 • New Operations to enhance vectorization – Broadcasts – Masked load & store Wider vectors and non-destructive source specify more work with fewer instructions Extending the existing state is area and power efficient 15 Computer Architecture 2011 – Advanced Topics Execution Cluster Scheduler sees matrix: • 3 “ports” to 3 “stacks” of execution units ALU Port 0 VI MUL FP MUL VI Shuffle Blend DIV • General Purpose Integer GPR – SIMD (Vector) Integer – SIMD Floating Point • Challenge: double the output of one of these stacks in a manner that is invisible to the others ALU Port 1 Port 5 SIMD INT VI ADD SIMD FP FP ADD VI Shuffle ALU FP Shuf JMP FP Bool Blend 16 Computer Architecture 2011 – Advanced Topics Execution Cluster Solution: • Repurpose existing data paths to dual-use • SIMD integer and legacy SIMD FP use legacy stack style MultiplyFP MUL FPBlend Blend VI MULFP ALU Port 0 VI Shuffle GPR SIMD INT Port 1 17 VI ADD VI Shuffle • Double FLOPs – 256-bit Multiply + 256-bit ADD + 256-bit Load per clock SIMD FP FP ADDFP ADD ALU • Intel® AVX utilizes both 128-bit execution stacks DIV Port 5 ALU FP Shuf FP Shuffle JMP FP Bool FP Boolean FP BlendBlend Computer Architecture 2011 – Advanced Topics Memory Cluster Load 256KB L2 Cache (MLC) Fill Buffers Store Address Store Data Memory Control Store Buffers 32 bytes/cycle 32KB 8-way L1 Data Cache • Memory Unit can service two memory requests per cycle – 16 bytes load and 16 bytes store per cycle • Goal: Maintain the historic bytes/flop ratio of SSE for Intel® AVX 18 Computer Architecture 2011 – Advanced Topics Memory Cluster Load Load Store Address Store Address Store Data Memory Control 256KB L2 Cache (MLC) Fill Buffers Store Buffers 48 bytes/cycle 32KB 8-way L1 Data Cache • Solution : Dual-Use the existing connections – Make load/store pipes symmetric • Memory Unit services three data accesses per cycle – 2 read requests of up to 16 bytes AND 1 store of up to 16 bytes – Internal sequencer deals with queued requests • Second Load Port is one of highest performance features – Required to keep Intel® AVX fed – linear power/performance 19 Computer Architecture 2011 – Advanced Topics Putting it together Sandy Bridge Microarchitecture 32k L1 Instruction Cache Pre decode Instruction Branch Pred Load Buffers Store Buffers Decoders Decoders Decoders Decoders Queue 1.5k uOP cache Zeroing Idioms Allocate/Rename/Retire Reorder Buffers In order Out-oforder Scheduler Port 0 ALU VI MUL VI Shuffle DIV AVX FP MUL Port 1 ALU VI ADD VI Shuffle AVX FP ADD Port 5 Port 2 Port 3 Load Load Port 4 ALU JMP AVX/FP Shuf Store Address Store Address AVX/FP Bool AVX FP Blend AVX FP Blend Store Data Memory Control 48 bytes/cycle L2 Data Cache (MLC) Fill Buffers 32k L1 Data Cache 20 Computer Architecture 2011 – Advanced Topics Other Architectural Extensions • Cryptography Instruction Throughput Enhancements – Increased throughput for AES instructions • Arithmetic Throughput Enhancements – ADC (Add with Carry) throughput doubled – Multiply (64-bit multiplicands with 128-bit product) ~25% speedup on existing RSA binaries • State Save/Restore Enhancements – New state added in Intel® AVX – HW monitors features used by applications Only saves/restores state that is used 21 Computer Architecture 2011 – Advanced Topics 2nd Gen Intel® Core™ Microarchitecture x16 PCIe DMI PCI Express System Display Agent IMC Core LLC Core LLC Core LLC Core LLC 2ch DDR3 Graphics PECI Interface To Embedded Controller System Agent, Ring Architecture and Other Innovations in 2nd Generation Intel® Core™ Microarchitecture formerly codenamed Sandy Bridge Notebook DP Port DMI 2011 PCH 22 Computer Architecture 2011 – Advanced Topics Integration: Optimization Opportunities • Dynamically redistribute power between Cores & Graphics • Tight power management control of all components, providing better granularity and deeper idle/sleep states • Three separate power/frequency domains: System Agent (Fixed), Cores+Ring, Graphics (Variable) • High BW Last Level Cache, shared among Cores and Graphics – Significant performance boost, saves memory bandwidth and power • Integrated Memory Controller and PCI Express ports – Tightly integrated with Core/Graphics/LLC domain – Provides low latency & low power – remove intermediate busses • Bandwidth is balanced across the whole machine, from Core/Graphics all the way to Memory Controller • Modular uArch for optimal cost/power/performance – Derivative products done with minimal effort/time 23 Computer Architecture 2011 – Advanced Topics Scalable Ring On-die Interconnect • Ring-based interconnect between Cores, Graphics, Last Level Cache (LLC) and System Agent domain • Composed of 4 rings – 32 Byte Data ring, Request ring, Acknowledge ring and Snoop ring – Fully pipelined at core frequency/voltage: bandwidth, latency and power scale with cores • Massive ring wire routing runs over the LLC with no area impact • Access on ring always picks the shortest path – minimize latency • Distributed arbitration, ring protocol handles coherency, ordering, and core interface • Scalable to servers with large number of processors DMI PCI Express* System Display Agent IMC Core LLC Core LLC Core LLC Core LLC Graphics High Bandwidth, Low Latency, Modular 24 Computer Architecture 2011 – Advanced Topics Cache Box • Interface block – – – – Between Core/Graphics/Media and the Ring Between Cache controller and the Ring Implements the ring logic, arbitration, cache controller Communicates with System Agent for LLC misses, external snoops, non-cacheable accesses • Full cache pipeline in each cache box – Physical Addresses are hashed at the source to prevent hot spots and increase bandwidth – Maintains coherency and ordering for the addresses that are mapped to it – LLC is fully inclusive with “Core Valid Bits” – eliminates unnecessary snoops to cores • Runs at core voltage/frequency, scales with Cores Distributed coherency & ordering; Scalable Bandwidth, Latency & Power 25 PCI Express* DMI System Agent Display IMC Core LLC Core LLC Core LLC Core LLC Graphics Computer Architecture 2011 – Advanced Topics LLC Sharing • LLC shared among all Cores, Graphics and Media – Graphics driver controls which streams are cached/coherent – Any agent can access all data in the LLC, independent of who allocated the line, after memory range checks • Controlled LLC way allocation mechanism to prevent thrashing between Core/graphics • Multiple coherency domains – IA Domain (Fully coherent via cross-snoops) – Graphic domain (Graphics virtual caches, flushed to IA domain by graphics engine) – Non-Coherent domain (Display data, flushed to memory by graphics engine) Much higher Graphics performance, DRAM power savings, more DRAM BW available for Cores 26 PCI Express* DMI Display System Agent IMC Core LLC Core LLC Core LLC Core LLC Graphics Computer Architecture 2011 – Advanced Topics System Agent • Contains PCI Express, DMI, Memory Controller, Display Engine… DMI • Contains Power Control Unit – Programmable uController, handles all power management and reset functions in the chip • Smart integration with the ring – Provides cores/Graphics /Media with high BW, low latency to DRAM/IO for best performance – Handles IO-to-cache coherency • Separate voltage and frequency from ring/cores, Display integration for better battery life PCI Express* System Display Agent IMC Core LLC Core LLC Core LLC Core LLC Graphics • Extensive power and thermal management for PCI Express and DDR Smart I/O Integration 27 Computer Architecture 2011 – Advanced Topics Hyper Threading Technology 28 Computer Architecture 2011 – Advanced Topics Thread-Level Parallelism Multiprocessor systems have been used for many years – There are known techniques to exploit multiprocessors Software trends – Applications consist of multiple threads or processes that can be executed in parallel on multiple processors Thread-level parallelism (TLP) – threads can be from – the same application – different applications running simultaneously – operating system services Increasing single thread performance becomes harder – and is less and less power efficient Chip Multi-Processing (CMP) – Two (or more) processors are put on a single die 29 Computer Architecture 2011 – Advanced Topics Multi-Threading Multi-threading: a single processor executes multiple threads Time-slice multithreading – The processor switches between software threads after a fixed period – Can effectively minimize the effects of long latencies to memory Switch-on-event multithreading – Switch threads on long latency events such as cache misses – Works well for server applications that have many cache misses A deficiency of both time-slice MT and switch-on-event MT – They do not cover for branch mis-predictions and long dependencies Simultaneous multi-threading (SMT) – Multiple threads execute on a single processor simultaneously w/o switching – Makes the most effective use of processor resources 30 Maximizes performance vs. transistor count and power Computer Architecture 2011 – Advanced Topics Hyper-threading (HT) Technology HT is SMT – Makes a single processor appear as 2 logical processors = threads Each thread keeps a its own architectural state – General-purpose registers – Control and machine state registers Each thread has its own interrupt controller – Interrupts sent to a specific logical processor are handled only by it OS views logical processors (threads) as physical processors – Schedule threads to logical processors as in a multiprocessor system From a micro-architecture perspective – Thread share a single set of physical resources 31 caches, execution units, branch predictors, control logic, and buses Computer Architecture 2011 – Advanced Topics Two Important Goals When one thread is stalled the other thread can continue to make progress – Independent progress ensured by either Partitioning buffering queues and limiting the number of entries each thread can use Duplicating buffering queues A single active thread running on a processor with HT runs at the same speed as without HT – Partitioned resources are recombined when only one thread is active 32 Computer Architecture 2011 – Advanced Topics Front End Each thread manages its own next-instruction-pointer Threads arbitrate Uop cache access every cycle (Ping-Pong) – If both want to access the UC – access granted in alternating cycles – If one thread is stalled, the other thread gets the full UC bandwidth TC entries are tagged with thread-ID – Dynamically allocated as needed – Allows one logical processor to have more entries than the other Uop Cache 33 Computer Architecture 2011 – Advanced Topics Front End (cont.) Branch prediction structures are either duplicated or shared – The return stack buffer is duplicated – Global history is tracked for each thread – The large global history array is a shared Entries are tagged with a logical processor ID Each thread has its own ITLB Both threads share the same decoder logic – if only one needs the decode logic, it gets the full decode bandwidth – The state needed by the decodes is duplicated Uop queue is hard partitioned – Allows both logical processors to make independent forward progress regardless of FE stalls (e.g., TC miss) or EXE stalls 34 Computer Architecture 2011 – Advanced Topics Out-of-order Execution ROB and MOB are hard partitioned – Enforce fairness and prevent deadlocks Allocator ping-pongs between the thread – A thread is selected for allocation if 35 Its uop-queue is not empty its buffers (ROB, RS) are not full It is the thread’s turn, or the other thread cannot be selected Computer Architecture 2011 – Advanced Topics Out-of-order Execution (cont) Registers renamed to a shared physical register pool – Store results until retirement After allocation and renaming uops are placed in one of 2 Qs – Memory instruction queue and general instruction queue The two queues are hard partitioned – Uops are read from the Q’s and sent to the scheduler using ping-pong The schedulers are oblivious to threads – Schedule uops based on dependencies and exe. resources availability Regardless of their thread – Uops from the two threads can be dispatched in the same cycle – To avoid deadlock and ensure fairness Limit the number of active entries a thread can have in each scheduler’s queue Forwarding logic compares physical register numbers – Forward results to other uops without thread knowledge 36 Computer Architecture 2011 – Advanced Topics Out-of-order Execution (cont) Memory is largely oblivious – L1 Data Cache, L2 Cache, L3 Cache are thread oblivious All use physical addresses – DTLB is shared Each DTLB entry includes a thread ID as part of the tag Retirement ping-pongs between threads – If one thread is not ready to retire uops all retirement bandwidth is dedicated to the other thread 37 Computer Architecture 2011 – Advanced Topics Single-task And Multi-task Modes MT-mode (Multi-task mode) – Two active threads, with some resources partitioned as described earlier ST-mode (Single-task mode) – There are two flavors of ST-mode single-task thread 0 (ST0) – only thread 0 is active single-task thread 1 (ST1) – only thread 1 is active – Resources that were partitioned in MT-mode are re-combined to give the single active logical processor use of all of the resources Moving the processor from between modes Thread 0 executes HALT Interrupt ST0 Thread 1 executes HALT 38 Low Power Thread 1 executes HALT ST1 MT Thread 0 executes HALT Computer Architecture 2011 – Advanced Topics Operating System And Applications An HT processor appears to the OS and application SW as 2 processors – The OS manages logical processors as it does physical processors The OS should implement two optimizations: Use HALT if only one logical processor is active – Allows the processor to transition to either the ST0 or ST1 mode – Otherwise the OS would execute on the idle logical processor a sequence of instructions that repeatedly checks for work to do – This so-called “idle loop” can consume significant execution resources that could otherwise be used by the other active logical processor On a multi-processor system, – Schedule threads to logical processors on different physical processors before scheduling multiple threads to the same physical processor – Allows SW threads to use different physical resources when possible 39 Computer Architecture 2011 – Advanced Topics