Introduction Motivation A Multi-Core on our desks A new microarchitecture to replace Netburst Intel Core 2 Duo A dual-core CPU ISA with SIMD Extension Intel Core microarchitecture Memory Hierarchy System Instruction Set Architecture Base: X86-64 No VLIW (Itanium) SIMD Extensions: MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1 Walfdale, SSE4.1, Sep 2006 Core 2, SSSE3, July 2006 Prescott, SSE3, 2004 Pentium 4, SSE2, 2001 e.g. Permuting Pentium III, SSE, 1999 bytes in a word DSP-oriented math, Pentium MMX, 1996 8 new registers, Packed data type, Integer Operations Double precision, process 128-bit register management 8 new registers, support Float-point Operations Streaming SIMD Extension (SSE) 4.1 Beginning with the 45 nm processors 47 instructions that improve performance of media data manipulation e.g. Fast and efficient bit width conversions Convert single byte values to word (16-bit) values. 00000000 00000000 00000000 00000000 SSE2 Code MOVDQU XMM0, M64 PXOR XMM1, XMM1 PUNPCKLBW XMM0, XMM1 SSE4.1 Code PMOVZXBW XMM0, M64 DEST[15:0] <-- ZeroExtend(SRC[7:0]); DEST[31:16] <-- ZeroExtend(SRC[15:8]); DEST[47:32] <-- ZeroExtend(SRC[23:16]); DEST[63:48] <-- ZeroExtend(SRC[31:24]); DEST[79:64] <-- ZeroExtend(SRC[39:32]); DEST[95:80] <-- ZeroExtend(SRC[47:40]); DEST[111:96] <-- ZeroExtend(SRC[55:48]); DEST[127:112] <-- ZeroExtend(SRC[63:56]); Benefits Reduced instruction number (31) Better performance (~40% speedup each loop) Reduced register pressure (21) Microarchitecture The Cores Single-die(107 mm²), Two identical core(L1 cache 64K x 2), Shared L2 cache 6M No Hyper-threading, no L3 cache Keep front-side bus Larger L2 cache Microarchitecture • 14-stage Pipeline • 4 wide decode • 4 wide Retire • Macro-fusion • Enhanced ALUs • Deeper Buffers Another View Decode Hardware • 128 bits fetch bandwidth • 18-entry IQ • Complex Decode -produces 1-4 micro-ops • Micro-code Sequencer Macro-fusion New Micro-op • Represent instruction pair as single micro-op Enhanced ALUs • To execute new compare and jump (CMPJCC) microop in one clock Out of Order Execution • 96 entries ROB • 32 Entry Reservation Station Execution Units • 6 dispatch ports(1 Load, 2 Store, 3 universal ports) • 3 integer ALU, 2 float point ALU Branch Predictor • Loop Detector - Track the number of loop iterations for future reference • branch prediction unit (BPU) selects among for every branch: -bimodal predictor -global predictor -loop detector Cache Organization private L1 DCache and ICache, 32K/core, 8way, 64B linesize, write-back(directory-based conherence) shared L2 cache, 8way, 64B linesize (E8xxx) pros: could be less bus traffic cons: longer access latency than private L2 cache; potential conflict between threads -- FSB 1333MHz (E8xxx) Memory disambiguation aggressive memory dependence speculation based on a load's- EIP-address-indexed hash table watchdog mechanism Prediction Implementation • History table indexed by Instruction Pointer • Each entry in the history array has a saturating counter • Once counter saturates: disambiguation possible on this load (take effect since next iteration) -load is allowed to go even meet unkown store addresses • When a particular load failed disambiguation: reset its counter • Each time a particular load correctly disambiguated: increment counter Predictor Lookup when sent from RS, set disambiguation bit If meets an older unknow Load Dispatch Prediction Verification store address, set "update" If prediction is "go", dispatch, set "done" Else blocked A store in Load Buffer scan all previous load, if a match found, "reset" bit set. When load commits, update history. Execute Disable Bit Support AMD Enhanced Virus Protection; ARM eXecute Never help prevent buffer overflow attacks no need of software patches for buffer overflow attacks segregate memory by either storage of code or data processor disable code execution when malicious worms try to inserting code into data buffers (with OS support) Instruction Pointer Based Prefetcher L1 DCache:2 IP prefetchers/core L1 ICache:1 traditional prefetcher L2 Cache: 2 IP prefetchers; predict what memory address will be used and deliver in time record every load's history using Instruction Pointer IP history array parameters for prefetch traffic control fine-tuned for different platforms prefetch monitor References Intel's Next Generation Microarchitecture Unveiled, by David Kanter, Real World Technologies Intel Core Microarchitecture Briefing, by Stephen Smith and Bob Valentine, Intel Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance, Ofri Wechsler, Technology@Intel Magazine Intel Core: A Next-Generation Microarchitecture, by Alan Zeichick, DevX too many… Questions?