Intel Core 2 Duo - University of Virginia

Introduction  Motivation  A Multi-Core on our desks  A new microarchitecture to replace Netburst  Intel Core 2 Duo  A dual-core CPU  ISA with SIMD Extension  Intel Core microarchitecture  Memory Hierarchy System Instruction Set Architecture  Base: X86-64  No VLIW (Itanium)  SIMD Extensions: MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1 Walfdale, SSE4.1, Sep 2006 Core 2, SSSE3, July 2006 Prescott, SSE3, 2004 Pentium 4, SSE2, 2001 e.g. Permuting Pentium III, SSE, 1999 bytes in a word DSP-oriented math, Pentium MMX, 1996 8 new registers, Packed data type, Integer Operations Double precision, process 128-bit register management 8 new registers, support Float-point Operations Streaming SIMD Extension (SSE) 4.1  Beginning with the 45 nm processors  47 instructions that improve performance of media data manipulation  e.g. Fast and efficient bit width conversions  Convert single byte values to word (16-bit) values. 00000000 00000000 00000000 00000000 SSE2 Code  MOVDQU XMM0, M64  PXOR XMM1, XMM1  PUNPCKLBW XMM0, XMM1 SSE4.1 Code  PMOVZXBW XMM0, M64  DEST[15:0] <-- ZeroExtend(SRC[7:0]);  DEST[31:16] <-- ZeroExtend(SRC[15:8]);  DEST[47:32] <-- ZeroExtend(SRC[23:16]);  DEST[63:48] <-- ZeroExtend(SRC[31:24]);  DEST[79:64] <-- ZeroExtend(SRC[39:32]);  DEST[95:80] <-- ZeroExtend(SRC[47:40]);  DEST[111:96] <-- ZeroExtend(SRC[55:48]);  DEST[127:112] <-- ZeroExtend(SRC[63:56]);  Benefits  Reduced instruction number (31)  Better performance (~40% speedup each loop)  Reduced register pressure (21) Microarchitecture  The Cores  Single-die(107 mm²),  Two identical core(L1 cache 64K x 2),  Shared L2 cache 6M  No Hyper-threading, no L3 cache  Keep front-side bus  Larger L2 cache Microarchitecture • 14-stage Pipeline • 4 wide decode • 4 wide Retire • Macro-fusion • Enhanced ALUs • Deeper Buffers Another View Decode Hardware • 128 bits fetch bandwidth • 18-entry IQ • Complex Decode -produces 1-4 micro-ops • Micro-code Sequencer Macro-fusion New Micro-op • Represent instruction pair as single micro-op Enhanced ALUs • To execute new compare and jump (CMPJCC) microop in one clock Out of Order Execution • 96 entries ROB • 32 Entry Reservation Station Execution Units • 6 dispatch ports(1 Load, 2 Store, 3 universal ports) • 3 integer ALU, 2 float point ALU Branch Predictor • Loop Detector - Track the number of loop iterations for future reference • branch prediction unit (BPU) selects among for every branch: -bimodal predictor -global predictor -loop detector  Cache Organization  private L1 DCache and ICache, 32K/core, 8way, 64B linesize, write-back(directory-based conherence)  shared L2 cache, 8way, 64B linesize (E8xxx) pros: could be less bus traffic cons: longer access latency than private L2 cache; potential conflict between threads -- FSB 1333MHz (E8xxx)  Memory disambiguation  aggressive memory dependence speculation based on a load's- EIP-address-indexed hash table  watchdog mechanism Prediction Implementation • History table indexed by Instruction Pointer • Each entry in the history array has a saturating counter • Once counter saturates: disambiguation possible on this load (take effect since next iteration) -load is allowed to go even meet unkown store addresses • When a particular load failed disambiguation: reset its counter • Each time a particular load correctly disambiguated: increment counter Predictor Lookup  when sent from RS, set disambiguation bit  If meets an older unknow Load Dispatch Prediction Verification store address, set "update"  If prediction is "go", dispatch, set "done"  Else blocked  A store in Load Buffer scan all previous load, if a match found, "reset" bit set.  When load commits, update history.  Execute Disable Bit Support  AMD Enhanced Virus Protection; ARM eXecute Never  help prevent buffer overflow attacks  no need of software patches for buffer overflow attacks  segregate memory by either storage of code or data  processor disable code execution when malicious worms try to inserting code into data buffers (with OS support)  Instruction Pointer Based Prefetcher  L1 DCache:2 IP prefetchers/core L1 ICache:1 traditional prefetcher L2 Cache: 2 IP prefetchers;  predict what memory address will be used and deliver in     time record every load's history using Instruction Pointer IP history array parameters for prefetch traffic control fine-tuned for different platforms prefetch monitor References  Intel's Next Generation Microarchitecture Unveiled, by David Kanter, Real World Technologies  Intel Core Microarchitecture Briefing, by Stephen Smith and Bob Valentine, Intel  Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance, Ofri Wechsler, Technology@Intel Magazine  Intel Core: A Next-Generation Microarchitecture, by Alan Zeichick, DevX  too many… Questions?

Intel Core 2 Duo - University of Virginia

Related documents

Products

Support

Intel Core 2 Duo - University of Virginia

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib