Intel Core 2 Duo - University of Virginia

advertisement
Introduction
 Motivation
 A Multi-Core on our desks
 A new microarchitecture to replace Netburst
 Intel Core 2 Duo
 A dual-core CPU
 ISA with SIMD Extension
 Intel Core microarchitecture
 Memory Hierarchy System
Instruction Set Architecture
 Base: X86-64
 No VLIW (Itanium)
 SIMD Extensions: MMX, SSE, SSE2, SSE3, SSSE3,
SSE4.1
Walfdale, SSE4.1, Sep 2006
Core 2, SSSE3, July 2006
Prescott, SSE3, 2004
Pentium 4, SSE2, 2001
e.g. Permuting
Pentium III, SSE, 1999
bytes in a word
DSP-oriented math,
Pentium MMX, 1996
8 new registers,
Packed data type,
Integer Operations
Double precision, process
128-bit register management
8 new registers,
support
Float-point
Operations
Streaming SIMD Extension (SSE) 4.1
 Beginning with the 45 nm processors
 47 instructions that improve performance of media
data manipulation
 e.g. Fast and efficient bit width conversions
 Convert single byte values to word (16-bit) values.
00000000
00000000
00000000
00000000
SSE2 Code
 MOVDQU XMM0, M64
 PXOR XMM1, XMM1
 PUNPCKLBW XMM0, XMM1
SSE4.1 Code
 PMOVZXBW XMM0, M64
 DEST[15:0] <-- ZeroExtend(SRC[7:0]);
 DEST[31:16] <-- ZeroExtend(SRC[15:8]);
 DEST[47:32] <-- ZeroExtend(SRC[23:16]);
 DEST[63:48] <-- ZeroExtend(SRC[31:24]);
 DEST[79:64] <-- ZeroExtend(SRC[39:32]);
 DEST[95:80] <-- ZeroExtend(SRC[47:40]);
 DEST[111:96] <-- ZeroExtend(SRC[55:48]);
 DEST[127:112] <-- ZeroExtend(SRC[63:56]);
 Benefits
 Reduced instruction number (31)
 Better performance (~40% speedup each loop)
 Reduced register pressure (21)
Microarchitecture
 The Cores
 Single-die(107 mm²),
 Two identical core(L1 cache 64K x 2),
 Shared L2 cache 6M
 No Hyper-threading, no L3 cache
 Keep front-side bus
 Larger L2 cache
Microarchitecture
• 14-stage Pipeline
• 4 wide decode
• 4 wide Retire
• Macro-fusion
• Enhanced ALUs
• Deeper Buffers
Another View
Decode Hardware
• 128 bits fetch bandwidth
• 18-entry IQ
• Complex Decode
-produces 1-4 micro-ops
• Micro-code Sequencer
Macro-fusion
New Micro-op
• Represent
instruction pair as
single micro-op
Enhanced ALUs
• To execute new
compare and jump
(CMPJCC) microop in one clock
Out of Order Execution
• 96 entries ROB
• 32 Entry Reservation Station
Execution Units
• 6 dispatch ports(1 Load, 2 Store, 3 universal ports)
• 3 integer ALU, 2 float point ALU
Branch Predictor
• Loop Detector
- Track the number of loop iterations
for future reference
• branch prediction unit (BPU) selects
among for every branch:
-bimodal predictor
-global predictor
-loop detector
 Cache Organization
 private L1 DCache and ICache, 32K/core, 8way, 64B linesize,
write-back(directory-based conherence)
 shared L2 cache, 8way, 64B linesize (E8xxx)
pros: could be less bus traffic
cons: longer access latency than private L2 cache;
potential conflict between threads
-- FSB 1333MHz (E8xxx)
 Memory disambiguation
 aggressive memory dependence speculation based on a load's-
EIP-address-indexed hash table
 watchdog mechanism
Prediction Implementation
• History table indexed by Instruction Pointer
• Each entry in the history array has a saturating
counter
• Once counter saturates: disambiguation possible on
this load (take effect since next iteration) -load is
allowed to go even meet unkown store addresses
• When a particular load failed disambiguation: reset
its counter
• Each time a particular load correctly disambiguated:
increment counter
Predictor
Lookup
 when sent from RS, set
disambiguation bit
 If meets an older unknow
Load
Dispatch
Prediction
Verification
store address, set "update"
 If prediction is "go", dispatch,
set "done"
 Else blocked
 A store in Load Buffer scan
all previous load, if a match
found, "reset" bit set.
 When load commits, update
history.
 Execute Disable Bit Support
 AMD Enhanced Virus Protection; ARM eXecute Never
 help prevent buffer overflow attacks
 no need of software patches for buffer overflow attacks
 segregate memory by either storage of code or data
 processor disable code execution when malicious worms try to
inserting code into data buffers (with OS support)
 Instruction Pointer Based Prefetcher
 L1 DCache:2 IP prefetchers/core
L1 ICache:1 traditional prefetcher
L2 Cache: 2 IP prefetchers;
 predict what memory address will be used and deliver in




time
record every load's history using Instruction Pointer
IP history array
parameters for prefetch traffic control fine-tuned for
different platforms
prefetch monitor
References
 Intel's Next Generation Microarchitecture Unveiled,
by David Kanter, Real World Technologies
 Intel Core Microarchitecture Briefing, by Stephen
Smith and Bob Valentine, Intel
 Inside Intel Core Microarchitecture: Setting New
Standards for Energy-Efficient Performance, Ofri
Wechsler, Technology@Intel Magazine
 Intel Core: A Next-Generation Microarchitecture, by
Alan Zeichick, DevX
 too many…
Questions?
Download