Circuit Technologies for Multi-Core Processor Design Stefan Rusu Intel Corporation Santa Clara, CA stefan.rusu@intel.com
Xeon, Itanium, Core are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be Copyright © 2006, Intel Corporation
1
Outline • Dual-core architectural directions • Interconnect trends • Power and leakage reduction – Cache Sleep and Shut-off Modes – Long-Le Transistor Usage • Voltage domains • Clock distribution • Package details • DFT/DFM features • Thermal management • Summary Copyright © 2006, Intel Corporation
2
Dual Core Processors Everywhere!
Intel tips Viiv, Yonah in consumer push Mark LaPedus, 01/06/2006 SAN JOSE, Calif. — Making a big splash in the consumer market, Intel Corp. on Thursday unveiled a dual-core processor, PC platform and several content alliances that are said to provide the foundation for digital entertainment and wireless laptops.
Laptop Intel unwraps dual-core Xeon server processors Tom Krazit, 10/10/2005 At an event Monday in San Francisco, Intel unveiled its first dual-core Xeon chips for two-processor and four-processor servers, previously known by the Paxville code name. Server Intel ready to ship dual-core processors Daniel A. Begun, 3/15/2005 Earlier this year, Intel announced that desktop processors using dual-core technology will be available by the end of June, but the company recently hinted at March's Intel Developer Forum (IDF) that dual-core processors could be available even sooner than that.
Desktop
3
Copyright © 2006, Intel Corporation
Tulsa – Xeon ® Processor
FSB TOP
Core 0 1MB L2 Core 1 1MB L2 Shared 16MB L3 Cache Bus Interface
Core 1 T A G 1MB L2 Control Logic 1MB L2 Core 0 T A G FSB BOT
• Process technology: 65 nm, 8 Cu interconnect layers • Transistor count: 1.328 Billion • Die area: 435 mm 2
16MB L3
Copyright © 2006, Intel Corporation
4
Montecito – Itanium ® 2 Processor Core 0 1MB L2I 256kB L2D 12MB L3 Core 1 1MB L2I 256kB L2D 12MB L3
12MB L3 1MB L2I Core 1
Bus Interface
Core 0 1MB L2I
• Process technology: 90nm, 7 Cu interconnect layers • Transistor count: 1.72 Billion • Die area: 596mm 2
12MB L3
Copyright © 2006, Intel Corporation
5
Yonah – Mobile/Desktop/Blade Server Core 0 Core 1 Shared 2MB L2 Cache Bus Interface
Core 0 Core 1 Bus 2 MB L2 Cache
• Process technology: 65 nm, 8 Cu interconnect layers • Transistor count: 151 Million • Die area: 90.3 mm 2 Copyright © 2006, Intel Corporation
6
Moore’s Law for Multi-Core Processors
1.E+10 1.E+09 Itanium ® Dual-core Itanium ® 2 Dual-core Xeon ® 2 Pentium ® 4 1.E+08 Pentium ® III 1.E+07 Pentium ® II Pentium ® Itanium ® 486 1.E+06 386 286 1.E+05 1980 1990 2000 2010
Process technology scaling enables the number of cores to double every two years Copyright © 2006, Intel Corporation
7
Server Processors L3 Cache Trend On-Die L3 Cache [MB] 28 24 20 16 12 8 4 0 Itanium ® ► Processors ◄ Xeon ® Processors 180nm 130nm 90nm 65nm Cache size doubles with every process generation Copyright © 2006, Intel Corporation
8
100 10 1 SRAM Cell Size Scaling 44 21 10.6
5.6
2.4
1.0
0.57
0.1
0.5 0.35 0.25 0.18 0.13 90n 65n SRAM cell size scales ~0.5x per generation Copyright © 2006, Intel Corporation
9
Server Processors L3 Cache Trend Dual-core Processor Process L3 cache per core [MB] L3 cache per thread [MB] Itanium ® 90nm 12 6 Xeon ® 65nm 8 4 Large per-core and per-thread caches enable server-class performance Copyright © 2006, Intel Corporation
10
Relative delay 100 250 On-chip Interconnect Trend 180 Feature size (nm) 130 90 65 45
Global interconnect without repeaters
10 32
Global interconnect with repeaters
1
Local interconnect (M1,2)
Source: ITRS
Gate delay (FO4)
0.1
• Local interconnects scale with gate delay • Global interconnects do not scale Copyright © 2006, Intel Corporation
11
Layout Techniques to Reduce Coupling • Interleave bussed signals with other busses that switch at a different time • Eliminates capacitive coupling and reduces inductive noise Bus_A V Bus_B t • Switch bit order of bussed signals at every turn • Noise will not be additive across the entire bus route
3 0 4 1 5 2 0 1 2 3 4 5
Swizzled Bus Route Copyright © 2006, Intel Corporation
12
Layout Techniques (Cont) • Interleave narrow Vcc/Vss lines through bus Vcc Vss Vcc Vss Vss • Use staggered inverting buffers [7] Copyright © 2006, Intel Corporation
13
L3 Cache Sleep and Shut-off Modes Active Mode Sleep Mode Shut-off Mode Sub-array Sub-array Sub-array Virtual VSS Block Select Sleep Bias Shut off
X X X
1.1V
Virtual VSS 0V 2x lower leakage 250mV 2x lower leakage 520mV 0V Copyright © 2006, Intel Corporation
14
Leakage Shut-off Infrared Images 16MB SKU All 16MB in sleep mode 8MB SKU 8MB in sleep mode 8MB in shut-off mode Shut-off feature reduces the leakage of the 8MB disabled sub-arrays by about 3W Copyright © 2006, Intel Corporation
15
Dynamic Intel ® Smart Cache Sizing • First implementation in Yonah dual-core mobile processor – Dynamic implementation of the shut-off mode • HW based algorithm predicts cache usage requirements – Considers the % of time the CPU is in Active state compared to the various sleep states • During periods of low activity or inactivity the processor dynamically adapts its effective cache size – Cache content is gradually flushed to system memory – Cache ways are gradually turned off (physically as well as logically), thus reducing power • Cache ways are re-powered on demand to deliver full performance when needed Copyright © 2006, Intel Corporation
16
Leakage Mitigation: Long-Le Transistors Nominal Le Long Le (Nom+10%) Copyright © 2006, Intel Corporation • All transistors can be either nominal or long-Le • Most library cells are available in both flavors • Long-Le transistors are about 10% slower, but have 3x lower leakage • All paths with timing slack use long-Le transistors • Initial design uses only long channel devices
17
Long-Le Transistors Usage Map 100% 80% 60% 40% 20% 0%
Cor 1 Cont rol Core 0 L3 Cache
Copyright © 2006, Intel Corporation
18
Long-Le Transistors Summary Percentage of Long-Le device width excluding RAM arrays: Cores Uncore Nominal 46% Nominal 24% Long-Le 54% Long-Le 76% To reduce sub-threshold leakage, most devices will be slower and only a handful of transistors will be fast Copyright © 2006, Intel Corporation
19
w
Equal Loading
w u
Stack Forcing
10 Two-stack High-V t Two-stack Low -V t 1
w w l u
? ½ w ? ½ w
w u +w l
Wu = Wl Performance Loss High-V t Low T Leakage Reduction
w l
[Narendra, et al – ISLPED 2001] • Force one transistor into a two transistor stack with the same input load • Can be applied to gates with timing slack • Trade-off between transistor leakage and speed Copyright © 2006, Intel Corporation
20
Montecito Voltage Domains [2]
Core 0, frequency tracks voltage, 40W Bus Arbiter Fixed frequency, 2.5W
f1
Core 1, frequency tracks voltage, 40W
Foxton Control 1GHz fixed frequency Copyright © 2006, Intel Corporation
21
Tulsa Voltage Domains
FSB TOP
Core PLL Voltage Profile Cut Line 1.25V
Core 1 1MB L2 Control Logic T A G 1MB L2 Core 0 T A G FSB BOT
Cores
16MB L3
Ctrl + Tag 1.10V
16MB array 0.25V
Uncore I/O Virtual VSS
22
Tulsa Clock Domains [4]
FSB TOP Core 1 1MB L2 T A G 16MB L3 1MB L2
System Clock (BCLK)
Core 0 FSB BOT
Legend: Core Copyright © 2006, Intel Corporation PLL
T A G
Uncore I/O
23
Tulsa Uncore Clock Distribution [4]
Un-Core pre-global ZCLK spine Un-Core sparse SCLK grid Un-Core pre-global MCLK spine
Copyright © 2006, Intel Corporation Horizontal clock spines
24
Montecito Clock System [5] Variable Supply Pins Fuses Frequency Translation Table Divisors Fixed Supply Core0 Core1 Bus Clock PLL 1/M 1/1 Foxton I/Os Bus Logic DFD DFD DFD DFD DFD DFD RVD SLCB SLCB RAD RAD SLCB SLCB SLCB Phase Aligner CVD Gater Balanced Tree Clock Distribution CVD Gater 1/N CVD CVD CVD Gater Gater Gater 1/N Copyright © 2006, Intel Corporation
25
Montecito Clock Distribution [6] L0 route L1 route L2 route L3 Route CVD GATERS SLCB Latches Latches REPEATERS Bus Clock PLL
core0 core1 Foxton IOs
DFD DFD DFD DFD DFD DFD
Bus Logic
RAD SLCB RAD SLCB SLCB SLCB CVD GATERS CVD CVD GATERS GATERS CVD GATERS Latches Latches Latches Latches Latches Fixed frequency Low Voltage Swings Differential Copyright © 2006, Intel Corporation Variable Frequency Full Rail Transitions Single Ended
26
Single-Ended vs. Differential Clocks GND CLK + CLK GND GND VDD GND VDD GND VDD • Differential clock – Lower skew – High power – Longer distance between repeaters VDD CLK GND Copyright © 2006, Intel Corporation • Single-ended clock – Lower power – Need sharp edges to control skew
27
Dense vs. Sparse Grid Tiles Un-Core Dense Tile Core Full Grid Tile Copyright © 2006, Intel Corporation Un-Core Sparse Tile
28
Xeon ® Processor Package • 12 layers organic package (53.3 mm/side) • 4-4-4 stacking • Integrated heat spreader (38.5 mm/side) • 604 total pins • 366 signal I/Os • System management components and decoupling capacitors on package
29
Copyright © 2006, Intel Corporation
Itanium ® 2 Processor Package Power delivery connector Heat spreader Decoupling Caps System Management Components Copyright © 2006, Intel Corporation
30
Intel ® Core™ Duo Processor Package Copyright © 2006, Intel Corporation
31
Design for Test and Debug Features Die-level DFT/DFM •Parallel structural core test with XOR •Scan and observability registers (scan-out) •Three TAP controllers (core0, core1, uncore) •Within-die process monitors •On-die clock shrink L3 cache DFT/DFM •Built-in pattern generator (PBIST) •Programmable weak-write test •Low-yield analysis •Stability test mode •32-entry cache line disable (Pellston) FSB DFT/DFM •I/O loopback •I/O test generator Copyright © 2006, Intel Corporation
32
Itanium ® 2 Thermal Map • Two thermal sensors per core • Mux thermal diodes into VCOs to measure temp Copyright © 2006, Intel Corporation
33
Xeon ® Infra-red Emission Image Temperature Sensors Thermal Diode Copyright © 2006, Intel Corporation
34
Potential Multi-Core Thermal Control TS Core TS TS Core TS TS Core TS TS Core TS I/O Cache Cache Cache TMU Cache Cache Cache Cache Cache I/O TS Core TS TS Core TS TS Core TS TS Core TS TS Thermal Sensor TMU Thermal Management Unit • Multiple core designs require a central thermal management unit and an efficient mechanism for transmitting thermal sensor measurements [8] Copyright © 2006, Intel Corporation
35
Summary • Dual-core processors cover the entire compute spectrum, from laptops, to desktop and servers • Moore’s Law applied to multi-core processors: – Process technology scaling enables number of cores and cache size to double every two years • Power and leakage reduction circuit techniques are essential for multi-core processors – Massive Long-Le usage
→
a few fast devices many slow transistors and – Cache sleep and shut-off modes (static and dynamic) • Multiple on-die clock and voltage domains are required to control active power and leakage – Need new verification tools and automated checks Copyright © 2006, Intel Corporation
36
References [1] S. Rusu, et al., “A Dual-Core Multi-Threaded Xeon ® Processor with 16MB L3 Cache,” ISSCC Dig. Tech. Papers, Paper 5.3, Feb. 2006.
[2] S. Naffziger, et al., “The Implementation of a 2-Core, Multi-Threaded Itanium ® Family Processor,” IEEE J. Solid-State Circuits, pp. 197-209, Jan. 2006.
[3] R. Korner, “Yonah and Sossaman Processor Briefing”, Intel Developer
Forum
, Fall 2005 [4] S. Tam, et al., “Clock Generation and Distribution of a Dual-Core Xeon ® Processor with 16MB L3 Cache,” ISSCC Dig. Tech. Papers, Paper 21.2, 2006.
[5] T. Fischer, et al., “A 90-nm variable frequency clock system for a power managed Itanium ® architecture processor,” IEEE J. Solid-State Circuits, pp. 217–227, Jan. 2006.
[6] P. Mahoney, et al., “Clock Distribution on a Dual-core Multi-threaded Itanium™-Family Microprocessor,” ISSCC Dig. Tech. Papers, 2005 [7] S. Rusu, “Timing analysis of high-speed VLSI designs - Trends and challenges,” International Workshop on Timing Issues (TAU), 1997 [8] S. Rusu and S. Tam, “Apparatus for thermal management of multiple core microprocessors,” US patent 6,908,272, issued 7/21/2005 Copyright © 2006, Intel Corporation
37