Verdana Bold 30

advertisement
A Look Inside Intel®:
The Core (Nehalem)
Microarchitecture
Intel®
Beeman Strong
Core™ microarchitecture
(Nehalem) Architect
Intel Corporation
Agenda
• Intel® Core™ Microarchitecture (Nehalem) Design
Overview
• Enhanced Processor Core
• Performance Features
• Intel® Hyper-Threading Technology
• New Platform
• New Cache Hierarchy
• New Platform Architecture
• Performance Acceleration
• Virtualization
• New Instructions
• Power Management Overview
• Minimizing Idle Power Consumption
• Performance when it counts
2
Scalable Cores
Same core for
all segments
Common software
optimization
Intel® Core™
Microarchitecture
(Nehalem)
Common feature set
Servers/Workstations
Energy Efficiency,
Performance, Virtualization,
Reliability, Capacity, Scalability
45nm
Desktop
Performance, Graphics, Energy
Efficiency, Idle Power, Security
Mobile
Battery Life, Performance,
Energy Efficiency, Graphics,
Security
Optimized cores to meet all market segments
3
The First Intel® Core™ Microarchitecture
(Nehalem) Processor
Memory Controller
M
i
s
c
I
O
Q
P
I
0
Core
Core
Q
u
e
u
e
Core
Shared L3 Cache
Core
M
i
s
c
I
O
Q
P
I
1
QPI: Intel® QuickPath
Interconnect (Intel® QPI)
A Modular Design for Flexibility
4
Agenda
• Intel® Core™ Microarchitecture (Nehalem) Design
Overview
• Enhanced Processor Core
• Performance Features
• Intel® Hyper-Threading Technology
• New Platform
• New Cache Hierarchy
• New Platform Architecture
• Performance Acceleration
• Virtualization
• New Instructions
• Power Management Overview
• Minimizing Idle Power Consumption
• Performance when it counts
5
Intel® Core™ Microarchitecture Recap
• Wide Dynamic Execution
– 4-wide decode/rename/retire
• Advanced Digital Media Boost
– 128-bit wide SSE execution units
• Intel HD Boost
– New SSE4.1 Instructions
• Smart Memory Access
– Memory Disambiguation
– Hardware Prefetching
• Advanced Smart Cache
– Low latency, high BW shared L2 cache
Nehalem builds on the great Core microarchitecture
6
Designed for Performance
New SSE4.2
Instructions
Improved Lock
Support
L1 Data Cache
Execution
Units
Deeper
Buffers
Out-of-Order
Scheduling &
Retirement
Simultaneous
Multi-Threading
Memory Ordering
& Execution
Instruction
Decode &
Microcode
Faster
Virtualization
7
Additional Caching
Hierarchy
L2 Cache
& Interrupt
Servicing
Paging
Branch Prediction
Improved
Loop
Streaming
Instruction Fetch
& L1 Cache
Better Branch
Prediction
Macrofusion
• Introduced in Intel® Core™2 microarchitecture
• TEST/CMP instruction followed by a conditional branch treated
as a single instruction
– Decode/execute/retire as one instruction
• Higher performance & improved power efficiency
– Improves throughput/Reduces execution latency
– Less processing required to accomplish the same work
• Support all the cases in Intel Core 2 microarchitecture PLUS
– CMP+Jcc macrofusion added for the following branch conditions
–
–
–
–
JL/JNGE
JGE/JNL
JLE/JNG
JG/JNLE
– Intel® Core™ microarchitecture (Nehalem) supports macrofusion
in both 32-bit and 64-bit modes
– Intel Core2 microarchitecture only supports macrofusion in 32-bit
mode
Increased macrofusion benefit on Intel®
Core™ microarchitecture (Nehalem)
8
Intel® Core™ Microarchitecture
(Nehalem) Loop Stream Detector
• Loop Stream Detector identifies software loops
– Stream from Loop Stream Detector instead of normal path
– Disable unneeded blocks of logic for power savings
– Higher performance by removing instruction fetch limitations
• Higher performance: Expand the size of the loops detected (vs Core 2)
• Improved power efficiency: Disable even more logic (vs Core 2)
Intel Core Microarchitecture (Nehalem) Loop Stream Detector
Branch
Prediction
Fetch
Decode
Loop
Stream
Detector
28
Micro-Ops
9
Branch Prediction Improvements
• Focus on improving branch prediction accuracy
each CPU generation
– Higher performance & lower power through more
accurate prediction
• Example Intel® Core™ microarchitecture (Nehalem)
improvements
– L2 Branch Predictor
– Improve accuracy for applications with large code size (ex.
database applications)
– Advanced Renamed Return Stack Buffer (RSB)
– Remove branch mispredicts on x86 RET instruction (function
returns) in the common case
Greater Performance through Branch Prediction
10
Execution Unit Overview
Unified Reservation Station
• Schedules operations to Execution units
• Single Scheduler for all Execution Units
• Can be used by all integer, all FP, etc.
Execute 6 operations/cycle
• 3 Memory Operations
• 1 Load
• 1 Store Address
• 1 Store Data
• 3 “Computational” Operations
Unified Reservation Station
FP Multiply
FP Add
Divide
Complex
SSE Integer ALU
SSE Integer
Integer Shuffles
Multiply
Port 5
LEA
Load
Port 4
Shift
Port 3
Integer ALU &
Port 2
Port 1
Port 0
Integer ALU &
Store
Store
Integer ALU &
Address
Data
Shift
Branch
FP Shuffle
Integer
SSE Integer ALU
Integer Shuffles
11
Increased Parallelism
Concurrent uOps Possible
• Goal: Keep powerful
execution engine fed
• Nehalem increases size of out
of order window by 33%
• Must also increase other
corresponding structures
1
128
112
96
80
64
48
32
16
0
Dothan
Merom
Nehalem
Structure
Intel® Core™
microarchitecture
(formerly Merom)
Intel® Core™
microarchitecture
(Nehalem)
Comment
Reservation Station
32
36
Dispatches
operations to
execution units
Load Buffers
32
48
Tracks all load
operations allocated
Store Buffers
20
32
Tracks all store
operations allocated
Increased Resources for Higher Performance
1Intel®
Pentium® M processor (formerly Dothan)
Intel® Core™ microarchitecture (formerly Merom)
Intel® Core™ microarchitecture (Nehalem)
12
Enhanced Memory Subsystem
• Responsible for:
– Handling of memory operations (loads/stores)
• Key Intel® Core™2 Features
– Memory Disambiguation
– Hardware Prefetchers
– Advanced Smart Cache
• New Intel® Core™ Microarchitecture (Nehalem)
Features
– New TLB Hierarchy (new, low latency 2nd level unified TLB)
– Fast 16-Byte unaligned accesses
– Faster Synchronization Primitives
13
Intel® Hyper-Threading Technology
• Also known as Simultaneous MultiThreading (SMT)
w/o SMT
SMT
– Run 2 threads at the same time per core
– Keep it fed with multiple threads
– Hide latency of a single thread
• Most power efficient performance feature
– Very low die area cost
– Can provide significant performance benefit
depending on application
– Much more efficient than adding an entire
core
• Intel® Core™ microarchitecture (Nehalem)
advantages
– Larger caches
– Massive memory BW
Simultaneous multi-threading enhances
performance and energy efficiency
14
Time (proc. cycles)
• Take advantage of 4-wide execution engine
Note: Each box
represents a
processor
execution unit
SMT Performance Chart
Performance Gain SMT enabled vs disabled
40%
34%
35%
29%
30%
25%
20%
16%
13%
15%
10%
10%
7%
5%
0%
Floating Point
3 dsMax*
Integer
Cinebench* 10 POV-Ray* 3.7 3 DMark*
beta 25 Vantage* CPU
Intel®Core™i7
Floating Point is based on SPECfp_rate_base2006* estimate
Integer is based on SPECint_rate_base2006* estimate
SPEC, SPECint, SPECfp, and SPECrate are trademarks of the Standard Performance Evaluation Corporation.
For more information on SPEC benchmarks, see: http://www.spec.org
Source: Intel. Configuration: pre-production Intel® Core™ i7 processor with 3 channel DDR3 memory. Performance tests and ratings are measured
using specific computer systems and / or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information
to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the
performance of Intel products, visit http://www.intel.com/performance/
15
Agenda
• Intel® Core™ Microarchitecture (Nehalem) Design
Overview
• Enhanced Processor Core
• Performance Features
• Intel® Hyper-Threading Technology
• New Platform
• New Cache Hierarchy
• New Platform Architecture
• Performance Acceleration
• Virtualization
• New Instructions
• Power Management Overview
• Minimizing Idle Power Consumption
• Performance when it counts
16
Designed For Modularity
C
O
R
E
C
O
R
E
…
C
O
R
E
Core
L3 Cache
DRAM
…
IM
C
Intel
®
QPI
…
Intel
®
QPI
Power
&
Clock
Uncore
…
Intel QPI
Differentiation in the “Uncore”:
# cores
# mem
channels
#QPI
Links
Size of
cache
Type of
Memory
Intel® QPI: Intel® QuickPath
Interconnect (Intel® QPI)
Power
Management
Integrated
graphics
2008 – 2009 Servers & Desktops
Optimal price / performance / energy efficiency
for server, desktop and mobile products
17
Intel® Smart Cache – 3rd Level Cache
• Shared across all cores
• Size depends on # of cores
– Quad-core: Up to 8MB (16-ways)
– Scalability:
– Built to vary size with varied core
counts
– Built to easily increase L3 size in
future parts
• Perceived latency depends on
frequency ratio between core &
uncore
• Inclusive cache policy for best
performance
– Address residing in L1/L2 must be
present in 3rd level cache
18
Core
Core
Core
L1 Caches L1 Caches
L1 Caches
…
L2 Cache
L2 Cache
L3 Cache
L2 Cache
Why Inclusive?
• Inclusive cache provides benefit of an on-die snoop filter
• Core Valid Bits
– 1 bit per core per cache line
– If line may be in a core, set core valid bit
– Snoop only needed if line is in L3 and core valid bit is set
– Guaranteed that line is not modified if multiple bits set
• Scalability
– Addition of cores/sockets does not increase snoop traffic seen by
cores
• Latency
– Minimize effective cache latency by eliminating cross-core snoops
in the common case
– Minimize snoop response time for cross-socket cases
19
Intel® Core™ Microarchitecture
(Nehalem-EP) Platform Architecture
• Integrated Memory Controller
– 3 DDR3 channels per socket
– Massive memory bandwidth
– Memory Bandwidth scales with # of
processors
– Very low memory latency
Nehalem
EP
Nehalem
EP
Tylersburg
EP
• Intel® QuickPath Interconnect
(Intel® QPI)
–
–
–
–
–
–
New point-to-point interconnect
Socket to socket connections
Socket to chipset connections
Build scalable solutions
Up to 6.4 GT/sec (12.8 GB/sec)
Bidirectional (=> 25.6 GB/sec)
IOH
memory
Significant performance
leap from new platform
memory
CPU
CPU
CPU
CPU
IOH
Intel® Core™ microarchitecture (Nehalem-EP)
Intel® Next Generation Server Processor Technology (Tylersburg-EP)
20
memory
memory
Non-Uniform Memory Access (NUMA)
• FSB architecture
– All memory in one location
Nehalem
EP
Intel®
• Starting with
Core™
microarchitecture (Nehalem)
Nehalem
EP
– Memory located in multiple
places
Ensure software is NUMAoptimized for best performance
Intel® Core™ microarchitecture (Nehalem-EP)
Intel® Next Generation Server Processor Technology (Tylersburg-EP)
21
Relative Memory Latency Comparison
Relative Memory Latency
• Latency to memory dependent
on location
• Local memory has highest BW,
lowest latency
• Remote Memory still very fast
Tylersburg
EP
1.00
0.80
0.60
0.40
0.20
0.00
Harpertown (FSB 1600)
Nehalem (DDR3-1067)
Nehalem (DDR3-1067)
Local
Remote
Memory Bandwidth – Initial Intel® Core™
Microarchitecture (Nehalem) Products
•
•
•
•
3 memory channels per socket
≥ DDR3-1066 at launch
Massive memory BW
Scalability
Stream Bandwidth – Mbytes/Sec (Triad)
33376
3.4X
– Design IMC and core to take
advantage of BW
– Allow performance to scale with
cores
– Core enhancements
9776
– Support more cache misses per
core
– Aggressive hardware prefetching w/
throttling enhancements
6102
HTN 3.16/ BF1333/
667 MHz mem
– Example IMC Features
– Independent memory channels
– Aggressive Request Reordering
HTN 3.00/ SB1600/
800 MHz mem
Source: Intel Internal measurements – August 20081
Massive memory BW provides performance and scalability
1HTN:
NHM 2.66/ 6.4 QPI/
1066 MHz mem
Intel® Xeon® processor 5400 Series (Harpertown)
NHM: Intel® Core™ microarchitecture (Nehalem)
22
Agenda
• Intel® Core™ Microarchitecture (Nehalem) Design
Overview
• Enhanced Processor Core
• Performance Features
• Intel® Hyper-Threading Technology
• New Platform
• New Cache Hierarchy
• New Platform Architecture
• Performance Acceleration
• Virtualization
• New Instructions
• Power Management Overview
• Minimizing Idle Power Consumption
• Performance when it counts
23
Virtualization
• To get best virtualized
performance
Round Trip Virtualization Latency 1
– Have best native performance
– Reduce transitions to/from
virtual machine
– Reduce latency of transitions
• Intel® Core™ microprocessor
(Nehalem) virtualization
features
– Reduced latency for transitions
– Virtual Processor ID (VPID) to
reduce effective cost of
transitions
– Extended Page Table (EPT) to
reduce # of transitions
24
Relative Latency
100%
80%
60%
40%
20%
0%
Merom
Penryn
Nehalem
EPT Solution
EPT
CR3
Base Pointer
Guest
Linear
Address
Intel® 64
Page Tables
Guest
Physical
Address
EPT
Page Tables
Host
Physical
Address
• Intel® 64 Page Tables
– Map Guest Linear Address to Guest Physical Address
– Can be read and written by the guest OS
• New EPT Page Tables under VMM Control
– Map Guest Physical Address to Host Physical Address
– Referenced by new EPT base pointer
• No VM Exits due to Page Faults, INVLPG or CR3
accesses
25
Extending Performance and Energy Efficiency
- Intel® SSE4.2 Instruction Set Architecture (ISA) Leadership in 2008
Accelerated
String and Text
Processing
Faster XML parsing
Faster search and pattern
matching
Novel parallel data
matching and comparison
operations
Accelerated Searching
& Pattern Recognition
of Large Data Sets
Improved performance for
Genome Mining,
Handwriting recognition.
Fast Hamming distance /
Population count
New Communications
Capabilities
Hardware based CRC
instruction
Accelerated Network
attached storage
Improved power efficiency
for Software I-SCSI, RDMA,
and SCTP
SSE4
STTNI
(45nm CPUs)
SSE4.1
(Penryn Core)
ATA
SSE4.2
(Nehalem Core)
STTNI
e.g. XML
acceleration
ATA
(Application
Targeted
Accelerators)
POPCNT
e.g. Genome
Mining
What should the applications, OS and VMM vendors do?:
Understand the benefits & take advantage of new instructions in 2008.
Provide us feedback on instructions ISV would like to see for
next generation of applications
26
CRC32
e.g. iSCSI
Application
Agenda
• Intel® Core™ Microarchitecture (Nehalem) Design
Overview
• Enhanced Processor Core
• Performance Features
• Intel® Hyper-Threading Technology
• New Platform
• New Cache Hierarchy
• New Platform Architecture
• Performance Acceleration
• Virtualization
• New Instructions
• Power Management Overview
• Minimizing Idle Power Consumption
• Performance when it counts
27
Intel® Core™ Microarchitecture (Nehalem)
Design Goals
World class performance combined with superior energy efficiency – Optimized for:
Dynamically scaled
performance when
Single Thread
Multi-threads
Existing Apps
Emerging Apps
All Usages
needed to maximize
energy efficiency
A single, scalable, foundation optimized across each segment and power envelope
Workstation / Server
Desktop / Mobile
A Dynamic and Design Scalable
Microarchitecture
28
Power Control Unit
Vcc
BCLK
Core
PLL
Vcc
Freq.
Sensors
Integrated proprietary
microcontroller
PLL
Core
Vcc
Freq.
Sensors
PLL
Core
PCU
Vcc
Freq.
Sensors
PLL
Core
Uncore,
LLC
Vcc
Freq.
Sensors
PLL
29
Shifts control from
hardware to embedded
firmware
Real time sensors for
temperature, current,
power
Flexibility enables
sophisticated
algorithms, tuned for
current operating
conditions
Minimizing Idle Power Consumption
• Operating system notifies CPU when no tasks are
ready for execution
– Execution of MWAIT instruction
• MWAIT arguments hint at expected idle duration
Idle Power (W)
– Higher numbered C-states
lower power, but also
longer exit latency
• CPU idle states referred
to as “C-States”
C0
C1
Cn
Exit Latency (us)
30
C6 on Intel® Core™ Microarchitecture
(Nehalem)
31
C6 on Intel® Core™ Microarchitecture
(Nehalem)
Core Power
0
Cores 0, 1, 2,
and 3 running
applications.
Core 3
Core 2
0
Core 1
0
Core 0
0
Time
32
C6 on Intel® Core™ Microarchitecture
(Nehalem)
Core Power
Task completes. No work
waiting. OS executes
MWAIT(C6) instruction.
Core 3
0
Core 2
0
Core 1
0
Core 0
0
Time
33
C6 on Intel® Core™ Microarchitecture
(Nehalem)
Core Power
0
Execution stops. Core
architectural state saved.
Core clocks stopped. Cores
0, 1, and 3 continue
execution undisturbed.
Core 3
Core 2
0
Core 1
0
Core 0
0
Time
34
C6 on Intel® Core™ Microarchitecture
(Nehalem)
Core Power
0
Core power gate turned off.
Core voltage goes to 0.
Cores 0, 1, and 3 continue
execution undisturbed.
Core 3
Core 2
0
Core 1
0
Core 0
0
Time
35
C6 on Intel® Core™ Microarchitecture
(Nehalem)
Core Power
Core 3
0
Core 2
0
0
Task completes. No work waiting.
OS executes MWAIT(C6)
instruction. Core 0 enters C6.
Cores 1 and 3 continue
execution undisturbed.
0
Time
36
Core 1
Core 0
C6 on Intel® Core™ Microarchitecture
(Nehalem)
Core Power
0
Interrupt for Core 2 arrives. Core
2 returns to C0, execution
resumes at instruction following
MWAIT(C6). Cores 1 and 3
continue execution undisturbed.
Core 3
Core 2
0
Core 1
0
Core 0
0
Time
37
C6 on Intel® Core™ Microarchitecture
(Nehalem)
Core Power
Core 3
0
0
0
Interrupt for Core 0 arrives. Power
gate turns on, core clock turns on,
core state restored, core resumes
execution at instruction following
MWAIT(C6). Cores 1, 2, and 3
continue execution undisturbed.
Core 2
Core 1
Core 0
0
Time
Core independent C6 on Intel Core
microarchitecture (Nehalem) extends benefits
38
Intel® Core™ Microarchitecture (Nehalem)based Processor
I
O
Q
P
I
0
Core
Core
Q
u
e
u
e
Core
Core
M
i
s
c
I
O
Q
P
I
Shared L3 Cache
1
• Significant logic outside core
– Integrated memory controller
– Large shared cache
– High speed interconnect
– Arbitration logic
QPI = Intel® QuickPath Interconnect (Intel® QPI)
39
Total CPU Power
Consumption
Cores (x N)
M
i
s
c
Memory Controller
Core
Clocks
and Logic
Core Clock
Distribution
Core
Leakage
Uncore Logic
I/O
Uncore Clock
Distribution
Uncore
Leakage
Intel® Core™ Microarchitecture
(Nehalem) Package C-State Support
Active CPU Power
Core
Clocks
and Logic
• All cores in C6 state:
Cores (x N)
– Core power to ~0
Core Clock
Distribution
Core
Leakage
Uncore Logic
I/O
Uncore Clock
Distribution
Uncore
Leakage
40
Intel® Core™ Microarchitecture
(Nehalem) Package C-State Support
Active CPU Power
• All cores in C6 state:
– Core power to ~0
• Package to C6 state:
– Uncore logic stops toggling
Uncore Logic
I/O
Uncore Clock
Distribution
Uncore
Leakage
41
Intel® Core™ Microarchitecture
(Nehalem) Package C-State Support
Active CPU Power
• All cores in C6 state:
– Core power to ~0
• Package to C6 state:
– Uncore logic stops toggling
– I/O to lower power state
I/O
Uncore Clock
Distribution
Uncore
Leakage
42
Intel® Core™ Microarchitecture
(Nehalem) Package C-State Support
Active CPU Power
• All cores in C6 state:
– Core power to ~0
• Package to C6 state:
– Uncore logic stops toggling
– I/O to lower power state
– Uncore clock grids stopped
Substantial reduction in
idle CPU power
43
I/O
Uncore Clock
Distribution
Uncore
Leakage
Managing Active Power
• Operating system changes frequency as needed to
meet performance needs, minimize power
– Enhanced Intel SpeedStep® Technology
– Referred to as processor P-States
• PCU tunes voltage for given frequency, operating
conditions, and silicon characteristics
PCU automatically optimizes operating voltage
44
Turbo Mode: Key to Scalability Goal
• Intel® Core™ microarchitecture
(Nehalem) is a scalable architecture
Nehalem
– High frequency core for performance in
less constrained form factors
– Retain ability to use that frequency in
very small form factors
– Retain ability to use that frequency when
running lightly threaded or lower power
workloads
• Turbo utilizes available frequency:
– Maximizes both single-thread and multithread performance in the same part
Turbo Mode provides performance
when you need it
45
Turbo Mode Enabling
• Turbo Mode exposed as additional Enhanced Intel
SpeedStep® Technology operating point
– Operating system treats as any other P-state, requesting
Turbo Mode when it needs more performance
– Performance benefit comes from higher operating
frequency – no need to enable or tune software
• Turbo Mode is transparent to system
– Frequency transitions handled completely in hardware
– PCU keeps silicon within existing operating limits
– Systems designed to same specs, with or without Turbo
Mode
Performance benefits with
existing applications and operating systems
46
Summary
• Intel® Core™ microarchitecture (Nehalem) – The 45nm Tock
• Designed for
– Power Efficiency
– Scalability
– Performance
• Key Innovations:
– Enhanced Processor Core
– Brand New Platform Architecture
– Sophisticated Power Management
High Performance When You Need It
Lower Power When You Don’t
47
Q&A
48
Legal Disclaimer
• INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO
LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL
PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS
AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER,
AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF
INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A
PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR
OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN
MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.
• Intel may make changes to specifications and product descriptions at any time, without notice.
• All products, dates, and figures specified are preliminary based on current expectations, and are subject to
change without notice.
• Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which
may cause the product to deviate from published specifications. Current characterized errata are available on
request.
• Merom, Penryn, Hapertown, Nehalem, Dothan, Westmere, Sandy Bridge, and other code names featured are
used internally within Intel to identify products that are in development and not yet publicly announced for
release. Customers, licensees and other third parties are not authorized by Intel to use code names in
advertising, promotion or marketing of any product or services and any such use of Intel's internal code
names is at the sole risk of the user
• Performance tests and ratings are measured using specific computer systems and/or components and reflect
the approximate performance of Intel products as measured by those tests. Any difference in system
hardware or software design or configuration may affect actual performance.
• Intel, Intel Inside, Intel Core, Pentium, Intel SpeedStep Technology, and the Intel logo are trademarks of Intel
Corporation in the United States and other countries.
• *Other names and brands may be claimed as the property of others.
• Copyright © 2008 Intel Corporation.
49
Risk Factors
This presentation contains forward-looking statements that involve a number of risks and uncertainties. These
statements do not reflect the potential impact of any mergers, acquisitions, divestitures, investments or other
similar transactions that may be completed in the future. The information presented is accurate only as of
today’s date and will not be updated. In addition to any factors discussed in the presentation, the important
factors that could cause actual results to differ materially include the following: Demand could be different
from Intel's expectations due to factors including changes in business and economic conditions, including
conditions in the credit market that could affect consumer confidence; customer acceptance of Intel’s and
competitors’ products; changes in customer order patterns, including order cancellations; and changes in the
level of inventory at customers. Intel’s results could be affected by the timing of closing of acquisitions and
divestitures. Intel operates in intensely competitive industries that are characterized by a high percentage of
costs that are fixed or difficult to reduce in the short term and product demand that is highly variable and
difficult to forecast. Revenue and the gross margin percentage are affected by the timing of new Intel product
introductions and the demand for and market acceptance of Intel's products; actions taken by Intel's
competitors, including product offerings and introductions, marketing programs and pricing pressures and
Intel’s response to such actions; Intel’s ability to respond quickly to technological developments and to
incorporate new features into its products; and the availability of sufficient supply of components from
suppliers to meet demand. The gross margin percentage could vary significantly from expectations based on
changes in revenue levels; product mix and pricing; capacity utilization; variations in inventory valuation,
including variations related to the timing of qualifying products for sale; excess or obsolete inventory;
manufacturing yields; changes in unit costs; impairments of long-lived assets, including manufacturing,
assembly/test and intangible assets; and the timing and execution of the manufacturing ramp and associated
costs, including start-up costs. Expenses, particularly certain marketing and compensation expenses, vary
depending on the level of demand for Intel's products, the level of revenue and profits, and impairments of
long-lived assets. Intel is in the midst of a structure and efficiency program that is resulting in several actions
that could have an impact on expected expense levels and gross margin. Intel's results could be impacted by
adverse economic, social, political and physical/infrastructure conditions in the countries in which Intel, its
customers or its suppliers operate, including military conflict and other security risks, natural disasters,
infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Intel's results could be
affected by adverse effects associated with product defects and errata (deviations from published
specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer,
antitrust and other issues, such as the litigation and regulatory matters described in Intel's SEC reports. A
detailed discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings,
including the report on Form 10-Q for the quarter ended June 28, 2008.
50
Backup Slides
51
Intel® Core™ Microarchitecture (Nehalem)
Design Goals
World class performance combined with superior energy efficiency – Optimized for:
Dynamically scaled
performance when
Single Thread
Multi-threads
Existing Apps
Emerging Apps
All Usages
needed to maximize
energy efficiency
A single, scalable, foundation optimized across each segment and power envelope
Workstation / Server
Desktop / Mobile
A Dynamic and Design Scalable
Microarchitecture
52
Tick-Tock Development Model
Merom1
Penryn
Nehalem
Westmere
Sandy
Bridge
NEW
Microarchitecture
NEW
Process
NEW
Microarchitecture
NEW
Process
NEW
Microarchitecture
65nm
45nm
45nm
32nm
32nm
TOCK
TICK
TOCK
TICK
TOCK
Forecast
1Intel®
Core™ microarchitecture (formerly Merom)
45nm next generation Intel® Core™ microarchitecture (Penryn)
Intel® Core™ Microarchitecture (Nehalem)
Intel® Microarchitecture (Westmere)
53
Intel®
Microarchitecture (Sandy Bridge)
All dates, product descriptions, availability and plans are forecasts and subject to
change without notice.
53
Enhanced Processor Core
Instruction Fetch and
Pre Decode
ITLB
32kB
Instruction Cache
Instruction Queue
Front End
Execution
Engine
Memory
Decode
4
2nd Level TLB
Rename/Allocate
Retirement Unit
(ReOrder Buffer)
2nd
4
Reservation Station
6
Execution Units
DTLB
32kB
Data Cache
54
256kB
Level Cache
L3 and beyond
Front-end
• Responsible for feeding the compute engine
– Decode instructions
– Branch Prediction
• Key Intel® Core™2 microarchitecture Features
– 4-wide decode
– Macrofusion
– Loop Stream Detector
Instruction Fetch and
Pre Decode
ITLB
Instruction Queue
Decode
55
32kB
Instruction Cache
Loop Stream Detector Reminder
• Loops are very common in most software
• Take advantage of knowledge of loops in HW
– Decoding the same instructions over and over
– Making the same branch predictions over and over
• Loop Stream Detector identifies software loops
– Stream from Loop Stream Detector instead of normal path
– Disable unneeded blocks of logic for power savings
– Higher performance by removing instruction fetch limitations
Intel® Core™2 Loop Stream Detector
Branch
Prediction
Loop
Fetch
Stream
Detector
18
Instructions
56
Decode
Branch Prediction Reminder
• Goal: Keep powerful compute engine fed
• Options:
– Stall pipeline while determining branch direction/target
– Predict branch direction/target and correct if wrong
• Minimize amount of time wasted correcting from
incorrect branch predictions
– Performance:
– Through higher branch prediction accuracy
– Through faster correction when prediction is wrong
– Power efficiency: Minimize number of
speculative/incorrect micro-ops that are executed
Continued focus on branch
prediction improvements
57
L2 Branch Predictor
• Problem: Software with a large code footprint not
able to fit well in existing branch predictors
– Example: Database applications
• Solution: Use multi-level branch prediction scheme
• Benefits:
– Higher performance through improved branch prediction
accuracy
– Greater power efficiency through less mis-speculation
58
Advanced Renamed Return Stack Buffer (RSB)
• Instruction Reminder
– CALL: Entry into functions
– RET: Return from functions
• Classical Solution
– Return Stack Buffer (RSB) used to predict RET
– RSB can be corrupted by speculative path
• The Renamed RSB
– No RET mispredicts in the common case
59
Execution Engine
• Responsible for:
– Scheduling operations
– Executing operations
• Powerful Intel® Core™2 microarchitecture
execution engine
– Dynamic 4-wide Execution
– Intel® Advanced Digital Media Boost
– 128-bit wide SSE
– Super Shuffler (45nm next generation Intel® Core™
microarchitecture (Penryn))
60
Intel® Smart Cache – Core Caches
• New 3-level Cache Hierarchy
• 1st level caches
– 32kB Instruction cache
– 32kB, 8-way Data Cache
Core
– Support more L1 misses in parallel than
Intel® Core™2 microarchitecture
32kB L1
32kB L1
Data Cache Inst. Cache
• 2nd level Cache
– New cache introduced in Intel® Core™
microarchitecture (Nehalem)
– Unified (holds code and data)
– 256 kB per core (8-way)
– Performance: Very low latency
– 10 cycle load-to-use
– Scalability: As core count increases,
reduce pressure on shared cache
61
256kB
L2 Cache
New TLB Hierarchy
• Problem: Applications continue to grow in data size
• Need to increase TLB size to keep the pace for performance
• Nehalem adds new low-latency unified 2nd level TLB
# of Entries
1st Level Instruction TLBs
Small Page (4k)
128
Large Page (2M/4M)
7 per thread
1st Level Data TLBs
Small Page (4k)
64
Large Page (2M/4M)
32
New 2nd Level Unified TLB
Small Page Only
512
62
Fast Unaligned Cache Accesses
• Two flavors of 16-byte SSE loads/stores exist
– Aligned (MOVAPS/D, MOVDQA) -- Must be aligned on a 16-byte boundary
– Unaligned (MOVUPS/D, MOVDQU) -- No alignment requirement
• Prior to Intel® Core™ microarchitecture (Nehalem)
– Optimized for Aligned instructions
– Unaligned instructions slower, lower throughput -- Even for aligned accesses!
– Required multiple uops (not energy efficient)
– Compilers would largely avoid unaligned load
– 2-instruction sequence (MOVSD+MOVHPD) was faster
• Intel Core microarchitecture (Nehalem) optimizes Unaligned instructions
– Same speed/throughput as Aligned instructions on aligned accesses
– Optimizations for making accesses that cross 64-byte boundaries fast
– Lower latency/higher throughput than Core 2
– Aligned instructions remain fast
• No reason to use aligned instructions on Intel Core microarchitecture (Nehalem)!
• Benefits:
– Compiler can now use unaligned instructions without fear
– Higher performance on key media algorithms
– More energy efficient than prior implementations
63
Faster Synchronization Primitives
LOCK CMPXCHG Perform ance
Relative Latency
• Multi-threaded software
becoming more prevalent
• Scalability of multi-thread
applications can be limited by
synchronization
• Synchronization primitives:
LOCK prefix, XCHG
• Reduce synchronization
latency for legacy software
1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Pentium 4
Core 2
Nehalem
Greater thread scalability with Nehalem
1Intel®
Pentium® 4 processor
Intel® Core™2 Duo processor
Intel® Core™ microarchitecture (Nehalem)-based processor
64
Intel® Core™ Microarchitecture (Nehalem) SMT
Implementation Details
Policy
Description
Intel® Core™
Microarchitecture
(Nehalem) Examples
Replicated
Duplicate logic per thread
Register State
Renamed RSB
Large Page ITLB
Partitioned
Statically allocated
between threads
Load Buffer
Store Buffer
Reorder Buffer
Small Page ITLB
Competitively Shared
Depends on thread’s
dynamic behavior
Reservation Station
Caches
Data TLB
2nd level TLB
Unaware
No SMT impact
Execution units
SMT efficient due to
minimal replication of logic
65
Feeding the Execution Engine
• Powerful 4-wide dynamic execution engine
• Need to keep providing fuel to the execution engine
• Intel® Core™ Microarchitecture (Nehalem) Goals
– Low latency to retrieve data
– Keep execution engine fed w/o stalling
– High data bandwidth
– Handle requests from multiple cores/threads seamlessly
– Scalability
– Design for increasing core counts
• Combination of great cache hierarchy and new platform
Intel® Core™ microarchitecture (Nehalem)
designed to feed the execution engine
66
Inclusive vs. Exclusive Caches –
Cache Miss
Exclusive
Inclusive
L3 Cache
Core
0
Core
1
Core
2
L3 Cache
Core
0
Core
3
Core
1
Core
2
Core
3
Data request from Core 0 misses Core 0’s L1 and L2
Request sent to the L3 cache
67
Inclusive vs. Exclusive Caches –
Cache Miss
Exclusive
Core
0
Inclusive
L3 Cache
L3 Cache
MISS!
MISS!
Core
1
Core
2
Core
0
Core
3
Core 0 looks up the L3 Cache
Data not in the L3 Cache
68
Core
1
Core
2
Core
3
Inclusive vs. Exclusive Caches –
Cache Miss
Exclusive
Core
0
Inclusive
L3 Cache
L3 Cache
MISS!
MISS!
Core
1
Core
2
Core
0
Core
3
Core
1
Core
2
Core
3
Guaranteed data is not on-die
Must check other cores
Greater scalability from inclusive approach
69
Inclusive vs. Exclusive Caches –
Cache Hit
Exclusive
Inclusive
L3 Cache
L3 Cache
HIT!
HIT!
Core
0
Core
1
Core
2
Core
0
Core
3
No need to check other cores
Core
1
Core
2
Core
3
Data could be in another core BUT
Intel® CoreTM microarchitecture
(Nehalem) is smart…
70
Inclusive vs. Exclusive Caches –
Cache Hit
Inclusive
• Maintain a set of “core
valid” bits per cache line
in the L3 cache
• Each bit represents a core
• If the L1/L2 of a core may
contain the cache line,
then core valid bit is set to
“1”
• No snoops of cores are
needed if no bits are set
• If more than 1 bit is set,
line cannot be in Modified
state in any core
L3 Cache
HIT! 0 0 0 0
Core
0
Core
1
Core
2
Core valid bits limit
unnecessary snoops
71
Core
3
Inclusive vs. Exclusive Caches –
Read from other core
Exclusive
Core
0
Inclusive
L3 Cache
L3 Cache
MISS!
HIT! 0 0 1 0
Core
1
Core
2
Core
0
Core
3
Core
1
Core
2
Core
3
Only need to check the core
whose core valid bit is set
Must check all other cores
72
Local Memory Access
• CPU0 requests cache line X, not present in any CPU0 cache
– CPU0 requests data from its DRAM
– CPU0 snoops CPU1 to check if data is present
• Step 2:
– DRAM returns data
– CPU1 returns snoop response
• Local memory latency is the maximum latency of the two responses
• Intel® Core™ microarchitecture (Nehalem) optimized to keep key latencies
close to each other
DRAM
CPU0
Intel®
QPI
Intel® QPI = Intel® QuickPath Interconnect
73
CPU1
DRAM
Remote Memory Access
• CPU0 requests cache line X, not present in any CPU0 cache
– CPU0 requests data from CPU1
– Request sent over Intel® QuickPath Interconnect (Intel® QPI) to
CPU1
– CPU1’s IMC makes request to its DRAM
– CPU1 snoops internal caches
– Data returned to CPU0 over Intel QPI
• Remote memory latency a function of having a low latency
interconnect
DRAM
CPU0
Intel®
QPI
74
CPU1
DRAM
Hardware Prefetching (HWP)
• HW Prefetching critical to hiding memory latency
• Structure of HWPs similar as in Intel® Core™2
microarchitecture
– Algorithmic improvements in Intel® Core™ microarchitecture
(Nehalem) for higher performance
• L1 Prefetchers
– Based on instruction history and/or load address pattern
• L2 Prefetchers
– Prefetches loads/RFOs/code fetches based on address pattern
– Intel Core microarchitecture (Nehalem) changes:
– Efficient Prefetch mechanism
– Remove the need for Intel® Xeon® processors to disable HWP
– Increase prefetcher aggressiveness
– Locks on address streams quicker, adapts to change faster, issues more
prefetchers more aggressively (when appropriate)
75
Today’s Platform Architecture
76
Intel® QuickPath Interconnect
• Intel® Core™ microarchitecture
(Nehalem) introduces new
Intel® QuickPath Interconnect
(Intel® QPI)
• High bandwidth, low latency
point to point interconnect
• Up to 6.4 GT/sec initially
– 6.4 GT/sec -> 12.8 GB/sec
– Bi-directional link -> 25.6
GB/sec per link
– Future implementations at even
higher speeds
• Highly scalable for systems
with varying # of sockets
Nehalem
EP
Nehalem
EP
IOH
memory
memory
CPU
CPU
CPU
CPU
IOH
Intel® CoreTM microarchitecture (Nehalem-EP)
77
memory
memory
Integrated Memory Controller (IMC)
• Memory controller optimized per
market segment
• Initial Intel® Core™
microarchitecture (Nehalem)
products
–
–
–
–
–
–
Native DDR3 IMC
Up to 3 channels per socket
Massive memory bandwidth
Designed for low latency
Support RDIMM and UDIMM
RAS Features
DDR3
• Future products
Nehalem
EP
Nehalem
EP
DDR3
Tylersburg
EP
– Scalability
– Vary # of memory channels
– Increase memory speeds
– Buffered and Non-Buffered solutions
– Market specific needs
– Higher memory capacity
– Integrated graphics
Significant performance through new IMC
Intel® Core™ microarchitecture (Nehalem-EP)
Intel® Next Generation Server Processor Technology (Tylersburg-EP)
78
Memory Latency Comparison
•
•
•
•
Low memory latency critical to high performance
Design integrated memory controller for low latency
Need to optimize both local and remote memory latency
Intel® Core™ microarchitecture (Nehalem) delivers
– Huge reduction in local memory latency
– Even remote memory latency is fast
• Effective memory latency depends per application/OS
– Percentage of local vs. remote accesses
– Intel Core microarchitecture (Nehalem) has lower latency regardless of mix
Relative Memory Latency
Relative Memory Latency Comparison
1.00
0.80
0.60
0.40
0.20
0.00
Harpertow n (FSB 1600)
1Next
1
Nehalem (DDR3-1067) Local
generation Quad-Core Intel® Xeon® processor (Harpertown)
Intel® CoreTM microarchitecture (Nehalem)
79
Nehalem (DDR3-1067) Remote
Latency of Virtualization Transitions
• Microarchitectural
Round Trip Virtualization Latency 1
– Huge latency reduction
generation over generation
– Nehalem continues the
trend
Relative Latency
100%
• Architectural
– Virtual Processor ID (VPID)
added in Intel® Core™
microarchitecture
(Nehalem)
– Removes need to flush
TLBs on transitions
80%
60%
40%
20%
0%
Merom
Penryn
Nehalem
Higher Virtualization Performance Through
Lower Transition Latencies
1Intel®
Core™ microarchitecture (formerly Merom)
45nm next generation Intel® Core™ microarchitecture (Penryn)
Intel® Core™ microarchitecture (Nehalem)
80
Extended Page Tables (EPT) Motivation
• A VMM needs to protect
physical memory
VM1
Guest OS
• Multiple Guest OSs share
the same physical memory
• Protections are
implemented through
page-table virtualization
CR3
Guest Page Table
Guest page table changes
cause exits into the VMM
VMM
CR3
• Page table virtualization
accounts for a
significant portion of
virtualization overheads
Active Page Table
VMM maintains the active
page table, which is used
by the CPU
• VM Exits / Entries
• The goal of EPT is to
reduce these overheads
81
STTNI - STring & Text New Instructions
Operates on strings of bytes or words (16b)
Equal Each Instruction
True for each character in Src2 if
same position in Src1 is equal
Src1:
Test\tday
Src2:
tad tseT
Mask:
01101111
STTNI MODEL
Source2 (XMM / M128)
Equal Any Instruction
Ranges Instruction
True for each character in Src2 if
any character in Src1 matches
Src1:
Example\n
Src2:
atad tsT
Mask:
10100000
True if a character in Src2 is in
at least one of up to 8 ranges
in Src1
Src1:
AZ’0’9zzz
Src2:
taD tseT
Mask:
00100001
Equal Ordered Instruction
Finds the start of a substring (Src1)
within another string (Src2)
Src1:
ABCA0XYZ
Src2:
S0BACBAB
Mask:
00000010
Source1 (XMM)
Bit 0
T
e
s
t
\t
d
a
y
t ad t s eT
xxxxxxxT
xxxxxxTx
xxxxxTxx
xxxxTxxx
xxxFxxxx
xxTxxxxx
xTxxxxxx
Fxxxxxxx
01101111
Projected 3.8x kernel speedup on XML parsing &
2.7x savings on instruction cycles
82
Bit 0
Check
each bit
in the
diagonal
IntRes1
STTNI Model
EQUAL EACH
EQUAL ANY
Source2 (XMM / M128)
Source1 (XMM)
Bit 0
E
x
a
m
p
l
e
\n
a
F
F
T
F
F
F
F
F
t ad
FFFF
FFFF
FTFF
FFFF
FFFF
FFFF
FFFF
FFFF
t
F
F
F
F
F
F
F
F
sT
FF
FF
FF
FF
FF
FF
FF
FF
Bit 0
OR results down
each column
Bit 0
Source1 (XMM)
Source2 (XMM / M128)
10100000
T
e
s
t
\t
d
a
y
IntRes1
RANGES
t aD t s e T
A FFTFFFTT
Z FFTTFFFT
‘0’ T T T F T T T T
9 FFFTFFFF
z FFFFFFFF
z TTTTTTTT
z FFFFFFFF
z TTTTTTTT
00100001
Check
each bit in
the
diagonal
IntRes1
Source2 (XMM / M128)
Bit 0
S0BACBAB
Bit 0
•First Compare
does GE, next
does LE
Bit 0
•AND GE/LE pairs of results
•OR those results
IntRes1
83
Source1 (XMM)
Source1 (XMM)
Bit 0
Bit 0
EQUAL ORDERED
Source1 (XMM)
Source2 (XMM / M128)
t ad t seT
xxxxxxxT
xxxxxxTx
xxxxxTxx
xxxxTxxx
xxxFxxxx
xxTxxxxx
xTxxxxxx
Fxxxxxxx
01101111
A
B
C
A
0
X
Y
Z
fF fF F T F
fF fF T F F
fF fF F F T
fF fF F T F
fTfTfTfT x
fTfTfT x x
fTfT x x x
F
T
F
x
x
x
x
T
F
x
x
x
x
x
F
x
x
x
x
x
x
fT x x x x x x x
00000010
Bit 0
AND the results
AND the along
results
each
along each
diagonal
diagonal
IntRes1
ATA - Application Targeted Accelerators
CRC32
POPCNT
POPCNT determines the number of nonzero
Accumulates a CRC32 value using the iSCSI polynomial
bits in the source.
63
DST
SRC
32 31
X
63
0
0 1 0
Old CRC
10
Bit
... 0 0 1 1
RAX
Data 8/16/32/64 bit

0
63
DST
32 31
0
0x3
0
RBX
New CRC
=? 0
One register maintains the running CRC value as a
software loop iterates over data.
Fixed CRC polynomial = 11EDC6F41h
0
ZF
POPCNT is useful for speeding up fast matching in
data mining workloads including:
• DNA/Genome Matching
• Voice Recognition
Replaces complex instruction sequences for CRC in
Upper layer data protocols:
• iSCSI, RDMA, SCTP
ZFlag set if result is zero. All other flags (C,S,O,A,P)
reset
Enables enterprise class data assurance with high data rates
in networked storage in any user environment.
84
Tools Support of New Instructions
•
Intel® Compiler 10.x supports the new instructions
–
–
–
–
•
Intel® XML Software Suite
–
–
–
•
SSE4.2 supported via intrinsics
Inline assembly supported on IA-32 only
Necessary to include required header files in order to access intrinsics
VC++ 2008 tools masm, msdis, and debuggers recognize the new instructions
Sun Studio Express* 7/08
–
–
–
•
High performance C++ and Java runtime libraries
Version 1.0 (C++), version 1.01 (Java) available now
Version 1.1 w/SSE4.2 optimizations planned for September 2008
Microsoft Visual Studio* 2008 VC++
–
–
–
–
•
Nehalem specific compiler optimizations
SSE4.2 supported via vectorization and intrinsics
Inline assembly supported on both IA-32 and Intel® 64 architecture targets
Necessary to include required header files in order to access intrinsics
Supports Intel® CoreTM microarchitecture (Merom), 45nm next generation Intel® Core™ microarchitecture (Penryn), Intel® CoreTM
microarchitecture (Nehalem)
SSE4.1, SSE4.2 through intrinsics
Nehalem specific compiler optimizations
GCC* 4.3.1
–
–
–
Support Intel Core microarchitecture (Merom), 45nm next generation Intel Core microarchitecture (Penryn), Intel Core
microarchitecture (Nehalem)
via –mtune=generic.
Support SSE4.1 and SSE4.2 through vectorizer and intrinsics
Broad Software Support for Intel®
Core™ Microarchitecture (Nehalem)
85
Software Optimization Guidelines
• Most optimizations for Intel® Core™
microarchitecture still hold
• Examples of new optimization guidelines:
– 16-byte unaligned loads/stores
– Enhanced macrofusion rules
– NUMA optimizations
• Intel® Core™ microarchitecture (Nehalem) SW
Optimization Guide will be published
• Intel® Compiler will support settings for Intel Core
microarchitecture (Nehalem) optimizations
86
Example Code For strlen()
string equ
[esp + 4]
mov
ecx,string
; ecx -> string
short
test ecx,3
; test if string isjealigned
onbyte_2
32 bits
test eax,0ff000000h
je
short main_loop
; is it byte 3
str_misaligned:
je
short byte_3
; simple byte loop until string is aligned
jmp
short main_loop
mov
al,byte ptr [ecx]
; taken if bits 24-30 are clear and bit
add
ecx,1
; 31 is set
test al,al
byte_3:
je
short byte_3
lea
eax,[ecx - 1]
test ecx,3
mov ecx,string
jne
short str_misaligned
sub
add
eax,dword ptr 0
; 5 byte nop to
aligneax,ecx
label below
ret
align 16
; should be redundant
byte_2:
main_loop:
lea
eax,[ecx - 2]
mov
eax,dword ptr [ecx]
; read 4 bytes
mov
ecx,string
mov
edx,7efefeffh
sub
eax,ecx
add
edx,eax
ret
xor
eax,-1
byte_1:
xor
eax,edx
lea
eax,[ecx - 3]
add
ecx,4
mov
ecx,string
test eax,81010100h
sub
eax,ecx
je
short main_loop
ret
; found zero byte in the loop
byte_0:
mov
eax,[ecx - 4]
lea
eax,[ecx - 4]
test al,al
; is it byte 0
mov
ecx,string
je
short byte_0
sub
eax,ecx
test ah,ah
; is it byte 1
ret
je
short byte_1
test eax,00ff0000h ; is it byte 2 strlen endp
end
Current Code:
STTNI Version
int sttni_strlen(const char * src)
{
char eom_vals[32] = {1, 255, 0};
__asm{
mov
eax, src
movdqu
xor
xmm2, eom_vals
ecx, ecx
topofloop:
add
eax, ecx
movdqu
xmm1, OWORD PTR[eax]
pcmpistri
xmm2, xmm1, imm8
jnz
topofloop
endofstring:
add
eax, ecx
sub
ret
eax, src
}
Minimum of 11 instructions; Inner loop processes 4 bytes with 8 instructions
}
STTNI Code: Minimum of 10 instructions; A single inner loop processes 16 bytes with only 4
instructions
87
CRC32 Preliminary Performance
CRC32 optimized Code
 Preliminary tests involved Kernel code implementing
CRC algorithms commonly used by iSCSI drivers.
crc32c_sse42_optimized_version(uint32 crc, unsigned
char const *p, size_t len)
 32-bit and 64-bit versions of the Kernel under test
{ // Assuming len is a multiple of 0x10
 32-bit version processes 4 bytes of data using
1 CRC32 instruction
asm("pusha");
asm("mov %0, %%eax" :: "m" (crc));
asm("mov %0, %%ebx" :: "m" (p));
 64-bit version processes 8 bytes of data using
1 CRC32 instruction
asm("mov %0, %%ecx" :: "m" (len));
asm("1:");
 Input strings of sizes 48 bytes and 4KB used for the
test
// Processing four byte at a time: Unrolled four times:
asm("crc32 %eax, 0x0(%ebx)");
asm("crc32 %eax, 0x4(%ebx)");
32 - bit
64 - bit
Input
Data
Size =
48 bytes
6.53 X
9.85 X
Input
Data
Size = 4
KB
9.3 X
18.63 X
asm("crc32 %eax, 0x8(%ebx)");
asm("crc32 %eax, 0xc(%ebx)");
asm("add $0x10, %ebx")2;
asm("sub $0x10, %ecx");
asm("jecxz 2f");
asm("jmp 1b");
asm("2:");
asm("mov %%eax, %0" : "=m" (crc));
asm("popa");
return crc;
}}
Preliminary Results show CRC32 instruction out-performing
the fastest CRC32C software algorithm by a big margin
88
Idle Power Matters
• Data center operating costs1
– 41M physical servers by 2010, average utilization < 10%
– $0.50 spent on power and cooling for every $1 spent on
server hardware
• Regulatory requirements affect all segments
– ENERGY STAR* and related requirements
• Environmental responsibility
Idle power consumption not just mobile concern
1.
IDC’s Datacenter Trends Survey, January 2007
89
CPU Core Power Consumption
• High frequency processes are leaky
– Reduced via high-K metal gate process,
design technologies, manufacturing
optimizations
90
Leakage
CPU Core Power Consumption
• High frequency designs require high
performance global clock distribution
• High frequency processes are leaky
– Reduced via high-K metal gate process,
design technologies, manufacturing
optimizations
91
Clock
Distribution
Leakage
CPU Core Power Consumption
Total Core Power
Consumption
• Remaining power in logic, local clocks
– Power efficient microarchitecture, good
clock gating minimize waste
• High frequency designs require high
performance global clock distribution
• High frequency processes are leaky
– Reduced via high-K metal gate process,
design technologies, manufacturing
optimizations
Local
Clocks
and Logic
Clock
Distribution
Leakage
Challenge – Minimize power when idle
92
C-State Support Before Intel® Core™
Microarchitecture (Nehalem)
Active Core Power
• C0: CPU active state
Local
Clocks
and Logic
Clock
Distribution
Leakage
93
C-State Support Before Intel® Core™
Microarchitecture (Nehalem)
• C0: CPU active state
• C1, C2 states (early 1990s):
• Stop core pipeline
• Stop most core clocks
Active Core Power
Local
Clocks
and Logic
Clock
Distribution
Leakage
94
C-State Support Before Intel® Core™
Microarchitecture (Nehalem)
• C0: CPU active state
• C1, C2 states (early 1990s):
Active Core Power
• Stop core pipeline
• Stop most core clocks
• C3 state (mid 1990s):
• Stop remaining core clocks
Clock
Distribution
Leakage
95
C-State Support Before Intel® Core™
Microarchitecture (Nehalem)
• C0: CPU active state
• C1, C2 states (early 1990s):
Active Core Power
• Stop core pipeline
• Stop most core clocks
• C3 state (mid 1990s):
• Stop remaining core clocks
• C4, C5, C6 states (mid 2000s):
• Drop core voltage, reducing leakage
• Voltage reduction via shared VR
Leakage
Existing C-states significantly reduce idle power
96
C-State Support Before Intel® Core™
Microarchitecture (Nehalem)
• Cores share a single voltage plane
– All cores must be idle before voltage reduced
– Independent VR’s per core prohibitive from cost and form
factor perspective
• Deepest C-states have relatively long exit latencies
– System / VR handshake, ramp voltage, restore state,
restart pipeline, etc.
Deepest C-states available in mobile products
97
Intel® Core™ Microarchitecture
(Nehalem) Core C-State Support
Active Core Power
• C0: CPU active state
Local
Clocks
and Logic
Clock
Distribution
Leakage
98
Intel® Core™ Microarchitecture
(Nehalem) Core C-State Support
Active Core Power
• C0: CPU active state
• C1 state:
• Stop core pipeline
• Stop most core clocks
Local
Clocks
and Logic
Clock
Distribution
Leakage
99
Intel® Core™ Microarchitecture
(Nehalem) Core C-State Support
Active Core Power
• C0: CPU active state
• C1 state:
• Stop core pipeline
• Stop most core clocks
• C3 state:
• Stop remaining core clocks
Clock
Distribution
Leakage
100
Intel® Core™ Microarchitecture
(Nehalem) Core C-State Support
Active Core Power
• C0: CPU active state
• C1 state:
• Stop core pipeline
• Stop most core clocks
• C3 state:
• Stop remaining core clocks
• C6 state:
• Processor saves architectural state
• Turn off power gate, eliminating
leakage
Core idle power goes to ~0
101
Leakage
C6 Support on Intel® Core™2 Duo Mobile
Processor (Penryn)
102
C6 Support on Intel® Core™2 Duo Mobile
Processor (Penryn)
Core Power
Cores running
applications.
Core 1
0
Core 0
0
Time
103
C6 Support on Intel® Core™2 Duo Mobile
Processor (Penryn)
Core Power
Task completes. No work
waiting. OS executes
MWAIT(C6) instruction.
Core 1
0
Core 0
0
Time
104
C6 Support on Intel® Core™2 Duo Mobile
Processor (Penryn)
Core Power
Execution stops. Core
architectural state saved.
Core clocks stopped. Core
0 continues execution
undisturbed.
Core 1
0
Core 0
0
Time
105
C6 Support on Intel® Core™2 Duo Mobile
Processor (Penryn)
Core Power
Core 1
0
Task completes. No work
waiting. OS executes
MWAIT(C6) instruction.
Core enters C6.
Core 0
0
Time
106
C6 Support on Intel® Core™2 Duo Mobile
Processor (Penryn)
Core Power
Core 1
0
VR voltage
reduced. Power
drops.
Core 0
0
Time
107
C6 Support on Intel® Core™2 Duo Mobile
Processor (Penryn)
Core Power
Interrupt for Core 1 arrives. VR
voltage increased. Core 1 clocks
turn on, core state restored, and
core resumes execution at
instruction following MWAIT(C6).
Cores 0 remains idle.
Core 1
0
Core 0
0
Time
108
C6 Support on Intel® Core™2 Duo Mobile
Processor (Penryn)
Core Power
Core 1
0
Interrupt for Core 0 arrives. Core 0
returns to C0 and resumes execution at
instruction following MWAIT(C6). Core
1 continues execution undisturbed.
Core 0
0
C6 significantly reducesTime
idle power consumption
109
Reducing Platform Idle Power
• Dramatic improvements in CPU idle power increase
importance of platform improvements
• Memory power:
– Memory clocks stopped between requests at low utilization
– Memory to self refresh in package C3, C6
• Link power:
– Intel® QuickPath Interconnect links to lower power states
as CPU becomes less active
– PCI Express* links on chipset have similar behavior
• Hint to VR to reduce phases during periods of low
current demand
Intel® Core™ microarchitecture (Nehalem)
reduces CPU and platform power
110
Intel® Core™ Microarchitecture
(Nehalem): Integrated Power Gate
VCC
• Integrated power switch
between VR output and
core voltage supply
–
–
–
•
Very low on-resistance
Very high off-resistance
Much faster voltage ramp
than external VR
Core0
Core1
Core2
Memory System, Cache, I/O
Enables per core C6 state
–
–
Individual cores transition to
~0 power state
Transparent to other cores,
platform, software, and VR
Close collaboration with process technology
to optimize device characteristics
111
111
Core3
VTT
Agenda
• Intel® Core™ microarchitecture (Nehalem)
power management overview
• Minimizing idle power consumption
• Performance when you need it
112
Turbo Mode Before Intel® Core™
Microarchitecture (Nehalem)
Clock Stopped
Power reduction in
inactive cores
113
113
Core 0
Core 1
Workload Lightly Threaded
Frequency (F)
Core 0
Core 1
Frequency (F)
No Turbo
Turbo Mode Before Intel® Core™
Microarchitecture (Nehalem)
Clock Stopped
Turbo Mode
Power reduction in
inactive cores
In response to workload
adds additional performance
bins within headroom
114
114
Core 0
Workload Lightly Threaded
Frequency (F)
Core 0
Core 1
Frequency (F)
No Turbo
Intel® Core™ Microarchitecture
(Nehalem) Turbo Mode
Power Gating
Zero power for inactive
cores
115
115
Core 0
Core 1
Core 2
Core 3
Workload Lightly Threaded
or < TDP
Frequency (F)
Core 0
Core 1
Core 2
Core 3
Frequency (F)
No Turbo
Intel® Core™ Microarchitecture
(Nehalem) Turbo Mode
Power Gating
Turbo Mode
Zero power for inactive
cores
In response to workload
adds additional performance
bins within headroom
116
116
Core 0
Core 1
Workload Lightly Threaded
or < TDP
Frequency (F)
Core 0
Core 1
Core 2
Core 3
Frequency (F)
No Turbo
Intel® Core™ Microarchitecture
(Nehalem) Turbo Mode
Power Gating
Turbo Mode
Zero power for inactive
cores
In response to workload
adds additional performance
bins within headroom
117
117
Core 0
Core 1
Workload Lightly Threaded
or < TDP
Frequency (F)
Core 0
Core 1
Core 2
Core 3
Frequency (F)
No Turbo
Intel® Core™ Microarchitecture
(Nehalem) Turbo Mode
Active cores running
workloads < TDP
118
118
Core 0
Core 1
Core 2
Core 3
Workload Lightly Threaded
or < TDP
Frequency (F)
Core 0
Core 1
Core 2
Core 3
Frequency (F)
No Turbo
Intel® Core™ Microarchitecture
(Nehalem) Turbo Mode
Active cores running
workloads < TDP
Turbo Mode
In response to workload
adds additional performance
bins within headroom
119
119
Core 0
Core 1
Core 2
Core 3
Workload Lightly Threaded
or < TDP
Frequency (F)
Core 0
Core 1
Core 2
Core 3
Frequency (F)
No Turbo
Intel® Core™ Microarchitecture
(Nehalem) Turbo Mode
Power Gating
Turbo Mode
Zero power for inactive
cores
In response to workload
adds additional performance
bins within headroom
Core 0
Core 1
Core 2
Core 3
Workload Lightly Threaded
or < TDP
Frequency (F)
Core 0
Core 1
Core 2
Core 3
Frequency (F)
No Turbo
Dynamically Delivering Optimal Performance
and Energy Efficiency
120
120
Additional Sources of
Information on This Topic:
• Other Sessions / Chalk Talks / Labs:
– TCHS001: Next Generation Intel® Core™ Microarchitecture (Nehalem) Family of
Processors: Screaming Performance, Efficient Power (8/19, 3:00 – 3:50)
– DPTS001: High End Desktop Platform Design Overview for the Next Generation
Intel® Microarchitecture (Nehalem) Processor (8/20, 2:40 – 3:30)
– NGMS001: Next Generation Intel® Microarchitecture (Nehalem) Family:
Architectural Insights and Power Management (8/19, 4:00 – 5:50)
– NGMC001: Chalk Talk: Next Generation Intel® Microarchitecture (Nehalem)
Family (8/19, 5:50 – 6:30)
– NGMS002: Tuning Your Software for the Next Generation Intel®
Microarchitecture (Nehalem) Family (8/20, 11:10 – 12:00)
– PWRS003: Power Managing the Virtual Data Center with Windows Server* 2008
/ Hyper-V and Next Generation Processor-based Intel® Servers Featuring Intel®
Dynamic Power Technology (8/19, 3:00 – 3:50)
– PWRS005: Platform Power Management Options for Intel® Next Generation
Server Processor Technology (Tylersburg-EP) (8/21, 1:40 – 2:30)
– SVRS002: Overview of the Intel® QuickPath Interconnect (8/21, 11:10 –
12:00)
121
Session Presentations - PDFs
The PDF for this Session presentation is
available from our IDF Content Catalog at the
end of the day at:
www.intel.com/idf
or
https://intel.wingateweb.com/US08/scheduler/public.jsp
122
Please Fill out the
Session Evaluation Form
Place form in evaluation box
at the back of session room
Thank you for your input, we use it to improve
future Intel Developer Forum events
123
Download