Intel Pentium M

advertisement
Intel Pentium M
Outline



History
P6 Pipeline in detail
New features





Improved Branch
Prediction
Micro-ops fusion
Speed Step technology
Thermal Throttle 2
Power and Performance
Quick Review of x86










8080 - 8-bit
8086/8088 - 16-bit (8088 had 8-bit external data bus)
- segmented memory model
286
- introduction of protected mode, which included:
segment limit checking, privilege levels, read- and exe-only segment options
386 - 32-bit
- segmented and flat memory model
- paging
486 - first pipeline
- expanded the 386's ID and EX units into five-stage pipeline
- first to include on-chip cache
- integrated x87 FPU (before it was a coprocessor)
Pentium (586) - first superscalar
- included two pipelines, u and v
- virtual-8086 mode
- MMX soon after
Pentium Pro (686 or P6) - three-way superscalar
- dynamic execution - out-of-order execution, branch prediction, speculative execution
- very successful micro-architecture
Pentium 2 and 3 - both P6
Pentium 4 - new NetBurst architecture
Pentium M - enhanced P6
Pentium Pro Roots

NexGen 586 (1994)

Decomposes IA32 instructions into simpler
RISC-like operations (R-ops or micro-ops)


NexGen bought by AMD


Decoupled Approach
AMD K5 (1995) – also used micro-ops
Intel Pentium Pro

Intel’s first use of decoupled architecture
Pentium-M Overview






Introduced March 12, 2003
Initially called Banias
Created by Israeli team
Missed deadline by less than 5 days
Marketed with Intel’s Centrino Initiative
Based on P6 microarchitechture
P6 Pipeline in a Nutshell

Divided into three clusters (front, middle, back)
In-order Front-End
 Out-of-order Execution Core
 Retirement


Each cluster is independent

I.e. if a mispredicted branch is detected in the frontend, the front-end will flush and retch from the
corrected branch target, all while the execution core
continues working on previous instructions
P6 Pipeline in a Nutshell
P6 Front-End



Major units: IFU, ID, RAT, Allocator, BTB, BAC
Fetching (IFU)
 Includes I-cache, I-streaming cache, ITLB, ILD
 No pre-decoding
 Boundary markings by instruction-length decoder (ILD)
Branch Prediction


Decoding (ID)


Predicted (speculative) instructions are marked
Conversion of instructions (macro-ops) into micro-ops
Allocation of Buffer Entries: RS, ROB, MOB
P6 Execution Core

Reservation Station (RS)



Waiting micro-ops ready to go
Scheduler
Out-of-order Execution of micro-ops


Independent execution units (EU)
Must be careful about out-of-order memory access


Memory ordering buffer (MOB) interfaces to the memory subsystem
Requirements for execution


Available operands, EU, and write-back bus
Optimal performance
P6 Retirement

In-order updating of architected machine state


Micro-op retirement – “all or none”


Re-order buffer (ROB)
Architecturally illegal to retire only part
of an IA-32 instruction
In-ordering handling of exceptions

Legal to handle mid-execution, but illegal
to handle mid-retirement
PM Changes to P6





Most changes made in P6 front-end
Added and expanded on P4 branch predictor
Micro-ops fusion
Addition of dedicated stack engine
Pipeline length
Longer than P3, shorter than P4
 Accommodates extra features above

PM Changes to P6, cont.



Intel has not released the exact length of the pipeline.
Known to be somewhere between the P4 (20 stage)
and the P3 (10 stage). Rumored to be 12 stages.
Trades off slightly lower clock frequencies (than P4) for better
performance per clock, less branch prediction penalties, …
Blue Man Group Commercial Break
Banias







1st version
77 million transistors, 23
million more than P4
1 MB on die Level 2 cache
400 MHz FSB (quad pumped
100 MHZ)
130 nm process
Frequencies between 1.3 –
1.7 GHz
Thermal Design Point of
24.5 watts
http://www.intel.com/pressroom/archive/photos/centrino.htm
Dothan






Launched May 10, 2004
140 million transistors
2 MB Level 2 cache
400 or 533 MHz FSB
Frequencies between 1.0
to 2.26 GHz
Thermal Design Point of
21(400 MHz FSB) to 27
watts
http://www.intel.com/pressroom/archive/photos/centrino.htm
Dothan cont.





90 nm process technology on 300 mm wafer.
Provide twice the capacity of the 200 mm while
the process dimensions double the transistor
density
Gate dimensions are 50nm or approx half the
diameter if the influenza virus
P and n gate voltages are reduced by enhancing
the carrier mobility of the Si lattice by 10-20%
Draws less than 1 W average power
Bus




Utilizes a split transaction deferred reply
protocol
64-bit width
Delivers up to 3.2 Gbps (Banis) or 4.2 Gbps
(Dothan) in and out of the processor
Utilizes source synchronous transfer of
addresses and data
Data transferred 4 times per bus clock
 Addresses can be delivered times per bus clock


Bus update in Dothan

http://www.intel.com/technology/itj/2005/volume09issue01/art05_perf_power
L1 Cache

64KB total
32 K instruction
 32 K data (4 times P4M)




Write-back vs. write-through on P4
In write-through cache, data is written to both
L1 and main memory simultaneously
In write-back cache, data can be loaded without
writing to main memory, increasing speed by
reducing the number of slow memory writes
L2 cache






1 – 2 MB
8-way set associative
Each set is divided into 4 separate power quadrants.
Each individual power quadrant can be set to a sleep
mode, shutting off power to those quadrants
Allows for only 1/32 of cache to be powered at any
time
Increased latency vs. improved power consumption
Prefetch



Prefetch logic fetches data to the level 2 cache
before L1 cache requests occur
Reduces compulsory misses due to an increase
of valid data in cache
Reduces bus cycle penalties
Schedule

P6 Pipeline in detail
Front-End
 Execution Core
 Back-End


Power Issues


Intel SpeedStep
Testing the Features
x86 system registers
 Performance Testing

P6 Front-end: Instruction Fetching

IA-32 Memory Management

Classic segmented model (cannot be disabled in protected mode)


Separation of code, data, and stack into "segments“
Optional paging


Segments divided into pages (typically 4KB)
Additional protection to segment-protection


I.e. provides read-write protection on a page-by-page basis
Stage 11 (stage 1) - Selection of address for next I-cache access


Speculation – address chosen from competing sources (i.e. BTB, BAC,
loop detector, etc.)
Calculation of linear address from logical (segment selector + offset)


Segment selector – index into a table of segment descriptors, which
include base address, size, type, and access right of the segment
Remember: only six segment selectors, so only six usable at a time


32-bit code nowadays uses flat model, so OS can make do with only a few
(typically four) segments
IFU chooses address with highest priority and sends it to stage two
P6 Front-end: Instruction Fetching

Stage 12-13 - Accessing of caches

Accesses instruction caches with address calculated in stage one


With paging, consults ITLB to determine physical page number (tag bits)



Without paging, linear address from stage one becomes physical address
Obtains branch prediction from branch target buffer (BTB)


Includes standard cache, victim cache, and streaming buffer
BTB takes two cycles to complete one access
Instruction boundary (ILD) and BTB markings
Stage 14 - Completion of instruction cache access

Instructions and their marks are sent to instruction buffer or steered to
ID
P6 Front-end: Instruction Fetching
P6 Front-end: Instruction Decoding

Stage 15-16 - Decoding of IA32 Instructions




Alignment of instruction bytes
Identification of the ends of up to three instructions
Conversion of instructions into micro-ops
Stage 17 - Branch Decoding

If the ID notices a branch that went unpredicted by the BTB (i.e. if the BTB had never seen
the branch before), flushes the in-order pipe, and re-fetches from the branch target



Early catch saves speculative instructions from being sent through the pipeline
Stage 21 - Register Allocation and Renaming



Synonymous with stage 17 (a reminder of independent working units)
Allocator used to allocate required entries in ROB, RS, LB, and SB
Register Alias Table (RAT) consulted


Branch target calculated by BAC
Maps logical sources/destinations to physical entries in the ROB (or sometimes RRF)
Stage 22 – Completion of Front-End

Marked micro-ops are forwarded to RS and ROB, where they
await execution and retirement, respectively.
P6 Front-end: Instruction Decoding
Register Alias Table Introduction




Provides register renaming of integer and floatingpoint registers and flags
Maps logical (architected) entries to physical entries
usually in the re-order buffer (ROB)
Physical entries are actually allocated by the Allocator
The physical entry pointers become a part of the
micro-op’s overall state as it travels through the pipeline
RAT Details


P6 is 3-way super-scalar, so the RAT must be able to
rename up to six logical sources per cycle
Any data dependences must be handled


Ex:
op1) ADD EAX, EBX, ECX (dest. = EAX)
op2) ADD EAX, EAX, EDX
op3) ADD EDX, EAX, EDX
Instead of making op2 wait for op1 to retire, the RAT
provides data forwarding

Same case for op3, but RAT must make sure that it gets the result
from op2 and not op1
RAT Implementation Difficulties

Speculative Renaming


Since speculative micro-ops flow by, the RAT must be able to undo its
mappings in the case of a branch misprediction
Partial-width register reads and writes

Consider a partial-width write followed by a larger-width read


Retirement Overrides




Data required by the read is an assimilation of multiple previous writes to the
register – to make sure, RAT must stall the pipeline
Common interaction between RAT and ROB
When a micro-op retires, its ROB entry is removed and its result may be
latched into an architected destination register
If any active micro-ops source the retired op’s destination, they must not
reference the outdated ROB entry
Mismatch stalls

Associated with flag renaming
The Allocator


Works in conjunction with RAT to allocate required entries
In each cycle, assumes three ROB, RS, and LB and two SB entries


ROB Allocation


Once micro-ops arrive, it determines how many entries are really needed
If three entries aren’t available the allocator will stall
RS Allocation


A bitmap is used to determine which entries are free
If the RS is full, pipeline is stalled


RS must make sure valid entries are not overwritten
MOB Allocation

Allocation of LB and SB entries also done by allocator
PM Changes to P6 Front-End




Micro-op fusion
Dedicated Stack Engine
Enhanced branch prediction
Additional stages
Intel’s secret
 Most likely required for extra functionality above

Micro-ops Fusion

Fusion of multiple micro-ops into one micro-op



Similarity to SIMD data packing
Two examples of fusion from Intel documentation:



Less contention for buffer entries
IA32 load-and-operate and store instructions
Not known for certain whether these are the only cases of
fusion
Possibly inspired by MacroOps used in K7 (Athlon)
Dedicated Stack Engine


Traditional out-of-order implementations update the
Stack Pointer Register (ESP) by sending a µop to
update the ESP register with every stack related
instruction
Pentium M implementation



A delta register (ESPD) is maintained in the front end
A historic ESP (ESPO) is then kept in the out-of-order
execution core
Dedicated logic was added to update the ESP by adding the
ESPO with the ESPD
Improvements




The ESPO value kept in the out-of-order machine is not
changed during a sequence of stack operations, this
allows for more parallelism opportunities to be realized
Since ESPD updates are now done by a dedicated adder,
the execution unit is now free to work on other µops
and the ALU’s are freed to work on more complex
operations
Decreased power consumption since large adders are
not used for small operations and the eliminated µops
do not toggle through the machine
Approximately 5% of the µops have been eliminated
Complications


Since the new adder lives in the front end all of
its calculations are speculative. This necessitates
the addition of recovery table for all values of
ESPO and ESPD
If the architectural value of ESP is needed
inside of the out-of-order machine the decode
logic then needs to insert a µop that will carry
out the ESP calculation
Branch Prediction


Longer pipelines mean higher penalties for
mispredicted branches
Improvements result in added performance and
hence less energy spent per instruction retired
Branch Prediction in Pentium M


Enhanced version of Pentium 4 predictor
Two branch predictors added that run in tandem
with P4 predictor:
Loop detector
 Indirect branch detector


20% lower misprediction rate than PIII resulting
in up to 7% gain in real performance
Branch Prediction
Based on diagram found here: http://www.cpuid.org/reviews/PentiumM/index.php
Loop Detector



A predictor that always
branches in a loop will
always incorrectly branch
on the last iteration
Detector analyzes
branches for loop
behavior
Benefits a wide variety of
program types
http://www.intel.com/technology/itj/2003/volume07
issue02/art03_pentiumm/p05_branch.htm
Indirect Branch Predictor


Picks targets based on
global flow control
history
Benefits programs
compiled to branch to
calculated addresses
http://www.intel.com/technology/itj/2003/volume07iss
ue02/art03_pentiumm/p05_branch.htm
Reservation Station




Used as a store for µops to wait for their operands and execution
units to become available
Consists of 20 entries
Control portion of the entry can be written to from one of three
ports
Data portion can be written to from one of 6 available ports




3 for ROB
3 for EU write backs
Scheduler then uses this to schedule up to 5 µops at a time
During pipeline stage 31 entries that are ready for dispatch are
then sent to stage 32
Cancellation



Reservation Station assumes that all cache
accesses will be hits
In the case of a cache miss micro-ops that are
dependant on the write-back data need to be
cancelled and rescheduled at a later time
Can also occur due to a future resource conflict
Retirement



Takes 2 clock cycles to complete
Utilizes reorder buffer (ROB) to control retirement or
completion of μops
ROB is a multi-ported register file with separate ports for






Allocation time writes of µop fields needed at retirement
Execution Unit write-backs
ROB reads of sources for the Reservation Station
Retirement logic reads of speculative result data
Consists of 40 entries with each entry 157 bits wide
The ROB participates in



Speculative execution
Register renaming
Out-of-order execution
Speculative Execution



Buffers results of the execution unit before commit
Allows maximum rate for fetch and execute by
assuming that branch prediction is perfect and no
exceptions have occurred
If a misprediction occurs:


Speculative results stored in the ROB are immediately discarded
Microengine will restart by examining the committed state in the
ROB
Register Renaming



Entries in the ROB that will hold the results of
speculative µops are allocated during stage 21 of
the pipeline
In stage 22 the sources for the µops are
delivered based upon the allocation in stage 21.
Data is written to the ROB by the Execution
Unit into the renamed register during stage 83
Out-of-order Execution



Allows µops to complete and write back their results
without concern for other µops executing
simultaneously
The ROB reorders the completed µops into the original
sequence and updates the architectural state
Entries in ROB are treated as FIFO during retirement


µops are originally allocated in sequential order so the
retirement will also follow the original program order
Happens during pipeline stage 92 and 93
Exception Handling







Events are sent to the ROB by the EU during stage 83
Results sent to the ROB from the Execution Unit are speculative results,
therefore any exceptions encountered may not be real
If the ROB determines that branch prediction was incorrect it inserts a clear
signal at the point just before the retirement of this operation and then
flushes all the speculative operations from the machine
If speculation is correct, the ROB will invoke the correct microcode
exception handler
All event records are saved to allow the handler to repair the result or invoke
the correct macro handler
Pointers for the macro and micro instructions are also needed to allow the
program to resume after completion by the event handler
If the ROB retires an operation that faults, both the in-order and out-oforder sections are cleared. This happens during pipeline stages 93 and 94
Memory Subsystem

Memory Ordering Buffer (MOB)



Execution is out-of-order, but memory accesses cannot just
be done in any order
Contains mainly the LB and the SB
Speculative loads and stores

Not all loads can be speculative


I.e. a memory-mapped I/O ld could have unrecoverable side effects
Stores are never speculative (can’t get back overwritten bits)

But to improve performance, stores are queued in the store buffer
(SB) to allow pending loads to proceed

Similar to a write-back cache
Schedule

P6 Pipeline in detail
Front-End
 Execution Core
 Back-End


Power Issues


Intel SpeedStep
Testing the Features
x86 system registers
 Performance Testing

Power Issues

Power use = α * C * V2 * F
α = activity factor
 C = effective capacitance
 V = voltage
 F = operating frequency


Power use can be reduced linearly by
lowering frequency and capacitance and
quadratically by scaling voltage
Mobile Use


Mobile is bursty – full power is only necessary
for brief periods
Intel developed SpeedStep technology to take
advantage of this fact and reduce power
consumption during periods of inactivity
http://www.intel.com/technology/itj/2003/volume07issue02/art05_power/p05_thermal.htm
SpeedStep I and II

SpeedStep I and II used in previous generations

Only two states:
High performance (High frequency mode)
 Lower power use (Low frequency mode)


Problems
Slow transition times
 Limited opportunity for optimization

Pentium M Goals


Optimize for performance when plugged in
Optimize for long battery-life when unplugged
Model
Frequency (max / min)
Vcore (max / min)
Pentium M 1,6GHz
1,6GHz / 600MHz
1,484v / 0,956v
Pentium M 1,5GHz
1,5GHz / 600MHz
1,484v / 0,956v
Pentium M 1,4GHz
1,4GHz / 600MHz
1,484v / 0,956v
Pentium M 1,3GHz
1,3GHz / 600MHz
1,388v / 0,956v
Pentium M 1,1GHz
Low Voltage
1,1GHz / 600MHz
1,180v / 0,956v
Pentium M 900MHz
Ultra Low Voltage
1,6GHz / 600MHz
1,004v / 0,844v
SpeedStep III


Optimized to fix limitations of
previous generations
Three innovations:

Voltage-Frequency switching
separation

Clock partitioning and recovery

Event blocking
Freq. Volt.
1.6GHz 1.484 V
1.4GHz
1.42V
1.2GHz
1.276V
1GHz
1.164V
800MHz 1.036V
600MHz 0.956 V
The 6 states of the Pentium
M 1,6GHz
Voltage-Frequency switching
separation



Voltage scaling is stepped
up and down incrementally
This prevents clock noise
and allows the processor to
remain responsive during
transition
Once voltage target is
reached, frequency is
throttled
http://www.intel.com/technology/itj/2003/volume07iss
ue02/art03_pentiumm/p10_speedstep.htm
Clock partitioning and recovery


During transition, only the
core clock and phaselocked-loop are stopped
This keeps logic active
even while the clock is
stopped
http://www.intel.com/technology/itj/2003/volume07iss
ue02/art03_pentiumm/p10_speedstep.htm
Event blocking


To prevent loss of events
during frequency and voltage
scaling when the core clock is
stopped, interrupts, pin events,
and snoop requests are sampled
and saved
These events are retransmitted
once the core clock becomes
available
http://www.intel.com/technology/itj/2003/volume07iss
ue02/art03_pentiumm/p10_speedstep.htm
Leakage


Transistors in off state still draw current
As transistors shrink and clock speed increases,
transistors leak more current causing higher
temperatures and more power use
Strained Silicon
http://www.research.ibm.com/resources/press/strainedsilicon/
Benefits of Strained Silicon



Electrons flow up to 70% faster due to reduced
resistance
This leads to chips which are up to 35% faster,
without decrease in chip size
Intel’s "uni-axial" strained silicon process
reduces leakage by at least five times without
reducing performance – the 65nm process will
realize another reduction of at least four times
High-K Transistor Gate Dielectric
(coming soon)



The dielectric used since the 1960s, silicon
dioxide, is so thin now that leakage is a
significant problem
A high-k (high dielectric constant) material has
been developed by Intel to replace silicon
dioxide
This high-k material reduces leakage by a factor
of 100 below silicon dioxide
More Advances to Expect


Continued lowering of capacitance has helped
reduce power consumption
Tri-gate transistors decreases leakage by
increasing the amount of surface area for
electrons to flow through
Schedule

P6 Pipeline in detail
Front-End
 Execution Core
 Back-End


Power Issues


Intel SpeedStep
Testing the Features
x86 system registers
 Performance Testing

x86 System Registers

EFLAGS


CPUID


Various system flags
Exposes type and available features of processor
Model Specific Registers (MSRs)


rdmsr and wrmsr
Examples



Enabling/Disabling SpeedStep
Determining and changing voltage/frequency points
More
Performance Testing

P4 2.2GHz vs. PM 1.6GHz
Asus L3C
Pentium-M Notebook
Display Size
15.1"
14.1"
Display
Resolution
1400x1050
1024x768
CPU
P4-M-2.2GHz
Pentium-M 1.6GHZ
Memory Type
PC2100 DDR SDRAM
PC2100 DDR SDRAM
Amount of
Memory
256 MB
256 MB
Chipset
Northbridge
845MP
"Odem" 855PM
Chipset
Southbridge
ICH3-M
ICH4-M
Graphics
Controller
Ati Mobility Radeon 7500 (LW)/M7 32MB DDR
NVIDIA GeForce4 440 Go 64MB DDR
CD/DVD ROM
Toshiba SDR2102 (ATA-2)
8x/8x8x24xDVD/CDRW Combo
XX-XXXX (ATA-2) 8x/8x8x24xDVD/CDRW
Combo
Harddisc
IBM Travelstar IC25N020ATCS05-0 ATA-5
20GB/5400rpm/8MB
IBM Travelstar IC25N020ATCS05-0 ATA-5
20GB/5400rpm/8MB
Hard drive bay
2.5", 12.5 mm height
2.5", 12.5 mm height
Ethernet
Realtek RTL8139 (10/100 Mbit)
3Com 3C920 (10/100 Mbit)
Modem
HSP 56MR
LT56 ATW
Audio
Intel AC97
Crystal AC97
Battery Capacity
59 Wh
49 Wh
Benchmark
Battery Life
Pentium M vs AMD Turion
Specifications
Processor
FSB/HTT
Ferrari 4005
TravelMate 8104
AMD Turion 64 Mobile ML-37
(2.0 GHz, 1MB L2 Cache)
Intel Pentium M Processor 760
(2.0 GHz, 2MB L2 Cache)
1600MHz
533 MHz
Chipset
ATI Radeon Xpress 200M
Intel 915 PM Express
Wireless LAN
Broadcom 802.11b/g with
SpeedBooster
Bluetooth Wireless
IrDA
Intel PRO/Wireless 2915ABG
(802.11a/b/g)
Bluetooth Wireless
IrDA
LCD
15.4” WSXGA+ TFT LCD
(1680x1050)
15.4” WSXGA+ TFT LCD
(1680x1050)
100GB Seagate Momentus
5400RPM 8MB Cache
(ST9100823A)
100GB Seagate Momentus
5400RPM 8MB Cache
(ST9100823A)
1GB DDR400 SDRAM
(2 x 512MB) on
Single-Channel Mode
2.5-3-3-7
1GB DDR2-533 SDRAM
(2 x 512MB) on
Dual-Channel Mode
4-4-4-12
ATI Mobility Radeon X700 128MB
PCI-E (358 core/345 mem)
ATI Mobility Radeon X700 128MB
PCI-E (358 core/345 mem)
Driver version 6.14.10.6546
Driver version 6.14.10.6546
S-Video/TV-out/DVI-D
S-Video/TV-out/DVI-D
Slot-Load DVD-RW Super-Multi
Double Layer
Tray-Load DVD-RW
Super-Multi Double Layer
Hard Drive
Memory
Graphics
Graphics
Interface
Optical Drive
Audio
Audio Interface
Weight
Size (W x D x H)
Operating System
Battery
Realtek AC' 97
Realtek High Definition
Microphone, two stereo speakers,
headphone/line-out with SPDIF
support
Microphone, two stereo speakers,
headphone/line-out with SPDIF
support
6.3 lbs. with 8-cell battery
6.3 lbs. with 8-cell battery
14.3” x 10.5” x 1.2”-1.4”
14.3” x 10.5” x 1.2”-1.4”
Windows XP Professional w/SP2
Windows XP Professional w/SP2
4,800 mAh
4,800 mAh
Gaming
Battery Life
Future Processors

Yonah






Dual-core processor
Manufactured on a 65 nm process
Starting at 2.16GHz with a 667 MHz FSB (166MHz quad-pumped)
Shared 2MB L2 cache
Increased floating point performance with SSE3 instructions
Merom



Based on EM64T ISA
Consume ~0.5 W of power, half of what the Dothan consumes
Possibility of laptops with 10 hours of battery life
Download