L12_advance_2011

advertisement
Computer Architecture
Advanced Topics
1
Computer Architecture 2011 – Advanced Topics
Performance per Watt
 Mobile smaller form-factor decreases power budget
– Power generates heat, which must be dissipated to keep transistors
within allowed temperature
– Limits the processor’s peak power consumption
 Change the target
– Old target: get max performance
– New target: get max performance at a given power envelope

Performance per Watt
 Performance via frequency increase
– Power = CV2f, but increasing f also requires increasing V
– X% performance costs 3X% power

Assume performance linear with frequency
 A power efficient feature – better than 1:3 performance : power
– Otherwise it is better to just increase frequency (and voltage)
– U-arch performance features should be power efficient
2
Computer Architecture 2011 – Advanced Topics
Higher Performance vs.
Longer Battery Life
 Processor average power is <10% of
the platform
– The processor reduces power in periods of
low processor activity
– The processor enters lower power states in
idle periods
 Average power includes low-activity
periods and idle-time
– Typical: 1W – 3W
 Max power limited by heat dissipation
– Typical: 20W – 100W
Intel®
LAN Fan
DVD
ICH
2% 2%
2%
3%
CLK
5%
Display
(panel + inverter)
33%
HDD
8%
GFX
8%
Misc.
8%
CPU
10%
Intel® MCH Power Supply
10%
9%
 Decision
– Optimize for performance when Active
– Optimize for battery life when idle
3
Computer Architecture 2011 – Advanced Topics
Leakage Power
 The power consumed by a processor consists of
– Active power: used to switch transistors
– Leakage power: leakage of transistors under voltage
 Leakage power is a function of
– Number of transistors and their size
– Operating voltage
– Die temperature
 Leakage power reduction
– The LLC (Last Level Cache) is built with low-leakage transistors (2/3 of the
die transistors)

Low-leakage transistors are slower, increasing cache access latency

The significant power saved justifies the small performance loss
– Enhanced SpeedStep® technology

4
Reduces voltage and frequency on low processor activity
Computer Architecture 2011 – Advanced Topics
Enhanced SpeedStep™ Technology
 The “Basic” SpeedStep™ Technology had
– 2 operating points
– Non-transparent switch
 The “Enhanced” version provides
– Multi voltage/frequency operating points
For example, 1.6GHz Pentium M
processor operation ranges:
6.1X
18
16
2.8
14
Efficiency
ratio = 2.3
2.4
12
2.0
10
1.2
2.7X
)
1.6
8
6
0.8
4
0.4
2
0.0
0
0.8
5
20
3.2
Frequency
– Higher power efficiency
2.7X lower frequency 
2X performance loss 
>2X energy gain
– Outstanding battery life
Freq (GHz)
Power (Watts)
3.6
– Transparent switch
– Frequent switches
 Benefits
Voltage, Frequency, Power
4.0
1.0
1.2
Voltage (Volt)
1.4
Typical Power
 From 600MHz @ 0.956V
 To 1.6GHz @ 1.484V
(GHz

1.6
Computer Architecture 2011 – Advanced Topics
2nd Generation Intel® CoreTM
Sandy Bridge
6
Computer Architecture 2011 – Advanced Topics
2nd Gen Intel® Core™ Microarchitecture: Overview
Integrates CPU, Graphics, MC,
PCI Express* On Single Chip
Next Generation Intel® Turbo
Boost Technology
High Bandwidth
Last Level Cache
Next Generation Processor
Graphics and Media
DMI
High BW/low-latency modular
core/GFX interconnect
PCI Express*
x16
PCIe
System
Agent
IMC
Substantial performance
improvement
Display
Core
LLC
Core
LLC
Core
LLC
Core
LLC
2ch
DDR3
Intel® Advanced Vector
Extension (Intel® AVX)
Integrated Memory Controller
2ch DDR3
Embedded DisplayPort
Graphics
Discrete Graphics Support:
1x16 or 2x8
PECI Interface
To Embedded
Controller
Notebook
DP Port
Intel® Hyper-Threading
Technology
4 Cores / 8 Threads
2 Cores / 4 Threads
PCH
7
Computer Architecture 2011 – Advanced Topics
Core Block Diagram
32k L1 Instruction Cache
Pre decode
Instruction
Front End
(IA instructions  Uops)
Branch Pred
Queue
Decoders
Decoders
Decoders
Decoders
1.5k uOP cache
Reord
Zeroing Idioms
Store
Allocate/Rename/Retire
In Order
Allocation, Rename,
Retirement
erBuff
Buffe
Load
Buffe
rs
rs
ers
Scheduler
Port 0
Out of Order “Uop” Scheduling
Port 1
Port 5
ALU, SIMUL, ALU, SIALU, ALU, Branch,
DIV, FP MUL
Port 3
Load
Load
Six Execution
Ports
FP ADD
FP Shuffle Store Address Store Address
L2 Cache (MLC)
8
Port 2
Fill
Buffers
In order
Out-oforder
Port 4
Store
Data
Data
Cache
32k L1Unit
Data Cache
48
bytes/cycle
Computer Architecture 2011 – Advanced Topics
Front End
32KB L1 I-Cache
Pre decode
Instruction
Queue
Decoders
Decoders
Decoders
Decoders
Branch Prediction Unit
Instruction Fetch and Decode
• 32KB 8-way Associative ICache
• 4 Decoders, up to 4 instructions / cycle
• Micro-Fusion
– Bundle multiple instruction events into a single “Uops”
• Macro-Fusion
– Fuse instruction pairs into a complex “Uop”
• Decode Pipeline supports 16 bytes per cycle
9
Computer Architecture 2011 – Advanced Topics
Decoded Uop Cache
32KB L1 I-Cache
Branch Prediction Unit
Pre decode
Instruction
Queue
Decoders
Decoders
Decoders
Decoders
Decoded Uop Cache ~1.5 Kuops
Decoded Uop Cache
• Instruction Cache for Uops instead of Instruction Bytes
– ~80% hit rate for most applications
• Higher Instruction Bandwidth and Lower Latency
– Decoded Uop Cache can represent 32-byte / cycle
 More Cycles sustaining 4 instruction/cycle
– Able to ‘stitch’ across taken branches in the control flow
10
Computer Architecture 2011 – Advanced Topics
Branch Prediction Unit
32k L1 Instruction Cache
Branch Prediction Unit
Pre decode
Instruction
Queue
Decoders
Decoders
Decoders
Decoders
Decoded Uop Cache ~1.5 Kuops
New Branch Predictor
• Twice as many targets
• Much more effective storage for history
• Much longer history for data dependent behaviors
11
Computer Architecture 2011 – Advanced Topics
Front End
32k L1 Instruction Cache
Branch Prediction Unit
Zzzz
Pre decode
Instruction
Queue
Decoders
Decoders
Decoders
Decoders
Decoded Uop Cache ~1.5 Kuops
• Decoded Uop Cache lets the normal front end sleep
– Decode one time instead of many times
• Branch-Mispredictions reduced substantially
– The correct path is also the most efficient path
Save Power while Increasing Performance
12
Computer Architecture 2011 – Advanced Topics
“Out of Order” Part of the machine
Zeroing Idioms
Allocate/Rename/Retire
Reord
Store
In Order
Allocation,
Rename,
Retirement
erBuff
Buffe
Load
Buffe
rs
rs
Scheduler
Port 0
In order
ers
OutPort
of1Order
“Uop” Port
Scheduling
Port 5
2
Port 3
Out-of-order
Port 4
• Receives Uops from the Front End
• Sends them to Execution Units when they are ready
• Retires them in Program Order
• Increase Performance by finding more Instruction Level
Parallelism
– Increasing Depth and Width of machine implies larger buffers
 More Data Storage, More Data Movement, More Power
13
Computer Architecture 2011 – Advanced Topics
Sandy Bridge Out-of-Order (OOO) Cluster
Load
Buffers
Store Reorder
Buffers Buffers
Allocate/Rename/Retire Zeroing Idioms
In order
Out-oforder
Scheduler
FP/INT Vector PRF
Int PRF
• Method: Physical Reg File (PRF) instead
of centralized Retirement Register File
– Single copy of every data
– No movement after calculation
• Allows significant increase in buffer sizes
– Dataflow window ~33% larger
PRF has better than linear
performance/power
Key enabler for Intel® AVX
14
Nehalem
Sandy
Bridge
Load Buffers
48
64
Store Buffers
32
36
RS Scheduler
Entries
36
54
PRF integer
N/A
160
PRF floatpoint
N/A
144
ROB Entries
128
168
Computer Architecture 2011 – Advanced Topics
Intel® Advanced Vector Extensions
• Vectors are a natural data-type
for many applications
• Extend SSE FP instruction set
to 256 bits operand size
XMM0
YMM0 128 bits
256 bits (AVX)
– Intel AVX extends all
16 XMM registers to 256bits
• New, non-destructive source syntax
– VADDPS ymm1, ymm2, ymm3
• New Operations to enhance vectorization
– Broadcasts
– Masked load & store
Wider vectors and non-destructive source specify more work with fewer
instructions
Extending the existing state is area and power efficient
15
Computer Architecture 2011 – Advanced Topics
Execution Cluster
Scheduler sees matrix:
• 3 “ports” to 3 “stacks”
of execution units
ALU
Port 0
VI MUL
FP MUL
VI Shuffle
Blend
DIV
• General Purpose Integer
GPR
– SIMD (Vector) Integer
– SIMD Floating Point
• Challenge: double the
output of one of these
stacks in a manner that
is invisible to the others
ALU
Port 1
Port 5
SIMD INT
VI ADD
SIMD FP
FP ADD
VI Shuffle
ALU
FP Shuf
JMP
FP Bool
Blend
16
Computer Architecture 2011 – Advanced Topics
Execution Cluster
Solution:
• Repurpose existing data
paths to dual-use
• SIMD integer and legacy
SIMD FP use legacy
stack style
MultiplyFP MUL
FPBlend
Blend
VI MULFP
ALU
Port 0
VI Shuffle
GPR
SIMD INT
Port 1
17
VI ADD
VI Shuffle
• Double FLOPs
– 256-bit Multiply +
256-bit ADD +
256-bit Load
per clock
SIMD FP
FP ADDFP ADD
ALU
• Intel® AVX utilizes both
128-bit execution stacks
DIV
Port 5
ALU
FP Shuf
FP Shuffle
JMP
FP Bool
FP Boolean
FP BlendBlend
Computer Architecture 2011 – Advanced Topics
Memory Cluster
Load
256KB L2 Cache (MLC)
Fill
Buffers
Store
Address
Store
Data
Memory Control
Store
Buffers
32 bytes/cycle
32KB 8-way L1 Data Cache
• Memory Unit can service two memory requests per cycle
– 16 bytes load and 16 bytes store per cycle
• Goal:
Maintain the historic bytes/flop ratio of SSE for Intel® AVX
18
Computer Architecture 2011 – Advanced Topics
Memory Cluster
Load
Load
Store Address
Store Address
Store
Data
Memory Control
256KB L2 Cache (MLC)
Fill
Buffers
Store
Buffers
48 bytes/cycle
32KB 8-way L1 Data Cache
• Solution : Dual-Use the existing connections
– Make load/store pipes symmetric
• Memory Unit services three data accesses per cycle
– 2 read requests of up to 16 bytes AND 1 store of up to 16 bytes
– Internal sequencer deals with queued requests
• Second Load Port is one of highest performance features
– Required to keep Intel® AVX fed
– linear power/performance
19
Computer Architecture 2011 – Advanced Topics
Putting it together
Sandy Bridge Microarchitecture
32k L1 Instruction Cache
Pre decode
Instruction
Branch Pred
Load
Buffers
Store
Buffers
Decoders
Decoders
Decoders
Decoders
Queue
1.5k uOP cache
Zeroing Idioms
Allocate/Rename/Retire
Reorder
Buffers
In order
Out-oforder
Scheduler
Port 0
ALU
VI MUL
VI Shuffle
DIV
AVX FP MUL
Port 1
ALU
VI ADD
VI Shuffle
AVX FP ADD
Port 5
Port 2
Port 3
Load
Load
Port 4
ALU
JMP
AVX/FP Shuf
Store Address Store Address
AVX/FP Bool
AVX FP Blend
AVX FP Blend
Store
Data
Memory Control
48 bytes/cycle
L2 Data Cache (MLC)
Fill
Buffers
32k L1 Data Cache
20
Computer Architecture 2011 – Advanced Topics
Other Architectural Extensions
• Cryptography Instruction Throughput Enhancements
– Increased throughput for AES instructions
• Arithmetic Throughput Enhancements
– ADC (Add with Carry) throughput doubled
– Multiply (64-bit multiplicands with 128-bit product)
 ~25% speedup on existing RSA binaries
• State Save/Restore Enhancements
– New state added in Intel® AVX
– HW monitors features used by applications
 Only saves/restores state that is used
21
Computer Architecture 2011 – Advanced Topics
2nd Gen Intel® Core™
Microarchitecture
x16
PCIe
DMI
PCI Express
System
Display Agent
IMC
Core
LLC
Core
LLC
Core
LLC
Core
LLC
2ch
DDR3
Graphics
PECI Interface
To Embedded
Controller
System Agent, Ring
Architecture and Other
Innovations in 2nd
Generation Intel® Core™
Microarchitecture
formerly codenamed
Sandy Bridge
Notebook
DP Port
DMI
2011 PCH
22
Computer Architecture 2011 – Advanced Topics
Integration: Optimization Opportunities
• Dynamically redistribute power between Cores & Graphics
• Tight power management control of all components,
providing better granularity and deeper idle/sleep states
• Three separate power/frequency domains:
System Agent (Fixed), Cores+Ring, Graphics (Variable)
• High BW Last Level Cache, shared among Cores and Graphics
– Significant performance boost, saves memory bandwidth and power
• Integrated Memory Controller and PCI Express ports
– Tightly integrated with Core/Graphics/LLC domain
– Provides low latency & low power – remove intermediate busses
• Bandwidth is balanced across the whole machine,
from Core/Graphics all the way to Memory Controller
• Modular uArch for optimal cost/power/performance
– Derivative products done with minimal effort/time
23
Computer Architecture 2011 – Advanced Topics
Scalable Ring On-die Interconnect
• Ring-based interconnect between Cores, Graphics,
Last Level Cache (LLC) and System Agent domain
• Composed of 4 rings
– 32 Byte Data ring, Request ring,
Acknowledge ring and Snoop ring
– Fully pipelined at core frequency/voltage:
bandwidth, latency and power scale with cores
• Massive ring wire routing runs over the LLC
with no area impact
• Access on ring always picks the shortest path –
minimize latency
• Distributed arbitration, ring protocol handles
coherency, ordering, and core interface
• Scalable to servers with large number of
processors
DMI
PCI Express*
System
Display Agent
IMC
Core
LLC
Core
LLC
Core
LLC
Core
LLC
Graphics
High Bandwidth, Low Latency, Modular
24
Computer Architecture 2011 – Advanced Topics
Cache Box
• Interface block
–
–
–
–
Between Core/Graphics/Media and the Ring
Between Cache controller and the Ring
Implements the ring logic, arbitration, cache controller
Communicates with System Agent for LLC misses,
external snoops, non-cacheable accesses
• Full cache pipeline in each cache box
– Physical Addresses are hashed at the source
to prevent hot spots and increase bandwidth
– Maintains coherency and ordering for the
addresses that are mapped to it
– LLC is fully inclusive with “Core Valid Bits” –
eliminates unnecessary snoops to cores
• Runs at core voltage/frequency, scales
with Cores
Distributed coherency & ordering; Scalable
Bandwidth, Latency & Power
25
PCI Express*
DMI
System
Agent
Display
IMC
Core
LLC
Core
LLC
Core
LLC
Core
LLC
Graphics
Computer Architecture 2011 – Advanced Topics
LLC Sharing
• LLC shared among all Cores, Graphics and Media
– Graphics driver controls which streams are cached/coherent
– Any agent can access all data in the LLC, independent of who
allocated the line, after memory range checks
• Controlled LLC way allocation mechanism to prevent
thrashing between Core/graphics
• Multiple coherency domains
– IA Domain (Fully coherent via cross-snoops)
– Graphic domain (Graphics virtual caches,
flushed to IA domain by graphics engine)
– Non-Coherent domain (Display data,
flushed to memory by graphics engine)
Much higher Graphics performance,
DRAM power savings, more DRAM BW
available for Cores
26
PCI Express*
DMI
Display
System
Agent
IMC
Core
LLC
Core
LLC
Core
LLC
Core
LLC
Graphics
Computer Architecture 2011 – Advanced Topics
System Agent
• Contains PCI Express, DMI, Memory
Controller, Display Engine…
DMI
• Contains Power Control Unit
– Programmable uController, handles all power
management and reset functions in the chip
• Smart integration with the ring
– Provides cores/Graphics /Media with high BW,
low latency to DRAM/IO for best performance
– Handles IO-to-cache coherency
• Separate voltage and frequency from
ring/cores, Display integration for better
battery life
PCI Express*
System
Display Agent
IMC
Core
LLC
Core
LLC
Core
LLC
Core
LLC
Graphics
• Extensive power and thermal
management for PCI Express and DDR
Smart I/O Integration
27
Computer Architecture 2011 – Advanced Topics
Hyper Threading Technology
28
Computer Architecture 2011 – Advanced Topics
Thread-Level Parallelism
 Multiprocessor systems have been used for many years
– There are known techniques to exploit multiprocessors
 Software trends
– Applications consist of multiple threads or processes that can be
executed in parallel on multiple processors
 Thread-level parallelism (TLP) – threads can be from
– the same application
– different applications running simultaneously
– operating system services
 Increasing single thread performance becomes harder
– and is less and less power efficient
 Chip Multi-Processing (CMP)
– Two (or more) processors are put on a single die
29
Computer Architecture 2011 – Advanced Topics
Multi-Threading
 Multi-threading: a single processor executes multiple threads
 Time-slice multithreading
– The processor switches between software threads after a fixed period
– Can effectively minimize the effects of long latencies to memory
 Switch-on-event multithreading
– Switch threads on long latency events such as cache misses
– Works well for server applications that have many cache misses
 A deficiency of both time-slice MT and switch-on-event MT
– They do not cover for branch mis-predictions and long dependencies
 Simultaneous multi-threading (SMT)
– Multiple threads execute on a single processor simultaneously w/o switching
– Makes the most effective use of processor resources

30
Maximizes performance vs. transistor count and power
Computer Architecture 2011 – Advanced Topics
Hyper-threading (HT) Technology
 HT is SMT
– Makes a single processor appear as 2 logical processors = threads
 Each thread keeps a its own architectural state
– General-purpose registers
– Control and machine state registers
 Each thread has its own interrupt controller
– Interrupts sent to a specific logical processor are handled only by it
 OS views logical processors (threads) as physical processors
– Schedule threads to logical processors as in a multiprocessor system
 From a micro-architecture perspective
– Thread share a single set of physical resources

31
caches, execution units, branch predictors, control logic, and buses
Computer Architecture 2011 – Advanced Topics
Two Important Goals
 When one thread is stalled the other thread can continue to
make progress
– Independent progress ensured by either


Partitioning buffering queues and limiting the number of entries each
thread can use
Duplicating buffering queues
 A single active thread running on a processor with HT runs at
the same speed as without HT
– Partitioned resources are recombined when only one thread is active
32
Computer Architecture 2011 – Advanced Topics
Front End
 Each thread manages its own next-instruction-pointer
 Threads arbitrate Uop cache access every cycle (Ping-Pong)
– If both want to access the UC – access granted in alternating cycles
– If one thread is stalled, the other thread gets the full UC bandwidth
 TC entries are tagged with thread-ID
– Dynamically allocated as needed
– Allows one logical processor to have more entries than the other
Uop
Cache
33
Computer Architecture 2011 – Advanced Topics
Front End (cont.)
 Branch prediction structures are either duplicated or shared
– The return stack buffer is duplicated
– Global history is tracked for each thread
– The large global history array is a shared

Entries are tagged with a logical processor ID
 Each thread has its own ITLB
 Both threads share the same decoder logic
– if only one needs the decode logic, it gets the full decode bandwidth
– The state needed by the decodes is duplicated
 Uop queue is hard partitioned
– Allows both logical processors to make independent forward progress
regardless of FE stalls (e.g., TC miss) or EXE stalls
34
Computer Architecture 2011 – Advanced Topics
Out-of-order Execution
 ROB and MOB are hard partitioned
– Enforce fairness and prevent deadlocks
 Allocator ping-pongs between the thread
– A thread is selected for allocation if



35
Its uop-queue is not empty
its buffers (ROB, RS) are not full
It is the thread’s turn, or the other thread cannot be selected
Computer Architecture 2011 – Advanced Topics
Out-of-order Execution (cont)
 Registers renamed to a shared physical register pool
– Store results until retirement
 After allocation and renaming uops are placed in one of 2 Qs
– Memory instruction queue and general instruction queue

The two queues are hard partitioned
– Uops are read from the Q’s and sent to the scheduler using ping-pong
 The schedulers are oblivious to threads
– Schedule uops based on dependencies and exe. resources availability

Regardless of their thread
– Uops from the two threads can be dispatched in the same cycle
– To avoid deadlock and ensure fairness

Limit the number of active entries a thread can have in each
scheduler’s queue
 Forwarding logic compares physical register numbers
– Forward results to other uops without thread knowledge
36
Computer Architecture 2011 – Advanced Topics
Out-of-order Execution (cont)
 Memory is largely oblivious
– L1 Data Cache, L2 Cache, L3 Cache are thread oblivious

All use physical addresses
– DTLB is shared

Each DTLB entry includes a thread ID as part of the tag
 Retirement ping-pongs between threads
– If one thread is not ready to retire uops all retirement bandwidth is
dedicated to the other thread
37
Computer Architecture 2011 – Advanced Topics
Single-task And Multi-task Modes
 MT-mode (Multi-task mode)
– Two active threads, with some resources partitioned as described earlier
 ST-mode (Single-task mode)
– There are two flavors of ST-mode

single-task thread 0 (ST0) – only thread 0 is active

single-task thread 1 (ST1) – only thread 1 is active
– Resources that were partitioned in MT-mode are re-combined to give the
single active logical processor use of all of the resources
 Moving the processor from between modes
Thread 0 executes HALT
Interrupt
ST0
Thread 1 executes HALT
38
Low
Power
Thread 1 executes HALT
ST1
MT
Thread 0 executes HALT
Computer Architecture 2011 – Advanced Topics
Operating System And Applications
 An HT processor appears to the OS and application SW as 2
processors
– The OS manages logical processors as it does physical processors
The OS should implement two optimizations:
 Use HALT if only one logical processor is active
– Allows the processor to transition to either the ST0 or ST1 mode
– Otherwise the OS would execute on the idle logical processor a sequence of
instructions that repeatedly checks for work to do
– This so-called “idle loop” can consume significant execution resources that
could otherwise be used by the other active logical processor
 On a multi-processor system,
– Schedule threads to logical processors on different physical processors
before scheduling multiple threads to the same physical processor
– Allows SW threads to use different physical resources when possible
39
Computer Architecture 2011 – Advanced Topics
Download