L12_advance_2013

advertisement
Computer Structure
Advanced Topics
Lihu Rappoport and Adi Yoaz
1
Computer Structure 2013 – Advanced Topics
Hyper Threading Technology
2
Computer Structure 2013 – Advanced Topics
Thread-Level Parallelism
 Multiprocessor systems have been used for many years
– There are known techniques to exploit multiprocessors
 Software trends
– Applications consist of multiple threads or processes that can be
executed in parallel on multiple processors
 Thread-level parallelism (TLP) – threads can be from
– the same application
– different applications running simultaneously
– operating system services
 Increasing single thread performance becomes harder
– and is less and less power efficient
 Chip Multi-Processing (CMP)
– Two (or more) processors are put on a single die
3
Computer Structure 2013 – Advanced Topics
Multi-Threading
 Multi-threading: a single processor executes multiple threads
 Time-slice multithreading
– The processor switches between software threads after a fixed period
– Can effectively minimize the effects of long latencies to memory
 Switch-on-event multithreading
– Switch threads on long latency events such as cache misses
– Works well for server applications that have many cache misses
 A deficiency of both time-slice MT and switch-on-event MT
– They do not cover for branch mis-predictions and long dependencies
 Simultaneous multi-threading (SMT)
– Multiple threads execute on a single processor simultaneously w/o switching
– Makes the most effective use of processor resources

4
Maximizes performance vs. transistor count and power
Computer Structure 2013 – Advanced Topics
Hyper-threading (HT) Technology
 HT is SMT
– Makes a single processor appear as 2 logical processors = threads
 Each thread keeps a its own architectural state
– General-purpose registers
– Control and machine state registers
 Each thread has its own interrupt controller
– Interrupts sent to a specific logical processor are handled only by it
 OS views logical processors similar to physical processors
– But can still differentiate and prefer to schedule a thread on a new physical
processor rather than on a 2nd logical processors in the same phy processor
 From a micro-architecture perspective
– Thread share a single set of physical resources

5
caches, execution units, branch predictors, control logic, and buses
Computer Structure 2013 – Advanced Topics
Two Important Goals
 When one thread is stalled the other thread can continue to
make progress
– Independent progress ensured by either


Partitioning buffering queues and limiting the number of entries each
thread can use
Duplicating buffering queues
 A single active thread running on a processor with HT runs at
the same speed as without HT
– Partitioned resources are recombined when only one thread is active
6
Computer Structure 2013 – Advanced Topics
Front End
 Each thread manages its own next-instruction-pointer
 Threads arbitrate Uop cache access every cycle (Ping-Pong)
– If both want to access the UC – access granted in alternating cycles
– If one thread is stalled, the other thread gets the full UC bandwidth
 Uop Cacahe entries are tagged with thread-ID
– Dynamically allocated as needed
– Allows one logical processor to have more entries than the other
Uop
Cache
7
Computer Structure 2013 – Advanced Topics
Front End (cont.)
 Branch prediction structures are either duplicated or shared
– The return stack buffer is duplicated
– Global history is tracked for each thread
– The large global history array is a shared

Entries are tagged with a logical processor ID
 Each thread has its own ITLB
 Both threads share the same decoder logic
– if only one needs the decode logic, it gets the full decode bandwidth
– The state needed by the decodes is duplicated
 Uop queue is hard partitioned
– Allows both logical processors to make independent forward progress
regardless of FE stalls (e.g., TC miss) or EXE stalls
8
Computer Structure 2013 – Advanced Topics
Out-of-order Execution
 ROB and MOB are hard partitioned
– Enforce fairness and prevent deadlocks
 Allocator ping-pongs between the thread
– A thread is selected for allocation if



9
Its uop-queue is not empty
its buffers (ROB, RS) are not full
It is the thread’s turn, or the other thread cannot be selected
Computer Structure 2013 – Advanced Topics
Out-of-order Execution (cont)
 Registers renamed to a shared physical register pool
– Store results until retirement
 After allocation and renaming uops are placed in one of 2 Qs
– Memory instruction queue and general instruction queue

The two queues are hard partitioned
– Uops are read from the Q’s and sent to the scheduler using ping-pong
 The schedulers are oblivious to threads
– Schedule uops based on dependencies and exe. resources availability

Regardless of their thread
– Uops from the two threads can be dispatched in the same cycle
– To avoid deadlock and ensure fairness

Limit the number of active entries a thread can have in each
scheduler’s queue
 Forwarding logic compares physical register numbers
– Forward results to other uops without thread knowledge
10
Computer Structure 2013 – Advanced Topics
Out-of-order Execution (cont)
 Memory is largely oblivious
– L1 Data Cache, L2 Cache, L3 Cache are thread oblivious

All use physical addresses
– DTLB is shared

Each DTLB entry includes a thread ID as part of the tag
 Retirement ping-pongs between threads
– If one thread is not ready to retire uops all retirement bandwidth is
dedicated to the other thread
11
Computer Structure 2013 – Advanced Topics
Single-task And Multi-task Modes
 MT-mode (Multi-task mode)
– Two active threads, with some resources partitioned as described earlier
 ST-mode (Single-task mode)
– There are two flavors of ST-mode

single-task thread 0 (ST0) – only thread 0 is active

single-task thread 1 (ST1) – only thread 1 is active
– Resources that were partitioned in MT-mode are re-combined to give the
single active logical processor use of all of the resources
 Moving the processor from between modes
Thread 0 executes HALT
Interrupt
ST0
Thread 1 executes HALT
12
Low
Power
Thread 1 executes HALT
ST1
MT
Thread 0 executes HALT
Computer Structure 2013 – Advanced Topics
Operating System And Applications
 An HT processor appears to the OS and application SW as 2
processors
– The OS manages logical processors as it does physical processors
The OS should implement two optimizations:
 Use HALT if only one logical processor is active
– Allows the processor to transition to either the ST0 or ST1 mode
– Otherwise the OS would execute on the idle logical processor a sequence of
instructions that repeatedly checks for work to do
– This so-called “idle loop” can consume significant execution resources that
could otherwise be used by the other active logical processor
 On a multi-processor system,
– Schedule threads to logical processors on different physical processors
before scheduling multiple threads to the same physical processor
– Allows SW threads to use different physical resources when possible
13
Computer Structure 2013 – Advanced Topics
2nd Generation Intel® CoreTM
Sandy Bridge
14
Computer Structure 2013 – Advanced Topics
3rd Generation Intel CoreTM Processors




15
22nm process
Quad core die, with Intel HD Graphics 4000
1.4 Billion transistors
Die size: 160 mm2
Computer Structure 2013 – Advanced Topics
Sandy Bridge Processor Core Overview
• Build upon Nehalem microarchitecture processor core
– Converged building block for mobile, desktop, and server
• Add “Cool” microarchitecture enhancements
– Features that are better than linear performance/power
• Add “Really Cool” microarchitecture enhancements
– Features which gain performance while saving power
• Extend the architecture for important new applications
– Floating Point and Throughput
 Intel® Advanced Vector Extensions (Intel® AVX) –
significant boost for selected compute intensive applications
– Security
 AES (Advanced Encryption Standard) throughput enhancements
 Large Integer RSA speedups
– OS/VMM and server related features
 State save/restore optimizations
16
Foil taken from IDF 2011
Computer Structure 2013 – Advanced Topics
Core Block Diagram
32k L1 Instruction Cache
Pre decode
Instruction
Front End
(IA instructions  Uops)
Branch Pred
Queue
Decoders
Decoders
Decoders
Decoders
1.5k uOP cache
Reord
Zeroing Idioms
Store
Allocate/Rename/Retire
In Order
Allocation, Rename,
Retirement
erBuff
Buffe
Load
Buffe
rs
rs
ers
Scheduler
Port 0
Out of Order “Uop” Scheduling
Port 1
Port 5
ALU, SIMUL, ALU, SIALU, ALU, Branch,
DIV, FP MUL
Port 3
Load
Load
Six Execution
Ports
FP ADD
FP Shuffle Store Address Store Address
L2 Cache (MLC)
17
Port 2
Fill
Buffers
Foil taken from IDF 2011
In order
Out-oforder
Port 4
Store
Data
Data
Cache
32k L1Unit
Data Cache
48
bytes/cycle
Computer Structure 2013 – Advanced Topics
Front End
32KB L1 I-Cache
Pre decode
Instruction
Queue
Decoders
Decoders
Decoders
Decoders
Branch Prediction Unit
Instruction Fetch and Decode
• 32KB 8-way Associative ICache
• 4 Decoders, up to 4 instructions / cycle
• Micro-Fusion
– Bundle multiple instruction events into a single “Uops”
• Macro-Fusion
– Fuse instruction pairs into a complex “Uop”
• Decode Pipeline supports 16 bytes per cycle
18
Foil taken from IDF 2011
Computer Structure 2013 – Advanced Topics
Branch Prediction Unit
32KB L1 I-Cache
Pre decode
Instruction
Queue
Decoders
Decoders
Decoders
Decoders
Branch Prediction Unit
New Branch Predictor
• Twice as many targets
• Much more effective storage for history
• Much longer history for data dependent behaviors
19
Foil taken from IDF 2011
Computer Structure 2013 – Advanced Topics
Decoded Uop Cache
32KB L1 I-Cache
Branch Prediction Unit
Pre decode
Instruction
Queue
Decoders
Decoders
Decoders
Decoders
Decoded Uop Cache ~1.5 Kuops
Decoded Uop Cache
• Instruction Cache for Uops instead of Instruction Bytes
– ~80% hit rate for most applications
• Higher Instruction Bandwidth and Lower Latency
– Decoded Uop Cache can represent 32-byte / cycle
 More Cycles sustaining 4 instruction/cycle
– Able to ‘stitch’ across taken branches in the control flow
20
Foil taken from IDF 2011
Computer Structure 2013 – Advanced Topics
Decoded Uop Cache
32k L1 Instruction Cache
Branch Prediction Unit
Pre decode
Zzzz
Instruction
Queue
Decoders
Decoders
Decoders
Decoders
Decoded Uop Cache ~1.5 Kuops
• Decoded Uop Cache lets the normal front end sleep
– Decode one time instead of many times
• Branch-Mispredictions reduced substantially
– The correct path is also the most efficient path
Save Power while Increasing Performance
21
Foil taken from IDF 2011
Computer Structure 2013 – Advanced Topics
Instruction Decode
 Four decoding units decode instruction into μops
– The first can decode all instructions up to four μops in size
– The remaining 3 decoders handle common single μop instructions
 μops emitted by the decoders are directed to the μop queue and
to the Decoded uop cache
 Instructions with >4 μops generate their μops from the MSROM
– The MSROM bandwidth is four μops per cycle
– Instructions whose μops come from the MSROM can start from
either the legacy decode pipeline or from the Decoded uop-cache
22
From the Optimization Manual
Computer Structure 2013 – Advanced Topics
Micro Fusion
 Fuse multiple μops from same instruction into a single μop
– The micro-fused μop is dispatched multiple times in the OOO

As it would if it were not micro-fused
– Instruction which decode into a single micro-fused μop can be
handled by all decoders
– Improves instruction bandwidth delivered from decode to retirement
and saves power
 Micro-fused instructions
– All stores to memory, including store immediate
 Execute internally as two μop: store-address and store-data
– Load + op instruction

e.g., FADD DOUBLE PTR [RDI+RSI*8]
– Load and jump

23
e.g., JMP [RDI+200]
From the Optimization Manual
Computer Structure 2013 – Advanced Topics
Macro-Fusion
 Merge two instructions into a single μop
– A macro-fused instruction executes with a single dispatch
 Reduces latency and frees execution resources
– Increased decode, rename and retire bandwidth
– Power savings from representing more work in fewer bits
 The first instruction of a macro-fused pair modifies flags
– CMP, TEST, ADD, SUB, AND, INC, DEC
 The 2nd inst. of a macro-fusible pair is a conditional branch
– For each first instruction, some branches can fuse with it
 These pairs are common in many types of applications
24
From the Optimization Manual
Computer Structure 2013 – Advanced Topics
Decoded Uop-Cache
 The UC is an accelerator of the legacy decode pipeline
– Caches the μops coming out of the instruction decoder
– Next time μops are taken from the UC
– The UC can ideally hold up to 1536 μops

32 sets, 8 ways per set, up to 6 μops per way
– Average hit rate of 80% of the μops
 Skips fetch and decode for the cached μops
– Reduces latency on branch mispredictions
– Increases μop delivery bandwidth to the OOO engine
– Reduces front end power consumption
 In each cycle provide uops for instructions mapped to 32 bytes
– Keeps the front ahead of the back end also for
 Fills the large scheduler window, to find instruction level parallelism
 The UC is virtually included in the IC and in the ITLB
– Flushed on a context switch
25
From the Optimization Manual
Computer Structure 2013 – Advanced Topics
Loop Stream Detector (LSD)
 LSD detects small loops that fit in the μop queue
– The loop streams from the μop queue, with no more fetching,
decoding, or reading μops from any of the caches
– until a branch miss-prediction inevitably ends it
 The loops with the following attributes qualify for LSD replay
– Up to eight chunk fetches of 32-instruction-bytes
– Up to 28 μops (~28 instructions)
– All μops are also resident in the UC
– No more than eight taken branches
– No CALL or RET
– No mismatched stack operations (e.g., more PUSH than POP)
 Many calculation-intensive loops, searches and software string
moves match these characteristics
– For high performance code, loop unrolling is generally preferable for
performance even when it overflows the LSD capability
26
From the Optimization Manual
Computer Structure 2013 – Advanced Topics
“Out of Order” Part of the machine
Zeroing Idioms
Allocate/Rename/Retire
Reord
Store
In Order
Allocation,
Rename,
Retirement
erBuff
Buffe
Load
Buffe
rs
rs
Scheduler
Port 0
In order
ers
OutPort
of1Order
“Uop” Port
Scheduling
Port 5
2
Port 3
Out-of-order
Port 4
• Receives Uops from the Front End
• Sends them to Execution Units when they are ready
• Retires them in Program Order
• Increase Performance by finding more Instruction Level
Parallelism
– Increasing Depth and Width of machine implies larger buffers
 More Data Storage, More Data Movement, More Power
27
Foil taken from IDF 2011
Computer Structure 2013 – Advanced Topics
Sandy Bridge Out-of-Order Cluster
Load
Buffers
Store Reorder
Buffers Buffers
Allocate/Rename/Retire Zeroing Idioms
In order
Out-oforder
Scheduler
FP/INT Vector PRF
Int PRF
• Method: Physical Reg File (PRF) instead
of centralized Retirement Register File
– Single copy of every data
– No movement after calculation
• Allows significant increase in buffer sizes
– Dataflow window ~33% larger
PRF has better than linear
performance/power
Key enabler for Intel® AVX
28
Foil taken from IDF 2011
Nehalem
Sandy
Bridge
Load Buffers
48
64
Store Buffers
32
36
RS –
Scheduler
Entries
36
54
PRF integer
N/A
160
PRF floatpoint
N/A
144
ROB Entries
128
168
Computer Structure 2013 – Advanced Topics
Double the FLOPs in a “Cool” Manner
• Intel® Advanced Vector Extensions (Intel® AVX)
– Vectors are a natural data-type for many applications
– Extend SSE FP instruction set to 256 bits operand size
– Extend all 16 XMM registers to 256bits
XMM0 – 128 bits
YMM0 – 256 bits (AVX)
• New, non-destructive source syntax
– VADDPS ymm1, ymm2, ymm3
• New Operations to enhance vectorization
– Broadcasts, masked load & store
Wide vectors+ non-destructive source: more work with fewer instructions
Extending the existing state is area and power efficient
29
Foil taken from IDF 2011
Computer Structure 2013 – Advanced Topics
Execution Cluster
• 3 Execution Ports
• Max throughput of 8 floating point operations per cycle
– Port 0 : packed SP multiply
– Port 1 : packed SP add
Scheduler
Port 0
ALU
Port 1
ALU
Port 5
ALU
VI MUL
VI ADD
VI Shuffle
VI Shuffle
FP Shuf
DIV
FP ADD
FP Bool
FP MUL
JMP
Blend
Blend
30
Foil taken from IDF 2011
Computer Structure 2013 – Advanced Topics
Execution Cluster
Scheduler sees matrix:
• 3 “ports” to 3 “stacks”
of execution units
ALU
Port 0
VI MUL
FP MUL
VI Shuffle
Blend
DIV
• General Purpose Integer
GPR
– SIMD (Vector) Integer
– SIMD Floating Point
• Challenge: double the
output of one of these
stacks in a manner that
is invisible to the others
ALU
Port 1
Port 5
SIMD INT
VI ADD
SIMD FP
FP ADD
VI Shuffle
ALU
FP Shuf
JMP
FP Bool
Blend
31
Foil taken from IDF 2011
Computer Structure 2013 – Advanced Topics
Execution Cluster
Solution:
• Repurpose existing data
paths to dual-use
Port 0
• SIMD integer and legacy
SIMD FP use legacy
stack style
• Intel® AVX utilizes both
128-bit execution stacks
VI Shuffle
GPR
Port 1
DIV
SIMD INT
SIMD FP
FP ADDFP ADD
ALU
VI ADD
VI Shuffle
• Double FLOPs
– 256-bit Multiply +
256-bit ADD +
256-bit Load
per clock
MultiplyFP MUL
FPBlend
Blend
VI MULFP
ALU
Port 5
ALU
FP Shuf
FP Shuffle
JMP
FP Bool
FP Boolean
FP BlendBlend
“Cool” Implementation of Intel AVX
256-bit Multiply + 256-bit ADD + 256-bit Load per clock…
Double your FLOPs with great energy efficiency
32
Foil taken from IDF 2011
Computer Structure 2013 – Advanced Topics
The Renamer
 Moves ≤4μops / cycle from the μop queue to the OOO engine
– Renames architectural sources and destinations of the μops to
micro-architectural sources and destinations
– Allocates resources to the μops, e.g., load or store buffers
– Binds the μop to an appropriate dispatch port
 Some μops can execute to completion during rename,
effectively costing no execution bandwidth
–
–
–
–
–
Zero idioms (dependency breaking idioms)
NOP
VZEROUPPER
FXCHG
Ivy Bridge: A subset of register-to-register MOV
 The renamer can allocate two branches each cycle
33
From the Optimization Manual
Computer Structure 2013 – Advanced Topics
Dependency Breaking Idioms
 Instruction parallelism can be improved by zeroing register content
 Zero idiom Examples
– XOR REG,REG
– SUB REG,REG
 Zero idioms are detected and removed by the renamer
– Have zero execution latency
– They do not consume any execution resource
 There is another dependency breaking idiom – the "ones idiom"
– CMPEQ XMM1, XMM1; "ones idiom" set all elements to all "ones"



Regardless of the input data the output data is always "all ones"
No μop dependency on its sources, as with the zero idiom
Can execute as soon as it finds a free execution port
– As opposed to 0-idiom, the "ones idiom“ μop must execute
34
From the Optimization Manual
Computer Structure 2013 – Advanced Topics
Memory Cluster
Load
256KB L2 Cache (MLC)
Fill
Buffers
Store
Address
Store
Data
Memory Control
Store
Buffers
32 bytes/cycle
32KB 8-way L1 Data Cache
• Memory Unit can service two memory requests per cycle
– 16 bytes load and 16 bytes store per cycle
• Goal:
Maintain the historic bytes/flop ratio of SSE for Intel® AVX
35
Foil taken from IDF 2011
Computer Structure 2013 – Advanced Topics
Memory Cluster
Load
Load
Store Address
Store Address
Store
Data
Memory Control
Fill
Buffers
256KB L2 Cache (MLC)
Store
Buffers
48 bytes/cycle
32KB 8-way L1 Data Cache
• Solution : Dual-Use the existing connections
– Make load/store pipes symmetric
• Memory Unit services three data accesses per cycle
– 2 read requests of up to 16 bytes AND 1 store of up to 16 bytes
– Internal sequencer deals with queued requests
• Second Load Port is one of highest performance features
– Required to keep Intel® AVX fed
– Linear power/performance
36
Foil taken from IDF 2011
Computer Structure 2013 – Advanced Topics
Cache Hierarchy
Level
Capacity ways
Line Size Write Update
Latency Bandwidth
Inclusive
(bytes)
Policy
(cycles) (Byte/cyc)
L1 Data
32KB
8
64
Write-back
-
4
2 ×16 +16
L1 Instruction
32KB
8
64
N/A
-
-
-
L2 (Unified)
256KB
8
64
Write-back
No
12
1 × 32
LLC
Varies
Varies
64
Write-back
Yes
26-31
1 × 32
 The LLC is inclusive of all cache levels above it
– Data contained in the core caches must also reside in the LLC
– Each LLC cache line holds an indication of the cores that may have
this line in their L2 and L1 caches
 Fetching data from LLC when another core has the data
– Clean hit – data is not modified in the other core – 43 cycles
– Dirty hit – data is modified in the other core – 60 cycles
37
From the Optimization Manual
Computer Structure 2013 – Advanced Topics
Stores
 Stores to memory are executed in two phases
 At exe: fill store buffers with linear+phy address and with data
– Once store address and data are known,
store data can be forwarded to load operations that need it
 After the store retires – completion phase
– First, the line must be in L1 D$, in Exclusive or Modified MESI state
– Otherwise fetch it using a Read for Ownership request

Look up the line in the following locations, in the following order
L1 D$  L2$  LLC  L2 and L1 D$ in other cores  Memory
– Then, read the data from the store buffers and write it to L1 D$

Mark the cache line as Modified
– All this done at retirement – preserves the order of memory writes
 Store latency usually does not affect the store itself
– However, a burst of stores missing the L1 D$ may hurt performance
– As long as the store is not complete, it occupies a store buffer entry
– When the store buffer becomes full, allocation stalls
38
From the Optimization Manual
Computer Structure 2013 – Advanced Topics
L1 Data Cache
 Handles two 16 byte loads and one 16 byte store per cycle
 Maintains requests which cannot be completed
–
–
–
–
–
Cache misses
Unaligned access that splits across cache lines
Data not ready to be forwarded from a preceding store
Loads experiencing bank collisions
Load block due to cache line replacement
 L1-D$ split into 8 × 8 byte banks (bits [4:2] define the banks)
– An unaligned 16-byte loads can cover up to 3 banks

With 2 loads/cycle, can access up to six banks per cycle
– A bank conflict happens when two load accesses need the same bank
 Handles up to 10 outstanding cache misses in the LFBs
– Continues to service incoming stores and loads
 The L1 D$ is a write-back write-allocate cache
– Stores that hit in the L1-D$ do not update the L2/LLC/Mem
– Stores that miss the L1-D$ allocate a cache line
39
From the Optimization Manual
Computer Structure 2013 – Advanced Topics
Load Buffer and Store Buffer
 64 entry load buffer
– Keeps load μops from allocation until retirement
– Re-dispatch blocked loads
 36 entry store buffer
– Keeps stores from allocation until the store value is written to L1-D$
(or written to the line fill buffers – for non-temporal stores)
40
From the Optimization Manual
Computer Structure 2013 – Advanced Topics
Store Forwarding
 Forward data directly from a store to a load that needs it
– Instead of having the store write the data to the DCU and then the
load read the data from the DCU
 Rules that must be met to enable store to load forwarding
– The store must be the last store to that address, prior to the load
– The store must contain all data being loaded
– The load is from a write-back memory type and neither the load nor
the store are non-temporal accesses
 Stores cannot forward to loads in the following cases
– Four byte and eight byte loads that cross eight byte boundary,
relative to the preceding 16- or 32-byte store
– Any load that crosses a 16-byte boundary of a 32-byte store
41
From the Optimization Manual
Computer Structure 2013 – Advanced Topics
Memory Disambiguation
 A load may depend on a preceding store
– A load needs to block until all preceding store addresses are known
 Predict which loads do not depend on any previous stores
– Get data from the L1 D$ even when previous store address is unknown
– The prediction is verified, and if a conflict is detected
 The load and all succeeding instructions are re-executed
– Always assumes dependency between loads and earlier stores that
have the same address bits 0:11
 The following loads are not disambiguated
– Loads that cross the 16-byte boundary
– 32-byte Intel AVX loads that are not 32-byte aligned
Execution is stalled until the addresses of all previous stores are known
42
From the Optimization Manual
Computer Structure 2013 – Advanced Topics
Data Prefetching
 Two hardware prefetchers load data to the L1 DC$
 DCU prefetcher (the streaming prefetcher)
– Triggered by an ascending access to very recently loaded data
– Fetches the next line, assuming a streaming load
 Instruction pointer (IP)-based stride prefetcher
– Tracks individual load instructions, detecting a regular stride
Prefetch address = current address + stride
 Detects strides of up to 2K bytes, both forward and backward

43
From the Optimization Manual
Computer Structure 2013 – Advanced Topics
Putting it together
Sandy Bridge Microarchitecture
32k L1 Instruction Cache
Pre decode
Instruction
Branch Pred
Load
Buffers
Store
Buffers
Decoders
Decoders
Decoders
Decoders
Queue
1.5k uOP cache
Zeroing Idioms
Allocate/Rename/Retire
Reorder
Buffers
In order
Out-oforder
Scheduler
Port 0
ALU
VI MUL
VI Shuffle
DIV
AVX FP MUL
Port 1
ALU
VI ADD
VI Shuffle
AVX FP ADD
Port 5
Port 2
Port 3
Load
Load
Port 4
ALU
JMP
AVX/FP Shuf
Store Address Store Address
AVX/FP Bool
AVX FP Blend
AVX FP Blend
Store
Data
Memory Control
48 bytes/cycle
L2 Data Cache (MLC)
Fill
Buffers
32k L1 Data Cache
44
Foil taken from IDF 2011
Computer Structure 2013 – Advanced Topics
45
Computer Structure 2013 – Advanced Topics
46
Computer Structure 2013 – Advanced Topics
47
Computer Structure 2013 – Advanced Topics
Haswell Execution Units Overview
48
Computer Structure 2013 – Advanced Topics
49
Computer Structure 2013 – Advanced Topics
50
Computer Structure 2013 – Advanced Topics
51
Computer Structure 2013 – Advanced Topics
Download