acquire_lock(mutex)

advertisement
Haswell
Thomas Shull
Bhargava Reddy Gopi Reddy
Raghavendra Pradyumna Pothukuchi
RISC-y
• Find each instruction i.e. decode the length.
• Each x86 instruction (Macro Op) is chopped into
“µOps”
• Some Macro Op combos can be treated as 1 instruction.
Pack them together. 1
• CMP <–> JUMP IF;
± <–> TEST
• Some µOps are packed into 1 µOp and are later
implicitly broken op.
• ADD [EBX] EAX ->
MOV ECX [EBX]
ADD ECX EAX
->
ADD [EBX] EAX
www.realworldtech.com/haswell-cpu/
Fetch and Decode
• Multicycle power hungry decode.
• µOps are cached.
• 32 sets: 8 ways: 6 µops per line:
• 32B window (18 µOps at maximum) is inserted at once
Why?
AVX!
• if 32B has more than 18 µOps, do not insert.
• Deliver atmost 4 µOps on a “full hit”
• Double bandwidth (32B vs. 16B) on a hit.
Renaming and Oh Oh Oh!
• Renaming – Map from logical registers to physical
registers (PRF) and allocate resources.
• ROB is a placeholder.
• Break the fused µOps to simpler Ops.
www.realworldtech.com/haswell-cpu/
Scheduler
• 8 Issue Ports
• 1 WB per Port
• INT, FP, SIMD networks + MEM
• More penalty for inter-network data forwarding.
• Register-Register moves are folded by just changing
PRF map.
• Extra pipeline stage for dereferencing links
Execution Units
60 Entry Unified Scheduler
FMA
Vector
Vector
Vector
Vector
Vector
Vector
Vector
Div
Vector
Int
Branch
Port 7
FMA
Branch
Int
Port 6
Int
Mem
Store
Port 5
Port 4
Port 3
Port 2
Port 1
Port 0
Int
Store
Did we forget something?
Branch Predictor !!
• More entries in BTB (less per entry!)
• Entries with fewer offset bits
• Use the space saved for global branch prediction
• 2 level global predictor? 1-bit entries?
• 14 -17 cycles of misprediction penalty.
• 56 entry µOp buffer for identifying small loops
Big Picture:
14 stage pipeline
www.realworldtech.com/haswell-cpu/
Memory Hierarchy – For Data
Load
Buffer
Store
Buffer
Unified scheduler
Port 3
Port 2
64-bit
AGU
64-bit
AGU
Port 4
Store
AGU
Store
Data
2x32B
Port 7
32B
32 KB L1 D Cache (8-way)
64B
256KB L2 Cache (8-way)
L3/LLC
4k – 64
2M/4M - 32
1G - 4
4-way
L1 TLB
1024 Entry
Shared
8-way
L2 TLB
L3 (Also Last Level Cache)
• Banked Structure, One bank per core System Agent
• Shared and Fully inclusive
Core0
L3
Core1
L3
Core2
L3
Core3
L3
• Separate tag arrays
• One for Data Requests
• One for Prefetches and Coherency
Requests
• Point of Coherence
• Separate Frequency domain from
CPU
• Helps to run CPU, GPU and LLC at
different speeds as necessary
GPU
The Ring
• Ring stops
• Core/L3 bank (Cachebox) can
send/receive two packets on
ring each cycle
• Up direction
• Down direction
• GPU and System Agent can
send only one per cycle
• Ring actually consists of 4 Rings
System Agent
Core0
L3
Core1
L3
Core2
L3
Core3
L3
GPU
Memory Controller
• 2 Clock Domains
• DCLK – DDR command clock
• QCLK – DDR data clock
• Requested 32B are returned first
• Maintains a page table information and corresponding
requests
• Page Hits are given priority -> increase the bandwidth
• Reads are given priority
• Write Data Buffer to maintain writes
• Write Merging can happen in WriteDataBuffer
System Agent
Display
Engine
PCIE
DMI
PCU
• Contains
•
•
•
•
•
•
Memory Controller
PCI Express Controller
DMI Controller
Display Engine
Power Control Unit
I/O
Memory
Controller
Core0
L3
Core1
L3
Core2
L3
Core3
L3
GPU
Multithreading
• Use atomic operations to control access to items
used by multiple threads
• Obtain and release locks for critical sections
• Intel currently supports making the following
operations atomic by appending a “LOCK” prefix:
• ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG,
CMPXCH8B, DEC, INC, NEG, NOT, OR SBB, SUB,
XOR, XADD, and XCHG
• MOV and LEAL are also atomic on aligned
accesses
Transactional Memory
• Main idea: try to run critical sections without locks and
monitor for conflicts
• Use Read and Write Sets to log memory accesses in
transactional sections
• If conflicts occur, abort and revert register state to the
beginning of transaction
• If successful, commit the changes to memory so they are
visible to other threads
Restricted Transactional
Memory
• Haswell is the first Intel mainstream processor to include
Transactional Memory
•
Added Transactional Synchronization eXtension (TSX)
• New instructions for Restricted Transactional Memory
•
•
•
•
XBEGIN – indicates start of transaction
XEND – indicates end of transaction
XABORT – used for testing; aborts transaction
XTEST – indicates whether preforming in a transactional region
• Must have pointer to code that runs upon an abort
• Requires code to be rewritten using transactional sections
Integrated Graphics
• Supports 3
simultaneous
display, HDMI
• Scalable
Architecture:
different versions of
processor (GT1, GT2,
GT3) offer different
number of Execution
Units (EUs) among
other upgrades
Figure taken from “Technology Insight: Intel Next Generation Microarchitecture
Code Name Haswell” Presentation. Intel Developers Forum, San Francisco, 2012
• Multiple Video Encoding and Decoding Support in Hardware.
• Supported encodings include MPEG4, MPEG2, SVC
• Supports Open CL 1.1, Open GL 4.0
Power Management
• Three Voltage Domains
• Allows for screen to be
updated while
processor is turned off
• Voltage Regulators are
on chip
• Power Gating
Figure taken from “Intel Next Generation Microarchitecture
• New Power Saving States
Codename Haswell: New Processor Innovations” Presentation.
Intel Developers Forum, San Francisco, 2012
• S0ix idle states
• Recommends power levels and response times for vendors
• Uses 20x less power than previous S0 state
Recap:
•
•
•
•
14 stage pipeline
4 cores, SMT machine
In order issue, Out of Order execution, In order commit.
Wider data paths and extra Store AGU to provide more
bandwidth in AVX2 computations
• LLC/Ring is the point of coherence and distributed
arbitration of requests.
• Intel TSX
• Added support for Restricted Transaction Memory
• Integrated Graphics and Improved Power Management
• Power Efficiency is a huge emphasis
Resources
General Information
• Technology Insight: Intel Next Generation Microarchitecture
Code Name Haswell. Presented at IDF 2012 by Tom Piazza,
Hong Jiang, Per Hammarlund, Ronak Singhal
• Intel Next Generation Micro Architecture Codename Haswell:
New Processor Innovations. Presented at IDF 2012 by Robert
Chappell, Bret Toll, Ronal Singhal
• Kanter, David Intel’s Haswell Cpu Microarchitecture.
November 13, 2012. www.realworldtech.com/haswell-cpu/
• Kanter, David Analysis of Haswell’s Transactional Memory.
February 15, 2012. www.realworldtech.com/haswell-tm/
• Lai Shimpi, Anand. Intel’s Haswell Architecture Analyzed:
Building a New PC and a New Intel. October 5, 2012.
www.anandtech.com/show/6355/intels-haswell-architecture
• Introducing SandyBridge. Presented at IDF 2010 by Bob
Valentine.
• Sandy Bridge Spans Generation. Micro Processor Report.
September 2010
Resources
Processor Core
• Fog Agner. The microarchitecture of Intel, AMD and VIA
CPUs, An optimization guide for assembly programmers and
compiler makers. Copenhagen University College of
Engineering
• Intel 64 and IA-32 Architectures Optimization Reference
Manual. Order Number: 248966-026. April 2012
Transactional Memory
• Intel Transactional Synchronization Extensions. Presented at
IDF 2012 by Ravi Rajwar, Martin Dixon
• Intel Architecture Instruction Set Extensions Programming
Reference Manual. Order Number: 319433-012A. February
2012
• Gelas, J and Hamm, C. Making Sense of the Intel Haswell
Transactional Synchronization eXtensions. September 15, 2012.
www.anandtech.com/show/6290/making-sense-of-intelhaswell-transactional-synchronization-extensions
Extra Slides
Current Locking Strategies
acquire_lock(mutex
)
release_lock(mutex)
Scalability Issues
Figure taken Making Sense of the Intel Haswell Transactional Synchronization eXtensions.
As core count increases, efficiency is drastically reduced!
Lock Elision
• Idea introduced by Ravi Rajwar and James R. Goodman in
2001
•
•
•
remove locks, run code as a transaction
If there are conflicts, abort and rerun code with locks intact
On success, commit the transaction’s writes to memory
• To other threads the lock still remains available
•
Reduces execution time if conflicts do not occur
• Guarantees Correctness by using the transactional memory
• Have new instructions to implement Lock Elision
•
•
XAQUIRE: denotes start of lock elision section
XRELEASE: denotes end of lock elision section
•
These options are added as prefixes to existing instructions
Lock Elision
acquire_lock(mutex
)
release_lock(mutex)
Changes can be made in library functions. User does not
have to adopt new programming paradigm
Performance Benefits
Intel says using TSX Helps!
Figure taken from “Intel Transactional Synchronization Extensions” Presentation.
Intel Developers Forum, San Francisco, 2012
Software Transactional Memory has been researched, but
the overhead in software negated performance benefits
Download