Haswell Thomas Shull Bhargava Reddy Gopi Reddy Raghavendra Pradyumna Pothukuchi RISC-y • Find each instruction i.e. decode the length. • Each x86 instruction (Macro Op) is chopped into “µOps” • Some Macro Op combos can be treated as 1 instruction. Pack them together. 1 • CMP <–> JUMP IF; ± <–> TEST • Some µOps are packed into 1 µOp and are later implicitly broken op. • ADD [EBX] EAX -> MOV ECX [EBX] ADD ECX EAX -> ADD [EBX] EAX www.realworldtech.com/haswell-cpu/ Fetch and Decode • Multicycle power hungry decode. • µOps are cached. • 32 sets: 8 ways: 6 µops per line: • 32B window (18 µOps at maximum) is inserted at once Why? AVX! • if 32B has more than 18 µOps, do not insert. • Deliver atmost 4 µOps on a “full hit” • Double bandwidth (32B vs. 16B) on a hit. Renaming and Oh Oh Oh! • Renaming – Map from logical registers to physical registers (PRF) and allocate resources. • ROB is a placeholder. • Break the fused µOps to simpler Ops. www.realworldtech.com/haswell-cpu/ Scheduler • 8 Issue Ports • 1 WB per Port • INT, FP, SIMD networks + MEM • More penalty for inter-network data forwarding. • Register-Register moves are folded by just changing PRF map. • Extra pipeline stage for dereferencing links Execution Units 60 Entry Unified Scheduler FMA Vector Vector Vector Vector Vector Vector Vector Div Vector Int Branch Port 7 FMA Branch Int Port 6 Int Mem Store Port 5 Port 4 Port 3 Port 2 Port 1 Port 0 Int Store Did we forget something? Branch Predictor !! • More entries in BTB (less per entry!) • Entries with fewer offset bits • Use the space saved for global branch prediction • 2 level global predictor? 1-bit entries? • 14 -17 cycles of misprediction penalty. • 56 entry µOp buffer for identifying small loops Big Picture: 14 stage pipeline www.realworldtech.com/haswell-cpu/ Memory Hierarchy – For Data Load Buffer Store Buffer Unified scheduler Port 3 Port 2 64-bit AGU 64-bit AGU Port 4 Store AGU Store Data 2x32B Port 7 32B 32 KB L1 D Cache (8-way) 64B 256KB L2 Cache (8-way) L3/LLC 4k – 64 2M/4M - 32 1G - 4 4-way L1 TLB 1024 Entry Shared 8-way L2 TLB L3 (Also Last Level Cache) • Banked Structure, One bank per core System Agent • Shared and Fully inclusive Core0 L3 Core1 L3 Core2 L3 Core3 L3 • Separate tag arrays • One for Data Requests • One for Prefetches and Coherency Requests • Point of Coherence • Separate Frequency domain from CPU • Helps to run CPU, GPU and LLC at different speeds as necessary GPU The Ring • Ring stops • Core/L3 bank (Cachebox) can send/receive two packets on ring each cycle • Up direction • Down direction • GPU and System Agent can send only one per cycle • Ring actually consists of 4 Rings System Agent Core0 L3 Core1 L3 Core2 L3 Core3 L3 GPU Memory Controller • 2 Clock Domains • DCLK – DDR command clock • QCLK – DDR data clock • Requested 32B are returned first • Maintains a page table information and corresponding requests • Page Hits are given priority -> increase the bandwidth • Reads are given priority • Write Data Buffer to maintain writes • Write Merging can happen in WriteDataBuffer System Agent Display Engine PCIE DMI PCU • Contains • • • • • • Memory Controller PCI Express Controller DMI Controller Display Engine Power Control Unit I/O Memory Controller Core0 L3 Core1 L3 Core2 L3 Core3 L3 GPU Multithreading • Use atomic operations to control access to items used by multiple threads • Obtain and release locks for critical sections • Intel currently supports making the following operations atomic by appending a “LOCK” prefix: • ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR SBB, SUB, XOR, XADD, and XCHG • MOV and LEAL are also atomic on aligned accesses Transactional Memory • Main idea: try to run critical sections without locks and monitor for conflicts • Use Read and Write Sets to log memory accesses in transactional sections • If conflicts occur, abort and revert register state to the beginning of transaction • If successful, commit the changes to memory so they are visible to other threads Restricted Transactional Memory • Haswell is the first Intel mainstream processor to include Transactional Memory • Added Transactional Synchronization eXtension (TSX) • New instructions for Restricted Transactional Memory • • • • XBEGIN – indicates start of transaction XEND – indicates end of transaction XABORT – used for testing; aborts transaction XTEST – indicates whether preforming in a transactional region • Must have pointer to code that runs upon an abort • Requires code to be rewritten using transactional sections Integrated Graphics • Supports 3 simultaneous display, HDMI • Scalable Architecture: different versions of processor (GT1, GT2, GT3) offer different number of Execution Units (EUs) among other upgrades Figure taken from “Technology Insight: Intel Next Generation Microarchitecture Code Name Haswell” Presentation. Intel Developers Forum, San Francisco, 2012 • Multiple Video Encoding and Decoding Support in Hardware. • Supported encodings include MPEG4, MPEG2, SVC • Supports Open CL 1.1, Open GL 4.0 Power Management • Three Voltage Domains • Allows for screen to be updated while processor is turned off • Voltage Regulators are on chip • Power Gating Figure taken from “Intel Next Generation Microarchitecture • New Power Saving States Codename Haswell: New Processor Innovations” Presentation. Intel Developers Forum, San Francisco, 2012 • S0ix idle states • Recommends power levels and response times for vendors • Uses 20x less power than previous S0 state Recap: • • • • 14 stage pipeline 4 cores, SMT machine In order issue, Out of Order execution, In order commit. Wider data paths and extra Store AGU to provide more bandwidth in AVX2 computations • LLC/Ring is the point of coherence and distributed arbitration of requests. • Intel TSX • Added support for Restricted Transaction Memory • Integrated Graphics and Improved Power Management • Power Efficiency is a huge emphasis Resources General Information • Technology Insight: Intel Next Generation Microarchitecture Code Name Haswell. Presented at IDF 2012 by Tom Piazza, Hong Jiang, Per Hammarlund, Ronak Singhal • Intel Next Generation Micro Architecture Codename Haswell: New Processor Innovations. Presented at IDF 2012 by Robert Chappell, Bret Toll, Ronal Singhal • Kanter, David Intel’s Haswell Cpu Microarchitecture. November 13, 2012. www.realworldtech.com/haswell-cpu/ • Kanter, David Analysis of Haswell’s Transactional Memory. February 15, 2012. www.realworldtech.com/haswell-tm/ • Lai Shimpi, Anand. Intel’s Haswell Architecture Analyzed: Building a New PC and a New Intel. October 5, 2012. www.anandtech.com/show/6355/intels-haswell-architecture • Introducing SandyBridge. Presented at IDF 2010 by Bob Valentine. • Sandy Bridge Spans Generation. Micro Processor Report. September 2010 Resources Processor Core • Fog Agner. The microarchitecture of Intel, AMD and VIA CPUs, An optimization guide for assembly programmers and compiler makers. Copenhagen University College of Engineering • Intel 64 and IA-32 Architectures Optimization Reference Manual. Order Number: 248966-026. April 2012 Transactional Memory • Intel Transactional Synchronization Extensions. Presented at IDF 2012 by Ravi Rajwar, Martin Dixon • Intel Architecture Instruction Set Extensions Programming Reference Manual. Order Number: 319433-012A. February 2012 • Gelas, J and Hamm, C. Making Sense of the Intel Haswell Transactional Synchronization eXtensions. September 15, 2012. www.anandtech.com/show/6290/making-sense-of-intelhaswell-transactional-synchronization-extensions Extra Slides Current Locking Strategies acquire_lock(mutex ) release_lock(mutex) Scalability Issues Figure taken Making Sense of the Intel Haswell Transactional Synchronization eXtensions. As core count increases, efficiency is drastically reduced! Lock Elision • Idea introduced by Ravi Rajwar and James R. Goodman in 2001 • • • remove locks, run code as a transaction If there are conflicts, abort and rerun code with locks intact On success, commit the transaction’s writes to memory • To other threads the lock still remains available • Reduces execution time if conflicts do not occur • Guarantees Correctness by using the transactional memory • Have new instructions to implement Lock Elision • • XAQUIRE: denotes start of lock elision section XRELEASE: denotes end of lock elision section • These options are added as prefixes to existing instructions Lock Elision acquire_lock(mutex ) release_lock(mutex) Changes can be made in library functions. User does not have to adopt new programming paradigm Performance Benefits Intel says using TSX Helps! Figure taken from “Intel Transactional Synchronization Extensions” Presentation. Intel Developers Forum, San Francisco, 2012 Software Transactional Memory has been researched, but the overhead in software negated performance benefits