18-742 Spring 2011 Parallel Computer Architecture Lecture 11: Core Fusion and Multithreading Prof. Onur Mutlu Carnegie Mellon University Announcements No class Monday (Feb 14) Interconnection Networks lectures on Wed-Fri (Feb 16, 18) 2 Reviews Due Today (Feb 11) midnight Herlihy and Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures,” ISCA 1993. Due Tuesday (Feb 15) midnight Patel, “Processor-Memory Interconnections for Multiprocessors,” ISCA 1979. Dally, “Route packets, not wires: on-chip inteconnection network,” DAC 2001. Das et al., “Aergia: Exploiting Packet Latency Slack in On-Chip Networks,” ISCA 2010. 3 Last Lecture Speculative Lock Elision (SLE) SLE vs. Accelerated critical sections (ACS) Data Marshaling 4 Today Dynamic Core Combining (Core Fusion) Maybe start multithreading 5 How to Build a Dynamic ACMP Frequency boosting DVFS Core combining: Core Fusion Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors,” ISCA 2007. Idea: Dynamically fuse multiple small cores to form a single large core 6 Core Fusion: Motivation Programs are incrementally parallelized in stages Each parallelization stage is best executed on a different “type” of multi-core 7 Core Fusion Idea Combine multiple simple cores dynamically to form a larger, more powerful core 8 Core Fusion Microarchitecture Concept: Add enveloping hardware to make cores combineable 9 How to Make Multiple Cores Operate Collectively as a Single Core Reconfigurable I-cache 10 Collective Fetch Each core fetches two instructions from own i-cache Fetch Management Unit (FMU) controls redirection Two-cycle bubble per taken branch (+1 if misaligned core) Core “zero” provides RAS Cores process branches locally, communicate prediction to FMU FMU communicates outcome and GHR updates Two-cycle interconnect Misaligned targets re-align in one cycle Return encountered on another core gets its prediction from Core 0’s RAS FMU updates i-TLBs on a miss 11 Branching in Fused Mode BPred BPred GHR GHR GHR GHR RAS RAS BTB RAS B BPred BTB RAS BTB BPred BTB 12 Branching in Fused Mode BTB GHR GHR GHR GHR RAS X X B BPred BTB RAS X X BPred BPred BTB RAS X X RAS BPred BTB X X X 13 Centralized Renaming and Steering Centralized structure: Cores send predecoded info to Steering Management Unit (SMU) SMU steers and dispatches regular and copy instructions Max. two regular + two copy instructions per core, cycle Eight extra pipeline stages (only fused mode) 14 Operand Communication via Copy Instructions Copy-in Issue Copy-out Copy-in Issue Copy-out Out In 15 Collective Commit (No Blocking Case) Pre-commit ROB Head Conventional ROB Head i1 i3 i5 i7 i0 i2 i4 i6 i1 i3 i5 i7 i0 i2 i4 i6 i1 i3 i5 i7 i0 i2 i4 i6 i1 i3 i5 i7 i0 i2 i4 i6 16 Collective Commit (Blocked Case) Pre-commit ROB Head Conventional ROB Head i1 i3 i5 i7 i0 i2 i4 i6 i1 i3 i5 i7 i0 i2 i4 i6 i1 i3 i5 i7 i0 i2 i4 i6 i1 i3 i5 i7 i0 i2 i4 i6 17 Collective Load/Store Queue LD/ST instructions bank-assigned to cores based on effective addresses PC-based steering prediction on which bank the ld/st should access Distributed disambiguation Re-steer on misprediction Core-fusion-aware indexing Full utilization in fused and split modes Cache coherence avoids flushing or shuffling Tag Bank ID Index 18 Dynamic Reconfiguration Run-time control of granularity Mechanism: Fusion, fission instructions in the ISA Serial vs. parallel sections Variable granularity in parallel sections Typically encapsulated in macros or directives (e.g., OpenMP sections) Can be safely ignored (single execution model) Reconfiguration actions Flush pipelines and i-caches Reconfigure i-cache tags Transfer architectural state as needed 19 Core Fusion Evaluation Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors,” ISCA 2007. 20 Core Fusion Evaluation 21 Single Thread Performance 22 Parallel Application Performance 23 Core Fusion vs. Tile-Small Symmetric CMP Core Fusion Advantages + Better single-thread performance when needed Disadvantages - Possibly lower parallel throughput (area spent for glue logic) - Reconfiguration overhead to dynamically create a large core (can reduce performance) - More complex design: glue logic between cores, reconfigurable I-cache 24 Core Fusion vs. Asymmetric CMP Core Fusion Advantages + Cores not fixed at design time: more adaptive + Possibly better parallel throughput (assuming ACMP does not use SMT) + Potentially higher frequency design: all cores are the same Disadvantages - Reconfiguration overhead to dynamically create a large core (can reduce performance) - Single-thread performance on the fused core less than that on a large core statically optimized for single thread - Additional stages, fine-grained operand communication between cores, collective operation constraints - Potentially complex design: glue logic between cores, reconfigurable I-cache 25 Core Fusion vs. Clustered Superscalar/OoO Core fusion: build small cores, add glue logic to combine them dynamically Clustered superscalar/OoO: build a large superscalar core in a scalable fashion (clustered scheduling windows, register files, and execution units) Can use SMT to execute multiple threads in different clusters Both require: Steering instructions to different clusters (cores) Operand communication between clusters Memory disambiguation 26 Core Fusion vs. Clustered Superscalar/OoO Some core fusion advantages + No resource contention between threads in non-fused mode + No need to build a wide fetch engine Some disadvantages - Single-thread performance can be less due to additional communication latencies in fetch and commit - I-cache not shared between threads in non-fused mode 27 Review: Performance Asymmetry What to do with it? Improve serial performance (accelerate sequential bottleneck) Reduce energy consumption – adapt to phase behavior Optimize energy delay – adapt to phase behavior Improve parallel performance (accelerate critical sections) How to build it? Static Multiple different core microarchitectures or frequencies Dynamic Combine cores Adapt frequency 28 Research in Asymmetric Multi-Core How to Design Asymmetric Cores Static Dynamic How to divide the program to best take advantage of asymmetry? Can you fuse in-order cores easily to build an OoO core? Explicit vs. transparent How to match arbitrary program phases to the best-fitting core? Staged execution models. How to minimize code/data migration overhead? How to satisfy shared resource requirements of different cores? 29 Multithreading Readings: Multithreading Required Spracklen and Abraham, “Chip Multithreading: Opportunities and Challenges,” HPCA Industrial Session, 2005. Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro 2004. Tullsen et al., “Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor,” ISCA 1996. Eyerman and Eeckhout, “A Memory-Level Parallelism Aware Fetch Policy for SMT Processors,” HPCA 2007. Recommended Hirata et al., “An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads,” ISCA 1992 Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978. Gabor et al., “Fairness and Throughput in Switch on Event Multithreading,” MICRO 2006. Agarwal et al., “APRIL: A Processor Architecture for Multiprocessing,” ISCA 1990. 31 Multithreading (Outline) Multiple hardware contexts Purpose Initial incarnations CDC 6600 HEP Tera Levels of multithreading Fine-grained (cycle-by-cycle) Coarse grained (multitasking) Switch-on-event Simultaneous Uses: traditional + creative (now that we have multiple contexts, why do we not do …) 32