43rd International Symposium on Microarchitecture Erasing Core Boundaries for Robust and Configurable Performance Shantanu Gupta Shuguang Feng Amin Ansari Scott Mahlke University of Michigan, Ann Arbor December 7, 2010 University of Michigan Electrical Engineering and Computer Science Multicore Architectures • Industry wide move to multicores 2 – 16 cores on a single die • Multiple challenges confront them: ► ► ► ► Single-thread performance Reliability Power density Memory bandwidth …. Sun Niagara 2 ► IBM Cell ► Intel 4 Core Nehalem • Our hypothesis: A highly configurable architecture can handle these issues in a unified manner. 2 University of Michigan Electrical Engineering and Computer Science Multicore Performance Challenge 2. Stagnating sequential performance CPU Performance (log scale) 1. Good throughput / parallel performance with more cores Core i7 Core 2 Quad Core Duo Pentium 4 Pentium III Pentium II Need flexibility to provide both Parallel and Sequential performance Pentium 486 Power wall 1985 Parallel workloads (scientific computing, newer web browsers, video decoding) 1990 1995 2000 2005 2010 2015 Sequential workloads Spectrum of Applications 3 (legacy workloads, most mobile/desktop apps) University of Michigan Electrical Engineering and Computer Science Solution: Configurable Performance • Assign resources where they are needed… • In an N core chip: ► ► Use all N cores for best Parallel Performance Group M cores together for Serial Performance (M < N) Parallel / Throughput Serial / Sequential Source: Mark D. Hill • Core Fusion, ISCA’07; Composable Lighweight Processors, MICRO’07 4 University of Michigan Electrical Engineering and Computer Science Multicore Reliability Challenge Parametric Variability Electromigration (EM) Hard Faults Intra-die variations in ILD thickness Need mechanisms for Increased Heating in-field silicon failures Oxide breakdown (OBD) Oxide Negative Bias Threshold Inversion Thermal Runaway Higher Power Dissipation [Todd Austin, GSRC Sep 08] Higher Transistor Leakage Manufacturing Defects That Escape Testing (Inefficient Burn-in Testing) 5 University of Michigan Electrical Engineering and Computer Science Solution: Isolate Broken Resources CORE level MODULE level • ElastIC, DT’ 06 • Configurable Isolation, ISCA’07 • Online Diagnosis of Hard Faults, MICRO’ 05 • Ultra Low-Cost Defect Protection, ASPLOS’ 06 STAGE level Stage1 Stage1 Stage1 Stage2 Stage2 Stage2 • StageNet, MICRO 08 • Core Cannibalization, PACT 08 Stage3 Stage3 Stage3 StageN StageN StageN - StageNet decouples the pipeline stages - Regular fabric, no global interconnections - Any set of stages can be connected to form a pipeline 6 University of Michigan Electrical Engineering and Computer Science Point Solutions: Summary and Limitations Reliability Configurable Performance Stage1 Fuse cores for higher single-thread performance Stage1 Stage1 Stage2 Stage3 Stage2 Stage2 Stage3 StageN StageN Stage3 Stage level isolation 7 University of Michigan Electrical Engineering and Computer Science StageN Point Solutions: Summary and Limitations 1. Solve only one challenge at a time 2. Incur additive overheads, no resource overlap Our 3. Are incompatible with oneGoal: another Design an architectural solution, which Stage1 Stage2 Stage3 Fuse cores for higher single-thread performance Stage1 Stage2 Stage3 1. Simultaneously targets configurable Stage1 Stage2 Stage3 performance and reliability StageN StageN Stage level isolation 2. Overlaps hardware changes, and • Tightly coupled resources • Decoupled resources • Centralized structures for data and control management • Distributed data and control management 3. Resolves any conflicting requirements 8 University of Michigan Electrical Engineering and Computer Science StageN The CoreGenesis (CG) Architecture Crossbar Switch Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem Distributed Structures 1. • Regular grid of pipeline stages. No explicit core boundary. • Stages interconnected by full crossbars • Distributed structures for data and control management Throughput 9 University of Michigan Electrical Engineering and Computer Science The CoreGenesis (CG) Architecture Single pipeline processor Fetch Decode Advantages: Issue 1. Fetch Fetch Unified performance / reliability solution Decode Issue Ex/Mem 2. Overlaps hardware overheads 3. Regular fabric Decode Issue 4. Fetch Ex/Mem Ex/Mem No centralized resources for fetch, issue, operand copying Decode Issue Ex/Mem Conjoined pipelines processor 1. Throughput 3. Configurable Performance 2. Reliability 10 University of Michigan Electrical Engineering and Computer Science CG – Microarchitectural Hurdles Control Flow Register Data Flow Single Pipeline Memory Data Flow Instruction Issue N/A N/A Conjoined Pipelines Control Flow Register and Memory Data Flow - Instruction sequence needs to be • Single Pipeline: Solved by the StageNet design, MICRO’08 managed across fetch stages - Detection of cross pipeline register ► Stream Identification bits for Control and memory Flow data flow violations Instruction Issue cache (inside EXEC.-stage) ► Bypass for to Register Data Flow Recovery a consistent architectural - Segregate data flow chains state between pipelines 11 University of Michigan Electrical Engineering and Computer Science CG – Overview Conjoined pipelines processor Distributed fetch Distributed decode Detection of data flow violations In-order Writeback Decentralized Instruction issue (broadcasted) Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem 1. Control Flow 2. Register Data Flow tracking 3. Memory Data Flow tracking 4. Replay Mechanism 5. Instruction Issue 12 University of Michigan Electrical Engineering and Computer Science CG – Control Flow Distributed Fetch. - Pipelines fetch alternate instructions - Branch predictors are kept in sync. 9..7..5..3..1 Fetch Decode 10..8..6..4..2 Fetch Decode Advantages - Evenly splits the work (fetch, decode, issue) between two pipelines - No explicit communication required for control decisions - Consistent control decisions due to mirrored branch predictors 13 University of Michigan Electrical Engineering and Computer Science CG – Data Flow Across pipeline dependencies are tricky…. Compare notes at commit time Register Data Flow 1. Issue stages locally maintain a table of source registers Replay if any dependency was violated 2. Issue stages monitor write-backs, and detect if any other pipeline InstructionupdatesSplit instruction Local decisions a source for an outstanding instruction stream stream and execution 3. Missed dependency initiate a light-weight replay 14 University of Michigan Electrical Engineering and Computer Science CG – Register Data Flow: Example Scenario A 1. 2. 3. 4. R1 = …. R2 = …. … = R1 … = R2 R1 … = R2 R1 …. 3, 1 … = R2 R1 Issue R1 = …. Execute R1 R2 R2 … = …. R2 4, 2 Issue … = R2 R2 = …. R1 R2 Execute R2 Scenario B 1. 2. 3. 4. R1 = …. R2 = …. … = R2 … = R2 Data flow violation! Pipeline 1 used a stale value of R2 How can we avoid these violations? 15 University of Michigan Electrical Engineering and Computer Science CG – Instruction Issue • Instructions can be: Issue ► Straight steered ► Cross steered straight Issue Ex/Mem Ex/Mem • Objective: match producers and consumers • Mismatch Data Flow violation Replay • Solution: Use static compiler analysis to generate steering hints 16 University of Michigan Electrical Engineering and Computer Science CG – Instruction Issue: Example 10..8..6..4..2 6 Fetch order 9 1 10 4 9..7..5..3..1 Issue straight Issue Ex/Mem Ex/Mem Always straight steering 5 2 • Ignores data dependencies • Number of replays = 5 7 3 Compiler orchestrated steering • Use clustering algorithms 8 Critical cross dependency • Accounts for dependencies and communication delays • Number of replays = 0 17 University of Michigan Electrical Engineering and Computer Science CG – Design Summary Fetch Decode Issue Ex/Mem Fetch Decode Issue Ex/Mem 1. Control Flow - Pipelines fetch alternate instructions - Branch predictors kept in sync 2. Register Data Flow - Maintain local data flow information - Check the decisions at writeback 3. Memory Data Flow tracking 5. Instruction Issue - Steer consumers to producers - Leverage static compiler analysis 4. Replay Mechanism 18 University of Michigan Electrical Engineering and Computer Science Evaluation Methodology • Liberty Simulation Infrastructure ► For cycle accurate simulations • Trimaran Compilation System ► For instruction steering hints Microarchitectural Paramenters Branch predictor Global, 16-bit, gshare predictor Level 1 I/D cache 4-way, 16KB, 1 cycle latency Level 2 unified cache 8-way, 64KB, 5 cycle latency • Experiments: ► ► ► Single-thread performance gain from conjoining Throughput improvement from conjoining (at low utilizations) Throughput sustainability (in face of failures) 19 University of Michigan Electrical Engineering and Computer Science Sequential Performance Baseline (1 issue) CG Single Pipeline (1 issue) CG Conjoint Pipelines (2 issue) Baseline (2 issue) Normalized IPC 2.5 2 1.5 1 0.5 0 20 University of Michigan Electrical Engineering and Computer Science Throughput at varying utilization Throughput (IPC) 8-core traditional multicore 8-core CoreGenesis chip 8 7 6 5 4 3 2 1 0 0.25 0.5 0.75 1 System utilization (number threads / number of cores) 21 University of Michigan Electrical Engineering and Computer Science Throughput Sustainability (Reliability) 8-core traditional multicore 8-core CoreGenesis chip 7 Throughput (IPC) 6 5 4 3 2 1 0 0 1 2 3 4 5 Time (years) 22 6 7 8 University of Michigan Electrical Engineering and Computer Science 9 Conclusions • Architectural flexibility can tackle multiple multicore challenges • CoreGenesis is our attempt at a unified performance and reliability solution ► ► Decentralized instruction flow management to combine resources for higher single-thread performance Decoupled pipeline architecture to allow stage level reconfiguration • Results: ► ► ► Combining two single issue pipelines gives 40% speedup Sustains the same throughput for up to 70% longer Overheads: 20% area, 17% power 23 University of Michigan Electrical Engineering and Computer Science Thank you Erasing Core Boundaries for Robust and Configurable Performance http://cccp.eecs.umich.edu 24 University of Michigan Electrical Engineering and Computer Science Back up slides 25 University of Michigan Electrical Engineering and Computer Science Traditional Solutions and CoreGenesis (CG) (B) Core Disabling. Isolates broken cores (red). Sustains throughput only in low failure rates. Throughput The architecture is composed of a sea of building blocks (B). These blocks can be configured for: Sequential • Throughput computing: By forming single-issue pipelines • Single-thread performance: By forming wider-issue pipelines • Fault-tolerance: By decommissioning broken blocks. (A) Dynamic Multicore. Cores can fuse together when sequential performance is needed. (C) Heterogeneous CMP. Maintains a variety of cores to offer powerproportional computing. Traditional point solutions • Customized processing: Heterogeneous building blocks can be introduced in the fabric to form customized pipelines. CoreGenesis Vision 26 University of Michigan Electrical Engineering and Computer Science CG Instance: A Unified Performance-Reliability Solution Provides.. Summary of Challenges 1. Configurable Performance: By merging varying number of stages 2. Reliability: By isolating broken stages Design Characteristics • Elementary pipeline stages form the building blocks • Stages interconnected using full crossbars. • No global flush, stall or forwarding signals. • No modifications to the cache hierarchy Single Pipeline Conjoint Pipelines Control Flow Register Data Flow Memory Data Flow Instruction Steering 27 University of Michigan Electrical Engineering and Computer Science Bypass $ 1. Control Flow 2. Data Flow Stream ID (SID) double Issue double double double double double Decode Ex/Mem 3. Transmission Delays Bypass $ Macro-Ops • Stores previous results • Send instruction bundles • Control flow handling • Eliminates flush signals • Fully associative structure 0 1 • Emulates data forwarding double SID SID Fetch buffer buffer Register File buffer buffer buffer Macro-op Generator buffer Gen Branch PC Predictor buffer Decoupling Stages in a Pipeline [MICRO’08] • Amortizes transfer delay >> LD LD + + / & ST >> << • Increases system utilization 28 University of Michigan Electrical Engineering and Computer Science ST Replays MemFlow replay cycles RegFlow replay cycles Normal operation cycles 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 29 University of Michigan Electrical Engineering and Computer Science Area 30 University of Michigan Electrical Engineering and Computer Science Power 31 University of Michigan Electrical Engineering and Computer Science