A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket K. Choudhary, Salil Wadhavkar, Eric Rotenberg Department of Electrical and Computer Engineering North Carolina State University Sandeep Navada © 2013 1 Single-ISA HCMP • Same ISA • Different microarchitectures – Superscalar width – Structure sizes – Frequency • Cores have different performance and power • New run-time optimization lever Sandeep Navada © 2013 2 Monotonic HCMP Performance • Cores can be ranked independent of application • Core 1 faster than Core 2 for any application Core 1 Core 2 A Sandeep Navada © 2013 B C Applications D 3 Monotonic HCMP example Sandeep Navada © 2013 4 HCMP literature • Focus – Monotonic cores – Cores are preordained – Scheduling • Single thread – Minimize energy for given performance degradation threshold w.r.t. highest ranked core • Multiple threads – Maximize throughput/Watt/mm2 Sandeep Navada © 2013 5 Going beyond monotonic HCMP Performance • Cores can’t be ranked independent of application • Cores designed from ground-up, not pre-existing Core 1 Core 2 A Sandeep Navada © 2013 B C Applications D 6 Non-monotonic HCMP High-contention scenario (Optimize throughput) Kumar, et al., Core Architecture Optimization for SingleISA Heterogeneous Multiprocessors Low-contention scenario (Optimize latency) Our work Sandeep Navada © 2013 7 Optimize latency Performance = IPC × frequency Complexity↑ => IPC↑ frequency↓ App A App B IPC frequency perf Complexity IPC frequency perf Complexity This tradeoff plays out differently for different apps and is dependent on the ILP characteristics of the app Sandeep Navada © 2013 8 Non-monotonic HCMP challenges Core Selection Application Steering How to pick the core types comprising the heterogeneous design? How to steer the applications to the best core? Sandeep Navada © 2013 9 CORE SELECTION Sandeep Navada © 2013 10 Core design space Parameter Value Range Number Front end width 2, 3, 4, 5, 6, 7, 8 7 Issue width 2, 3, 4, 5, 6, 7, 8 7 Physical register file size 64, 128, 192, 256, 384, 512 6 Issue queue size 16, 24, 32, 48, 64, 96, 128 7 Load queue/ Store queue size 8/8, 16/16, 24/24, 32/32, 40/40, 48/48, 56/56, 64/64 8 L1 I$ size 8, 16, 32, 64, 128KB 5 L1 D$ size 8, 16, 32, 64, 128KB 5 L2$ size 2MB 1 Clock period 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2 ns 8 Sandeep Navada © 2013 11 Core selection Core design space Pruning script Pruned design Space SPEC bench SimPoint tool 39 10M phases Search N=1 HCMP Optimal 1-core-type HCMP Search N=2 HCMP Optimal 2-core-type HCMP Sandeep Navada © 2013 Search N=3 HCMP Optimal 3-core-type HCMP FabScalar toolset IPC, freq, power Performance of every phase on every design point Search N=4 HCMP Optimal 4-core-type HCMP N: Number of core types 12 Core Types BIPS Phases A B C D E F G H 1 1.5 3.2 1.3 2.2 1.6 1.7 1.3 2.0 2 0.5 2.3 2.5 1.9 3.1 1.8 2.0 1.2 Search for Optimal 4-core-type HCMP Core 1 Core 2 Core 3 Core 4 Performance A B C D HMEAN(3.2, 2.5) = 2.81 E B C D HMEAN(3.2, 3.1) = 3.15 A F C D HMEAN(2.2, 2.5) = 2.34 E F C D HMEAN(2.2, 3.1) = 2.57 E F G H HMEAN(2.0, 3.1) = 2.43 … Sandeep Navada © 2013 13 Kiviat diagram • Visualize core parameters Frequency higher frequency increase superscalar width Width Sandeep Navada © 2013 larger structures Window 14 Optimal 1-core-type HCMP Frequency A Width Sandeep Navada © 2013 Window 15 Optimal 1-core-type HCMP Frequency A Width Window “A” core is an average core which strikes a good balance between IPC and frequency. Sandeep Navada © 2013 16 Optimal 2-core-type HCMP Frequency A LW Width Sandeep Navada © 2013 Window 17 Optimal 2-core-type HCMP Frequency A LW Width Window “A” core is still selected! Sandeep Navada © 2013 18 Optimal 2-core-type HCMP Frequency A LW Width Window LARGER WIDER “LW” core targets window and width bottlenecks in “A” core. Sandeep Navada © 2013 19 Optimal 3-core-type HCMP Frequency A LW N Width Sandeep Navada © 2013 Window 20 Optimal 3-core-type HCMP Frequency A LW N Width Window “A” core is still selected!! Sandeep Navada © 2013 21 Optimal 3-core-type HCMP Frequency A LW N Width Window “LW” core is still selected. Sandeep Navada © 2013 22 Optimal 3-core-type HCMP Frequency A LW N Width Window “N” core targets frequency bottleneck. Sandeep Navada © 2013 23 Optimal 4-core-type HCMP Frequency A L W N Width Sandeep Navada © 2013 Window 24 Optimal 4-core-type HCMP Frequency A L W N Width Window “A” and “N” are selected, again. “LW” got split into “L” and “W”, addressing each bottleneck better! Sandeep Navada © 2013 25 LW split Frequency A LW L W Width Sandeep Navada © 2013 Window 26 Optimal HCMP Core Type Clock Period ILP-extracting buffers Widths Caches A 0.6 32, 128, 128 3, 4 64, 64 N 0.5 32, 64, 64 2, 2 16, 16 L 0.7 48, 128, 384 4, 4 128, 128 W 0.7 32, 128, 128 6, 6 128, 32 The optimal HCMP consists of 1. Average core which is the best homogeneous core 2. Accelerator cores that relieve distinct bottlenecks in the average core Sandeep Navada © 2013 27 APPLICATION STEERING Sandeep Navada © 2013 28 Bottleneck-driven steering • Application is continuously diagnosed for bottlenecks on the current core using perf. counters • Migrate to different core when bottlenecks change – To an accelerator core that relieves any diagnosed bottleneck and doesn’t worsen any diagnosed bottleneck – To the average core if no accelerator meets this condition, or if no bottlenecks Sandeep Navada © 2013 29 Bottleneck-driven steering Track performance counters Diagnose bottlenecks Steer phase Sandeep Navada © 2013 30 Track performance counters Counter Description Width_ctr Ready instruction not issued due to limited issue width. Window_ctr Instruction not dispatched due to issue queue or reorder buffer full. I$_ctr Instruction stalled due to instruction cache miss. D$_ctr Load instruction stalled due to data cache miss. Misp_ctr Mispredicted branch. L2_ctr Instruction stalled due to L2 cache miss. Cycle_ctr Number of cycles. Sandeep Navada © 2013 31 Diagnose bottlenecks • Every 10K instructions, evaluate bottlenecks using performance counters and thresholds • Performance counters are normalized with respect to the cycle count • If the normalized performance counter value is above threshold, then the corresponding resource is a bottleneck Sandeep Navada © 2013 32 Diagnose bottlenecks Bottleneck bool Width Expression Width = (Width_ctr > Width_thresh) bool Window Window = (Window_ctr > Window_thresh) bool Frequency Frequency = (Misp_ctr > Misp_thresh) || (L2_ctr > L2_thresh) bool I$ I$ = (I$_ctr > I$_thresh) bool D$ D$ = (D$_ctr > D$_thresh) Thresholds are determined empirically using a training process Sandeep Navada © 2013 33 Steer phase Core Bottlenecks relieved Bottlenecks worsened Steering logic W Width Frequency if (Width && !Frequency) W L Window Frequency else if (Window && !Frequency) L N Frequency Width, Window else if (Frequency && !(Width || Window)) N A n/a n/a else A Paper shows full steering logic with I$ and D$ bottlenecks included. Sandeep Navada © 2013 34 RESULTS Sandeep Navada © 2013 35 Methodology • Benchmarks: SPEC 2000 – Simulate first 4 billion instructions • Metrics – Performance: BIPS – Efficiency: BIPS3/Watt • Migration overhead – Default: 100 cycles – Sensitivity study: 1K, 10K cycles Sandeep Navada © 2013 36 Steering algorithms Algorithm Description Baseline Run the entire 4B instructions on the average core Run on each core type for the sampling interval and then on the best core type for the switching interval Run current 10K instruction segment based on the bottlenecks of the prior 10K segment Sampling Bottleneck Optimal Oracle Run every 10K instruction segment on the best core type of the prior 10K segment Run every 10K instruction segment on the best core type Sandeep Navada © 2013 37 4-core-type HCMP •4-core HCMP outperforms homogeneous CMP by up to 76% and 15%, on average •Our steering algorithm is able to capture most of this gain Sandeep Navada © 2013 38 Sampling vs. bottleneck steering Sampling performs than the average Sampling performs 8.9% 8.9% betterbetter than the average core core Bottleneck steering performs 12% better than the average Bottleneck steering performs 12% better than the average core core Sandeep Navada © 2013 39 Occupancy Occupancy pattern varies dramatically across different applications Sandeep Navada © 2013 40 Efficiency Sampling performs 25% better than the average core Bottleneck steering performs 33% better than the average core Sandeep Navada © 2013 41 SUMMARY Sandeep Navada © 2013 42 Summary • First proposal to architect and orchestrate multiple core types for latency reduction. • With N core types, the optimal HCMP consists of an average core type coupled with N-1 accelerator core types. • In the complementary steering algorithm, the application is continuously diagnosed for bottlenecks and is migrated to the core type which relieves the bottlenecks. Sandeep Navada © 2013 43 Future work • HCMPs open up a whole new direction of microarchitecture research. • Many microarchitecture optimizations don’t provide universal benefits. • As each core-type targets a narrow workload space, HCMP provides a great platform to reconsider these optimizations. Sandeep Navada © 2013 44