TimeCube A Manycore Embedded Processor with Interference-agnostic Progress Tracking Anshuman Gupta Jack Sampson Michael Bedford Taylor University of California, San Diego Multicore Processors in Embedded Systems Intel Atom Qualcomm Snapdragon Apple A6 Applied Micro Green Mamba • Standard in domains such as smartphones • Higher Energy-Efficiency • Higher Area-Efficiency 2 Towards Manycore Embedded Systems Unicore Dualcore Shared Mem Quadcore Many(64)core Shared Cache, Shared Mem Shared OCN, Shared Cache, Shared Mem etc. • Number of cores in a processor is increasing • So is sharing! 3 What’s Great About Manycores • Lots of resources Tile GX 8072 Xeon Phi 7120X • Cores 72 61 • Caches 23MB 30.5MB • DDR channels 4 16 • Memory Bandwidth 100GB/s 352GB/s 4 What’s Not So Great: Sharing • Low per-core resources Tile Gx 8072 • Cache / core 327 KB > 7X 2.5 MB • Memory BW / core 1.16 B/cyc > 3X 4.26 B/cyc Intel Xeon 4650 The applications fight with each other over the limited resources. 5 Sharing at its Worst SPEC2K, SPEC2K6 + I/O-centric suite • 32 cores, 16 MB L2 Cache, 96Gb/s DRAM bandwidth, 32GB DDR3 • 12X worstcase slowdowns! 6 Key Problems With Sharing • I know how I’d run by myself, but how much are others slowing me down? • How do I get guarantees of how much performance I’ll get? • How do we allocate the resources for the good of the many, but without punishing the few, or the one? 7 I know how I’d run by myself, but how much are others slowing me down? Solution: We introduce a new metric – Progress-Time Time the application would have taken, were it to have been allocated all CPU resources. • This Paper: With the right hardware, we can calculate the Progress-Time in real time. • Useful Because: Key building block for the hardware, for the operating system, and for the application to create guarantees about execution quality. 8 How do I get guarantees of how much performance I’ll get? Solution: We introduce a new hardware-generated data structure – Progress Tables For each application, how much Progress-Time it gets for every possible resource allocation – and we extend the hardware to dynamically partition resources. • This Paper: With a little more hardware, we can compute the Progress Tables accurately and accordingly partition resources to guarantee performance, in real time. • Useful Because: We can determine exactly how much resources are required to attain a given level of performance. 9 1 0.75 1MB 0.5 256KB 0.25 64KB 0 50 Bandwidth (%) 75 Cache 4MB 0.75 1MB 0.5 256KB 0.25 64KB 25 50 Bandwidth (%) 75 0 1 16MB 4MB 0.75 1MB 0.5 256KB 0.25 64KB 0 10 25 50 Bandwidth (%) 75 0 astar Cache 0 1 16MB 0 • Red = attaining the full 1ms of Progress-Time in 1ms of real time 25 hmmer • Graphical images of real Incremental Progress Tables generated in real time by our hardware 4MB specrand Sneak Preview Cache 16MB How do we allocate the resources for the good of the many, but without punishing the few, or the one*? Solution: We introduce a new hardware-generated data structure – SPOT (Simultaneous For each application, how much resources should be allocated to Performance maximize geomean of ProgressOptimization Table) Times across the system. • This Paper: With 3% more hardware, we can find near-optimal resource allocations, in real time. • Useful Because: Greatly improve system performance and fairness. * Star Trek reference. 11 DIMMs Memory Controller Memory Controller C Cores D D D D D D D D D D D D D D D D C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D DIMMs D D D D D D D D D D D D D D D D DIMMs D D D D D D D D D D D D D D D D Memory Controller D D D D D D D D D D D D D D D D Memory Controller DIMMs TimeCube: A Demonstration Vehicle for These Ideas D L2 Cache Block • Scalable manycore architecture, in-order memory system • Critical resources spatially distributed over tiles 12 Outline • Introduction • Measuring Execution Quality: Progress-Time • Enforcing Execution Guarantees: Progress-Table • Allocating Execution Resources: SPOT • Conclusion 13 Measuring Execution Progress: Progress-Time • What do we need to compute Progress-Time? Current Universe Ideal (Shadow) Universe 14 Measuring Execution Progress: Progress-Time • What do we need to compute Progress-Time? Execution Counters Current Universe Ideal (Shadow) Universe 15 Measuring Execution Progress: Progress-Time • What do we need to compute Progress-Time? + + + + Execution Counters Current Universe Shadow Counters Ideal (Shadow) Universe 16 Shadow Structures • Shadow Tags • Measure cache miss rates for full cache allocation • Set-sampling reduces overhead • Shadow Prefetchers • Measure prefetches issued and prefetch hit rate • Track cache miss stream from Shadow Tags • Launch fake prefetches, no data buffers • Shadow Banking • Measure DRAM page hits, misses, and conflicts • Tracks current state of DRAM row buffers using DDR protocol 17 A Shadow Performance Model for Progress-Time • Analytical model to estimate Progress-Time • Takes into account the critical memory resources • Assumes no change in core pipeline execution cycles • Uses events collected from the shadow structures • Reuses average latencies for accessing individual resources ExecutionTime = corecycles + Shadow Events Average Latencies for current allocation L2Hit x L2HitLatency PrefHit x PrefHitLatency PageHit x PageHitLatency PageMiss x PageMissLatency PageConflict x PageConflictLatency 18 Accounting for Bandwidth Stalls • L2 misses and prefetcher statistics determine required bandwidth • No bandwidth stall assumed if sufficient bandwidth • If insufficient bandwidth, performance (IPC) degrades proportionally 19 Evaluation Methodology • Evaluate a 32-core instance similar to modern manycore processors • 26 benchmarks from SPEC2K, SPEC2K6, and an I/O-centric suite • Near unlimited combinations of simultaneous runs Prefetcher NoPrefetcher 20 15 10 5 0 1K B 4K B 16 6 2 KB 4KB 56K B L2 Cache Size cliff 3 2.5 Prefetcher NoPrefetcher 2 1.5 1 0.5 0 1K B 4K B 16 6 2 KB 4KB 56K B L2 Cache Size 20 Requests per Kilo Insts stream 25 Requests per Kilo Insts Requests per Kilo Insts • Compress run-space by classifying apps into streams, cliffs, and slopes based on cache sensitivity 4 3.5 3 2.5 2 1.5 1 0.5 0 slope Prefetcher NoPrefetcher 1K B 4K B 16 6 2 KB 4KB 56K B L2 Cache Size Average Estimation Accuracy (%) Shadow Performance Model and Shadow Structures Accurately Compute Progress-Time 99% 100 80 60 40 20 0 st s s s s s s s s s s s s s s A re tre tre tre tre tre tre tre tre tre tre tre tre tre tre VE am am am am am am am am am am am am am am am R 10 75 75 50 50 50 25 25 25 25 0% 0% 0% 0% 0% AG 0% % % % % % % % % % _ _ _ _ _ E _s _slo _slo _slo _slo _slo _slo _slo _slo _slo slop slop slop slop slop lo pe pe0 pe2 pe0 pe2 pe5 pe0 pe2 pe5 pe7 e0% e25 e50 e75 e10 0% % 5% % 5% 0% % 5% 0% 5% % % % 0% Compositions • TimeCube tracks Progress-Times with ~1% error • No latency overheads 21 Outline • Introduction • Measuring Execution Quality: Progress-Time • Enforcing Execution Guarantees: Progress-Table • Allocating Execution Resources: SPOT • Conclusion 22 Progress-Tables in TimeCube 0% 0% Cache • One Progress-Table (Ptable) per application • Memory bandwidth binned in 1% increments • Last-level cache arrays allocated in powers of two • Progress-Time accumulated over intervals using last cell 23 50% 100% Bandwidth 100% Execution-Time for app i, cache c and bandwidth b Shadow Structures 2.0 • Shadow Tags • Measure cache miss rates for all power-of-two cache allocations • LRU-stacking reduces overhead • Shadow Prefetchers • Add one instance for each cache allocation • Shadow Banking • Add one instance for each cache allocation Same performance model is used as for Progress-Time. 24 1 4MB 0.75 1MB 0.5 256KB 0.25 64KB 0 50 Bandwidth (%) 75 1 16MB 4MB 0.75 1MB 0.5 256KB Overall as well as perinterval QoS control 0 25 50 Bandwidth (%) 75 0 1 16MB 4MB 0.75 1MB 0.5 256KB 0.25 64KB 0 25 25 50 Bandwidth (%) 75 0 astar • 0.25 64KB TimeCube can use these maps to guarantee QoS for applications Cache • 0 hmmer Ptables provide accurate mapping from resource allocation to slowdown Cache • 25 specrand Progress-Tables Examples Cache 16MB Outline • Introduction • Measuring Execution Quality: Progress-Time • Enforcing Execution Guarantees: Progress-Table • Allocating Execution Resources: SPOT • Conclusion 26 Allocating Execution Resources: SPOT • Key Idea: Run optimization algorithm over application Progress-Tables to maximize an objective function • Objective Function: Mean Progress-Times of all applications, accumulated over all intervals so far and the upcoming one • Geometric-Mean balances throughput and fairness • The geomean can be approximated to: 27 Implementation: Maximizing the Mean Progress-Time Ba n h idt w d … 1 All Max Mean Progress-Time 0 1 Mean Progress-Time i Apps, j Cache, k BW Cache … All 0 0 1 Ap p lica … Simultaneous Performance Optimization Table (SPOT) All tio ns • Bin-packing: Distribute resources among applications to maximize mean • Clever algorithm allows optimal solution in pseudo-polynomial time • <All,All,All> corner gives maximum mean and corresponding allocation 28 Real-Time TimeCube Resource Allocation • Interval-based TimeCube execution • Statistics collected during execution • Every interval : • Estimate Progress-Times • Allocate resource partitions • Reconfigure partitions Create pTables Resource Allocation Reconfiguration Execute and Collect Stats Intervaln • Done in parallel with execution 29 time Normalized System Throughput Progress-Based Allocation Improves Throughput 1.7 77% TimeCube Baseline 1.5 1.3 36% 1.1 0.9 0.7 0.5 10 75 75 50 50 50 25 25 25 25 0% 0% 0% 0% 0% AV 0% % % % % % % % % % E st str str str str str str str str str stre stre stre stre stre RA re ea ea ea ea ea ea ea ea ea a a a a a G am m m m m m m m m m m m m m m E , 0 , 0 , 2 , 0 , 2 , 5 , 0 , 2 , 5 , 7 , 0% , 25 , 50 , 75 , 10 % % 5% % 5% 0% % 5% 0% 5% % % % 0% slo slo slo slo slo slo slo slo slo slo slop slo slo slo slo pe pe pe pe pe pe pe pe pe pe e pe pe pe pe Compositions • Allocating resources simultaneously increases throughput • As much as 77% increase, 36% improvement on average 30 Normalized performance for slowest application Maximizing Geometric Mean Provides Fairness 1.5 57% 1.3 Progress−based Miss−based 19% 1.1 0.9 0.7 0.5 10 75 75 50 50 50 25 25 25 25 0% 0% 0% 0% 0% AV 0% % % % % % % % % % ER st str str str str str str str str str stre stre stre stre stre AG re ea ea ea ea ea ea ea ea ea a a a a a E am m m m m m m m m m m m m m m , 0 , 0 , 2 , 0 , 2 , 5 , 0 , 2 , 5 , 7 , 0% , 25 , 50 , 7 5 , 1 0 % % 5 % % 5% 0% % 5% 0% 5% % % % 0% slo slo slo slo slo slo slo slo slo slo slop slo slo slo slo pe pe pe pe pe p e pe pe pe pe e pe pe pe p e Compositions • Worstcase performance improves by 19% on average • As much as 57% worstcase improvement 31 TimeCube’s Mechanisms are Energy-Efficient Others (0.45%) Memory Access (11.16%) L2 Access (0.50%) L1 Access (12.96%) Pipeline (34.51%) pTables (0.01%) Prefetcher (12.52%) L1 Evict (1.06%) L2 Evict (26.84%) • Progress-Time Mechanisms consume < 0.5% energy • Shadow structures consume 0.23% • Ptable calculation consumes just 0.01% • SPOT calculation consumes 0.18% 32 TimeCube’s Mechanisms are Area-Efficient • Progress-Time Mechanisms consume < 7% area • Shadow Tags consume 1.40% • Ptables consume 1.11% • SPOT consumes 3.20% 33 Related Work • Measuring Execution Quality [Progress-Time] • Analytical: Solihin [SC’99], Kaseridis [HPCA’10] • Regression: Eyerman [ISPASS’11] • Sampling: Yang [ISCA’13] • Enforcing Execution Guarantees [Progress-Tables] • RT systems: Lipari [RTTAS’00], Bernat [RTS’02], Beccari [RTS’05] • Offline: Mars [ISCA’13], Federova [ATC’05] • Allocating Execution Resources [SPOT] • Adaptive: Hsu [PACT’06], Guo [MICRO’07] • Offline: Bitirgen [MICRO’08], Liu [HPCA’04] 34 Conclusions • Problem: Interference on multicore processors can lead to large unpredictable slowdowns. • How to measure execution quality: Progress-Time • We can track live application progress with high accuracy (~ 1% error) and low overheads (0.5% performance, < 0.5% energy, < 7% area). • How to enforce execution guarantees: Progress-Tables • We can use Progress-Tables to precisely control the QoS provided, on-the-fly. • How to allocate execution resources: SPOT • We can use SPOT to improve both throughput and fairness (36% and 19% on average, 77% and 57% in best-case). • Multicore processors can employ these three mechanisms, demonstrated through TimeCube, to make them more attractive for embedded systems. 35 Thank You Questions? 36 Backup Slides 37 Execution Time Normalized to Standalone Execution Time Problem: Resource Sharing Causes Interference 3 2.5 2 1.5 1 0.5 0 17 18 30 40 42 45 46 47 5. 1. 0. 1. 9. 8. 2. 0. vp m tw bz m sje lib lbm cf cf olf q r ip ng ua 2 nt um Benchmarks (with 0−3 Background Applications) • Unpredictable slowdown during concurrent execution • Can lead to failed QoS guarantees 38 Progress-Tables RESOURCE1 R CE n R OU S E Execution-Time [res0] [res1]…[resn ] for app i … RE SO UR CE Applications 0 • Progress-Time for a spectrum of resource allocations • Provide information for resource management at the right granularity 39 Dynamic Execution Isolation Reduces Interference • TimeCube partitions shared resources for dynamic execution isolation • Last-Level Cache Partitioning • Associative Cache Partitioning allocates cache ways to applications • Virtual Private Caches [Nesbit ISCA 2007] • Memory Bandwidth Partitioning • Memory bandwidth is dynamically allocated between applications • Fair Queuing Arbiter [Nesbit MICRO 2006] for memory scheduling • DRAM Capacity Partitioning • DRAM memory banks are split between applications • Row buffers fronting these banks are also partitioned as a result • OS page management maintains physical memory bank allocation 40 Prefetcher Throttling Increases Bandwidth Utilization L2 Prefetch Buffer Misses Stream Tracker Shadow Prefetcher Prefetcher Required BW with and w/o Prefetching Prefetches Prefetch Filter Aggression Level Prefetch Aggression Controller Throttler Allocated BW Memory • Filter fixed ratio of prefetches based on aggression level, such that required BW just above allocated BW • Shadow Performance Model augmented to give required BW 41 Prefetcher Throttling Chooses the Right-Level 0.8 0.6 0.4 0.2 NoThrottling Throttling Throughput per app Throughput per app 0.8 0 0.6 0.4 0.2 NoPrefetching ThrottledPrefetching 0 1 2 3 4 6 7 .5 .6 .7 .8 .0 .1 00 25 50 75 00 25 Bits per cycle per core (BW) 0. 1 . 1 . 2. 3. 3 7 12 87 62 37 5 5 5 5 5 Bits per cycle per core (BW) • Nine Aggression-Levels used • Throttler chooses the right level to give pareto-optimal curve • Prefetcher throttling efficiently utilizes the available bandwidth 42 Prefetcher Throttling Chooses the Right-Level 0.8 0.6 0.4 0.2 0 0.6 0.4 NoThrottling Throttling 0.2 Throughput per app 0.8 Throughput per app Throughput per app 0.8 0.6 0.4 0.2 NoPrefetching NoThrottling Throttling 0 NoPrefetching ThrottledPrefetching • 0. 1 . 1 . 2. 3. 1. 2. 3. 4. 6. 7. 37 12 87 602 37 50 62 75 87 00 12 5 5 5 5 15.1 1.5 1.8 2.2 2.6 0 5 0 5 0 5 25 75 5 Bits 25 per cycle per core (BW) Bits per cycle per core (BW) Bits per cycle per core (BW) Nine Aggression-Levels used • Throttler chooses the right level to give pareto-optimal curve • Prefetcher throttling efficiently utilizes the available bandwidth 43 Multicore Processors Share Resources Low-Power Intel “Haswell” Architecture • Leads to increased utilization • Lower per core resources on manycore processors • Increasing pressure to share resources 44 *** 45 Shadow Performance Model and Shadow Structures Accurately Compute Progress-Time • TimeCube tracks Progress-Times with ~1% error • Performance overheads due to reconfiguration are < 0.5% 46 Towards Manycore Embedded Systems 47 Objective: Maximizing Mean Progress-Time • TimeCube allocates resources between applications to maximize the Mean Progress-Times • Geometric-Mean balances throughput and fairness • The geometric mean can be approximated to: 48 Measuring Execution Progress: Progress-Time • What do we need to compute Progress-Time? Execution Stats Ideal (Shadow) Universe Current Universe 49 Solution: Track Live Application Progress App1 Processor App2 Processor App1 App2 Processor time • Determine and control QoS provided to applications “online” • We quantify application progress using Progress-Time: Progress-Time is the amount of time required for an application to complete the same amount of work it has done so far, were to have been allocated all CPU resources. 50 TimeCube: A Progress-Tracking Processor App1 Processor App2 Processor App1 App2 Processor Track & Use Progress-Time time TimeCube • TimeCube is a manycore processor • Augmented to track & use live Progress-Times • Embedded domains can use TimeCube to guarantee QoS 51 TimeCube Periodically Estimates Progress-Times Shadow Performance Modeling Dynamic Execu on Isola on Last Level Cache Memory Bandwidth Execution Stats DRAM Banks Resource Allocations • • • Dynamically partition critical shared resources • Fine-grained QoS control Resource Management Shadow performance model estimates Progress Time • Uses execution statistics • Statistics from shadow structures Shadow Prefetcher Shadow Banking Concurrent execution on dynamically isolated resources • Shadow Cache Progress-Time estimates used for shared resource management 52 Progress-Time Tables TimeCube Periodically Estimates Progress-Times Shadow Performance Modeling Dynamic Execu on Isola on Last Level Cache Memory Bandwidth Execution Stats DRAM Banks Resource Allocations • • • Dynamically partition critical shared resources • Fine-grained QoS control Resource Management Shadow performance model estimates Progress Time • Uses execution statistics • Statistics from shadow structures Shadow Prefetcher Shadow Banking Concurrent execution on dynamically isolated resources • Shadow Cache Progress-Time estimates used for shared resource management 53 Progress-Time Tables TimeCube Periodically Estimates Progress-Times Shadow Performance Modeling Dynamic Execu on Isola on Last Level Cache Memory Bandwidth Execution Stats DRAM Banks Resource Allocations • • • Dynamically partition critical shared resources • Fine-grained QoS control Resource Management Shadow performance model estimates Progress Time • Uses execution statistics • Statistics from shadow structures Shadow Prefetcher Shadow Banking Concurrent execution on dynamically isolated resources • Shadow Cache Progress-Time estimates used for shared resource management 54 Progress-Time Tables Isolation Can’t Remove Performance Interference Progress Time Table for astar Progress Time Table for hmmer Progress Time Table for specrand 2^10 0.4 2^3 2^2 0.2 2^1 2^5 0.6 2^4 0.4 2^3 2^2 0.2 2^1 20 40 60 Bandwidth (%) 80 2^5 0.6 2^4 0.4 2^3 2^2 0.2 2^1 0 0 0 0 2^6 2^0 2^0 2^0 0.8 0.8 2^6 Cache 0.6 2^4 Relative Progress Time Cache 2^5 2^7 2^7 0.8 2^6 Relative Progress Time 2^7 0 20 40 60 Bandwidth (%) 80 0 20 40 60 Bandwidth (%) • Isolation removes resources interference only • Performance not linearly related to resource allocation • Same resource allocations can lead to different performance • TimeCube uses Shadow Performance Modeling to estimate performance impact of different resource allocations 55 80 Relative Progress Time 2^10 2^10 Cache 1 1 1 Prefetcher Throttling Chooses the Right-Level 0.8 0.4 0.2 0 NoThrottling Throttling 0. 1 . 1 . 2. 3. 3 7 12 87 62 37 5 5 5 5 5 Bits per cycle per core (BW) 0.8 0.6 0.4 NoPrefetching NoThrottling Throttling 0.2 0 1. 1. 1. 2. 2. 12 5 87 25 62 5 5 5 Bits per cycle per core (BW) Throughput per app 0.6 Throughput per app Throughput per app 0.8 0.6 0.4 0.2 NoPrefetching ThrottledPrefetching 0 1 2 3 4 6 7 .5 .6 .7 .8 .0 .1 00 25 50 75 00 25 Bits per cycle per core (BW) • Nine Aggression-Levels used • Throttler chooses the right level to give pareto-optimal curve • Prefetcher throttling efficiently utilizes the available bandwidth 56