The University of Rochester Controlled Resource and Data Sharing in Multi-Core Platforms* • Small private research university • 4400 undergraduates • 2800 graduate students • Set on the Genesee River in Western NY State, near the south shore of Lake Ontario • 250km by road from Toronto; 590km from New York City Sandhya Dwarkadas Department of Computer Science University of Rochester *Joint work with Arrvindh Shriraman, Hemayet Hossain, Xiao Zhang, Hongzhou Zhao, Rongrong Zhong, Michael L. Scott, Michael Huang, Kai Shen 1 2 The Computer Science Dept. • 15 tenure-track faculty; 45 Ph.D. students • Specializing in AI, theory, and parallel and distributed systems • Among the best small departments in the US 3 4 1 The Hardware-Software Interface TreadMarks Cashmere-2L RTM Concurrency: Coherence, Synchronization, Consistency P … P P M LADLE FCS Distributed Systems InterWeave Server data InterWeave FlexTM Internet Memory Systems ARMCO DDCache RPPT Willow Sentry: Multi-Cores Protection Support Power-Aware Computing CAP DT-CMT IW library IW library cache cache IW library MCD Operating Systems Resource-Aware OS Scheduling Peer-to-peer systems cache Cluster Fortran/C The Implications of Technology Scaling Handheld Device Java Desktop C/C++ • • • • • Many more transistors for compute power Energy constraints Large volumes of data High-speed communication Concurrency (parallel or distributed) • Need support for – Scalable sharing – Reliability – Protection and security – Performance isolation Source: http://weblog.infoworld.com/tech-bottom-line/archives/IT-In-The-Clouds_hp.jpg 5 6 Multi-Core Challenges Current Projects • • Ensuring performance isolation • Providing protected and controlled sharing across cores • Scaling support for data sharing • CoSyn: Communication and Synchronization Mechanisms for Emerging Multi-Core Processors – Collaboration with Professors Michael Scott and Michael Huang – Arrvindh Shriraman, Hemayet Hossain, Hongzhou Zhao Operating System-Level Resource Management in the Multi-Core Era – Collaboration with Professor Kai Shen – Xiao Zhang and Rongrong Zhong See http://www.cs.rochester.edu/research/cosyn and http://www.cs.rochester.edu/~sandhya 7 8 2 Multi-Core Challenges • Ensuring performance isolation • Providing protected and controlled sharing across cores • Scaling support for data sharing Performance Isolation 9 Resource Sharing is (and will be) Ubiquitous! 10 Resource Sharing on Multicore Chip • Memory bandwidth and last level cache are commonly shared by sibling cores sitting on the same chip Intel’s 6-core (12-thread), … Sun UltraSparc T1, … AMD’s 12-core, … • Floating point, integer, state, cache with multiple threads on a core • Second-level cache with multiple cores on a chip • Interconnect bandwidth on multiprocessors 11 11 12 3 Poor Performance Due to Uncontrolled Resource Contention Resource Management To Date • Capitalistic - generation of more requests results in more resource usage – Performance: resource contention can result in significantly reduced overall performance – Fairness: equal time slice does not necessarily guarantee equal progress Win-win situation 13 Experiments were conducted on a 3Ghz Intel Core 2 Duo processor with a shared 4MB L2 cache 13 Fluctuating Performance Due to Uncontrolled Resource Contention 14 Fairness and Security Concerns Performance of art when co-running with different applications on an Intel dual-core processor with a 4MB shared L2 cache 15 • • • • Priority inversion Poor fairness among competing applications Information leakage at chip level Denial of service attack at chip level 16 4 Existing Mechanism(I): Software based Page Coloring Big Picture Control resource usage of co-running applications Thread A’s footprint • Classic technique to reduce cache misses, Memory page now used by OS to manage cache A1 partitioning Page coloring or Hardware throttling [Eurosys’09] [USENIX’09] Select which applications run together Resource-aware scheduling [USENIX’10] A B C ……… ….. D A2 • Partition cache at coarse granularity Thread A • No need for hardware support Thread B X A3 A4 A5 Way-1 • Expensive re-coloring cost – Prohibitive in a dynamic environment where Thread A’s footprint frequent re-coloring may be necessary A1 • Complex memory management – Introduces artificial memory pressure A2 Way-1 ………… A3 Way-n Way-n Shared Cache 17 Drawbacks of Page Coloring ………… 18 Toward Practical Page Coloring • Hotness-based Page Coloring – Efficiently find a small group of hot pages – Restrain page coloring or re-coloring to hot pages – Pay less re-coloring overhead while achieving most of the cache partitioning benefit (separate competing applications’ most frequently accessed pages) Thread A A4 Thread B A5 • Key challenge – Efficient way to track page hotness Shared Cache Memory page 19 20 5 Sampling of Access Bits Methods to Track Page Hotness • Using page protection • Decouple sampling frequency and window – Capture page accesses by triggering page faults – Microseconds overhead per page fault – Hotness sampling accuracy is determined by sampling time window T – Hotness sampling overhead is determined by sampling frequency N • Using access bits – A single bit stored in each Page Table Entry (PTE) – Generally available on x86, automatically set by hardware upon page access – Tens of cycles per page table entry check Clear all access bits N – Recycle spare bits in PTE as hotness counter Clear all access bits 2N 3N 4N Time 0 • Counter is aged to reflect recency and frequency T N+T Check all access bits 2N+T In our experiments, T = 2 milliseconds 3N+T 4N+T Check all access bits N = 100 or 10 milliseconds 21 Miss-Ratio-Curve Driven Cache Partition Policy 22 Hot Page Coloring • Budget control of page re-coloring overhead System optimization metric = 0.2 0.5 0.7 0.3 – % of time slice, e.g. 5% Thread A’s Miss Ratio • Recolor from hottest until budget is reached Cache Allocation 0 4M Cache Size = ∑A,B Cache Allocation Thread B’s Miss Ratio Cache Allocation 0 – Maintain a set of hotness bins during sampling • bin[ i ][ j ] = # of pages in color i with normalized hotness in range [ j, j+1] – Given a budget K, K-th hottest page’s hotness value is estimated in constant time by searching hotness bins – Make sure hot pages are uniformly distributed among colors 4M Optimal partition point 23 24 6 Re-coloring Procedure Performance Comparison Cache share decrease Budget = 3 pages hotness counter value 0 14 … … … 71 83 100 3 11 … … … 75 82 98 Color Red 2 12 … … … 73 81 97 Color Blue 1 10 … … … 74 87 99 Color Green 4 SPECcpu2k benchmarks (art, equake, mcf, and twolf) are running on 2 sibling cores (Intel core2duo) that share a 4MB L2 cache. Page sorted in hotness ascending order X Color Gray 25 26 Additional Benefit of Hotness-based Page Coloring • Page coloring introduces artificial memory pressure – App’s footprint is larger than its entitled memory color pages, but system still has an abundance of Thread A memory pages • Allow app to “steal” Thread B other’s colors, but it preferentially copies cold pages to other’s memory colors Big Picture Control resource usage of co-running applications Thread A’s footprint Memory page Cache Way-1 ………… A1 Page coloring or Hardware throttling A2 Way-n Select which applications run together A3 Resource-aware scheduling A4 A A5 27 B C D ……… ….. X 28 7 Comparing Hardware Execution Throttling to Page Coloring Hardware Execution Throttling • Instead of directly controlling resource allocation, throttle the execution speed of application that overuses resource • Kernel code modification complexity • Available throttling knobs • Runtime overhead of configuration – Code length: 40 lines in a single file, as a reference our page coloring implementation takes 700+ lines of code crossing 10+ files – Duty-cycle modulation – Frequency/voltage scaling – Cache prefetchers – Less than 1 microseconds, as a reference re-coloring a page takes 3 microseconds 29 30 Drawback of Scheduling Quantum Adjustment Existing Mechanism(II): Scheduling Quantum Adjustment Coarse-grained control at scheduling quantum granularity may result in fluctuating service delays for individual transactions • Shorten the time slice of app that overuses cache • May let core idle if there is no other active thread available Core 0 Thread A Core 1 Thread B idle Thread A Thread B idle Thread A idle Thread B time 31 32 8 Comparison of Hardware Execution Throttling to other two mechanisms New Mechanism: Hardware Execution Throttling [Usenix’09] • Comparison to page coloring – Little complexity to kernel • Throttle the execution speed of app that overuses cache – Duty cycle modulation • Code length: 40 lines in a single file, as a reference our page coloring implementation takes 700+ lines of code crossing 10+ files • CPU works only in duty cycles and stalls in non-duty cycles • Different from Dynamic Voltage Frequency Scaling – Lightweight to configure – Per-core vs. per-processor control – Thermal vs. power management • Read plus write register: duty-cycle 265 + 350 cycles, prefetcher 298 + 2065 cycles • Less than 1 microseconds, as a reference re-coloring a page takes 3 microseconds – Enable/disable cache prefetchers • Comparison to scheduling quantum adjustment • L1 prefetchers – More fine-grained controlling – IP: keeps track of instruction pointer for load history – DCU: when detecting multiple loads from the same line within a time limit, prefetches the next line Quantum adjustment • L2 prefetchers Core 0 – Adjacent line: Prefetches the adjacent line of required data – Stream: looks at streams of data for regular patterns Core 1 Thread A Hardware execution throttling idle Thread B time 33 Fairness Comparison 34 Performance Comparison • Unfairness factor: coefficient of variation (deviationto-mean ratio, σ / μ) of co-running apps’ normalized performances (normalization base is the executiontime/throughput when the application monopolizes the whole chip) • System efficiency: geometric mean of co-running apps’ normalized performances • On average all three mechanisms achieve system efficiency comparable to default sharing • Case where severe interthread cache conflicts exist favors segregation, e.g. {swim, mcf} • Case where well-interleaved cache accesses exist favors sharing, e.g. {mcf, mcf} • On average all three mechanisms are effective in improving fairness • Case {swim, SPECweb} illustrates limitation of page coloring 35 36 35 36 9 Policies for Hardware Throttling Based Multicore Management Model-Driven Iterative Framework • User-defined service level agreements (SLAs) • Customizable performance estimation model – Proportional progress among competing threads • Reference configuration set and linear approximation • Currently incorporates duty cycle modulation and frequency/voltage scaling • Unfairness metric: coefficient of variation of threads’ performance – Quality of service guarantee for high-priority application(s) • Key challenge – Throttling configuration space grows exponentially as the number of cores increases – Quickly determining optimal or close to optimal throttling configurations is challenging • Iterative refinement • Prediction accuracy gets improved over time as more configurations are added into reference set 37 38 Online Deployment: Hill-Climbing Search Acceleration Iterative Refinement Patterns • For a m-throttling-level n-core system, need to compute nm times to predict a “best” one • Hill-climbing searches along the best child rather than all children • Prunes the computation space to (m-1)n2 39 (X,Y,Z,U) (X1,Y,Z,U) (X-1,Y1,Z,U) (X-1,Y-1,Z1,U) … (X,Y1,Z,U) (X,Y-2,Z,U) (X,Y-2,Z-1,U) (X,Y,Z1,U) (X,Y,Z,U1) (X,Y-1,Z1,U) (X,Y-1,Z,U1) (X,Y-1,Z-2,U) (X,Y-1,Z-1,U1) … … … 40 10 Accuracy Evaluation Capability of Satisfying SLAs • Service Level Agreements (SLAs) • Fairness-oriented: keep the unfairness below a threshold • QoS-oriented: keep the QoS-core above a QoS threshold • 4 different unfairness/QoS thresholds for 5 sets • Optimization goal: satisfy SLAs while optimizing performance or power efficiency • Test platform • A quad-core Nehalem processor with 8MB shared L3 cache • Search space from full CPU speed (duty cycle level 8) to half CPU speed (duty cycle level 4), so 369 configurations for each test • Benchmarks: SPECCPU2k • • • • • Set-1: {mesa, art, mcf, equake} Set-2: {swim, mgrid, mcf, equake} Set-3: {swim, art, equake, twolf} Set-4: {swim, applu, equake, twolf} Set-5: {swim, mgrid, art, equake} # Passing tests Avg. num of samples Avg. performance of picked configs that pass tests Oracle 39/40 0 100% Model 39/40 4.1 99.4% Random 25/40 41 15 91.1% Recall the search space has 369 configurations42 Accuracy of Performance Estimation Big Picture Error Rate = |Prediction – Actual| / Actual Control resource usage of co-running applications Page coloring or Hardware throttling Select which applications run together Resource-aware scheduling A 43 B C D ……… ….. X 44 11 Resource-aware Scheduling Similarity Grouping Scheduling • Scheduling decision could significantly affect performance • Group applications with similar cache miss ratio on the same chip – Separate high and low miss ratio apps on different chips • Benefits – Mitigate cache thrashing effect – Avoid over-saturating memory bandwidth – Engage per-chip DVFS-based power savings • A single voltage setting applies to all sibling cores on existing multicore chips • High-miss-ratio chip runs at low frequency while low-miss-ratio chip runs at high frequency 45 46 Model Accuracy Frequency-to-Performance Model Error Rate = (Prediction - Actual) / Actual • Objective: explore power savings with bounded performance loss • Assumptions – An application’s performance is linearly determined by cache and memory access latencies – Frequency scaling only affects on-chip accesses – Miss ratio does not vary across frequencies Normalized performance at frequency f = T(F) / T(f) 47 48 12 Model-based Dynamic Frequency Setting Model-based Dynamic Frequency Setting • Dynamically adjust CPU frequency based on current running application’s behavior – Collect cache miss ratio every 10 milliseconds – Calculate an appropriate frequency setting based on performance estimation model • Guided by performance degradation threshold (e.g. 10%) 49 Hardware Counter-based Power Containers: An OS Resource 50 Power Conditioning Using Power Containers • Cross-core activity influence • Online calibration with actual measurement • Application-transparent online request context tracking 51 52 13 Power Conditioning Achieved Using Targeted Throttling Ongoing Work • Variation-directed information and management – Using behavior fluctuation to trigger monitoring – Supporting fine-grain resource accounting – Developing policies to reshape behavior for high dependability and low jitter – Request-level power attribution, modeling, and management 53 54 Arch/App: Shared Memory ++: DIMM [TRANSACT’06, ISCA’07, ISCA’08,ICS’09] • Data Isolation (DI) – Provide control over propagation of writes – Buffer writes and allow group undo or propagation Applications: Sand-boxing, transactional programming, speculation • Memory Monitoring (MM) – Monitor memory at summary or individual cache line level Applications: Synchronization/event notification, reliability, security, watchpoints/debugging See http://www.cs.rochester.edu/research/cosyn http://www.cs.rochester.edu/~sandhya 55 56 14 Arch/App/OS: Protection: Separation of Privileges Sentry: Light-Weight Auxiliary Memory Access Control [ISCA’10] • Access checks on an L1 miss – Saves 90x energy – Simplifies implementation • Metadata cache (M-cache) accessed in parallel with the L2 to speed up check Reality: Today’s programs often consist of multiple modules written by different programmers Reliability and composability requires developing access and interface conventions 57 58 The Indirection Problem A(S) Load P0 A 1 0 1 0 0 0 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 Load A Data A 3 3 Data A Data A Data A AA(S) (M) P4 Home DG P11AA 2 DG A Longer distance means longer latency 59 PACT 2008 60 15 Fine-Grain Data Sharing Goal: Localize Shared Data Communication Load P0 A A(S) Simultaneous access to the same data by more than one core while the data still resides in some L1 cache 1 2 Data A Load A AA(S) (M)A Data P4 Key Idea: Data availability at P0: 2 vs 10 physical hops (whether P4 holds A in M or S) Fine-grain sharing can be leveraged to localize communication 61 P11 A Home 62 Summary • Harnessing 50B transistors requires a fresh look at conventional hardware-software boundaries and interfaces with support for – Scalable coherence design – Controlled data sharing via architectural support for • Memory monitoring, isolation, and protection – Controlled resource sharing via operating system-level policies for • Performance isolation • We have examined coherence protocol additions to allow – Fast event-based communication – Fine-grain access control – Programmable support for isolation – Low-latency access for fine-grain data sharing – Software to determine policy decisions in a flexible manner A combined hardware/software approach to support for concurrency with improved performance and scalability 63 16