Contention Awareness in the Cloud Mary Lou Soffa Computer Science Department, University of Virginia 151 Engineer’s Way, Charlottesville, VA 22904 Email: soffa@cs.virginia.edu Tele: (434) 982 2277 URL: http://www.cs.virginia.edu/ soffa 1. Motivation and Introduction 1.1. Motivation. Web companies, such as Google, use off-the-shelf commodity components to build data-centers as they are cheap, abundant and easily replaceable. The current state of the art commodity processors are primarily composed of multicore chips. However, as multiple processes execute simultaneously, contention for shared on-chip resources can often cause significant performance degradation and low machine utilization. When the performance of an application is negatively effectedduetocontentionwithaco-runningapplicationonaneighboringcore,wecallthiscross-coreperformanceinterference. Cross-core interference is particularly undesirable for latency sensitive applications such as Search and Maps. For instance, in the latency sensitive domain of web search this kind of cross core interference can cause unexpected slowdowns, negatively impacting the QoS on a search query. A commonly used solution is to simply disallow the co-location of latency sensitive applications with other applications on a single machine, resulting in low machine utilization and higher energy costs. To date, there has been very little research attention paid to investigate deployable solutions to these challenges in the current commodity multicore environment. Effectively addressing these challenges can significantly improve machine utilization, lower power consumption, improve energy efficiency, decrease latency, and ultimately lower cost in running the data-center, potentially translating to many millions of dollars saved. Figure 1 illustrates the potential cross-core interference that can occur when multiple co-running applications are executing on current multicore architectures. We perform this experiment using two representative examples of the state of the art multicore chip designs that can be found in a datacenter, the Intel Core i7 Quad Core chip and AMD’s Phenom X4 Quad Core. The figure shows the performance slowdown when co-locating each of the SPEC2006 benchmarks with the lbm application (known to be cache intensive). As this graph shows, there are severe performance degradations (of up to 35%) due to cross-core interference on most of the applications. 1.2. Research. The vision for our Adaptive Cloud Computing Systems Lab(ACCS Lab) is to develop infrastructure and techniques to maximize utilization in a datacenter without sacrificing latency. We focus on innovation in application software, system software, and micro-architectural design. As we form our ACCS Lab, one of our first initiatives is a holistic contention aware computing environment to address the challenges of on-chip resource contention in the memory subsystem and the resulting cross core performance interference. This computing environment includes a contention aware compilation framework, programming model, operating system and micro-architectural design. As shown in Figure 2 this proposal focuses on the investigation, design and evaluation of the contention aware (CA) Compilation Framework. The state of the art compilation techniques on the current multi-core architectures are oblivious to contention. We believe that CA compilation may prove critical to fully utilize the parallelization power of these architectures. Our CA compilation framework includes three inter-related research directions as shown in Figure 2: 1) the formation of a profiling and characterization framework that will be able to quantify the inherent cross-core interference sensitivity of an application and specific code regions, 2) novel static compilation techniques that will allow us to specialize application code layout based on the code regions’ cross-core interference characteristics provided by our profiler, and 3) sophisticated online adaptive and managed runtime techniques that will allow dynamic responses to contention as it occurs. These three components of our CA Compilation framework are highly inter-related and necessary for realizing the vision of contention-aware compilation. Because contention only occurs during runtime, static analysis alone is not sufficient to characterize cross-core interference, which necessitates the design of novel characterization and profiling techniques. Our profiling analysis will be able to reveal an application’s dynamic phases of contention and identify contentious code regions. Our compilation techniques can then perform code transformations on these contentious regions. The compilation framework will also incorporate other knowledge gained during profiling through feedback directed optimizations. Novel compilation 1 1.4x 1.3x 1.35x 1.2x 1.25x Intel Core i7 Quad AMD Phenom X4 Figure 1. Slowdown when co-located with lbm 1.1x 1.15x 1 x 1 . 0 5 x techniques can also help enable the online adaptive detect-and-resp ond system. In addition, the phase-level knowledge gained during profiling can also assist the online system as well. 1.3. Preliminary Work. We have investigated the challenges posed by contention on mulitcore architectures and gained many insights during our preliminary work and our collaboration experience with Google. We have also demonstrated the potential of contention aware and adaptive approaches through our work on Scenario Base Optimization (SBO) [1], our Contention Aware Execution Runtime (CAER) [3] and our preliminary work of characterization Figure 2. Contention Aware Computing Environment and profiling methodology [2]. The SBO and CAER works [1, 3] were done in collaboration with Robert Hundt at Google while my PhD student, Jason Mars, interned for the last two summer. In summary, using SBO we were able to apply aggressive loop unrolling and software cache prefetching optimizations dynamically and adaptively, gaining 12% performance improvement on average over traditional static approaches. Meanwhile, our CAER engine provides improved performance isolation for latency sensitive applications on current commodity hardware. Using hardware performance counter information the runtime detects contention in the shared on-chip cache and responds by staggering or halting the execution of the latency insensitive application. With CAER, we bring the overhead due to contention from 17% down to 5% on average, while gaining close to 60% more utilization of the processor over running the latency-sensitive application alone. These results demonstrates the great promise of contention-awar e and adaptive approaches. We have only scratched the surface with these prior technologies, and we believe by addressing the problem of contention appropriately, performance and utilization can be significantly improved. Please refer to the full publications for more in-depth details of our preliminary work. Also note we have continued our collaboration with the compiler team at Google and provided our prototype implementations to their team via Google Code. 2 . Proposed Research To date, there is no compilation framework that is contention aware. As the first initiative of our Adaptive Cloud Computing Systems Lab, we propose a holistic contention aware compilation framework. To achieve this we propose a number of research projects. 2.1. Characterizing and Profiling Cross Core Interference Sensitivity. An application’s cross-core interference sensitivity is determined by its intrinsic reliance on the shared memory resources and the underlying abundance and management of those resources in the micro-architectur e. Also note that, as our preliminary experimentation s show, on average, cross-core interference sensitivity of an application also indicates its aggressiveness as the sensitivity of an application hinges on its demand on, and usage of, the shared resource. We seek to characterize this sensitivity and aggressiveness as it relates to the entire application, its phases, and source-level code regions. To date, there has been no methodology for identifying and extracting this information on current real-world multicore architectures. To address these challenges, we propose an online empirical characterization approach. To assess an application’s sensitivity to cross-core interference due to contention, we plan to synthesize contention. As an application executes on our profiling framework, a carefully designed contention synthesis engine will be spawned on a neighboring core to run alongside the application. This contention synthesis engine is continually controlled by our profiler and run in a bursty fashion. The resulting performance impact on the host application is monitored, analyzed, and profiled by the profiling framework. Our profiler will also monitor specific application code regions as it relates to this performance impact using the ubiquitous on-chip hardware performance monitors. This performance impact will be used to generate a quantitative metric for cache contentiousness that we will be able to associate with an application, its individual phases, and source-level code regions. Our preliminary contention synthesis mechanisms are presented in [2]. In addition to what is presented in this preliminary work, we propose application phase detection and characterization, and source level code region characterization. 2.2. Contention Aware Compilation Techniques. For co-running applications, our prior work [3] indicates that a small amount of throttling down for one application could greatly reduce contention, resulting in less performance degradation. Based on this observation, we propose a new way of thinking about compiler’s optimization. Instead of optimizing an application while only considering its standalone performance, we propose optimizing an application mean sphinx3 xalancbmk sjeng soplex povray perlbench namd omnetpp mcf milc lbm libquantum hmmer h264ref gcc gobmk dealII astar bzip2 Slowdown considering its performance when co-running with others. We will first investigate how existing optimizations affect a program’s behaviors in the presence of contention. Various existing optimizations, including software cache prefetching and other optimizations that modify the rate or order of memory access such as loop transformations, affect how an application interacts with the shared memory system. Therefore, they may have either positive or negative effects on a program’s sensitivity to cross core interference. We will investigate these optimizations and then design heuristics and provide the option to apply them based on the new optimization objectives: optimize for overall performance and/or to accommodate latency sensitive applications’ performance. An example could be heuristics to restrict the application of software cache prefetching to code regions that are identified as not aggressive. We will also design novel code transformation techniques. For example, one technique we wish to investigate is to slow down the data access rate of particular loops that are identified as contentious using a loop padding technique. These code-regions and particular loops will be identified using our profiling framework. 2.3. Contention Aware Online Adaptive Techniques and Managed Runtime Systems. To enable online adaptive approaches in previous research [1], we have suggested a Scenario Based Optimization framework (Google Patent Pending), that can apply application code changes online only when contention is detected. We believe we have just scratched the surface and would like to investigate a novel framework that includes a continuous GCC versioning server that can continuously evaluate and provide code re-layout alternatives for dynamic selection. In addition to these compilation techniques we believe there is groundbreaking work to be done in the domain of managed runtime environments such as the Java VM and Microsoft’s Common Runtime Language CLR. One of our long term goals is the design of a Contention Aware VM. The VM provides a broader design space and thus more opportunities to address the challenges of contention in a datacenter. In this proposed research, we will focus our investigation on restructuring and applying the novel techniques we have proposed at a finer granularity in the managed dynamic compilation domain. We also plan to investigate how to use the virtual execution and garbage collection capabilities of managed runtimes to provide a harness for novel dynamic memory re-layout techniques to address contention. 3. Expected Outcomes and Results Our prior work has shown that contention-aware approaches in datacenter are promising for improving utilization and reducing energy consumption while guaranteeing the QoS of latency sensitive applications. Our proposed research is to further realize this promise and to develop a more advanced and comprehensive contention-aware compiler infrastructure. The research result will be an end-to-end compilation framework that provides capabilities including profiling and characterization of an application’s contention sensitivity, compiler optimization and code transformation techniques to address the performance impact of contention, and online approaches and managed runtime to adapt and respond to contention dynamically. Our detailed expected outcomes and results are as follows: • A profiling and characterization framework that effectively captures an application’s inherent sensitivity to cross core interference caused by contention. The system will also pin-point hot code regions that are contentious. • New compiler heuristics and code transformations that are designed, implemented, and evaluated. We will investigate how existing optimization techniques should be used in the contention-conscious environment as well as designing new code transformation techniques to reduce contention and improve overall performance. • Online adaptive and managed runtime techniques that enable applications to dynamically detect occurrences of contention and respond. • We will provide our contention aware compilation framework to Google as open source profiling systems, and open source GCC and JVM extensions. All of our deliverables will work on real programs and real systems translating into real performance boosts. 4. Budget Plan and Google Contact Considering the scope of this project, and the promise demonstrated in our publications [1, 2, 3], we are requesting one year of funding for two students, one of which is Jason Mars who interned at Google, and under the supervision of Robert Hundt, designed and implemented the prior work mentioned. Total amount of funds being requested is $128,000 These funds will be used to support two PhD student for a period of one year ($54K per year for 2 PhD students) and to provide one month of summer salary for the PI, Mary Lou Soffa ($20,000). Robert Hundt will serve as the sponsor of this project. The results and infrastructure of this research work will be regularly shared with Robert Hundt and his compiler team. References [1] J. Mars and R. Hundt. Scenario based optimization: A framework for statically enabling online optimizations. In CGO ’09: Proceedings of the 2009 International Symposium on Code Generation and Optimization, pages 169–179, Washington, DC, USA, 2009. IEEE Computer Society. [2] J. Mars and M. L. Soffa. Synthesizing contention. In 2009 Workshop on Binary Instrumentation and Applications (WBIA), New York, NY, USA, December 2009. [3] J. Mars, N. Vachharajani, M. L. Soffa, and R. Hundt. Contention aware execution: Online contention detection and response. In CGO ’10: Proceedings of the 2010 International Symposium on Code Generation and Optimization, Toronto, Canada, April 2010. Biography: Mary Lou Soffa Department of Computer Science University of Virginia Charlottesville, VA 22904 (434) 298-2277 soffa@virginia.edu Professional Preparation University of Pittsburgh, B.S., Mathematics Ohio State University, M.S., Mathematics University of Pittsburgh, Ph.D., Computer Science Appointments University of Virginia Owen R Cheatham Prof. and Chair 2004- University of Pittsburgh Professor 1990–2004 University of Pittsburgh Graduate Dean in Arts & Sciences 1991–1996 University of California, Berkeley Visiting Associate Professor 1987–1988 University of Pittsburgh Assist, Associate Professor 1977–1989 Honors and Awards (selected) • Invited speaker, International Conference on Software Testing, April, 2009 • Computing Research Association’s Nico Habermann Award, 2006 • Keynote Speaker, Fifth International Conference on Quality Software, Melbourne, Australia, September, 2005 • Keynote Speaker, International Compiler Construction Conference, Barcelona, March, 2004. • ACM Fellow, 1999 • Presidential Award for Excellence in Mentoring in Science, Mathematics, and Engineering, 1999: given by the White House for excellence in mentoring under-represented students and encouraging their significant achievement in science, mathematics and engineering. • Girl Scout Woman of Distinction for 2003. • Invited Speaker, Grace Hopper Celebrating Women Conference, October 2002 • Invited Speaker, National Symposium on the Advancement of Women in Science, Harvard, April 2003. • Most Influential papers of 20 years in ACM/SIGPLAN Programming Languages Design and Implementation (PLDI), “Complete Removal of Redundant Expressions”, (co-authored with R. Bodik and R. Gupta), 40 out of 550 papers selected and appeared in a PLDI Anniversary issue, 2003. Five Publications Most Closely Related to the Proposed Project 1. Contention Aware Execution: Online Contention Detection and Response Jason Mars, Mary Lou Soffa, Neil Vachharajani, Robert Hundt To Appear in proceedings of the ACM/IEEE International Symposium on Code Generation and Optimization (CGO) 2010 2. Jason Mars and Mary Lou Soffa, “Mats: MultiCore Adaptive Trace Selection,” Third workshop on Software Tools for Multicore Systems, collocated with CGO, April, 2009 3. Min Zhao, Bruce R. Childers, Mary Lou Soffa, “A Framework for Exploring Optimization Properties, Compiler Conference, March, 2009 4. Jing Yang, Shukang Zhou, and Mary Lou Soffa Dime nsion: An Instrumentation Tool for Virtual Execution Environments.” Second International Conference on Virtual Execution Environments (VEE '06) . Ottawa, Canada, June 14, 2006 1 5. N. Kumar, B. R. Childers, D. Williams, J. W. Davidson and M.L. Soffa, “Compile-time Planning for Overhead Reduction in Software Dynamic Translators,” International Journal on Parallel Programming, December 2004. Five Other Significant Publications 1. M. Zhao, B. Childers and M.L. Soffa, “Predicting the Impact of Optimizations for Embedded Systems,” 2003 ACM SIGPLAN Conference on Languages, Compilers, and Tools for Embedded Systems, San Diego, CA., pp. 1-11, 2003. 2. J. Misurda, J. Clause, J. Reed, P. Gandra, B.R. Childers and M.L. Soffa “Demand-Driven Structural Testing with Dynamic Instrumentation,” International Conference on Software Engineering, St. Louis, May, 2005. 3. Min Zhao, Bruce R. Childers and Mary Lou Soffa, “A Model-based Framework: An Approach for Profit-driven Optimization “, ACM SIGMICRO Int'l. Conference on Code Generation and Optimization (CGO'05), San Jose, California, March 2005. 4. Bruce Childers, Jack W. Davison and Mary Lou Soffa, “Continuous Compilation: A New Approach to Aggressive and Adaptive Code Transformation,” Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS'03) Nice, 2003 5. J. Misurda, J. Clause, J.L. Reed, P. Gandra, B.R. Childers and M.L. Soffa, “Jazz: A Tool for Demand-Driven Structural Testing,” 14th ETAPS International Conference on Compiler Construction (CC'05), Edinburgh, Scotland, April 2005. Five Synergistic Activities 1. Chair/Vice Chair of Professional Organizations: Vice Chair of Computing Research Association (CRA) (1998-2001); Co-Chair of CRA-W, Committee on Status of Women in Computing (19902002; Chair of ACM/ SIGPLAN (1997-1999). 2. Conference Chair: Architectural Support for Programming Languages and Operating Systems (ASPLOS ) 2009; Conference on Code Generation and Optimization (CGO), March, 2008; SIGSOFT Eight International Symposium on the Foundations of Software Engineering, 2002; ACM SIGPLAN Programming Languages Design and Implementation, 1995. 3. Program Chair: ACM/IEEE International Conference on Software Engineering, 2006; ACM SIGPLAN Programming Languages Design and Implementation, June 2001; Parallel Architectures and Compiler Techniques, October, 2000. 4. Member on Editorial Board: ACM Transactions on, Software Engineering Methodology (2003 – present); Journal of Parallel Programming (1995-present); IEEE Transactions of Software Engineering (1994-2000); South African Journal of Computing (1996 - present), ACM Transactions Programming Languages and Systems (1993-2000). 5. Diversity/Student Activities: CRA Career Workshops, Snowbird Conference panels, OOPSLA Doctoral Symposium, ICSE Doctoral Symposium Chair, Presidential Award for Mentoring Underrepresented groups. Advisor/Co-advisor: advisor to 55 Master Students, over half were women; PhD students: (Graduated) Naveen Kumar (VmWare), 2008, Greg Kapfhammer (Allegheny College), 2007, Min Zhao ( HP Research Labs), Atif Memon (University of Maryland), Clara Jaramillo (Chatham College), Rastislav Bodik (University of California at Berkeley), Neelam Gupta (University of Arizona), Evelyn Duesterwald (IBM Research Labs), Jodi Tims (St. Francis College), Tia Watts (Indiana University of Pennsylvania), David Berson (Motorola, Inc.), Tarun Nakra (IBM), Chy Ren Dow (Feng-Chia University), Pat Pineo (Edinboro University), Deborah Whitfield (Slippery Rock University), Brian Malloy (Clemson University), Ravi Sharma (Bell Labs), Mary Jean Harrold (Georgia Tech), Mary Bivens (Allegheny College), Lori Pollock (University of Delaware), Rajiv Gupta (U. of Arizona), George Logothetis (AT&T), Ching-Chy Wang (Leverage Design Acceleration Corp.), Fernando Lafora-Garcia (DEC Corp); CURRENT: Apala Guha, Jing Yang, Wei Le, Kristen Walcott, Jason Mars, Lingjia Tang, Wei Wang, Tanima Dey (University of Virginia) 2 Budget 2010-2011 (Soffa) Salary support for PI Soffa (1 month) $ 20,000 2 graduate students for 1 year ($54K per student) $108,000 Total $128,000 Breakdown of charges for one student Salary $17,615.00 Tuition $13,670.00 Insurance $2,092.00 Overhead $10,641.78 Student fees $64.00 Out of state fees: $10,000 Foreign student fees $100.00 Total $54,182.78