Debloating Software Guoqing (Harry) Xu UC Irvine Overview Performance and scalability issues are becoming increasingly critical due to the pervasive use of object-oriented programming languages. The inefficiencies inherent in the implementation of an objectoriented language as well as the commonly adopted design and implementation principles in the objectoriented community often combine to hurt performance. The community-wide recognition of the importance of abstraction and reuse results in increased emphasis on modular design, declaration of general interfaces, and use of models and patterns. Programmers are taught to focus first and foremost on them, taking it for granted that compilers and run-time systems can remove all the inefficiencies. In a large program that is typically built on top of many layers of frameworks and libraries, a small set of inefficiencies can multiply and quickly get magnified to slow down the system. When the call stack grows to be deep, the usefulness of the dataflow analyses in a dynamic compiler becomes limited and the optimizer can no longer remove these inefficiencies. As a result, many applications suffer from chronic run-time performance problems that significantly affect scalability and performance. This is a serious problem for real-world software systems used every day by thousands of businesses. The pressing need for new optimization techniques can be especially seen as object-orientation goes everywhere into systems of any size. The extensive use of object-oriented languages in the development of memory-constrained applications such as smartphone apps (e.g., Java used in Android and C# used in Windows phones) and data-intensive systems (e.g., Hadoop, Giraph, and Hyracks) introduces numerous research challenges— these systems have small memory space but large amounts of data to process and inefficiencies in these systems can be significantly exacerbated. The burden of reducing unnecessary work should not be only on the shoulder of hardware designers, especially in the modern era when Moore’s dividend becomes less obvious. It strongly calls for highe-level performance optimization techniques that can detect and remove inefficiencies for all categories of object-oriented applications. We envision the following categories of techniques that need to be developed for improving the performance of the new-generation object-oriented applications. 1. Effective testing techniques that can find performance problems. Performance problems are notoriously difficult to find during development and in-house testing; many of such problems in modern applications are scalability issues that can only manifest when the input data is sufficiently large. Traditional testing focuses on detection of functional bugs and developers often do not have large, realworld input data to test a program. Very often, developers are not aware of the problems until software is released and users observe that its performance cannot meet their expectations. Novel run-time techniques need to be developed to amplify performance problems so that they can manifest even when the program is exercised with small inputs. 2. Semantics-aware adaptive optimizations. Adaptive optimization (such as feedback-directed optimization in a JIT compiler) has been extensively researched during the past decade. However, recent studies show that most of the severe performance problems in a modern application are caused by developers’ mistakes (e.g., inappropriate choices of algorithms, data structures, etc.) closely related to the semantics of the application. Traditional (dataflow-based) optimizations are semantics-agnostic and thus cannot effectively remove today’s semantic redundancies. New optimization techniques should be developed to complement the existing dataflow analyses being performed in the JIT compiler. For example, it is interesting to develop an automated tuning framework that can selects and switches data structures implementations in object-oriented programs. 3. Optimization of Big Data applications. Modern computing has entered the era of Big Data. Analyzing information from Twitter, Google, Facebook, Wikipedia, or the Human Genome Project requires the development of scalable platforms that can quickly process massive-scale data. Such frameworks often utilize large numbers of machines in a cluster or in the cloud to process data in a scalable manner. An object-oriented programming language such as Java is often the developer’s choice for implementing data-processing frameworks. In fact, the Java community has already been the home of many dataintensive computing infrastructures, such as Hadoop, Hyracks, Storm, and Giraph. Despite the many development benefits provided by Java, these applications commonly suffer from severe memory bloat, which stems primarily from a combination of the inefficient memory usage inherent in the runtime of a managed language as well as the processing of huge volumes of data that can exacerbate the alreadyexisting inefficiencies by orders of magnitude. Novel optimization techniques (based either on human efforts or on compiler and run-time system support) should be developed to optimize bloat away in the presence of massive amounts of data, so that Big Data developers can enjoy the many benefits of object-oriented programming as well as the high performance. 4. Novel program analysis techniques to interpret performance problems. Once a performance problem is observed, developers often have to perform manual tuning in order to understand its root cause. This is a daunting task as modern software often has extremely large code base, creates many millions of objects, and runs for a long period of time. Manual tuning is very difficult because human experts have to find useful information from an ocean of objects and other executed program entities. It is thus highly demanding to develop automated program analysis techniques that can automatically pinpoint the problematic areas in order to assist developers to fix the problems to improve performance. 5. Optimizing memory-constrained systems. Memory constrained systems typified by smartphone apps are much more vulnerable to inefficiencies than regular server/desktop applications. Identifying common inefficiency patterns in smartphone apps and developing techniques to optimize them away is a highly interesting future research direction that may potentially lead to the fundamental changes in the way such application are developed.