Towards Practical Software Bloat Removal with Assurance Scott D. Stoller and Yanhong Liu Stony Brook University Software bloat removal can significantly reduce software inefficiencies and vulnerabilities, and is becoming increasingly necessary for applications that are resource constrained and mission critical. For effective bloat removal in practical applications, bloat removal tools must satisfy three challenging requirements: • Correctness. Bloat removal tools must correctly manipulate many program components— often written in different languages or available only as binaries—that interact with each other in complicated ways. In fact, a key reason that software bloat exists is the difficulty of removing it correctly. • Productivity. Bloat removal tools need to be developed and maintained with limited time and expense, and must support reuse, improvements, modifications, and extensions. This is necessary because the software technologies that must be handled by bloat removal tools constantly evolve. • Efficiency. Bloat removal tools need to be efficient, especially for manipulating large applications, or if used at runtime. Because bloat removal is nontrivial, repeated applications of bloat removal, testing, and problem fixing should be expected, and thus each iteration must be efficient. These requirements apply to all critical applications, but especially to bloat removal tools, because they are aimed at manipulating all applications in nontrivial ways. We advocate a logic-based method for building bloat removal tools, where all program information and analysis results are expressed directly as logic facts, and all program analysis and transformations are expressed declaratively as logic rules. Logic inference is used to produce the analysis results as well as transformed programs. This method provides significant advantages in addressing the challenging requirements: • High assurance of correctness. Logic rules and facts are the most fundamental, direct, semantic forms for expressing complex relationships and reasoning about them, for different languages at all levels (source, bytecode, binary). They make correctness of the analysis and transformations drastically easier to attain and prove than if these analysis and transformations are written as imperative code or using different frameworks. • Significantly increased productivity. Expressing the analysis and transformations at the very high level of facts and rules is significantly easier and faster than writing low-level code or using different frameworks, and similarly better supports maintenance tasks. • Guarantee of sufficient efficiency. The biggest challenge in the past has been efficient implementation of logic inference. However, significant progress in recent years has made such an approach feasible, e.g., [3, 1]. In fact, for Datalog rules, which are particularly suitable for complex program analysis and transformations, efficient implementations can be generated with better complexity guarantees than previously manually developed and implemented algorithms [1, 2]. 1 Advancing the state of the art in software bloat removal using a logic-based method would also provide other significant benefits: • Besides program information and analysis results about components written in different languages, logic facts and rules can also easily express any additional knowledge from external sources and use it for bloat removal. Also, expressing all information relationally, as facts, allows easy interfacing with many other program analysis and program verification tools, such as SMT solvers, and other data analysis and data mining tools tools, including tools based on big data techniques. • Logic-based methods and tools for analysis and transformation studied for bloat removal, especially if designed with appropriate abstractions, can provide a solid infrastructure for other program analysis and manipulations. This is because effective bloat removal requires deep and sophisticated program flow and dependence analysis and manipulation. References [1] Y. A. Liu and S. D. Stoller. From Datalog rules to efficient programs with time and space guarantees. ACM Transactions on Programming Languages and Systems, 31(6):1–38, 2009. [2] K. T. Tekle and Y. A. Liu. More efficient Datalog queries: Subsumptive tabling beats magic sets. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pages 661–672, 2011. [3] J. Whaley and M. S. Lam. Cloning-based context-sensitive pointer alias analysis using binary decision diagrams. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, pages 131–144, 2004. 2