BOX: Icing the APIs Zhenyu GUO, Xi WANG*, Xuezheng LIU, Wei LIN, and Zheng ZHANG Microsoft Research Asia Tsinghua University* {zhenyu.guo, xueliu, weilin, zzhang}@microsoft.com wangxi01@mails.tsinghua.edu.cn We choose API as the controlling interface for a variety of reasons. API compatibility ensures that BOX-based tools can benefit all legacy applications with unmodified binary executables. APIs are also well-designed specification of services that the low-level system and library provides to support user applications. From the perspective of a developer, the implementation below an API can be considered as correct and robust. While we start with system call APIs, our methodology is equally applicable to application modules or layers to perform incremental debugging and testing, as we gradually reduce modules that are being BOX-ed. The API calling boundary is also coarse grained while not giving up transparency and control. All these makes it attractive comparing with full-transparent and yet too heavyweight approaches that uses virtual machines [4] or kernel modifications [8] [9]. Abstract This paper presents BOX, an API-centric debugging and testing platform that uses API calling boundary as the manipulation surface to install and extend a variety of important debugging tools transparently for legacy applications. We deal with the problem of instrumenting large amount of APIs by using annotation-aware code generation; we carefully design the runtime to eliminate interference from hosted tools to the target application. The framework is highly extensible, thanks to the signal-slot model to process intercepted APIs. We demonstrate the power of BOX by prototyping a number of tools, ranging from monitoring, logging, dependency tracking, time-suspending-debugging and deterministic replay. We have successfully replayed several large and complex software package, including MySQL and Apache with low overhead. Our experience has validated the main design points of BOX. 1. Introduction We envision many useful tools to be developed on top of BOX. A BOX logger can log semantic-rich information which is made available through API annotation. BOX-based profiling and monitoring tool can optionally output dependency flow of resources (e.g. locks) captured by extension that tracks producer-consumer relationships. BOX can host simulator to inject faults and study application’s sensitivity to different hardware settings. API invocation points can serve as scheduling points for model checking and performance tuning. Deterministic replay tool can be developed to root out subtle bugs that arise only when the application is executed. We believe that many useful tools covering major development stages can be easily constructed and refined under the BOX framework, allowing the toolset to evolve and extend in response to new demands. Such attributes are missing in many exiting tools such as Strace [15], Replay Debugger [3] that focus on certain aspects of the development and debugging process. Developing, debugging and deploying correct and highperforming system is challenging. The task is made all the more difficult given the high-degree of concurrency as the system is typically multi-threaded and distributed. Since the end-to-end process has multiple stages, there has been a plethora of tools, each of them tailored to particular aspect of the process. Yet, many of the tools are still ill-suited, and they typically work in isolation. In this paper, we propose BOX, an API-centric debugging and testing platform that enables many important debugging and testing tools under one unified framework. BOX intercepts APIs from the target applications as the manipulation points. Borrowing the methodology from Qt [14], each API invocation is modeled as a signal-slot process, and therefore can be transparently extended, as extensions are processed as slots in a linked list. These extensions are typically components of a tool, and they can be dynamically added or removed. The platform is implemented as a shared library that runs in the same address space as the target application. To eliminate the interference from the tools to the target application’s internal state while retaining the efficiency of such in-address solution, the BOX runtime establishes clean separation by using two separate subspaces, one for the target application and another for the tools. Finally, we use annotation-based code generation to deal with the large API surface on Windows, a platform that is known to be difficult to install framework such as BOX. We believe that these methodologies are applicable to other platforms as well. BOX is the latest addition to the WiDS family [6]. Using its own APIs, the WiDS toolkit can perform debugging, simulation, deterministic replay and bug checking [7] with the same code base. The value of these results compels us to propagate the WiDS methodology to legacy applications. The rest of the paper is organized as follows. We highlight the challenges and our contributions in Section 2. Our approach is briefly described in Section 3. We go through potential BOX-based tools in Section 4, and those already prototyped in Section 5. Section 6 pre- 1 sents our preliminary results and we conclude in Section 7. low overheads. Examples include basic monitoring and profiling. We also share the views of [3] in that logging should have low overhead as well, so that it is possible to be always-on. Online checking of deployed distributed system, an ongoing effort that continues our WiDS Checker project [7], also desires the least possible perturbation caused by the tool itself. 2. Challenges and Contributions BOX aims to be a generic debugging, testing and tuning platform transparent to legacy applications without surrendering extensibility. In order to achieve these goals, we need to address a number of challenges. 3. Methodology Systematic instrumentation. An API-centric platform such as BOX requires extensive instrumentations of APIs. The instrumentation needs not only the prototype of the API, but more semantics such as in/out, the pairing of buffer pointer and its length, the success predicator, to name but a few. Yet, modern OS platform has a wide API surface; on Windows, we have encountered more than seven hundred APIs for the applications that we have experimented with. A manual instrumentation is both tedious and error-prone. This is a difficult problem in general. Our first contribution is the annotationaware automatic code generation, which we believe is a step towards the right direction. Annotation-aware code generation. We construct our code generator based on annotations on APIs. Annotations are concise attributes of parameters, such as in/out, a buffer and its paired length specification, etc. Most Windows APIs are well annotated in the Standard Annotation Language or SAL [5] [13] in recent Windows SDK. For those that are not listed, we use some annotation inference tools such as SALInfer to automatically annotate them. Our code generator takes the prototype of the API along with its annotations, and generates wrapper (described shortly) that replaces the original API in the modified shared library. Extension slots are generated in the same way. However, they typically require additional annotations that need to be manually added. For instance, a block directive indicates that the API may block the current thread, making it a candidate for scheduling. Controlled isolation. When an API is invoked, the control transfers into the BOX runtime. An efficient implementation, as the one that we adopted, runs whatever tools that BOX hosts in the same address space. Such in-address solution is known to have the vulnerability such that the state and logic of the tools themselves interfere with the application that the tools are applied to. One example is memory management for log and replay tool. Any address in the application can be potentially an internal state of the application. Inconsistent memory footprints across different runs are one major problem that plagues earlier attempts. 1 int __wrapper_foo(int param1, int param2) 2{ 3 if (IsThreadInSystemSpace()) 4 return CallNativeFoo(param1, param2); 5 else { 6 SetThreadInSystemSpace(); 7 int retVal = SignalExOfFoo.Execute(param1, param2); 8 SetThreadInApplicationSpace(); 9 return retVal; 10 } 11} Addressing the above problem requires careful delineation of application state machine and the rest of environment that it depends on. Our second contribution is the technique to perform such separation, which enables us to log and replay MySQL, Apache and a variety of other popular and complex software package. Figure 1 Pseudo code of a wrapper BOX runtime. One fundamental design decision is that BOX runtime separates the execution of the target application into two disjoint subspaces, system space and application space. An API, if intercepted, can be called from both subspaces. The wrapper code (Figure 1) correctly distinguishes the source of the invocation via a thread-specific flag (line 3). Invocations from system space of the API are dispatched to the native implementation of the API (line 4). When the application space invokes an intercepted API, the thread switches to system space (line 6), and the execution is dispatched to an extension which we call SignalEx (line 7). When the execution of a SignalEx object is finished, the thread switches back to application space (line 8). Extensibility. Developing and debugging a complex system takes many steps, each of them may require different tools. While each of the needs can be (and sometimes has been) addressed by an individual tool, what is called for is a platform such that different debugging tools can be easily developed and, ideally, reuse components of earlier tools. The third contribution of our work is our technique to achieve such flexibility, demonstrated by a number of tools that we have already prototyped. Low overhead (when required). Debugging and developing a complex system is a multi-stage process. As such, some of the steps can tolerate larger overheads than others. However, some of the tools need to have A SignalEx object treats an API invocation as a signalslot process. In such a process, a signal is an event publisher, which is connected to several slots kept in a 2 linked list, and each slot is an event subscriber. As the event triggers, the event subscribers are executed one by one. The linked list can be dynamically reconfigured. By default, there is one native slot in the linked list which is a reference to the native implementation of the API. section describes a simple debugging utility that we have developed by virtulizing the time. Finding corner bugs at the presence of nondeterminisms is one of the most tedious debugging tasks. One approach to deal with them is deterministic replay. In next section, we will describe such a tool that works for multithreaded and distributed applications. Another alternative is to exhaustively explore the state space by model check the application [2]; this can be accommodated by exposing scheduling points to the model checker. In a nutshell, space separation deals with the challenge of controlled isolation, and the signal-slot mechanism is how the framework provides extensibility. Building tools in BOX. The process of adding and refining BOX-based tools is straightforward. First, the prototypes of APIs that should be BOX-ed need to be prepared, and this is typically supplied by header files or using binary dumping tools. Then, the APIs are annotated either with annotation inference tools or by hand. Next, our annotation-aware code generator produces code snippets as slots, and the template of these slots can be modified if necessary to fit specific requirements. Finally, the generated code are compiled and automatically inserted into corresponding APIs’ signal-slot linked list. This prepares the shared library that the tool is embedded in. Performance tuning. As BOX can subsume part of the runtime responsibilities, runtime optimization is also possible. One particular example is to control how many threads an application should fire, which is a tradeoff of concurrency and overheads. This is possible by inserting a customized scheduler as SignalEx slot. There are other applications of BOX that fall out of the scope of debugging and tuning, but nonetheless they can be quite useful. For instance, as BOX gains control of various I/O APIs, it is possible to perform ondemand virus scan. We can duplicate storage I/O requests transparently to backup services, as an extension of BOX-based logging facility. For APIs that mutate system state (e.g. registry and file system), the updates can be redirected, and therefore establish a per-process virtual machine, as is done in the Featherweight Virtual Machine [8]. 4. Tool Examples By containing the state machine of an application entirely within its API calling boundaries, BOX establishes a lightweight runtime environment to transparently enable many useful debugging and tuning tools. These tools are all built upon components that are themselves developed using the SignalEx API extension model. We broadly classify such tools into the following categories, roughly aligned with the development process. Some of the implemented tools are described in the next section. 5. Implemented Tools 5.1 depTracker For many critical resources such as files, sockets, and various synchronization objects, depTracker tracks their life-spans as well as producer-consumer relationships as related APIs manipulate such resources. The information is made available for the Logger to persist to logs. In order to do that, the code generator produces a tracking slot in the SignalEx object for each APIs that manipulate such resources. The annotation covers the involved handles of an API and their operation types, including allocation, close and access. The BOX runtime allocates a shadow memory block behind each resource handle to store resource state and last access operation signature as well as time, updating appropriately when APIs accessing these handles are invoked. We have used depTracker to isolate a subtle bug that arises from incorrect assumption of socket lifetime. Monitoring and profiling. The first set of tools performs lightweight, non-intrusive monitoring when the application is run. Traditional profiling tools are already useful to present a bird’s eye view of program execution, giving call graph and timing distribution. A BOXbased runtime monitoring and profiling tool extends on several important aspects. First, the logger can be semantic-rich. Second, profiling can contain dependency flow as how system resources (e.g. locks) are used. The Logger and depTracker are exemplary tools on these two aspects, as we will discuss in the next section. Debugging and checking. BOX virtualizes the runtime environment, and virtualization is a powerful tool. Network virtualization, for instance, allows a distributed system to be debugged in one physical machine. It is now possible to investigate the program’s sensitivity to different hardware settings (e.g. slow/fast network/storage I/O subsystems). We can inject faults and selectively delay certain APIs (e.g. network I/O) to exercise different code paths and look for corner cases and investigate odd performance problems. The next depTracker not only tracks intra-process dependencies, but also inter-process dependencies by forwarding logical clock values associated with socket handles to different processes. This implements Lamport Clock so that a partial order among processes communicating with socket can be established. The mechanism is simi- 3 lar to [3], and is completely transparent to applications by using LSP (Layered Service Provider) from Windows [11]. read from the log files to generate inputs to the replayed application. Applications in Windows typically extensively use asynchronous APIs such as IOCP (I/O Completion Port) to improve performance. For asynchronous APIs, the logger records the completion points, which the feeder correctly uses to present the results. 5.2 Logger The BOX logger offers semantic-rich logging capability. Since APIs are annotated, the logger contains information such as in¸ out, the pairing of buffer pointer and its length. If an API returns a name, the logger presents it as a text string, rather than a data pointer. These semantic-rich logs are very useful to aid debugging effort. We use our customized scheduler with one runnable token so that there is only thread running. While this limits the concurrency, it completely avoids data-races that are results of conflicting memory accesses that occur outside the protection of synchronization objects. This is the same approach as adopted in [3]; other alternatives such as logging memory accesses [1] are known to be too expensive. To make it efficient, log files are per thread so there is no contention among threads. We use a pair of memory buffer for each thread. The logger inserts a log slot in the APIs that we are interested in logging. The slot writes logs into the primary buffer, and the secondary buffer flushes to the log file asynchronously. Once the secondary buffer is done with the IO, it is ready to switch roles with the primary buffer. This process stalls only if the primary is full and the secondary has not done its I/O. With sufficiently large buffer (e.g. 64KB), such cases rarely happen. We should point out that separating the application execution into BOX application and system subspaces is critical to make replay possible. This gives us the option to implement two heap mangers for the two subspaces. These heap managers are called by the application and the tools separately, they do not interfere with each other, and they have no random factors, so that the same request sequence will produce the same address footprints. This is one reason that we are able to replay many large and complex applications, as reported in Section 6. Inconsistent memory footprint is one major problem that plagues earlier attempts [3]. 5.3 Customizable scheduler We have developed a customized scheduler which plugs into the SignalEx of an API extension as a slot. This scheduler gives away n tokens, and let the runnable threads compete; a thread owns a token runs till next scheduling point. As such, one limitation is that the application should not have tight loop where the termination depends on a global variable which is only set by another thread. It is easy to see that changing n then effectively sizes the number of concurrent threads in the system. When n is one, then this becomes a traditional cooperative scheduler, which we use in log and replay. 5.5 Time-Suspending-Debugging Hijacking the APIs gives us an opportunity to present different time illusions to the applications. We have built a distributed system debugging utility called Time-Suspending-Debugging (TSD). Under TSD, the time that presents to the application is logical and such logical clock can be paused. This allows a developer to virtually freeze an application in the middle of a run, effectively establishing a distributed breakpoint. What is important is that all time-sensitive APIs that have already been fired, such as sleep and various timers will correctly skip the pause duration. 5.4 Replay Many of the subtle bugs will only surface when an application is actually executed. Deterministic replay builds a time machine, and thus proactively exerts control over non-deterministic factors in the original run, making it possible to check those bugs and to further identify their root causes. We have built a log and replay tool that works for multithreaded and distributed applications. 6. Preliminary Result In this section, we give some preliminary result about BOX platform and the tools on top of it. The experiments were all performed on machines with 2.8GHz Pentium 4 CPU and 512 MB memory. Code Generation. Table 1 lists the statistics of APIs that BOX has intercepted for all the experiments here, which are spread in five different modules. The table also lists number of APIs that are asynchronous, and annotated to track dependencies. The replay column shows the subset of APIs that our replay tool log and replayed; the rest (e.g. thread creation and console output) are natively executed during replay. Shown also Our replay facility builds upon the logger and the customized scheduler, the part of depTracker that implements Lamport Clock, and relies on one more component called feeder. Feeder replaces the native API at replay with a feed slot. It does the reverse of logger, essentially uses the same dual-buffer mechanism to 4 are the average number of parameters, and among them those annotated with in, out, and opt. module API asy -nc tracker replay avg params avg in avg out avg opt ntdll 62 6 39 35 4.34 3.58 0.87 1.03 kernel32 509 0 98 442 2.62 2.15 0.58 0.05 advapi32 78 0 6 78 4.19 3.29 1.18 0.82 ws2_32 95 5 51 95 3.31 2.76 0.88 0.12 mswsock 7 5 6 7 6.29 5.29 1.29 0.14 Total 751 16 200 657 3.05 2.50 0.71 0.22 tems being developed. One of them is a semi-structured storage system and another is a social network computation engine. Our experience is that as long as the APIs are completely and correctly annotated, repay is straightforward. 7. Conclusion The design principle of BOX is based on our conviction that API is the most natural boundary to install various debugging tools. This choice gives us legacy application compatibility without giving up control. The challenges are to deal with large API surface that needs to be systematically instrumented, exert controlled isolation to eliminate interference of tools to the target applications, and to provide sufficient extensibility so that tools can be incrementally developed and evolved. We have addressed each of these challenges, and our experience so far has validated this choice. Our future work will continue on a number of fronts. We will continue to instrument more APIs for completeness, develop more BOX-based tools, and combine BOX with our ongoing online bug checking research. Table 1 Statistics of code generation Wrapper performance. For every intercepted API, a wrapper takes charge of space dispatching and signalslot processing. On x86 platform, the wrapper and the signal-slots produce 38 and 67 instructions on average, respectively. We can see that the wrapper overhead for every API is fairly low. Requests / Second Apache. We test and replay Apache server to evaluate the overall performance of our platform. We install the latest Apache server prebuilt version (without SSL) from [10], and test it with default configuration (250 threads per child) under three different platform configurations. We put an html file with size of 64KB and download it with the shipped ApacheBench in the same install package. Figure 2 gives the overall performance, varying number of concurrent clients. The monitor configuration only has the native slot in SignalEx object, which has an average slowdown of 2.6%. The log configuration includes both the cooperative scheduler and the BOX logger, and the average slowdown is 6.0%. References [1] [2] [3] [4] 300 [5] 200 native 100 [6] monitor [7] 0 log 1 2 5 10 20 50 [8] Concurrency Level [9] Figure 2 Overall performance on apache [10] [11] We also experimented with MySQL server [12] with shipped SQL benchmark in the same install package, which is again successfully replayed. The result is approximately the same compared to Apache: the slowdown is about 2.4% for monitor and 6.9% for log. Both Apache and MySQL can be successfully replayed based on the logs, and replay generally costs less time because all timeouts and thread blockings are removed. Other packages that we have replayed include wget, Lua interpreter, as well as our two ongoing distributed sys- [12] [13] [14] [15] 5 S. Bhansali, W. Chen, S. d. Jong, A. Edwards, R. Murray, M. Drinic, D. Mihocka, and J. Chau, "Framework for Instruction-level Tracing and Analysis of Program Executions", Second International Conference on Virtual Execution Environments, June, 2006 H. Chen, D. Dean, and D. Wagner, "Model checking one million lines of C code", In Symposiums on Network and Distributed Systems Security, 2004 D. Geels, G. Altekar, S. Shenker, and I. Stoica, "Replay Debugging for Distributed Applications", USENIX Annual Technical Conference, May 2006 T. Garfinkel and M. Rosenblum, "A Virtual Machine Introspection Based Architecture for Intrusion Detection", Proceedings of Network and Distributed Systems Security Symposium, 2003 B. Hackett, M. Das, D. Wang, and Z. Yang, "Modular checking for buffer overflows in the large", 28th International Conference on Software Engineering, May, 2006 S. Lin, A. Pan, Z. Zhang, R. Guo, and Z. Guo, "WiDS: An Integrated Toolkit for Distributed System Development", Tenth Workshop on Hot Topics in Operating Systems, June 2005 X. Liu, W. Lin, A. Pan and Z. Zhang, "WiDS Checker: Combating Bugs in Distributed Systems", to appear in 4th Symposium on Networked Systems Design and Implementation, April, 2007 S. M. Srinivasan, S. Kandula, C. R. Andrews, and Y. Zhou, "Flashback: A lightweight extension for rollback and deterministic replay for software debugging", In USENIX Annual Technical Conference, General Track, pages 29–44, 2004 Y. Yu, F. Guo, S. Nanda, L. Lam, and T. Chiueh, "A Feather-weight Virtual Machine forWindows Applications", VEE, Ottawa, June 2006 Apache 2.2.3, http://httpd.apache.org/download.cgi Layered Protocols and Provider Chains, http://msdn2.microsoft.com/enus/library/aa925739.aspx MySQL-5.0.27, http://www.mysql.org/downloads/mysql/5.0.html SAL Annotations, http://msdn2.microsoft.com/enus/library/ms235402(vs.80).aspx Signals and Slots, http://en.wikipedia.org/wiki/Signals_and_slots Strace, http://sourceforge.net/projects/strace/