tr-2008-03

advertisement
BOX: Icing the APIs
Zhenyu GUO, Xi WANG*, Xuezheng LIU, Wei LIN, and Zheng ZHANG
Microsoft Research Asia
Tsinghua University*
{zhenyu.guo, xueliu, weilin, zzhang}@microsoft.com wangxi01@mails.tsinghua.edu.cn
We choose API as the controlling interface for a variety
of reasons. API compatibility ensures that BOX-based
tools can benefit all legacy applications with unmodified binary executables. APIs are also well-designed
specification of services that the low-level system and
library provides to support user applications. From the
perspective of a developer, the implementation below
an API can be considered as correct and robust. While
we start with system call APIs, our methodology is
equally applicable to application modules or layers to
perform incremental debugging and testing, as we
gradually reduce modules that are being BOX-ed. The
API calling boundary is also coarse grained while not
giving up transparency and control. All these makes it
attractive comparing with full-transparent and yet too
heavyweight approaches that uses virtual machines [4]
or kernel modifications [8] [9].
Abstract
This paper presents BOX, an API-centric debugging and testing platform that uses API calling boundary as the manipulation surface to install and extend a variety of important debugging tools transparently for legacy applications. We deal
with the problem of instrumenting large amount of APIs by
using annotation-aware code generation; we carefully design
the runtime to eliminate interference from hosted tools to the
target application. The framework is highly extensible, thanks
to the signal-slot model to process intercepted APIs. We
demonstrate the power of BOX by prototyping a number of
tools, ranging from monitoring, logging, dependency tracking,
time-suspending-debugging and deterministic replay. We
have successfully replayed several large and complex software package, including MySQL and Apache with low overhead. Our experience has validated the main design points of
BOX.
1. Introduction
We envision many useful tools to be developed on top
of BOX. A BOX logger can log semantic-rich information which is made available through API annotation.
BOX-based profiling and monitoring tool can optionally output dependency flow of resources (e.g. locks)
captured by extension that tracks producer-consumer
relationships. BOX can host simulator to inject faults
and study application’s sensitivity to different hardware
settings. API invocation points can serve as scheduling
points for model checking and performance tuning.
Deterministic replay tool can be developed to root out
subtle bugs that arise only when the application is executed. We believe that many useful tools covering major development stages can be easily constructed and
refined under the BOX framework, allowing the toolset
to evolve and extend in response to new demands. Such
attributes are missing in many exiting tools such as
Strace [15], Replay Debugger [3] that focus on certain
aspects of the development and debugging process.
Developing, debugging and deploying correct and highperforming system is challenging. The task is made all
the more difficult given the high-degree of concurrency
as the system is typically multi-threaded and distributed.
Since the end-to-end process has multiple stages, there
has been a plethora of tools, each of them tailored to
particular aspect of the process. Yet, many of the tools
are still ill-suited, and they typically work in isolation.
In this paper, we propose BOX, an API-centric debugging and testing platform that enables many important
debugging and testing tools under one unified framework. BOX intercepts APIs from the target applications
as the manipulation points. Borrowing the methodology
from Qt [14], each API invocation is modeled as a signal-slot process, and therefore can be transparently extended, as extensions are processed as slots in a linked
list. These extensions are typically components of a tool,
and they can be dynamically added or removed. The
platform is implemented as a shared library that runs in
the same address space as the target application. To
eliminate the interference from the tools to the target
application’s internal state while retaining the efficiency of such in-address solution, the BOX runtime establishes clean separation by using two separate subspaces,
one for the target application and another for the tools.
Finally, we use annotation-based code generation to
deal with the large API surface on Windows, a platform
that is known to be difficult to install framework such
as BOX. We believe that these methodologies are applicable to other platforms as well.
BOX is the latest addition to the WiDS family [6]. Using its own APIs, the WiDS toolkit can perform debugging, simulation, deterministic replay and bug checking
[7] with the same code base. The value of these results
compels us to propagate the WiDS methodology to
legacy applications.
The rest of the paper is organized as follows. We highlight the challenges and our contributions in Section 2.
Our approach is briefly described in Section 3. We go
through potential BOX-based tools in Section 4, and
those already prototyped in Section 5. Section 6 pre-
1
sents our preliminary results and we conclude in Section 7.
low overheads. Examples include basic monitoring and
profiling. We also share the views of [3] in that logging
should have low overhead as well, so that it is possible
to be always-on. Online checking of deployed distributed system, an ongoing effort that continues our WiDS
Checker project [7], also desires the least possible perturbation caused by the tool itself.
2. Challenges and Contributions
BOX aims to be a generic debugging, testing and tuning
platform transparent to legacy applications without surrendering extensibility. In order to achieve these goals,
we need to address a number of challenges.
3. Methodology
Systematic instrumentation. An API-centric platform
such as BOX requires extensive instrumentations of
APIs. The instrumentation needs not only the prototype
of the API, but more semantics such as in/out, the pairing of buffer pointer and its length, the success predicator, to name but a few. Yet, modern OS platform has a
wide API surface; on Windows, we have encountered
more than seven hundred APIs for the applications that
we have experimented with. A manual instrumentation
is both tedious and error-prone. This is a difficult problem in general. Our first contribution is the annotationaware automatic code generation, which we believe is a
step towards the right direction.
Annotation-aware code generation. We construct our
code generator based on annotations on APIs. Annotations are concise attributes of parameters, such as in/out,
a buffer and its paired length specification, etc. Most
Windows APIs are well annotated in the Standard Annotation Language or SAL [5] [13] in recent Windows
SDK. For those that are not listed, we use some annotation inference tools such as SALInfer to automatically
annotate them.
Our code generator takes the prototype of the API along
with its annotations, and generates wrapper (described
shortly) that replaces the original API in the modified
shared library. Extension slots are generated in the
same way. However, they typically require additional
annotations that need to be manually added. For instance, a block directive indicates that the API may
block the current thread, making it a candidate for
scheduling.
Controlled isolation. When an API is invoked, the
control transfers into the BOX runtime. An efficient
implementation, as the one that we adopted, runs whatever tools that BOX hosts in the same address space.
Such in-address solution is known to have the vulnerability such that the state and logic of the tools themselves interfere with the application that the tools are
applied to. One example is memory management for
log and replay tool. Any address in the application can
be potentially an internal state of the application. Inconsistent memory footprints across different runs are one
major problem that plagues earlier attempts.
1 int __wrapper_foo(int param1, int param2)
2{
3 if (IsThreadInSystemSpace())
4
return CallNativeFoo(param1, param2);
5 else {
6
SetThreadInSystemSpace();
7
int retVal = SignalExOfFoo.Execute(param1, param2);
8
SetThreadInApplicationSpace();
9
return retVal;
10 }
11}
Addressing the above problem requires careful delineation of application state machine and the rest of environment that it depends on. Our second contribution is
the technique to perform such separation, which enables
us to log and replay MySQL, Apache and a variety of
other popular and complex software package.
Figure 1 Pseudo code of a wrapper
BOX runtime. One fundamental design decision is that
BOX runtime separates the execution of the target application into two disjoint subspaces, system space and
application space. An API, if intercepted, can be called
from both subspaces. The wrapper code (Figure 1) correctly distinguishes the source of the invocation via a
thread-specific flag (line 3). Invocations from system
space of the API are dispatched to the native implementation of the API (line 4). When the application space
invokes an intercepted API, the thread switches to system space (line 6), and the execution is dispatched to an
extension which we call SignalEx (line 7). When the
execution of a SignalEx object is finished, the thread
switches back to application space (line 8).
Extensibility. Developing and debugging a complex
system takes many steps, each of them may require
different tools. While each of the needs can be (and
sometimes has been) addressed by an individual tool,
what is called for is a platform such that different debugging tools can be easily developed and, ideally, reuse components of earlier tools. The third contribution
of our work is our technique to achieve such flexibility,
demonstrated by a number of tools that we have already
prototyped.
Low overhead (when required). Debugging and developing a complex system is a multi-stage process. As
such, some of the steps can tolerate larger overheads
than others. However, some of the tools need to have
A SignalEx object treats an API invocation as a signalslot process. In such a process, a signal is an event publisher, which is connected to several slots kept in a
2
linked list, and each slot is an event subscriber. As the
event triggers, the event subscribers are executed one
by one. The linked list can be dynamically reconfigured.
By default, there is one native slot in the linked list
which is a reference to the native implementation of the
API.
section describes a simple debugging utility that we
have developed by virtulizing the time.
Finding corner bugs at the presence of nondeterminisms is one of the most tedious debugging
tasks. One approach to deal with them is deterministic
replay. In next section, we will describe such a tool that
works for multithreaded and distributed applications.
Another alternative is to exhaustively explore the state
space by model check the application [2]; this can be
accommodated by exposing scheduling points to the
model checker.
In a nutshell, space separation deals with the challenge
of controlled isolation, and the signal-slot mechanism is
how the framework provides extensibility.
Building tools in BOX. The process of adding and
refining BOX-based tools is straightforward. First, the
prototypes of APIs that should be BOX-ed need to be
prepared, and this is typically supplied by header files
or using binary dumping tools. Then, the APIs are annotated either with annotation inference tools or by
hand. Next, our annotation-aware code generator produces code snippets as slots, and the template of these
slots can be modified if necessary to fit specific requirements. Finally, the generated code are compiled
and automatically inserted into corresponding APIs’
signal-slot linked list. This prepares the shared library
that the tool is embedded in.
Performance tuning. As BOX can subsume part of the
runtime responsibilities, runtime optimization is also
possible. One particular example is to control how
many threads an application should fire, which is a
tradeoff of concurrency and overheads. This is possible
by inserting a customized scheduler as SignalEx slot.
There are other applications of BOX that fall out of the
scope of debugging and tuning, but nonetheless they
can be quite useful. For instance, as BOX gains control
of various I/O APIs, it is possible to perform ondemand virus scan. We can duplicate storage I/O requests transparently to backup services, as an extension
of BOX-based logging facility. For APIs that mutate
system state (e.g. registry and file system), the updates
can be redirected, and therefore establish a per-process
virtual machine, as is done in the Featherweight Virtual
Machine [8].
4. Tool Examples
By containing the state machine of an application entirely within its API calling boundaries, BOX establishes a lightweight runtime environment to transparently
enable many useful debugging and tuning tools. These
tools are all built upon components that are themselves
developed using the SignalEx API extension model. We
broadly classify such tools into the following categories,
roughly aligned with the development process. Some of
the implemented tools are described in the next section.
5. Implemented Tools
5.1 depTracker
For many critical resources such as files, sockets, and
various synchronization objects, depTracker tracks their
life-spans as well as producer-consumer relationships as
related APIs manipulate such resources. The information is made available for the Logger to persist to
logs. In order to do that, the code generator produces a
tracking slot in the SignalEx object for each APIs that
manipulate such resources. The annotation covers the
involved handles of an API and their operation types,
including allocation, close and access. The BOX
runtime allocates a shadow memory block behind each
resource handle to store resource state and last access
operation signature as well as time, updating appropriately when APIs accessing these handles are invoked.
We have used depTracker to isolate a subtle bug that
arises from incorrect assumption of socket lifetime.
Monitoring and profiling. The first set of tools performs lightweight, non-intrusive monitoring when the
application is run. Traditional profiling tools are already
useful to present a bird’s eye view of program execution, giving call graph and timing distribution. A BOXbased runtime monitoring and profiling tool extends on
several important aspects. First, the logger can be semantic-rich. Second, profiling can contain dependency
flow as how system resources (e.g. locks) are used. The
Logger and depTracker are exemplary tools on these
two aspects, as we will discuss in the next section.
Debugging and checking. BOX virtualizes the runtime
environment, and virtualization is a powerful tool.
Network virtualization, for instance, allows a distributed system to be debugged in one physical machine. It is
now possible to investigate the program’s sensitivity to
different hardware settings (e.g. slow/fast network/storage I/O subsystems). We can inject faults and
selectively delay certain APIs (e.g. network I/O) to exercise different code paths and look for corner cases
and investigate odd performance problems. The next
depTracker not only tracks intra-process dependencies,
but also inter-process dependencies by forwarding logical clock values associated with socket handles to different processes. This implements Lamport Clock so
that a partial order among processes communicating
with socket can be established. The mechanism is simi-
3
lar to [3], and is completely transparent to applications
by using LSP (Layered Service Provider) from Windows [11].
read from the log files to generate inputs to the replayed
application. Applications in Windows typically extensively use asynchronous APIs such as IOCP (I/O Completion Port) to improve performance. For asynchronous APIs, the logger records the completion points,
which the feeder correctly uses to present the results.
5.2 Logger
The BOX logger offers semantic-rich logging capability.
Since APIs are annotated, the logger contains information such as in¸ out, the pairing of buffer pointer and
its length. If an API returns a name, the logger presents
it as a text string, rather than a data pointer. These semantic-rich logs are very useful to aid debugging effort.
We use our customized scheduler with one runnable
token so that there is only thread running. While this
limits the concurrency, it completely avoids data-races
that are results of conflicting memory accesses that
occur outside the protection of synchronization objects.
This is the same approach as adopted in [3]; other alternatives such as logging memory accesses [1] are known
to be too expensive.
To make it efficient, log files are per thread so there is
no contention among threads. We use a pair of memory
buffer for each thread. The logger inserts a log slot in
the APIs that we are interested in logging. The slot
writes logs into the primary buffer, and the secondary
buffer flushes to the log file asynchronously. Once the
secondary buffer is done with the IO, it is ready to
switch roles with the primary buffer. This process stalls
only if the primary is full and the secondary has not
done its I/O. With sufficiently large buffer (e.g. 64KB),
such cases rarely happen.
We should point out that separating the application execution into BOX application and system subspaces is
critical to make replay possible. This gives us the option to implement two heap mangers for the two subspaces. These heap managers are called by the application and the tools separately, they do not interfere with
each other, and they have no random factors, so that the
same request sequence will produce the same address
footprints. This is one reason that we are able to replay
many large and complex applications, as reported in
Section 6. Inconsistent memory footprint is one major
problem that plagues earlier attempts [3].
5.3 Customizable scheduler
We have developed a customized scheduler which
plugs into the SignalEx of an API extension as a slot.
This scheduler gives away n tokens, and let the runnable threads compete; a thread owns a token runs till
next scheduling point. As such, one limitation is that
the application should not have tight loop where the
termination depends on a global variable which is only
set by another thread. It is easy to see that changing n
then effectively sizes the number of concurrent threads
in the system. When n is one, then this becomes a traditional cooperative scheduler, which we use in log and
replay.
5.5 Time-Suspending-Debugging
Hijacking the APIs gives us an opportunity to present
different time illusions to the applications. We have
built a distributed system debugging utility called
Time-Suspending-Debugging (TSD). Under TSD, the
time that presents to the application is logical and such
logical clock can be paused. This allows a developer to
virtually freeze an application in the middle of a run,
effectively establishing a distributed breakpoint. What
is important is that all time-sensitive APIs that have
already been fired, such as sleep and various timers will
correctly skip the pause duration.
5.4 Replay
Many of the subtle bugs will only surface when an application is actually executed. Deterministic replay
builds a time machine, and thus proactively exerts control over non-deterministic factors in the original run,
making it possible to check those bugs and to further
identify their root causes. We have built a log and replay tool that works for multithreaded and distributed
applications.
6. Preliminary Result
In this section, we give some preliminary result about
BOX platform and the tools on top of it. The experiments were all performed on machines with 2.8GHz
Pentium 4 CPU and 512 MB memory.
Code Generation. Table 1 lists the statistics of APIs
that BOX has intercepted for all the experiments here,
which are spread in five different modules. The table
also lists number of APIs that are asynchronous, and
annotated to track dependencies. The replay column
shows the subset of APIs that our replay tool log and
replayed; the rest (e.g. thread creation and console output) are natively executed during replay. Shown also
Our replay facility builds upon the logger and the customized scheduler, the part of depTracker that implements Lamport Clock, and relies on one more component called feeder. Feeder replaces the native API at
replay with a feed slot. It does the reverse of logger,
essentially uses the same dual-buffer mechanism to
4
are the average number of parameters, and among them
those annotated with in, out, and opt.
module
API
asy
-nc
tracker
replay
avg
params
avg
in
avg
out
avg
opt
ntdll
62
6
39
35
4.34
3.58
0.87
1.03
kernel32
509
0
98
442
2.62
2.15
0.58
0.05
advapi32
78
0
6
78
4.19
3.29
1.18
0.82
ws2_32
95
5
51
95
3.31
2.76
0.88
0.12
mswsock
7
5
6
7
6.29
5.29
1.29
0.14
Total
751
16
200
657
3.05
2.50
0.71
0.22
tems being developed. One of them is a semi-structured
storage system and another is a social network computation engine. Our experience is that as long as the APIs
are completely and correctly annotated, repay is
straightforward.
7. Conclusion
The design principle of BOX is based on our conviction
that API is the most natural boundary to install various
debugging tools. This choice gives us legacy application compatibility without giving up control. The challenges are to deal with large API surface that needs to
be systematically instrumented, exert controlled isolation to eliminate interference of tools to the target applications, and to provide sufficient extensibility so that
tools can be incrementally developed and evolved. We
have addressed each of these challenges, and our experience so far has validated this choice. Our future work
will continue on a number of fronts. We will continue
to instrument more APIs for completeness, develop
more BOX-based tools, and combine BOX with our
ongoing online bug checking research.
Table 1 Statistics of code generation
Wrapper performance. For every intercepted API, a
wrapper takes charge of space dispatching and signalslot processing. On x86 platform, the wrapper and the
signal-slots produce 38 and 67 instructions on average,
respectively. We can see that the wrapper overhead for
every API is fairly low.
Requests / Second
Apache. We test and replay Apache server to evaluate
the overall performance of our platform. We install the
latest Apache server prebuilt version (without SSL)
from [10], and test it with default configuration (250
threads per child) under three different platform configurations. We put an html file with size of 64KB and
download it with the shipped ApacheBench in the same
install package. Figure 2 gives the overall performance,
varying number of concurrent clients. The monitor configuration only has the native slot in SignalEx object,
which has an average slowdown of 2.6%. The log configuration includes both the cooperative scheduler and
the BOX logger, and the average slowdown is 6.0%.
References
[1]
[2]
[3]
[4]
300
[5]
200
native
100
[6]
monitor
[7]
0
log
1
2
5
10 20 50
[8]
Concurrency Level
[9]
Figure 2 Overall performance on apache
[10]
[11]
We also experimented with MySQL server [12] with
shipped SQL benchmark in the same install package,
which is again successfully replayed. The result is approximately the same compared to Apache: the slowdown is about 2.4% for monitor and 6.9% for log. Both
Apache and MySQL can be successfully replayed based
on the logs, and replay generally costs less time because
all timeouts and thread blockings are removed. Other
packages that we have replayed include wget, Lua interpreter, as well as our two ongoing distributed sys-
[12]
[13]
[14]
[15]
5
S. Bhansali, W. Chen, S. d. Jong, A. Edwards, R. Murray, M. Drinic, D.
Mihocka, and J. Chau, "Framework for Instruction-level Tracing and
Analysis of Program Executions", Second International Conference on
Virtual Execution Environments, June, 2006
H. Chen, D. Dean, and D. Wagner, "Model checking one million lines
of C code", In Symposiums on Network and Distributed Systems Security, 2004
D. Geels, G. Altekar, S. Shenker, and I. Stoica, "Replay Debugging for
Distributed Applications", USENIX Annual Technical Conference, May
2006
T. Garfinkel and M. Rosenblum, "A Virtual Machine Introspection
Based Architecture for Intrusion Detection", Proceedings of Network
and Distributed Systems Security Symposium, 2003
B. Hackett, M. Das, D. Wang, and Z. Yang, "Modular checking for
buffer overflows in the large", 28th International Conference on Software Engineering, May, 2006
S. Lin, A. Pan, Z. Zhang, R. Guo, and Z. Guo, "WiDS: An Integrated
Toolkit for Distributed System Development", Tenth Workshop on Hot
Topics in Operating Systems, June 2005
X. Liu, W. Lin, A. Pan and Z. Zhang, "WiDS Checker: Combating Bugs
in Distributed Systems", to appear in 4th Symposium on Networked
Systems Design and Implementation, April, 2007
S. M. Srinivasan, S. Kandula, C. R. Andrews, and Y. Zhou, "Flashback:
A lightweight extension for rollback and deterministic replay for software debugging", In USENIX Annual Technical Conference, General
Track, pages 29–44, 2004
Y. Yu, F. Guo, S. Nanda, L. Lam, and T. Chiueh, "A Feather-weight
Virtual Machine forWindows Applications", VEE, Ottawa, June 2006
Apache 2.2.3, http://httpd.apache.org/download.cgi
Layered Protocols and Provider Chains, http://msdn2.microsoft.com/enus/library/aa925739.aspx
MySQL-5.0.27, http://www.mysql.org/downloads/mysql/5.0.html
SAL Annotations, http://msdn2.microsoft.com/enus/library/ms235402(vs.80).aspx
Signals and Slots, http://en.wikipedia.org/wiki/Signals_and_slots
Strace, http://sourceforge.net/projects/strace/
Download