Resource-Aware Application State Monitoring

advertisement
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
VOL. 23,
NO. X,
XXXXXXX 2012
1
Resource-Aware Application State Monitoring
Shicong Meng, Student Member, IEEE, Srinivas Raghav Kashyap,
Chitra Venkatramani, and Ling Liu, Senior Member, IEEE
Abstract—The increasing popularity of large-scale distributed applications in datacenters has led to the growing demand of distributed
application state monitoring. These application state monitoring tasks often involve collecting values of various status attributes from a
large number of nodes. One challenge in such large-scale application state monitoring is to organize nodes into a monitoring overlay
that achieves monitoring scalability and cost effectiveness at the same time. In this paper, we present REMO, a REsource-aware
application state MOnitoring system, to address the challenge of monitoring overlay construction. REMO distinguishes itself from
existing works in several key aspects. First, it jointly considers intertask cost-sharing opportunities and node-level resource constraints.
Furthermore, it explicitly models the per-message processing overhead which can be substantial but is often ignored by previous
works. Second, REMO produces a forest of optimized monitoring trees through iterations of two phases. One phase explores costsharing opportunities between tasks, and the other refines the tree with resource-sensitive construction schemes. Finally, REMO also
employs an adaptive algorithm that balances the benefits and costs of overlay adaptation. This is particularly useful for large systems
with constantly changing monitoring tasks. Moreover, we enhance REMO in terms of both performance and applicability with a series
of optimization and extension techniques. We perform extensive experiments including deploying REMO on a BlueGene/P rack
running IBM’s large-scale distributed streaming system—System S. Using REMO in the context of collecting over 200 monitoring tasks
for an application deployed across 200 nodes results in a 35-45 percent decrease in the percentage error of collected attributes
compared to existing schemes.
Index Terms—Resource-aware, state monitoring, distributed monitoring, datacenter monitoring, adaptation, data-intensive
Ç
1
INTRODUCTION
R
ECENTLY,
we have witnessed a fast growing set of largescale distributed applications ranging from stream
processing [1] to applications [2] running in Cloud datacenters. Correspondingly, the demand for monitoring the
functioning of these applications also increases substantially.
Typical monitoring of such applications involves collecting
values of metrics, e.g., performance related metrics, from a
large number of member nodes to determine the state of the
application or the system. We refer to such monitoring tasks
as application state monitoring. Application state monitoring is
essential for the observation, analysis, and control of
distributed applications and systems. For instance, data
stream applications may require monitoring the data receiving/sending rate, captured events, tracked data entities,
signature of internal states, and any number of applicationspecific attributes on participating computing nodes to
ensure stable operation in the face of highly bursty workloads
[3] [4]. Application provisioning may also require continuously collecting performance attribute values such as CPU
. S. Meng is with the Georgia Institute of Technology, 37975 GT Station,
Atlanta, GA 30332. E-mail: smeng@cc.gatech.edu.
. S.R. Kashyap is with Google Inc., 1600 Amphitheatre Parkway Mountain
View, CA 94043. E-mail: raaghav@gmail.com.
. C. Venkatramani is with the IBM TJ Watson Research Center, 19 Skyline
Drive, Hawthorne, NY 10532. E-mail: chitrav@us.ibm.com.
. L. Liu is with the Georgia Institute of Technology, KACB 3340, Georgia
Tech, 266 Ferst Drive, Atlanta, GA 30332-0765.
E-mail: lingliu@cc.gatech.edu.
Manuscript received 26 June 2011; revised 5 Dec. 2011; accepted 9 Feb. 2012;
published online 29 Feb. 2012.
Recommended for acceptance by J. Zhang.
For information on obtaining reprints of this article, please send e-mail to:
tpds@computer.org, and reference IEEECS Log Number TPDS-2011-06-0426.
Digital Object Identifier no. 10.1109/TPDS.2012.82.
1045-9219/12/$31.00 ß 2012 IEEE
usage, memory usage, and packet size distributions from
application-hosting servers [5].
One central problem in application state monitoring is
organizing nodes into a certain topology where metric
values from different nodes can be collected and delivered.
In many cases, it is useful to collect detailed performance
attributes at a controlled collection frequency. As an
example, fine-grained performance characterization information is required to construct various system models and to
test hypotheses on system behavior [1]. Similarly, the data
rate and buffer occupancy in each element of a distributed
application may be required for diagnosis purposes when
there is a perceived bottleneck [3]. However, the overhead of
collecting monitoring data grows quickly as the scale and
complexity of monitoring tasks increase. Hence, it is crucial
that the monitoring topology should ensure good monitoring scalability and cost effectiveness at the same time.
While a set of monitoring-topology planning approaches
have been proposed in the past, we find that these approaches
often have the following drawbacks in general. First of all,
existing works either build monitoring topologies for each
individual monitoring task (TAG [6], SDIMS [7], PIER [8], join
aggregations [9], REED [10], operator placement [11]), or use a
static monitoring topology for all monitoring tasks [11]. These
two approaches, however, often produce suboptimal monitoring topologies. For example, if two monitoring tasks both
collect metric values over the same set of nodes, using one
monitoring tree for monitoring data transmission is more
efficient than using two, as nodes can merge updates for both
tasks and reduce per-message processing overhead. Hence,
multimonitoring-task-level topology optimization is crucial
for monitoring scalability.
Second, for many data-intensive environments, monitoring overhead grows substantially with the increase of
Published by the IEEE Computer Society
2
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
monitoring tasks and deployment scale [7] [12]. It is
important that the monitoring topology should be resource
sensitive, i.e., it should avoid monitoring nodes spending
excessive resources on collecting and delivering attribute
values. Unfortunately, existing works do not take node-level
resource consumption as a first-class consideration. This
may result in overload on certain nodes which eventually
leads to monitoring data loss. Moreover, some assumptions
in existing works do not hold in real-world scenarios. For
example, many works assume that the cost of update
messages is only related with the number of values within
the message, while we find that a fixed per-message
overhead is not negligible.
Last but not the least, application state monitoring tasks
are often subject to change in real-world deployments [13].
Some tasks are short term by nature, e.g., ad hoc tasks
submitted to check the current system usage [14]. Other
tasks may be frequently modified for debugging, e.g., a user
may specify different attributes for one task to understand
which attribute provides the most useful information [13].
Nevertheless, existing works often consider monitoring
tasks to be static and perform one-time topology optimization [10] [11]. With little support for efficient topology
adaptation, these approaches would either produce suboptimal topologies when using a static topology regardless
of changes in tasks, or introduce high adaptation cost when
performing comprehensive topology reconstruction for any
change in tasks [15].
In this paper, we present REMO, a resource-aware
application state monitoring system, that aims at addressing the above issues. REMO takes node-level available
resources as the first class factor for building a monitoring
topology. It optimizes the monitoring topology to achieve
the best scalability and ensures that no node would be
assigned with excessive monitoring workloads for their
available resources.
REMO employs three key techniques to deliver costeffective monitoring topologies under different environments. In the conference version [15] of this paper, we
introduced a basic topology planning algorithm. This algorithm
produces a forest of carefully optimized monitoring trees for
a set of static monitoring tasks. It iteratively explores costsharing opportunities among monitoring tasks and refines
the monitoring trees to achieve the best performance given
the resource constraints on each node. One limitation of the
basic approach is that it explores the entire search space for
an optimal topology whenever the set of monitoring tasks is
changed. This could lead to significant resource consumption for monitoring environments where tasks are subject to
change. In this journal paper, we present an adaptive topology
planning algorithm which continuously optimizes the monitoring topology according to the changes of tasks. To
achieve cost effectiveness, it maintains a balance between the
topology adaptation cost and the topology efficiency, and
employs cost-benefit throttling to avoid trivial adaptation.
To ensure the efficiency and applicability of REMO, we also
introduce a set of optimization and extension techniques in this
paper. These techniques further improve the efficiency of
resource-sensitive monitoring tree construction scheme, and
allow REMO to support popular monitoring features such as
In-network aggregation and reliability enhancements.
We undertake an experimental study of our system and
present results including those gathered by deploying
VOL. 23,
NO. X,
XXXXXXX 2012
REMO on a BlueGene/P rack (using 256 nodes booted into
Linux) running IBM’s large-scale distributed streaming
system—System S [3]. The results show that our resourceaware approach for application state monitoring consistently
outperforms the current best known schemes. For instance,
in our experiments with a real application that spanned up to
200 nodes and about as many monitoring tasks, using REMO
to collect attributes resulted in a 35-45 percent reduction in
the percentage error of the attributes that were collected.
To our best knowledge, REMO is the first system that
promotes resource-aware methodology to support and
scale multiple application state monitoring tasks in largescale distributed systems. We make three contributions in
this paper.
We identify three critical requirements for largescale application state monitoring: the sharing of
message processing cost among attributes, meeting nodelevel resource constraints, and efficient adaptation toward
monitoring task changes. Existing approaches do not
address these requirements well.
. We propose a framework for communication-efficient
application state monitoring. It allows us to optimize
monitoring topologies to meet the above three
requirements under a single framework.
. We develop techniques to further improve the
applicability of REMO in terms of runtime efficiency
and supporting new monitoring features.
The rest of the paper is organized as follows: Section 2
identifies challenges in application state monitoring. Section
3 illustrates the basic monitoring topology construction
algorithm, and Section 4 introduces the adaptive topology
construction algorithm. We optimize the efficiency of REMO
and extend it for advanced features in Sections 5 and 6. We
present our experimental results in Section 7. Section 8
describes related works and Section 9 concludes this paper.
.
2
SYSTEM OVERVIEW
In this section, we introduce the concept of application state
monitoring and its system model. We also demonstrate the
challenges in application state monitoring, and point out
the key questions that an application state monitoring
approach must address.
2.1 Application State Monitoring
Users and administrators of large-scale distributed applications often employ application state monitoring for observation, debugging, analysis, and control purposes. Each
application state monitoring task periodically collects values
of certain attributes from the set of computing nodes over
which an application is running. We use the term attribute
and metric interchangeably in this paper. As we focus on
monitoring topology planning rather than the actual production of attribute values [16], we assume values of attributes
are made available by application-specific tools or management services. In addition, we target at datacenter-like
monitoring environments where any two nodes can communicate with similar cost (more details in Section 3.3). Formally,
we define an application state monitoring task t as follows:
Definition 1. S
A monitoring task t ¼ ðAt ; Nt Þ is a pair of sets,
where At i2Nt Ai is a set of attributes and Nt N is a set
MENG ET AL.: RESOURCE-AWARE APPLICATION STATE MONITORING
3
Fig. 2. CPU usage versus increasing message number/size.
Fig. 1. A high-level system model.
of nodes. In addition, t can also be represented as a list of nodeattribute pairs ði; jÞ, where i 2 Nt ; j 2 At .
2.2 The Monitoring System Model
Fig. 1 shows the high-level model of REMO, a system we
developed to provide application state monitoring functionality. REMO consists of several fundamental components:
Task manager takes state monitoring tasks and removes duplication among monitoring tasks. For instance,
monitoring tasks
t1 ¼ ðfcpu utilizationg; fa; bgÞ and
t2 ¼ ðfcpu utilizationg; fb; cgÞ
have duplicated monitored attribute cpu utilization on node
b. With such duplication, node b has to send cpu utilization
information twice for each update, which is clearly unnecessary. Therefore, given a set of monitoring tasks, the task
manager transforms this set of tasks into a list of nodeattribute pairs and eliminates all duplicated node-attribute
pairs. For instance, t1 and t2 are equivalent to the list
fa-cpu utilization, b-cpu utilizationg and fb-cpu utilization,
c-cpu utilizationg, respectively. In this case, node-attribute
pair fb-cpu utilizationg is duplicated, and thus, is eliminated
from the output of the task manager.
Management core takes deduplicated tasks as input and
schedules these tasks to run. One key subcomponent of the
management core is the monitoring planner which determines the interconnection of monitoring nodes. For simplicity, we also refer to the overlay connecting monitoring
nodes as the monitoring topology. In addition, the management core also provides important support for reliability
enhancement and failure handling. Data collector provides
a library of functions and algorithms for efficiently collecting attribute values from the monitoring network. It also
serves as the repository of monitoring data and provides
monitoring data access to users and high-level applications.
Result processor executes the concrete monitoring operations including collecting and aggregating attribute values,
triggering warnings, etc.
In this paper, we focus on the design and implementation of the monitoring planner. We next introduce monitoring overhead in application state monitoring which drives
the design principles of the monitoring planner.
2.3 Monitoring Overhead and Monitoring Planning
On a high level, a monitoring system consists of n
monitoring nodes and one central node, i.e., data collector.
Each monitoring node has a set of observable attributes
Ai ¼ faj jj 2 ½1; mg. Attributes at different nodes but with
the same subscription are considered as attributes of the
same type. For instance, monitored nodes may all have
locally observable CPU utilization. We consider an attribute
as a continuously changing variable which outputs a new
value in every unit time. For simplicity, we assume all
attributes are of the same size a and it is straightforward to
extend our work to support attributes with different sizes.
Each node i, the central node or a monitoring node, has a
capacity bi (also referred to as the resource constraint of
node i) for receiving and transmitting monitoring data. In
this paper, we consider CPU as the primary resource for
optimization. We associate each message transmitted in the
system with a per-message overhead C, and, the cost of
transmitting a message with x values is C þ ax. This cost
model is motivated by our observations of monitoring
resource consumption on a real-world system which we
introduce next.
Our cost model considers both per-message overhead
and the cost of payload. Although other models may
consider only one of these two, our observation suggests
that both costs should be captured in the model. Fig. 2
shows how significant the per-message processing overhead
is. The measurements were performed on a BlueGene/P
node which has a quad core 850 MHz PowerPC processor.
The figure shows an example monitoring task where nodes
are configured in a star network where each node
periodically transmits a single fixed small message to a root
node over TCP/IP. The CPU utilization of the root node
grows roughly linearly from around 6 percent for 16 nodes
(the root receives 16 messages periodically) to around
68 percent for 256 nodes (the root receives 256 messages
periodically). Note that this increased overhead is due to the
increased number of messages at the root node and not due
to the increase in the total size of messages. Furthermore,
the cost incurred to receive a single message increases from
0.2 to 1.4 percent when we increase the number of values in
the message from 1 to 256. Hence, we also model the cost
associated with message size as a message may contain a
large number of values relayed for different nodes.
In other scenarios, the per-message overhead could be
transmission or protocol overhead. For instance, a typical
monitoring message delivered via TCP/IP protocol has a
message header of at least 78 bytes not including application-specific headers, while an integer monitoring data is
just 4 bytes.
As Fig. 3 shows, given a list of node-attribute pairs, the
monitoring planner organizes monitoring nodes into a
forest of monitoring trees where each node collects values
for a set of attributes. The planner considers the aforementioned per-message overhead as well as the cost of attributes
4
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
Fig. 3. An example of monitoring planning.
transmission (as illustrated by the black and white bar in the
left monitoring tree) to avoid overloading certain monitoring nodes in the generated monitoring topology. In
addition, it also optimizes the monitoring topology to
achieve maximum monitoring data delivery efficiency. As
a result, one monitoring node may connect to multiple trees
(as shown in Figs. 3 and 4c). Within a monitoring tree T ,
each node i periodically sends an update message to its
parent. As application state monitoring requires collecting
values of certain attributes from a set of nodes, such update
messages include both values locally observed by node i
and values sent by i’s children, for attributes monitored by
T . Thus, the size of a message is proportional to the number
of monitoring nodes in the subtree rooted at node i. This
process continues upward in the tree until the message
reaches the central data collector node.
2.4 Challenges in Monitoring Planning
From the users’ perspective, monitoring results should be
as accurate as possible, suggesting that the underlying
monitoring network should maximize the number of nodeattribute pairs received at the central node. In addition,
such a monitoring network should not cause the excessive
use of resource at any node. Accordingly, we define the
monitoring planning problem (MP) as follows:
Problem Statement 1. Given a set of node-attribute pairs for
monitoring ¼ f!1 ; !2 ; . . . ; !p g where !q ¼ ði; jÞ, i 2 N,
j 2 A, q 2 ½1; p, and resource constraint bi for each associated
node, find a parent fði; jÞ; 8i; j, where j 2 Ai such that node i
forwards attribute j to node fði; jÞ that maximizes the total
number of node-attribute pairs received at the central node and
the resource demand of node i, di , satisfies di bi ; 8i 2 N.
NP-completeness. When restricting all nodes to only
monitor the same attribute j, we obtain a special case of the
monitoring planning problem where each node has at most
one attribute to monitor. As shown by Kashyap et al. [17],
this special case is an NP-complete problem. Consequently,
the monitoring planning problem is an NP-complete
Fig. 4. Motivating examples for the topology planning problem.
VOL. 23,
NO. X,
XXXXXXX 2012
problem, since each instance of MP can be restricted to this
special case. Therefore, in REMO, we primarily focus on
efficient approaches that can deliver reasonably good
monitoring plan.
We now use some intuitive examples to illustrate the
challenges and the key questions that need to be addressed
in designing a resource-aware monitoring planner. Fig. 4
shows a monitoring task involving 6 monitoring nodes
where each node has a set of attributes to deliver (as
indicated by alphabets on nodes). The four examples (a), (b),
(c), and (d) demonstrate different approaches to fulfill this
monitoring task. Example (a) shows a widely used topology
in which every node sends its updates directly to the central
node. Unfortunately, this topology has poor scalability,
because it requires the central node to have a large amount of
resources to account for per-message overhead. We refer to
the approach used in example (a) as the star collection.
Example (b) organizes all monitoring nodes in a single tree
which delivers updates for all attributes. While this monitoring plan reduces the resource consumption (per-message
overhead) at the central node, the root node now has to relay
updates for all node-attribute pairs, and again faces
scalability issues due to limited resources. We refer to this
approach as one-set collection. These two examples suggest
that achieving certain degree of load balancing is important
for a monitoring network.
However, load balance alone does not lead to a good
monitoring plan. In example (c), to balance the traffic
among nodes, the central node uses three trees, each of
which delivers only one attribute, and thus achieves a more
balanced workload compared with example (b) (one-set
collection) because updates are relayed by three root nodes.
However, since each node monitors at least two attributes,
nodes have to send out multiple update messages instead of
one as in example (a) (star collection). Due to per-message
overhead, this plan leads to higher resource consumption at
almost every node. As a result, certain nodes may still fail to
deliver all updates and less resources will be left over for
additional monitoring tasks. We refer to the approach in
example (c) as singleton-set collection.
The above examples reveal two fundamental aspects of
the monitoring planning problem. First, how to determine the
number of monitoring trees and the set of attributes on each? This
is a nontrivial problem. Example (d) shows a topology which
uses one tree to deliver attribute a, b, and another tree to
deliver attribute c. It introduces less per-message overhead
compared with example (c) (singleton-set collection) and is a
more load-balanced solution compared with example (b)
(one-set collection). Second, how to determine the topology for
MENG ET AL.: RESOURCE-AWARE APPLICATION STATE MONITORING
nodes in each monitoring tree under node-level resource constraints? Constructing monitoring trees subject to resource
constraints at nodes is also a nontrivial problem and the
choice of topology can significantly impact node resource
usage. Example (e) shows three different trees. The star
topology (upper left), while introducing the least relaying
cost, causes significant per-message overhead at its root. The
chain topology (upper right), on the contrary, distributes the
per-message overhead among all nodes, but causes the most
relaying cost. A “mixed” tree (bottom) might achieve a good
tradeoff between relaying cost and per-message overhead,
but it is determine its optimal topology.
3
THE BASIC REMO APPROACH
The basic REMO approach promotes the resource aware
multitask optimization framework, consisting of a two
phase iterative process and a suite of multitask optimization
techniques. At a high level, REMO operates as a guided local
search approach, which starts with an initial monitoring
network composed of multiple independently constructed
monitoring trees, and iteratively optimizes the monitoring
network until no further improvements are possible. When
exploring various optimization directions, REMO employs
cost estimation to guide subsequent improvement so that
the search space can be restricted to a small size. This
guiding feature is essential for the scalability of large-scale
application state monitoring systems.
Concretely, during each iteration, REMO first runs a
partition augmentation procedure which generates a list of
most promising candidate augmentations for improving the
current distribution of monitoring workload among monitoring trees. While the total number of candidate augmentations is very large, this procedure can trim down the size
of the candidate list for evaluation by selecting the most
promising ones through cost estimation. Given the generated candidate augmentation list, the resource-aware evaluation procedure further refines candidate augmentations by
building monitoring trees accordingly with a resourceaware tree construction algorithm. We provide more details
to these two procedures in the following discussion.
3.1 Partition Augmentation
The partition augmentation procedure is designed to
produce the attribute partitions that can potentially reduce
message processing cost through a guided iterative process.
These attribute partitions determine the number of monitoring trees in the forest and the set of attributes each tree
delivers. To better understand the design principles of our
approach, we first briefly describe two simple but most
popular schemes, which essentially represent the state-ofthe-art in multiple application state monitoring.
Recall that among example schemes in Fig. 4, one scheme
(example (c)) delivers each attribute in a separate tree, and
the other scheme (example (b)) uses a single tree to deliver
updates for all attributes. We refer to these two schemes as the
Singleton-set partition scheme (SP) and the One-set partition
(OP) scheme, respectively. We use the term “partition”
because these schemes partition the set of monitored
attributes into a number of nonoverlapping subsets and
assign each subsets to a monitoring tree.
5
Singleton-set partition. Specifically, given a set of
attributes for collection A, singleton-set partition scheme
divides A into jAj subsets, each of which contains a distinct
attribute in A. Thus, if a node has m attributes to monitor, it
is associated with m trees. This scheme is widely used in
previous work, e.g., PIER [8], which constructs a routing
tree for each attribute collection. While this scheme
provides the most balanced load among trees, it is not
efficient, as nodes have to send update messages for each
individual attribute.
One-set partition. The one-set partition scheme uses the
set A as the only partitioned set. This scheme is also used in a
number of previous work [11]. Using OP, each node can
send just one message which includes all the attribute
values, and thus, saves per-message overhead. Nevertheless, since the size of each message is much larger compared
with messages associated with SP, the corresponding
collecting tree cannot grow very large, i.e., contains limited
number of nodes.
3.1.1 Exploring Partition Augmentations
REMO seeks a middle ground between these extreme
solutions—one where nodes pay lower per-message overhead compared to SP while being more load-balanced and
consequently more scalable than OP. Our partition augmentation scheme explores possible augmentations to a given
attribute partition P by searching for all partitions that are
close to P in the sense that the resulting partition can be
created by modifying P with certain predefined operations.
We define two basic operations that are used to modify
attribute set partitions.
Definition 2. Given two attribute sets APi and APj in partition
P , a merge operation over APi and APj , denoted as APi ffl APj ,
yields a new set APk ¼ APi [ APj . Given one attribute set APi
and an attribute , a split operation on APi with regard to ,
denoted as APi . , yields two new sets APk ¼ APi .
A merge operation is simply the union of two set
attributes. A split operation essentially removes one
attribute from an existing attribute set. As it is a special
case of set difference operation, we use the set difference
sign () here to define split. Furthermore, there is no
restriction on the number of attributes that can be involved
in a merge or a split operation. Based on the definition of
merge and split operations, we now define neighboring
solution as follows:
Definition 3. For an attribute set partition P , we say partition
P 0 is a neighboring solution of P if and only if either
9APi ; APj 2 P so that P 0 ¼ P APi APj þ ðAPi ffl APj Þ, or
9APi 2 P ; 2 APi so that P 0 ¼ P APi þ ðAPi . Þ þ fg.
A neighboring solution is essentially a partition obtained
by make “one-step” modification (either one merge or one
split operation) to the existing partition.
Guided partition augmentation. Exploring all neighboring augmentations of a given partition and evaluating the
performance of each augmentation is practically infeasible,
since the evaluation involves constructing resource-constrained monitoring trees. To mitigate this problem, we use
a guided partition augmentation scheme which greatly
reduces the number of candidate partitions for evaluation.
6
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
The basic idea of this guided scheme is to rank candidate
partitions according to the estimated reduction in the total
capacity usage that would result from using the new
partition. The rationale is that a partition that provides a
large decrease in capacity usage will free up capacity for
more attribute value pairs to be aggregated. Following this,
we evaluate neighboring partitions in the decreased order
of their estimated capacity reduction so that we can find a
good augmentation without evaluating all candidates. We
provide details of the gain estimation in the appendix
which can be found on the Computer Society Digital
Library at http://doi.ieeecomputersociety.org/10.1109/
TPDS.2012.82. This guided local-search heuristic is essential
to ensuring the practicality of our scheme.
3.2 Resource-Aware Evaluation
To evaluate the objective function for a given candidate
partition augmentation m, the resource-aware evaluation
procedure evaluates m by constructing trees for nodes
affected by m and measures the total number of nodeattribute value pairs that can be collected using these trees.
This procedure primarily involves two tasks. One is
constructing a tree for a given set of nodes without exceeding
resource constraints at any node. The other is for a node
connected to multiple trees to allocate its resources to
different trees.
3.2.1 Tree Construction
The tree construction procedure constructs a collection tree
for a given set of nodes D such that no node exceeds its
resource constraints while trying to include as many nodes
as possible into the constructed tree. Formally, we define
the tree construction problem as follows:
Problem Statement 2. Given a set of n vertices, each has xi
attributes to monitor, and resource constraint bi , find a parent
vertex pðiÞ; 8i, so that the number of vertices in the
constructed tree is maximized subject to the following
constraints where ui is the resource consumed at vertex i for
sending update messages to its parent
P
1. For any vertex i in the tree, pðjÞ¼i uj þ ui bi .
2. Let yi be the number of all attribute
P values transmitted
by vertex i. We have yi ¼ xi þ pðjÞ¼i xj .
3. According to our definition, ui C þ yi a.
The first constraint requires that the resource spent on
node i for sending and receiving updates should not exceed
its resource constraint bi . The second constraint requires a
node to deliver its local monitored values as well as values
received from its children. The last constraint states that the
cost of processing an outgoing message is the combination of
per-message overhead and value processing cost. The tree
construction problem, however, is also NP-complete [17]
and we present heuristics for the tree-construction problem.
To start with, we first discuss two simple tree construction heuristics
Star. This scheme forms “star”-like trees by giving
priority to increasing the breadth of the tree. Specifically, it
adds nodes into the constructed tree in the order of decreased
available capacity, and attaches a new node to the node with
the lowest height and sufficient available capacity, until no
such nodes exist. STAR creates bushy trees and consequently
VOL. 23,
NO. X,
XXXXXXX 2012
pays low relay cost. However, owing to large node degrees,
the root node suffers from higher per-message overhead, and
consequently, the tree cannot grow very large.
Chain. This scheme gives priority to increasing the
height of the tree, and constructs “chain”-like trees. CHAIN
adds nodes to the tree in the same way as STAR does except
that it tries to attach nodes to the node with the highest height
and sufficient available capacity. CHAIN creates long trees
that achieve very good load balance, but due to the number
of hops each message has to travel to reach the root, most
nodes pay a high relay cost.
STAR and CHAIN reveal two conflicting factors in
collection tree construction—resource efficiency and scalability. Minimizing tree height achieves resource efficiency,
i.e., minimum relay cost, but causes poor scalability, i.e.,
small tree size. On the other hand, maximizing tree height
achieves good scalability, but degrades resource efficiency.
The adaptive tree construction algorithm seeks a middleground between the STAR and CHAIN procedures in this
regard. It tries to minimize the total resource consumption,
and can trade off overhead cost for relay cost, and vice
versa, if it is possible to accommodate more nodes by
doing so.
Before we describe the adaptive tree construction
algorithm, we first introduce the concept of saturated trees
and congested nodes as follows:
Definition 4. Given a set of nodes N for tree construction and
the corresponding tree T which contains a set of nodes
N 0 N, we say T is saturated if no more nodes d 2
ðN N 0 Þ can be added to T without causing the resource
constraint to be violated for at least one node in T . We refer to
nodes whose resource constraint would be violated if d 2
ðN N 0 Þ is added to T as congested nodes.
The adaptive tree construction algorithm iteratively
invokes two procedures, the construction procedure and the
adjusting procedure. The construction procedure runs the
STAR scheme which attaches new nodes to low-level existing
tree nodes. As we mentioned earlier, STAR causes the
capacity consumption at low-level tree nodes to be much
heavier than that at other nodes. Thus, as low-level tree
nodes become congested we get a saturated tree, the
construction procedure terminates and returns all congested
nodes. The algorithm then invokes the adjusting procedure,
which tries to relieve the workload of low level tree nodes by
reducing the degree of these nodes and increasing the height
of the tree (similar to CHAIN). As a result, the adjusting
procedure reduces congested nodes and makes a saturated
tree unsaturated. The algorithm then repeats the constructing-adjusting iteration until no more nodes can be added to
the tree or all nodes have been added.
3.3 Discussion
REMO targets at datacenter-like environments where the
underlying infrastructure allows any two nodes in the
network can communicate with similar cost, and focuses on
the resource consumption on computing nodes rather than
that of the underlying network. We believe this setting fits
for many distributed computing environments, even when
computing nodes are not directly connected. For instance,
MENG ET AL.: RESOURCE-AWARE APPLICATION STATE MONITORING
communication packets between hosts located in the same
rack usually pass through only one top-of-rack switch,
while communication packets between hosts located in
different racks may travel through longer communication
path consisting of multiple switches or routers. The
corresponding overhead on communication endpoints,
however, is similar in these two cases as packet forwarding
overhead is outsourced to network devices. As long as
networks are not saturated, REMO can be directly applied
for monitoring topology planning.
when the resource consumption on network devices
needs to be considered, e.g., networks are bottleneck
resources, REMO cannot be directly applied. Similarly, for
environments where internode communication requires
nodes to actively forward messages, e.g., peer-to-peer
overlay networks and wireless sensor networks, the
assumption of similar cost on communication endpoints
does not hold as longer communication paths also incur
higher forwarding cost. However, REMO can be extended
to handle such changes. For example, its local search
process can incorporate the forwarding cost in the resource
evaluation of a candidate plan. We consider such extension
for supporting such networks as our future work.
4
RUNTIME TOPOLOGY ADAPTION
The basic REMO approach works well for a static set of
monitoring tasks. However, in many distributed computing environments, monitoring tasks are often added,
modified or removed on the fly for better information
collection or debugging. Such changes necessitate the
adaptation of monitoring topology. In this section, we
study the problem of runtime topology adaptation for
changes in monitoring tasks.
4.1 Efficient Adaptation Planning
One may search for an optimal topology by invoking the
REMO planning algorithm every time a monitoring task is
added, modified or removed, and update the monitoring
topology accordingly. We refer to such an approach as
REBUILD. REBUILD, however, may incur significant
resource consumption due to topology planning computation as well as topology reconstruction cost (e.g., messages
used for notifying nodes to disconnect or connect),
especially in datacenter-like environments with a massive
number of mutable monitoring tasks undergoing relatively
frequent modification.
An alternative approach is to introduce minimum
changes to the topology to fulfill the changes of monitoring
tasks. We refer to such an approach as DIRECT-APPLY or
D-A for brevity. D-A also has its limitation as it may result
in topologies with poor performance over time. For
instance, when we continuously add attributes to a task
for collection, D-A simply instructs the corresponding tree
to deliver these newly added attribute values until some
nodes become saturated due to increased relay cost.
To address such issues, we propose an efficient adaptive
approach that strikes a balance between adaptation cost and
topology performance. The basic idea is to look for new
topologies with good performance and small adaptation
cost (including both searching and adaptation-related
7
communication cost) based on the modification to monitoring tasks. Our approach limits the search space to
topologies that are close variants of the current topology
in order to achieve efficient adaptation. In addition, it ranks
candidate adaptation operations based on their estimated
cost-benefit ratios so that it always performs the most
worthwhile adaptation operation first. We refer to this
scheme as ADAPTIVE for brevity.
When monitoring tasks are added, removed or modified,
we first applies D-A by building the corresponding trees with
the tree building algorithm introduced in Section 3 (no
changes in the attribute partition). We consider the resulting
monitoring topology after invoking D-A as the base topology,
which is then optimized by our ADAPTIVE scheme. Note
that the base topology is only a virtual topology plan stored in
memory. The actual monitoring topology is updated only
when the ADAPTIVE scheme produces a final topology plan.
Same as the algorithm in Section 3, the ADAPTIVE
scheme performs two operations, merging and splitting,
over the base topology in an iterative manner. For each
iteration, the ADAPTIVE scheme first lists all candidate
adaptation operations for merging and splitting, respectively, and ranks the candidate operations based on
estimated cost effectiveness. It then evaluates merging
operations in the order of decreasing cost effectiveness until
it finds a valid merging operation. It also evaluates splitting
operations in the same way until it finds a valid splitting
operations. From these two operations, it chooses one with
the largest improvement to apply to the base topology.
Let T be the set of reconstructed trees. To ensure
efficiency, the ADAPTIVE scheme considers only merging
operations involving at least one tree in T as candidate
merging operations. This is because merging two trees that
are not in T is unlikely to improve the topology. Otherwise,
previous topology optimization process would have
adopted such a merging operation. The evaluation of a
merging operation involves computationally expensive tree
building. As a result, evaluating only merging operations
involving trees in T greatly reduces the search space and
ensures the efficiency and responsiveness (changes of
monitoring tasks should be quickly applied) of the
ADAPTIVE scheme. For a monitoring topology with n
trees, the number of possible merging operations is
, while the number of merging operations
Cn2 ¼ nðn1Þ
2
involving trees in T is jT j ðn 1Þ which is usually
as T << n for most
significantly smaller than nðn1Þ
2
monitoring task updates. Similarly, the ADAPTIVE scheme
considers a splitting operation as a candidate operation only
if the tree to be split is in T .
The ADAPTIVE scheme also minimizes the number of
candidate merging/splitting operations it evaluates for
responsiveness. It ranks all candidate operations and
always evaluates the one with the greatest potential gain
first. To rank candidate operations, the ADAPTIVE scheme
needs to estimate the cost effectiveness of each operation.
We estimate the cost effectiveness of an operation based on
its estimated benefit and estimated adaptation cost. The
estimated benefit is the same as gðmÞ we introduced in
Section 3. The estimated adaptation cost refers to the cost of
applying the merging operation to the existing monitoring
topology. This cost is usually proportional to the number of
8
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
edges modified in the topology. To estimate this cost, we
use the lower bound of the number of edges that would
have to be changed.
4.2 Cost-Benefit Throttling
The ADAPTIVE scheme must ensure that the benefit of
adaption justifies the corresponding cost. For example, a
monitoring topology undergoing frequent modification of
monitoring tasks may not be suitable for frequent topology
adaptation unless the corresponding gain is substantial. We
employ cost-benefit throttling to apply only topology
adaptations whose gain exceeds the corresponding cost.
Concretely, when the ADAPTIVE scheme finds a valid
merging or splitting operation, it estimates the adaptation
cost by measuring the volume of control messages needed
for adaptation, denoted by Madapt . The algorithm considers
the operation cost effective if Madapt is less than a threshold
defined as follows:
T hresholdðAm Þ ¼ ðTcur minfTadj;i ; i 2 Am gÞ ðCcur Cadj Þ;
where Am is the set of trees involved in the operation, Tadj;i
is the last time tree i being adjusted, Tcur is the current time,
Ccur is the volume of monitoring messages delivered in unit
time in the trees of the current topology, and Cadj is the
volume of monitoring messages delivered in unit time in
the trees after adaptation. ðTcur minfTadj;i ; i 2 Am gÞ essentially captures how frequently the corresponding trees are
adjusted, and ðCcur Cadj Þ measures the efficiency gain of
the adjustment. Note that the threshold will be large if
either the potential gain is large, i.e., ðCcur Cadj Þ is large, or
the corresponding trees are unlikely to be adjusted due to
monitoring task updates, i.e., ðTcur minfTadj;i ; i 2 Am gÞ is
large. Cost-benefit throttling also reduces the number of
iterations. Once the algorithm finds that a merging or
splitting is not cost-effective, it can terminate immediately.
5
OPTIMIZATION
The basic REMO approach can be further optimized to
achieve better efficiency and performance. In this section,
we present two techniques, efficient tree adjustment and
ordered resource allocation, to improve the efficiency of
REMO tree construction algorithm and its planning
performance, respectively.
5.1 Efficient Tree Adjustment
The tree building algorithm introduced in Section 3
iteratively invokes a construction procedure and an adjusting procedure to build a monitoring tree for a set of nodes.
One issue of this tree building algorithm is that it generates
high-computation cost, especially its adjusting procedure. To
increase the available resource of a congested node dc , the
adjusting procedure tries to reduce its resource consumption
on per-message overhead by reducing the number of its
branches. Specifically, the procedure first removes the
branch of dc with the least resource consumption. We use
bdc to denote this branch. It then tries to reattach nodes in bdc
to other nodes in the tree except dc . It considers the
reattaching successful if all nodes of bdc is attached to the
tree. As a result, the complexity of the adjusting procedure is
Oðn2 Þ where n is the number of nodes in the tree.
VOL. 23,
NO. X,
XXXXXXX 2012
We next present two techniques that reduces the
complexity of the adjusting procedure. The first one, branch
based reattaching, reduces the complexity to OðnÞ by
reattaching the entire branch bdc instead of individual nodes
in bdc . It trades off a small chance of failing to reattaching bdc
for considerable efficiency improvement. The second
technique, Subtree-only searching, reduces the reattaching
scope to the subtree of dc , which considerably reduces
searching time in practice (the complexity is still OðnÞ). The
subtree-only searching also theoretically guarantees the
searching completeness.
5.1.1 Branch-Based Reattaching
The above adjusting procedure breaks branches into nodes
and moves one node at a time. This per-node-reattaching
scheme is quite expensive. To reduce the time complexity, we
adopts a branch-based reattaching scheme. As its name
suggests, this scheme removes a branch from the congested
node and attaches it entirely to another node, instead of
breaking the branch into nodes and performing the reattaching. Performing reattaching in a branch basis effectively
reduces the complexity of the adjusting procedure.
One minor drawback of branch-based reattaching is that
it diminishes the chance of finding a parent to reattach the
branch when the branch consists of many nodes. However,
the impact of this drawback is quite limited in practice. As
the adjusting procedure removes and reattaches the
smallest branch first, failing to reattach the branch suggests
that nodes of the tree all have limited available resource. In
this case, node-based reattaching is also likely to fail.
5.1.2 Subtree-Only Searching
The tree adjusting procedure tries reattaching the pruned
branch to all nodes in the tree except the congested node
denoted as dc . This adjustment scheme is not efficient as it
enumerates almost every nodes in the tree to test if the
pruned branch can be reattached to the node. It turns out
that testing nodes outside dc ’s subtree is often unnecessary. The following theorem suggests that testing nodes
within dc ’s subtree is sufficient as long as the resource
demand of the node failed to add is higher than that of the
pruned branch.
Theorem 1. Given a saturated tree T outputted by the
construction procedure, df the node failed to add to T , a
congested node dc and one of its branches bdc , attaching bdc to any
node outside the subtree of dc causes overload, given that the
resource demand of df is no larger than that of bdc , i.e.,
udf ubdc .
Proof. If there exists a node outside the subtree of dc ,
namely do , that can be attached with bdc without causing
overload, then adding df to do should have succeeded in
the previous execution of the construction procedure, as
udf ubdc . However, T is a saturated tree when adding
df , which leads to a contradiction.
u
t
Hence, we improve the efficiency of the original tree
building algorithm by testing all nodes for reattaching only
when the resource demand of the node failed to add is
higher than that of the pruned branch. For all other cases,
the algorithm performs reattaching test only within the
subtree of dc .
MENG ET AL.: RESOURCE-AWARE APPLICATION STATE MONITORING
5.2 Ordered Resource Allocation
For a node associated with multiple trees, determining how
much resource it should assign to each of its associated trees is
necessary. Unfortunately, finding the optimal resource
allocation is difficult because it is not clear how much
resource a node will consume until the tree it connects to is
built. Exploring all possible allocations to find the optimal one
is clearly not an option as the computation cost is intractable.
To address this issue, REMO employs an efficient ondemand allocation scheme. Since REMO builds the monitoring topology on a tree-by-tree sequential basis, the ondemand allocation scheme defers the allocation decision until
necessary and only allocates capacity to the tree that is about
to be constructed. Given a node, the on-demand allocation
scheme assigns all current available capacity to the tree that is
currently under construction. Specifically, given a node i with
resource bi and a list of trees each with resourcePdemand dij ,
the available capacity assigned to tree j is bi j1
k¼1 dij . Our
experiment results suggest that our on-demand allocation
scheme outperforms several other heuristics.
The on-demand allocation scheme has one drawback
that may limit its performance when building trees with
very different sizes. As on-demand allocation encourages
the trees constructed first to consume as much resource as
necessary, the construction of these trees may not invoke
the adjusting procedure which saves resource consumption
on parent nodes by reducing their branches. Consequently,
resources left for constructing the rest of the trees is limited.
We employ a slightly modified on-demand allocation
scheme that relieves this issue with little additional
planning cost. Instead of not specifying the order of
construction, the new scheme constructs trees in the order
of increasing tree size. The idea behind this modification is
that small trees are more cost efficient in the sense that they
are less likely to consume much resource for relaying cost.
By constructing trees from small ones to large ones, the
construction algorithm pays more relaying cost for better
scalability only after small trees are constructed. Our
experiment results suggest the ordered scheme outperforms
the on-demand scheme in various settings.
6
EXTENSIONS
Our description of REMO so far is based on a simple
monitoring scenario where tasks collect distributed values
without aggregation or replication under a uniform value
updating frequency. Real-world monitoring, however, often
poses diverse requirements. In this section, we present three
important techniques to support such requirements in
REMO. The In-network-aggregation-aware planning technique allows REMO to accurately estimate per-node resource
consumption when monitoring data can be aggregated
before being passed to parent nodes. The reliability
enhancement technique provides additional protection to
monitoring data delivery by producing topologies that
replicate monitoring data and pass them through distinct
paths. The heterogeneous-update-frequency supporting
technique enables REMO to plan topologies for monitoring
tasks with different data updating frequencies by correctly
estimating per-node resource consumption of such mixed
workloads. These three techniques introduce little to no extra
planning cost. Moreover, they can be incorporated into
9
REMO as plugins when certain functionality is required by
the monitoring environment without modifying the REMO
framework.
6.1 Supporting In-Network Aggregation
In-network aggregation is important to achieve efficient
distributed monitoring. Compared with holistic collection,
i.e., collecting individual values from nodes, In-network
aggregation allows individual monitoring values to be
combined into aggregate values during delivery. For
example, if a monitoring task requests the SUM of certain
metric m on a set of nodes N, with In-network aggregation,
a node can locally aggregate values it receives from other
nodes into a single partial sum and pass it to its parent,
instead of passing each individual value it receives.
REMO can be extended to build monitoring topology for
monitoring tasks with In-network aggregation. We first
introduce a funnel function to capture the changes of
resource consumption caused by In-network aggregation.
Specifically, a funnel function on node i, fnlm
i ðgm ; nm Þ,
returns the number of outgoing values on node i for metric
m given the In-network aggregation type gm and the
number of incoming values nm . The corresponding resource
consumption of node i in tree k for sending update message
to its parent is,
X
uik ¼ C þ
a fnlm
ð1Þ
i ðgm ; nm Þ;
m2Ai \APk
where Ai \ APk is the set of metrics node i needs to collect
and report in tree k, a is the per value overhead and C is the
per message overhead. For SUM aggregation, the corresponding funnel function is fnlm
i ðSUMm ; nm Þ ¼ 1 because
the result of SUM aggregation is always a single value.
Similarly, for TOP10 aggregation, the funnel function is
fnlm
i ðT OP 10m ; nm Þ ¼ minf10; nm g. For holistic aggregation
we discussed earlier, fnlm
i ðHOLIST ICm ; nm Þ ¼ nm . Hence,
(1) can be used to calculate per-node resource consumption
for both holistic aggregation and In-network aggregation in
the aforementioned monitoring tree building algorithm.
Note that it also supports the situation where one tree
performs both In-network and holistic aggregation for
different metrics.
Some aggregation functions such as DISTINCT, however,
are data dependent in terms of the result size. For example,
applying DISTINCT on a set X of 10 values results in a set
with size ranging from 1 to 10, depending how many
repeated values A contains. For these aggregations, we
simply employ the funnel function of holistic aggregation for
an upper bound estimation in the current implementation of
REMO. Accurate estimation may require sampling-based
techniques which we leave as our future work.
6.2 Reliability Enhancements
Enhancing reliability is important for certain mission critical
monitoring tasks. REMO supports two modes of reliability
enhancement, same source different paths (SSDP) and different
sources different paths (DSDP), to minimize the impact of link
and node failure. The most distinct feature of the reliability
enhancement in REMO is that the enhancement is done by
rewriting monitoring tasks and requires little modification
to the original approach.
10
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
The SSDP mode deals with link failures by duplicating the
transmission of monitored values in different monitoring
trees. Specifically, for a monitoring task t ¼ ða; Nt Þ requiring
SSDP support, REMO creates a monitoring task t0 ¼ ða0 ; Nt Þ
where a0 is an alias of a. In addition, REMO restricts that a and
a0 would never occur in the same set of a partition P during
partition augmentation, which makes sure messages updating a and a0 are transmitted within different monitoring trees,
i.e., different paths. Note that the degree of reliability can be
adjusted through different numbers of duplications.
When a certain metric value is observable at multiple
nodes, REMO also supports the DSDP mode. For example,
computing nodes sharing the same storage observe the same
storage performance metric values. Under this mode, users
submit monitoring tasks in the form of t ¼ ða; Nidentical Þ
where Nidentical ¼ Nðv1 Þ; Nðv2 Þ; . . . ; Nðvn Þ. Nðvi Þ denotes the
set of nodes that observe the same value vi and Nidentical is a
set of node groups each of which observes the same value.
Let k ¼ minfjNðvi Þj; i 2 ½1; ng. REMO rewrites t into k
monitoring tasks so that each task collects values for metric
a with a distinct set of nodes drawn from Nðvi Þ; i 2 ½1; n.
Similar to SSDP model, REMO then constructs the topology
by avoiding any of the k monitoring tasks to be clustered into
one tree. In this way, REMO ensures values of metrics can be
collected from distinct sets of nodes and delivered through
different paths.
6.3 Supporting Heterogeneous Update Frequencies
Monitoring tasks may collect values from nodes with
different update frequencies. REMO supports heterogeneous update frequencies by grouping nodes based on their
metric collecting frequencies and constructing per-group
monitoring topologies. When a node has a single metric a
with the highest update frequency, REMO considers the
node as having only one metric to update as other metrics
piggyback on a. When a node has a set of metrics updated
at the same highest frequencies, denoted by Ah , it evenly
assigns other metrics to piggyback on metrics in of Ah .
Similarly, REMO considers the node as having a set of
metrics Ah to monitor as other metrics piggyback on Ah . We
estimate the cost of updating
with piggybacked metrics for
P
node i by ui ¼ C þ a j freqj =freqmax where freqj is the
frequency of one metric collected on node i and freqmax is
the highest update frequency on node i.
Sometimes metric piggybacking cannot achieve the
precise update frequency defined by users. For example, if
the highest update frequency on a node is 1/5 (msg/sec), a
metric updated at 1/22 can at best be monitored at either 1/
20 or 1/25. If users are not satisfied with such an approximation, our current implementation separates these metrics out
and builds individual monitoring trees for each of them.
7
EXPERIMENTAL EVALUATION
We undertake an experimental study of our system and
present results including those gathered by deploying
REMO on a BlueGene/P rack (using 256 nodes booted into
Linux) running IBM’s large-scale distributed streaming
system—System S.
Synthetic data set experiments. For our experiments on
synthetic data, we assign a random subset of attributes to
VOL. 23,
NO. X,
XXXXXXX 2012
each node in the system. For monitoring tasks, we generate
them by randomly selecting jAt j attributes and jNt j nodes
with uniform distribution, for a given size of attribute set A
and node set N. We also classify monitoring tasks into two
categories—1) small-scale monitoring tasks that are for a
small set of attributes from a small set of nodes, and 2)
large-scale monitoring tasks that either involves many
nodes or many attributes.
To evaluate the effectiveness of different topology
construction schemes, we measure the percentage of
attribute values collected at the root node with the
monitoring topology produced by a scheme. Note that this
value should be 100 percent when the monitoring workload
is trivial or each monitoring node is provisioned with
abundant monitoring resources. For comparison purposes,
we apply relatively heavy monitoring workloads to keep
this value below 100 percent for all schemes. This allows us
to easily compare the performance of different schemes by
looking at their percentage of collected values. Schemes
with higher percentage of collected values not only achieve
better monitoring coverage when monitoring resources are
limited, but also have better monitoring efficiency in terms
of monitoring resource consumption.
Real system experiments. Through experiments in a real
system deployment, we also show that the error in attribute
value observations (due to either stale or dropped attribute
values) introduced by REMO is small. Note that this error
can be measured in a meaningful way only for a real system
and is what any “user” of the monitoring system would
perceive when using REMO.
System S is a large-scale distributed stream processing
middleware. Applications are expressed as dataflow graphs
that specify analytic operators interconnected by data
streams. These applications are deployed in the System S
runtime as processes executing on a distributed set of hosts,
and interconnected by stream connections using transports
such as TCP/IP. Each node that runs application processes
can observe attributes at various levels such as at the
analytic operator level, System S middleware level, and the
OS level. For these experiments, we deployed one such
System S application called YieldMonitor [18], that monitors
a chip manufacturing test process and uses statistical
stream processing to predict the yield of individual chips
across different electrical tests. This application consisted of
over 200 processes deployed across 200 nodes, with 30-50
attributes to be monitored on each node, on a BlueGene/P
cluster. The BlueGene is very communication rich and all
compute nodes are interconnected by a 3D Torus mesh
network. Consequently, for all practical purposes, we have
a fully connected network where all pairs of nodes can
communicate with each other at almost equal cost.
7.1 Result Analysis
We present a small subset of our experimental results to
highlight the following observations amongst others. First,
REMO can collect a larger fraction of node-attribute pairs to
serve monitoring tasks presented to the system compared to
simple heuristics (which are essentially the state-of-the-art).
REMO adapts to the task characteristics and outperforms
each of these simple heuristics for all types of tasks and
system characteristics, e.g., for small-scale tasks, a collection
MENG ET AL.: RESOURCE-AWARE APPLICATION STATE MONITORING
11
Fig. 5. Comparison of attribute set partition schemes under different
workload characteristics.
Fig. 6. Comparison of attribute set partition schemes under different
system characteristics.
mechanism with fewer trees is better while for large-scale
tasks, a collection mechanism with more trees is better.
Second, in a real application scenario, REMO also significantly reduces percentage error in the observed values
of the node-attribute pairs required by monitoring tasks
when compared to simple heuristics.
Varying the scale of monitoring tasks. Fig. 5 compares
the performance of different attribute set partition schemes
under different workload characteristics. In Fig. 5a, where
we increase the number of attributes in monitoring tasks,
i.e., increasing jAt j, our partition augmentation scheme
(REMO) performs consistently better than singleton-set
(SINGLETON-SET) and one-set (ONE-SET) schemes. In
addition, ONE-SET outperforms SINGLETON-SET when
jAt j is relatively small. As each node only sends out one
message which includes all its own attributes and those
received from its children, ONE-SET causes the minimum
per-message overhead. Thus, when each node monitors
relatively small number of attributes, it can efficiently
deliver attributes without suffering from its scalability
problem. However, when jAt j increases, the capacity
demand of the low-level nodes, i.e., nodes that are close to
the root, increases significantly, which in turn limits the size
of the tree and causes poor performance. In Fig. 5b, where
we set jAt j ¼ 100 and increase jNt j to create extremely heavy
workloads, REMO gradually converges to SINGLETONSET, as SINGLETON-SET achieves the best load balance
under heavy workload which in turn results in the best
performance.
Varying the number of monitoring tasks. We observe
similar results in Figs. 5c and 5d, where we increase the
total number of small-scale and large-scale monitoring
tasks, respectively.
Varying nodes in the system. Fig. 6 illustrates the
performance of different attribute set partition schemes
with changing system characteristics. In Figs. 6a and 6b,
where we increase the number of nodes in the system given
small- and large-scale monitoring tasks, respectively, we
can see SINGLETON-SET is better for large-scale tasks
while ONE-SET is better for small-scale tasks, and REMO
performs much better than them in both cases, around
90 percent extra collected node-attribute pairs.
Varying per-message processing overhead. To study the
impact of per-message overhead, we vary the C=a ratio
under both small- and large-scale monitoring tasks in Figs. 6c
and 6d. As expected, increased per-message overhead hits
the SINGLETON-SET scheme hard since it constructs a large
number of trees and, consequently, incurs the largest
overhead cost while the performance of the ONE-SET
scheme which constructs just a single tree degrades more
gracefully. However having a single tree is not the best
solution as shown by REMO which outperforms both the
schemes as C=a is increased, because it can reduce the
number of trees formed when C=a is increased.
Comparison of tree-construction schemes. In Fig. 7, we
study the performance of different tree construction
algorithms under different workloads and system characteristics. Our comparison also includes a new algorithm,
namely MAX_AVB, a heuristic algorithm used in TMON
[17] which always attaches new node to the existing node
with the most available capacity. While we vary different
workloads and system characteristics in the four figures,
our adaptive tree construction algorithm (ADAPTIVE)
always performs the best in terms of percentage of collected
values. Among all the other tree construction schemes,
STAR performs well when workload is heavy, as suggested
by Figs. 7a and 7b. This is because STAR builds trees with
minimum height, and thus, pays minimum cost for
Fig. 7. Comparison of tree construction schemes under different
workload and system characteristics.
12
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
Fig. 8. Comparison of average percentage error.
relaying, which can be considerable when workloads are
heavy. CHAIN performs the worst in almost all cases.
While CHAIN provides good load balance by distributing
per-message overhead in CHAIN-like trees, nodes have to
pay high cost for relaying, which seriously degrades the
performance of CHAIN when workloads are heavy (performs the best when workloads are light as indicated by the
left portions of both Figs. 7a and 7b). MAX_AVB scheme
outperforms both STAR and CHAIN given small workload,
as it avoids over stretching a tree in breadth or height by
growing trees from nodes with the most available capacity.
However, its performance quickly degrades with increasing
workload as a result of relaying cost.
Real-world performance. To evaluate the performance
of REMO in a real-world application. we measure the
average percentage error of received attribute values for
synthetically generated monitoring tasks. Specifically, we
measure average percentage error between the snapshot of
values observed by our scheme and compare it to the
snapshot of “actual” values (that can be obtained by
combining local log files at the end of the experiment).
Fig. 8a compares the achieved percentage error between
different partition schemes given increasing number of
nodes. Recall that our system can deploy the application
over any number of nodes. The figure shows that our
partition augmentation scheme in REMO outperforms the
other partition schemes. The percentage error achieved by
REMO is around 30-50 percent less than that achieved by
SINGLETON-SET and ONE-SET. Interestingly, the percentage error achieved by REMO clearly reduces when the
number of nodes in the system increases. However,
according to our previous results, the number of nodes
has little impact on the coverage of collected attributes. The
reason is that as the number of nodes increases, monitoring
tasks are more sparsely distributed among nodes. Thus,
each message is relatively small and each node can have
more children. As a result, the monitoring trees constructed
by our schemes are “bushier,” which in turn reduces the
VOL. 23,
NO. X,
XXXXXXX 2012
percentage error caused by latency. Similarly, we can see
that REMO gains significant error reduction compared with
the other two schemes in Fig. 8b where we compare the
performance of different partition schemes under increasing monitoring tasks.
Runtime adaptation. To emulate a dynamic monitoring
environment with a small portion of changing tasks, we
continuously update (modify) the set of running tasks with
increasing update frequency. Specifically, we randomly
select 5 percent of monitoring nodes and replaces 50 percent
of their monitoring attributes. We also vary the frequency
of task updates to evaluate the effectiveness of our adaptive
techniques.
We compare the performance and cost of four different
schemes: 1) DIRECT-APPLY (D-A) scheme which directly
applies the changes in the monitoring task to the monitoring topology. 2) REBUILD scheme which always performs
full-scale search from the initial topology with techniques
we introduced in Section 3. 3) NO-THROTTLE scheme
which searches for optimized topology that is close to the
current one with techniques we introduced in Section 4.
4) ADAPTIVE scheme is the complete technique set
described in Section 4, which improves NO-THROTTLE
by applying cost-benefit throttling to avoid frequent
topology adaptation when tasks are frequently updated.
Fig. 9a shows the CPU time of running different
planning schemes given increasing task updating frequency. The X-axis shows the number of task update
batches within a time window of 10 value updates. The Y axis shows the CPU time (measured on a Intel CORE Duo
2 2.26 GHz CPU) consumed by each scheme. We can see
that D-A takes less than 1 second to finish since it performs
only a single round tree construction. REBUILD consumes
the most CPU time among all schemes as it always explores
the entire searching space. When cost-benefit throttling is
not applied the NO-THROTTLE scheme consumes less
CPU time. However, its CPU time grows with task update
frequency which is not desirable for large-scale monitoring.
With throttling, the adaptive schemeincurs even less CPU
time (1-3 s) as it avoids unnecessary topology optimization
for frequent updates. Note that while the CPU time
consumed by ADAPTIVE is higher than that of D-A, it is
fairly acceptable.
Fig. 9b illustrates the percentage of adaptation cost over
the total cost for each scheme. Here, the adaptation cost is
measured by the total number of messages used to
notifying monitoring nodes to change monitoring topology
(e.g., one such message may inform a node to disconnect
from its current parent node and connect to another parent
Fig. 9. Performance comparison of different adaptation schemes given increasing task updating frequencies.
MENG ET AL.: RESOURCE-AWARE APPLICATION STATE MONITORING
13
Fig. 10. Speedup of optimization schemes.
Fig. 11. Comparison between resource allocation schemes.
node). Similarly, the total cost of a scheme is the total
number of messages the scheme used for both adaptation
and delivering monitoring data. REBUILD introduces
the highest adaptation cost because it always pursues the
optimal topology which can be quite different from the
current one. Similar to what we observed in Fig. 9a, NOTHROTTLE achieves much lower adaption cost compared
with REBUILD does. ADAPTIVE further reduces adaptation cost which is very close to that of D-A, because costbenefit throttling allows it to avoid unnecessary topology
optimization when task updating frequency grows.
Fig. 9c shows the scheme-wise difference of the total cost
(including both adaptation and data delivery messages).
The Y -axis shows the ratio (percentage) of total cost
associated with one scheme over that associated with DA. REBUILD initially outperforms D-A as it produces
optimized topology which in turn saves monitoring communication cost. Nevertheless, as task updating frequency
increases, the communication cost of adaptation messages
generated by REBUILD increases quickly, and eventually
the extra cost in adaptation surpasses the monitoring
communication cost it saves. NO-THROTTLE shows similar
growth of total cost with increasing task updating frequency. ADAPTIVE, however, consistently outperforms DA due to its ability to avoid unnecessary optimization.
Fig. 9d shows the performance of schemes in terms of
collected monitoring attribute values. The Y -axis shows the
percentage of collected values of one scheme over that of DA. Note that the result we show in Fig. 9c is the generated
traffic volume. As each node cannot process traffics beyond
its capacity, the more traffic generated, the more likely we
observe miss-collected values. With increasing task updating frequency, the performance of REBUILD degrades
faster than that of D-A due to its quickly growing cost in
topology optimization (see Figs. 9b and 9c). On the contrary,
both NO-THROTTLE and ADAPTIVE gain an increasing
performance advantage over D-A. This is because the
monitoring topology can still be optimized with relatively
low adaptation cost with NO-THROTTLE and ADAPTIVE,
but continuously degrades with D-A, especially with high
task updating frequency.
Overall, ADAPTIVE produces monitoring topologies
with the best value collection performance (Fig. 9d), which
is the ultimate goal of monitoring topology planning. It
achieves this by minimizing the overall cost of the topology
(Fig. 9c) by only adopting adaptations whose gain outweighs
cost. Its searching time and adaptation cost, although slighter
higher than schemes such as D-A, is fairly small for all
practical purposes.
Optimization. Figs. 10a and 10b show the speedup of
our optimization techniques for the monitoring tree adjustment procedure, where the Y -axis shows the speedup of
one technique over the basic adjustment procedure, i.e., the
ratio between CPU time of the basic adjustment procedure
over that of an optimized procedure. Because the basic
adjustment procedure reattaches a branch by first breaking
up the branch into individual nodes and performing a pernode-based reattaching, it takes considerably more CPU
time compared with our branch-based reattach and subtreeonly reattach techniques. With both techniques combined,
we observe a speedup at up to 11 times, which is especially
important for large distributed systems. We also find that
these two optimization techniques introduce little performance penalties in terms of the percentage of values
collected from the resulting monitoring topology(<2%).
Figs. 11a and 11b compare the performance of different
tree-wise capacity allocation schemes, where UNIFORM
divides the capacity of one node equally among all trees it
participates in, PROPORTIONAL divides the capacity
proportionally according to the size of each tree, ONDEMAND and ORDERED are our allocation techniques
introduced in Section 5.2. We can see that both ONDEMAND and ORDERED consistently outperform UNIFORM and PROPORTIONAL. Furthermore, ORDERED
gains an increasing advantage over ON-DEMAND with
growing nodes and tasks. This is because large number of
nodes and tasks cause one node to participate into trees with
very different sizes, where ordered allocation is useful for
avoiding improper node placement, e.g., putting one node as
root in one tree (consuming much of its capacity) while it still
needs to participate in other trees.
Extension. Fig. 12a compares the efficiency of basic
REMO with that of extended REMO when given tasks that
involves In-network aggregation and heterogeneous update
frequencies. Specifically, we apply MAX In-network aggregation to tasks so that one node only needs to send the
largest value to its parent node. In addition, we randomly
Fig. 12. Performance of extension techniques.
14
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
choose half of the tasks and reduce their value update
frequency by half. The Y -axis shows values collected by
REMO enhanced with one extension technique, normalized
by values collected by the basic REMO. Note that collected
values for MAX In-network aggregation refer to values
included in the MAX aggregation, and are not necessarily
collected by the root node.
The basic REMO approach is oblivious to In-network
aggregation. Hence, it tends to overestimate communication
cost of the monitoring topology, and prefers SINGLETONSET-like topology where each tree delivers one or few
attributes. As we mentioned earlier, such topologies
introduce high per-message overhead. On the contrary,
REMO with aggregation-awareness employs funnel functions to correctly estimate communication cost and produces more efficient monitoring topology. We observe
similar results between the basic REMO and REMO with
update-frequency-awareness. When both extension techniques are combined, they can provide an improvement close
to 50 percent in terms of collected values.
Fig. 12b compares the efficiency of REMO with replication
support and two alternative techniques. The SINGLETONSET-2 scheme uses two SINGLETON-SET trees to deliver
values of one attribute separately. The ONE-SET-2 scheme
creates two ONE-SET trees, each of which connects all nodes
and delivers values of all attributes separately. REMO-2 is
our Same-Source-Different-Path technique with a replication
factor of 2, i.e., values of each attribute are delivered through
two different trees. Compared with the two alternative
schemes, REMO-2 achieves both replication and efficiency by
combining multiple attributes into one tree to reduce permessage overhead. As a result, it outperforms both alternatives consistently given increasing monitoring tasks.
8
RELATED WORK
Much of the early work addressing the design of distributed
query systems mainly focuses on executing single queries
efficiently. As the focus of our work is to support multiple
queries, we omit discussing these work. Research on
processing multiple queries on a centralized data stream
[19], [13], [20], [21] is not directly related with our work
either, as the context of our work is distributed streaming
where the number of messages exchanged between the
nodes is of concern.
A large body of work studies query optimization and
processing for distributed databases (see [22] for a survey).
Although our problem bears a superficial resemblance to
these distributed query optimization problems, our problem is fundamentally different since in our problem
individual nodes are capacity constrained. There are also
much work on multiquery optimization techniques for
continuous aggregation queries over physically distributed
data streams [21], [23], [24], [25], [19], [13], [20], [26]. These
schemes assume that the routing trees are provided as part
of the input. In our setting where we are able to choose from
many possible routing trees, solving the joint problem of
optimizing resource-constrained routing tree construction
and multitask optimization provides significant benefit
over solving only one of these problems in isolation as
evidenced by our experimental results.
VOL. 23,
NO. X,
XXXXXXX 2012
More recently, several work studies efficient data collection mechanisms. CONCH [27] builds a spanning forest with
minimal monitoring costs for continuously collecting readings from a sensor network by utilizing temporal and spatial
suppression. However, it does not consider the resource
limitation at each node and per-message overhead as we did,
which may limit its applicability in real-world applications.
PIER [8] suggests using distinct routing trees for each query
in the system, in order to balance the network load, which is
essentially the SINGLETON-SET scheme we discussed. This
scheme, though achieves the best load balance, may cause
significant communication cost on per-message overhead.
9
CONCLUSIONS
We developed REMO, a resource-aware multitask optimization framework for scaling application state monitoring in
large-scale distributed systems. The unique contribution of
the REMO approach is its techniques for generating the
network of monitoring trees that optimizes multiple monitoring tasks and balances the resource consumption at
different nodes. We also proposed adaptive techniques to
efficiently handle continuous task updates, optimization
techniques that speedup the searching process up to a factor
of 10, and techniques extending REMO to support advanced
monitoring requirements such as In-network aggregation.
We evaluated REMO through extensive experiments including deploying REMO in a real-world stream processing
application hosted on BlueGene/P. Our experimental results
show that REMO significantly and consistently outperforms
existing approaches.
ACKNOWLEDGMENTS
This work is partially supported by NSF grants from CISE
CyberTrust program, NetSE program, and CyberTrust
Cross Cutting program, and an IBM faculty award and an
Intel ISTC grant on Cloud Computing.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
N. Jain, L. Amini, H. Andrade, R. King, Y. Park, P. Selo, and C.
Venkatramani, “Design, Implementation, and Evaluation of the
Linear Road Bnchmark on the Stream Processing Core,” Proc.
ACM SIGMOD Int’l Conf. Management of Data (SIGMOD), 2006.
B. Hayes, “Cloud Computing,” Comm. ACM, vol. 51, no. 7, pp. 911, 2008.
L. Amini, N. Jain, A. Sehgal, J. Silber, and O. Verscheure,
“Adaptive Control of Extreme-Scale Stream Processing Systems,”
Proc. IEEE 26th Int’l Conf. Distributed Computing Systems (ICDCS),
2006.
J. Borkowski, D. Kopanski, and M. Tudruj, “Parallel Irregular
Computations Control Based on Global Predicate Monitoring,”
Proc. Int’l Symp. Parallel Computing in Electrical Eng. (PARELEC),
2006.
K. Park and V.S. Pai, “Comon: A Mostly-Scalable Monitoring
System for Planetlab,” Operating Systems Rev., vol. 40, no. 1, pp. 6574, 2006.
S. Madden, M.J. Franklin, J.M. Hellerstein, and W. Hong, “Tag: A
Tiny Aggregation Service for Ad-Hoc Sensor Networks,” Proc.
Fifth Symp. Operating Systems Design and Implementation (OSDI),
2002.
P. Yalagandula and M. Dahlin, “A Scalable Distributed Information Management System,” Proc. SIGCOMM, pp. 379-390, 2004.
R. Huebsch, B.N. Chun, J.M. Hellerstein, B.T. Loo, P. Maniatis, T.
Roscoe, S. Shenker, I. Stoica, and A.R. Yumerefendi, “The
Architecture of Pier: An Internet-Scale Query Processor,” Proc.
Second Conf. Innovative Data Systems Research (CIDR), 2005.
MENG ET AL.: RESOURCE-AWARE APPLICATION STATE MONITORING
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
G. Cormode and M.N. Garofalakis, “Sketching Streams through
the Net: Distributed Approximate Query Tracking,” Proc. 31st Int’l
Conf. Very Large Data Bases (VLDB), pp. 13-24, 2005.
D.J. Abadi, S. Madden, and W. Lindner, “Reed: Robust, Efficient
Filtering, and Event Detection in Sensor Networks,” Proc. 31st Int’l
Conf. Very Large Databases (VLDB), 2005.
U. Srivastava, K. Munagala, and J. Widom, “Operator Placement
for In-Network Stream Query Processing,” Proc. ACM SIGMODSIGACT-SIGART Symp. Principles of Database Systems (PODS),
pp. 250-258, 2005.
C. Olston, B.T. Loo, and J. Widom, “Adaptive Precision Setting for
Cached Approximate Values,” Proc. ACM SIGMOD Int’l Conf.
Management of Data (SIGMOD ’01), 2001.
S. Krishnamurthy, C. Wu, and M.J. Franklin, “On-the-Fly Sharing
for Streamed Aggregation,” Proc. ACM SIGMOD Int’l Conf.
Management of Data (SIGMOD ’06), pp. 623-634, 2006.
S. Ko and I. Gupta, “Efficient On-Demand Operations in Dynamic
Distributed Infrastructures,” Proc. Second Workshop Large-Scale
Distributed Systems and Middleware (LADIS), 2008.
S. Meng, S.R. Kashyap, C. Venkatramani, and L. Liu, “Remo:
Resource-Aware Application State Monitoring for Large-Scale
Distributed Systems,” Proc. IEEE 29th Int’l Conf. Distributed
Computing Systems (ICDCS), pp. 248-255, 2009.
K. Marzullo and M.D. Wood, Tools for Constructing Distributed
Reactive Systems. Cornell Univ., 1991.
S.R. Kashyap, D. Turaga, and C. Venkatramani, “Efficient Trees
for Continuous Monitoring,” 2008.
D.S. Turaga, M. Vlachos, O. Verscheure, S. Parthasarathy, W. Fan,
A. Norfleet, and R. Redburn, “Yieldmonitor: Real-Time Monitoring and Predictive Analysis of Chip Manufacturing Data,” 2008.
R. Zhang, N. Koudas, B.C. Ooi, and D. Srivastava, “Multiple
Aggregations over Data Streams,” Proc. ACM SIGMOD Int’l Conf.
Management of Data (SIGMOD), 2005.
J. Li, D. Maier, K. Tufte, V. Papadimos, and P.A. Tucker, “No
Pane, No Gain: Efficient Evaluation of Sliding-Window Aggregates over Data Streams,” ACM SIGMOD Record, vol. 34, no. 1,
pp. 39-44, 2005.
S. Madden, M.A. Shah, J.M. Hellerstein, and V. Raman,
“Continuously Adaptive Continuous Queries over Streams,” Proc.
ACM SIGMOD Int’l Conf. Management of Data (SIGMOD), 2002.
D. Kossmann, “The State of the Art in Distributed Query
Processing,” ACM Computing Surveys, vol. 32, no. 4, pp. 422-469,
2000.
R. Huebsch, M.N. Garofalakis, J.M. Hellerstein, and I. Stoica,
“Sharing Aggregate Computation for Distributed Queries,” Proc.
ACM SIGMOD Int’l Conf. Management of Data (SIGMOD), 2007.
A. Silberstein and J. Yang, “Many-to-Many Aggregation for
Sensor Networks,” Proc. 27th Int’l Conf. Data Eng. (ICDE),
pp. 986-995, 2007.
S. Xiang, H.-B. Lim, K.-L. Tan, and Y. Zhou, “Two-Tier Multiple
Query Optimization for Sensor Networks,” Proc. 27th Int’l Conf.
Distributed Computing Systems (ICDCS), p. 39, 2007.
J. Borkowski, “Hierarchical Detection of Strongly Consistent
Global States,” Proc. Third Int’l Symp. Parallel and Distributed
Computing/Third Int’l Workshop Algorithms, Models, and Tools for
Parallel Computing on Heterogeneous Networks (ISPDC/HeteroPar),
pp. 256-261, 2004.
A. Silberstein, R. Braynard, and J. Yang, “Constraint Chaining:
On Energy-Efficient Continuous Monitoring in Sensor Networks,” Proc. ACM SIGMOD Int’l Conf. Management of Data
(SIGMOD), 2006.
15
Shicong Meng currently working toward the
PhD degree in the College of Computing at
Georgia Tech. He is affiliated with the Center for
Experimental Research in Computer Systems
(CERCS) where he works with professor Ling
Liu. His research focuses on performance,
scalability, and security issues in large-scale
distributed systems such as cloud datacenters.
He is a student member of the IEEE.
Srinivas Raghav Kashyap received the PhD
degree in computer science from the University
of Maryland at collage park in 2007. Currently,
he is a software engineer at Google Inc. and was
a postdoctoral researcher at IBM T.J. Watson
Research Center before moving to Google.
Chitra Venkatramani received the PhD degree
in computer science from SUNY at Stony Brook
in 1997, and was also engaged in various
projects around multimedia data streaming and
content distribution at IBM’s Watson labs. She is
a research staff member and manager of the
Distributed Streaming Systems Group at the
IBM T.J. Watson Research Center. The focus of
her team is on research related to System S, a
software platform for large scale, distributed
stream processing. Her team explores issues in high performance,
large-scale distributed computing systems, including advanced computation and communication infrastructures, fault tolerance, scheduling,
and resource management for data-intensive applications. She holds
numerous patents and publications and is also a recipient of IBM’s
Outstanding Technical Achievement Award.
Ling Liu is a full professor in the School of
Computer Science at Georgia Institute of Technology. There she directs the research programs
in Distributed Data Intensive Systems Lab
(DiSL), examining various aspects of data
intensive systems with the focus on performance, availability, security, privacy, and energy
efficiency. He has served as general chair and
PC chairs of numerous IEEE and ACM conferences in data engineering, distributed computing, service computing, and cloud computing fields and is a coeditor-inchief of the 5 volume Encyclopedia of Database Systems (Springer).
Currently, she is on the editorial board of several international journals.
He has published more than 300 International journal and conference
articles in the areas of databases, distributed systems, and internet
computing. She is a recipient of the Best Paper Award of ICDCS 2003,
WWW 2004, the 2005 Pat Goldberg Memorial Best Paper Award, and
2008 international conference on Software Engineering and Data
Engineering. Her current research is primarily sponsored by NSF,
IBM, and Intel. She is a senior member of the IEEE.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
Download