Update-Linearizability: A Consistency Concept for the Chronological

advertisement
Update-Linearizability: A Consistency Concept for the
Chronological Ordering of Events in MANETs
Jörg Hähner∗
Universität Stuttgart, IPVS
Universitätsstrasse 38
70569 Stuttgart, Germany
Kurt Rothermel
Universität Stuttgart, IPVS
Universitätsstrasse 38
70569 Stuttgart, Germany
Christian Becker
Universität Stuttgart, IPVS
Universitätsstrasse 38
70569 Stuttgart, Germany
haehner@informatik.uni-stuttgart.de
rothermel@informatik.uni-stuttgart.de
becker@informatik.uni-stuttgart.de
Abstract
the most current state of each real-world object according to
the chronological ordering in which it was perceived by sensors. Note that this ordering may be different from the ordering in which state changes are applied to the database,
e.g., due to message reordering within the network. Whenever a sensor reports a significant state change of an object,
the database copies are updated to reflect the new state.
The major contributions of this paper are as follows:
We propose a novel consistency concept that preserves the
chronological ordering of update operations on the replicated database caused by state transition events of real
world objects: Roughly speaking, the database state of an
object is never replaced by a state that has been captured by
any sensor before the current state. Moreover, based on this
consistency concept, we introduce a novel replication algorithm suitable for MANETs that allows for multiple independent sources that update state information of objects. If
only a single source of updates exists for each object, the
most recently sensed state can easily be determined by a
simple versioning scheme. However, in scenarios with mobile objects and/or mobile sensors, an object may be perceived by multiple sensors consecutively or concurrently.
This complicates the task of maintaining consistency, especially if synchronized clocks are not available. Simulation experiments with our replication algorithm show that
the synchronization overhead for guaranteeing the chronological ordering of state transition events is low, i.e. it is
dominated by the underlying data dissemination algorithm.
Replicated copies are updated regularly even if messages
are lost due to network congestion caused by high network
load. The remainder of this paper is structured as follows:
In Section 2 we discuss the related work. Section 3 presents
our consistency concept. Our system model is presented in
Section 4. In Sections 5 and 6 we introduce our replication algorithm and prove that it is correct with respect to the
consistency model. Simulations showing the performance
of our algorithm in a broad range of experiments are presented in Section 7. Section 8 concludes the paper.
MANETs are used in situations where networks need to
be deployed immediately but no network infrastructure is
available. If MANET nodes have sensing capabilities, they
can capture and communicate the state of their surroundings, including environmental conditions or objects in their
proximity. If the sensed state information is propagated to a
database to build a consistent model of the real world, a variety of promising context-aware applications becomes possible. In this paper, we introduce a novel consistency concept that preserves the chronological ordering of sensed
state transition events. Based on this concept, we propose a
data replication algorithm for MANETs that guarantees the
consistency concept without relying on synchronized clocks
and show its correctness. Our simulation experiments show
that replicated copies are updated regularly even if the network load in the system is high.
1. Introduction
Devices with integrated sensors and wireless communication technology may be able to detect their position
and orientation, as well as measure changes in their environmental conditions, including hazardous emissions, temperature, smoke, or dust. Although context-aware applications can already profit from context collected by sensors
of a single device, the benefit is higher if nodes propagate
context information that is captured locally to a distributed
database, which maintains a consistent model of reality.
In this paper, we investigate how state information captured by a collection of distributed sensors can be maintained in a database that is replicated over a number of
nodes in a MANET in order to provide a current model of
the operational environment. Mobile database nodes store
∗
Funded by the Gottlieb Daimler and Carl Benz Foundation, Germany
0-7803-8815-1/04/$20.00 ©2004 IEEE
1
2. Related Work
by appropriate sensor technology. We assume objects have
a unique identifier and may be associated with state information, such as location or temperature, which may change
over time. The object identifiers may for example be obtained by using radio frequency identification (RFID) tags
attached to objects, such as the electronic product code
(EPC) [1]. An observer senses perceivable objects that are
within its observation range. Observers transmit the sensed
state information, e.g., the location of the object, by means
of update requests to the database (DB). Each update request includes the identifier and state of the observed object, so that for a given object an update request may be
generated when it enters the observation range of the observer or its state changes significantly. The DB is maintained by a number of (mobile) DB nodes. For each observed object the DB stores a so-called state record that
specifies the state of a given object. When the first update
request for an object arrives, a new state record is created
in the DB, and subsequent update requests are applied to
this record. Clients are applications that only read state information from the DB. This restriction makes sense because the DB contains state information of physical objects
which can only be captured by observers. Therefore, in our
model the DB is only updated by observers and read exclusively by clients. The same object may be observed by multiple observers, either sequentially or concurrently. Consequently, there may be multiple observers transmitting update requests for the same object. This leads to the question
of how these update requests are to be ordered.
The ordering of events in distributed systems has been
subject of research for many decades. The seminal work
of Lamport [11] introduces means for ordering events that
can be specified and observed within a given system. The
happened-before relation as defined by Lamport utilizes the
order between sending and receiving of a given message at
different processes in the system by merging the causal history of the sender and receiver processes at the receiver. Our
algorithm maintains the chronological ordering of update
operations caused by events external to the system rather
than their causal ordering within the system.
Strong consistency based on the concept of serializability [5] has been addressed in the domain of distributed
databases extensively. Since the level of consistency is a
trade-off to availability [3], strong consistency may result in
poor availability in the presence of frequent node and network failures.
Weaker consistency levels have been proposed to increase the availability of data. The authors of [4] propose
epidemic algorithms to update copies of replicated data in
fixed networks. Their concept of consistency ensures that
all copies converge to a common state. Consistency issues
for data replication have also been addressed in the context
of sensor networks and MANETs. The adaptive broadcast
replication protocol (ABR) [15] covers explicitly the tradeoff between consistency and transmission costs and ensures
weak consistency limited to a single update source per logical object. Deno [10] presents an epidemic replication algorithm based on weighted-voting for weakly connected ad
hoc networks ensuring that each copy commits updates in
the same order. In [9] a quorum-based system is presented
to provide access to the most current information about objects. However, it differs from our approach in the assumption that there exists a single update source for each object. The authors of [12] present a collection of protocols for
probabilistic quorum based data storage in MANETs. Read
operations will return the result of the latest update operation independent of the order in which these updates have
been executed.
In contrast to the approaches described above, our consistency model explicitly enforces the chronological ordering of update operations from multiple update sources.
Additionally, our algorithm guarantees consistency without
synchronized clocks being available.
3.1. Ordering of Update Requests
Since observers capture real world events, the ordering
of state changes according to their occurrence in real-time
is essential for many applications. For example, for an application tracking the movement of a physical object, the
real-time ordering of the observed location changes is essential to determine the direction the object is moving to.
Due to the lack of global time in distributed systems, the
real-time ordering of observation events occurring at different observers can be captured only with limited accuracy.
We define the occurred-before relationship for update requests, which is a relaxation of real-time ordering.
Definition 1 (occurred-before): Let u and u be two update requests. Then u occurred-before (<) u iff tobs (u ) −
tobs (u) > δ, where δ > 0 and tobs (u) denotes the real time
at which the observation leading to the generation of u occurred.
3. Update Linearizability
Parameter δ is a system parameter that defines how accurate the real-time ordering of observation events is captured by the underlying system. This parameter is important for applications because it defines the minimum temporal distance of any pair of observation events needed to
The major components of our consistency model are perceivable objects, observers, database nodes (DB nodes) and
clients. A perceivable object (or object for short) is a mobile or stationary real world object that can be perceived
2
a)
O1: u[x]1
O2:
u[x]2
C1:
r[x]1 r[x]2
C2:
r[x]2
b)
O1: u[x]1
O2:
u[x]2
C1:
r[x]1 r[x]2
C2:
r[x]2 r[x]1
c)
O1: u[x]1 u[y]2
O2: u[y]1 u[x]2
C1:
r[x]2 r[y]1
C2:
r[y]2 r[x]1
Figure 1. Sample executions: execution (a) and (c) are valid, while (b) is invalid
determine their correct real-time ordering. If neither u < u
nor u < u, then u and u are said to be concurrent, denoted as uu . For concurrent update requests it cannot be
guaranteed that correct real-time ordering is captured.
tation u[x]1 to indicate an update request for object x that
writes the state 1 and r[y]2 for a read operation that reads
object y and returns state 2. The time axis runs from left to
right. The execution in Figure 1(a) is correct because client
C1 reads the state of object x as 1 even though state 2 has
already been written by observer O2. This is allowed because C1 has never read object x before, allowing to start
C1’s program for object x anywhere in the serialization.
Client C2 reads state 2 at the same time than C1 reads
state 1. This is valid because executions of different clients
may be interleaved in the serialization. In contrast to that,
Figure 1(b) shows an invalid execution, because C2 reads
state 1 after it has already read state 2. The example in
Figure 1(c) is a valid execution with two objects, because
update-linearizability is an object-local property and both
clients read each object only once. C1 reads state 2 of object x before reading state 1 of object y and C2 vice versa.
3.2. Consistency Definition
Update-linearizability is a weaker consistency model
than linearizability [7], where the ordering of both update
and read operations is to be consistent with the real times
at which the operations occurred in the actual execution.
With update-linearizability read operations of a single client
for a single object are only ordered according to their program order. While linearizability ensures that each read operation returns the most current value of an object, updatelinearizability only guarantees that a client never reads a
value that is older than any value it has read before for the
same object.
Definition 2 (update-linearizability): An execution of the
read and update operations issued by clients and observers
is said to be update-linearizable if there exists some serialization S of this execution that satisfies the following conditions:
(C1) All read operations of a single client on a single object in S are ordered according to the program order of the
client. For each object x and each pair of update requests
u[x] and u [x] on x in S: u [x] is a (direct or indirect) successor of u[x] in S iff u[x] < u [x] or u[x]u [x].
(C2) For each object x in the database (DB) S meets the
specification of a single copy of x.
4. System Model
In this section, we define the system model of MANETs
in which we apply our replication algorithm presented in
Section 5. DB nodes and observers may be mobile and form
a MANET. Observers create update requests and propagate
them to DB nodes. If an observer and a DB node are colocated on the same device, requests can be forwarded locally. Otherwise, wireless communication is used to transmit update requests to the DB nodes in the transmission
range of the observer. We do not require the nodes’ and observers’ clocks to be synchronized. Our algorithm orders
update requests sent by observers according to their arrival
at DB nodes in the observers’ transmission range, i.e. in
single-hop distance. Therefore parameter δ in the definition of the occurred-before-relation is two times the communication jitter on the single-hop link between DB nodes
and observers. Note that the jitter includes any delays that
may be introduced by randomization on lower protocol layers. For the sake of simplicity, we assume here that the time
from capturing a sensor signal to sending an update request
is the same for all observers. To illustrate the choice of δ,
consider the example where a DB node receives two update
requests from two distinct observers. Let the time difference
between the two receive events be tdiff . On one hand, whenever tdiff is greater than twice the maximum communication
jitter, it can be guaranteed that the order in which the update
requests are received does not differ from the order in which
Definition 2 captures the idea that for an execution of
operations there exists a serialization against a single logical image of the DB objects and each client sees a view of
the (physical) objects that is consistent with the logical image. It guarantees that update requests are only performed
in the occurred-before order. Once a client c has executed
a read r1 that returned the result of an update request u1
on a specific object x, condition (C1) guarantees that the
next read operation r2 of c on x at least returns the same result as r1 or some result written by an update request u2 ,
with u1 < u2 or u1 u2 .
3.3. Examples of Executions
Figure 1 shows examples of valid and invalid executions
according to Definition 2. For the examples we use the no-
3
tween state records created by independent observers. Next,
we present the algorithm which is divided into the observernode and node-node protocol. The former is used to transmit state changes from the observers to single-hop neighbor DB nodes, whereas the latter propagates state changes
together with ordering information among DB nodes to update the copies of the database.
they were sent by the respective observers. This means that
the DB node may safely accept both update requests. On the
other hand, if tdiff is less than twice the maximum jitter, the
order in which the update requests are received may be arbitrary and therefore the DB node must not accept the second
update request it receives. We further assume that each DB
node in the MANET stores a copy of the entire DB, i.e. each
DB entry is replicated on every DB node. Clients reside on
the nodes in the MANET and read their local DB copy only.
In many scenarios total replication of data is highly desirable. With the consistency level introduced in Definition 2,
a client may read an object’s state record as long as a single copy is available. Consequently, with total replication an
operational client may always read its local copy, i.e. from a
client’s point of view the data is always available, independent of node and communication failures. While read operations are always local, write operations might be expensive depending on the number of nodes. This is acceptable
for scenarios where reads occur more frequently than updates or the number of nodes is limited.
Our work addresses MANETs with frequent topology
changes and an unknown number of hosts and therefore our
replication algorithm is based on message flooding to update the copies of an object. Message flooding is a well
understood, frequently used, and very robust technique for
broadcasting in networks with unknown topology. Numerous broadcast algorithms in MANETs are based on flooding (e.g., in [13, 14]). Besides plain flooding, variants like
probabilistic or counter-based schemes help to avoid socalled broadcast storms. Hyper-flooding helps to increase
the packet delivery ratio in case of frequent network partitions and high node mobility [8]. It has been shown that the
communication overhead and the packet delivery ratio of a
chosen flooding scheme heavily depends on the characteristics of the underlying MANET [6,8], i.e. different MANETs
require different flooding schemes and can be combined
flexibly with our algorithm depending on the application
scenario. Therefore, we assume the existence of a communication primitive f-forward, which sends a given message to one-hop neighbors of a given sender with best-effort
communication semantics and a known upper bound for the
delay jitter on single-hop communication links, defining the
ordering precision δ in Definition 1. Note, that we do not use
any multi-hop routing algorithm.
5.1. State Record
Whenever an observer detects a significant state change,
it transmits an update request to its single-hop neighbor DB
nodes according to the observer-node protocol. Each update request includes a state record consisting of a tuple
(Obj , State, Obs, SN ), where Obj is the unique identifier
of an object, State is the new state information for Obj ,
Obs is the unique identifier of the observer that created the
update request, and SN is the sequence number of the update request. The sequence number is unique and strictly
monotonic increasing for each observer. We use the notaObs
tion u[x]Obs
SN and db[x]SN to denote an update request and
database entry for object x with sequence number SN transmitted by observer Obs. For the ease of exposition, the state
information of the object is omitted from the state record,
because it is not needed to maintain consistency.
5.2. Ordering Graph
In cases where two update requests were created by the
same observer, their ordering can be determined by comparing sequence numbers. Whenever the ordering decision
must be made for two update requests issued by independent observers, additional ordering information is necessary. Therefore, each DB node maintains an ordering graph
for each object x in its DB, which specifies the locally
known ordering of update requests.
Definition 3 (ordering graph): Let Gx = (Vx , Ex ) denote
the ordering graph for object x. The set of vertices Vx contains locally known update requests. The set of directed
edges Ex reflects the locally known ordering relationship
between update requests, where an edge (u[x]oi , u[x]pk ) exists in Ex iff u[x]oi < u[x]pk .
The remainder of this section defines the semantics of
an ordering graph and the definition of the transformation
functions add and join. The former adds a new vertex to
the graph and is used by the observer-node protocol to include new ordering information, while the latter merges two
graphs into a new graph and is used by the node-node protocol to combine ordering information available from different nodes. Finally, we define the predicate occurredBefore
which is used to decide whether the state record of an object may be overwritten by an update request. Given the local ordering graph of a DB node, it evaluates to true if the
5. A Replication Algorithm that Guarantees
Update-Linearizability
In this section we first present the data structures needed
for our replication algorithm: the state record and the ordering graph. The state record includes state information
about a particular object together with the origin of the information, i.e. which observer created the record. The ordering graph is necessary to maintain ordering information be-
4
state record currently stored in the database copy of a DB
node occurred before a given update request.
- u[x]3
1
Z
~ u[x]2
Z
2
1
u[x]41 u[x]11
5.2.1. Adding New Ordering Information to the Graph
Our algorithm derives the global ordering of updates from
the order in which DB nodes receive update requests directly, i.e., single-hop, from observers. When a node receives an update request from some observer Obs, it concludes, with respect to the discussion of δ in Section 4, that
for each v in Vx it holds that v < u[x]Obs
SN . Now, assume that
u[x]ki and u[x]m
are
transmitted
at
time
t and t > t + δ, rej
spectively. Then it is guaranteed that u[x]m
j does not arrive
before u[x]ki at any DB node.
When a DB node directly receives u[x]Obs
SN from some
observer Obs, it modifies its local ordering graph Gx according to the following operation:
Z
(a) join(G1x , G2x )
u[x]11 H
HH
j u[x]2
H
2
1
?3 u[x]1 (b) add(G1x , u[x]22 )
Figure 2. Examples for the graphs G1x :
u[x]11 → u[x]21 → u[x]31 and G2x : u[x]41 → u[x]22
reduce(V, E) :
VD = {u[x]oi | u[x]oi , u[x]oj ∈ V ∧ o = o ∧ i < j}
E = E ∪ {(u , u ) | (u , u), (u, u ) ∈ E ∧ u ∈ VD }
E = E \ {(u, u ), (u , u) ∈ E | u ∈ VD }
E = E ∪ {(u, high(u)) | (u, v) ∈ E ∧ v ∈ VD }
V = V \ VD
return G(V , E )
add (Gx , u[x]Obs
SN ) :
Vx = Vx ∪ {u[x]Obs
SN }
Obs
Ex = Ex ∪ {(u, u[x]Obs
SN ) | u ∈ Vx \ {u[x]SN }}
Gx = reduce(Vx , Ex )
return(Gx )
This operation removes all vertices in VD including their
outgoing edges. Moreover, for each u ∈ VD the in-going
edges of u are ”re-directed” to high(u). Since u < high(u),
edge (u , high(u)) can be added if (u , u) is in Vx . Figure 2 shows examples for the operations add and join for
two graphs G1x and G2x .
This operation adds a vertex u[x]Obs
SN to Vx . Moreover, edges
are added to indicate that all other update requests in Gx ocObs
curred before u[x]Obs
SN . After adding u[x]SN , the ordering
graph may contain two vertices both representing update requests from observer Obs. To save memory and communication bandwidth, we reduce the graph in such way that it
contains at most one vertex per observer, i.e. the most recent one known from that observer. The reduce operation is
described below.
5.2.3. Ordering of Update Requests DB nodes have to
decide whether or not to accept received update requests.
Consider the case where the local DB includes state record
db[x]ki and update request u[x]m
j is received. To preserve
consistency u[x]m
j may only be accepted if the update request that wrote db[x]ki occurred-before u[x]m
j or both requests are concurrent. If both requests are from the same
observer (k = m) the update request can be accepted
if i < j. If both update requests come from different
observers, Gx has to be evaluated to decide whether the
update has to be accepted: u[x]m
j has to be accepted if
occurredBefore(Gx , db[x]ki , u[x]m
j ) evaluates to true:
5.2.2. Joining Two Ordering Graphs Whenever the ordering graph of a DB node is modified, it is propagated to
the other DB nodes according to the node-node protocol.
When a node receives an ordering graph Gx it joins its local ordering graph Gx with Gx . Operation join is defined
as:
join(Gx , Gx ) :
E = Ex ∪ Ex
V = Vx ∪ Vx
G = reduce(V, E)
return(G)
occurredBefore(G, u[x]ki , u[x]m
j ) :
k = m ∧ ∃u1 , · · · , un ∈ V :
{(u1 , u2 ), · · · , (un−2 , un−1 ), (un−1 , un )} ⊆ E ∧
u1 = u[x]kq : i ≤ q ∧
un = u[x]m
s : s ≤ j
The set union of the sets of vertices and edges of two graphs
results in a new graph with at most two vertices for each observer, if the two joined graphs contained at most one vertex per observer each. Informally, the reduce operation applied next eliminates occurrences where two vertices of the
same observer exist by removing the older one of both. Let
u and high(u) denote a pair of vertices of the same observer,
where high(u) has a greater sequence number than u. Moreover, let VD be a subset of Vx that contains all vertices u for
which a vertex high(u) exists in Vx . Operation reduce is
then defined as follows:
In other words, if the ordering graph includes a path leadk
ing from u[x]kq to u[x]m
s , we can conclude that u[x]q ocm
curred before u[x]s . Moreover, due to the total ordering of
requests transmitted by the same observer by means of sequence numbers, we can conclude that u[x]kq occurred before u[x]m
s for all s ≤ j and i ≤ q.
5.3. Observer-Node Protocol
When an observer Obs captures the new state of an object x it f-forwards an update request in an O-Update mes-
5
sage to its single-hop neighbors, where sequence number
SN is the value of a local counter that is incremented whenever Obs sends a request. According to our assumptions
in Section 4, an update request received from an observer
did not occur before the state record currently stored in the
DB, i.e. O-Updates can be accepted if the time at which
the last update request from another observer was received
is more than δ ago. Additionally, the accepted update request is added to the local ordering graph using operation
add . Finally, the modified ordering graph and the accepted
update request are f-forwarded to the neighbor DB nodes
in an N-Update message. After the O-Update message has
been received by a DB node, the reasoning about the ordering of update requests in the node-node protocol is solely
done on the basis of ordering graphs and sequence numbers and not based on the order in which messages are received.
On N-Update ( u[x]Obs
SN , Gx ) :
GOld
← Gx
x
Gx = join(Gx , Gx ) / / j o i n o r d e r i n g g r a p h s
i f db[x]O
S = empty / / (case N-1)
/ / no s t a t e r e c o r d i n DB f o r o b j e c t x
Obs
db[x]O
S ← u[x]SN
f-forward ( N-Update ( u[x]Obs
SN , Gx ) )
e l s e / / (case N-2)
/ / o b j e c t x s t o r e d i n DB : db[x]O
S
i f Obs = O / / (case N-2-1)
/ / s t a t e r e c o r d and u p d a t e r e q u e s t a r e
/ / f r o m t h e same o b s e r v e r
i f SN > S / / (case N-2-1-1)
/ / u p d a t e r e q u e s t h a s h i g h e r s e q . num .
Obs
db[x]O
S ← u[x]SN
f-forward ( N-Update ( u[x]Obs
SN , Gx ) )
e l s e / / (case 2-1-2)
/ / db r e c o r d i s more r e c e n t o r t h e same
i f Gx = GOld
x
f-forward ( N-Update ( db[x]O
S , Gx ) )
fi
fi
e l s e / / (case N-2-2)
/ / d i f f e r e n t observers
Obs
i f occurredBefore(Gx , db[x]O
S , u[x]SN )
/ / (case 2-2-1)
Obs
db[x]O
S ← u[x]SN
f-forward ( N-Update ( u[x]Obs
SN , Gx ) )
e l s e / / (case N-2-2-2)
/ / update request not accepted
i f Gx = GOld
x
f-forward ( N-Update ( u[x]Obs
SN , Gx ) )
fi
fi
fi
fi
5.4. Node-Node Protocol
A DB node receiving an N-Update message from another
DB node updates its ordering graph and decides whether to
accept the update request or not according to the algorithm
in Figure 3. A received update request, say u[x]Obs
SN , is only
accepted if
• the local copy of the DB does not yet contain a state
record for x (case N-1 in Figure 3), or
• the local copy of the DB already contains a state record
for x, and the stored state record is also from observer
Obs, and its sequence number is less than SN (case
N-2-1-1), or
Figure 3. Node-node protocol
• the local copy of the DB already contains a state record
for object x, and this state record is not from Obs, and
the received update request did not occur-before the
last update of this state record (case N-2-2-1).
building, for example, the state record associated with a person should be automatically removed after he or she left the
building. For this purpose, we adopt a soft state approach,
which can be easily integrated into the above algorithm.
Each state record is associated with a time to live (TTL)
timer greater than the rate of state changes. Observers are
responsible for refreshing the TTL by means of O-Update
messages for every object in their observation range, i.e.,
within a TTL period at least one O-Update must be issued.
Nodes refresh the TTL timer of a state record whenever they
accept an update for that record. If the timer expires, the corresponding state record is removed.
If either the DB or the local ordering graph is changed,
the received update request and the ordering graph are fforwarded to the neighbor nodes. An exception is made in
case N-2-1-2, where the received update request occurredbefore the current state record in the DB. In this case, the
state record together with the ordering graph is f-forwarded
instead. Note that the comparison of graphs in cases N-2-12 and N-2-2-2 of the node-node protocol are separated for
clarity. In practice they can be integrated into the graph operation join without increasing its complexity.
5.6. Space and Time Complexity
Let O be the set of observers that observe an object x
and s the size of the object’s state. The size of each vertex and each edge in the ordering graph is constant. The
graph contains |O| vertices and the number of edges is in
O(|O|2 ). Temporarily, the graph operations allocate additional space depending on the implementation, e.g. for the
5.5. Removing Database Records
If the event that an object under observation leaves the
entire system cannot be detected, the database may contain
many obsolete state records. If the system tracks people in a
6
set union in join. For each object a DB node stores a state
record. The space needed to store a state record is s plus a
constant overhead, e.g. for the object id. For the time complexity we consider the example of the graph being stored
as a list of vertices and an adjacency matrix. Given two ordering graphs Gx = (Vx , Ex ) and Gx = (Vx , Ex ), the
time complexity of join is bounded by O((|Vx | + |Vx |)3 ).
Add applied to graph Gx is bounded by O(|Vx |2 ), because only one vertex is added that has to be taken into account by reduce. Both considerations include the complexity of reduce. The time complexity of occurredBefore is
bounded by the complexity of the algorithm chosen to find
a path in a graph, e.g. Kruskal’s algorithm to find a spanning tree in O(|E| log|E|).
the node-node protocol also preserves the occurred-before
order.
To show that our algorithm also fulfills condition C2 of
Definition 2, we have to consider serializations of the read
and write operations performed on all copies of a given object.
Claim: For each object x, there exists a serialization Sx
of all read and update operations on x that fulfills condition C1.
Proof: Let SSx = {Sx,n | n is a DB node}. Each Sx,n
in SSx can be divided into segments, one for each update operation in Sx,n and the succeeding read operations. W.l.o.g., let Sx,n include the following sequence: · · · u[x]k ; r[x]k+1 ; · · · ; r[x]k+m ; u[x]k+m+1 ; · · ·.
Then the segment of u[x]k is defined to be
u[x]k ; r[x]k+1 ; · · · ; r[x]k+m (k ≥ 0). Sx can be constructed by merging the segments of sequences in SSx according to the occurred-before order. In other words,
for any two segments seg(u[x]) and seg(u [x]) in SSx
seg(u[x]) must have occurred-before seg(u [x]) in Sx if
u[x] < u [x], and in any order if u[x]||u [x].
6. Correctness Arguments
In this section, we first show that our algorithm is safe,
i.e. we achieve update-linearizability according to Definition 2. Next, we show that our algorithm is live, i.e., every
DB copy of an object converges to the most recently propagated state.
6.2. Liveness
6.1. Safety
In this section we show that the DB copies of an object
converge into the most recently observed state.
Claim: For each object x all of x’s copies eventually receive and accept an update operation u[x], where ¬∃uk [x] :
u[x] < uk [x].
Proof: We start by assuming that all of the the f-forwarded
N-Update messages reach each node. Later we drop this assumption. First, let m be the only observer for x and u[x]m
j
be the youngest update request observed by m. Since message N-Update(u[x]m
j , Gx ) reaches every node, and j > i
m
for all copies db[x]m
i , all copies will accept u[x]j . Secondly, assume that there exists a pair of observers, say m
n
and n, and u[x]m
i and u[x]j are the latest updates transn
mitted by m and n with u[x]m
i < u[x]j . If at least one
n
DB node receives u[x]j from n after having learned about
m
n
u[x]m
i , this node adds edge (u[x]i , u[x]j ) to its local ordering graph and f-forwards it. Therefore, each node eventually accepts u[x]nj . However, messages may be lost and
n
therefore a node that has edge (u[x]m
i , u[x]j ) in its ordering graph may not exist. Then, a portion of the nodes
n
may end up with db[x]m
i and another one with db[x]j , and
each node’s ordering graph Gx eventually includes vern
tices u[x]m
i and u[x]j without an ordering relationship defined between them. W.l.o.g., assume that update requests
n
u[x]kj < u[x]m
i < u[x]l are f-forwarded and that all nodes
k
accept u[x]j . Further assume that node n1 accepts u[x]m
i
and misses u[x]nl while node n2 misses u[x]m
i and accepts
u[x]nl . This condition, where different nodes hold different state information for the same object, is resolved with
the next update request u[x]jh which will be ordered as the
First, we show that condition C1 of Definition 2 is fulfilled for a single copy. Let Sx,n denote the sequence of update and read operations executed on the copy of object x
stored on node n.
Claim: Sx,n meets condition C1 of Definition 2.
Proof: All read operations in Sx,n are executed by local synchronous database calls. Therefore, they are executed in the
clients program order.
Next, we show that all update operations in Sx,n are performed in occurred-before order. More precisely, we show
that once node n has accepted u[x]kj , it never accepts an upm
k
date u[x]m
i if u[x]i < u[x]j .
For the observer-node protocol, we assumed that δ is defined by twice the maximum communication delay jitter of
a single-hop communication link as mentioned in Section 4.
Therefore, it is guaranteed that u[x]m
i is not accepted after
u[x]kj at any node. Since nodes perform update requests issued by observers in the order of their arrival, the observernode protocol preserves the occurred-before order.
For the node-node protocol we have to consider two
k
cases. If k = m, then u[x]m
i and u[x]j were created by
the same observer. In this case, our sequence numbering
k
scheme ensures that u[x]m
i is not accepted once u[x]j has
been accepted as i < j (case N-2-1-2, Figure 3). If k = m,
then u[x]m
i would only be accepted if the local ordering
graph includes a path from u[x]kj to u[x]m
i (case N-2-2-2).
However, since no node receives u[x]kj before u[x]m
i , no ordering graph will ever include such a path. Consequently,
7
0.3
0 bg msg, 18 obs
0 bg msg, 36 obs
0 bg msg, 100 obs
7.5 bg msg, 18 obs
7.5 bg msg, 36 obs
7.5 bg msg, 100 obs
15 bg msg, 18 obs
15 bg msg, 36 obs
15 bg msg, 100 obs
Average update latency
0.25
0.2
0.15
0.1
0.05
0
50
100
150
200
250
DB Nodes
Figure 4. Average update latency
Figure 5. Distribution of update latencies for
250 DB nodes, 100 observers, and background traffic of 15 messages per update request and DB node
youngest update by the node receiving it from j and adding
it to its graph (see Section 5.5 for the discussion on periodic update request).
7. Performance Evaluation
The gap length is a metric used to measure the staleness
of state records at individual DB nodes. Informally, the gap
length defines the number of update requests for an object
x at a DB node that were missed between two accepted update requests. With regard to the gap length, a set of concurrent operations is treated as a single request, which is
defined to be accepted if at least one of the concurrent requests is accepted, i.e. only if the entire set is not accepted
a gap occurs.
To determine the message overhead of our algorithm we
measure the number and the size of the messages sent by
DB nodes. The number of messages is given as the average
number of messages per DB node and update request. Using, for example, plain flooding as f-forward primitive this
will result in at most 1 message per DB node and update request.
In this section we first define a set of metrics that allow
for characterizing the performance of our algorithm. In Section 7.2 we describe the model that was used to create load
in the system. Next, we introduce a set of the system parameters that were systematically varied throughout the experiments. In Section 7.4 we present the simulation results
obtained.
7.1. Performance Metrics
The update latency at a node is defined as the time difference between sending an update request at an observer node
and accepting it at a DB node. For example, consider an update request that is passed from an observer to node n1 first
and then from n1 to node n2 . Assume further that it takes
time t1 for processing and sending the request from the observer to n1 and time t2 for the communication between n1
and n2 . The update latency accounted for that update request will be t1 at node n1 and t1 + t2 at node n2 . The
update latency is only measured if an update request is accepted by a DB node.
Three reasons can lead to situations in which DB nodes
do not accept an update request: the update request is lost
due to network congestion, the update request is rejected
because of message re-ordering, or the update request is rejected because the ordering information available is not sufficient to decide about the order. The update miss ratio and
the gap length are used to take these effects into account.
The update miss ratio is defined for each DB node and
each object x as the ratio between the number of missed updates for an object x and the total number of updates sent
by all observers for x.
7.2. Network Load Model
The network load model is characterized by the pattern
of object observations and by the amount of background
traffic in the network. The observation pattern of an object
is determined by the relative speed between observer nodes
and the object, the number of observers that observe an object, and the frequency of state changes at an object. For our
experiments we used the following random movement pattern for objects which are observed by stationary observers
that are arranged on a regular grid. Throughout a simulation run an object moves randomly from a given active observer to one of its neighbors with a given frequency. Additionally, neighbors of the active observer may make concurrent observations randomly with a probability pn within the
8
observation interval [t − tobs , t + tobs ]. The time t specifies the time at which the active observer makes its observation. The presence of background traffic is simulated
as follows. Each observer node randomly selects a time tbg
in every observation interval at which it broadcasts a message to all DB nodes with a probability pbg . The message
is broadcast using plain flooding. This means that the average number of background messages that have to be received, processed, and sent for every DB node is kept constant for a given number of observers and a given value of
msg per node
pbg . By choosing pbg = number of observers we vary the average number of background messages per DB node for a
given number of observer.
0 bg msg, 18 obs
0 bg msg, 36 obs
0 bg msg, 100 obs
7.5 bg msg, 18 obs
7.5 bg msg, 36 obs
7.5 bg msg, 100 obs
15 bg msg, 18 obs
15 bg msg, 36 obs
15 bg msg, 100 obs
2.2
Average gap length
2
1.8
1.6
1.4
1.2
1
50
100
150
200
250
DB Nodes
Figure 6. Average gap length
7.3. System Parameters
0.5
The forwarding primitive f-forward is realized as plain
flooding as an upper bound with respect to message overhead and network load. Randomization is used to reduce the
probability of collisions. The randomization period is randomly selected from the interval [0, 5] ms with uniform distribution. Messages are not acknowledged and are not repeated upon collisions. The database is replicated on ndb
DB nodes. The area for all experiments is 875 m x 875 m.
On this area we varied the number of observer nodes in
an evenly arranged grid of 18, 36, and 100 observers. Observer nodes remain stationary throughout the simulation.
The number of mobile DB nodes was varied between 50,
100, and 250. All DB nodes used the random-waypoint mobility model with speeds chosen uniformly between 0 and
5 m/s. The observation pattern used is the generic random
movement pattern for objects where a new active observer
was chosen every 5 seconds. The probability pn in all experiment is 0.5 to simulate concurrent observations due to
overlapping observation rages of observers. Parameter tobs ,
determining the size of the observation interval, was 0.25 s.
The simulated duration of a single experiment was 275 seconds. Each experiment included at least 55 observations and
was repeated 5 times. The results were averaged.
Average update miss ratio
0.4
0.3
0.2
0 bg msg, 18 obs
0 bg msg, 36 obs
0 bg msg, 100 obs
7.5 bg msg, 18 obs
7.5 bg msg, 36 obs
7.5 bg msg, 100 obs
15 bg msg, 18 obs
15 bg msg, 36 obs
15 bg msg, 100 obs
0.1
0
50
100
150
200
250
DB Nodes
Figure 7. Average update miss ratio
7.4.1. Update Latency Figure 4 shows the simulation results for the update latency as a function of the number of
DB nodes. The results only account for update requests that
were accepted by the algorithm ignoring update requests
that were received but rejected. The update latency is depending mainly on the level of background traffic, i.e., the
largest growth of the latency can be observed between different levels of background traffic. Figure 5 shows the distribution of latencies for 250 DB nodes, 100 observers, and
15 messages background traffic. In this scenario the average latency is 0.26 s. The distribution shows that approximately 50% of all latencies are less than the average (the
sum of the first two bars in Figure 5). This indicates that the
average taken for the update latency is a representative aggregation.
7.4. Simulation Results
We evaluated our algorithm with the network simulator
ns2 [2]. For the simulation experiments the MAC implementation of IEEE 802.11 supplied with the network simulator was used for both DB nodes and observer nodes.
The implementation of our algorithm uses only MAC-layer
broadcast messages sent to one hop neighbor nodes that are
within transmission range of 250 m, i.e. no additional routing protocol is used. In the figures, the level of background
traffic is denoted as ”bg msg” and the number of observers
is denoted as ”obs”.
7.4.2. Gap Length and Update Miss Ratio The simulation results for the gap length and update miss ratio are presented in Figure 6 and Figure 7. The jitter of the transmission latency between observer and DB node is close to 0 under the assumption that messages are not delayed in the interface queue at observers (within the observation interval
each observer sends at most two messages). Under the ad-
9
state information of objects available to client applications
by local read operations is on average 1.9 update requests
older than the most current update request in the system.
Future work includes the extension of our algorithm in
such way that the chronological ordering of update requests
can be derived for multiple objects, e.g., to decide whether
the state change for object x was sensed before the state
change of object y or vice versa. Moreover, we will investigate the impact of other flooding schemes on the performance of our algorithm.
ditional assumption that the variance of the time to make
an observation at observers is 0, i.e. if the time to fetch
raw data from a sensor as well as the processing time are
constant, we can conclude that the state propagation delay
varies only within the randomization interval of [0,5] ms. In
the evaluation of the simulation results a gap was counted
as soon as at least one update request was missed by a DB
node, neglecting the possible presence of concurrent update
requests. The gap length does not quantify gaps of length
0, i.e. those cases where two consecutive update requests
were accepted. This leads to a minimum gap length of 1,
unless no gaps are encountered at all. The presented results
therefore give an upper bound for the gap length according to the definition in Section 7.1. The maximum average
gap length encountered was 1.9, even for a high overall update miss ratio due to heavy background traffic and concurrent update requests. The object state information available
for clients in the local copies of the DB is therefore on average 1.9 update requests older than the most current update
request in the whole system. This shows that DB nodes are
updated regularly providing recent information to client applications.
References
[1] Electronic product code (epc).
http://www.
epcglobalinc.org.
[2] Simulator ns2. http://www.isi.edu/nsnam/.
[3] S. B. Davidson, H. Garcia-Molina, and D. Skeen. Consistency in a partitioned network: A survey. ACM Computing
Surveys (CSUR), 17(3):341–370, 1985.
[4] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson,
S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. Epidemic algorithms for replicated database maintenance. In
Proc. of the 6th ACM PODC, pages 1–12, 1987.
[5] T. Haerder and A. Reuter. Principles of transaction-oriented
database recovery.
ACM Computing Surveys (CSUR),
15(4):287–317, 1983.
[6] W. Heinzelman, J. Kulik, and H. Balakrishnan. Adaptive
protocols for information dissemination in wireless sensor
networks. In Proc. of the 5th ACM/IEEE MobiCom, pages
174–185, 1999.
[7] M. P. Herlihy and J. M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Trans. on Programming Languages and Systems, 12(3):463–492, 1990.
[8] C. Ho, K. Obraczka, G. Tsudik, and K. Viswanath. Flooding
for reliable multicast in multi-hop ad hoc networks. In Proc.
of the 3rd ACM DIALM Workshop, pages 64–71, 1999.
[9] G. Karumanchi, S. Muralidharan, and R. Prakash. Information dissemination in partitionable mobile ad hoc networks.
In Proc. of 18th IEEE Symposium on Reliable Distributed
Systems, pages 4–13. IEEE Computer Society, 1999.
[10] P. J. Keleher and U. Cetintemel. Consistency management in
deno. ACM/Kluwer MONET, 5(4):299–309, 2000.
[11] L. Lamport. Time, clocks, and the ordering of events in a distributed system. Comm. of the ACM, 21(7):558–565, 1978.
[12] J. Luo, J.-P. Hubaux, and P. Th. Eugster. Pan: Providing reliable storage in mobile ad hoc networks with probabilistic
quorum systems. In Proc. of the 4th ACM MobiHoc, pages
1–12, 2003.
[13] S.-Y. Ni, Y.-C. Tseng, Y.-S. Chen, and J.-P. Sheu. The broadcast storm problem in a mobile ad hoc network. In Proc. of
the 5th ACM/IEEE MobiCom, pages 151–162, 1999.
[14] B. Williams and T. Camp. Comparison of broadcasting techniques for mobile ad hoc networks. In Proc. of the 3rd ACM
MobiHoc, pages 194–205, 2002.
[15] B. Xu, O. Wolfson, and S. Chamberlain. Spatially distributed
databases on sensors. In Proc. of the 8th ACM GIS, pages
153–160, 2000.
7.4.3. Message Overhead The message overhead is determined by the number of messages sent by each node and
the length of the individual messages sent by the nodes. A
DB node f-forwards a received update request if it either updates its local state record or if the local ordering graph has
been changed. The number of messages sent per update request and per DB node is, depending on the update miss ratio, between 0.7 and 0.95 messages per DB node per observation. These values were measured for high and low update
miss ratios, respectively. The message length is directly related to the number of vertices in the ordering graph which
in turn depends linearly on the number of observers that
send update requests for an object. The only other quantity that influences the message size is the size of the state
of an object which was kept constant at 10 bytes. The average values for the message size measured during the experiments were between 126 and 370 bytes.
8. Conclusion and Future Work
In this paper we have introduced update-linearizability,
a consistency concept that preserves the chronological ordering of update requests caused by state transitions of
real world objects which are captured by sensors. Further, we have presented a replication algorithm suitable for
MANETs, which guarantees update-linearizability even in
the presence of multiple independent update sources for a
given object. Our algorithm does not require synchronized
clocks on any node. Even in networks with a high network
load, replicated copies of information objects are updated
regularly. Simulation experiments show that the replicated
10
Download