Update-Linearizability: A Consistency Concept for the Chronological Ordering of Events in MANETs Jörg Hähner∗ Universität Stuttgart, IPVS Universitätsstrasse 38 70569 Stuttgart, Germany Kurt Rothermel Universität Stuttgart, IPVS Universitätsstrasse 38 70569 Stuttgart, Germany Christian Becker Universität Stuttgart, IPVS Universitätsstrasse 38 70569 Stuttgart, Germany haehner@informatik.uni-stuttgart.de rothermel@informatik.uni-stuttgart.de becker@informatik.uni-stuttgart.de Abstract the most current state of each real-world object according to the chronological ordering in which it was perceived by sensors. Note that this ordering may be different from the ordering in which state changes are applied to the database, e.g., due to message reordering within the network. Whenever a sensor reports a significant state change of an object, the database copies are updated to reflect the new state. The major contributions of this paper are as follows: We propose a novel consistency concept that preserves the chronological ordering of update operations on the replicated database caused by state transition events of real world objects: Roughly speaking, the database state of an object is never replaced by a state that has been captured by any sensor before the current state. Moreover, based on this consistency concept, we introduce a novel replication algorithm suitable for MANETs that allows for multiple independent sources that update state information of objects. If only a single source of updates exists for each object, the most recently sensed state can easily be determined by a simple versioning scheme. However, in scenarios with mobile objects and/or mobile sensors, an object may be perceived by multiple sensors consecutively or concurrently. This complicates the task of maintaining consistency, especially if synchronized clocks are not available. Simulation experiments with our replication algorithm show that the synchronization overhead for guaranteeing the chronological ordering of state transition events is low, i.e. it is dominated by the underlying data dissemination algorithm. Replicated copies are updated regularly even if messages are lost due to network congestion caused by high network load. The remainder of this paper is structured as follows: In Section 2 we discuss the related work. Section 3 presents our consistency concept. Our system model is presented in Section 4. In Sections 5 and 6 we introduce our replication algorithm and prove that it is correct with respect to the consistency model. Simulations showing the performance of our algorithm in a broad range of experiments are presented in Section 7. Section 8 concludes the paper. MANETs are used in situations where networks need to be deployed immediately but no network infrastructure is available. If MANET nodes have sensing capabilities, they can capture and communicate the state of their surroundings, including environmental conditions or objects in their proximity. If the sensed state information is propagated to a database to build a consistent model of the real world, a variety of promising context-aware applications becomes possible. In this paper, we introduce a novel consistency concept that preserves the chronological ordering of sensed state transition events. Based on this concept, we propose a data replication algorithm for MANETs that guarantees the consistency concept without relying on synchronized clocks and show its correctness. Our simulation experiments show that replicated copies are updated regularly even if the network load in the system is high. 1. Introduction Devices with integrated sensors and wireless communication technology may be able to detect their position and orientation, as well as measure changes in their environmental conditions, including hazardous emissions, temperature, smoke, or dust. Although context-aware applications can already profit from context collected by sensors of a single device, the benefit is higher if nodes propagate context information that is captured locally to a distributed database, which maintains a consistent model of reality. In this paper, we investigate how state information captured by a collection of distributed sensors can be maintained in a database that is replicated over a number of nodes in a MANET in order to provide a current model of the operational environment. Mobile database nodes store ∗ Funded by the Gottlieb Daimler and Carl Benz Foundation, Germany 0-7803-8815-1/04/$20.00 ©2004 IEEE 1 2. Related Work by appropriate sensor technology. We assume objects have a unique identifier and may be associated with state information, such as location or temperature, which may change over time. The object identifiers may for example be obtained by using radio frequency identification (RFID) tags attached to objects, such as the electronic product code (EPC) [1]. An observer senses perceivable objects that are within its observation range. Observers transmit the sensed state information, e.g., the location of the object, by means of update requests to the database (DB). Each update request includes the identifier and state of the observed object, so that for a given object an update request may be generated when it enters the observation range of the observer or its state changes significantly. The DB is maintained by a number of (mobile) DB nodes. For each observed object the DB stores a so-called state record that specifies the state of a given object. When the first update request for an object arrives, a new state record is created in the DB, and subsequent update requests are applied to this record. Clients are applications that only read state information from the DB. This restriction makes sense because the DB contains state information of physical objects which can only be captured by observers. Therefore, in our model the DB is only updated by observers and read exclusively by clients. The same object may be observed by multiple observers, either sequentially or concurrently. Consequently, there may be multiple observers transmitting update requests for the same object. This leads to the question of how these update requests are to be ordered. The ordering of events in distributed systems has been subject of research for many decades. The seminal work of Lamport [11] introduces means for ordering events that can be specified and observed within a given system. The happened-before relation as defined by Lamport utilizes the order between sending and receiving of a given message at different processes in the system by merging the causal history of the sender and receiver processes at the receiver. Our algorithm maintains the chronological ordering of update operations caused by events external to the system rather than their causal ordering within the system. Strong consistency based on the concept of serializability [5] has been addressed in the domain of distributed databases extensively. Since the level of consistency is a trade-off to availability [3], strong consistency may result in poor availability in the presence of frequent node and network failures. Weaker consistency levels have been proposed to increase the availability of data. The authors of [4] propose epidemic algorithms to update copies of replicated data in fixed networks. Their concept of consistency ensures that all copies converge to a common state. Consistency issues for data replication have also been addressed in the context of sensor networks and MANETs. The adaptive broadcast replication protocol (ABR) [15] covers explicitly the tradeoff between consistency and transmission costs and ensures weak consistency limited to a single update source per logical object. Deno [10] presents an epidemic replication algorithm based on weighted-voting for weakly connected ad hoc networks ensuring that each copy commits updates in the same order. In [9] a quorum-based system is presented to provide access to the most current information about objects. However, it differs from our approach in the assumption that there exists a single update source for each object. The authors of [12] present a collection of protocols for probabilistic quorum based data storage in MANETs. Read operations will return the result of the latest update operation independent of the order in which these updates have been executed. In contrast to the approaches described above, our consistency model explicitly enforces the chronological ordering of update operations from multiple update sources. Additionally, our algorithm guarantees consistency without synchronized clocks being available. 3.1. Ordering of Update Requests Since observers capture real world events, the ordering of state changes according to their occurrence in real-time is essential for many applications. For example, for an application tracking the movement of a physical object, the real-time ordering of the observed location changes is essential to determine the direction the object is moving to. Due to the lack of global time in distributed systems, the real-time ordering of observation events occurring at different observers can be captured only with limited accuracy. We define the occurred-before relationship for update requests, which is a relaxation of real-time ordering. Definition 1 (occurred-before): Let u and u be two update requests. Then u occurred-before (<) u iff tobs (u ) − tobs (u) > δ, where δ > 0 and tobs (u) denotes the real time at which the observation leading to the generation of u occurred. 3. Update Linearizability Parameter δ is a system parameter that defines how accurate the real-time ordering of observation events is captured by the underlying system. This parameter is important for applications because it defines the minimum temporal distance of any pair of observation events needed to The major components of our consistency model are perceivable objects, observers, database nodes (DB nodes) and clients. A perceivable object (or object for short) is a mobile or stationary real world object that can be perceived 2 a) O1: u[x]1 O2: u[x]2 C1: r[x]1 r[x]2 C2: r[x]2 b) O1: u[x]1 O2: u[x]2 C1: r[x]1 r[x]2 C2: r[x]2 r[x]1 c) O1: u[x]1 u[y]2 O2: u[y]1 u[x]2 C1: r[x]2 r[y]1 C2: r[y]2 r[x]1 Figure 1. Sample executions: execution (a) and (c) are valid, while (b) is invalid determine their correct real-time ordering. If neither u < u nor u < u, then u and u are said to be concurrent, denoted as uu . For concurrent update requests it cannot be guaranteed that correct real-time ordering is captured. tation u[x]1 to indicate an update request for object x that writes the state 1 and r[y]2 for a read operation that reads object y and returns state 2. The time axis runs from left to right. The execution in Figure 1(a) is correct because client C1 reads the state of object x as 1 even though state 2 has already been written by observer O2. This is allowed because C1 has never read object x before, allowing to start C1’s program for object x anywhere in the serialization. Client C2 reads state 2 at the same time than C1 reads state 1. This is valid because executions of different clients may be interleaved in the serialization. In contrast to that, Figure 1(b) shows an invalid execution, because C2 reads state 1 after it has already read state 2. The example in Figure 1(c) is a valid execution with two objects, because update-linearizability is an object-local property and both clients read each object only once. C1 reads state 2 of object x before reading state 1 of object y and C2 vice versa. 3.2. Consistency Definition Update-linearizability is a weaker consistency model than linearizability [7], where the ordering of both update and read operations is to be consistent with the real times at which the operations occurred in the actual execution. With update-linearizability read operations of a single client for a single object are only ordered according to their program order. While linearizability ensures that each read operation returns the most current value of an object, updatelinearizability only guarantees that a client never reads a value that is older than any value it has read before for the same object. Definition 2 (update-linearizability): An execution of the read and update operations issued by clients and observers is said to be update-linearizable if there exists some serialization S of this execution that satisfies the following conditions: (C1) All read operations of a single client on a single object in S are ordered according to the program order of the client. For each object x and each pair of update requests u[x] and u [x] on x in S: u [x] is a (direct or indirect) successor of u[x] in S iff u[x] < u [x] or u[x]u [x]. (C2) For each object x in the database (DB) S meets the specification of a single copy of x. 4. System Model In this section, we define the system model of MANETs in which we apply our replication algorithm presented in Section 5. DB nodes and observers may be mobile and form a MANET. Observers create update requests and propagate them to DB nodes. If an observer and a DB node are colocated on the same device, requests can be forwarded locally. Otherwise, wireless communication is used to transmit update requests to the DB nodes in the transmission range of the observer. We do not require the nodes’ and observers’ clocks to be synchronized. Our algorithm orders update requests sent by observers according to their arrival at DB nodes in the observers’ transmission range, i.e. in single-hop distance. Therefore parameter δ in the definition of the occurred-before-relation is two times the communication jitter on the single-hop link between DB nodes and observers. Note that the jitter includes any delays that may be introduced by randomization on lower protocol layers. For the sake of simplicity, we assume here that the time from capturing a sensor signal to sending an update request is the same for all observers. To illustrate the choice of δ, consider the example where a DB node receives two update requests from two distinct observers. Let the time difference between the two receive events be tdiff . On one hand, whenever tdiff is greater than twice the maximum communication jitter, it can be guaranteed that the order in which the update requests are received does not differ from the order in which Definition 2 captures the idea that for an execution of operations there exists a serialization against a single logical image of the DB objects and each client sees a view of the (physical) objects that is consistent with the logical image. It guarantees that update requests are only performed in the occurred-before order. Once a client c has executed a read r1 that returned the result of an update request u1 on a specific object x, condition (C1) guarantees that the next read operation r2 of c on x at least returns the same result as r1 or some result written by an update request u2 , with u1 < u2 or u1 u2 . 3.3. Examples of Executions Figure 1 shows examples of valid and invalid executions according to Definition 2. For the examples we use the no- 3 tween state records created by independent observers. Next, we present the algorithm which is divided into the observernode and node-node protocol. The former is used to transmit state changes from the observers to single-hop neighbor DB nodes, whereas the latter propagates state changes together with ordering information among DB nodes to update the copies of the database. they were sent by the respective observers. This means that the DB node may safely accept both update requests. On the other hand, if tdiff is less than twice the maximum jitter, the order in which the update requests are received may be arbitrary and therefore the DB node must not accept the second update request it receives. We further assume that each DB node in the MANET stores a copy of the entire DB, i.e. each DB entry is replicated on every DB node. Clients reside on the nodes in the MANET and read their local DB copy only. In many scenarios total replication of data is highly desirable. With the consistency level introduced in Definition 2, a client may read an object’s state record as long as a single copy is available. Consequently, with total replication an operational client may always read its local copy, i.e. from a client’s point of view the data is always available, independent of node and communication failures. While read operations are always local, write operations might be expensive depending on the number of nodes. This is acceptable for scenarios where reads occur more frequently than updates or the number of nodes is limited. Our work addresses MANETs with frequent topology changes and an unknown number of hosts and therefore our replication algorithm is based on message flooding to update the copies of an object. Message flooding is a well understood, frequently used, and very robust technique for broadcasting in networks with unknown topology. Numerous broadcast algorithms in MANETs are based on flooding (e.g., in [13, 14]). Besides plain flooding, variants like probabilistic or counter-based schemes help to avoid socalled broadcast storms. Hyper-flooding helps to increase the packet delivery ratio in case of frequent network partitions and high node mobility [8]. It has been shown that the communication overhead and the packet delivery ratio of a chosen flooding scheme heavily depends on the characteristics of the underlying MANET [6,8], i.e. different MANETs require different flooding schemes and can be combined flexibly with our algorithm depending on the application scenario. Therefore, we assume the existence of a communication primitive f-forward, which sends a given message to one-hop neighbors of a given sender with best-effort communication semantics and a known upper bound for the delay jitter on single-hop communication links, defining the ordering precision δ in Definition 1. Note, that we do not use any multi-hop routing algorithm. 5.1. State Record Whenever an observer detects a significant state change, it transmits an update request to its single-hop neighbor DB nodes according to the observer-node protocol. Each update request includes a state record consisting of a tuple (Obj , State, Obs, SN ), where Obj is the unique identifier of an object, State is the new state information for Obj , Obs is the unique identifier of the observer that created the update request, and SN is the sequence number of the update request. The sequence number is unique and strictly monotonic increasing for each observer. We use the notaObs tion u[x]Obs SN and db[x]SN to denote an update request and database entry for object x with sequence number SN transmitted by observer Obs. For the ease of exposition, the state information of the object is omitted from the state record, because it is not needed to maintain consistency. 5.2. Ordering Graph In cases where two update requests were created by the same observer, their ordering can be determined by comparing sequence numbers. Whenever the ordering decision must be made for two update requests issued by independent observers, additional ordering information is necessary. Therefore, each DB node maintains an ordering graph for each object x in its DB, which specifies the locally known ordering of update requests. Definition 3 (ordering graph): Let Gx = (Vx , Ex ) denote the ordering graph for object x. The set of vertices Vx contains locally known update requests. The set of directed edges Ex reflects the locally known ordering relationship between update requests, where an edge (u[x]oi , u[x]pk ) exists in Ex iff u[x]oi < u[x]pk . The remainder of this section defines the semantics of an ordering graph and the definition of the transformation functions add and join. The former adds a new vertex to the graph and is used by the observer-node protocol to include new ordering information, while the latter merges two graphs into a new graph and is used by the node-node protocol to combine ordering information available from different nodes. Finally, we define the predicate occurredBefore which is used to decide whether the state record of an object may be overwritten by an update request. Given the local ordering graph of a DB node, it evaluates to true if the 5. A Replication Algorithm that Guarantees Update-Linearizability In this section we first present the data structures needed for our replication algorithm: the state record and the ordering graph. The state record includes state information about a particular object together with the origin of the information, i.e. which observer created the record. The ordering graph is necessary to maintain ordering information be- 4 state record currently stored in the database copy of a DB node occurred before a given update request. - u[x]3 1 Z ~ u[x]2 Z 2 1 u[x]41 u[x]11 5.2.1. Adding New Ordering Information to the Graph Our algorithm derives the global ordering of updates from the order in which DB nodes receive update requests directly, i.e., single-hop, from observers. When a node receives an update request from some observer Obs, it concludes, with respect to the discussion of δ in Section 4, that for each v in Vx it holds that v < u[x]Obs SN . Now, assume that u[x]ki and u[x]m are transmitted at time t and t > t + δ, rej spectively. Then it is guaranteed that u[x]m j does not arrive before u[x]ki at any DB node. When a DB node directly receives u[x]Obs SN from some observer Obs, it modifies its local ordering graph Gx according to the following operation: Z (a) join(G1x , G2x ) u[x]11 H HH j u[x]2 H 2 1 ?3 u[x]1 (b) add(G1x , u[x]22 ) Figure 2. Examples for the graphs G1x : u[x]11 → u[x]21 → u[x]31 and G2x : u[x]41 → u[x]22 reduce(V, E) : VD = {u[x]oi | u[x]oi , u[x]oj ∈ V ∧ o = o ∧ i < j} E = E ∪ {(u , u ) | (u , u), (u, u ) ∈ E ∧ u ∈ VD } E = E \ {(u, u ), (u , u) ∈ E | u ∈ VD } E = E ∪ {(u, high(u)) | (u, v) ∈ E ∧ v ∈ VD } V = V \ VD return G(V , E ) add (Gx , u[x]Obs SN ) : Vx = Vx ∪ {u[x]Obs SN } Obs Ex = Ex ∪ {(u, u[x]Obs SN ) | u ∈ Vx \ {u[x]SN }} Gx = reduce(Vx , Ex ) return(Gx ) This operation removes all vertices in VD including their outgoing edges. Moreover, for each u ∈ VD the in-going edges of u are ”re-directed” to high(u). Since u < high(u), edge (u , high(u)) can be added if (u , u) is in Vx . Figure 2 shows examples for the operations add and join for two graphs G1x and G2x . This operation adds a vertex u[x]Obs SN to Vx . Moreover, edges are added to indicate that all other update requests in Gx ocObs curred before u[x]Obs SN . After adding u[x]SN , the ordering graph may contain two vertices both representing update requests from observer Obs. To save memory and communication bandwidth, we reduce the graph in such way that it contains at most one vertex per observer, i.e. the most recent one known from that observer. The reduce operation is described below. 5.2.3. Ordering of Update Requests DB nodes have to decide whether or not to accept received update requests. Consider the case where the local DB includes state record db[x]ki and update request u[x]m j is received. To preserve consistency u[x]m j may only be accepted if the update request that wrote db[x]ki occurred-before u[x]m j or both requests are concurrent. If both requests are from the same observer (k = m) the update request can be accepted if i < j. If both update requests come from different observers, Gx has to be evaluated to decide whether the update has to be accepted: u[x]m j has to be accepted if occurredBefore(Gx , db[x]ki , u[x]m j ) evaluates to true: 5.2.2. Joining Two Ordering Graphs Whenever the ordering graph of a DB node is modified, it is propagated to the other DB nodes according to the node-node protocol. When a node receives an ordering graph Gx it joins its local ordering graph Gx with Gx . Operation join is defined as: join(Gx , Gx ) : E = Ex ∪ Ex V = Vx ∪ Vx G = reduce(V, E) return(G) occurredBefore(G, u[x]ki , u[x]m j ) : k = m ∧ ∃u1 , · · · , un ∈ V : {(u1 , u2 ), · · · , (un−2 , un−1 ), (un−1 , un )} ⊆ E ∧ u1 = u[x]kq : i ≤ q ∧ un = u[x]m s : s ≤ j The set union of the sets of vertices and edges of two graphs results in a new graph with at most two vertices for each observer, if the two joined graphs contained at most one vertex per observer each. Informally, the reduce operation applied next eliminates occurrences where two vertices of the same observer exist by removing the older one of both. Let u and high(u) denote a pair of vertices of the same observer, where high(u) has a greater sequence number than u. Moreover, let VD be a subset of Vx that contains all vertices u for which a vertex high(u) exists in Vx . Operation reduce is then defined as follows: In other words, if the ordering graph includes a path leadk ing from u[x]kq to u[x]m s , we can conclude that u[x]q ocm curred before u[x]s . Moreover, due to the total ordering of requests transmitted by the same observer by means of sequence numbers, we can conclude that u[x]kq occurred before u[x]m s for all s ≤ j and i ≤ q. 5.3. Observer-Node Protocol When an observer Obs captures the new state of an object x it f-forwards an update request in an O-Update mes- 5 sage to its single-hop neighbors, where sequence number SN is the value of a local counter that is incremented whenever Obs sends a request. According to our assumptions in Section 4, an update request received from an observer did not occur before the state record currently stored in the DB, i.e. O-Updates can be accepted if the time at which the last update request from another observer was received is more than δ ago. Additionally, the accepted update request is added to the local ordering graph using operation add . Finally, the modified ordering graph and the accepted update request are f-forwarded to the neighbor DB nodes in an N-Update message. After the O-Update message has been received by a DB node, the reasoning about the ordering of update requests in the node-node protocol is solely done on the basis of ordering graphs and sequence numbers and not based on the order in which messages are received. On N-Update ( u[x]Obs SN , Gx ) : GOld ← Gx x Gx = join(Gx , Gx ) / / j o i n o r d e r i n g g r a p h s i f db[x]O S = empty / / (case N-1) / / no s t a t e r e c o r d i n DB f o r o b j e c t x Obs db[x]O S ← u[x]SN f-forward ( N-Update ( u[x]Obs SN , Gx ) ) e l s e / / (case N-2) / / o b j e c t x s t o r e d i n DB : db[x]O S i f Obs = O / / (case N-2-1) / / s t a t e r e c o r d and u p d a t e r e q u e s t a r e / / f r o m t h e same o b s e r v e r i f SN > S / / (case N-2-1-1) / / u p d a t e r e q u e s t h a s h i g h e r s e q . num . Obs db[x]O S ← u[x]SN f-forward ( N-Update ( u[x]Obs SN , Gx ) ) e l s e / / (case 2-1-2) / / db r e c o r d i s more r e c e n t o r t h e same i f Gx = GOld x f-forward ( N-Update ( db[x]O S , Gx ) ) fi fi e l s e / / (case N-2-2) / / d i f f e r e n t observers Obs i f occurredBefore(Gx , db[x]O S , u[x]SN ) / / (case 2-2-1) Obs db[x]O S ← u[x]SN f-forward ( N-Update ( u[x]Obs SN , Gx ) ) e l s e / / (case N-2-2-2) / / update request not accepted i f Gx = GOld x f-forward ( N-Update ( u[x]Obs SN , Gx ) ) fi fi fi fi 5.4. Node-Node Protocol A DB node receiving an N-Update message from another DB node updates its ordering graph and decides whether to accept the update request or not according to the algorithm in Figure 3. A received update request, say u[x]Obs SN , is only accepted if • the local copy of the DB does not yet contain a state record for x (case N-1 in Figure 3), or • the local copy of the DB already contains a state record for x, and the stored state record is also from observer Obs, and its sequence number is less than SN (case N-2-1-1), or Figure 3. Node-node protocol • the local copy of the DB already contains a state record for object x, and this state record is not from Obs, and the received update request did not occur-before the last update of this state record (case N-2-2-1). building, for example, the state record associated with a person should be automatically removed after he or she left the building. For this purpose, we adopt a soft state approach, which can be easily integrated into the above algorithm. Each state record is associated with a time to live (TTL) timer greater than the rate of state changes. Observers are responsible for refreshing the TTL by means of O-Update messages for every object in their observation range, i.e., within a TTL period at least one O-Update must be issued. Nodes refresh the TTL timer of a state record whenever they accept an update for that record. If the timer expires, the corresponding state record is removed. If either the DB or the local ordering graph is changed, the received update request and the ordering graph are fforwarded to the neighbor nodes. An exception is made in case N-2-1-2, where the received update request occurredbefore the current state record in the DB. In this case, the state record together with the ordering graph is f-forwarded instead. Note that the comparison of graphs in cases N-2-12 and N-2-2-2 of the node-node protocol are separated for clarity. In practice they can be integrated into the graph operation join without increasing its complexity. 5.6. Space and Time Complexity Let O be the set of observers that observe an object x and s the size of the object’s state. The size of each vertex and each edge in the ordering graph is constant. The graph contains |O| vertices and the number of edges is in O(|O|2 ). Temporarily, the graph operations allocate additional space depending on the implementation, e.g. for the 5.5. Removing Database Records If the event that an object under observation leaves the entire system cannot be detected, the database may contain many obsolete state records. If the system tracks people in a 6 set union in join. For each object a DB node stores a state record. The space needed to store a state record is s plus a constant overhead, e.g. for the object id. For the time complexity we consider the example of the graph being stored as a list of vertices and an adjacency matrix. Given two ordering graphs Gx = (Vx , Ex ) and Gx = (Vx , Ex ), the time complexity of join is bounded by O((|Vx | + |Vx |)3 ). Add applied to graph Gx is bounded by O(|Vx |2 ), because only one vertex is added that has to be taken into account by reduce. Both considerations include the complexity of reduce. The time complexity of occurredBefore is bounded by the complexity of the algorithm chosen to find a path in a graph, e.g. Kruskal’s algorithm to find a spanning tree in O(|E| log|E|). the node-node protocol also preserves the occurred-before order. To show that our algorithm also fulfills condition C2 of Definition 2, we have to consider serializations of the read and write operations performed on all copies of a given object. Claim: For each object x, there exists a serialization Sx of all read and update operations on x that fulfills condition C1. Proof: Let SSx = {Sx,n | n is a DB node}. Each Sx,n in SSx can be divided into segments, one for each update operation in Sx,n and the succeeding read operations. W.l.o.g., let Sx,n include the following sequence: · · · u[x]k ; r[x]k+1 ; · · · ; r[x]k+m ; u[x]k+m+1 ; · · ·. Then the segment of u[x]k is defined to be u[x]k ; r[x]k+1 ; · · · ; r[x]k+m (k ≥ 0). Sx can be constructed by merging the segments of sequences in SSx according to the occurred-before order. In other words, for any two segments seg(u[x]) and seg(u [x]) in SSx seg(u[x]) must have occurred-before seg(u [x]) in Sx if u[x] < u [x], and in any order if u[x]||u [x]. 6. Correctness Arguments In this section, we first show that our algorithm is safe, i.e. we achieve update-linearizability according to Definition 2. Next, we show that our algorithm is live, i.e., every DB copy of an object converges to the most recently propagated state. 6.2. Liveness 6.1. Safety In this section we show that the DB copies of an object converge into the most recently observed state. Claim: For each object x all of x’s copies eventually receive and accept an update operation u[x], where ¬∃uk [x] : u[x] < uk [x]. Proof: We start by assuming that all of the the f-forwarded N-Update messages reach each node. Later we drop this assumption. First, let m be the only observer for x and u[x]m j be the youngest update request observed by m. Since message N-Update(u[x]m j , Gx ) reaches every node, and j > i m for all copies db[x]m i , all copies will accept u[x]j . Secondly, assume that there exists a pair of observers, say m n and n, and u[x]m i and u[x]j are the latest updates transn mitted by m and n with u[x]m i < u[x]j . If at least one n DB node receives u[x]j from n after having learned about m n u[x]m i , this node adds edge (u[x]i , u[x]j ) to its local ordering graph and f-forwards it. Therefore, each node eventually accepts u[x]nj . However, messages may be lost and n therefore a node that has edge (u[x]m i , u[x]j ) in its ordering graph may not exist. Then, a portion of the nodes n may end up with db[x]m i and another one with db[x]j , and each node’s ordering graph Gx eventually includes vern tices u[x]m i and u[x]j without an ordering relationship defined between them. W.l.o.g., assume that update requests n u[x]kj < u[x]m i < u[x]l are f-forwarded and that all nodes k accept u[x]j . Further assume that node n1 accepts u[x]m i and misses u[x]nl while node n2 misses u[x]m i and accepts u[x]nl . This condition, where different nodes hold different state information for the same object, is resolved with the next update request u[x]jh which will be ordered as the First, we show that condition C1 of Definition 2 is fulfilled for a single copy. Let Sx,n denote the sequence of update and read operations executed on the copy of object x stored on node n. Claim: Sx,n meets condition C1 of Definition 2. Proof: All read operations in Sx,n are executed by local synchronous database calls. Therefore, they are executed in the clients program order. Next, we show that all update operations in Sx,n are performed in occurred-before order. More precisely, we show that once node n has accepted u[x]kj , it never accepts an upm k date u[x]m i if u[x]i < u[x]j . For the observer-node protocol, we assumed that δ is defined by twice the maximum communication delay jitter of a single-hop communication link as mentioned in Section 4. Therefore, it is guaranteed that u[x]m i is not accepted after u[x]kj at any node. Since nodes perform update requests issued by observers in the order of their arrival, the observernode protocol preserves the occurred-before order. For the node-node protocol we have to consider two k cases. If k = m, then u[x]m i and u[x]j were created by the same observer. In this case, our sequence numbering k scheme ensures that u[x]m i is not accepted once u[x]j has been accepted as i < j (case N-2-1-2, Figure 3). If k = m, then u[x]m i would only be accepted if the local ordering graph includes a path from u[x]kj to u[x]m i (case N-2-2-2). However, since no node receives u[x]kj before u[x]m i , no ordering graph will ever include such a path. Consequently, 7 0.3 0 bg msg, 18 obs 0 bg msg, 36 obs 0 bg msg, 100 obs 7.5 bg msg, 18 obs 7.5 bg msg, 36 obs 7.5 bg msg, 100 obs 15 bg msg, 18 obs 15 bg msg, 36 obs 15 bg msg, 100 obs Average update latency 0.25 0.2 0.15 0.1 0.05 0 50 100 150 200 250 DB Nodes Figure 4. Average update latency Figure 5. Distribution of update latencies for 250 DB nodes, 100 observers, and background traffic of 15 messages per update request and DB node youngest update by the node receiving it from j and adding it to its graph (see Section 5.5 for the discussion on periodic update request). 7. Performance Evaluation The gap length is a metric used to measure the staleness of state records at individual DB nodes. Informally, the gap length defines the number of update requests for an object x at a DB node that were missed between two accepted update requests. With regard to the gap length, a set of concurrent operations is treated as a single request, which is defined to be accepted if at least one of the concurrent requests is accepted, i.e. only if the entire set is not accepted a gap occurs. To determine the message overhead of our algorithm we measure the number and the size of the messages sent by DB nodes. The number of messages is given as the average number of messages per DB node and update request. Using, for example, plain flooding as f-forward primitive this will result in at most 1 message per DB node and update request. In this section we first define a set of metrics that allow for characterizing the performance of our algorithm. In Section 7.2 we describe the model that was used to create load in the system. Next, we introduce a set of the system parameters that were systematically varied throughout the experiments. In Section 7.4 we present the simulation results obtained. 7.1. Performance Metrics The update latency at a node is defined as the time difference between sending an update request at an observer node and accepting it at a DB node. For example, consider an update request that is passed from an observer to node n1 first and then from n1 to node n2 . Assume further that it takes time t1 for processing and sending the request from the observer to n1 and time t2 for the communication between n1 and n2 . The update latency accounted for that update request will be t1 at node n1 and t1 + t2 at node n2 . The update latency is only measured if an update request is accepted by a DB node. Three reasons can lead to situations in which DB nodes do not accept an update request: the update request is lost due to network congestion, the update request is rejected because of message re-ordering, or the update request is rejected because the ordering information available is not sufficient to decide about the order. The update miss ratio and the gap length are used to take these effects into account. The update miss ratio is defined for each DB node and each object x as the ratio between the number of missed updates for an object x and the total number of updates sent by all observers for x. 7.2. Network Load Model The network load model is characterized by the pattern of object observations and by the amount of background traffic in the network. The observation pattern of an object is determined by the relative speed between observer nodes and the object, the number of observers that observe an object, and the frequency of state changes at an object. For our experiments we used the following random movement pattern for objects which are observed by stationary observers that are arranged on a regular grid. Throughout a simulation run an object moves randomly from a given active observer to one of its neighbors with a given frequency. Additionally, neighbors of the active observer may make concurrent observations randomly with a probability pn within the 8 observation interval [t − tobs , t + tobs ]. The time t specifies the time at which the active observer makes its observation. The presence of background traffic is simulated as follows. Each observer node randomly selects a time tbg in every observation interval at which it broadcasts a message to all DB nodes with a probability pbg . The message is broadcast using plain flooding. This means that the average number of background messages that have to be received, processed, and sent for every DB node is kept constant for a given number of observers and a given value of msg per node pbg . By choosing pbg = number of observers we vary the average number of background messages per DB node for a given number of observer. 0 bg msg, 18 obs 0 bg msg, 36 obs 0 bg msg, 100 obs 7.5 bg msg, 18 obs 7.5 bg msg, 36 obs 7.5 bg msg, 100 obs 15 bg msg, 18 obs 15 bg msg, 36 obs 15 bg msg, 100 obs 2.2 Average gap length 2 1.8 1.6 1.4 1.2 1 50 100 150 200 250 DB Nodes Figure 6. Average gap length 7.3. System Parameters 0.5 The forwarding primitive f-forward is realized as plain flooding as an upper bound with respect to message overhead and network load. Randomization is used to reduce the probability of collisions. The randomization period is randomly selected from the interval [0, 5] ms with uniform distribution. Messages are not acknowledged and are not repeated upon collisions. The database is replicated on ndb DB nodes. The area for all experiments is 875 m x 875 m. On this area we varied the number of observer nodes in an evenly arranged grid of 18, 36, and 100 observers. Observer nodes remain stationary throughout the simulation. The number of mobile DB nodes was varied between 50, 100, and 250. All DB nodes used the random-waypoint mobility model with speeds chosen uniformly between 0 and 5 m/s. The observation pattern used is the generic random movement pattern for objects where a new active observer was chosen every 5 seconds. The probability pn in all experiment is 0.5 to simulate concurrent observations due to overlapping observation rages of observers. Parameter tobs , determining the size of the observation interval, was 0.25 s. The simulated duration of a single experiment was 275 seconds. Each experiment included at least 55 observations and was repeated 5 times. The results were averaged. Average update miss ratio 0.4 0.3 0.2 0 bg msg, 18 obs 0 bg msg, 36 obs 0 bg msg, 100 obs 7.5 bg msg, 18 obs 7.5 bg msg, 36 obs 7.5 bg msg, 100 obs 15 bg msg, 18 obs 15 bg msg, 36 obs 15 bg msg, 100 obs 0.1 0 50 100 150 200 250 DB Nodes Figure 7. Average update miss ratio 7.4.1. Update Latency Figure 4 shows the simulation results for the update latency as a function of the number of DB nodes. The results only account for update requests that were accepted by the algorithm ignoring update requests that were received but rejected. The update latency is depending mainly on the level of background traffic, i.e., the largest growth of the latency can be observed between different levels of background traffic. Figure 5 shows the distribution of latencies for 250 DB nodes, 100 observers, and 15 messages background traffic. In this scenario the average latency is 0.26 s. The distribution shows that approximately 50% of all latencies are less than the average (the sum of the first two bars in Figure 5). This indicates that the average taken for the update latency is a representative aggregation. 7.4. Simulation Results We evaluated our algorithm with the network simulator ns2 [2]. For the simulation experiments the MAC implementation of IEEE 802.11 supplied with the network simulator was used for both DB nodes and observer nodes. The implementation of our algorithm uses only MAC-layer broadcast messages sent to one hop neighbor nodes that are within transmission range of 250 m, i.e. no additional routing protocol is used. In the figures, the level of background traffic is denoted as ”bg msg” and the number of observers is denoted as ”obs”. 7.4.2. Gap Length and Update Miss Ratio The simulation results for the gap length and update miss ratio are presented in Figure 6 and Figure 7. The jitter of the transmission latency between observer and DB node is close to 0 under the assumption that messages are not delayed in the interface queue at observers (within the observation interval each observer sends at most two messages). Under the ad- 9 state information of objects available to client applications by local read operations is on average 1.9 update requests older than the most current update request in the system. Future work includes the extension of our algorithm in such way that the chronological ordering of update requests can be derived for multiple objects, e.g., to decide whether the state change for object x was sensed before the state change of object y or vice versa. Moreover, we will investigate the impact of other flooding schemes on the performance of our algorithm. ditional assumption that the variance of the time to make an observation at observers is 0, i.e. if the time to fetch raw data from a sensor as well as the processing time are constant, we can conclude that the state propagation delay varies only within the randomization interval of [0,5] ms. In the evaluation of the simulation results a gap was counted as soon as at least one update request was missed by a DB node, neglecting the possible presence of concurrent update requests. The gap length does not quantify gaps of length 0, i.e. those cases where two consecutive update requests were accepted. This leads to a minimum gap length of 1, unless no gaps are encountered at all. The presented results therefore give an upper bound for the gap length according to the definition in Section 7.1. The maximum average gap length encountered was 1.9, even for a high overall update miss ratio due to heavy background traffic and concurrent update requests. The object state information available for clients in the local copies of the DB is therefore on average 1.9 update requests older than the most current update request in the whole system. This shows that DB nodes are updated regularly providing recent information to client applications. References [1] Electronic product code (epc). http://www. epcglobalinc.org. [2] Simulator ns2. http://www.isi.edu/nsnam/. [3] S. B. Davidson, H. Garcia-Molina, and D. Skeen. Consistency in a partitioned network: A survey. ACM Computing Surveys (CSUR), 17(3):341–370, 1985. [4] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. Epidemic algorithms for replicated database maintenance. In Proc. of the 6th ACM PODC, pages 1–12, 1987. [5] T. Haerder and A. Reuter. Principles of transaction-oriented database recovery. ACM Computing Surveys (CSUR), 15(4):287–317, 1983. [6] W. Heinzelman, J. Kulik, and H. Balakrishnan. Adaptive protocols for information dissemination in wireless sensor networks. In Proc. of the 5th ACM/IEEE MobiCom, pages 174–185, 1999. [7] M. P. Herlihy and J. M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Trans. on Programming Languages and Systems, 12(3):463–492, 1990. [8] C. Ho, K. Obraczka, G. Tsudik, and K. Viswanath. Flooding for reliable multicast in multi-hop ad hoc networks. In Proc. of the 3rd ACM DIALM Workshop, pages 64–71, 1999. [9] G. Karumanchi, S. Muralidharan, and R. Prakash. Information dissemination in partitionable mobile ad hoc networks. In Proc. of 18th IEEE Symposium on Reliable Distributed Systems, pages 4–13. IEEE Computer Society, 1999. [10] P. J. Keleher and U. Cetintemel. Consistency management in deno. ACM/Kluwer MONET, 5(4):299–309, 2000. [11] L. Lamport. Time, clocks, and the ordering of events in a distributed system. Comm. of the ACM, 21(7):558–565, 1978. [12] J. Luo, J.-P. Hubaux, and P. Th. Eugster. Pan: Providing reliable storage in mobile ad hoc networks with probabilistic quorum systems. In Proc. of the 4th ACM MobiHoc, pages 1–12, 2003. [13] S.-Y. Ni, Y.-C. Tseng, Y.-S. Chen, and J.-P. Sheu. The broadcast storm problem in a mobile ad hoc network. In Proc. of the 5th ACM/IEEE MobiCom, pages 151–162, 1999. [14] B. Williams and T. Camp. Comparison of broadcasting techniques for mobile ad hoc networks. In Proc. of the 3rd ACM MobiHoc, pages 194–205, 2002. [15] B. Xu, O. Wolfson, and S. Chamberlain. Spatially distributed databases on sensors. In Proc. of the 8th ACM GIS, pages 153–160, 2000. 7.4.3. Message Overhead The message overhead is determined by the number of messages sent by each node and the length of the individual messages sent by the nodes. A DB node f-forwards a received update request if it either updates its local state record or if the local ordering graph has been changed. The number of messages sent per update request and per DB node is, depending on the update miss ratio, between 0.7 and 0.95 messages per DB node per observation. These values were measured for high and low update miss ratios, respectively. The message length is directly related to the number of vertices in the ordering graph which in turn depends linearly on the number of observers that send update requests for an object. The only other quantity that influences the message size is the size of the state of an object which was kept constant at 10 bytes. The average values for the message size measured during the experiments were between 126 and 370 bytes. 8. Conclusion and Future Work In this paper we have introduced update-linearizability, a consistency concept that preserves the chronological ordering of update requests caused by state transitions of real world objects which are captured by sensors. Further, we have presented a replication algorithm suitable for MANETs, which guarantees update-linearizability even in the presence of multiple independent update sources for a given object. Our algorithm does not require synchronized clocks on any node. Even in networks with a high network load, replicated copies of information objects are updated regularly. Simulation experiments show that the replicated 10