Survey of Multiversion and Replicated Databases Xinyan Feng December, 99 1. Introduction Concurrency control protocols are the part of database system that ensures the correctness of the database. Conventional single version database concurrency control mechanisms basically fall into one of the following three categories: Locking based protocols such as Two-phase Locking (2PL) Timestamp based protocols such as Timestamp Ordering (TSO) Optimistic protocols such as Kung & Robinson Optimism (OP) In a single version system, the same data is used both for read operations and write operations. Execution conflict rate could be fairly high. Many protocols were proposed to increase concurrency level. Among the many choices, multiversion and data replication have a wide range of variances. Data multiversion and replication deal with different problems. Multiversion databases provide slightly older versions of data items for transactions to read, while write operations could create new versions. It could be used in a centralized or a distributed environment. Replicated data, on the other hand, provides a system with higher data availability by having the same data available at multiple sites. Hence, this mechanism is used in a distributed database environment to increase concurrency and data availability. However, from the point of view that both mechanisms try to increase concurrency level by providing data redundancy, they share a lot in common. This survey paper tries to look into development in both. Also, since both multiversion database and distributed database with data replication have a lot of other issues related to concurrency control, some of them will also be addressed in the paper. This survey is organized into the following section: section 2 takes a look back at the concepts in a single version centralized database, their concurrency control mechanisms in particular. Section 3 is a discussion of multiversion database, including its concurrency control mechanisms, their performances, the interaction with object oriented paradigm, interaction with time validity concept in real time and version control issues. Section 4 is a discussion of replicated database. Issues such as concurrency control, replication control are addressed. In section 5 a short summary is given about this survey. 2. Single Version Concurrency Control Protocols 2.1 Two –phase Locking Locking based concurrency control algorithms are the most popular ones in the commercial world. A lock is a variable associated with a data object that describes the status of the object with respect to possible operations that can be applied. Common locks include read locks and write locks. Two-phase Locking is developed to help reduce chances of deadlocks. Two-phase locking has a growing and a shrinking phase. During the growing phase, new locks on database items can be acquired but no old lock released. During the shrinking phase, the prior acquired locks are released; no new locks are granted. Two-phase locking preserves serializability. Nonetheless, deadlocks could still occur which would be solved by deadlock detection or prevention mechanisms. Another disadvantage of two-phase locking is that it limits the concurrency level of a system. 2.2 Timestamp Ordering In a Timestamp Ordering scenario, each data item has associated with it a read timestamp and a write timestamp, which are used to schedule conflicting data access. Deadlocks will not happen thanks to the nature of conflict resolution. However, starving problem might occur where a transaction might get started and aborted many times before it finishes. Another disadvantage of Timestamp Ordering is cascading aborts. When a transaction aborts, all the transactions that have read data values updated by this transaction previously have to abort too. 2.3 Optimistic Concurrency Control Optimistic Concurrency Control is designed to reduce the overhead of locking. No locks are involved in this mechanism and therefore it is deadlock free. Three phases make up this protocol. The first phase is a read phase when a transaction can read database items and perform preparatory writing (in the local workspace of the transaction). Then comes the validation phase when the serializability of the transactions is checked. If the transaction passes the check in write phase, all transaction updates will be applied to the database items. Otherwise, local updates are discarded (and the transaction restarted). During the validation phase for transaction Ti , the system checks that for each transaction T j that is either committed or in validation phase. The transaction will be validated if one of the following three conditions holds: 1. Ti and T j do not overlap in time 2. Re ad _ Set (Ti ) Write _ Set (T j ) , and their write phase do not overlap in time 3. (Re ad _ Set (Ti ) Write _ Set (Ti )) Write _ Set (T j ) , otherwise. Otherwise, the transaction is aborted and restarted later. The efficiency of Optimistic Concurrency Control protocol largely depends on the level of conflict between transactions. It is easy to imagine that efficiency will be pretty good under low conflict rate due to less overhead compared to other algorithms. When conflict rates are high, however, output will be low due to higher rates of restart. 3. Multiple Versions Multiple versions of data are used in a database system to support higher transaction concurrency and system recovery. The higher concurrency results since read requests can be serviced with older versions of data. These algorithms are particularly effective for long queries, which otherwise cannot finish due to the high probability of conflict with other transactions. However, notice that a long running query may still have to be aborted when some of the versions it needs have been garbage collected prematurely. Serializability theory has been extended into this area and research work has proved that multiversion algorithms are able to provide strictly more concurrency than single version algorithms. The basic idea behind multiversion concurrency control is to maintain one or more old versions of data in the database in order to allow work to proceed using both the latest version of the data and some old versions. Some algorithms proposed two versions, a current version and an older one. Other algorithms keep many old versions around. We are going to put emphasis on the latter. In a multiversion database, each write operation on an object produces a new version. A read operation on the object is performed by returning the value from an appropriate version in the list. One thing worth addressing is that the existence of multiple versions is visible only to the schedule, not to the user transactions that refer to the object. In another word, it is transparent to user application. 3.1 Multiversion History In a multiversion database, an object x has versions x i , x j , …, where subscripts i, j are version numbers, which often correspond to the index or transaction number of the transactions that wrote the versions. A multiversion (MV) history H represents a sequence of operations on. Each write operation wi [x ] in a MV history is mapped into wi [ x i ] , and each ri [x ] into ri [ x j ] , for some j, where x j is some version of data x created by transaction T j . The notion of final writes can be omitted, since every write results in a new entity being created in the database. Correctness of a multiversion database is determined by one-copy (1C) serializability. An MV history is one-copy serializable if it is equivalent to a serial history over the same set of transactions executed over a single version database. The serialization graph of an MV history H is a directed graph whose nodes represent transactions and whose edges are all Ti T j such that one of Ti ’s operations precedes and conflicts with one of T j ’s operations. Unlike single version database, SG(H) by itself does not contain enough information to determine whether H is one-copy serializble or not. To determine if an MV history is one-copy serializable, a modified serialization graph is adopted. Given an MV history H, a multiversion serialization graph (MVSG(H)) is SG(H) with additional edges such that the following conditions hold: 1. For each object x, MVSG(H) has a total order (denoted x ) on all transactions that write x, and 2. For each object x, if T j reads x from Ti and if Ti x Tk , then MVSG(H) has an edge from T j to T k (i.e., T j Tk ); otherwise, if Tk x Ti , then MVSG(H) has an edge from T k to Ti (i.e., Tk Ti ). The additional edges are called version order edges. An MV history H is one-copy serializable is MVSG(H) is acyclic. 3.2 Multiversion Concurrency Control Protocols To support the versioning capacity and improve database performance, multiversion concurrency control protocols are developed to extend the basic single version protocols. For single version databases, we have Two-phase Locking, Timestamp Ordering and Optimistic Concurrency Control. Consequently, for multiversion databases, there are Multiversion Two-phase Locking (MV2PL), Multiversion Timestamp Ordering (MVTSO), and Multiversion Optimistic Concurrency Control. 3.2.1 Multiversion Two-phase Locking MV2PL is an extension to single version two-phase locking. One implementation variation is called CCA version pool algorithm. It works as follows: Each transaction T is assigned a startup timestamp S-TS(T) when it begins running and a commit timestamp CTS(T) when it reaches its commit point. Transactions are classified as read-only or update. When an update transaction reads or writes a data item, it locks the item, and it reads or writes the most recent version. When an item is written, a new version of the item is created. Every version of an item is stamped with the commit timestamp of its creator. Since the timestamp associated with a version is the commit timestamp of its writer, a read-only transaction T is made to read versions written by transactions that committed before T starts. T is serialized after all transactions that committed prior to its startup, but before all transactions that are active during its life. Another variation keeps a completed transaction list, which is a list of all update transactions committed successfully until that time. One drawback of this algorithm is the maintenance and usage of the completed transaction list. The execution of a read operation of a read-only transaction involves finding the largest version of the object smaller than the startup stamp of the transaction, and ensure that creator of the version appears in the copy of the completed transaction list of the transaction. This approach is cumbersome and complex to deal with. Another drawback exists in the extension to distributed databases. Although this protocol guarantees a consistent view to a read-only transaction, it does not guarantee global serializability for read-only transactions. The protocol also requires that a read-only transaction have a priori knowledge of the set of sites where it is going to perform reads. It is necessary to construct a global completed transaction list from copies of the local completed transaction lists at the respective sites before the read-only transaction begins execution. Thus, the complexity of the protocol increases when used in a distributed database environment. MV2PL is proved to preserve serializability. However, it still causes deadlocks. The deadlocks could be detected with the same technique as with the single version database. To break deadlocks, however, some transactions must be aborted. Cascading aborts may occur. 3.2.2 Partitioned Two-phase Locking Multiversion Two-phase Locking gives decent performance in many situations. However, notice that these algorithms make the distinction of read-only transactions from update transactions. Among update transactions, however, some perform read-only access to certain portions of the database and write into other portions. A concurrency control algorithm that distinguishes update transactions from read-only transactions would have treated such transactions as regular update transactions and subjected them to the usual two-phase locking protocol. Hierarchical timestamp (HTS) is presented to take advantage of such a decomposition. HTS allows a transaction that write into one data partition but reads from other higher data partitions (because of the read-write dependencies) to read from the latter using a timestamp different from the transaction’s initiation timestamp. In essence, this ensures that the transaction will not interfere with concurrent updates on the higher data partitions performed by other transactions. With the idea behind HTS, partitioned two-phase locking extends the version pool method. Concurrency control for synchronizing update transactions in partitioned two-phase locking is composed of two protocols. One is for intraclass and the other for read-only access to higher data partitions. The former is equivalent to the multiversion two-phase locking method, where both read and write accesses will result in locking of the accessed data element. The latter is a protocol that grants a particular version of the data to the requesting transaction which requires no locking. However, although the paper mentioned MV2PL, it is not clear why multiversion is involved because the analysis available seems to be the scenario of a partitioned distributed database, and the emphasis does not seem to be related to version issue. 3.2.3 Multiversion Timestamp Ordering In multiversion timestamp ordering, old versions of data are used to speed the process of read requests. Different variations of the algorithm treat write requests differently. However, all of them are based on basic TSO. Write requests are synchronized using the basic timestamp ordering on the most recent versions of data items. On the other hand, read requests are always granted, possibly using older versions of objects. Each transaction T has a start-up timestamp S-ST(T), which is issued when T begins execution. The most recent version of an item X has read timestamp R-TS(X), and a write timestamp, W-TS(X). They are used to record the startup timestamps of the latest reader and the writer of this version. A write request is granted if S-TS(T)>=R-TS(X) and S-TS(T)>=W-TS(X). Transactions whose write requests are not granted are restarted. When a read or write request is made for an object with a pending write request from an older transaction, the read or write request is blocked until the older transaction either commits or aborts. Another algorithm creates a new version for every write. Read mechanism is similar to the one mentioned above. MVTSO ensures serializability. It shares TSO’s potential to have the cyclic restart problem. Cascading aborts might happen. A variant of the multiversion timestamp ordering is seen in literature that avoids cascading aborts by using the realistic recovery protocol. Read requests receive better service because they are never rejected. However, several drawbacks exist. First, read operations issued by read-only transactions may still be blocked due to a pending write. Second, read-only operations have a significant concurrency control overhead since they must update certain information associated with versions. Two-phase commit may be used for distributed read transactions. It may also result in a read-only transaction causing an abort of an update transaction. 3.2.4 Multiversion Optimistic Concurrency Control One mechanism is called Multiversion Serial Validation (MVSV). Each transaction is assigned a startup timestamp, S-TS(T), at startup time, and a commit timestamp, CTS(T), when it enters its commit phase. A write timestamp TS(X) is maintained for each data item X, which is the commit timestamp of the most recent writer of X. A transaction T is validated at commit time if S-TS(T)>TS(X) for each object X in its read set. Then it sets TS(X) equal to , C-TS(T) for all data items in its write set. Another example of multiversion optimistic concurrency control is called Time Warp (TW). Transactions communicate through timestamped messages. Active processes (transactions) are allowed to process incoming messages without concerns for conflict, much like what the optimistic mechanism does. However, a conflict will occur when a process receives a message from another process whose timestamp is less than the timestamp of the process. Whenever this happens, the transaction is rolled back to a time earlier than the timestamp of the message. Compare these two implementation mechanisms, the difference lies in the unit of rollback. In MVSV, the unit is the whole transaction. In contrast, TW does not roll back the whole transaction but just one step. As result, the total time lost in the rollback process is smaller. 3.3 Experiments 3.3.1 About Simulation Model and Results There are three major parts of a concurrency control performance model: a database system model, a user model, and a transaction model. The database system model captures the relevant characteristics of the system’s hardware and software, including the physical resources and their associated schedulers, the characteristics of the database, the load control mechanism for controlling the number of active transactions, and the concurrency control algorithm itself. The user model captures the arrival processes for users and the nature of user’s transactions (for example, interactive or batch). Finally, the transaction model captures the reference behavior and processing requirements of the transactions in the workload. A concurrency control performance model that fails to include any of these three major parts is in some sense incomplete. In addition, if there are more than one class of transactions in the workload with different features, the transaction model must specify the mix of transaction classes. Second, there are some common assumptions simulation experiments made about the system. The first one is the assumption of infinite resources. Some studies compare concurrency control algorithms assuming that transactions progress at a rate independent of the number of active transactions. In another word, transactions proceed in parallel, which is only possible in a system with enough resources so that transactions never have to wait before receiving CPU or I/O services. The second assumption is fake restart. Several models assume that a restarted transaction is replaced by a new, independent transaction, rather than the same transaction getting to run over again. In particular, this model is nearly always used in analytical models in order to make the modeling of restart tractable. The third assumption is about the write-lock acquisition. A number of studies that distinguish between read and write locks assume that read locks are set on read-only items and that write locks are set on the items to be updated when they are first read. In reality, however, transactions often acquire a read lock on an item, then examine the item, and then request to upgrade it to a write lock only when they want to change the value. These simplified assumptions give people more flexibility to focus on whatever they want to investigate. Nonetheless, they make it harder to compare one simulation to the other and even harder to apply them to a real-life scenario. 3.3.2 Experiment Results Many experiments are conducted and results published in comparison of single version versus multiversion concurrency control performance. However, due to the many different simulation models people used and different assumptions made, results thrun out to be very conflicting with each other. Simulation study has been done in multiversion database to evaluate performance of various concurrency control algorithms. They also address questions such as CPU and I/O costs associated with locating and accessing old versions and the overall cost of maintaining multiple versions of data in database. Centralized setting is used in order to isolate the effects of multiple versions on performance. Several metrics are applied to analyze system performance. Most common ones are throughput, average response time, number of disk access per read, work wasted due to restarts and space required for old versions, etc. One experiment compares the performance of 2PL, TSO, OP, MV2PL, MVTSO and TW. In their simulation model, all reads, writes, aborts and commits are uniformly distributed. A transaction is immediately restarted after it is aborted, while a new transaction immediately replaces a committed transaction. System variables include concurrency level, read-write ratio and transaction size. They look at number of aborts, blocks, partial rollbacks, the mean number of transactions in execution, the mean wait time, the conflict probability, and the throughput of the system as parameters of performance. The result shows that all multiversion protocols outperform their single version counterparts. In the order of high-to-low performance, we have TW, MV2PL, MVTSO, TSO, 2PL, and OP. The improvement is particularly impressive for MV2PL. Under medium workload, there is a 277% increase in throughput from single version. However, MVTSO outperforms MV2PL in high-write scenario. And TW is more sensitive to writeratio than other multiversion concurrency control algorithms. The difference in performance between multiversion protocols and single version protocols is bigger when the concurrency level increases. This makes multiversion protocols desirable candidates for a real intensive system. For a system operating under low-write scenario, TW would be the best choice. For a system operating under a medium or high-write ratio, TW can also be the choice if the concurrency level is not too high. MV2PL would be a good concurrency control protocol under the high concurrency, medium-write scenario, whereas MVTSO performs better under the high-write situation. The result of the study also showed that the relative performance of different protocols did vary with variables such as concurrency level, read-write ratio and the like. Other work compared the performance of 2PL, TSO, OP, MV2PL, MVTSO and MVSV. Their work leads to similar conclusion as provided above. The result of their experiment shows that for both read-only transactions and small size update transactions, the three multiversion algorithms provide almost identical throughput. The reason being that given the small size of the update transactions almost all conflicts are between read-only and update transactions. All three multiversion algorithms eliminate this source of conflicts, allowing read-only transactions to execute on older versions of objects and requiring update transactions to compete only among themselves for access to the objects in the database. On the other hand, the three single version algorithms provide quite different performance tradeoffs between the two transaction classes. 2PL provides good throughput for the large read-only transactions, but provides the worst throughput for the update transactions. TSO and OP provide better performance for update transactions at the expense of the large read-only transactions. Both TSO and OP are biased against large read-only transactions because of their conflict resolution mechanisms. OP restarts a transaction at the end of its execution if any of the objects in its readset have been updated during its lifetime, which is likely for large transactions. They might be restarted over and over again many times before they could commit. Similarly, TSO restarts a transaction any time it attempts to read an object with a timestamp newer than its startup timestamp, meaning that the object has been updated by a transaction that started running after this transaction did. Again, this becomes very likely as the read-only transaction size is increased. In contrast, 2PL has the opposite problem. Read-only transactions can set and hold locks on a decent among of objects in the database for a while before releasing any one of them if the transaction sizes are large. Update transactions that wish to update locked objects must wait a long time in order to lock and update these objects. This is why the throughput of update transaction decreases significantly when the sizes of read-only transactions increase. Besides the overall performance improvement provided by the multiversion concurrency control algorithms, MV2PL remedies the problem that 2PL has: For medium to large read-only transaction sizes, the response time for the update transactions degrades quickly as the read-only transaction size is increased. Similarly, MVSV and MVTSO improved the problem that both OP and TSO have: As the read-only transaction sizes are increased, the large read-only transactions quickly begin to starve out because of updates made by the update transactions in the workload. As a result, all of them give more even performance to both kinds of transactions. Other research work checked CPU, I/O, and storage costs as a result of the use of multiple versions. Multiple versions yield the additional disk accesses involved in accessing old data from the disk. Storage overhead for maintaining old versions is not very large under most circumstances. However, it is important for version maintenance to be efficient, as otherwise the maintenance cost could overweigh the concurrency benefits. Read-only transactions incur no additional concurrency control costs in MV2PL and MVSV, but read-only transactions in MVTO and update transactions in all three algorithms all incur costs for setting locks, checking timestamps, or performing validation tests depending on the algorithm considered. 3.4 Multiversion Concurrency Control in Distributed and Partially Replicated Environment The above research results use a relatively simplified simulation model. They do not consider data distribution. Communication delay is simplified by combining together CPU processing time, communication delays and I/O processing time for each transaction. However, in a distributed database system, the message overhead, data distribution and transaction type and size are some of the most important parameters that have significant effect on performance. Some research work has investigated the behavior of the multiversion concurrency control algorithms under partitioned and partially replicated database. A read is scheduled only on one copy of a particular version. A write on the other hand creates new versions of all copies of the data item. When a transaction enters the system, it is divided into subtransactions and each is sent to the relevant node. All request of a subtransaction is satisfied at one node and if a subtransaction fails then the entire transaction is rolled-back. The system implements a read-one-write-all policy. The model also uses two-phase commit protocol. The simulation result shows the effect of message overhead declines with MPL (multiprogramming level). This research work seemed to be very primitive. Not too much in-depth discussion of the interaction between multiversion and data distribution and replication was found. It is also not convincing why the simulation model is so designed for the purpose it wants to achieve. 3.5 Version Control Notice that each proposed multiversion concurrency control protocol algorithm employs a different approach to integrate multiple versions of data with desired concurrency control protocol. Version control components are very closely tied to concurrency control units. In contrast, protocols for replicated data are clearly divided into two units: the concurrency control component and the replication control component. Possbile division between these two, however, will allow a modular development of new protocols as well as simplify the task of correctness proof. It should also be easier to extend these protocols to a distributed environment. Version control mechanisms were proposed that could be integrated with any conflictbased concurrency control protocols. One such modular version control mechanism could integrate with abstract concurrency control. The basic assumptions about the system are as follows: transactions are classified into read-only and read-write (update) transactions, with update as the default. Second, the execution of update transactions is assumed to be synchronized by a concurrency control protocol that guarantees some serial order. A read-write transaction T is assigned a transaction number tn(T), which is unique and corresponds to the serial order. (It can be proved that conflict-based concurrency control protocols can be changed to assign such numbers to transactions.). The most interesting feature of this mechanism is that here read-only transactions are independent of the underlying concurrency control protocol. These transactions do not interact with the concurrency control module at all, and make only one call to the version control module in the beginning. Afterwards, their existence remains transparent to both the concurrency control and version control modules. Therefore, there is almost negligible overhead associated with read-only transactions in this scheme. Unlike multiversion timestamp ordering, the version control mechanism guarantees that a read-only transaction cannot delay or abort read-write transactions. Execution of readonly transaction is relatively simple when compared to that in the multiversion two-phase locking. This mechanism is also easy to integrate with garbage collection of old versions. In this scenario, the concurrency control, version control and garbage control units are relatively independent from each other. The garbage collection scheme does not interact with the read-write transactions and the concurrency control component does not interact with the read-only transactions. However, in order to achieve the advantages mentioned above, the version control mechanism trades off the system performance efficiency. Several techniques are proposed to rectify this problem. 3.6 Interaction with Object Oriented Paradigm Multiversion Object Oriented Databases (MOODB) support all traditional database functions, such as concurrency control, recovery, user authorization and object oriented paradigm, such as object encapsulation and identification, class inheritance, object version derivation. In a inheritance subtree, there might be a schema change or a query. Examples of schema changes include adding or deleting an attribute or a method to/from a particular class, changing the domain for an attribute, or changing the inheritance hierarchy. In another word, a schema change works a meta information level. In a multiversion object orient database, after creation of a class instance, i.e., an object, its version may be created. New versions may be derived, which in turn may become the parents of next new versions. Versions of an object are organized as a hierarchy, called the version derivation hierarchy. It captures the evolution of the object and partially orders its versions. The problem of concurrency control in a multiversion object oriented database cannot be solved with traditional methods. The reason being that traditional concurrency control mechanisms do not address the semantic relationships between classes, objects and object versions, i.e., class instantiation, inheritance, and version derivation. In an object-oriented paradigm, a transaction accessing a particular class virtually accesses all its instances and all its subclasses. Correspondingly, the virtual access to an object by a transaction does not mean that this object is read or modified by it. It simply means that a concurrent update of the object by another transaction needs to be excluded. Another problem relates to update of an object version that is not a leaf of the derivation hierarchy. If the update has to be propagated to all the derived object versions, the access to the whole derivation subtree by other transaction has to be excluded. Early work in this area concerned the inheritance hierarchy but not the version derivation hierarchy. One work applies hierarchical locking to two types of granules only: a class and its instances. If an access to a class concerns also its subclasses, all of them have to be locked explicitly. Another work applies hierarchical locking to the class-instance hierarchy and the class inheritance hierarchy. Before the basic locking of a class or a class with its subclasses, all its superclasses have to be intentionally locked. In a case of a transaction accessing the leaves of the inheritance hierarchy, it leads to a great number of intentional locks to be set. We could imagine that it would be hard to apply this hierarchical locking mechanism to version hierarchy because the levels could be very large. Another way to deal with the problem is proposed as stamp locking, which concerns both inheritance and version derivation hierarchies. The main idea behind it is to extend the notion of a lock in such a way that it contains information on the position of a locked subtree in the whole hierarchy. A stamp lock is a pair: SL = (lm, ns), where lm denotes a lock mode and ns denotes a node stamp. Lock modes describe the properties of stamp locks. Shared and exclusive locks are used. A node stamp is a sequence of numbers constructed in such a way that all its ancestors are identifiable. If a node is the n-th child of its parent whose node stamp is p, then the child node has a stamp p.n. And the root node is stamped 0. In this scenario, two stamp locks are compatible if and only if their lock modes are compatible or their scopes have no common nodes. To examine the relationship between the scopes of two stamp locks, it is sufficient to compare the node stamps. So far only one hierarchy is considered. The concept, however, may be extended to more than one hierarchy, providing that they are orthogonal to each other. To enable simultaneous locking in many orthogonal hierarchies, however, it is necessary to extend the notion of the stamp locking in such a way that it contains multiple node stamps for different hierarchies. And of course the logic to determine the compatibility of stamp locks is also extended. In particular, when both version and class hierarchies are considered, the stamp lock definition is extended to the following tuple: SL( lm, ns1 , ns2 ), where lm is a stamp lock mode, ns1 , ns 2 are node stamps concerning the first and the second hierarchy. In some multiversion databases, updates of data items lead to a new version of the whole database. To create a new object version, a new database version has to be created, where the new object version appears in the context of versions of all the other objects and respects the consistency constraints imposed. A stamp locking method could apply explicitly to two hierarchies: the database version derivation and the inheritance hierarchy, and implicitly to the object version derivation hierarchy. The inheritance hierarchy, composing of classes, is orthogonal to both database derivation hierarchy and the object version derivation hierarchy, which are composed of database and object versions. Ten notions of granules are used, from the whole multiversion database to a single version of a single object. A stamp lock set on the multiversion database is defined as a triple (lm, vs, is), while a stamp lock set on multiversion object is defined as a pair (lm, vs). Here vs is a version stamp of the database version subtree or object version subtree; is is an inheritance stamp of the class which all or particular instances is locked. Depending on the lock mode lm, stamp locks may be grouped in four ways. All together, there are 12 kinds stamp locks that could be set on the multiversion databases. A compatibility matrix is also proposed among the 12 kinds of locks. 3.7 Application in Real Time Paradigm Unlike traditional database transactions, real-time jobs have deadlines, beyond which their results have little value. Furthermore, many real-time jobs may have to produce timely results as well, where more recent data are preferred over older data. Traditional database models do not take time or timeliness into consideration. Consequently, single-version concurrency control algorithms tend to postpone a job too much into the future by blocking or aborting it due to conflicts, causing it to miss its deadline. At the same time, multiversion concurrency control may push a job too much into the past (read an old version, for example) in an effort to avoid conflicts, thus producing obsolete results. In order to deal with the timeliness of the transactions, real time transactions may tolerate some limited amount of inconsistency, which in a degree sacrifices the serializability of a database system, although transactions prefer to read data versions that are still valid. In these situations, classic multiversion concurrency control algorithms become inadequate, since they do not support the notion of data timeliness. One way to deal with the tradeoff of timeliness and data consistency is the conservative scheduling of data access. Data consistency is maintained by scheduling transactions under SR. Data timeliness is maintained explicitly by application programs. In addition, the system must schedule the jobs carefully to guarantee that real-time transactions will meet their deadlines. But this conservative approach is too restrictive, since SR may not be required in many real-time applications, where sufficiently close versions may be more important than strict SR. In another word, queries may read inconsistent data as long as the data versions being read are close enough to serialization in data value and timeliness. 3.7.1 Multiversion Epsilon Serializability (ESR) Epsilon Serializabilty (ESR) has been proposed to manage and control inconsistency. It relaxes SR’s consistency constraints. In ESR, each epsilon transaction (ET), has a specification of inconsistency (fuzziness) allowed in its execution. ESR increases transaction system concurrency by tolerating a bounded amount of inconsistency. Fuzziness is formally defined as the distance between a potentially inconsistent state and a known consistent state. So for example, the time fuzziness for the non-SR execution of a query operation is defined as the distance in time between the version a query ET reads in an MVESR execution and the version that would have been read in an MVESR execution. In ESR, each transaction has a specification of the fuzziness allowed in its execution, called epsilon specification ( spec ). When it is 0, an ET (Epsilon transaction) is reduced to a classic transaction and ESR reduced to SR. When ESR is applied to a database system that maintains multiple versions of data, it is called multiversion epsilon serializability (MVESR). Multiversion divergence control algorithms guarantees epsilon serializability of a multiversion database. MVESR is very suitable for the use of multiversion database for real-time applications that may trade a limited degree of data inconsistency for more data recency. The addition of the time dimension to multiversion concurrency control bounds both value fuzziness of the query results and the timeliness of data. Non-serializable executions make it possible for queries to access more recent versions. As a result, version management can be greatly simplified since only most recent few versions need to be maintained. 3.7.2 Multiversion Divergence Control for Time and Value Fuzziness Two MVESR were presented in this research work: one based on timestamp ordering multiversion concurrency control and the other based on two-phase locking multiversion concurrency control. In ESR model, update transactions are serializable with each other while queries (readonly transactions) need not to be serializable with update transactions. Fuzziness is allowed only between update ETs ( U ET ) and read-only ETs ( Q ET ). The spec of a U ET refers to the amount of fuzziness allowed to be exported by the U ET , while the spec of a Q ET refers to the amount of fuzziness allowed to be imported by Q ET . The time interval [ts(Ti ) spec, ts(Ti ) spec] defines all the legitimate versions accessible by Ti , where ts(Ti ) denotes the timestamp of transaction Ti . A version can be accessed by transaction Ti if its timestamp is within Ti ’s spec , The problem with accumulating time fuzziness is not a simple summation of two time intervals, which may underestimate or overestimate the total time fuzziness. A new operation on time intervals is proposed as TimeUnion: TimeUnion([ , ], [ , ]) [min( , ), max( , )] . TimeUnion operator can be used to accurately accumulate the total time fuzziness of different intervals. Meanwhile, import _ time _ fuzziness Qi is used to denote the accumulated amount of time fuzziness that has been imported by a query Qi . exp ort _ time _ fuzzinessU j is the accumulated time fuzziness exported to other queries by U j . The objective of an MVESR is to maintain the following invariants for all Qi and U j : import _ time _ fuzziness Qi import _ time _ lim it Qi ; exp ort _ time _ fuzzinesU j exp ort _ time _ lim it U j Several tradeoffs are involved in the choice of an appropriate version of data for a given situation. One could assume that spec = 0 and choose the version that makes a transaction serializable (it is a generalization of the version selection function of a traditional multiversion concurrency control algorithm). Or one could always choose the most recently committed version of the data item being accessed. The third choice tries to minimize the fuzziness by approximating the classic multiversion concurrency control. It chooses the newest version x j of data item x such that ts( x j ) ts(Ti ). If such an x j is no longer available because it has been garbage collected, then an available version x k would be chosen such that ts( x j ) ts( x k ) (ts(Ti ) spec ) . If it is not available for any other reasons, a version x m may be chosen such that ts( x m ) ts( x j ), (ts(Ti ) spec ) ts( x m ) ts(Ti ). 3.7.3 Timestamp Ordering Multiversion Divergence Control Algorithm A timestamp ordering multiversion divergence control algorithm with an emphasis of bounding time fuzziness is proposed. In this model, the spec of other dimensions are assumed to be infinitely large. The extension stage of a multiversion divergence control algorithm identifies the two condition where a non-SR operation may be allowed. The first is a situation where the serializable version does not exist or if it cannot satisfy the recency requirement. The second is that a late-arriving update transaction may still create a new version that is nonSR with respect to some previously arrived query transactions. Also notice that for a query transaction, a read operation can be processed by choosing a proper version. However, for an update transaction, a read operation is restricted to a serializable data version in order to maintain the overall database consistency. In the relaxation stage, time fuzziness is accumulated using the TimeUnion operation for Q i and U j . If the resulting accumulated time fuzziness does not exceed the corresponding spec , then the non-SR operation can be allowed. Otherwise, it will be rejected. Also, the time fuzziness incurred by a write operation wi (x) (of U i ) being processed even after a q j ( x k ) (from Q j ) with ts( x k ) ts(U i ) ts(Q j ) has been completed, is the time distance between ts(U k ) and ts(U i ) , where U k is the update transaction that has created x k . This time fuzziness is accumulated into U i and Q j ’s time fuzziness using the TimeUnion operations. 3.7.4 Two-phase Locking Multiversion Divergence Control Algorithm This algorithm tried to bound both time and value fuzziness. Value fuzziness is calculated by taking the difference between values from the non-SR version and the SR version. The accumulation is calculated by adding together the value fuzziness caused by each non-SR operation. Thus, instead of the two requirements on time fuzziness mentioned in the previous section, four requirements need to be met here: import _ value _ fuzziness Qi import _ value _ lim it Qi ; exp ort _ value _ fuzzinessU j exp ort _ value _ lim it U j ; import _ time _ fuzziness Qi import _ time _ lim it Qi ; exp ort _ time _ fuzzinesU j exp ort _ time _ lim it U j In applications where rates of changes of data values are constant, the value distance could be calculated by using the timestamp of data versions. When an epsilon transaction successfully sets all the locks and is ready to certify, the two corresponding conditions required for certification are extended as follows. First, any data version returned by the version selection function may be accepted for a query read, but the data version for an update transaction is restricted to the latest certified one. Second, for each data item x that an update transaction wrote, all the update transactions that have read a certified data version of x have to be certified before this update transaction could be certified. Each query transaction has associated to it an import_time_fuzziness and an import_value_fuzziness, and each update transaction has an export_time_fuzziness and an export_value_fuzziness. Each uncertified data version, when created by a write operator, is associated with a Conflict_Q list which is initialized to null. The Conflict_Q list is used to remember all queries that have read that uncertified data version, and the time and value fuzziness for these queries are calculated by the time this data is certified. However, for applications that cannot afford to delay the certification, one can use external information to estimate the timestamp and value of the data version that is serializable to it, and then accumulate the fuzziness to the total time and value fuzziness of the corresponding epsilon transaction. 3.7.5 Discussion The multiverison divergence control algorithms were not combined with any CPU scheduling algorithms, such as EDF or RM. More research work needs to be done to help to understand the interaction between divergence control and transaction scheduling. 3.8 Other Notions of Versioning There are three common kinds of explicit versioning in literature. One is called historical versions. Users create historical versions when they want to keep a record of histories of data. The changes of data could also be viewed as revisions and user may want to examine earlier versions of the data. The only difference between a historical version and a revised version is that the former models a logical time, but the latter models a physical time. Finally, variables can have different values based on different assumptions or design criteria. These different values are alternative versions. All these three kinds of versioning can be modeled with the notion of annotations. The advantages of doing it include higher generality, extensibility and flexibility. Versions can be combined arbitrarily with other annotations and with other versions. A new kind of versioning can be implemented by defining the appropriate annotation class. One aspect of research work involves multiversioning technique for concurrency control. However, the relationship between concurrency control and annotation is not clear. There, the system creates versions for the purpose of ensuring serializability; these versions are subsequently removed when they are no longer needed. This form of versioning is different from the three types of versioning mentioned in this section, because the versions in a multiversion system are only visible to the system. It is unclear whether multi-versioning can be integrated with the annotation framework. 4. Replicated Data Data are often replicated to improve transaction performance and system availability. Replication reduces the need for expensive remote accesses, thus enhancing performance. Replicated database can also tolerate failures, hence providing greater availability than single-site database. An important issue of replicated data management is to ensure correct data access in spite of concurrency, failures, and update propagation. Data copies should be accessed by different transactions in a serializable way both locally and globally. Therefore, two conflicting transactions should access replicated data in the same serialization order at sites. Common practice in the field is to use concurrency control protocols to ensure data consistency, replication control protocols to ensure mutual consistency, and atomic commitment protocols to ensure that changes to either none or all copies of replicated data are made permanent. One popular combination would beTwo-phase Locking (2PL) for concurrency control, read-one-write-all (ROWA) for replication control and Twophase Commit (2PC) for atomic commitment. 4.1 The Structure of Distributed Transactions An example distributed database works as follows: Each transaction has a master (coordinator) process running at the site where originated. The coordinator, in turn, sets up a collection of cohort processes to perform the actual processing involved in running the transaction. Transactions access data at the sites where they reside instead of remote access. There is at least one cohort at each site where data will be accessed by the transaction. Each cohort that updates a replicated data item is assumed to have one or more remote update processes associated with it at other sites. It communicates with its remote update processes for both concurrency purposes and value update. The most common commit protocol used in a distributed environment is the two-phase commit protocol, with the coordinator process controlling the protocol. With no data replication, the protocol works as follows: When a cohort finishes executing its portion of a query, it sends an execution complete message to the coordinator. When the coordinator has received this message from every cohort. It initiates the commit protocol by sending prepare message to all cohorts. Assuming that a cohort wishes to commit, it responds to this message by forcing a prepare record to its log and then sending a prepared message back to the coordinator. After receiving prepared messages from all cohorts, the coordinator forces a commit record to its log and sends a commit messages to each cohort. Upon receipt of this message, each cohort forces a commit record to its record to its log and sends a committed message back to the coordinator. Finally, once the coordinator collects committed messages from each cohort, it records the completion of the protocol by logging a transaction end record. If any cohort is unable to commit, it will log an abort record and return an abort message instead of a prepared message in the first phase, causing the coordinator to log an abort record and to send abort instead of commit messages in the second phase of the protocol. Each aborted cohort reports back to the coordinator once the abort procedure has been completed. When remote updates are present, the commit protocol becomes a hierarchical two-phase commit protocol. 4.2 Replicated Data History A replicated database is considered to be correct if it is one-copy serializable. Here, we have two concepts: replicated data history and one-copy history. Replicated data histories reflect the database execution, while one-copy history is the interpretation of transaction executions from a user’s single copy point of view. An execution of transactions is correct if the replicated data history is equivalent to a serial one-copy history. A complete replicated data (RD) history H over T {T0 ,..., Tn } is a partial order with ordering relation < where 1. H h( n i 0 Ti ) for some translation function h; 2. for each Ti and all operations p i , q i in Ti , if p i i qi , then every operation in h( pi ) is related by < to every operation in h( qi ). 3. for every r j [ x A ] , there is at least one wi [ x a ] r j [ x a ]; 4. all pairs of conflicting operations are related by <, where two operations conflict if they operate on the same copy and at least one of them is a write; and 5. if wi [ x ] i ri [ x ] and h( ri [ x ]) ri [ x A ] then wi [ x A ] h( wi [ x ]) An RD history H over T is equivalent to a 1C history H 1C over T if 1. H and H 1C have the same reads-from relationships on data items, i.e., T j reads from Ti if and only of the same holds in H 1C ; and 2. for each final write wi [x ] in H 1C , wi [ x A ] is a final write in H for some copy x A of data x. 4.3 Replicated Concurrency Control Protocols For protocols listed here, we assume read-one-copy, write-all-copies scenario. 4.3.1 Distributed Two-phase Locking To read an item, it suffices to set a read lock on any copy of the item. To update an item, write locks are required on all copies. Write locks are obtained as the transaction executes, with the transaction blocking on the write request until all of the copies to be updated have been locked. All locks are held until the transaction has committed or aborted. Deadlocks are possible. Global deadlock detection is done by a snooping process, which periodically requests wait-for information from all sites and then checks for and resolves any global deadlocks using the same victim selection criteria as for local deadlocks. The snoop responsibility could be dedicated to a particular site or rotated among sites in a round-robin fashion. A variation of the above algorithm is called wound-wait. The difference between WW and 2PL is their way to deal with deadlock. Rather than maintaining wait-for information and checking for local and global deadlocks, deadlocks are prevented by the use of time stamps. Every transaction in the system has a start-up timestamp, and younger transactions are prevented from making older transactions wait. Whenever there is a data conflict, younger transactions are aborted unless it is in its second phase of its commit protocol. There is also a notion of Optimistic Two-phase Locking (O2PL). The difference between O2PL and normal 2PL is that O2PL handles replicated data as OP does. When a cohort updates a data item, it immediately requests a write lock on the local copy of the item, but it defers requesting of the write locks on the remote copies until the beginning of the commit phase. So the idea behind O2PL is to set locks immediately within a site, where doing so is cheap (which is the local site), while taking a more optimistic, less messageintensive approach across site boundaries. Since O2PL waits until the end of transaction to obtain write locks on remote copies, both blocking and deadlocks happen late. Nonetheless, if deadlocks occur, some transactions have to be aborted eventually. 4.3.2 Distributed Timestamp Ordering The basic notion behind distributed timestamp ordering is the same as that for centralized basic timestamp ordering. However, with replicated data, a read request is processed using the local copy of the requested data item. A write request must be approved at all copies before the transaction proceeds. Writers keep their updates in a private workspace until commit time. At the same time, granted writes for a given data item are queued in timestamp order, without blocking the writers, until they are ready to commit, at which point their writes are dequeued and processed in order. Accepted read requests for such a pending write must be queued as well, thus blocking readers, as readers cannot be permitted to proceed until the update becomes visible. Effectively, a write request locks out any subsequent read requests with later timestamp until the corresponding write actually takes place. This happens when the write transaction is ready to commit and thus dequeued and processed. 4.3.3 Distributed Optimistic Concurrency Control Some variations of distributed optimistic concurrency control borrow the idea of timestamp to exchange certificate information during commit phase. Each data item has a read timestamp and a write timestamp. Transactions may read and update data items freely. Updates are stored temporarily into a local workspace until commit time. Read operations need to remember the write timestamp of the versions they read. When all of the transaction’s cohorts have completed their work and report to the coordinator, the transaction is assigned a unique timestamp. This timestamp is sent to each cohort, and is used to certify locally all of the cohort’s reads and writes. If the version that the cohort read is still the current version, read request is locally certified. A write request will be certified if no later reads have been certified. A transaction will be certified globally only if local certifications are passed for all cohorts. In the replicated situation, remote updaters are considered in certification. Updates must locally certify the set of writes they received at commit time, and the necessary communication can be accomplished by passing information in the message of the commit protocol. 4.3.4 Quorum Based Concurrency Control Algorithms When copies of the same data reside on several computers, the system becomes more reliable and available. However, it makes the work much harder to keep the copies mutually consistent. Several popular methods for replicated data concurrency control are based on quorum consensus (QC) class of algorithms. The common feature of QC algorithm family is that each site is assigned a vote. To perform a read or a write operation, a transaction must assemble a read or write quorum of sites such that the votes of all of the sites in the quorum add up to a predefined read or write threshold. The basic principle is that the sum of these two thresholds must exceed the total sum of all votes, and the write threshold must be strictly larger than half of the sum of all votes. These two conditions are known as the quorum intersection invariants. The former ensures that a read operation and a write operation do not take place simultaneously, while the latter prevents two simultaneous write operations on the same object. It is important to note that the above two invariants do not enforce unique values upon the size of the read and write quorums, or even the individual site votes. The readone-write-all method can be viewed as a special case of the QC method with each site assigned a vote of 1. The read quorum is 1, and the write quorum is set to the sum of all votes. This assignment decision leads to better performance for queries at the expense of poorer performance for updates. The basic QC algorithms are static because the votes and quorum sizes are fixed a priori. In dynamic algorithms, votes and quorum sizes are adjustable as sites fail and recover. Examples of dynamic algorithms include missing writes method and virtual partition method. In the missing writes method, the read-one-write-all policy is used when all of the sites in the network are up; however, if any site goes down, the size of the read quorum is increased and that of the write quorum is decreased. After the failure is recovered, it reverts to the old scheme. In the virtual partition algorithm, each site maintains a view consisting of all the sites with which it can communicate, and within this view, the read-one-write-all method is performed. Another dynamic algorithm is the available copies method. There, query transactions can read from any single site, while update transactions must write to all of the sites that are up. Each transaction must perform two additional steps called missing writes validation and access validation. The primary copy algorithm is based on designating one copy as primary copy and requiring each transaction to update it before committing. The update is subsequently spooled to the other copies. If the primary copy fails, then a new primary is selected. Network partitioning due to site or communication link failures are discussed by some protocols. Some algorithms ensure consistency at the price of reduced availability. They permit at most one partition to process updates at any given time. Dynamic algorithms permits updates in a partition provided it contains more than half of the up-to-date copies of the replicated data. It is dynamic because the order in which past partitions were created plays a role in the selection of the next distinguished partition. The basic operations of dynamic voting work as follows. When a site S receives an update request, it sends a message to all other sites. Those sites belonging to the partition P where S currently belongs (that is, those sites with which S can communicate at the moment.) lock their copies of the data and reply to the inquiry sent by S. From the replies, S learns the biggest (most recent) version number VN among the copies in partition P, and updates site’s cardinality SC of the copies with that version number. Partition P is the distinguished partition if the partition contains more than half of the SC sites with version number VN. If partition P is the distinguished partition, then S commits the update and sends a message to the other sites in P, telling them to commit the update and unlock their copies of the data. A two-phase commit protocol is used to ensure that transactions are atomic. The vote assignment method is important to this family of algorithms because the assignment of votes to sites and the settings of quorum values will greatly affect the transaction overhead and availability. In some research work, techniques are developed to optimize the assignment of votes and quorums in a static algorithm. Three optimization models are addressed. Their primary objective is to minimize cost during regular operations. These three models are classified into several problems with some of which solved optimally and others heuristically. Processing costs are likely to reduce using these methods. The reason is that in QC algorithms, a major component of the processing cost is proportional to the number of sites participating in a quorum of a transaction. When the unit communication costs are equal, the overall communication cost is directly related to the size of a transaction’s quorum. Hence, minimizing communication cost is equivalent to minimizing the number of sites that participate in the transaction. On the other hand, however, if the unit communication costs are unequal, the relationship between the number of sites and the total communication cost is more complex, depending on other factors such as the variance in the unit costs. 4.4 Experiments In a distributed environment, besides the throughput as an important metric for evaluation, number of message sending involved is also worth consideration. Some work focus their attention on the update cost for distributed concurrency control. Various system load and different levels of data replication were chosen to test the performance of various replicated concurrency control algorithms. Among all the replicated distributed concurrency control protocols, 2PL and O2PL provide the best performance, followed by WW(wound wait) and TSO, OPT performs the worst. In this case, 2PL and O2PL performs similarly because they differ only in their handling of replicated data. The reason behind the difference in performance is because 2PL and O2PL have the lowest restart ratios and as a result waste the smallest fraction of the system’s resources. WW and TSO have higher restart ratios. Between these two, however, although WW has a higher ratio of restarts to commits, it always selects a younger transaction to abort, making its individual restarts less costly than those of TSO. OPT has the highest restart ratio among these protocols. These results indicate the importance of restart ratios and the age of aborted transactions as performance determinants. When the number of copies increases, both the amount of I/O involved in updating the database and the level of synchronization-related message traffic are increased. However, the differences between algorithms decrease with the level of replication increased. The reason is again restart-related. Successfully completing a transaction in the presence of replication involves work of updating remote copies of data. Since remote updates occur only after a successful commit, the relative cost of a restart decreases as the number of copies increases. However, 2PL, WW and TSO suffer a bit more as the level of replication is increased. One interesting result is that O2PL actually performs a little worse than 2PL because of the way conflicts are detected and resolved in O2PL. Unlike the situation in 2PL, it is possible that for a specific data item in O2PL, each copy site has an update transaction obtains locks locally, discovering the existence of competing update transactions at other sites only after successfully executing locally, and has to be aborted or abort others. Hence O2PL has a higher transaction restart ratio than the 2PL scheme. In contrast, OP is insensitive to the number of active copies either because it only checks for conflicts at the end of transactions, or it matters little whether the conflicts occur at one copy site or multiply copy sites, since all copies are involved in the validation process anyway. In fact, restart cost becomes less serious for OP (as well as for O2PL). Since remote copy updates are only performed after a successful commit, they are not done when OP restarts a transaction. As far as message cost goes, O2PL and OP require significantly fewer messages than other algorithms, because they delay the necessary inter-site conflict check until before the transactions commit. Between these two, O2PL retains its performance advantage over OP due to its much lower reliance on restarts to resolve conflicts. Assume message cost is large, OP might actually outperform all three others (2PL, TSO, WW). Among the subset of algorithms that communicate with remote copies on a per-write basis, however, 2PL is still the best performer, followed by WW and TSO. 4.5 Increasing Concurrency in a Replicated System Serializability is a traditional standard for the correctness in a database system. For a distributed replicated system, the notion for correctness is one-copy serializability. However, it might not be practical in high-performance applications since the restriction it imposes on concurrency control is too stringent. In classical serializability theory, two operations on a data item conflict unless they are both reads. One technique for improving concurrency utilizes the semantics of transaction operations. The purpose is to restrict the notion of conflicts. For instance, just like an airfare ticker system, two commutative operations do not conflict, even if both update the same data item (However, whether or not two operations commute might depend on the state of the database.). More research work needs to be done to further exploit the inherent concurrency in such a situation. When serializability is relaxed, the integrity constraints describing the data may be violated. By allowing bounded violation of the integrity constraints, however, it is possible to increase the concurrency of transactions in a replicated environment. In one solution, transactions are partitioned into disjoint subsets called types. Each transaction is a sequence of atomic steps. Each type, y, is associated with a compatibility set whose elements are the types, y’, such that the atomic steps of y and y’ may be interleaved arbitrarily without violating consistency. Interleaved schedules are not necessarily serializable. This idea has also been generalized in different ways to give a specification of allowable interleaving at each transaction breakpoint with respect to other transactions. Most approaches in literature are concerned with maintaining the integrity constraints of the database, even though serializability is not preserved. Some other work are interested in application scenarios that have stringent performance requirements but which can permit bounded violations of the integrity constraints. A notion of set-wise serializability is introduced where the database is decomposed into atomic data sets, and serializability is guaranteed within a single data set. For applications where replications are allowed to diverge in a controlled fashion, an up-to-date central copy is present which is used to detect divergences in other copies and thus triggers appropriate operations. However, it is not clear how this method could be extended to a completely distributed environment. Another example is the notion of k-completeness introduced in SHARD system. A total order is imposed on transactions executed in a run. A transaction is k-complete if it sees the results of all but at most k of the preceding transactions in the order. Because there was no algorithm proposed to enforce a particular value of k, the extent of ignorance can only be estimated in a probabilistic sense. Also, a transaction which has executed at one site may be unknown to other sites even after they have executed a large number of subsequent transactions. This is a major drawback since it implies that a particular site might remain indefinitely unaware of an earlier transaction at another site. Epsilon serializability is also proposed for replicated database systems. An epsilon transaction is allowed to import and export a limited amount of inconsistency. It is not clear, however, how the global integrity constraints of the database are affected if transactions can both import and export inconsistency. 4.5.1 N-ignorance N-ignorance utilizes the relaxation of correctness in order to increase concurrency. An Nignorant transaction is a transaction that may be ignorant of the results of at most N prior transactions. A system in which all transactions are N-ignorant can have an N+1-fold increase in concurrency over serializable systems, at the expense of bounded violations of its integrity constraints. Quorum locking and gossip messages are integrated to obtain a control algorithm for an N-ignorant system. The messages that are propagated in the system as part of the quorum algorithm are locking request, locking grant, locking release, and lock deny. Gossip messages are piggy-backed onto these quorum messages. They may also be transmitted periodically from a site, independent of the quorum messages. Gossip messages are point-to-point background messages which can be used to broadcast information. In the database model here, a gossip message sent by site i to j may contain updates known to i (these need not to be only updates of transactions initiated at site i). Gossip messages could ensure that a site learns of updates in a happened-before order. At most one of the following can occur at a time t at a particular site i: the execution of a transaction, the send of a gossip message, or the receipt of a gossip message. Consider a transaction T submitted at site i. If site j is in QT , then j is said to participate in T. Once QT is established, the sites in QT cannot participate in another transaction until the execution of T is completed. However, the number of sites in the quorum ( |Q|) need not be greater than half of the total number of sites (M/2), When T is initiated, site i knows of all updates unknown to each quorum site before the quorum site sent its lock grant response to i. When T terminates, each quorum site learns of all updates that i knows about. Hence, a tight coupling exists among these sites. The condition for nonquorum sites is more relaxed: site i is allowed to initiate T as long as non-quorum sites are ignorant of a bounded number of updates that i knows about. By adjusting |Q| and the amount of allowed difference between the knowledge states of the initiator site and nonquorum sites, different algorithms with different values of N are developed. Two algorithms are actually introduced. Algorithm A does the timetable check first and then locks a quorum. Algorithm B does the operations in a reverse order. In algorithm B, in contrast, replicated data at quorum sites are locked while a site is waiting for a more restrictive timetable condition to be satisfied. Hence, less concurrency is permitted. If serializability can be relaxed, the |Q| can be reduced and performance improves because it takes less time to gather a quorum and also that conflicting transactions could execute concurrently. When site crashes, message delays, or communication link fails and a standard commit protocol is used to abort transactions interrupted by failures of quorum sites, the algorithms are fault tolerant. They can allow up to |Q|-1 non-quorum site failures and transactions can still be initiated and executed successfully at the sites that have not failed. 4.5.2 Discussion of N-ignorance The major contributions of this work are the following two points. First, they formalize the violation of the integrity constraints as a function of the number, N, of conflicting transactions that can execute concurrently. In order to compute the extent to which the constraints can be violated, they also provide a systematic analysis of the reachable states of the system. Secondly, they deal with a replicated database. The emphasis is on reducing the number of sites a site has to communicate with. They attempt to increase the autonomy of sites in executing transactions, which results in a smaller response time. N-ignorance has the property that in any run R, the update of a transaction T is unknown to at most N transactions. This property is referred to as Locality of N-ignorance. N-ignorance is most useful when many transaction types conflict. To improve concurrency when few transaction types conflict, the algorithms are extended to permit the use of a matrix of ignorance, so that the allowable ignorance of each transaction type with respect to each other type can be individually specified. Several problems need further research effort. Among them, one is the question of how to fit compensation transactions into this model. A compensating transaction is normally used to bring a system that does not satisfy the constraints back to good state. It is important to develop a lucid theory of compensation in N-ignorant system, since compensation is integral to such a system. Another thing worth addressing is how to determine the value of N (or the ignorance matrix) given the original and relaxed constraints. The actual performance, however, of the algorithms used to implement N-ignorance depends largely on how frequently gossip messages are propagated and how much time is involved in processing these messages. Any system that is based on ignorance will have to be tuned to determine the appropriate frequency of gossip messages. 4.6 Increasing Availability in a Replicated Database System semantics has been exploited to enhance availability. Lazy replication is intended for an environment where failures of individual nodes and network are not Byzantine. It preserves consistency with a weaker, causal ordering, which leads to a better performance. This method is intended for applications in which most update operations are causal. An operation call is executed at just one replicated copy. A replicated application is implemented by service consisting of replicated copies running at different nodes in a network. To call an operation, a client makes a local call that sends a call message to one of the copies. The copy executes the requested operation and sends a reply message back. Replicas communicate new information among themselves by lazily exchanging gossip messages. When requested to perform an update, the front end returns to the client immediately and communicates with the service in the background. To perform a query, the front end waits for the response from a replica, and then returns the result to the client. Applications that use causally ordered operations may occasionally require a stronger ordering. In addition to the causal operations, there are forced and immediate operations. Forced operations are performed in the same order (relative to one another) at all copies. Their ordering relative to causal operations may differ at different copies but is consistent with the causal order at all replicated copies. Immediate operations are performed at all replicated copies in the same order relative to all other operations. They have the effect of being performed immediately when the operation returns, and are ordered consistently with external events. The forced and immediate operations are important because they increase the applicability of the approach, allowing it to be used for applications in which some updates require a stronger ordering than causality. 4.7 Performance in Large Scale Data Replication Research work such as the primary copy, dynamic voting and general quorum consensus mentioned in the previous sections are developed for small size systems, with a small number of data replication and almost all of these protocols require the participation and consensus of all sites. The number of participating sites, therefore, is almost linear of the system size. In very large systems, these solutions are impractical. More recent work emphasized performance optimization in addition to availability. The optimization common to these algorithms is from the savings in communication costs obtained by decreasing the number of participants in order to ensure mutual consistency. Work published includes the tree quorum algorithm and the multidimensional voting algorithms. 4.7.1 Multiview Access Protocol Again, the observation behind the idea is that to present both consistent and current database views to all transactions is not always necessary in many replicated database applications. It is often sufficient to guarantee that all replicated copies will be eventually updated and certain replicated copies are kept up-to-date all the time. A multiview access (MVA) strategy addresses the high transaction abort rate and poor overall performance problems in large-scale replicated database systems. The main idea is to reduce the number of sites that run a distributed commit protocol. In this approach, concurrent data accesses follow approaches such as 2PL, ROWA and 2PC, but in several stages, instead of all at the same time as in the 1SR approach. In this framework, sites are first grouped into a tree-structured cluster architecture. Replicated data copies are organized into a corresponding hierarchical multiview structure. There is a designated primary copy of each replicated data item in the cluster in which it resides. Replicated data can be read at any level, and any copy. Updating replicated data, however, requires updating all the copies in a top-down fashion, starting from the root of the structure tree. Primary copies of clusters at the same level within a parent cluster are updated at the same time by one transaction, whereas other copies of each cluster are updated by subsequent separate transactions. The main idea of the proposed strategy is to break large update transaction into a set of related update transactions, each operating on a smaller scope. Therefore, each can commit independently. The advantage is that transactions do not have to wait until all replicated copies have been updated. Only failed transaction needs to be re-executed and the strategy still guarantees the consistency of replicated data. Transaction abort rate is much lower. Primary copies of the same level clusters (within the same parent cluster) have the same database view. Non-primary copies within each cluster also share the same view at another level. These views, however, can be different at a given time as they are updated by separate transactions. A transaction can see the globally consistent and the latest view of replicated data by reading only the top-level primary copies. A transaction can also see a cluster-wide consistent but a possibly out of date view of replicated data by reading non-primary copies in the cluster. A transaction can even see a consistency-relaxed view within a cluster by blindly reading a mix of primary and non-primary copies of different data items. 4.7.2 Discussion Compared to other similar protocols, MVA offers better overall performance. The performance is improved for two reasons. First, transaction throughput is improved, due to a reduction of the transaction abort rate. Second, user response time is also improved as the top-level transactions commit before all replicated copies have been updated. The system will take care of update propagation, even for sites that are not operational at the time. Also, read-only queries have the flexibility to see database views of different degrees of consistency and data currency. This ranges from global, most up-to-date, and consistent views, to local, consistent, but potentially old views, to local nearest to users but potentially inconsistent views. The autonomy of the clusters is preserved, as no specific protocol is required to update replicated copies within clusters. Clusters are therefore free to use any valid replication or concurrency control protocols. Meanwhile, the protocol maintains its scalability by allowing dynamic system reconfiguration as it grows by splitting a cluster into two or more smaller clusters. Similarly, when two or more clusters in the same parent cluster become too small, they can be combined into a single cluster in the same parent cluster. Although 2PL, ROWA, and 2PC are assumed as the basic concurrency control, replication and commitment protocols, the strategy could be implemented on other similar protocols. In general, any distributed concurrency control algorithm that correctly maintains data consistency can be used for accessing primary copies and local data copies. The ROWA protocol can also be replaced by other protocols such as quorum consensus. The 2PC protocol can be replaced also by, for example, 3PC. The multiview access protocol also supports independent adaptability. Each cluster view can be maintained by independent and possibly different protocols, and adaptation to other mechanisms is transparent to other views. Overall, the common objectives these research work want to achieve is to increase concurrency and availability. Even though people propose their work for different perspectives, they are somehow related to each other and look similar in some aspect. In the end, the objective of higher concurrency and better availability are not opposite each other. Contrarily, sometimes they help to boost on another. 5.Summary This survey paper started from a very simple thought: both multiple versions and replicated data are trying to achieve the purpose of performance improvement by providing more than one copies of the same data. Multiversion database keeps track of the history of a piece of data “vertically” and let read-only transactions use relatively old (there will be restrictions on how old it could be though.) information to get their work done. Replicated database puts the same data on multiple physical locations so that they are more available in the system. However, this leads to the whole complexity from concurrency control to transaction commit. Also notice that multiversion and replicated database have different application realms. Nonetheless, the author was trying to look into development in both, mostly concurrency control mechanisms as well as some other related issues. Relative materials seem to indicate that these two areas have been pretty hot in the late 80s and early 90s but die down for lately (There are papers mentioning concepts in multiversioning or data replication every now and then without being the topic or the focus of research works.). However, the author think these two areas deserve more attention. Maybe not in the sense of developing more concurrency algorithms for the sake of a 1% performance improvement, but more about how to integrate these with some other concepts (like real time or QoS research). Due to the limited resource the author has, the survey is so far very incomplete. If the topic found interesting, the author will try to research more literature and put more flesh into this survey paper. Bibliography: A Quorum-Consensus Replication Method for Abstract Data Types: M. Herlihy Adaptive Commitment for Distributed Real-Time Transactions: Nandit Soparkar, Eliezer Levy, Henry F. Korth, Avi Silberschatz An algorithm for Concurrency Control and Recovery in Replicated Distributed Databases: P. Bernstein and T.A Joseph Apologizing Versus Asking Permission: Optimistic Concurrency Control for Abstract Data Types: Maurice Herlihy A Simulation Model for Distributed Real-Time Database Systems: Ozgur Ulusoy, Geneva G. Belford Bounded Ignorance: A Technique for Increasing Concurrency in a Replicated System: Narayanan Krishnakumar, Arthur J. Bernstein Concurrency Control for High Contention Environments: Peter A. Franaszek, John T. Robinson, Alexander Thomasian Concurrency Control in Distributed Database Systems: P. Beinstein, N. Goodman Concurrency Control Performance Modeling: Alternatives and Implications: Rakesh Agrawal, Michael J.Carey, Miron Livny Conflict Detection Tradeoffs for Replicated Data: Michael J. Carey, Miron Livny Cost and Availability Tradeoffs in Replicated Data Concurrency Control: Akhil Kumar, Arie Segev Dual Properties of Replicated Atomic Data: Maurice Herlihy Dynamic Voting Algorithms for Maintaining the Consistency of a Replicated Database: Sushil Jajodia, David Mutchler Fundamental Algorithms for Concurrency Control in Distributed Database Systems: P. Beinstein, N. Goodman Integrated Concurrency Control and Recovery Mechanisms: Design and Performance Evaluation: Rakesh Agrawal, David J. Dewitt Locking Objects and Classes in Multiversion Object-Oriented Databases: Wojciech Cellary, Waldermar Wieczerzycki Maintaining Availability in Partitioned Replicated Database: A. El-Abbadi, S. Toueg Modeling and Evaluation of Database Concurrency Control Algorithms: M. Carey Modular Synchronization in Multiversion Databases: Version Control and Concurrency Control: Divyakant Agrawal, Soumitra Sengupta Multiview Access Protocol for Large Scale Replication: Xiangning Liu, Abdelsalam Helal, Weimin Du Multiversion Divergence Control of Time Fuzziness: Calton Pu, Miu K. Tsang, KunLung Wu and Phillip S. Yu Partitioned Two-phase Locking: Meichun Hsu, Arvola Chan Performance of Multiversion Concurrency Control Mechanism in Partitioned and Partially Replicated Databases: Albert Burger, Vijay Kumar Providing High Availability Using Lazy Replication: Rivka Ladin, Barbara Liskov, Liuba Shrira, Sanjay Ghemawat References to Remote Mobile Objects in Thor: Mark Day, Barbara Liskov, Umesh Maheshwari, Andrew C. Myers Serializability with Constraints: Toshihide Ibaraki, Tiko Kameda, Toshimi Minoura The Performance of Multiversion Concurrency Control Algorithms: Michael J. Carey, Waleed A. Muhanna Timestamp-Based Algorithms for Concurrency Control in Distributed Database Systems: P. Beinstein, N. Goodman Transaction Chopping: Algorithms and Performance Studies: Dennis Shasha, Francois Llirbat, Eric Simon, Patrick Valduriez