Multiversion ROCC for Concurrency Control1 Victor Shi, Shan Duanmu and William Perrizo Computer Science Department North Dakota State University Fargo, ND 58105 Abstract: This paper discusses a multi-version method for concurrency control. Multi-version Read-commit Order Concurrency Control (MVROCC) uses three timestamps (Start, Commit and Combined) for transaction validation. The Start timestamp records the time when the transaction reads its first data object. The Commit timestamp records the time when the transaction requests to commit. The Combined timestamp can either be set to the Start timestamp or to the Commit timestamp, depending on the result of a conflict check using the proposed “intervening interval validation” algorithm. MVROCC is more permissive than Snapshot isolation, since it is fully serializable and it allows one-sided write-write conflicts. Our preliminary simulation results showed that the throughput of MVROCC is 5% higher than Snapshot Isolation. More importantly, MVROCC guarantees execution correctness while achieving good performance. The easy-to-implement feature of MVROCC may attract attentions from the database industry, because the major change is to replace the first-committer-wins policy with the proposed intervening validation algorithm in a system that has already implemented Snapshot Isolation mechanisms. Key Words: Concurrency control, High performance, Database systems. Key Words: Concurrency control, High performance, Database systems. 1 Introduction High system throughput is a desirable feature in database systems. In traditional database management systems, two phase locking is used to support concurrent access while maintaining correctness. With increasing pressure of demanding higher system throughput, researchers have 1 Patents on ROCC and MVROCC technology are pending. explored numerous mechanisms for better performance with regard to system throughput and transaction response time. The mechanisms that have demonstrated better performance can be classified into three categories: optimistic, multi-version and deadlock avoidance. In general, concurrency control methods utilizing optimistic mechanism allow data access with no restrictions, they will validate data when transaction commits. Probably the most successful commercially used optimistic method is field call and its Escrow variant [1, 2]. The field-call method tests predicates at the time the transaction requests access to data items. A REDO log record is generated if the predicate test succeeds. All updates are deferred to transaction commit time. When the transaction commits, all its predicates will be tested again to validate all required conditions hold. The drawback is that transaction commit may return an error if the update predicate has become false since the field call was issued. Escrow method overcomes this drawback by introducing Escrow record. The Escrow record preserves the truth of the predicate between the time the transaction first makes the field call and the time the predicate is reevaluated at commit time. The truth preservation works well when the field is an aggregate. With the fast evolution of technology and storage no longer being a problem, multi-version concurrency control methods become attractive. A successful example of multi-version concurrency control is Snapshot isolation method. It was introduced in 1995 and then adopted by Oracle, PostgreSQL, Microsoft SQL server and Exchange as a method for “serializable” isolation level transactions. Though Snapshot isolation suffers “write skew” problem and thus is not truly serializable [5], it has an obvious advantage – read-only transactions always succeed. A representative method of deadlock-avoidance concurrency control (CC) is the WDL (WaitDepth Limit) method proposed by IBM researchers [3]. It limits the length of waiting list so that no deadlock can be formed. They then proposed a hybrid two-phase CC method [3, 4], which in the first execution phase uses optimistic CC method and in the restart phase uses conservative 2PL [4]. In [6] and [9], we proposed methods for improving system performance. Our simulation study showed that ROCC (Read-commit Order Concurrency Control [6]), significantly outperforms WDL and 2PL, and has a comparable performance with Snapshot. Since snapshot is not a concurrency control method providing full serializable correctness guarantees, ROCC seems very promising. In this paper we explore a multi-version ROCC method (MVROCC). Our intentions are twofold – to keep the advantage of Snapshot Isolation and to fix its “write skew” problem at the system level. We expect that the MVROCC may have better performance while maintaining correctness. In the next, section we summarize related work. Section 3 presents the proposed multi-version ROCC. In section 4 we analyze and discuss the possible performance gain. Section 5 concludes the paper and discusses future work. 2 2.1 Related work ROCC Read-commit Order Concurrency Control method (ROCC [6]) is a deadlock-free, serializable concurrency control method based on optimistic mechanisms. It maintains a Read-Commit queue (RC-queue) that records the access order of transactions. Along with the RC-queue, an “intervening” validation algorithm is developed for execution validation. In addition to traditional operation conflict, element conflict is introduced to reduce transaction restarts. Through intervening element conflict check, transaction restarts and validation complexity are reduced significantly. The key contribution of the ROCC method is the intervening validation algorithm. This validation algorithm checks only conflict of the validating transaction’s elements with elements of other transactions, which are in its “intervening interval” (between its read-elements and its commit-element). An element in the RC queue contains the identifiers of the transaction, the data items to be accessed in one request, and the read/write operations in the request. Element conflict is conflict between a transaction’s operations represented by an element and the operations represented by another element that belongs to a different transaction. Four types of elements are defined: Read element, Commit element, Validated element and Restart element. A Read element represents the read/write request message a transaction submits. A Commit element represents a commit request message. A Validated element corresponds to transaction that has been validated, or a transaction that does not need validation. A Restart element contains all the identifiers of data items and the operations that a failed transaction intends to perform. The scheduler’s intervening validation algorithm works as follows. When a commit request message arrives, the scheduler generates a Commit element and posts it to the RC queue starting the validation process for that transaction. The validation process traverses the queue from the Commit element to the first Read element (called “first” below) of the validating transaction. If “first” conflict with no elements in the intervening interval between “first” and the next element of the validating transaction, the validation process combines “first” with that next element and renames the combination as “first”. This is iterated until the commit element is reached. If “first” conflicts with no intervening elements in any step, the validation passes. The scheduler sends a commit request message to the execution queue for data manager to perform write operations upon the validation success, otherwise it sends out a restart request message. If a conflict is found, let “second” be the Commit element. Check if “second” conflicts with any elements in the intervening interval between “first” and “second” as above. If both “first” and “second” have conflicts in their intervening intervals, then the validation fails. Otherwise the validation passes. Note that this check for a second conflict allows one write-write conflict to exist in the interval. If the data manager performs operations following their order in the RC queue when conflicts occur, and the transaction scheduler validates execution using the above “intervening” validation, then the execution of transactions is serializable. 2.2 Snapshot isolation Snapshot isolation is an optimistic multi-version concurrency control method, in which the scheduler uses first-committer-wins policy to avoid lost-update problems. A transaction always reads data from a snapshot of the (committed) data as of the time the transaction starts, called its Start-Timestamp. Reads from a transaction are never blocked provided the snapshot data can be maintained (an assumed Data Manager function). The transaction’s writes (updates, inserts and deletes) will also be recorded in this snapshot, so that it can be read again if the transaction accesses the data a second time. Updates of other transactions beginning after the transaction Start-Timestamp are invisible to the transaction. Suppose we have a transaction, T1. When T1 is ready to commit, it gets a Commit- Timestamp, which is larger than any existing Start-Timestamp or Commit-Timestamp. The transaction commits only if no other transaction, T2, with a Commit-Timestamp in T1’s interval [Start-Timestamp, Commit-Timestamp] wrote data that T1 also wrote. Otherwise, T1 will abort. This is the so-called first-committer-wins policy. As pointed out in [5][7], the first-committerwins policy allows the following history. Thus systems (ORACLE, PostgreSQL, Microsoft SQL server and exchange) using Snapshot isolation have write-skew problem. The system errors caused by write-skew problems could be “thousands per day” and “quite damaging” [7]. r1[x=50] r1[y=50] r2[x=50] r2[y=50] w1[y=-40] w2[x=-40] c1 c2 To fix the write skew problem, A. Fekete et al [7] proposed promoting read operations to write operations, at the application level, so that the first-committer-wins policy can be utilized to force one of the transactions in the cycle to abort. Such approaches require all possible operations to be known in advance. In addition, the promotion of reads may significantly degrade system performance. 3 Multi-version ROCC The Multi-version ROCC can be described as follows a. Read committed (the latest committed version) to avoid cascading abort. b. Set a timestamp for each read-element and each write-element. c. When the transaction commits, check if there are two or more intervening element-conflicts based on the read/write timestamps, if not, the transaction commits. Otherwise the transaction aborts. d. Unlike ROCC, deferred writes are not necessary. This is because multi-version allows immediate write without causing cascading abort problem. An interesting question is, what are the timestamps for read operations and write operations, and what committed data version should be read when a read request arrives. A simple option is to follow the Snapshot idea, i.e., all reads should read the latest data version committed before the Start-timestamp. This Start-timestamp could be any time before the transaction’s first read. Thus the timestamp for all reads in a transaction is the Starttimestamp. Also the timestamp for all writes can be set to be the Commit-timestamp to avoid cascading abort problem. Thus, unlike ROCC, in which there may exist many intervals for conflict checking, the interval between [Start-timestamp, Commit-timestamp] is the only interval left for validation. Intervening validation checking is thus simplified significantly. The difference between MVROCC and Snapshot Isolation is the scheduler algorithm: we use our intervening validation algorithm to replace the “first-committer-wins” policy to achieve serializability and better performance. Before we formalize the description of the intervening interval validation algorithm, we use two examples to explain how our algorithm avoids “write skew” and why it will produce better performance than Snapshot Isolation. Example 1: avoiding “write skew” We rewrite the history H1 with timestamps to help explain how our intervening validation algorithm avoids “write skew”. In this history, x and y are two accounts shared by a couple, with initial values x=50 and y=50. The integrity constraint is x + y > 0, i.e., the accounts does not allow overdraw. Transaction T2 concurrently withdraws 90 from x when transaction T1 withdraws 90 from y. While Snapshot allows H1, the execution result x + y = (-40) + (-40) = -80 < 0 apparently violates the constraint (x + y > 0). H1: r1[x=50] r1[y=50] r2[x=50] r2[y=50] w1[y=-40] w2[x=-40] c1 c2 T1’s Start-timestamp T2’s Start-timestamp T1’s Commit-timestamp Figure 1 Validation when T1 commits MVROCC checks the element conflicts in the interval of T1’s [Start-timestamp, Committimestamp] when T1 commits. The Read element with T1’s Start-timestamp is r1[x=50] r1[y=50] and the commit element with T1’s Commit-Timestamp is w1[y=-40]c1. The intervening element between the Read element and the commit element is r2[x=50] r2[y=50] (please note w2[x=-40] is not included since its timestamp is T2’s Commit-timestamp, which has not come when T1 commits). Thus we find there is only one element conflict (T1’s read element with the intervening element on y). T1 can commit, and T1’s read element and commit element merges into a combined-element with a timestamp of the same value as its Commit-timestamp, as shown in Figure 2. This is because T1’s read element does not have conflict with its intervening element and thus can freely move down to merge with its commit element. r2[x=50] r2[y=50] r1[x=50] r1[y=50] w1[y=-40] c1 w2[x=-40] c2 T2’s Start-timestamp T1’s Commit-timestamp T2’s Commit-timestamp Figure 2 Validation when T2 commits When T2 commits, its read element is r2[x=50] r2[y=50] with its Start-timestamp. T2’s commit element is w2[x=-40]c2. The intervening element between T1’s [Start-timestamp, Commit-timestamp] is r1[x=50] r1[y=50] w1[y=-40] c1. We can find a read-write conflict on y between T2’s read element and the intervening element, and a read-write conflict on x between the intervening element and T2’s commit element. Thus T2 has two-sided element conflicts and thus has to be aborted according to the criteria set by the intervening validation algorithm. “Write-skew” therefore is avoided. In the above example we introduced some new concepts – read element, commit element, intervening element and combined element. We give their formal definitions as follows. Definition #1 (Read element): A transaction T’s read element is the set of all T’s reads with a timestamp when T reads its first data item. A transaction can have only one read element due to the snapshot read principle imposed on the MVROCC mechanism (the snapshot read principle requires all reads are on the data versions that are committed before the start-timestamp). Definition #2 (Commit element): A transaction T’s commit element is the set of all T’s writes with a timestamp when T requests to commit. A transaction can have only one commit element due to the snapshot write principle imposed on the MVROCC mechanism (snapshot write principle requires all writes to be invisible until the transaction commits). Definition #3 (Intervening element): A transaction T’s intervening element is the set of all reads/ writes of other transactions with timestamps within T’s timestamp interval [Starttimestamp, Commit-timestamp]. A transaction can have only one intervening element due to the fact that it has only one interval. Definition #4 (Combined element): A transaction T’s combined element is the set of all T’s reads/writes with a timestamp determined by the intervening validation algorithm. A transaction can have only one combined element after it commits. Otherwise it has to abort. With the 4 element definitions, we give upper/lower sided element conflict definitions. Definition #5 (Upper-sided element conflict) A transaction T has an upper-sided element conflict if operations in its read element conflict with the operations in its intervening element. Definition #6 (Lower-sided element conflict) A transaction T has a lower-sided element conflict if operations in its commit element conflict with operations in its intervening element. With definitions #1 – #6, we give the following intervening validation algorithm When transaction T requests to commit Check if T has an upper-sided element conflict; If T has an upper-sided element conflict { check if T has a lower-sized element conflict; If has a lower-sized element conflict T aborts; Else { T commits; set T’s combined element’s timestamp to its Start-timestamp; } } else { T commits; set T’s combined element’s timestamp to its Commit-timestamp; } Figure 3 The intervening validation algorithm in MVROCC From the above algorithm, we can see that MVROCC only aborts transactions that have both upper-sided element conflict and lower-sized element conflict (two-sided conflicts). We can prove that MVROCC produces serializable execution history. Proof: Suppose MVROCC produces a history that contains a cycle: Ti …Tj…Ti, and MVROCC allows Ti to commit. Then Ti must have two timestamps after it commits: one timestamp is before Tj’s timestamp and the other timestamp is after Tj’s timestamp. This contradicts the fact that Ti can have only one timestamp after it commits (please note that we move element to have a merge only when there is no element conflict, thus such movement will not affect the conflict history). 4 Discussions The following goals are what we try to achieve when we develop MVROCC: 1. Keep the advantage of Snapshot Isolation. In particular, we want to provide Read-only transactions (query) the privilege of success. 2. Be more permissible than Snapshot isolation, thus provides better performance. 3. Guarantee correctness. 4. Keep the modification to Snapshot isolation to a minimum, thus it can be easily implemented in the commercial products that have already implemented the Snapshot Isolation mechanism. 5. Easy recovery when the system crashes. Apparently the goals #1 and #4 are achieved by using as much as possible the Snapshot isolation. MVROCC simply replaces the first-committers-wins policy by its own intervening validation algorithm described in Figure 3. This intervening validation algorithm makes the execution serializable as proved in section 2. Also it is more permissible than the first- committer-wins policy, since it only aborts transactions that have two-sided conflicts. As A. Adya et al pointed out in [8], write-write conflicts may not always cause non-serializable problem in multi-version database systems. For example, consider the history H2 in which transaction T2 increments the salaries of all employees for which “Dept = Sales”, and transaction T1 adds two employees, x and y, to the Sales department. H2: w1[x1] r2[Dept=Sales: z] w1[y1] w2[z] c2 c1 The transaction T1 will be aborted by Snapshot Isolation because it has a predicate write-write conflict on y (w2[z], y is in z since it satisfies the predicate “Dept = Sales”. W 2[z] is in T1’s interval [Start-timestamp, Commit-timestamp]). MVROCC will permit the history because it only has a lower-sided predicate conflict on x and y (x and y both are in z, thus w2[z] conflicts with w1[x]w1[y]). The equivalent history is T2T1 ( r2[Dept=Sales: y1] w2[z] c2 w1[x] w1[y]c1, w1[y]w1[y] occurs with c1). In general, MVROCC allows interval write-write history, as long as it is one sided (a transaction commits when only has lower-sided element conflict). Goal # 5 (easy recovery) can be achieved by removing all the versions that are written by the transactions that are not yet committed. 5 Conclusions and future work In this paper we discussed MVROCC – a multi-version concurrency control method in database systems. Compared to ROCC, the implementation of MVROCC is surprisingly simple. No special technique is needed to store the RC queue for recovery purpose. No over-declaration is required as in [9]. Compared to Snapshot isolations used in Oracle 9i, PostgreSQL, Microsoft SQL server and Exchange, MVROCC may have fewer restarts and still guarantees execution correctness (our preliminary simulation results showed that the throughput of MVROCC was 5% higher than Snapshot Isolation). Also it is surprising to find that, to implement MVROCC, we only need to replace the first-committer-wins policy with intervening-validation, a simple algorithm described in Figure 2. Our future work is to construct a prototype for further evaluation of MVROCC over Snapshot isolation. References [1] J. Gray and A. Reuter, “Transaction processing: Concepts and techniques”, 1993, Morgan Kaufmann. [2] P. O’neil, “The Escrow transaction method”, ACM Transaction on Database Systems, Vol.11, No.4, 1986, pp. 405-429. [3] P. Franaszek, J. Robinson and A. Thomasian, “Concurrency control for high performance environments”, ACM Transactions on Database Systems, Vol. 17, No.2, 1992, pp.304-345. [4] A. Thomasian, “Distributed optimistic concurrency control methods for high-performance transaction processing”, IEEE Transactions on Knowledge and Data Engineering, Vol. 10, No.1, 1998, pp.173-189. [5] H. Berenson, P. Beinstein, J. Gray, E. O’neil and P. O’neil, “A critique of ANSI SQL isolation levels”, ACM SIGMOD, 1995, pp.1-10. [6] Victor T.S. Shi and William Perrizo, “A New Method for Concurrency Control in Centralized Database Systems”, ISCA CATA-2002. [7] A. Fekete, E. O’Neil, P. O’Neil and D. Shasha, “Making Snapshot Isolation Serializable’, in Print. [8] A. Adya, B. Liskove and P. O’Neil, “Generalized Isolation Level Definitions”, Proceedings of The IEEE International Conference on Data Engineering, 2000. [9] W. Perrizo, “Request Order Linked List (ROLL): A Concurrency Control Object”, Proceedings of IEEE International Conference on Data Engineering, 1991.