Reliability Protocols in Distributed Database Systems Chang Won Choi CS632 – Advanced Database Systems Cornell University February 5, 2016 1 Introduction In the past, implementation of distributed database systems was deemed impractical because network technology was either too unreliable or immature to be used and because computers were too expensive to be implemented in large numbers. However, as networks have become more reliable and computers have become much cheaper, there has been a large interest to use distributed database systems. There are five big reasons for using a distributed database system 2: Many organizations are distributed in nature. Multiple databases can be accessed transparently. Database can be expanded incrementally – as needs arise, additional computers can be connected to the distributed database system. Reliability and availability is increased – distributed database can replicate data among several sites. So even if one site fails, redundancy in data will lead to increased availability and reliability of the data as a whole. Performance will increased – query processing can be performed at multiple sites and as such distributed database systems can mimic parallel database systems in a high-performance network. Even with these benefits, distributed database systems has not been widely used because of many problems in designing distributed database management system (DDBMS). 1.1 1.1.1 Descriptions of DDBMS Homogeneous Distributed DBMS vs. Heterogeneous Distributed DBMS Distributed DBMS may choose to be homogeneous or heterogeneous 2,6. In homogeneous system, the sites involved in distributed DBMS use the same DBMS software at every site but the sites in heterogeneous system can use different DBMS software at every site. While it might be easier to implement homogeneous systems, heterogeneous systems are preferable because organizations may have different DBMS installed at different sites and may want to access them transparently. 1.1.2 Replication Distributed DBMS can choose to have multiple copies of relations at different sites or choose to have only one copy of a relation 6. The benefits of data replication is increased reliability – if one site fails, then other sites can perform queries for the relation. The performance will increase, as transaction can perform queries from a local site and not worry about network problems. The problem with data replication is decreased performance when there are a large number of updates, as distributed DBMS have to ensure that each transaction is consistent with every replicated data. This adds additional communication costs to ensure that all copies of the data are updated at the same time. 1.1.3 Fragmentation Implementation of distributed DBMS can also differ on fragmentation 6,8. Fragmentation is when a relation is broken up into smaller relations and stored in different sites. Relations can be either fragmented horizontally or vertically. In horizontal fragmentation, subsets of the relation are split up, and in vertical fragmentation columns of the relations are split up and stored at different sites. Fragmentation is preferable if locality is to be exploited. For example, it would be faster for users in Ithaca to perform queries at Ithaca sites and for Chicago users to perform queries at Chicago sites. And only when the entire relation is needed, both sites will be accessed. 1.1.4 Local control Last option in implementing distributed DBMS is to how much local control should be given to each site 6. In some systems, distributed DBMS may force sites to perform some operations and the administrator of the site has no control over that. In other systems such as Mariposa, each site has total control over its site and may refuse to perform operations 7. It is preferable to have greater local control to implement load balancing. If the site is being overloaded, then it is preferable to refuse further operations and if the data is replicated, then refer the operations to other sites. 1.2 Problems in Designing Distributed DBMS Same problems arise when designed distributed DBMS as designing traditional DBMS: distributed DBMS also have to consider how to do query optimization and concurrency control but distributed DBMS requires different solutions because of its nature. 1.2.1 Query Optimization Query optimization poses greater difficulty to distributed DBMS than to centralized database systems because additional variables of fragmentation, replication, and network transmission cost 6,8. Because additional variables are added, the search tree for optimal query plan will be expanded and it will be harder to choose an optimal plan. For example, it is not readily apparent to the optimizer whether or not to move a relation over to another site and do a join or join after other operations are completed then the join sets might have become smaller. 1.2.2 Concurrency Control & Recovery Method As with centralized database systems, DDBMS have to guarantee ACID properties, where the ACID properties are atomicity, consistency, isolation, and durability 6,8. In distributed DBMS, problems arise when transactions span multiple sites. For example, how will distributed DBMS ensure that when a transaction commit, the distributed DBMS has to ensure that all sub-transactions will be committed. And if some sites, how will recovery process work? The purpose of this survey paper to look at various protocols enforcing atomicity and durability and what future works could be done in this area. 2 Reliability Protocol Overview One of the critical functions that database systems have to ensure is correctness and availability. During the course of the database operations, the database system may stop running or some transactions may have to be aborted before the transactions commit. In those situations, atomicity and durability would be compromised if a committed transaction isn’t written to the disk or if aborted transaction is written to the disk. It is the role of the transaction manager is to guarantee correctness of database system - all actions in the transaction happen or none happen and if a transaction commits then its effects persist. Furthermore, recovery manager has to be efficient enough so that downtime of the database is minimized and the availability of the database is maximized 2,6. 2.1 No Steal-Force Method To ensure correctness, a trivial solution would be to use no steal-force method 6. In no steal-force method, the effects of transactions are held in the buffer pool until the transaction commits and after the transaction commits the effects are forced (written) to the disk. There are two problems with this method. In a long transaction, no steal prevents the content of the buffer pool to be replaced until the transaction commits. This will lead to poor throughput in the database system as fewer items get be placed into the buffer manager and disk has to be continuously accessed.. Force writes requires writes to the disk after the transaction but this lead to poor response time. For example, if there is a page that is update frequently, the page has to be forced to the disk at every update. It would be more efficient to have the page in the buffer pool and then write to the disk after most of the updates are finished. However, force is necessary to guarantee durability because what would happen if the database were to crashed before writing the data to the disk? So a desired method is steal-no force method that will also guarantee atomicity and durability. 2.2 Write Ahead Logging The solution is to log every action and outcome of every transaction 6. For example, if the transaction decides to abort, the database system can look at the log and undo all the actions. If the database crashes before committed writes are written to the disk, then once the database comes back up, the recovery manager will read the logs and redo the actions of committed transactions. Lastly, write-ahead logging (WAL) is used to ensure atomicity and durability. In WAL, logs are forced into a stable storage before updates take place or before a transaction commit. A more detailed explanation can be found in the ARIES paper 5. 3 Issues in Recovery Protocol in Distributed Database With distributed databases, guaranteeing atomicity and durability becomes more complicated 2. Transactions usually span more than one site, so if a transaction commits, then all the sites that are involved in the transaction have to commit. Also, if the transaction abort, then all sub-transactions have to abort. The problem is how to do this efficiently. The two-phase commit has been the most popular technique among implemented distributed DBMS and variations have been made to the protocol to make it more efficient in some cases 1,2,4,6. A more radical variation is called the three-phase commit protocol (3PC) where additional message is sent at the end and in some cases, will lead to better performance 2. 3.1 Two-Phase Commit The two-phase commit (2PC) has been the most popular technique for concurrency control in distributed database as R*, Distributed-INGRES and DDM all use 2PC or variations of it 1,2,4,6. In the basic form of 2PC, there is a coordinator and subordinates where the coordinator is the site that has initiated the transaction. In the first stage, the coordinator tries to get a uniform decision of either committing or aborting out of the subordinates and in the second stage, the coordinator relay the decision back to them. The protocol goes as follows: 1. 2. 3. 4. 5. The coordinator will write a “prepare” log on a stable storage and will send “prepare” messages to the subordinates. After the subordinates receives a “prepare” message, they will decide if they can commit or not and sends the response back to the coordinator. If the subordinate decides to commit, a “ready” log will be written to a stable storage on the subordinate’s machine. If the coordinator receives an abort reply from any subordinates, the coordinator will send abort message to all subordinates. If all subordinates send “ready” messages, then the coordinator will send another message saying that everyone has signaled commit and write a “global commit” log to the stable storage. If the subordinates get a message saying everyone has decided commit, then the subordinates will commit the data and send back an acknowledgement message back to the coordinator and commit log is written to the local stable storage. If the subordinates get an “abort” message, the subordinates will abort. When the coordinator receives an acknowledgement messages back from all subordinates, the coordinator will record “complete” in the log and the transaction is finished. If any of the subordinates fails to respond within a set time before the “commit” message have been sent, the coordinator will send abort to all of the subordinates. 3.2 Resilience of 2PC 2-Phase Commit protocol is able to handle all failures and is able to keep the database systems consistent. The following are possible failures and how 2PC is able to handle them. An important assumption is that each site has its own recovery and concurrency controls to recover from local crashes and failures. Process Coordinator sends a “prepare” message Coordinator Fails Nothing happens because coordinator hasn’t sent any messages yet Subordinates sends either a “ready” or “abort” message All the subordinates that replied “ready” would wait until the coordinator recover and send a reply back. Until then, the subordinates will continue sending message to the coordinator with request for the reply. Same as above. The coordinator sends “commit” or “abort” message. The subordinate will send an acknowledgement. 3.2.1 Same as above Subordinate Fails Subordinate fails to receive a “prepare” message so after some predetermined time coordinator will abort the transaction and send abort to all subordinates Same as above. The coordinator will wait until the subordinate recovers and sends a commit message to the subordinate. Same as above 2PC Variations – Presume Abort The researchers for the R* System improved the performance of 2PC protocol by assuming abort 2,4,9. In presumed abort, when the coordinator receives an “abort” message from any of the subordinates, the coordinator will send “abort” messages to other subordinates and forget about the entire transaction. If any of the subordinates fail during the process, after the failed subordinate recover, it will ask the coordinator what has happened and from the coordinator’s log, it will find that the transaction has been aborted, and the subordinate will abort. The advantage of this variation is reduction of number of control messages between the coordinator and subordinates. However, presume abort becomes less efficient when network is less stable, so it would be better for the subordinates and the coordinator to wait for the network to become reliable again. 3.3 2PC Issues & Non-blocking Commitment Protocol: 3PC The main issue with traditional 2PC is the blocking factor 2. Blocking occurs if failure(s) result in other sites having to wait until the failure gets resolved. For example, if there is any network or database failures in any sites after the first phase, then subordinates and/or coordinator have to wait until network communication is restored or until failed site recover from the failure. Since the recovery time can vary from few minutes to few hours or even days, it is very inefficient for sites to wait and hold the locks until a response is received. The problem is more acute for the case when the coordinator fails since all the subordinates have to wait for the coordinator to recover. In the case that one of the subordinate fail, only the coordinator have to wait for the failed subordinate to recover. This problem led to a non-blocking solution. The goal of a non-blocking solution is to have the transaction keep on going at all operational sites or abort the transaction if the coordinator fails. The 3PC works as follows: 1. 2. 3. 4. 5. The coordinator will write a “prepare” log on a stable storage and will send “prepare” messages to the subordinates. After the subordinates receives a “prepare” message, they will decide if they can commit or not and sends the response back to the coordinator. If the subordinate decides to commit, a “ready” log will be written to a stable storage on the subordinate’s machine. If the coordinator receives an abort reply from any subordinates, the coordinator will send “prepare abort” message to all subordinates. If all subordinates send a ready to commit message, the coordinator will send “prepare commit” message to the subordinates and write a “global commit” log to the stable storage. If the subordinates get a “prepare commit” message, then the subordinates will go into “ready to commit” and send back an “okay” message back to the coordinator and appropriate log is written to the local stable storage. If the subordinates get an “prepare abort” message, the subordinates will do the same. When the coordinator receive “okay” messages back from all subordinates, the coordinator will send “final commit” or “final abort” message to the subordinates and writes appropriate log to a stable storage. The difference between 3PC and 2PC can be seen when the coordinator fails before sending “prepare commit” message. In 2PC, the subordinates will wait indefinitely until the coordinator comes back again, but in 3PC, a new coordinator is chosen and the new coordinator behaves as follows: Status of the new coordinator The new coordinator is in “prepare commit” mode The new committed coordinator has Actions by the new coordinator The new coordinator will send “prepare commit” to the subordinates and wait for “Ok” message. Once all “Ok” messages are received, “commit” message is sent to all subordinates. “Commit” messages are sent to all subordinates The new coordinator has aborted “Abort” messages are sent to all subordinates The new coordinator “prepare” mode “Abort” messages are sent to all subordinates 3.3.1 is in Why the actions are correct Nothing has been changed at other sites, so the new coordinator will try to determine the states of other sites and act accordingly. If the coordinator has committed then other sites should have at least gotten “prepare commit” and sent “Ok” message back to the old coordinator. So other sites should have no problem of committing. If the coordinator has aborted then at the worst case, other sites would be at “prepare commit,” so there is no inconsistency in aborting at other sites. Same as above. 3PC Issues – Quorum-based protocols Surprisingly, 3PC, in its basic form, do not guarantee consistency 2. Consider the following case: let and x and y be two groups of operating sites and because of a network failure, communication between two groups are completely gone. Following 3PC, two groups will choose their own coordinators and continue to go through the protocol. So there is a possibility that x and y will act differently. For example, x might choose a coordinator that was in “prepare commit” and later decides to commit. But if y chooses a new coordinator who was in “prepare” mode, then y will uniformly abort. The solution is to use a quorum of subordinates and the coordinator 2. If the coordinator fails and a new coordinator is chosen, the coordinator starts collecting information from the subordinates. An addition to the 3PC is the “prepare to abort.” Unlike the original 3PC, sites that respond to “prepare to abort” can still commit later on. What “prepare to abort” means is that the site wants to abort but in the same case it doesn’t have to and can commit if the situation requires it. This is possible because the quorum is built after all sites have research “prepare” state and in this state, the sites do not care if they commit or abort. Quorum-based protocol also introduced VA and VC, where these are the constants that are defined during the transaction. If V is the total number of sites in a transaction, then V A+VC=V+1 to ensure that commit or abort in performed. The following the behavior by the new coordinator using quorum-based protocol: Situation At least one of the sites has committed Actions by the new coordinator Send “commit” messages to the subordinates At least one of the sites has aborted Send “abort” messages to the subordinates The number of sites that are in prepared-to-commit state is greater than VC. The number of sites that are in prepared-to-abort state is greater than VA. The sum of number of sites that are in prepare-to-commit and unknown states is greater than VC. The sum of number of sites that are in prepare-to-commit and unknown states is greater than VA. Other cases Send “commit” messages to the subordinates Send “abort” messages to the subordinates. Reason for the actions Since at least one site has committed, other sites should be at “prepare commit” or at “commit.” So eventually, all sites should commit. Since at least one site has aborted, other sites will be at most “prepare.” So there is no possibility that other sites will commit. It is more likely that other sites will want to commit so commit is sent. It is more likely that other sites will want to abort so abort is sent. “Prepare commit” message is sent to the sites with unknown states and the coordinator waits. “Prepare abort” message is sent to the sites with unknown states and the coordinator waits. No decision can be made. The coordinator waits. Same as above. Same as above. 3.4 Replication Issues & Miscellaneous Issues Recovery protocols have to account for replication if the distributed database system supports replication 2. For the copies of the data to be consistent, the copies have to stay the same. This is done through updating every copy and the protocol is same as of 2PC or 3PC. A single copy is used as a coordinator, and the coordinator sends update commands to every copy and tries to commit the copies. Another issue with 2PC, 3PC or another other protocols are how to resolved commission errors 2. Commission errors occur when messages are corrupted during transmission or if the coordinator sends wrong commands to the subordinates or if the subordinates do not perform correctly. As possible solutions to commission error lies in network communication, they will not be discussed in this survey paper. 4 Future Research There are many possibilities for research in recovery protocols in distributed database systems. While the performance of recovery protocol made be trivial to centralized database system, because of communication costs, recovery protocol has larger role in overall database performance. It would be an important study to research on performance of recovery protocol in different situations. Furthermore, there needs to be research done on how recovery protocol affects the performance of other components of distributed database. For example, how will 3PC affect locking mechanisms of replicated data? Does presume-abort work well in cases where commits are frequent and how fast will locks be acquired and freed? These are some questions that needs to be answered for an efficient distributed database system to be implemented. 5 Bibliography 1. J. Bacon. Concurrent Systems: An Intergrated Approach to Operating Systems, Database, and Distributed Systems. Addison-Wesley Publishing Company, 1993. 2. W. Cellary, E. Gelenbe, and T. Morzy. Concurrency Control in Distributed Database Systems. North-Holland, 1988. 3. S. Ceri and G. Pelagatti. Distributed Databases: Principles and Systems. McGraw-Hill Book Company, 1984. 4. C. Mohan, B. Lindsay, and R. Obermarck. Transaction management in the R* distributed database management system. ACM Transaction on Database Systems, 11(4): 378-396, 1986. 5. C. Mohan, and et al. ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging. ACM Transaction on Database Systems, 17(1): 94-162, 1992. 6. R. Ramakrishnan. Database Management Systems. McGraw-Hill Book Company, 1998. 7. M. Stonebraker, and et al. Mariposa: a wide-area distributed database system. 8. P. Valduriez and M. Ozsu. Distributed and Parallel Database Systems. ACM Computing Surveys, 28(1), March 1996. 9. R. Williams, et al.. R*: An Overview of the Architecture, Technical Report RJ3325, IBM Research Lab, San Jose, CA, December 1981. 10. O. Wolfson. The Overhead of Locking (and Commit) Protocols in Distributed Database. ACM Transactions on Database Systems, 12(3): 453-471, Sept. 1987.