In the name of God Distributed Database Systems (Technical report 1) University of Tehran Electrical and Computer Engineering Dept. Directed By: Dr. M. Rahgozar By: Samira Tasharofi Reza Basseda Summer 2005 1 Abstract A distributed database system is a database whose relations reside on different sites or replicated at different sites or split between different sites. The advantages that can be achieved by a distributed database system result in a lot of researches on it to extend its utilizing in different environments. This paper will provide an overview for different aspects of distributed database systems and the state of the art for those aspects. We also can see the points that need more works. 1. Introduction Distributed Database System (DDS) technology is the intersection of two technologies, namely, database systems and computer networks. Distributed Computing System consists of a number of processing elements (not necessarily homogeneous) that are interconnected by a computer network and that cooperate in performing their assigned tasks. A distributed database can be defined as a collection of multiple, logically interrelated databases distributed over a computer network. A distributed database management system is a Software System that permits the management of the Distributed Database System and makes the distribution transparent to the users. A typical Distributed Database will have the following features: Entire database is distributed over a number of distant sites, potentially having replication of some portion of data. In every site there exists means to access data situated in a remote site. The records kept in a location are more frequently accessed by the locally submitted transactions than by remote transactions. A distributed database provides a number of advantages, like: Local autonomy Improved performance (by proper fragmentation) Improved reliability/availability (by replication) Greater expandability Greater share ability Also it has some disadvantages as: Higher complexity Higher software and hardware cost Synchronization and co-ordination among the sites Higher maintenance overhead in case of replication Greater security problem Section 2 of this paper contributes in Distributed Data Storage. Distributed transactions and their commit protocols are described in section 3 and 4 respectively. In section 5 we considered concurrency control in distributed database systems. Section 6 focused on availability of data and approaches that improve availability. Sections 7,8 investigate query evaluation and heterogeneous distributed database. Finallly we provided conclusion in section 9. 2. Distributed Data Storage 2 Data allocation is a critical aspect of distributed database systems: a poorlydesigned data allocation can lead to inefficient computation, high access costs, and high network loads whereas a well-designed data allocation can enhance data availability, diminish access time, and minimize overall usage of resources. It is thus very important to provide distributed database systems with an efficient means of achieving effective data allocation. Two important issues in distributed data storage are fragmentation and replication. Some advanced features are also proposed that will be described. 2.1 Fragmentation In order to distribute data in a distributed database, we need to use fragmentation of information. The goal of it is to minimize the total data transfer cost incurred in executing a set of queries. There are three kinds of fragmentation: horizontal, vertical and mixed as will be described bellow [1]: Vertical fragments are created by dividing a global relation R on its attributes by applying the project operator: Rj = Π {Aj},key R, where 1 ≤ j ≤m, Where {Aj} is a set of attributes not in the primary key, upon which the vertical fragment is defined and m is the maximum number of fragments. A vertical fragmentation schema is complete when every attribute in the original global relation can be found in some vertical fragment defined on that relation. Then the reconstruction rule is satisfied by a join on the primary key(s): Rj {R1, R2,… ,Rm} : R = ►◄key Rj The disjoint ness rule does not apply in a strict sense to vertical fragmentation as the reconstruction rule can only be satisfied when the primary key is included in each fragment. Horizontal fragmentation divides a global relation R on its tuples by use of the j selection operator: R = σPj (R), where 1 ≤ j ≤m, Where Pj is the selection condition as a simple predicate and m is the maximum number of fragments. The horizontal fragmentation schema satisfies the completeness rule if the selection predicates are complete. Furthermore, if a horizontal fragmentation schema is complete, the reconstruction rule is satisfied by a union operation over all the fragments: 1 2 m j Rj {R , R , …, R } : R = R Disjoint ness is ensured when the selection predicates defining the fragments are mutually exclusive. Mixed Fragmentation is a hybrid fragmentation schema; it is a combination of horizontal and vertical fragments. If the correctness and disjoint ness rules are satisfied for the comprising fragments, they are implicitly satisfied for the entire hybrid schema. Reconstruction is achieved by applying the reconstruction operators in reverse order of fragment definition. 2.2 Replication 3 A relation or fragment of a relation is replicated if it is stored redundantly in two or more sites. Reliability and performance are the two major purposes of data replication. Data replication allows availability, especially for distributed databases, but it also implies a big problem to keep consistency. It can be complete (replicated in all sites) or partial. Fully redundant databases are those in which every site contains a copy of the entire database [1, 8]. 2.3 Dynamic Fragmentation Traditionally, fragmentation in distributed databases has been determined by offline analysis and optimization, however, there are some enterprises having users accessing their databases under changing access patterns (requires an approach to dynamic fragmentation, i.e., an algorithm that can reallocate data while database is online). In [1] one of the approaches for dynamic fragmentation and improving performance of accessing data in distributed database systems is RBy (bound) algorithm which is based on horizontal fragmentation with partial replication as described below: 1. For each query requested, a slave computer increments a counter (ctr) for the user that have made the request. 2. If ctr reaches bound number (parameter of this algorithm), then this computer is a candidate to have a set of records replicated and need to follow steps 3 and 4, else step 5. 3. Request the set of records that the user is asking for and save this information into the slave database. 4. Reset the user local counter to cero. 5. End. This approach allows database availability even when the connection between slave and master database is broken. In order to reduce the search space, slave-master search (SMS) technique is proposed in which accessing data is based on searching the slave (local) database before sending the query to master computer. 2.4 Data Relocation In a distributed system, long-running distributed services are often confronted with changes to the configuration of the underlying hardware. If the service is required to be highly available, the service needs to deal with the problem of adapting to these changes while still providing its service and make it transparent to client processes. This problem is increased further if multiple changes can occur concurrently. In [2] one of the approaches for data relocation is represented to address the problems of data relocation noted above as will be explain in this section. In this approach, it is assumed that the service is implemented by a set of server processes, with each server located on a different machine and managing (or hosting) a disjoint subset of the records (assume no replication). Every record is always hosted by a single server, which is called the record’s hosting server. The distributed service provides its services to external client processes. The load distribution policy is captured by a data structure which is called the mapping which defines for each record its hosting server and can be used for redirecting 4 requests to appropriate servers. Mappings are managed by a separate service, called the configuration service which is responsible for building an updated mapping that includes the configuration change and providing the new mapping to all the servers in the case of configuration change. To invoke an operation, a client arbitrarily picks a server and submits its request to it. The selected server then looks in its copy of the mapping to determine the record’s hosting server and forwards the request accordingly. When a server receives the new mapping, it starts relocating records, includes transferring of mappings between a server and the configuration service, as well as relocation of records between servers. The algorithm is proposed for multiple situations of data redistribution described as the following: Single Redistribution: For single redistribution, the solution follows three steps: Initialization: Initially, all the servers have a local copy of the authoritative mapping, M, which is used for forwarding requests to the proper hosting. When the configuration service receives the notification for a configuration change, it computes a new mapping M' that reflects the change, and distributes M' to all the servers of the distributed service Record relocation: the server, receives a new mapping M’ and ships its own record if it is needed. During the record relocation step, servers continue to forward client requests using the authoritative mapping M or M’ for not-yetshipped record and already-shipped record respectively. Already-shipped record requests are forwarded to the record’s new hosting server as dictated by M’. Termination: As soon as a server completes its record relocation step, it notifies the configuration service. When the configuration service receives completion notifications from all servers, it, in turn, notifies all the servers that the termination step can start in which each server discards the mapping M and replaces it by the new mapping M’ as the new authoritative mapping. Different mappings are serviced sequentially. To make the record relocation step more efficient, a server does not discard a record after it has been shipped to its new host. Instead, the server keeps handling lookup requests for such a record, but only for as long as that record remains consistent with the copy at its new hosting server, it means before the first update request for that record is made. Alternative Design Considerations There are two strategies to deal with requests for already-shipped data while redistribution is in progress: 1.Reject the request and let the client keep trying until the redistribution is completed. It is contradictory of transparency goal. 2.Always handle the request locally independent of whether it is a lookup or an update request. In the case of an update request for a record that is already relocated, the authoritative hosting server propagates the record’s modified value to its new hosting server in order to keep the two copies of the record consistent. This solution has the 5 advantage that update requests are processed slightly faster, but introduces additional complexity for keeping the records consistent. Also, there are two approaches for initial forwarding of requests: 1.Forwarding a request to the record’s authoritative hosting server and having it forwarded further if the record is already shipped. 2.Forwarding the request to the record’s new hosting server and having the record fetched on demand. The first strategy favors frequent lookups and rare updates and the second strategy favors more frequent updates as it eliminates the extra forwarding of every single request for a shipped record that has been updated. Overlapping Redistributions: In the following approaches efficiency is improved by introducing concurrency (let R1, R2,.., Rn be the sequence of upcoming redistributions and M1, M2,. . . , Mn their respective mappings. M is the (current) authoritative mapping of the distributed service as a whole). Approach I: Per-server Sequential Redistribution In this case, the configuration service generates a new mapping and distributes it to the servers as soon as it receives a notification for a new configuration change. The servers themselves are responsible for locally queuing incoming mappings and processing them one at a time in the order received. Each server maintains a queue of mappings, which always contains at least one mapping. A server that has relocated all records for redistribution R1 can start carrying out the record relocation for the next redistribution R2 before all other servers have completed redistribution R1. The authoritative mapping as known to the server (i.e., M) is removed from the server’s queue only upon receiving a notification from the configuration service stating that redistribution R1 has been completed by all servers. The forwarding is based on first preference virtual mapping. Approach II: Per-server Mixed but Ordered Redistributions The main idea in this case is that there are cases where a server does not need to complete a redistribution to start working on the next one. Assume a server is currently going through its set of records, checking which ones are to be shipped based on redistribution Ri and it comes across a record that is not remapped by Ri. The server can then ship this record based on a successive redistribution Rj(j > i), even if it has not finished Ri yet. As the approach (I) forwarding requests are based on first preference virtual mapping. Approach III: Direct Shipping to Final Destination The optimization introduced in Approach III entails that a record is shipped directly to the record’s hosting server according to the last known redistribution. This policy keeps a record from being shipped from server to server when it is already known that it needs to be shipped further. Instead, the record is sent directly to the last server in the 6 chain of servers it is mapped to. This policy prevents unnecessary network traffic and redistribution delay. The main difference between this approach and approaches I and II is that server ships records based on the virtual mapping with last preference of all the mappings in its queue. The records are thus directly relocated to the proper hosting server. However, servers still use the virtual mapping with first preference of all these mappings to forward requests that cannot be handled locally. 3. Distributed Transactions A multi-database system (MDBS) is a collection of autonomous local databases (LDBSs) and it is viewed as a single unit. There are two types of transactions in MDBS. A local transaction, which accesses a local database only, is submitted directly to LDBS. Global transaction is a set of sub-transactions, where each sub-transaction is a transaction accessing the data items at a single local site and the component local database systems do not support prepare to commit stage. There are three types of subtransactions: Rtriable: It is guaranteed to commit after e finite number of submissions when executed from any consistent database state. Compensatable: the effect of its execution can be semantically undone after commitment by executing a compensating sub-transaction at its local site. Pivot: it is neither retriable nor compensatable. In each global transaction, at most one sub-transaction can be pivot. Global transaction management requires cooperation from local sites to ensure the consistent and reliable execution of global transactions in distributed database system. In a heterogeneous distributed database (or multi-database) environment, various local sites makes conflicting assertions of autonomy over the execution of global transactions. Global serializability is an accepted correctness criterion for the execution of (nonflexible) global and local transactions in the HDDBS environment. A global schedule S is globally serializable if the committed projection from S of both global transactions in the HDDBS environment and the transactions that run independently at local sites is conflictequivalent to some serial execution of those transactions 3.1 Flexible Transactions A flexible transaction model for the specification of global transactions makes it possible to deal robustly with these conflicting requirements. In heterogeneous database systems, flexible transactions can increase the failure resilience of global transactions by allowing alternate (but in some sense equivalent) execution to be attempted when a local database system fails or some sub-transactions of the global transaction abort. The flexible transaction model supports flexible execution control flow by specifying two types of dependencies among the sub-transactions of a global transaction: Execution ordering dependencies between two sub-transactions and alternative dependencies between two subsets of sub-transactions. 3.1.1 Semi-Atomicity in Flexible Transactions 7 In [3] semi-atomicity is represented as a weaker form of atomicity for flexible transaction and allows local sites to autonomously maintain serializability and recoverability. Let T = {t1,t2,..,tn} be a repertoire of sub-transactions and P(T) the collection of all subsets of T. Let ti,tj < T and Ti<T be a subset of T with a precedence relation <i defined on Ti. (Ti,<i) is a representative partial order, abbreviated as <-rpo, if the execution of sub-transactions in Ti represents the execution of the entire flexible transaction T. The execution of a flexible transaction T, preserves the property of semi-atomicity if one of the following conditions is satisfied: All its sub-transaction in one <-rpo commit and all attempted sub-transactions not in the committed <-rpo are either aborted or have their effects undone No partial effects of its sub-transactions remain permanent in local database The preservation of the weaker property of semi-atomicity renders flexible transactions more resilient to failures than traditional global transactions. This property is preserved through a combination of compensation and retry approaches. The construction of recoverable flexible transactions that are executable in the error-prone MDBS environment demonstrates that the flexible transaction model indeed enhances the scope of global transaction management beyond that offered by the traditional global transaction model. Using flexible transactions, blocking that may caused by the 2PC protocol can be prevented. Compensating sub-transactions may be subject to fewer restrictions in such an environment. 3.1.2 Global Serializability: Some sub-transactions in the flexible transaction which do not belong to the committed <-rpo (as noted above) may have committed and their effects are compensated. Such sub-transactions which is called invalid sub-transactions with their compensating transactions are termed surplus transactions. In the case of flexible transactions, a global schedule S is serializable if the projection of committed local, flexible and surplus transactions is conflict equivalent to some serial execution of these transactions. 3.1.3 F-Serializability: F-serializability is a concurrency control criterion of flexible and local transactions that is stricter than global seralizability in that it prevents the flexible transactions which are serialized between a flexible transaction and its compensating subtransactions to affect any data items that have been updated by flexible transaction as described in [4]. A global schedule S is compensation-free if for any sub-transaction tj which is serialized between a sub-transaction ti and its compensating transaction cti in S, WC(ti) AC(tj) = , Where WC(t) denotes the set of data items that writes and commits and AC(t) denote the set of data items that t accesses and commits. Let S be a global schedule of a set of a set of well-formed flexible transactions and local transactions. S is F-serializable if it is globally serializable and compensationinterference free. If we consider the traditional definition of global serializability in which all subtransactions and their compensating sub-transactions of a flexible transaction at a local 8 site are treated as a logically atomic sub-transaction, then the set of F-serializable schedules would be superset of globally seralizable schedule. Scheduling protocol For achieving a GTM scheduling protocol (which assumes that each global subtransaction and compensating transaction predeclares its read- and write- sets) that ensures F-serializability on the execution of local and flexible transactions, and avoids cascading aborts, the Stored Sub-transaction Execution Graph (SSGP) is maintained. The SSEG of a set of flexible transactions in global schedule S is a directed graph whose nodes are global sub-transactions and compensating sub-transactions for those flexible transactions, and whose edges titj indicate that tj must serialize before ti due to preference, precedence, or conflict. The proper insertion and deletion rules for nodes and edges are also defined in this graph. 4. Commit Protocols Distributed database systems implement a transaction commit protocol to ensure transaction atomicity. In distributed database systems, a single transaction may have to execute in many sites, depending on the location of the data that it needs to process. In such cases, some sites of a transaction may decide to commit while other could decide to abort it, resulting in a violation of transaction atomicity. To address this problem, distributed database systems use transaction commit protocols. The job of the commit protocol is to make sure that all participating sites of a transaction agree upon the final outcome (commit or abort) of the transaction. Most importantly, this assurance has to remain in the presence of network failures. In this section an overview of the commit protocols represented in [5, 6] will be described. 4.1 General Commit Protocols Variety of commit protocols has been proposed (for non-real time database systems) for a distributed database system includes: Two Phase Commit (2PC): It operates in two phases: In the first phase, called “voting phase”, the master reaches a global decision (commit or abort) based on local decisions of the cohorts. In the second phase, called the “decision phase” the master conveys this decision to the cohorts. In this protocol, cohorts use logging mechanisms in order to cohorts can be undone if the transaction. Presumed Abort (PA): it is a variant of 2PC protocol and behaves identically to 2PC for committing transactions, but has reduced message and logging overhead for aborted transactions. It is not necessary for cohorts to send ACKs for ABORT messages from the master, or to force-write the abort record to the log. Presumed Commit (PC): this protocol is another variant of 2PC, in which cohorts do not send ACKs for a commit decision sent from the master, and also does not forcewrite a commit log record. In addition, the master does not write end log record. In 9 other words, the overhead are reduced for committing transactions rather than aborted transaction. Three Phase Commit (3PC): A fundamental problem with all of the above protocols is that cohorts may become blocked in the event of site failure and remain blocked until the failed site recover. This protocol addresses the blocking problem of the above protocols by inserting an extra phase, called the “precommit phase”, in between the two phases of the 2PC protocol in the price of increasing overhead in the communication (messaging) and logging. 4.2 Real-Time Commit Protocols Distributed commit processing can have considerably more effect than distributed data processing on real-time performance. So, it is important to use commit protocols which reflects on limitations of real-time database systems. The semantic of firm deadline in distributed real-time database systems is that a transaction should be either committed before its deadline or be killed when the deadline expires. A distributed firm-deadline real-time transaction is said to be committed if the master has reached the commit decision (that is, forced the commit log record to the disk) before the expiry of the deadline at its site. This definition applies irrespective of whether the cohorts have also received and recorded the commit decision by the deadline. Permits Reading Of Modified Prepared-data for Timeliness (PROMT) It is the best performing two-phase protocol in DRTDBS. This protocol allows transactions to “optimistically” borrow, in a controlled manner, the updated data of transactions currently in their commit phase. This controlled borrowing reduces the data inaccessibility and the priority inversion that is inherent in distributed real-time commit processing. To future improve its real-time performance; three additional features are included in the PROMPT protocol as: 1) Active Abort: cohorts inform the master as soon as they decide to abort locally, rather than only upon explicit request by the master. 2) Silent Kill: there is no need for the master to invoke the abort protocol since the cohorts of the transaction can independently realize the missing of the deadline (assuming global clock synchronization) 3) Healthy Lending: The heath factor associated with each transaction is computed at the point of time when the master is ready to send PREPARE messages and is defined to be the ratio TimeLeft/MinTime, where TimeLeft is the time left until the transaction’s deadline, and MinTime is the minimum time required for commit processing. In this scheme, a transaction is allowed to lend its data only if its health factor is greater than a (system-specified) minimum value MinHF as lending by transactions that is close to its deadline, result in the aborts of all the associated borrowers. Early Prepare (EP) Commit Protocol (one phase protocol) 10 This protocol uses PC protocol to eliminate one round of messages for a distributed transaction that executes in the absence of failure. It also reduces the communication overhead further by making each cohort enter the prepared state after it performs its work, and before it replies to the master with the WORKDONE message. The master may have to force multiple MEMBERSHIP records, because the transaction membership may grow as transaction execution progresses. Also, the master must record a Cohort’s identity in its stable log before sending a work request to that abort. The steps of FP’s execution are as bellow: Master forces one or more MEMBERSHIP log records and sends a STARTWORK request to each cohort. Each cohort executes its work request, forces a PREPARE log record, and replies to the master with a WORKDONE message. Master forces a COMMIT log record, sends a COMMIT message to each cohort, and forgets about the transaction. A commit decision is reached if all cohorts have performed their jobs successfully and thereby, ready to commit. Each cohort appends to its log (but need not force) a COMMIT log record and then forgets about the transaction. In contrast to 2PC or PC, when EP is used and several requests are sent to each cohort, each cohort forces only one PREPARE log record regardless of the number of work requests it received. Also, a cohort using EP synchronously forces each PREPARE record as early as possible, while a cohort using PC synchronously forces the PREPARE record as late as possible. Comparison of Real-Time Protocols EP implicitly holds one strong feature of PROMPT, namely active abort. As a result, EP outperforms PROMPT when the cohorts of a distributed transaction execute in parallel. However, the performance of EP is rather poor in the environment where the cohorts of a distributed transaction execute sequentially. In the case of very high workload, in the environment of sequential execution, EP performs extremely well compared to any two-phase protocol in the presence of both resource and data contention. 4.3 A Two-Phase Commit Protocol for Mobile Wireless Environment (M-2PC protocol) As noted in [7] if the traditional 2PC is executed in mobile environment, disconnections will increase the number of, may be wrong, abortion decisions of transaction because if a Fixed Host (FH) tries to communicate with it a disconnected Mobile Host (MH) this will cause a failure. As frequent are disconnections, as transaction abortions are. This is not acceptable in mobile environments because frequent disconnections are not exceptions but rather are part of the normal mode of operation, so they should not be treated as failures. So, the M2PC protocol is proposed for mobile environment. In this protocol, coordinator must reside in the fixed part of the network to be directly reachable by the fixed participants. Some situations considered in this protocol are as the following: 11 The case of mobile client and fixed servers: To handle disconnections the client delegates its commit duties to the coordinator which is always available during protocol execution. The client sends the request for commit to the coordinator along with its logs. Afterwards, the client can disconnect. The coordinator sends vote messages to all participants and decides on whether to commit or abort according to the traditional 2PC principal. After receiving the acknowledgements the coordinator informs the client, which may be in another cell, about the result. The coordinator waits for the client acknowledgement before forgetting about the transaction (releasing resources). So, to mitigate the unforeseeable breakdowns, the client must force-write the identity and location information of the coordinator just before sending the commit-request. The case of mobile client and mobile servers: in this situation, mobile server is called participant. The representation agent for mobile server, which is called participant-agent, will work on behalf of the mobile server which is free to disconnect from the moment it delegates its commitment duties to its representation agent. The participant-agent is responsible of transmitting the result to the participant at reconnection time and also of keeping logs and eventually recovering in the case of failure. The participant is free to move to another cell during the protocol execution. When it registers to a new base station (BS), the participant MH (or mobile participant) informs its participantagent about its new location. Again, the workload is shifted to the fixed part of the network thus preserving processing power and communication resources and minimizing traffic cost over the wireless links. 5. Concurrency Control Concurrency control has been actively investigated for the past several years, and the problem for non-distributed DBMSs is well understood. A broad mathematical theory has been developed to analyze the problem, and one approach, called two-phase locking, has been accepted as a standard solution. Current research on non-distributed concurrency control is focused on evolutionary improvements to two-phase locking, detailed performance analysis and optimization, and extensions to the mathematical theory. The concurrency control problem is exacerbated in a distributed DBMS (DDBMS) because (1) users may access data stored in many different computers in a distributed system, and (2) a concurrency control mechanism at one computer cannot instantaneously know about interactions at other computers. More than 20 concurrency control algorithms have been proposed for DDBMS’s, and several have been, or are being, implemented. These algorithms are usually complex, hard to understand, and difficult to prove correct (indeed, many are incorrect). Because they are described in different terminologies and make different assumptions about the underlying DDBMS environment, it is difficult to compare the many proposed algorithms, even in qualitative terms. In fact, the sub algorithms used by all practical DDBMS concurrency control algorithms are variations of just three basic techniques: 12 Two-phase locking Timestamp ordering Optimistic Thus the state of the art is far more coherent than a review of the literature would seem to indicate. Well-known centralized concurrency control techniques can be extended to solve the problem of concurrency control in distributed databases, but not all concurrency control techniques are suitable for a distributed database. One example is serialization graph testing, which works well in a centralized database system given relative powerful processors compared to I/O speed. But in a distributed environment, keeping the graph updated at all times is prohibitly expensive because of the communication costs. During the last years, several distributed database systems have been realized. Usually, the concurrency control in these systems has been done by some kind of two-phase locking, but as processor speed increases relative to I/O and communication speed, it is expected that timestamp ordering should be able to compete with two-phase locking in performance. In theory, timestamp ordering scheduling should be capable of good performance in distributed systems. It is deadlock free and avoids much communication for synchronization and lock management. The work that has been done has been mostly theoretical, but some interesting simulation models have been developed and simulated at the University of Wisconsin. Bernstein and GooDMan review many of the proposed algorithms and describe how additional algorithms may be synthesized by combining basic mechanisms from the locking and timestamp classes [17]. They have presented a framework for the design and analysis of distributed database concurrency control algorithms. The framework has two main components: (1) a system model that provides common terminology and concepts for describing a variety of concurrency control algorithms, and (2) a problem decomposition that decomposes concurrency control algorithms into read-write and write-write synchronization sub-algorithms. They have considered synchronization subalgorithms outside the context of specific concurrency control algorithms. Virtually all known database synchronization algorithms are variations of two basic techniques-twophase locking (2PL) and timestamp ordering (T/O). They have described the principal variations of each technique, though they do not claim to have exhausted all possible variations. In addition, they have described ancillary problems (e.g., deadlock resolution) that must be solved to make each variation effective. They have shown how to integrate the described techniques to form complete concurrency control algorithms. They have listed 47 concurrency control algorithms, describing 25 in detail: 5.1. Basic 2PL: An implementation of 2PL amounts to building a 2PL scheduler, a software module that receives lock requests and lock releases and processes them according to the 2PL specification. The basic way to implement 2PL in a distributed database is to distribute the schedulers along with the database, placing the scheduler for data item x at the DM were x is stored. In this implementation readlocks may be implicitly requested by din-reads and writelocks may be implicitly requested by prewrites. If the requested lock cannot be granted, the operation is placed on a waiting queue for the desired data item. (This can produce a deadlock,as discussed in Section 3.5.) Writelocks are implicitly released by din-writes. However, to release readlocks, 13 special lock-release operations are required. These lock releases may be transmitted in parallel with the din-writes, since the DM-writes signal the start of the shrinking phase. When a lock is released, the operations on the waiting queue of that data item are processed first-in/first-out (FIFO) order. 5.2. Primary Copy 2PL: Primary copy 2PL is a 2PL technique that pays attention to data redundancy. One copy of each logical data item is designated the primary copy; before accessing any copy of the logical data item, the appropriate lock must be obtained on the primary copy. For readlocks this technique requires more communication than basic 2PL. Suppose xl is the primary copy of logical data item X, and suppose transaction T wishes to read some other copy, Xn of X. To read X, T must communicate with two DMs, the DM where Xs is stored (so T can lock X1) and the DM where X, is stored. By contrast, under basic 2PL, T would only communicate with x,'s DM. For writelocks, however, primary copy 2PL does not incur extra communication. Suppose T wishes to update X. Under basic 2PL, T would issue prewrites to all copies of X (thereby requesting writelocks on these data items) and then issue DM-writes to all copies. Under primary copy 2PL the same operations would be required, but only the prewrite (X1) would request a writelock. That is, pre-writes would be sent for X1, . . . , Xm, but the prewrites for X2 . . . . . Xm would not implicitly request writelocks. 5.3 Voting 2PL: Voting 2PL (or majority consensus 2PL) is another 2PL implementation that exploits data redundancy. Voting 2PL is derived from the majority consensus technique of Thomas and is only suitable for ww synchronization. To understand voting, we must examine it in the context of two-phase commit. Suppose transaction T wants to write into X. Its TM sends prewrites to each DM holding a copy of X. For the voting protocol, the DM always responds immediately. It acknowledges receipt of the prewrite and says "lock set" or "lock blocked." (In the basic implementation it would not acknowledge at all until the lock is set.) After the TM receives acknowledgments from the DMs, it counts the number of"lock~set" responses: if the number constitutes a majority, then the TM behaves as if all locks were set. Otherwise, it waits for "lockset" operations from DMs that originally said "lock blocked." Deadlocks aside (see Section 3.5), it will eventually receive enough "lockset" operations to proceed. Since only one transaction can hold a majority of locks on X at a time, only one transaction writing into X can be in its second commit phase at any time. All copies of X thereby have the same sequence of writes applied to them. A transaction's locked point occurs when it has obtained a majority of its writelocks on each data item in its writeset. When updating many data items, a transaction must obtain a majority of locks on every data item before it issues 5.4. Centralized 2PL: Instead of distributing the 2PL schedulers, one can centralize the scheduler at a single site. Before accessing data at any site, appropriate locks must be obtained from the central 2PL scheduler. So, for example, to perform DM-read (x) where x is not stored at the central site, the TM must first request a readlock on X from the central site, walt for the central site to acknowledge that the lock has been set, then send DM-read (x) to the DM that holds X. (To save some communication, one can have the 14 TM send both the lock request and DM-read (x) to the central site and let the central site directly forward DM-read(x) to x's DM; the DM then responds to the TM when DM-read (x) has been processed.) Like primary copy 2PL, this approach tends to require more communication than basic 2PL, since DM-reads and prewrites usually cannot implicitly request locks. 5.5. Basic T/O Implementation: An implementation of T/O amounts to building a T/O scheduler, a software module that receives DM-reads and DM-writes outputs these operations according to the T/O specification. In practice, prewrites must also be processed through the T/O scheduler for two-phase commit to operate properly. As was the case with 2PL, the basic T/O implementation distributes the schedulers along with the database. 5.6 The Thomas Write Rule: For ww synchronization the basic T/O scheduler can be optimized using an observation. Let W be a dm-write (x), and suppose ts(W) < W-ts(x). Instead of rejecting W we can simply ignore it. We call this the Thomas Write Rule (TWR) Intuitively, TWR applies to a dm-write that tries to place obsolete information into the database. The rule guarantees that the effect of applying a set of dm-writes to x is identical to what would have happened had the dm-writes been applied in timestamp order. If TWR is used, there is no need to incorporate two-phase commit into the ww synchronization algorithm; the ww scheduler always accepts prewrites and never buffers dm-writes. For rw synchronization the basic T/O scheduler can be improved using multiversion data items. For each data item x there is a set of R-ts's and a set of (W-ts, value) pairs, called versions. The R-ts's of x record the timestamps of all executed dm-read(x) operations, and the versions record the timestamps and values of all executed dm-write(x) operations. 5.7. Conservative timestamp ordering: Conservative timestamp ordering is a technique for eliminating restarts during T/O scheduling. When a scheduler receives an operation O that might cause a future restart, the scheduler delays 0 until it is sure that no future restarts are possible. Conservative T/O requires that each scheduler receive dm-reads (or dm-writes) from each TM in timestamp order. For example, if scheduler Sj receives dm-read(x) followed by dm-read(y) from TM, then ts(dm-read(x)) =< ts(dm-read(y)). Since the network is assumed to be a FIFO channel, this timestamp ordering is accomplished by requiring that TM, send din-reads (or din-writes) to S: in timestamp order: Conservative T/O buffers din-reads and din-writes as part of its normal operation. When a scheduler buffers an operation, it remembers the TM that sent it. Let min-R-ts(TM,) be the minimum timestamp of any buffered din-read from TM~, with min-R-ts(TM,) ffi -oo if no such din-read is buffered din-read from TM~, with min-R-ts(TM,) ffi -oo if no such din-read is buffered. Define min-W-ts(TMi) analogously. Conservative T/O performs rw synchronization as follows: 1. Let R be a din-read(x). If ts(R) > min-W-ts(TM) for any TM in the system, R is buffered. Else R is output. 15 2. Let Wbe a dm-write(x). Ifts(W) :> min-R-ts(TM) for any TM, W is buffered. Else W is output. 3. When R or W is output or buffered, this may increase min-R-ts(TM) or min-Wts(TM); buffered operations are retested to see if they can now be output. The effect is that R is output if and only if (a) the scheduler has a buffered dinwrite from every TM, and (b) ts(R) < minimum timestamp of any buffered dm-write. Similarly, W is output if and only if (a) there is a buffered din-read from every TM, and (b) ts(W) < minimum timestamp of any buffered din-read. Thus R (or W) is output ff and only if the scheduler has received every din-write (or din-read) with smaller timestamp that it will ever receive. Ww synchronization is accomplished as follows: 1. Let Wbe a din-write(x). Ifts(W) > min-W-ts(TM) for any TM in the system, W is buffered; else it is output. 2. When W is buffered or output, this may increase min-W-ts(TM); buffered dinwrites are retested accordingly. The effect is that the scheduler waits until it has a buffered din-write from every TM and then outputs the din-write with smallest timestamp. Two-phase commit need not be tightly integrated into conservative T/O because dm-writes are never rejected. Although pre-writes must be issued for all data items updated, the conservative T/O schedulers do not process these operations. 5.8. Certifier: In the certification approach DM-reads and prewrites are processed by DMs first-come/first-served, with no synchronization whatsoever. DMs do maintain summary information about rw and ww conflicts, which they update every time an operation is processed. However, din-reads and pre-writes are never blocked or rejected on the basis of the discovery of such a conflict. Synchronization occurs when a transaction attempts to terminate. When a transaction T issues its END, the DBMS decides whether or not to certify, and thereby commit, T. To understand how this decision is made, we must distinguish between "total" and "committed" executions. A total execution of transactions includes the execution of all operations processed by the system up to a particular moment. The committed execution is the portion of the total execution that only includes din-reads and din-writes processed on behalf of committed transactions. That is, the committed execution is the total execution that would result from aborting all active transactions (and not restarting them). When T issues its END, the system tests whether the committed execution augmented by T's execution is serializable, that is, whether after committing T the resulting committed execution would still be serializable. If so, T is committed; otherwise T is restarted. There are two properties of certification that distinguish it from other approaches. First, synchronization is accomplished entirely by restarts, never by blocking. And second, the decision to restart or not is made after the transaction has finished executing. No concurrency control method discussed above satisifies both these properties. A certification concurrency control method must include a summarization algorithm for storing information about dm-reads and prewrites when they are processed and a certification algorithm for using that information to certify transactions when they terminate. The main problem in the summarization algorithm is avoiding the need to store information about already-certified transactions. The main problem in the certification 16 algorithm is obtaining a consistent copy of the summary information. To do so the certification algorithm often must perform some synchronization of its own, the cost of which must be included in the cost of the entire method. 5.9. Thomas' Majority Consensus Algorithm: Thomas' algorithm assumes a fully redundant database, with every logical data item stored at every site. Each copy carries the timestamp of the last transaction that wrote into it. Transactions execute in two phases. In the first phase each transaction executes locally at one site called the transaction's home site. Since the database is fully redundant, any site can serve as the home site for any transaction. The transaction is assigned a unique timestamp when it begins executing. During execution it keeps a record of the timestamp of each data item it reads and, when its executes a write on a data item, processes the write by recording the new value in an update list. Note that each transaction must read a copy of a data item before it writes into it. When the transaction terminates, the system augments the update list with the list of data items read and their timestamps at the time they were read. In addition, the timestamp of the transaction itself is added to the update list. This completes the first phase of execution. In the second phase the update list is sent to every site. Each site (including the site that produced the update list) votes on the update list. Intuitively speaking, a site votes yes on an update list if it can certify the transaction that produced it. After a site votes yes, the update list is said to be pending at that site. To cast the vote, the site sends a message to the transaction's home site, which, when it receives a majority of yes or no votes, informs all sites of the outcome. If a majority voted yes, then all sites are required to commit the update, which is then installed using TWR. If a majority voted no, all sites are told to discard the update, and the transaction is restarted. The rule that determines when a site may vote "yes" on a transaction is pivotal to the correctness of the algorithm. To vote on an update list U, a site compares the timestamp of each data item in the readset of U with the timestamp of that same data item in the site's local database. If any data item has a timestamp in the database different from that in U, the site votes no. Otherwise, the site compares the readset and writeset of U with the readset and writeset of each pending update list at that site, and if there is no rw conflict between U and any of the pending update lists, it votes yes. If there is an rw conflict between U and one of those pending requests, the site votes pass (abstain) if U's timestamp is larger than that of all pending update lists with which it conflicts. If there is an rw conflict but U's timestamp is smaller than that of the conflicting pending update list, then it sets U aside on a wait queue and tries again when the conflicting request has either been committed or aborted at that site. 5.10. Ellis’ Ring Algorithm: Ellis' algorithm solves the distributed concurrency control problem with the following restrictions: 1. The database must be fully redundant. 2. The communication medium must be a ring, so each site can only communicate with its successor on the ring. 3. Each site-to-site communication link is pipelined. 4. Each site can supervise no more than one active update transaction at a time. 17 5. To update any copy of the database, a transaction must first obtain a lock on the entire database at all sites. The effect of restriction 5 is to force all transactions to execute serially; no concurrent processing is ever possible. For this reason alone, the algorithm is fundamentally impractical. To execute, an update transaction migrates around the ring, (essentially) obtaining a lock on the entire database at each site. However, the lock conflict rules are nonstandard. A lock request from a transaction that originated at site A conflicts at site C with a lock held by a transaction that originated from site B if B=C and either A=B or A's priority < B's priority. The daisy-chain communication induced by the ring combined with this locking rule produces a deadlock-free algorithm that does not require deadlock detection and never induces restarts. There are several problems with this algorithm in a distributed database environment. First, as mentioned above, it forces transactions to execute serially. Second, it only applies to a fully redundant database. And third, the daily-chain communication requires that each transaction obtain its lock at one site at a time, which causes communication delay to be (at least) linearly proportional to the number of sites in the system. This list includes almost all concurrency control algorithms described previously in the literature, plus several new ones. This extreme consolidation of the state of the art is possible in large part because of their framework set up earlier. The focus of [17] has primarily been the structure and correctness of synchronization techniques and concurrency control algorithms. They have left open a very important issue, namely, performance. 5.11. Modeling Concurrency control in Distributed Database: The main performance metrics for concurrency control algorithms are system throughput and transaction response time. Four cost factors influence these metrics: intersite communication, local processing, transaction restarts, and transaction blocking. The impact of each cost factor on system throughput and response time varies from algorithm to algorithm, system to system, and application to application. This impact is not understood in detail, and a comprehensive quantitative analysis of performance is beyond the state of the art. They provide a mythological model of Database and express serializable conditions in their model and express algorithms in their model and express their correctness. Figure 1 Hierarchical Transaction Structure In [18] concurrency control problem is expressed in three questions as below: 1. How do the performance characteristics of the various basic algorithm classes compare under alternative assumptions about the nature of the database, the workload, and the computational environment? 18 2. How does the distributed nature of transactions affect the behavior of the various classes of concurrency control algorithms? 3. How much of a performance penalty must be incurred for synchronization and updates when data is replicated for availability or query performance reasons? The first of these questions remains unanswered due to shortcomings of past studies that have examined multiple algorithm classes. The most comprehensive of these studies suffer from unrealistic modeling assumptions. In [18], reported the first phase of a study aimed at addressing the questions raised above. Examined four concurrency control algorithms in this study, including two locking algorithms: a timestamp algorithm and an optimistic algorithm. The algorithms considered span a wide range of characteristics in terms of how conflicts are detected and resolved. They express a hierarchical structure for transactions. It has been shown in Figure 1. They describe under investigate algorithms briefly. Then they provide a model, which comprises of four main parts: the source, transaction manager, resource manager and concurrency control and test workloads and database parameters on them. Figure 11 shows the model in detail. Figure 2 The Model of Database Modeled distributed database, as collection of site comprises these components. It is shown in Figure 12. Finally, they present their initial performance results for the four-concurrency control algorithms, which mentioned above under various assumptions about data replication. CPU cost for sending and receiving messages, transaction locality, and sequential versus parallel execution. The simulator used to obtain these results was written in the DeNet simulation language, which allowed us to preserve the modular structure of their model when implementing it. We describe the performance experiments and results following a discussion of the performance metrics of interest and the parameter settings used. They have four experiments and in those experiments, they evaluated algorithms with respect to replication in first experiment. The purpose of this 19 experiment is to investigate the performance of the four algorithms as the system load varies, and to see how different levels of data replication impact performance. In the second experiment, they examine the impact of message cost on the performance of the algorithms. The data layout, workload, and transaction execution pattern used here are identical to those of first experiment. In the third experiment, they consider a situation where a transaction may access non-local data. The data layout and transaction execution pattern used here are the same as in first and second experiments. and all of the files needed by a given transaction still reside on a single site. Figure 3 Database in [18] methodology The purpose of fourth experiment is to investigate performance under a parallel transaction execution pattern. In this case, the data layout is different, and a bit more complex. In [21] Carey and Livny describes a distributed DBMS model, an extension to their centralized model. Different simulation parameters are examined through simulations. Several papers about concurrency control have also been written by Thomasian et. al. [19, 20]. The most important difference between O-O approach and the earlier approaches is that we focus on data-shipping page-server OODBs, while earlier approaches have been done in the context of query-shipping relational database systems. Also, inter-operation and inter-transaction times are expected to be much smaller in this kind of system. 5.12. Simulating Concurrency Control in Distributed Database: In [22] Norvag, Sandsta and Bratbergsengen provided a simulator for simulating a distributed database. While distributed relational database systems usually use query shipping, data shipping is most common in object-oriented database systems. That means, instead of sending the queries to the data, data is sent to the queries. The most popular data granularity is pages. This is the easiest to implement, the most common in today’s object-oriented DBMS, and also the granularity that gives the best performance. In [22], they provide a brief view of DBsim’s architecture. In addition to simulate and compare schedulers, one of the main goals in the development of the DBsim simulator was that it should be useful as a framework for simulation of schedulers and easy to extend with new schedulers. The DBsim architecture is object oriented, all the major components are implemented as classes in C++. The 20 program consists of a collection of cooperating objects; the most important are the event controller (the main loop), transaction manager (TM), scheduler, data manager (DM) and the bookkeeper. Extending the simulator with new scheduler strategies is easy, e.g., if someone wants to test out a new scheduler, it can be implemented as a subtype of the generic base scheduler class defined in DBsim. The generic scheduler has defined the necessary methods to cooperate with the transaction and data managers. The simulation is event driven, and each transaction can be thought of as a thread. In the main loop an event from the event queue is picked and executed. Events in the queue consist of an event type and the time for the event to be executed. If the event is a TM-event, theTM is called, and if the event is a DM-event, theDMis called. Possible reasons for events could be a transaction requesting an operation, or the data manager has finished a read or writes operation on disk. They introduced three new concepts to the distributed simulation model: 1. Number of sites that the database is distributed to. 2. Locality of data. 3. Type of network. The main architecture for their simulator can be seen in Figure 13, which shows the simulator as a collection of cooperating objects. For each simulated node we have one data manager object, and one scheduler object. Figure 4 DBsim Architecture In their model they have only one global transaction manager object. This object is responsible for creating transactions and operations to the underlying scheduler and data manager objects. A bookkeeper object is used to collect statistics about the ongoing activities. They evaluated many parameters mentioned as below: Number of Nodes: Number of nodes in the system they simulate. These nodes are connected by network. Number of Transactions: The number of transactions to complete after the initial warm-up phase. The reason for having a warm-up phase, is to make sure that the start-up phase for the system they simulate, is not taken into account when they start collecting statistics about the simulation. In their simulations they have used a warm-up phase consisting of 2000 transactions before they start collecting data. 21 Size of Address Space: Number of elements (pages) in the address space is important. In their simulations they have set the address space to 20000 elements. Data Declustering: Distribution of the data elements to nodes, the percentage of the database elements in the database system located at a particular node. Data Locality: The probability that a transaction access a data element located on the same node as the transaction. Non-Local Data Access: When a transaction in the system is accessing a data element not located at its home node, this is the probability that the remote access will access data at a particular node. Hot Spot Probability: The probability of an operation to address the hot spot part of the address space. The hot spot area in their model is the first 10% of the address space at each node. Multi Programming Level: Number of concurrently executing transactions. Transaction Distribution: The probability that a new transaction in the system is started on a particular node. Short Transaction Probability: The probability of a transaction being short. The rest of the transactions are long transactions. Abort Probability: The probability of a transaction requesting abort before committing. This probability is the same for both long and short transactions. Read Probability: The probability of a data access operation being a read operation. In their simulations, they have used a value of 80%. This gives a write probability of 20%. Burst Probability: The probability of a transaction to ask for operations in a burst. Time between operations in a burst is shorter than normal. The length of short and long transaction, the time between transactions, time between operation requests from a transaction and number of operations in a burst is drawn from an uniform distribution with parameters as shown in Table 1. Restart Delay: When using a timestamp ordering scheduler, transactions might get aborted because of conflicts. If restarted immediately, the probability for the same conflict to occur is quite high. To avoid this problem, the restart of the transaction is delayed a certain amount of time. This delay is a value drawn from a uniform distribution, multiplied by the number of retries. Network: What kind of network to simulate. It can be of three types, cluster, LAN or WAN. Scheduler Type: Scheduler used for concurrency control. In the current version, this has to be either two-phase locking or timestamp ordering. Disk Operation Time: This time interval gives the time for doing an access to the disk. Regarding to above parameters, they evaluated various factors such as Throughput, Number of Nodes, MPL, Response Time, Abort Frequency and Different Data Placement. After these, the results concluded as below: 22 With a mix of long and short transactions, the TO scheduler have a higher throughput than the 2PL scheduler. With only short transaction, the two schedulers perform almost identical. The TO scheduler have much higher abort probabilities than the 2PL scheduler. The 2PL scheduler is in favor of long transactions, and the number of long transactions that manage to finish successfully is much higher for the 2PL scheduler. The network is not the bottleneck for a reasonable load. Only for heavy load and a slow network this severely affects performance. 6. Availability: Availability is one of most popular aspects in designing distributed database. To provide high availability for services such as mail or bulletin boards, data must be replicated. 6.1 Providing Availability using Lazy Operations: The availability of data in distributed databases can be increased by replication. If the data is replicated on several sites, it may still be available after site failures. However, implementing an object with several copies residing on different sites may introduce inconsistencies between copies of the same object. To be consistent, a system should be one-copy equivalent; that is, it should behave as if each object has only one copy in so far as the user can tell. A replica control protocol is one that ensures that the database is one-copy equivalent. In [25] Abbadi and Toueg consider database systems that are prone to both site and link failures. Sites may fail by crashing or by failing to send or receive messages. Links may fail by crashing, delaying, or failing to deliver messages. Several replica control protocols have been proposed that tolerate different types of failures. In [25] they present a replica control protocol that allows the accessing of data even when the database is partitioned. It can be combined with any available concurrency control protocol to ensure the correctness of a database. They describe the formal database model and their correctness criteria. They propose a replica control protocol, and then they prove it correct. Finally they present several optimizations to the protocol. Their replica control protocol assumes two types of transactions: user transactions, issued by the users of the database, and update transactions, issued by the protocol. We assume that all transactions follow a conflictpreserving concurrency control protocol, for example, two-phase locking. Such a protocol ensures that logs are CP-serializable only at the level of copies (but not at the object level). They present a replica control protocol that ensures that all logs are onecopy serializable, and hence that transactions are serializable at the object level. A user transaction t that is initiated at a site with view v is said to execute in v. Informally, view v determines which objects t can read and write, as well as which copies it can access or write. Views are totally ordered according to a unique view-id assigned to each view, and two sites are said to have the same view if they have identical view-ids. Their protocol ensures one-copy serializability by (1) ensuring that all transactions executed in one view are one-copy serializable, and (2) ensuring that all transactions executing in a “lower” view are serialized before transactions executing in a “higher” 23 view. Satisfying conditions (1) and (2) enforces a serialization of all transactions executing in all views. With each object x, we associate read and write accessibility thresholds, Ar [x] and Aw [x], respectively. An object x is read (write) accessible in a view only if Ar [x](Aw [x]) copies reside on sites in that view. The accessibility thresholds Ar [x] and Aw[x] must satisfy: Ar [ x] Aw [ x] n[ x] This relationship ensures that a set of copies of x of size Ar [x]] has at least one copy in common with any set of copies of x of size Ar [x]. In each view v, every object x is assigned a read and write quorum, qr [x, v] and qw [x, v]: These specify how many physical access and write operations are needed to read and write an object x in view v. Let n[x, v] be the number of copies of x that reside on sites in view v (formally, n[x, v] = |sites[x] ∩ vI ). For each view v, the quorums of object z must satisfy the following relations: qr [ x, v] q w [ x, v] n[ x, v] 2q w [ x, v] n[ x, v] 1 qr [ x, v] n[ x, v] Aw [ x] q w [ x, v] n[ x, v] These relations ensure that, in a view v, a set of copies of x of size qw[x, v] has at least one copy in common with any set of copies of x of size qr[x, v], qw [x, v], and Ar[x]. Read operations use the version numbers associated with each copy to identify (and read) the most “up-to-date” copy accessed. In their protocol, version numbers consist of two fields (v-id, k). Intuitively, if a copy has version number (v-id, k), then this copy was last written by a transaction t executing in a view v with view-id v-id, and t is the kth transaction to write x in view v. A version number (v1-id, k1) is less than (v2-id, k2), if v1-id < v2-id, or v1-id = v2-id and k1 < k2. Initially, sites have a common view v. with view-id v0-id, and all copies have version number (vo-id, 0). A user transaction t executing in view v can read (write) an object x only if x is read (write) accessible in view v. (Note that a site can determine whether an object is read or write accessible from its local view only, i.e., without accessing any copies.) Furthermore, t can only accesses read or write copies of x that reside on sites with view v (this restriction is relaxed in Section 5.1). If object x is read accessible in view v, t executes the logical operation r[x] by 1. Physically accessing qr[x,v] copies of x residing on sites in v (with view v), 2. Determining vnmax, the maximum version number of the selected copies, and 3. Reading the accessed copy with version number vnmax. If object x is write accessible in view v, with view-id v-id, t executes the logical operation w[x] by 1. selecting qw[x,v] copies of x residing on sites in v (with view v), 2. Determining vnmax, the maximum version number of the selected copies, and 3. Writing all the selected copies and updating their version numbers to (v-id,l), where l>= 1 is the smallest integer such that (v-id, l) is greater than vnmax. 24 If a user transaction tries to access a copy that resides on a site with a view different from the view of the site where the issuing transaction is initiated, that transaction is aborted. Quorum relations (2) and (3) ensure that all logically conflicting operations issued by user transactions executing in the same view, also physically conflict. Furthermore, since all transactions use version numbers and a conflict-preserving concurrency control protocol, one can show that all transactions executing in the same view are one-copy serializable. There are trade-offs between our protocol and the quorum consensus protocol in terms of costs. The quorum consensus protocol is designed for multi-version databases. It must maintain a quorum assignment table and ensure that this table always satisfies the quorum intersection invariant. This overhead allows transactions to run at increasingly higher levels (by a process called inflation) without incurring update costs. However, to satisfy the quorum intersection invariant, read quorums must monotonically increase with respect to level number, thus making read operations more expensive at higher levels. One way to guarantee consistency of replicated data is to force service operations to occur in the same order at all sites, but this approach is expensive. For some applications a weaker causal operation order can preserve consistency while providing better performance. In [23] described a new way of implementing causal operations. Lazy replication is intended for an environment in which individual computers, or nodes, are connected by a communication network. Both the nodes and the network may fail, but we assume the failures are not Byzantine. The nodes are fail-stop processors. The network can partition, and messages can be lost, delayed, duplicated, and delivered out of order. The configuration of the system can change; nodes can leave and join the network at any time. They assume nodes have loosely synchronized clocks. There are practical protocols, such as NTP, that with low cost synchronize clocks in geographically distributed networks. A replicated application is implemented by service consisting of replicas running at different nodes in a network. To hide replication from clients, the system also provides front-end code that runs at client nodes. To call an operation, a client makes a local call to the front end, which sends a call message to one of the replicas. The replica executes the requested operation and sends a reply message back to the front end. Replicas communicate new information (e.g., about updates) among themselves by lazily exchanging gossip messages. There are two kinds of operations: update operations modify but do not observe the application state, while query operations observe the state but do not modify it. (Operations that both update and observe the state can be treated as an update followed by a query.) When requested to perform an update, the front end returns to the client immediately and communicates with the service in the background. To perform a query, the front end waits for the response from a replica, and then returns the result to the client. They proved correctness of their solution in a mythological way. There are many solutions, which uses laziness and increase availability. There is a trade off between consistency and availability. To increase availability, we increase replication in different ways and this’ll make concurrency control harder. 25 6.2 Providing Availablity using Agent Based Design: Using heuristics to design intelligent agents to increasing availability is usual. Thus, it is important for each agent in this system to possess its own DBMS, which is autonomous in all respects. Dependency on a distributed DBMS in order to gain access to data on another agent’s database would reduce the autonomy of these agents. The system being modelled is essentially a multi-agent environment where agents will each have local data that is unique to environment and cannot be found on another agent. This also means that there is no replication of data unlike traditional synchronous or asynchronous distributed databases Data will be partitioned vertically in this model. This means columns (parts of a table) will be placed on agents based on relevance. For example, a relation for a machine can be vertically partitioned so that columns used primarily by the manufacturing department can be distributed to its computer, while the rest of the columns can be distributed to the engineering department’s computer. In a multi-user database, multiple statements inside different transactions could attempt to update the same data. This could lead to the data becoming inconsistent. This is undesirable, and commercial systems use rather complex techniques to handle concurrent transactions. Another issue is determining the location of remote data. Agents are responsible for getting information about data distribution and plan to get locks for data and make data available. 6.3. Availability through improving fault tolerance: In this approach, given an ordered set of nodes, one can usually devise a rule that unambiguously imposes a desired logical structure on this set. Then, the read and write operations can rely on this rule rather than on the knowledge of statically structured network in determining what replica sets include quorums. If in addition, at any time, all operations can agree on a set of replicas from which the quorums are drawn, then the protocol can adjust dynamically this set to reflect detected failures and repairs and at the same time guarantee consistency. In this protocol, it’s assumed assume that each node is assigned a name and all names are linearly ordered [14]. Among all nodes replicating the data item, we identify a set of nodes considered the current epoch. At arty time, the data item may have only one current epoch associated with it. Originally all replicas of the data item form the current epoch. The system periodically runs a special operation, epoch checking, that polls all replicas of the data item. If any members of the current epoch are not accessible (failures detected), or any replicas outside the current epoch have been successfully contacted (repairs detected), an attempt is made to forma new epoch. (Their epoch numbers distinguishes Epochs, with later epochs assigned greater epoch numbers.) For this attempt to be successful, the new epoch must contain a write quorum of the previous epoch, and the list of the new epoch members (the epoch list) along with the new epoch number must be recorded on every member of the new epoch. Then, due to the intersection property of the quorums, it is possible to guarantee that, if the network partitions, the attempt to form a new epoch will be successful in at most one partition and hence the uniqueness of the current epoch will be preserved. For the same reason, any successful read or write operation must contact at least one member of the current epoch and therefore obtain the current epoch list. Hence, the operation can reconstruct the logical structure of the current 26 epoch and use it to identify read or write quorums. Similarly to dynamic voting, the system will be available as long as some small number of nodes (the number depends on the specific protocol) is up and connected. 7. Query Evaluation: In [26] described a notation for describing queries and used that notation to show plans of query evaluation. It express that the performance of a distributed query-processing algorithm depends to a significant extent on the estimation algorithm used to evaluate the expected sizes of some intermediate relations. The choice of a reasonable estimation algorithm is therefore extremely important. It categorize queries into two main queries. There are tree queries and cyclic queries. There are optimal strategies for simple queries and tree queries. Then they express heuristics to get optimal strategies for all queries. 7.1. Query Broker: In [27] Vu and Collet present their work on supporting flexible query evaluation over large distributed, heterogeneous, and autonomous sources. Flexibility means that the query evaluation process can be configured according to application context specific, resources constraints and also can interact with its execution environment. Their query evaluation is based on Query Brokers as basic units, which allow the query processing to interact with its environment. Queries are evaluated under query brokers contexts defining constraints of the evaluation task. Query Brokers have dynamic behaviors towards their execution environment. This paper focuses on the definition and the role of query brokers in the query evaluation task in large-scale mediation systems. We show also how query brokers ensure the flexibility of this task. Generally speaking, in mediation systems queries are formulated over a global schema, called also the mediation schema. These queries, called global queries, are then rewritten on local schemas, i.e. schemas of the component sources, and decomposed into sub-queries, called local queries. The sub-queries are evaluated by their appropriated sources and mediators assemble intermediate results. We propose Query Brokers as basic units for evaluating queries. The query evaluation process can be compared as a set of query brokers. Each of them ensuring evaluation of a sub-query. The execution context is specified through Query Brokers in order to take into consideration specific requirements while processing queries and to fulfill system available resources. This provides a mean to adapt the query evaluation process to context-specific. Besides, adaptivities of the query evaluation task are enabled by interactions of Query Broker with its execution environment, i.e. users, execution circumstances, during the evaluation phase. 27 Figure 5. Hierarchical Mediation Architecture They assume a mediator-based system, i.e. a set of wrapped sources and a mediator with the task of responding to queries formulated on its global schema by using underlying sources. Mediators can be hierarchically organized as shown in Figure 5. Following this approach, a mediator is built on other mediators and/or wrappers. Many mediators can work together to respond to different queries. This approach is suitable to build large-scale systems where mediators can be distributed over network. However, communications between mediators must be taken into consideration while processing queries. As a result, the query processing is distributed through mediator’s hierarchies. This hypothesis allows us to generalize the mediation architecture, and also the query processing architecture. As mentioned previously, the static optimization approach is not suitable for processing queries in distributed and scalable mediation systems because of lack of statistics and unpredictabilities of execution environment. They consider queryprocessing architecture (cf. Figure 6) including the following phases: 1. A parsing phase (parser) which syntactically and semantically analyses queries. This phase is similar to the one of the traditional query processing; 2. A rewriting phase (rewriter) which translates global query into local queries on sources schemas. This phase depends on the way mappings between schemas are defined 3. A preparing phase (preparation) which generates a query evaluation plan (QEP). We will present the form of QEP in the next section; 4. An evaluation phase (evaluator), which communicates other components, i.e. mediators or wrappers, for evaluating sub-queries. 28 Figure 6 Query Processing Architecture Figure 7 presents Query Brokers (QBs), which wrap one or several query operators. In other words, a QB corresponds to a sub-query and is the basic unit of the query evaluation. As a result, a QEP is represented as a QBs graph such as the one in Figure 3. Our hierarchy of QBs fits the hierarchical mediation architecture presented in Figure 5. Each mediator corresponds to one or several query broker(s) processing subquery (ies). A Query Broker is defined by: 1. A context which determines constraints for executing query and meta-information driving evaluation tasks, i.e. optimization, execution of sub-query wrapped by this broker and communication with other QBs. Examples of constraints are limitation of execution time, acceptation of partial results, limitation of economic access cost, etc. 2. Operator(s), which determines how this QB processes data. Operators can be built-in operators, i.e. pre-defined operators such as algebraic operators -e.g. selection, projection, join, union, etc.-, communication operators -e.g. send, receive, etc.-, and user-defined or external operators. 3. Buffer(s), which are used for separating data stream between two QBs. Buffers, can have different forms, from simple buffers for synchronizing two QBs operating at different speeds to more or less complicated caches for materializing intermediate results. More details of the buffer management will be discussed in the next sub-section. 4. Rules (E-C-A2) which define behaviors of QB towards changes of execution environment, e.g. delay of arrival data, inaccessible data, query refinement, etc. Using rules, QBs could change evaluation strategies, e.g. re-schedule/reoptimization sub-queries, change the implementation of certain operators such as join, etc., and behave towards query refinements during the execution phase. 29 Figure 8 gives an overview of the functional architecture of a QB. The main modules are a Buffer Manager, a Context Manager, a Rule Manager, an Evaluator, and a Monitor. In the following, we discuss these modules. The QB context consists of a set of parameters. They determine four categories of parameters related to user’s requirement, e.g. limitation of execution time (timeout), type of partial result (partial-result), limitation of economic cost for processing queries (cost), interested data (preference), etc.; availability of resources, e.g. memory-size, CPU-time, etc.; meta-information, e.g. arrive-data-rate, source-access, etc.; and other query variables which will be specified during query execution. For achieving a flexible query evaluation framework, we adopted rule-based approach for defining behaviors of QB. E-C-A rules allow specifying Query Brokers behaviors towards execution circumstances. The techniques for re-scheduling and reoptimizing queries can be integrated in QBs as rules. Figure 7 Interconnected Query Brokers In [28] Evrendilek and Dogac provide a method to optimize a query over a distributed database. They suppose that three steps are necessary to process the global queries: First a global query is decomposed into sub-queries such that the data needed by each subquery are available from one local database. Next each sub-query is translated to a query or queries of the corresponding local database system and sent to a local database system for execution. Third the results returned by the sub-queries are combined into the answer. In [28] Evrendilek and Dogac consider the optimization of query decomposition in case of data replication and the optimization of inter-site joins, that is, the join of the results returned by the sub-queries. The optimization algorithm presented for inter-site joins can easily be generalized to any operation required for inter-site query processing. The algorithm presented in this paper is distributed and takes the federated nature of the problem into account. Nowadays, P2P systems become more common. In such systems there is not global knowledge: neither a global schema nor information of data distribution or indexes. The 30 only information a participating peer has is information about its neighbors, i.e. which peers are reachable and which data they provide. The suitability of this approach was already demonstrated by the success of the well-known file sharing systems like Napster or Gnutella. Furthermore, in such systems, we cannot assume the existence of a global schema even not as the sum of the schemas of the neighbor peers because adding a new peer could trigger schema modifications for all other peers of the system. There are several obvious advantages of such a schema-based P2P system. A main advantage is that adding a new source (peer) is simplified because it requires only defining correspondences to one peer already part of the system. Using this neighbor the new peer becomes accessible from all other peers. Of course, such advantages are not for free. 7.2. Incomplete schema and P2P Query Evaluation: In [29] Karnstedt, Hose and Sattler address this problem by investigating different strategies for processing queries on incomplete schemas. Their contribution is (1) dealing with the problem of incomplete schema information and (2) a detailed comparison of different query processing strategies in this context. They express a system, which based on XML data. By having no complete information about schema, they have to deal with two issues. First, they have to express correspondences between two schemas and second, we have to formulate queries without complete schema information. In distributed systems the general question is whether to execute the query at the initiator’s side or at the peers that store the relevant data. In the first case the data is moved to the initiator and all operations are executed there. This is called data shipping. The second approach is called query shipping, because in this case the query is moved to the data and only that part satisfying the query is shipped to the requestor for further processing. Applying this strategy the amount of data moved through the network is reduced, because only the necessary data a queried peer cannot process is sent to other peers. Query and data shipping are the two general approaches when processing queries distributed, but neither query shipping nor data shipping is the best policy for query processing in all situations. Other techniques, trying to combine the advantages of both approaches, have been developed. An example is called hybrid shipping. In the query shipping approach the first intention is to decompose the query into sub-queries according to the known peers and their querying capabilities. In this way each peer receives that part of the query it (or the peers connected to it) is expected to support. After decomposition the peer computes the corresponding result, or forwards the query to other peers if it does not provide all queried data. Another technique evolved is called Mutant Query Plans: An execution plan constructed from the original query is sent as a whole to other peers. Each peer decides by itself, if it can deliver any data. If yes, it writes the data into the plan replacing the corresponding part of the query. Using such mutating plans also provides the opportunity of optimizing (parts of) the plan decentralized. They implemented a query shipping technique is a variant of mutating query plans. The query plan is shipped to the connected peers in parallel and each peer inserts the data it can provide. Beside the general approaches based on flooding additional approaches using global knowledge (all data at all peers is known to each peer) have been tested in order to outline the benefits of query shipping even more. The difference is that in the 31 first approach there is more control messages generated than in this global knowledge approach. The approaches mentioned above, suitable for distributed systems, are not suitable for real P2P systems without modifications. We cannot assume having all defined correspondences known to each peer. The general techniques for processing queries must be modified in order to accomplish the required tasks of query transformation and data collection step by step, having each peer responsible for querying local neighbors using only the locally defined mappings. If the processed query is formulated in a schema unknown to the processing peer the simplest way of processing it is to query all neighbors, which we call flooding. This is an applicable strategy even if no correspondences are defined, because by sending the query to each peer in the network (up to a certain horizon) it will finally be shipped to those peers knowing the used schema. They provide methods to route a query in the network, despite the restrictions we encounter in P2P systems. The problem of routing is to decide which of the known peers is most suitable for answering the query. A strategy adapted to their needs will have to use partial information available. A possible approach is to use the defined correspondences in terms of routing. The defined mappings provide for each peer of information about what data is stored at the neighbors. In order to reduce the message number using routing indexes can be quite useful. A routing index is a data structure that allows to route queries only to peers that may store the queried data. Therefore the data stored at each peer must somehow be associated with data identifiers. Routing indexes may also be used to generate a list of priorities for querying the peer’s neighbors. 7.3. Dynamic Query Evaluation: In [31] Jim and Suciu provide a method, which evaluates queries despite of topology changes. In [31], they propose a new paradigm for distributed query evaluation on the Web. We present the paradigm in terms of a simple query language, called dynamically distributed datalog (d3log). The simple addition of dynamic site discovery changes the character of query evaluation, and motivates novel evaluation techniques. The main paradigm change is the introduction of an intentional answer, in which a server responds to a query not with a table, but rather with a set of rules. In this paper we develop a framework for studying possible query evaluation techniques in the presence of dynamic site discovery. They started by studying query evaluation with site discovery for a non-recursive query language, similar to SQL. They soon discovered however those query evaluation in this new setting is deeply and intimately connected to recursion, for three reasons. First, recursion may be impossible to prohibit: the Web is not under centralized control, and one cannot constrain what links other sites may have. Second, it is difficult to detect: sites are unknown in advance, so recursion must be detected dynamically. Finally, we have found some applications, such as security infrastructures, where recursion arises naturally and inevitably. 8. Heterogeneous Distributed Database: 32 Heterogeneous distributed database link distributed topics to legacy system and designing a heterogeneous system. Nowadays using mediators to integrate various types of systems are very usual. In [24] Stphen and Huhns describe a technique for creating mediators- information system components that operate among distributed applications and distributed databases. The databases are distributed logically and physically, typically residing on different platforms and having different semantics. A mediator automates the process of merging widely differing databases into a seamless, unified whole to be presented to the user. Their system’s underlying strength is its homogeneity-all components are modeled as Java agents that communicate with each other in a common protocol (KQML) and with a common semantics (the ontology). The mediator-based information system consists of a mediator-reasoning core, a universal ontology, wrappers to align the databases semantically with the ontology, and agents that handle connectivity issues. Each component is modeled as an agent. The mediator agent operates as a forward-chaining, rule-based system written declaratively. It is multithreaded to support simultaneous queries and interactions. The mediator’s intelligence is embedded in a fact and rule base residing in a text file that is easily edited. By changing only the rule base, the mediator may be customized for any application. Mediators mitigate the heterogeneity and distribution of information sources, but are difficult to construct. The major problems involve semantics and communications. The structure of a system for querying and updating heterogeneous databases is shown in Figure 5. The system consists of a mediator-reasoning core, a universal ontology, and wrappers to align the databases semantically with the ontology, and agents that handle the connectivity issues. The information flow in the system consists of the following. Database and ontology agents communicate their semantics to the mediator agent. A user can formulate database commands in terms of the common ontology; the user’s agent sends these commands to the mediator agent. The mediator reasons about schemas and ontologies to determine relevant databases or other information resources. The mediator communicates to each information resource’s “wrapper” agent, which maps from terms in the common ontology to the resource’s schema. The common ontology of concepts includes both entities and relationships among the entities. A system user formulates database commands in terms of the common ontology. Each database is accessed through a “wrapper” that maps from terms in the common ontology to the database schema. A mediator takes the ontology and database schema information and maps the user command to the appropriate database wrappers. The database wrappers are agents that translate the command into the local database schema. The wrappers return results to the mediator in terms of the common ontology. For query commands, the mediator gathers the results and passes them to the user. The mediator is based on a uniform interaction among the components of the information system, with each component modeled as an agent. There are separate agents for each user interface, the ontology, and each database wrapper. 33 Figure 8 Agent-based Mediator Structure In [24] agent-based mediator; for simplicity, all interagent communication occurs through a special agent called the router. The agent router allows Java applets to exchange messages with any registered agent on the Internet. (Netscape’s security restrictions prohibit a given agent from communicating with an agent not spawned on the same host.) The agent router allows any registered agent to send messages to any other registered agent by making a single socket connection to the agent router. Messages are forwarded without the sending agent having to know the receiving agent’s absolute address and making a separate socket connection as with the usual Agent Name Server (ANS) infrastructure. Like an e-mail server, the agent router buffers all messages so that they are not lost due to network transient problems. If each individual agent goes down or logs out, he may return for his messages at a later time. The mediator contains rules that logically connect the input ontology and database schemas. When these rules fire, a mediator agent is instantiated for the given input of ontology and schemas. The resulting agent has a mapping between the domain ontology and the database schemas. This mapping is the basis for the query mediation. At run time, an application program submits an SQL command to the interface agent, which forwards the command to the mediator agent. The mediator agent parses the query, reasons about which databases may have relevant information, reasons about any necessary decomposition of the query, and then sends the decomposed query to the resource agent for the relevant databases. They accept the SQL commands and, using JDBC, connect to, open, and query the database. 9. Conclusion: 34 A lot of theoretical work has been done in the design of distributed database systems and transaction processing. Many concepts models and algorithms have been developed for concurrency control, crash recovery, communication, query optimization, nested transactions, site failures, network partitions, and replication. Simultaneously a major effort has been put in the implementation of these ideas in experimental and commercial systems with a wide spectrum of distributed database systems, distributed operating systems and programming languages. The implementers have discarded many ideas while adopting others such as two phase locking, two phase commit, and remote procedure calls. They have built workable systems but still wrestle with the problems of reliability and replication. The true benefits of distributed database systems such as high availability and good response time appear to be within reach. In this paper, we prepare a general view to recent works in distributed database systems. We provided a survey on recent works in related topics to distributed database systems. There are many new approaches used to improve distributed database systems. Using intelligent mediators are very common in managing heterogeneous database. An agent-based tool is well suited to the problem of providing inference prevention capability in a distributed database system. There are many advantages for using agents as DDBMS components. Some of these advantages are listed below: 1. Since the agents work in parallel and are local to the databases, the performance benefit of distribution is not lost. There is no bottleneck through which all queries must pass. 2. Similarly, the survivability benefit of distribution is not lost. The potential single point of failure represented by a centralized Rational Downgrader is avoided. 3. The comparTMentalization provided by a distributed scheme is preserved. Databases can prevent the inference of sensitive data in other databases without knowing exactly what the nature of that data is. 4. Interoperability is insured. Heterogeneous databases can participate in the inference prevention effort as long as they are compliant with the SQL standard. 5. A separation of concerns is maintained. Changes to the inference prevention scheme do not require changes to the database management systems and vice versa. New distributed databases such as P2P raise new requirements for query evaluation, concurrency control algorithm and etc. Web as distributed system which can be viewed a data source, requires specific type of research related to distributed database systems. In this paper, we consider some of recent approaches for designing distributed databases such as agent based approaches and etc. 10. References [1] D. Pinto and G. Torres. “On Dynamic Fragmentation of Distributed Databases Using Partial Fragmentation replication”, Facultad de Ciencias de la Computación Benemérita Universidad Autónoma de Puebla, 2002 [2] S. Voulgaris, M.V. Steen, A. Baggio, and G. Ballintjn. “Transparent Data Relocation in Highly Available Distributed Systems”. Studia Informatica Universalis. 2002 35 [3] A.Zhang, M.Nodine, B.Bhargava and O.Bukhres, Ensuring Relaxed Atomicity for Flexible Transaction in Multi-database Systems, SIGMOD’94 ACM conference [4] A. Zhang, M. Nodine and B. Bhargava, Global scheduling for flexible Transactions in Heterogeneous distributed Database Systems [5] J.H.Harista, K.Ramamritham and R.Gupta. The PROMPT Real-Time Commit Protocol, IEEE,1999 [6] P.Saha, One-Phase Real-Time Commit Protocols, Master engineering thesis in computer science and engineering, Indian Institute of Science , 1999 [7] N. Nouali, A. Doucet and H. Drias, “A Two-Phase Commit Protocol for Mobile Wireless Environment”, Australian Computer Society, 6th Australasian Database Conference (ADC 2005),2005 [8] A. Sillberschats, H. F. Korth, S. Sudarshan, “Database System Concepts 4 th edition”, Mc GrawHill publications co.2002 [9] V. Goenka, “Peterson’s algorithm in a multi agent database system”, Knowledge representation and reasoning Journal, July 2003 [10] R. J. Peris, M. P. Martynez, and S. Arevalo. “An Algorithm for Deterministic Scheduling for Transactional Multithreaded Replicas and its Correctness”, In IEEE SRDS, 2000. [11] R. J. Peris and M. P. Martynez, “Deterministic Scheduling and Online Recovery for Replicated Multithreaded Transactional Servers”, In Workshop on Dependable Middleware-Based Systems, 2002. [12] B. K. S. Khoo, S. Chandramouli, “An Agent-Based Approach to Collaborative Schema Design”, Workshop on Agent Based Information Systems at Autonomous Agent Conference, May, 1999 [13] B. Bhargava, “Building Distributed Database Systems”, 26th VLDB Conference Cairo, Egypt 2003 [14] M. Rabinovich, E. D. Lazowska, ” Improving Fault Tolerance and Supporting Partial writes in Structured Coterie Protocols for Replicated Objects”, ACM SIGMOD, USA, 1992 [15] L. M. Haas, C. Mohan, R. F. Wilms, R. A. Yost, “Computation and Communication in R*: A Distributed Database Manager”, ACM Transactions on Computer Systems, Vol. 2, No. 1. 1984 [16] M. Blakey, “Models a Very Large Distributed Database”, ACM Transactions on Computer Systems, Vol. 10, No. 6. 1992 [17] P. A. Bernstein and N. Goodman, ”Concurrency Control in Distributed Database Systems”, Computing Surveys, Vol. 13, No. 2, June 1981 [18] M. J. Carey and M. Livny “Distributed Concurrency Control Performance: A Study of Algorithms, Distribution, and Replication”, 14th VLDB Conference Los Angeles, California 1988 [19] P.A. Franaszek, J. T. Robinson and Thomson, “Concurrency Control for High Contention Environments” ACM Transactions on Database Systems, 1992 36 [20] A. Thomasian, “Performance Limits of Two-Phase Locking”. IEEE International Conference on Data Engineering, 1991 [21] M.J. Carey, and M. Livny “Parallelism and Concurrency Control Performance in Distributed Database Machines”, 1989 ACM SIGMOD, 1989 [22] Kj. Norvag, O. Sansta and K. Bratbergsengen, “Concurrency Control in Distributed Object-Oriented Database Systems”, Advances in Database and Information Systems, 1997 [23] R. Ladin, B. Liskov, L. Shirra, S. Ghemawat, “Providing High Availability Using Lazy Replication “, ACM Transactions on Computer Systems, Vol. 10, No. 4. 1992 [24] L. M. Stephen, M. N. Huhns, “Database Connectivity Using an Agent-Based Mediator System”, Nov 2004 [25] A. E. Abbadi, S. Toueg, “Marinating Availability in Portioned Replicated Databases”, ACM Transactions on Computer Systems, Vol. 14, No. 2. 1989 [26] C. T. Yu, C. C. Chang, “Distributed Query Processing”, Computing Surveys, Vol 14, No. 2, Dec 1984 [27] T. T. Vu, C. Collet, “ Query Brokers for Distributed and Flexible Query Evaluation”, ACM Transactions on Database Systems, 2004 [28] C. Evrendilek, A. Dogac, “Query Decomposition, Optimization and Processing in Multidatabase Systems”, ACM Transactions on Computer Systems, Vol. 11, No. 4. 2003 [29] M. Karnstedet, K. Hose, K. U. Sattler, “Distributed Query Processing in P2P Systems with incomplete schema information”, ACM Transactions on Database Systems, 2004 [30] J. Smith, A. Gounarlis, P. Watson, N. W. Paton, A. A. A. Fernandes, R. Sakellariou, “Distributed Query Processing On The Grid”, International Journal of High Performance Computing Applications, Vol 17, No. 4 , Winter 2003 [31] T. Jim, D. Suciu, “Dynamically Distributed Query Evaluation”, ACM Transactions on Database Systems, 2003 37