Distributed database overview A distributed database can be defined as consisting of a collection of data with different parts under the control of separate DBMSs running on independent computer systems. All the computers are interconnected and each system has autonomous processing capability serving local applications. Each system participates, as well, in the execution of one or more global applications. Such applications require data from more than one site. The distributed nature of the database is hidden from users and this transparency manifests itself in a number of ways. Although there are a number of advantages to using a distributed DBMS, there are also a number of problems and implementation issues. Finally, data in a distributed DBMS can be partitioned or replicated or both. http://www.compapp.dcu.ie/databases/f449.html Distributed database transparency A distributed DBMS should provide a number of features which make the distributed nature of the DBMS transparent to the user. These include the following: Location transparency Replication transparency Performance transparency Transaction transparency Catalog transparency The distributed database should look like a centralised system to the users. Problems of the distributed database are at the internal level. Distributed database advantages There are a number of advantages to using a distributed DBMS. These include the following: Capacity and incremental growth Reliability and availability Efficiency and flexibility Sharing Distributed database issues There are a number of issues or problems which are peculiar to a distributed database and these require novel solutions. These include the following: Distributed query optimisation Distributed update propagation Distributed concurrency control Distributed catalog management Distributed query optimisation In a distributed database the optimisation of queries by the DBMS itself is critical to the efficient performance of the overall system. Query optimisation must take into account the extra communication costs of moving data from site to site, but can use whatever replicated copies of data are closest, to execute a query. Thus it is a more complex operation than query optimisation in centralised databases. Distributed catalog management The distributed database catalog entries must specify site(s) at which data is being stored in addition to data in a system catalog in a centralised DBMS. Because of data partitioning and replication, this extra information is needed. There are a number of approaches to implementing a distributed database catalog. Centralised - Keep one master copy of the catalog Fully replicated - Keep one copy of the catalog at each site Partitioned - Partition and replicate the catalog as usage patterns demand Centralised/partitioned - Combination of the above Distributed update propagation Update propagation in a distributed database is problematic because of the fact that there may be more than one copy of a piece of data because of replication, and data may be split up because of partitioning. Any updates to data performed by any user must be propagated to all copies throughout the database. The use of snapshots is one technique for implementing this. Query optimisation overview Query optimisation is essential if a DBMS is to achieve acceptable performance and efficiency. Relational database systems based on the relational model and relational algebra have the strength that their relational expressions are at a sufficiently high level so query optimisation is feasible in the first place; in non-relational systems, user requests are low level and optimisation is done manually by the user - the system cannot help. Hence systems which implement optimisation have several advantages over systems that do not. The optimisation process itself involves several stages, which involves the implementation of the relational operators. A different approach to query optimisation, called semantic optimisation has recently been suggested. This technique may be used in combination with the other optimisation techniques and uses constraints specified on the database schema. Consider the SQL query: SELECT E.LNAME FROM WHERE EMPLOYEE E M E.SSN = M.SSN AND E.SALARY > M.SALARY This query retrieves the names of employees who earn more than their supervisors. Suppose we had a constraint on the database schema that states that no employee can earn more than their supervisor. If the semantic query optimisor checks for the existence of this constraint, then it need not execute the query at all. This may save considerable time if the checking for constraints can be done efficiently; however, searching through many constraints to find ones applicable to a given query can also be quite time consuming. Distributed concurrency control Concurrency control in distributed databases can be done in several ways. Locking and timestamping are two techniques which can be used, but timestamping is generally preferred. The problems of concurrency control in a distributed DBMS are more severe than in a centralised DBMS because of the fact that data may be replicated and partitioned. If a user wants unique access to a piece of data, for example to perform an update or a read, the DBMS must be able to guarantee unique access to that data, which is difficult if there are copies throughout the sites in the distributed database. Timestamping Timestamping is a method of concurrency control where basically, all transactions are given a timestamp or unique date/time/site combination and the database management system uses one of a number of protocols to schedule transactions which require access to the same piece of data. While more complex to implement than locking, timestamping does avoid deadlock occurring by avoiding it in the first place.