Distributed database issues

advertisement
Distributed database overview
A distributed database can be defined as consisting of a collection of data with different
parts under the control of separate DBMSs running on independent computer systems.
All the computers are interconnected and each system has autonomous processing
capability serving local applications. Each system participates, as well, in the execution
of one or more global applications. Such applications require data from more than one
site. The distributed nature of the database is hidden from users and this transparency
manifests itself in a number of ways.
Although there are a number of advantages to using a distributed DBMS, there are also a
number of problems and implementation issues. Finally, data in a distributed DBMS can
be partitioned or replicated or both.
http://www.compapp.dcu.ie/databases/f449.html
Distributed database transparency
A distributed DBMS should provide a number of features which make the distributed
nature of the DBMS transparent to the user. These include the following:





Location transparency
Replication transparency
Performance transparency
Transaction transparency
Catalog transparency
The distributed database should look like a centralised system to the users. Problems of
the distributed database are at the internal level.
Distributed database advantages
There are a number of advantages to using a distributed DBMS. These include the
following:




Capacity and incremental growth
Reliability and availability
Efficiency and flexibility
Sharing
Distributed database issues
There are a number of issues or problems which are peculiar to a distributed database and
these require novel solutions. These include the following:




Distributed query optimisation
Distributed update propagation
Distributed concurrency control
Distributed catalog management
Distributed query optimisation
In a distributed database the optimisation of queries by the DBMS itself is critical to the
efficient performance of the overall system. Query optimisation must take into account
the extra communication costs of moving data from site to site, but can use whatever
replicated copies of data are closest, to execute a query. Thus it is a more complex
operation than query optimisation in centralised databases.
Distributed catalog management
The distributed database catalog entries must specify site(s) at which data is being stored
in addition to data in a system catalog in a centralised DBMS. Because of data
partitioning and replication, this extra information is needed. There are a number of
approaches to implementing a distributed database catalog.




Centralised - Keep one master copy of the catalog
Fully replicated - Keep one copy of the catalog at each site
Partitioned - Partition and replicate the catalog as usage patterns demand
Centralised/partitioned - Combination of the above
Distributed update propagation
Update propagation in a distributed database is problematic because of the fact that there
may be more than one copy of a piece of data because of replication, and data may be
split up because of partitioning. Any updates to data performed by any user must be
propagated to all copies throughout the database. The use of snapshots is one technique
for implementing this.
Query optimisation overview
Query optimisation is essential if a DBMS is to achieve acceptable performance and
efficiency. Relational database systems based on the relational model and relational
algebra have the strength that their relational expressions are at a sufficiently high level
so query optimisation is feasible in the first place; in non-relational systems, user requests
are low level and optimisation is done manually by the user - the system cannot help.
Hence systems which implement optimisation have several advantages over systems that
do not.
The optimisation process itself involves several stages, which involves the
implementation of the relational operators. A different approach to query optimisation,
called semantic optimisation has recently been suggested.
This technique may be used in combination with the other optimisation techniques and
uses constraints specified on the database schema. Consider the SQL query:
SELECT
E.LNAME
FROM
WHERE
EMPLOYEE E M
E.SSN = M.SSN AND E.SALARY > M.SALARY
This query retrieves the names of employees who earn more than their supervisors.
Suppose we had a constraint on the database schema that states that no employee can
earn more than their supervisor. If the semantic query optimisor checks for the existence
of this constraint, then it need not execute the query at all. This may save considerable
time if the checking for constraints can be done efficiently; however, searching through
many constraints to find ones applicable to a given query can also be quite time
consuming.
Distributed concurrency control
Concurrency control in distributed databases can be done in several ways. Locking and
timestamping are two techniques which can be used, but timestamping is generally
preferred.
The problems of concurrency control in a distributed DBMS are more severe than in a
centralised DBMS because of the fact that data may be replicated and partitioned. If a
user wants unique access to a piece of data, for example to perform an update or a read,
the DBMS must be able to guarantee unique access to that data, which is difficult if there
are copies throughout the sites in the distributed database.
Timestamping
Timestamping is a method of concurrency control where basically, all transactions are
given a timestamp or unique date/time/site combination and the database management
system uses one of a number of protocols to schedule transactions which require access to
the same piece of data.
While more complex to implement than locking, timestamping does avoid deadlock
occurring by avoiding it in the first place.
Download