LU13 – D ISTRIBUTED D ATABASES AND DDBMS
L EARNING O BJECTIVES
This week we’ll learn about distributed data, distributed databases, and distributed database management systems. Our learning objectives are:
Describe various DDBMS implementations
Explain how database design affects the DDBMS environment
Apply DDBMS principles to solve problems
If you look at our methodology below we will be looking at activities that take place in the design and implementation phases of our methodology:
In this learning unit we will explore the fundamentals of distributed data, distributed databases, and the features of the DBMS that facilitate data distribution.
P ART 1: A LL ABOUT D ISTRIBUTED D ATABASES
W HAT ARE THE DRIVING FORCES FOR DISTRIBUT ING DATA ?
There is little question that we are in the information age as well as firmly entrenched in a global economy. With these factors as our backdrop it leaves little doubt that organizations are becoming more and more geographically dispersed. If this is the case, how will we provide decision makers with the data they need to make business decisions? How then will organizations get data timely? One
Page 1
solution for getting data to where it is needed is by placing it closer to where it is used. This week we’ll explore the notion of distributed databases which is the one of our solutions to getting data closer to the people that need it.
W HAT IS A DISTRIBUTED DATABASE ?
A distributed database is a single logical view of the database (as seen by the user) of a database that is physically spread over multiple computing locations connected by a network. The important thing to remember is that the distributed database is truly a database and not a loose collection of files and applications. A distributed database requires multiple database management systems with at least one DBMS running on a node of the network where a database is located.
What is the difference between a distributed database and a decentralized database? A decentralized database is also stored on multiple computers in multiple locations but it is a collection of independent databases. The databases are not viewed as a single logical entity and the data in these individual databases can not be shared.
In today’s day and age, distributed databases come in many shapes and forms, and serve a wide variety of purposes. Let’s look at some of the conditions that fostered the growth of distributed databases:
Distribution and autonomy of the business - result of globalization, mergers and acquisitions modern organizations have created geographically dispersed autonomous business units. Each business unit has demanded local control over its data.
Data Sharing - Most modern business decisions, independent of the organization’s complexity, are requiring crossfunctional data. That is data from business units have become increasingly more important.
Data communications costs - the cost to continually move high volumes of data to remote locations can be high. While costs have decreased, thereby reducing the need for this form of data distribution, it still remains a factor in the overall database design.
Data communications reliability - moving large volumes of data can be risky
Data communications timing - moving large volumes of data takes time. Having the data stored closer to where it is used decreases the delivery time to the user.
Purchased software - many organizations are relying on purchased software packages to help solve their business problems. Often times these independent software packages use different databases and different DBMSs. So it is not unusual for an organization to have a variety of databases and DBMSs. The development focus then shifts from design and implementation to how these applications can be integrated with each other.
Data recovery - Many organizations use distributed databases as part of their data recovery strategy. By having duplicate databases geographically distributed, an organization can spread the risk as well as having copies of their data available if one of the databases fails.
Multiple uses - A distributed database gives an organization the option to use their databases for multiple purposes. One database can support the operational decision making and processing needs of the organization while another database can provide the decision support needs of the tactical and strategic decision makers.
T HREE PRIMARY OBJECTI VES OF THE DISTRIBUT ED DATABASE
Before a distributed database can provide the capabilities necessary for the user to derive value from the database it must first meet three objectives. These objectives make it possible for users to have confidence that the database maintains data integrity, meets high performance thresholds and has the capabilities for easy data access. To meet these criteria a distributed data database must have:
1.
Location transparency - even though the data is physically dispersed the database has to perform as if all of the data were stored in one physical location. Location transparency ensures to the end user the applications that use the database will work regardless of the user’s and data’s location.
2.
Replication transparency - even though a piece of data is physically stored in more than one location the database treats the data as if it were stored in only one location. Many distributed databases rely on the storage of copies of data in several
Page 2
locations to improve performance. This fact must be abstracted so that the perception to the end user is that there’s only one data source.
3.
Failure transparency - even though a transaction is successful in one location it must also be successful at all locations. Once a transaction occurs it must survive all opportunities for failure, it the transaction fails to commit at one site it must be rolled back at all sites. This is critical for maintaining consistent and accurate data at multiple locations.
P ART 2: O PTIONS FOR D ISTRIBUTING D ATABASES
D ISTRIBUTED DATABA SE DESIGN FACTORS
The determining factor for the method chosen will be based on the situation in which the organization plans to use the data and the resources they have to support the chosen method. There are five factors that influence the selection of a distributed database design strategy:
1.
Organizational forces - funding availability, autonomy of organizational units, and the need for security.
2.
Frequency and location or clustering of reference to data - In general, data should be located close to the applications that use those data.
3.
Need for growth and expansion - The availability of processors on the network will influence where data may be located and applications may be run, and may indicate the need for expansion of the network.
4.
Technological capabilities - Capabilities at each node and for DBMSs coupled with the costs for acquiring and managing technology must be considered. Storage costs tend to be low, but the costs for managing complex technology can be great.
5.
Need for reliable service - Mission-critical applications and very frequently required data encourage replication schemes.
There are a number of ways that databases can be distributed. The three most common are data replication, horizontal partitioning and vertical partitioning.
D ATA R EPLICATION
Data replication is a very popular method of data distribution. This method provides a fault tolerant way of distributing a copy
(complete of partial) of the database in more than one location. There are many different replication models including push replication, pull replication and store
Page 3
Here are some advantages:
Reliability - in the event that one database node fails there are other nodes with a copy of the same data available for use
Faster response - each site has a copy of the data close to where it is used minimizing data movement across the network or contention with other user sites for the same data
Node decoupling - transactions can be processed without coordination across a network. Each user’s request can be handled independently. When the updates occur, data synchronization of all database copies can take place independently
Reduced network traffic - network traffic for data replication or synchronization can take place in off-peak times to reduce network congestion
Some disadvantages:
Increase storage requirements - each copy of the database is going to require additional disk space to store “replication metadata”, such as which row is newer and where it was updated from.
Complexity of synchronization - there are costs and complexities to keep all copies of the database current especially updating and synchronizing in near-real time
H ORIZONTAL P ARTITIONING
Imagine a scenario by which the sales team, customer records and orders for each particular location are stored in a DBMS at that location. This makes sense when that majority of CRUD operations will take place within a specific scope. For example, a North
American sales office would seldom place an order for a European customer, but could do so thanks to distribution transparency.
Horizontal partitioning is a method for distributing data by row. Each database node on the network gets a subset of the rows of the database and the total of all of the rows from all of the nodes on the network comprise the entire database
Advantages of horizontal partitioning:
Efficiency - Data are stored close to where they are used and separate from other data used by other users or applications.
Local optimization - Data can be stored to optimize performance for local access.
Security - Data not relevant to usage at a particular site are made unavailable.
Ease of querying - Combining data across horizontal partitions is easy since rows are simply merged by unions across the partitions.
Page 4
Ease of combining - a horizontal database can be combined by performing a UNION of the databases making it much easier to combine rows from multiple tables versus the more complex JOINS.
Disadvantages horizontal partitioning:
Inconsistent access speed - When data from several partitions are required, the access time can be significantly different from local-only data access.
Backup vulnerability - Since data are not replicated, when data at one site become inaccessible or damaged, usage cannot switch to another site where a copy exists; data may be lost if proper backup is not performed at each site.
V ERTICAL P ARTITIONING
Vertical partitioning is a method for distributing data by column. A subset of a database’s columns is distributed to multiple sites.
Each set of data needs a combination of primary and foreign keys in order to recreate the entire database. The advantages and disadvantages of vertical partitioning are identical to those of horizontal partitioning, with the exception that combining data across vertical partitions is more difficult than across horizontal partitions. This difficulty arises from the need to match primary keys or other qualifications to join rows across partitions.
H OW DO YOU SELECT THE RIGHT DATA DISTRIBUT ION STRATEGY ?
Now that you understand the basic options for distributed databases, you might be wondering how you choose the right strategy for a given situation. There are a variety of ways to distribute data, and the rationale for making the decision is based on your organization’s unique data needs and the resources your organization is willing to invest. Of the following methods of distribution there is no one best approach, and oftentimes a combination of these approaches is used.
Various distribution strategies at a glance.
Complete centralization - the database is physically stored in one location and is accessed remotely by the geographically dispersed sites.
Replication w/ Snapshots - each remote location has a complete or partial copy of the database with each copy periodically updated with snapshots of reflecting changes to the data. Snapshots occur on a routine interval.
Page 5
Replication w/Synchronization - each remote site has a complete or partial copy of the database with each copy receiving near real-time updates via synchronization – like remote triggers on a table.
Integrated Partitioning - where each site’s database is viewed as one logical piece of the entire database, either through horizontal or vertical partitioning.
Independent Partitioning - where each site’s database is independent and non-integrated with all of the other database segments
The table below contrasts the various database distribution strategies:
taken from "Modern Database Management" 7th ed., Hoffer, Prescott, McFadden, Prentice-Hall 2007.
D ISTRIBUTION : H OMOGENEOUS VS .
HETEROGENEOUS
When distributing databases it is easier and less complicated to distribute homogeneous databases. The idea is that the same vendor’s DBMS (with the same major release) will be used to manage the data at each node. Often times this is impractical especially in a packaged software environment or in a situation where you are distributing databases acquired through mergers or acquisitions. The figures below demonstrate the difference between a homogeneous and a heterogeneous database
Page 6
Page 7