Advanced Databases (PU) 1-1 Parallel Databases UNIT

advertisement
Advanced Databases (PU)
1-1
1
Parallel Databases
UNIT - I
Parallel Databases
Syllabus :
Introduction, Parallel database architecture, I/O Parallelism, Inter-query and Intraquery Parallelism, Inter-operational and Intra-operational Parallelism, Design of
parallel systems.
1.1
Introduction :
Fifteen years ago, parallel database systems had been nearly written off, even by some
of their staunchest advocates. Today, they are successfully marketed by practically every
database system vendor, several trends fueled this transition.

The transaction requirements of organizations have grown with increasing use of
computers. Moreover, the growth of the World Wide Web has created many sites with
millions of viewers, and the increasing amount of data collected from these viewers has
produced extremely large databases at many companies.

Organizations are using these increasingly large volume of data such as, data about what
items people buy, what Web links users clicked on, or when people make telephone
calls to plan their activities and pricing. Queries used for such purposes are called
decision support queries, and the data requirements for such queries may run into
terabytes. Single-processor systems are not capable of handling such large volumes of
data at the required rates.
Advanced Databases (PU)
1-2
Parallel Databases

The set-oriented nature of database queries naturally lends itself to parallelization. A
number of commercial and research systems have demonstrated the power and
scalability of parallel query processing.

As microprocessors have become cheap, parallel machines have become common and
relatively inexpensive.
A parallel database system seeks to improve performance through parallellization of
various operations, such as loading data, building indexes and evaluating queries. Although
data may be stored in a distributed fashion such a system, the distribution is governed solely by
performance considerations.
1.2
Parallel Systems :
 [ University Exam – Dec. 2006, Dec. 2007 !!! ]

Parallel systems improve processing and I/O speeds by using multiple CPUs and disks
in parallel. Parallel machines are becoming increasingly common, making the study of
parallel database systems correspondingly more important.

The driving force behind parallel database systems is the demands of applications that
have to query an extremely large databases or that have to process large number of
transactions per second (of the order of thousands of transactions per second).

Centralized and client-server database systems are not powerful enough to handle such
applications.

In parallel processing, many operations are performed simultaneously, as opposed to
serial processing, in which the computational steps are performed sequentially.

A coarse-grain parallel machine consists of the small number of powerful processors; a
massively parallel or fine-grain parallel machine uses thousands of smaller processors.

Most high-end machines today offer some degree of coarse-grain parallelism: Two or
more processor machines are common.
1.2.1 Measures of Performance of Database Systems :
There are two main measures of performance of a database system :
1.
Throughout, the number of tasks that can be completed in a given time interval.
2.
Response time, the amount of time it takes to complete a single task from the time it is
submitted. A system that processes a large number of small transactions can improve
throughout by processing many transactions in parallel. A system that processes large
transactions can improve response time as well as throughout by performing subtasks of
each transaction in parallel.
Advanced Databases (PU)
1-3
Parallel Databases
1.2.2 Speedup and Scaleup :
 [ University Exam – May 2007, Dec. 2007 !!! ]
Two important issues in studying parallelism are :
(1) Speedup :
Running a given task in less time by increasing the degree of parallelism is called
speedup.
(2) Scaleup :
Handling larger tasks by increasing the degree of parallelism is scaleup.
Speedup :

Consider a database application running on a parallel system with a certain number of
processors and disks. Now suppose that we increase the size of the system by increasing
the number or processors, disks, and other components of the system.

The goal is to process the task in time inversely proportional to the number of
processors and disks allocated.

The parallel system is said to demonstrate linear speedup if the speedup is N when the
larger system has N times the resources (CPU, disk, and so on) of the smaller system.

If the speedup is less than N, the system is said to demonstrate sublinear speedup
Fig. 1.1 illustrates linear and sublinear speedup.
Fig. 1.1 : Speedup
Scaleup :

Scaleup relates to the ability to process larger tasks in the same amount of time by
providing more resources.

Let Q be a task and QN be a task that is N times bigger than Q. Suppose execution time
of task Q on machine MS is TS and the execution time of task QN on parallel machine ML
which is N times larger than MS is TL.
Advanced Databases (PU)
1-4
Parallel Databases
Scaleup is defined as TS / TL.
Where,
TL : Execution time of a task on the larger machine
TS : The execution, time of the same task on the smaller machine
The parallel system ML is said to demonstrate linear scaleup on task Q if.
TL = TS.
If TL > TS the system is said to demonstrate sublinear scaleup.
Fig. 1.2 : Scaleup
1.3
Architectures for Parallel Databases :
 [ University Exam – May 2007, Dec. 2007, Dec. 2009 !!! ]
There are several architectural models for parallel machines. Among the most prominent
ones are those in Fig. 1.3 (In the Fig. 1.3, M denotes memory, P denotes a processor, and disks
are shown as cylinders)

Shared memory : All the processors share a common memory (Fig. 1.3(a)).
Fig. 1.3(a) : Shared memory
o Shared disk : All the processors share a common set of disk (Fig. 1.3(b)). Shareddisk are sometimes called clusters.
Advanced Databases (PU)
1-5
Parallel Databases
Fig. 1.3(b) : Shared disk
o Shared nothing : The processors share neither a common memory nor common
disk (Fig. 1.3(c)).
Fig. 1.3(c) : Shared nothing
o Hierarchical : This model is a hybrid of the preceding three architectures
(Fig. 1.3(d)).
Techniques used to speedup transaction processing on data-server systems, such and
lock caching and lock de-escalation, can also be in shared-disk parallel databases as well as in
shared-nothing parallel databases. In fact, they are very important for efficient transaction
processing in such systems.
1.3.1 Shared Memory :
In shared memory architecture, the processors and disks have access to a common
memory, typically via a bus or through an interconnection network.
Advanced Databases (PU)
1-6
Parallel Databases
Advantages :

The benefit shared memory is extremely efficient communication between processors.
Data in shared memory can be accessed by any processor without being moved with
software.

A processor can send messages to other processors much faster by using memory writes
(which usually rake less than a microsecond) than by sending a message through
communication mechanism.
Disadvantages :

The downside of shared-memory that the architecture is not scalable beyond 32 or 64
processors because the bus or interconnection network becomes a bottleneck (since it is
shared by all processors).

Adding more number of processors should be avoided as they most of the time in
waiting for their turn on the bus to access memory.

Shared-memory architectures usually have large memory caches at each processor so
that referencing of the shared memory is avoided whenever possible.

However, at least some of the data will not be in the cache and accesses will have to go
to the shared memory. Moreover, the caches need to be kept coherent.

Maintaining cache-coherency becomes an increasing overhead with increasing overhead
with increasing number of processors.

Consequently, shared memory machines are not capable of scaling up beyond a point;
current shared-memory machines cannot support more than 64 processors.
1.3.2 Shared Disk :
In the shared-disk model, all processors can access all disks directly via an
interconnection network, but the processors have private memories.
Advantages :

Since each processor has its own memory, the memory bus is not a bottleneck.

It offers a cheap way to provide a degree of fault tolerance.

If a processor (or its memory) fails, the other processor can take over its tasks, since the
database is resident on disks that are accessible from all processors.

We can make the disk subsystem itself fault tolerant by using RAID architecture,

The shared-disk architecture has found acceptance in many applications.
Advanced Databases (PU)
1-7
Parallel Databases
Disadvantages :

The main problem with a shared-disk system is again scalability.

Although the memory bus is no longer a bottleneck, the interconnection to the disk
subsystem is now a bottleneck; it is particularly so in a situation where the database
makes a large number of accesses to disks.

Compared to shared memory systems, shared-disk systems can scale to a somewhat
larger number of processors, but communication across processor is slower, since it has
to go through a communication network.
Example :
DEC clusters running Rdb were One of the early commercial users of the shared disk
database architecture. (Rdb is now owned by Oracle, and is tailed Oracle Rdb. Digital
Equipment Corporation (DEC) is now owned by Compaq.)
1.3.3
Shared Nothing :

In a shared-nothing system, each node of the machine consists of a processor, memory,
and one or more disks.

The processors at one node may communicate with one another processor at another
node by a high-speed interconnection network.

A node functions as the server for the data on the disk or disks that the node owns. Since
local disk references are serviced by local disks at each processor.
Advantages :

The shared-nothing model overcomes the disadvantage of requiring all I/O to go
through a singly interconnection network; only queries, accesses to non local disks, and
result relations pass through the network.

Moreover, the interconnection networks for shared nothing systems are usually designed
to be scalable, so that their transmission capacity increases as more nodes are added.

Consequently, shared-nothing architectures are more scalable, and can easily support a
large number of processors.
Disadvantage :
The main drawback of shared nothing systems is the costs of communication and of
nonlocal disk access, which are higher than in a shared memory or shared-disk architecture
since sending data involves software interaction at both ends.
Advanced Databases (PU)
1-8
Parallel Databases
Applications :

The Teradata database machine was among (the earliest commercial systems to use the
shared-nothing database architecture.

The Grace and the Gamma research prototypes also used shared-nothing architectures.
1.3.4 Hierarchical :

The hierarchical architecture combines the characteristics of shared-memory, shareddisk, and shared-nothing architectures.

At the top level, the system consists of nodes connected by an interconnection network,
and do not share disks or memory with one another. Thus, the top level is a sharednothing architecture.

Each node of the system could actually be a shared-memory system with a few
processors Alternatively, each node could be a shared-disk system, and each of the
systems sharing a set of disks could be a shared-memory system.

Thus, a system could be built as a hierarchy, with shared-memory architecture with a
few processors at the base, and a shared-nothing architecture at the top, with possibly
shared-disk architecture in the middle.

Fig. 1.3(d) illustrates a hierarchical architecture with shared-memory nodes connected
together in a shared nothing architecture.

Commercial parallel database systems today run on several of these architectures.

Attempts to reduce the complexity of programming such systems have yielded
distributed virtual memory architectures, where logically there is a single shared
memory, but physically there are multiple disjoint memory systems; the virtualmemory-mapping hardware, coupled with system software, allows each processor to
view the disjoint memories as a single virtual memory.

Since access speeds differ, depending whether the page is available locally or not, such
architecture is also referred to as nonuniform memory architecture (NUMA).
Fig. 1.3(d)
Advanced Databases (PU)
1-9
Parallel Databases
1.3.5 Parallel Query Evaluation :
 [ University Exam – Dec. 2006 !!! ]
Now we try to understand parallel evaluation of a relational query in a DBMS with a
shared-nothing architecture. While it is possible to consider parallel execution of multiple
queries, it is hard to identify in advance which queries will run concurrently. So the emphasis
has been on parallel execution of a single query.

A relational query execution plan is a graph of relational algebra operators, and the
operators in a graph can be executed in parallel. If one operator consumes the output of
a second operator, we have pipelined parallelism.(the output of the second operator is
worked on by the first operator as soon as it is generated)

If not, the two operators can proceed essentially independently. An operator is said to
block if it produces no output until it has consumed all inputs. Pipelined parallelism is
limited by the presence of operators that block.

To evaluate different operators in parallel, we can evaluate each individual operator in a
query plan in a parallel fashion. The key to evaluating operator in parallel is to partition
the input data; we can then work on a partition in parallel and combine the results. This
approach is called a partitioned parallel evaluation.

An important observation, which explains why shared-nothing parallel database system
have been very successful, is that database query evaluation is very amenable to datapartitioned parallel evaluation.

The goal is to minimize data shipping by partitioning the data and structuring the
algorithms to do most of the processing at individual processors.
1.4
I/O Parallelism :
 [ University Exam – Dec. 2007, Dec. 2009 !!! ]
Definition : I/O parallelism refers to reducing the time required to retrieve relations
from disk by partitioning the relations on multiple disks. The most common form of data
partitioning in a parallel database environment is horizontal partitioning.
In horizontal partitioning, the tuples of a relation are divided (or declustered) among
many disks, so that each tuple resides on one disk. Several partitioning strategies have been
proposed.
1.4.1 Partitioning Techniques :
There are three basic data-partitioning strategies. Assume that there are n disks
DO, DI. ..., Dn-1 across which the data are to be partitioned.

Round-robin : This strategy scans the relation in any order and sends the ith tuple to
disk number Di mod n. The round-robin scheme ensures an even distribution of tuples
across disks; that is each disk has approximately the same number of tuples as the
others.
Advanced Databases (PU)
1-10
Parallel Databases

Hash partitioning : This declustering strategy designates one or more attributes from
the given relation's schema as the partitioning attributes. A hash function is chosen
whose range is {0, 1,-----, n – 1}. Each tuple of the original relation is hashed on the
partitioning attributes. If the hash function returns i, then the tuple is placed on disk Di.

Range partitioning : This strategy distributes contiguous attribute value ranges to each
disk. It chooses a partitioning attribute. partitioning vector. The relation is partitioned
as follows. Let {V0,V1,...,Vn – 2 } denote the partitioning vector, such that, if i < j, then
Vi < Vj. Consider a tuple t such that t[A] = x. If x < V0 then t goes on disk Do.
If x >=Vn – 2 , then t goes on disk D n – 1 . If Vi <= x < Vi + 1 then t goes on disk Di + 1.

For example, range partitioning with three disks numbered 0, 1, and 2 may assign
tuples with values less than 5 to disk 0, values between 5 and 40 to disk 1, and values greater
than 40 to disk 2.
1.4.2 Comparison of Partitioning Techniques :
 [ University Exam – Dec. 2006, May 2007 !!! ]

Once a relation has been partitioned among several disks, we can retrieve it in parallel,
using all the disks.

Similarly, when a relation is being partitioned, it can be written to multiple disks in
parallel.

Thus, the transfer rates for reading or writing an entire relation are much faster with I/O
parallelism than without it.

However, reading an entire relation is only one kind of access to data. Access to data
can be classified as follows :
1.
Scanning the entire relation.
2.
Locating a tuple associatively (for example, employee-name - "Campbell"); these
queries, called point queries, seek tuples that have a specified value for a specific
attribute.
3.
Locating all tuples for which the value of a given attribute lies within a specified
range (for example, 10000 < salary < 20000); these queries are called range
queries.
The different partitioning techniques support these types of access at different levels of
efficiency :
1.4.2.1
Round-robin :
The scheme is ideally suited for applications that wish to read the entire relation
sequentially for each query. With this scheme, both point queries and range queries are
complicated to process, since each of the n disks must be used for the search.
Advanced Databases (PU)
1.4.2.2
1-11
Parallel Databases
Hash Partitioning :

This scheme is best suited for point queries based on the partitioning attribute.

For example if a relation is partitioned on the telephone-number attribute, then we can
answer the query “Find the record of the employee with telephone-number = 555-3333”
by applying the partitioning hash function to 555-3333 and then searching that disk.
Directing a query to a single disk saves the startup cost of initiating a query on multiple
4 disks and leaves the other disks free to process other queries.

Hash partitioning is also useful for sequential scans of the entire relation.

If the hash function is a good randomizing function, and the partitioning attributes form
a key of the relation, then the number of tuples in each of the disks is approximately the
same, without much variance.

Hence, the time taken to scan the relation is approximately 1/n of the time required to
scan the relation in a single disk system.

The scheme, however, is not well suited for point queries non partitioning attributes.

Hash-based partitioning is also not well suited for answering range queries, since,
typically, hash functions do not preserve proximity within a range. Therefore, all the
disks need to be scanned for range queries to be answered.
1.4.2.3
Range Partitioning :
 [ University Exam – Dec. 2009 !!! ]





This scheme is well suited for point and range queries on the partitioning attribute. For
point queries, we can consult the partitioning vector to locate the disk where the tuple
resides.
For range queries, we consult the partitioning vector to find the range of disks on which
the tuples may reside. In both cases, the search narrows to exactly those disks that might
have any tuples of interest.
An advantage of this feature is that, if there are only a few tuples in the queried range,
then the query is typically sent to one disk, as opposed to all the disks.
Since other disks can be used to answer other queries, range partitioning results in
higher throughput while maintaining good response time. On the other hand, if there are
many tuples in the queried range (as there are when the queried range is a larger fraction
of the domain of the relation), many tuples have to be retrieved from a few disks,
resulting in an I/O bottleneck (hot spot) at those disks.
In this example of execution skew, all processing occurs in one or only a few partitions.
In contrast, hash partitioning and round-robin partitioning would engage all the disks for
such queries, giving a faster response time for approximately the same throughput.
Advanced Databases (PU)
1-12
Parallel Databases

The type of partitioning also affects other relational operations, such as joins. Thus, the
choice of partitioning technique also depends on the operations that need to be executed.
In general, hash partitioning or range partitioning are preferred to round-robin
partitioning.

In a system with many disks, the number of disks across which to partition a relation can
be chosen in this way; if a relation contains only a few tuples that will fit into a single
disk block, then it is better to assign the relation to a single disk.

Large-relations are preferably partitioned across all the available disks. If a relation
consists of m disk blocks and there are n disks available in the system, then the relation
should be allocated min(m, n) disks.
1.4.3 Handling of Skew :
 [ University Exam – Dec. 2006, May 2007, Dec. 2007 !!! ]
When a relation is partitioned (by a technique other than round-robin), there may be a
skew in the distribution of tuples, with a high percentage of tuples placed in some partitions
and fewer tuples in other partitions. The ways that skew may appear are classified as:
1.
Attribute-value skew
2.
Partition skew
1.4.3.1
Attribute-value Skew :

It refers to the fact that some values appear in the partitioning attributes of many tuples.
All the tuples with the same value for the partitioning attribute end up in the same
partition, resulting in skew. Partition skew refers to the fact that there may be load
imbalance in the partitioning, even when there is no attribute skew.

Attribute-value skew can result in skewed partitioning regardless of whether range
partitioning or hash partitioning is used. If the partition vector is not chosen carefully,
range partitioning may result in partition skew. Partition skew is less likely with hash
partitioning, if a good hash function is chosen.

Skew becomes an increasing problem with a higher degree of parallelism.

For example, if a relation of 1000 tuples is divided into 10 parts, and the division is
skewed, then there may be some partitions of size less than 100 and some partitions of
size more than 100

If even one partition happens to be of size 200, the speedup that we would obtain by
accessing the partitions in parallel is only 5, instead of the 10 for which we would have
hoped.
Advanced Databases (PU)

Parallel Databases
If the same relation has to be partitioned into 100 parts, a partition will have 10 tuples
on an average. If even one partition has 40 tuples (which is possible, given the large
number of partitions) the speedup that we would obtain by accessing them in parallel
would be 25, rather than 100. Thus the loss of speedup due to skew increases with
parallelism.
1.4.3.2

1-13
A Balanced Range-Partitioning Skew :
A balanced range partitioning vector can be constructed by sorting the relation. The
relation is first sorted on the partitioning attributes.

The relation is then scanned in sorted order. After every 1/n of the relation has been read
the value of the partitioning attribute of the next tuple is added to the partition vector.

Here, n denotes the number of partitions to be constructed. In case there are many tuples
with the same value for the partitioning attribute, the technique can still result in some
skew.
Disadvantage :
The extra I/O overhead incurred in doing the initial sort.
How I/O overhead is avoided ?

The I/O overhead for constructing balanced range-partition vectors can be reduced by
constructing and storing a frequency table, or histogram, of the attribute values for each
attribute of each relation. (See Fig. 1.4).

Histogram for an integer valued attribute that takes values in the range 1 to 25.

A histogram takes up only a little space, so histograms on several different attributes can
be stored in the system catalog.
Fig. 1.4 : Example of histogram
Advanced Databases (PU)
1-14
Parallel Databases

It is straightforward to construct a balanced range-partitioning function given a
histogram on the partitioning attributes.

If the histogram is not stored, it can be computed approximately by sampling the
relation, using only tuples from a randomly chosen subset of the disk blocks of the
relation.
Another approach to minimize the effect of skew :

Another approach to minimize the effect of skew particularly with range partitioning is
to use virtual processors.

The virtual processor approach, we pretend there are several times as many virtual
processors as the number of real processors.

Any of the partitioning techniques and query evaluation techniques that we study later in
this chapter can be used, but they map tuples and work to virtual processors instead of
to real processors.

Virtual processors, in turn, are mapped to real processors, usually by round-robin
partitioning.

The idea is that even if one range had many more tuples than the others because of
skew, these tuples would get split across multiple virtual processor ranges.

Round robin allocation of virtual processors to real processors would distribute the extra
work among multiple real processors, so that one processor does not have to bear all
the burden.
1.5
Parallelizing Individual Operations :
 [ University Exam – Dec. 2007 !!! ]

It shows how various operations can be implemented in parallel in a shared-nothing
architecture.

We assume that each relation is horizontally partitioned across several disks, although
this partitioning may or may not be appropriate for a given query.

The evaluation of a query must take the initial partitioning criteria into account and
repartition if necessary. We use following techniques :
1.
Bulk loading and scanning
2.
Sorting
1.5.1 Bulk Loading and Scanning :

We begin with two simple operations: scanning a relation and loading a relation.

Pages can be read in parallel while scanning a relation, and the retrieved tuples can then
be merged, if the relation is partitioned across several disks.
Advanced Databases (PU)
1-15
Parallel Databases

More generally, the idea also applies when retrieving all tuples that meet a selection
condition. If hashing or range partitioning is used, selection queries can be answered by
going to just those processors that contain relevant tuples.

Similar observation holds for bulk loading. Further, if a relation has assorted indexes,
any sorting of data entries required for building the indexes, during bulk loading can
also be done in parallel.
1.5.2 Sorting :

A simple idea is to let each CPU sort the part of the relation that is on its local disk and
then merge these sorted sets of tuples. The degree of parallelism likely to be limited by
the merging phase.

A better idea is to first redistribute all tuples in the relation using range partitioning.

For example, if we want to sort a collection of employee tuples by salary, salary values
range from 10 to 210, and we have 20 processors, we could send all tuples with salary
values in the range 10 to 20 to the first processor all in the range 21 to 30 to the second
processor, and so on.

Each processor then sorts the tuples assigned to it, using some sequential sorting
algorithm.

For example, a processor can collect tuples until its memory full then sort these tuples
and write out a run, until all incoming tuples has been written to such sorted runs on the
local disk.

These runs can then be to create the sorted version of the set of tuples assigned to this
processor entire sorted relation can be retrieved by visiting the processors in an order
corresponding to the ranges assigned to them and simply scanning the tuples.
1.5.2.1
Splitting Vector :

The basic challenge in parallel sorting is to do the range partitioning each processor
receives roughly the same number of tuples; otherwise, a processor that receives a
disproportionately large number of tuples to sort becomes a bottleneck and limits the
scalability of the parallel sort.

One good approach to range partitioning is to obtain a sample of the entire relation by
taking at each processor that initially contains part of the relation.

The (relatively small) sample is sorted and used to identify ranges with equal number of
tuples. This set of range values, called a splitting vector, is then distributed all
processors and used to range partition the entire relation.
Advanced Databases (PU)
1.5.2.2
1-16
Parallel Databases
Application of Sorting :
A particularly important application of parallel sorting is sorting the data entries (in treestructured indexes. Sorting data entries can significantly speed up the process of bulk-loading
an index.
1.6
Interquery Parallelism :
 [ University Exam – Dec. 2006, May 2007, Dec. 2009 !!! ]
Definition :

In interquery parallelism, different queries or transactions execute in parallel with one
another.

Transaction throughput can be increased by this form of parallelism.

However, the response times of individual transactions are no faster than they would be
if the transactions were run in isolation.
1.6.1 Working of Interquery Parallelism :

The primary use of interquery parallelism is to scaleup a transaction-processing system
to support a larger number of transactions per second.

Interquery parallelism is the easiest form of parallelism to support in a database system
particularly in a shared-memory parallel system.

Database systems designed for single-processor systems can be used with few or no
changes on a shared-memory parallel architecture

Since even sequential database systems support concurrent processing, transactions that
would have operated in a time-shared concurrent manner on a sequential machine
operate in parallel in the shared-memory parallel architecture

Supporting interquery parallelism is more complicated in shared disk or shared nothing
architecture.

Processors have to perform some tasks, such as locking and logging, in a coordinated
fashion, and that requires that they pass messages to each other.

A parallel database system must also ensure that two processors do not update the same
data independently at the same time.

Further, when a processor accesses or updates data, the database system must ensure
that the processor has the latest version of the data in its buffer pool. The problem of
ensuring that the version is the latest is known as the cache-coherency problem.
Advanced Databases (PU)
1-17
Parallel Databases
1.6.2 Protocols used in Shared Disk System :
Various protocols are available to guarantee cache coherency; often, cache-coherency
protocols are integrated with concurrency-control protocols so that their overhead is reduced.
One such protocol for a shared-disk system is this :
1.
Before any read or write access to a page, a transaction locks the page in shared or
exclusive mode, as appropriate. Immediately after the transaction obtains either a shared
or exclusive lock on a page, it also reads the most recent copy of the page from the
shared disk.
2.
Before a transaction releases an exclusive lock on a page, it flushes the page to the
shared disk; then, it releases the lock. This protocol ensures that, when a transaction sets
a shared or exclusive lock on a page, it gets the correct copy of the page.
1.6.3 Advantages of Complex Protocols :

More complex protocols avoid the repeated reading and writing to disk required by the
preceding protocol. Such protocols do not write pages to disk when exclusive locks are
released.

When a shared or exclusive lock is obtained if the most recent version of a page is in the
buffer pool of some processor, the page is obtained from there.

The protocols have to be designed to handle concurrent requests.

The shared-disk protocols can be extended to shared-nothing architectures by this
scheme.
Each page has a home processor Pi and is stored on disk Di.


When other processors want to read or write the page, they send requests to the home
processor Pi of the page, since they cannot directly communicate with the disk. The
other actions are the same as in the shared-disk protocols.

The Oracle 8 and Oracle Rdb systems are examples of shared-disk parallel database
systems that support interquery parallelism.
1.7
Intraquery Parallelism :
 [ University Exam – Dec. 2006, May 2007 !!! ]
Definition : Intraquery parallelism refers to the execution of a single query in parallel
on multiple processors and disks. Using intraquery parallelism is important for speeding up
long running queries. Interquery parallelism does not support in this task since each query is
run sequentially.
Advanced Databases (PU)
1-18
Parallel Databases
1.7.1 Working of Intraquery Parallelism :

To illustrate the parallel evaluation of a query, consider a query that requires a relation
to be sorted. Suppose that the relation has been partitioned across multiple disks by
range partitioning on some attribute, and the sort is requested on the partitioning
attribute.

The sort operation can be implemented by sorting each partition in parallel, then
concatenating the sorted partitions to get the final sorted relation.

Thus, we can parallelize a query by parallelizing individual operations. There is another
source of parallelism in evaluating a query: The operator tree for a query can contain
multiple operations.

We can parallelize the evaluation of the operator tree by evaluating in parallel some of
the operations that do not depend on one another. We may be able to pipeline the output
of one operation to another operation.

The two operations can be executed in parallel on separate processors, one generating
output that is consumed by the other, even as it is generated.
In summary, the execution of a single query can be parallelized in two ways :
1.
Intraoperation parallelism : We can speed up processing of a query by
parallelizing the execution of each individual operation, such as sort, select,
project, and join.
2.
Interoperation parallelism : We can speed up processing of a query by executing
in parallel the different operations in a query expression.
1.7.2 Importance of Parallelism :

The two forms of parallelism are complementary and can be used simultaneously on a
query. Since the number of operations in a typical query is small, compared to the
number of tuples processed by each operation, the first form of parallelism can scale
better with increasing parallelism. However, with the relatively small number of
processors in typical parallel systems today, both forms of parallelism are important.

In the following discussion of parallelization of queries, we assume that the queries are
read only.

The choice of algorithms for parallelizing query evaluation depends on the machine
architecture. Rather than presenting algorithms for each architecture separately.

We use a shared-nothing architecture model in our description. Thus, we explicitly
describe when data have to be transferred from one processor to another. We can
simulate this model easily by using the other architectures, since transfer of data can be
done via shared memory in a shared-memory architecture, and via shared disks in a
Advanced Databases (PU)
1-19
Parallel Databases
shared-disk architecture. Hence, algorithms for shared-nothing architectures can be used
on the other architectures too. We mention occasionally how the algorithms can be
further optimized for shared-memory or shared-disk systems.

To simplify the presentation of the algorithms, assume that there are n processors, Po, Pi,
… , P n- 1, and n disks Do, Di, ... ,D n – 1, where disk Di is associated with processor Pi.
A real system may have multiple disks per processor. It is not hard to extend the
algorithms to allow multiple disks per processor. We simply allow Di to be a set of
disks. However, for simplicity, we assume here that A is a single disk.
1.8
Intraoperation Parallelism :
 [ University Exam – Dec. 2007 !!! ]

Since relational operations work on relations containing large sets of tuples, we can
parallelize the operations by executing them in parallel on different subsets of the
relations.

Since the number of tuples in a relation can be large, the degree of parallelism is
potentially enormous.

Thus, intraoperation parallelism is natural in a database system.
1.8.1 Parallel Sort :
 [ University Exam – May 2007, Dec. 2007 !!! ]

Suppose that we wish to sort a relation that resides on parallel disks DO , D1 .... Dn-1 .

If the relation has been range partitioned on the attributes on which it is to be sorted,
then, we can sort each partition separately, and can concatenate the results to get the full
sorted relation.

Since the tuples are partitioned on n disks, the time required for reading the entire
relation is reduced by the parallel access.

If the relation has been partitioned in any other way, we can sort it in one of two ways :
1.
We can range partition it on the sort attributes, and then sort each partition
separately.
2.
We can use a parallel version of the external sort-merge algorithm.
1.8.1.1
Range-Partitioning Sort :

Range-partitioning sort works in two steps first range partitioning the relation, then
sorting each partition separately.

When we sort by range partitioning the relation, it is not necessary to range-partition the
relation on the same set of processors or disks as those on which that relation is stored.
Suppose that we choose processors Po , PI, .... Pm, where m < n to sort the relation.
Advanced Databases (PU)

1-20
Parallel Databases
There are two steps involved in this operation:
1.
Redistribute the tuples in the relation, using a range-partition strategy, so that all
tuples that lie within the ith range are sent to processor PI , which stores the relation
temporarily on disk Di.
To implement range partitioning, in parallel every processor reads the tuples from
its disk and sends the tuples to their destination processor. Each processor
PO, PI ,..., Pm also receives tuples belonging to its partition, and stores them locally.
This step requires disk I/O and communication overhead.
2.
Each of the processors sorts its partition of the relation locally, without interaction
with the other processors. Each processor executes the same operation namely,
sorting on a different data set. (Execution of the same operation in parallel on
different sets of data is called data parallelism.)
The final merge operation is trivial, because the range partitioning in the first phase
ensures that, for 1 < i < j < m, the key values in processor Pi are all less than the key values in
Pj.
We must do range partitioning with a good range-partition vector, so that each partition
will have approximately the same number of tuples. Virtual processor partitioning can also be
used to reduce skew.
1.8.1.2
Parallel External Sort Merge :
 [ University Exam – Dec. 2009 !!! ]
Parallel external sort-merge is an alternative to range partitioning. Suppose that a
relation has already been partitioned among disks DO , DI ,..., D n – 1 (it does not matter how the
relation has been partitioned). Parallel external sort-merge then works this way :
1.
Each processor Pi locally sorts the data on disk Di.
2.
The system then merges the sorted runs on each processor to get the final sorted output.
The merging of the sorted runs in step 2 can be parallelized by this sequence of actions :
1.
The system range-partitions the sorted partitions at each processor Pi (all by the same
partition vector) across the processors PO , PI .... Pm – 1. It sends the tuples in sorted order,
so that each processor receives the tuples in sorted streams.
2.
Each processor Pi performs a merge on the streams as they are received, to get a single
sorted run.
3.
The system concatenates the sorted runs on processors PO, PI ...., Pm – 1 to get the final
result.
Advanced Databases (PU)
1-21
Parallel Databases

As described, this sequence of actions results in an interesting form of execution
skew, since at first every processor sends all blocks of partition 0 to P0, then every
processor sends all blocks of partition 1 to P1, and so on.

Thus, while sending happens in parallel, receiving tuples becomes sequential: first only
PO receives tuples, then only PI receives tuples, and so on.

To avoid this problem, each processor repeatedly sends a block of data to each partition.
In other words, each processor sends first block of every partition, then sends the second
block of every partition so on. As a result, all processors receive data in parallel.

Some machines, such as the Teradata DBC series machines, use specialized hardware to
perform merging.

The Y-net interconnection network in the Teradata DBC machines can merge output
from multiple processors to give a single sorted output.
1.8.2 Parallel Join :
 [ University Exam – Dec. 2006, May 2007, Dec. 2007 !!! ]

The join operation requires that the system test pairs of tuples to see whether they satisfy
the join condition; if they do, the system adds the pair to the join output. Parallel join
algorithms attempt to split the pairs to be tested over several processors

Each processor then computes part of the join locally.

Then, the system collects the results from each processor to produce the final result.
1.8.2.1


Partitioned Join :
For certain kinds of joins, such as equijoins and natural joins, it is possible to partition
the two input relations across the processors, and to compute the join locally at each
processor. Suppose that we are using n processors, and that the relations to be joined are
r and ,s. Partitioned join then works this way:
The system partitions the relations r and s each into n partitions, denoted r0 , r1 ... r n – 1
and s0, s1 ... s n – 1. The system sends partitions r1 and si to processor pi, where their join is
computed locally.

The partitioned join technique works correctly only if the join is in an equijoin and if we
partition r and s by the same partitioning function on their join attributes.

The idea of partitioning is exactly the same as that behind the partitioning step of hashjoin.

In a partitioned join, however, there are two different ways of partitioning r and s :
1.
Range partitioning on the join attributes
2.
Hash partitioning on the join attributes
Advanced Databases (PU)
1-22
Parallel Databases

In either case, the same partitioning function must be used for both relations. For range
partitioning, the same partition vector must be used for both relations.

For hash partitioning, the same hash function must be used on both relations. Fig. 1.5
depicts the partitioning in a partitioned parallel join.

Once the relations are partitioned, we can use any join technique locally at each
processor Pi to compute the join of ri and si.

For example, hash-join, merge-join, or nested-loop join could be used. Thus, we can use
partitioning to parallelize any join technique.
Fig. 1.5 : Partitioned parallel join

If one or both of the relations r and s are already partitioned on the join attributes(by
either hash partitioning or range partitioning), the work needed or partitioning is reduced
greatly. If the relations are not partitioned, or are partitioned on attributes other than the
join attributes, then the tuples need to be partitioned.

Each processor Pi reads in the tuples on disk Di , computes for each tuple t the partition
j to which t belongs, and sends tuple t to processor Pj. Processor Pj stores the tuples on
disk Dj.

We can optimize the join algorithm used locally at each processor to reduce I/O by
buffering some of the tuples to memory, instead of writing them to disk.

Skew presents a special problem when range partitioning is used, since a partition vector
that splits one relation of the join into equal-sized partitions may split the other relations
into partitions of widely varying size.

The partition vector should be such that | ri | + | si | (that is, the sum of the sizes of ri and
si) is roughly equal over all the i = 0,1...., n – 1. With a good hash function, hash
Advanced Databases (PU)
1-23
Parallel Databases
partitioning is likely to have a smaller skew, except when there are many tuples with the
same values for the join attributes.
1.8.2.2



Fragment and Replicate Join :
Partitioning is not applicable to all types of joins. For instance, if the join condition is an
inequality, such as r ⋈ r  a < s  b si it is possible that all tuples in r join with some
tuple in s (and vice versa).
Thus, there may be no easy way of partitioning r and s so that tuples in partition ri, join
with only tuples in partition si.
We can parallelize such joins by using a technique called fragment and replicate. We
first consider a special case of fragment and replicate join as :
Asymmetric fragment and replicate join :
It works as follows :
1.
The system partitions one of the relations say, r. Any partitioning technique can be used
on r, including round-robin partitioning.
2.
3.
The system replicates the other relation, s, across all the processors.
Processor Pi then locally computes the join of ri; with all of .s, using any join technique.
The asymmetric fragment-and-replicate scheme appears in Fig. 1.6(a). If r is already
stored by partitioning, there is no need to partition it further in step 1. All that is required is to
replicate s across all processors.
Fig. 1.6(a) : Asymmetric fragment and raplicate
Fragment and replicate join :
The general case of fragment and replicate join appears in Fig. 1.6 (b), it works as
follows :

The system partitions relation r into n partitions r0, r1 ... r n – 1 and partitions s into m
partitions, s0, s1,... s m – 1 as before, any partitioning technique may be used on r and on s.
Advanced Databases (PU)
1-24
Parallel Databases

The values of m and n do not need to be equal, but they must be chosen so that there are
atleast m * n processors.

Asymmetric fragment and replicate is simply a special case of general fragment and
replicate, where m = 1, Fragment and replicate reduces the sizes of the relations at each
processor, compared to asymmetric fragment and replicate.
Fig. 1.6(b) : Fragment and replicate schemes

Fragment and replicate works with any join condition, since every tuple in r can be
tested with every tuple in s. Thus, it can be used where partitioning cannot be.

Fragment and replicate usually has a higher cost than partitioning when both relations
are of roughly the same size, since at least one of the relations has to be replicated.

However, if one of the relations say, s is small, it may be cheaper to replicate s across all
processors, rather than to repartition r and s on the join attributes. In such a case,
asymmetric fragment and replicate is preferable, even though partitioning could be used.
1.8.2.3
Partitioned Parallel Hash-Join :
The partitioned hash-join of can be parallelized. Suppose that we have n processors, P0,
P1 .... P n – 1 , and two relations r and s, such that the relations r and s are partitioned across
multiple disks. If the size of s is less than that of r the parallel hash-join algorithm proceeds
this way :
1.
Choose a hash function say, h1 that takes the join attribute value of each tuple in r and .s
and maps the tuple to one of the n processors. Let ri denote the tuples of relation r that
are mapped to processor Pi , similarly, let si , denote the tuples of relation s that are
mapped to processor Pi. Each processor Pi reads the tuples of s that are on its disk Di and
Advanced Databases (PU)
1-25
Parallel Databases
sends each tuple to the appropriate processor on the basis of hash function
h1.
2.
As the destination processor Pi receives the tuples of , si it further partitions them by
another hash function, h2, which the processor uses to compute the hash-join locally.
The partitioning at this stage is exactly the same as in the partitioning phase of the
sequential hash-join algorithm. Each processor Pi executes this step independently from
the other processors.
3.
Once the tuples of and have been distributed, the system redistributes the larger relation
r across the m processors by the hash function h1 in the same way as before. As it
receives each tuple, the destination processor repartitions it by the function h 2 , just as
the probe relation is partitioned in the sequential hash-join algorithm.
4.
Each processor Pi executes the build and probe phases of the hash-join algorithm on the
local partitions ri and si of r and s to produce a partition of the final result of the hashjoin.
The hash-join at each processor is independent of that at other processors, and receiving
the tuples of ri and si is similar to reading them from disk. We can use hybrid hash join to
cache some of the incoming tuples in memory, Thus avoid cost of writing them and of reading
them back in.
1.8.2.4
Parallel Nested-Loop Join :

To illustrate the use of fragment and replicate-based parallelization, consider the case
where the relation s is much smaller than relation r.

Suppose that relation r is stored by partitioning; the attribute on which it is partitioned
does not matter. Suppose too that there is an index on a join attribute of relation r at each
of the partitions of relation r.

We use asymmetric fragment and replicate, with relation s being replicated and with the
existing partitioning of relation r.

Each processor Pj here a partition of relation s is stored reads the tuples of relation s
stored in Dj, and replicates the tuples to every other processor Pi At the end of this
phase, relation s is replicated at all sites that store tuples of relation r.

Now, each processor Pi performs an indexed nested-loop join of relation s with the ith
partition of relation r. We can overlap the indexed nested-loop join with the distribution
of tuples of relation s, to reduce the costs of writing the tuples of relation s to disk, and
of reading them back.
Advanced Databases (PU)

1-26
Parallel Databases
However, the replication of relation s must be synchronized with the join so that there is
enough space in the in memory buffers at each processor Pi to hold the tuples of relation
s that have been received but that have not yet been used in the join.
1.8.3 Other Relational Operations :
 [ University Exam – Dec. 2006 !!! ]
The evaluation of other relational operations also can be parallelized :
Selection :

Let the selection be a(r). Consider first the case where  is of the form ai = vi, where ai,
is an attribute and v is a value. If the relation r is partitioned on ai, the selection proceeds
at a single processor.

If  is of the form I <o> i < <i—that is, 9 is a range selection and the relation has been
range-partitioned on ai, then the selection proceeds at each processor whose partition
overlaps with the specified range of values. In all other cases, the selection proceeds in
parallel at all the processors.
Duplicate elimination :

Duplicates can be eliminated by sorting; either of the parallel sort techniques can be
used, optimized to eliminate duplicates as soon as they appear during sorting.

We can also parallelize duplicate elimination by partitioning the tuples (by either range
or hash partitioning) and eliminating duplicates locally at each processor.
Projection :

Projection without duplicate elimination can be performed as tuples are read in from
disk in parallel.

If duplicates are to be eliminated, either of the techniques just described can be used.
Aggregation :

Consider an aggregation operation. We can parallelize the operation by partitioning the
relation on the grouping attributes, and then computing the aggregate values locally at
each processor. Either hash partitioning or range partitioning can be used.

If the relation is already partitioned on the grouping attributes, the first step can be
skipped.

We can reduce the cost of transferring tuples during partitioning by partly computing
aggregate values before partitioning, at least for the commonly used aggregate
functions.
Advanced Databases (PU)
1-27
Parallel Databases

Consider an aggregation operation on a relation r, using the sum aggregate function on
attribute B, with grouping on attribute A. The system can perform the operation at each
processor Pi on those r tuples stored on disk Di.

This computation results in tuples with partial sums at each processor; there is one tuple
at Pi for each value for attribute A present in r tuples stored on Di.

The system partitions the result of the local aggregation on the grouping attribute A, and
performs the aggregation again (on tuples with the partial sums) at each processor Pi to
get the final result.

As a result of this optimization, fewer tuples need to be sent to other processors during
partitioning.
1.8.4 Cost of Parallel Evaluation of Operations :
 [ University Exam – May 2007 !!! ]

We achieve parallelism by partitioning the I/O among multiple disks, and partitioning
the CPU work among multiple processors.

If such a split is achieved without any overhead, and if there is no skew in the splitting
of work, a parallel operation using n processors will take 1/n times as long as the same
operation on a single processor.

We already know how to estimate the cost of an operation such as a join or a selection.
The time cost of parallel processing would then be 1/n of the time cost of sequential
processing of the operation.
We must also account for the following costs :
o Startup costs for initiating the operation at multiple processors.

o Skew in the distribution of work among the processors, with some processors
getting a larger number of tuples than others.
o Contention for resources—such as memory, disk, and the communication
network—resulting in delays.
o Cost of assembling the final result by transmitting partial results from each
processor.
The time taken by a parallel operation can be estimated as :
T part + T asm + max(To, Ti.... .T n – 1)
Where,
Tpart is the time for partitioning the relations,
Tasm is the time for assembling the results
Ti the time taken for the operation at processor Pi.
Advanced Databases (PU)
1-28
Parallel Databases

Assuming that the tuples are distributed without any skew, the number of tuples sent to
each processor can be estimated as l/n of the total number of tuples.

The preceding estimate will be an optimistic estimate, since skew is common. Even
though breaking down a single query into a number of parallel steps reduces the size of
the average step, it is the time for processing the single slowest step that determines the
time taken for processing the query as a whole.

A partitioned parallel evaluation, for instance, is only as fast as the slowest of the
parallel executions.

Thus, any skew in the distribution of the work across processors greatly affects
performance,

The problem of skew in partitioning is closely related to the problem of partition
overflow in sequential hash-joins.

We can use overflow resolution and avoidance techniques developed for hash-joins to
handle skew when hash partitioning is used. We can use balanced range partitioning and
virtual processor partitioning to minimize skew due to range partitioning.
1.9
Interoperation Parallelism :
 [ University Exam – Dec. 2007, Dec. 2009 !!! ]
There are two forms of interoperation parallelism :
1.
Pipelined parallelism
2.
Independent parallelism.
1.9.1 Pipelined Parallelism :

Pipelining forms an important source of economy of computation for database query
processing. In pipelining, the output tuples of one operation, A, are consumed by a
second operation, B, even before the first operation has produced the entire set of tuples
in its output

The major advantage of pipelining execution in a sequential evaluation is that we can
carry out a sequence such operations without writing any of the intermediate results to
disks.

Parallel systems use pipelining primarily for the same reason that sequential systems do.

However, pipelines are a source of parallelism as well, in the same way that instruction
pipelines are source of parallelism in hardware design.

It is possible, to run operations A and B simultaneously on different processors, so that
B consumes tuples in parallel with A producing them. This form of parallelism is called
pipelined parallelism.
Advanced Databases (PU)
1-29
Parallel Databases
Consider a join of four relations :
r1 ⋈ r2 ⋈ r3 ⋈ r4

We can set up a pipeline that allows the three joins to be computed in parallel. Suppose
processor P1 is assigned the computation of temp1  r1 ⋈ r2 and P2 is assigned the
computation of r3 ⋈ temp1.

As P1 computes tuples in  r1 ⋈ r2, it makes these tuples available to processor P2.
Thus, P2 has available to it some of the tuples in r1 ⋈ r2 before P1 has finished its
computation. P2 can use those tuples that are available to begin computation of
temp1 ⋈ r3, even before r1 ⋈ r2 is fully computed by P1.

Likewise, as P2 computes tuples in (r1 ⋈ r2) ⋈ r3 , it makes these tuples available to P3 ,
which computes the join of these tuples with r4.

Pipelined parallelism is useful with a small number of processors, but does not scaleup
as well.

First, pipeline chains generally don’t attain sufficient length to provide a high degree of
parallelism.

Second, it is not possible to pipeline relational operators that do not produce output until
all inputs have been accessed, such as the set-difference operation.

Third, only marginal speedup is obtained for the frequent cases in which one operator's
execution cost is much higher than are those of the others.

All things considered, when the degree of parallelism is high, pipelining. is less
important source of parallelism than partitioning.

The real reason for using pipelining is that pipelined executions can avoid writing
intermediate results to disk.
1.9.2 Independent Parallelism :
Operations in a query expression that do not depend on one another can be executed in
parallel. This form of parallelism is called independent parallelism .
Consider the join r1 ⋈ r2 ⋈ r3 ⋈ r4. Clearly, we can compute temp1  r1 ⋈ r2 in
parallel with temp2  r3 ⋈ r4. When these two computations complete, we compute
temp1 ⋈ temp2
To obtain further parallelism, we can pipeline the tuples in temp 1 and temp2 into the
computation of temp1 ⋈ temp2, which is itself carried out by a pipelined join
Advanced Databases (PU)
1-30
Parallel Databases
Like pipelined parallelism, independent parallelism does not provide a high degree of
parallelism and is less useful in a highly parallel system, although it is useful with a lower
degree of parallelism.
1.9.3 Query Optimization :
 [ University Exam – Dec. 2006 !!! ]
Query optimizers account in large measure for the success of relational technology.
Recall that a query optimizer takes a query and finds the cheapest execution plan among the
many possible execution plans that give the same answer.
Query optimizers for parallel query evaluation are more complicated than query
optimizers for sequential query evaluation First, the cost models are more complicated, since
partitioning costs have to be accounted for, and issues such as skew and resource contention
must be taken into account. More important is the issue of how to parallelize a query.
To evaluate an operator tree in a parallel system, we must make the following
decisions :

How to parallelize each operation, and how many processors to use for it

What operations to pipeline across different processors, what operations to execute
independently in parallel, and what operations to execute sequentially, one after the
other.
These decisions constitute the task of scheduling the execution tree.

Determining the resources of each kind such as processors, disks, and memory that
should be allocated to each operation in the tree is another aspect of the optimization
problem. For instance, it may appear wise to use the maximum amount or parallelism
available, but it is a good idea not to execute certain operations in parallel.

Operations whose computational requirements are significantly smaller than the
communication overhead should be clustered with one of their neighbors. Otherwise, the
advantage of parallelism is negated by the overhead of communication.

One concern is that long pipelines do not lend themselves to good resource utilization
Unless the operations are coarse grained, the final operation of the pipeline may wait for
a long time to get inputs, while holding precious resources, such as memory. Hence,
long pipelines should be avoided.
The number of parallel evaluation plans from which to choose is much larger than the
number of sequential evaluation plans. Optimizing parallel queries by considering all
alternatives is therefore much more expensive than optimizing sequential queries Hence, we
Advanced Databases (PU)
1-31
Parallel Databases
usually adopt heuristic approaches to reduce the number of parallel execution plans that we
have to consider. We describe two popular heuristics here.
1.9.3.1

Types of Heuristics for Evaluation :
The first heuristic is to consider only evaluation plans that parallelize every operation
across all processors, and that do not use any pipelining. This approach is used in the
Teradata DBC series Finding best such execution plan is like doing query optimization
in a sequential system. The main differences lie in how the partitioning is performed
and what cost-estimation formula is used.

The second heuristic is to choose the most efficient sequential evaluation plan, and then
to parallelize the operations in that evaluation plan. The Volcano parallel database
popularized a model of parallelization called the exchange-operator model. This model
uses existing implementations of operations, operating on local copies of data, coupled
with an exchange operation that moves data around between different processors.
Exchange operators can be introduced into an evaluation plan to tram-form it into a
parallel evaluation plan.

Another dimension of optimization is the design of physical-storage organization to
speed up queries. The optimal physical organization differs for different queries. The
database administrator must choose a physical organization that appears to be good for
the expected mix of database queries. Thus, the area of parallel query optimization is
complex, and it is still an area of active research.
1.10 Design of Parallel System :
 [ University Exam – Dec. 2006, May 2007 !!! ]

We have studied the concepts of data storage and query processing. Since large-scale
parallel database systems are used primarily for storing large volumes of data, and, for
processing decision-support queries on those data, these topics are the most important in
a parallel database system.

Parallel loading of data from external sources is an important requirement, if we are to
handle large volumes of incoming data.

A large parallel database system should focus on following availability issues :
o Resilience to failure of some processors or disks
o Online reorganization of data and schema changes
Advanced Databases (PU)
1-32
Parallel Databases
1.10.1 Resilience to Failure of Some Processors or Disks :
Failure rate :

With a large number of processors and disks, the probability that at least one processor
or disk will malfunction is significantly greater than in a single-processor system with
one disk.

A poorly designed parallel system will stop functioning if any component (processor or
disk) fails.

Assuming that the probability of failure of a single processor or disk is small, the
probability of failure of the system goes up linearly with the number of processors and
disks.

If a single processor or disk would fail once every 5 years a system with 100 processors
would have a failure every 18 days.
For example

The large-scale parallel database systems, such as Compaq Himalaya, Teradata, and
Informix XPS (now a division of IBM), are designed to operate even if a processor or
disk fails.

Data are replicated across at least two processes. If a processor fails, the data that it
stored can still be accessed from the other processors.

The system keeps track of failed processors and distributes the work among functioning
processors.

Requests for data stored at the failed site are automatically routed to the backup sites
that store a replica of the data. If all the data of a processor A are replicated at a single
processor B, B will have to handle all the requests to A as well as those to itself, and that
will result in B becoming a bottleneck. Therefore, the replicas of the data of a processor
are partitioned across multiple other processors.
1.10.2 Problems of Large Databases :

When we are dealing with large volumes of data (ranging in the terabytes), simple
operations, such as creating indices, and changes to schema, such as adding a column to
a relation, can take a long time perhaps hours or even days.

Therefore, it is unacceptable for the database system to be unavailable while such
operations are in progress.

Many parallel database systems, such as the Compaq Himalaya systems, allow such
operations to be performed online, that is, while the system is executing other
transactions.
Advanced Databases (PU)
1-33
Parallel Databases
1.10.3 Online Index Construction :

An online index construction is a system that supports the feature which allows
insertions, deletions, and updates on a relation even as an index is being built on the
relation.

The index-building operation therefore cannot lock the entire relation in shared mode, as
it would have done otherwise.

Instead, the process keeps track of updates that occur while it is active, and incorporates
the changes into the index being constructed.
1.11 Effective Guidelines :
Parallel database machine architectures have evolved from the use of exotic hardware to
a software parallel dataflow architecture based on conventional shared-nothing hardware.
These new designs provide impressive speedup and scaleup when processing relational
database queries.
Introduction :

Highly parallel database systems are beginning to displace traditional mainframe
computers for the largest database and transaction processing tasks.

Ten years ago the future of highly-parallel database machines seemed gloomy, even to
their staunchest advocates.

Most database machine research had focused on specialized, often trendy, hardware
such as CCD memories, bubble memories, head-per-track disks, and optical disks.

None of these technologies fulfilled their promises; so there was a sense that
conventional cpus, electronic RAM, and moving-head magnetic disks would dominate
the scene for many years to come.

At that time, disk throughput was predicted to double while processor speeds were
predicted to increase by much larger factors.

Consequently, critics predicted that multi-processor systems would soon be I/O limited
unless a solution to the I/O bottleneck were found.
While these predictions were fairly accurate about the future of hardware, the critics
were certainly wrong about the overall future of parallel database systems. Over the last
decade Teradata, Tandem, and a host of startup companies have successfully developed and
marketed highly parallel database machines.
Advanced Databases (PU)






1-34
Parallel Databases
Why have parallel database systems become more than a research curiosity ? One
explanation is the widespread adoption of the relational data model. In 1983 relational
database systems were just appearing in the marketplace; today they dominate it.
Relational queries are ideally suited to parallel execution; they consist of uniform
operations applied to uniform streams of data.
Each operator produces a new relation, so the operators can be composed into highly
parallel dataflow graphs.
By streaming the output of one operator into the input of another operator, the two
operators can work in series giving pipelined parallelism. By partitioning the input data
among multiple processors and memories, an operator can often be split into many
independent operators each working on a part of the data. This partitioned data and
execution gives partitioned parallelism (Fig. 1.7 ).
The dataflow approach to database system design needs a message-based client-server
operating system to interconnect the parallel processes executing the relational
operators.
This in turn requires a high-speed network to interconnect the parallel processors. Such
facilities seemed exotic a decade ago, but now they are the mainstream of computer
architecture. The client-server paradigm using high-speed LANs is the basis for most
PC, workstation, and workgroup software. Those same client-server mechanisms are an
excellent basis for distributed database technology.
Fig. 1.7

Mainframe designers have found it difficult to build machines powerful enough to meet
the CPU and I/O demands of relational databases serving large numbers of simultaneous
users or searching terabyte databases.

Meanwhile, multi-processors based on fast and inexpensive microprocessors have
become widely available from vendors including Encore, Intel, NCR, nCUBE, Sequent,
Tandem, Teradata, and Thinking Machines.
Advanced Databases (PU)
1-35
Parallel Databases

These machines provide more total power than their mainframe counterparts at a lower
price.

Their modular architectures enable systems to grow incrementally, adding MIPS,
memory, and disks either to speedup the processing of a given job, or to scaleup the
system to process a larger job in the same time. In retrospect, special-purpose database
machines have indeed failed; but, parallel database systems are a big success.

The successful parallel database systems are built from conventional processors,
memories, and disks.

They have emerged as major consumers of highly parallel architectures, and are in an
excellent position to exploit massive numbers of fast-cheap commodity disks,
processors, and memories promised by current technology forecasts.

A consensus on parallel and distributed database system architecture has emerged.

This architecture is based on a shared-nothing hardware design [STON86] in which
processors communicate with one another only by sending messages via an
interconnection network.

In such systems, tuples of each relation in the database are partitioned (declustered)
across disk storage units attached directly to each processor. Partitioning allows multiple
processors to scan large relations in parallel without needing any exotic I/O devices.
Such architectures were pioneered by Teradata in the late seventies and by several
research projects.

This design is now used by Teradata, Tandem, NCR, Oracle-nCUBE, and several other
products currently under development. The research community has also embraced this
shared-nothing dataflow architecture in systems like Arbre, Bubba, and Gamma.
1.12 Basic Techniques for Parallel Database Machine Implementation :
 [ University Exam – Dec. 2007 !!! ]
1.12.1 Parallelism Goals and Metrics: Speedup and Scaleup :
The ideal parallel system demonstrates two key properties: (1) linear speedup: Twice as
much hardware can perform the task in half the elapsed time, and (2) linear scaleup:
Twice as much hardware can perform twice as large a task in the same elapsed time (see
Fig. 1.8).
Fig. 1.8 : Speedup and scaleup
Advanced Databases (PU)
1-36
Parallel Databases
A speedup design performs a one-hour job four times faster when run on a four-times
larger system. A scaleup design runs a ten-times bigger job is done in the same time by a tentimes bigger system.
More formally, given a fixed job run on a small system, and then run on a larger system,
the speedup given by the larger system is measured as:
Speedup = small_system_elapsed_time / big_system_elapsed_time
Speedup is said to be linear, if an N-times large or more expensive system yields a
speedup of N.
Speedup holds the problem size constant, and grows the system. Scaleup measures the
ability to grow both the system and the problem. Scaleup is defined as the ability of an N-times
larger system to perform an N-times larger job in the same elapsed time as the original system.
The scaleup metric is.
Scaleup=small_system_elapsed_time_on_small_problem/big_system_elapsed_time_on_
big_problem

If this scaleup equation evaluates to 1, then the scaleup is said to be linear. There are
two distinct kinds of scaleup, batch and transactional.

If the job consists of performing many small independent requests submitted by many
clients and operating on a shared database, then scaleup consists of N-times as many
clients, submitting N-times as many requests against an N-times larger database.

This is the scaleup typically found in transaction processing systems and timesharing
systems. This form of scaleup is used by the Transaction Processing Performance
Council to scale up their transaction processing benchmarks [GRAY91]. Consequently,
it is called transaction-scaleup.

Transaction scaleup is ideally suited to parallel systems since each transaction is
typically a small independent job that can be run on a separate processor.A second form
of scaleup, called batch scaleup, arises when the scaleup task is presented as a single
large job.

This is typical of database queries and is also typical of scientific simulations. In these
cases, scaleup consists of using an N-times larger computer to solve an N times larger
problem.

For database systems batch scaleup translates to the same query on an N times larger
database; for scientific problems, batch scaleup translates to the same calculation on an
N-times finer grid or on an N-times longer simulation.
Advanced Databases (PU)
1-37
Parallel Databases

The generic barriers to linear speedup and linear scaleup are the triple threats of startup:
The time needed to start a parallel operation.

If thousands of processes must be started, this can easily dominate the actual
computation time. interference: The slowdown each new process imposes on all others
when accessing shared resources.
Skew :

As the number of parallel steps increases, the average sized of each step decreases, but
the variance can well exceed the mean.

The service time of a job is the service time of the slowest step of the job.

When the variance dominates the mean, increased parallelism improves elapsed time
only slightly.
Fig. 1.9
1.12.2 Hardware Architecture, the Trend to Shared-Nothing Machines :

The ideal database machine would have a single infinitely fast processor with an infinite
memory with infinite bandwidth and it would be infinitely cheap (free).

Given such a machine, there would be no need for speedup, scaleup, or parallelism.
Unfortunately, technology is not delivering such machines but it is coming close.
Technology is promising to deliver fast one-chip processors, fast high-capacity disks,
and high-capacity electronic RAM memories.

It also promises that each of these devices will be very inexpensive by today's standards,
costing only hundreds of dollars each.

So, the challenge is to build an infinitely fast processor out of infinitely many processors
of finite speed, and to build an infinitely large memory with infinite memory bandwidth
from infinitely many storage units of finite speed.

This sounds trivial mathematically; but in practice, when a new processor is added to
most computer designs, it slows every other computer down just a little bit.

If this slowdown (interference) is 1%, then the maximum speedup is 37 and a thousandprocessor system has 4% of the effective power of a single processor system. How can
Advanced Databases (PU)
1-38
Parallel Databases
we build scaleable multi-processor systems? Stonebraker suggested the following
simple taxonomy for the spectrum of designs.

shared-memory: All processors share direct access to a common global memory and to
all disks. The IBM/370, and Digital VAX, and Sequent Symmetry multi-processors
typify this design.

shared-disks : Each processor has a private memory but has direct access to all disks.
The IBM Sysplex and original Digital VAXcluster typify this design.

shared-nothing : Each memory and disk is owned by some processor that acts as a
server for that data. Mass storage in such an architecture is distributed among the
processors by connecting one or more disks. The Teradata, Tandem, and nCUBE
machines typify this design.

Shared-nothing architectures minimize interference by minimizing resource sharing.

They also exploit commodity processors and memory without needing an incredibly
powerful interconnection network.

As Fig. 1.10 suggests, the other architectures move large quantities of data through the
interconnection network.

The shared-nothing design moves only questions and answers through the network. Raw
memory accesses and raw disk accesses are performed locally in a processor, and only
the filtered (reduced) data is passed to the client program.

This allows a more scaleable design by minimizing traffic on the interconnection
network.Shared-nothing characterizes the database systems being used by Teradata
[TERA83], Gamma [DEWI86, DEWI90], Tandem [TAND88], Bubba [ALEX88],
Arbre [LORI89], and nCUBE [GIBB91]. Significantly, Digital’s VAXcluster has
evolved to this design. DOS and UNIX workgroup systems from 3com, Boreland,
Digital, HP, Novel, Microsoft, and Sun also adopt a shared-nothing client-server
architecture.

The actual interconnection networks used by these systems vary enormously. Teradata
employs a redundant tree-structured communication network.

Tandem uses a three-level duplexed network, two levels within a cluster, and rings
connecting the clusters. Arbre, Bubba, and Gamma are independent of the underlying
interconnection network, requiring only that network allow any two nodes to
communicate with one another. Gamma operates on an Intel Hypercube.
Advanced Databases (PU)
1-39
Parallel Databases

The Arbre prototype was implemented using IBM 4381 processors connected to one
another in a point-to-point network. Workgroup systems are currently making a
transition from Ethernet to higher speed local networks.

The main advantage of shared-nothing multi-processors is that they can be scaledup to
hundreds and probably thousands of processors that do not interfere with one another.

Teradata, Tandem, and Intel have each shipped systems with more than 200 processors.
Intel is implementing a 2000 node Hypercube.

The largest shared-memory multi-processors currently available are limited to about 32
processors have application in simulation, pattern matching, and mathematical search,
but they do not seem to be appropriate for the multiuser, i/o intensive, and dataflow
paradigm of database systems.

These shared-nothing architectures achieve near-linear speedups and scaleups on
complex relational queries and on online-transaction processing workloads [DEWI90,
TAND88, ENGL89].

Given such results, database machine designers see little justification for the hardware
and software complexity associated with shared-memory and shared-disk designs
Shared-memory and shared-disk systems do not scale well on database applications.

Interference is a major problem for shared-memory multi-processors. The
interconnection network must have the bandwidth of the sum of the processors and
disks. It is difficult to build such networks that can scale to thousands of nodes.

To reduce network traffic and to minimize latency, each processor is given a large
private cache. Measurements of shared-memory multiprocessors running database
workloads show that loading and flushing these caches considerably degrades processor
performance [THAK90].

As parallelism increases, interference on shared resources limits performance. Multiprocessor systems often use an affinity scheduling mechanism to reduce this
interference; giving each process an affinity to a particular processor.

This is a form of data partitioning; it represents an evolutionary step toward the sharednothing design.

Partitioning a shared-memory system creates many of the skew and load balancing
problems faced by a shared-nothing machine; but reaps none of the simpler hardware
interconnect benefits.

Based on this experience, we believe high-performance shared-memory machines will
not economically scale beyond a few processors when running database applications.
Advanced Databases (PU)
1-40
Parallel Databases

To ameliorate the interference problem, most shared-memory multi-processors have
adopted a shared-disk architecture.

This is the logical consequence of affinity scheduling. If the disk interconnection
network can scale to thousands of discs and processors, then a shared-disk design is
adequate for large read-only databases and for databases where there is no concurrent
sharing.

The shared-disk architecture is not very effective for database applications that read and
write a shared database. A processor wanting to update some data must first obtain the
current copy of that data.

Since others might be updating the same data concurrently, the processor must declare
its intention to update the data.

Once this declaration has been honored and acknowledged by all the other processors,
the updator can read the shared data from disk and update it.

The processor must then write the shared data out to disk so that subsequent readers and
writers will be aware of the update.

There are many optimizations of this protocol, but they all end up exchanging
reservation messages and exchanging large physical data pages. This creates processor
interference and delays. It creates heavy traffic on the shared interconnection network.

For shared database applications, the shared-disk approach is much more expensive than
the shared-nothing approach of exchanging small high-level logical questions and
answers among clients and servers.

One solution to this interference has been to give data a processor affinity; other
processors wanting to access the data send messages to the server managing the data.

This has emerged as a major application of transaction processing monitors that
partition the load among partitioned servers, and is also a major application for remote
procedure calls.

Again, this trend toward the partitioned data model and shared-nothing architecture on a
shared disk system reduces interference.

Since the shared-disk system interconnection network is difficult to scale to thousands
of processors and disks, many conclude that it would be better to adopt the sharednothing architecture from the start.

Given the shortcomings of shared-disk and shared-nothing architectures, why have
computer architects been slow to adopt the shared-nothing approach?
Advanced Databases (PU)

1-41
Parallel Databases
The first answer is simple, high-performance low-cost commodity components have
only recently become available.Traditionally, commodity components were relatively
low performance and low quality.

Today, old software is the most significant barrier to the use of parallelism. Old
software written for uni-processors gets no speedup or scaleup when put on any kind of
multiprocessor.

It must be rewritten to benefit from parallel processing and multiple disks. Database
applications are a unique exception to this.

Today, most database programs are written in the relational language SQL that has been
standardized by both ANSI and ISO.

It is possible to take standard SQL applications written for uni-processor systems and
execute them in parallel on sharednothing database machines. Database systems can
automatically distribute data among multiple processors.

Teradata and Tandem routinely port SQL applications to their system and demonstrate
near-linear speedups and scaleups. The next section explains the basic techniques used
by such parallel database systems.
Fig. 1.10 : A parallel data flow approach
Advanced Databases (PU)
1-42
Parallel Databases

Terabyte online databases, consisting of billions of records, are becoming common as
the price of online storage decreases.

These databases are often represented and manipulated using the SQL relational model.

A relational database consists of relations (files in COBOL terminology) that in turn
contain tuples (records in COBOL terminology). All the tuples in a relation have the
same set of attributes (fields in COBOL terminology).

Relations are created, updated, and queried by writing SQL statements. These
statements are syntactic sugar for a simple set of operators chosen from the relational
algebra. Selectproject, here called scan, is the simplest and most common operator. It
produces a row-andcolumn subset of a relational table.

A scan of relation R using predicate P and attribute list L produces a relational data
stream as output. The scan reads each tuple, t, of R and applies the predicate P to it.

If P(t) is true, the scan discards any attributes of t not in L and inserts the resulting tuple
in the scan output stream. Expressed in SQL, a scan of a telephone book relation to find
the phone numbers of all people named Smith would be written:
SELECT telephone_number /* the output attribute(s) */
FROM telephone_book /* the input relation */
WHERE last_name = ’Smith’; /* the predicate */

A scan's output stream can be sent to another relational operator, returned to an
application, displayed on a terminal, or printed in a report. Therein lies the beauty and
utility of the relational model.

The uniformity of the data and operators allow them to be arbitrarily composed into
dataflow graphs.

The output of a scan may be sent to a sort operator that will reorder the tuples based on
an attribute sort criteria, optionally eliminating duplicates.

SQL defines several aggregate operators to summarize attributes into a single value, for
example, taking the sum, min, or max of an attribute, or counting the number of distinct
values of the attribute.

The insert operator adds tuples from a stream to an existing relation. The update and
delete operators alter and delete tuples in a relation matching a scan stream.

The relational model defines several operators to combine and compare two or more
relations.

It provides the usual set operators union, intersection, difference, and some more exotic
ones like join and division.
Advanced Databases (PU)
1-43
Parallel Databases

Discussion here will focus on the equi-join operator (here called join). The join operator
composes two relations, A and B, on some attribute to produce a third relation.

For each tuple, ta, in A, the join finds all tuples, tb, in B whose attribute values are equal
to that of ta. For each matching pair of tuples, the join operator inserts into the output
steam a tuple built by concatenating the pair.

The programs whih is combination of conventional programs and SQL statements
interact with clients, perform data display, and provide high-level direction of the SQL
dataflow.

The SQL data model was originally proposed to improve programmer productivity by
offering a non-procedural database language.

Data independence was an additional benefit; since the programs do not specify how the
query is to be executed, SQL programs continue to operate as the logical and physical
database schema evolves.

Parallelism is an unanticipated benefit of the relational model. Since relational queries
are really just relational operators applied to very large collections of data, they offer
many opportunities for parallelism.

Since the queries are presented in a non-procedural language, they offer considerable
latitude in executing the queries.

Relational queries can be executed as a dataflow graph. As mentioned in the
introduction, these graphs can use both pipelined parallelism and partitioned parallelism.

If one operator sends its output to another, the two operators can execute in parallel
giving potential speedup of two.

The benefits of pipeline parallelism are limited because of three factors :
(1) Relational pipelines are rarely very long - a chain of length ten is unusual.
(2) Some relational operators do not emit their first output until they have consumed
all their inputs. Aggregate and sort operators have this property. One cannot
pipeline these operators.
(3) Often, the execution cost of one operator is much greater than the others (this is an
example of skew). In such cases, the speedup obtained by pipelining will be very
limited.

Partitioned execution offers much better opportunities for speedup and scaleup. By
taking the large relational operators and partitioning their inputs and outputs, it is
possible to use divide-and-conquer to turn one big job into many independent little ones.

This is an ideal situation for speedup and scaleup. Partitioned data is the key to
partitioned execution.
Advanced Databases (PU)
1-44
Parallel Databases
Data partitioning :











Partitioning a relation involves distributing its tuples over several disks. Data
partitioning has its origins in centralized systems that had to partition files, either
because the file was too big for one disk, or because the file access rate could not be
supported by a single disk.
Distributed databases use data partitioning when they place relation fragments at
different network sites, Data partitioning allows parallel database systems to exploit the
I/O bandwidth of multiple disks by reading and writing them in parallel.
This approach provides I/O bandwidth superior to RAID-style systems without needing
any specialized hardware.
The simplest partitioning strategy distributes tuples among the fragments in a
roundrobin fashion.
This is the partitioned version of the classic entry-sequence file. Round robin
partitioning is excellent if all applications want to access the relation by sequentially
scanning all of it on each query.
The problem with round-robin partitioning is that applications frequently want to
associatively access tuples, meaning that the application wants to find all the tuples
having a particular attribute value.
The SQL query looking for the Smith’s in the phone book is an example of an
associative search.
Hash partitioning is ideally suited for applications that want only sequential and
associative access to the data.
Tuples are placed by applying a hashing function to an attribute of each tuple. The
function specifies the placement of the tuple on a particular disk.
Associative access to the tuples with a specific attribute value can be directed to a single
disk, avoiding the overhead of starting queries on multiple disks.
Hash partitioning mechanisms are provided by Arbre, Bubba, Gamma, and Teradata.
Fig. 1.11
Advanced Databases (PU)














1-45
Parallel Databases
Database systems pay considerable attention to clustering related data together in
physical storage.
If a set of tuples are routinely accessed together, the database system attempts to store
them on the same physical page.
For example, if the Smith’s of the phone book are routinely accessed in alphabetical
order, then they should be stored on pages in that order, these pages should be clustered
together on disk to allow sequential prefetching and other optimizations.
Clustering is very application specific. For example, tuples describing nearby streets
should be clustered together in geographic databases, tuples describing the line items of
an invoice should be clustered with the invoice tuple in an inventory control application.
Hashing tends to randomize data rather than cluster it.
Range partitioning clusters tuples with similar attributes together in the same partition.
It is good for sequential and associative access, and is also good for clustering data.
Fig. 1.11 shows range partitioning based on lexicographic order, but any clustering
algorithm is possible.
Range partitioning derives its name from the typical SQL range queries such as latitude
BETWEEN 37 AND 39Arbre, Bubba, Gamma, Oracle, and Tandem provide range
partitioning. The problem with range partitioning is that it risks data skew, where all the
data is place in one partition, and execution skew in which all the execution occurs in
one partition.
Hashing and round-robin are less susceptible to these skew problems. Range
partitioning can minimize skew by picking non-uniformly-distributed partitioning
criteria.
Bubba uses this concept by considering the access frequency (heat) of each tuple when
creating partitions a relation; the goal being to balance the frequency with which each
partition is accessed (its temperature) rather than the actual number of tuples on each
disk (its volume) [COPE88].
While partitioning is a simple concept that is easy to implement, it raises several new
physical database design issues.
Each relation must now have a partitioning strategy and a set of disk fragments.
Increasing the degree of partitioning usually reduces the response time for an individual
query and increases the overall throughput of the system.
For sequential scans, the response time decreases because more processors and disks are
used to execute the query.
For associative scans, the response time improves because fewer tuples are stored at
each node and hence the size of the index that must be searched decreases.
There is a point beyond which further partitioning actually increases the response time
of a query. This point occurs when the cost of starting a query on a node becomes a
significant fraction of the actual execution time
Advanced Databases (PU)
1-46
Parallel Databases
Parallelism within relational operators :

Data partitioning is the first step in partitioned execution of relational dataflow graphs.

The basic idea is to use parallel data streams instead of writing new parallel operators
(programs).

This approach enables the use of unmodified, existing sequential routines to execute the
relational operators in parallel. Each relational operator has a set of input ports on which
input tuples arrive and an output port to which the operator’s output stream is sent.

The parallel dataflow works by partitioning and merging data streams into these
sequential ports. This approach allows the use of existing sequential relational operators
to execute in parallel.

Consider a scan of a relation, A, that has been partitioned across three disks into
fragments A0, A1, and A2.

This scan can be implemented as three scan operators that send their output to a
common merge operator.

The merge operator produces a single output data stream to the application or to the next
relational operator.

The parallel query executor creates the three scan processes shown in Fig. 1.12 and
directs them to take their inputs from three different sequential input streams (A0, A1,
A2).

It also directs them to send their outputs to a common merge node. Each scan can run on
an independent processor and disk. So the first basic parallelizing operator is a merge
that can combine several parallel data streams into a single sequential stream.
Fig. 1.12
Fig. 1.13
Advanced Databases (PU)
1-47
Parallel Databases

The merge operator tends to focus data on one spot. If a multi-stage parallel operation is
to be done in parallel, a single data stream must be split into several independent
streams.

A split operator is used to partition or replicate the stream of tuples produced by a
relational operator. A split operator defines a mapping from one or more attribute values
of the output tuples to a set of destination processes (Fig. 1.13).
Fig. 1.14 : A simple SQL query and the associated relational query graph

The query specifies that a join is to be performed between relations A and B by
comparing the x attribute of each tuple from the A relation with the y attribute value of
each tuple of the B relation.

For each pair of tuples that satisfy the predicate, a result tuple is formed from all the
attributes of both tuples.

This result tuple is then added to the result relation C. The associated logical query
graph (as might be produced by a query optimizer) shows a tree of operators, one for the
join, one for the insert, and one for scanning each input relation. As an example,
consider the two split operators shown in Fig.1.15 in conjunction with the SQL query
shown in Fig. 1.14.

Assume that three processes are used to execute the join operator, and that five other
processes execute the two scan operators — three scanning partitions of relation A while
two scan partitions of relation B. Each of the three relation A scan nodes will have the
same split operator, sending all tuples between “A-H” to port 1 of join process 0, all
between “I-Q” to port 1 of join process 1, and all between “R-Z” to port 1 of join
process 2.

Similarly the two relation B scan nodes have the same split operator except that their
outputs are merged by port 1 (not port 0) of each join process. Each join process sees a
sequential input stream of A tuples from the port 0 merge (the left scan nodes) and
another sequential stream of B tuples from the port 1 merge (the right scan nodes).
Advanced Databases (PU)
1-48
Parallel Databases

The outputs of each join are, in turn, split into three steams based on the partitioning
criterion of relation C.

Relation A Scan Split Operator Relation B Scan Split Operator Predicate Destination
Process Predicate Destination Process.
Relation a scan split operator
Predicate
Relation B scan split operator
Destination process
Predicate
Destination process
“A-H”
(cpu #5, process #3, port #0) “A-H”
(cpu #5, process #3, port #1)
“I-Q”
(cpu #7, process #8, port #0) “I-Q”
(cpu #7, process #8, port #1)
“R-Z”
(cpu #2, process #2, port #0) “R-Z”
(cpu #2, process #2, port #1)
Fig. 1.15 : Sample split operators

Each split operator maps tuples to a set of output streams (ports of other processes)
depending on the range value (predicate) of the input tuple.

The split operator on the left is for the relation A scan in Fig. 1.12, while the table on the
right is for the relation B scan. The tables above partition the tuples among three data
streams.

To clarify this example, consider the first join process in Fig. 1.15 (processor 5, process
3, ports 0 and 1 in Fig. 1.15).

It will receive all the relation A “A-H” tuples from the three relation A scan operators
merged as a single stream on port 0, and will get all the “A-H” tuples from relation B
merged as a single stream on port 1.

It will join them using a hash-join, sort merge join, or even a nested join if the tuples
arrive in the proper order.
Fig. 1.16
Advanced Databases (PU)
1-49
Parallel Databases

If each of these processes is on an independent processor with an independent disk,
there will be little interference among them. Such dataflow designs are a natural
application for shared-nothing machine architectures.

The split operator in Fig. 1.16 is just an example. Other split operators might duplicate
the input stream, or partition it round-robin, or partition it by hash.

The partitioning function can be an arbitrary program. Gamma, Volcano, and Tandem
use this approach [GRAE90]. It has several advantages including the automatic
parallelism of any new operator added to the system, plus support for a many kinds of
parallelism.

The split and merge operators have flow control and buffering built into them. This
prevents one operator from getting too far ahead in the computation.

When a split-operator’s output buffers fill, it stalls the relational operator until the data
target requests more output.

For simplicity, these examples have been stated in terms of an operator per process. But
it is entirely possible to place several operators within a process to get coarser grained
parallelism.

The fundamental idea though is to build a self-pacing dataflow graph and distribute it in
a shared-nothing machine in a way that minimizes interference.
Specialized parallel relational operators :

Some algorithms for relational operators are especially appropriate for parallel
execution, either because they minimize data flow, or because they better tolerate data
and execution skew.

Improved algorithms have been found for most of the relational operators. The evolution
of join operator algorithms is sketched here as an example of these improved algorithms.

Recall that the join operator combines two relations, A and B, to produce a third relation
containing all tuple pairs from A and B with matching attribute values.

The conventional way of computing the join is to sort both A and B into new relations
ordered by the join attribute.

These two intermediate relations are then compared in sorted order, and matching tuples
are inserted in the output stream. This algorithm is called sort-merge join.

Many optimizations of sort-merge join are possible, but since sort has execution cost
nlog(n), sort-merge join has an nlog(n) execution cost. Sort-merge join works well in a
parallel dataflow environment unless there is data skew.
Advanced Databases (PU)
1-50
Parallel Databases

In case of data skew, some sort partitions may be much larger than others. This in turn
creates execution skew and limits speedup and scaleup. These skew problems do not
appear in centralized sort-merge joins.

Hash-join is an alternative to sort-merge join. It has linear execution cost rather than
nlog(n) execution cost, and it is more resistant to data skew.

It is superior to sort-merge join unless the input streams are already in sorted order.
Hash join works as follows. Each of the relations A and B are first hash partitioned on
the join attribute.

A hash partition of relation A is hashed into memory. The corresponding partition of
table relation B is scanned, and each tuple is compared against the main-memory hash
table for the A partition.

If there is a match, the pair of tuples are sent to the output stream. Each pair of hash
partitions is compared in this way. The hash join algorithm breaks a big join into many
little joins.

If the hash function is good and if the data skew is not too bad, then there will be little
variance in the hash bucket size.

In these cases hash-join is a linear-time join algorithm with linear speedup and scaleup.
Many optimizations of the parallel hash-join algorithm have been discovered over the
last decade.

In pathological skew cases, when many or all tuples have the same attribute value, one
bucket may contain all the tuples. In these cases no algorithm is known to speedup or
scaleup.

The hash-join example shows that new parallel algorithms can improve the performance
of relational operators.

This is a fruitful research area [BORA90, DEWI86, KITS83, KITS90, SCHN89,
SCHN90, WOLF90, ZELL90].

Even though parallelism can be obtained from conventional sequential relational
algorithms by using split and merge operators, we expect that many new algorithms will
be discovered in the future.
1.13 Parallel Query Optimization :

Current database query optimizers do not consider all possible plans when optimizing a
relational query.
Advanced Databases (PU)
1-51
Parallel Databases

While cost models for relational queries running on a single processor are now wellunderstood [SELI79], they still depend on cost estimators that are a guess at best.

Some dynamically select from among several plans at run time depending on, for
example, the amount of physical memory actually available and the cardinalities of the
intermediate results [GRAE89].

To date, no query optimizers consider all the parallel algorithms for each operator and
all the query tree organizations. More work is needed in this area.

Another optimization problem relates to highly skewed value distributions.

Data skew can lead to high variance in the size of intermediate relations, leading to both
poor query plan cost estimates and sub-linear speedup.

Solutions to this problem are an area of active research [KITS90, WOLF90,
HUA91,WALT91].
1.14 Application Program Parallelism :

The parallel database systems offer parallelism within the database system. Missing are
tools to structure application programs to take advantage of parallelism inherent in these
parallel systems.

While automatic parallelization of applications programs written in Cobol may not be
feasible, library packages to facilitate explicitly parallel application programs are
needed.

Ideally the SPLIT and MERGE operators could be packaged so that applications could
benefit from them.
1.15 Physical Database Design :

For a given database and workload there are many possible indexing and partitioning
combinations.

Database design tools are needed to help the database administrator select among these
many design options.

Such tools might accept as input a description of the queries comprising the workload,
their frequency of execution, statistical information about the relations in the database,
and a description of the processors and disks.

The resulting output would suggest a partitioning strategy for each relation plus the
indices to be created on each relation. Steps in this direction are beginning to appear.
Advanced Databases (PU)
1-52
Parallel Databases

Current algorithms partition relations using the values of a single attribute. For example,
geographic records could be partitioned by longitude or latitude.

Partitioning on longitude allows selections for a longitude range to be localized to a
limited number of nodes, selections on latitude must be sent to all the nodes.

While this is acceptable in a small configuration, it is not acceptable in a system with
thousands of processors. Additional research is needed on multidimensional partitioning
and search algorithms.
1.16 On-line Data Reorganization and Utilities :

Loading, reorganizing, or dumping a terabyte database at a megabyte per second takes
over twelve days and nights.

Clearly parallelism is needed if utilities are to complete within a few hours or days.

Even then, it will be essential that the data be available while the utilities are operating.
In the SQL world, typical utilities create indices, add or drop attributes, add constraints,
and physically reorganize the data, changing its clustering.

One unexplored and difficult problem is how to process database utility commands
while the system remains operational and the data remains available for concurrent reads
and writes by others.

The fundamental properties of such algorithms is that they must be online (operate
without making data unavailable), incremental (operate on parts of a large database),
parallel (exploit parallel processors), and recoverable (allow the operation to be
canceled and return to the old state).
Fig. 1.17
Advanced Databases (PU)
1-53
Parallel Databases

Users running data-parallel applications across the Internet.

The data stores are on server clusters, which, compared to monolithic machines, flexibly
support different kinds of concurrent workloads, are easier to upgrade, and have the
potential to support independent node faults.

Our client machines, in contrast to the traditional view, are active collaborators with the
clusters in providing the end-result to the user.
Other systems :

Other parallel database system prototypes include XPRS [STON88], Volcano
[GRAE90], Arbre [LORI89], and the PERSIST project under development at IBM

Research abs in Hawthorne and Almaden. While both Volcano and XPRS are
implemented on sharedmemory multi-processors, XPRS is unique in its exploitation of
the availability of massive shared-memory in its design. In addition, XPRS is based on
several innovative techniques for obtaining extremely high performance and availability.

Recently, the Oracle database system has been implemented atop a 64-node nCUBE
shared-nothing system. The resulting system is the first to demonstrate more than 1000
transactions per second on the industry-standard TPC-B benchmark.

This is far in excess of Oracle’s performance on conventional mainframe systems - both
in peak performance and in price/performance [GIBB91].

NCR has announced the 3600 and 3700 product lines that employ shared-nothing
architectures running System V R4 of Unix on Intel 486 and 586 processors.

The interconnection network for the 3600 product line uses an enhanced Y-Net licensed
from Teradata while the 3700 is based on a new multistage interconnection network
being developed jointly by NCR and Teradata.

Two software offerings have been announced. The first, a port of the Teradata software
to a Unix environment, is targeted toward the decision-support marketplace. The second,
based on a parallelization of the Sybase DBMS is intended primarily for transaction
processing workloads.
1.17 Database Machines and Grosch’s Law :

Today shared-nothing database machines have the best peak performance and best price
performance available.

When compared to traditional mainframes, the Tandem system scales linearly well
beyond the largest reported mainframes on the TPC-A transaction processing
benchmark.
Advanced Databases (PU)
1-54
Parallel Databases

Its price/performance on these benchmarks is three times cheaper than the comparable
mainframe numbers.

Oracle on an nCUBE has the highest reported TPC-B numbers, and has very
competitive price performance [GRAY91, GIBB91]. These benchmarks demonstrate
linear scaleup on transaction processing benchmarks.

Gamma, Tandem, and Teradata have demonstrated linear speedup and scaleup on

complex relational database benchmarks. They scale well beyond the size of the largest

mainframes. Their performance and price performance is generally superior to
mainframe systems.
Summary :

Parallel databases have gained significant commercial acceptance in the past 15 years.

In I/O parallelism, relations are partitioned among available disks so that they can be
retrieved faster. Three commonly used partitioning techniques are round-robin
partitioning, hash partitioning, and range partitioning.

Skew is a major problem, especially with increasing degrees of parallelism. Balanced
partitioning vectors, using histograms, and virtual processor partitioning are among the
techniques used to reduce skew.

In interquery parallelism, we run different queries concurrently to increase throughput.

Intraquery parallelism attempts to reduce the cost of running a query. There are two
types of intraquery parallelism; intraoperation parallelism and inter-operation
paralleliism.

We use intraoperartion parallelism to execute relational operations such as sorts and
joins, in parallel. Intraoperation parallelism is natural for relational operations, since
they are set oriented

There are two basic approaches to parallelizing a binary operation such as a join.

In partitioned parallelism, the relations are split into several parts, and tuples in ri, are
joined with only tuples from si. Partitioned parallelism can only be used for natural and
equi-joins.

In fragment and replicate, both relations are partitioned and each partition is replicated.
In asymmetric fragment-and-replicate, one of the relations is replicated while the other
is partitioned. Unlike partitioned paralleslism, fragment and replicate and asymmetric
fragment-and-repliate can be used with any join condition.

Both parallelization techniques can work in conjunction with any join technique.
Advanced Databases (PU)
1-55
Parallel Databases

In independent parallelism, different operations that do not depend on one another are
executed in parallel.

In pipelined parallelism, processors send the results of one operation to another
operation as those results are computed, without wailing for the entire operation to
finish.

Query optimization in parallel databases is significantly more complex than query
optimization in sequential databases.
Review Questions
Q. 1
Discuss the different motivations behind parallel and distributed databases.
Q. 2
Describe the three main architectures for parallel DBMSs. Explain why the sharedmemory and shared-disk approaches suffer from interference.
Q. 3
What can you say about the speed-up and scale-up of the shared-nothing
architecture ?
Q. 4
Describe and differentiate pipelined parallelism and data-partitioned parallelism.
Q. 5
Discuss the following techniques for partitioning data : round-robin, hash, and range.
Q. 6
Explain how existing code can be parallelized by introducing split and merge operators.
Q. 7
Discuss how each of the following operators can be parallelized using data partitioning;
scanning, sorting, join. Compare the use of sorting versus hashing for partitioning.
Q. 8
What do we need to consider in optimizing queries for parallel execution ? Discuss
interoperation parallelism, left-deep trees versus bushy trees, and cost estimation.

1.1 Introduction : ....................................................................................................... 1
1.2 Parallel Systems :................................................................................................. 2
1.2.1
Measures of Performance of Database Systems : ........................................ 2
1.2.2
Speedup and Scaleup : .................................................................................. 3
1.3 Architectures for Parallel Databases : ................................................................. 4
1.3.1
Shared Memory : .......................................................................................... 5
1.3.2
Shared Disk : ................................................................................................ 6
1.3.3
Shared Nothing :........................................................................................... 7
1.3.4
Hierarchical : ................................................................................................ 8
1.3.5
Parallel Query Evaluation : .......................................................................... 9
1.4 I/O Parallelism : ................................................................................................... 9
1.4.1
Partitioning Techniques : ............................................................................. 9
1.4.2
Comparison of Partitioning Techniques : ................................................... 10
Advanced Databases (PU)
1-56
Parallel Databases
1.4.2.1 Round-robin : ............................................................................................. 10
1.4.2.2 Hash Partitioning : ...................................................................................... 11
1.4.2.3 Range Partitioning : .................................................................................... 11
1.4.3
Handling of Skew : ..................................................................................... 12
1.4.3.1 Attribute-value Skew : ............................................................................... 12
1.4.3.2 A Balanced Range-Partitioning Skew : ...................................................... 13
1.5 Parallelizing Individual Operations : ................................................................. 14
1.5.1
Bulk Loading and Scanning : ..................................................................... 14
1.5.2
Sorting : ...................................................................................................... 15
1.5.2.1 Splitting Vector : ........................................................................................ 15
1.5.2.2 Application of Sorting : .............................................................................. 16
1.6 Interquery Parallelism : ..................................................................................... 16
1.6.1
Working of Interquery Parallelism : ........................................................... 16
1.6.2
Protocols used in Shared Disk System : ..................................................... 17
1.6.3
Advantages of Complex Protocols : ........................................................... 17
1.7 Intraquery Parallelism : ..................................................................................... 17
1.7.1
Working of Intraquery Parallelism : ........................................................... 18
1.7.2
Importance of Parallelism : ........................................................................ 18
1.8 Intraoperation Parallelism : ............................................................................... 19
1.8.1
Parallel Sort : .............................................................................................. 19
1.8.1.1 Range-Partitioning Sort : ............................................................................ 19
1.8.1.2 Parallel External Sort Merge : .................................................................... 20
1.8.2
Parallel Join : .............................................................................................. 21
1.8.2.1 Partitioned Join :......................................................................................... 21
1.8.2.2 Fragment and Replicate Join : .................................................................... 23
1.8.2.3 Partitioned Parallel Hash-Join : .................................................................. 24
1.8.2.4 Parallel Nested-Loop Join : ........................................................................ 25
1.8.3
Other Relational Operations : ..................................................................... 26
1.8.4
Cost of Parallel Evaluation of Operations : ................................................ 27
1.9 Interoperation Parallelism : ............................................................................... 28
1.9.1
Pipelined Parallelism : ................................................................................ 28
1.9.2
Independent Parallelism : ........................................................................... 29
1.9.3
Query Optimization : .................................................................................. 30
1.9.3.1 Types of Heuristics for Evaluation : .......................................................... 31
1.10 Design of Parallel System : ............................................................................... 31
1.10.1 Resilience to Failure of Some Processors or Disks : .................................. 32
1.10.2 Problems of Large Databases : ................................................................... 32
1.10.3 Online Index Construction : ....................................................................... 33
1.11 Effective Guidelines : ........................................................................................ 33
1.12 Basic Techniques for Parallel Database Machine Implementation : ................. 35
1.13 Parallel Query Optimization : ............................................................................ 50
1.14 Application Program Parallelism : .................................................................... 51
Advanced Databases (PU)
1-57
Parallel Databases
1.15 Physical Database Design :............................................................................... 51
1.16
On-line Data Reorganization and Utilities :................................................... 52
1.17 Database Machines and Grosch’s Law :............................................................ 53

Download