Advanced Databases (PU) 1-1 1 Parallel Databases UNIT - I Parallel Databases Syllabus : Introduction, Parallel database architecture, I/O Parallelism, Inter-query and Intraquery Parallelism, Inter-operational and Intra-operational Parallelism, Design of parallel systems. 1.1 Introduction : Fifteen years ago, parallel database systems had been nearly written off, even by some of their staunchest advocates. Today, they are successfully marketed by practically every database system vendor, several trends fueled this transition. The transaction requirements of organizations have grown with increasing use of computers. Moreover, the growth of the World Wide Web has created many sites with millions of viewers, and the increasing amount of data collected from these viewers has produced extremely large databases at many companies. Organizations are using these increasingly large volume of data such as, data about what items people buy, what Web links users clicked on, or when people make telephone calls to plan their activities and pricing. Queries used for such purposes are called decision support queries, and the data requirements for such queries may run into terabytes. Single-processor systems are not capable of handling such large volumes of data at the required rates. Advanced Databases (PU) 1-2 Parallel Databases The set-oriented nature of database queries naturally lends itself to parallelization. A number of commercial and research systems have demonstrated the power and scalability of parallel query processing. As microprocessors have become cheap, parallel machines have become common and relatively inexpensive. A parallel database system seeks to improve performance through parallellization of various operations, such as loading data, building indexes and evaluating queries. Although data may be stored in a distributed fashion such a system, the distribution is governed solely by performance considerations. 1.2 Parallel Systems : [ University Exam – Dec. 2006, Dec. 2007 !!! ] Parallel systems improve processing and I/O speeds by using multiple CPUs and disks in parallel. Parallel machines are becoming increasingly common, making the study of parallel database systems correspondingly more important. The driving force behind parallel database systems is the demands of applications that have to query an extremely large databases or that have to process large number of transactions per second (of the order of thousands of transactions per second). Centralized and client-server database systems are not powerful enough to handle such applications. In parallel processing, many operations are performed simultaneously, as opposed to serial processing, in which the computational steps are performed sequentially. A coarse-grain parallel machine consists of the small number of powerful processors; a massively parallel or fine-grain parallel machine uses thousands of smaller processors. Most high-end machines today offer some degree of coarse-grain parallelism: Two or more processor machines are common. 1.2.1 Measures of Performance of Database Systems : There are two main measures of performance of a database system : 1. Throughout, the number of tasks that can be completed in a given time interval. 2. Response time, the amount of time it takes to complete a single task from the time it is submitted. A system that processes a large number of small transactions can improve throughout by processing many transactions in parallel. A system that processes large transactions can improve response time as well as throughout by performing subtasks of each transaction in parallel. Advanced Databases (PU) 1-3 Parallel Databases 1.2.2 Speedup and Scaleup : [ University Exam – May 2007, Dec. 2007 !!! ] Two important issues in studying parallelism are : (1) Speedup : Running a given task in less time by increasing the degree of parallelism is called speedup. (2) Scaleup : Handling larger tasks by increasing the degree of parallelism is scaleup. Speedup : Consider a database application running on a parallel system with a certain number of processors and disks. Now suppose that we increase the size of the system by increasing the number or processors, disks, and other components of the system. The goal is to process the task in time inversely proportional to the number of processors and disks allocated. The parallel system is said to demonstrate linear speedup if the speedup is N when the larger system has N times the resources (CPU, disk, and so on) of the smaller system. If the speedup is less than N, the system is said to demonstrate sublinear speedup Fig. 1.1 illustrates linear and sublinear speedup. Fig. 1.1 : Speedup Scaleup : Scaleup relates to the ability to process larger tasks in the same amount of time by providing more resources. Let Q be a task and QN be a task that is N times bigger than Q. Suppose execution time of task Q on machine MS is TS and the execution time of task QN on parallel machine ML which is N times larger than MS is TL. Advanced Databases (PU) 1-4 Parallel Databases Scaleup is defined as TS / TL. Where, TL : Execution time of a task on the larger machine TS : The execution, time of the same task on the smaller machine The parallel system ML is said to demonstrate linear scaleup on task Q if. TL = TS. If TL > TS the system is said to demonstrate sublinear scaleup. Fig. 1.2 : Scaleup 1.3 Architectures for Parallel Databases : [ University Exam – May 2007, Dec. 2007, Dec. 2009 !!! ] There are several architectural models for parallel machines. Among the most prominent ones are those in Fig. 1.3 (In the Fig. 1.3, M denotes memory, P denotes a processor, and disks are shown as cylinders) Shared memory : All the processors share a common memory (Fig. 1.3(a)). Fig. 1.3(a) : Shared memory o Shared disk : All the processors share a common set of disk (Fig. 1.3(b)). Shareddisk are sometimes called clusters. Advanced Databases (PU) 1-5 Parallel Databases Fig. 1.3(b) : Shared disk o Shared nothing : The processors share neither a common memory nor common disk (Fig. 1.3(c)). Fig. 1.3(c) : Shared nothing o Hierarchical : This model is a hybrid of the preceding three architectures (Fig. 1.3(d)). Techniques used to speedup transaction processing on data-server systems, such and lock caching and lock de-escalation, can also be in shared-disk parallel databases as well as in shared-nothing parallel databases. In fact, they are very important for efficient transaction processing in such systems. 1.3.1 Shared Memory : In shared memory architecture, the processors and disks have access to a common memory, typically via a bus or through an interconnection network. Advanced Databases (PU) 1-6 Parallel Databases Advantages : The benefit shared memory is extremely efficient communication between processors. Data in shared memory can be accessed by any processor without being moved with software. A processor can send messages to other processors much faster by using memory writes (which usually rake less than a microsecond) than by sending a message through communication mechanism. Disadvantages : The downside of shared-memory that the architecture is not scalable beyond 32 or 64 processors because the bus or interconnection network becomes a bottleneck (since it is shared by all processors). Adding more number of processors should be avoided as they most of the time in waiting for their turn on the bus to access memory. Shared-memory architectures usually have large memory caches at each processor so that referencing of the shared memory is avoided whenever possible. However, at least some of the data will not be in the cache and accesses will have to go to the shared memory. Moreover, the caches need to be kept coherent. Maintaining cache-coherency becomes an increasing overhead with increasing overhead with increasing number of processors. Consequently, shared memory machines are not capable of scaling up beyond a point; current shared-memory machines cannot support more than 64 processors. 1.3.2 Shared Disk : In the shared-disk model, all processors can access all disks directly via an interconnection network, but the processors have private memories. Advantages : Since each processor has its own memory, the memory bus is not a bottleneck. It offers a cheap way to provide a degree of fault tolerance. If a processor (or its memory) fails, the other processor can take over its tasks, since the database is resident on disks that are accessible from all processors. We can make the disk subsystem itself fault tolerant by using RAID architecture, The shared-disk architecture has found acceptance in many applications. Advanced Databases (PU) 1-7 Parallel Databases Disadvantages : The main problem with a shared-disk system is again scalability. Although the memory bus is no longer a bottleneck, the interconnection to the disk subsystem is now a bottleneck; it is particularly so in a situation where the database makes a large number of accesses to disks. Compared to shared memory systems, shared-disk systems can scale to a somewhat larger number of processors, but communication across processor is slower, since it has to go through a communication network. Example : DEC clusters running Rdb were One of the early commercial users of the shared disk database architecture. (Rdb is now owned by Oracle, and is tailed Oracle Rdb. Digital Equipment Corporation (DEC) is now owned by Compaq.) 1.3.3 Shared Nothing : In a shared-nothing system, each node of the machine consists of a processor, memory, and one or more disks. The processors at one node may communicate with one another processor at another node by a high-speed interconnection network. A node functions as the server for the data on the disk or disks that the node owns. Since local disk references are serviced by local disks at each processor. Advantages : The shared-nothing model overcomes the disadvantage of requiring all I/O to go through a singly interconnection network; only queries, accesses to non local disks, and result relations pass through the network. Moreover, the interconnection networks for shared nothing systems are usually designed to be scalable, so that their transmission capacity increases as more nodes are added. Consequently, shared-nothing architectures are more scalable, and can easily support a large number of processors. Disadvantage : The main drawback of shared nothing systems is the costs of communication and of nonlocal disk access, which are higher than in a shared memory or shared-disk architecture since sending data involves software interaction at both ends. Advanced Databases (PU) 1-8 Parallel Databases Applications : The Teradata database machine was among (the earliest commercial systems to use the shared-nothing database architecture. The Grace and the Gamma research prototypes also used shared-nothing architectures. 1.3.4 Hierarchical : The hierarchical architecture combines the characteristics of shared-memory, shareddisk, and shared-nothing architectures. At the top level, the system consists of nodes connected by an interconnection network, and do not share disks or memory with one another. Thus, the top level is a sharednothing architecture. Each node of the system could actually be a shared-memory system with a few processors Alternatively, each node could be a shared-disk system, and each of the systems sharing a set of disks could be a shared-memory system. Thus, a system could be built as a hierarchy, with shared-memory architecture with a few processors at the base, and a shared-nothing architecture at the top, with possibly shared-disk architecture in the middle. Fig. 1.3(d) illustrates a hierarchical architecture with shared-memory nodes connected together in a shared nothing architecture. Commercial parallel database systems today run on several of these architectures. Attempts to reduce the complexity of programming such systems have yielded distributed virtual memory architectures, where logically there is a single shared memory, but physically there are multiple disjoint memory systems; the virtualmemory-mapping hardware, coupled with system software, allows each processor to view the disjoint memories as a single virtual memory. Since access speeds differ, depending whether the page is available locally or not, such architecture is also referred to as nonuniform memory architecture (NUMA). Fig. 1.3(d) Advanced Databases (PU) 1-9 Parallel Databases 1.3.5 Parallel Query Evaluation : [ University Exam – Dec. 2006 !!! ] Now we try to understand parallel evaluation of a relational query in a DBMS with a shared-nothing architecture. While it is possible to consider parallel execution of multiple queries, it is hard to identify in advance which queries will run concurrently. So the emphasis has been on parallel execution of a single query. A relational query execution plan is a graph of relational algebra operators, and the operators in a graph can be executed in parallel. If one operator consumes the output of a second operator, we have pipelined parallelism.(the output of the second operator is worked on by the first operator as soon as it is generated) If not, the two operators can proceed essentially independently. An operator is said to block if it produces no output until it has consumed all inputs. Pipelined parallelism is limited by the presence of operators that block. To evaluate different operators in parallel, we can evaluate each individual operator in a query plan in a parallel fashion. The key to evaluating operator in parallel is to partition the input data; we can then work on a partition in parallel and combine the results. This approach is called a partitioned parallel evaluation. An important observation, which explains why shared-nothing parallel database system have been very successful, is that database query evaluation is very amenable to datapartitioned parallel evaluation. The goal is to minimize data shipping by partitioning the data and structuring the algorithms to do most of the processing at individual processors. 1.4 I/O Parallelism : [ University Exam – Dec. 2007, Dec. 2009 !!! ] Definition : I/O parallelism refers to reducing the time required to retrieve relations from disk by partitioning the relations on multiple disks. The most common form of data partitioning in a parallel database environment is horizontal partitioning. In horizontal partitioning, the tuples of a relation are divided (or declustered) among many disks, so that each tuple resides on one disk. Several partitioning strategies have been proposed. 1.4.1 Partitioning Techniques : There are three basic data-partitioning strategies. Assume that there are n disks DO, DI. ..., Dn-1 across which the data are to be partitioned. Round-robin : This strategy scans the relation in any order and sends the ith tuple to disk number Di mod n. The round-robin scheme ensures an even distribution of tuples across disks; that is each disk has approximately the same number of tuples as the others. Advanced Databases (PU) 1-10 Parallel Databases Hash partitioning : This declustering strategy designates one or more attributes from the given relation's schema as the partitioning attributes. A hash function is chosen whose range is {0, 1,-----, n – 1}. Each tuple of the original relation is hashed on the partitioning attributes. If the hash function returns i, then the tuple is placed on disk Di. Range partitioning : This strategy distributes contiguous attribute value ranges to each disk. It chooses a partitioning attribute. partitioning vector. The relation is partitioned as follows. Let {V0,V1,...,Vn – 2 } denote the partitioning vector, such that, if i < j, then Vi < Vj. Consider a tuple t such that t[A] = x. If x < V0 then t goes on disk Do. If x >=Vn – 2 , then t goes on disk D n – 1 . If Vi <= x < Vi + 1 then t goes on disk Di + 1. For example, range partitioning with three disks numbered 0, 1, and 2 may assign tuples with values less than 5 to disk 0, values between 5 and 40 to disk 1, and values greater than 40 to disk 2. 1.4.2 Comparison of Partitioning Techniques : [ University Exam – Dec. 2006, May 2007 !!! ] Once a relation has been partitioned among several disks, we can retrieve it in parallel, using all the disks. Similarly, when a relation is being partitioned, it can be written to multiple disks in parallel. Thus, the transfer rates for reading or writing an entire relation are much faster with I/O parallelism than without it. However, reading an entire relation is only one kind of access to data. Access to data can be classified as follows : 1. Scanning the entire relation. 2. Locating a tuple associatively (for example, employee-name - "Campbell"); these queries, called point queries, seek tuples that have a specified value for a specific attribute. 3. Locating all tuples for which the value of a given attribute lies within a specified range (for example, 10000 < salary < 20000); these queries are called range queries. The different partitioning techniques support these types of access at different levels of efficiency : 1.4.2.1 Round-robin : The scheme is ideally suited for applications that wish to read the entire relation sequentially for each query. With this scheme, both point queries and range queries are complicated to process, since each of the n disks must be used for the search. Advanced Databases (PU) 1.4.2.2 1-11 Parallel Databases Hash Partitioning : This scheme is best suited for point queries based on the partitioning attribute. For example if a relation is partitioned on the telephone-number attribute, then we can answer the query “Find the record of the employee with telephone-number = 555-3333” by applying the partitioning hash function to 555-3333 and then searching that disk. Directing a query to a single disk saves the startup cost of initiating a query on multiple 4 disks and leaves the other disks free to process other queries. Hash partitioning is also useful for sequential scans of the entire relation. If the hash function is a good randomizing function, and the partitioning attributes form a key of the relation, then the number of tuples in each of the disks is approximately the same, without much variance. Hence, the time taken to scan the relation is approximately 1/n of the time required to scan the relation in a single disk system. The scheme, however, is not well suited for point queries non partitioning attributes. Hash-based partitioning is also not well suited for answering range queries, since, typically, hash functions do not preserve proximity within a range. Therefore, all the disks need to be scanned for range queries to be answered. 1.4.2.3 Range Partitioning : [ University Exam – Dec. 2009 !!! ] This scheme is well suited for point and range queries on the partitioning attribute. For point queries, we can consult the partitioning vector to locate the disk where the tuple resides. For range queries, we consult the partitioning vector to find the range of disks on which the tuples may reside. In both cases, the search narrows to exactly those disks that might have any tuples of interest. An advantage of this feature is that, if there are only a few tuples in the queried range, then the query is typically sent to one disk, as opposed to all the disks. Since other disks can be used to answer other queries, range partitioning results in higher throughput while maintaining good response time. On the other hand, if there are many tuples in the queried range (as there are when the queried range is a larger fraction of the domain of the relation), many tuples have to be retrieved from a few disks, resulting in an I/O bottleneck (hot spot) at those disks. In this example of execution skew, all processing occurs in one or only a few partitions. In contrast, hash partitioning and round-robin partitioning would engage all the disks for such queries, giving a faster response time for approximately the same throughput. Advanced Databases (PU) 1-12 Parallel Databases The type of partitioning also affects other relational operations, such as joins. Thus, the choice of partitioning technique also depends on the operations that need to be executed. In general, hash partitioning or range partitioning are preferred to round-robin partitioning. In a system with many disks, the number of disks across which to partition a relation can be chosen in this way; if a relation contains only a few tuples that will fit into a single disk block, then it is better to assign the relation to a single disk. Large-relations are preferably partitioned across all the available disks. If a relation consists of m disk blocks and there are n disks available in the system, then the relation should be allocated min(m, n) disks. 1.4.3 Handling of Skew : [ University Exam – Dec. 2006, May 2007, Dec. 2007 !!! ] When a relation is partitioned (by a technique other than round-robin), there may be a skew in the distribution of tuples, with a high percentage of tuples placed in some partitions and fewer tuples in other partitions. The ways that skew may appear are classified as: 1. Attribute-value skew 2. Partition skew 1.4.3.1 Attribute-value Skew : It refers to the fact that some values appear in the partitioning attributes of many tuples. All the tuples with the same value for the partitioning attribute end up in the same partition, resulting in skew. Partition skew refers to the fact that there may be load imbalance in the partitioning, even when there is no attribute skew. Attribute-value skew can result in skewed partitioning regardless of whether range partitioning or hash partitioning is used. If the partition vector is not chosen carefully, range partitioning may result in partition skew. Partition skew is less likely with hash partitioning, if a good hash function is chosen. Skew becomes an increasing problem with a higher degree of parallelism. For example, if a relation of 1000 tuples is divided into 10 parts, and the division is skewed, then there may be some partitions of size less than 100 and some partitions of size more than 100 If even one partition happens to be of size 200, the speedup that we would obtain by accessing the partitions in parallel is only 5, instead of the 10 for which we would have hoped. Advanced Databases (PU) Parallel Databases If the same relation has to be partitioned into 100 parts, a partition will have 10 tuples on an average. If even one partition has 40 tuples (which is possible, given the large number of partitions) the speedup that we would obtain by accessing them in parallel would be 25, rather than 100. Thus the loss of speedup due to skew increases with parallelism. 1.4.3.2 1-13 A Balanced Range-Partitioning Skew : A balanced range partitioning vector can be constructed by sorting the relation. The relation is first sorted on the partitioning attributes. The relation is then scanned in sorted order. After every 1/n of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector. Here, n denotes the number of partitions to be constructed. In case there are many tuples with the same value for the partitioning attribute, the technique can still result in some skew. Disadvantage : The extra I/O overhead incurred in doing the initial sort. How I/O overhead is avoided ? The I/O overhead for constructing balanced range-partition vectors can be reduced by constructing and storing a frequency table, or histogram, of the attribute values for each attribute of each relation. (See Fig. 1.4). Histogram for an integer valued attribute that takes values in the range 1 to 25. A histogram takes up only a little space, so histograms on several different attributes can be stored in the system catalog. Fig. 1.4 : Example of histogram Advanced Databases (PU) 1-14 Parallel Databases It is straightforward to construct a balanced range-partitioning function given a histogram on the partitioning attributes. If the histogram is not stored, it can be computed approximately by sampling the relation, using only tuples from a randomly chosen subset of the disk blocks of the relation. Another approach to minimize the effect of skew : Another approach to minimize the effect of skew particularly with range partitioning is to use virtual processors. The virtual processor approach, we pretend there are several times as many virtual processors as the number of real processors. Any of the partitioning techniques and query evaluation techniques that we study later in this chapter can be used, but they map tuples and work to virtual processors instead of to real processors. Virtual processors, in turn, are mapped to real processors, usually by round-robin partitioning. The idea is that even if one range had many more tuples than the others because of skew, these tuples would get split across multiple virtual processor ranges. Round robin allocation of virtual processors to real processors would distribute the extra work among multiple real processors, so that one processor does not have to bear all the burden. 1.5 Parallelizing Individual Operations : [ University Exam – Dec. 2007 !!! ] It shows how various operations can be implemented in parallel in a shared-nothing architecture. We assume that each relation is horizontally partitioned across several disks, although this partitioning may or may not be appropriate for a given query. The evaluation of a query must take the initial partitioning criteria into account and repartition if necessary. We use following techniques : 1. Bulk loading and scanning 2. Sorting 1.5.1 Bulk Loading and Scanning : We begin with two simple operations: scanning a relation and loading a relation. Pages can be read in parallel while scanning a relation, and the retrieved tuples can then be merged, if the relation is partitioned across several disks. Advanced Databases (PU) 1-15 Parallel Databases More generally, the idea also applies when retrieving all tuples that meet a selection condition. If hashing or range partitioning is used, selection queries can be answered by going to just those processors that contain relevant tuples. Similar observation holds for bulk loading. Further, if a relation has assorted indexes, any sorting of data entries required for building the indexes, during bulk loading can also be done in parallel. 1.5.2 Sorting : A simple idea is to let each CPU sort the part of the relation that is on its local disk and then merge these sorted sets of tuples. The degree of parallelism likely to be limited by the merging phase. A better idea is to first redistribute all tuples in the relation using range partitioning. For example, if we want to sort a collection of employee tuples by salary, salary values range from 10 to 210, and we have 20 processors, we could send all tuples with salary values in the range 10 to 20 to the first processor all in the range 21 to 30 to the second processor, and so on. Each processor then sorts the tuples assigned to it, using some sequential sorting algorithm. For example, a processor can collect tuples until its memory full then sort these tuples and write out a run, until all incoming tuples has been written to such sorted runs on the local disk. These runs can then be to create the sorted version of the set of tuples assigned to this processor entire sorted relation can be retrieved by visiting the processors in an order corresponding to the ranges assigned to them and simply scanning the tuples. 1.5.2.1 Splitting Vector : The basic challenge in parallel sorting is to do the range partitioning each processor receives roughly the same number of tuples; otherwise, a processor that receives a disproportionately large number of tuples to sort becomes a bottleneck and limits the scalability of the parallel sort. One good approach to range partitioning is to obtain a sample of the entire relation by taking at each processor that initially contains part of the relation. The (relatively small) sample is sorted and used to identify ranges with equal number of tuples. This set of range values, called a splitting vector, is then distributed all processors and used to range partition the entire relation. Advanced Databases (PU) 1.5.2.2 1-16 Parallel Databases Application of Sorting : A particularly important application of parallel sorting is sorting the data entries (in treestructured indexes. Sorting data entries can significantly speed up the process of bulk-loading an index. 1.6 Interquery Parallelism : [ University Exam – Dec. 2006, May 2007, Dec. 2009 !!! ] Definition : In interquery parallelism, different queries or transactions execute in parallel with one another. Transaction throughput can be increased by this form of parallelism. However, the response times of individual transactions are no faster than they would be if the transactions were run in isolation. 1.6.1 Working of Interquery Parallelism : The primary use of interquery parallelism is to scaleup a transaction-processing system to support a larger number of transactions per second. Interquery parallelism is the easiest form of parallelism to support in a database system particularly in a shared-memory parallel system. Database systems designed for single-processor systems can be used with few or no changes on a shared-memory parallel architecture Since even sequential database systems support concurrent processing, transactions that would have operated in a time-shared concurrent manner on a sequential machine operate in parallel in the shared-memory parallel architecture Supporting interquery parallelism is more complicated in shared disk or shared nothing architecture. Processors have to perform some tasks, such as locking and logging, in a coordinated fashion, and that requires that they pass messages to each other. A parallel database system must also ensure that two processors do not update the same data independently at the same time. Further, when a processor accesses or updates data, the database system must ensure that the processor has the latest version of the data in its buffer pool. The problem of ensuring that the version is the latest is known as the cache-coherency problem. Advanced Databases (PU) 1-17 Parallel Databases 1.6.2 Protocols used in Shared Disk System : Various protocols are available to guarantee cache coherency; often, cache-coherency protocols are integrated with concurrency-control protocols so that their overhead is reduced. One such protocol for a shared-disk system is this : 1. Before any read or write access to a page, a transaction locks the page in shared or exclusive mode, as appropriate. Immediately after the transaction obtains either a shared or exclusive lock on a page, it also reads the most recent copy of the page from the shared disk. 2. Before a transaction releases an exclusive lock on a page, it flushes the page to the shared disk; then, it releases the lock. This protocol ensures that, when a transaction sets a shared or exclusive lock on a page, it gets the correct copy of the page. 1.6.3 Advantages of Complex Protocols : More complex protocols avoid the repeated reading and writing to disk required by the preceding protocol. Such protocols do not write pages to disk when exclusive locks are released. When a shared or exclusive lock is obtained if the most recent version of a page is in the buffer pool of some processor, the page is obtained from there. The protocols have to be designed to handle concurrent requests. The shared-disk protocols can be extended to shared-nothing architectures by this scheme. Each page has a home processor Pi and is stored on disk Di. When other processors want to read or write the page, they send requests to the home processor Pi of the page, since they cannot directly communicate with the disk. The other actions are the same as in the shared-disk protocols. The Oracle 8 and Oracle Rdb systems are examples of shared-disk parallel database systems that support interquery parallelism. 1.7 Intraquery Parallelism : [ University Exam – Dec. 2006, May 2007 !!! ] Definition : Intraquery parallelism refers to the execution of a single query in parallel on multiple processors and disks. Using intraquery parallelism is important for speeding up long running queries. Interquery parallelism does not support in this task since each query is run sequentially. Advanced Databases (PU) 1-18 Parallel Databases 1.7.1 Working of Intraquery Parallelism : To illustrate the parallel evaluation of a query, consider a query that requires a relation to be sorted. Suppose that the relation has been partitioned across multiple disks by range partitioning on some attribute, and the sort is requested on the partitioning attribute. The sort operation can be implemented by sorting each partition in parallel, then concatenating the sorted partitions to get the final sorted relation. Thus, we can parallelize a query by parallelizing individual operations. There is another source of parallelism in evaluating a query: The operator tree for a query can contain multiple operations. We can parallelize the evaluation of the operator tree by evaluating in parallel some of the operations that do not depend on one another. We may be able to pipeline the output of one operation to another operation. The two operations can be executed in parallel on separate processors, one generating output that is consumed by the other, even as it is generated. In summary, the execution of a single query can be parallelized in two ways : 1. Intraoperation parallelism : We can speed up processing of a query by parallelizing the execution of each individual operation, such as sort, select, project, and join. 2. Interoperation parallelism : We can speed up processing of a query by executing in parallel the different operations in a query expression. 1.7.2 Importance of Parallelism : The two forms of parallelism are complementary and can be used simultaneously on a query. Since the number of operations in a typical query is small, compared to the number of tuples processed by each operation, the first form of parallelism can scale better with increasing parallelism. However, with the relatively small number of processors in typical parallel systems today, both forms of parallelism are important. In the following discussion of parallelization of queries, we assume that the queries are read only. The choice of algorithms for parallelizing query evaluation depends on the machine architecture. Rather than presenting algorithms for each architecture separately. We use a shared-nothing architecture model in our description. Thus, we explicitly describe when data have to be transferred from one processor to another. We can simulate this model easily by using the other architectures, since transfer of data can be done via shared memory in a shared-memory architecture, and via shared disks in a Advanced Databases (PU) 1-19 Parallel Databases shared-disk architecture. Hence, algorithms for shared-nothing architectures can be used on the other architectures too. We mention occasionally how the algorithms can be further optimized for shared-memory or shared-disk systems. To simplify the presentation of the algorithms, assume that there are n processors, Po, Pi, … , P n- 1, and n disks Do, Di, ... ,D n – 1, where disk Di is associated with processor Pi. A real system may have multiple disks per processor. It is not hard to extend the algorithms to allow multiple disks per processor. We simply allow Di to be a set of disks. However, for simplicity, we assume here that A is a single disk. 1.8 Intraoperation Parallelism : [ University Exam – Dec. 2007 !!! ] Since relational operations work on relations containing large sets of tuples, we can parallelize the operations by executing them in parallel on different subsets of the relations. Since the number of tuples in a relation can be large, the degree of parallelism is potentially enormous. Thus, intraoperation parallelism is natural in a database system. 1.8.1 Parallel Sort : [ University Exam – May 2007, Dec. 2007 !!! ] Suppose that we wish to sort a relation that resides on parallel disks DO , D1 .... Dn-1 . If the relation has been range partitioned on the attributes on which it is to be sorted, then, we can sort each partition separately, and can concatenate the results to get the full sorted relation. Since the tuples are partitioned on n disks, the time required for reading the entire relation is reduced by the parallel access. If the relation has been partitioned in any other way, we can sort it in one of two ways : 1. We can range partition it on the sort attributes, and then sort each partition separately. 2. We can use a parallel version of the external sort-merge algorithm. 1.8.1.1 Range-Partitioning Sort : Range-partitioning sort works in two steps first range partitioning the relation, then sorting each partition separately. When we sort by range partitioning the relation, it is not necessary to range-partition the relation on the same set of processors or disks as those on which that relation is stored. Suppose that we choose processors Po , PI, .... Pm, where m < n to sort the relation. Advanced Databases (PU) 1-20 Parallel Databases There are two steps involved in this operation: 1. Redistribute the tuples in the relation, using a range-partition strategy, so that all tuples that lie within the ith range are sent to processor PI , which stores the relation temporarily on disk Di. To implement range partitioning, in parallel every processor reads the tuples from its disk and sends the tuples to their destination processor. Each processor PO, PI ,..., Pm also receives tuples belonging to its partition, and stores them locally. This step requires disk I/O and communication overhead. 2. Each of the processors sorts its partition of the relation locally, without interaction with the other processors. Each processor executes the same operation namely, sorting on a different data set. (Execution of the same operation in parallel on different sets of data is called data parallelism.) The final merge operation is trivial, because the range partitioning in the first phase ensures that, for 1 < i < j < m, the key values in processor Pi are all less than the key values in Pj. We must do range partitioning with a good range-partition vector, so that each partition will have approximately the same number of tuples. Virtual processor partitioning can also be used to reduce skew. 1.8.1.2 Parallel External Sort Merge : [ University Exam – Dec. 2009 !!! ] Parallel external sort-merge is an alternative to range partitioning. Suppose that a relation has already been partitioned among disks DO , DI ,..., D n – 1 (it does not matter how the relation has been partitioned). Parallel external sort-merge then works this way : 1. Each processor Pi locally sorts the data on disk Di. 2. The system then merges the sorted runs on each processor to get the final sorted output. The merging of the sorted runs in step 2 can be parallelized by this sequence of actions : 1. The system range-partitions the sorted partitions at each processor Pi (all by the same partition vector) across the processors PO , PI .... Pm – 1. It sends the tuples in sorted order, so that each processor receives the tuples in sorted streams. 2. Each processor Pi performs a merge on the streams as they are received, to get a single sorted run. 3. The system concatenates the sorted runs on processors PO, PI ...., Pm – 1 to get the final result. Advanced Databases (PU) 1-21 Parallel Databases As described, this sequence of actions results in an interesting form of execution skew, since at first every processor sends all blocks of partition 0 to P0, then every processor sends all blocks of partition 1 to P1, and so on. Thus, while sending happens in parallel, receiving tuples becomes sequential: first only PO receives tuples, then only PI receives tuples, and so on. To avoid this problem, each processor repeatedly sends a block of data to each partition. In other words, each processor sends first block of every partition, then sends the second block of every partition so on. As a result, all processors receive data in parallel. Some machines, such as the Teradata DBC series machines, use specialized hardware to perform merging. The Y-net interconnection network in the Teradata DBC machines can merge output from multiple processors to give a single sorted output. 1.8.2 Parallel Join : [ University Exam – Dec. 2006, May 2007, Dec. 2007 !!! ] The join operation requires that the system test pairs of tuples to see whether they satisfy the join condition; if they do, the system adds the pair to the join output. Parallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locally. Then, the system collects the results from each processor to produce the final result. 1.8.2.1 Partitioned Join : For certain kinds of joins, such as equijoins and natural joins, it is possible to partition the two input relations across the processors, and to compute the join locally at each processor. Suppose that we are using n processors, and that the relations to be joined are r and ,s. Partitioned join then works this way: The system partitions the relations r and s each into n partitions, denoted r0 , r1 ... r n – 1 and s0, s1 ... s n – 1. The system sends partitions r1 and si to processor pi, where their join is computed locally. The partitioned join technique works correctly only if the join is in an equijoin and if we partition r and s by the same partitioning function on their join attributes. The idea of partitioning is exactly the same as that behind the partitioning step of hashjoin. In a partitioned join, however, there are two different ways of partitioning r and s : 1. Range partitioning on the join attributes 2. Hash partitioning on the join attributes Advanced Databases (PU) 1-22 Parallel Databases In either case, the same partitioning function must be used for both relations. For range partitioning, the same partition vector must be used for both relations. For hash partitioning, the same hash function must be used on both relations. Fig. 1.5 depicts the partitioning in a partitioned parallel join. Once the relations are partitioned, we can use any join technique locally at each processor Pi to compute the join of ri and si. For example, hash-join, merge-join, or nested-loop join could be used. Thus, we can use partitioning to parallelize any join technique. Fig. 1.5 : Partitioned parallel join If one or both of the relations r and s are already partitioned on the join attributes(by either hash partitioning or range partitioning), the work needed or partitioning is reduced greatly. If the relations are not partitioned, or are partitioned on attributes other than the join attributes, then the tuples need to be partitioned. Each processor Pi reads in the tuples on disk Di , computes for each tuple t the partition j to which t belongs, and sends tuple t to processor Pj. Processor Pj stores the tuples on disk Dj. We can optimize the join algorithm used locally at each processor to reduce I/O by buffering some of the tuples to memory, instead of writing them to disk. Skew presents a special problem when range partitioning is used, since a partition vector that splits one relation of the join into equal-sized partitions may split the other relations into partitions of widely varying size. The partition vector should be such that | ri | + | si | (that is, the sum of the sizes of ri and si) is roughly equal over all the i = 0,1...., n – 1. With a good hash function, hash Advanced Databases (PU) 1-23 Parallel Databases partitioning is likely to have a smaller skew, except when there are many tuples with the same values for the join attributes. 1.8.2.2 Fragment and Replicate Join : Partitioning is not applicable to all types of joins. For instance, if the join condition is an inequality, such as r ⋈ r a < s b si it is possible that all tuples in r join with some tuple in s (and vice versa). Thus, there may be no easy way of partitioning r and s so that tuples in partition ri, join with only tuples in partition si. We can parallelize such joins by using a technique called fragment and replicate. We first consider a special case of fragment and replicate join as : Asymmetric fragment and replicate join : It works as follows : 1. The system partitions one of the relations say, r. Any partitioning technique can be used on r, including round-robin partitioning. 2. 3. The system replicates the other relation, s, across all the processors. Processor Pi then locally computes the join of ri; with all of .s, using any join technique. The asymmetric fragment-and-replicate scheme appears in Fig. 1.6(a). If r is already stored by partitioning, there is no need to partition it further in step 1. All that is required is to replicate s across all processors. Fig. 1.6(a) : Asymmetric fragment and raplicate Fragment and replicate join : The general case of fragment and replicate join appears in Fig. 1.6 (b), it works as follows : The system partitions relation r into n partitions r0, r1 ... r n – 1 and partitions s into m partitions, s0, s1,... s m – 1 as before, any partitioning technique may be used on r and on s. Advanced Databases (PU) 1-24 Parallel Databases The values of m and n do not need to be equal, but they must be chosen so that there are atleast m * n processors. Asymmetric fragment and replicate is simply a special case of general fragment and replicate, where m = 1, Fragment and replicate reduces the sizes of the relations at each processor, compared to asymmetric fragment and replicate. Fig. 1.6(b) : Fragment and replicate schemes Fragment and replicate works with any join condition, since every tuple in r can be tested with every tuple in s. Thus, it can be used where partitioning cannot be. Fragment and replicate usually has a higher cost than partitioning when both relations are of roughly the same size, since at least one of the relations has to be replicated. However, if one of the relations say, s is small, it may be cheaper to replicate s across all processors, rather than to repartition r and s on the join attributes. In such a case, asymmetric fragment and replicate is preferable, even though partitioning could be used. 1.8.2.3 Partitioned Parallel Hash-Join : The partitioned hash-join of can be parallelized. Suppose that we have n processors, P0, P1 .... P n – 1 , and two relations r and s, such that the relations r and s are partitioned across multiple disks. If the size of s is less than that of r the parallel hash-join algorithm proceeds this way : 1. Choose a hash function say, h1 that takes the join attribute value of each tuple in r and .s and maps the tuple to one of the n processors. Let ri denote the tuples of relation r that are mapped to processor Pi , similarly, let si , denote the tuples of relation s that are mapped to processor Pi. Each processor Pi reads the tuples of s that are on its disk Di and Advanced Databases (PU) 1-25 Parallel Databases sends each tuple to the appropriate processor on the basis of hash function h1. 2. As the destination processor Pi receives the tuples of , si it further partitions them by another hash function, h2, which the processor uses to compute the hash-join locally. The partitioning at this stage is exactly the same as in the partitioning phase of the sequential hash-join algorithm. Each processor Pi executes this step independently from the other processors. 3. Once the tuples of and have been distributed, the system redistributes the larger relation r across the m processors by the hash function h1 in the same way as before. As it receives each tuple, the destination processor repartitions it by the function h 2 , just as the probe relation is partitioned in the sequential hash-join algorithm. 4. Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and si of r and s to produce a partition of the final result of the hashjoin. The hash-join at each processor is independent of that at other processors, and receiving the tuples of ri and si is similar to reading them from disk. We can use hybrid hash join to cache some of the incoming tuples in memory, Thus avoid cost of writing them and of reading them back in. 1.8.2.4 Parallel Nested-Loop Join : To illustrate the use of fragment and replicate-based parallelization, consider the case where the relation s is much smaller than relation r. Suppose that relation r is stored by partitioning; the attribute on which it is partitioned does not matter. Suppose too that there is an index on a join attribute of relation r at each of the partitions of relation r. We use asymmetric fragment and replicate, with relation s being replicated and with the existing partitioning of relation r. Each processor Pj here a partition of relation s is stored reads the tuples of relation s stored in Dj, and replicates the tuples to every other processor Pi At the end of this phase, relation s is replicated at all sites that store tuples of relation r. Now, each processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation r. We can overlap the indexed nested-loop join with the distribution of tuples of relation s, to reduce the costs of writing the tuples of relation s to disk, and of reading them back. Advanced Databases (PU) 1-26 Parallel Databases However, the replication of relation s must be synchronized with the join so that there is enough space in the in memory buffers at each processor Pi to hold the tuples of relation s that have been received but that have not yet been used in the join. 1.8.3 Other Relational Operations : [ University Exam – Dec. 2006 !!! ] The evaluation of other relational operations also can be parallelized : Selection : Let the selection be a(r). Consider first the case where is of the form ai = vi, where ai, is an attribute and v is a value. If the relation r is partitioned on ai, the selection proceeds at a single processor. If is of the form I <o> i < <i—that is, 9 is a range selection and the relation has been range-partitioned on ai, then the selection proceeds at each processor whose partition overlaps with the specified range of values. In all other cases, the selection proceeds in parallel at all the processors. Duplicate elimination : Duplicates can be eliminated by sorting; either of the parallel sort techniques can be used, optimized to eliminate duplicates as soon as they appear during sorting. We can also parallelize duplicate elimination by partitioning the tuples (by either range or hash partitioning) and eliminating duplicates locally at each processor. Projection : Projection without duplicate elimination can be performed as tuples are read in from disk in parallel. If duplicates are to be eliminated, either of the techniques just described can be used. Aggregation : Consider an aggregation operation. We can parallelize the operation by partitioning the relation on the grouping attributes, and then computing the aggregate values locally at each processor. Either hash partitioning or range partitioning can be used. If the relation is already partitioned on the grouping attributes, the first step can be skipped. We can reduce the cost of transferring tuples during partitioning by partly computing aggregate values before partitioning, at least for the commonly used aggregate functions. Advanced Databases (PU) 1-27 Parallel Databases Consider an aggregation operation on a relation r, using the sum aggregate function on attribute B, with grouping on attribute A. The system can perform the operation at each processor Pi on those r tuples stored on disk Di. This computation results in tuples with partial sums at each processor; there is one tuple at Pi for each value for attribute A present in r tuples stored on Di. The system partitions the result of the local aggregation on the grouping attribute A, and performs the aggregation again (on tuples with the partial sums) at each processor Pi to get the final result. As a result of this optimization, fewer tuples need to be sent to other processors during partitioning. 1.8.4 Cost of Parallel Evaluation of Operations : [ University Exam – May 2007 !!! ] We achieve parallelism by partitioning the I/O among multiple disks, and partitioning the CPU work among multiple processors. If such a split is achieved without any overhead, and if there is no skew in the splitting of work, a parallel operation using n processors will take 1/n times as long as the same operation on a single processor. We already know how to estimate the cost of an operation such as a join or a selection. The time cost of parallel processing would then be 1/n of the time cost of sequential processing of the operation. We must also account for the following costs : o Startup costs for initiating the operation at multiple processors. o Skew in the distribution of work among the processors, with some processors getting a larger number of tuples than others. o Contention for resources—such as memory, disk, and the communication network—resulting in delays. o Cost of assembling the final result by transmitting partial results from each processor. The time taken by a parallel operation can be estimated as : T part + T asm + max(To, Ti.... .T n – 1) Where, Tpart is the time for partitioning the relations, Tasm is the time for assembling the results Ti the time taken for the operation at processor Pi. Advanced Databases (PU) 1-28 Parallel Databases Assuming that the tuples are distributed without any skew, the number of tuples sent to each processor can be estimated as l/n of the total number of tuples. The preceding estimate will be an optimistic estimate, since skew is common. Even though breaking down a single query into a number of parallel steps reduces the size of the average step, it is the time for processing the single slowest step that determines the time taken for processing the query as a whole. A partitioned parallel evaluation, for instance, is only as fast as the slowest of the parallel executions. Thus, any skew in the distribution of the work across processors greatly affects performance, The problem of skew in partitioning is closely related to the problem of partition overflow in sequential hash-joins. We can use overflow resolution and avoidance techniques developed for hash-joins to handle skew when hash partitioning is used. We can use balanced range partitioning and virtual processor partitioning to minimize skew due to range partitioning. 1.9 Interoperation Parallelism : [ University Exam – Dec. 2007, Dec. 2009 !!! ] There are two forms of interoperation parallelism : 1. Pipelined parallelism 2. Independent parallelism. 1.9.1 Pipelined Parallelism : Pipelining forms an important source of economy of computation for database query processing. In pipelining, the output tuples of one operation, A, are consumed by a second operation, B, even before the first operation has produced the entire set of tuples in its output The major advantage of pipelining execution in a sequential evaluation is that we can carry out a sequence such operations without writing any of the intermediate results to disks. Parallel systems use pipelining primarily for the same reason that sequential systems do. However, pipelines are a source of parallelism as well, in the same way that instruction pipelines are source of parallelism in hardware design. It is possible, to run operations A and B simultaneously on different processors, so that B consumes tuples in parallel with A producing them. This form of parallelism is called pipelined parallelism. Advanced Databases (PU) 1-29 Parallel Databases Consider a join of four relations : r1 ⋈ r2 ⋈ r3 ⋈ r4 We can set up a pipeline that allows the three joins to be computed in parallel. Suppose processor P1 is assigned the computation of temp1 r1 ⋈ r2 and P2 is assigned the computation of r3 ⋈ temp1. As P1 computes tuples in r1 ⋈ r2, it makes these tuples available to processor P2. Thus, P2 has available to it some of the tuples in r1 ⋈ r2 before P1 has finished its computation. P2 can use those tuples that are available to begin computation of temp1 ⋈ r3, even before r1 ⋈ r2 is fully computed by P1. Likewise, as P2 computes tuples in (r1 ⋈ r2) ⋈ r3 , it makes these tuples available to P3 , which computes the join of these tuples with r4. Pipelined parallelism is useful with a small number of processors, but does not scaleup as well. First, pipeline chains generally don’t attain sufficient length to provide a high degree of parallelism. Second, it is not possible to pipeline relational operators that do not produce output until all inputs have been accessed, such as the set-difference operation. Third, only marginal speedup is obtained for the frequent cases in which one operator's execution cost is much higher than are those of the others. All things considered, when the degree of parallelism is high, pipelining. is less important source of parallelism than partitioning. The real reason for using pipelining is that pipelined executions can avoid writing intermediate results to disk. 1.9.2 Independent Parallelism : Operations in a query expression that do not depend on one another can be executed in parallel. This form of parallelism is called independent parallelism . Consider the join r1 ⋈ r2 ⋈ r3 ⋈ r4. Clearly, we can compute temp1 r1 ⋈ r2 in parallel with temp2 r3 ⋈ r4. When these two computations complete, we compute temp1 ⋈ temp2 To obtain further parallelism, we can pipeline the tuples in temp 1 and temp2 into the computation of temp1 ⋈ temp2, which is itself carried out by a pipelined join Advanced Databases (PU) 1-30 Parallel Databases Like pipelined parallelism, independent parallelism does not provide a high degree of parallelism and is less useful in a highly parallel system, although it is useful with a lower degree of parallelism. 1.9.3 Query Optimization : [ University Exam – Dec. 2006 !!! ] Query optimizers account in large measure for the success of relational technology. Recall that a query optimizer takes a query and finds the cheapest execution plan among the many possible execution plans that give the same answer. Query optimizers for parallel query evaluation are more complicated than query optimizers for sequential query evaluation First, the cost models are more complicated, since partitioning costs have to be accounted for, and issues such as skew and resource contention must be taken into account. More important is the issue of how to parallelize a query. To evaluate an operator tree in a parallel system, we must make the following decisions : How to parallelize each operation, and how many processors to use for it What operations to pipeline across different processors, what operations to execute independently in parallel, and what operations to execute sequentially, one after the other. These decisions constitute the task of scheduling the execution tree. Determining the resources of each kind such as processors, disks, and memory that should be allocated to each operation in the tree is another aspect of the optimization problem. For instance, it may appear wise to use the maximum amount or parallelism available, but it is a good idea not to execute certain operations in parallel. Operations whose computational requirements are significantly smaller than the communication overhead should be clustered with one of their neighbors. Otherwise, the advantage of parallelism is negated by the overhead of communication. One concern is that long pipelines do not lend themselves to good resource utilization Unless the operations are coarse grained, the final operation of the pipeline may wait for a long time to get inputs, while holding precious resources, such as memory. Hence, long pipelines should be avoided. The number of parallel evaluation plans from which to choose is much larger than the number of sequential evaluation plans. Optimizing parallel queries by considering all alternatives is therefore much more expensive than optimizing sequential queries Hence, we Advanced Databases (PU) 1-31 Parallel Databases usually adopt heuristic approaches to reduce the number of parallel execution plans that we have to consider. We describe two popular heuristics here. 1.9.3.1 Types of Heuristics for Evaluation : The first heuristic is to consider only evaluation plans that parallelize every operation across all processors, and that do not use any pipelining. This approach is used in the Teradata DBC series Finding best such execution plan is like doing query optimization in a sequential system. The main differences lie in how the partitioning is performed and what cost-estimation formula is used. The second heuristic is to choose the most efficient sequential evaluation plan, and then to parallelize the operations in that evaluation plan. The Volcano parallel database popularized a model of parallelization called the exchange-operator model. This model uses existing implementations of operations, operating on local copies of data, coupled with an exchange operation that moves data around between different processors. Exchange operators can be introduced into an evaluation plan to tram-form it into a parallel evaluation plan. Another dimension of optimization is the design of physical-storage organization to speed up queries. The optimal physical organization differs for different queries. The database administrator must choose a physical organization that appears to be good for the expected mix of database queries. Thus, the area of parallel query optimization is complex, and it is still an area of active research. 1.10 Design of Parallel System : [ University Exam – Dec. 2006, May 2007 !!! ] We have studied the concepts of data storage and query processing. Since large-scale parallel database systems are used primarily for storing large volumes of data, and, for processing decision-support queries on those data, these topics are the most important in a parallel database system. Parallel loading of data from external sources is an important requirement, if we are to handle large volumes of incoming data. A large parallel database system should focus on following availability issues : o Resilience to failure of some processors or disks o Online reorganization of data and schema changes Advanced Databases (PU) 1-32 Parallel Databases 1.10.1 Resilience to Failure of Some Processors or Disks : Failure rate : With a large number of processors and disks, the probability that at least one processor or disk will malfunction is significantly greater than in a single-processor system with one disk. A poorly designed parallel system will stop functioning if any component (processor or disk) fails. Assuming that the probability of failure of a single processor or disk is small, the probability of failure of the system goes up linearly with the number of processors and disks. If a single processor or disk would fail once every 5 years a system with 100 processors would have a failure every 18 days. For example The large-scale parallel database systems, such as Compaq Himalaya, Teradata, and Informix XPS (now a division of IBM), are designed to operate even if a processor or disk fails. Data are replicated across at least two processes. If a processor fails, the data that it stored can still be accessed from the other processors. The system keeps track of failed processors and distributes the work among functioning processors. Requests for data stored at the failed site are automatically routed to the backup sites that store a replica of the data. If all the data of a processor A are replicated at a single processor B, B will have to handle all the requests to A as well as those to itself, and that will result in B becoming a bottleneck. Therefore, the replicas of the data of a processor are partitioned across multiple other processors. 1.10.2 Problems of Large Databases : When we are dealing with large volumes of data (ranging in the terabytes), simple operations, such as creating indices, and changes to schema, such as adding a column to a relation, can take a long time perhaps hours or even days. Therefore, it is unacceptable for the database system to be unavailable while such operations are in progress. Many parallel database systems, such as the Compaq Himalaya systems, allow such operations to be performed online, that is, while the system is executing other transactions. Advanced Databases (PU) 1-33 Parallel Databases 1.10.3 Online Index Construction : An online index construction is a system that supports the feature which allows insertions, deletions, and updates on a relation even as an index is being built on the relation. The index-building operation therefore cannot lock the entire relation in shared mode, as it would have done otherwise. Instead, the process keeps track of updates that occur while it is active, and incorporates the changes into the index being constructed. 1.11 Effective Guidelines : Parallel database machine architectures have evolved from the use of exotic hardware to a software parallel dataflow architecture based on conventional shared-nothing hardware. These new designs provide impressive speedup and scaleup when processing relational database queries. Introduction : Highly parallel database systems are beginning to displace traditional mainframe computers for the largest database and transaction processing tasks. Ten years ago the future of highly-parallel database machines seemed gloomy, even to their staunchest advocates. Most database machine research had focused on specialized, often trendy, hardware such as CCD memories, bubble memories, head-per-track disks, and optical disks. None of these technologies fulfilled their promises; so there was a sense that conventional cpus, electronic RAM, and moving-head magnetic disks would dominate the scene for many years to come. At that time, disk throughput was predicted to double while processor speeds were predicted to increase by much larger factors. Consequently, critics predicted that multi-processor systems would soon be I/O limited unless a solution to the I/O bottleneck were found. While these predictions were fairly accurate about the future of hardware, the critics were certainly wrong about the overall future of parallel database systems. Over the last decade Teradata, Tandem, and a host of startup companies have successfully developed and marketed highly parallel database machines. Advanced Databases (PU) 1-34 Parallel Databases Why have parallel database systems become more than a research curiosity ? One explanation is the widespread adoption of the relational data model. In 1983 relational database systems were just appearing in the marketplace; today they dominate it. Relational queries are ideally suited to parallel execution; they consist of uniform operations applied to uniform streams of data. Each operator produces a new relation, so the operators can be composed into highly parallel dataflow graphs. By streaming the output of one operator into the input of another operator, the two operators can work in series giving pipelined parallelism. By partitioning the input data among multiple processors and memories, an operator can often be split into many independent operators each working on a part of the data. This partitioned data and execution gives partitioned parallelism (Fig. 1.7 ). The dataflow approach to database system design needs a message-based client-server operating system to interconnect the parallel processes executing the relational operators. This in turn requires a high-speed network to interconnect the parallel processors. Such facilities seemed exotic a decade ago, but now they are the mainstream of computer architecture. The client-server paradigm using high-speed LANs is the basis for most PC, workstation, and workgroup software. Those same client-server mechanisms are an excellent basis for distributed database technology. Fig. 1.7 Mainframe designers have found it difficult to build machines powerful enough to meet the CPU and I/O demands of relational databases serving large numbers of simultaneous users or searching terabyte databases. Meanwhile, multi-processors based on fast and inexpensive microprocessors have become widely available from vendors including Encore, Intel, NCR, nCUBE, Sequent, Tandem, Teradata, and Thinking Machines. Advanced Databases (PU) 1-35 Parallel Databases These machines provide more total power than their mainframe counterparts at a lower price. Their modular architectures enable systems to grow incrementally, adding MIPS, memory, and disks either to speedup the processing of a given job, or to scaleup the system to process a larger job in the same time. In retrospect, special-purpose database machines have indeed failed; but, parallel database systems are a big success. The successful parallel database systems are built from conventional processors, memories, and disks. They have emerged as major consumers of highly parallel architectures, and are in an excellent position to exploit massive numbers of fast-cheap commodity disks, processors, and memories promised by current technology forecasts. A consensus on parallel and distributed database system architecture has emerged. This architecture is based on a shared-nothing hardware design [STON86] in which processors communicate with one another only by sending messages via an interconnection network. In such systems, tuples of each relation in the database are partitioned (declustered) across disk storage units attached directly to each processor. Partitioning allows multiple processors to scan large relations in parallel without needing any exotic I/O devices. Such architectures were pioneered by Teradata in the late seventies and by several research projects. This design is now used by Teradata, Tandem, NCR, Oracle-nCUBE, and several other products currently under development. The research community has also embraced this shared-nothing dataflow architecture in systems like Arbre, Bubba, and Gamma. 1.12 Basic Techniques for Parallel Database Machine Implementation : [ University Exam – Dec. 2007 !!! ] 1.12.1 Parallelism Goals and Metrics: Speedup and Scaleup : The ideal parallel system demonstrates two key properties: (1) linear speedup: Twice as much hardware can perform the task in half the elapsed time, and (2) linear scaleup: Twice as much hardware can perform twice as large a task in the same elapsed time (see Fig. 1.8). Fig. 1.8 : Speedup and scaleup Advanced Databases (PU) 1-36 Parallel Databases A speedup design performs a one-hour job four times faster when run on a four-times larger system. A scaleup design runs a ten-times bigger job is done in the same time by a tentimes bigger system. More formally, given a fixed job run on a small system, and then run on a larger system, the speedup given by the larger system is measured as: Speedup = small_system_elapsed_time / big_system_elapsed_time Speedup is said to be linear, if an N-times large or more expensive system yields a speedup of N. Speedup holds the problem size constant, and grows the system. Scaleup measures the ability to grow both the system and the problem. Scaleup is defined as the ability of an N-times larger system to perform an N-times larger job in the same elapsed time as the original system. The scaleup metric is. Scaleup=small_system_elapsed_time_on_small_problem/big_system_elapsed_time_on_ big_problem If this scaleup equation evaluates to 1, then the scaleup is said to be linear. There are two distinct kinds of scaleup, batch and transactional. If the job consists of performing many small independent requests submitted by many clients and operating on a shared database, then scaleup consists of N-times as many clients, submitting N-times as many requests against an N-times larger database. This is the scaleup typically found in transaction processing systems and timesharing systems. This form of scaleup is used by the Transaction Processing Performance Council to scale up their transaction processing benchmarks [GRAY91]. Consequently, it is called transaction-scaleup. Transaction scaleup is ideally suited to parallel systems since each transaction is typically a small independent job that can be run on a separate processor.A second form of scaleup, called batch scaleup, arises when the scaleup task is presented as a single large job. This is typical of database queries and is also typical of scientific simulations. In these cases, scaleup consists of using an N-times larger computer to solve an N times larger problem. For database systems batch scaleup translates to the same query on an N times larger database; for scientific problems, batch scaleup translates to the same calculation on an N-times finer grid or on an N-times longer simulation. Advanced Databases (PU) 1-37 Parallel Databases The generic barriers to linear speedup and linear scaleup are the triple threats of startup: The time needed to start a parallel operation. If thousands of processes must be started, this can easily dominate the actual computation time. interference: The slowdown each new process imposes on all others when accessing shared resources. Skew : As the number of parallel steps increases, the average sized of each step decreases, but the variance can well exceed the mean. The service time of a job is the service time of the slowest step of the job. When the variance dominates the mean, increased parallelism improves elapsed time only slightly. Fig. 1.9 1.12.2 Hardware Architecture, the Trend to Shared-Nothing Machines : The ideal database machine would have a single infinitely fast processor with an infinite memory with infinite bandwidth and it would be infinitely cheap (free). Given such a machine, there would be no need for speedup, scaleup, or parallelism. Unfortunately, technology is not delivering such machines but it is coming close. Technology is promising to deliver fast one-chip processors, fast high-capacity disks, and high-capacity electronic RAM memories. It also promises that each of these devices will be very inexpensive by today's standards, costing only hundreds of dollars each. So, the challenge is to build an infinitely fast processor out of infinitely many processors of finite speed, and to build an infinitely large memory with infinite memory bandwidth from infinitely many storage units of finite speed. This sounds trivial mathematically; but in practice, when a new processor is added to most computer designs, it slows every other computer down just a little bit. If this slowdown (interference) is 1%, then the maximum speedup is 37 and a thousandprocessor system has 4% of the effective power of a single processor system. How can Advanced Databases (PU) 1-38 Parallel Databases we build scaleable multi-processor systems? Stonebraker suggested the following simple taxonomy for the spectrum of designs. shared-memory: All processors share direct access to a common global memory and to all disks. The IBM/370, and Digital VAX, and Sequent Symmetry multi-processors typify this design. shared-disks : Each processor has a private memory but has direct access to all disks. The IBM Sysplex and original Digital VAXcluster typify this design. shared-nothing : Each memory and disk is owned by some processor that acts as a server for that data. Mass storage in such an architecture is distributed among the processors by connecting one or more disks. The Teradata, Tandem, and nCUBE machines typify this design. Shared-nothing architectures minimize interference by minimizing resource sharing. They also exploit commodity processors and memory without needing an incredibly powerful interconnection network. As Fig. 1.10 suggests, the other architectures move large quantities of data through the interconnection network. The shared-nothing design moves only questions and answers through the network. Raw memory accesses and raw disk accesses are performed locally in a processor, and only the filtered (reduced) data is passed to the client program. This allows a more scaleable design by minimizing traffic on the interconnection network.Shared-nothing characterizes the database systems being used by Teradata [TERA83], Gamma [DEWI86, DEWI90], Tandem [TAND88], Bubba [ALEX88], Arbre [LORI89], and nCUBE [GIBB91]. Significantly, Digital’s VAXcluster has evolved to this design. DOS and UNIX workgroup systems from 3com, Boreland, Digital, HP, Novel, Microsoft, and Sun also adopt a shared-nothing client-server architecture. The actual interconnection networks used by these systems vary enormously. Teradata employs a redundant tree-structured communication network. Tandem uses a three-level duplexed network, two levels within a cluster, and rings connecting the clusters. Arbre, Bubba, and Gamma are independent of the underlying interconnection network, requiring only that network allow any two nodes to communicate with one another. Gamma operates on an Intel Hypercube. Advanced Databases (PU) 1-39 Parallel Databases The Arbre prototype was implemented using IBM 4381 processors connected to one another in a point-to-point network. Workgroup systems are currently making a transition from Ethernet to higher speed local networks. The main advantage of shared-nothing multi-processors is that they can be scaledup to hundreds and probably thousands of processors that do not interfere with one another. Teradata, Tandem, and Intel have each shipped systems with more than 200 processors. Intel is implementing a 2000 node Hypercube. The largest shared-memory multi-processors currently available are limited to about 32 processors have application in simulation, pattern matching, and mathematical search, but they do not seem to be appropriate for the multiuser, i/o intensive, and dataflow paradigm of database systems. These shared-nothing architectures achieve near-linear speedups and scaleups on complex relational queries and on online-transaction processing workloads [DEWI90, TAND88, ENGL89]. Given such results, database machine designers see little justification for the hardware and software complexity associated with shared-memory and shared-disk designs Shared-memory and shared-disk systems do not scale well on database applications. Interference is a major problem for shared-memory multi-processors. The interconnection network must have the bandwidth of the sum of the processors and disks. It is difficult to build such networks that can scale to thousands of nodes. To reduce network traffic and to minimize latency, each processor is given a large private cache. Measurements of shared-memory multiprocessors running database workloads show that loading and flushing these caches considerably degrades processor performance [THAK90]. As parallelism increases, interference on shared resources limits performance. Multiprocessor systems often use an affinity scheduling mechanism to reduce this interference; giving each process an affinity to a particular processor. This is a form of data partitioning; it represents an evolutionary step toward the sharednothing design. Partitioning a shared-memory system creates many of the skew and load balancing problems faced by a shared-nothing machine; but reaps none of the simpler hardware interconnect benefits. Based on this experience, we believe high-performance shared-memory machines will not economically scale beyond a few processors when running database applications. Advanced Databases (PU) 1-40 Parallel Databases To ameliorate the interference problem, most shared-memory multi-processors have adopted a shared-disk architecture. This is the logical consequence of affinity scheduling. If the disk interconnection network can scale to thousands of discs and processors, then a shared-disk design is adequate for large read-only databases and for databases where there is no concurrent sharing. The shared-disk architecture is not very effective for database applications that read and write a shared database. A processor wanting to update some data must first obtain the current copy of that data. Since others might be updating the same data concurrently, the processor must declare its intention to update the data. Once this declaration has been honored and acknowledged by all the other processors, the updator can read the shared data from disk and update it. The processor must then write the shared data out to disk so that subsequent readers and writers will be aware of the update. There are many optimizations of this protocol, but they all end up exchanging reservation messages and exchanging large physical data pages. This creates processor interference and delays. It creates heavy traffic on the shared interconnection network. For shared database applications, the shared-disk approach is much more expensive than the shared-nothing approach of exchanging small high-level logical questions and answers among clients and servers. One solution to this interference has been to give data a processor affinity; other processors wanting to access the data send messages to the server managing the data. This has emerged as a major application of transaction processing monitors that partition the load among partitioned servers, and is also a major application for remote procedure calls. Again, this trend toward the partitioned data model and shared-nothing architecture on a shared disk system reduces interference. Since the shared-disk system interconnection network is difficult to scale to thousands of processors and disks, many conclude that it would be better to adopt the sharednothing architecture from the start. Given the shortcomings of shared-disk and shared-nothing architectures, why have computer architects been slow to adopt the shared-nothing approach? Advanced Databases (PU) 1-41 Parallel Databases The first answer is simple, high-performance low-cost commodity components have only recently become available.Traditionally, commodity components were relatively low performance and low quality. Today, old software is the most significant barrier to the use of parallelism. Old software written for uni-processors gets no speedup or scaleup when put on any kind of multiprocessor. It must be rewritten to benefit from parallel processing and multiple disks. Database applications are a unique exception to this. Today, most database programs are written in the relational language SQL that has been standardized by both ANSI and ISO. It is possible to take standard SQL applications written for uni-processor systems and execute them in parallel on sharednothing database machines. Database systems can automatically distribute data among multiple processors. Teradata and Tandem routinely port SQL applications to their system and demonstrate near-linear speedups and scaleups. The next section explains the basic techniques used by such parallel database systems. Fig. 1.10 : A parallel data flow approach Advanced Databases (PU) 1-42 Parallel Databases Terabyte online databases, consisting of billions of records, are becoming common as the price of online storage decreases. These databases are often represented and manipulated using the SQL relational model. A relational database consists of relations (files in COBOL terminology) that in turn contain tuples (records in COBOL terminology). All the tuples in a relation have the same set of attributes (fields in COBOL terminology). Relations are created, updated, and queried by writing SQL statements. These statements are syntactic sugar for a simple set of operators chosen from the relational algebra. Selectproject, here called scan, is the simplest and most common operator. It produces a row-andcolumn subset of a relational table. A scan of relation R using predicate P and attribute list L produces a relational data stream as output. The scan reads each tuple, t, of R and applies the predicate P to it. If P(t) is true, the scan discards any attributes of t not in L and inserts the resulting tuple in the scan output stream. Expressed in SQL, a scan of a telephone book relation to find the phone numbers of all people named Smith would be written: SELECT telephone_number /* the output attribute(s) */ FROM telephone_book /* the input relation */ WHERE last_name = ’Smith’; /* the predicate */ A scan's output stream can be sent to another relational operator, returned to an application, displayed on a terminal, or printed in a report. Therein lies the beauty and utility of the relational model. The uniformity of the data and operators allow them to be arbitrarily composed into dataflow graphs. The output of a scan may be sent to a sort operator that will reorder the tuples based on an attribute sort criteria, optionally eliminating duplicates. SQL defines several aggregate operators to summarize attributes into a single value, for example, taking the sum, min, or max of an attribute, or counting the number of distinct values of the attribute. The insert operator adds tuples from a stream to an existing relation. The update and delete operators alter and delete tuples in a relation matching a scan stream. The relational model defines several operators to combine and compare two or more relations. It provides the usual set operators union, intersection, difference, and some more exotic ones like join and division. Advanced Databases (PU) 1-43 Parallel Databases Discussion here will focus on the equi-join operator (here called join). The join operator composes two relations, A and B, on some attribute to produce a third relation. For each tuple, ta, in A, the join finds all tuples, tb, in B whose attribute values are equal to that of ta. For each matching pair of tuples, the join operator inserts into the output steam a tuple built by concatenating the pair. The programs whih is combination of conventional programs and SQL statements interact with clients, perform data display, and provide high-level direction of the SQL dataflow. The SQL data model was originally proposed to improve programmer productivity by offering a non-procedural database language. Data independence was an additional benefit; since the programs do not specify how the query is to be executed, SQL programs continue to operate as the logical and physical database schema evolves. Parallelism is an unanticipated benefit of the relational model. Since relational queries are really just relational operators applied to very large collections of data, they offer many opportunities for parallelism. Since the queries are presented in a non-procedural language, they offer considerable latitude in executing the queries. Relational queries can be executed as a dataflow graph. As mentioned in the introduction, these graphs can use both pipelined parallelism and partitioned parallelism. If one operator sends its output to another, the two operators can execute in parallel giving potential speedup of two. The benefits of pipeline parallelism are limited because of three factors : (1) Relational pipelines are rarely very long - a chain of length ten is unusual. (2) Some relational operators do not emit their first output until they have consumed all their inputs. Aggregate and sort operators have this property. One cannot pipeline these operators. (3) Often, the execution cost of one operator is much greater than the others (this is an example of skew). In such cases, the speedup obtained by pipelining will be very limited. Partitioned execution offers much better opportunities for speedup and scaleup. By taking the large relational operators and partitioning their inputs and outputs, it is possible to use divide-and-conquer to turn one big job into many independent little ones. This is an ideal situation for speedup and scaleup. Partitioned data is the key to partitioned execution. Advanced Databases (PU) 1-44 Parallel Databases Data partitioning : Partitioning a relation involves distributing its tuples over several disks. Data partitioning has its origins in centralized systems that had to partition files, either because the file was too big for one disk, or because the file access rate could not be supported by a single disk. Distributed databases use data partitioning when they place relation fragments at different network sites, Data partitioning allows parallel database systems to exploit the I/O bandwidth of multiple disks by reading and writing them in parallel. This approach provides I/O bandwidth superior to RAID-style systems without needing any specialized hardware. The simplest partitioning strategy distributes tuples among the fragments in a roundrobin fashion. This is the partitioned version of the classic entry-sequence file. Round robin partitioning is excellent if all applications want to access the relation by sequentially scanning all of it on each query. The problem with round-robin partitioning is that applications frequently want to associatively access tuples, meaning that the application wants to find all the tuples having a particular attribute value. The SQL query looking for the Smith’s in the phone book is an example of an associative search. Hash partitioning is ideally suited for applications that want only sequential and associative access to the data. Tuples are placed by applying a hashing function to an attribute of each tuple. The function specifies the placement of the tuple on a particular disk. Associative access to the tuples with a specific attribute value can be directed to a single disk, avoiding the overhead of starting queries on multiple disks. Hash partitioning mechanisms are provided by Arbre, Bubba, Gamma, and Teradata. Fig. 1.11 Advanced Databases (PU) 1-45 Parallel Databases Database systems pay considerable attention to clustering related data together in physical storage. If a set of tuples are routinely accessed together, the database system attempts to store them on the same physical page. For example, if the Smith’s of the phone book are routinely accessed in alphabetical order, then they should be stored on pages in that order, these pages should be clustered together on disk to allow sequential prefetching and other optimizations. Clustering is very application specific. For example, tuples describing nearby streets should be clustered together in geographic databases, tuples describing the line items of an invoice should be clustered with the invoice tuple in an inventory control application. Hashing tends to randomize data rather than cluster it. Range partitioning clusters tuples with similar attributes together in the same partition. It is good for sequential and associative access, and is also good for clustering data. Fig. 1.11 shows range partitioning based on lexicographic order, but any clustering algorithm is possible. Range partitioning derives its name from the typical SQL range queries such as latitude BETWEEN 37 AND 39Arbre, Bubba, Gamma, Oracle, and Tandem provide range partitioning. The problem with range partitioning is that it risks data skew, where all the data is place in one partition, and execution skew in which all the execution occurs in one partition. Hashing and round-robin are less susceptible to these skew problems. Range partitioning can minimize skew by picking non-uniformly-distributed partitioning criteria. Bubba uses this concept by considering the access frequency (heat) of each tuple when creating partitions a relation; the goal being to balance the frequency with which each partition is accessed (its temperature) rather than the actual number of tuples on each disk (its volume) [COPE88]. While partitioning is a simple concept that is easy to implement, it raises several new physical database design issues. Each relation must now have a partitioning strategy and a set of disk fragments. Increasing the degree of partitioning usually reduces the response time for an individual query and increases the overall throughput of the system. For sequential scans, the response time decreases because more processors and disks are used to execute the query. For associative scans, the response time improves because fewer tuples are stored at each node and hence the size of the index that must be searched decreases. There is a point beyond which further partitioning actually increases the response time of a query. This point occurs when the cost of starting a query on a node becomes a significant fraction of the actual execution time Advanced Databases (PU) 1-46 Parallel Databases Parallelism within relational operators : Data partitioning is the first step in partitioned execution of relational dataflow graphs. The basic idea is to use parallel data streams instead of writing new parallel operators (programs). This approach enables the use of unmodified, existing sequential routines to execute the relational operators in parallel. Each relational operator has a set of input ports on which input tuples arrive and an output port to which the operator’s output stream is sent. The parallel dataflow works by partitioning and merging data streams into these sequential ports. This approach allows the use of existing sequential relational operators to execute in parallel. Consider a scan of a relation, A, that has been partitioned across three disks into fragments A0, A1, and A2. This scan can be implemented as three scan operators that send their output to a common merge operator. The merge operator produces a single output data stream to the application or to the next relational operator. The parallel query executor creates the three scan processes shown in Fig. 1.12 and directs them to take their inputs from three different sequential input streams (A0, A1, A2). It also directs them to send their outputs to a common merge node. Each scan can run on an independent processor and disk. So the first basic parallelizing operator is a merge that can combine several parallel data streams into a single sequential stream. Fig. 1.12 Fig. 1.13 Advanced Databases (PU) 1-47 Parallel Databases The merge operator tends to focus data on one spot. If a multi-stage parallel operation is to be done in parallel, a single data stream must be split into several independent streams. A split operator is used to partition or replicate the stream of tuples produced by a relational operator. A split operator defines a mapping from one or more attribute values of the output tuples to a set of destination processes (Fig. 1.13). Fig. 1.14 : A simple SQL query and the associated relational query graph The query specifies that a join is to be performed between relations A and B by comparing the x attribute of each tuple from the A relation with the y attribute value of each tuple of the B relation. For each pair of tuples that satisfy the predicate, a result tuple is formed from all the attributes of both tuples. This result tuple is then added to the result relation C. The associated logical query graph (as might be produced by a query optimizer) shows a tree of operators, one for the join, one for the insert, and one for scanning each input relation. As an example, consider the two split operators shown in Fig.1.15 in conjunction with the SQL query shown in Fig. 1.14. Assume that three processes are used to execute the join operator, and that five other processes execute the two scan operators — three scanning partitions of relation A while two scan partitions of relation B. Each of the three relation A scan nodes will have the same split operator, sending all tuples between “A-H” to port 1 of join process 0, all between “I-Q” to port 1 of join process 1, and all between “R-Z” to port 1 of join process 2. Similarly the two relation B scan nodes have the same split operator except that their outputs are merged by port 1 (not port 0) of each join process. Each join process sees a sequential input stream of A tuples from the port 0 merge (the left scan nodes) and another sequential stream of B tuples from the port 1 merge (the right scan nodes). Advanced Databases (PU) 1-48 Parallel Databases The outputs of each join are, in turn, split into three steams based on the partitioning criterion of relation C. Relation A Scan Split Operator Relation B Scan Split Operator Predicate Destination Process Predicate Destination Process. Relation a scan split operator Predicate Relation B scan split operator Destination process Predicate Destination process “A-H” (cpu #5, process #3, port #0) “A-H” (cpu #5, process #3, port #1) “I-Q” (cpu #7, process #8, port #0) “I-Q” (cpu #7, process #8, port #1) “R-Z” (cpu #2, process #2, port #0) “R-Z” (cpu #2, process #2, port #1) Fig. 1.15 : Sample split operators Each split operator maps tuples to a set of output streams (ports of other processes) depending on the range value (predicate) of the input tuple. The split operator on the left is for the relation A scan in Fig. 1.12, while the table on the right is for the relation B scan. The tables above partition the tuples among three data streams. To clarify this example, consider the first join process in Fig. 1.15 (processor 5, process 3, ports 0 and 1 in Fig. 1.15). It will receive all the relation A “A-H” tuples from the three relation A scan operators merged as a single stream on port 0, and will get all the “A-H” tuples from relation B merged as a single stream on port 1. It will join them using a hash-join, sort merge join, or even a nested join if the tuples arrive in the proper order. Fig. 1.16 Advanced Databases (PU) 1-49 Parallel Databases If each of these processes is on an independent processor with an independent disk, there will be little interference among them. Such dataflow designs are a natural application for shared-nothing machine architectures. The split operator in Fig. 1.16 is just an example. Other split operators might duplicate the input stream, or partition it round-robin, or partition it by hash. The partitioning function can be an arbitrary program. Gamma, Volcano, and Tandem use this approach [GRAE90]. It has several advantages including the automatic parallelism of any new operator added to the system, plus support for a many kinds of parallelism. The split and merge operators have flow control and buffering built into them. This prevents one operator from getting too far ahead in the computation. When a split-operator’s output buffers fill, it stalls the relational operator until the data target requests more output. For simplicity, these examples have been stated in terms of an operator per process. But it is entirely possible to place several operators within a process to get coarser grained parallelism. The fundamental idea though is to build a self-pacing dataflow graph and distribute it in a shared-nothing machine in a way that minimizes interference. Specialized parallel relational operators : Some algorithms for relational operators are especially appropriate for parallel execution, either because they minimize data flow, or because they better tolerate data and execution skew. Improved algorithms have been found for most of the relational operators. The evolution of join operator algorithms is sketched here as an example of these improved algorithms. Recall that the join operator combines two relations, A and B, to produce a third relation containing all tuple pairs from A and B with matching attribute values. The conventional way of computing the join is to sort both A and B into new relations ordered by the join attribute. These two intermediate relations are then compared in sorted order, and matching tuples are inserted in the output stream. This algorithm is called sort-merge join. Many optimizations of sort-merge join are possible, but since sort has execution cost nlog(n), sort-merge join has an nlog(n) execution cost. Sort-merge join works well in a parallel dataflow environment unless there is data skew. Advanced Databases (PU) 1-50 Parallel Databases In case of data skew, some sort partitions may be much larger than others. This in turn creates execution skew and limits speedup and scaleup. These skew problems do not appear in centralized sort-merge joins. Hash-join is an alternative to sort-merge join. It has linear execution cost rather than nlog(n) execution cost, and it is more resistant to data skew. It is superior to sort-merge join unless the input streams are already in sorted order. Hash join works as follows. Each of the relations A and B are first hash partitioned on the join attribute. A hash partition of relation A is hashed into memory. The corresponding partition of table relation B is scanned, and each tuple is compared against the main-memory hash table for the A partition. If there is a match, the pair of tuples are sent to the output stream. Each pair of hash partitions is compared in this way. The hash join algorithm breaks a big join into many little joins. If the hash function is good and if the data skew is not too bad, then there will be little variance in the hash bucket size. In these cases hash-join is a linear-time join algorithm with linear speedup and scaleup. Many optimizations of the parallel hash-join algorithm have been discovered over the last decade. In pathological skew cases, when many or all tuples have the same attribute value, one bucket may contain all the tuples. In these cases no algorithm is known to speedup or scaleup. The hash-join example shows that new parallel algorithms can improve the performance of relational operators. This is a fruitful research area [BORA90, DEWI86, KITS83, KITS90, SCHN89, SCHN90, WOLF90, ZELL90]. Even though parallelism can be obtained from conventional sequential relational algorithms by using split and merge operators, we expect that many new algorithms will be discovered in the future. 1.13 Parallel Query Optimization : Current database query optimizers do not consider all possible plans when optimizing a relational query. Advanced Databases (PU) 1-51 Parallel Databases While cost models for relational queries running on a single processor are now wellunderstood [SELI79], they still depend on cost estimators that are a guess at best. Some dynamically select from among several plans at run time depending on, for example, the amount of physical memory actually available and the cardinalities of the intermediate results [GRAE89]. To date, no query optimizers consider all the parallel algorithms for each operator and all the query tree organizations. More work is needed in this area. Another optimization problem relates to highly skewed value distributions. Data skew can lead to high variance in the size of intermediate relations, leading to both poor query plan cost estimates and sub-linear speedup. Solutions to this problem are an area of active research [KITS90, WOLF90, HUA91,WALT91]. 1.14 Application Program Parallelism : The parallel database systems offer parallelism within the database system. Missing are tools to structure application programs to take advantage of parallelism inherent in these parallel systems. While automatic parallelization of applications programs written in Cobol may not be feasible, library packages to facilitate explicitly parallel application programs are needed. Ideally the SPLIT and MERGE operators could be packaged so that applications could benefit from them. 1.15 Physical Database Design : For a given database and workload there are many possible indexing and partitioning combinations. Database design tools are needed to help the database administrator select among these many design options. Such tools might accept as input a description of the queries comprising the workload, their frequency of execution, statistical information about the relations in the database, and a description of the processors and disks. The resulting output would suggest a partitioning strategy for each relation plus the indices to be created on each relation. Steps in this direction are beginning to appear. Advanced Databases (PU) 1-52 Parallel Databases Current algorithms partition relations using the values of a single attribute. For example, geographic records could be partitioned by longitude or latitude. Partitioning on longitude allows selections for a longitude range to be localized to a limited number of nodes, selections on latitude must be sent to all the nodes. While this is acceptable in a small configuration, it is not acceptable in a system with thousands of processors. Additional research is needed on multidimensional partitioning and search algorithms. 1.16 On-line Data Reorganization and Utilities : Loading, reorganizing, or dumping a terabyte database at a megabyte per second takes over twelve days and nights. Clearly parallelism is needed if utilities are to complete within a few hours or days. Even then, it will be essential that the data be available while the utilities are operating. In the SQL world, typical utilities create indices, add or drop attributes, add constraints, and physically reorganize the data, changing its clustering. One unexplored and difficult problem is how to process database utility commands while the system remains operational and the data remains available for concurrent reads and writes by others. The fundamental properties of such algorithms is that they must be online (operate without making data unavailable), incremental (operate on parts of a large database), parallel (exploit parallel processors), and recoverable (allow the operation to be canceled and return to the old state). Fig. 1.17 Advanced Databases (PU) 1-53 Parallel Databases Users running data-parallel applications across the Internet. The data stores are on server clusters, which, compared to monolithic machines, flexibly support different kinds of concurrent workloads, are easier to upgrade, and have the potential to support independent node faults. Our client machines, in contrast to the traditional view, are active collaborators with the clusters in providing the end-result to the user. Other systems : Other parallel database system prototypes include XPRS [STON88], Volcano [GRAE90], Arbre [LORI89], and the PERSIST project under development at IBM Research abs in Hawthorne and Almaden. While both Volcano and XPRS are implemented on sharedmemory multi-processors, XPRS is unique in its exploitation of the availability of massive shared-memory in its design. In addition, XPRS is based on several innovative techniques for obtaining extremely high performance and availability. Recently, the Oracle database system has been implemented atop a 64-node nCUBE shared-nothing system. The resulting system is the first to demonstrate more than 1000 transactions per second on the industry-standard TPC-B benchmark. This is far in excess of Oracle’s performance on conventional mainframe systems - both in peak performance and in price/performance [GIBB91]. NCR has announced the 3600 and 3700 product lines that employ shared-nothing architectures running System V R4 of Unix on Intel 486 and 586 processors. The interconnection network for the 3600 product line uses an enhanced Y-Net licensed from Teradata while the 3700 is based on a new multistage interconnection network being developed jointly by NCR and Teradata. Two software offerings have been announced. The first, a port of the Teradata software to a Unix environment, is targeted toward the decision-support marketplace. The second, based on a parallelization of the Sybase DBMS is intended primarily for transaction processing workloads. 1.17 Database Machines and Grosch’s Law : Today shared-nothing database machines have the best peak performance and best price performance available. When compared to traditional mainframes, the Tandem system scales linearly well beyond the largest reported mainframes on the TPC-A transaction processing benchmark. Advanced Databases (PU) 1-54 Parallel Databases Its price/performance on these benchmarks is three times cheaper than the comparable mainframe numbers. Oracle on an nCUBE has the highest reported TPC-B numbers, and has very competitive price performance [GRAY91, GIBB91]. These benchmarks demonstrate linear scaleup on transaction processing benchmarks. Gamma, Tandem, and Teradata have demonstrated linear speedup and scaleup on complex relational database benchmarks. They scale well beyond the size of the largest mainframes. Their performance and price performance is generally superior to mainframe systems. Summary : Parallel databases have gained significant commercial acceptance in the past 15 years. In I/O parallelism, relations are partitioned among available disks so that they can be retrieved faster. Three commonly used partitioning techniques are round-robin partitioning, hash partitioning, and range partitioning. Skew is a major problem, especially with increasing degrees of parallelism. Balanced partitioning vectors, using histograms, and virtual processor partitioning are among the techniques used to reduce skew. In interquery parallelism, we run different queries concurrently to increase throughput. Intraquery parallelism attempts to reduce the cost of running a query. There are two types of intraquery parallelism; intraoperation parallelism and inter-operation paralleliism. We use intraoperartion parallelism to execute relational operations such as sorts and joins, in parallel. Intraoperation parallelism is natural for relational operations, since they are set oriented There are two basic approaches to parallelizing a binary operation such as a join. In partitioned parallelism, the relations are split into several parts, and tuples in ri, are joined with only tuples from si. Partitioned parallelism can only be used for natural and equi-joins. In fragment and replicate, both relations are partitioned and each partition is replicated. In asymmetric fragment-and-replicate, one of the relations is replicated while the other is partitioned. Unlike partitioned paralleslism, fragment and replicate and asymmetric fragment-and-repliate can be used with any join condition. Both parallelization techniques can work in conjunction with any join technique. Advanced Databases (PU) 1-55 Parallel Databases In independent parallelism, different operations that do not depend on one another are executed in parallel. In pipelined parallelism, processors send the results of one operation to another operation as those results are computed, without wailing for the entire operation to finish. Query optimization in parallel databases is significantly more complex than query optimization in sequential databases. Review Questions Q. 1 Discuss the different motivations behind parallel and distributed databases. Q. 2 Describe the three main architectures for parallel DBMSs. Explain why the sharedmemory and shared-disk approaches suffer from interference. Q. 3 What can you say about the speed-up and scale-up of the shared-nothing architecture ? Q. 4 Describe and differentiate pipelined parallelism and data-partitioned parallelism. Q. 5 Discuss the following techniques for partitioning data : round-robin, hash, and range. Q. 6 Explain how existing code can be parallelized by introducing split and merge operators. Q. 7 Discuss how each of the following operators can be parallelized using data partitioning; scanning, sorting, join. Compare the use of sorting versus hashing for partitioning. Q. 8 What do we need to consider in optimizing queries for parallel execution ? Discuss interoperation parallelism, left-deep trees versus bushy trees, and cost estimation. 1.1 Introduction : ....................................................................................................... 1 1.2 Parallel Systems :................................................................................................. 2 1.2.1 Measures of Performance of Database Systems : ........................................ 2 1.2.2 Speedup and Scaleup : .................................................................................. 3 1.3 Architectures for Parallel Databases : ................................................................. 4 1.3.1 Shared Memory : .......................................................................................... 5 1.3.2 Shared Disk : ................................................................................................ 6 1.3.3 Shared Nothing :........................................................................................... 7 1.3.4 Hierarchical : ................................................................................................ 8 1.3.5 Parallel Query Evaluation : .......................................................................... 9 1.4 I/O Parallelism : ................................................................................................... 9 1.4.1 Partitioning Techniques : ............................................................................. 9 1.4.2 Comparison of Partitioning Techniques : ................................................... 10 Advanced Databases (PU) 1-56 Parallel Databases 1.4.2.1 Round-robin : ............................................................................................. 10 1.4.2.2 Hash Partitioning : ...................................................................................... 11 1.4.2.3 Range Partitioning : .................................................................................... 11 1.4.3 Handling of Skew : ..................................................................................... 12 1.4.3.1 Attribute-value Skew : ............................................................................... 12 1.4.3.2 A Balanced Range-Partitioning Skew : ...................................................... 13 1.5 Parallelizing Individual Operations : ................................................................. 14 1.5.1 Bulk Loading and Scanning : ..................................................................... 14 1.5.2 Sorting : ...................................................................................................... 15 1.5.2.1 Splitting Vector : ........................................................................................ 15 1.5.2.2 Application of Sorting : .............................................................................. 16 1.6 Interquery Parallelism : ..................................................................................... 16 1.6.1 Working of Interquery Parallelism : ........................................................... 16 1.6.2 Protocols used in Shared Disk System : ..................................................... 17 1.6.3 Advantages of Complex Protocols : ........................................................... 17 1.7 Intraquery Parallelism : ..................................................................................... 17 1.7.1 Working of Intraquery Parallelism : ........................................................... 18 1.7.2 Importance of Parallelism : ........................................................................ 18 1.8 Intraoperation Parallelism : ............................................................................... 19 1.8.1 Parallel Sort : .............................................................................................. 19 1.8.1.1 Range-Partitioning Sort : ............................................................................ 19 1.8.1.2 Parallel External Sort Merge : .................................................................... 20 1.8.2 Parallel Join : .............................................................................................. 21 1.8.2.1 Partitioned Join :......................................................................................... 21 1.8.2.2 Fragment and Replicate Join : .................................................................... 23 1.8.2.3 Partitioned Parallel Hash-Join : .................................................................. 24 1.8.2.4 Parallel Nested-Loop Join : ........................................................................ 25 1.8.3 Other Relational Operations : ..................................................................... 26 1.8.4 Cost of Parallel Evaluation of Operations : ................................................ 27 1.9 Interoperation Parallelism : ............................................................................... 28 1.9.1 Pipelined Parallelism : ................................................................................ 28 1.9.2 Independent Parallelism : ........................................................................... 29 1.9.3 Query Optimization : .................................................................................. 30 1.9.3.1 Types of Heuristics for Evaluation : .......................................................... 31 1.10 Design of Parallel System : ............................................................................... 31 1.10.1 Resilience to Failure of Some Processors or Disks : .................................. 32 1.10.2 Problems of Large Databases : ................................................................... 32 1.10.3 Online Index Construction : ....................................................................... 33 1.11 Effective Guidelines : ........................................................................................ 33 1.12 Basic Techniques for Parallel Database Machine Implementation : ................. 35 1.13 Parallel Query Optimization : ............................................................................ 50 1.14 Application Program Parallelism : .................................................................... 51 Advanced Databases (PU) 1-57 Parallel Databases 1.15 Physical Database Design :............................................................................... 51 1.16 On-line Data Reorganization and Utilities :................................................... 52 1.17 Database Machines and Grosch’s Law :............................................................ 53