Course Name: Business Intelligence Year: 2009 The Value of Parallelism 16th Meeting Source of this Material (2). Loshin, David (2003). Business Intelligence: The Savvy Manager’s Guide. Chapter 11 Bina Nusantara University 3 The Business Case Maintaining large amounts of transaction data is one thing, but integrating and subsequently transforming that data into an analytical environment (such as a data warehouse or any multidimensional analytical framework) requires a large amount of both storage space and processing capability. And unfortunately, the kinds of processing needed for BI applications cannot be scaled linearly. In other words, with most BI processing, doubling the amount of data can dramatically increase the amount of processing required. Bina Nusantara University 4 Parallelism and Granularity Whenever we talk about parallelism, we need to assess the size of the problem as well as the way the problem can be decomposed into parallelizable units. The metric of the unit size with respect to concurrency is called granularity. Large problems that decompose into a relatively small number of large task would have coarse granularity, whereas a decomposition into a large number of very small task would have fine granularity. In this section we look at different kinds of parallelism ranging from coarse-grained to fine-grained parallelism. • Scalability Scalability refers to the situation when the speedup linearly increases as the number of resources is increased. • Task Parallelism This is an example of task parallelism (Figure 16-1), which is a coarsely grained parallelism. In this case, a high-level process is decomposed into a collection of discrete tasks, each of which performs some set of operations and results in some output or side effect. Bina Nusantara University 5 Parallelism and Granularity (cont…) Figure 16-1 • Pipeline Parallelism Pipelining is an example of medium-grained parallelism, because the tasks are not fully separable (i.e., the completion of a single stage does not result in a finished product); however, the amount of work is large enough that it can be cordoned off and assigned as operational tasks (Figure 16-2). Bina Nusantara University 6 Parallelism and Granularity (cont…) Figure 16-2 • Data Parallelism Data parallelism is a different kind of parallelism. Here, instead of indentifying a collection of operational steps to be allocated to a process or task, the parallelism is related to both the flow and the structure of the information. For data parallelism, the goal is to scale the throughput of processing based on the ability to decompose the data set into concurrent processing stream, all performing the same set of operations. Bina Nusantara University 7 Parallelism and Granularity (cont…) • Vector Parallelism Vector parallelism refers to an execution framework where a collection of objects is treated as an array, or vector, and the same set of operations is applied to all elements of the set. A vector parallel platform can be used to implement both pipeline and data parallel applications. • Combinations We can embed pipelined processing within coarsely grained tasks or even decompose a pipe stage into a set of concurrent processes. The value of each of these kinds of parallelism is bounded by the system’s ability to support the overhead for managing those different levels. Bina Nusantara University 8 Parallel Processing System In this section we look at some popular parallel processing architectures. Systems employing these architectures either are configured by system manufacturers (such as symmetric multiprocessor [SMP] or massively parallel processing [MPP] systems) or can be homebrewed by savvy technical personnel (such as by use of a network of workstations). • Symmetric Multiprocessing An SMP system is a hardware configuration that combines multiple processors within a single architecture. In an SMP system, multiple processes can be allocated to different CPU’s within the system, which makes an SMP machine a good platform for coarse-grained parallelism. • Massively Parallel Processing An MPP system consists of a large number of small homogeneous processors interconnected via a high-speed network. The processors in an MPP machine are independent- they do not share memory, and typically each processor may run its own instance of an operating system, although there may be systemic controller application hosted on a leader processor that instruct the individual processors in the MPP configuration on what tasks to perform. Bina Nusantara University 9 Parallel Processing System (cont…) • Network of Workstations A network of workstations is a more loosely coupled version of an MPP system; the workstations are likely to be configured as individual machines that are connected via network. The communication latencies (i.e., delays in exchanging data) are likely to be an order of magnitude greater in this kind of configuration, and it is also possible for the machines in the network to be heterogeneous (i.e., involving different kinds of systems). • Hybrid Architectures A hybrid architecture is one that combines or adapts one of the previously discussed systems. The us of a hybrid architecture may be very dependent on the specific application, because some systems may be better suited to the concurrency specifics associated with each application. Bina Nusantara University 10 Dependence The key to exploiting parallelism is the ability to analyze dependence constraints within any process. In this section we explore both control dependence and data dependence, as well as issues associated with analyzing the dependence constraints within a system that prevent an organization from exploiting parallelism. • Control Dependence Control refers to the logic that determines whether a particular task is performed. Process control cannot be initiated at that first task until the condition is set by the other task or the other task completes. In this case, the first task is control dependent on the second task. • Input/output Data Dependence A data dependence between two processing stages represents a situation where some information that is “touched” by one process is read or written by a subsequent process, and the order of execution of these processes requires that the first process execute before the second process. There are potentially four kinds of data dependencies. Bina Nusantara University 11 Dependence (cont…) Read After Read (RAR) In which the second process’s read of data occurs after the first process’s read. Read After Write (RAW) In which the second process’s read of data must occur after the first process’s write of that data. Write After Read (WAR) In which the second process’s write of data must occur after the first process’s read of that data item. Write After Write (WAW) In which the second process’s write of data must occur after the first process’s write of that data item. • Dependence Analysis Dependence analysis is the process of examining an application’s use of data and control structure to indentify true dependencies within the application. The goal is first to identify the dependence chain within the system and then to look for opportunities where independent tasks can be executed in parallel. Bina Nusantara University 12 Parallelism and Business Intelligence At this point it makes sense to review the value of parallelism with respect to a number of the BI-related applications. In each of these applications, a significant speedup can be achieved by exploiting parallelism. • Query Processing Relational database management systems will most likely have taken advantage of internal query optimization as well as user defined indexes to speed up this kind of query. But even in that situation there different ways that the query can be parallelized, and this will still result in speed up. • Data Profiling The column analysis component of data profiling is another good example of an application that can benefit from parallelism. Every column in a table is subject to a collection of analyses: frequency of values, value cardinality, distribution of value ranges, etc. But because the analysis of one column is distinct from that applied to all other columns, we can exploit parallelism by treating the set of analyses applied to each column as a separate task and then instantiating separate task to analyze a collection of columns simultaneously. Bina Nusantara University 13 Parallelism and Business Intelligence (cont…) • Extract, Transform, Load The extract, transform, load (ETL) component of data warehouse population is actually well suited to all the kinds of parallelism we have discussed. The ETL process itself consist of a sequence of stages that can be configured as a pipeline, propagating the results of each stage to its successor, yielding some medium-grained parallelism. Bina Nusantara University 14 Management Issue In this section we discuss some of the management issues associated with the use of parallelism in a BI environment. • Training and Technical Management Requirements Significant technical training is required to understand both system management and how best to take advantage of parallel system. • Minimal Software Support Not many off-the-shelf application software packages take advantage of parallel system, although a number of RDBMS systems do, and there is an increasing number of high-end ETL tools that exploit parallelism. • Scalability Issues Scalability is not just a function of the size of the data. Any use of parallelism in an analytical environment should be evaluated in the context of data size, the amount of query processing expected, the number of concurrent users, and the kinds and complexity of the analytical processing. Bina Nusantara University 15 Management Issue (cont…) • Need for Expertise Ultimately, control and data dependence analysis are tasks that need to be incorporated into the systems analysis role when migrating to a parallel infrastructure, because the absence of valid dependence analysis will prevent proper exploitation of concurrent resources. Bina Nusantara University 16 End of Slide Bina Nusantara University 17