Document 15063061

advertisement
Course Name: Business Intelligence
Year: 2009
The Value of Parallelism
16th Meeting
Source of this Material
(2).
Loshin, David (2003). Business Intelligence:
The Savvy Manager’s Guide. Chapter 11
Bina Nusantara University
3
The Business Case
Maintaining large amounts of transaction data is one thing, but integrating and
subsequently transforming that data into an analytical environment (such as a
data warehouse or any multidimensional analytical framework) requires a large
amount of both storage space and processing capability. And unfortunately, the
kinds of processing needed for BI applications cannot be scaled linearly.
In other words, with most BI processing, doubling the amount of data can
dramatically increase the amount of processing required.
Bina Nusantara University
4
Parallelism and Granularity
Whenever we talk about parallelism, we need to assess the size of the problem
as well as the way the problem can be decomposed into parallelizable units.
The metric of the unit size with respect to concurrency is called granularity.
Large problems that decompose into a relatively small number of large task
would have coarse granularity, whereas a decomposition into a large number of
very small task would have fine granularity. In this section we look at different
kinds of parallelism ranging from coarse-grained to fine-grained parallelism.
• Scalability
Scalability refers to the situation when the speedup linearly increases as the number
of resources is increased.
•
Task Parallelism
This is an example of task parallelism (Figure 16-1), which is a coarsely grained
parallelism. In this case, a high-level process is decomposed into a collection of
discrete tasks, each of which performs some set of operations and results in some
output or side effect.
Bina Nusantara University
5
Parallelism and Granularity (cont…)
Figure 16-1
•
Pipeline Parallelism
Pipelining is an example of medium-grained parallelism, because the tasks are not
fully separable (i.e., the completion of a single stage does not result in a finished
product); however, the amount of work is large enough that it can be cordoned off
and assigned as operational tasks (Figure 16-2).
Bina Nusantara University
6
Parallelism and Granularity (cont…)
Figure 16-2
•
Data Parallelism
Data parallelism is a different kind of parallelism. Here, instead of indentifying a
collection of operational steps to be allocated to a process or task, the parallelism is
related to both the flow and the structure of the information. For data parallelism, the
goal is to scale the throughput of processing based on the ability to decompose the
data set into concurrent processing stream, all performing the same set of operations.
Bina Nusantara University
7
Parallelism and Granularity (cont…)
•
Vector Parallelism
Vector parallelism refers to an execution framework where a collection of objects is
treated as an array, or vector, and the same set of operations is applied to all
elements of the set. A vector parallel platform can be used to implement both pipeline
and data parallel applications.
•
Combinations
We can embed pipelined processing within coarsely grained tasks or even
decompose a pipe stage into a set of concurrent processes. The value of each of
these kinds of parallelism is bounded by the system’s ability to support the overhead
for managing those different levels.
Bina Nusantara University
8
Parallel Processing System
In this section we look at some popular parallel processing architectures.
Systems employing these architectures either are configured by system
manufacturers (such as symmetric multiprocessor [SMP] or massively parallel
processing [MPP] systems) or can be homebrewed by savvy technical
personnel (such as by use of a network of workstations).
• Symmetric Multiprocessing
An SMP system is a hardware configuration that combines multiple processors within
a single architecture. In an SMP system, multiple processes can be allocated to
different CPU’s within the system, which makes an SMP machine a good platform for
coarse-grained parallelism.
•
Massively Parallel Processing
An MPP system consists of a large number of small homogeneous processors
interconnected via a high-speed network. The processors in an MPP machine are
independent- they do not share memory, and typically each processor may run its
own instance of an operating system, although there may be systemic controller
application hosted on a leader processor that instruct the individual processors in the
MPP configuration on what tasks to perform.
Bina Nusantara University
9
Parallel Processing System (cont…)
•
Network of Workstations
A network of workstations is a more loosely coupled version of an MPP system; the
workstations are likely to be configured as individual machines that are connected via
network. The communication latencies (i.e., delays in exchanging data) are likely to
be an order of magnitude greater in this kind of configuration, and it is also possible
for the machines in the network to be heterogeneous (i.e., involving different kinds of
systems).
•
Hybrid Architectures
A hybrid architecture is one that combines or adapts one of the previously discussed
systems. The us of a hybrid architecture may be very dependent on the specific
application, because some systems may be better suited to the concurrency specifics
associated with each application.
Bina Nusantara University
10
Dependence
The key to exploiting parallelism is the ability to analyze dependence
constraints within any process. In this section we explore both control
dependence and data dependence, as well as issues associated with analyzing
the dependence constraints within a system that prevent an organization from
exploiting parallelism.
• Control Dependence
Control refers to the logic that determines whether a particular task is performed.
Process control cannot be initiated at that first task until the condition is set by the
other task or the other task completes. In this case, the first task is control dependent
on the second task.
•
Input/output Data Dependence
A data dependence between two processing stages represents a situation where
some information that is “touched” by one process is read or written by a subsequent
process, and the order of execution of these processes requires that the first process
execute before the second process. There are potentially four kinds of data
dependencies.
Bina Nusantara University
11
Dependence (cont…)
 Read After Read (RAR)
In which the second process’s read of data occurs after the first process’s read.
 Read After Write (RAW)
In which the second process’s read of data must occur after the first process’s write of
that data.
 Write After Read (WAR)
In which the second process’s write of data must occur after the first process’s read of
that data item.
 Write After Write (WAW)
In which the second process’s write of data must occur after the first process’s write of
that data item.
•
Dependence Analysis
Dependence analysis is the process of examining an application’s use of data and
control structure to indentify true dependencies within the application. The goal is first
to identify the dependence chain within the system and then to look for opportunities
where independent tasks can be executed in parallel.
Bina Nusantara University
12
Parallelism and Business Intelligence
At this point it makes sense to review the value of parallelism with respect to a
number of the BI-related applications. In each of these applications, a
significant speedup can be achieved by exploiting parallelism.
• Query Processing
Relational database management systems will most likely have taken advantage of
internal query optimization as well as user defined indexes to speed up this kind of
query. But even in that situation there different ways that the query can be
parallelized, and this will still result in speed up.
•
Data Profiling
The column analysis component of data profiling is another good example of an
application that can benefit from parallelism. Every column in a table is subject to a
collection of analyses: frequency of values, value cardinality, distribution of value
ranges, etc. But because the analysis of one column is distinct from that applied to all
other columns, we can exploit parallelism by treating the set of analyses applied to
each column as a separate task and then instantiating separate task to analyze a
collection of columns simultaneously.
Bina Nusantara University
13
Parallelism and Business Intelligence (cont…)
•
Extract, Transform, Load
The extract, transform, load (ETL) component of data warehouse population is
actually well suited to all the kinds of parallelism we have discussed. The ETL
process itself consist of a sequence of stages that can be configured as a pipeline,
propagating the results of each stage to its successor, yielding some medium-grained
parallelism.
Bina Nusantara University
14
Management Issue
In this section we discuss some of the management issues associated with the
use of parallelism in a BI environment.
• Training and Technical Management Requirements
Significant technical training is required to understand both system management and
how best to take advantage of parallel system.
•
Minimal Software Support
Not many off-the-shelf application software packages take advantage of parallel
system, although a number of RDBMS systems do, and there is an increasing
number of high-end ETL tools that exploit parallelism.
•
Scalability Issues
Scalability is not just a function of the size of the data. Any use of parallelism in an
analytical environment should be evaluated in the context of data size, the amount of
query processing expected, the number of concurrent users, and the kinds and
complexity of the analytical processing.
Bina Nusantara University
15
Management Issue (cont…)
•
Need for Expertise
Ultimately, control and data dependence analysis are tasks that need to be
incorporated into the systems analysis role when migrating to a parallel infrastructure,
because the absence of valid dependence analysis will prevent proper exploitation of
concurrent resources.
Bina Nusantara University
16
End of Slide
Bina Nusantara University
17
Download