Paper - World Academy of Science, Engineering and

advertisement
World Academy of Science, Engineering and Technology
International Journal of Social, Behavioral, Educational, Economic, Business and Industrial Engineering Vol:1, No:3, 2007
Selecting Materialized Views
Using Two-Phase Optimization with
Multiple View Processing Plan
Jiratta Phuboon-ob, and Raweewan Auepanwiriyakul
International Science Index, Economics and Management Engineering Vol:1, No:3, 2007 waset.org/Publication/2030
Abstract—A data warehouse (DW) is a system which has value
and role for decision-making by querying. Queries to DW are critical
regarding to their complexity and length. They often access millions
of tuples, and involve joins between relations and aggregations.
Materialized views are able to provide the better performance for
DW queries. However, these views have maintenance cost, so
materialization of all views is not possible. An important challenge of
DW environment is materialized view selection because we have to
realize the trade-off between performance and view maintenance
cost. Therefore, in this paper, we introduce a new approach aimed at
solve this challenge based on Two-Phase Optimization (2PO), which
is a combination of Simulated Annealing (SA) and Iterative
Improvement (II), with the use of Multiple View Processing Plan
(MVPP). Our experiments show that our method provides a further
improvement in term of query processing cost and view maintenance
cost.
Keywords—Data warehouse, materialized views, view selection
problem, two-phase optimization.
I. INTRODUCTION
A
data warehouse (DW) can be defined as subject-oriented,
integrated, nonvolatile, and time-variant collection of
data in support of management’s decision [1]. It can bring
together selected data from multiple database or other
information sources into a single repository [2]. To avoid
accessing base table and increase the speed of queries posed to
a DW, we can use some intermediate results from the query
processing stored in the DW called materialized views.
Although materialized views speed up query processing, they
have to be refreshed when changes occur to the base tables.
Therefore, materialized view selection involved query
processing cost and materialized view maintenance cost. So,
many literatures try to make the sum of that cost minimal. For
all of operation, i.e., select, project, join, order, group-by and
aggregation operation; join operation has the most impact on
query processing cost. In addition, some researchers consider
only join order optimization or aggregation operation, or both.
The existing algorithms solving query optimization, multiple
query optimizations, and materialized view selection can be
classified into four categories, i.e., deterministic algorithm,
randomized algorithm, evolutionary algorithm and hybrid
algorithm [3].
Our previous work, in [4], we analyzed and compared only
three types of algorithm; deterministic algorithm, evolutionary
algorithm and hybrid algorithm. In [5], we proposed TwoPhase Optimization (2PO) algorithm, which is a combination
of Simulated Annealing (SA) and Iterative Improvement (II),
to the materialized view selection problem with Multiple View
Processing Plan (MVPP) techniques compared to [6] and [7].
However in this experimental study, we show that, comparing
to [6] – [9] our method achieve substantial improvements in
term of query processing cost and view maintenance cost.
The rest of the paper is organized as follows. Section 2, we
describe Multiple View Processing Plan (MVPP). Section 3,
we focus on Iterative Improvement and Simulated Annealing.
Section 4, we propose our Two-Phase Optimization approach
which aimed at solve the materialized view selection problem.
Section 5, deals with our experimental studies, and is
concluded in section 6.
II. MULTIPLE VIEW PROCESSING PLAN (MVPP)
[160240]
[967519280]
[36276]
Tmp12
[160240]
Tmp14
[967519280]
[24036000000]
[32048000000]
[36276]
Tmp8[160240]
[150000]
Tmp13
[150000]
[1602400000]
Tmp11[36276]
Tmp6[2003]
Manuscript received February 21, 2007.
Jiratta Phuboon-ob, Ph.D. student is with School of Applied Statistics,
National Institute of Development Administration (NIDA), Bangkok,
Thailand (e-mail: jiratta.p@ msu.ac.th).
Raweewan Auepanwiriyakul is an Assistant Professor with School of
Applied Statistics, National Institute of Development Administration (NIDA),
Bangkok, Thailand (e-mail: raweewan@as.nida.ac.th).
International Scholarly and Scientific Research & Innovation 1(3) 2007
[160240]
[967519280]
Tmp4[5]
[25]
Tmp2[1]
[1]
[25]
Tmp3
Tmp1[1]
[5]
[150000]
[5]
[25]
[10000]
Tmp5
[25]
[10000]
[800000]
Tmp7
[10000]
[800000]
[800000]
Fig. 1 An example MVPP plan
53
[7255200000]
[50000]
scholar.waset.org/1999.10/2030
Tmp10[9069]
[200000]
[200000]
Tmp9
[200000]
[200000]
World Academy of Science, Engineering and Technology
International Journal of Social, Behavioral, Educational, Economic, Business and Industrial Engineering Vol:1, No:3, 2007
According to [6], they use multiple query processing
technique (MQP) to build multiple views processing plan
(MVPP) in order to identify views to be materialized.
An MVPP is a directed acyclic graph that represents a
query processing of DW views. An example MVPP is shown
in Fig. 1. The leaf nodes correspond to the base relations, and
the root nodes represent the queries. Any vertex which is an
intermediate or a final result of a query is denoted as a view.
The cost for each operation node is labeled at the right hand
side of each node. The query access frequencies are labeled on
the top of each query node
1. random starting point
2. at the starting point, a random neighbor is selected
3. compare neighbor’s cost and current’s cost
3.1 if the neighbor’s cost is lower than current’s cost
then move is carried out and a neighbor is sought
3.2 otherwise moving to neighbor’s state or staying
in current’s state with probability
4. performs random series of move and accepts both of
downhill and uphill until it reaches a local minimum
Fig. 3 Simulated Annealing (SA) algorithm
IV. OUR APPROACH: TWO-PHASE OPTIMIZATION FOR
SELECTING MATERIALIZED VIEW
International Science Index, Economics and Management Engineering Vol:1, No:3, 2007 waset.org/Publication/2030
III. ITERATIVE IMPROVEMENT AND SIMULATED ANNEALING
A. Iterative Improvement (II)
[10] exposed II algorithm to the large join query
optimization problem. II is a simple hill-climbing algorithm.
This algorithm performs a large number of local
optimizations. A local optimization starts at a random state,
and seeks minimum cost point using a strategy-like hillclimbing. At the starting point, a random neighbor is selected.
If the neighbor’s cost is lower than current’s cost, the move is
carried out and a new neighbor with the lower cost is sought.
II performs random series of move and accepts only downhill
ones until it reaches a local minimum. This algorithm is
repeated until a time limit is exceeded or a predetermined
number of starting points is processed, then the lowest local
minimum encountered is the result. The II algorithm present
in Fig. 2.
1. random starting point
2. at the starting point, a random neighbor is selected
3. if the neighbor’s cost is lower than current’s cost
then move is carried out and a neighbor is sought
4. performs random series of move and accepts only
downhill ones until it reaches a local minimum
Fig. 2 Iterative Improvement (II) algorithm
B. Simulated Annealing (SA)
[11] invented SA algorithm, and used it on traveling sale
man problem. SA follows a procedure similar to II, but it
accepts uphill move with some probability, while II performs
only downhill move. At each step, the SA considers
neighbor’s cost and the current’s cost, and probabilistically
decides among moving the system to neighbor’s state or
staying in current’s state. The probabilities are chosen, so the
system ultimately tends to move to states of the lower cost.
This step is repeated until the time becomes zero, or until the
system reaches a state which is good enough for the
application. [12] applied this algorithm to the optimization of
some recursive queries.
In [7], they introduced a new approach for materialized
view selection based on SA in conjunction with the use of a
MVPP. Fig. 3 show SA algorithm.
International Scholarly and Scientific Research & Innovation 1(3) 2007
Ioannidis and Kang [13] inspired Two-Phase Optimization
(2PO) algorithm to the optimization of project-select-join
queries. Our approach is designed based on 2PO with MVPP
for solving the materialized view selection problem. 2PO
combines both SA and II. It begins by running II to find a
good local minimum, and then applies SA to search for the
global minimum from the state found by II. Our algorithm
present in Fig. 4.
1. Input a MVPP represented by a DAG
2. Use width-first searching method to search through
all of the nodes in the DAG and produce an ordered
sequence of these nodes into a binary string
3. Call Iterative Improvement algorithm
4. Call Simulated Annealing algorithm
5. Present set of views to materialized with minimum cost
Fig. 4 Our Two-Phase Optimization (2PO) algorithm
In the following subsection, we give the details of
representation of solutions and define the cost model of
materialized view selection.
A. Representation of Solutions
Output from MVPP is a DAG. We map a DAG into a
binary string. For example, searching through the DAG,
shown in Fig. 1, using width-first, we obtain the mapping
array, i.e. [result3,0], [result1,0], [result2,0], [tmp14,0],
[tmp12,0], [tmp11,0], [tmp13,0], [tmp8,0], [tmp10,0],
[tmp6,0], [tmp7,0], [tmp9,0], [tmp4,0], [tmp5,0], [tmp2,0],
[tmp3,0], [tmp1,0]. A binary string of {0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0} indicates that none of node is materialized.
B. Cost Model of Materialized View Selection
According to [5], a linear cost model is used to calculate the
cost of query Q. The cost of answering Q is the number of
rows in the table that query Qi used to construct Q.
Let M be a set of materialized views, Cq (M) be the cost to
i
compute qi from the set of materialized views M, Cm (v) be the
cost of maintenance when v is materialized, and fq, fu are query
and updating frequency, respectively.
Then the total query processing cost is:
54
scholar.waset.org/1999.10/2030
World Academy of Science, Engineering and Technology
International Journal of Social, Behavioral, Educational, Economic, Business and Industrial Engineering Vol:1, No:3, 2007
∑q ∈Q f q Cq (M )
i
i
i
The total maintenance cost is:
∑v∈M fuCm(v)
The total cost of the materialized views M is:
∑q ∈Q f q Cq (M ) + ∑v∈M fuCm(v)
i
i
i
Our goal is to find the set M, if the members of M are
materialized then the value of total cost will be minimal
among all feasible sets of materialized view.
International Science Index, Economics and Management Engineering Vol:1, No:3, 2007 waset.org/Publication/2030
V. EXPERIMENTAL STUDIES
In our experiment, we employ MQP technique to all of five
algorithms and use the same cost model proposed by [6] to
compute query processing cost, materialized view
maintenance cost and total cost. We do not consider any
constraints. We use the TPC-H database of size 1GB as a
running example throughout our paper. It has 22 read-only
queries. Most of them are large and complex, and perform
different operations on the database table. For more details on
this benchmark refers to [14]. In this paper, we cover all of
regular aggregate functions; MIN, MAX, SUM, AVG,
COUNT, VARIANCE and STDDEV as listed in Fig. 5.
Query 1 (MIN)
select n_name, min(ps_supplycost)
from part, partsupp, supplier, nation, region
where p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'ASIA '
group by n_name;
Query 2 (MAX)
select n_name, max(o_totalprice)
from customer, orders, lineitem, nation, region
where c_custkey = o_custkey
and o_orderkey = l_orderkey
and c_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'ASIA'
and o_orderdate >= '1994-01-01'
and o_orderdate < '1995-01-01'
group by n_name;
Query 4 (AVG)
select n_name, avg(c_acctbal)
from partsupp, supplier, customer, nation, region
where ps_suppkey = s_suppkey
and c_nationkey = s_nationkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'ASIA'
group by n_name;
Query 5 (COUNT)
select p_size, count(ps_suppkey)
from partsupp, part
where p_partkey = ps_partkey
and p_brand <> 'Brand#45'
and not p_type like '%BRASS%'
and p_size in (9, 19, 49)
group by p_size;
Query 6 (VARIANCE)
select p_size, variance(ps_supplycost)
from supplier, partsupp, part
where s_suppkey = ps_suppkey
and p_partkey = ps_partkey
and p_brand <> 'Brand#45'
Query 3 (SUM)
and not p_type like '%BRASS%'
select n_name, sum(l_quantity)
and p_size in (9, 19, 49)
from orders, lineitem, supplier, nation, region
group by p_size;
where l_orderkey = o_orderkey
and l_suppkey = s_suppkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'ASIA'
and o_orderdate >= '1994-01-01'
and o_orderdate < '1995-01-01'
group by n_name;
Query 7 (STDDEV)
select o_orderpriority, stddev(l_tax)
from customer, orders, lineitem
where c_custkey = o_custkey
and o_orderkey = l_orderkey
and o_orderdate >= '1994-01-01'
and o_orderdate < '1995-01-01'
group by o_orderpriority;
Fig. 5 Our Queries
We observe that these queries are defined over overlapping
portion of the base data or intermediate query result. For
example Query 5 and Query 6 can share the intermediate
result of joining part and partsupp. We assume that base table
is updated once a time and the frequencies of Query 1 to
International Scholarly and Scientific Research & Innovation 1(3) 2007
Query 7 are 4, 6, 7, 2, 5, 9, and 3 respectively.
We use the algorithm to generate MVPP proposed by [6].
Fig. 6 gives accessing plan for each of the above queries,
denoted as op1, op2, op3, op4, op5, op6, op7 respectively. As
a consequence, first of all, we push up all select, project, and
group-by operation. Second, we create a list of query and
order them based on the result of query access frequency
multiplied by query processing cost. Therefore the initial list
is {op4, op7, op3, op2, op1, op6, op5}.Third, we pick up op4,
and merge the rest of the queries with it in the order of that in
the list. Then we get the first MVPP. Fourth, the first element
of list is moved to the end of the list, so the list becomes {op7,
op3, op2, op1, op6, op5, op4}. We generate the second MVPP
by using the third step. We repeat the third and fourth step to
generate all 7 MVPPs. Next, we push down select, project,
and group-by operations respectively for all of MVPPs.
Finally, we select the cheapest one; shown in Fig. 7. In our
MVPP, we assume that methods for implementing select and
join operation are linear search and nested loop approach.
Before comparing the cost, we compute query processing cost,
materialized view maintenance cost and total cost of allvirtual-views and all-materialized-view, demonstrated in
Table I. Table II gives the selected views and their cost from
each algorithm. We compare these costs between five
algorithms as following:
In deterministic algorithm, given a MVPP, we execute the
view selection algorithm proposed by [6] to select
materialized view. The view selected are Tmp11, Tmp15,
Tmp17, Tmp21 and Tmp24. Based on these results, it would
be benefit to materialize them, reducing the cost from
7,688,720,739,017 to 6,184,919,079,222.
According to simulated annealing algorithm, we use an
existing simulated annealing package [15] and define this
problem based on [7]. We first search through the DAG,
shown in Fig. 7, using width-first, in order to map MVPP into
a binary string of 1s and 0s. We set SA parameters like [7] the
result of SA is {0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,1,0,0,
0,0,1,1,0,0,1,0,0,0,0,0,0} means that nodes named Tmp5,
Tmp11, Tmp15, Tmp17, Tmp18, Tmp21 and Tmp24 are
materialized, but the others are not. Based on these results; it
would be benefit to materialize them, reducing the cost from
7,688,720,739,017 to 6,184,918,609,222.
For evolutionary algorithm, we follow the genetic algorithm
(GA) proposed by [8] to solve this problem. We first map a
DAG into a binary string using the same method as used by
SA. We adopt the concept of ranking selection as selection
operator, and set GA parameters according to [8]. The
selected views are Tmp11, Tmp15, Tmp17, Tmp18, Tmp21
and Tmp24. Based on these results; it would be benefit to
materialize them, reducing the cost from 7,688,720,739,017 to
6,184,918,679,222.
As Hybrid algorithm, we use the view selection algorithm
presented by [9] to select materialized view. We divide this
method into two phases. In the first phase, we run GA
regarding the third experiment. The output of this phase is the
input used in the next phase. For the second phase, we run
55
scholar.waset.org/1999.10/2030
International Science Index, Economics and Management Engineering Vol:1, No:3, 2007 waset.org/Publication/2030
World Academy of Science, Engineering and Technology
International Journal of Social, Behavioral, Educational, Economic, Business and Industrial Engineering Vol:1, No:3, 2007
deterministic algorithm, similar to the first experiment. The
selected views are Tmp11, Tmp15, Tmp17, Tmp21 and
Tmp24, which are the same views processed in the first
experiment. So the total cost of these results is equal to the
total cost of deterministic algorithm.
For our two-phase optimization algorithm, we map a DAG
into a binary string using the same method as used by SA.
Then we run II and then followed by SA. The selected views
are Tmp5, Tmp11, Tmp12, Tmp15, Tmp17, Tmp18, Tmp21
and Tmp24. Based on these results; it would be benefit to
materialize them, reducing the cost from 7,688,720,739,017 to
6,184,918,159,222.
Table II compares our 2PO algorithm result to the
deterministic algorithm, SA algorithm, GA and hybrid
algorithm for materialized view selection. This table shows
that our 2PO algorithm approach provides the best result.
Although our maintenance cost is the most expensive,
however, our query processing cost is the cheapest one. So
our total cost is minimal. Consider the result of deterministic
algorithm and hybrid algorithm, theirs maintenance cost are
the cheapest, however, theirs query processing cost are the
most expensive, leading to theirs total cost the most expensive
consequently. For GA and SA algorithm, their query
processing cost and maintenance cost are moderate, so theirs
total cost are moderate too.
VI. CONCLUSION
In this paper, we introduce Two-Phase Optimization (2PO)
algorithm, which is a combination of Simulated Annealing
(SA) and Iterative Improvement (II), to materialized view
selection with Multiple View Processing Plan (MVPP)
proposed by [6]. Comparing to deterministic algorithm
exposed by [6], Simulated Annealing proposed by [7], Genetic
Algorithm introduced by [8], and hybrid algorithm invented
by [9], our approach provides a better result than the other
algorithm. Two-Phase Optimization finds the best solution,
and avoids unnecessary large uphill moves at the early stages
of Simulated Annealing.
ACKNOWLEDGMENT
We are grateful to Wanida Prapavasit, Roozbeh
Derakhshan and Dyke Stiles for helping by way of discussion
and provide material for this paper.
Fig. 6 Individual Optimal Query Processing Plan
International Scholarly and Scientific Research & Innovation 1(3) 2007
56
scholar.waset.org/1999.10/2030
World Academy of Science, Engineering and Technology
International Journal of Social, Behavioral, Educational, Economic, Business and Industrial Engineering Vol:1, No:3, 2007
TABLE I
THE QUERY PROCESSING, MAINTENANCE AND TOTAL COST
All-virtual-views
All-materialized-views
Cost of query processing
Cost of maintenance
Total Cost
8,494,509,321,063
1,941,298,714
7,686,779,440,303
8,494,509,321,063
7,688,720,739,017
TABLE II
THE QUERY PROCESSING, MAINTENANCE AND TOTAL COST FOR EACH ALGORITHM
Algorithm
International Science Index, Economics and Management Engineering Vol:1, No:3, 2007 waset.org/Publication/2030
Deterministic
Simulated Annealing
Genetic Algorithm
Hybrid Algorithm
Two-Phase Optimization
Cost of query
processing
591,205,328,714
591,204,438,714
591,204,528,714
591,205,328,714
591,203,688,714
Selected views
Tmp11, Tmp15, Tmp17, Tmp21, Tmp24
Tmp5, Tmp11, Tmp15, Tmp17, Tmp18, Tmp21, Tmp24
Tmp11, Tmp15, Tmp17, Tmp18, Tmp21, Tmp24
Tmp11, Tmp15, Tmp17, Tmp21, Tmp24
Tmp5, Tmp11, Tmp12, Tmp15, Tmp17, Tmp18, Tmp21,
Tmp24
Cost of
maintenance
5,593,713,750,508
5,593,714,170,508
5,593,714,150,508
5,593,713,750,508
5,593,714,470,508
Total Cost
6,184,919,079,222
6,184,918,609,222
6,184,918,679,222
6,184,919,079,222
6,184,918,159,222
Fig. 7 An MVPP
REFERENCES
[1]
[2]
[3]
[4]
W.H.Inmon, Building the Data Warehouse, John Wiley and Sons, 2002.
J. Widom, “Research Problems in Data Warehousing,” Int. Conf. on
Information and Knowledge Management, 1995, pp. 25-30.
C. Zhang, X. Yao, and J. Yang, “An Evolutionary Approach to
Materialized Views Selection in a Data Warehouse Environment,” IEEE,
2001, 31, pp. 282-294.
J. Phuboon-ob, and R. Auepanwiriyakul, “Analysis and Comparison of
Algorithm for Selecting Materialized Views in a Data Warehousing
Environment” APDSI, 2006, 392-395.
International Scholarly and Scientific Research & Innovation 1(3) 2007
[5]
[6]
[7]
57
J. Phuboon-ob, and R. Auepanwiriyakul, “Two-Phase Optimization for
Selecting Materialized Views in a Data Warehouse,” Enformatika
Transactions on Engineering, Computing and Technology, vol. 19
pp.277-281, Jan. 2007.
J. Yang, K. Karlapalem, and Q. Li, “Algorithms for Materialized View
Design in Data Warehousing Environment,” VLDB Conference, 1997,
136-145.
R. Derakhshan, F. Dehne, O. Korn and B. Stantic, “Simulated Annealing
for Materialized View Seletion in Data Warehousing Environment,”
DBA, 2006, 89-94.
scholar.waset.org/1999.10/2030
World Academy of Science, Engineering and Technology
International Journal of Social, Behavioral, Educational, Economic, Business and Industrial Engineering Vol:1, No:3, 2007
[8]
[9]
[10]
[11]
[12]
[13]
[14]
International Science Index, Economics and Management Engineering Vol:1, No:3, 2007 waset.org/Publication/2030
[15]
C. Zhang, and J. Yang, “Genetic Algorithm for Materialized View
Selection in Data Warehouse Environements,” Proc. Int’l Conf. Data
Warehousing and Knowledge Discovery (DaWaK), 1999, pp.116-125.
C. Zhang, X. Yao and J. Yang, “An Evolutionary Approach to
Materialized Views Selection in a Data Warehouse Environment,” IEEE
Transactions on Systems, Man, and Cybernetics, Part C, Vol. 31, pp.
282-294, 2001.
S. Nahar, S. Sahni, and E. Shragowitz “Simulated Annealing and
Combinatorial Optimization,” Design Automation Conference, 1986,
293-299.
S. Kirkpatrick, C.D. Gelatt, and M.P. Vecchi, “Optimization by
Simulated Annealing,” Science, 1983, 671-680.
Y. Ioannidis, and E. Wong, “Query optimization by simulated
annealing,” ACM SIGMOD, 1987, 2-22.
Y. E. Ioannidis and Y. C. Kang, “Randomized Algorithm for
Optimizating Large Join Queries,” ACM SIGMOD, 1990, 312-321.
Transaction Processing Performance Council. “TPC benchmarks-H
Revision 2.3.0,” Available: www.tpc.org, 2005.
D. Stiles, “A ‘C’ Robust Simulated Annealing Package”. Available:
http://www.engineering.usu.edu/ece/research/rtpc/projects/comb/robust,
2006.
International Scholarly and Scientific Research & Innovation 1(3) 2007
Jiratta Phuboon-ob received the B.Sc. (Hons.) degree in Statistics from
Srinakharinwirot University, Mahasarakham Campus, Thailand in 1992 and
the M.S. degree in Applied Statistics from School of Applied Statistics,
National Institute of Development Administration (NIDA), Bangkok,
Thailand in 1997.
She is currently pursuing the Ph.D. degree in Computer Science at NIDA,
Bangkok, Thailand. Her research interests include databases, data warehouse
and data mining.
Raweewan Auepanwiriyakul received the B.Sc. degree in Radiological
Technology from Mahidol University, Thailand, in 1982 and the M.S. and
Ph.D. degree in Computer Science from University of North Texas, U.S.A., in
1985 and 1989, respectively.
Currently, she is an Assistant Professor with School of Applied Statistics,
National Institute of Development Administration (NIDA), Bangkok,
Thailand. Her research interests include databases, object-oriented analysis
and design, and data warehouse.
58
scholar.waset.org/1999.10/2030
Download