Aggregate Function Computation and Iceberg Querying in Vertical

advertisement
Aggregate Function Computation and Iceberg Querying in Vertical Database
Yue Cui
William Perrizo
Computer Science department
Computer Science department
North Dakota State University
North Dakota State University
Fargo, ND, 58102, U.S.A.
Fargo, ND, 58102, U.S.A.
yue.cui@ndsu.edu
william.perrizo@ndsu.edu
many
ABSTRACT
other
applications
including
information
retrieval, clustering, and copy detection [2].
Aggregate function computation and iceberg
querying are important and common in many applications
Fast implementation of aggregate functions and
of data mining and data warehousing because people are
iceberg queries is always challenging task in data
usually interested in looking for anomalies or unusual
warehousing, which may contain millions of data. One
patterns by computing aggregate functions across many
method in data warehousing, which improves the speed
attributes or finding aggregate values above some
of iceberg queries, is to pre-compute the iceberg cube
specified thresholds. In this thesis, we present efficient
[3]. Iceberg cube actually stores all the above-threshold
algorithms to compute aggregate functions and evaluate
aggregate information of a dataset. When a customer
iceberg queries using P-trees, which is an innovative data
initiates an iceberg query, the system, instead of
structure on vertical databases.
Our methods do well for
computing the result from the original dataset, looks
most of the aggregate functions, such as Count, Sum,
into the iceberg cube for the answer. This helps to
Average, Min, Max, Median, Rank, and Top-k, and in
accelerate the on-line processing speed. Efficient
some situations, they are the best of all. Our algorithms
algorithms in computation of the iceberg cube are
use very little memory and perform quickly because, in
widely studied. A lot of research has been done in this
most cases, we only pass over data once and use logical
area [3].
operations, And/Or, to implement the entire task.
Recently, a new data structure, Peano Tree
1.
INTRODUCTION
(P-tree) [4], has been introduced for large datasets.
P-tree is a lossless, quadrant-based, compression data
structure. It provides efficient logical operations that
1.1 Background
are fast and efficient. One of the P-tree variations,
Aggregation functions [1] across many attributes
Predicate P-tree, is used to efficiently reduce data
are widely used in queries of data mining and data
accesses by filtering out “bit holes,” which consist of
warehousing. The commonly used aggregation functions
consecutive 0’s [4].
include Count, Sum, Average, Min, Max, Median, Rank,
and Top-k. The commonly used queries in Data mining
In this paper, we investigate problems on how to
and data warehousing are iceberg queries [2], which
efficiently implement aggregate functions and iceberg
perform an aggregate function across attributes and then
queries using P-tree. We have two contributions: first,
eliminate aggregate values that are below some specified
we develop efficient algorithms, which compute the
threshold. Iceberg queries are so called because the
various aggregate functions by using P-tree; second,
number of above-threshold results is often very small (the
we illustrate the procedure to implement iceberg query
tip of an iceberg) relative to the large amount of input
by using P-tree. In our algorithms and procedures, the
data (the iceberg). Iceberg queries are also common in
main operations are counting the number of 1s in
P-trees and performing logic operations, And/Or, of
1
P-trees. These operations can be executed quickly by
sum and count. The H function adds these two
hardware. The costs of these operations are really cheap.
components and then divides to produce the global
As you will see in our experiments section, our methods
average. Similar techniques apply to finding the N
and procedure are superior in computing aggregate
largest values, the center of mass of group of objects,
functions and iceberg queries in many cases.
and other algebraic functions. The key to algebraic
functions is that a fixed size result (an M-tuple) can
This paper is organized as follows. In Chapter 2,
summarize the sub-aggregation.
we do a Literature review, where we will introduce the
concepts of aggregate functions and iceberg queries. In
Holistic: An aggregate function F is holistic if
Chapter 3, we review the P-tree technology. In Chapter 4,
there is no constant bound on the size of the storage
we give the algorithms of the aggregate functions using
needed to describe a sub-aggregate. That is, there is no
P-tree. In Chapter 5, we give an example of how to
constant M, such that an M-tuple characterizes the
implement iceberg queries using P-tree. In Chapter 6, we
computation F. Median, MostFrequent (also called the
describe our experiment results, where we compare our
Mode), and Rank are common examples of holistic
method with bitmap index. Chapter 7 is the concluding
functions.
section of this paper.
Efficient computation of all these aggregate
2.
LITERATURE REVIEW
functions
is
required
in
most
large
database
applications. There are a lot efficient techniques for
computing distributive and algebraic functions [2] but
2.1. Aggregate Functions
only a few for holistic functions such as Median. In
data
Frequently, people are interested in summarizing
this paper, our method to compute Median is one of the
to
best among all the available techniques.
determine
trends
or
produce
top-level
reports. For example, the purchasing manager may not
be interested in a listing of all computer hardware sales,
2.2. Iceberg Queries
but may simply want to know the number of Notebook
sold in a specific month. Aggregate functions can assist
Iceberg queries refer to a class of queries which
with the summarization of large volumes of data.
compute aggregate functions across attributes to find
aggregate values above some specified threshold.
Three types of aggregate functions were identified
Given a relation R with attributes a1, a2… an, and m, an
[1]. Consider aggregating a set of tuples, T. Let {Si | i = 1
aggregate function AggF, and a threshold T, an iceberg
. . . n} be any complete set of disjoint subsets of T such
query has the form of follow:
that
 i Si = T, and  i Si = {}.
SELECT
R.a1, R.a2… R.an, AggF (R.m)
FROM
relation R
if there is a function G such that
GROUPBY
R.a1, R.a2… R.an
F (T) = G ({F (Si)| i = 1 . . . n}). SUM, MIN, and MAX
HAVING
AggF (R.m) >= T
Distributive: An aggregate function F is distributive
are distributive with G = F. Count is distributive with G =
SUM.
The number of tuples, that satisfy the threshold
in the having clause, is relatively small compared to the
Algebraic: An aggregate function F is algebraic if
large amount of input data. The output result can be
there is an M-tuple valued function G and a function H
seen as the “tip of iceberg,” where the input data is the
such that F (T) = H ({G (Si) | i = 1 . . . n}). Average,
“iceberg.”
Standard Deviation, MaxN, MinN, and Center_of_Mass
are all algebraic. For Average, the function G records the
2
Suppose, a purchase manager is given a sales
From the knowledge we generated above, we can
transaction dataset, he/she may want to know the total
eliminate many of the location & Product Type pair
number of products, which are above a certain threshold
groups. It means that we only generate candidate
T, of every type of product in each local store. To answer
location and Product Type pairs for local store and
this question, we can use the iceberg query below:
Product type which are in Location-list and Product
Type-list. This approach improves efficiency by
SELEC
Location, Product Type, Sum (#
pruning many groups beforehand. Then performing
Product)
operation, And, of value P-trees, we can calculate the
FROM
Relation Sales
results easily. We will illustrate our method in detail by
GROUPBY
Location, Product Type
example in Chapter 5.
HAVING
Sum (# Product) >= T
3.
REVIEW OF P-TREES
To implement iceberg query, a common strategy in
horizontal database is first to apply hashing or sorting to
A Peano tree (P-tree) is a lossless, bitwise tree. A
all the data in the dataset, then to count all of the location
P-tree
can
be
1-dimensional,
2-dimensional,
& Product Type pair groups, and finally to eliminate
3-dimensional, etc. In this section, we will focus on
those groups which do not pass the threshold T. But these
1-dimensional P-trees. We will give a brief review of
algorithms can generate significant I/O for intermediate
P-trees on their structure and operations, and introduce
results and require large amounts of main memory. They
two variations of predicate P-trees which will be used
leave much room for improvement in efficiency. One
in our iceberg queries algorithm.
method is to prune groups using the Apriori-like [5] [6]
method.
But the Apriori-like method is not always
3.1 Structure of P-trees
simple to use for all the aggregate functions. For instance,
the Apriori-like method is efficient in calculating Sum but
Given a data set with d attributes, X = (A1, A2 …
has little use in implementing Average [4]. In our
Ad), and the binary representation of j th attribute Aj as
example, instead of counting the number of tuples in
bj,mbj,m-1...bj,i… bj,1bj,0, we decompose each attribute
every location & Product Type pair group at first step, we
into bit files, one file for each bit position [1]. To build
can do the following: Generate Location-list: a list of
a P-tree, a bit file is recursively partitioned into halves
local stores which sell more than T number of products.
and each half into sub-halves until the sub-half is pure
For example,
(entirely 1-bits or entirely 0-bits).
SELECT
Location, Sum (# Product)
The detailed construction of P-trees is illustrated
FROM
Relation Sales
by an example in Figure 1. The transaction set is
GROUPBY
Location
shown in a).
HAVING
Sum (# Product) >= T
has one attribute. We represent the attribute as binary
For simplicity, assume each transaction
values, e.g., (7)10 = (111)2. Then vertically decompose
Generate Product Type-list: a list of categories
them into three separate bit files, one file for each bit,
which sell more than T number of products. For example,
as shown in b). The corresponding basic P-trees, P1, P2
and P3, are constructed by recursive partition, which
SELECT
Type, Sum (# Product)
FROM
Relation Sales
GROUPBY
Product Type
HAVING
Sum (# Product) >= T
are shown in c), d) and e).
As shown in e) of Figure 1, the root of P 1 tree is
3, which is the 1-bit count of the entire bit file. The
second level of P1 contains the 1-bit counts of the two
halves, 0 and 3. Since the first half is pure, there is no
3
need to partition it. The second half is further partitioned
The P-tree logic operations are performed
recursively.
level-by-level starting from the root level. They are
commutative and distributive, since they are simply
pruned bit-by-bit operations. For instance, ANDing a
010
3
011
010
010
101
010
111
111
a)
pure-0 node with anything results in a pure-0 node,
0
3
1
1
ORing a pure-1 node with anything results in a pure-1
node. In Figure 3, a) is the ANDing result of P1,1 and
2
P1,2, b) is the ORing result of P1,1 and P1,3, and c) is the
0
result of NOT P1,3 (or P1,3’), where P1,1, P1,2 and P1,3 are
c) P1
shown in Figure 2.
7
0
Transaction set
4
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
1
0
1
1
1
1
1
1
0
1
1
1
0
0
1
0
0
0
0 0
0
a) P1,1P1,2
1
1
1
0
b) P1,1P1,3
0
1 0
1 0
0
0 1
c) P1,3’
Figure 3. AND, OR and NOT Operations.
3
0 1
1
1
1
3.3. Predicate P-trees
2
0
There are many variations of predicate P-trees,
e) P3
b) 3 bit files
1
0
4
1
0
0
2
d) P2
0
0
3
such as value P-trees, tuple P-trees, inequity P-trees
Figure 1. Construction of 1-D P-trees.
[8][9][10][11], etc. In this section, we will describe
value P-trees and inequity P-trees, which are used in
evaluating range predicates.
3.2 P-tree Operations
AND, OR, and NOT logic operations are the most
frequently
used
P-tree
operations.
For
3.3.1. Value P-trees
efficient
implementation, we use a variation of P-trees, called
A value P-tree represents a data set X related to a
Pure-1 trees (P1-trees). A tree is pure-1 if all the values in
specified value v, denoted by Px=v, where x  X.
the sub-tree are 1’s. A node in a P1-tree is a 1-bit if and
v = bmbm-1…b0, where bi is i binary bit value of v.
only if that half is pure-1. Figure 2 shows the P1-trees
There are two steps to calculate Px=v. 1) Get the
corresponding to the P-trees in c), d), and e) of Figure 1.
bit-P-tree Pb,i for each bit position of v according to the
Let
th
bit value: If bi = 1, Pb,i = Pi; Otherwise Pb,i = Pi’,
2)
Calculate Px=v by ANDing all the bit P-trees of v, i.e.
0
0
0
1
0
0
1
a) P1,1
0
0
0
1
0
0
0
1
1
b) P1,2
0 0
0
0
Px=v =
0
1
1
Pb1  Pb2…  Pbm. Here,  means AND
operation. For example, if we want to get a value P-tree
satisfying x = 101 in Figure 1. We have P x=101 = Pb,3 
1
Pb,2  Pb,1 = P3 P2’ P1.
0
c) P1,3
3.3.2. Inequity P-trees
Figure 2. P1-trees for the Transaction Set.
An inequity P-tree represents data points within
a data set X satisfying an inequity predicate, such as
4
x>v, xv, x<v, and xv. Without loss of generality, we
relation S with information about every transaction of a
will discuss two inequity P-trees for xv and xv,
company, first we transform relation S into binary
denoted by Pxv and Pxv, where x  X, v is a specified
representation. For numerical attributes, this step is
value. The calculation of Pxv and Pxv is as follows:
simple. We just need to change the decimal values into
binary numbers. For categorical attributes, there will be
Calculation of Pxv: Let x be a data within a data
two steps: first, we translate categorical values into
set X, x be a m-bit data, and Pm, Pm-1, … P0 be the P-trees
integers. Second, we convert integers into binary
for the vertical bit files of X. Let v=b m…bi…b0, where bi
numbers. By changing categorical attribute values into
th
is i binary bit value of v, and Pxv be the predicate tree
integers, we can save a lot of memory space and make
for the predicate xv, then Pxv = Pm opm … Pi opi Pi-1 …
processing procedure much easier at the same time. We
i = 0, 1 … m, where 1) opi is  if bi=1, opi is 
do not need to process string values anymore in our
op1 P0,
otherwise, and 2) the operators are right binding. Here, 
means AND operation,
algorithms.
 means OR operation, right
binding means operators are associated from right to left,
Id
Mon
Loc
Type
On line
# Product
e.g., P2 op2 P1 op1 P0 is equivalent to (P2 op2 (P1 op1 P0)).
1
Jan
New York
Notebook
Y
10
For example, the inequity tree Px 101 = (P2  (P1 P0)).
2
Jan
Minneapolis
Desktop
N
5
3
Feb
New York
Printer
Y
6
Calculation of Pxv: Calculation of Pxv is similar
4
Mar
New York
Notebook
Y
7
to Calculation of Pxv. Let x be a data within a data set X,
5
Mar
Minneapolis
Notebook
Y
11
x be a m-bit data set, and P’m, P’m-1, … P’0 be the
6
Mar
Chicago
Desktop
Y
9
complement P-trees for the vertical bit files of X. Let
7
Apr
Minneapolis
Fax
N
3
v=bm…bi…b0, where bi is ith binary bit value of v, and
Table 1. Relation Sales.
Pxv be the predicate tree for the predicate xv, then Pxv
= P’mopm … P’i opi P’i-1 … opk+1P’k,
kim, where
1). opi is  if bi=0, opi is  otherwise, 2) k is the rightmost
Id
bit position with value of “0,” i.e., b k=0, bj=1,  j<k, and
3) the operators are right binding. For example, the
aggregation functions and iceberg queries.
4. AGGREGATE FUNCTION COMPUTATION
USING P-TREES
In this section, we give algorithms, which show
how to use P-tree to compute various aggregation
Loc
Type
On
#
line
Product
P3,0
P4,3 P4,2
P0,3 P0,2
P1,4 P1,3 P1,2
P2,2 P2,1
P0,1 P0,0
P1,1 P1,0
P2,0
1
0001
00001
001
1
1010
2
0001
00101
010
0
0101
3
0010
00001
100
1
0110
4
0011
00001
001
1
0111
5
0011
00101
001
1
1011
6
0011
00110
010
1
1001
7
0100
00101
101
0
0011
inequity tree Px 101 = (P’2  P’1). We will frequently use
value P-trees and inequity P-trees in our calculation of
Mon
P4,1 P4,0
functions. We illustrate some of these algorithms by
Table 2. Binary Form of Sales.
examples. For simplicity, the examples are computed on a
single attribute. As you will see, our algorithms can be
Next we vertically decompose the binary
easily extended to evaluate aggregate functions over
transaction table into bit files: one file for each bit
multiple attributes.
position. In relation S, there are totally 17 bit files.
Then we build 17 basic P-trees for relation S. The
First, we give out our example dataset in Table 1.
detailed decomposition process is discussed in Chapter
Second, we illustrate the procedure to convert the sample
dataset into P-tree bit files in Table 2. Suppose, we have a
5
3. For convenience, we only use uncompressed P-trees to
Algorithm 4.2
illustrate the calculation process followed.
Evaluating max () with P-tree.
max = 0.00;
c = 0;
4.1. Count Aggregate
Pc is set all 1s
For i = n to 0 {
COUNT function is probably the simplest and most
useful of all these aggregate functions. It is not necessary
c = RootCount (Pc AND Pi);
to write special function for Count because P-tree
If (c >= 1)
Pc = Pc AND Pi;
RootCount function has already provided the mechanism
max = max + 2i;
to implement it. Given a P-tree Pi, RootCount(Pi) returns
}
the number of 1s in Pi. For example, if we want to know
Algorithm 4. 2. Max Aggregate.
Return max;
4.5. Min Aggregate
how many transactions in relation S are conducted
on-line, we just need to count the number of 1s in P 3, 0.
Min function returns the smallest value in a field.
Count (# Transaction on line) = RootCount (P 3,0) =
RootCount (1011110) =
For example, if we want to know the minimum number
5. Thus we know that 5 out of
of products which were sold in one transaction, we can
7 transactions in our dataset are conducted on line.
use the algorithm in Algorithm 4.3.
4.2. Sum Aggregate
Sum function can total a field of numerical values.
We will illustrate the algorithm in Algorithm 4.1.
Algorithm 4.3.
Evaluating Min () with P-tree.
min = 0.00;
Algorithm 4.1 Evaluating sum () with P-tree.
c = 0;
total = 0.00;
Pc is set all 1s
For i = n to 0 {
For i = 0 to n {
c = RootCount (Pc AND NOT (Pi));
i
total = total + 2 * RootCount (Pi);
If (c >= 1)
}
Pc = Pc AND NOT (Pi);
Return total
Algorithm 4. 1. Sum Aggregate.
Else
min = min + 2i;
Algorithm 4. 3. Min Aggregate.
4.3. Average Aggregate
}
min;
4.6.Return
Median/Rank
Aggregate
Average function will show the average value in a
field. It can be calculated from function Count and Sum.
Median function returns the median value in a
Average () = Sum ()/Count ().
field. Rank (K) function returns the value that is the kth
largest value in a field. For example, if we want to
4.4. Max Aggregate
know the median of the number of products which
were sold in all the transactions, we can use the
Max function returns the largest value in a field. For
algorithm in Algorithm 4.4. The detailed steps are
example, if we want to know the maximum number of
following the algorithm.
products which were sold in one transaction in relation S,
we can use the algorithm in Algorithm 4.2.
6
The median is 7 in attribute # Product. Similarly
Algorithm 4.4. Evaluating Median () with P-tree
if we want to know the 5th largest number of the
median = 0.00;
products, which were sold in all the transactions, we
pos = N/2; for rank pos = K;
just need to change the initial value of pos from 4 to 5.
c = 0;
Then with the same procedure, we can get the 5th
Pc is set all 1s for single attribute
largest number in attribute # Product.
For i = n to 0 {
c = RootCount (Pc AND Pi);
4.7. Top-k Aggregate
If (c >= pos)
median = median + 2i;
Top-k (K) function is very useful in various
Pc = Pc AND Pi;
algorithms of clustering and classification. In order to
Else
get the largest k values in a field, first, we will find
pos = pos - c;
rank k value Vk using function Rank (K). Second, we
Pc AND NOT (PiAggregates.
);
Algorithm 4.Pc4.=Median/Rank(K)
}
will find all the tuples whose values are greater than or
Return
Step
1: median;
pos = 4 we decide the position of the
calculate range predicates using inequity P-trees, we
equal to Vk. In Chapter 2, we have illustrated how to
use the same method here to find out all the values,
median in the dataset.
Pc = (1111111)
which are greater than or equal to Vk.
n=3
operation using P-trees
Iceberg query
Step 2: RootCount (Pc AND P4,3) = 3. Because c <
5. Iceberg query operation using P-trees
pos (3 < 4),
pos = 4 – 3 = 1
Beside the computation of aggregate functions,
Pc = Pc AND NOT (P4,3) = (01111001)
another important part of iceberg queries is to
Median = 0
implement the Group By operation. By using value
n=2
P-trees, we can make the calculation of Group By as
Step3: RootCount (Pc AND P4,2) = 3. Because c >=
efficient as possible.
pos (3 > 1)
With the efficient algorithms to
pos = 1
compute aggregate functions and to implement Group
Pc = Pc AND (P4,2) = (01110000)
By operation, iceberg query can be executed as fast as
Median = 0 + 22 = 4
possible by using P-tree. We will give out an example
n=1
to illustrate how to implement iceberg query using
P-tree.
Step3: RootCount (Pc AND P4,1) = 2. Because c >=
pos (2 > 1)
We have described iceberg query algorithm in
pos = 1
Pc = Pc AND (P4,1) = (00110000)
Chapter 2. We demonstrate the procedure with the
Median = 4 + 21 = 6
following example:
n=0
SELECT
Loc, Type, Sum (# Product)
FROM
Relation S
pos = 1
GROUPBY
Loc, Type
Pc = Pc AND (P4,1) = (00010000)
HAVING
Sum (# Product) >= 15
Step3: RootCount (Pc AND P4,0) = 1. Because c >=
pos (1 = 1)
Median = 6 + 20 = 7
5.1. Step One
n = -1 program stopped.
7
We build value P-trees for every value of attribute
Sum(# product | New York) =
Loc. Attribute Loc has three values {Loc| New York,
23  RootCount (P4,3 AND PNY) + 22 
Minneapolis, and Chicago}. Their counterpart value
RootCount (P4,2 AND PNY) +21  RootCount (P4,1
P-trees can be seen in Figure 4.
AND PNY) + 20  RootCount (P4,0 AND PNY) = 8 
1 + 4  2 + 2  3 + 1  1 = 23
PNY
PMN
PCH
1
0
0
Table 5 shows the total number of products sold
0
1
0
in each of the three of the locations. Since our
1
0
0
threshold is 15, according to the Apriori, if the super
1
0
0
set can not pass the threshold, any of its subset can not
0
1
0
pass it too, therefore we eliminate the city Chicago.
0
0
1
Figure 4. Value P-trees of Attribute Loc.
1
Since we0 have described
how 0to calculate value
Loc Values
Sum (# Product)
Threshold
P-tree from basic P-tree in Chapter 3, we only illustrate
New York
23
Y
the procedure of generating value P-tree PNY as example.
Minneapolis
18
Y
In table 2, we can see that the binary value of New York
Chicago
9
N
is 00001. We process bits one by one from left to right. If
Table 3. the Summary Table of Attribute Loc.
the bit is 0 in the current bit position, we use primary
P-tree of this bit in our formula. If the bit is 1, we use
5.2. Step Two
basic P-tree of this bit in our formula. Finally we obtain
formula 1.
Similarly we build value P-trees for every value
of attribute Type. Attribute Type has four values {Type
PNY = P’1,4 AND P’1,3 AND P’1,2 AND P’1,1 AND P1,0
| Notebook, desktop, Printer, Fax}.
(1)
PNotebook
1
PDesktop
0
value P-tree PNY. After getting all the value P-trees for
0
1
0
0
each location, we calculate the total number of products
0
0
1
0
sold in each place. We still use the value, New York, as
1
0
0
0
our example. In formula 2, we illustrate how we obtain
1
0
0
0
0
1
0
0
0
0
0
1
Figure 5 illustrates the calculation procedure of
the total number of products sold in New York.
PPrinter
0
PFAX
0
Figure 6. Value P-trees of Attribute Type.
Figure 6 shows the value P-tree of the four
LOC
values of attribute Type. Similarly we get the summary
P1.1
P1.0
P’1,4
P’1,3
P’1,2
P’1.1
P1.0
PNY
0
0
1
1
1
1
1
1
1
According to the threshold, only value P-tree of
0
1
0
1
1
1
0
1
1
0
notebook will be used in the future.
0
0
0
0
1
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
1
Type Values
Sum (# Product)
Threshold
0
0
1
0
1
1
1
0
1
1
0
Notebook
28
Y
0
0
1
1
0
1
1
0
0
0
0
Desktop
14
N
0
0
1
0
1
1
1
0
1
1
0
FAX
3
N
Printer
6
N
P1,4
P1,3
P1,2
0
0
0
table for each value of attribute Type in Table 4.
Figure 5. Procedure of Calculating PNY.
Table 4. Summary Table of Attribute Type.
8
RootCount (P4,1 AND PMN
5.3. Step Three
AND Notbook)
+ 20 
RootCount (P4,0 AND PMN AND Notbook) = 8  1 + 4 
We only generate candidate Loc and Type pairs for
0
local store and Product type, which can pass the threshold
+
2

1
+
1

1
=
11
(4)
we are given before. By Performing And operation on
PNY with PNotebook, we obtain value P-tree PNY AND Notebook
Finally, we obtain the summary table, Table 5.
as it is showed in Figure 7. We calculate the total number
According to the threshold T=15 we set before, we can
of notebooks sold in New York by formula 3.
see that only value P-tree PNY AND Notebook satisfies our
threshold. We get tuples 1 and 4 as the results for this
iceberg query example.
PNY
PNotebook
PNY AND Notebook
1
1
1
Type Values
Sum (# Product)
Threshold
0
0
0
New York And Notebook
17
Y
Minneapolis And Notebook
11
N
AND
=
1
0
0
1
1
1
0
1
0
0
0
Calculating
0
0
Figure 7. Procedure of
Table 5. Summary Table of Our Example.
0
PNY AND Notebook.
0
All the value P-trees along with their root counts
Sum (# Product | New York) =
23  RootCount (P4,3 AND PNY
 RootCount (P4,2 AND PNY
RootCount (P4,1 AND PNY
generated during the procedure are stored together with
AND Notebook)
AND
+ 22
AND Notebook)
Notebook)
basic P-trees. In the future, when we answer another
+21 
+ 20
iceberg query which requires generation of the same

value P-tree, we will not need to repeat the logic
RootCount (P4,0 AND PNY AND Notebook) = 8  1 + 4  1
+

2
2
+
1

1
=
operations to get our value P-tree. We would need only
17
to read the desired value P-trees from the disk directly,
(3)
which will decrease the response time steadily as the
system keeps working.
By performing And operations on P MN with P Notebook,
we obtain value P-tree PMN AND
Notebook
as it is showed in
6. Performance analysis
Figure 8. We calculate the total number of notebook sold
in Minneapolis by formula 4.
In this section, we report our performance
analysis on computing aggregate functions and iceberg
queries. All experiments are implemented in the C++
PMN
PNotebook
0
1
1
AND
0
PMN AND Notebook
language on a 1GHz Pentium PC machine with 1GB
0
=
main memory running on Red Hat Linux. The test data
0
0
0
0
0
1
0
1
1
1
0
0
0
1
0
0
includes aerial TIFF images, moisture maps, nitrate
maps, and yield maps of the Oakes Irrigation Test Area
in North Dakota. The data sets are available in [11].
As shown in [8], the performance of horizontal
databases for computing aggregate functions and
Figure 8. Procedure of Calculating PMN AND Notebook.
iceberg queries is much weaker than Bitmap index
[13], which is a set of vectors: one vector of bits per
Sum (# product | Minneapolis) =
2  RootCount (P4,3 AND PMN AND Notbook) + 2
3
 RootCount (P4,2 AND PMN
AND Notbook)
+2
1
distinct attribute value; each bit of the value is mapped
2
to a tuple; the associated bit is set if and only if the

tuple’s value fulfills the property in focus, typically
9
that the value occurs in that tuple. The advantages of
definitely superior to the bitmap approach for
basic P-trees are that, first, there are fewer by a log2
aggregation of all types.
factor, and, second it is easy to combine them for a single
attribute bitmap as in [13] and [14], and, third the same
simple
AND
computation
provides
Finally, it might be claimed that we have picked
multi-attribute
particular data sets where this conclusion holds, and
bitmaps which are the typical need in data mining.
that it might not hold on other data sets.
Basic P-trees also serve up scalable computations of total
that issue, we have chosen many related, real-life
variation, neighborhood masks, aggregations, and range
numeric data sets and combined them.
queries without file scans [4].
evaluating the full complement of aggregations over
These advantages make
To address
We are
basic P-trees superior in speed and accuracy than bitmap
many numeric attributes.
Aggregation speed, but its
indexes. Therefore, we concentrate on the performance
very nature, does not depend upon the distribution of
comparison between bitmap indexes and basic P-trees.
the data values or any other characteristic of the data.
The very same program execution thread is performed
In this comparison, we give the bitmap approach
regardless of the specific distribution, correlation, or
every possible advantage to compete, to the point of
any other statistical characteristic of the data values.
being very unfair to the P-tree approach.
First, it is
For example, in calculating the sum over an attribute,
really feasible to bitmap every numeric attribute (over
the same program thread is used and the same result is
which aggregation could be requested) of every file.
obtained regardless of the nature and arrangement of
Even in the data sets used in this study, which would be
the specific data values.
thousands of separate bitmaps.
Second, when the query
have given a very fair and unbiased assessment of the
involves a composite attribute, it is totally infeasible to
P-tree aggregation method relative to other methods by
prepare bitmaps ahead of time to accelerate the query.
choosing a fairly large sampling of real-life numeric
We concentrate on the “worst case” for P-trees and the
attributes for a wide range of data set sizes.
Therefore, we feel that we
“best case” for bitmaps, namely, single numeric attribute
aggregation queries in which it is assumed (albeit
6.1. Computing Aggregate Functions
unreasonably) that the full set of bitmaps has been
maintained for all values involved.
Still we find that the
Figure 9 shows the runtime of Sum aggregate for
P-tree approach proves to be substantially faster, as
methods, the basic P-trees and the Bitmap index, with
shown in the following section.
respect to the number of tuples in the file.
At the end of the next section we also give one
P-tree
example where the competition is set more realistically,
50
Runtime (Second)
namely an iceberg query involving two attributes, not just
one.
We still assume that there are bitmaps for both
attributes, which is still quite a radical assumption
favoring the bitmap approach.
Bitmap Index
In this case P-trees prove
40
30
20
10
to be a great advantage.
0
100
In summary, then, we have done everything we can
200
400
500
600
Number of tuples (k)
to give the bitmap approach an advantage over the P-tree
approach in this performance work.
Figure 9. Sum Aggregate Performance Time
Comparison.
Still P-trees are
shown to be a superior approach in every case.
And in
the most realistic case, the multi-attribute iceberg query,
P-trees are shown to be far superior.
Therefore, we
The counterpart SQL query of
believe this study confirms that the P-tree approach is
below:
10
Figure 9 is
SELECT
N1, Sum (Y)
Figure 12 shows the runtime of Max aggregate
FROM
Relation M
for the methods, basic Ptrees and Bitmap index, with
GROUPBY
N1
respect to the number of tuples.
P-tree
Figure 10 shows the runtime of Average aggregate
Runtime (Second)
for the methods, basic Ptrees and Bitmap index, with
respect to the number of tuples.
P-tree
Bitmap Index
60
40
35
30
25
20
15
10
5
0
50
Runtime (Second)
Bitmap Index
100
200
400
500
600
Number of tuples (k)
40
30
Figure 12. Max Aggregate Performance Time
Comparison.
20
10
0
100
200
400
500
The counterpart SQL query of Figure 12 is
600
Number of tuples (k)
below:
Figure 10. Average Aggregate Performance Time
Comparison.
SELECT
N1, Max (Y)
FROM
Relation M
GROUPBY
N1
The counterpart SQL query of Figure 10 is below:
SELECT
N1, Average (Y)
Figure 13 shows the runtime of Median and
FROM
Relation M
Rank-k aggregate for the methods, basic P-trees and
GROUPBY
N1
Bitmap index, with respect to the number of tuples.
Figure 11 shows the runtime of Min aggregate for
P-tree
the methods, basic Ptrees and Bitmap index, with respect
P-tree
Runtime (Second)
to the number of tuples.
Bitmap Index
Runtime (Second)
50
40
Bitmap Index
45
40
35
30
25
20
15
10
5
0
100
30
200
400
500
600
Number of tuples (k)
20
Figure 13. Median or Rank-k
Performance Time Comparison.
10
0
100
200
400
500
600
Number of tuples (k)
Figure 11. Min
Comparison.
Aggregate
Aggregate
The counterpart SQL query of Figure 13is
Performance
Time
below:
SELECT
N1, Median (Y)
The counterpart SQL query of Figure 11 is below:
FROM
Relation M
SELECT
N1, Min (Y)
GROUPBY
N1
FROM
Relation M
GROUPBY
N1
11
Figure 14 shows the runtime of Top-k aggregate for
approach either needs to scan the data file or combine
the methods, basic P-trees and Bitmap index, with respect
the bitmap indexes in some way.
to the number of tuples.
6.2. Computing the Example of Iceberg Query
Runtime (Second)
P-tree
Bitmap Index
60
In this section, we show the real advantage of the
50
basic P-tree representation of a data set over the use of
40
auxiliary bitmapped indexes on that data set. In most
30
data mining tasks (iceberg querying and others), it
20
cannot be predicted in advance exactly which single
10
attribute or attributes are going to be aggregated over.
0
100
200
400
500
Therefore, it is impossible to pre-construct and
600
Number of tuples (k)
maintain auxiliary bitmapped indexes on all composite
Figure 14. Top-k Aggregate Performance Time
Comparison.
attributes that might be involved in an iceberg query
and even more impossible for other data mining
operations (e.g., a classification or clustering).
We
The counterpart SQL query of Figure 14 is below:
demonstrate this important point by showing just one
SELECT
N1, Top-k(Y)
composite attribute aggregation comparison in which
FROM
Relation M
there is no bitmap index on the composite attribute. In
GROUPBY
N1
our experiment below, we suppose that attribute N2 is
not bitmap indexed.
Figure 9 to figure 13 show that algorithms of
aggregate functions for both basic P-trees and bitmap
Figure 15 shows the runtime of executing an
index on a single attribute are scalable with respect to
iceberg query for both the methods, P-tree and Bitmap,
number of tuples, but the P-tree method is somewhat
with respect to the number of tuples. From figure 15,
quicker, especially when it deals with large datasets. The
we can see that when the number of transactions
advantages of basic P-tree representations of files are
increases, the running time of both methods increases
that: first, there is no need for redundant, auxiliary
at a similar rate. Over all, P-tree method is better than
structures, and, second basic P-trees are good at
bitmap indexes due to the inner optimization of P-tree.
calculating multi-attribute aggregations and fair to all
The experiment results show that P-tree method is
attributes. Bitmap indexes select individual attribute of a
more scalable than bitmap index in terms of the
file.
number of transactions and the number of values in
It is difficult for bitmap index to deal with
composite attributes that are not bitmap indexed because
each attribute.
no one advocates “fully inverting every composite
P-tree
attribute with a separate bitmap index – that would
Runtime (Second)
require totally unacceptable space and time.” The indexes
would take up an order of magnitude of more space than
the file itself, and any update to the original file would
require changes to every one of them.
The real strong advantage of basic P-trees over
180
160
140
120
100
80
60
40
20
0
100
bitmap indexes comes to light when multiple attributes
Bitmap Index
200
400
500
600
Number of tuples (k)
are involved. In that situation, basic P-trees do not need to
Figure 15. Iceberg Query with multi-attributes
aggregation Performance Time Comparison.
scan the originally data file while the bitmap index
12
The counterpart SQL query of Figure 15 is below:
aggregates and value P-trees, we will get many useful
value P-trees and their root counts, which can be used
SELECT
N1, N2, Sum (Y)
to calculate aggregates that are required in real time
FROM
Relation M
later.
GROUPBY
N1, N2
HAVING
Sum (Y) >= 15
Reference
The real advantage of basic P-trees comes with
[1] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh.
respect to composite queries and the fact that basic
Data Cube: A relational aggregation operator
generalizing group-by, cross-tab, and sub-totals. J. Data
Mining and Knowledge Discovery, pages 29-53, 1997.
P-trees methods are “fair” to all composite attributes,
which means no matter what the composite aggregation
have
[2] M. Fang, N. Shivakumar, H. Garcia-Molina, R.
approximately the same advantage. Whereas, bitmap
Motwani, and J. D. Ullman. Computing iceberg queries
efficiently. VLDB Conf., pages 299-310, 1998.
attributes
might
be,
attribute
or
attributes
indexes make a very selective choice between attributes
which will have accelerated aggregation and attributes
[3] K. Beyer and R. Ramakrishnan. Bottom-up
which will not (usually including all composites).
computation of sparse and iceberg cubes. SIGMOD,
pages 359-370, 1999.
Conclusion
[4] W., Perrizo, Peano Count Tree Technology,
In the first part of this paper, we present algorithms
Technical Report NDSU-CSOR-TR-01-1, 2001.
to implement various aggregate functions using basic
P-trees. P-trees are lossless, vertical bitwise data
[5] R. Agrawal, T. Imielinski, and A. Swami, Mining
structures, which are developed to facilitate data
Association Rules Between Sets of Items in Large
Databases. ACM SIGMOD Conf. Management of Data,
pages 207-216, 1993
compression and processing. Although our algorithms are
designated for different functions, they are all made up of
two major operations: logic operations of P-trees and
[6] R. Agrawal and R. Srikant. Fast algorithms for
RootCount function of P-trees. Both logic operations and
mining association rules. In VLDB, pages 487-499,
1994.
RootCount function are single instruction loops, which
computers can execute quickly. Those characters make
[7] J. Han, J. Pei, G. Dong, and K. Wang. Efficient
our algorithms superior in speed and accuracy than many
computation of iceberg cubes with complex measures.
SIGMOD, pages 1-12, 2001.
other methods.
We use an example to demonstrate how iceberg
[8] B. Wang, F. Pan, D. Ren, Y. Cui, Q. Ding, and W.
query on multi-attribute using basic P-trees can be
Perrizo. Efficient OLAP Operations for Spatial Data
Using Peano Trees. DMKD, pages 28-34, 2003.
implemented. We believe that our method makes the
calculation of a multi-attribute Group By as efficient as
[9] Wang, B., Pan, F., Cui, Y., and Perrizo, W.
possible. Traditional methods need to pre-calculate all the
Efficient Quantitative Frequent Pattern Mining Using
Predicate Trees. Int. Journal of Computers and Their
Applications, 2006
aggregates or all the bitmap indexes in a relation, which
not only lose the flexibility in data aggregation but also
waste memory space and time, especially, when there are
[10] M., Khan, Q., Ding, W., Perrizo, k-Nearest
too many aggregates, attributes, and combination of
Neighbor Classification on Spatial Data Streams Using
P-Trees, PAKDD, pages 517-528, 2002.
attributes. Our method is especially flexible in the
following way: we can either generate the most
[11] Q. Ding, Q. Ding and W. Perrizo "Association
frequently used aggregates and value P-trees before-hand
Rule Mining on Remotely Sensed Images Using
P-trees," PAKDD, pages 66-79, 2002.
or calculate the aggregates and value P-trees in real time.
In many cases, during the procedure of pre-calculating
13
[12] TIFF
image
data
sets.
Available
http://midas-10cs.ndsu.nodak.edu/data/images/.
at
[13] P. O’Neil and D. Quass. Improved query performance
with variant indexes. SIGMOD, pages 38-49, 1997.
[14] P. O'Neil, Informix and Indexing Support for Data
Warehouses, Database and Programming Design, 1997.
14
Download