Aggregate Function Computation and Iceberg Querying in Vertical Database Yue Cui William Perrizo Computer Science department Computer Science department North Dakota State University North Dakota State University Fargo, ND, 58102, U.S.A. Fargo, ND, 58102, U.S.A. yue.cui@ndsu.edu william.perrizo@ndsu.edu many ABSTRACT other applications including information retrieval, clustering, and copy detection [2]. Aggregate function computation and iceberg querying are important and common in many applications Fast implementation of aggregate functions and of data mining and data warehousing because people are iceberg queries is always challenging task in data usually interested in looking for anomalies or unusual warehousing, which may contain millions of data. One patterns by computing aggregate functions across many method in data warehousing, which improves the speed attributes or finding aggregate values above some of iceberg queries, is to pre-compute the iceberg cube specified thresholds. In this thesis, we present efficient [3]. Iceberg cube actually stores all the above-threshold algorithms to compute aggregate functions and evaluate aggregate information of a dataset. When a customer iceberg queries using P-trees, which is an innovative data initiates an iceberg query, the system, instead of structure on vertical databases. Our methods do well for computing the result from the original dataset, looks most of the aggregate functions, such as Count, Sum, into the iceberg cube for the answer. This helps to Average, Min, Max, Median, Rank, and Top-k, and in accelerate the on-line processing speed. Efficient some situations, they are the best of all. Our algorithms algorithms in computation of the iceberg cube are use very little memory and perform quickly because, in widely studied. A lot of research has been done in this most cases, we only pass over data once and use logical area [3]. operations, And/Or, to implement the entire task. Recently, a new data structure, Peano Tree 1. INTRODUCTION (P-tree) [4], has been introduced for large datasets. P-tree is a lossless, quadrant-based, compression data structure. It provides efficient logical operations that 1.1 Background are fast and efficient. One of the P-tree variations, Aggregation functions [1] across many attributes Predicate P-tree, is used to efficiently reduce data are widely used in queries of data mining and data accesses by filtering out “bit holes,” which consist of warehousing. The commonly used aggregation functions consecutive 0’s [4]. include Count, Sum, Average, Min, Max, Median, Rank, and Top-k. The commonly used queries in Data mining In this paper, we investigate problems on how to and data warehousing are iceberg queries [2], which efficiently implement aggregate functions and iceberg perform an aggregate function across attributes and then queries using P-tree. We have two contributions: first, eliminate aggregate values that are below some specified we develop efficient algorithms, which compute the threshold. Iceberg queries are so called because the various aggregate functions by using P-tree; second, number of above-threshold results is often very small (the we illustrate the procedure to implement iceberg query tip of an iceberg) relative to the large amount of input by using P-tree. In our algorithms and procedures, the data (the iceberg). Iceberg queries are also common in main operations are counting the number of 1s in P-trees and performing logic operations, And/Or, of 1 P-trees. These operations can be executed quickly by sum and count. The H function adds these two hardware. The costs of these operations are really cheap. components and then divides to produce the global As you will see in our experiments section, our methods average. Similar techniques apply to finding the N and procedure are superior in computing aggregate largest values, the center of mass of group of objects, functions and iceberg queries in many cases. and other algebraic functions. The key to algebraic functions is that a fixed size result (an M-tuple) can This paper is organized as follows. In Chapter 2, summarize the sub-aggregation. we do a Literature review, where we will introduce the concepts of aggregate functions and iceberg queries. In Holistic: An aggregate function F is holistic if Chapter 3, we review the P-tree technology. In Chapter 4, there is no constant bound on the size of the storage we give the algorithms of the aggregate functions using needed to describe a sub-aggregate. That is, there is no P-tree. In Chapter 5, we give an example of how to constant M, such that an M-tuple characterizes the implement iceberg queries using P-tree. In Chapter 6, we computation F. Median, MostFrequent (also called the describe our experiment results, where we compare our Mode), and Rank are common examples of holistic method with bitmap index. Chapter 7 is the concluding functions. section of this paper. Efficient computation of all these aggregate 2. LITERATURE REVIEW functions is required in most large database applications. There are a lot efficient techniques for computing distributive and algebraic functions [2] but 2.1. Aggregate Functions only a few for holistic functions such as Median. In data Frequently, people are interested in summarizing this paper, our method to compute Median is one of the to best among all the available techniques. determine trends or produce top-level reports. For example, the purchasing manager may not be interested in a listing of all computer hardware sales, 2.2. Iceberg Queries but may simply want to know the number of Notebook sold in a specific month. Aggregate functions can assist Iceberg queries refer to a class of queries which with the summarization of large volumes of data. compute aggregate functions across attributes to find aggregate values above some specified threshold. Three types of aggregate functions were identified Given a relation R with attributes a1, a2… an, and m, an [1]. Consider aggregating a set of tuples, T. Let {Si | i = 1 aggregate function AggF, and a threshold T, an iceberg . . . n} be any complete set of disjoint subsets of T such query has the form of follow: that i Si = T, and i Si = {}. SELECT R.a1, R.a2… R.an, AggF (R.m) FROM relation R if there is a function G such that GROUPBY R.a1, R.a2… R.an F (T) = G ({F (Si)| i = 1 . . . n}). SUM, MIN, and MAX HAVING AggF (R.m) >= T Distributive: An aggregate function F is distributive are distributive with G = F. Count is distributive with G = SUM. The number of tuples, that satisfy the threshold in the having clause, is relatively small compared to the Algebraic: An aggregate function F is algebraic if large amount of input data. The output result can be there is an M-tuple valued function G and a function H seen as the “tip of iceberg,” where the input data is the such that F (T) = H ({G (Si) | i = 1 . . . n}). Average, “iceberg.” Standard Deviation, MaxN, MinN, and Center_of_Mass are all algebraic. For Average, the function G records the 2 Suppose, a purchase manager is given a sales From the knowledge we generated above, we can transaction dataset, he/she may want to know the total eliminate many of the location & Product Type pair number of products, which are above a certain threshold groups. It means that we only generate candidate T, of every type of product in each local store. To answer location and Product Type pairs for local store and this question, we can use the iceberg query below: Product type which are in Location-list and Product Type-list. This approach improves efficiency by SELEC Location, Product Type, Sum (# pruning many groups beforehand. Then performing Product) operation, And, of value P-trees, we can calculate the FROM Relation Sales results easily. We will illustrate our method in detail by GROUPBY Location, Product Type example in Chapter 5. HAVING Sum (# Product) >= T 3. REVIEW OF P-TREES To implement iceberg query, a common strategy in horizontal database is first to apply hashing or sorting to A Peano tree (P-tree) is a lossless, bitwise tree. A all the data in the dataset, then to count all of the location P-tree can be 1-dimensional, 2-dimensional, & Product Type pair groups, and finally to eliminate 3-dimensional, etc. In this section, we will focus on those groups which do not pass the threshold T. But these 1-dimensional P-trees. We will give a brief review of algorithms can generate significant I/O for intermediate P-trees on their structure and operations, and introduce results and require large amounts of main memory. They two variations of predicate P-trees which will be used leave much room for improvement in efficiency. One in our iceberg queries algorithm. method is to prune groups using the Apriori-like [5] [6] method. But the Apriori-like method is not always 3.1 Structure of P-trees simple to use for all the aggregate functions. For instance, the Apriori-like method is efficient in calculating Sum but Given a data set with d attributes, X = (A1, A2 … has little use in implementing Average [4]. In our Ad), and the binary representation of j th attribute Aj as example, instead of counting the number of tuples in bj,mbj,m-1...bj,i… bj,1bj,0, we decompose each attribute every location & Product Type pair group at first step, we into bit files, one file for each bit position [1]. To build can do the following: Generate Location-list: a list of a P-tree, a bit file is recursively partitioned into halves local stores which sell more than T number of products. and each half into sub-halves until the sub-half is pure For example, (entirely 1-bits or entirely 0-bits). SELECT Location, Sum (# Product) The detailed construction of P-trees is illustrated FROM Relation Sales by an example in Figure 1. The transaction set is GROUPBY Location shown in a). HAVING Sum (# Product) >= T has one attribute. We represent the attribute as binary For simplicity, assume each transaction values, e.g., (7)10 = (111)2. Then vertically decompose Generate Product Type-list: a list of categories them into three separate bit files, one file for each bit, which sell more than T number of products. For example, as shown in b). The corresponding basic P-trees, P1, P2 and P3, are constructed by recursive partition, which SELECT Type, Sum (# Product) FROM Relation Sales GROUPBY Product Type HAVING Sum (# Product) >= T are shown in c), d) and e). As shown in e) of Figure 1, the root of P 1 tree is 3, which is the 1-bit count of the entire bit file. The second level of P1 contains the 1-bit counts of the two halves, 0 and 3. Since the first half is pure, there is no 3 need to partition it. The second half is further partitioned The P-tree logic operations are performed recursively. level-by-level starting from the root level. They are commutative and distributive, since they are simply pruned bit-by-bit operations. For instance, ANDing a 010 3 011 010 010 101 010 111 111 a) pure-0 node with anything results in a pure-0 node, 0 3 1 1 ORing a pure-1 node with anything results in a pure-1 node. In Figure 3, a) is the ANDing result of P1,1 and 2 P1,2, b) is the ORing result of P1,1 and P1,3, and c) is the 0 result of NOT P1,3 (or P1,3’), where P1,1, P1,2 and P1,3 are c) P1 shown in Figure 2. 7 0 Transaction set 4 0 0 1 0 0 0 0 1 0 1 1 0 1 0 0 1 0 1 1 1 1 1 1 0 1 1 1 0 0 1 0 0 0 0 0 0 a) P1,1P1,2 1 1 1 0 b) P1,1P1,3 0 1 0 1 0 0 0 1 c) P1,3’ Figure 3. AND, OR and NOT Operations. 3 0 1 1 1 1 3.3. Predicate P-trees 2 0 There are many variations of predicate P-trees, e) P3 b) 3 bit files 1 0 4 1 0 0 2 d) P2 0 0 3 such as value P-trees, tuple P-trees, inequity P-trees Figure 1. Construction of 1-D P-trees. [8][9][10][11], etc. In this section, we will describe value P-trees and inequity P-trees, which are used in evaluating range predicates. 3.2 P-tree Operations AND, OR, and NOT logic operations are the most frequently used P-tree operations. For 3.3.1. Value P-trees efficient implementation, we use a variation of P-trees, called A value P-tree represents a data set X related to a Pure-1 trees (P1-trees). A tree is pure-1 if all the values in specified value v, denoted by Px=v, where x X. the sub-tree are 1’s. A node in a P1-tree is a 1-bit if and v = bmbm-1…b0, where bi is i binary bit value of v. only if that half is pure-1. Figure 2 shows the P1-trees There are two steps to calculate Px=v. 1) Get the corresponding to the P-trees in c), d), and e) of Figure 1. bit-P-tree Pb,i for each bit position of v according to the Let th bit value: If bi = 1, Pb,i = Pi; Otherwise Pb,i = Pi’, 2) Calculate Px=v by ANDing all the bit P-trees of v, i.e. 0 0 0 1 0 0 1 a) P1,1 0 0 0 1 0 0 0 1 1 b) P1,2 0 0 0 0 Px=v = 0 1 1 Pb1 Pb2… Pbm. Here, means AND operation. For example, if we want to get a value P-tree satisfying x = 101 in Figure 1. We have P x=101 = Pb,3 1 Pb,2 Pb,1 = P3 P2’ P1. 0 c) P1,3 3.3.2. Inequity P-trees Figure 2. P1-trees for the Transaction Set. An inequity P-tree represents data points within a data set X satisfying an inequity predicate, such as 4 x>v, xv, x<v, and xv. Without loss of generality, we relation S with information about every transaction of a will discuss two inequity P-trees for xv and xv, company, first we transform relation S into binary denoted by Pxv and Pxv, where x X, v is a specified representation. For numerical attributes, this step is value. The calculation of Pxv and Pxv is as follows: simple. We just need to change the decimal values into binary numbers. For categorical attributes, there will be Calculation of Pxv: Let x be a data within a data two steps: first, we translate categorical values into set X, x be a m-bit data, and Pm, Pm-1, … P0 be the P-trees integers. Second, we convert integers into binary for the vertical bit files of X. Let v=b m…bi…b0, where bi numbers. By changing categorical attribute values into th is i binary bit value of v, and Pxv be the predicate tree integers, we can save a lot of memory space and make for the predicate xv, then Pxv = Pm opm … Pi opi Pi-1 … processing procedure much easier at the same time. We i = 0, 1 … m, where 1) opi is if bi=1, opi is do not need to process string values anymore in our op1 P0, otherwise, and 2) the operators are right binding. Here, means AND operation, algorithms. means OR operation, right binding means operators are associated from right to left, Id Mon Loc Type On line # Product e.g., P2 op2 P1 op1 P0 is equivalent to (P2 op2 (P1 op1 P0)). 1 Jan New York Notebook Y 10 For example, the inequity tree Px 101 = (P2 (P1 P0)). 2 Jan Minneapolis Desktop N 5 3 Feb New York Printer Y 6 Calculation of Pxv: Calculation of Pxv is similar 4 Mar New York Notebook Y 7 to Calculation of Pxv. Let x be a data within a data set X, 5 Mar Minneapolis Notebook Y 11 x be a m-bit data set, and P’m, P’m-1, … P’0 be the 6 Mar Chicago Desktop Y 9 complement P-trees for the vertical bit files of X. Let 7 Apr Minneapolis Fax N 3 v=bm…bi…b0, where bi is ith binary bit value of v, and Table 1. Relation Sales. Pxv be the predicate tree for the predicate xv, then Pxv = P’mopm … P’i opi P’i-1 … opk+1P’k, kim, where 1). opi is if bi=0, opi is otherwise, 2) k is the rightmost Id bit position with value of “0,” i.e., b k=0, bj=1, j<k, and 3) the operators are right binding. For example, the aggregation functions and iceberg queries. 4. AGGREGATE FUNCTION COMPUTATION USING P-TREES In this section, we give algorithms, which show how to use P-tree to compute various aggregation Loc Type On # line Product P3,0 P4,3 P4,2 P0,3 P0,2 P1,4 P1,3 P1,2 P2,2 P2,1 P0,1 P0,0 P1,1 P1,0 P2,0 1 0001 00001 001 1 1010 2 0001 00101 010 0 0101 3 0010 00001 100 1 0110 4 0011 00001 001 1 0111 5 0011 00101 001 1 1011 6 0011 00110 010 1 1001 7 0100 00101 101 0 0011 inequity tree Px 101 = (P’2 P’1). We will frequently use value P-trees and inequity P-trees in our calculation of Mon P4,1 P4,0 functions. We illustrate some of these algorithms by Table 2. Binary Form of Sales. examples. For simplicity, the examples are computed on a single attribute. As you will see, our algorithms can be Next we vertically decompose the binary easily extended to evaluate aggregate functions over transaction table into bit files: one file for each bit multiple attributes. position. In relation S, there are totally 17 bit files. Then we build 17 basic P-trees for relation S. The First, we give out our example dataset in Table 1. detailed decomposition process is discussed in Chapter Second, we illustrate the procedure to convert the sample dataset into P-tree bit files in Table 2. Suppose, we have a 5 3. For convenience, we only use uncompressed P-trees to Algorithm 4.2 illustrate the calculation process followed. Evaluating max () with P-tree. max = 0.00; c = 0; 4.1. Count Aggregate Pc is set all 1s For i = n to 0 { COUNT function is probably the simplest and most useful of all these aggregate functions. It is not necessary c = RootCount (Pc AND Pi); to write special function for Count because P-tree If (c >= 1) Pc = Pc AND Pi; RootCount function has already provided the mechanism max = max + 2i; to implement it. Given a P-tree Pi, RootCount(Pi) returns } the number of 1s in Pi. For example, if we want to know Algorithm 4. 2. Max Aggregate. Return max; 4.5. Min Aggregate how many transactions in relation S are conducted on-line, we just need to count the number of 1s in P 3, 0. Min function returns the smallest value in a field. Count (# Transaction on line) = RootCount (P 3,0) = RootCount (1011110) = For example, if we want to know the minimum number 5. Thus we know that 5 out of of products which were sold in one transaction, we can 7 transactions in our dataset are conducted on line. use the algorithm in Algorithm 4.3. 4.2. Sum Aggregate Sum function can total a field of numerical values. We will illustrate the algorithm in Algorithm 4.1. Algorithm 4.3. Evaluating Min () with P-tree. min = 0.00; Algorithm 4.1 Evaluating sum () with P-tree. c = 0; total = 0.00; Pc is set all 1s For i = n to 0 { For i = 0 to n { c = RootCount (Pc AND NOT (Pi)); i total = total + 2 * RootCount (Pi); If (c >= 1) } Pc = Pc AND NOT (Pi); Return total Algorithm 4. 1. Sum Aggregate. Else min = min + 2i; Algorithm 4. 3. Min Aggregate. 4.3. Average Aggregate } min; 4.6.Return Median/Rank Aggregate Average function will show the average value in a field. It can be calculated from function Count and Sum. Median function returns the median value in a Average () = Sum ()/Count (). field. Rank (K) function returns the value that is the kth largest value in a field. For example, if we want to 4.4. Max Aggregate know the median of the number of products which were sold in all the transactions, we can use the Max function returns the largest value in a field. For algorithm in Algorithm 4.4. The detailed steps are example, if we want to know the maximum number of following the algorithm. products which were sold in one transaction in relation S, we can use the algorithm in Algorithm 4.2. 6 The median is 7 in attribute # Product. Similarly Algorithm 4.4. Evaluating Median () with P-tree if we want to know the 5th largest number of the median = 0.00; products, which were sold in all the transactions, we pos = N/2; for rank pos = K; just need to change the initial value of pos from 4 to 5. c = 0; Then with the same procedure, we can get the 5th Pc is set all 1s for single attribute largest number in attribute # Product. For i = n to 0 { c = RootCount (Pc AND Pi); 4.7. Top-k Aggregate If (c >= pos) median = median + 2i; Top-k (K) function is very useful in various Pc = Pc AND Pi; algorithms of clustering and classification. In order to Else get the largest k values in a field, first, we will find pos = pos - c; rank k value Vk using function Rank (K). Second, we Pc AND NOT (PiAggregates. ); Algorithm 4.Pc4.=Median/Rank(K) } will find all the tuples whose values are greater than or Return Step 1: median; pos = 4 we decide the position of the calculate range predicates using inequity P-trees, we equal to Vk. In Chapter 2, we have illustrated how to use the same method here to find out all the values, median in the dataset. Pc = (1111111) which are greater than or equal to Vk. n=3 operation using P-trees Iceberg query Step 2: RootCount (Pc AND P4,3) = 3. Because c < 5. Iceberg query operation using P-trees pos (3 < 4), pos = 4 – 3 = 1 Beside the computation of aggregate functions, Pc = Pc AND NOT (P4,3) = (01111001) another important part of iceberg queries is to Median = 0 implement the Group By operation. By using value n=2 P-trees, we can make the calculation of Group By as Step3: RootCount (Pc AND P4,2) = 3. Because c >= efficient as possible. pos (3 > 1) With the efficient algorithms to pos = 1 compute aggregate functions and to implement Group Pc = Pc AND (P4,2) = (01110000) By operation, iceberg query can be executed as fast as Median = 0 + 22 = 4 possible by using P-tree. We will give out an example n=1 to illustrate how to implement iceberg query using P-tree. Step3: RootCount (Pc AND P4,1) = 2. Because c >= pos (2 > 1) We have described iceberg query algorithm in pos = 1 Pc = Pc AND (P4,1) = (00110000) Chapter 2. We demonstrate the procedure with the Median = 4 + 21 = 6 following example: n=0 SELECT Loc, Type, Sum (# Product) FROM Relation S pos = 1 GROUPBY Loc, Type Pc = Pc AND (P4,1) = (00010000) HAVING Sum (# Product) >= 15 Step3: RootCount (Pc AND P4,0) = 1. Because c >= pos (1 = 1) Median = 6 + 20 = 7 5.1. Step One n = -1 program stopped. 7 We build value P-trees for every value of attribute Sum(# product | New York) = Loc. Attribute Loc has three values {Loc| New York, 23 RootCount (P4,3 AND PNY) + 22 Minneapolis, and Chicago}. Their counterpart value RootCount (P4,2 AND PNY) +21 RootCount (P4,1 P-trees can be seen in Figure 4. AND PNY) + 20 RootCount (P4,0 AND PNY) = 8 1 + 4 2 + 2 3 + 1 1 = 23 PNY PMN PCH 1 0 0 Table 5 shows the total number of products sold 0 1 0 in each of the three of the locations. Since our 1 0 0 threshold is 15, according to the Apriori, if the super 1 0 0 set can not pass the threshold, any of its subset can not 0 1 0 pass it too, therefore we eliminate the city Chicago. 0 0 1 Figure 4. Value P-trees of Attribute Loc. 1 Since we0 have described how 0to calculate value Loc Values Sum (# Product) Threshold P-tree from basic P-tree in Chapter 3, we only illustrate New York 23 Y the procedure of generating value P-tree PNY as example. Minneapolis 18 Y In table 2, we can see that the binary value of New York Chicago 9 N is 00001. We process bits one by one from left to right. If Table 3. the Summary Table of Attribute Loc. the bit is 0 in the current bit position, we use primary P-tree of this bit in our formula. If the bit is 1, we use 5.2. Step Two basic P-tree of this bit in our formula. Finally we obtain formula 1. Similarly we build value P-trees for every value of attribute Type. Attribute Type has four values {Type PNY = P’1,4 AND P’1,3 AND P’1,2 AND P’1,1 AND P1,0 | Notebook, desktop, Printer, Fax}. (1) PNotebook 1 PDesktop 0 value P-tree PNY. After getting all the value P-trees for 0 1 0 0 each location, we calculate the total number of products 0 0 1 0 sold in each place. We still use the value, New York, as 1 0 0 0 our example. In formula 2, we illustrate how we obtain 1 0 0 0 0 1 0 0 0 0 0 1 Figure 5 illustrates the calculation procedure of the total number of products sold in New York. PPrinter 0 PFAX 0 Figure 6. Value P-trees of Attribute Type. Figure 6 shows the value P-tree of the four LOC values of attribute Type. Similarly we get the summary P1.1 P1.0 P’1,4 P’1,3 P’1,2 P’1.1 P1.0 PNY 0 0 1 1 1 1 1 1 1 According to the threshold, only value P-tree of 0 1 0 1 1 1 0 1 1 0 notebook will be used in the future. 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 Type Values Sum (# Product) Threshold 0 0 1 0 1 1 1 0 1 1 0 Notebook 28 Y 0 0 1 1 0 1 1 0 0 0 0 Desktop 14 N 0 0 1 0 1 1 1 0 1 1 0 FAX 3 N Printer 6 N P1,4 P1,3 P1,2 0 0 0 table for each value of attribute Type in Table 4. Figure 5. Procedure of Calculating PNY. Table 4. Summary Table of Attribute Type. 8 RootCount (P4,1 AND PMN 5.3. Step Three AND Notbook) + 20 RootCount (P4,0 AND PMN AND Notbook) = 8 1 + 4 We only generate candidate Loc and Type pairs for 0 local store and Product type, which can pass the threshold + 2 1 + 1 1 = 11 (4) we are given before. By Performing And operation on PNY with PNotebook, we obtain value P-tree PNY AND Notebook Finally, we obtain the summary table, Table 5. as it is showed in Figure 7. We calculate the total number According to the threshold T=15 we set before, we can of notebooks sold in New York by formula 3. see that only value P-tree PNY AND Notebook satisfies our threshold. We get tuples 1 and 4 as the results for this iceberg query example. PNY PNotebook PNY AND Notebook 1 1 1 Type Values Sum (# Product) Threshold 0 0 0 New York And Notebook 17 Y Minneapolis And Notebook 11 N AND = 1 0 0 1 1 1 0 1 0 0 0 Calculating 0 0 Figure 7. Procedure of Table 5. Summary Table of Our Example. 0 PNY AND Notebook. 0 All the value P-trees along with their root counts Sum (# Product | New York) = 23 RootCount (P4,3 AND PNY RootCount (P4,2 AND PNY RootCount (P4,1 AND PNY generated during the procedure are stored together with AND Notebook) AND + 22 AND Notebook) Notebook) basic P-trees. In the future, when we answer another +21 + 20 iceberg query which requires generation of the same value P-tree, we will not need to repeat the logic RootCount (P4,0 AND PNY AND Notebook) = 8 1 + 4 1 + 2 2 + 1 1 = operations to get our value P-tree. We would need only 17 to read the desired value P-trees from the disk directly, (3) which will decrease the response time steadily as the system keeps working. By performing And operations on P MN with P Notebook, we obtain value P-tree PMN AND Notebook as it is showed in 6. Performance analysis Figure 8. We calculate the total number of notebook sold in Minneapolis by formula 4. In this section, we report our performance analysis on computing aggregate functions and iceberg queries. All experiments are implemented in the C++ PMN PNotebook 0 1 1 AND 0 PMN AND Notebook language on a 1GHz Pentium PC machine with 1GB 0 = main memory running on Red Hat Linux. The test data 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 0 includes aerial TIFF images, moisture maps, nitrate maps, and yield maps of the Oakes Irrigation Test Area in North Dakota. The data sets are available in [11]. As shown in [8], the performance of horizontal databases for computing aggregate functions and Figure 8. Procedure of Calculating PMN AND Notebook. iceberg queries is much weaker than Bitmap index [13], which is a set of vectors: one vector of bits per Sum (# product | Minneapolis) = 2 RootCount (P4,3 AND PMN AND Notbook) + 2 3 RootCount (P4,2 AND PMN AND Notbook) +2 1 distinct attribute value; each bit of the value is mapped 2 to a tuple; the associated bit is set if and only if the tuple’s value fulfills the property in focus, typically 9 that the value occurs in that tuple. The advantages of definitely superior to the bitmap approach for basic P-trees are that, first, there are fewer by a log2 aggregation of all types. factor, and, second it is easy to combine them for a single attribute bitmap as in [13] and [14], and, third the same simple AND computation provides Finally, it might be claimed that we have picked multi-attribute particular data sets where this conclusion holds, and bitmaps which are the typical need in data mining. that it might not hold on other data sets. Basic P-trees also serve up scalable computations of total that issue, we have chosen many related, real-life variation, neighborhood masks, aggregations, and range numeric data sets and combined them. queries without file scans [4]. evaluating the full complement of aggregations over These advantages make To address We are basic P-trees superior in speed and accuracy than bitmap many numeric attributes. Aggregation speed, but its indexes. Therefore, we concentrate on the performance very nature, does not depend upon the distribution of comparison between bitmap indexes and basic P-trees. the data values or any other characteristic of the data. The very same program execution thread is performed In this comparison, we give the bitmap approach regardless of the specific distribution, correlation, or every possible advantage to compete, to the point of any other statistical characteristic of the data values. being very unfair to the P-tree approach. First, it is For example, in calculating the sum over an attribute, really feasible to bitmap every numeric attribute (over the same program thread is used and the same result is which aggregation could be requested) of every file. obtained regardless of the nature and arrangement of Even in the data sets used in this study, which would be the specific data values. thousands of separate bitmaps. Second, when the query have given a very fair and unbiased assessment of the involves a composite attribute, it is totally infeasible to P-tree aggregation method relative to other methods by prepare bitmaps ahead of time to accelerate the query. choosing a fairly large sampling of real-life numeric We concentrate on the “worst case” for P-trees and the attributes for a wide range of data set sizes. Therefore, we feel that we “best case” for bitmaps, namely, single numeric attribute aggregation queries in which it is assumed (albeit 6.1. Computing Aggregate Functions unreasonably) that the full set of bitmaps has been maintained for all values involved. Still we find that the Figure 9 shows the runtime of Sum aggregate for P-tree approach proves to be substantially faster, as methods, the basic P-trees and the Bitmap index, with shown in the following section. respect to the number of tuples in the file. At the end of the next section we also give one P-tree example where the competition is set more realistically, 50 Runtime (Second) namely an iceberg query involving two attributes, not just one. We still assume that there are bitmaps for both attributes, which is still quite a radical assumption favoring the bitmap approach. Bitmap Index In this case P-trees prove 40 30 20 10 to be a great advantage. 0 100 In summary, then, we have done everything we can 200 400 500 600 Number of tuples (k) to give the bitmap approach an advantage over the P-tree approach in this performance work. Figure 9. Sum Aggregate Performance Time Comparison. Still P-trees are shown to be a superior approach in every case. And in the most realistic case, the multi-attribute iceberg query, P-trees are shown to be far superior. Therefore, we The counterpart SQL query of believe this study confirms that the P-tree approach is below: 10 Figure 9 is SELECT N1, Sum (Y) Figure 12 shows the runtime of Max aggregate FROM Relation M for the methods, basic Ptrees and Bitmap index, with GROUPBY N1 respect to the number of tuples. P-tree Figure 10 shows the runtime of Average aggregate Runtime (Second) for the methods, basic Ptrees and Bitmap index, with respect to the number of tuples. P-tree Bitmap Index 60 40 35 30 25 20 15 10 5 0 50 Runtime (Second) Bitmap Index 100 200 400 500 600 Number of tuples (k) 40 30 Figure 12. Max Aggregate Performance Time Comparison. 20 10 0 100 200 400 500 The counterpart SQL query of Figure 12 is 600 Number of tuples (k) below: Figure 10. Average Aggregate Performance Time Comparison. SELECT N1, Max (Y) FROM Relation M GROUPBY N1 The counterpart SQL query of Figure 10 is below: SELECT N1, Average (Y) Figure 13 shows the runtime of Median and FROM Relation M Rank-k aggregate for the methods, basic P-trees and GROUPBY N1 Bitmap index, with respect to the number of tuples. Figure 11 shows the runtime of Min aggregate for P-tree the methods, basic Ptrees and Bitmap index, with respect P-tree Runtime (Second) to the number of tuples. Bitmap Index Runtime (Second) 50 40 Bitmap Index 45 40 35 30 25 20 15 10 5 0 100 30 200 400 500 600 Number of tuples (k) 20 Figure 13. Median or Rank-k Performance Time Comparison. 10 0 100 200 400 500 600 Number of tuples (k) Figure 11. Min Comparison. Aggregate Aggregate The counterpart SQL query of Figure 13is Performance Time below: SELECT N1, Median (Y) The counterpart SQL query of Figure 11 is below: FROM Relation M SELECT N1, Min (Y) GROUPBY N1 FROM Relation M GROUPBY N1 11 Figure 14 shows the runtime of Top-k aggregate for approach either needs to scan the data file or combine the methods, basic P-trees and Bitmap index, with respect the bitmap indexes in some way. to the number of tuples. 6.2. Computing the Example of Iceberg Query Runtime (Second) P-tree Bitmap Index 60 In this section, we show the real advantage of the 50 basic P-tree representation of a data set over the use of 40 auxiliary bitmapped indexes on that data set. In most 30 data mining tasks (iceberg querying and others), it 20 cannot be predicted in advance exactly which single 10 attribute or attributes are going to be aggregated over. 0 100 200 400 500 Therefore, it is impossible to pre-construct and 600 Number of tuples (k) maintain auxiliary bitmapped indexes on all composite Figure 14. Top-k Aggregate Performance Time Comparison. attributes that might be involved in an iceberg query and even more impossible for other data mining operations (e.g., a classification or clustering). We The counterpart SQL query of Figure 14 is below: demonstrate this important point by showing just one SELECT N1, Top-k(Y) composite attribute aggregation comparison in which FROM Relation M there is no bitmap index on the composite attribute. In GROUPBY N1 our experiment below, we suppose that attribute N2 is not bitmap indexed. Figure 9 to figure 13 show that algorithms of aggregate functions for both basic P-trees and bitmap Figure 15 shows the runtime of executing an index on a single attribute are scalable with respect to iceberg query for both the methods, P-tree and Bitmap, number of tuples, but the P-tree method is somewhat with respect to the number of tuples. From figure 15, quicker, especially when it deals with large datasets. The we can see that when the number of transactions advantages of basic P-tree representations of files are increases, the running time of both methods increases that: first, there is no need for redundant, auxiliary at a similar rate. Over all, P-tree method is better than structures, and, second basic P-trees are good at bitmap indexes due to the inner optimization of P-tree. calculating multi-attribute aggregations and fair to all The experiment results show that P-tree method is attributes. Bitmap indexes select individual attribute of a more scalable than bitmap index in terms of the file. number of transactions and the number of values in It is difficult for bitmap index to deal with composite attributes that are not bitmap indexed because each attribute. no one advocates “fully inverting every composite P-tree attribute with a separate bitmap index – that would Runtime (Second) require totally unacceptable space and time.” The indexes would take up an order of magnitude of more space than the file itself, and any update to the original file would require changes to every one of them. The real strong advantage of basic P-trees over 180 160 140 120 100 80 60 40 20 0 100 bitmap indexes comes to light when multiple attributes Bitmap Index 200 400 500 600 Number of tuples (k) are involved. In that situation, basic P-trees do not need to Figure 15. Iceberg Query with multi-attributes aggregation Performance Time Comparison. scan the originally data file while the bitmap index 12 The counterpart SQL query of Figure 15 is below: aggregates and value P-trees, we will get many useful value P-trees and their root counts, which can be used SELECT N1, N2, Sum (Y) to calculate aggregates that are required in real time FROM Relation M later. GROUPBY N1, N2 HAVING Sum (Y) >= 15 Reference The real advantage of basic P-trees comes with [1] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. respect to composite queries and the fact that basic Data Cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Data Mining and Knowledge Discovery, pages 29-53, 1997. P-trees methods are “fair” to all composite attributes, which means no matter what the composite aggregation have [2] M. Fang, N. Shivakumar, H. Garcia-Molina, R. approximately the same advantage. Whereas, bitmap Motwani, and J. D. Ullman. Computing iceberg queries efficiently. VLDB Conf., pages 299-310, 1998. attributes might be, attribute or attributes indexes make a very selective choice between attributes which will have accelerated aggregation and attributes [3] K. Beyer and R. Ramakrishnan. Bottom-up which will not (usually including all composites). computation of sparse and iceberg cubes. SIGMOD, pages 359-370, 1999. Conclusion [4] W., Perrizo, Peano Count Tree Technology, In the first part of this paper, we present algorithms Technical Report NDSU-CSOR-TR-01-1, 2001. to implement various aggregate functions using basic P-trees. P-trees are lossless, vertical bitwise data [5] R. Agrawal, T. Imielinski, and A. Swami, Mining structures, which are developed to facilitate data Association Rules Between Sets of Items in Large Databases. ACM SIGMOD Conf. Management of Data, pages 207-216, 1993 compression and processing. Although our algorithms are designated for different functions, they are all made up of two major operations: logic operations of P-trees and [6] R. Agrawal and R. Srikant. Fast algorithms for RootCount function of P-trees. Both logic operations and mining association rules. In VLDB, pages 487-499, 1994. RootCount function are single instruction loops, which computers can execute quickly. Those characters make [7] J. Han, J. Pei, G. Dong, and K. Wang. Efficient our algorithms superior in speed and accuracy than many computation of iceberg cubes with complex measures. SIGMOD, pages 1-12, 2001. other methods. We use an example to demonstrate how iceberg [8] B. Wang, F. Pan, D. Ren, Y. Cui, Q. Ding, and W. query on multi-attribute using basic P-trees can be Perrizo. Efficient OLAP Operations for Spatial Data Using Peano Trees. DMKD, pages 28-34, 2003. implemented. We believe that our method makes the calculation of a multi-attribute Group By as efficient as [9] Wang, B., Pan, F., Cui, Y., and Perrizo, W. possible. Traditional methods need to pre-calculate all the Efficient Quantitative Frequent Pattern Mining Using Predicate Trees. Int. Journal of Computers and Their Applications, 2006 aggregates or all the bitmap indexes in a relation, which not only lose the flexibility in data aggregation but also waste memory space and time, especially, when there are [10] M., Khan, Q., Ding, W., Perrizo, k-Nearest too many aggregates, attributes, and combination of Neighbor Classification on Spatial Data Streams Using P-Trees, PAKDD, pages 517-528, 2002. attributes. Our method is especially flexible in the following way: we can either generate the most [11] Q. Ding, Q. Ding and W. Perrizo "Association frequently used aggregates and value P-trees before-hand Rule Mining on Remotely Sensed Images Using P-trees," PAKDD, pages 66-79, 2002. or calculate the aggregates and value P-trees in real time. In many cases, during the procedure of pre-calculating 13 [12] TIFF image data sets. Available http://midas-10cs.ndsu.nodak.edu/data/images/. at [13] P. O’Neil and D. Quass. Improved query performance with variant indexes. SIGMOD, pages 38-49, 1997. [14] P. O'Neil, Informix and Indexing Support for Data Warehouses, Database and Programming Design, 1997. 14