Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William Perrizo Computer Science Department North Dakota State University Fargo, ND 58105 USA {taufik.abidin, amal.perera, md.serazi, william.perrizo}@ndsu.edu Abstract In this paper, we introduce the vertical set square distance (VSSD) technique that is designed to efficiently and scalably measure the total variation of a set about a fixed point in large datasets. The set can be any projected subspace of any vector space, including oblique subspaces (not just dimensional subspaces). VSSD can determine the closeness of a point to a set of points in a dataset, which can be very useful for classification, clustering and outlier detection tasks. The technique employs a vertical data structure called the Predicate-tree (P-tree)1. Performance evaluations based on both synthetic and real-world datasets show that VSSD technology is fast, accurate and scales well to very large datasets, as compared to similar techniques utilizing horizontal record-based data structure. Keywords Vertical Set Square Distance, P-trees, Total Variation. cardinality datasets becomes our main concern because data mining tasks frequently deal with large datasets. The VSSD technique employs the P-tree vertical representation as its underlying data structure, which makes it very fast and scales well to compute total variation in large datasets, compared to similar technique that uses horizontal recordbased data, as shown in our performance evaluations. The horizontal record-based set square distance technique as a comparison technique is called horizontal set square distance (HSSD) in this paper. The rest of the paper is organized as follows. Section 2 provides a short review of P-tree vertical representation. Section 3 presents vertical set formulas and simple example. Section 4 discusses performance evaluations. Section 5 presents conclusions with some directions for future work. 2. VERTICAL REPRESENTATION 1. INTRODUCTION The determination of closeness or distance of a point to another point or to a set of points is often required in data mining tasks. For example, in partitioning clustering which similar points (objects) are group together, a closeness of points is used for expressing similarity i.e. k-mean, kmedoids [2][3], and it is also true for classification tasks [4][6]. Similarly, in outlier detection a closeness of a point to other points in a dataset is often used as a basis for determining outlier [5]. One way to measure the closeness of a point to some other predefined collection of points is by the computation of total variation of a set of points about a point in question. The analysis of total variation can reveal how well and poor the point is distinguished to that set of points [3]. In this paper, we introduce a new technique called the vertical set square distance (VSSD) that scalably and accurately computes total variation of a set of points about a fixed point. The scalability of the technique to very large 1 Patents are pending on the P-tree technology. This work was partially supported by GSA Grant ACT#: K96130308. Vertical data representation consists of set structures representing the data column-by-column rather than rowby-row (relational data). P-trees are one choice of vertical data representation, which can be used for data mining instead of the more common sets of relational records. Ptrees were initially developed for mining spatial data [4][7]. However, since then this vertical data representation has been adapted for mining many types of data [6][8]. P-trees vertical data structures are particularly well suited for data mining without sub sampling. The creation of P-tree is typically started by converting a relational table of horizontal records to a set of vertical, compressed P-trees by decomposing each attribute in the table into separate bit vectors (e.g., one for each bit position of a numeric attribute or one bitmap for each category in a categorical attribute). Such vertical partitioning guarantees that the information is not lost. The complete set of (uncompressed) vertical bit slices (for numeric attributes) or bitmaps (for categorical attributes) will be called the 0-dimensional P-trees for the given table. 1-dimensional P-trees involve a binary tree compression of those 0-dimensional P-trees as shown in the next figure. 2-dimensional, 3-dimensional or multidimensional P-trees are also possible, depending upon the nature of the data and the compression achieved for the particular dataset. Roughly, the q-dimensional P-tree set which losslessly represents a tabular dataset, is constructed by recursively partitioning into q pieces, and recording the truth of the given predicate regarding each piece in the appropriate node of a fan-out = q tree. In this paper, we will focus on 1-dimensional P-tree in which each piece is equiwidth (recursively halving, i.e., partitioning into 1/21pieces). However, everything done with 1-dimensional Ptrees could as well be done with 2-dimensional P-trees (e.g., in which each equi-width piece is recursively quartered, i.e., partitioning into 1/22-pieces), etc. Logical operations AND (), OR () and NOT or complement (') are the main operations used to horizontally process these vertical P-trees. The operations are performed level-by-level starting from the root level. They are associative, commutative and distributive, since they are simply pruned bit-by-bit operations. Refer to [1] for an excellent overview about P-tree algebra and its logical operations. We note here, only that the AND operation, for instance, can be accelerated by realizing that any operand P-tree that has a pure-0 terminal node at any level will cause the result P-tree to have a pure-0 terminal node at the same position, regardless of the other operands. This and other operational speedups contribute strongly to the effectiveness of the approach. For example, let R be a relational table consists of three numeric attributes R(A1, A2, A3). Assuming the binary representation (e.g., (5)10 in binary is (101)2). In 0dimensional P-trees each position is treated as a vertical bitslice and is stored in a separate file. For 1-dimensional Predicate-trees using the universal predicate (purely 1-bits), each bit slice is converted into a compressed tree structure by recursively halving and recording the truth of the predicate in a binary tree. Figure 1 depicts the conversion of a numerical attribute, A1, into 1-dimensional P-trees. The count of 1-bits from the resulting P-trees logical operations is called root count. It can be computed quickly by summing from the bottom up. For example a root count of P11 P12 is equal to 7, computed from 1 · 20 + 1 · 21 + 1 · 22, as there is only a single bit of 1 found in each level. A1 A1 4 100 2 010 2 010 7 111 5 101 1 001 6 110 3 011 P11 0 0 0 0 0 1 0 1 P12 3.1 Binary Representation L1 0 1 L0 0 1 L2 L3 0 0 L3 In this section, we define vertical set formulas performed on training set with n dimensions and represented in b-bits. At the end of the section, the horizontal set square distance formula is also provided. 0 0 0 Binary representation is intrinsically a fundamental concept in vertical data structures. Let x be a numeric value of attribute A1. Then the representation of x in b bits is written as: 0 L2 1 0 1 0 x1b 1 x10 L1 2 j x1 j j b 1 L0 xb1 and x0 are respectively. L3 3.2 Vertical Set Inner Product (VSIP) P13 P12 P11 P13 3. VERTICAL SET FORMULAS the highest and lowest order bits 1 A02 0 0 1 0 0 1 0 1 1 1 1 0 1 0 0 1 a (a1(b1) a10 , a 2(b1) a 20 ,, an (b1) a n 0 ) be a 1 1 0 0 1 1 target vector, then the vertical set inner product (X o a) is defined as: 0 0 L2 0 0 0 0 1 0 0 1 1 0 1 0 L1 L0 Figure 1. 1-dimensional P-tree of attribute A1. The P-trees are built from the top down, stopping any branch as soon as purity (either purely 1-bits or purely 0bits) is reached (giving the compression). Let X, any set of vectors in R(A1…An) with P-tree class mask, PX, and x X is represented in b bits, x ( x1(b 1) x10 , x2(b 1) x20 ,, xn (b 1) xn 0 ) and X a xX n xa b1 0 i 1 j b1 rc( PX Pij ) 2 k 0 j k aik Suppose there is a dataset with three attributes A1, A2, and A3, each of which has a numerical domain and another attribute Rank has a single categorical domain as illustrated in table 1, P-tree class masks (bitmaps), PXi, are obtained by creating a vertical bit vector, one for each class, where bit 1 is assigned to every tuple containing that class and bit 0 to all the other tuples. Let attribute Rank a class attribute containing two types of values then two P-tree class masks are created, one for each distinct value. 3.3 Vertical Set Difference (VSD) Vertical set difference, denoted as (X - a), computes the sum of vector difference from a set of vectors X about a target vector a. Let x X are vectors that belong to the same class and X is any set of vectors in R(A1…An) with PX is a class mask, the vertical set difference is defined: X a (v1 , v 2 , , vi , , v n ), 1 i n 0 vi Table 1. Training set example. A1 9 11 11 7 7 8 A2 31 20 21 23 27 31 A3 6 5 4 3 1 0 Rank Low Low High High High High Subsequently, we convert each numerical domain into based-two representation with a uniform width of b bits. The maximum width is determined from the largest value in the training set. After that, we create P-tree class masks as illustrated in table 2. Table 2. P-tree class masks of attribute Rank. A1 A2 A3 01001 01011 01011 00111 00111 01000 11111 10100 10101 10111 11011 11111 00110 00101 00100 00011 00001 00000 Rank PX1 PX2 0 1 0 1 1 0 1 0 1 0 1 0 2 j (rc ( PX Pij ) rc ( PX ) aij ) j b 1 The formula returns a single vector that represents a cumulative length of a set of vectors about a. However, since a vector has direction, the final summation may mislead the actual separation, especially when negative vectors are involved in the summation. Therefore, squaring the formula can avoid this misleading. We will describe this new formula thoroughly in the next sub section. 3.4 Vertical Set Square Distance (VSSD) To alleviate the problem of canceling out when a set of vector difference is summed due to the existence of direction in a vector, we introduce the vertical set square distance (VSSD). VSSD measures the total variation and the variance of a set of points about a fixed point. The VSSD is defined as follows: X a X a x a x a xX n x xX i 1 As we mentioned before, a root count is the total number of 1 bits counted from the resultant of operations of P-tree operands. A root count (rc) of PX1 P13, written as rc(PX1 P13), is equal to 2, where PX1 is the P-tree class mask for class high and P13 is the P-tree of attribute A1 at the fourth bit position. Let a target vector a = (14, 10, 19) = (01110, 01010, 10011) in binary, the vertical set inner product for class High, (X1 o a) = 1,634 and for class Low, (X2 o a) = 999. ai n n 2 xi ai ai2 i 1 i 1 n x xX We pad a reasonable number of zeros to have a uniform bit width for any attributes that can be represented in less than b number of bits. Zero padding is a prerequisite for the formula to obtain a correct result. In table 2, attribute A3 has been padded with two additional zero bits to have a uniform width of 5 bits each because 31 or (11111)2 is the largest value found in the training set (attribute A2 of tuple 1 or 7). i i 1 2 2 i n n xX i 1 xX i 1 T1 T2 T3 where n T1 xi2 xX i 1 n = 0 2 2j rc ( PX Pij ) i 1 j b 1 2 k rc( PX Pij Pil ) k ( j*2)( j 1)&& j 0 l ( j 1)0&& j 0 n T2 2 xi ai x X i 1 n 2 0 2 j x ij j b 1 xX i 1 n xi2 2 xi ai ai2 2 j a ij j b 1 0 xX i 1 n 0 2 2 j i 1 j b 1 xX n 2 0 2 i 1 j b 1 a 0 2 j j b 1 rc( PX Pij ) n 2 i xX i i 4. EXPERIMENTAL RESULTS aij 0 2 j b 1 0 2 j aij xX i 1 j b 1 n T3 j xij j aij 2 2 n 0 rc( PX ) 2 j aij rc ( PX ) a i2 i 1 i 1 j b 1 n The X a X a , where N refers to the total N number of vectors in X, measures the variance of X about a. Notice that N can be easily computed using rc(PX), that is the total number of 1 bits counted from P-tree class mask X. The advantage of VSSD is that root counts can be precomputed and stored, as their operations are obviously independent from a, thus allowing us to pre-compute them in advance. These root counts include the root counts of Ptree class masks PX, the root counts of PX Pij and the root counts of PX Pij Pil where Pij and Pil are the corresponding P-trees of the training set. We calculation the total variation and variance using the same training example found in table 2 and a target vector a = (14, 10, 19). Using the VSSD, the total variation of X1 about a is X 1 a X 1 a T1 T2 T3 . We get T1 = 2,969, T2 = -3,268 and T3 = 2,628. Therefore, the total variation of X1 about a ( X1 a) ( X1 a) T1 T2 T 3 = 2,329 and the variance of X1 about a: ( X 1 a ) ( X 1 a ) T1 T2 T 3 T1 T2 T 3 N1 N1 rc( PX 1 ) 2,969 3,268 2,628 4 582 .25 Similarly, the total variation and the variance of X2 about a are 940 and 470 respectively. Since variance of X1>X2, we conclude that vector a is closer to X2 than to X1. 3.5 Horizontal Set Square Distance (HSSD) Let X, a set of vectors in R(A1…An) and x = (x1, x2, …, xn) is a vector belong to class X and a = (a1, a2, …, an) is a target vector, then the horizontal set square distance is defined as: n 2 X a X a x a x a xi ai x X x X i 1 This section reports experiments performed to evaluate the VSSD algorithm. The experiments were conducted using both real and synthetic datasets. The objective was to compare the execution time and scalability of our algorithm employing a vertical approach (vertical data structure and horizontal bitwise AND operation) with a horizontal approach (horizontal data structure and vertical scan operation). We show the results of experiments of execution time with respect to scalability. Performance of both algorithms was observed under different machine specifications, including an SGI Altix CC-NUMA machine. Table 3 summarizes the machines used for the experiments. Table 3. The specification of machines used. Machine AMD1GB P42GB SGI Altix Specification AMD Athlon K7 1.4GHz, 1GB RAM Intel P4 2.4GHz processor, 2GB RAM SGI Altix CC-NUMA 12 processor shared memory (12 x 4 GB RAM). 4.1 Datasets The experimental data was generated based on a set of aerial photographs from the Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA) near Oakes, North Dakota. Latitude and longitude are 970 42'18"W, taken in 1998. The image contains three bands: red, green, and blue reflectance values. We use the original image of size 1024x1024 pixels (having cardinality of 1,048,576). Corresponding synchronized data for soil moisture, soil nitrate and crop yield were also used for experimental evaluations. Combining all bands and synchronized data, we obtained a dataset with 6 dimensions. Additional datasets with different sizes were synthetically generated based on the original datasets to study the timing and scalability of VSSD technique presented in this paper. Both timing and scalability were evaluated with respect to data size. Due to the small number of cardinality obtained from the original dataset (1,048,576 records), we super sampled the dataset by using a simple image processing tool on the original dataset to produce five other larger datasets, each of which having cardinality of 2,097,152, 4,194,304 (2048x2048 pixels), 8,388,608, 16,777,216 (4096x4096 pixels) and 25,160,256 (5016x5016 pixels). 4.2 Timing and Scalability Results The first performance evaluation was done using a P4 with 2 GB RAM. We used synthetic datasets having 4.1 and 8.3 million rows to evaluate the execution time of the algorithms to compute total variation for 100 different test cases. Datasets of size greater than 8.3 million rows cannot be executed in this machine due to out of memory problem when running HSSD. Figure 2 and 3 depict the execution time comparison between VSSD and HSSD. VSSD vs HSSD Time Comparison Using 100 Test Cases on 4,194,304 Rows Dataset 9 Time (in Seconds) 7 5 3 1 -1 0 10 20 30 40 50 60 70 80 90 100 Test Case ID VSSD Table 4. Time for loading and computing root count. HSSD Figure 2. Time comparison on 4.1 million rows dataset. VSSD vs HSSD Time Comparison Using 100 Test Cases on 8,388,608 Rows Dataset Cardinality of Dataset 1,048,576 2,097,152 4,194,304 8,388,608 17 15 13 Time (in Seconds) Notice also that VSSD tends to have a constant execution time even though the datasets size is increased, where as HSSD tends to require a significantly increased execution time. One may argue that pre-calculation of root count makes this comparison fallacious. However, compare the time required for loading the vertical data structure to memory and one time root count operations for VSSD, and loading horizontal records to memory for HSSD given in table 4. The performance with respect to time of VSSD is comparable to HSSD. There is a slight increase in the amount of time required to load horizontal records than to load P-trees and to compute the root counts. This illustrates the ability of the P-tree data structure to efficiently load and compute the simple counts. These timings were obtained on a P4 with 2 GB of memory. Time (Seconds) VSSD HSSD Root Count PreHorizontal Computation and Dataset Loading P-trees Loading 3.900 4.974 8.620 10.470 18.690 19.914 38.450 39.646 11 9 7 5 3 1 -1 0 10 20 30 40 50 60 Test Case ID VSSD 70 80 90 100 HSSD Figure 3. Time comparison on 8.3 million rows dataset. The figures show that up to 8.3 million rows both algorithms apparently scale, however VSSD is significantly fast compared to HSSD. It requires only 0.0003 and 0.0004 seconds on average to complete the calculation on each dataset, very much less than HSSD, which needs 7.3800 and 15.1600 seconds on average respectively. These significant disparities are due to the superiority of VSSD algorithm to reuse the same root count values that have been pre-computed and stored during P-tree creation even though various different test case vectors are fed during calculation. If we refer back to the VSSD formula defined in section 3.4, the test case vector a only appeared in the calculation of T2 and T3 and independent from root count rc(PX Pij) operations. Thus allowing us to pre-compute the operations and reuse the values repeatedly regardless how many different target vectors are used as long as the dataset and set of classes remain unchanged. Our next experiment was to observe the algorithm’s timing and scalability performance when executing on machines with different specifications, especially for HSSD, which is very sensitive to the availability of memory to execute successfully. This sensitivity was proven when we run HSSD on AMD with 1 GB memory. HSSD successfully completed the total variation computation using dataset with cardinality of 1,048,576, 2,097,152, and 4,194,304, yet suffered from out of memory problem when computing total variation using dataset with cardinality of more than 4.1 million. Similarly, when we run HSSD on P42GB machine, HSSD scales to compute total variation only for datasets with cardinality less than 8.3 million. Nevertheless, HSSD performed better in term of scalability under the SGI Altix and successfully computed total variation for all datasets, but also suffered from out of memory problem when trying to load a dataset with more than 25 million rows. However the timing performance of HSSD on this machine degrades significantly compared to the timing of HSSD running on the P4 2 GB RAM. This is because of not utilizing the full capability of the shared memory 12-processor parallel architecture of the machine, which is beyond the scope of this paper. This machine with 12x4G of RAM was used in the performance study since it was the only machine capable of loading the entire dataset for the HSSD for larger datasets. On the other hand, our VSSD technique was successful with respect to time and scalability. There was no memory problem, when effectively computing total variation with large datasets having more than 25 million rows. We were able to compute VSSD for the largest dataset on the smallest machine (AMD1GB). Much faster results are evident when running VSSD using the other more powerful machines. We only report the average time for VSSD which was extremely stable, that is around 0.0003 to 0.0004 seconds for the smallest machine with respect to compute power and memory. Table 5 presents the average time when executing the two techniques under different machines and figure 4 further illustrates the performance with respect to scalability. Table 5. Average time under different machines. Cardinality of Dataset AMD1GB 1,048,576 2.2000 2,097,152 4.4100 4,194,304 8.5800 8,388,608 16,777,216 25,160,256 : Out of memory Average Running Time (Seconds) HSSD SGI P4Altix 2GB 12x4GB 1.8400 5.4800 3.6400 8.3200 7.3800 15.8640 15.1600 33.9000 66.5400 115.2040 VSSD AMD1GB 0.0003 0.0003 0.0004 0.0004 0.0004 0.0004 the calculation is very fast due to the ability of VSSD to pre-compute and store the root count values in advance, and use them over and over in the calculation, independent to the point in question. The scalability is also due to the ability to efficiently load and compute the initial root counts with the deployment of the P-tree vertical data structure. We believe that VSSD could be very useful in clustering tasks and outlier detection as it measures the closeness of a group of feature vectors about a target. Presumably, in a nearest neighbor classification task, using VSSD in the voting phase would greatly accelerate the assignment of class because the calculation of correlated points can be done entirely in one computation without having to visit each individual point as in the case of the horizontal-based approach. VSSD provides a window of opportunities to develop scalable classification, clustering and outlier detection applications for very large datasets. However, these hypotheses require careful observation and testing which will be further observed in our future works. 6. REFERENCES [1] Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree Algebra, Proceedings of the ACM Symposium on Applied Computing, pp. 426-431, 2002. [2] J. Han, and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA, 2001. [3] J. A. Hartigan, Clustering Algorithms, John Wiley & 120 VSSD vs HSSD Time Comparison Using 100 Difference Test Cases Running in Different Types of Machines Sons, New York, NY, 1975. [4] M. Khan, Q. Ding, and W. Perrizo, K-Nearest 100 Time (Seconds) 80 VSSD on AMD-1G HSSD on AMD-1G 60 Neighbor Classification of Spatial Data Streams using P-trees, Proceedings of the PAKDD, pp. 517-528, 2002. HSSD on P4-2G HSSD on SGI-48G Out of Memory 40 20 [5] E.M. Knorr and R. T. Ng. Algorithms for Mining Distance-Based Outliers in Large Datasets. Proceedings of 24th International Conference on Very Large Data Bases (VLDB), pp. 392-403, 1998. 0 0 2 4 6 8 10 12 14 16 18 20 22 24 Number of Tuples (1024^2) Figure 4. Average time under different machines. 5. CONCLUSION In this paper we have presented, defined and evaluated the performance of vertical set square distance (VSSD), a new concept to compute total variation. Experiments clearly indicate that VSSD is fast, accurate and scales well to very large datasets, as compared to HSSD. The time to complete [6] A. Perera, A. Denton, P. Kotala, W. Jockhec, W.V. Granda, and W. Perrizo, P-tree Classification of Yeast Gene Deletion Data. SIGKDD Explorations, 4(2), pp. 108-109, 2002. [7] W. Perrizo, Peano Count Tree Technology, Technical Report NDSU-CSOR-TR-01-1, 2001. [8] I. Rahal and W. Perrizo, An Optimized Approach for KNN Text Categorization using P-Trees. Proceedings of ACM Symposium on Applied Computing, pp. 613617, 2004.