05_cata_tv - NDSU Computer Science

advertisement
Vertical Set Square Distance:
A Fast and Scalable Technique to Compute Total Variation in Large Datasets
Taufik Abidin, Amal Perera, Masum Serazi, William Perrizo
Computer Science Department
North Dakota State University
Fargo, ND 58105 USA
{taufik.abidin, amal.perera, md.serazi, william.perrizo}@ndsu.edu
Abstract
In this paper, we introduce the vertical set square
distance (VSSD) technique that is designed to efficiently
and scalably measure the total variation of a set about a
fixed point in large datasets. The set can be any projected
subspace of any vector space, including oblique subspaces
(not just dimensional subspaces). VSSD can determine the
closeness of a point to a set of points in a dataset, which can
be very useful for classification, clustering and outlier
detection tasks. The technique employs a vertical data
structure called the Predicate-tree (P-tree)1. Performance
evaluations based on both synthetic and real-world datasets
show that VSSD technology is fast, accurate and scales well
to very large datasets, as compared to similar techniques
utilizing horizontal record-based data structure.
Keywords
Vertical Set Square Distance, P-trees, Total Variation.
cardinality datasets becomes our main concern because data
mining tasks frequently deal with large datasets. The VSSD
technique employs the P-tree vertical representation as its
underlying data structure, which makes it very fast and
scales well to compute total variation in large datasets,
compared to similar technique that uses horizontal recordbased data, as shown in our performance evaluations. The
horizontal record-based set square distance technique as a
comparison technique is called horizontal set square
distance (HSSD) in this paper.
The rest of the paper is organized as follows. Section 2
provides a short review of P-tree vertical representation.
Section 3 presents vertical set formulas and simple
example. Section 4 discusses performance evaluations.
Section 5 presents conclusions with some directions for
future work.
2. VERTICAL REPRESENTATION
1. INTRODUCTION
The determination of closeness or distance of a point to
another point or to a set of points is often required in data
mining tasks. For example, in partitioning clustering which
similar points (objects) are group together, a closeness of
points is used for expressing similarity i.e. k-mean, kmedoids [2][3], and it is also true for classification tasks
[4][6]. Similarly, in outlier detection a closeness of a point
to other points in a dataset is often used as a basis for
determining outlier [5].
One way to measure the closeness of a point to some
other predefined collection of points is by the computation
of total variation of a set of points about a point in question.
The analysis of total variation can reveal how well and poor
the point is distinguished to that set of points [3].
In this paper, we introduce a new technique called the
vertical set square distance (VSSD) that scalably and
accurately computes total variation of a set of points about
a fixed point. The scalability of the technique to very large
1
Patents are pending on the P-tree technology. This work was partially
supported by GSA Grant ACT#: K96130308.
Vertical data representation consists of set structures
representing the data column-by-column rather than rowby-row (relational data). P-trees are one choice of vertical
data representation, which can be used for data mining
instead of the more common sets of relational records. Ptrees were initially developed for mining spatial data [4][7].
However, since then this vertical data representation has
been adapted for mining many types of data [6][8].
P-trees vertical data structures are particularly well
suited for data mining without sub sampling. The creation
of P-tree is typically started by converting a relational table
of horizontal records to a set of vertical, compressed P-trees
by decomposing each attribute in the table into separate bit
vectors (e.g., one for each bit position of a numeric attribute
or one bitmap for each category in a categorical attribute).
Such vertical partitioning guarantees that the information is
not lost.
The complete set of (uncompressed) vertical bit slices
(for numeric attributes) or bitmaps (for categorical
attributes) will be called the 0-dimensional P-trees for the
given table. 1-dimensional P-trees involve a binary tree
compression of those 0-dimensional P-trees as shown in the
next figure. 2-dimensional, 3-dimensional or multidimensional P-trees are also possible, depending upon the
nature of the data and the compression achieved for the
particular dataset. Roughly, the q-dimensional P-tree set
which losslessly represents a tabular dataset, is constructed
by recursively partitioning into q pieces, and recording the
truth of the given predicate regarding each piece in the
appropriate node of a fan-out = q tree. In this paper, we will
focus on 1-dimensional P-tree in which each piece is equiwidth (recursively halving, i.e., partitioning into 1/21pieces). However, everything done with 1-dimensional Ptrees could as well be done with 2-dimensional P-trees (e.g.,
in which each equi-width piece is recursively quartered, i.e.,
partitioning into 1/22-pieces), etc.
Logical operations AND (), OR () and NOT or
complement (') are the main operations used to horizontally
process these vertical P-trees. The operations are performed
level-by-level starting from the root level. They are
associative, commutative and distributive, since they are
simply pruned bit-by-bit operations. Refer to [1] for an
excellent overview about P-tree algebra and its logical
operations. We note here, only that the AND operation, for
instance, can be accelerated by realizing that any operand
P-tree that has a pure-0 terminal node at any level will
cause the result P-tree to have a pure-0 terminal node at the
same position, regardless of the other operands. This and
other operational speedups contribute strongly to the
effectiveness of the approach.
For example, let R be a relational table consists of three
numeric attributes R(A1, A2, A3). Assuming the binary
representation (e.g., (5)10 in binary is (101)2). In 0dimensional P-trees each position is treated as a vertical bitslice and is stored in a separate file. For 1-dimensional
Predicate-trees using the universal predicate (purely 1-bits),
each bit slice is converted into a compressed tree structure
by recursively halving and recording the truth of the
predicate in a binary tree. Figure 1 depicts the conversion
of a numerical attribute, A1, into 1-dimensional P-trees.
The count of 1-bits from the resulting P-trees logical
operations is called root count. It can be computed quickly
by summing from the bottom up. For example a root count
of P11  P12 is equal to 7, computed from 1 · 20 + 1 · 21 + 1
· 22, as there is only a single bit of 1 found in each level.
A1
A1
4
100
2
010
2
010
7
111
5
101
1
001
6
110
3
011
P11
0
0
0
0
0
1
0 1
P12
3.1 Binary Representation
L1
0 1
L0
0
1
L2
L3
0
0
L3
In this section, we define vertical set formulas
performed on training set with n dimensions and
represented in b-bits. At the end of the section, the
horizontal set square distance formula is also provided.
0
0
0
Binary representation is intrinsically a fundamental
concept in vertical data structures. Let x be a numeric value
of attribute A1. Then the representation of x in b bits is
written as:
0
L2
1
0 1
0
x1b 1  x10 
L1
2
j
 x1 j
j b 1
L0
xb1 and x0 are
respectively.
L3
3.2 Vertical Set Inner Product (VSIP)
P13 P12 P11
P13
3. VERTICAL SET FORMULAS
the
highest
and
lowest
order
bits
1
A02
0
0
1
0
0
1
0
1
1
1
1
0
1
0
0
1
a  (a1(b1)  a10 , a 2(b1)  a 20 ,, an (b1)  a n 0 ) be a
1
1
0
0
1
1
target vector, then the vertical set inner product (X o a) is
defined as:
0
0
L2
0
0
0
0
1 0
0 1
1 0
1 0
L1
L0
Figure 1. 1-dimensional P-tree of attribute A1.
The P-trees are built from the top down, stopping any
branch as soon as purity (either purely 1-bits or purely 0bits) is reached (giving the compression).
Let X, any set of vectors in R(A1…An) with P-tree class
mask, PX, and x  X is represented in b bits,
x  ( x1(b 1)  x10 , x2(b 1)  x20 ,, xn (b 1)  xn 0 ) and
X a 

xX
n
xa 
b1
0

i 1 j b1
rc( PX  Pij ) 
2
k 0
j k
 aik
Suppose there is a dataset with three attributes A1, A2,
and A3, each of which has a numerical domain and another
attribute Rank has a single categorical domain as illustrated
in table 1, P-tree class masks (bitmaps), PXi, are obtained
by creating a vertical bit vector, one for each class, where
bit 1 is assigned to every tuple containing that class and bit
0 to all the other tuples. Let attribute Rank a class attribute
containing two types of values then two P-tree class masks
are created, one for each distinct value.
3.3 Vertical Set Difference (VSD)
Vertical set difference, denoted as (X - a), computes
the sum of vector difference from a set of vectors X about a
target vector a. Let x  X are vectors that belong to the
same class and X is any set of vectors in R(A1…An) with PX
is a class mask, the vertical set difference is defined:
X  a  (v1 , v 2 , , vi , , v n ), 1  i  n
0
vi 
Table 1. Training set example.
A1
9
11
11
7
7
8
A2
31
20
21
23
27
31
A3
6
5
4
3
1
0
Rank
Low
Low
High
High
High
High
Subsequently, we convert each numerical domain into
based-two representation with a uniform width of b bits.
The maximum width is determined from the largest value in
the training set. After that, we create P-tree class masks as
illustrated in table 2.
Table 2. P-tree class masks of attribute Rank.
A1
A2
A3
01001
01011
01011
00111
00111
01000
11111
10100
10101
10111
11011
11111
00110
00101
00100
00011
00001
00000
Rank
PX1
PX2
0
1
0
1
1
0
1
0
1
0
1
0
2
j
 (rc ( PX  Pij )  rc ( PX )  aij )
j b 1
The formula returns a single vector that represents a
cumulative length of a set of vectors about a. However,
since a vector has direction, the final summation may
mislead the actual separation, especially when negative
vectors are involved in the summation. Therefore, squaring
the formula can avoid this misleading. We will describe this
new formula thoroughly in the next sub section.
3.4 Vertical Set Square Distance (VSSD)
To alleviate the problem of canceling out when a set of
vector difference is summed due to the existence of
direction in a vector, we introduce the vertical set square
distance (VSSD). VSSD measures the total variation and
the variance of a set of points about a fixed point. The
VSSD is defined as follows:
X  a X  a

 x  a   x  a 
xX
n
 x

xX i 1


As we mentioned before, a root count is the total
number of 1 bits counted from the resultant of operations of
P-tree operands. A root count (rc) of PX1  P13, written as
rc(PX1  P13), is equal to 2, where PX1 is the P-tree class
mask for class high and P13 is the P-tree of attribute A1 at
the fourth bit position. Let a target vector a = (14, 10, 19) =
(01110, 01010, 10011) in binary, the vertical set inner
product for class High, (X1 o a) = 1,634 and for class Low,
(X2 o a) = 999.

 ai 
n
n

 2  xi ai   ai2 
i 1
i 1

n
   x
xX
We pad a reasonable number of zeros to have a
uniform bit width for any attributes that can be represented
in less than b number of bits. Zero padding is a prerequisite
for the formula to obtain a correct result. In table 2,
attribute A3 has been padded with two additional zero bits
to have a uniform width of 5 bits each because 31 or
(11111)2 is the largest value found in the training set
(attribute A2 of tuple 1 or 7).
i
i 1
2
2
i
n
n
xX i 1
xX i 1
 T1  T2  T3
where
n
T1   xi2
xX i 1
n
=
0
 2
2j
 rc ( PX  Pij ) 
i 1 j b 1
2
k
 rc( PX  Pij  Pil )
k ( j*2)( j 1)&& j  0
l ( j 1)0&& j 0
n
T2  2   xi ai
x X i 1
n
 2 
 0

2 j  x ij

j

b

1

 
xX i 1
n
  xi2  2  xi ai   ai2

2 j  a ij 

j b 1

0

xX i 1
n
0
 2  
 2
j
i 1 j b 1 xX
n
 2  
0
2
i 1 j b 1
 a
0
2
j
j b 1
 rc( PX  Pij ) 
n
2
i
xX i i
4. EXPERIMENTAL RESULTS
 aij
0
2
j b 1
 0

    2 j  aij 
xX i 1  j b 1

n
T3 
j
 xij 
j
 aij
2
2
n
 0

 rc( PX )     2 j  aij   rc ( PX )   a i2
i 1
i 1  j b 1

n
The  X  a    X  a  , where N refers to the total
N
number of vectors in X, measures the variance of X about a.
Notice that N can be easily computed using rc(PX), that is
the total number of 1 bits counted from P-tree class mask X.
The advantage of VSSD is that root counts can be precomputed and stored, as their operations are obviously
independent from a, thus allowing us to pre-compute them
in advance. These root counts include the root counts of Ptree class masks PX, the root counts of PX  Pij and the root
counts of PX  Pij  Pil where Pij and Pil are the
corresponding P-trees of the training set.
We calculation the total variation and variance using
the same training example found in table 2 and a target
vector a = (14, 10, 19). Using the VSSD, the total variation
of X1 about a is  X 1  a   X 1  a   T1  T2  T3 . We get
T1 = 2,969, T2 = -3,268 and T3 = 2,628. Therefore, the total
variation of X1 about a ( X1  a)  ( X1  a)  T1  T2 T 3 =
2,329 and the variance of X1 about a:
( X 1  a )  ( X 1  a ) T1  T2 T 3 T1  T2 T 3


N1
N1
rc( PX 1 )
2,969  3,268  2,628

4
 582 .25
Similarly, the total variation and the variance of X2
about a are 940 and 470 respectively. Since variance of
X1>X2, we conclude that vector a is closer to X2 than to X1.
3.5 Horizontal Set Square Distance (HSSD)
Let X, a set of vectors in R(A1…An) and x = (x1, x2, …,
xn) is a vector belong to class X and a = (a1, a2, …, an) is a
target vector, then the horizontal set square distance is
defined as:
n
2
 X  a    X  a    x  a   x  a    xi  ai 
x X
x X i 1
This section reports experiments performed to evaluate
the VSSD algorithm. The experiments were conducted
using both real and synthetic datasets. The objective was to
compare the execution time and scalability of our algorithm
employing a vertical approach (vertical data structure and
horizontal bitwise AND operation) with a horizontal
approach (horizontal data structure and vertical scan
operation). We show the results of experiments of
execution time with respect to scalability. Performance of
both algorithms was observed under different machine
specifications, including an SGI Altix CC-NUMA machine.
Table 3 summarizes the machines used for the experiments.
Table 3. The specification of machines used.
Machine
AMD1GB
P42GB
SGI Altix
Specification
AMD Athlon K7 1.4GHz, 1GB RAM
Intel P4 2.4GHz processor, 2GB RAM
SGI Altix CC-NUMA 12 processor
shared memory (12 x 4 GB RAM).
4.1 Datasets
The experimental data was generated based on a set of
aerial photographs from the Best Management Plot (BMP)
of Oakes Irrigation Test Area (OITA) near Oakes, North
Dakota. Latitude and longitude are 970 42'18"W, taken in
1998. The image contains three bands: red, green, and blue
reflectance values. We use the original image of size
1024x1024 pixels (having cardinality of 1,048,576).
Corresponding synchronized data for soil moisture, soil
nitrate and crop yield were also used for experimental
evaluations. Combining all bands and synchronized data,
we obtained a dataset with 6 dimensions.
Additional datasets with different sizes were
synthetically generated based on the original datasets to
study the timing and scalability of VSSD technique
presented in this paper. Both timing and scalability were
evaluated with respect to data size. Due to the small number
of cardinality obtained from the original dataset (1,048,576
records), we super sampled the dataset by using a simple
image processing tool on the original dataset to produce
five other larger datasets, each of which having cardinality
of 2,097,152, 4,194,304 (2048x2048 pixels), 8,388,608,
16,777,216 (4096x4096 pixels) and 25,160,256
(5016x5016 pixels).
4.2 Timing and Scalability Results
The first performance evaluation was done using a P4
with 2 GB RAM. We used synthetic datasets having 4.1 and
8.3 million rows to evaluate the execution time of the
algorithms to compute total variation for 100 different test
cases. Datasets of size greater than 8.3 million rows cannot
be executed in this machine due to out of memory problem
when running HSSD. Figure 2 and 3 depict the execution
time comparison between VSSD and HSSD.
VSSD vs HSSD Time Comparison
Using 100 Test Cases on 4,194,304 Rows Dataset
9
Time
(in Seconds)
7
5
3
1
-1 0
10
20
30
40
50
60
70
80
90
100
Test Case ID
VSSD
Table 4. Time for loading and computing root count.
HSSD
Figure 2. Time comparison on 4.1 million rows dataset.
VSSD vs HSSD Time Comparison
Using 100 Test Cases on 8,388,608 Rows Dataset
Cardinality
of Dataset
1,048,576
2,097,152
4,194,304
8,388,608
17
15
13
Time
(in Seconds)
Notice also that VSSD tends to have a constant
execution time even though the datasets size is increased,
where as HSSD tends to require a significantly increased
execution time. One may argue that pre-calculation of root
count makes this comparison fallacious. However, compare
the time required for loading the vertical data structure to
memory and one time root count operations for VSSD, and
loading horizontal records to memory for HSSD given in
table 4. The performance with respect to time of VSSD is
comparable to HSSD. There is a slight increase in the
amount of time required to load horizontal records than to
load P-trees and to compute the root counts. This illustrates
the ability of the P-tree data structure to efficiently load and
compute the simple counts. These timings were obtained on
a P4 with 2 GB of memory.
Time (Seconds)
VSSD
HSSD
Root Count PreHorizontal
Computation and
Dataset Loading
P-trees Loading
3.900
4.974
8.620
10.470
18.690
19.914
38.450
39.646
11
9
7
5
3
1
-1
0
10
20
30
40
50
60
Test Case ID
VSSD
70
80
90
100
HSSD
Figure 3. Time comparison on 8.3 million rows dataset.
The figures show that up to 8.3 million rows both
algorithms apparently scale, however VSSD is significantly
fast compared to HSSD. It requires only 0.0003 and 0.0004
seconds on average to complete the calculation on each
dataset, very much less than HSSD, which needs 7.3800
and 15.1600 seconds on average respectively. These
significant disparities are due to the superiority of VSSD
algorithm to reuse the same root count values that have
been pre-computed and stored during P-tree creation even
though various different test case vectors are fed during
calculation. If we refer back to the VSSD formula defined
in section 3.4, the test case vector a only appeared in the
calculation of T2 and T3 and independent from root count
rc(PX  Pij) operations. Thus allowing us to pre-compute
the operations and reuse the values repeatedly regardless
how many different target vectors are used as long as the
dataset and set of classes remain unchanged.
Our next experiment was to observe the algorithm’s
timing and scalability performance when executing on
machines with different specifications, especially for
HSSD, which is very sensitive to the availability of memory
to execute successfully. This sensitivity was proven when
we run HSSD on AMD with 1 GB memory. HSSD
successfully completed the total variation computation
using dataset with cardinality of 1,048,576, 2,097,152, and
4,194,304, yet suffered from out of memory problem when
computing total variation using dataset with cardinality of
more than 4.1 million. Similarly, when we run HSSD on
P42GB machine, HSSD scales to compute total variation
only for datasets with cardinality less than 8.3 million.
Nevertheless, HSSD performed better in term of scalability
under the SGI Altix and successfully computed total
variation for all datasets, but also suffered from out of
memory problem when trying to load a dataset with more
than 25 million rows. However the timing performance of
HSSD on this machine degrades significantly compared to
the timing of HSSD running on the P4 2 GB RAM. This is
because of not utilizing the full capability of the shared
memory 12-processor parallel architecture of the machine,
which is beyond the scope of this paper. This machine with
12x4G of RAM was used in the performance study since it
was the only machine capable of loading the entire dataset
for the HSSD for larger datasets.
On the other hand, our VSSD technique was successful
with respect to time and scalability. There was no memory
problem, when effectively computing total variation with
large datasets having more than 25 million rows. We were
able to compute VSSD for the largest dataset on the
smallest machine (AMD1GB). Much faster results are
evident when running VSSD using the other more powerful
machines. We only report the average time for VSSD which
was extremely stable, that is around 0.0003 to 0.0004
seconds for the smallest machine with respect to compute
power and memory. Table 5 presents the average time when
executing the two techniques under different machines and
figure 4 further illustrates the performance with respect to
scalability.
Table 5. Average time under different machines.
Cardinality
of Dataset
AMD1GB
1,048,576
2.2000
2,097,152
4.4100
4,194,304
8.5800
8,388,608

16,777,216

25,160,256

: Out of memory
Average Running Time
(Seconds)
HSSD
SGI
P4Altix
2GB
12x4GB
1.8400
5.4800
3.6400
8.3200
7.3800
15.8640
15.1600
33.9000
66.5400

115.2040

VSSD
AMD1GB
0.0003
0.0003
0.0004
0.0004
0.0004
0.0004
the calculation is very fast due to the ability of VSSD to
pre-compute and store the root count values in advance, and
use them over and over in the calculation, independent to
the point in question. The scalability is also due to the
ability to efficiently load and compute the initial root counts
with the deployment of the P-tree vertical data structure.
We believe that VSSD could be very useful in
clustering tasks and outlier detection as it measures the
closeness of a group of feature vectors about a target.
Presumably, in a nearest neighbor classification task, using
VSSD in the voting phase would greatly accelerate the
assignment of class because the calculation of correlated
points can be done entirely in one computation without
having to visit each individual point as in the case of the
horizontal-based approach. VSSD provides a window of
opportunities to develop scalable classification, clustering
and outlier detection applications for very large datasets.
However, these hypotheses require careful observation and
testing which will be further observed in our future works.
6. REFERENCES
[1] Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree
Algebra, Proceedings of the ACM Symposium on
Applied Computing, pp. 426-431, 2002.
[2] J. Han, and M. Kamber, Data Mining: Concepts and
Techniques, Morgan Kaufmann, San Francisco, CA,
2001.
[3] J. A. Hartigan, Clustering Algorithms, John Wiley &
120
VSSD vs HSSD
Time Comparison Using 100 Difference Test Cases
Running in Different Types of Machines
Sons, New York, NY, 1975.
[4] M. Khan, Q. Ding, and W. Perrizo, K-Nearest
100
Time
(Seconds)
80
VSSD on AMD-1G
HSSD on AMD-1G
60
Neighbor Classification of Spatial Data Streams using
P-trees, Proceedings of the PAKDD, pp. 517-528,
2002.
HSSD on P4-2G
HSSD on SGI-48G
Out of Memory
40
20
[5] E.M. Knorr and R. T. Ng. Algorithms for Mining
Distance-Based
Outliers
in
Large
Datasets.
Proceedings of 24th International Conference on Very
Large Data Bases (VLDB), pp. 392-403, 1998.
0
0
2
4
6
8 10 12 14 16 18 20 22 24
Number of Tuples (1024^2)
Figure 4. Average time under different machines.
5. CONCLUSION
In this paper we have presented, defined and evaluated
the performance of vertical set square distance (VSSD), a
new concept to compute total variation. Experiments clearly
indicate that VSSD is fast, accurate and scales well to very
large datasets, as compared to HSSD. The time to complete
[6] A. Perera, A. Denton, P. Kotala, W. Jockhec, W.V.
Granda, and W. Perrizo, P-tree Classification of Yeast
Gene Deletion Data. SIGKDD Explorations, 4(2), pp.
108-109, 2002.
[7] W. Perrizo, Peano Count Tree Technology, Technical
Report NDSU-CSOR-TR-01-1, 2001.
[8] I. Rahal and W. Perrizo, An Optimized Approach for
KNN Text Categorization using P-Trees. Proceedings
of ACM Symposium on Applied Computing, pp. 613617, 2004.
Download