01icdm_DataStreamDTIClassification_v4

advertisement
Decision Tree Classification of Spatial Data Streams
Using Peano Count Tree1, 2
Qiang Ding, Qin Ding and William Perrizo
Computer Science Department, North Dakota State University
Fargo, ND 58105, USA
{Qiang_Ding, Qin_Ding, William_Perrizo }@ndsu.nodak.edu
Abstract
Many organizations have large quantities of spatial data collected in various application areas,
including remote sensing, geographical information systems (GIS), astronomy, computer cartography,
environmental assessment and planning, etc. Many of these data collections are growing rapidly and can
therefore be considered as spatial data streams [15, 16]. For data stream classification, time is a major
issue. However, many of these spatial data sets are too large to be classified effectively in a reasonable
amount of time using existing methods. In this paper, we develop a P-tree method for decision tree
classification.
The Peano Count Tree (P-tree) is a spatial data organization that provides a lossless
compressed representation of a spatial data set and facilitates classification and other data mining
techniques. We compare P-tree decision tree induction classification and a classical decision tree induction
method with respect to the speed at which the classifier can be built (and rebuilt when substantial amounts
of new data arrive). We show that the P-tree method is significantly faster than existing training methods,
making it the preferred method for working with spatial data streams.
Keywords
Data mining, Classification, Decision Tree Induction, Spatial Data, Peano Count Tree, Data Stream
1.
Introduction
Data streams are very common today. Classifying open-ended data streams brings challenges and
opportunities since traditional techniques often cannot complete the work as quickly as the data is arriving
in the stream [15, 16]. Spatial data collected from sensor platforms in space, from airplanes or other
platforms are typically update periodically. For example, AVHRR (Advanced Very High Resolution
Radiometer) data is update every hour or so (8 times each day during daylight hours). Such data sets can
be very large (multiple gigabytes) and are often archived in deep storage before valuable information can
1
2
Patents are pending on the bSQ and Ptree technology.
This work is partially supported by GSA Grant ACT#: K96130308 CIT:142.0.K00PA 131.50.25.H40.516, DARPA Grant DAAH0496-1-0329, NSF Grant OSR-9553368
be obtained from them. An objective of spatial data stream mining is to mine the such data in near real
time prior to deep storage archiving. In this paper we focus on classification data mining.
In classification a training or learning set is identified for the construction of a classifier. Each record
in the learning set has several attributes, one of which, the goal or class label attribute, indicates the class to
which each record belongs. The classifier, once built and tested, is used to predict the class label of new
records which do not yet have a class label attribute value.
A test set is used to test the accuracy of the classifier once it has been developed. The classifier, once
certified, is used to predict the class label of future unclassified data. Different models have been proposed
for classification such as decision trees, neural networks, Bayesian belief networks, fuzzy sets, and generic
models. Among these models, decision trees are widely used for classification. We focus on inducing
decision trees in this paper. ID3 (and its variants such as C4.5) [1] [2] and CART [4] are among the best
known classifiers which use decision trees. Other decision tree classifiers include Interval Classifier [3]
and SPRINT [3] [5] concentrate on making it possible to mine databases that do not fit in main memory by
only requiring sequential scans of the data. Classification has been applied in many fields, such as retail
target marketing, customer retention, fraud detection and medical diagnosis [14].
Spatial data is a
promising area for classification. In this paper, we propose a decision tree based model to perform
classification on spatial data.
The Peano Count Tree (P-tree) [17], is used to build the decision tree classifier. P-trees represent
spatial data bit-by-bit in a recursive quadrant-by-quadrant arrangement. With the information in P-trees,
we can rapidly build the decision tree. Each new component in a spatial data stream is converted to P-trees
and then added to the training set as soon as possible. Typically, a window of data components from the
stream is used to build (or rebuild) the classifier. There are many ways to define the window, depending on
the data and application. In this paper, we focus on a fast classifier-building algorithm.
The rest of the paper is organized as follows. In section 2, we briefly introduce the data formats of
spatial data and describe the P-tree data structure and algebra. In Section 3, we detail our decision tree
induction classifier using P-trees. A performance analysis is given in Section 4. Section 5 discuss some
related work. Finally, there is a conclusion.
2.
Data Structures
There are several formats used in spatial data such as Band Sequential (BSQ), Band Interleaved by
Line (BIL) and Band Interleaved by Pixel (BIP). In previous work[17] we proposed a new format called
bit Sequential (bSQ). Each bit of each band is stored separately as a bSQ file. Each bSQ file can be
reorganized into a quadrant-based tree (P-tree). The example in Figure 1 shows a bSQ file and its basic Ptree.
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
55
____________/ / \ \___________
/
_____/ \ ___
\
16
____8__
_15__
16
/ / |
\
/ | \ \
3 0 4 1
4 4 3 4
//|\
//|\
//|\
1110
0010
1101
depth=0
level=3
depth=1
level=2
depth=2
level=1
depth=3
level=0
Figure 1. 8-by-8 image and its P-tree
In this example, 55 is the count of 1’s in the entire image, the numbers at the next level, 16, 8, 15 and
16, are the 1-bit counts for the four major quadrants. Since the first and last quadrant is made up of entirely
1-bits, we do not need sub-trees for these two quadrants. This pattern is continued recursively. Recursive
raster ordering is called Peano or Z-ordering in the literature. The process terminates at the leaf level
(level-0) where each quadrant is a 1-row-1-column quadrant. If we were to expand all sub-trees, including
those for quadrants that are pure 1-bits, then the leaf sequence is just the Peano space-filling curve for the
original raster image.
For each band (assuming 8-bit data values), we get 8 basic P-trees, one for each bit positions. For
band, Bi, we will label the basic P-trees, Pi,1, Pi,2, …, Pi,8, thus, Pi,j is a lossless representation of the jth bits
of the values from the ith band. However, Pij provides more information and are structured to facilitate data
mining processes. Some of the useful features of P-trees are collected together below. Details on these
features can be found later in this paper or our earlier work [17].

P-trees contain 1-count for every quadrant of every dimension.

The P-tree for any sub-quadrant at any level is simply the sub-tree rooted at that sub-quadrant.

A P-tree leaf sequence (depth-first) is a partial run-length compressed version of the original bit-band.

Basic P-trees can be combined to reproduce the original data (P-trees are lossless representations).

P-trees can be partially combined to produce upper and lower bounds on all quadrant counts.

P-trees can be used to smooth data by bottom-up quadrant purification (bottom-up replacement of mixed
counts with their closest pure counts - see section 3 below).
For a spatial data set with 2n-row and 2n-column, there is a convenient mapping from raster
coordinates (x,y) to Peano coordinates (or quadrant id). If x and y are expressed as n-bit strings, x1x2…xn
and y1y2…yn then the mapping is (x,y)=(x1x2…xn, y1y2…yn)  (x1y1.x2y2…xnyn). Thus, in an 8 by 8 image,
the pixel at (3,6) = (011,110) has quadrant id 01.11.10 = 1.3.2. To locate all quadrants containing (3,6) in
the tree start at the root; follow sub-tree pointer 1; then sub-tree pointer 3; then sub-tree pointer 2 (where
the 4 sub-tree pointers are 0, 1, 2, 3 at each node). Finally, we note that these P-trees can be generated
quite quickly [18] and can be viewed as a “data mining ready”, lossless format for storing spatial data.
The basic P-trees defined above can be combined using simple logical operations (AND, NOT, OR,
COMPLEMENT) to produce P-trees for the original values (at any level of precision, 1-bit precision, 2-bit
precision, etc.). We let Pb,v denote the Peano Count Tree for band, b, and value, v, where v can be
expressed in 1-bit, 2-bit,.., or 8-bit precision. Using all 8-bit for instance, Pb,11010011 can be constructed
from the basic P-trees as:
Pb,11010011 = Pb1 AND Pb2 AND Pb3’ AND Pb4 AND Pb5’ AND Pb6’ AND Pb7 AND Pb8
where ’ indicates the bit-complement (which is simply the count complement in each quadrant). This is
called the value P-tree. The AND operation is simply the pixel-wise AND of the bits. Before going further,
we note that the process of converting the BSQ (BIL, BIP) data for a Landsat TM satellite image
(approximately 60 million pixels) to its basic P-trees can be done in just a few seconds using a high
performance PC computer. This is a one time cost. We also note that we are storing the basic P-trees in a
“depth-first” data structure, which specifies the pure-1 quadrants only. Using this data structure, each AND
can be completed in a few milliseconds and the result counts can be computed easily once the AND and
COMPLEMENT program has completed.
Finally, we note that the data in the relational format can also be represented as P-trees. For any
combination of values, (v1,v2,…,vn), where vi is from band-i, the quadrant-wise count of occurrences of this
tuple of values is given by: P(v1,v2,…,vn) = P1,v1 AND P2,v2 AND … AND Pn,vn . This is called a tuple
P-tree.
3.
The Classifier
Classification is a data mining technique which typically involves three phases, a learning phase, a
testing phase and an application phase. A learning model or classifier is built during the learning phase. It
may be in the form of classification rules, a decision tree or a mathematical formulae. Since the class label
of each training sample is provided, this approach is known as supervised learning. In unsupervised
learning (clustering), the class labels are not known in advance.
In the testing phase test data are used to assess the accuracy of classifier. If the classifier passes the
test phase, it is used for the classification of new, unclassified data tuples. This is the application phase.
The classifier predicts the class label for these new data samples.
In this paper, we consider the classification of spatial data in which the resulting classifier is a decision
tree (decision tree induction). Our contributions include

a set of classification-ready data structures called Peano Count trees, which are compact, rich in
information and facilitate classification;

a data structure for organizing the inputs to decision tree induction, the Peano count cube;

and a fast decision tree induction algorithm, which employs these structures.
We point out the classifier is precisely that classifier built by the ID3 decision tree induction algorithm
[4]. The point of the work is to reduce the time it takes to build and rebuild the classifier as new data
continues to arrive in the data stream.
3.1 Data Smoothing and Attribute Relevance
In the overall classification effort, as in most data mining approaches, there is a data preparation stage
in which the data is prepared for classification. Data preparation can involve cleaning (noise reduction by
applying smoothing techniques and missing value management techniques).
The P-tree data structure
facilitates a proximity-based data smoothing method, which can reduce the data classification time
considerably. The smoothing method is called bottom-up purity shifting. By replacing 3 counts with 4
and 1 counts with 0 at level-1 (and making resultant changes on up the tree), the data is smoothed and the
P-tree is compressed. A more drastic smoothing can be effected. The user can select which set of counts
to replace with pure1 and which set of counts to replace with pure0. The most important thing to note is
that this smoothing can be done almost instantaneously once P-trees are constructed. With this method it is
feasible to actually smooth data from the data stream before mining.
Another important pre-classification step is relevance analysis (selecting only a subset of the feature
attributes, so as to improve algorithm efficiency). This step can involve removal of irrelevant attributes or
removal of redundant attributes. An effective way to determine relevance is to rollup the P-cube to the
class label attribute and each other potential decision attribute in turn. If any of these rollups produce
counts that are uniformly distributed, then that attribute is not going to be effective in classifying the class
label attribute. The rollup can be computed from the basic P-trees without necessitating the actual creation
of the P-cube. This can be done by ANDing the class label attribute P-trees with the P-trees of the potential
decision attribute. Only an estimate of uniformity in the root counts is all that is needed. By ANDing
down to a fixed depth of the P-trees ever better estimates can be discovered. For instance, ANDing to
depth=1 counts provides the roughest of distribution information, ANDing at depth=2 provides better
distribution information and so forth. Again, the point is that P-trees facilitate simple real-time relevance
analysis, which makes it feasible for data streams.
3.2 Classification by Decision Tree Induction Using P-tree
A Decision Tree is a flowchart-like structure in which each inode denotes a test on an attribute. Each
branch represents an outcome of the test and the leaf nodes represent classes or class distributions.
Unknown samples can be classified by testing attributes against the tree. The path traced from root to leaf
holds the class prediction for that sample. The basic algorithm for inducing a decision tree from the
learning or training sample set is as follows ([2] [13]):

Initially the decision tree is a single node representing the entire training set.

If all samples are in the same class, this node becomes a leaf and is labeled with that class label.

Otherwise, an entropy-based measure, "information gain", is used as a heuristic for selecting the
attribute which best separates the samples into individual classes (the “decision" attribute).

A branch is created for each value of the test attribute and samples are partitioned accordingly.

The algorithm advances recursively to form the decision tree for the sub-sample set at each
partition. Once an attribute has been used, it is not considered in descendent nodes.

The algorithm stops when all samples for a given node belong to the same class or when there are
no remaining attributes (or some other stopping condition).
The attribute selected at each decision tree level is the one with the highest information gain. The
information gain of an attribute is computed by using the following algorithm.
Assume B[0] is the class attribute; the others are non-class attributes. We store the decision path for
each node. For example, in the decision tree below (Figure 2), the decision path for node N09 is “Band2,
value 0011, Band3, value 1000”. We use RC to denote the root count of a P-tree, given node N’s decision
path B[1], V[1], B[2], V[2], … , B[t], V[t], let P-tree P=PB[1],v[1]^P B[2],v[2]^…^P B[t],v[t] , we can calculate
node N’s information I(P) through I(P)=-
(i=1..n)pi*log 2pi
where pi=RC(P^PB[0],
V0[i])/RC(P),
here
V0[1]..V0[n] are possible B[0] values if classified by B[0] at node N. If N is the root node, then P is the
full P-tree (root count is the total number of transactions).
Now if we want to evaluate the information gain of attribute A at node N, we can use the formulae:
Gain(A)=I(P)-E(A), where entropy E(A)= 
(i=1..n)I(P^PA,VA[i])*RC(P^PA,VA[i])/RC(P),
here VA[1]..VA[n]
are possible A values if classified by attribute A at node N.
B2
0010
N02
B1
0111
N07
0011
N03
B3
N01
0111 1010
N04
B1
0100
N08
1000
N09
B1 0111
B1 0011
N13
1011
N05
B1
0011
N10
N06
B1
1111
N11
0010
N12
N14
Figure 2. A Decision Tree Example
3.3 Example
In this example the data is a remotely sensed image (e.g., satellite image or aerial photo) of an
agricultural field and the soil moisture levels for the field, measured at the same time. We use the whole
data set for mining so as to get as better accuracy as we can. This data are divided to make up the learning
and test data sets. The goal is to classify the data using soil moisture as the class label attribute and then to
use the resulting classifier to predict the soil moisture levels for future time (e.g., to determine capacity to
buffer flooding or to schedule crop planting).
Branches are created for each value of the selected attribute and subsets are partitioned accordingly.
The following training set contains 4 bands of 4-bit data values (expressed in decimal and binary). B1
stands for soil-moisture. B2, B3, and B4 stand for the channel 3, 4, and 5 of AVHRR, respectively.
FIELD |CLASS|
COORDS|LABEL|
X | Y | B1 |
0,0 | 3 |
0,1 | 3 |
0,2 | 7 |
0,3 | 7 |
1,0 | 3 |
1,1 | 3 |
1,2 | 7 |
1,3 | 7 |
2,0 | 2 |
2,1 | 2 |
2,2 | 10 |
2,3 | 15 |
3,0 | 2 |
3,1 | 10 |
3,2 | 15 |
3,3 | 15 |
REMOTELY SENSED
REFLECTANCES______
B2 | B3 | B4___
7 | 8 | 11
3 | 8 | 15
3 | 4 | 11
2 | 5 | 11___
7 | 8 | 11
3 | 8 | 11
3 | 4 | 11
2 | 5 | 11___
11 | 8 | 15
11 | 8 | 15
10 | 4 | 11
10 | 4 | 11___
11 | 8 | 15
11 | 8 | 15
10 | 4 | 11
10 | 4 | 11
FIELD |CLASS|
COORDS|LABEL|
X | Y | B1 |
0,0 | 0011|
0,1 | 0011|
0,2 | 0111|
0,3 | 0111|
1,0 | 0011|
1,1 | 0011|
1,2 | 0111|
1,3 | 0111|
2,0 | 0010|
2,1 | 0010|
2,2 | 1010|
2,3 | 1111|
3,0 | 0010|
3,1 | 1010|
3,2 | 1111|
3,3 | 1111|
REMOTELY SENSED
REFLECTANCES______
B2 | B3 | B4___
0111| 1000| 1011
0011| 1000| 1111
0011| 0100| 1011
0010| 0101| 1011__
0111| 1000| 1011
0011| 1000| 1011
0011| 0100| 1011
0010| 0101| 1011__
1011| 1000| 1111
1011| 1000| 1111
1010| 0100| 1011
1010| 0100| 1011__
1011| 1000| 1111
1011| 1000| 1111
1010| 0100| 1011
1010| 0100| 1011
This learning dataset is converted to bSQ format. We display the bSQ bit-bands values in their spatial
positions, rather than displaying them in 1-column files. The Band-1 bit-bands are:
B11
0000
0000
0011
0111
B12
0011
0011
0001
0011
B13
1111
1111
1111
1111
B14
1111
1111
0001
0011
Thus, the Band-1 basic P-trees are as follows (tree pointers are omitted).
P1,1
5
0 0 1 4
0001
P1,2
7
0 4 0 3
0111
P1,3
16
P1,4
11
4 4 0 3
0111
We can use AND and COMPLEMENT operation to calculate all the value P-trees of Band-1 as below.
(e.g., P1,0011 = P1,1’ AND P1,2’ AND P1,3 AND P1,4 )
P1,0000
0
P1,0100
0
P1,1000
0
P1,1100
0
P1,0010
P1,0110
3
0
0 0 3 0
1110
P1,1010
P1,1110
2
0
0 0 1 1
0001 1000
P1,0001
0
P1,0101
0
P1,1001
0
P1,1101
0
P1,0011
P1,0111
P1,1011
4
4
0
4 0 0 0 0 4 0 0
P1,1111
3
0 0 0 3
0111
Then we generate basic P-trees and value P-trees similarly to B2, B3 and B4.
The basic algorithm for inducing a decision tree from the learning set is as follows.
1. The tree starts as a single node representing the set of training samples, S:
2. If all samples are in same class (same B1-value), S becomes a leaf with that class label..
3. Otherwise, use entropy-based measure, information gain, as the heuristic for selecting the attribute
which best separates the samples into individual classes (the test or decision attribute).
Start with A = B2 to classify S into {A1..Av}, where Aj = { t | t(B2)=aj } and aj ranges over those B2values, v’, such that the root count of P 2,v' is non-zero. The symbol, sij, counts the number of samples of
class, Ci, in subset, Aj , that is, the root count of P 1,v AND P2,v', where v ranges over those B1-values such
that the root count of P1,v is non-zero. Thus, the sij (the probability that a sample belongs to Ci) are as
follows (i=row and j=column).
0
0
2
0
0
0
2
2
0
0
0
2
0
0
0
0
0
0
1
3
3
0
0
1
0
Let attribute, A, have v distinct values, {a1..av}. A could be used to classify S into {S1..Sv}, where Sj is the
set of samples having value, a j. Let sij be the number of samples of class, Ci, in a subset, Sj. The expected
information needed to classify the sample is
I(s1..sm)= - i=1..m pi*log2 pi
(m=5
si = 3,4,4,2,3
pi = si/s
pi = s1/s = 3/16, 1/4, 1/4, 1/8, 3/16, respectively). Thus,
I = - ( 3/16*log2(3/16)+4/16*log2(4/16)+4/16*log2(4/16)+2/16*log2(2/16)+3/16*log2(3/16) ) = 2.281
The entropy based on the partition into subsets by B2 is
E(A) = j=1..v i=1..m (sij / s) * I(sij..smj), where
I(s1j..smj) = - i=1..m pij*log2 pij
pij = sij/|Aj|.
We finish the computations leading to the gain for B2 in detail.
2
0
0
1
0
0
0
0
4
0
.5
.5
0
0
1
.25
2
0
1
0
0
0
0
0
4
0
0
0
.25
.75
.811
.203
4
.75
0
0
.25
0
.811
.203
.656
2.281
1.625
←
←
←
←
←
←
←
←
←
←
←
|Aj| (divisors)
p1j
p2j
p3j
p4j
p5j
I(s1j+..+s5j)
(s1j+..+s5j)*I(s1j+..+s5j)/16
E(B2)
I(s1..sm)
I(s1..sm)-E(B2) = Gain(B2)
1.084 = Gain(B3)
Likewise, the Gains of B3 and B4 are computed.
0.568 = Gain(B4)
Thus, B2 is selected as the first level decision attribute.
4.
Branches are created for each value of B2 and samples are partitioned accordingly (If a partition is
empty, generate a leaf and label it with the most common class, C2, labeled with 0011).
B2=0010  Sample_Set_1
B2=0011  Sample_Set_2
B2=0111  Sample_Set_3
B2=1010  Sample_Set_4
B2=1011  Sample_Set_5
Sample_Set_1
X-Y B1
B3
B4
0,3 0111 0101 1011
1,3 0111 0101 1011
0,1
0,2
1,1
1,2
Sample_Set_2
X-Y B1
B3
Sample_Set_3
X-Y B1
B3
B4
0011
0111
0011
0111
1000
0100
1000
0100
1111
1011
1011
1011
B4
0,0 0011 1000 1011
1,0 0011 1000 1011
Sample_Set_4
X-Y B1
B3
2,2 1010 0100
2,3 1111 0100
3,2 1111 0100
3,3 1111 0100
Sample_Set_5
X-Y B1
B3
2,0 0010 1000
2,1 0010 1000
3,0 0010 1000
3,1 1010 1000
B4
1011
1011
1011
1011
B4
1111
1111
1111
1111
5. The algorithm advances recursively to form a decision tree for the samples at each partition. Once an
attribute is the decision attribute at a node, it is not considered further.
6. The algorithm stops when:
a. all samples for a given node belong to the same class or
b. no remaining attributes (label leaf with majority class among the samples).
All samples belong to the same class for Sample_Set_1 and Sample_Set_3. Thus, the decision tree is:
B2=0010  B1=0111
B2=0011  Sample_Set_2
B2=0111  B1=0011
B2=1010  Sample_Set_4
B2=1011  Sample_Set_5
Advancing the algorithm recursively, it is unnecessary to rescan the learning set to form these Subsample sets, since the P-trees for those samples have been computed.
For Sample_set_2, we compute all P2,001 ^P1,v ^P3,w to calculate Gain(B3); and we compute P2,001
^P1,v ^P4,w to calculate Gain(B4). The results are: Gain(B4)  0.179 , Gain(B3)  1.000
Thus, B3 is the decision attribute at this level in the decision tree.
For Sample_Set_4 (label-B2=1010), the P-trees,
P1010,1010,,
1
0 0 0 1
1000
P1111,1010,,
3
0 0 0 3
0111
P,1010,0100,
4
0 0 0 4
P,1010,,1011
4
0 0 0 4
tell us that there is only one B3 and one B4 value in the Sample_Set. Therefore we can tell there will be no
gain by using either B3 or B4 at this level in the tree. We can come to that conclusion without producing
and scanning the Sample_Set_4.
The same conclusion can be reach regarding Sample_Set_5 (label-B2=1011) simply by examining,
P0010,1011,,
3
0 0 3 0
1110
P1010,1011,,
1
0 0 1 0
0001
P,1011,1000,
4
0 0 4 0
P,1011,,1111
4
0 0 4 0
Here, we are using an additional stopping condition at step 3, namely,
Any attribute, which has one single value in each candidate decision attribute over the entire sample, need
not be considered, since the information gain will be zero. If all candidate decision attributes are of this
type, the algorithm stops for this subsample.
Thus, we use the majority class label for Sample_Sets 4 and 5:
B2=0010  B1=0111
B2=0011  Sample_Set_2
B2=0111  B1=0011
B2=1010  B1=1111
B2=1011  B1=0010
The best test attribute for Sample_Set_2 (of B3 or B4), is found in the same way. The result is that B3
maximizes the information gain. Therefore, the decision tree becomes,
B2=0010  B1=0111
B2=0011 
B3=0100  Sample_Set_2.1
B3=1000  Sample_Set_2.2
B2=0111  B1=0011
B2=1010  B1=1111
B2=1011  B1=0010
Sample_Set_2.1
X-Y B1
B3
B4
0,2 0111 0100 1011
1,2 0111 0100 1011
Sample_Set_2.2
X-Y B1
B3
B4
0,1 0011 1000 1111
1,1 0011 1000 1011
In both cases, there is one class label in the set. Therefore the algorithm terminates with the decision tree:
B2=0010  B1=0111
B2=0011  B3=0100  B1=0111
B3=1000  B1=0011
B2=0111  B1=0011
B2=1010  B1=1111
B2=1011  B1=0010
4
Performance Analysis
Often prediction accuracy is used as a basis of comparison for different classification methods.
However, for stream data mining, speed is a significant issue. In this paper, we use ID3 with the P-tree
data structure to improve the speed of the algorithm. The important performance issue in this paper is
computation speed relative to ID3.
In our method, we only build and store basic P-trees. All the AND operations are performed on the fly
and only the corresponding root count is needed. Our experimental results show that larger data size leads
to more significant speed (in Figure 3).
The reason is obvious if we analysis theoretically. First, let’s look at the cost to calculate information
gain each time. In ID3, to test if all the samples are in the same class, one scan on the entire sample set is
needed. While using P-trees, we only need to calculate the root counts of the AND of some P-trees. These
AND operations can also be performed in parallel. Figure 4 shows our experimental results of the cost
comparison between scanning the entire data set and all the P-tree ANDing.
Second, Using P-trees, the creation of sub-sample sets is not necessary. If B is a candidate for the
current decision attribute with kB basic P-trees, we need only AND the P-trees of the class label defining
the sub-sample set with each of the kB basic P-trees. If the P-tree of the current sample set is P
0001
, for example and the current attribute is B1 (with, say, 2 bit values), then P
2, 0100
2, 0100
^ P 3, 0001 ^ P 1,
^P
00,
3,
P 2,
0100
^ P 3, 0001 ^ P 1,
01, P 2, 0100
^ P 3, 0001 ^ P 1,
10
and P 2, 0100 ^ P 3, 0001 ^ P 1,
11
identifies the partition of the
current sample set. To generate Sij, only P-tree ANDings are required.
Classification Time
Total cost
800
600
ID3
400
P-tree
200
0
0
50
100
Size of data (M)
Figure 3. Classification cost with respect to the dataset size
cost (ms) (base-2 log)
Cost (base-2 log) with respect to dataset size
20
15
Scan cost in ID3
10
ANDing cost using
P-trees
5
0
0
20000
40000
60000
80000
data set size (KB)
Figure 4: Cost comparison between scan and ANDings
5
Related work
Concepts related to the P-tree data structure, include Quadtrees [8, 9, 10] and its variants (such as
point quadtrees [7] and region quadtrees [8]), and HH-codes [12].
Quadtrees decompose the universe by means of iso-oriented hyperplanes. These partitions do not
have to be of equal size, although that is often the case. For example, for two-dimensional quadtrees, each
interior node has four descendants, each corresponding to a NW/NE/SW/SE quadrant. The decomposition
into subspaces is usually continued until the number of objects in each partition is below a given threshold.
Quadtrees have many variants, such as point quadtrees and region quadtrees [8].
HH-codes, or Helical Hyperspatial Codes, are binary representations of the Riemannian diagonal. The
binary division of the diagonal forms the node point from which eight sub-cubes are formed. Each subcube has its own diagonal, generating new sub-cubes.
These cubes are formed by interlacing one-
dimensional values encoded as HH bit codes. When sorted, they cluster in groups along the diagonal. The
clustering gives fast (binary) searches along the diagonal. The clusters are order in a helical pattern which
is why they are called "Helical Hyperspatial". They are sorted in a Z-ordered fashion.
The similarities among P-tree, quadtree and HHCode are that they are quadrant based. The difference
is that P-trees focus on the count. P-trees are not indexes, rather they are representations of the datasets
themselves. P-trees are particular useful for data mining because they contain the aggregate information
needed for data mining.
6
Conclusion
In this paper, we propose a new approach to decision tree induction, that is especially useful for the
classification of streaming spatial data. We use the data organization, bit Sequential organization (bSQ)
and a lossless, data-mining ready data structure, the Peano Count tree (P-tree), to represent the information
needed for classification in an efficient and ready-to-use form. The rich and efficient P-tree storage
structure and fast P-tree algebra facilitate the development of a fast decision tree induction classifier. The
P-tree decision tree induction classifier is shown to improve classifier development time significantly. This
makes classification of open-ended streaming datasets feasible in near real time.
References
[1] J. R. Quinlan and R. L. Riverst, “Inferring decision trees using the minimum description length
principle”, Information and Computation, 80, 227-248, 1989.
[2] Quinlan, J. R., “C4.5: Programs for Machine Learning”, Morgan Kaufmann, 1993.
[3] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami. “An interval classifier for database
mining applications”, VLDB 1992.
[4] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, “Classfication and Regression Trees”,
Wadsworth, Belmont, 1984.
[5] J. Shafer, R. Agrawal, and M. Mehta, “SPRINT: A scalable parallel classifier for data mining”,
VLDB 96.
[6] S. M. Weiss and C. A. Kulikowski, “Computer Systems that Learn: Classification and Prediction
Methods from Satatistics, Neural Nets, Machine Learning, and Expert Systems”, Morgan Kaufman, 1991.
[7] Volker Gaede and Oliver Gunther, "Multidimensional Access Methods", Computing Surveys, 30(2),
1998.
[8] H. Samet, “The quadtree and related hierarchical data structure”. ACM Computing Survey, 16, 2,
1984.
[9] H. Samet, “Applications of Spatial Data Structures”, Addison-Wesley, Reading, Mass., 1990.
[10] H. Samet, “The Design and Analysis of Spatial Data Structures”, Addison-Wesley, Reading, Mass.,
1990.
[11] R. A. Finkel and J. L. Bentley, “Quad trees: A data structure for retrieval of composite keys”, a
Informatica, 4, 1, 1974.
[12] http://www.statkart.no/nlhdb/iveher/hhtext.htm
[13] Jiawei Han, Micheline Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann, 2001.
[14] D. Michie, D. J. Spiegelhalter, and C. C. Taylor, “Machine Learning, Neural and Statistical
Classification”, Ellis Horwood, 1994.
[15] Domingos, P., & Hulten, G., “Mining high-speed data streams”, Proceedings of ACM SIGKDD2000.
[16] Domingos, P., & Hulten, G., “Catching Up with the Data: Research Issues in Mining Data Streams”,
DMKD2001.
[17] W. Perrizo, Q. Ding, Q. Ding, A. Roy, “On Mining Satellite and Other Remotely Sensed Images”,
DMKD2001.
[18] William Perrizo, "Peano Count Tree Technology", Technical Report NDSU-CSOR-TR-01-1, 2001.
Download