Abstract - NDSU Computer Science

advertisement
Bayesian Classification on Spatial Data Streams Using P-Trees12
Mohammad Hossain, Amal Shehan Perera and William Perrizo
Computer Science Department, North Dakota State University
Fargo, ND 58105, USA
{Mohammad_Hossain, Amal_Perera, William_Perrizo}@ndsu.nodak.edu
Abstract:
Classification of spatial data can be difficult with existing methods due to the large numbers and
sizes of spatial data sets. The task becomes even more difficult when we consider continuous
spatial data streams. Data streams require a classifier that can be built and rebuilt repeatedly in
near real time. This paper presents an approach to deal with this challenge. Our approach uses
the Peano Count Tree (P-tree), which provides a lossless, compressed and data-mining-ready
representation (data structure) for spatial data. This data structure can be applied in many
classification techniques.
In this paper we focus on Bayesian classification.
A Bayesian
classifier is a statistical classifier, which uses the Bayes theorem to predict class membership as a
conditional probability that a given data sample falls into a particular class. In this paper we
demonstrate how P-trees can improve the classification of spatial data when using a Bayesian
classifier. We also introduce the use of information gain calculations with Bayesian classification
to improve its accuracy. The use of a P-tree based Bayesian classifier can not only make
classification more effective on spatial data, but also can reduce the build time of the classifier
considerably. This improvement in build time makes it feasible for use with streaming data.
Keywords: Data Mining, Bayesian Classification, P-tree, Spatial Data
1
Patents are pending on the bSQ and P-tree technology.
This work is partially supported by NSF Grant OSR-9553368, DARPA Grant DAAH04-96-1-0329 and
GSA Grant ACT#: K96130308.
2
1. Introduction:
Classification is an important data mining technique which predicts the class of a given data
sample. Given a relation, R(k, A1, …, An, C), where k is the key of the relation R and A1, …, An,
C are different attributes and among them C is the class label attribute [1]. Given an unclassified
data sample (having a value for all attributes except C), a classification technique will predict the
C-value for the given sample and thus determine its class. In the case of spatial data, the key, k
in R, usually represents some location (or pixel) over a space and each Ai is a descriptive
attribute of the locations. A typical example of such spatial data is an image of the earth
collected as a satellite image or aerial photograph.
The attributes may be different reflectance
bands such as red, green, blue, infra-red, near infra-red, thermal infra-red, etc [2]. The attributes
may also include ground measurements such as yield, soil type, zoning category, a weather
attribute, etc.). A classifier may predict the yield value from different reflectance band values
extracted from a satellite image.
Many classification techniques have been proposed by statisticians and machine learning
experts [3]. They include decision trees, neural networks, Bayesian classification, and knearest neighbor classification. Bayesian classification is a statistical classifier based on
Bayes theorem [4], as follows. Let X be a data sample whose class label is unknown.
Let H be a hypothesis (ie, X belongs to class, C). P(H|X) is the posterior probability of H
given X. P(H) is the prior probability of H then P(H|X) = P(X|H)P(H)/P(X)
Where P(X|H) is the posterior probability of X given H and P(X) is the prior probability of X.
Bayesian classification uses this theorem in the following way. Each data sample is represented
by a feature vector, X=(x1..,xn) depicting the measurements made on the sample from A1,..An,
respectively. Given classes, C1,...Cm, the Bayesian Classifier will predict the class label, Cj , that
an unknown data sample, X (with no class label), belongs to as the one having the highest
posterior probability, conditioned on X
P(Cj|X) > P(Ci|X), where i is not j ...
...
P(X) is constant for all classes so we maximize P(X|Cj)P(Cj).
…
... (1)
Bayesian classification can be
naïve or based on Bayesian belief networks. In naive Bayesian the naive assumption of ‘class
conditional independence of values’ is made to reduce the computational complexity of
calculating all P(X|Cj)'s. It assumes that the value of an attribute is independent of that of all
others. Thus,
P(X|Ci) = P(xk|Ci)*…*P(xn|Ci).
...
...
... (2)
For categorical attributes, P(xk|Ci) = sixk/si where si = number of samples in class Ci and sixk =
number of training samples of class Ci, having Ak-value xk.
A Bayesian belief network is a graphical model [5]. It represents the dependencies among
subsets of attributes in a directed acyclic graph. A table called the ‘conditional probability table’
(CPT) is constructed. From the CPT different probabilities are established to estimate P(X|C).
The Bayesian Classification techniques have some problems. Naive Bayesian depends on the
assumption of ‘class conditional independency’, which is not satisfied in all the cases. In the case
of belief network, it is computationally very complex to build the network and to create the CPT
[4,5]. It is time consuming to calculate the probabilities in traditional ways because of the need
to scan the whole database again and again. This often makes its use with large spatial data sets
and streaming spatial computationally expensive.
Our approach is to use the P-trees data structure to reduce the computational expense in
classifying a new sample. The use of P-tree gives us a simple solution to the problem of scanning
the database. Using a simple P-tree algebra, we can determine the number of occurrences of a
sample and the number of co-occurrences of a sample and a class label, providing the necessary
components of the probability calculation.
2. Bayesian Classification Using P-tree:
In case of spatial data, in the relation R(k, A1,...,An, C) the Ai’s and C could be viewed as
different bands (denoted by B1,...,Bn in this paper). For example, in a remotely sensed image it
may be the reflectance value of red, green, blue color for a particular pixel in the space identified
by k. These bands are converted to P-trees and stored in that form.
Most spatial data comes in a format called BSQ for Band Sequential (or can be easily converted
to BSQ). BSQ data has a separate file for each attribute or band. The ordering of the data values
within a band is raster ordering with respect to the spatial area represented in the dataset. This
ordering is assumed and therefore is not explicitly indicated as a key attribute in each band (bands
have just one column). In this paper, we divided each BSQ band into several files, one for each
bit position of the data values. We call this format bit Sequential or bSQ. A Landsat Thematic
Mapper satellite image, for example, is in BSQ format with 7 bands, B1,…,B7, (Landsat-7 has 8)
and ~40,000,000 8-bit data values. In this case, the bSQ format will consist of 56 separate files,
B11,…,B78, each containing ~40,000,000 bits. A typical TIFF image aerial digital photograph is
in what is called Band Interleaved by Bit (BIP) format, in which there is one file containing
~24,000,000 bits ordered by bit-position, then band and then raster-ordered-pixel-location. A
simple transform can be used to convert TIFF images to bSQ format.
We organize each bSQ bit file, Bij, into a tree structure, called a Peano Count Tree (P-tree). A Ptree is a quadrant-based tree. The root of a P-tree contains the 1-bit count of the entire bit-band.
The next level of the tree contains the 1-bit counts of the four quadrants in raster order. At the
next level, each quadrant is partitioned into sub-quadrants and their 1-bit counts in raster order
constitute the children of the quadrant node. This is construction is continued recursively down
each tree path until the sub-quadrant is pure (entirely 1-bits or entirely 0-bits), which may or may
not be at the leaf level (1-by-1 sub-quadrant level). P-trees are related in various ways to many
other data structures in the literature (see the appendix for details).
For example, the P-tree for a 8-row-8-column bit-band is shown below.
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
55
____________/ / \ \___________
/
_____/ \ ___
\
16
____8__
_15__
16
/ / |
\
/ | \ \
3 0 4 1
4 4 3 4
//|\
//|\
//|\
1110
0010
1101
depth=0
level=3
depth=1
level=2
depth=2
level=1
depth=3
level=0
level=0
In this example, 55 is the count of 1’s in the entire image, the numbers at the next level, 16, 8, 15
and 16, are the 1-bit counts for the four major quadrants. Since the first and last quadrant is made
up of entirely 1-bits, we do not need sub-trees for these two quadrants. This pattern is continued
recursively. Recursive raster ordering is called the Peano or Z-ordering in the literature –
therefore, the name Peano Count trees. The recursive process will definitely terminates at the
“leaf” level (level-0) where each quadrant is a 1-row-1-column quadrant. If we were to expand
all sub-trees, including those for the pure quadrants, then the leaf sequence is just the Peano
space-filling curve for the original raster image
A P-tree is a lossless data structure, which maintains the raster spatial order of the original data.
There are many variations in P-trees. P-trees built from bSQ files are called basic P-trees. The
symbol Pi,j denotes a P-tree built from band i and bit j. A function COMP(Pi,j), defined for P-tree
Pi,j gives the complement of the P-tree Pi,j. We can built a value P-tree Pi,v denoting band i and
value v, which gives us the P-tree for a specific value v in the original band data. The tuple P-tree
PX1..Xn denotes the P-tree of tuple x1…xn. The function RootCount gives the number of occurrence
of a particular pattern in a P-tree. If the P-tree is the basic P-tree it gives the number of 1’s in the
tree or if the P-tree is a value P-tree or tuple P-tree it gives the number occurrence of that value or
tuple in that tree. A detailed description of P-trees and related things are included in the appendix.
2.1 Calculating the probabilities using P-tree:
We showed earlier that to classify a tuple X=(x1..,xn) we need to find the value of P(xk|Ci) = sixk/si
where si = number of samples in class Ci and sixk = number of training samples of class Ci,
having Ak-value xk. To find the values of sixk and si we need two value P-trees, Pk,xk (Value Ptree of band k, value xk) and Pc,ci (value P-tree of class label band C, value Ci) and
sixk= RootCount [(Pk,xk) AND (Pc,ci)],
si= RootCount[Pc,ci]
...
... (3)
In this way we find the value of all probabilities in (2) and can use (1) to classify a new tuple, X.
To improve the accuracy of the classifier, we can find the value in (2) directly by calculating the
tuple P-tree Px1..,xn (tuple P-tree of band x1..xn) which is simply
(P1,x1)AND ...AND (Pn,xn)
Now P(X|Ci) = P x1..,xn = six1..n/si and
six1..n= RootCount [(Px1..,xn) AND (Pc,ci)]
...
... (4)
By doing this we are finding the probability of occurrence of the whole tuple in the data sample.
We do not need to care about the inter-dependency between different bands (attributes). In this
way, we can find the value of P(X|Ci) without the naive assumption of ‘class conditional
independency’ which will
improve
the accuracy of the classifier. Thus we can keep the
simplicity of naive Bayesian as well as get the higher accuracy.
One problem of this approach is that if the tuple X that we want to classify is not present in our
data set, the Root count of TPx1..,xn will be zero and the value of P(X|Ci) will be zero as well. In
that case we will not be able to classify the tuple. To deal with that problem we introduce the
measure of information gain of an attribute of the data in our classification.
2.2 Introduction of information gain.
Not all the attributes carry the same amount of information that may be useful in the classification
task. For example, in case of a bank loan, the attribute INCOME might carry more information
than attribute AGE of the client. This significance could be mathematically measured by
calculation the information gain of an attribute.
2.2.1 Calculation of Information Gain:
In a data sample, class label attribute Cn has m different values or classes, Ci,
i = 1...m. Let si = number of samples in Ci
Information needed to classify a given sample is [4]:
I(s1..sm)= -SUM(i =1..m)[pi*log2(pi)]
where pi=si/s is the probability that a sample belongs to Ci.
Let attribute, A, have v distinct values, {a1...av}. Entropy or expected information based on
partition into subsets by A is:
E(A) = SUM(j =1..v)[ (SUM(I =1..m)[sij] / s) * I(sij..smj) ]
Now the information gain can be calculated by :
Gain(A) = I(s1..sm) - E(A)
(expected reduction of entropy caused by knowing the values of A)
2.2.2 Use of info gain to handle the situation of TP-tree RootCount = 0
Back to the problem in our classification technique, when the Root count of TPx1..,xn is zero we
can form a TP-tree by reducing its size. We can form a TPx1..xi..xn, where i =1 to n but ik and
the band xk has the lower information gain than any other bands. So (2) will be
P(X|Ci) = P(x1..n|Ci)*..*P(xk|Ci).
...
...
... (5)
P(x1..n|Ci) can be calculated by (4) with out using the band k and P(xk|Ci) can be calculated by
(3). But if the root count of TPx1..xi..xn is still zero, we will remove another band with the 2nd
lowest information gain and proceed in the same way. The general equation of (5) is
P(X|Ci) = P(x1...k...n|Ci) *...* jP(xj|Ci).
...
... (6)
where P(x1...k...n|Ci) is calculated from (4) with the maximum number of bands having
RootCount[TPx1..xk..xn]0 and k is not in j, and range of j is the minimum number of bands
having lowest information gain for band xj. P(xj|Ci) can be calculated from (1) for each j.
2.3 Example:
Consider the training set, S, where B1 is the class label attribute.
S:
B1
0011
0011
0111
0111
0011
0011
0111
0111
0010
0010
1010
1111
0010
1010
1111
1111
B2
0111
0011
0011
0010
0111
0011
0011
0010
1011
1011
1010
1010
1011
1011
1010
1010
B3
1000
1000
0100
0101
1000
1000
0100
0101
1000
1000
0100
0100
1000
1000
0100
0100
B4
1011
1111
1011
1011
1010
1011
1011
1011
1111
1111
1011
1011
1111
1111
1011
1011
So we have 5 distinct classes, they are
C1=0010, C2=0011, C3=0111, C4=1010, C5=1111
So the value P-trees of the Ci's are:
__C1___
P1,0010
3
0 0 3 0
__C2___
P1,0011
4
4 0 0 0
__C3___
P1,0111
4
0 4 0 0
__C4___
P1,1010
2
0 0 1 1
__C5___
P1,1111
3
0 0 0 3
P2,0111
2
2 0 0 0
P2,1010
4
0 0 0 4
P2,1011
4
0 0 4 0
Value P-trees for B2
P2,0010
2
0 2 0 0
P2,0011
4
2 2 0 0
Value P-trees for B3
P3,0100
6
0 2 0 4
P3,0101
2
0 2 0 0
P3,1000
8
4 0 4 0
Value P-trees for B4
P4,1010
1
1 0 0 0
P4,1011
10
2 4 0 4
P4,1111
5
1 0 4 0
Now consider an unknown sample X = 0011 1000 1011
So tuple P-tree PX2X3X4 = P0011
1000 1011
=
1
1 0 0 0
0001
So
RootCount(PX2X3X4 AND P1,C1) = 0
RootCount(PX2X3X4 AND P1,C2) = 1
RootCount(PX2X3X4 AND P1,C3) = 0
RootCount(PX2X3X4 AND P1,C4) = 0
RootCount(PX2X3X4 AND P1,C5) = 0
So the sample belongs to class C2
Now we examine with another sample:
X = 0011 1000 1010
In this case:
P0011 1000 1010 = 0
So we calculate the information gain of different bands.
G2 = 1.65, G3 = 1.59 and G4 = 1.31
We first remove the band B4 from the tuple P-tree and build P0011 1000 =
2
2 0 0 0
0110
RootCount(PX2X3 AND P1,C1) * RootCount(P4,X4 AND P1,C1) = 0
RootCount(PX2X3 AND P1,C2) * RootCount(P4,X4 AND P1,C2) = 1
RootCount(PX2X3 AND P1,C3) * RootCount(P4,X4 AND P1,C3) = 0
RootCount(PX2X3 AND P1,C4) * RootCount(P4,X4 AND P1,C4) = 0
RootCount(PX2X3 AND P1,C5) * RootCount(P4,X4 AND P1,C5) = 0
So the sample belongs to class C2
3. Experimental Results and Performance Analysis
In this section we will present the comparative analysis of the proposed method. Initially we will
discuss the advantage of using P-trees in our approach and then show some experimental results
that indicate the comparative success rates for the proposed method. In the final section we will
discuss the qualities of this approach that makes it suitable for the classification of data streams.
3.1 Performance of P-trees
Use of the P-tree in any application depends on the algebra that we refer to as the P-tree algebra.
The operations in the P-tree algebra are AND, OR, XOR and COMPLEMENT. In our application
we used AND and COMPLEMENT. Among these two AND is the most critical operation. It
takes two P-trees as operands and gives a resultant P-tree which is equivalent to the P-tree built
from the pixel wise logical AND operation on a basic P-tree data set. The performance of our
application depends on the performance of the AND operation. Studies show [6] that the AND
operation on different bit position of any band can be done between 8 milliseconds to 52
milliseconds in a distributed parallel system. In a particular band most significant bit position
takes less time than least significant bit position (This is an inherent characteristic of spatial
image data) [http://www.cs.ndsu.nodak.edu/~amroy/paper/paper.pdf]. Following figure 1 shows
the variation of time over the different significant bits for a 1320x1320 TIFF image (1742400
pixels).
Figure 1 Performance of the P-tree AND operation
3.2 Probability calculations using P-trees
The advantage of using P-trees for classification technique described on this paper is clearly
evident when we consider the effort to calculate the required probability figures. If we try to do
the same classification without P-tree’s we have two options.
1.
Scan the training data and obtain probability counts for each classification
2.
Store all the possible probability counts for the training data.
Assuming we have n attributes, m number of bits per attributes and c number of classes.
For the first approach above we need at least one scan (training data) for each classification. Each
scan will constitute going through all the pixels in the training image looking for the specific bit
patterns and updating the counts respectively. If we use P-trees we need to do nm number of
ANDing operations and on the average m/2 number of COMPLEMENT operations on the basic
P-trees to come up with the count.
In the second approach, we need to scan the training data and keep all the possible probability
counts. When the number of bits and the number of attributes increase, the number of all possible
combinations increases. This will require a capable data structure to store the probability counts
and a mechanism to retrieve the counts. The required amount of counters is c(2mn+2m(n1)+………..+2m)
Space Cost
Space (MB)
20
15
Counts
10
P-trees
5
0
1
2
3
4
5
Significant Bits
Figure 2 Increase in the space requirement
If we use P-trees we need nm Basic P-trees and c value P-trees. We will be able to compute the
required count for any pattern. This process is a sequential process and if we keep the
intermediate values we can use them for the required counts, if we decide to remove one or more
attributes (attributes with the least amount of information) in the classification process. Since we
know the order in which we need to drop the attributes we can compute the count in that order for
the given pattern. Figure 2 shows the comparison of the two techniques with respect to space. The
values are computed for a 1320x1320 image with 4 attributes and 4 classification classes.
It has been shown in similar applications that the computational overhead incurred with the use of
P-trees is proportionately efficient, when using it on large image data sets. This makes the
approach usable for data streams.
3.3 Classification Accuracy
As performance evaluation with respect to accuracy for this work, we compared the classification
accuracy of the proposed Bayesian classifier using information gain (BCIG) with two other
widely accepted classification algorithms (K-nearest neighbor - KNN and naïve Bayesian NBC). Actual aerial TIFF images with a synchronized yield band were used for the evaluation.
The data is available on SMILEY web site[http://rock.cs.ndsu.nodak.edu/smiley/]. The data set
has 4 bands red, green, blue and yield with 8 bits of information. The following table1 shows the
respective success rates for each technique. The training sample size was set at 1024x1024. The
objective of the classification is to generate the respective approximate yield values for a given
aerial TIFF image for a crop field. The data was classified into 4 yield classes by considering the
RGB attribute values.
Comparison Performance for 4 classification classes
Bits
KNN
NvBC
Succ
Succ
BCIG with threshold .45
Succ
IG Use
IG Succ
2
.12
.14
.48
.40
.36
3
.16
.19
.50
.34
.38
4
.21
.27
.51
.32
.37
5
.37
.26
.52
.31
.40
6
.46
.24
.51
.27
.41
7
.43
.23
.51
.20
.40
Table 1
IG Use - Proportion of the number of times the information gain was used for classification.
IG Succ - Proportion of the number of times the above was successful.
A classification is not accepted if the classification ratio (posterior probability) is not above the
threshold value. This leads the classifier to use the new technique employing information gain.
The algorithm will only try to use the information gain if it is not possible to do a positive
classification for that particular data item with the available training sample. The value for the
threshold should be picked so as to not let the use of information gain dominate the classification.
This could lead to an overall decline in the classification. Experimental results showed that the
use of information gain at a threshold of 0.45 is appropriate.
Comparison of techniques
0.6
Success
0.5
0.4
0.3
0.2
KNN
0.1
NBC
BCIG
0
2
3
4
5
6
7
Siginificant Bits
Figure 3 Classification success rate comparison
Seven significant bits were used for the comparison (Last few least significant bits may contain
random noise.). The above figure 3 shows that our solution is capable of doing a much better
classification with less information compared to the other two techniques. This is a significant
advantage in the application of classification to data streams. All the techniques display a peak
level of classification at different bits. The low level of classification accuracy could be attributed
to the nature of the data used for the classification. This is further shown in the following figure 4
where around 25% of the test cases are unclassifiable with all three techniques. It also shows the
fact that the proposed classifier performs better with less information compared to the other two
techniques. At the peak classification level of 5 bits for our method, the classifiable (with other
techniques) pixels that are outside the set is relatively small compared to the values for the peak
levels for the other techniques.
.472
.423
.314
.265
.292
.263
.254
.225
.002
.033
.064
.135
KNN
.092
.103
.114
.145
BCIG
.012
.053
.124
.095
.002
.003
.014
.045
.112
.093
.104
.075
.032
.033
.044
.065
Proportion of classified pixels significant
NBC
bits
Figure 4 Classification correlation for the three techniques
Table 2 shows the information gain values computed for the training data set. It can be observed
that the Information gain value for the blue band is the lowest. In the proposed algorithm this will
be the first attribute to be removed if classification is not possible when using all three attributes.
The difference in the level of information gain indicates the importance of each attribute in the
classification process. The attribute blue is the least important band as it should be on any crop
field image data file. This further justifies our decision to use information gain to try and classify
pixels that are not classifiable using all the available attributes.
Band
Information gain for bit:
2
3
4
5
6
7
Green
.140
.164
.173
.175
.176
.176
Red
.128
.149
.168
.174
.175
.176
Blue
.026
.076
.088
.092
.093
.093
Table 2 Variation of information gain for different bands
All the above information is for one data image. Experimental results showed the outcome to be
similar for other TIFF images with RGB and synchronized yield data.
3.4 Performance in data stream applications.
A typical data stream mining algorithm should have the following criteria [7,8].
i.
It must require a small constant time per record.
ii.
It must use only a fixed amount of main memory
iii.
It must be able to build a model at most one scan of the data
iv.
It must make a usable model available at any point of time.
v.
It should produce a model that is equivalent to the one that would be obtained by
the corresponding database-mining algorithm.
vi.
When the data-generating phenomenon is changing over time, the model at any
time should be up-to-date but also include the past information.
Now we will illustrate how our new approach of using a P-tree based Bayesian Classification
meets the stringent design criteria above. For the first and second points, we need very little time
to build the basic P-trees from the data that comes through the data stream. And the size of the
classifier is a constant for a fixed image size. Considering the size of the upper bound for each Ptree we can figure out the required amount of fixed main memory for the respective application.
With respect to the third and fourth, the P-trees are built by doing only one pass over the data.
The collection of P-trees created for an image is classification ready at any point of time. Fifth, Ptree model helps us to build the classifier quickly and conveniently. As it is a lossless
representation of the original data the classifier built using P-trees contains all the information of
the original training data. And the classification will be equivalent to any traditional classification
technique. In this approach the P-trees allow us the capability to calculate the required probability
values accurately and efficiently for the Bayesian classification. Sixth, if the data-generating
phenomenon differs it will not adversely affect the classification process. The new PC trees built
will add the adaptability requirement for the classifier. Past information will also not be lost as
long as we keep those on our P-trees. The training image window size over the data stream and
how it should be selected can introduce the historical aspects as well as the adaptability for the
required classifier.
4. Conclusion:
In this paper we efficiently used the P-tree data structure in the Bayesian classification. We also
applied a new method in Bayesian classification by introducing the information gain. Thus our
new method has the simplicity of naïve Bayesian as well as the accuracy of other classifiers. Also
we have shown that our method fits for data stream mining, which is a new concept of data
mining comparing with traditional database mining. P-tree plays an important role in achieving
these advantages by providing a way to calculate the different probabilities easily, quickly and
accurately. P-tree technology can be efficiently used in other data mining techniques such as
association rule mining, k-nearest classification, decision tree classification etc.
6. References:
[1] W. Perrizo, "Peano Count Tree Lab Notes", Technical Report NDSU-CSOR-TR-01-1, 2001.
[2] SMILEY project. Available at http://midas.cs.ndsu.nodak.edu/~smiley
[3] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. Advances in
Knowledge Discovery and Data Mining. AAAI/MIT Press, Menlo Park, CA, 1996.
[4] J. Han, M. Kamber, “Data Mining Concept and Techniques”, Morgan Kaufmann.
[5] D. Heckerman. Bayesian networks for knowledge discovery. In U.M. Fayyad, G. PaitetskyShapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery in Data
mining, pages 273-305. Cambridge, MA:MIT Press, 1996.
[6] Amalendu Roy, “Master thesis on Ptrees and the AND operation” Department of Computer
Science, NDSU. Available at: [http://www.cs.ndsu.nodak.edu/~amroy/paper/paper.pdf].
[7] Domingos, P., Hulten, G., “Mining high-speed data streams”, ACM SIGKDD 2000.
[8] Domingos, P., & Hulten, G., “Catching Up with the Data: Research
Issues in Mining Data Streams”, DMKD 2001.
[9] H. Samet, “Quadtree and related hierarchical data structure”. ACM Survey, 16, 2, 1984.
[10] HH-codes. Available at http://www.statkart.no/nlhdb/iveher/hhtext.htm
Appendix: Peano Count Tree (P-tree) and the P-tree Algebra
Most spatial data comes in a format called BSQ for Band Sequential (or can be easily converted
to BSQ). BSQ data has a separate file for each attribute or band. The ordering of the data values
within a band is raster ordering with respect to the spatial area represented in the dataset. This
order is assumed and therefore is not explicitly indicated as a key attribute in each band (bands
have just one column). In this paper, we divided each BSQ band into several files, one for each
bit position of the data values. We call this format bit Sequential or bSQ.
Each bit file of the bSQ format is converted into a tree structure, called a Peano Count Tree (Ptree). A P-tree is a quadrant-based tree. It recursively divide the entire image into quadrants and
record the count of 1-bits for each quadrant, thus forming a quadrant count tree. P-trees are
somewhat similar in construction to other data structures in the literature (e.g., Quadtrees[9] and
HHcodes [10]).
For example, given an 8-row-8-column image of single bits, its P-tree is as shown in Figure 5.
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
P-tree
55
__________/ / \ \__________
/
___ / \___
\
/
/
\
\
16
____8__
_15__ 16
/ / |
\
/ | \ \
3 0 4 1
4 4 3 4
//|\
//|\
//|\
1110
0010
1101
PM-tree
m
____________/ / \ \___________
/
___ / \___
\
/
/
\
\
1
____m__
_m__
1
/ / |
\
/ | \ \
m 0 1 m
1 1 m 1
//|\
//|\
//|\
1110
0010
1101
Figure 5. P-tree and PM-tree for a 88 image
In this example, 55 is the number of 1’s in the entire image. This root level is labeled level 0.
The numbers 16, 8, 15, and 16 found at the next level (level 1) are the 1-bit counts for the four
major quadrants in raster order (upper-left, upper-right, lower-left, lower-right). Since the first
and last level-1 quadrants are composed entirely of 1-bits (called pure-1 quadrants), sub-trees are
not needed and these branches terminate. Similarly, quadrants composed entirely of 0-bits are
called pure-0 quadrants, which also cause termination of tree branches. This pattern is continued
recursively using the Peano or Z-ordering (recursive raster ordering) of the four sub-quadrants at
each new level. Eventually, every branch terminates (since, at the “leaf” level all quadrant are
pure). If we were to expand all sub-trees, including those for pure quadrants, then the leaf
sequence would be the Peano-ordering of the image. Thus, we use the name Peano Count Tree.
A variation of the P-tree data structure, the Peano Mask Tree (PM-tree), is a similar structure in
which masks rather than counts are used. In a PM-tree, we use a 3-value logic to represent pure1, pure-0 and mixed quadrants (1 denotes pure-1, 0 denotes pure-0 and m denotes mixed). The
PM-tree for the previous example is also given in Figure 3. Since a PM-tree is just an alternative
implementation for a Peano Count tree, for simplicity we will use the same term “P-tree” for
Peano Mask Tree.
We note that the fan-out of a P-tree need not be fixed at four. It can be any power of 4
(effectively skipping levels in the tree). Also, the fan-out at any one level need not coincide with
the fan-out at another level. The fan-out pattern can be chosen to produce maximum compression
for each bSQ file. More discussion can be found in [1].
For simplicity, let us assume that the fan out is four. For each band (assuming 8-bit data values,
though the model applies to data of any number bits), there are eight P-trees as currently defined,
one for each bit position. We will call these P-trees the basic P-trees of the spatial dataset. We
will use the notation, Pb,i to denote the basic P-tree for band, b and bit position, i. There are
always 8n basic P-trees for a dataset with n bands. Each basic P-tree has a natural complement.
The complement of a basic P-tree can be constructed directly from the P-tree by simply
complementing the counts at each level (subtracting from the pure-1 count at that level), as shown
in the example below (Figure 6). Note that the complement of a P-tree provides the 0-bit counts
for each quadrant. P-tree AND/OR operations are also illustrated in Figure 6.
P-tree
55
______/ / \ \_______
/
__ / \___
\
/
/
\
\
16 __8____
_15__ 16
/ / |
\
/ | \ \
3 0 4 1
4 4 3 4
//|\
//|\
//|\
1110
0010
1101
PM-tree
m
______/ / \ \______
/
__ / \ __
\
/
/
\
\
1
m
m
1
/ / \ \
/ / \ \
m 0 1 m 11 m 1
//|\
//|\
//|\
1110
0010
1101
Complement 9
______/ / \ \_______
/
__ / \___
\
/
/
\
\
0 __8____
_1__
0
/ / |
\
/ | \ \
1 4 0 3 0 0 1 0
//|\
//|\
//|\
0001
1101
0010
P-tree-1:
m
______/ / \ \______
/
/ \
\
/
/
\
\
1
m
m
1
/ / \ \
/ / \ \
m 0 1 m 11 m 1
//|\
//|\
//|\
1110
0010
1101
P-tree-2:
m
______/ / \ \______
/
/ \
\
/
/
\
\
1
0
m
0
/ / \ \
11 1 m
//|\
0100
AND-Result: m
________ / / \ \___
/
____ / \
\
/
/
\
\
1
0
m
0
/ | \ \
1 1 m m
//|\ //|\
1101 0100
m
______/ / \ \______
/
__ /
\ __
\
/
/
\
\
0
m
m
0
/ / \ \
/ / \ \
m1 0 m 00 m 0
//|\
//|\
//|\
0001 1101
0010
OR-Result:
m
________ / / \ \___
/
____ / \
\
/
/
\
\
1
m
1
1
/ / \ \
m 0 1 m
//|\
//|\
1110
0010
Figure 6. P-tree Algebra (Complement, AND and OR)
By performing the AND operation on the appropriate subset of the basic P-trees and their
complements, we can construct P-trees for values with more than one bit. This kind of P-tree is
called a value P-tree and is denoted by Pb,v, where b is the band number and v is a specified
value. For example, value P-tree P1,110 gives the count of pixels with band-1 bit 1 equal to 1, bit 2
equal to 1 and bit 3 equal to 0, i.e., with band-1 value in the range of [192, 224). In the very same
way, we can construct tuple P-trees where P(v1,…,vn) denotes the P-tree in which each node
number is the count of pixels in that quadrant having the value, vi, in band i, for i = 1,..,n. Any
tuple P-tree can also be constructed directly from basic P-trees and their complements by the
AND operation. The process of ANDing basic P-trees and their complements to produce value Ptrees or tuple P-trees can be done at any level of precision -- 1-bit precision, 2-bit precision, …, 8bit precision. For example, using the full 8-bit precision, Pb,11010011, can be constructed from basic
P-trees by the following AND operation, where ’ indicates the complement.
Pb,11010011 = Pb,1 AND Pb,2 AND Pb,3’ AND Pb,4 AND Pb,5’ AND Pb,6’ AND Pb,7 AND Pb,8
If only 3-bit precision is used, the value P-tree Pb,110, would be constructed by Pb,110 = Pb,1 AND
Pb,2 AND Pb,3’. The tuple P-tree, P001, 010, 111, 011, 001, 110, 011, 101 , would be constructed by:
P001,010,111,011,001,110,011,101 = P1,001 AND P2,010 AND P3,111 AND P4,011 AND P5,001 AND P6,110
AND P7,011 AND P8,101
Basic P-trees
(i.e., P1,1, P1,2, …, P2,1, …, P8,8)
AND
Value P-trees
(i.e., P1, 110 )
AND
Tuple P-trees
(i.e., P001, 010, 111, 011, 001, 110, 011, 101 )
Figure 7. Basic P-trees, Value P-trees (for 3-bit values) and Tuple P-trees
The detail algorithm and other experimental results of ANDing operation can be found in [6].
Download