04_IEEE_INMIC_BAYESIAN_streams

advertisement
IEEE International Multi-topic Conference (INMIC), Dec. 24-26, 2004, NUCES FAST, Lahore, Pakistan
Bayesian Classification for Spatial Data Using P-tree1
Mohammad Kabir Hossain1, Rajibul Alam1, Abu Ahmed Sayeem Reaz1 and Dr.William Perrizo2
1
Department of Computer Science and Engineering, North South University, Dhaka-1213,
Bangladesh
2
Department of Computer Science, North Dakota State University, Fargo, ND 58105, USA
mkhossain@northsouth.edu, ronim@agni.com, sayeem_reaz@yahoo.com,
william.perrizo@ndsu.nodak.edu
Abstract
Classification of spatial data can be difficult
with existing methods due to the large numbers and
sizes of spatial data sets and a large volume of data
requires a huge amount of memory and/or time. The
task becomes even more difficult when we consider
continuous spatial data streams.
In this paper, we deal with this challenge using
the Predicate tree (also known as Peano Count Tree
or P-tree), which provides a lossless, compressed, and
data-mining-ready representation (data structure) for
spatial data. We demonstrate how P-trees can improve
the classification of spatial data when using a
Bayesian classifier. We also introduce the use of
information gain calculations with Bayesian
classification to improve its accuracy. The use of a Ptree based Bayesian classifier can make classification,
not only more effective on spatial data, but also can
reduce the build time of the classifier considerably.
This improvement in build time makes it feasible for
use with streaming data.
attributes except C), a classification technique will
predict the C-value for the given sample and, thus,
determine its class. In the case of spatial data, ki’s in
R usually represent some location (or pixel) over a
space, and each Ai is a descriptive attribute of the
locations.
In this paper, we explore the technique of
Bayesian classification, which works based on Bayes
theorem and we introduce a novel technique called
non-naïve Bayesian classification. To improve the
accuracy of the classifier, we combine the naïve and
non-naïve technique and call it partial naïve
classification. We introduce the concept of information
gain in selecting the attributes, which have a higher
contribution to decide the class label of an unclassified
sample. We also use a new data mining ready data
structure in implementing our idea. It reduces the cost
to scan the database. In fact, using the P-tree[1,2]
gives us the facility to compute different counts even
without scanning the database (Patents are pending on
1
the P-tree technology ).
2. Related Concepts
1. Introduction
Classification is an important data-mining
technique. Given an unclassified data sample, it
predicts in which class the given data sample belongs.
If we are given a relation, R(k1..kn, A1, …, An, C),
where ki’s are the structural attribute of relation R and
A1, …, An, C are different feature attributes and among
them, C is the class label attribute, given an
unclassified data sample (having a value for all
1
Patents are pending at North Dakota State University on P-tree technology
There are two types of Bayesian classification
techniques, naïve Bayesian and Bayesian belief
network. Bayesian classification is a statistical
classifier based on Bayes theorem [3]. It assigns a
probability for a given sample to find in which class it
belongs. In naïve Bayesian the “Class conditional
independency” assumption is made to reduce the
complexity for calculating the probabilities. This
technique works well when the attributes are
independence of each other and has poor performance
if this assumption does not hold. On the other hand,
the Bayesian belief network is a graphical model [4]. It
represents the dependencies among subsets of
attributes in a directed acyclic graph. A table called
the “conditional probability table” (CPT) is
constructed. From the CPT different probabilities are
calculated to predict the class label of an unknown
sample. In the case of belief network, it is
computationally very complex to build the network
and to create the CPT [3, 4]. It is time consuming to
calculate the probabilities in traditional ways because
of the need to scan the entire database again and again.
In general, two techniques are widely used to increase
the classifier accuracy. [3]. They are bagging and
boosting. In these methods, there is a series of
classifiers C1,…,Cn which are already learned, and
they are combined to create an improved classifier C.
We have found many techniques to increase the
accuracy of naïve Bayesian classification. In [5], the
authors showed that boosting cannot improve the
accuracy of the naïve Bayesian classifier, on average,
in a set of natural domains. They proposed to
introduce tree structures into the naïve Bayesian
classifier to improve its performance. Although they
claimed that their technique achieved lower average
error than the naïve Bayesian classifier, they admitted
that it increases the average error for individual
models. In [6], the authors introduced a lazy seminaïve Bayesian classifier learning technique to
overcome the problem of “class conditional
independency.” Their method is very efficient when a
single classifier is to be applied to classify a few cases.
In [7], the authors introduced a technique where they
perform attribute elimination and test case selection
using cross-validation to generate the rule for each test
case.
3. Proposed System
3.1. Introduction to P-Tree
We will try to explain p-tree, based on how it
organizes spatial data. A P-tree is a quadrant-based
tree. It recursively divides an entire image into
quadrants and records the count of 1-bits for each
quadrant, thus forming a quadrant count tree. P-trees
are somewhat similar in construction to other data
structures in the literature (e.g. Quad-trees [4] and
nodes [7]). For example, given an 8-row-8-column
image of single bits, its P-tree is as shown in Figure 1.
In this example, 55 is the number of 1's in the entire
image. This root level is labeled level 0. The numbers
16, 8, 15, and 16 found at next level (level 1) are the
1-bit count for the four major quadrants in raster order,
or Z order (upper left, upper right, lower left and lower
right). Since the first and last level-1 quadrants are
composed entirely of 1-bits (called pure-1 quadrants),
sub-trees are not needed, and these branches terminate.
Figure 1: 8x8 image and its p-tree
Similarly, quadrants composed entirely of 0-bits are
called pure-0 quadrants, which also cause termination
of tree branches. This pattern is continued recursively
using Peano, or Z-ordering (recursive raster ordering),
of the four sub-quadrants at each new level.
Eventually, every branch terminates (since, at the
"leaf" level, all quadrants are pure). If we were to
expand all sub-trees, including those for pure
quadrants, then the leaf sequence would be the Peanoordering of the image. Thus we use the name Peano
Count Tree [1]. If the internal nodes of the tree simply
record the minimum of information, namely the truth
of the universal quadrant predicate “purely 1-bits” then
the P-tree is an absolutely minimal lossless
representation of the data called a Predicate tree (also
a P-tree). We note further that one can easily construct
the count tree from the universal predicate tree by
summing from the bottom up in one in-order scan of
the tree.
A spatial image can be viewed as a 2dimensional array of pixels. Associated with each pixel
are various descriptive attributes, called “bands”. For
example, BMP image has 3 bands- namely, Blue,
Green and Red. An image can also be viewed as a
relational table where each pixel is a tuple and each
band is an attribute. The primary key can be expressed
as x-y coordinates.
3.2. Bit Sequential Format
Data in relational table must be mapped to bit
Sequential (bSQ) format for p-tree generation.
Suppose in spatial data a reflectance value in a band is
a number in the range 0-255 and is represented by a
byte. We split each band into eight separate files, one
for each bit position. In figure 2 we give a very simple
illustrative example of bSQ format with only two
bands in a scene having only four pixels (two rows and
two columns). P-trees are basically z-order-run-lengthcompressed, representations of each bSQ file. So, we
store the database not in relational tables but in bSQ
files.
3.3. Basic P-Trees and Operations
We reorganize each bit file of the bSQ format into
a basic p-tree. The definition of basic P-tree has been
given in [1] as written below:
P (v1,v2, …, vn) = P1(v1) AND P2(v2) AND
…AND Pn(vn)
3.5. Root Count
The total number of 1’s in a P-tree is called the
root count. The root count of a p-tree indicates the
total number of 1’s in the image from where the P-tree
was built.
3.6. Bayesian Classification
A Bayesian classifier is a statistical classifier
which works based on Bayes theorem. The Bayes
theorem is as follows
Let X be a data sample for which class label is
unknown. Let H be a hypothesis. (i.e., X belongs to
class, C.) P(H|X) is the posterior probability of H
given X. P(H) is the prior probability of H; then,
P( H | X ) 
Figure 2: bSQ formats for a two-band 2x2 image
Definition 1: A basic P-tree Pi,j is a P-tree for the
jth bit of the ith band i.
The basic P-trees defined in Section 3.1 can be
combined using simple logical operations (AND, OR
and COMPLEMENT) to produce P-trees for the
original values at any level of precision.
3.4. Value and Tuple P-Trees
Definition 2: A value P-tree Pi (v) is the P-tree if
value v at band i. Value v can be expressed in 1-bit up
to 8-bit precision [8]. Value p-trees can be constructed
by ANDing basic P-trees or their complements. For
example,
Pi (110) = Pi,1 AND Pi,2 AND Pi,3'
Definition 3: A tuple P-tree P (v1, v2, …, vn), is
the P-tree of value vi at band i, for all i from 1 to n [6].
We have,
P( X | H ) P( H )
P( X )
where P(X|H) is the posterior probability of X given H
and P(X) is the prior probability of X.
Bayesian classification uses this theorem in the
following way. Each data sample is represented by a
feature vector, X=(x1..,xn), depicting the measurements
made on the sample from attributes A1,..An,
respectively. Given classes, C1...Cm the Bayesian
classifier will predict the class label, Cj, that an
unknown data sample, X (with no class label), belongs
to as the one having the highest posterior probability
conditioned on X.
P (Cj | X) > P (Ci | X) ………………………….…(1)
for all i and j, where i  j.
Now, according to the Bayes theorem
P (Cj | X) = P (X| Ci) P(Cj )……………………...(2)
P(X) is constant for all classes, so we maximize
Bayesian classification using naive Bayesian, the naive
assumption of “class conditional independence of
values” is made to reduce the computational
complexity of calculating all P(X|Cj)'s. It assumes that
the value of an attribute is independent of that of all
others. Therefore,
n
P(X| Cj )=

k 1
P(Xj | Ci )…………………………(3)
For categorical attributes, P(xk|Ci) = si(xk)/si,
where si = number of samples in class Ci and si(xk) =
number of training samples of class Ci having Ak value
xk.
3.7. Calculating Probabilities using P-Trees
We showed earlier that, to classify
tuple
X=(x1..,xn), we need to find the value of P(xk|Ci) =
si(xk)/si, To find the values of sixk and si, we need two
value P-trees, Pk , x (value P-tree of band k, value xk)
k
and Pc ,ci (value P-tree of class label band C, value Ci).
From these two value P-trees, the following equation
gives us the value of sixk and si.
si(xk)= RootCount [( Pk , x )
k
AND ( Pc ,c )],
………………(4)
i
si= RootCount[ Pc ,ci ]
In this way, we can find the value of all
probabilities in (3) and can use (1) and (2) to classify a
new tuple, X.
3.8. Avoiding the Naïve Assumption of “Class
Conditional Independency
The naïve assumption of class conditional
independency introduces some error in the
classification because the assumption that the value of
one attribute is independent of other attribute does not
hold for all cases. This assumption is necessary to
simplify the calculation of the probability of equation
(2) using (3). To improve the accuracy of the
classifier, we can find the value in (3) directly by
calculating tuple P-tree Px1..,xn (tuple P-tree of band
(x1..xn), which is simply
Px1 .. xn = ( P1, x ) AND...AND ( Pn , x ) Then
1
n
s ( x ..x )
P(X|Ci) = i 1 n where
si
....(5)
si ( x1 ..xn ) = RootCount [ ( P1, x1 ) AND ( Pn , xn ) ]
By calculating the above tuple P-tree, we are
finding the probability of occurrence of the entire tuple
in the data sample. We do not need to care about the
interdependency between different bands (attributes).
In this way, we can find the value of P(X|Ci) without
the naive assumption of “class conditional
independency” which will improve the accuracy of
the classifier. Thus, we can keep the simplicity of
naive Bayesian and get the higher accuracy. We call
this process the non-naïve approach.
One problem of this approach is that, if the tuple
X that we want to classify is not present in our data set,
the root count of Px1..,xn will be zero and the value of
P(X|Ci) will be zero as well. In that case we will not be
able to classify the tuple. To deal with that problem,
we introduce the measure of information gain for an
attribute of the data in our classification.
3.9. Calculation of Information Gain
In a data sample, class label attribute C has m
different values or classes, Ci, i = 1...m. Let si =
number of samples in Ci. Information needed to
classify
a
given
sample
is
m
I s ,.., s   
p log ( p ) ……………………….(6)
1
m

i 1
i
2
i
where pi=si/s is the probability that a sample belongs to
Ci.
Let attribute A have v distinct values, {a1...av}.
Entropy or expected information based on partition
into subsets by A is
v s  ..  s
1j
mj
E ( A)  
 I ( s1 j ,.., s mj ) …………(7)
s
j 1
where sij = number of samples in class Ci which have
the value aj for attribute A. and
I s j1 ,.., s mj    p ji log 2 ( p ij )
m
……………….……(8)
i 1
where pij = sij/sj and sj = number of sample having aj
value for A
Now, the information gain can be calculated by
Gain(A)= I(s1..sm) - E(A)……………………………...(9)
(expected reduction of entropy caused by knowing the
values of A)
Now, we can find the value of si, sij, and s by the
following the P-tree algebra.
si = RootCount (Pc,ci )
sj = RootCount (PA,aj )
sij = RootCount (Pc,ci AND PA,aj)
where Pc,ci = value P-tree of class label attribute with
value Ci and PA,aj= value P-tree of attribute A with
value aj.
3.10. Use of Information Gain in Our Method
Back to the problem in our classification
technique, when the root count of Px1..,xn is zero, we
can form a tuple P-tree by reducing the size of the
tuple. To do that we remove one attribute Ak from the
tuple P-tree and form Px1..xi..xn, where i =1 to n but
ik. Here we will select k such that band Ak has a
lower information gain than any other bands, so (3)
will be
P(X|i)= P(x1..n|Ci)*..*P(xk|Ci)……………….....(10)
P(x1..n|Ci) can be calculated by (5) without using
band k, and P(xk|Ci) can be calculated by (4). Here we
where P(x1...k...n|Ci) is calculated from (5) with the
maximum
number
of
bands
having
RootCount[Px1..xk..xn]0 and k is not in j, and range of
j is the minimum number of bands having the lowest
information gain for attribute Aj. P(xj|Ci) can be
calculated from (1) for each j.
size was set at 1024x1024. The objective of the
classification was to generate the respective
approximate yield values for a given aerial TIFF image
for a crop field. The data are classified into four yield
classes by considering the RGB attribute values.
Success
are treating attribute A1..Ai..An [ik] non-naively and
Ak naively, and we call it partial-naïve. Now, if it
happens that the root count of Px1..xi..xn is still zero,
we will remove another attribute with the second
lowest information gain and proceed in the same way.
The general equation of (10) will look like
P(X|Ci)= P(x1...k...n|Ci) *...* jP(xj|Ci)………... (11)
0.6
0.5
0.4
0.3
0.2
0.1
0
NBC
NNBC
PNBC
2
3
4
5
6
7
Significant bits
Figure 4: Classification success rate comparisons
4. Experiment Results
4.1. Performance of P-trees
In our application, we used AND[2] and
COMPLEMENT. Among these two operations, the
AND is the most critical operation. It takes two P-trees
as operands and gives a resultant P-tree which is
equivalent to the P-tree built from the pixel wise
logical AND operation on a basic P-tree data set. The
performance of our application depends on the
performance of the AND operation. Refer to figure 3.
4.2. Classification Accuracy
As performance evaluation with respect to
accuracy for this work, we compared the classification
accuracy of the proposed Bayesian classifier using
information gain (BCIG) with naïve Bayesian (NBC).
time (ms)
Time Required Vs Bit Number
60
40
20
0
0
1
2
3
4
5
6
7
8
bit number
Figure 3: Performance of the P-tree AND operation.
Actual aerial TIFF images with a synchronized
yield band were used for the evaluation. The data are
available
on
the
SMILEY[13]
web
site[http://rock.cs.ndsu.nodak.edu/smiley/]. The data
set has four bands red, green, blue and yield with eight
bits of information. Table 5.1 shows the respective
success rates for each technique. The training sample
Seven significant bits were used for the
comparison. (The last few least significant bits may
contain random noise.) Figure 4 shows that our
solution is capable of doing a much better
classification compared to naïve Bayesian techniques
which is a significant advantage in the application of
classification to data streams. All the techniques
display a peak level of classification at different bits.
The low level of classification accuracy could be
attributed to the nature of the data used for the
classification where around 36-38% of the test cases
are unclassifiable with all three techniques. It also
shows the fact that the proposed classifier performs
better compared to the naïve technique as it can
classify almost all that can be classified with naïve
Bayesian.
Table 1: Comparison performance for four
classification classes
Bits
2
3
4
5
6
7
NBC
.14
.19
.27
.26
.24
.23
NNBC
.28
.33
.34
.35
.37
.41
PNBC
.48
.50
.51
.52
.51
.51
NBC: Naïve Bayesian, NNBC: Non-naïve Bayesian,
and PNBC: Partial Bayesian (Using information gain).
4.4. Performance in Data Stream Applications
Earlier, we said that any data-mining technique
applying to the stream data needs some special care.
Here, we are defining those things more precisely. A
typical data stream mining algorithm should have the
following criteria [3, 9]:
1. It must require a small constant time per record.
2. It must use only a fixed amount of main memory
3. It must be able to build a model with at most one
scan of the data
4. It must make a usable model available at any point
in time.
5. It should produce a model that is equivalent to the
one that would be obtained by the corresponding
database-mining algorithm.
6. When the data-generating phenomenon is
changing over time, the model should, at any time,
be up-to-date but also include the past
information.
Now, we will illustrate how our new approach of
using a P-tree based Bayesian classification meets the
stringent design criteria above. For the first and second
points, it takes a short time to build the basic P-trees
from the data of the data stream. The size of the
classifier is a constant for a fixed image size.
Considering the size of the upper bound for each Ptree, we can figure out the required amount of fixed
main memory for the respective application. With
respect to the third and fourth criteria, the P-trees are
built by doing only one pass over the data. The
collection of P-trees created for an image is
classification ready at any point of time. Fifth, a P-tree
model helps us to build the classifier quickly and
conveniently. As it is a lossless representation of the
original data, the classifier built using P-trees contains
all the information of the original training data. The
classification will be equivalent to any traditional
classification technique. In this approach, the P-trees
allow us the capability to calculate the required
probability values accurately and efficiently for the
Bayesian classification. Sixth, if the data-generating
phenomenon differs, it will not adversely affect the
classification process. The new P-trees built will add
the adaptability requirement for the classifier. Past
information will also not be lost as long as we keep
that information on our P-trees. The training image
window size over the data stream and how it should be
selected can introduce the historical aspects as well as
the adaptability for the required classifier
5. Future Work
In our paper, we focused on eliminating the class
conditional independency in the naïve Bayesian
classification. For that purpose, we used a non-naïve
technique where we got exact tuple match in the
training sample. In the case where no exact match was
found, we used partial-naïve by separating one
attribute or more. To select the attribute to be
separated, we used the measure of information gain.
The same thing can happen in other classification
techniques, for example, in k-nearest neighbor (KNN)
classification. In KNN classification, if there is no
exact match in the training sample, it expands its
neighbor up to k and decides the class label from the
majority votes of the k neighbors. Instead of using the
information gain, other techniques can also be applied
to select the attribute to be separated.
6. Conclusion
In this paper, we efficiently used the P-tree data
structure in the Bayesian classification. We also
applied a new method in Bayesian classification by
introducing the information gain. Thus, our new
method has the simplicity of naïve Bayesian but higher
accuracy. Experimental results show that our new
approach increases the accuracy in naïve Bayesian
classification significantly. Also, we have shown that
our method fits for data stream mining, which is a new
concept of data mining compared with traditional
database mining. P-trees play an important role in
achieving these advantages by providing a way to
calculate the different probabilities easily, quickly, and
accurately. P-tree technology can be efficiently used in
other data mining techniques.
7. Reference
[1] W. Perrizo, Peano Count Tree Lab Notes, Technical
Report NDSU-CSOR-TR-01-1, 2001
[2] Amalendu 10, Master thesis on P-Subject:trees and the
AND operation, Department of Computer Science, NDSU.
[3] J. Han, M. Kamber, Data Mining Concept and
Techniques, Morgan Kaufmann, 2001
[4] D. Heckerman. Bayesian networks for knowledge
discovery. In U.M. Fayyad, G. Paitetsky-Shapiro, P. Smyth,
and R. Uthurusamy, editors, Advances in Knowledge
Discovery in Data Mining, pages 273-305. Cambridge,
MA:MIT Press, 1996.
[5] Kai-Min-Ting, Zijian-Zheng, Improving the performance
of boosting for naïve Bayesian classification, Third PAKD,
Beijing China, 1999.
[6] Z Zheng, G I Webb, K M Ting, Lazy Bayesian Rules: A
Lazy Semi-Naïve Learning Technique Competitive to
Boosting Decision Trees, To appear in Proceedings of
ICML’99, Morgan Kaufmann.
[7] Kai Ming Ting, Zijian Zheng, Geoffrey Web, Learning
Lazy Rules to improve the performance of Classifier,
available at www.deakin.edu.au/~zij
[8] H. Samet, Quad-tree and related hierarchical data
structure. ACM Survey, 1984.
[9] Domingos, P., and Hulten, G., Catching Up with the
Data: Research Issues in Mining Data Streams, DMKD
2001.
[10] HH-codes. Available at
http://www.statkart.no/nlhdb/iveher/hhtext.htm
[11] Domingos, P., Hulten, G., Mining high-speed data
streams, ACM SIGKDD 2000.
[12] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.
Uthurusamy, editors, Advances in Knowledge Discovery and
Data Mining. AAAI/MIT Press, Menlo Park, CA, 1996.
http://www.cs.ndsu.nodak.edu/~am10/paper/paper.pdf.
[13] SMILEY project. Available at
http://midas.cs.ndsu.nodak.edu/~smiley
Download