02_caine_bayesian - NDSU Computer Science

advertisement
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees1, 2
Amal S. Perera
Dep. of Computer Science
North Dakota State University
Fargo, ND 58105, USA
Amal.Perera@ndsu.nodak.edu
Masum H. Serazi
Dep. of Computer Science
North Dakota State University
Fargo, ND 58105, USA
Md.Serazi@ndsu.nodak.edu
Abstract1,2
Accuracy is one of the major issues for a classifier.
Currently there exist a range of classifiers with different
degrees of accuracy directly related to computational
complexity. In this paper we are presenting an approach
to improve the classification accuracy of an existing PTree based Bayesian classification technique. The new
approach has increased the granularity between two
conditional probability calculations by using a bit-based
approach rather than the existing band-based approach.
This approach enables the complete elimination of the
naïve assumption. The new approach maintains the same
computational cost as the previous method. This approach
outperforms the existing P-Tree based Bayesian classifier,
a Bayesian belief network and a Euclidian distance based
KNN classifier, in terms of accuracy for a particular set of
spatial data collected for precision agriculture.
Keywords
Data mining, Bayesian Classification, bSQ, P-Tree
1 INTRODUCTION
In general, classification is a form of data analysis or
data mining technique that can be used to extract models
describing important data classes or to predict future data
trends. There is a broad range of techniques for data
classification. They range from Decision Tree Induction,
Bayesian, Neural Networks, K-Nearest Neighbor, Case
Based Reasoning, Genetic Algorithm, rough sets, to fuzzy
logic techniques [3]. A Bayesian classifier is a statistical
classifier, which uses Bayes’ theorem to predict class
membership as a conditional probability that a given data
sample falls into a particular class. The complexity of
computing the conditional probability values can become
prohibitive for most of the applications with a large data
set and a large attribute space. Bayesian Belief Networks
relax many constraints and use the information about the
domain to build a conditional probability table. Naïve
Bayesian Classification is a lazy classifier. Computational
cost is reduced with the use of the Naïve assumption of
class conditional independence, to calculate the
conditional probabilities when required [3]. Bayesian
1
2
Patents are pending on the bSQ and P-Tree technology.
This work is partially supported by GSA Grant ACT# K96130308.
William Perrizo
Dep. of Computer Science
North Dakota State University
Fargo, ND 58105, USA
William.Perrizo@ndsu.nodak.edu
Belief Networks require build time and domain
knowledge where as the Naïve approach looses accuracy
if the assumption is not valid. The P-Tree data structure
allows us to compute the Bayesian probability values
efficiently, without the Naïve assumption by building PTrees for the training data. Calculation of probability
values require a set of P-Tree AND operations that will
yield the respective counts for a given pattern. Bayesian
classification with P-Trees has been used successfully on
remotely sensed image data to predict yield in precision
agriculture [1]. To avoid situations where the required
pattern does not exist in the training data this approach
partially employs the naïve assumption. To completely
eliminate the assumption in order to increase the
accuracy, we propose a bit-based Bayesian classification
instead of the band-based approach in [1].
This paper is organized as follows. We first discuss
the related work in Section 2, where we discuss about the
P-Tree data structure and how this can be used for a
Bayesian classifier. We propose our algorithm in Section
3 to improve the performance of an existing Bayesian
Classifier using P-Trees on spatial data. In Section 4 we
discuss the performance in terms of accuracy of our
algorithm and scalability with the size of the data set.
Finally, we offer our conclusions in Section 5.
2 RELATED WORK
Many studies have been conducted in spatial data
classification using P-Trees [1],[5],[6],[7]. In this section
we will discuss the P-Tree data structure and the
application of the P-Trees to calculate the counts required
to compute the Bayesian probability values in [1] to
partially eliminate the use of the naïve assumption.
2.1 P-Tree
Most spatial data comes in a format called BSQ for Band
Sequential (or can be easily converted to BSQ). BSQ
data has a separate file for each band. The ordering of the
data values within a band is raster ordering with respect to
the spatial area represented in the dataset. This order is
assumed and therefore is not explicitly indicated as a key
attribute in each band (bands have just one column). For
P-Trees each BSQ band has been divided into several
files, one for each bit position of the data values. This
format is called ‘bit Sequential’ or bSQ. A simple
transform can be used to convert image files to BSQ and
then to bSQ format. Each bSQ bit file is organized, B ij
(the file constructed from the jth bits of ith band), into a
tree structure, called a Peano Count Tree (P-Tree). A PTree is a quadrant-based tree. The root of a P-Tree
contains the 1-bit count of the entire bit-band. The next
level of the tree contains the 1-bit counts of the four
quadrants in raster order. At the next level, each quadrant
is partitioned into sub-quadrants and their 1-bit counts in
raster order constitute the children of the quadrant node.
This construction is continued recursively down each tree
path until the sub-quadrant is ‘pure’ (entirely 1-bits or
entirely 0-bits), which may or may not be at the leaf level.
For example, the P-Tree for a 8-row-8-column bit-band is
shown in the following figure 1.
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
m
FIGURE 1
_____________/ / \ \___________
/
____/ \ ____
\
1
____m__
_m__
1
/ / |
\
/ | \ \
m 0 1
m
1 1 m 1
//|\
//|\
//|\
1110
0010
1101
attributes and among them C is the class label attribute,
given an unclassified data sample (having a value for all
attributes except C), a classification technique will predict
the C-value for the given sample and thus determine its
class. In the case of spatial data, the key, k in R, usually
represents some location (or pixel) over a space and each
Ai is a descriptive attribute of the locations. A typical
example of such spatial data is an image of the earth’s
surface, collected as a satellite image or an aerial
photograph. The attributes may be different reflectance
bands such as red, green, blue, infra-red, near infra-red,
thermal infra-red, etc. The attributes may also include
synchronized ground measurements such as yield, soil
type, zoning category, a weather attribute, etc. A classifier
may predict the yield value from different reflectance
band values extracted from a satellite image.
In Bayesian classification, let X be a data sample
whose class label is unknown. Let H be a hypothesis (ie,
X belongs to class, C). P(H|X) is the posterior probability
of H given X. P(H) is the prior probability of H then
P(H|X) = P(X|H)P(H) / P(X)
For efficient implementation, we use a variation of the
basic P-Trees, called PM-tree (Pure Mask tree). In the
PM-tree, we use a 3-valued logic, in which ‘11’
represents a quadrant of pure 1-bits (pure1 quadrant), ‘00’
represents a quadrant of pure 0-bits (pure0 quadrant) and
‘01’ represents a mixed quadrant. To simplify the
exposition, we use 1 instead of 11 for pure1, 0 for pure0,
and m for mixed. The PM-tree is given in figure 1 [4].
Where P(X|H) is the posterior probability of X given
H and P(X) is the prior probability of X. Bayesian
classification uses this theorem in the following way.
Each data sample is represented by a feature vector,
X=(x1..,xn) depicting the measurements made on the
sample from A1,..An, respectively.
Given classes,
C1,...Cm, the Bayesian Classifier will predict the class
label, Cj , that an unknown data sample, X (with no class
label), belongs to the one having the highest posterior
probability, conditioned on X
2.2 P-Tree Algebra
P(Cj|X) > P(Ci|X), where i is not j ...
Basic P-Trees can be combined using logical operations
to produce P-Trees for the original values at any level of
bit precision. We let Pb,v denote the Peano Count Tree for
band, b, and value, v, where v can be expressed in n-bit
precision. Using 8-bit precision for a value Pb,11010011 , can
be constructed from the basic P-Trees as:
P(X) is constant for all classes so P(X|Cj)P(Cj) is
maximized.
Bayesian classification can be naïve or
based on Bayesian belief networks. In naive Bayesian
the naive assumption ‘class conditional independence of
values’ is made to reduce the computational complexity
of calculating all P(X|Cj)'s. It assumes that the value of an
attribute is independent of that of all others. Thus,
Pb,11010011 = Pb1 AND Pb2 AND Pb3’ AND Pb4 AND Pb5’
AND Pb6’ AND Pb7 AND Pb8,
where ’ indicates NOT operation. The AND operation is
simply the pixel-wise AND of the bits. Similarly, any data
set in the relational format can be represented as P-Trees.
For any combination of values, (v1,v2,…,vn), where vi is
from band-i, the quadrant-wise count of occurrences of
this combination of values is given by:
P(v1,v2,…,vn) = P1,v1 AND P2,v2 AND … AND Pn,vn
[2]
2.3 Bayesian Classifier
Given a relation, R(k, A1, …, An, C), where k is the
key of the relation R and A1, …, An, C are different
... (1)
P(X|Ci) = P(xk|Ci)*…*P(xn|Ci) … … … (2)
For categorical attributes,
where
P(xk|Ci) = sixk/si
si is the number of samples in class Ci and sixk is
the number of training samples of class Ci,
having Ak-value xk [3]
2.4 Bayesian Classifier Using P-Tree
A P-Tree is a lossless, compressed, data mining ready
data structure, which maintains the raster spatial order of
the original data. There are many variations of P-Trees. PTrees built from bSQ files are called basic P-Trees. The
symbol Pi,j denotes a P-Tree built from band i and bit j. A
function COMP(Pi,j), defined for P-Tree Pi,j gives the
complement of the P-Tree Pi,j. We can built a value PTree Pi,v denoting band i and value v, which gives us the
P-Tree for a specific value v in the original band data. The
tuple P-Tree PX1..Xn denotes the P-Tree of tuple x1…xn.
The function RootCount gives the number of occurrence
of a particular pattern in a P-Tree. If the P-Tree is the
basic P-Tree it gives the number of 1’s in the tree or if the
P-Tree is a value P-Tree or a tuple P-Tree it gives the
number occurrence of that value or tuple in that tree [1].
2.5 Calculating the probabilities using P-Tree:
We showed that to classify a tuple X=(x1..,xn) we need to
find the value of P(xk|Ci). To find the values of sixk and si
we need value P-Trees, Pk,xk (Value P-Tree of band k,
value xk) and Pc,ci (value P-Tree of label ban C, value Ci).
sixk= RootCount [(Pk,xk) AND (Pc,ci)],
si= RootCount[Pc,ci]... (3)
In this way we find the value of all probabilities in (2) and
can use (1) to classify a new tuple, X. To improve the
accuracy, we can find the value in (2) directly by
calculating the tuple P-Tree Px1..,xn (tuple P-Tree of band
x1..xn) which is simply (P1,x1) AND ...AND (Pn,xn).
Now P(X|Ci) = si(x1..xn) / si where
si(x1..xn) = RootCount [(Px1..,xn) AND (Pc,ci)]... (4)
By doing this we are finding the probability of
occurrence of the whole tuple in the data sample. We do
not need to care about the inter-dependency between
different bands (attributes). In this way, we can find the
value of P(X|Ci) without the naive assumption of ‘class
conditional independency’ which will improve the
accuracy of the classifier. Thus we can keep the simplicity
of naive Bayesian as well as get the higher accuracy [1].
3 PROPOSED APPROACH
3.1 Band based approach
The existing algorithm uses a band-based approach
in calculating the Bayesian Probability values for each
class. As long as there are matching patterns in the
training data, the Bayesian conditional probability values
can be calculated very easily with the use of P-Trees. In
the event of not finding a set of matching patterns in the
training data set for the given pattern the existing
approach will remove the attribute with the least amount
of information gain relative to the class label from the
given pattern and then try to find a match in the training
set. This will be continued until a set of matching patterns
is found for the given pattern. The Bayesian probabilities
are calculated using the partial probabilities for this partial
pattern and the attributes discarded earlier in order to find
a match. The probabilities are combined using the Naive
assumption. This is a better approach than completely
using the naive assumption as shown in the previous work
[1]. One experimental observation is the decrease in
accuracy with an increase in the use of the naive
assumption. This observation is shown in the next section
in performance evaluation. One other approach is to
completely ignore the use of the naïve assumption. Do the
classification with the conditional probability values for a
partial pattern. This improves the performance slightly as
shown on the next section. This prompts us to look for an
approach that does not use the Naïve assumption and also
keeps most of the information of the given pattern while
trying to find a pattern. i.e. A better mechanism to relax
the search for the given pattern.
3.2 Bit based approach
This approach is motivated by the requirement to
completely avoid the use of the naive assumption for the
probability calculations. In this approach we are
proposing to transform the problem of calculating the
probability values for a given unknown pattern to
calculating the probability values for a known similar
pattern. This similar pattern is selected by removing the
least significant bits from the attribute space. The order of
the attribute bits to be removed is selected by calculating
the information gain for the respective bit of the band.
Consider the following example to calculate the Bayesian
conditional probability value for the pattern [10,01] in a
two-attribute space. Assume that the information gain
values for the first significant bit for band R is less than
the value for band G and for the second significant bit the
information gain value for band G is less than for band R.
Initially it will search for the pattern, as shown in figure
2a. If the pattern is not found, search will be done for
[1_,01] considering the information gain values for the
second significant bit. The search space will be increased
as shown in figure 2b. The search will continue for the
pattern [1_,0_] as shown in 2c. Once again the search will
continue for the pattern [1_,__] by considering the
information gain value for the first significant bit as
shown in figure 3d. In the algorithm the above process is
done in a reverse order. We start to build a value P-Tree
for the given pattern starting from the most significant
bits of each attribute and compute the conditional
probabilities for each class. We continue to add bits to the
value Ptree until we obtain a conditional probability value
for a class greater than a given threshold probability
value. If such a class is found it is a clear winner against
all the other classes if the probability threshold is greater
than 0.5. In our experimentation we used a value of 0.85.
If the algorithm is unable to find a winner above the given
threshold value the class that recorded the highest
probability in a preceding step will be used as the class
label. We do not use any weighting on this with respect to
R
R
11
11
10
10
01
01
00
00
00
01
10
11
00
G
01
(a)
10
11
G
(b)
R
R
11
11
10
10
01
01
00
00
00
01
10
11
(c)
G
00
01
10
(d)
11
G
the number of bits in the calculation. This is one area that
can be modified to further improve the classification.
4 PERFORMANCE EVALUATION
In this section we will present the comparative analysis of
the proposed method to increase the accuracy of
classification. The performance of the classification is
compared with the previous technique that partially uses
the naïve assumption, band based approach without the
naïve assumption and KNN with a Euclidian distance
matrix. It is also compared with a Bayesian Belief
network classifier. Finally we compare the variability of
classification time against the training sample size.
The above table shows the comparison between the band
based classification and the bit based classification. The
percentage of attempts that required the naïve assumption
grows with the increase in the number of significant bits.
When there are more significant bits in the attribute, it is
less likely to find a match from the training pattern. This
contributes to the increase in the number of times required
to make the naïve assumption. It is clear from the
accuracy statistics that the accuracy of the band based
classification with naïve assumption (previous attempt)
fails with the increase of the percentage attempts with the
naïve assumption. As indicated in the previous section the
values in the table clearly indicates the necessity to use
another mechanism. The table shows a clear difference in
accuracy between the Band based mechanism and the Bit
based mechanism. The effect on the classification
accuracy with different bin sizes (number of significant
bits) for the bit-based method is minimal. Average
number of bits used for the classification indicates the
number of bits used per classification until the threshold
is satisfied. Note that the average number of bits required
is less than the number of significant bits for the last test
case. This shows the ability of the bit based method to
avoid situation of not finding a matching pattern for the
given pattern in the training set by looking for a pattern
without the least significant bit in most of the attributes. It
is clear that the new approach is better.
FIGURE 3.
Classification accuracy for '97 Data
90
80
70
60
4.1 Sample data
50
40
The experimental data was extracted from two sets of
aerial photographs of the Best Management Plot (BMP)
of Oakes Irrigation Test Area (OITA) near Oaks, North
Dakota, United States. The images were taken in 1997
and 1998. Each image contains 3 bands, red, green and
blue reflectance values. Three other files contain
synchronized soil moisture, nitrate and yield values.
30
20
10
0
1K
Band
Acc. use of Naive Ass.
72 %
0%
63 %
5%
26 %
28 %
Acc. Avg. # of Bits used
76 %
3.89
74 %
4.52
74 %
5.05
65K
260K
Naïve-Band
KNN-Euc.
Bit
Variation of Classification Time with
Training Size
Time (ms)
The motivation for this paper is to improve the
classification accuracy of the existing Ptree based
Bayesian classification approach.
Band based, Naïve Ass. Bit Based, Threshold 0.85
16K
Training Data Size (pixels)
4.2 Classification Accuracy
Sig.
Bits
4
5
6
4K
300
250
200
150
100
50
0
0
50
100
150
200
Trainig sample size (pixels)
250
300
It is also important to compare the classification accuracy
with other classification techniques. The following
diagram shows the comparison of the new approach with
a Euclidian distance based KNN, the Band based and
Band Based without Naïve assumption.
Figure 3 shows the comparison of the classification
accuracy. It is clearly seen that the new approach is out
performing the other approaches for both data sets. This
evaluation was done with 5 significant bits of information
for each attribute band. The two band based approaches
show a similar degree of accuracy because the use of the
naïve assumption is minimal for 5 significant bits. With
the increase in the training data size, the possibility of
finding a set of matching patterns is increased. This
contribute to an improvement in the classification
accuracy for the band based approach as seen in the above
figure. The KNN approach shows a steady degree of
accuracy with respect to the training data size. The new
approach also shows a similar trend at a higher level. This
shows the ability of the classifier to perform at a
reasonable level even with a smaller training data set.
The accuracy of the approach was also compared to an
existing Bayesian belief network classifier. The classifier
is J Cheng's Bayesian Belief Network available at [9]. The
above classifier was the winning entry for the KDD Cup
2001 data mining competition. The developer claims that
the classifier can perform with or without domain
knowledge. Domain knowledge is a very important factor
in most of the Bayesian belief network implementations.
For the comparison smaller training data sets ranging
from 4K to 16K pixels were used due to the inability of
the implementation to handle larger data sets on a P-II.
Training Size (pixels) Ptree Based Bayesian Belief
4000
66 %
26 %
16000
67 %
51 %
Table 2 Comparison with a Bayesian Belief network
The above table shows the comparison between the PTree based approach and the Bayesian belief approach for
the 1997 data set. It is clearly seen that the Ptree based
approach is better. The Belief network was built without
using any domain knowledge to make it comparable with
the P-Tree approach. The belief network was also allowed
to use the maximum required amount of storage space to
keep the conditional probability values. Attribute data was
broken into 30 equal sized bins to compare with the 5
significant bits (32 bins) used for the P-Tree approach.
4.2 Classification Time
The following figure shows the classification time for a
single classification. P-Tree based Bayesian classifier is a
lazy classifier. The P-Tree approach does not require any
build time as any other lazy classifier. As with most of the
lazy classifiers the classification time per tuple varies with
the number of items in the training set due to the
requirement of having to scan the training data. The PTree approach does not require a traditional data scan, but
the computation time can be observed. The data for the
above figure was collected using 5 significant bits and a
threshold probability of 0.85. The time is given for
scalability comparison purposes. Other work related to PTrees have shown the space-time efficiency of using it for
data mining [1],[5],[6],[7].
5 CONCLUSIONS
In this paper, we have shown that the use of the
naïve assumption reduces the accuracy of the
classification in this particular application domain. We
present an approach to increase the accuracy of a P-Tree
based Bayesian classifier by completely eliminating the
naïve assumption. The new approach has a better
accuracy than the existing P-Tree based Bayesian
classifier. It was also shown that it is better than a
Bayesian belief network implementation and a Euclidian
distance based KNN approach. This approach has the
same computational cost with respect to the use of P-Tree
operations compared to the previous approach, and it is
scalable with respect to the size of the data set.
REFERENCES
[1] MD Hossain, ‘Bayesian Classification on Spatial
Data Streams Using P-Trees’, CS-NDSU Dec. 02.
[2] Q Ding, M Khan, A Roy, W Perrizo, ‘P-Tree
Algebra’, ACM-SAC, Madrid, pp. 426-431, Mar 02.
[3] J. Han, M. Kamber, ‘Data Mining Concept and
Techniques’, Morgan Kaufmann, 2001.
[4] William Perrizo, ‘Peano Count Tree Technology’,
Technical Report NDSU-CSOR-TR-01-1, 2001.
[5] M Khan, Q Ding, W Perrizo, ‘K-nearest Neighbor
Classification on Spatial Data Stream Using PTrees’, PAKDD-02, Taipei, pp.517-528, May 2002.
[6] Q Ding, Qia Ding, W Perrizo, ‘ARM on RSI Using
P-Trees’, PAKDD 2002, Taipei, pp 66-79, May 02.
[7] Q Ding, Qia Ding, W. Perrizo, ‘Decision Tree
Classification of Spatial Data Streams Using PTrees’, ACM SAC, Madrid, pp. 426-431, March 02.
[8] Datasurg web site, ‘http://datasurg.ndsu.edu/’, 5-02.
[9] J. Chang, ‘J Cheng's Bayesian Belief Network’,
http://www.cs.ualberta.ca/~jcheng/’,
May
02.
Download