IEEE International Multi-topic Conference (INMIC), Dec. 24-26, 2004, NUCES FAST, Lahore, Pakistan Bayesian Classification for Spatial Data Using P-tree1 Mohammad Kabir Hossain1, Rajibul Alam1, Abu Ahmed Sayeem Reaz1 and Dr.William Perrizo2 1 Department of Computer Science and Engineering, North South University, Dhaka-1213, Bangladesh 2 Department of Computer Science, North Dakota State University, Fargo, ND 58105, USA mkhossain@northsouth.edu, ronim@agni.com, sayeem_reaz@yahoo.com, william.perrizo@ndsu.nodak.edu Abstract Classification of spatial data can be difficult with existing methods due to the large numbers and sizes of spatial data sets and a large volume of data requires a huge amount of memory and/or time. The task becomes even more difficult when we consider continuous spatial data streams. In this paper, we deal with this challenge using the Predicate tree (also known as Peano Count Tree or P-tree), which provides a lossless, compressed, and data-mining-ready representation (data structure) for spatial data. We demonstrate how P-trees can improve the classification of spatial data when using a Bayesian classifier. We also introduce the use of information gain calculations with Bayesian classification to improve its accuracy. The use of a Ptree based Bayesian classifier can make classification, not only more effective on spatial data, but also can reduce the build time of the classifier considerably. This improvement in build time makes it feasible for use with streaming data. attributes except C), a classification technique will predict the C-value for the given sample and, thus, determine its class. In the case of spatial data, ki’s in R usually represent some location (or pixel) over a space, and each Ai is a descriptive attribute of the locations. In this paper, we explore the technique of Bayesian classification, which works based on Bayes theorem and we introduce a novel technique called non-naïve Bayesian classification. To improve the accuracy of the classifier, we combine the naïve and non-naïve technique and call it partial naïve classification. We introduce the concept of information gain in selecting the attributes, which have a higher contribution to decide the class label of an unclassified sample. We also use a new data mining ready data structure in implementing our idea. It reduces the cost to scan the database. In fact, using the P-tree[1,2] gives us the facility to compute different counts even without scanning the database (Patents are pending on 1 the P-tree technology ). 2. Related Concepts 1. Introduction Classification is an important data-mining technique. Given an unclassified data sample, it predicts in which class the given data sample belongs. If we are given a relation, R(k1..kn, A1, …, An, C), where ki’s are the structural attribute of relation R and A1, …, An, C are different feature attributes and among them, C is the class label attribute, given an unclassified data sample (having a value for all 1 Patents are pending at North Dakota State University on P-tree technology There are two types of Bayesian classification techniques, naïve Bayesian and Bayesian belief network. Bayesian classification is a statistical classifier based on Bayes theorem [3]. It assigns a probability for a given sample to find in which class it belongs. In naïve Bayesian the “Class conditional independency” assumption is made to reduce the complexity for calculating the probabilities. This technique works well when the attributes are independence of each other and has poor performance if this assumption does not hold. On the other hand, the Bayesian belief network is a graphical model [4]. It represents the dependencies among subsets of attributes in a directed acyclic graph. A table called the “conditional probability table” (CPT) is constructed. From the CPT different probabilities are calculated to predict the class label of an unknown sample. In the case of belief network, it is computationally very complex to build the network and to create the CPT [3, 4]. It is time consuming to calculate the probabilities in traditional ways because of the need to scan the entire database again and again. In general, two techniques are widely used to increase the classifier accuracy. [3]. They are bagging and boosting. In these methods, there is a series of classifiers C1,…,Cn which are already learned, and they are combined to create an improved classifier C. We have found many techniques to increase the accuracy of naïve Bayesian classification. In [5], the authors showed that boosting cannot improve the accuracy of the naïve Bayesian classifier, on average, in a set of natural domains. They proposed to introduce tree structures into the naïve Bayesian classifier to improve its performance. Although they claimed that their technique achieved lower average error than the naïve Bayesian classifier, they admitted that it increases the average error for individual models. In [6], the authors introduced a lazy seminaïve Bayesian classifier learning technique to overcome the problem of “class conditional independency.” Their method is very efficient when a single classifier is to be applied to classify a few cases. In [7], the authors introduced a technique where they perform attribute elimination and test case selection using cross-validation to generate the rule for each test case. 3. Proposed System 3.1. Introduction to P-Tree We will try to explain p-tree, based on how it organizes spatial data. A P-tree is a quadrant-based tree. It recursively divides an entire image into quadrants and records the count of 1-bits for each quadrant, thus forming a quadrant count tree. P-trees are somewhat similar in construction to other data structures in the literature (e.g. Quad-trees [4] and nodes [7]). For example, given an 8-row-8-column image of single bits, its P-tree is as shown in Figure 1. In this example, 55 is the number of 1's in the entire image. This root level is labeled level 0. The numbers 16, 8, 15, and 16 found at next level (level 1) are the 1-bit count for the four major quadrants in raster order, or Z order (upper left, upper right, lower left and lower right). Since the first and last level-1 quadrants are composed entirely of 1-bits (called pure-1 quadrants), sub-trees are not needed, and these branches terminate. Figure 1: 8x8 image and its p-tree Similarly, quadrants composed entirely of 0-bits are called pure-0 quadrants, which also cause termination of tree branches. This pattern is continued recursively using Peano, or Z-ordering (recursive raster ordering), of the four sub-quadrants at each new level. Eventually, every branch terminates (since, at the "leaf" level, all quadrants are pure). If we were to expand all sub-trees, including those for pure quadrants, then the leaf sequence would be the Peanoordering of the image. Thus we use the name Peano Count Tree [1]. If the internal nodes of the tree simply record the minimum of information, namely the truth of the universal quadrant predicate “purely 1-bits” then the P-tree is an absolutely minimal lossless representation of the data called a Predicate tree (also a P-tree). We note further that one can easily construct the count tree from the universal predicate tree by summing from the bottom up in one in-order scan of the tree. A spatial image can be viewed as a 2dimensional array of pixels. Associated with each pixel are various descriptive attributes, called “bands”. For example, BMP image has 3 bands- namely, Blue, Green and Red. An image can also be viewed as a relational table where each pixel is a tuple and each band is an attribute. The primary key can be expressed as x-y coordinates. 3.2. Bit Sequential Format Data in relational table must be mapped to bit Sequential (bSQ) format for p-tree generation. Suppose in spatial data a reflectance value in a band is a number in the range 0-255 and is represented by a byte. We split each band into eight separate files, one for each bit position. In figure 2 we give a very simple illustrative example of bSQ format with only two bands in a scene having only four pixels (two rows and two columns). P-trees are basically z-order-run-lengthcompressed, representations of each bSQ file. So, we store the database not in relational tables but in bSQ files. 3.3. Basic P-Trees and Operations We reorganize each bit file of the bSQ format into a basic p-tree. The definition of basic P-tree has been given in [1] as written below: P (v1,v2, …, vn) = P1(v1) AND P2(v2) AND …AND Pn(vn) 3.5. Root Count The total number of 1’s in a P-tree is called the root count. The root count of a p-tree indicates the total number of 1’s in the image from where the P-tree was built. 3.6. Bayesian Classification A Bayesian classifier is a statistical classifier which works based on Bayes theorem. The Bayes theorem is as follows Let X be a data sample for which class label is unknown. Let H be a hypothesis. (i.e., X belongs to class, C.) P(H|X) is the posterior probability of H given X. P(H) is the prior probability of H; then, P( H | X ) Figure 2: bSQ formats for a two-band 2x2 image Definition 1: A basic P-tree Pi,j is a P-tree for the jth bit of the ith band i. The basic P-trees defined in Section 3.1 can be combined using simple logical operations (AND, OR and COMPLEMENT) to produce P-trees for the original values at any level of precision. 3.4. Value and Tuple P-Trees Definition 2: A value P-tree Pi (v) is the P-tree if value v at band i. Value v can be expressed in 1-bit up to 8-bit precision [8]. Value p-trees can be constructed by ANDing basic P-trees or their complements. For example, Pi (110) = Pi,1 AND Pi,2 AND Pi,3' Definition 3: A tuple P-tree P (v1, v2, …, vn), is the P-tree of value vi at band i, for all i from 1 to n [6]. We have, P( X | H ) P( H ) P( X ) where P(X|H) is the posterior probability of X given H and P(X) is the prior probability of X. Bayesian classification uses this theorem in the following way. Each data sample is represented by a feature vector, X=(x1..,xn), depicting the measurements made on the sample from attributes A1,..An, respectively. Given classes, C1...Cm the Bayesian classifier will predict the class label, Cj, that an unknown data sample, X (with no class label), belongs to as the one having the highest posterior probability conditioned on X. P (Cj | X) > P (Ci | X) ………………………….…(1) for all i and j, where i j. Now, according to the Bayes theorem P (Cj | X) = P (X| Ci) P(Cj )……………………...(2) P(X) is constant for all classes, so we maximize Bayesian classification using naive Bayesian, the naive assumption of “class conditional independence of values” is made to reduce the computational complexity of calculating all P(X|Cj)'s. It assumes that the value of an attribute is independent of that of all others. Therefore, n P(X| Cj )= k 1 P(Xj | Ci )…………………………(3) For categorical attributes, P(xk|Ci) = si(xk)/si, where si = number of samples in class Ci and si(xk) = number of training samples of class Ci having Ak value xk. 3.7. Calculating Probabilities using P-Trees We showed earlier that, to classify tuple X=(x1..,xn), we need to find the value of P(xk|Ci) = si(xk)/si, To find the values of sixk and si, we need two value P-trees, Pk , x (value P-tree of band k, value xk) k and Pc ,ci (value P-tree of class label band C, value Ci). From these two value P-trees, the following equation gives us the value of sixk and si. si(xk)= RootCount [( Pk , x ) k AND ( Pc ,c )], ………………(4) i si= RootCount[ Pc ,ci ] In this way, we can find the value of all probabilities in (3) and can use (1) and (2) to classify a new tuple, X. 3.8. Avoiding the Naïve Assumption of “Class Conditional Independency The naïve assumption of class conditional independency introduces some error in the classification because the assumption that the value of one attribute is independent of other attribute does not hold for all cases. This assumption is necessary to simplify the calculation of the probability of equation (2) using (3). To improve the accuracy of the classifier, we can find the value in (3) directly by calculating tuple P-tree Px1..,xn (tuple P-tree of band (x1..xn), which is simply Px1 .. xn = ( P1, x ) AND...AND ( Pn , x ) Then 1 n s ( x ..x ) P(X|Ci) = i 1 n where si ....(5) si ( x1 ..xn ) = RootCount [ ( P1, x1 ) AND ( Pn , xn ) ] By calculating the above tuple P-tree, we are finding the probability of occurrence of the entire tuple in the data sample. We do not need to care about the interdependency between different bands (attributes). In this way, we can find the value of P(X|Ci) without the naive assumption of “class conditional independency” which will improve the accuracy of the classifier. Thus, we can keep the simplicity of naive Bayesian and get the higher accuracy. We call this process the non-naïve approach. One problem of this approach is that, if the tuple X that we want to classify is not present in our data set, the root count of Px1..,xn will be zero and the value of P(X|Ci) will be zero as well. In that case we will not be able to classify the tuple. To deal with that problem, we introduce the measure of information gain for an attribute of the data in our classification. 3.9. Calculation of Information Gain In a data sample, class label attribute C has m different values or classes, Ci, i = 1...m. Let si = number of samples in Ci. Information needed to classify a given sample is m I s ,.., s p log ( p ) ……………………….(6) 1 m i 1 i 2 i where pi=si/s is the probability that a sample belongs to Ci. Let attribute A have v distinct values, {a1...av}. Entropy or expected information based on partition into subsets by A is v s .. s 1j mj E ( A) I ( s1 j ,.., s mj ) …………(7) s j 1 where sij = number of samples in class Ci which have the value aj for attribute A. and I s j1 ,.., s mj p ji log 2 ( p ij ) m ……………….……(8) i 1 where pij = sij/sj and sj = number of sample having aj value for A Now, the information gain can be calculated by Gain(A)= I(s1..sm) - E(A)……………………………...(9) (expected reduction of entropy caused by knowing the values of A) Now, we can find the value of si, sij, and s by the following the P-tree algebra. si = RootCount (Pc,ci ) sj = RootCount (PA,aj ) sij = RootCount (Pc,ci AND PA,aj) where Pc,ci = value P-tree of class label attribute with value Ci and PA,aj= value P-tree of attribute A with value aj. 3.10. Use of Information Gain in Our Method Back to the problem in our classification technique, when the root count of Px1..,xn is zero, we can form a tuple P-tree by reducing the size of the tuple. To do that we remove one attribute Ak from the tuple P-tree and form Px1..xi..xn, where i =1 to n but ik. Here we will select k such that band Ak has a lower information gain than any other bands, so (3) will be P(X|i)= P(x1..n|Ci)*..*P(xk|Ci)……………….....(10) P(x1..n|Ci) can be calculated by (5) without using band k, and P(xk|Ci) can be calculated by (4). Here we where P(x1...k...n|Ci) is calculated from (5) with the maximum number of bands having RootCount[Px1..xk..xn]0 and k is not in j, and range of j is the minimum number of bands having the lowest information gain for attribute Aj. P(xj|Ci) can be calculated from (1) for each j. size was set at 1024x1024. The objective of the classification was to generate the respective approximate yield values for a given aerial TIFF image for a crop field. The data are classified into four yield classes by considering the RGB attribute values. Success are treating attribute A1..Ai..An [ik] non-naively and Ak naively, and we call it partial-naïve. Now, if it happens that the root count of Px1..xi..xn is still zero, we will remove another attribute with the second lowest information gain and proceed in the same way. The general equation of (10) will look like P(X|Ci)= P(x1...k...n|Ci) *...* jP(xj|Ci)………... (11) 0.6 0.5 0.4 0.3 0.2 0.1 0 NBC NNBC PNBC 2 3 4 5 6 7 Significant bits Figure 4: Classification success rate comparisons 4. Experiment Results 4.1. Performance of P-trees In our application, we used AND[2] and COMPLEMENT. Among these two operations, the AND is the most critical operation. It takes two P-trees as operands and gives a resultant P-tree which is equivalent to the P-tree built from the pixel wise logical AND operation on a basic P-tree data set. The performance of our application depends on the performance of the AND operation. Refer to figure 3. 4.2. Classification Accuracy As performance evaluation with respect to accuracy for this work, we compared the classification accuracy of the proposed Bayesian classifier using information gain (BCIG) with naïve Bayesian (NBC). time (ms) Time Required Vs Bit Number 60 40 20 0 0 1 2 3 4 5 6 7 8 bit number Figure 3: Performance of the P-tree AND operation. Actual aerial TIFF images with a synchronized yield band were used for the evaluation. The data are available on the SMILEY[13] web site[http://rock.cs.ndsu.nodak.edu/smiley/]. The data set has four bands red, green, blue and yield with eight bits of information. Table 5.1 shows the respective success rates for each technique. The training sample Seven significant bits were used for the comparison. (The last few least significant bits may contain random noise.) Figure 4 shows that our solution is capable of doing a much better classification compared to naïve Bayesian techniques which is a significant advantage in the application of classification to data streams. All the techniques display a peak level of classification at different bits. The low level of classification accuracy could be attributed to the nature of the data used for the classification where around 36-38% of the test cases are unclassifiable with all three techniques. It also shows the fact that the proposed classifier performs better compared to the naïve technique as it can classify almost all that can be classified with naïve Bayesian. Table 1: Comparison performance for four classification classes Bits 2 3 4 5 6 7 NBC .14 .19 .27 .26 .24 .23 NNBC .28 .33 .34 .35 .37 .41 PNBC .48 .50 .51 .52 .51 .51 NBC: Naïve Bayesian, NNBC: Non-naïve Bayesian, and PNBC: Partial Bayesian (Using information gain). 4.4. Performance in Data Stream Applications Earlier, we said that any data-mining technique applying to the stream data needs some special care. Here, we are defining those things more precisely. A typical data stream mining algorithm should have the following criteria [3, 9]: 1. It must require a small constant time per record. 2. It must use only a fixed amount of main memory 3. It must be able to build a model with at most one scan of the data 4. It must make a usable model available at any point in time. 5. It should produce a model that is equivalent to the one that would be obtained by the corresponding database-mining algorithm. 6. When the data-generating phenomenon is changing over time, the model should, at any time, be up-to-date but also include the past information. Now, we will illustrate how our new approach of using a P-tree based Bayesian classification meets the stringent design criteria above. For the first and second points, it takes a short time to build the basic P-trees from the data of the data stream. The size of the classifier is a constant for a fixed image size. Considering the size of the upper bound for each Ptree, we can figure out the required amount of fixed main memory for the respective application. With respect to the third and fourth criteria, the P-trees are built by doing only one pass over the data. The collection of P-trees created for an image is classification ready at any point of time. Fifth, a P-tree model helps us to build the classifier quickly and conveniently. As it is a lossless representation of the original data, the classifier built using P-trees contains all the information of the original training data. The classification will be equivalent to any traditional classification technique. In this approach, the P-trees allow us the capability to calculate the required probability values accurately and efficiently for the Bayesian classification. Sixth, if the data-generating phenomenon differs, it will not adversely affect the classification process. The new P-trees built will add the adaptability requirement for the classifier. Past information will also not be lost as long as we keep that information on our P-trees. The training image window size over the data stream and how it should be selected can introduce the historical aspects as well as the adaptability for the required classifier 5. Future Work In our paper, we focused on eliminating the class conditional independency in the naïve Bayesian classification. For that purpose, we used a non-naïve technique where we got exact tuple match in the training sample. In the case where no exact match was found, we used partial-naïve by separating one attribute or more. To select the attribute to be separated, we used the measure of information gain. The same thing can happen in other classification techniques, for example, in k-nearest neighbor (KNN) classification. In KNN classification, if there is no exact match in the training sample, it expands its neighbor up to k and decides the class label from the majority votes of the k neighbors. Instead of using the information gain, other techniques can also be applied to select the attribute to be separated. 6. Conclusion In this paper, we efficiently used the P-tree data structure in the Bayesian classification. We also applied a new method in Bayesian classification by introducing the information gain. Thus, our new method has the simplicity of naïve Bayesian but higher accuracy. Experimental results show that our new approach increases the accuracy in naïve Bayesian classification significantly. Also, we have shown that our method fits for data stream mining, which is a new concept of data mining compared with traditional database mining. P-trees play an important role in achieving these advantages by providing a way to calculate the different probabilities easily, quickly, and accurately. P-tree technology can be efficiently used in other data mining techniques. 7. Reference [1] W. Perrizo, Peano Count Tree Lab Notes, Technical Report NDSU-CSOR-TR-01-1, 2001 [2] Amalendu 10, Master thesis on P-Subject:trees and the AND operation, Department of Computer Science, NDSU. [3] J. Han, M. Kamber, Data Mining Concept and Techniques, Morgan Kaufmann, 2001 [4] D. Heckerman. Bayesian networks for knowledge discovery. In U.M. Fayyad, G. Paitetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery in Data Mining, pages 273-305. Cambridge, MA:MIT Press, 1996. [5] Kai-Min-Ting, Zijian-Zheng, Improving the performance of boosting for naïve Bayesian classification, Third PAKD, Beijing China, 1999. [6] Z Zheng, G I Webb, K M Ting, Lazy Bayesian Rules: A Lazy Semi-Naïve Learning Technique Competitive to Boosting Decision Trees, To appear in Proceedings of ICML’99, Morgan Kaufmann. [7] Kai Ming Ting, Zijian Zheng, Geoffrey Web, Learning Lazy Rules to improve the performance of Classifier, available at www.deakin.edu.au/~zij [8] H. Samet, Quad-tree and related hierarchical data structure. ACM Survey, 1984. [9] Domingos, P., and Hulten, G., Catching Up with the Data: Research Issues in Mining Data Streams, DMKD 2001. [10] HH-codes. Available at http://www.statkart.no/nlhdb/iveher/hhtext.htm [11] Domingos, P., Hulten, G., Mining high-speed data streams, ACM SIGKDD 2000. [12] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, Menlo Park, CA, 1996. http://www.cs.ndsu.nodak.edu/~am10/paper/paper.pdf. [13] SMILEY project. Available at http://midas.cs.ndsu.nodak.edu/~smiley