Performance Improvement for Bayesian Classification on Spatial Data with P-Trees1, 2 Amal S. Perera Dep. of Computer Science North Dakota State University Fargo, ND 58105, USA Amal.Perera@ndsu.nodak.edu Masum H. Serazi Dep. of Computer Science North Dakota State University Fargo, ND 58105, USA Md.Serazi@ndsu.nodak.edu Abstract1,2 Accuracy is one of the major issues for a classifier. Currently there exist a range of classifiers with different degrees of accuracy directly related to computational complexity. In this paper we are presenting an approach to improve the classification accuracy of an existing PTree based Bayesian classification technique. The new approach has increased the granularity between two conditional probability calculations by using a bit-based approach rather than the existing band-based approach. This approach enables the complete elimination of the naïve assumption. The new approach maintains the same computational cost as the previous method. This approach outperforms the existing P-Tree based Bayesian classifier, a Bayesian belief network and a Euclidian distance based KNN classifier, in terms of accuracy for a particular set of spatial data collected for precision agriculture. Keywords Data mining, Bayesian Classification, bSQ, P-Tree 1 INTRODUCTION In general, classification is a form of data analysis or data mining technique that can be used to extract models describing important data classes or to predict future data trends. There is a broad range of techniques for data classification. They range from Decision Tree Induction, Bayesian, Neural Networks, K-Nearest Neighbor, Case Based Reasoning, Genetic Algorithm, rough sets, to fuzzy logic techniques [3]. A Bayesian classifier is a statistical classifier, which uses Bayes’ theorem to predict class membership as a conditional probability that a given data sample falls into a particular class. The complexity of computing the conditional probability values can become prohibitive for most of the applications with a large data set and a large attribute space. Bayesian Belief Networks relax many constraints and use the information about the domain to build a conditional probability table. Naïve Bayesian Classification is a lazy classifier. Computational cost is reduced with the use of the Naïve assumption of class conditional independence, to calculate the conditional probabilities when required [3]. Bayesian 1 2 Patents are pending on the bSQ and P-Tree technology. This work is partially supported by GSA Grant ACT# K96130308. William Perrizo Dep. of Computer Science North Dakota State University Fargo, ND 58105, USA William.Perrizo@ndsu.nodak.edu Belief Networks require build time and domain knowledge where as the Naïve approach looses accuracy if the assumption is not valid. The P-Tree data structure allows us to compute the Bayesian probability values efficiently, without the Naïve assumption by building PTrees for the training data. Calculation of probability values require a set of P-Tree AND operations that will yield the respective counts for a given pattern. Bayesian classification with P-Trees has been used successfully on remotely sensed image data to predict yield in precision agriculture [1]. To avoid situations where the required pattern does not exist in the training data this approach partially employs the naïve assumption. To completely eliminate the assumption in order to increase the accuracy, we propose a bit-based Bayesian classification instead of the band-based approach in [1]. This paper is organized as follows. We first discuss the related work in Section 2, where we discuss about the P-Tree data structure and how this can be used for a Bayesian classifier. We propose our algorithm in Section 3 to improve the performance of an existing Bayesian Classifier using P-Trees on spatial data. In Section 4 we discuss the performance in terms of accuracy of our algorithm and scalability with the size of the data set. Finally, we offer our conclusions in Section 5. 2 RELATED WORK Many studies have been conducted in spatial data classification using P-Trees [1],[5],[6],[7]. In this section we will discuss the P-Tree data structure and the application of the P-Trees to calculate the counts required to compute the Bayesian probability values in [1] to partially eliminate the use of the naïve assumption. 2.1 P-Tree Most spatial data comes in a format called BSQ for Band Sequential (or can be easily converted to BSQ). BSQ data has a separate file for each band. The ordering of the data values within a band is raster ordering with respect to the spatial area represented in the dataset. This order is assumed and therefore is not explicitly indicated as a key attribute in each band (bands have just one column). For P-Trees each BSQ band has been divided into several files, one for each bit position of the data values. This format is called ‘bit Sequential’ or bSQ. A simple transform can be used to convert image files to BSQ and then to bSQ format. Each bSQ bit file is organized, B ij (the file constructed from the jth bits of ith band), into a tree structure, called a Peano Count Tree (P-Tree). A PTree is a quadrant-based tree. The root of a P-Tree contains the 1-bit count of the entire bit-band. The next level of the tree contains the 1-bit counts of the four quadrants in raster order. At the next level, each quadrant is partitioned into sub-quadrants and their 1-bit counts in raster order constitute the children of the quadrant node. This construction is continued recursively down each tree path until the sub-quadrant is ‘pure’ (entirely 1-bits or entirely 0-bits), which may or may not be at the leaf level. For example, the P-Tree for a 8-row-8-column bit-band is shown in the following figure 1. 11 11 11 11 11 11 11 01 11 11 11 11 11 11 11 11 11 10 11 11 11 11 11 11 00 00 00 10 11 11 11 11 m FIGURE 1 _____________/ / \ \___________ / ____/ \ ____ \ 1 ____m__ _m__ 1 / / | \ / | \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101 attributes and among them C is the class label attribute, given an unclassified data sample (having a value for all attributes except C), a classification technique will predict the C-value for the given sample and thus determine its class. In the case of spatial data, the key, k in R, usually represents some location (or pixel) over a space and each Ai is a descriptive attribute of the locations. A typical example of such spatial data is an image of the earth’s surface, collected as a satellite image or an aerial photograph. The attributes may be different reflectance bands such as red, green, blue, infra-red, near infra-red, thermal infra-red, etc. The attributes may also include synchronized ground measurements such as yield, soil type, zoning category, a weather attribute, etc. A classifier may predict the yield value from different reflectance band values extracted from a satellite image. In Bayesian classification, let X be a data sample whose class label is unknown. Let H be a hypothesis (ie, X belongs to class, C). P(H|X) is the posterior probability of H given X. P(H) is the prior probability of H then P(H|X) = P(X|H)P(H) / P(X) For efficient implementation, we use a variation of the basic P-Trees, called PM-tree (Pure Mask tree). In the PM-tree, we use a 3-valued logic, in which ‘11’ represents a quadrant of pure 1-bits (pure1 quadrant), ‘00’ represents a quadrant of pure 0-bits (pure0 quadrant) and ‘01’ represents a mixed quadrant. To simplify the exposition, we use 1 instead of 11 for pure1, 0 for pure0, and m for mixed. The PM-tree is given in figure 1 [4]. Where P(X|H) is the posterior probability of X given H and P(X) is the prior probability of X. Bayesian classification uses this theorem in the following way. Each data sample is represented by a feature vector, X=(x1..,xn) depicting the measurements made on the sample from A1,..An, respectively. Given classes, C1,...Cm, the Bayesian Classifier will predict the class label, Cj , that an unknown data sample, X (with no class label), belongs to the one having the highest posterior probability, conditioned on X 2.2 P-Tree Algebra P(Cj|X) > P(Ci|X), where i is not j ... Basic P-Trees can be combined using logical operations to produce P-Trees for the original values at any level of bit precision. We let Pb,v denote the Peano Count Tree for band, b, and value, v, where v can be expressed in n-bit precision. Using 8-bit precision for a value Pb,11010011 , can be constructed from the basic P-Trees as: P(X) is constant for all classes so P(X|Cj)P(Cj) is maximized. Bayesian classification can be naïve or based on Bayesian belief networks. In naive Bayesian the naive assumption ‘class conditional independence of values’ is made to reduce the computational complexity of calculating all P(X|Cj)'s. It assumes that the value of an attribute is independent of that of all others. Thus, Pb,11010011 = Pb1 AND Pb2 AND Pb3’ AND Pb4 AND Pb5’ AND Pb6’ AND Pb7 AND Pb8, where ’ indicates NOT operation. The AND operation is simply the pixel-wise AND of the bits. Similarly, any data set in the relational format can be represented as P-Trees. For any combination of values, (v1,v2,…,vn), where vi is from band-i, the quadrant-wise count of occurrences of this combination of values is given by: P(v1,v2,…,vn) = P1,v1 AND P2,v2 AND … AND Pn,vn [2] 2.3 Bayesian Classifier Given a relation, R(k, A1, …, An, C), where k is the key of the relation R and A1, …, An, C are different ... (1) P(X|Ci) = P(xk|Ci)*…*P(xn|Ci) … … … (2) For categorical attributes, where P(xk|Ci) = sixk/si si is the number of samples in class Ci and sixk is the number of training samples of class Ci, having Ak-value xk [3] 2.4 Bayesian Classifier Using P-Tree A P-Tree is a lossless, compressed, data mining ready data structure, which maintains the raster spatial order of the original data. There are many variations of P-Trees. PTrees built from bSQ files are called basic P-Trees. The symbol Pi,j denotes a P-Tree built from band i and bit j. A function COMP(Pi,j), defined for P-Tree Pi,j gives the complement of the P-Tree Pi,j. We can built a value PTree Pi,v denoting band i and value v, which gives us the P-Tree for a specific value v in the original band data. The tuple P-Tree PX1..Xn denotes the P-Tree of tuple x1…xn. The function RootCount gives the number of occurrence of a particular pattern in a P-Tree. If the P-Tree is the basic P-Tree it gives the number of 1’s in the tree or if the P-Tree is a value P-Tree or a tuple P-Tree it gives the number occurrence of that value or tuple in that tree [1]. 2.5 Calculating the probabilities using P-Tree: We showed that to classify a tuple X=(x1..,xn) we need to find the value of P(xk|Ci). To find the values of sixk and si we need value P-Trees, Pk,xk (Value P-Tree of band k, value xk) and Pc,ci (value P-Tree of label ban C, value Ci). sixk= RootCount [(Pk,xk) AND (Pc,ci)], si= RootCount[Pc,ci]... (3) In this way we find the value of all probabilities in (2) and can use (1) to classify a new tuple, X. To improve the accuracy, we can find the value in (2) directly by calculating the tuple P-Tree Px1..,xn (tuple P-Tree of band x1..xn) which is simply (P1,x1) AND ...AND (Pn,xn). Now P(X|Ci) = si(x1..xn) / si where si(x1..xn) = RootCount [(Px1..,xn) AND (Pc,ci)]... (4) By doing this we are finding the probability of occurrence of the whole tuple in the data sample. We do not need to care about the inter-dependency between different bands (attributes). In this way, we can find the value of P(X|Ci) without the naive assumption of ‘class conditional independency’ which will improve the accuracy of the classifier. Thus we can keep the simplicity of naive Bayesian as well as get the higher accuracy [1]. 3 PROPOSED APPROACH 3.1 Band based approach The existing algorithm uses a band-based approach in calculating the Bayesian Probability values for each class. As long as there are matching patterns in the training data, the Bayesian conditional probability values can be calculated very easily with the use of P-Trees. In the event of not finding a set of matching patterns in the training data set for the given pattern the existing approach will remove the attribute with the least amount of information gain relative to the class label from the given pattern and then try to find a match in the training set. This will be continued until a set of matching patterns is found for the given pattern. The Bayesian probabilities are calculated using the partial probabilities for this partial pattern and the attributes discarded earlier in order to find a match. The probabilities are combined using the Naive assumption. This is a better approach than completely using the naive assumption as shown in the previous work [1]. One experimental observation is the decrease in accuracy with an increase in the use of the naive assumption. This observation is shown in the next section in performance evaluation. One other approach is to completely ignore the use of the naïve assumption. Do the classification with the conditional probability values for a partial pattern. This improves the performance slightly as shown on the next section. This prompts us to look for an approach that does not use the Naïve assumption and also keeps most of the information of the given pattern while trying to find a pattern. i.e. A better mechanism to relax the search for the given pattern. 3.2 Bit based approach This approach is motivated by the requirement to completely avoid the use of the naive assumption for the probability calculations. In this approach we are proposing to transform the problem of calculating the probability values for a given unknown pattern to calculating the probability values for a known similar pattern. This similar pattern is selected by removing the least significant bits from the attribute space. The order of the attribute bits to be removed is selected by calculating the information gain for the respective bit of the band. Consider the following example to calculate the Bayesian conditional probability value for the pattern [10,01] in a two-attribute space. Assume that the information gain values for the first significant bit for band R is less than the value for band G and for the second significant bit the information gain value for band G is less than for band R. Initially it will search for the pattern, as shown in figure 2a. If the pattern is not found, search will be done for [1_,01] considering the information gain values for the second significant bit. The search space will be increased as shown in figure 2b. The search will continue for the pattern [1_,0_] as shown in 2c. Once again the search will continue for the pattern [1_,__] by considering the information gain value for the first significant bit as shown in figure 3d. In the algorithm the above process is done in a reverse order. We start to build a value P-Tree for the given pattern starting from the most significant bits of each attribute and compute the conditional probabilities for each class. We continue to add bits to the value Ptree until we obtain a conditional probability value for a class greater than a given threshold probability value. If such a class is found it is a clear winner against all the other classes if the probability threshold is greater than 0.5. In our experimentation we used a value of 0.85. If the algorithm is unable to find a winner above the given threshold value the class that recorded the highest probability in a preceding step will be used as the class label. We do not use any weighting on this with respect to R R 11 11 10 10 01 01 00 00 00 01 10 11 00 G 01 (a) 10 11 G (b) R R 11 11 10 10 01 01 00 00 00 01 10 11 (c) G 00 01 10 (d) 11 G the number of bits in the calculation. This is one area that can be modified to further improve the classification. 4 PERFORMANCE EVALUATION In this section we will present the comparative analysis of the proposed method to increase the accuracy of classification. The performance of the classification is compared with the previous technique that partially uses the naïve assumption, band based approach without the naïve assumption and KNN with a Euclidian distance matrix. It is also compared with a Bayesian Belief network classifier. Finally we compare the variability of classification time against the training sample size. The above table shows the comparison between the band based classification and the bit based classification. The percentage of attempts that required the naïve assumption grows with the increase in the number of significant bits. When there are more significant bits in the attribute, it is less likely to find a match from the training pattern. This contributes to the increase in the number of times required to make the naïve assumption. It is clear from the accuracy statistics that the accuracy of the band based classification with naïve assumption (previous attempt) fails with the increase of the percentage attempts with the naïve assumption. As indicated in the previous section the values in the table clearly indicates the necessity to use another mechanism. The table shows a clear difference in accuracy between the Band based mechanism and the Bit based mechanism. The effect on the classification accuracy with different bin sizes (number of significant bits) for the bit-based method is minimal. Average number of bits used for the classification indicates the number of bits used per classification until the threshold is satisfied. Note that the average number of bits required is less than the number of significant bits for the last test case. This shows the ability of the bit based method to avoid situation of not finding a matching pattern for the given pattern in the training set by looking for a pattern without the least significant bit in most of the attributes. It is clear that the new approach is better. FIGURE 3. Classification accuracy for '97 Data 90 80 70 60 4.1 Sample data 50 40 The experimental data was extracted from two sets of aerial photographs of the Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA) near Oaks, North Dakota, United States. The images were taken in 1997 and 1998. Each image contains 3 bands, red, green and blue reflectance values. Three other files contain synchronized soil moisture, nitrate and yield values. 30 20 10 0 1K Band Acc. use of Naive Ass. 72 % 0% 63 % 5% 26 % 28 % Acc. Avg. # of Bits used 76 % 3.89 74 % 4.52 74 % 5.05 65K 260K Naïve-Band KNN-Euc. Bit Variation of Classification Time with Training Size Time (ms) The motivation for this paper is to improve the classification accuracy of the existing Ptree based Bayesian classification approach. Band based, Naïve Ass. Bit Based, Threshold 0.85 16K Training Data Size (pixels) 4.2 Classification Accuracy Sig. Bits 4 5 6 4K 300 250 200 150 100 50 0 0 50 100 150 200 Trainig sample size (pixels) 250 300 It is also important to compare the classification accuracy with other classification techniques. The following diagram shows the comparison of the new approach with a Euclidian distance based KNN, the Band based and Band Based without Naïve assumption. Figure 3 shows the comparison of the classification accuracy. It is clearly seen that the new approach is out performing the other approaches for both data sets. This evaluation was done with 5 significant bits of information for each attribute band. The two band based approaches show a similar degree of accuracy because the use of the naïve assumption is minimal for 5 significant bits. With the increase in the training data size, the possibility of finding a set of matching patterns is increased. This contribute to an improvement in the classification accuracy for the band based approach as seen in the above figure. The KNN approach shows a steady degree of accuracy with respect to the training data size. The new approach also shows a similar trend at a higher level. This shows the ability of the classifier to perform at a reasonable level even with a smaller training data set. The accuracy of the approach was also compared to an existing Bayesian belief network classifier. The classifier is J Cheng's Bayesian Belief Network available at [9]. The above classifier was the winning entry for the KDD Cup 2001 data mining competition. The developer claims that the classifier can perform with or without domain knowledge. Domain knowledge is a very important factor in most of the Bayesian belief network implementations. For the comparison smaller training data sets ranging from 4K to 16K pixels were used due to the inability of the implementation to handle larger data sets on a P-II. Training Size (pixels) Ptree Based Bayesian Belief 4000 66 % 26 % 16000 67 % 51 % Table 2 Comparison with a Bayesian Belief network The above table shows the comparison between the PTree based approach and the Bayesian belief approach for the 1997 data set. It is clearly seen that the Ptree based approach is better. The Belief network was built without using any domain knowledge to make it comparable with the P-Tree approach. The belief network was also allowed to use the maximum required amount of storage space to keep the conditional probability values. Attribute data was broken into 30 equal sized bins to compare with the 5 significant bits (32 bins) used for the P-Tree approach. 4.2 Classification Time The following figure shows the classification time for a single classification. P-Tree based Bayesian classifier is a lazy classifier. The P-Tree approach does not require any build time as any other lazy classifier. As with most of the lazy classifiers the classification time per tuple varies with the number of items in the training set due to the requirement of having to scan the training data. The PTree approach does not require a traditional data scan, but the computation time can be observed. The data for the above figure was collected using 5 significant bits and a threshold probability of 0.85. The time is given for scalability comparison purposes. Other work related to PTrees have shown the space-time efficiency of using it for data mining [1],[5],[6],[7]. 5 CONCLUSIONS In this paper, we have shown that the use of the naïve assumption reduces the accuracy of the classification in this particular application domain. We present an approach to increase the accuracy of a P-Tree based Bayesian classifier by completely eliminating the naïve assumption. The new approach has a better accuracy than the existing P-Tree based Bayesian classifier. It was also shown that it is better than a Bayesian belief network implementation and a Euclidian distance based KNN approach. This approach has the same computational cost with respect to the use of P-Tree operations compared to the previous approach, and it is scalable with respect to the size of the data set. REFERENCES [1] MD Hossain, ‘Bayesian Classification on Spatial Data Streams Using P-Trees’, CS-NDSU Dec. 02. [2] Q Ding, M Khan, A Roy, W Perrizo, ‘P-Tree Algebra’, ACM-SAC, Madrid, pp. 426-431, Mar 02. [3] J. Han, M. Kamber, ‘Data Mining Concept and Techniques’, Morgan Kaufmann, 2001. [4] William Perrizo, ‘Peano Count Tree Technology’, Technical Report NDSU-CSOR-TR-01-1, 2001. [5] M Khan, Q Ding, W Perrizo, ‘K-nearest Neighbor Classification on Spatial Data Stream Using PTrees’, PAKDD-02, Taipei, pp.517-528, May 2002. [6] Q Ding, Qia Ding, W Perrizo, ‘ARM on RSI Using P-Trees’, PAKDD 2002, Taipei, pp 66-79, May 02. [7] Q Ding, Qia Ding, W. Perrizo, ‘Decision Tree Classification of Spatial Data Streams Using PTrees’, ACM SAC, Madrid, pp. 426-431, March 02. [8] Datasurg web site, ‘http://datasurg.ndsu.edu/’, 5-02. [9] J. Chang, ‘J Cheng's Bayesian Belief Network’, http://www.cs.ualberta.ca/~jcheng/’, May 02.