Bayesian Classification on Spatial Data Streams Using P-Trees12 Mohammad Hossain, Amal Shehan Perera and William Perrizo Computer Science Department, North Dakota State University Fargo, ND 58105, USA {Mohammad_Hossain, Amal_Perera, William_Perrizo}@ndsu.nodak.edu Abstract: Classification of spatial data can be difficult with existing methods due to the large numbers and sizes of spatial data sets. The task becomes even more difficult when we consider continuous spatial data streams. Data streams require a classifier that can be built and rebuilt repeatedly in near real time. This paper presents an approach to deal with this challenge. Our approach uses the Peano Count Tree (P-tree), which provides a lossless, compressed and data-mining-ready representation (data structure) for spatial data. This data structure can be applied in many classification techniques. In this paper we focus on Bayesian classification. A Bayesian classifier is a statistical classifier, which uses the Bayes theorem to predict class membership as a conditional probability that a given data sample falls into a particular class. In this paper we demonstrate how P-trees can improve the classification of spatial data when using a Bayesian classifier. We also introduce the use of information gain calculations with Bayesian classification to improve its accuracy. The use of a P-tree based Bayesian classifier can not only make classification more effective on spatial data, but also can reduce the build time of the classifier considerably. This improvement in build time makes it feasible for use with streaming data. Keywords: Data Mining, Bayesian Classification, P-tree, Spatial Data 1 Patents are pending on the bSQ and P-tree technology. This work is partially supported by NSF Grant OSR-9553368, DARPA Grant DAAH04-96-1-0329 and GSA Grant ACT#: K96130308. 2 1. Introduction: Classification is an important data mining technique which predicts the class of a given data sample. Given a relation, R(k, A1, …, An, C), where k is the key of the relation R and A1, …, An, C are different attributes and among them C is the class label attribute [1]. Given an unclassified data sample (having a value for all attributes except C), a classification technique will predict the C-value for the given sample and thus determine its class. In the case of spatial data, the key, k in R, usually represents some location (or pixel) over a space and each Ai is a descriptive attribute of the locations. A typical example of such spatial data is an image of the earth collected as a satellite image or aerial photograph. The attributes may be different reflectance bands such as red, green, blue, infra-red, near infra-red, thermal infra-red, etc [2]. The attributes may also include ground measurements such as yield, soil type, zoning category, a weather attribute, etc.). A classifier may predict the yield value from different reflectance band values extracted from a satellite image. Many classification techniques have been proposed by statisticians and machine learning experts [3]. They include decision trees, neural networks, Bayesian classification, and knearest neighbor classification. Bayesian classification is a statistical classifier based on Bayes theorem [4], as follows. Let X be a data sample whose class label is unknown. Let H be a hypothesis (ie, X belongs to class, C). P(H|X) is the posterior probability of H given X. P(H) is the prior probability of H then P(H|X) = P(X|H)P(H)/P(X) Where P(X|H) is the posterior probability of X given H and P(X) is the prior probability of X. Bayesian classification uses this theorem in the following way. Each data sample is represented by a feature vector, X=(x1..,xn) depicting the measurements made on the sample from A1,..An, respectively. Given classes, C1,...Cm, the Bayesian Classifier will predict the class label, Cj , that an unknown data sample, X (with no class label), belongs to as the one having the highest posterior probability, conditioned on X P(Cj|X) > P(Ci|X), where i is not j ... ... P(X) is constant for all classes so we maximize P(X|Cj)P(Cj). … ... (1) Bayesian classification can be naïve or based on Bayesian belief networks. In naive Bayesian the naive assumption of ‘class conditional independence of values’ is made to reduce the computational complexity of calculating all P(X|Cj)'s. It assumes that the value of an attribute is independent of that of all others. Thus, P(X|Ci) = P(xk|Ci)*…*P(xn|Ci). ... ... ... (2) For categorical attributes, P(xk|Ci) = sixk/si where si = number of samples in class Ci and sixk = number of training samples of class Ci, having Ak-value xk. A Bayesian belief network is a graphical model [5]. It represents the dependencies among subsets of attributes in a directed acyclic graph. A table called the ‘conditional probability table’ (CPT) is constructed. From the CPT different probabilities are established to estimate P(X|C). The Bayesian Classification techniques have some problems. Naive Bayesian depends on the assumption of ‘class conditional independency’, which is not satisfied in all the cases. In the case of belief network, it is computationally very complex to build the network and to create the CPT [4,5]. It is time consuming to calculate the probabilities in traditional ways because of the need to scan the whole database again and again. This often makes its use with large spatial data sets and streaming spatial computationally expensive. Our approach is to use the P-trees data structure to reduce the computational expense in classifying a new sample. The use of P-tree gives us a simple solution to the problem of scanning the database. Using a simple P-tree algebra, we can determine the number of occurrences of a sample and the number of co-occurrences of a sample and a class label, providing the necessary components of the probability calculation. 2. Bayesian Classification Using P-tree: In case of spatial data, in the relation R(k, A1,...,An, C) the Ai’s and C could be viewed as different bands (denoted by B1,...,Bn in this paper). For example, in a remotely sensed image it may be the reflectance value of red, green, blue color for a particular pixel in the space identified by k. These bands are converted to P-trees and stored in that form. Most spatial data comes in a format called BSQ for Band Sequential (or can be easily converted to BSQ). BSQ data has a separate file for each attribute or band. The ordering of the data values within a band is raster ordering with respect to the spatial area represented in the dataset. This ordering is assumed and therefore is not explicitly indicated as a key attribute in each band (bands have just one column). In this paper, we divided each BSQ band into several files, one for each bit position of the data values. We call this format bit Sequential or bSQ. A Landsat Thematic Mapper satellite image, for example, is in BSQ format with 7 bands, B1,…,B7, (Landsat-7 has 8) and ~40,000,000 8-bit data values. In this case, the bSQ format will consist of 56 separate files, B11,…,B78, each containing ~40,000,000 bits. A typical TIFF image aerial digital photograph is in what is called Band Interleaved by Bit (BIP) format, in which there is one file containing ~24,000,000 bits ordered by bit-position, then band and then raster-ordered-pixel-location. A simple transform can be used to convert TIFF images to bSQ format. We organize each bSQ bit file, Bij, into a tree structure, called a Peano Count Tree (P-tree). A Ptree is a quadrant-based tree. The root of a P-tree contains the 1-bit count of the entire bit-band. The next level of the tree contains the 1-bit counts of the four quadrants in raster order. At the next level, each quadrant is partitioned into sub-quadrants and their 1-bit counts in raster order constitute the children of the quadrant node. This is construction is continued recursively down each tree path until the sub-quadrant is pure (entirely 1-bits or entirely 0-bits), which may or may not be at the leaf level (1-by-1 sub-quadrant level). P-trees are related in various ways to many other data structures in the literature (see the appendix for details). For example, the P-tree for a 8-row-8-column bit-band is shown below. 11 11 11 11 11 11 11 01 11 11 11 11 11 11 11 11 11 10 11 11 11 11 11 11 00 00 00 10 11 11 11 11 55 ____________/ / \ \___________ / _____/ \ ___ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101 depth=0 level=3 depth=1 level=2 depth=2 level=1 depth=3 level=0 level=0 In this example, 55 is the count of 1’s in the entire image, the numbers at the next level, 16, 8, 15 and 16, are the 1-bit counts for the four major quadrants. Since the first and last quadrant is made up of entirely 1-bits, we do not need sub-trees for these two quadrants. This pattern is continued recursively. Recursive raster ordering is called the Peano or Z-ordering in the literature – therefore, the name Peano Count trees. The recursive process will definitely terminates at the “leaf” level (level-0) where each quadrant is a 1-row-1-column quadrant. If we were to expand all sub-trees, including those for the pure quadrants, then the leaf sequence is just the Peano space-filling curve for the original raster image A P-tree is a lossless data structure, which maintains the raster spatial order of the original data. There are many variations in P-trees. P-trees built from bSQ files are called basic P-trees. The symbol Pi,j denotes a P-tree built from band i and bit j. A function COMP(Pi,j), defined for P-tree Pi,j gives the complement of the P-tree Pi,j. We can built a value P-tree Pi,v denoting band i and value v, which gives us the P-tree for a specific value v in the original band data. The tuple P-tree PX1..Xn denotes the P-tree of tuple x1…xn. The function RootCount gives the number of occurrence of a particular pattern in a P-tree. If the P-tree is the basic P-tree it gives the number of 1’s in the tree or if the P-tree is a value P-tree or tuple P-tree it gives the number occurrence of that value or tuple in that tree. A detailed description of P-trees and related things are included in the appendix. 2.1 Calculating the probabilities using P-tree: We showed earlier that to classify a tuple X=(x1..,xn) we need to find the value of P(xk|Ci) = sixk/si where si = number of samples in class Ci and sixk = number of training samples of class Ci, having Ak-value xk. To find the values of sixk and si we need two value P-trees, Pk,xk (Value Ptree of band k, value xk) and Pc,ci (value P-tree of class label band C, value Ci) and sixk= RootCount [(Pk,xk) AND (Pc,ci)], si= RootCount[Pc,ci] ... ... (3) In this way we find the value of all probabilities in (2) and can use (1) to classify a new tuple, X. To improve the accuracy of the classifier, we can find the value in (2) directly by calculating the tuple P-tree Px1..,xn (tuple P-tree of band x1..xn) which is simply (P1,x1)AND ...AND (Pn,xn) Now P(X|Ci) = P x1..,xn = six1..n/si and six1..n= RootCount [(Px1..,xn) AND (Pc,ci)] ... ... (4) By doing this we are finding the probability of occurrence of the whole tuple in the data sample. We do not need to care about the inter-dependency between different bands (attributes). In this way, we can find the value of P(X|Ci) without the naive assumption of ‘class conditional independency’ which will improve the accuracy of the classifier. Thus we can keep the simplicity of naive Bayesian as well as get the higher accuracy. One problem of this approach is that if the tuple X that we want to classify is not present in our data set, the Root count of TPx1..,xn will be zero and the value of P(X|Ci) will be zero as well. In that case we will not be able to classify the tuple. To deal with that problem we introduce the measure of information gain of an attribute of the data in our classification. 2.2 Introduction of information gain. Not all the attributes carry the same amount of information that may be useful in the classification task. For example, in case of a bank loan, the attribute INCOME might carry more information than attribute AGE of the client. This significance could be mathematically measured by calculation the information gain of an attribute. 2.2.1 Calculation of Information Gain: In a data sample, class label attribute Cn has m different values or classes, Ci, i = 1...m. Let si = number of samples in Ci Information needed to classify a given sample is [4]: I(s1..sm)= -SUM(i =1..m)[pi*log2(pi)] where pi=si/s is the probability that a sample belongs to Ci. Let attribute, A, have v distinct values, {a1...av}. Entropy or expected information based on partition into subsets by A is: E(A) = SUM(j =1..v)[ (SUM(I =1..m)[sij] / s) * I(sij..smj) ] Now the information gain can be calculated by : Gain(A) = I(s1..sm) - E(A) (expected reduction of entropy caused by knowing the values of A) 2.2.2 Use of info gain to handle the situation of TP-tree RootCount = 0 Back to the problem in our classification technique, when the Root count of TPx1..,xn is zero we can form a TP-tree by reducing its size. We can form a TPx1..xi..xn, where i =1 to n but ik and the band xk has the lower information gain than any other bands. So (2) will be P(X|Ci) = P(x1..n|Ci)*..*P(xk|Ci). ... ... ... (5) P(x1..n|Ci) can be calculated by (4) with out using the band k and P(xk|Ci) can be calculated by (3). But if the root count of TPx1..xi..xn is still zero, we will remove another band with the 2nd lowest information gain and proceed in the same way. The general equation of (5) is P(X|Ci) = P(x1...k...n|Ci) *...* jP(xj|Ci). ... ... (6) where P(x1...k...n|Ci) is calculated from (4) with the maximum number of bands having RootCount[TPx1..xk..xn]0 and k is not in j, and range of j is the minimum number of bands having lowest information gain for band xj. P(xj|Ci) can be calculated from (1) for each j. 2.3 Example: Consider the training set, S, where B1 is the class label attribute. S: B1 0011 0011 0111 0111 0011 0011 0111 0111 0010 0010 1010 1111 0010 1010 1111 1111 B2 0111 0011 0011 0010 0111 0011 0011 0010 1011 1011 1010 1010 1011 1011 1010 1010 B3 1000 1000 0100 0101 1000 1000 0100 0101 1000 1000 0100 0100 1000 1000 0100 0100 B4 1011 1111 1011 1011 1010 1011 1011 1011 1111 1111 1011 1011 1111 1111 1011 1011 So we have 5 distinct classes, they are C1=0010, C2=0011, C3=0111, C4=1010, C5=1111 So the value P-trees of the Ci's are: __C1___ P1,0010 3 0 0 3 0 __C2___ P1,0011 4 4 0 0 0 __C3___ P1,0111 4 0 4 0 0 __C4___ P1,1010 2 0 0 1 1 __C5___ P1,1111 3 0 0 0 3 P2,0111 2 2 0 0 0 P2,1010 4 0 0 0 4 P2,1011 4 0 0 4 0 Value P-trees for B2 P2,0010 2 0 2 0 0 P2,0011 4 2 2 0 0 Value P-trees for B3 P3,0100 6 0 2 0 4 P3,0101 2 0 2 0 0 P3,1000 8 4 0 4 0 Value P-trees for B4 P4,1010 1 1 0 0 0 P4,1011 10 2 4 0 4 P4,1111 5 1 0 4 0 Now consider an unknown sample X = 0011 1000 1011 So tuple P-tree PX2X3X4 = P0011 1000 1011 = 1 1 0 0 0 0001 So RootCount(PX2X3X4 AND P1,C1) = 0 RootCount(PX2X3X4 AND P1,C2) = 1 RootCount(PX2X3X4 AND P1,C3) = 0 RootCount(PX2X3X4 AND P1,C4) = 0 RootCount(PX2X3X4 AND P1,C5) = 0 So the sample belongs to class C2 Now we examine with another sample: X = 0011 1000 1010 In this case: P0011 1000 1010 = 0 So we calculate the information gain of different bands. G2 = 1.65, G3 = 1.59 and G4 = 1.31 We first remove the band B4 from the tuple P-tree and build P0011 1000 = 2 2 0 0 0 0110 RootCount(PX2X3 AND P1,C1) * RootCount(P4,X4 AND P1,C1) = 0 RootCount(PX2X3 AND P1,C2) * RootCount(P4,X4 AND P1,C2) = 1 RootCount(PX2X3 AND P1,C3) * RootCount(P4,X4 AND P1,C3) = 0 RootCount(PX2X3 AND P1,C4) * RootCount(P4,X4 AND P1,C4) = 0 RootCount(PX2X3 AND P1,C5) * RootCount(P4,X4 AND P1,C5) = 0 So the sample belongs to class C2 3. Experimental Results and Performance Analysis In this section we will present the comparative analysis of the proposed method. Initially we will discuss the advantage of using P-trees in our approach and then show some experimental results that indicate the comparative success rates for the proposed method. In the final section we will discuss the qualities of this approach that makes it suitable for the classification of data streams. 3.1 Performance of P-trees Use of the P-tree in any application depends on the algebra that we refer to as the P-tree algebra. The operations in the P-tree algebra are AND, OR, XOR and COMPLEMENT. In our application we used AND and COMPLEMENT. Among these two AND is the most critical operation. It takes two P-trees as operands and gives a resultant P-tree which is equivalent to the P-tree built from the pixel wise logical AND operation on a basic P-tree data set. The performance of our application depends on the performance of the AND operation. Studies show [6] that the AND operation on different bit position of any band can be done between 8 milliseconds to 52 milliseconds in a distributed parallel system. In a particular band most significant bit position takes less time than least significant bit position (This is an inherent characteristic of spatial image data) [http://www.cs.ndsu.nodak.edu/~amroy/paper/paper.pdf]. Following figure 1 shows the variation of time over the different significant bits for a 1320x1320 TIFF image (1742400 pixels). Figure 1 Performance of the P-tree AND operation 3.2 Probability calculations using P-trees The advantage of using P-trees for classification technique described on this paper is clearly evident when we consider the effort to calculate the required probability figures. If we try to do the same classification without P-tree’s we have two options. 1. Scan the training data and obtain probability counts for each classification 2. Store all the possible probability counts for the training data. Assuming we have n attributes, m number of bits per attributes and c number of classes. For the first approach above we need at least one scan (training data) for each classification. Each scan will constitute going through all the pixels in the training image looking for the specific bit patterns and updating the counts respectively. If we use P-trees we need to do nm number of ANDing operations and on the average m/2 number of COMPLEMENT operations on the basic P-trees to come up with the count. In the second approach, we need to scan the training data and keep all the possible probability counts. When the number of bits and the number of attributes increase, the number of all possible combinations increases. This will require a capable data structure to store the probability counts and a mechanism to retrieve the counts. The required amount of counters is c(2mn+2m(n1)+………..+2m) Space Cost Space (MB) 20 15 Counts 10 P-trees 5 0 1 2 3 4 5 Significant Bits Figure 2 Increase in the space requirement If we use P-trees we need nm Basic P-trees and c value P-trees. We will be able to compute the required count for any pattern. This process is a sequential process and if we keep the intermediate values we can use them for the required counts, if we decide to remove one or more attributes (attributes with the least amount of information) in the classification process. Since we know the order in which we need to drop the attributes we can compute the count in that order for the given pattern. Figure 2 shows the comparison of the two techniques with respect to space. The values are computed for a 1320x1320 image with 4 attributes and 4 classification classes. It has been shown in similar applications that the computational overhead incurred with the use of P-trees is proportionately efficient, when using it on large image data sets. This makes the approach usable for data streams. 3.3 Classification Accuracy As performance evaluation with respect to accuracy for this work, we compared the classification accuracy of the proposed Bayesian classifier using information gain (BCIG) with two other widely accepted classification algorithms (K-nearest neighbor - KNN and naïve Bayesian NBC). Actual aerial TIFF images with a synchronized yield band were used for the evaluation. The data is available on SMILEY web site[http://rock.cs.ndsu.nodak.edu/smiley/]. The data set has 4 bands red, green, blue and yield with 8 bits of information. The following table1 shows the respective success rates for each technique. The training sample size was set at 1024x1024. The objective of the classification is to generate the respective approximate yield values for a given aerial TIFF image for a crop field. The data was classified into 4 yield classes by considering the RGB attribute values. Comparison Performance for 4 classification classes Bits KNN NvBC Succ Succ BCIG with threshold .45 Succ IG Use IG Succ 2 .12 .14 .48 .40 .36 3 .16 .19 .50 .34 .38 4 .21 .27 .51 .32 .37 5 .37 .26 .52 .31 .40 6 .46 .24 .51 .27 .41 7 .43 .23 .51 .20 .40 Table 1 IG Use - Proportion of the number of times the information gain was used for classification. IG Succ - Proportion of the number of times the above was successful. A classification is not accepted if the classification ratio (posterior probability) is not above the threshold value. This leads the classifier to use the new technique employing information gain. The algorithm will only try to use the information gain if it is not possible to do a positive classification for that particular data item with the available training sample. The value for the threshold should be picked so as to not let the use of information gain dominate the classification. This could lead to an overall decline in the classification. Experimental results showed that the use of information gain at a threshold of 0.45 is appropriate. Comparison of techniques 0.6 Success 0.5 0.4 0.3 0.2 KNN 0.1 NBC BCIG 0 2 3 4 5 6 7 Siginificant Bits Figure 3 Classification success rate comparison Seven significant bits were used for the comparison (Last few least significant bits may contain random noise.). The above figure 3 shows that our solution is capable of doing a much better classification with less information compared to the other two techniques. This is a significant advantage in the application of classification to data streams. All the techniques display a peak level of classification at different bits. The low level of classification accuracy could be attributed to the nature of the data used for the classification. This is further shown in the following figure 4 where around 25% of the test cases are unclassifiable with all three techniques. It also shows the fact that the proposed classifier performs better with less information compared to the other two techniques. At the peak classification level of 5 bits for our method, the classifiable (with other techniques) pixels that are outside the set is relatively small compared to the values for the peak levels for the other techniques. .472 .423 .314 .265 .292 .263 .254 .225 .002 .033 .064 .135 KNN .092 .103 .114 .145 BCIG .012 .053 .124 .095 .002 .003 .014 .045 .112 .093 .104 .075 .032 .033 .044 .065 Proportion of classified pixels significant NBC bits Figure 4 Classification correlation for the three techniques Table 2 shows the information gain values computed for the training data set. It can be observed that the Information gain value for the blue band is the lowest. In the proposed algorithm this will be the first attribute to be removed if classification is not possible when using all three attributes. The difference in the level of information gain indicates the importance of each attribute in the classification process. The attribute blue is the least important band as it should be on any crop field image data file. This further justifies our decision to use information gain to try and classify pixels that are not classifiable using all the available attributes. Band Information gain for bit: 2 3 4 5 6 7 Green .140 .164 .173 .175 .176 .176 Red .128 .149 .168 .174 .175 .176 Blue .026 .076 .088 .092 .093 .093 Table 2 Variation of information gain for different bands All the above information is for one data image. Experimental results showed the outcome to be similar for other TIFF images with RGB and synchronized yield data. 3.4 Performance in data stream applications. A typical data stream mining algorithm should have the following criteria [7,8]. i. It must require a small constant time per record. ii. It must use only a fixed amount of main memory iii. It must be able to build a model at most one scan of the data iv. It must make a usable model available at any point of time. v. It should produce a model that is equivalent to the one that would be obtained by the corresponding database-mining algorithm. vi. When the data-generating phenomenon is changing over time, the model at any time should be up-to-date but also include the past information. Now we will illustrate how our new approach of using a P-tree based Bayesian Classification meets the stringent design criteria above. For the first and second points, we need very little time to build the basic P-trees from the data that comes through the data stream. And the size of the classifier is a constant for a fixed image size. Considering the size of the upper bound for each Ptree we can figure out the required amount of fixed main memory for the respective application. With respect to the third and fourth, the P-trees are built by doing only one pass over the data. The collection of P-trees created for an image is classification ready at any point of time. Fifth, Ptree model helps us to build the classifier quickly and conveniently. As it is a lossless representation of the original data the classifier built using P-trees contains all the information of the original training data. And the classification will be equivalent to any traditional classification technique. In this approach the P-trees allow us the capability to calculate the required probability values accurately and efficiently for the Bayesian classification. Sixth, if the data-generating phenomenon differs it will not adversely affect the classification process. The new PC trees built will add the adaptability requirement for the classifier. Past information will also not be lost as long as we keep those on our P-trees. The training image window size over the data stream and how it should be selected can introduce the historical aspects as well as the adaptability for the required classifier. 4. Conclusion: In this paper we efficiently used the P-tree data structure in the Bayesian classification. We also applied a new method in Bayesian classification by introducing the information gain. Thus our new method has the simplicity of naïve Bayesian as well as the accuracy of other classifiers. Also we have shown that our method fits for data stream mining, which is a new concept of data mining comparing with traditional database mining. P-tree plays an important role in achieving these advantages by providing a way to calculate the different probabilities easily, quickly and accurately. P-tree technology can be efficiently used in other data mining techniques such as association rule mining, k-nearest classification, decision tree classification etc. 6. References: [1] W. Perrizo, "Peano Count Tree Lab Notes", Technical Report NDSU-CSOR-TR-01-1, 2001. [2] SMILEY project. Available at http://midas.cs.ndsu.nodak.edu/~smiley [3] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, Menlo Park, CA, 1996. [4] J. Han, M. Kamber, “Data Mining Concept and Techniques”, Morgan Kaufmann. [5] D. Heckerman. Bayesian networks for knowledge discovery. In U.M. Fayyad, G. PaitetskyShapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery in Data mining, pages 273-305. Cambridge, MA:MIT Press, 1996. [6] Amalendu Roy, “Master thesis on Ptrees and the AND operation” Department of Computer Science, NDSU. Available at: [http://www.cs.ndsu.nodak.edu/~amroy/paper/paper.pdf]. [7] Domingos, P., Hulten, G., “Mining high-speed data streams”, ACM SIGKDD 2000. [8] Domingos, P., & Hulten, G., “Catching Up with the Data: Research Issues in Mining Data Streams”, DMKD 2001. [9] H. Samet, “Quadtree and related hierarchical data structure”. ACM Survey, 16, 2, 1984. [10] HH-codes. Available at http://www.statkart.no/nlhdb/iveher/hhtext.htm Appendix: Peano Count Tree (P-tree) and the P-tree Algebra Most spatial data comes in a format called BSQ for Band Sequential (or can be easily converted to BSQ). BSQ data has a separate file for each attribute or band. The ordering of the data values within a band is raster ordering with respect to the spatial area represented in the dataset. This order is assumed and therefore is not explicitly indicated as a key attribute in each band (bands have just one column). In this paper, we divided each BSQ band into several files, one for each bit position of the data values. We call this format bit Sequential or bSQ. Each bit file of the bSQ format is converted into a tree structure, called a Peano Count Tree (Ptree). A P-tree is a quadrant-based tree. It recursively divide the entire image into quadrants and record the count of 1-bits for each quadrant, thus forming a quadrant count tree. P-trees are somewhat similar in construction to other data structures in the literature (e.g., Quadtrees[9] and HHcodes [10]). For example, given an 8-row-8-column image of single bits, its P-tree is as shown in Figure 5. 11 11 11 11 11 11 11 01 11 11 11 11 11 11 11 11 11 10 11 11 11 11 11 11 00 00 00 10 11 11 11 11 P-tree 55 __________/ / \ \__________ / ___ / \___ \ / / \ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101 PM-tree m ____________/ / \ \___________ / ___ / \___ \ / / \ \ 1 ____m__ _m__ 1 / / | \ / | \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101 Figure 5. P-tree and PM-tree for a 88 image In this example, 55 is the number of 1’s in the entire image. This root level is labeled level 0. The numbers 16, 8, 15, and 16 found at the next level (level 1) are the 1-bit counts for the four major quadrants in raster order (upper-left, upper-right, lower-left, lower-right). Since the first and last level-1 quadrants are composed entirely of 1-bits (called pure-1 quadrants), sub-trees are not needed and these branches terminate. Similarly, quadrants composed entirely of 0-bits are called pure-0 quadrants, which also cause termination of tree branches. This pattern is continued recursively using the Peano or Z-ordering (recursive raster ordering) of the four sub-quadrants at each new level. Eventually, every branch terminates (since, at the “leaf” level all quadrant are pure). If we were to expand all sub-trees, including those for pure quadrants, then the leaf sequence would be the Peano-ordering of the image. Thus, we use the name Peano Count Tree. A variation of the P-tree data structure, the Peano Mask Tree (PM-tree), is a similar structure in which masks rather than counts are used. In a PM-tree, we use a 3-value logic to represent pure1, pure-0 and mixed quadrants (1 denotes pure-1, 0 denotes pure-0 and m denotes mixed). The PM-tree for the previous example is also given in Figure 3. Since a PM-tree is just an alternative implementation for a Peano Count tree, for simplicity we will use the same term “P-tree” for Peano Mask Tree. We note that the fan-out of a P-tree need not be fixed at four. It can be any power of 4 (effectively skipping levels in the tree). Also, the fan-out at any one level need not coincide with the fan-out at another level. The fan-out pattern can be chosen to produce maximum compression for each bSQ file. More discussion can be found in [1]. For simplicity, let us assume that the fan out is four. For each band (assuming 8-bit data values, though the model applies to data of any number bits), there are eight P-trees as currently defined, one for each bit position. We will call these P-trees the basic P-trees of the spatial dataset. We will use the notation, Pb,i to denote the basic P-tree for band, b and bit position, i. There are always 8n basic P-trees for a dataset with n bands. Each basic P-tree has a natural complement. The complement of a basic P-tree can be constructed directly from the P-tree by simply complementing the counts at each level (subtracting from the pure-1 count at that level), as shown in the example below (Figure 6). Note that the complement of a P-tree provides the 0-bit counts for each quadrant. P-tree AND/OR operations are also illustrated in Figure 6. P-tree 55 ______/ / \ \_______ / __ / \___ \ / / \ \ 16 __8____ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101 PM-tree m ______/ / \ \______ / __ / \ __ \ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 11 m 1 //|\ //|\ //|\ 1110 0010 1101 Complement 9 ______/ / \ \_______ / __ / \___ \ / / \ \ 0 __8____ _1__ 0 / / | \ / | \ \ 1 4 0 3 0 0 1 0 //|\ //|\ //|\ 0001 1101 0010 P-tree-1: m ______/ / \ \______ / / \ \ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 11 m 1 //|\ //|\ //|\ 1110 0010 1101 P-tree-2: m ______/ / \ \______ / / \ \ / / \ \ 1 0 m 0 / / \ \ 11 1 m //|\ 0100 AND-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ 1101 0100 m ______/ / \ \______ / __ / \ __ \ / / \ \ 0 m m 0 / / \ \ / / \ \ m1 0 m 00 m 0 //|\ //|\ //|\ 0001 1101 0010 OR-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 m 1 1 / / \ \ m 0 1 m //|\ //|\ 1110 0010 Figure 6. P-tree Algebra (Complement, AND and OR) By performing the AND operation on the appropriate subset of the basic P-trees and their complements, we can construct P-trees for values with more than one bit. This kind of P-tree is called a value P-tree and is denoted by Pb,v, where b is the band number and v is a specified value. For example, value P-tree P1,110 gives the count of pixels with band-1 bit 1 equal to 1, bit 2 equal to 1 and bit 3 equal to 0, i.e., with band-1 value in the range of [192, 224). In the very same way, we can construct tuple P-trees where P(v1,…,vn) denotes the P-tree in which each node number is the count of pixels in that quadrant having the value, vi, in band i, for i = 1,..,n. Any tuple P-tree can also be constructed directly from basic P-trees and their complements by the AND operation. The process of ANDing basic P-trees and their complements to produce value Ptrees or tuple P-trees can be done at any level of precision -- 1-bit precision, 2-bit precision, …, 8bit precision. For example, using the full 8-bit precision, Pb,11010011, can be constructed from basic P-trees by the following AND operation, where ’ indicates the complement. Pb,11010011 = Pb,1 AND Pb,2 AND Pb,3’ AND Pb,4 AND Pb,5’ AND Pb,6’ AND Pb,7 AND Pb,8 If only 3-bit precision is used, the value P-tree Pb,110, would be constructed by Pb,110 = Pb,1 AND Pb,2 AND Pb,3’. The tuple P-tree, P001, 010, 111, 011, 001, 110, 011, 101 , would be constructed by: P001,010,111,011,001,110,011,101 = P1,001 AND P2,010 AND P3,111 AND P4,011 AND P5,001 AND P6,110 AND P7,011 AND P8,101 Basic P-trees (i.e., P1,1, P1,2, …, P2,1, …, P8,8) AND Value P-trees (i.e., P1, 110 ) AND Tuple P-trees (i.e., P001, 010, 111, 011, 001, 110, 011, 101 ) Figure 7. Basic P-trees, Value P-trees (for 3-bit values) and Tuple P-trees The detail algorithm and other experimental results of ANDing operation can be found in [6].