Workshop on Data Mining and Knowledge Discovery 2001, Santa Barbara, CA Association Rule Mining on Remotely Sensed Images Using Peano Count Trees William Perrizo, Qin Ding, North Dakota State University Abstract Association Rule Mining, originally proposed for Market Basket data, has potential applications in many areas. Remote Sensed Imagery (RSI) data is one of the areas in which association rule mining can be very beneficial. In RSI data, each pixel is considered a transaction; thus there can be large numbers of transactions. Mining implicit relations among different bands on RSI data can be very useful. In this paper, we propose a new model to derive association rules on RSI data. In our model, a new lossless data structure, the Peano Count Tree (Ptree), is used to represent all the needed information. Ptrees represent RSI data bit-by-bit in a recursive quadrant-by-quadrant arrangement. We propose a fast algorithm for logically ANDing Ptrees as a way to efficiently calculate the support of itemsets and a P-mining algorithm for rule mining using Ptrees. In addition, we propose several pruning techniques, including bit-based pruning and band-based pruning, to improve the efficiency of the rule mining process. Our approach is equally applicable to the various methods for partitioning spatial data, including equi-length partitioning, equi-depth partitioning, and customized partitioning. We implemented our model and compared it with FP-growth and basic Apriori algorithms; the results showed that our model is very efficient for data mining on RSI spatial data. Keywords: Data Mining, Association Rule Mining, Remote Sensing Imagery (RSI) 1. Introduction Association rule mining (ARM) [1,2,3,4,12], proposed by Agrawal, Imielinski and Swami in 1993, is one of the important advances for data mining. The initial application of association rule mining was on market basket data. A typical result might be a rule of the form, “customers who purchase one item are very likely to purchase another item at the same time.” There are two primary measures, support and confidence, which are used to assess the accuracy of the rules. The goal of association rule mining is to find all the rules with support and confidence exceeding some user specified thresholds. The first step in the basic ARM algorithms (e.g., Apriori [1] and DHP [4]) is to use the downward closure property of support to find all frequent itemsets (those itemsets with support above the minimal threshold). After obtaining all frequent itemsets, high confidence rules supported by them are found in a very straightforward way. Subsequent applications include association rule mining on spatial data, a very promising field. Huge amounts of spatial data have been collected in various ways, but the discovery of the knowledge contain therein has just begun. In this paper, we consider a space to be represented by a 2-dimnesional array of pixel locations. Associated with each pixel are various attributes or bands, such as visible reflectance intensities (blue, green and red), infrared reflectance intensities (e.g., NIR, MIR1, MIR2 and TIR) and possibly other value bands (e.g., yield quantities, quality measures, soil attributes and radar reflectance intensities). The pixel coordinates in raster order constitute the key attribute of the spatial dataset and the other bands are the non-key attributes. These spatial datasets are not usually organized in the relational format; instead, they are organized or can be easily re-organized in what is called Band Sequential or BSQ format (each attribute or band is stored as a separate file). In this paper, the association rules of interest are the relationships among reflectance bands and the other bands (e.g., the yield band). We propose a new format, bSQ (bit Sequential), to organize images in a spatial-data-mining-ready format (separate files for each bit positions of each band). We use a rich, new, lossless data structure, the Peano Count Tree (Ptree), to record quadrant-wise aggregate information (e.g., counts) for each bSQ file. With these quadrant counts, we can perform association rule mining very efficiently on the entire space or on a specific subspace. Efficient algorithm complexity pruning techniques are very important in association rule mining. In our model, we apply two kinds of pruning techniques, bit-based pruning and band-based pruning. Though we focus on quantitative discrete data, partitioning can be used to further reduce the complexity [3]. There are several ways to partition spatial data, such as equi-length partitioning, equi-depth partitioning and customized partitioning. Our model works equally well in conjunction with any of these partitioning approaches. The paper is organized as follows. Section 2 gives some background for spatial data and describes the new bit Sequential format, bSQ. Section 3 describes the Peano Count tree data structure and its variations. Section 4 details how to derive association rules using the Peano Count tree and two pruning techniques. Experiment results and performance analysis are given in Section 5. Section 6 gives the related work and the conclusions and future work are given in Section 7. 2. Formats of Spatial Data There are vast amounts of spatial data on which one can perform data mining to obtain useful information [5]. Spatial data are collected in different ways and are organized in different formats. BSQ, BIL and BIP are three typical formats used for Remotely Sensed Images (RSI). A remotely sensed image typically contains several bands or columns of reflectance intensities. For example, TM6 (Thematic Mapper) scenes contain six bands and TM7 scenes contain seven bands (Blue, Green, Red, NIR, MIR, TIR, MIR2). Each band contains a relative reflectance intensity value in the range 0-255 for each pixel location in the scene. An image can be organized into a relational table in which each tuple corresponds to a pixel and each spectral band is an attribute. The primary key consists of the pixel location and can be expressed as x-y coordinates or as latitude-longitude pairs. The Band Sequential (BSQ) format is similar to the Relational format. In BSQ each band is stored as a separate file. Each individual band uses the same raster order so that the primary key attribute values are calculable and need not be included. Landsat satellite Thematic Mapper (TM) scenes are in BSQ format. The Band Interleaved by Line (BIL) format stores the data in line-major order; the image scan line constitutes the organizing base. That is, BIL organizes all the bands in one file and interleaves them by row (the first row of all bands is followed by the second row of all bands, and so on). SPOT data, which comes from French satellite sensors, are in the Band Interleaved by Pixel (BIP) format, based on a pixelconsecutive scheme where the banded data is stored in pixel-major order. That is, BIP organizes all bands in one file and interleaves them by pixel. Standard TIFF images are in BIP format. Figure 2 gives an example of the BSQ, BIL and BIP formats. In this paper, we propose a new format, called bit Sequential (bSQ), to organize spatial data. A reflectance value in a band is a number in the range 0-255 and is represented as an 8-bit byte. We split each band into eight separate files, one for each bit position. Figure 2 also gives an example of the bSQ format. There are several reasons why we use the bSQ format. First, different bits have different degrees of contribution to the value. In some applications, the high-order bits alone provide the necessary information. Second, the bSQ format facilitates the representation of a precision hierarchy. Third, and most importantly, the bSQ format facilitates the creation of an efficient, rich data structure, the Ptree, and accommodates algorithm pruning based on a one-bit-at-a-time approach. We give a very simple illustrative example with only 2 data bands in a scene having only 4 pixels (2 rows and 2 columns). Both decimal and binary reflectance values are shown in Figure 1. BAND-1 254 127 (1111 1110) (0111 1111) BAND-2 37 240 (0010 0101) (1111 0000) 14 193 (0000 1110) (1100 0001) 200 19 (1100 1000) (0001 0011) Figure 1. Two bands of a 2-row-2-column image The BSQ, BIL, BIP and bSQ formats are given below. BSQ format (2 files) BIL format (1 file) BIP format (1 file) Band 1: 254 127 14 193 Band 2: 37 240 200 19 254 127 37 240 14 193 200 19 254 37 127 240 14 200 193 19 bSQ format (16 files) B11 B12 B13 B14 B15 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 B16 1 1 1 0 B17 1 1 1 0 B18 B21 B22 B23 0 0 0 1 1 1 1 1 0 1 1 0 1 0 0 0 B24 B25 B26 0 0 1 1 0 0 0 1 0 1 0 0 B27 B28 0 1 0 0 0 0 1 1 Figure 2. BSQ, BIP, BIL and bSQ formats 3. Data Structures 3.1 Basic Ptrees We reorganize each bit file of the bSQ format into a tree structure, called a Peano Count Tree (Ptree). A Ptree is a quadrant-based tree. The idea is to recursively divide the entire image into quadrants and record the count of 1-bits for each quadrant, thus forming a quadrant count tree. Ptrees are somewhat similar in construction to other data structures in the literature (e.g., Quadtrees[6] and HHcodes [10]). For example, given an 8-row-8-column image of single bits, its Ptree is as shown in Figure 3. 11 11 11 11 11 11 11 01 11 11 11 11 11 11 11 11 11 10 11 11 11 11 11 11 00 00 00 10 11 11 11 11 55 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101 -- level 0 -- level 1 -- level 2 -- level 3 Figure 3. 8*8 image and its Ptree In this example, 55 is the number of 1’s in the entire image. This root level is labeled level 0. The numbers 16, 8, 15, and 16 found at the next level (level 1) are the 1-bit counts for the four major quadrants in raster order (upper-left, upper-right, lower-left, lower-right). Since the first and last level-1 quadrants are composed entirely of 1-bits (pure-1 quadrants), subtrees are not needed (the subtrees would also be entirely pure-1) and these branches terminate. Similarly, quadrants composed entirely of 0-bits are called pure-0 quadrants, and also cause termination of tree branches. This pattern is continued recursively using the Peano or Z-ordering (recursive raster ordering) of the four subquadrants at each new level. Eventually, every branch terminates (since, at the “leaf” level all quadrant are pure). If we were to expand all subtrees, including those for pure quadrants, then the leaf sequence would be the well-known Peano-ordering of the image. Thus, we use the name Peano Count Tree. We note that the fan-out of this Ptree construction need not be fixed at four. It can, in fact, be any power of 4 (effectively skipping levels in the tree). Also, the fan-out at any one level need not coincide with the fan-out at another level. The fan-out pattern can be chosen to produce maximum compression for each bSQ file. In this paper we will fix the fan-out at 4, for simplicity. For each band (assuming 8-bit data values), there are eight Ptrees as currently defined - one for each bit position. We will call these Ptrees the basic Ptrees of the spatial dataset. We will use the notation, Pb,i to denote the basic Ptree for band, b and bit position, i. There are always 8n basic Ptrees for a dataset with n bands. Each basic Ptree has a natural complement, the Ptree of the bit complement bSQ file. The complement of a basic Ptree can be constructed directly from the Ptree by simply complementing the counts at each level (subtracting from the pure-1 count at that level), as shown in the example below (Figure 4). Note that the complement of a Ptree provides the 0-bit counts for each quadrant. 55 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101 complement 9 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 0 ____8__ __1__ 0 / / | \ / | \ \ 1 4 0 3 0 0 1 0 //|\ //|\ //|\ 0001 1101 0010 Figure 4. Basic Ptree and its complement By performing a simple pixel-wise logical AND operation on the appropriate subset of the basic Ptrees and their complements, we can construct the value Ptrees. The value Ptree for a reflectance value, v, from a band, b, (denoted Pb,v) is the Ptree in which each node number is the count of pixels in that quadrant having value, v, in band, b. In the very same way, we can construct tuple Ptrees where P(v1,…,vn) denotes the Ptree in which each node number is the count of pixels in that quadrant having the value, vi, in band I, for i = 1,..,n. Any tuple Ptree can also be constructed directly from basic Ptrees and their complements by the AND operation. We will describe how this AND operation can be efficiently performed. The basic Ptree, Pi,j , represents all the information in the j th bit-column of the ith band in a lossless way (in fact, with far more information than the bit-band itself contains, since Pi,j provides the 1-bit count for every quadrant of every dimension). Finally, we note that Ptrees are a data-mining-ready, lossless formats for storing spatial data. The process of ANDing basic Ptrees and their complements to produce value Ptrees or tuple Ptrees can be done at any level of precision -- 1-bit precision, 2-bit precision, …, 8-bit precision, by simply considering only the appropriate number of high-order bits from the value(s) and truncating the rest. E.g., using the full 8-bit precision (all 8 bits of each byte), value Ptree complements for each 0 bit: Pb,11010011 = Pb,1 AND Pb,2 AND Pb,3’ AND Pb,4 AND Pb,5’ AND Pb,6’ AND Pb,7 AND Pb,8 Pb,11010011 , can be constructed from basic Ptrees by ANDing them for each 1-bit where ’ indicates the complement (which again, is simply the Ptree with each count replaced by its count complement). If only 3-bit precision is used, the value Ptree Pb,110, would be constructed by: Pb,110 = Pb,1 AND Pb,2 AND Pb,3’ The tuple Ptree, P001, 010, 111, 011, 001, 110, 011, 101 , would be constructed by: P001,010,111,011,001,110,011,101 = P1,001 AND P2,010 AND P3,111 AND P4,011 AND P5,001 AND P6,110 AND P7,011 AND P8,101 Basic (bit) Ptrees (i.e., P1,1, P1,2, …, P2,1, …, P8,8) AND Value Ptrees (i.e., P1, 110 ) AND Tuple Ptrees (i.e., P001, 010, 111, 011, 001, 110, 011, 101 ) Figure 5. Basic Ptrees, Value Ptrees (for 3-bit values) and Tuple Ptrees The AND operation can be viewed as simply the pixel-wise AND of bits from bSQ files. However, since such files can contain hundreds of millions of bits, shortcut methods are needed. We discuss such methods later in the paper. The process of converting data to Ptrees is also time consuming unless special methods are used. For example, our methods can convert a TM satellite image (approximately 60 million pixels) to its basic Ptrees in just a few seconds using a high performance PC computer; this is a one-time process. We also note that we are storing the basic Ptrees in a data structure which specifies only the pure-1 quadrants, and does so in a lossless way. Using this data structure each AND can be completed in a few milliseconds. 3.2 Variations of the Ptree Structure In order to optimize the AND operation, we use a variation of the Ptree data structure which uses masks rather than counts. It is called the PM-tree (Pure Mask tree) structure. In a PM-tree, we use a 3value logic to represent pure-1, pure-0 and mixed quadrants (1 denotes pure-1, 0 denotes pure-0 and m denotes mixed). Thus, the PM-tree for the previous example is: 11 11 11 11 11 11 11 01 11 11 11 11 11 11 11 11 11 10 11 11 11 11 11 11 00 00 00 10 11 11 11 11 m ____________/ / \ \___________ / ___ / \___ \ / / \ \ 1 ____m__ _m__ 1 / / | \ / | \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101 Figure 6. 8*8 image and its PM-tree The PM-tree specifies the location of the pure-1 quadrants of the operands. The pure-1 quadrants of the AND result can be easily identified by the coincidence of pure-1 quadrants in both operands. Pure-0 quadrants of the ANDing result occur wherever a pure-0 quadrant is found on at least one of the operands. At each level, bit vector masks can be used to further simplify the construction. 3.3 The Ptree ANDing Algorithm We begin this section with a description of the AND algorithm. This algorithm is used to calculate the root counts of value Ptrees and tuple Ptrees. The approach is to store only the basic Ptrees and then generate the value and tuple Ptree root counts “on-the-fly” as needed. In this algorithm we will assume Ptrees are coded in a compact, depth-first ordering of the paths to each pure-1 quadrant. We use a hierarchical quadrant id (qid) scheme. At each level we append a subquadrant id number (0 means upper left, 1 means upper right, 2 means lower left, 3 means lower right): 100 101 00 0 2 01 10 11 102 103 1 02 03 12 13 20 21 30 31 22 23 32 33 3 Figure 7. Quadrant id (qid) We consider the following example first. 11 11 11 11 11 11 11 01 11 11 11 11 11 11 11 11 11 10 11 11 11 11 11 11 00 00 00 10 11 11 11 11 Ptree: 55 ________ / / \ \___ / ____ / \ \ / / \ \ 16 _8 _ _15_ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101 PM-tree: m ______/ / \ \______ / / \ \ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 11 m 1 //|\ //|\ //|\ 1110 0010 1101 -- level 0 -- level 1 -- level 2 -- level 3 The sequence of pure-1 qids (left-to-right depth-first order) is 0, 100, 101, 102, 12, 132, 20, 21, 220, 221, 223, 23, 3. Now we need a second operand for the AND. We will use the following Ptree for that second operand 11 11 11 11 11 11 11 11 11 11 11 11 11 11 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Ptree: 29 ________ / / \ \___ / ____ / \ \ / / \ \ 16 0 _ 13_ 0 / | \ \ 4 4 41 //|\ 0100 PM-tree: m ______/ / \ \______ / / \ \ / / \ \ 1 0 m 0 / / \ \ 11 1 m //|\ 0100 with sequence of pure-1 qids: 0, 20, 21, 22, 231. Since a quadrant will be pure 1’s in the result only if it is pure-1’s in both operands (or all operands, in the case there are more than 2), the AND is done by the following: scan the operands; output matching pure-1 sequence. 0 100 101 102 12 132 20 21 220 221 223 23 3 & 0 20 21 22 231 0 0 20 20 21 21 220 221 223 22 23 231 RESULT 0 20 21 220 221 223 231 Therefore the result is: PM-tree: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 _ m__ 0 / | \ \ 1 1 m m //|\ //|\ 1101 1000 Ptree: 28 ______/ / \ \______ / / \ \ / / \ \ 16 0 12 __ 0 / / \ \ 4 4 3 1 //|\ //|\ 1101 1000 11 11 11 11 11 11 11 01 11 11 11 11 11 11 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 The pseudo code for the ANDing algorithm is given below in Figure 8. Ptree_ANDing(P1, P2, Presult) // pos1, pos2, pos3 records the pure-1 quadrant path position of P1, P2, Presult 1. pos1:=0; pos2:=0; pos3:=0; 2. DO WHILE (pos1<>ENDofP1 and pos2<>ENDofP2) (a) IF P1.pos1=P2.pos2 THEN BEGIN Presult.pos3:=P1.pos1; pos1:=pos1+1; pos2:=pos2+1; pos3:=pos3+1; END (b) ELSE IF P1.pos1 is the substring of P2.pos2 THEN BEGIN Presult.pos3:=P2.pos2; pos2:=pos2+1; pos3:=pos3+1; END (c) ELSE IF P2.pos2 is the substring of P1.pos1 THEN BEGIN Presult.pos3:=P1.pos1; pos1:=pos1+1; pos3:=pos3+1; END (d) ELSE IF P1.pos1<P2.pos2 THEN pos1:=pos1+1; (e) ELSE pos2:=pos2+1; END IF END DO Figure 8. Ptree ANDing algorithm 4. Association Rule Mining on Spatial Data Using Ptrees 4.1 Discretization (partition) The reflectance values are quantitative data (typically 8-bit data values). It is common to partition the data before performing association rule mining. There are several ways to partition the data, including equi-length partitioning, equi-depth partitioning and customized partitioning. Equi-length partition is a simple but very useful method. By truncating some of the right-most bits of the values (low order or least significant bits) we can reduce the size of the itemset dramatically without loosing too much information (the low order bits show only subtle differences). For example, we can truncate the right-most 6 bits, resulting in the set of values {00, 01, 10, 11} (in decimal, {0, 1, 2, 3}). Each of these values represents an partition of the original 8-bit value space (i.e., 00 represents the values in [0,64), 01 represents the values in [64,128), etc.). Further pruning can be done by understanding the kinds of rules that are of interest to the user and focusing on those only. For instance, for an agricultural producer using precision techniques, there is little interest in rules of the type, Red>48 Green<134. A scientist might be interested in color relationships (both antecedent and consequent from color bands), but the producer is interested only in relationships in color antecedent and consequents from, for example, a yield band (i.e., When do observed color combinations predict high yield or foretell low yield). Therefore, for precision agriculture applications, it makes sense to restrict to those rules that have consequent from the yield band. We will refer to restrictions in the type of itemsets allowed for antecedent and consequent based on interest as of interest, as distinct from the notion of rules that are "interesting". Of-interest rules can be interesting or not interesting, depending on such measures as support and confidence, etc. In some cases, it would be better to allow users to partition the value space into uneven partitions. User knowledge can be applied in partitioning. E.g., band Bi can be partitioned into {[0,32), [32,64) [64,96), [96,256)}, if it is know that there will be only a few values between 96 to 255. Applying user's domain knowledge increases accuracy and and data mining efficiency. This type of partitioning will be referred to as user-defined partitioning. Equi-depth partitioning (each partition has approximately the same number of pixels) can be done by setting the endpoints so that there are approximately the same number of values in each partition. Whether partitioning is equi-length, equi-depth or user-defined, it can be characterized as follows. For each band, choose partition-points, v0 = 0, v1 , . . . , vn+1 = 256, then the partitions are, { [vi, vi+1 ) : i = 0..n } and are identified as values, { vi : i = 0..n}. The items to be used in the data mining algorithms are then pairs, (bi , vj ). 4.2 Deriving Rules Using Ptrees For RSI data, we can formulate the association rule mining model as follows. Let I be the set of all items and T be the set of all transactions. I = { (b,v) | b=band, v=value (1-bit or 2-bit or … 8-bit) }, T = {pixels}. Admissible Itemsets (Asets) are itemsets of the form, Int1 x Int2 x ... x Intn = Π i=1..n Inti , where Inti is an interval of values in Bandi (some of which may be the full value range). Modeled on the Apriori algorithm [2], we first find all itensets which are frequent and of-interest (e.g., If B1 = Yield, the user may wish to restrict attention to those Asets for which Int1 is not all of B1 – so either high-yield or lowyield. For 1-bit data values, this means either yield < 128 or Yield 128 (other threshold values can be selected using the user-defined partitioning concept described above). Then the user may want to restrict interest to those rules for which the rule consequent is Int1. For a frequent Aset, B = Π i=1..n Inti , rules are created by partitioning {1..n} into two disjoint sets, Â={i1..im} and Ĉ={j1..jq}, q+m=n, and then forming the rule, AC where A = Π k Intk and C = Π kĈ Intk . As noted above, users may be interested only in rules where q=1 and therefore the consequents come from a specified band (e.g., B1=Yield). Then there is just one rule of interest for each frequent set found and it need only be checked as to whether it is high-confidence or not. For the restricted interest case described above, in which q=1 and C=Int1 (e.g., the Yield band), support{AC} = support{p | p is a pixel such that p(i) is in all Int i, i=1..n}. The confidence of a rule, AC is its support divided by the support of A. In the restricted interest case, with B1=Yield, B2=blue, B3=green, B4=red, we need to calculate support(AC)=support( Int1 x Int2 x Int3 x Int4 ) and support(A) = support( Int2 x Int3 x Int4 ). If support(B) ≥ minsup (the specified minimum support threshold) and supp(B)/supp(A) ≥ minconf (the specified minimum confidence threshold), then AB is a strong rule. A k-band Aset ( kAset ) is an Aset in which k of the Inti intervals are non-full (i.e., in k of the bands the intervals are not the fully unrestricted intervals of all values). We start by finding all frequent 1Asets. Then, the candidate 2Asets are those whose every 1Aset subset is frequent, etc. The candidate kAsets are those whose every (k-1)Aset subset is frequent. Next we look for a pruning technique based on the value concept hierarchy. Once we find all 1-bit frequent kAsets, we can use the fact that a 2-bit kAset cannot be frequent if its enclosing 1-bit kAset is infrequent. A 1bit Aset encloses a 2-bit Aset if when the endpoints of the 2-bit Aset are shifted right 1-bit position, it is a subset of the 1-bit Aset, (e.g., [1,1] encloses [10,11], [10,10] and [11,11]). The following algorithms are given for mining association rules on RSI data using Ptrees in Figure 9. Procedure P-mining { (1) Data Discretization; (2) F1 = {frequent 1-Asets}; (3) For (k=2; F k-1 ) do begin (4) Ck = p-gen(F k-1); (5) Forall candidate Asets c Ck do (6) c.count = AND_rootcount(c); (7) Fk = {cCk | c.count >= minsup} (8) end (9) Answer = k Fk } Figure 9. P-mining algorithm The P-mining algorithm assumes a fixed value precision, for example, 3-bit precision in all bands. The p-gen function differs from the apriori-gen function ([1]) in the way pruning is done. We use band-based pruning. Since any itemsets consisting of two or more intervals from the same band will have zero support (no value can be in both intervals simultaneously), the kind of joining done in [1] is not necessary. The AND_rootcount function is used to calculate Aset counts directly by ANDing the appropriate basic P-trees instead of scanning the transaction databases. For example, in the Asets, {B1[0,64), B2[64,127)}, the count is the root count of P1, 00 AND P2, 01. To obtain the rules at other precision levels we can apply Algorithm-P-mining again. special bit-based pruning technique which can be applied in this case. There is a This bit-based pruning is essentially a matter of noting, e.g., if Aset [1,1]2 (the interval [1,1] in band 2) is not frequent, then the Asets [10,10]2 and [11,11]2 which are covered by [1,1]2 cannot possibly be frequent either. 4.3 An Example The following data in relational format will be used to illustrate the method. The data contains four bands with 4-bit precision in the data values. FIELD |CLASS| COORDS|LABEL| X | Y |YIELD| 0,0 | 3 | 0,1 | 3 | 0,2 | 7 | 0,3 | 7 | 1,0 | 3 | 1,1 | 3 | 1,2 | 7 | 1,3 | 7 | 2,0 | 2 | 2,1 | 2 | 2,2 | 10 | 2,3 | 15 | 3,0 | 2 | 3,1 | 10 | 3,2 | 15 | 3,3 | 15 | REMOTELY SENSED REFLECTANCES_______ Blue|Green| Red____ 7 | 8 | 11 3 | 8 | 15 3 | 4 | 11 2 | 5 | 11____ 7 | 8 | 11 3 | 8 | 11 3 | 4 | 11 2 | 5 | 11 ___ 11 | 8 | 15 11 | 8 | 15 10 | 4 | 11 10 | 4 | 11____ 11 | 8 | 15 11 | 8 | 15 10 | 4 | 11 10 | 4 | 11 FIELD |CLASS| COORDS|LABEL| X | Y |YIELD| 0,0 | 0011| 0,1 | 0011| 0,2 | 0111| 0,3 | 0111| 1,0 | 0011| 1,1 | 0011| 1,2 | 0111| 1,3 | 0111| 2,0 | 0010| 2,1 | 0010| 2,2 | 1010| 2,3 | 1111| 3,0 | 0010| 3,1 | 1010| 3,2 | 1111| 3,3 | 1111| REMOTELY SENSED REFLECTANCES_______ Blue|Green| Red____ 0111| 1000| 1011 0011| 1000| 1111 0011| 0100| 1011 0010| 0101| 1011__ 0111| 1000| 1011 0011| 1000| 1011 0011| 0100| 1011 0010| 0101| 1011__ 1011| 1000| 1111 1011| 1000| 1111 1010| 0100| 1011 1010| 0100| 1011__ 1011| 1000| 1111 1011| 1000| 1111 1010| 0100| 1011 1010| 0101| 1011 The data is first converted to bSQ format. We display the bSQ bit-band values in their spatial positions, rather than in columnar files. The Band1 bit-bands are: B11 0000 0000 0011 0111 B12 0011 0011 0001 0011 B13 1111 1111 1111 1111 B14 1111 1111 0001 0011 The Band-1 basic Ptrees are as follows (tree pointers are omitted). The Band 2,3 and 4 Ptrees are similar. P1,1 5 0 0 1 4 0001 P1,2 7 0 4 0 3 0111 P1,3 16 P1,4 11 4 4 0 3 0111 The value Ptrees are created as needed. The creation process for P1,0011 is shown as an example. P1,0011 = 4 4 0 0 0 P1,1’ AND 11 4 4 3 0 1110 P1,2’ AND 9 4 0 4 1 1000 P1,3 16 AND P1,4 11 4 4 0 3 0111 since, 0 1 20 21 22 0 2 31 0 1 31 32 33 0 (pure-1 paths of (pure-1 paths of (pure-1 paths of (pure-1 paths of P1,1’) P1,2’) P1,4 note that P1,3 is entirely pure-1) P1,0011 0 is the only pure-1 path in all operands) The other value Ptrees are calculated in this same way. P1,0000 0 P1,0100 0 P1,1000 0 P1,1100 0 P1,0010 3 0 0 3 0 1110 P1,0110 0 P1,1010 2 0 0 1 1 0001 1000 P1,0001 0 P1,0101 0 P1,1001 0 P1,1101 0 P1,0011 4 4 0 0 0 P1,0111 4 0 4 0 0 P1,1011 0 P1,1111 3 0 0 03 0111 P2,0000 0 P2,0100 0 P2,1000 0 P2,1100 0 P2,0010 2 0 2 0 0 0101 P2,0110 0 P2,1010 4 0 0 0 4 P2,1110 0 P2,0001 0 P2,0101 0 P2,1001 0 P2,1101 0 P2,0011 4 2 2 0 0 0101 1010 P2,1011 4 0 0 4 0 P2,1111 0 P3,0000 0 P3,0100 6 0 2 0 4 1010 P3,1000 P3,1100 8 0 4 0 4 0 P3,0010 0 P3,0110 0 P3,1010 0 P3,1110 0 P3,0001 0 P3,0101 2 0 2 0 0 0101 P3,1001 0 P3,0011 0 P3,0111 0 P3,1011 0 P3,1111 0 P4,0000 0 P4,0100 0 P4,1000 0 P4,1100 0 P4,0010 0 P4,0110 0 P4,1010 0 P4,0001 0 P4,0101 0 P4,1001 0 P4,1101 0 P4,0011 0 P4,0111 0 P4,1011 11 3 4 0 4 1011 P3,1101 0 P2,0111 2 2 0 0 0 1010 P1,1110 0 P4,1110 0 P4,1111 5 1 0 4 0 0100 Assume the minimum support is 60% (requiring a count of 10) and the minimum confidence is 60%. First, we find all 1Asets for 1-bit values from B1. There are two possibilities for Int1, [1,1] and [0,0]. Since, P1,1 support([1,1]1) = 5 (infrequent) and support([0,0]1)=11 (frequent). 5 0014 0001 Similarly, there are two possibilities for Int2 with support([1,1]2) = 8 (infrequent) and support([0,0]2) = 8 (infrequent), two possibilities for Int3 with support([1,1]3) = 8 (infrequent) and support([0,0]3) = 8 (infrequent), and two possibilities for Int3 with support([1,1]4) = 16 (frequent) and support([0,0]4) = 0 (infrequent). The set of 1-bit frequent 1Asets, 1L1, is { [0,0]1 , [1,1]4 } The set of 1-bit candidate 2Asets, 1C2, is { [0,0]1 x [1,1]4 } (support=root-count P1,0 & P4,1 = 11) and therefore, 1L2 = { [0,0]1 x [1,1]4 } The set of 1-bit candidate 3Asets, 1C3 is empty. We consider only those frequent sets which involve Yield (B1) as candidates for forming rules and we use B1 as the consequent of those rules (assuming this is the user’s choice). The rule which can be formed with B1 as the consequent is [1,1]4 [0,0]1 (rule support = 11). The supports of the antecedent is, support([1,1]4) = 16, giving confidence( [1,1]4[0,0]1 ) = 11/16. Thus, this is a strong rule. The frequent 1-bit 1Asets were [0,0]1 and [1,1]4 and the other 1-bit 1Asets are infrequent. This means all their enclosed 2-bit subintervals are infrequent. The interval [00,01]1 is identical to [0,0]1 in terms of the full 8-bit values that are included, and [00,10]1 is a superset of [0,0]1, so both are frequent. Others in band-1 to consider are: [00,00], [01,01], [01,10] and [01,11]. (using P1,00, count=7). [00,00] is infrequent [01,01] is infrequent (using P1,01, count=4). For [01,10] we use P1,01 OR P1,10 . If it is frequent, then [01,11] is frequent, otherwise, for [01,11] we use P 1,01 OR P1,10 OR P1,11 . The OR operation is very similar to the AND operation except that the role of 0 and 1 interchange. A result quadrant is pure-0 if both operand quadrants are pure-0. If either operand quadrant is pure-1, the result is pure-1. The root count of P1,01 OR P1,10 is 6 and therefore [01,11] is infrequent. The root count of P1,01 OR P1,10 OR P1,11 is 9 and therefore [01,11] is infrequent. The only new frequent 2-bit band1 1Aset is [00,10]1 , which does not form the support set of a rule. Thus, the algorithm terminates. 5. Experiment Results and Performance Analysis In this section, we compare our work with the classical frequent itemsets generation algorithm, Apriori [1], and a recently proposed efficient algorithm, FP-growth [12], which does not have the candidate generation step. The experiments are performed on a 900-MHz PC with 256 megabytes main memory, running Windows 2000. We set our algorithm to find all the frequent itemsets, not just those of-interest (e.g., containing Yield) for the fairness. The images we used as data are actual aerial TIFF images with a synchronized yield band and can be found at [17]. In our performance study each dataset has 4 bands {Blue, Green, Red, Yield}. We use different image sizes up to 1320 1320 pixels. For the 1320 1320 images, the total number of pixels is 1,742,400. We only store the basic Ptrees for each dataset. All other Ptrees (value Ptrees and tuple Ptrees) are created in real time as needed in the ARM algorithm, saving considerable space. The ANDing operation which produces the value and tuple Ptrees as needed from basic Ptree operands is very fast. The following tables reveal the storage needs of our approach (the sizes of 8 basic Ptrees for a 1320 x 1320 TIFF-Yield synchronized image. The following table (a typical example of TIFF image) shows that the basic Ptrees for the high-order bits are small. basic Ptree sizes in bytes Bit position Band 1 (Red) Band 2 (Green) Band 3 (Blue) 1 154032 27884 91722 2 234803 128747 200845 3 254477 250175 247900 4 254751 254555 254715 5 254795 254785 254799 6 254797 254819 254823 7 254807 254809 254821 8 254807 254811 254841 Table 1. Basic Ptree Sizes 5.1 Comparison of the P-mining with Apriori We implemented the Apriori algorithm [2] for the TIFF-Yield datasets using equi-length partitioning. P-mining is more scalable than Apriori in two ways. First, P-mining is more scalable for lower support thresholds. The reason is, for low support thresholds, the number of candidate itemsets will be extremely large. Thus candidate frequent itemset generation performance degrades markedly. Figure 10 gives the results of the comparison of the P-mining algorithm (P-tree runtime) and Apriori for different support thresholds. 800 Run time (Sec.) 700 600 500 P-tree runtime 400 Apriori 300 200 100 90% 70% 50% 30% 10% 0 Support threshold Figure 10. Run time versus support threshold The second conclusion we can draw is that the P-mining algorithm is more scalable to large spatial datasets. The reason is, in the Apriori algorithm we need to scan the entire database each time a support is to be calculated. This is a very high cost for large databases. However, in Pmining, since we calculate the count directly from the root count of a basic Ptree AND program, when we double the dataset size, only one more level is added to each basic Ptree. The additional cost is relatively small compared to the Apriori algorithm as shown in figure 11 below. 1200 Apriori Time (Sec.) 1000 P-tree 800 600 400 200 1700 1300 900 500 100 0 Number of transactions(K) Figure 11. Scalability with number of transactions 5.2 Comparison of the P-mining algorithm and the FP-growth algorithm FP-growth is a very efficient algorithm for association rule mining, which uses a data structure called the frequent pattern tree (FP-tree) to store compressed information about frequent patterns. For a dataset of 100K bytes, FP-growth runs very fast. But when we run the FP-growth algorithm on the TIFF image of size 1320 1320 pixels, the performance falls off markedly. For large sized datasets and low support thresholds, it takes much longer for FP-growth to run than Pmining. Figure 12 shows the experimental result of running the P-mining and FP-growth alogrithms on a 1320 1320 pixel TIFF-Yield dataset (in which the total number of transactions is ~1,700,000). In these experiements we have used 2-bits precision and equi-length partitioning. 800 Run time (Sec.) 700 600 500 P-mining 400 FP-growth 300 200 100 90% 80% 70% 60% 50% 40% 30% 20% 10% 0 Support threshold Figure 12. Scalability with support threshold The results show that both P-mining and FP-growth run faster than Apriori. For large image datasets, the P-mining algorithm runs faster than the FP-tree algorithm when the support threshold is low. Our test suggest that the FP-growth algorithm runs quite fast for datasets with transactions numbering less than 500,000 ( |D| < 500K ). For the larger data sets, P-mining algorithm gives much better performance. This result is presented in Figure 13 (where the support threshold was set at 10%). 1200 FP-growth Time (Sec.) 1000 P-mining 800 600 400 200 1700 1300 900 500 100 0 Number of transactions(K) Figure 13. Scalability with the number of transactions Figure 14 shows how the number of precision bits used affects the performance of the P- Time (sec.) mining algorithm. The more precision bits used, the greater the number of items. 180 160 140 120 100 80 60 40 20 0 1bit 2bits 3bits 4bits 10% 20% 30% 40% 50% 60% 70% 80% 90% Support threshold Figure 14. Performance of P-mining with respect to the number of bits used for partition 6. Related work Remotely Sensed Imagery data belongs to the category of spatial data. There are some works on spatial data mining [13,14,15], including Association Rule Mining on Spatial data [16]. A spatial association rule is a rule indicating certain association relationship among a set of spatial and possibly some nonspatial predicates [16]. For example, a rule like “most big cities in Canada are close to the Canada-U.S. border”. In these works, spatial data mining is performed with the perspective of spatial locality, i.e. mined patterns consider objects being close in space. In our work to derive rules among spectral bands and yield, the patterns not necessarily exist on close pixels, they can exist in any part of the image, and the rules generated in this way are very useful to the farmer. As to our data structure Ptree, there are some related works, including quadtree [6,7,9] and its variants (such as point quadtree [9] and region quadtree [6]), and HHcode [10]. Quadtree decompose the universe by means of iso-oriented hyperplanes. These partitions do not have to be of equal size, although that is often the case. For example, for two-dimensional quadtree, each interior node has four descendants, each corresponding to a rectangle, referred as NW/NE/SW/SE quadrants. The decomposition into subspaces is usually continued until the number of objects in each partition is below a given threshold. Quadtree has many variants, such as point quadtree and region quadtree [6]. HHcodes, or Helical Hyperspatial Codes, are binary representations of the Riemannian diagonal. The binary division of the diagonal forms the node point from which eight sub-cubes are formed. Each sub-cube has its own diagonal, generating new sub-cubes. These cubes are formed by interlacing one-dimensional values encoded as HH bit codes. When sorted, they cluster in groups along the diagonal. The clustering gives fast (binary) searches along the diagonal. The clusters are order in a helical pattern which is why they are called "Helical Hyperspatial". They are sorted in a Z-ordered fashion. The similarities among Ptree, quadtree and HHCode are that they are quadrant based, but the difference is that Ptree is focused on the count. Ptree is not only beneficial to store the data, it’s particular useful for association rule mining because it provides a lot of useful information needed for association rule mining. 7. Conclusion and future work In this paper, we propose a new model to derive association rules on Remotely Sensed Imagery data. In our model, the images are organized in bit-Sequential or bSQ format. For data mining of image data, we introduce a new data structure for bSQ files called Peano Count trees or Ptrees. Using the Peano ordering, each bSQ bit array is organized into a tree structure that efficiently captures all the information of the bit array plus the value-histograms of each and every quadrant in the space. These Peano Count trees are space efficient, lossless, data mining ready structures for the association rule mining of bSQ spatial datasets. Algorithm pruning is almost always required in order to make data mining feasible. Ptrees facilitate new pruning techniques for association rule mining based on a high-order bit first approach and a single attribute first approach. Also, these data structures provide early algorithm exit advantages for fast high-confidence, low-support association rule identification. Ptrees have the potential to revolutionize data mining in applications ranging from flood prediction and monitoring, community and regional planning, precision agriculture, virtual archeology, mineral exploration, gene mapping, VLSI design and environmental analysis and control. In this paper we have described the new structures and shown that they can be very effective in facilitating association rule mining on large Remotely Sensed Imagery data. Our future work includes applying Ptrees for Sequential Pattern mining on RSI data. References [1] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules Between Sets of Items in Large Database”, ACM-SIGMOD 93, Washington, DC, May 1993. [2] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. of Int’l Conf. on VLDB, Santiago, Chile, September 1994. [3] R. Srikant and R. Agrawal, "Mining Quantitative Association Rules in Large Relational Tables", ACM SIGMOD 96, Montreal Canada. [4] Jong Soo Park, Ming-Syan Chen and Philip S. Yu, “An effective Hash-Based Algorithm for Mining Association Rules,” ACM SIGMOD 95, CA, 1995. [5] Volker Gaede and Oliver Gunther, "Multidimensional Access Methods", Computing Surveys, 30(2), 1998. [6] H. Samet, “The quadtree and related hierarchical data structure”. ACM Computing Survey, 16, 2, 1984. [7] H. Samet, “Applications of Sptial Data Structures”, Addison-Wesley, Reading, Mass., 1990. [8] H. Samet, “The Design and Analysis of Spatial Data Structures”, Addison-Wesley, Reading, Mass., 1990. [9] R. A. Finkel and J. L. Bentley, “Quad trees: A data structure for retrieval of composite keys”, Acta Informatica, 4, 1, 1974. [10] http://www.statkart.no/nlhdb/iveher/hhtext.htm [11] S. W. Golomb, “Run-length encoding”, IEEE Trans. On Information Theory, 12(3), July 1966. [12] J. Han, J. Pei and Y. Yin, “Mining Frequent Patterns without Candidate Generation”, ACM_SIGMOD 2000, Dallas, Texas, May 2000. [13] M. Ester, H.-P. Kriegel, J. Sander, "Spatial Data Mining: A Database Approach", SSD 1997. [14] K. Koperski, J. Adhikary, J. Han, "Spatial Data Mining: Progress and Challenges", DMKD 1996. [15]M. Ester, A. Frommelt, H.-P. Kriegel, J. Sander, "Spatial Data Mining: Database Primitives, Algorithms and Efficient DBMS Support", Data Mining and Knowledge Discovery 4(2/3). [16] K. Koperski, J. Han, "Discovery of Spatial Association Rules in Geographic Information Databases", SSD 1995. [17] SMILEY (Spatial Miner http://midas.cs.ndsu.nodak.edu/~smiley & Interface Language for Earth Yield),