01_kdd_final - NDSU Computer Science

advertisement
Workshop on Data Mining and Knowledge Discovery 2001, Santa Barbara, CA
Association Rule Mining on Remotely Sensed Images
Using Peano Count Trees
William Perrizo, Qin Ding, North Dakota State University
Abstract
Association Rule Mining, originally proposed for Market Basket data, has potential applications in many
areas. Remote Sensed Imagery (RSI) data is one of the areas in which association rule mining can be
very beneficial. In RSI data, each pixel is considered a transaction; thus there can be large numbers of
transactions. Mining implicit relations among different bands on RSI data can be very useful. In this
paper, we propose a new model to derive association rules on RSI data. In our model, a new lossless data
structure, the Peano Count Tree (Ptree), is used to represent all the needed information. Ptrees represent
RSI data bit-by-bit in a recursive quadrant-by-quadrant arrangement. We propose a fast algorithm for
logically ANDing Ptrees as a way to efficiently calculate the support of itemsets and a P-mining
algorithm for rule mining using Ptrees. In addition, we propose several pruning techniques, including
bit-based pruning and band-based pruning, to improve the efficiency of the rule mining process. Our
approach is equally applicable to the various methods for partitioning spatial data, including equi-length
partitioning, equi-depth partitioning, and customized partitioning. We implemented our model and
compared it with FP-growth and basic Apriori algorithms; the results showed that our model is very
efficient for data mining on RSI spatial data.
Keywords:
Data Mining, Association Rule Mining, Remote Sensing Imagery (RSI)
1. Introduction
Association rule mining (ARM) [1,2,3,4,12], proposed by Agrawal, Imielinski and Swami in 1993, is
one of the important advances for data mining. The initial application of association rule mining was on
market basket data. A typical result might be a rule of the form, “customers who purchase one item are
very likely to purchase another item at the same time.” There are two primary measures, support and
confidence, which are used to assess the accuracy of the rules. The goal of association rule mining is to
find all the rules with support and confidence exceeding some user specified thresholds. The first step in
the basic ARM algorithms (e.g., Apriori [1] and DHP [4]) is to use the downward closure property of
support to find all frequent itemsets (those itemsets with support above the minimal threshold). After
obtaining all frequent itemsets, high confidence rules supported by them are found in a very
straightforward way. Subsequent applications include association rule mining on spatial data, a very
promising field. Huge amounts of spatial data have been collected in various ways, but the discovery of
the knowledge contain therein has just begun.
In this paper, we consider a space to be represented by a 2-dimnesional array of pixel locations.
Associated with each pixel are various attributes or bands, such as visible reflectance intensities (blue,
green and red), infrared reflectance intensities (e.g., NIR, MIR1, MIR2 and TIR) and possibly other
value bands (e.g., yield quantities, quality measures, soil attributes and radar reflectance intensities).
The pixel coordinates in raster order constitute the key attribute of the spatial dataset and the other bands
are the non-key attributes. These spatial datasets are not usually organized in the relational format;
instead, they are organized or can be easily re-organized in what is called Band Sequential or BSQ
format (each attribute or band is stored as a separate file). In this paper, the association rules of interest
are the relationships among reflectance bands and the other bands (e.g., the yield band). We propose a
new format, bSQ (bit Sequential), to organize images in a spatial-data-mining-ready format (separate
files for each bit positions of each band). We use a rich, new, lossless data structure, the Peano Count
Tree (Ptree), to record quadrant-wise aggregate information (e.g., counts) for each bSQ file. With these
quadrant counts, we can perform association rule mining very efficiently on the entire space or on a
specific subspace.
Efficient algorithm complexity pruning techniques are very important in association rule mining. In
our model, we apply two kinds of pruning techniques, bit-based pruning and band-based pruning.
Though we focus on quantitative discrete data, partitioning can be used to further reduce the complexity
[3]. There are several ways to partition spatial data, such as equi-length partitioning, equi-depth
partitioning and customized partitioning. Our model works equally well in conjunction with any of these
partitioning approaches.
The paper is organized as follows. Section 2 gives some background for spatial data and describes the
new bit Sequential format, bSQ. Section 3 describes the Peano Count tree data structure and its
variations. Section 4 details how to derive association rules using the Peano Count tree and two pruning
techniques. Experiment results and performance analysis are given in Section 5. Section 6 gives the
related work and the conclusions and future work are given in Section 7.
2. Formats of Spatial Data
There are vast amounts of spatial data on which one can perform data mining to obtain useful
information [5]. Spatial data are collected in different ways and are organized in different formats. BSQ,
BIL and BIP are three typical formats used for Remotely Sensed Images (RSI).
A remotely sensed image typically contains several bands or columns of reflectance intensities. For
example, TM6 (Thematic Mapper) scenes contain six bands and TM7 scenes contain seven bands (Blue,
Green, Red, NIR, MIR, TIR, MIR2). Each band contains a relative reflectance intensity value in the
range 0-255 for each pixel location in the scene.
An image can be organized into a relational table in which each tuple corresponds to a pixel and each
spectral band is an attribute. The primary key consists of the pixel location and can be expressed as x-y
coordinates or as latitude-longitude pairs.
The Band Sequential (BSQ) format is similar to the Relational format. In BSQ each band is stored as a
separate file. Each individual band uses the same raster order so that the primary key attribute values are
calculable and need not be included. Landsat satellite Thematic Mapper (TM) scenes are in BSQ format.
The Band Interleaved by Line (BIL) format stores the data in line-major order; the image scan line
constitutes the organizing base. That is, BIL organizes all the bands in one file and interleaves them by
row (the first row of all bands is followed by the second row of all bands, and so on). SPOT data, which
comes from French satellite sensors, are in the Band Interleaved by Pixel (BIP) format, based on a pixelconsecutive scheme where the banded data is stored in pixel-major order. That is, BIP organizes all
bands in one file and interleaves them by pixel. Standard TIFF images are in BIP format. Figure 2 gives
an example of the BSQ, BIL and BIP formats.
In this paper, we propose a new format, called bit Sequential (bSQ), to organize spatial data. A
reflectance value in a band is a number in the range 0-255 and is represented as an 8-bit byte. We split
each band into eight separate files, one for each bit position. Figure 2 also gives an example of the bSQ
format.
There are several reasons why we use the bSQ format. First, different bits have different degrees of
contribution to the value. In some applications, the high-order bits alone provide the necessary
information. Second, the bSQ format facilitates the representation of a precision hierarchy. Third, and
most importantly, the bSQ format facilitates the creation of an efficient, rich data structure, the Ptree,
and accommodates algorithm pruning based on a one-bit-at-a-time approach.
We give a very simple illustrative example with only 2 data bands in a scene having only 4 pixels (2
rows and 2 columns). Both decimal and binary reflectance values are shown in Figure 1.
BAND-1
254
127
(1111 1110) (0111 1111)
BAND-2
37
240
(0010 0101) (1111 0000)
14
193
(0000 1110) (1100 0001)
200
19
(1100 1000) (0001 0011)
Figure 1. Two bands of a 2-row-2-column image
The BSQ, BIL, BIP and bSQ formats are given below.
BSQ format (2 files)
BIL format (1 file)
BIP format (1 file)
Band 1: 254 127 14 193
Band 2: 37 240 200 19
254 127 37 240
14 193 200 19
254 37 127 240
14 200 193 19
bSQ format (16 files)
B11 B12 B13 B14 B15
1
1
1
1
1
0
1
1
1
1
0
0
0
0
1
1
1
0
0
0
B16
1
1
1
0
B17
1
1
1
0
B18 B21 B22 B23
0
0
0
1
1
1
1
1
0
1
1
0
1
0
0
0
B24 B25 B26
0
0
1
1
0
0
0
1
0
1
0
0
B27 B28
0
1
0
0
0
0
1
1
Figure 2. BSQ, BIP, BIL and bSQ formats
3. Data Structures
3.1 Basic Ptrees
We reorganize each bit file of the bSQ format into a tree structure, called a Peano Count Tree (Ptree).
A Ptree is a quadrant-based tree. The idea is to recursively divide the entire image into quadrants and
record the count of 1-bits for each quadrant, thus forming a quadrant count tree. Ptrees are somewhat
similar in construction to other data structures in the literature (e.g., Quadtrees[6] and HHcodes [10]).
For example, given an 8-row-8-column image of single bits, its Ptree is as shown in Figure 3.
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
55
____________/ / \ \___________
/
___ / \___
\
/
/
\
\
16
____8__
_15__
16
/ / |
\
/ | \ \
3 0 4 1
4 4 3 4
//|\
//|\
//|\
1110
0010
1101
-- level 0
-- level 1
-- level 2
-- level 3
Figure 3. 8*8 image and its Ptree
In this example, 55 is the number of 1’s in the entire image. This root level is labeled level 0. The
numbers 16, 8, 15, and 16 found at the next level (level 1) are the 1-bit counts for the four major
quadrants in raster order (upper-left, upper-right, lower-left, lower-right). Since the first and last level-1
quadrants are composed entirely of 1-bits (pure-1 quadrants), subtrees are not needed (the subtrees
would also be entirely pure-1) and these branches terminate. Similarly, quadrants composed entirely of
0-bits are called pure-0 quadrants, and also cause termination of tree branches. This pattern is continued
recursively using the Peano or Z-ordering (recursive raster ordering) of the four subquadrants at each
new level. Eventually, every branch terminates (since, at the “leaf” level all quadrant are pure). If we
were to expand all subtrees, including those for pure quadrants, then the leaf sequence would be the
well-known Peano-ordering of the image. Thus, we use the name Peano Count Tree.
We note that the fan-out of this Ptree construction need not be fixed at four. It can, in fact, be any
power of 4 (effectively skipping levels in the tree). Also, the fan-out at any one level need not coincide
with the fan-out at another level. The fan-out pattern can be chosen to produce maximum compression
for each bSQ file. In this paper we will fix the fan-out at 4, for simplicity.
For each band (assuming 8-bit data values), there are eight Ptrees as currently defined - one for each
bit position. We will call these Ptrees the basic Ptrees of the spatial dataset. We will use the notation, Pb,i
to denote the basic Ptree for band, b and bit position, i. There are always 8n basic Ptrees for a dataset
with n bands. Each basic Ptree has a natural complement, the Ptree of the bit complement bSQ file. The
complement of a basic Ptree can be constructed directly from the Ptree by simply complementing the
counts at each level (subtracting from the pure-1 count at that level), as shown in the example below
(Figure 4). Note that the complement of a Ptree provides the 0-bit counts for each quadrant.
55
____________/ / \ \___________
/
___ / \___
\
/
/
\
\
16
____8__
_15__
16
/ / |
\
/ | \ \
3 0 4 1
4 4 3 4
//|\
//|\
//|\
1110
0010
1101
complement
9
____________/ / \ \___________
/
___ / \___
\
/
/
\
\
0
____8__
__1__
0
/ / |
\
/ | \ \
1 4 0 3
0 0 1 0
//|\
//|\
//|\
0001
1101
0010
Figure 4. Basic Ptree and its complement
By performing a simple pixel-wise logical AND operation on the appropriate subset of the basic
Ptrees and their complements, we can construct the value Ptrees. The value Ptree for a reflectance value,
v, from a band, b, (denoted Pb,v) is the Ptree in which each node number is the count of pixels in that
quadrant having value, v, in band, b. In the very same way, we can construct tuple Ptrees where P(v1,…,vn)
denotes the Ptree in which each node number is the count of pixels in that quadrant having the value, vi,
in band I, for i = 1,..,n. Any tuple Ptree can also be constructed directly from basic Ptrees and their
complements by the AND operation. We will describe how this AND operation can be efficiently
performed.
The basic Ptree, Pi,j , represents all the information in the j th bit-column of the ith band in a lossless
way (in fact, with far more information than the bit-band itself contains, since Pi,j provides the 1-bit
count for every quadrant of every dimension). Finally, we note that Ptrees are a data-mining-ready,
lossless formats for storing spatial data.
The process of ANDing basic Ptrees and their complements to produce value Ptrees or tuple Ptrees
can be done at any level of precision -- 1-bit precision, 2-bit precision, …, 8-bit precision, by simply
considering only the appropriate number of high-order bits from the value(s) and truncating the rest.
E.g., using the full 8-bit precision (all 8 bits of each byte), value Ptree complements for each 0 bit:
Pb,11010011 = Pb,1 AND Pb,2 AND Pb,3’ AND Pb,4 AND Pb,5’ AND Pb,6’ AND Pb,7 AND Pb,8
Pb,11010011 , can be constructed from basic Ptrees by ANDing them for each 1-bit where ’ indicates the
complement (which again, is simply the Ptree with each count replaced by its count complement). If
only 3-bit precision is used, the value Ptree Pb,110, would be constructed by:
Pb,110 = Pb,1 AND Pb,2 AND Pb,3’
The tuple Ptree, P001, 010, 111, 011, 001, 110, 011, 101 , would be constructed by:
P001,010,111,011,001,110,011,101 = P1,001 AND P2,010 AND P3,111 AND P4,011 AND P5,001 AND P6,110 AND P7,011 AND P8,101
Basic (bit) Ptrees
(i.e., P1,1, P1,2, …, P2,1, …, P8,8)
AND
Value Ptrees
(i.e., P1, 110 )
AND
Tuple Ptrees
(i.e., P001, 010, 111, 011, 001, 110, 011, 101 )
Figure 5. Basic Ptrees, Value Ptrees (for 3-bit values) and Tuple Ptrees
The AND operation can be viewed as simply the pixel-wise AND of bits from bSQ files. However,
since such files can contain hundreds of millions of bits, shortcut methods are needed. We discuss such
methods later in the paper. The process of converting data to Ptrees is also time consuming unless
special methods are used. For example, our methods can convert a TM satellite image (approximately
60 million pixels) to its basic Ptrees in just a few seconds using a high performance PC computer; this is
a one-time process. We also note that we are storing the basic Ptrees in a data structure which specifies
only the pure-1 quadrants, and does so in a lossless way. Using this data structure each AND can be
completed in a few milliseconds.
3.2 Variations of the Ptree Structure
In order to optimize the AND operation, we use a variation of the Ptree data structure which uses
masks rather than counts. It is called the PM-tree (Pure Mask tree) structure. In a PM-tree, we use a 3value logic to represent pure-1, pure-0 and mixed quadrants (1 denotes pure-1, 0 denotes pure-0 and m
denotes mixed). Thus, the PM-tree for the previous example is:
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
m
____________/ / \ \___________
/
___ / \___
\
/
/
\
\
1
____m__
_m__
1
/ / |
\
/ | \ \
m 0 1 m
1 1 m 1
//|\
//|\
//|\
1110
0010
1101
Figure 6. 8*8 image and its PM-tree
The PM-tree specifies the location of the pure-1 quadrants of the operands. The pure-1 quadrants of
the AND result can be easily identified by the coincidence of pure-1 quadrants in both operands. Pure-0
quadrants of the ANDing result occur wherever a pure-0 quadrant is found on at least one of the
operands. At each level, bit vector masks can be used to further simplify the construction.
3.3 The Ptree ANDing Algorithm
We begin this section with a description of the AND algorithm. This algorithm is used to calculate the
root counts of value Ptrees and tuple Ptrees. The approach is to store only the basic Ptrees and then
generate the value and tuple Ptree root counts “on-the-fly” as needed. In this algorithm we will assume
Ptrees are coded in a compact, depth-first ordering of the paths to each pure-1 quadrant. We use a
hierarchical quadrant id (qid) scheme. At each level we append a subquadrant id number (0 means upper
left, 1 means upper right, 2 means lower left, 3 means lower right):
100 101
00
0
2
01
10
11
102 103
1
02
03
12
13
20
21
30
31
22
23
32
33
3
Figure 7. Quadrant id (qid)
We consider the following example first.
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
Ptree:
55
________ / / \ \___
/
____ / \
\
/
/
\
\
16 _8 _
_15_ 16
/ / | \
/ | \ \
3 0 4 1
4 4 3 4
//|\
//|\
//|\
1110 0010
1101
PM-tree: m
______/ / \ \______
/
/ \
\
/
/
\
\
1
m
m
1
/ / \ \
/ / \ \
m 0 1 m 11 m 1
//|\
//|\
//|\
1110
0010 1101
-- level 0
-- level 1
-- level 2
-- level 3
The sequence of pure-1 qids (left-to-right depth-first order) is 0, 100, 101, 102, 12, 132, 20, 21, 220, 221,
223, 23, 3.
Now we need a second operand for the AND. We will use the following Ptree for that second operand
11
11
11
11
11
11
11
11
11
11
11
11
11
11
01
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
Ptree:
29
________ / / \ \___
/
____ / \
\
/
/
\
\
16 0
_ 13_ 0
/ | \ \
4 4 41
//|\
0100
PM-tree: m
______/ / \ \______
/
/ \
\
/
/
\
\
1
0
m
0
/ / \ \
11 1 m
//|\
0100
with sequence of pure-1 qids: 0, 20, 21, 22, 231. Since a quadrant will be pure 1’s in the result only if it
is pure-1’s in both operands (or all operands, in the case there are more than 2), the AND is done by the
following: scan the operands; output matching pure-1 sequence.
0 100 101 102 12 132 20 21 220 221 223 23 3 & 0 20 21 22 231
0
0
20
20
21
21
220 221 223
22
23
231






RESULT
0
20
21
220 221 223
231
Therefore the result is:
PM-tree:
m
________ / / \ \___
/
____ / \
\
/
/
\
\
1
0
_ m__ 0
/ | \ \
1 1 m m
//|\ //|\
1101 1000
Ptree:
28
______/ / \ \______
/
/ \
\
/
/
\
\
16
0
12 __
0
/ / \ \
4 4 3 1
//|\ //|\
1101 1000
11
11
11
11
11
11
11
01
11
11
11
11
11
11
10
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
The pseudo code for the ANDing algorithm is given below in Figure 8.
Ptree_ANDing(P1, P2, Presult)
// pos1, pos2, pos3 records the pure-1 quadrant path position of P1, P2, Presult
1. pos1:=0; pos2:=0; pos3:=0;
2. DO WHILE (pos1<>ENDofP1 and pos2<>ENDofP2)
(a) IF P1.pos1=P2.pos2 THEN BEGIN
Presult.pos3:=P1.pos1; pos1:=pos1+1; pos2:=pos2+1; pos3:=pos3+1; END
(b) ELSE IF P1.pos1 is the substring of P2.pos2 THEN BEGIN
Presult.pos3:=P2.pos2; pos2:=pos2+1; pos3:=pos3+1; END
(c) ELSE IF P2.pos2 is the substring of P1.pos1 THEN BEGIN
Presult.pos3:=P1.pos1; pos1:=pos1+1; pos3:=pos3+1; END
(d) ELSE IF P1.pos1<P2.pos2 THEN pos1:=pos1+1;
(e) ELSE pos2:=pos2+1;
END IF
END DO
Figure 8. Ptree ANDing algorithm
4. Association Rule Mining on Spatial Data Using Ptrees
4.1 Discretization (partition)
The reflectance values are quantitative data (typically 8-bit data values). It is common to partition
the data before performing association rule mining. There are several ways to partition the data,
including equi-length partitioning, equi-depth partitioning and customized partitioning. Equi-length
partition is a simple but very useful method. By truncating some of the right-most bits of the values (low
order or least significant bits) we can reduce the size of the itemset dramatically without loosing too
much information (the low order bits show only subtle differences). For example, we can truncate the
right-most 6 bits, resulting in the set of values {00, 01, 10, 11} (in decimal, {0, 1, 2, 3}). Each of these
values represents an partition of the original 8-bit value space (i.e., 00 represents the values in [0,64), 01
represents the values in [64,128), etc.).
Further pruning can be done by understanding the kinds of rules that are of interest to the user and
focusing on those only. For instance, for an agricultural producer using precision techniques, there is
little interest in rules of the type, Red>48  Green<134. A scientist might be interested in color
relationships (both antecedent and consequent from color bands), but the producer is interested only in
relationships in color antecedent and consequents from, for example, a yield band (i.e., When do
observed color combinations predict high yield or foretell low yield).
Therefore, for precision
agriculture applications, it makes sense to restrict to those rules that have consequent from the yield
band. We will refer to restrictions in the type of itemsets allowed for antecedent and consequent based
on interest as of interest, as distinct from the notion of rules that are "interesting". Of-interest rules can
be interesting or not interesting, depending on such measures as support and confidence, etc. In some
cases, it would be better to allow users to partition the value space into uneven partitions. User
knowledge can be applied in partitioning. E.g., band Bi can be partitioned into {[0,32), [32,64) [64,96),
[96,256)}, if it is know that there will be only a few values between 96 to 255. Applying user's domain
knowledge increases accuracy and and data mining efficiency. This type of partitioning will be referred
to as user-defined partitioning. Equi-depth partitioning (each partition has approximately the same
number of pixels) can be done by setting the endpoints so that there are approximately the same number
of values in each partition.
Whether partitioning is equi-length, equi-depth or user-defined, it can be characterized as follows. For
each band, choose partition-points, v0 = 0, v1 , . . . , vn+1 = 256, then the partitions are, { [vi, vi+1 ) : i =
0..n } and are identified as values, { vi : i = 0..n}.
The items to be used in the data mining algorithms
are then pairs, (bi , vj ).
4.2 Deriving Rules Using Ptrees
For RSI data, we can formulate the association rule mining model as follows. Let I be the set of all
items and T be the set of all transactions. I = { (b,v) | b=band, v=value (1-bit or 2-bit or … 8-bit) },
T = {pixels}.
Admissible Itemsets (Asets) are itemsets of the form, Int1 x Int2 x ... x Intn = Π i=1..n Inti , where Inti is
an interval of values in Bandi (some of which may be the full value range). Modeled on the Apriori
algorithm [2], we first find all itensets which are frequent and of-interest (e.g., If B1 = Yield, the user
may wish to restrict attention to those Asets for which Int1 is not all of B1 – so either high-yield or lowyield. For 1-bit data values, this means either yield < 128 or Yield  128 (other threshold values can be
selected using the user-defined partitioning concept described above). Then the user may want to restrict
interest to those rules for which the rule consequent is Int1.
For a frequent Aset, B = Π
i=1..n
Inti , rules are created by partitioning {1..n} into two disjoint sets,
Â={i1..im} and Ĉ={j1..jq}, q+m=n, and then forming the rule, AC where A = Π kÂ Intk and C = Π kĈ
Intk . As noted above, users may be interested only in rules where q=1 and therefore the consequents
come from a specified band (e.g., B1=Yield). Then there is just one rule of interest for each frequent set
found and it need only be checked as to whether it is high-confidence or not.
For the restricted interest case described above, in which q=1 and C=Int1 (e.g., the Yield band),
support{AC} = support{p | p is a pixel such that p(i) is in all Int i, i=1..n}. The confidence of a rule,
AC is its support divided by the support of A. In the restricted interest case, with B1=Yield, B2=blue,
B3=green, B4=red, we need to calculate support(AC)=support( Int1 x Int2 x Int3 x Int4 ) and support(A)
= support( Int2 x Int3 x Int4 ). If support(B) ≥ minsup (the specified minimum support threshold) and
supp(B)/supp(A) ≥ minconf (the specified minimum confidence threshold), then AB is a strong rule.
A k-band Aset ( kAset ) is an Aset in which k of the Inti intervals are non-full (i.e., in k of the bands the
intervals are not the fully unrestricted intervals of all values).
We start by finding all frequent 1Asets. Then, the candidate 2Asets are those whose every 1Aset subset
is frequent, etc. The candidate kAsets are those whose every (k-1)Aset subset is frequent. Next we look
for a pruning technique based on the value concept hierarchy. Once we find all 1-bit frequent kAsets,
we can use the fact that a 2-bit kAset cannot be frequent if its enclosing 1-bit kAset is infrequent. A 1bit Aset encloses a 2-bit Aset if when the endpoints of the 2-bit Aset are shifted right 1-bit position, it is
a subset of the 1-bit Aset, (e.g., [1,1] encloses [10,11], [10,10] and [11,11]). The following algorithms
are given for mining association rules on RSI data using Ptrees in Figure 9.
Procedure P-mining
{
(1) Data Discretization;
(2) F1 = {frequent 1-Asets};
(3) For (k=2; F k-1 ) do begin
(4) Ck = p-gen(F k-1);
(5) Forall candidate Asets c  Ck do
(6) c.count = AND_rootcount(c);
(7) Fk = {cCk | c.count >= minsup}
(8) end
(9) Answer = k Fk
}
Figure 9. P-mining algorithm
The P-mining algorithm assumes a fixed value precision, for example, 3-bit precision in all bands. The
p-gen function differs from the apriori-gen function ([1]) in the way pruning is done. We use band-based
pruning. Since any itemsets consisting of two or more intervals from the same band will have zero
support (no value can be in both intervals simultaneously), the kind of joining done in [1] is not
necessary. The AND_rootcount function is used to calculate Aset counts directly by ANDing the
appropriate basic P-trees instead of scanning the transaction databases. For example, in the Asets,
{B1[0,64), B2[64,127)}, the count is the root count of P1, 00 AND P2, 01.
To obtain the rules at other precision levels we can apply Algorithm-P-mining again.
special bit-based pruning technique which can be applied in this case.
There is a
This bit-based pruning is
essentially a matter of noting, e.g., if Aset [1,1]2 (the interval [1,1] in band 2) is not frequent, then the
Asets [10,10]2 and [11,11]2 which are covered by [1,1]2 cannot possibly be frequent either.
4.3 An Example
The following data in relational format will be used to illustrate the method. The data contains four
bands with 4-bit precision in the data values.
FIELD |CLASS|
COORDS|LABEL|
X | Y |YIELD|
0,0 | 3 |
0,1 | 3 |
0,2 | 7 |
0,3 | 7 |
1,0 | 3 |
1,1 | 3 |
1,2 | 7 |
1,3 | 7 |
2,0 | 2 |
2,1 | 2 |
2,2 | 10 |
2,3 | 15 |
3,0 | 2 |
3,1 | 10 |
3,2 | 15 |
3,3 | 15 |
REMOTELY SENSED
REFLECTANCES_______
Blue|Green| Red____
7 | 8 | 11
3 | 8 | 15
3 | 4 | 11
2 | 5 | 11____
7 | 8 | 11
3 | 8 | 11
3 | 4 | 11
2 | 5 | 11 ___
11 | 8 | 15
11 | 8 | 15
10 | 4 | 11
10 | 4 | 11____
11 | 8 | 15
11 | 8 | 15
10 | 4 | 11
10 | 4 | 11
FIELD |CLASS|
COORDS|LABEL|
X | Y |YIELD|
0,0 | 0011|
0,1 | 0011|
0,2 | 0111|
0,3 | 0111|
1,0 | 0011|
1,1 | 0011|
1,2 | 0111|
1,3 | 0111|
2,0 | 0010|
2,1 | 0010|
2,2 | 1010|
2,3 | 1111|
3,0 | 0010|
3,1 | 1010|
3,2 | 1111|
3,3 | 1111|
REMOTELY SENSED
REFLECTANCES_______
Blue|Green| Red____
0111| 1000| 1011
0011| 1000| 1111
0011| 0100| 1011
0010| 0101| 1011__
0111| 1000| 1011
0011| 1000| 1011
0011| 0100| 1011
0010| 0101| 1011__
1011| 1000| 1111
1011| 1000| 1111
1010| 0100| 1011
1010| 0100| 1011__
1011| 1000| 1111
1011| 1000| 1111
1010| 0100| 1011
1010| 0101| 1011
The data is first converted to bSQ format. We display the bSQ bit-band values in their spatial
positions, rather than in columnar files. The Band1 bit-bands are:
B11
0000
0000
0011
0111
B12
0011
0011
0001
0011
B13
1111
1111
1111
1111
B14
1111
1111
0001
0011
The Band-1 basic Ptrees are as follows (tree pointers are omitted). The Band 2,3 and 4 Ptrees
are similar.
P1,1
5
0 0 1 4
0001
P1,2
7
0 4 0 3
0111
P1,3
16
P1,4
11
4 4 0 3
0111
The value Ptrees are created as needed. The creation process for P1,0011 is shown as an
example.
P1,0011
=
4
4 0 0 0
P1,1’ AND
11
4 4 3 0
1110
P1,2’ AND
9
4 0 4 1
1000
P1,3
16
AND
P1,4
11
4 4 0 3
0111
since,
0 1 20 21 22
0
2
31
0 1
31 32 33
0
(pure-1 paths of
(pure-1 paths of
(pure-1 paths of
(pure-1 paths of
P1,1’)
P1,2’)
P1,4
note that P1,3 is entirely pure-1)
P1,0011 0 is the only pure-1 path in all operands)
The other value Ptrees are calculated in this same way.
P1,0000
0
P1,0100
0
P1,1000
0
P1,1100
0
P1,0010
3
0 0 3 0
1110
P1,0110
0
P1,1010
2
0 0 1 1
0001 1000
P1,0001
0
P1,0101
0
P1,1001
0
P1,1101
0
P1,0011
4
4 0 0 0
P1,0111
4
0 4 0 0
P1,1011
0
P1,1111
3
0 0 03
0111
P2,0000
0
P2,0100
0
P2,1000
0
P2,1100
0
P2,0010
2
0 2 0 0
0101
P2,0110
0
P2,1010
4
0 0 0 4
P2,1110
0
P2,0001
0
P2,0101
0
P2,1001
0
P2,1101
0
P2,0011
4
2 2 0 0
0101 1010
P2,1011
4
0 0 4 0
P2,1111
0
P3,0000
0
P3,0100
6
0 2 0 4
1010
P3,1000
P3,1100
8
0
4 0 4 0
P3,0010
0
P3,0110
0
P3,1010
0
P3,1110
0
P3,0001
0
P3,0101
2
0 2 0 0
0101
P3,1001
0
P3,0011
0
P3,0111
0
P3,1011
0
P3,1111
0
P4,0000
0
P4,0100
0
P4,1000
0
P4,1100
0
P4,0010
0
P4,0110
0
P4,1010
0
P4,0001
0
P4,0101
0
P4,1001
0
P4,1101
0
P4,0011
0
P4,0111
0
P4,1011
11
3 4 0 4
1011
P3,1101
0
P2,0111
2
2 0 0 0
1010
P1,1110
0
P4,1110
0
P4,1111
5
1 0 4 0
0100
Assume the minimum support is 60% (requiring a count of 10) and the minimum confidence is
60%. First, we find all 1Asets for 1-bit values from B1. There are two possibilities for Int1, [1,1]
and [0,0].
Since, P1,1
support([1,1]1) = 5 (infrequent) and support([0,0]1)=11 (frequent).
5
0014
0001
Similarly, there are two possibilities for Int2 with support([1,1]2) = 8
(infrequent)
and
support([0,0]2) = 8 (infrequent), two possibilities for Int3 with support([1,1]3) = 8 (infrequent)
and support([0,0]3) = 8 (infrequent), and two possibilities for Int3 with support([1,1]4) = 16
(frequent) and support([0,0]4) = 0 (infrequent).
The set of 1-bit frequent 1Asets, 1L1, is { [0,0]1 , [1,1]4 }
The set of 1-bit candidate 2Asets, 1C2, is { [0,0]1 x [1,1]4 } (support=root-count P1,0 & P4,1 = 11)
and therefore, 1L2 = { [0,0]1 x [1,1]4 }
The set of 1-bit candidate 3Asets, 1C3 is empty.
We consider only those frequent sets which involve Yield (B1) as candidates for forming rules
and we use B1 as the consequent of those rules (assuming this is the user’s choice). The rule
which can be formed with B1 as the consequent is
[1,1]4  [0,0]1 (rule support = 11). The
supports of the antecedent is, support([1,1]4) = 16, giving confidence( [1,1]4[0,0]1 ) = 11/16.
Thus, this is a strong rule.
The frequent 1-bit 1Asets were [0,0]1 and [1,1]4 and the other 1-bit 1Asets are infrequent.
This means all their enclosed 2-bit subintervals are infrequent. The interval [00,01]1 is identical
to [0,0]1 in terms of the full 8-bit values that are included, and [00,10]1 is a superset of [0,0]1, so
both are frequent.
Others in band-1 to consider are: [00,00], [01,01], [01,10] and [01,11].
(using P1,00, count=7).
[00,00] is infrequent
[01,01] is infrequent (using P1,01, count=4). For [01,10] we use P1,01 OR
P1,10 . If it is frequent, then [01,11] is frequent, otherwise, for [01,11] we use P 1,01 OR P1,10 OR
P1,11 . The OR operation is very similar to the AND operation except that the role of 0 and 1
interchange. A result quadrant is pure-0 if both operand quadrants are pure-0. If either operand
quadrant is pure-1, the result is pure-1. The root count of P1,01 OR P1,10 is 6 and therefore
[01,11] is infrequent. The root count of P1,01 OR P1,10 OR P1,11 is 9 and therefore [01,11] is
infrequent.
The only new frequent 2-bit band1 1Aset is [00,10]1 , which does not form the support set of a
rule. Thus, the algorithm terminates.
5. Experiment Results and Performance Analysis
In this section, we compare our work with the classical frequent itemsets generation algorithm,
Apriori [1], and a recently proposed efficient algorithm, FP-growth [12], which does not have the
candidate generation step.
The experiments are performed on a 900-MHz PC with 256
megabytes main memory, running Windows 2000. We set our algorithm to find all the frequent
itemsets, not just those of-interest (e.g., containing Yield) for the fairness. The images we used
as data are actual aerial TIFF images with a synchronized yield band and can be found at [17]. In
our performance study each dataset has 4 bands {Blue, Green, Red, Yield}. We use different
image sizes up to 1320  1320 pixels. For the 1320 1320 images, the total number of pixels is
1,742,400.
We only store the basic Ptrees for each dataset. All other Ptrees (value Ptrees and tuple Ptrees)
are created in real time as needed in the ARM algorithm, saving considerable space. The
ANDing operation which produces the value and tuple Ptrees as needed from basic Ptree
operands is very fast. The following tables reveal the storage needs of our approach (the sizes of
8 basic Ptrees for a 1320 x 1320 TIFF-Yield synchronized image. The following table (a typical
example of TIFF image) shows that the basic Ptrees for the high-order bits are small.
basic Ptree sizes in bytes
Bit position
Band 1 (Red)
Band 2 (Green)
Band 3 (Blue)
1
154032
27884
91722
2
234803
128747
200845
3
254477
250175
247900
4
254751
254555
254715
5
254795
254785
254799
6
254797
254819
254823
7
254807
254809
254821
8
254807
254811
254841
Table 1. Basic Ptree Sizes
5.1 Comparison of the P-mining with Apriori
We implemented the Apriori algorithm [2] for the TIFF-Yield datasets using equi-length
partitioning. P-mining is more scalable than Apriori in two ways. First, P-mining is more
scalable for lower support thresholds. The reason is, for low support thresholds, the number of
candidate itemsets will be extremely large. Thus candidate frequent itemset generation
performance degrades markedly. Figure 10 gives the results of the comparison of the P-mining
algorithm (P-tree runtime) and Apriori for different support thresholds.
800
Run time (Sec.)
700
600
500
P-tree runtime
400
Apriori
300
200
100
90%
70%
50%
30%
10%
0
Support threshold
Figure 10. Run time versus support threshold
The second conclusion we can draw is that the P-mining algorithm is more scalable to large
spatial datasets. The reason is, in the Apriori algorithm we need to scan the entire database each
time a support is to be calculated. This is a very high cost for large databases. However, in Pmining, since we calculate the count directly from the root count of a basic Ptree AND program,
when we double the dataset size, only one more level is added to each basic Ptree. The additional
cost is relatively small compared to the Apriori algorithm as shown in figure 11 below.
1200
Apriori
Time (Sec.)
1000
P-tree
800
600
400
200
1700
1300
900
500
100
0
Number of transactions(K)
Figure 11. Scalability with number of transactions
5.2 Comparison of the P-mining algorithm and the FP-growth algorithm
FP-growth is a very efficient algorithm for association rule mining, which uses a data structure
called the frequent pattern tree (FP-tree) to store compressed information about frequent patterns.
For a dataset of 100K bytes, FP-growth runs very fast. But when we run the FP-growth algorithm
on the TIFF image of size 1320  1320 pixels, the performance falls off markedly. For large
sized datasets and low support thresholds, it takes much longer for FP-growth to run than Pmining.
Figure 12 shows the experimental result of running the P-mining and FP-growth alogrithms on
a 1320  1320 pixel TIFF-Yield dataset (in which the total number of transactions is ~1,700,000).
In these experiements we have used 2-bits precision and equi-length partitioning.
800
Run time (Sec.)
700
600
500
P-mining
400
FP-growth
300
200
100
90%
80%
70%
60%
50%
40%
30%
20%
10%
0
Support threshold
Figure 12. Scalability with support threshold
The results show that both P-mining and FP-growth run faster than Apriori. For large image
datasets, the P-mining algorithm runs faster than the FP-tree algorithm when the support
threshold is low.
Our test suggest that the FP-growth algorithm runs quite fast for datasets with transactions
numbering less than 500,000 ( |D| < 500K ). For the larger data sets, P-mining algorithm gives
much better performance. This result is presented in Figure 13 (where the support threshold was
set at 10%).
1200
FP-growth
Time (Sec.)
1000
P-mining
800
600
400
200
1700
1300
900
500
100
0
Number of transactions(K)
Figure 13. Scalability with the number of transactions
Figure 14 shows how the number of precision bits used affects the performance of the P-
Time (sec.)
mining algorithm. The more precision bits used, the greater the number of items.
180
160
140
120
100
80
60
40
20
0
1bit
2bits
3bits
4bits
10% 20% 30% 40% 50% 60% 70% 80% 90%
Support threshold
Figure 14. Performance of P-mining with respect to the number of bits used for partition
6. Related work
Remotely Sensed Imagery data belongs to the category of spatial data. There are some works on
spatial data mining [13,14,15], including Association Rule Mining on Spatial data [16]. A spatial
association rule is a rule indicating certain association relationship among a set of spatial and
possibly some nonspatial predicates [16]. For example, a rule like “most big cities in Canada are
close to the Canada-U.S. border”. In these works, spatial data mining is performed with the
perspective of spatial locality, i.e. mined patterns consider objects being close in space. In our
work to derive rules among spectral bands and yield, the patterns not necessarily exist on close
pixels, they can exist in any part of the image, and the rules generated in this way are very useful
to the farmer.
As to our data structure Ptree, there are some related works, including quadtree [6,7,9] and its
variants (such as point quadtree [9] and region quadtree [6]), and HHcode [10].
Quadtree decompose the universe by means of iso-oriented hyperplanes. These partitions do not
have to be of equal size, although that is often the case. For example, for two-dimensional
quadtree, each interior node has four descendants, each corresponding to a rectangle, referred as
NW/NE/SW/SE quadrants. The decomposition into subspaces is usually continued until the
number of objects in each partition is below a given threshold. Quadtree has many variants, such
as point quadtree and region quadtree [6].
HHcodes, or Helical Hyperspatial Codes, are binary representations of the Riemannian
diagonal. The binary division of the diagonal forms the node point from which eight sub-cubes
are formed. Each sub-cube has its own diagonal, generating new sub-cubes. These cubes are
formed by interlacing one-dimensional values encoded as HH bit codes. When sorted, they
cluster in groups along the diagonal. The clustering gives fast (binary) searches along the
diagonal. The clusters are order in a helical pattern which is why they are called "Helical
Hyperspatial". They are sorted in a Z-ordered fashion.
The similarities among Ptree, quadtree and HHCode are that they are quadrant based, but the
difference is that Ptree is focused on the count. Ptree is not only beneficial to store the data, it’s
particular useful for association rule mining because it provides a lot of useful information needed
for association rule mining.
7. Conclusion and future work
In this paper, we propose a new model to derive association rules on Remotely Sensed Imagery
data. In our model, the images are organized in bit-Sequential or bSQ format. For data mining of
image data, we introduce a new data structure for bSQ files called Peano Count trees or Ptrees.
Using the Peano ordering, each bSQ bit array is organized into a tree structure that efficiently
captures all the information of the bit array plus the value-histograms of each and every quadrant
in the space. These Peano Count trees are space efficient, lossless, data mining ready structures
for the association rule mining of bSQ spatial datasets.
Algorithm pruning is almost always required in order to make data mining feasible. Ptrees
facilitate new pruning techniques for association rule mining based on a high-order bit first
approach and a single attribute first approach. Also, these data structures provide early algorithm
exit advantages for fast high-confidence, low-support association rule identification.
Ptrees have the potential to revolutionize data mining in applications ranging from flood
prediction and monitoring, community and regional planning, precision agriculture, virtual
archeology, mineral exploration, gene mapping, VLSI design and environmental analysis and
control.
In this paper we have described the new structures and shown that they can be very effective in
facilitating association rule mining on large Remotely Sensed Imagery data. Our future work
includes applying Ptrees for Sequential Pattern mining on RSI data.
References
[1] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules Between Sets of Items
in Large Database”, ACM-SIGMOD 93, Washington, DC, May 1993.
[2] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. of Int’l
Conf. on VLDB, Santiago, Chile, September 1994.
[3] R. Srikant and R. Agrawal, "Mining Quantitative Association Rules in Large Relational
Tables", ACM SIGMOD 96, Montreal Canada.
[4] Jong Soo Park, Ming-Syan Chen and Philip S. Yu, “An effective Hash-Based Algorithm for
Mining Association Rules,” ACM SIGMOD 95, CA, 1995.
[5] Volker Gaede and Oliver Gunther, "Multidimensional Access Methods", Computing Surveys,
30(2), 1998.
[6] H. Samet, “The quadtree and related hierarchical data structure”. ACM Computing Survey,
16, 2, 1984.
[7] H. Samet, “Applications of Sptial Data Structures”, Addison-Wesley, Reading, Mass., 1990.
[8] H. Samet, “The Design and Analysis of Spatial Data Structures”, Addison-Wesley, Reading,
Mass., 1990.
[9] R. A. Finkel and J. L. Bentley, “Quad trees: A data structure for retrieval of composite keys”,
Acta Informatica, 4, 1, 1974.
[10] http://www.statkart.no/nlhdb/iveher/hhtext.htm
[11] S. W. Golomb, “Run-length encoding”, IEEE Trans. On Information Theory, 12(3), July
1966.
[12] J. Han, J. Pei and Y. Yin, “Mining Frequent Patterns without Candidate Generation”,
ACM_SIGMOD 2000, Dallas, Texas, May 2000.
[13] M. Ester, H.-P. Kriegel, J. Sander, "Spatial Data Mining: A Database Approach", SSD 1997.
[14] K. Koperski, J. Adhikary, J. Han, "Spatial Data Mining: Progress and Challenges", DMKD
1996.
[15]M. Ester, A. Frommelt, H.-P. Kriegel, J. Sander, "Spatial Data Mining: Database Primitives,
Algorithms and Efficient DBMS Support", Data Mining and Knowledge Discovery 4(2/3).
[16] K. Koperski, J. Han, "Discovery of Spatial Association Rules in Geographic Information
Databases", SSD 1995.
[17]
SMILEY
(Spatial
Miner
http://midas.cs.ndsu.nodak.edu/~smiley
&
Interface
Language
for
Earth
Yield),
Download