DM Project report

advertisement
CMPUT 695
FALL 2004
2004-11-28
ASSOCIATIVE CLASSIFIER
WITH REOCCURRING ITEMS
PROJECT REPORT
RAFAL RAK
WOJCIECH STACH
1 INTRODUCTION
Classification problem is one of the most common tasks in Data Mining and
Machine Learning fields. In general, it boils down to use the training set, which includes
a set of labelled objects, to build the classifier that is able to predict labels for objects that
have them unknown.
There are numerous different classification methods [4] from among which we can
separate those providing a model in the form of rule sets, e.g. decision trees, rule
learning, or naïve-Bayes. These approaches have several benefits that come from having
rules that describe model. One of the most important advantages is that such a model is
transparent, i.e. experts from the domain of the application are able to understand it. This
feature allows them to manipulate and add rules in order to increase the confidence and
accuracy of the classifier.
Associative classification is relatively new method. This main objective is to
discover strong patterns that are associated with the class labels. The set of data consists
of transactions, which include items. As a final model, one obtains a set of association
rules. Literature research indicates that a few classifiers based on the above-mentioned
idea, i.e. CBA [6], CMAR [5], and ARC-AC/ARC-BC [11], were introduced. The
substantial limitation of all these algorithms is that they do not handle the data having
multiple presence of the same item in a single transaction.
There is a group of mining areas, such as multimedia, medicine, or text mining, in
which the considering number of the occurrences of items (e.g. similar objects in the
same medical image) might be more beneficial than presence or absence of items. A few
approaches to mining rules with reoccurring items have been proposed, e.g. MaxOccur
[12], FP’-tree [8] and WAR [9].
The main goal of this project is to devise, implement and test an associative classifier
with reoccurring items. It exploits, combines, and extends the ideas mentioned above,
especially ARC-AC/ARC-BC and MaxOccur algorithms.
2 PROBLEM STATEMENT
The original approach of classification using association rules, named class
association rules (CAR), was introduced in [6]. The main idea was to modify the form of
transactions known from the traditional approach to the form of  condset, c  , where
condset is a set of items and c is a class label. As a result, each item is represented by
(attribute, value) pair. Hence, we can treat a transaction as a set of (attribute, value) pairs
and a class label e.g.  {( A1 , 3), ( A2 , 2)}, c  , where A1 and A2 are attributes. Different
notations of this type of transaction can be used. In this project, we will be using the
following one  {3 A1 , 2 A2 }, c 
All the rules generated from frequent itemsets are of the form condset  c . Once
the classifier (in this case: set of rules) is found, it can be used to predict the class of new
objects. However, two main problems might occur. One of them is that two or more
contradictory rules might exist, i.e. rules that have the same condset, yet different class
labels. The other one concerns situation, in which there is no exact rule, i.e. having the
2
same condset, for the object being classified. Different strategies can be applied to handle
these problems, see Section 4.2.
Our task is to combine the associative classification with the problem of recurrent
items. More formally, it can be stated that our goal is to modify the original approach
using transactions from the form of  {i1 , i2 , in }, c  , where ii is an item in a
transaction (e.g. a word in a document text) and c is a class label, to the form of
 {o1i1 , o2 i2 , on in }, c  , where oi is the number of the occurrences of the item ii in the
transaction. The rules generated from this set of transaction have the form  condset, c 
and they are used for classification of new objects. The hypothesis that we will try to
examine is that this representation of data supposes to be more beneficial, since it
maintains more information about both objects and rules.
3 RELATED WORK
Association rules have been recognized as a useful tool for finding interesting hidden
patterns in transactional databases. Several different techniques have been introduced to
tackle this problem effectively. The most important methods are those that are based on
either Apriori or FP-growth approaches. However less research has been done
considering transactions with reoccurrence of items. In [9] authors assign weights to
items in transactions and introduce the WAR algorithm to mine the rules. This method is
two fold: in the first step frequent itemsets are generated without considering weights and
then weighted association rules (WARs) are derived from each of these itemsets.
MaxOccur algorithm [12] is an efficient apriori-based method for discovering association
rules with recurrent items. It reduces the search space by effective usage of joining and
pruning techniques. The FP’-tree approach presented in [8] extends the FP-tree idea
includes several steps. For every distinct number of occurrences of given item, the
separated node is created. In case when new transaction is inserted into the tree, it might
increase support count for the different path(s) of the tree as well. This is based on
intersection between these two itemsets. Given the complete tree, the enumeration
process to find frequent patterns is similar to that from the FP-tree approach.
One of the very interesting and promising applications of association rules is a
classification task. Several classifiers have been introduced so far, i.e. CBA, CMAR,
ARC-AC, and ARC-BC. However, they use rules without reoccurrence of items in a
single transaction. CBA [7], an apriori-like algorithm, labels new objects based on the
confidence of matched rules. CMAR [5] using the FP-growth algorithm produces strong
rules by pruning more specific and less confident rules by more confident and general
ones. If more than one rule matches given object, advanced tests are performed to select
the strongest group of rules with the same label. ARC-AC and ARC-BC [11] are based on
the apriori algorithm. The ARC-BC approach treats each bunch of transactions labelled
by the same category separately while the ARC-AC considers all categories combined. In
order to manage the selection of rules, the term of dominance factor is introduced, which
is defined as a proportion of rules of the most dominant category in the applicable rules
for a document to classify. ARC-AC and ARC-BC were originally developed for
classifying text [11] and later on used to classify objects in images [2] [3].
3
4 PROPOSED APPROACH
Based on the work presented in Section 3 we introduce the new classification
method. Our approach, ACRI (Associative Classifier with Reoccurring Items), consists of
two modules; see Figure 1. We decided to base our algorithm for mining reoccurrence
items on Apriori property. The rationale behind this selection was well knowledge about
basic Apriori algorithm, its efficiency, and our own C++ implementation of this method.
The second choice was to following ARC-BC approach. It was based on the efficiency of
this method in the case of not evenly distributed class labels. This is the core of rule
generator module, see subsection 4.1. It mines the set of rules with reoccurring items that
are the main element of the other, i.e. classifier, module. The classification process might
be considered as matching the rules to the object; see subsection 4.2.
Set of
transactions
Rule generator
Set of rules
Classifier
Labeled objects
Objects
without labels
Figure 1 High-level diagram of ACRI
4.1
RULE GENERATOR
This module is designed for finding all frequent rules in the form of
 {o1i1 , o2i2 , , on in }, c  from a given set of transactions, i.e. rules that have support
equal or greater than user-defined min_support.
The general framework of this part is based on ARC-BC approach [11]. This means
that initial set of transactions is divided by categories and rules are generated for each of
them independently, see Figure 2.
4
Rule generator
for class C1
Transactions
with class C1
Set of
transactions
Transaction
divider
Transactions
with class C2
Rules with
class C1
Rule generator
for class C2
Rules with
class C2
Rule merger
Set of
rules
Rules with
class CN
Transactions
with class CN
Rule generator
for class CN
Figure 2 High-level diagram of rule generator module
The main components of this module are the following:
 Transaction divider – this block scans the set of transactions once and creates N
subsets of this set – each for one category (N is equal to number of categories);
 Rule generator for class Cx – this is Apriori-like algorithm for mining frequent
itemsets that extends the original method by taking into account reoccurrences of
items in a single transaction. In order to deal with this problem, the support count
has to be redefined. The main difference is that single transaction may increase
the support of a given itemset by more than one. We say that transaction supports
itemset by a given number. The formal definition of this approach, which was
proposed in [12] as a MaxOccur algorithm, is given below.
Transaction T=  {o1i1 , o2i2 , , on in }, c  supports itemset I= {l1i1 , l 2 i2 , l n i n } if and
only if i  1...n l1  o1  l2  o2  ...  ln  on . The number t by which T supports I is
calculated according to the formula:
o 
t  min  i  i  1...n :li  0  oi  0
 li 
Example
Let us take into account the following transaction T  {2i1 , 4i2 ,6i7 }, 1 
Support for several itemsets given by this transaction is shown in Table 1
Table 1 Example of support counting
Itemset I
{2i1 , 3i2 ,5i7 }
{i1 , 2i2 ,3i7 }
{i2 ,2i7 }
{i1 , i2 , i3 }
5
Support t
1
2
3
0
Thus, rule generator module finds all the itemsets that are frequent according to
the above definition of support.
 Rule merger – this block collects the rule sets for different classes and merges
them in order to generate complete set of mined rules. In this part we do not
perform any pruning, thus the merging is nothing but collecting the rules together.
4.2
CLASSIFIER
This module labels objects based on the set of mined rules obtained from rule
generator. Associative classifier is rule-based classification system, which means that
object is labelled on the basis of matched rule (or set of rules in case of
multiclassification). This task is simple if there is exact match between rule and object,
i.e. antecedent of the rule and the object are identical. The model often does not include
any rule that matches given object exactly, though. In such a case, in order to make the
classification, all rules are ranked according to a given scenario and the best one (or
several) is matched to a given object. Rule ranking might be performed following
different strategies, which associate each rule to a number that reflects its similarity to a
given object. These strategies may be used either separately or in different combinations.
We have tested the following ones: cosine measure, distance measure, coverage,
confidence, support, and dominant class, which are characterized below.
Let us consider the rule  {o1i1 , o2i2 , , onin }, c  and the object to be classified
 l1i1 , l 2i2 , , ln in  . The corresponding n-dimensional vectors can be denoted as


o  [o1 , o2 ,..., on ] and l  [l1 , l2 ,...,ln ] . The three following measures are based on this
representation.
 Cosine measure (CM) – assigns
value that is equal to the angle between these two
 
vectors, i.e. CM  arccos(o, l ) . The smaller the CM value is, the smaller the angle,
and the closer these vectors are in the n-dimensional space. It is equal to zero if the
vectors have the same direction, which, roughly speaking, means that they have the
same “proportions” of items.
 Distance measure (DM) – assigns value that is equal to the distance between the
ending points of these two vectors according to given distance norm, i.e.
 
DM  distance (o , l ) . We have tested norm L1 and norm L2 as distance functions. In
general, this measure “standalone” seems to be useless, since the vectors might have
various lengths. The only rational usage might be as a “fine tuning”, i.e. when the set
of rule has been already pruned. The smaller the distance is, the closer the two
ending points of these vectors are.
 Coverage (CV) – assigns value that is equal to the ratio of the number of common
items in the object and rule to the number of items in the rule (ignoring
reoccurrences)
In this case, the larger is the CV ratio, the more items are common for the rule and
the object. CV=1 means that the rule is entirely contained in the object.
The two following ranking methods refer to the rule property only and do not depend on
the classified object. Thus, they have to be used with other measures that prune the rule
set.
6
 Confidence
 Support
In the last examined classification scenario, called dominant class, the class label is
assigned to the object by choosing the one being the most frequent from the set of rules
matching the object.
5 SOFTWARE IMPLEMENTATION
5.1
GENERAL OVERVIEW
The ACRI program suite consists of three modules; see Figure 3. acri (rule generator)
and acric (classifier) are written in C++. The acri module, however, produces set of
rules that needs to be completed with support and confidence values. The time constraints
for the project was the reason that this operation is performed by separate script written in
Perl (count.pl). In the future this module will be integrated into acri application.
Training
set
Rules
acri
count.pl
Rules with
support and
confidence
Labeled
objects
acric
Objects
without labels
Figure 3 Software modules of ACRI
Refer to Appendix A to see the usage of programs.
6 EXPERIMENTS
6.1
EXPERIMENTAL SETUP
We used the Reuters-21578 text collection to perform experiments. We chose “ready-touse” top 10 topics [14] from this dataset. The total of 9980 documents is split onto two
sets: 7193 and 2787 for a training and test set respectively. First, we preprocessed data
extracting text from a bunch of XML documents. For normalization of words we used
Porter’s algorithm [13] to stem the words. We also pruned stop words, i.e., words that
appear very frequently and contributes nothing to the results. The list of stop words was a
combination of the list used in [11] and words from our observations while performing
tests (e.g., the obvious stop word is the word reuters as it appears in every document and
should be treated as noise rather than as useful information).
6.2
COMPARISON TESTS
In order to compare the results of our implementation of the algorithm for classifying
documents with recurrent items to ARC-BC approach, we provided the application with
7
“switches” that allowed us to produce rules regardless reoccurrence of words in
documents, which is a sort of emulation of ARC-BC algorithm. Although ACRI program
suite contains the ARC-BC approach, from now on the term ACRI will be used only to
denote the method with recurrent items.
We produced several different sets of rules to be used in classifier. For ARC-BC we
chose the support threshold ranged from 10 to 30% with the step of 5%; and 15 to 65%
with the same step for our approach. The difference between the support thresholds lies
in the definition of support for mining rules with recurrent items. As we mentioned in
section 4.1 a single transaction (document) can support an itemset (a set of words) more
than once. Therefore, if we consider support as the ratio of support count to the total
number of transactions, as it was introduced in [8], we may encounter support more than
100% for some itemsets. On the other hand, if we choose the definition presented in [12],
i.e., the ratio of support count to the number of distinct items (words), the support will
never reach 100% as long as there is more than one distinct item in the dataset (which is
quite obvious). Actually, in practice, the latter support definition requires for setting very
small thresholds to obtain reasonable results. Hence, we decided to use the first one as
more similar to “classical” definition of support. It is important to notice that no matter
which definition we choose, it eventually leads to setting the same support count (i.e., the
absolute number of transaction supporting an itemset).
For each support threshold we set three different confidence thresholds: 0, 35, and 70%.
The latter threshold was used in [11] as minimum reasonable threshold for producing
rules; the first one (no threshold) was introduced to observe the reaction of the classifier
for dealing with large numbers of rules; and the threshold of 35% is simply the middle
value between the two others.
For each single experiment we tried to keep the level of more then 98% of classified
objects, which resulted in manipulating CV (see section 4.2) from 0.3 to 1. We discarded
cases for which it was not possible to set CV to satisfy the minimum number of classified
objects. More than 90% of the remaining results had CV = 1.
We also performed experiments without specifying CV (using different methods of
choosing applicable rules); however, they eventually occur to produce much more lower
accuracy than those with specified CV > 0.3.
We used different classification techniques for choosing the most applicable rule
matching the object. Best confidence and dominant class matching methods were utilized
for both ARC-BC and ACRI approaches. Additionally, ACRI was provided with cosine
measure technique, and a combination of cosine measure, best confidence, and dominant
class techniques. The combination of matching techniques is realized as the kind of
scenario with different tolerance factors for each technique. The exemplary scenario
would look like this: (1) choose top 20% of rules with the best cosine measure, then (2)
choose 50% of the remaining rules with the highest confidence, and then (3) choose the
rule based on the dominant class technique.
6.3
RESULTS
Table 2 and Table 3 include the results in terms of the number of rules and accuracy for
ACRI and ARC-BC approaches, respectively. The rows indicate different levels of
support and confidence, and they are given in the form of SxCy, which means that
support was x% and confidence was y%.
8
Table 2 Number of rules and accuracy [%] of ACRI
S15C0
S15C35
S20C0
S20C35
S25C0
S25C35
S30C0
S30C35
S35C0
S35C35
S40C0
S40C35
Number of rules
13899
11242
5481
4253
2693
2011
1564
1131
1006
716
696
488
Confidence
76.42
76.43
74.54
74.59
74.92
75.00
74.86
74.95
74.25
74.53
73.05
73.33
Dominant class
75.88
81.17
75.01
80.42
73.81
80.12
72.99
79.24
72.77
78.74
71.28
77.96
Cosine
73.19
79.95
71.38
78.29
71.58
78.28
72.41
80.18
71.51
79.28
70.95
78.99
Cosine & confidence
82.77
83.90
81.54
82.72
81.10
82.32
81.51
82.64
80.13
81.42
78.78
80.52
Table 3 Number of rules and accuracy [%] of ARC-BC
S10C0
S10C35
S10C70
S15C0
S15C35
S15C70
S20C0
S20C35
S25C0
S25C35
S30C0
Number of rules
8231
5764
1402
1817
1164
290
643
401
316
213
182
Confidence
75.74
75.78
49.10
75.29
75.33
47.85
74.96
75.52
75.14
62.57
74.87
Dominant factor
78.00
84.09
74.82
77.77
83.67
71.48
76.55
83.24
76.30
71.88
71.80
In both cases the best confidence level was 35%. For ARC-BC classifier, the best strategy
was to use dominant factor, whereas in case of ACRI combination of cosine measure and
confidence factors worked best.
Based on these tables, Figure 4 shows the relationship between support and accuracy for
these approaches. Comparing the best-found results, ARC-BC slightly overcomes the
ACRI, however it seems to be more sensitive to changes of support. The accuracy of
ACRI almost does not depend on the support threshold. In case of ARC-BC it decreases
significantly when this threshold is greater than 20%.
9
Accuracy vs. support (confidence = 35% )
ARC-BC (dominant class)
ACRI (cosine + confidence)
ARC-BC (confidence)
ACRI (cosine + conf. + dominant c.)
90
Accuracy [%]
85
80
75
70
65
60
10
15
20
25
30
35
40
Support
Figure 4 Accuracy of the ACRI and ARC-BC
Figure 5 and Figure 6 show the number of generated rules with and without recurrent
items. As it can be observed, considering recurrences results in having more rules, which
has its origin in different support definition. The other interesting relationship is that by
increasing the confidence threshold from 0% to 35%, the difference between number of
rules decreases more rapidly for ACRI.
Number of rules vs. support (confidence = 0% )
Single Occurence
Number of rules vs. support (confidence = 35% )
Multiple Occurences
Single Occurence
16000
14000
10000
12000
Number of rules
Number of rules
Multiple Occurences
12000
10000
8000
6000
4000
8000
6000
4000
2000
2000
0
0
0
10
20
30
40
50
60
70
0
Support [%]
10
20
30
40
50
60
70
Support [%]
Figure 5 Number of rules with confidence = 0%
Figure 6 Number of rules with confidence = 35%
Figure 7 shows the relationship between running time for rule generator with and without
considering recurrent items. The algorithm with recurrences is slower, since it has to
search larger space, yet the differences become smaller while increasing support
threshold.
10
Algorithm efficiency
single support
multiple support
7
Avg. time per class [s]
6
5
4
3
2
1
0
0
10
20
30
40
50
60
70
Support threshold
Figure 7 Algorithms efficiency
The authors ARC-BC [11] claim to have the best results for confidence threshold greater
than 70%. However, our experiments prove to perform better on lower confidence for
both ARC-BC and ACRI approaches. This dissimilarity may be explained by usage of the
different method of counting support and confidence or/and by the fact, that our classifier
is slightly different than the original one.
7 CONCLUSIONS AND FUTURE WORK
In this document we explained the idea of associative classification and mining frequent
itemsets with recurrent items. Then, we combined these two and presented the new
approach of association classification with recurrent items. In our project we developed
ACRI, the suite of modules for generating rules and classification. Our implementation
allowed us to compare the classification with recurrent items to the ARC-BC
classification method. We performed several series of experiments and presented the
results. For given definitions of support and confidence we observed that ACRI performs
slightly worse than the ARC-BC classifier. What is more, ACRI needed more rules to
classify objects with accuracy comparable to ARC-BC. However, ACRI seems to be less
sensitive, with respect to accuracy, to the support threshold. Though, we must be aware
that our definition of support in mining frequent itemsets with recurrent items is different
from the original one, and to obtain the same absolute value of support (a support count)
we have to set the higher relative value of the support.
As we were limited by time constraints we did not perform enough experiments to
properly assess performance of our approach. We anticipate that using the different
definitions of counting support and confidence for rules assigned to different classes may
improve the results. We would also like to try different techniques for pruning rules such
as removing contradictory rules or rule generalization.
11
APPENDIX A
ACRI PROGRAM SUITE
All three modules of ACRI program suite are command line applications.
A.1 acri
The usage of acri is described in Table 4. inputfilename is the name of a file
containing transactions (documents). minsupp is a support threshold and
outputfilename is the name of the file that will contain rules. The formats of
inputfilename and outputfilename are presented in Table 5 and Table 6
respectively.
Optionally a user can choose --verbose option to see the steps of mining process;
--suppressrecurr to produce rules regardless reoccurrences of items (simple
Apriori); --tmpdir to specify the folder for temporarily files.
Table 4 Command line usage annotation for acri
$ acri
Usage: acri [OPTIONS] inputfilename minsupp outputfilename
OPTIONS:
--verbose
--suppressrecurr
--tmpdir string
Table 5 acri input file format
class_label [class_label [...]] : word [word [...]]
[...]
Table 6 acri output file format
class_label : word(recurrence) [word(recurrence) [...]]
[...]
A.2 count.pl
count.pl requires to specify three file names (see Table 7): the one containing rules
generated by acri, the one containing training set, and the output for rules with support
and confidence (see Table 8). Optionally a user can set a confidence threshold
(--minconf) and/or suppress counting items recurrence (--suppressrecurr).
Table 7 Command line usage annotation for count.pl
$ count.pl
Usage: count.pl [--minconf float] [--suppressrecurr] rule_file transaction_file
output_file
Table 8 count.pl output file format
class_label : word(recurrence) [word(recurrence) [...]] : support : confidence
[...]
12
A.3 acric
The usage of acric is presented in Table 9. The application requires for setting the name
of a file containing the rules (rulesfilename) from the previous step and the name of a
file containing a test set (objectsfilename). --rulecoverage, --cosinefactor,
--distancefactor, --supportfactor, and --confidencefactor allow for specifying
the tolerance factor for CV, CM, DM, support, and confidence respectively. Dominant
class technique can be turn on by setting --dominantclass option.
Executing acric will produce the results containing a confusion matrix and several
different parameters measuring the correctness of the method (see Table 10).
Table 9 Command line usage annotation for acric
$ acric
Usage: acric [OPTIONS] rulesfilename objectsfilename
OPTIONS:
--verbose
--rulecoverage double
--cosinefactor double
--distancefactor double
--supportfactor double
--confidencefactor double
--dominantclass
Table 10 Example of acric usage
$ ./acric data/rules/rmS25C35 data/source/reuters_test.data --rulecoverage 1 \
--supportfactor 0.1 --cosinefactor 0.4 --dominantclass
Confusion Matrix
(rows: true classes; columns: predicted classes)
623
0
10
38
0
5
25
3
8
2
0
37
2
1
25
0
145
3
3
1
0
74
0
4
999
0
2
8
8
8
8
4
99
5
4
3
0
0
0
0
102
20
10
0
1
1
0
50
102
12
0
24
3
12
2
1
6
0
0
1
2
2
5
4
0
4
1
57
2
0
0
0
3
0
1
0
0
21
0
0
6
3
9
0
12
5
13
14
101
3
0
0
0
0
0
0
0
0
0
0
Accuracy[%]: 79.37
Classified[%]: 99.46
Class ID
0
1
2
3
4
5
6
7
8
*MacroAvg
*MicroAvg
|
|
|
|
|
|
|
|
|
|
|
|
Class label
00_acq
01_corn
02_crude
03_earn
04_grain
05_interest
06_money-fx
07_ship
08_trade
| Precision[%] | Recall[%]
|
81.1 |
88.1
|
50.0 |
14.2
|
73.2 |
76.7
|
95.1 |
91.9
|
47.1 |
66.4
|
58.9 |
78.4
|
61.4 |
57.6
|
84.0 |
23.5
|
60.8 |
86.3
|
61.1 |
58.3
|
79.3 |
79.3
13
| F1[%]
|
|
|
|
|
|
|
|
|
|
|
84.4
22.2
74.9
93.4
55.1
67.3
59.4
36.8
71.3
59.7
79.3
|
|
|
|
|
|
|
|
|
|
|
|
REFERENCES:
[1]
Agarwal R., Srikant R., Fast Algorithms for Mining Association Rules,
International Conference Very Large Data Bases, pp. 487-499, 1994
[2] Antonie M.-L., Zaïane O., Coman A., Associative Classifiers for Medical Images,
Lecture Notes in Artificial Intelligence 2797, Mining Multimedia and Complex
Data, Springer-Verlag, pp. 68-83, 2003
[3] Antonie M.-L., Zaïane O., Coman A., Mammography Classification by an
Association Rule-Based Classifier, Third International ACM SIGKDD Workshop
on Multimedia Data Mining (MDM/KDD'2002) in conjunction with Eighth ACM
SIGKDD, pp. 62-69, 2002
[4] Han J., Kamber M., Data Mining: Concepts and Techniques, Morgan Kaufmann
Publishers, 2001
[5] Li W., Han J., Pei J., CMAR: Accurate and Efficient Classification Based on
Multiple Class-Association Rules, IEEE International Conference on Data Mining,
pp. 369-376, 2001
[6] Liu B., Hsu W., Ma Y., Integrating Classification and Association Rule Mining,
Knowledge Discovery and Data Mining, pp. 80-86, 1998
[7] Liu B., Hsu W., Ma Y., Integrating Classification and Association Rule Mining,
Knowledge Discovery and Data Mining, pp. 80-86, 1998
[8] Ong K.-L., Ng W.-K., Lim E.-P., Mining Multi-Level Rules with Recurrent Items
Using FP'-Tree, ICICS, 2001
[9] Wang W., Yang J., Yu P., WAR: Weighted Association Rules for Item Intensities,
Knowledge and Information Systems, vol. 6, pp. 203-229, 2004
[10] Zaïane O., Han J., Finding Spatial Associations in Images, Data Mining and
Knowledge Discovery: Theory, Tools, and Technology II, SPIE 14th International
Symposium on Aerospace/Defense Sensing, Simulation and Controls, pp. 138-147,
2000
[11] Zaïane O., R., Antonie M., L., Classifying Text Documents by Associating Terms
with Text Categories, Thirteenth Australasian Database Conference (ADC’02), pp.
215-222, 2002
[12] Zaïane O., R., Han J., Zhu H., Mining Recurrent Items in Multimedia with
Progressive Resolution Refinement, International Conference on Data Engineering
(ICDE'2000), pp. 461-470, 2000
WEB PAGE REFERENCES:
[13] Porter’s Stemming Algorithm. http://www.tartarus.org/~martin/PorterStemmer/
[14] Reuters-21578 Top 10 Topics Collection. http://www.jihe.net/datasets.htm - rttop10
14
Download