Distributed Classification Based on Association rules (CBA) algorithm

advertisement
Paper-1: Topic Description and Discussion
Distributed Classification Based on Association rules (CBA) algorithm
By Shuanghui Luo
Introduction
Data mining refers to extracting useful information or knowledge from large amounts of
data. In recent years, data mining has attracted a great deal of attention because its
applications in business, market analysis, science exploration, and etc. Classification rule
mining and association rule mining are two important data mining techniques.
Classification rule mining is to find a small set of rules in the database to form an
accurate classifier. There is one and only one pre-determined target: class. The data
classification is a two-step process. The first step is learning: analyze data using a
classification algorithm and build a model represented in the form of classification rules,
decision trees, or mathematical formulae. The second step is classification: test the
accuracy of the classification rules, if the accuracy is acceptable, the rules or the model is
used for the classification of new data.
Association rule mining is to find all rules in the database that satisfy some user-specified
minimum support and minimum confidence constraints. For association rule mining, the
target of discovery is not pre-determined. Association rule mining is a two-step process:
find all frequent itemsets and generate strong association rules from the frequent itemsets
(Jiawei Han and Micheline Kamber, 2000).
We can view association rule mining and classification rule mining as the complementary
approaches. Association is an unbiased way to find the connections that requires no
external knowledge needed to get all understandable and unexpected rules, while
classification is biased ways to pay attentions only to the small set of rules with the help
of external knowledge. In general, for the same set of data, its association rules can be
large and indiscriminate while classification rules are lean and less new knowledge. But
data mining is similar to find the hidden treasures in sandy sea. It is very desirable to
obtain new and unexpected knowledge in a meaningful search in the vast sea. Naturally,
people turn to combine the virtues of classification rule mining and associate rule mining.
Current researches on data mining are based on simple transaction data models. Given an
item set {itemi} and a transaction set {transi}, an association rule is defined as an
implication of the form, XY, where X and Y are non-overlap subsets of {itemi}.
Obviously, the possible association rules scale exponentially with the size of the item set.
Two important quantities here are confidence c that is percentage of transactions
including X and Y to transactions including X, and support s that is the percentage of
transaction including X and Y to all transactions. In classification dataset, items should
be viewed as {attributes, value} pair. Classification association rule (CAR) is then Xci
where ci is a class.
In most scenarios, only a small portion of all association rules carries valuable
information. One common practice to identify the valuable small subset association rules
is to disqualify rules not meet certain standards by setting minimal requirements. In
practice, the common minimum requirements are minimum support smin and minimum
confidence cmin. One of most important algorithm under this practice is Apriori algorithm,
which finds all valuable rules in two steps:
(1)
(2)
Find all the valuable itemsets that satisfy smin,
Generate association rules that satisfy cmin using the itemsets found in (1)
Apriori algorithm works on transaction item set. Classification dataset is different. It is
often viewed as relational table made of distinct attributes where a data record is a class.
For classification, it often requires small smin, which causes combination explosion. Many
CAR algorithms are derived from Apriori algorithm and address this problem in one way
or another.
Bing Liu et al proposed Classification Based on Association rules (CBA) algorithm as an
integration of classification rule mining and association rule mining (Bing Liu et al,
1998). The integration was done by focusing on mining a special subset of association
rules called class association rules (CARs).
The Problem for CBA algorithm can be stated as follows:
(1) Assume a relational table D with n attributes. An attribute can be discrete or
continuous. (CBA also works with transactional data)
(2) There is a discrete class attribute (for classification).
(3) For a continuous attribute, it is discretized into intervals.
(4) Item: (attribute, value)
(5) Let I be the set of all items in D, and Y be the set of class labels.
(6) A class association rule (CAR) is an implication of the form: X  y, where X  I,
and y  Y.
(7) A rule X  y holds in D with confidence and support (as in normal association
rule mining).
CBA algorithm was carried out in three stages:
(1) Find the CARs set using CBA-RG algorithm, which is based on Apriori algorithm
(2) Build Classifier based on CARs set using training data
(3) Apply the Classifier for data mining (predict which class a new item belongs to)
Therefore, the main focus to improve CBA algorithm is to
(1) How can we better choose the CARs set using association rule mining
(2) How can we generate more accurate Classifier
Bing Liu et al has attacked on both frontiers. As we know, the key parameter in
association rule mining is the smin. It controls how many rules and what kinds of rules are
generated. Earlier CBA system follows the original association rule model and uses a
single smin in its rule generation. However, this is inadequate for mining of CARs since
many practical classification datasets have uneven class frequency distributions. Using a
single smin will result in one of the following two problems:
1. If we set the smin value too high, we may not find sufficient rules of infrequent
classes.
2. If we set the smin value too low, we will find many useless and overfitting rules for
frequent classes.
Bing Liu suggested using multiple smin to solve the above problem using the following
algorithm:
smin,i : For each class ci, a different minimum class support is assigned. The user only
gives a total smin, denoted by t_ smin,, which is distributed to each class according to their
class distributions as follows:
smin,i = t_ smin × freqDistr(ci)
The formula gives frequent classes higher smin and infrequent classes lower smin. This
ensures that sufficient rules for infrequent classes will be generated and will not produce
too many overfitting rules for frequent classes. Regarding cmin, it has less impact on the
classifier quality as long as it is not set too high since we always choose the most
confident rules.
With a sound set of CARs, accurate classifier can be obtained using the high precedent
rules on a training data set. Various algorithms are available for fulfilling this job. Liu
proposed a simplest algorithm.
(1) Sort the set of generated rules R according to the relation “> ”, which is defined as
a. Given two rules, ri and rj, ri > rj (also called ri precedes rj or ri has a
higher precedence than rj) if
i. the confidence of r i is greater than that of r j , or
ii. their confidences are equal, but the support of r i is greater than
that of r j , or
iii. both the confidences and supports of r i and r j are the same, but r i
is generated earlier than r j ;
(2) Using the rules in R to cover the training data (in sorted sequence). After each
rule, the covered cases by the rule are removed. A set of rules C is selected from
R that covers all training data, where, R = <r1, r2, …, rn, default_class>.
(3) Discard those rules in C that do not improve the accuracy.
This simple algorithm has several drawbacks and was later improved by intelligently
choosing a minimum set of R to cover the training data. It combined CBA with decision
tree method. The combination method used one classifier to segment the training data and
chose the best classifiers for each segment of the training data. (Bing Liu et al, 2000)
Discussions
Liu’s CBA algorithm solves several issues in association rule data mining:
(1) Propose CBA to integrate classification and association mining
1. mine all classification rules
2. produce an accurate classifier (compared with C4.5)
3. mine normal association rules
(2) Present a new way to construct classifiers
(3) Help to solve a number of problems in classification systems.
(4) Build a classifier with data on disk rather than in memory.
CBA itself is a compromise between subjective and indiscriminative. CBA algorithm
framework has been established but there is still plenty of room for improvement in the
CBA such as following:
(1) In Liu’s CBA algorithm there is a step of discretion of data item to be admitted
for CARs generation. This can be undesirable because the choice of discretion
algorithm can be arbitrary which can affect the final outcome.
(2) Recently attention has been paid to the intrinsic correlation among those CARs. It
is understandable some of the long CARs are weaker rules with respect to their
subsets CARs. Denial of a strong rule automatically invalidates its derived weaker
rules. Organizing the CARs in an ordered way, for example in a tree structure, can
help address the scalability problems of candidate CARs since the Classifier
builder need not to traversal the whole candidate CARs. Bing Liu has developed
the association data tree (ADT) algorithm (Bing Liu et al, 2000, 2001).
(3) Classifier can be more flexible other than in precedent order way.
(4) There is also large interest in distributed CBA algorithm. As computer
infrastructure evolves, data are more likely to be distributed. The idea of using
bagging or boosting has been proposed in Liu’s group.
(5) Many current CBA algorithms assume the under studied data is well organized in
relational data set. Unfortunately, in reality, the vast majority information does
not store in such a graceful way.
The strength of CBA is its ability to use the most accurate rules for classification.
However, the existing techniques based on exhaustive search face a challenge in the case
of huge amount data due to its computation complexity. CBA deals with centralized
databases. In today’s Internet environment, the databases may be scattered over different
locations and heterogeneous. We will combine CBA and distributed techniques to
develop a distributed CBA algorithm to mine distributed and heterogeneous databases.
Since CBA does not require all data reside in memory which means the CBA algorithm
can be run in block by block manner, these is no technical obstacle having CBA running
decently on top of a meta-database layer, which hide the distributed nature of the
underlying data sources.
Reference:
1. Data Mining: Concepts and Techniques, Jiawei han and Micheline Kamber, Morgan
Kaufmann Publishers, San Francisco, CA, 2000.
2. Bing Liu, Wynne Hsu, Yiming Ma, "Integrating Classification and Association Rule
Mining." Proceedings of the Fourth International Conference on Knowledge Discovery
and Data Mining (KDD-98, full paper), New York, USA, 1998.
3. Bing Liu, Yiming Ma, C-K Wong, "Classification Using Association Rules:
weaknesses and Enhancements." To appear in Vipin Kumar, et al, (eds), Data mining for
scientific applications, 2001.
4. Bing Liu, Yiming Ma, Ching Kian Wong, "Improving an Association Rule Based
Classifier", Proceedings of the Fourth European Conference on Principles and Practice
of Knowledge Discovery in Databases (PKDD-2000), September 13-16, 2000, Lyon,
France.
Download