Paper-1: Topic Description and Discussion Distributed Classification Based on Association rules (CBA) algorithm By Shuanghui Luo Introduction Data mining refers to extracting useful information or knowledge from large amounts of data. In recent years, data mining has attracted a great deal of attention because its applications in business, market analysis, science exploration, and etc. Classification rule mining and association rule mining are two important data mining techniques. Classification rule mining is to find a small set of rules in the database to form an accurate classifier. There is one and only one pre-determined target: class. The data classification is a two-step process. The first step is learning: analyze data using a classification algorithm and build a model represented in the form of classification rules, decision trees, or mathematical formulae. The second step is classification: test the accuracy of the classification rules, if the accuracy is acceptable, the rules or the model is used for the classification of new data. Association rule mining is to find all rules in the database that satisfy some user-specified minimum support and minimum confidence constraints. For association rule mining, the target of discovery is not pre-determined. Association rule mining is a two-step process: find all frequent itemsets and generate strong association rules from the frequent itemsets (Jiawei Han and Micheline Kamber, 2000). We can view association rule mining and classification rule mining as the complementary approaches. Association is an unbiased way to find the connections that requires no external knowledge needed to get all understandable and unexpected rules, while classification is biased ways to pay attentions only to the small set of rules with the help of external knowledge. In general, for the same set of data, its association rules can be large and indiscriminate while classification rules are lean and less new knowledge. But data mining is similar to find the hidden treasures in sandy sea. It is very desirable to obtain new and unexpected knowledge in a meaningful search in the vast sea. Naturally, people turn to combine the virtues of classification rule mining and associate rule mining. Current researches on data mining are based on simple transaction data models. Given an item set {itemi} and a transaction set {transi}, an association rule is defined as an implication of the form, XY, where X and Y are non-overlap subsets of {itemi}. Obviously, the possible association rules scale exponentially with the size of the item set. Two important quantities here are confidence c that is percentage of transactions including X and Y to transactions including X, and support s that is the percentage of transaction including X and Y to all transactions. In classification dataset, items should be viewed as {attributes, value} pair. Classification association rule (CAR) is then Xci where ci is a class. In most scenarios, only a small portion of all association rules carries valuable information. One common practice to identify the valuable small subset association rules is to disqualify rules not meet certain standards by setting minimal requirements. In practice, the common minimum requirements are minimum support smin and minimum confidence cmin. One of most important algorithm under this practice is Apriori algorithm, which finds all valuable rules in two steps: (1) (2) Find all the valuable itemsets that satisfy smin, Generate association rules that satisfy cmin using the itemsets found in (1) Apriori algorithm works on transaction item set. Classification dataset is different. It is often viewed as relational table made of distinct attributes where a data record is a class. For classification, it often requires small smin, which causes combination explosion. Many CAR algorithms are derived from Apriori algorithm and address this problem in one way or another. Bing Liu et al proposed Classification Based on Association rules (CBA) algorithm as an integration of classification rule mining and association rule mining (Bing Liu et al, 1998). The integration was done by focusing on mining a special subset of association rules called class association rules (CARs). The Problem for CBA algorithm can be stated as follows: (1) Assume a relational table D with n attributes. An attribute can be discrete or continuous. (CBA also works with transactional data) (2) There is a discrete class attribute (for classification). (3) For a continuous attribute, it is discretized into intervals. (4) Item: (attribute, value) (5) Let I be the set of all items in D, and Y be the set of class labels. (6) A class association rule (CAR) is an implication of the form: X y, where X I, and y Y. (7) A rule X y holds in D with confidence and support (as in normal association rule mining). CBA algorithm was carried out in three stages: (1) Find the CARs set using CBA-RG algorithm, which is based on Apriori algorithm (2) Build Classifier based on CARs set using training data (3) Apply the Classifier for data mining (predict which class a new item belongs to) Therefore, the main focus to improve CBA algorithm is to (1) How can we better choose the CARs set using association rule mining (2) How can we generate more accurate Classifier Bing Liu et al has attacked on both frontiers. As we know, the key parameter in association rule mining is the smin. It controls how many rules and what kinds of rules are generated. Earlier CBA system follows the original association rule model and uses a single smin in its rule generation. However, this is inadequate for mining of CARs since many practical classification datasets have uneven class frequency distributions. Using a single smin will result in one of the following two problems: 1. If we set the smin value too high, we may not find sufficient rules of infrequent classes. 2. If we set the smin value too low, we will find many useless and overfitting rules for frequent classes. Bing Liu suggested using multiple smin to solve the above problem using the following algorithm: smin,i : For each class ci, a different minimum class support is assigned. The user only gives a total smin, denoted by t_ smin,, which is distributed to each class according to their class distributions as follows: smin,i = t_ smin × freqDistr(ci) The formula gives frequent classes higher smin and infrequent classes lower smin. This ensures that sufficient rules for infrequent classes will be generated and will not produce too many overfitting rules for frequent classes. Regarding cmin, it has less impact on the classifier quality as long as it is not set too high since we always choose the most confident rules. With a sound set of CARs, accurate classifier can be obtained using the high precedent rules on a training data set. Various algorithms are available for fulfilling this job. Liu proposed a simplest algorithm. (1) Sort the set of generated rules R according to the relation “> ”, which is defined as a. Given two rules, ri and rj, ri > rj (also called ri precedes rj or ri has a higher precedence than rj) if i. the confidence of r i is greater than that of r j , or ii. their confidences are equal, but the support of r i is greater than that of r j , or iii. both the confidences and supports of r i and r j are the same, but r i is generated earlier than r j ; (2) Using the rules in R to cover the training data (in sorted sequence). After each rule, the covered cases by the rule are removed. A set of rules C is selected from R that covers all training data, where, R = <r1, r2, …, rn, default_class>. (3) Discard those rules in C that do not improve the accuracy. This simple algorithm has several drawbacks and was later improved by intelligently choosing a minimum set of R to cover the training data. It combined CBA with decision tree method. The combination method used one classifier to segment the training data and chose the best classifiers for each segment of the training data. (Bing Liu et al, 2000) Discussions Liu’s CBA algorithm solves several issues in association rule data mining: (1) Propose CBA to integrate classification and association mining 1. mine all classification rules 2. produce an accurate classifier (compared with C4.5) 3. mine normal association rules (2) Present a new way to construct classifiers (3) Help to solve a number of problems in classification systems. (4) Build a classifier with data on disk rather than in memory. CBA itself is a compromise between subjective and indiscriminative. CBA algorithm framework has been established but there is still plenty of room for improvement in the CBA such as following: (1) In Liu’s CBA algorithm there is a step of discretion of data item to be admitted for CARs generation. This can be undesirable because the choice of discretion algorithm can be arbitrary which can affect the final outcome. (2) Recently attention has been paid to the intrinsic correlation among those CARs. It is understandable some of the long CARs are weaker rules with respect to their subsets CARs. Denial of a strong rule automatically invalidates its derived weaker rules. Organizing the CARs in an ordered way, for example in a tree structure, can help address the scalability problems of candidate CARs since the Classifier builder need not to traversal the whole candidate CARs. Bing Liu has developed the association data tree (ADT) algorithm (Bing Liu et al, 2000, 2001). (3) Classifier can be more flexible other than in precedent order way. (4) There is also large interest in distributed CBA algorithm. As computer infrastructure evolves, data are more likely to be distributed. The idea of using bagging or boosting has been proposed in Liu’s group. (5) Many current CBA algorithms assume the under studied data is well organized in relational data set. Unfortunately, in reality, the vast majority information does not store in such a graceful way. The strength of CBA is its ability to use the most accurate rules for classification. However, the existing techniques based on exhaustive search face a challenge in the case of huge amount data due to its computation complexity. CBA deals with centralized databases. In today’s Internet environment, the databases may be scattered over different locations and heterogeneous. We will combine CBA and distributed techniques to develop a distributed CBA algorithm to mine distributed and heterogeneous databases. Since CBA does not require all data reside in memory which means the CBA algorithm can be run in block by block manner, these is no technical obstacle having CBA running decently on top of a meta-database layer, which hide the distributed nature of the underlying data sources. Reference: 1. Data Mining: Concepts and Techniques, Jiawei han and Micheline Kamber, Morgan Kaufmann Publishers, San Francisco, CA, 2000. 2. Bing Liu, Wynne Hsu, Yiming Ma, "Integrating Classification and Association Rule Mining." Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98, full paper), New York, USA, 1998. 3. Bing Liu, Yiming Ma, C-K Wong, "Classification Using Association Rules: weaknesses and Enhancements." To appear in Vipin Kumar, et al, (eds), Data mining for scientific applications, 2001. 4. Bing Liu, Yiming Ma, Ching Kian Wong, "Improving an Association Rule Based Classifier", Proceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-2000), September 13-16, 2000, Lyon, France.