Paper-2 CS590L WS2002 Topic: Distributed Classification Based on Association rules (DCBA) algorithm By Shuanghui Luo This paper is organized as follows Introduction 1. Trend and significance of data mining 2. Basic concept about association rule mining and classification rule mining - Association rule mining: i. Basic concept, transaction model ii. Apriori algorithm, iii. FP-growth - Classification rule mining - Associative classification 3. Recent researches on the integration of Association rule mining and Classification rule mining - Classification Based on Associations (CBA) algorithm - Classification Based on Multiple Association Rules (CMAR) algorithm 4. Problem and high-level solution Related work 1. CBA algorithm 2. CMAR algorithm 3. Distributed boosting algorithm 4. Parallel computing 5. Our approach: - Distributed CBA (DCBA) algorithm and Distributed CMAR (DCMAR) algorithm - Issues: accuracy, efficiency, and scalability - Feasibility Introduction Trend and significance of data mining Data mining refers to search for valuable information hidden in large quantities of data using statistical and machine learning techniques. It is a multidisciplinary field with a major impact in scientific and commercial environment. The major reason that data mining has attracted a great deal of attention is due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. Data mining has been used in market data analysis, catalog design, web log sequence, DNA analysis, high throughput drug design, and etc. J. Han and M. Kamber have discussed the concepts and various techniques in their recent book Data Mining: Concepts and Techniques[1]. Association rule mining and classification rule mining are two important data mining techniques. This paper will focus on the integration of association rule mining and classification rule mining. Basic concept about association rule mining and classification rule mining Current researches on data mining are based on simple transaction data models. Given an item set {itemi} and a transaction set {transi}, an association rule is defined as an implication of the form, XY, where X and Y are non-overlap subsets of {itemi}. In classification data set, an item can be viewed as {attribute, value} pair. Two important related quantities are confidence c, which is the percentage of transactions including X and Y to transactions including X, and support s, which is the percentage of transactions including X and Y to all transactions. Classification association rule (CAR) is then Xci where ci is a class label. A training data set is such a set of data items that for each item, there exists a class label associated with it. A classifier is a function that maps attributes to class labels. In general, given a training data set, classification is to build a class model from the training data set such that it can be used to predict the class labels of unknown items with high accuracy. Association rule mining finds interesting association or correlation relationships among a large set of data items. It first discovers frequent itemsets satisfying user-defined minimum support, and then from which generates strong association rules satisfying user-defined minimum confidence. The most famous algorithm for association rule mining is Apriori algorithm.[2] Most of the previous studies on association rule mining adopt the Apriori-like candidate set generationand-test approach. Apriori algorithm uses frequent (k – 1)-itemsets to generate candidate frequent k-itemsets and use database scan and pattern matching to collect counts for the candidate itemsets. Recently, J. Han et al critiqued that the bottleneck of Apriori algorithm is the cost of the candidate generation and multiple scans of database. Han’s group developed another influential method for discovering frequent pattern without candidate generation, which is called frequent pattern growth (FP-growth). It adopts divide-and-conquer strategy and constructs a highly compact data structure (FP-tree) to compress the original transaction database. It focuses on the frequent pattern (fragment) growth and eliminate repeated database scan. The performance study by Han’s group shows that FP-growth is more efficient than Apriori algorithm.[3] Classification rule mining is to build a class model or classifier by analyzing predetermining training data and apply the model to predict the future cases. Besides other techniques for data classification such as decision tree induction, Bayesian classification, neural network, classification based on data warehousing technology, and etc, the associative classification or classification based on association rules is an integrated technique that applies the methods of association rule mining to the classification. It typically consists of two steps. The first step finds the subset of association rules that are both frequent and accurate using association rule techniques. The second step employs the rules for classification. We can view association rule mining and classification rule mining as the complementary approaches. Association rule mining is to find all rules in the data that satisfy some user-specified constraints such as minimum support and minimum confidence. For association rule mining, the target of discovery is not pre-determined. Classification rule mining is to find a small set of rules in the database to build an accurate class model (classifier) and apply the model to classify new data. There is one and only one pre-determined target: class for classification rule mining. Association is an unbiased way to obtain all the correlations between data items with no external knowledge needed, while classification is biased ways to pay attentions only to the small set of rules with the help of external knowledge. Naturally, people turn to combine the virtues of classification rule mining and associate rule mining. Recent researches on the integration of association rule mining and classification rule mining Recently, Bing Liu et al proposed Classification Based on Association rules (CBA) algorithm as an integration of classification rule mining and association rule mining.[4] The integration was done by finding a special subset of association rules called class association rules (CARs) and building a classifier from the CARs. The main strength of CBA algorithm is its ability to use the most accurate rules for classification, which explains its better performance compared with some original classification algorithms such as C4.5. Liu’s research group also proposed some methods to deal with the problems of the original CBA algorithm such as single minimum support and not being able to generate long rules for many datasets. The performance of the algorithm was improved by using multiple minimum support (Smin) instead of a single Smin, and combining CBA algorithm with other techniques such as decision tree method.[5,6] More recently, Wenmin Li et al critiqued some weakness of Liu’s approach as follows: (1) simply selection a rule with a maximal user-defined measure may affect the classification accuracy, (2) the efficiency problem of storing, retrieve, pruning, and sorting a large number of rules for classification when there exist a huge number of rules, large training data sets, and long pattern rules. They proposed a new associative classification algorithm: Classification based on Multiple Association Rules (CMAR). The experimental results shows that CMAR provides better efficiency and accuracy compared with CBA algorithm. The accuracy of CMAR is achieved by using multiple association rules for classification. The efficiency of CMAR is achieved by extension of efficient frequent pattern method, FP-growth, construction of a class distribution-associated FP-tree, and applying a CRtree structure to store and retrieve mined association rules.[7] (Both CBA algorithm and CMAR algorithm will be discussed in detail later in the section of related work.) Problem and high-level solution The main issues on the integration of association and classification are: (1) efficiency and accuracy: how to efficiently find out the high quality rules using association rule mining and how to generate more accurate classifier, (2) scalability: it is important when there exist large training data sets, huge number of rules and long pattern rules. The efficiency and accuracy typically affect each other. We need to balance these two issues. CBA algorithm adopts Apriori-like algorithm to find association rules. The challenging problem of combinational explosion of rules still exists when there are huge amounts of data items and huge number of rules. Although CMAR algorithm has improved the performance, the scalability problem still exists. Moreover, in today’s Internet environment, the databases may be scattered over different locations. This research is to combine the existing data mining algorithms and distributed techniques to develop a distributed CBA algorithm and distributed CMAR algorithm and apply them to mine very large and distributed databases. The need for parallelization of the Apriori algorithm or FPgrowth method comes from the transaction database being too large, or the possible number of frequent itemsets being too large (because of the high dimensionality of the database), or both. Correspondingly, in order to achieve concurrency either the candidates need to be counted in parallel or they need to be generated in parallel, or both these phases need to be done in parallel. Not only is this distributed nature of data but also there are some other considerations, which we have to look into for taking the parallel computing approaches. Many practical applications of association rules involve huge transaction databases, which contain a large number of distinct items. In such situations the serial algorithms like Apriori running on single processor machines may take unacceptably large time. The primary reasons are memory, CPU speed and IO bandwidth limitations faced by a single processor machines. As an example, in the Apriori algorithm if the number of candidate itemsets become too large, then they might not all fit in the main memory and multiple database passes would be required within each iteration, incurring expensive IO cost. This implies that, even with the highly efficient pruning method of Apriori, the task of finding all association rules can require a lot of computational and memory resources, especially when the data is enormous and high dimensional (large number of distinct items). This motivates the development of parallel computing. Use Parallel computing to improve the performance For efficient parallel data mining, it requires to minimize the interprocess communication. For example, CBA algorithm takes two steps to construct classifiers. The first step is to find the CARs. CARs are built on top of the global data space. Direct computing of support and confidence of each candidate rule requires formidable interprocess communicates. Therefore, construction of CARs based on the CARs sets selected by each process is very desirable. The second is the building of classifier based on the CARs set. This is basically a sorting algorithm that can be relatively easily parallelized. Here is the high-level proposed distributed CBA algorithm (1) Construct CARs using the same requirements of minimum support and confidence for each process. CARs satisfy the global requirements should fall into the sum set of all individual CARs (2) Prune the unqualified CARs on each individual set. This is the most difficult part of distributed CBA algorithm. Simple method: each process chooses the CAR of the lowest support/confidence and broadcast to all process; each process returns the support/confidence; delete the CAR with unqualified overall support/confidence. (3) Parallel sorting to build the classifier. For distributed CMAR algorithm, the basic idea is similar. We can apply the divide-and-conquer strategy to the two phases of CMAR especially. For example, scan each database or segment of training data and build a FP-tree for each database or segment, and then combine the FP-tree and broadcast to all the segments. Related works CBA algorithm As mentioned earlier, Bing Liu’s group proposed CBA algorithm as the integration of association rule mining and classification rule mining.[4] The Problem for CBA algorithm can be stated as follows: (1) Assume a relational table D with n attributes. An attribute can be discrete or continuous. (CBA also works with transactional data) (2) There is a discrete class attribute (for classification). (3) For a continuous attribute, it is discretized into intervals. (4) Item: (attribute, value) (5) Let I be the set of all items in D, and Y be the set of class labels. (6) A class association rule (CAR) is an implication of the form: X y, where X I, and y Y. (7) A rule X y holds in D with confidence c and support s (similar to normal association rule mining). CBA algorithm was carried out in two main stages: (1) Find the complete set of CARs based on Aprior-like algorithm. (2) Build a classifier using CARs set based on training data and apply the classifier for new data cases. Therefore, the main focus to improve CBA algorithm is to (1) How can we better choose the CARs set using association rule mining (2) How can we generate more accurate classifier Liu’s group has attacked on both frontiers. One of the key parameters in association rule mining is the smin, which controls how many rules and what kinds of rules will be generated. The original CBA system uses a single smin in its rule generation. A single smin is inadequate for unbalanced class frequency distribution since using a single s min will result in one of the following two problems: if the smin value is set too high, sufficient rules of infrequent classes may not be discovered; if the smin value is set too low, many useless and overfitting rules for frequent classes may be produced. To solve the problems, Liu et al adopted multiple smin as follows: For each class ci, a different minimum class support (s min,i) is assigned. The user-defined total minimum support, t_ smin, is distributed to each class according to their class distributions: smin,i = t_ smin × freqDistr(ci) The formula gives frequent classes higher smin and infrequent classes lower smin. This ensures that sufficient rules for infrequent classes will be generated and too many overfitting rules for frequent classes will not be produced. Liu’s group consider that cmin has less impact on the classifier quality as long as it is not set too high since they always choose the most confident rules. To produce the best classifier out of the whole set of rules is infeasible since the number of possible subsets of rules exponentially grows with the number of rules. Liu’s group presented a heuristic algorithm for building a classifier using CARs. They defined a total order (>) on the generated rules and used it in selecting the rules for the classifier. Given two rules, ri and rj, ri > rj (also called ri precedes rj or ri has a higher precedence than rj) if (1) the confidence of ri is greater than that of rj, or (2) their confidences are equal, but the support of ri is greater than that of rj, or (3) both the confidences and supports of ri and rj are the same, but ri is generated earlier than rj. Classifier (C) = < r1, r2, …, rn, default_class) Liu et al proposed several versions of algorithms to build a classifier. The simplest version based on naive algorithm includes three steps: (1) Sort the set of generated rules R according to the relation “>”, (2) Select rules for C from rule set R following the sorted sequence. For each rule r, go through the database to find those cases covered by r, (3) Discard those rules in C that do not improve the accuracy. This simple algorithm needs to make many passes over the database, which is inefficient when the database is not resident in the main memory and when the database is huge. They also presented an improved version of the algorithm, which finds the best rule in R to cover each data case instead of making one pass over the remaining data for each rule. Liu’s original CBA algorithm is not suitable for huge number of rules or long rules due to combination explosion problem. T deal with this problem, Liu’s group has developed the association data tree (ADT) algorithm. They divide the training data into segments and intelligently choose different classifier for each segment according to a set of conditions.[5, 6] Liu’s CBA algorithm is a good example that applies association rule mining techniques to classification tasks. It adopted Apriori algorithm to generate the complete set of classification rules (CARs). They have presented a new way to construct classifiers and built a more accurate classifier compared with C4.5. It helps to solve some problems in the existing classification systems such as understandability problem, discovery of interesting and useful rules, and the residency of database on disk rather than in main memory. However, CBA algorithm itself is a compromise between subjective and indiscriminative. CBA algorithm framework has been established but there is still plenty of room for improvement in the CBA such as following: (1) In Liu’s CBA algorithm there is a step of discretion of data item to be admitted for CARs generation. This can be undesirable because the choice of discretion algorithm can be arbitrary which can affect the final outcome. (2) Classifier can be more flexible other than in precedent order way. (3) Scalability problem still exist when there are large number of datasets and large number of rules or long rules due to the intrinsic combination explosion problem. (4) In today’s Internet environment, as computer infrastructure evolves, data are more likely to be distributed. There is also large interest in distributed CBA algorithm. (5) Many current CBA algorithms assume the under studied data is well organized in relational data set. Unfortunately, in reality, the vast majority information does not store in such a graceful way. CMAR algorithm W. Li et al critiqued some weakness of CBA algorithm such as: (1) simply selection a rule with a maximal user-defined measure may affect the classification accuracy, (2) how to efficiently store, retrieve, prune, and sort a large number of associative rules for classification is still a challenging problem. To solve first problem, they determine a class label by selecting a small set of related rules instead of a single rule. To overcome the second problem, they use a CR-tree data structure to store and retrieve large number of rules. They proposed a novel associative classification algorithm: Classification based on Multiple Association Rules (CMAR). They adopted a variant FP-growth method to find complete set of CARs, which is more efficient than Apriori-like method. CMAR consists of two phases: rule generation and classification. In the first phase, rule generation, it finds the complete set of rules in the form of R: P → c, where P is a frequent pattern passing support threshold, and c is a class label passing confidence threshold. It further prunes some rules and selects a subset of high quality rules for classification. In the second phase, classification, it extracts a subset of rules matching new data object and predicts the class label based on the subset of rules. The experimental results on performance study shows that CMAR is faster and has better classification accuracy than other classification methods. First phase: Rule Generation (1) It first scans the training data set and finds all the frequent item set. Then, sorts the attribute values in support descending order, F-list, and scans the training data again to construct a FP-tree. Third, based on the F-list, generates the subset of CARs without overlap. Fourth, prunes the FP-tree according to class label distribution. (2) Sort the rules in CR-tree. CR-tree is a compact tree structure. It explores potential sharing among rules and thus saves space since the rules that have common frequent items share the part of path. CR-tree can be used as an index of rules. Once a CR-tree is built, rule retrieval becomes efficient. (3) Prune the rules. Use general and high confidence rules to prune more specfic and lower confidence ones. Select only positively corrected rules. Select a subset of rules based on database coverage. Second phase: classification based on multiple rules. CMAR divides the rules into groups according to class labels. Compare the strength of the groups by measuring the combined effect of each group. Theoretically, it is hard to verify the effect of measures on strength of groups of rules, CMAR adopted weighted X2 measure. Parallel algorithm for mining association rules Several parallel mining algorithms for association rules exist in literatures. T. Shintani et al suggested a parallel algorithm for mining association rules with classification hierarchy on a shared-nothing parallel machine.[8] It partitions the candidate itemsets over the processors. It uses a hierarchic structure to minimize inter-processor communication. In the classification hierarchy, all the candidate itemsets whose root items are identical will be allocated to the identical node, which eliminates communication of the ancestor items. It identifies the frequently occurring candidate itemsets and copy them over all the processors, through which frequent itemsets can be processed locally without communication, which saves memory space. Distributed algorithms for data mining To solve the problem of learning from very large and distributed databases, some researches have proposed some distributed learning techniques. P. Chan et al proposed some meta-learning techniques in the JAM project.[9] Meta learning is a technique that combines independent classifiers generated from distributed databases into a single global classifier. The classifiers generated from individual databases are integrated so as to improve the overall predictive accuracy of the combined classifier. Recently, A. Lazarevic et al proposed a distributed boosting algorithm for classification rule mining.[10] Boosting uses adaptive sampling of patterns to generate an accurate ensemble of many weak classifiers whose individual global accuracy is moderate. First, the classifiers are learned from disjoint datasets at each boosting round, and then exchanged amongst the distributed sites. The exchanged classifiers are then combined, and finally their weighted voting ensemble is constructed on each disjoint data set. This algorithm can be applied to very large and distributed homogenous databases that cannot fit into the computer memory. We will develop a distributed algorithm for the associative classification by combing the existing associative classification algorithms such as CBA and CMAR and with distributed techniques. To improve the efficiency and scalability of the existing algorithm, parallel computing is necessary for huge databases or huge number of rules. To develop a distributed algorithm for very large distributed and heterogeneous databases is still a challenging task. Association rules are global entities. In other words, generating an accurate classifier requires information from each process. To make an efficient parallel algorithm, interprocess communication has to be minimized. Reference: 1. Data Mining: Concepts and Techniques, Jiawei han and Micheline Kamber, Morgan Kaufmann Publishers, San Francisco, CA, 2000. 2. Rakesh Agrawal, Ramakrishnan Srikant: “Fast Algorithms for Mining Association Rules in Large Databases.” VLDB 1994. 3. J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation'', SIGMOD, 2000. 4. Bing Liu, Wynne Hsu, Yiming Ma, "Integrating Classification and Association Rule Mining." KDD-98, 1998. 5. Bing Liu, Yiming Ma, C-K Wong, "Classification Using Association Rules: weaknesses and Enhancements." To appear in Vipin Kumar, et al, (eds), Data mining for scientific applications, 2001. 6. Bing Liu, Yiming Ma, Ching Kian Wong, "Improving an Association Rule Based Classifier", PKDD-2000, 2000. 7. W. Li, J. Han, and J. Pei, “CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules”, ICDM'01, 2001. 8. Takahiko Shintani and Masaru Kitsuregawa “Parallel Mining Algorithm for Generalized Association Rules with Classification Hierarchy”, SIGMOD, 1998. 9. “Meta-Learning in Distributed Data Mining Systems: Issues and Approaches”, Andreas L. Prodromidis, Salvatore J. Stolfo and Philip Chan, "Advances of Distributed Data Mining" book, editors Hillol Kargupta and Philip Chan, AAAI press, August 2000. 10. Aleksandar Lazarevic and Zoran Obradovic “The Distributed Boosting Algorithm”, KDD01, 2001.