Implementation of classifier tool in Twister Magesh Vadivelu, Shivaraman Janakiraman {magevadi, shivjana}@indiana.edu School of Informatics and Computing Indiana University Bloomington Abstract Mining frequent item-sets from large-scale databases has emerged as an important problem in the data mining and knowledge discovery research community. To overcome this problem, we have proposed to implement Apriori algorithm, a classification algorithm, in Twister, a distributed framework, that makes use of MapReduce. MapReduce is a programming model and an associated implementation for processing and generating large data sets. We specify a map function that processes a keyvalue pair to generate a set of intermediate keyvalue pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Programs written in this functional style are automatically parallelized and executed on a large cluster of machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines. Our implementation of apriori algorithm in runs on a large cluster of machines and is highly scalable. On an application level, we can use this apriori algorithm to identify the pattern in which customers buy products in a supermarket. Apriori Property: Any subset of frequent itemset must be frequent. Join Operation: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1with itself. With Hadoop [7], programmers can more focus on the business logic of their application without care too much about fault tolerance, task scheduling, and workload balancing issues. However Hadoop do not have built-in support for iterative programs, which is a common approach for many applications: K-means, PageRank, Markov chain. Twister [9] is an enhanced MapReduce runtime that supports iterative MapReduce computations efficiently. It uses a publish/subscribe messaging infrastructure for communication and data transfers, and supports long running map/reduce tasks, which can be used in ‘configure once and use many times’ approach. These improvements allow Twister to support iterative MapReduce computations highly efficiently compared to other MapReduce runtimes. . Apriori Algorithm Key Words: Twister, MapReduce, Parallel Computing, Apriori Algorithm. Introduction The Apriori Algorithm is an influential algorithm for mining frequent itemsets for Boolean association rules. Frequent Itemsets: The sets of item which has minimum support (denoted by Li for ith -Itemset). Find the frequent itemsets: the sets of items that have minimum support. A subset of a frequent itemset must also be a frequent itemsets i.e., if {AB} isa frequent itemset, both {A} and {B} should be a frequent itemset. Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset). Use the frequent itemsets to generate association rules. Architecture Sp = T1/Tp Where: p is the number of processors T1 is the execution time of the sequential algorithm Tp is the execution time of the parallel algorithm with p processors References Implementation Timeline Time Task Week 11/07/2011 Discuss the algorithm and design the coding methodology sequentially Week 11/14/2011 Complete coding the algorithm sequentially Week 11/21/2011 Complete coding the algorithm sequentially Week 11/28/2011 Discuss the design an implementation in twister Project Review Discuss the design with partial implementation in twister Week 12/05/2011 Complete the implementation in twister Week 12/12/2011 Do validation Project Review Presentation Validation The performance of the application in Twister framework is validated by determining speed-up while running the application in twister. This is done by drawing a speed-up chart between the input data-set and the speed-up. The speed-up is calculated as: [1] J. Han, J. Pei and Y. Yin, “Mining frequent patterns without candidate generation,” in ACM SIGMOD Int. Conf. on Management of Data,2000 [2] S.K. Tanbeer, C.F. Ahmed, B-S Jeong and Y-K Lee, “Efficient single-pass frequent pattern mining using a prefix-tree,” Information Sciences,2009 [3] G. Grahne and J. Zhu, “Fast algorithms for frequent itemset mining using FP-Trees,” IEEE Trans. On Knowledge and Data Engineering, 2005 [4] R. Agrawal, T. Imielinski and A.N. Swami, “Mining association rules between sets of items in large databases,” in ACM SIGMOD Int. Conf. on Management of Data, 1993 [5]http://labs.google.com/papers/mapreduceosdi04.pdf [6]http://developer.yahoo.com/hadoop/tutorial/ module3.html [7] Hadoop official website http://hadoop.apache.org/ [8] http://www.michael-noll.com/ [9]Twister Official website: http://www.iterativemapreduce.org/