International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 Mine High Utility Itemset using UP-Tree and FP-Growth NS JAGADEESH#1, B JYOTHSNA*2 , KN DHARANIDHAR#3, A ANANTHA BIPIN*4 # Assistant Professor, Dept of CSE, Kuppam Engineering College, kuppam, India. Abstract— Data Mining is defined as a process that extracts some new, non-trivial, previously unknown potentially useful information contained in large databases. Traditional mining techniques have focused largely on detecting the statistical correlations between the items that are more frequent in the transaction databases. Also termed as frequent itemset mining. In this paper, I propose strategies for UP-Growth from the emerging area called Utility Mining which not only considers the frequency of the itemsets but also considers the utility associated with the itemsets. The term utility refers to the importance or the usefulness of the itemset in transactions quantified in terms like profit, sales or any other user preferences. Here the objective is to identify itemsets that have utility values above a given utility threshold using the pattern growth methodology for mining set of utility patterns. Keywords— candidate pruning, frequent itemset, high utility itemset, utility mining, UP-tree, FP-Growth. Discovering useful patterns hidden in a database plays an essential role in several data mining tasks, such as frequent pattern mining, weighted frequent pattern mining and high utility pattern mining. Among them, frequent pattern mining is a fundamental research topic that has been applied to different kinds of databases, such as transactional databases. It is used in the analysis of customer transactions in retail research where it is termed as market basket analysis and also been used to identify the purchase patterns of the consumer. I. INTRODUCTION II. LITERATURE SURVEY Over the last two decades data mining has emerged as a significant research area. This is primary due to the interdisciplinary nature of the subject and the diverse range of application domains in which data mining based products and techniques are being employed. This includes bioinformatics, genetics, medicine, clinical research, education, retail and marketing research. Data mining is the process of revealing previously unknown and potentially useful information from large databases. The primary goal is to discover hidden patterns, unexpected trends in the data. This term is frequently misused to mean any form of large-scale data or information processing. The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown interesting patterns. Data mining activities uses combination of techniques from database technologies, statistics, artificial intelligence and machine learning. Extensive studies have been proposed for mining frequent patterns [1, 2, 3, 4, 6]. Among the issues of frequent pattern mining, the most famous are association rule mining [1, 3, 4, 6] and sequential pattern mining. One of the well-known algorithms for mining association rules is Apriori [1], which is the pioneer for efficiently mining association rules from large databases. Pattern growth based association rule mining algorithms [4, 6] such as FP-Growth [4] were afterward proposed. It is widely recognised that FP-Growth achieves a better performance than Apriori based algorithms since it finds frequent itemsets without generating any candidate itemset and scans database just twice. ISSN: 2231-5381 Frequent Itemset Mining An itemset can be defined as a non-empty set of items. An itemset with k different items is termed as a k-itemset. For e.g. {bread, butter, milk } may denote a 3-itemset in a supermarket http://www.ijettjournal.org Page 4046 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 transaction .The notion of frequent itemsets was introduced by Agrawal et al [1].Frequent itemsets are the itemsets that appear frequently in the transactions. The goal of frequent itemset mining is to identify all the itemsets in a transaction dataset [6]. Frequent itemset mining plays an essential role in the theory and practice of many important data mining tasks, such as mining association rules [1,2] long patterns [5], emerging patterns and dependency rules. It has been applied in the field of telecommunications [3], census analysis[6] and text analysis. Item Name Unit Profit (in USD) Item A 5 Item B 100 Item C 40 Now consider the itemset AB. Since there are only 3transactions (T3, T5 and T6) that contain this itemset out of the overall 10 transactions, so the support for this itemset will be Support (AB) = 3 / 10 * 100 = 30 % Since T3 contains 4 units of item A and 1 unit of The criterion of being frequent is expressed in item B, so the profit earned by the sale of the terms of support value of the itemsets. The Support itemset AB in transaction T3 is given by value of an itemset is the percentage of transactions profit (AB, T3) = 4 * profit(A) + 1 * that contain the itemset. profit(B) = 4*5 + 1*100 = 120 1) EXAMPLE 1: Since AB appears in transactions T3, T5 and T6, so . Consider the small example of a transaction total profit associated with itemset AB by the complete transaction set of 10 transactions is database representing the sales data and the profit TABLE I TRANSACTION DATABASE Transacion ID T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 Quantity of Item sold in Transaction Item A Item B Item C 2 4 4 0 5 10 4 1 3 5 0 0 1 1 1 1 0 0 0 0 1 2 0 1 2 5 2 0 0 0 Profit(AB) = profit(AB,T3) + profit(AB,T5) + profit(AB,T6) =(4*5+1*100) + (5*5+1*100) + (10*5+1*100 ) = 395 Similarly we can calculate the support values for the different itemsets and also the profit obtained by the sale of those itemsets by all ten transactions as indicated in Table III. If we consider minimum support = 40 % then we observe that there are 4 itemsets A, B,C and AC associated with the sale of each unit of the items. which qualify as frequent itemsets because they Table I represents the sales figures for three items – Item A, B and C and ten transactions overall. The have support more than minimum support threshold entry in the cells represent the unit of any item sold value. But if we consider the profit associated we in that transaction find that out of the 4 most profitable itemsets i.e. C, AC, BC, and ABC only two are frequent itemsets Table II represents the unit profit associated also. Itemsets BC and ABC are itemsets which are with the sale of individual items. not frequent but still they fetch more profit than TABLE II UNIT PROFIT ASSOCIATED WITH ITEMS some of the frequent itemsets like A or B. This is ISSN: 2231-5381 http://www.ijettjournal.org Page 4047 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 inherently because the deviation of the unit profits value of an itemset is the measurement of the of the items. As we can see one unit of item B when importance of that itemset in the user’s perspective. sold will fetch much more profit than one unit of For e.g. if a sales analyst involved in some retail item A or item C. research needs to find out which itemsets in the TABLE III stores earn the maximum sales revenue for the SUPPORT AND PROFIT FOR ALL ITEMSETS stores he or she will define the utility of any itemset Itemset Support(%) Profit(USD) as the monetary profit that the store earns by selling A 190 90 each unit of that itemset. B 400 40 Here note that the sales analyst is not interested in C 60 520 the number of transactions that contain the itemset AB 30 395 but he or she is only concerned about the revenue AC 50 605 generated collectively by all the transactions BC 30 620 containing the itemset. In practice the utility value ABC 20 555 of an itemset can be profit, popularity, page-rank, measure of some aesthetic aspect such as beauty or This example illustrates the fact that frequent design or some other measures of user’s preference. itemset mining approach may not always satisfy a Formally an itemset S is useful to a user if it sales manager’s goal. In this case the support satisfies a utility constraint i.e. any constraint in the measure of the itemsets reflects the statistical form u(S)>=min_util, where u(S) is the utility value correlation of items, but it does not reflect their of the itemset an min_util is a utility threshold semantic significance which in this example was defined by the user [32]. In our example if we take the associated profit. utility of an itemset as the unit profit associated In reality a retail business may be interested in with the sale of that itemset then with utility identifying its most valuable customers (customers threshold min_util = 500 then the itemset ABC has who contribute a major fraction of the profits to the a utility value of 555 which means that this itemset business).These are the customers who may buy is of interest to the user even though its support full priced items or high margin items which may value is just 20%. Since while considering the total be absent from a large number of transactions utility of an itemset S we multiply the utility values because most customers do not buy these items of the individual items consisting the itemset S with frequently. the corresponding frequencies of the individual items of S in the transactions that contain S, so the Utility Mining utility based mining approach can be said to be measuring the significance of an itemset from two The limitations of frequent or rare itemset mining dimensions. The first dimension being the support motivated researchers to conceive a utility based value of the itemset i.e., the frequency of the mining approach, which allows a user to itemset and the second dimension is the semantic conveniently express his or her perspectives significance of the itemset as measured by the user. concerning the usefulness of itemsets as utility values and then find itemsets with high utility III. PROPOSED METHODS values higher than a threshold. In utility based mining the term utility refers to the quantitative representation of user preference i.e. the utility ISSN: 2231-5381 http://www.ijettjournal.org Page 4048 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 The framework of proposed method consists of two steps: (1) Scan the database twice to construct a global UP-Tree with the first two strategies (given in the Subsection III.A). (2) Recursively generate potential high utility itemsets abbreviated as PHUIs) from global UP-Tree and local UP-Trees by UPGrowth+ with the last two strategies (given in the Subsection III.B). A. The Proposed Data Structure: UP-Tree To facilitate the mining performance and avoid scanning original database repeatedly, we will use a compact tree structure, named UP-Tree (Utility Pattern Tree), to maintain the information of transactions and high utility itemsets. Two strategies are applied to minimise the overestimated utilities stored in the nodes of global UP-Tree. In following subsections, the elements of UP-Tree are first defined. Next, the two strategies are introduced. during the construction of a global UP-Tree are introduced. 2) Strategy DGU: Discarding Global Unpromising Items The construction of a global UP-Tree can be performed with two scans of the original database. In the first scan, Transaction Utility (also abbreviated as TU) of each transaction is computed. At the same time, Transaction-Weighted Utility (also abbreviated as TWU) of each single item is also accumulated. By transaction-weighted downward closure (also abbreviated as TWDC) property, an item and its supersets are unpromising to be high utility itemsets if its also TWU is less than the minimum utility threshold. Such an item is called an unpromising item. An item is called a promising item if TWU >= min_util. Otherwise, it is called an un promising item. Without loss of generality, an item is also called a promising item if its overestimated utility is no less than min_util. Otherwise, it is called an unpromising item. 3) Strategy DGN: Decreasing Global Node Utilities By actual utilities of descendant nodes during the construction of global UP-Tree we can decrease global node utilities. By applying strategy DGN, the utilities of the nodes that are closer to the root of a global UP-Tree are further reduced. DGN is especially suitable for the databases containing lots of long transactions. In other words, the more items a transaction contains, the more utilities can be discarded by DGN. On the contrary, traditional TWU mining model is not suitable for such databases since the more items a transaction contains, the higher TWU is. 1) The Elements in UP-Tree In a UP-Tree, each node N consists of N.name, N.count, N.nu, N.parent, N.hlink and a set of child nodes. N.name is the node’s item name. N.count is the node’s support count.N.nu is the node’s node utility, i.e., overestimated utility of the node. N.parent records the parent node of N. N.hlink is a node link which points to a node whose item name is the same as N.name. A table named header table is employed to facilitate the traversal of UP-Tree. In header table, each entry records an item name, an overestimated utility, and a link. The link points to the last occurrence of the node which has the same item as the entry in the UP-Tree. By following the links in header table and the nodes in UP-Tree, the nodes B. The Proposed Mining Method: UP-Growth+ having the same name can be traversed efficiently. In UP-Growth+, minimal node utilities (also In following subsections, two strategies for abbreviated as MNU's) in each path are used to decreasing the overestimated utility of each item make the estimated pruning values closer to real utility values of the pruned items in database. ISSN: 2231-5381 http://www.ijettjournal.org Page 4049 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 MNU for each node can be acquired during the construction of a global UP-Tree. First, we add an element, namely N.mnu, into each node of UP-Tree. N.mnu is minimal node utility of N. When N is traced, N.mnu keeps track of the minimal value of N.name’s utility in different transactions. If N.mnu is larger than u(N.name, Tcurrent), N.mnu is set to u(N.name, Tcurrent). Fig. 1 A Block diagram of the proposed system 1) Strategy ENU: Eliminating local unpromising items and their estimated Node Utilities from the paths and path utilities ENU can be recognized as local version of DGU. It will provide a simple but useful schema to reduce over estimated utilities locally without an extra scan of original database. 2) Strategy DNN: Decreasing local Node utilities for nodes of local UP-Tree by estimated utilities of descendant Nodes DLN can be also be recognized as well as a local version of DGN mentioned in the earlier sections. By these two strategies, overestimated utilities for itemsets can be locally reduced in a certain degree without losing any actual high utility itemset. AUTHORS DESCRIPTION IV. CONCLUSION In this paper, we have presented novel strategies for UP-growth by utilizing a tree structure for storing essential information about frequent patterns for mining high utility itemsets. I have utilized the concepts standard Frequent Itemset Mining for mining the complete set of frequent patterns by means of pattern growth. Higher efficiency in mining high utility patterns can be realized by implementing the above two important concepts. One is the construction of the UP-tree and the other one is the mining of utility itemsets from the UP-tree. The proposed UP-tree based pattern mining utilizes the pattern growth method to avoid the costly generation of a large number of candidate sets and reduces the search space dramatically. REFERENCES [1] R. Agrawal and R. Srikant. “Fast algorithms for mining association rules,” inProc. of the 20th VLDB Conf., pp. 487-499, 1994 [2] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” in Proc. of the 11th Int’l Conference on Data Engineering, pp. 3-14, Mar., 1995. [3] J. Han and Y. Fu, “Discovery of multiple-level association rules from large databases,” in Proc. 21th VLDB Conf., Sep. 2000, pp. 420–431. [4] J. Han, J. Pei, Y. Yin, “Mining frequent patterns without candidate generation,” in Proc. of the ACM-SIGMOD Int'l Conf. on Management of Data, pp. 1-12, 2011. [5] V. S. Tseng, C. J. Chu and T. Liang, “Efficient Mining of Temporal High Utility Itemsets from Data streams,” in Proc. of ACM KDD Workshop on Utility-Based Data Mining Workshop (UBDM’06), USA, Aug., 2006. [6] R. Martinez, N. Pasquier and C. Pasquier, “GenMiner: mining non-redundant association rules from integrated gene expression data and annotations,” Bio-informatics, Vol. 24, pp. 2643-2644, 2010. [7] S. J. Yen, Y. S. Lee, C. K. Wang, C. W. Wu and L.-Y. Ouyang, “The studies of mining frequent patterns based on frequent pattern tree,” in Proc. of the 13thPAKDD and LNCS, Vol. 5476, pp. 232-241, 2012. N.S.Jagadeesh, currently he is working as Assistant Professor in Kuppam Engineering ISSN: 2231-5381 http://www.ijettjournal.org Page 4050 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 College, kuppam, (Information received Technology) and B.Tech Engineering) and M.E (Computer Science and M.Tech Engineering) from Anna University, Chennai. (Computer Science and Engineering) from JNTU,Anantapur. His Research interest areas His Research interest areas are warehousing and Mining & Networks. are Data warehousing and Mining & Software Engineering. B.Jyothsna, currently she is working as Assistant Professor in Sir Institute Vishveshwaraiah of Science & Technology, Madanapalle. Received B.Tech, M.Tech (Computer Science and Engineering) from JNTU, Anantapur. Her Research interest areas are Data warehousing and mining & Software Engineering. KN Dharanidhar, currently he is working as Assistant Professor in Kuppam Engineering College, kuppam, received B.Tech (Information Technology) and M.Tech (Computer Science and Engineering) from JNTU, Anantapur. His Research interest areas are Data warehousing and Mining & Mobile Computing. A.Anantha Bipin, currently he is working Professor as in Assistant Kuppam Engineering College, kuppam, received B.E (Computer ISSN: 2231-5381 Science and http://www.ijettjournal.org Page 4051 Data