advertisement

Fault Tolerant Weighted Frequent Itemsets Mining using WIT-Trees Ambily Mohan1, Uday Babu P.2,Visakh R.3 Department Of Computer Science & Engineering Rajagiri School of Engineering & Technology Cochin, India ambili.mohan@gmail.com1 udaybabup@gmail.com2 visakh_vishnu@gmail.com3 Abstract-Data mining is the collection of techniques for the creative, automatic discovery of previously unknown, fitting, new, helpful and understandable patterns in large databases. Frequent pattern mining has emerged as a vital task in data mining. Frequent patterns are those that occur frequently in a data set. In traditional frequent pattern mining, patterns and items within the patterns are treated the same with the assumption that all items have same significance. However, real items have different importance as some of the items are more beneficial in terms of profit. This thought resulted in assigning weights to the items thereby taking benefits into account. Real world data tends to be diverse and dirty as it may be inferred with noise. This calls for fault tolerant mining over large data. In this paper, a new method for mining fault tolerant frequent patterns is proposed Keywords—Data Mining, Frequent Patterns, Fault Tolerant Patterns, Weighted Items I. INTRODUCTION Data mining [1] is known for discovering previously unknown, valid, new, fitting, useful and understandable patterns in large databases. Due to the availability of huge amount of data and the need to discover knowledge from such data, data mining has grown to be the most widely used technique. Thus data mining can be used for applications ranging from market analysis, fraud detection and customer retention, to production control and science exploration. Frequent pattern mining has become an important task in data mining. Frequent patterns are those patterns that occur frequently in a data set. For example, a set of items, such as milk and egg usually appears frequently together in a transaction data set. So it is treated as a frequent itemset. Frequent pattern mining therefore can be used for mining associations, correlations, and many other interesting relationships among data. Market basket analysis is one such application of frequent pattern mining. In this process, the retailers will analyze customer’s buying habits in terms of purchased items to find out the items that are frequently bought together. As a result they can develop several marketing strategies that will help them in improving their sales. For instance, if customers are buying milk, how likely they will also buy bread on the same trip to the market? Extracting such information can lead to increased sales. Frequent pattern mining algorithms like Apriori [2], FPgrowth [3] are simply based on the count of the occurrence of item sets in a transactional dataset. The support for an itemset is the number of transactions that contain the itemset. A pattern is said to be frequent if its count is above a minimum support threshold. Above all it treats all the items the same regardless of their significance in real application. Real data have different importance as some of the items are more beneficial in terms of the profit they yield per unit sale. For example, sale of egg may incur a profit of 20 cents, while a bottle of milk might incur a profit of 40 cents. Even if latter holds a low support value, it will be more beneficial than the other in terms of the profit. Traditional frequent pattern mining algorithms ignored this difference. With this thought, the idea of assigning weights to the items came into existence. In weighted approach, a weight factor is assigned to each of the items present in the transactions based on their benefits. The concept of weighted support is used instead of support used in traditional pattern mining which was simply the count of occurrence of item sets in each transaction. Weighted support calculated by making use of the weights of items resulted in the selection of important patterns. An item set is significant if its weighted support is above a minimum weighted support. Fault tolerant frequent pattern mining [4] deals with the discovery of more general, interesting and useful knowledge. For example, consider the rule R1 to study student's performance in course: Though R1 has high prediction accuracy, this rule covers only a very small subset of cases since there may not be that many students who are well versed in all four subjects to be good in data mining. Another rule R2 can be formulated as: R2=”A student good in at least three out of four courses: (data structure, algorithm, AI and DBMS), is also good at data mining". This also has high confidence. Rule R2 is faulttolerant as it requires the data to match only part of its left hand side and it is based on a fault tolerance factor δ. Instead of looking for exact patterns in data, approximate and more general patterns are looked upon. A. Related Work Zi-guo Huai and Ming-he Huang in [5] proposed an effective algorithm for mining weighted frequent itemsets. It was based on the hash table data structure. This technique could deal with the problems that arise when weight value of an item is changed and also when the database is changed. its weighting attribute denoted as w(i) = f(a) where A = a1 , a 2 , . . , a n are the attributes used for calculating weights. Given a Database that has 6 transactions and five items I = {A, B, C, D, E} in Table 1. Table 1. Approximate weighted frequent pattern mining was proposed by Unil Yun and Keun Ho Ryu in [6]. They proposed this concept thinking that real applications would contain huge amount of data, and inherent noise in the data would result in its inaccurate processing. Noisy data due to uncertainty of processing data and measurement errors can have negative effects on mining results. Result sets will be affected greatly even if there is only a small change in items weight or items support in noisy environment. As a result, an approximate factor was proposed that relaxed the requirement of weighted supports being exactly matched with the minimum threshold. This led to the discovery of approximate weighted frequent patterns. WAF (approximate weighted frequent pattern mining) algorithm is used to extract robust important patterns that are not affected by noise. Weighted frequent item set mining with a weight range and a minimum weight [7] proposed by Unil Yun and John J. Leggett focused on pushing weight constraints into pattern growth algorithms while maintaining the downward closure property. By adjusting the minimum weight and a weight range, WFIM generates more brief and important weighted frequent itemsets. It can be applied to dense databases with a low minimum support. Items are given different weights within the weight range. The number of weighted frequent items can be adjusted using different weight ranges. Bay Vo, F Coenen and Bac Le in [8] came up with an algorithm for fast mining of frequent weighted itemsets from weighted transaction databases. This algorithm was based on WIT-trees (Weighted Itemset-Tidset tree) data structure. J Pei, H. Tung and J. Han proposed fault tolerant frequent pattern mining, which is a fruitful direction for future data mining research. They have developed FT-Apriori for efficient fault tolerant frequent pattern mining. II. PROBLEM DEFINITION Let I = i1 , i2 , . . , in be a set of items. Let D be a transactional database which denotes a set of database transactions where each transaction T is a set of items such that T ⊆ I. Each transaction has an identifier associated with it, called TID. Consider an itemset X such that X ⊆ I. A transaction T is said to contain X if X ⊆ T. A set of items forms an itemset. An itemset that has k items in it is referred to as a k-itemset. The set {computer, antivirus_ software} is a 2itemset. A set of non negative real numbers is denoted by W. A pair (x,w) is called a weighted item where w is the weight associated with x. Item weight [9] denoted by w (i) is the value of weight factor assigned to an item based on its significance. Weighting attributes such as items price in a supermarket to visitor page dwelling time in a web log mining domain are used for the calculation of weights. Therefore, item weight is a function of Transaction Database 1 A,B,D,E 2 B,C,E 3 A,B,D,E 4 A,B,C,E 5 A,B,C,D,E,F 6 B,C,D The weights associated with the items are given in Table 2. Table 2. Weighted items Table A 0.6 B 0.1 C 0.3 D 0.9 E 0.2 The transaction weight (tw) of a transaction tk is given by (1). tw(t k ) = ∑ij ∈tk wj |t k | (1) The weighted support of an itemset is given by (2). ws(X) = ∑tk∈t(X) tw(t k ) ∑tk ∈T tw(t k ) (2) where T is the list of transactions in the database. Weighted support computed above satisfies the downward closure property. i.e., if X ⊆ Y then ws(X) ≥ ws(Y ) according to the properties of Galois Connection [10]. A. Fault Tolerant Frequent Patterns Given a fault tolerance δ (δ>0). Let P be an itemset such that|๐| > ๐ฟ. A transaction T = (TID, X) is said to FT-contain an itemset P iff there exists P ′ ⊆ P such that P ′ ⊆ X and |๐′| ≥ ฬ represents the set of transactions FT(|๐| − ๐ฟ). Let B(X) containing an itemset X. It is termed as FT-body of X. The fault tolerant weighted support is given by ฬ= ๐ค๐ (๐) ∑tk∈B(X) ฬ tw(t k ) ∑tk∈T tw(t k ) (3) An itemset X is called a Fault Tolerant Frequent pattern iff 1. 2. ฬ ≥ min_wsup FT and ๐ค๐ (๐) item for each item x ∈ ๐, wsupB(X) ฬ ≥ min _wsup where wsupB(X) ฬ = ∑t ∈B(X)โx∈t tw(tk ) ฬ k k ∑t ∈B(X) ฬ tw(tk ) ( 4) k Fault tolerant patterns always contains only global frequent items whose ws > min_wsupitem. Fault tolerant pattern mining contributes to longer and more frequent patterns. III. METHODOLOGY This algorithm uses a WIT-trees (Weighted Itemset-Tidset tree) data structure which represents the input data. Each node of the WIT-tree includes two fields: 1) X: an itemset 2) B(X): the set of transactions that FT contains X. The root node of the WIT-tree contains all global frequent Items whose ws(X) > min_wsupitem. These items form the Level 1 nodes of the tree. Each level 1 node belongs to the same equivalence class with prefix {} or [วพ] and will become a new equivalence class by using its item as the prefix and it will join with all nodes following it to create a new equivalence class. New equivalence classes in higher levels can be created by running this process recursively. All item sets satisfy downward closure property. Downward closure property helps in pruning an equivalence class if its ๐ค๐ ฬ value does not satisfy min_wsupFT. Given a fault tolerance δ. Only those patterns whose length is greater than fault tolerance factor δ and those that satisfy the fault tolerant conditions will be treated as an FT-pattern. For example, if value of δ=1, then the nodes representing 2-itemset i.e. level 2 nodes and the nodes following it will be considered for fault tolerance test. Level 1 nodes are not fault tolerant as their length is not greater than δ value which is 1 in this example. Only those patterns that satisfy both conditions 1 and 2 will be fault tolerant and others are pruned. Thus WIT-tree representing fault tolerant frequent patterns can be generated. This algorithm requires only single scan of the database as it is based on the intersection of Tid-sets. Thus it runs faster. IV. PERFORMANCE EVALUATION V. This algorithm is implemented using python. The experiments were conducted on a pc with an Intel(R) Pentium(R) Dual CPU 2.16 Ghz and 2.00 GB memory. The datasets were downloaded from http://fimi.cs.helsinki.fi/data/to perform the test. This method can find much longer patterns than the algorithm based on wit-trees [8] using similar support thresholds. This means that, it can find more general knowledge. The length of the longest pattern that this method generates is upto twice that is found by [8]. It can be seen that as min_wsupitem is low, the number of patterns increases exponentially and the run time increases as well. This method scales better when min_ wsupFT is high as it prunes the FTpatterns. CONCLUSION An efficient method for the discovery of fault tolerant frequent weighted itemsets is proposed. Fault tolerant pattern mining deals with the discovery of more general, interesting and useful knowledge. This method requires only single scan of the database which is made possible by taking the intersection of TID-sets. It also results in faster computation of weighted support. This algorithm makes use of WIT-tree data structure. Weight value assigned to the items helps in mining important itemsets as real data have different significance. REFERENCES [1] J. Han and M. Kamber, Data Mining: Concepts and Techniques (San Francisco, CA, Morgan Kaufmann Publishers, 2004). [2] [3] [4] [5] [6] [7] [8] Agrawal and R. Srikant, Fast algorithms for mining association rules , Proc. 20th Intâ€™l Conf. Very Large Data Base(VLDB â€˜94), pp.207216,1994. J. Han, J. Pie and Y. Yin, Mining frequent patterns without candidate generation, Proc. ACM SIGMOD, 1995. J. Pei, A. K. H. Tung, and J. Han, â€œFault-Tolerant Frequent Pattern Mining: Problems and Challenges â€•, Proceedings of ACM-SIGMOD International Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'01), Santa Barbara, CA, May 2001. Z. Huai and M. Huang, A weighted frequent itemsets incremental updating algorithm base on hash table, IEEE,2011. [6]Unil Yun and K. H. Ryu, Approximate weighted frequent pattern mining with/without noisy environments, Knowledge-Based Systems 24 (2011) 73-82. U. Yun, J.J. Leggett, WFIM: weighted frequent itemset mining with a weight range and a minimum weight, in: Proceedings of the Fourth SIAM International Conference on Data Mining, CA, USA, 2005, pp. 636-640. Bay Vo, F. Coenen and Bac Le, A new method for mining frequent weighted itemsets based on WIT-trees, Expert Systems with Applications 40 (2013) 1256-1264. [9] [10] [11] [12] [13] [14] [15] Tao, F. Murtagh and M. Farid, Weighted association rule mining using weighted support and significance framework. In: SIGKDDâ€™03(pp.661-666). [10]Zaki, M. J. (2004). Mining non-redundant association rules. Data Mining and Knowledge Discovery, 9(3), 223-248. C. H. Cai, A. W. Fu, C.H. Cheng, W.W. Kwong, Mining association rules with weighted items, in: Proceedings of International Database Engineering and Applications Symposium, IDEAS 98, Cardiff, Wales, UK, 1998, pp.68-77. G. Grahne, J. Zhu, Fast algorithms for frequent itemset mining using FPtrees, IEEE Transactions on Knowledge and Data Engineering 17 (2005) 1347-1362. R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases. in: Proceedings of 12th ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, 1993, pp. 207-216. Li Tong Yan and Chen Chao, Efficient mining of weighted frequent itemsets using MLWFI, IEEE 2011. Preetham Kumar and Ananthnarayana V S, Parallel method for discovering frequent itemsets using weighted tree approach, International Conference on Computer Engineering and Technology, 2009.. 3, no. 4, pp. 262–294, 2000