View/Open

advertisement
Fault Tolerant Weighted Frequent Itemsets
Mining using WIT-Trees
Ambily Mohan1, Uday Babu P.2,Visakh R.3
Department Of Computer Science & Engineering
Rajagiri School of Engineering & Technology
Cochin, India
ambili.mohan@gmail.com1 udaybabup@gmail.com2 visakh_vishnu@gmail.com3
Abstract-Data mining is the collection of techniques for the
creative, automatic discovery of previously unknown, fitting, new,
helpful and understandable patterns in large databases. Frequent
pattern mining has emerged as a vital task in data mining.
Frequent patterns are those that occur frequently in a data set. In
traditional frequent pattern mining, patterns and items within the
patterns are treated the same with the assumption that all items
have same significance. However, real items have different
importance as some of the items are more beneficial in terms of
profit. This thought resulted in assigning weights to the items
thereby taking benefits into account. Real world data tends to be
diverse and dirty as it may be inferred with noise. This calls for
fault tolerant mining over large data. In this paper, a new method
for mining fault tolerant frequent patterns is proposed
Keywords—Data Mining, Frequent Patterns, Fault Tolerant
Patterns, Weighted Items
I.
INTRODUCTION
Data mining [1] is known for discovering previously
unknown, valid, new, fitting, useful and understandable
patterns in large databases. Due to the availability of huge
amount of data and the need to discover knowledge from such
data, data mining has grown to be the most widely used
technique. Thus data mining can be used for applications
ranging from market analysis, fraud detection and customer
retention, to production control and science exploration.
Frequent pattern mining has become an important task in data
mining. Frequent patterns are those patterns that occur
frequently in a data set. For example, a set of items, such as
milk and egg usually appears frequently together in a
transaction data set. So it is treated as a frequent itemset.
Frequent pattern mining therefore can be used for mining
associations, correlations, and many other interesting
relationships among data.
Market basket analysis is one such application of frequent
pattern mining. In this process, the retailers will analyze
customer’s buying habits in terms of purchased items to find
out the items that are frequently bought together. As a result
they can develop several marketing strategies that will help
them in improving their sales. For instance, if customers are
buying milk, how likely they will also buy bread on the same
trip to the market? Extracting such information can lead to
increased sales.
Frequent pattern mining algorithms like Apriori [2], FPgrowth [3] are simply based on the count of the occurrence of
item sets in a transactional dataset. The support for an itemset
is the number of transactions that contain the itemset. A pattern
is said to be frequent if its count is above a minimum support
threshold. Above all it treats all the items the same regardless
of their significance in real application. Real data have
different importance as some of the items are more beneficial
in terms of the profit they yield per unit sale. For example, sale
of egg may incur a profit of 20 cents, while a bottle of milk
might incur a profit of 40 cents. Even if latter holds a low
support value, it will be more beneficial than the other in terms
of the profit. Traditional frequent pattern mining algorithms
ignored this difference. With this thought, the idea of assigning
weights to the items came into existence.
In weighted approach, a weight factor is assigned to each of
the items present in the transactions based on their benefits.
The concept of weighted support is used instead of support
used in traditional pattern mining which was simply the count
of occurrence of item sets in each transaction. Weighted
support calculated by making use of the weights of items
resulted in the selection of important patterns. An item set is
significant if its weighted support is above a minimum
weighted support.
Fault tolerant frequent pattern mining [4] deals with the
discovery of more general, interesting and useful knowledge.
For example, consider the rule R1 to study student's
performance in course:
Though R1 has high prediction accuracy, this rule covers
only a very small subset of cases since there may not be that
many students who are well versed in all four subjects to be
good in data mining. Another rule R2 can be formulated as:
R2=”A student good in at least three out of four courses: (data
structure, algorithm, AI and DBMS), is also good at data
mining". This also has high confidence. Rule R2 is faulttolerant as it requires the data to match only part of its left hand
side and it is based on a fault tolerance factor δ. Instead of
looking for exact patterns in data, approximate and more
general patterns are looked upon.
A. Related Work
Zi-guo Huai and Ming-he Huang in [5] proposed an
effective algorithm for mining weighted frequent itemsets. It
was based on the hash table data structure. This technique
could deal with the problems that arise when weight value of
an item is changed and also when the database is changed.
its weighting attribute denoted as w(i) = f(a) where A = a1 ,
a 2 , . . , a n are the attributes used for calculating weights. Given a
Database that has 6 transactions and five items I = {A, B, C, D,
E} in Table 1.
Table 1.
Approximate weighted frequent pattern mining was
proposed by Unil Yun and Keun Ho Ryu in [6]. They proposed
this concept thinking that real applications would contain huge
amount of data, and inherent noise in the data would result in
its inaccurate processing. Noisy data due to uncertainty of
processing data and measurement errors can have negative
effects on mining results. Result sets will be affected greatly
even if there is only a small change in items weight or items
support in noisy environment. As a result, an approximate
factor was proposed that relaxed the requirement of weighted
supports being exactly matched with the minimum threshold.
This led to the discovery of approximate weighted frequent
patterns. WAF (approximate weighted frequent pattern mining)
algorithm is used to extract robust important patterns that are
not affected by noise.
Weighted frequent item set mining with a weight range and
a minimum weight [7] proposed by Unil Yun and John J.
Leggett focused on pushing weight constraints into pattern
growth algorithms while maintaining the downward closure
property. By adjusting the minimum weight and a weight
range, WFIM generates more brief and important weighted
frequent itemsets. It can be applied to dense databases with a
low minimum support. Items are given different weights within
the weight range. The number of weighted frequent items can
be adjusted using different weight ranges.
Bay Vo, F Coenen and Bac Le in [8] came up with an
algorithm for fast mining of frequent weighted itemsets from
weighted transaction databases. This algorithm was based on
WIT-trees (Weighted Itemset-Tidset tree) data structure.
J Pei, H. Tung and J. Han proposed fault tolerant frequent
pattern mining, which is a fruitful direction for future data
mining research. They have developed FT-Apriori for efficient
fault tolerant frequent pattern mining.
II.
PROBLEM DEFINITION
Let I = i1 , i2 , . . , in be a set of items. Let D be a
transactional database which denotes a set of database
transactions where each transaction T is a set of items such
that T ⊆ I. Each transaction has an identifier associated with it,
called TID. Consider an itemset X such that X ⊆ I. A
transaction T is said to contain X if X ⊆ T. A set of items
forms an itemset. An itemset that has k items in it is referred to
as a k-itemset. The set {computer, antivirus_ software} is a 2itemset. A set of non negative real numbers is denoted by W. A
pair (x,w) is called a weighted item where w is the weight
associated with x.
Item weight [9] denoted by w (i) is the value of weight
factor assigned to an item based on its significance. Weighting
attributes such as items price in a supermarket to visitor page
dwelling time in a web log mining domain are used for the
calculation of weights. Therefore, item weight is a function of
Transaction Database
1
A,B,D,E
2
B,C,E
3
A,B,D,E
4
A,B,C,E
5
A,B,C,D,E,F
6
B,C,D
The weights associated with the items are given in Table 2.
Table 2. Weighted items Table
A
0.6
B
0.1
C
0.3
D
0.9
E
0.2
The transaction weight (tw) of a transaction tk is given by
(1).
tw(t k ) =
∑ij ∈tk wj
|t k |
(1)
The weighted support of an itemset is given by (2).
ws(X) =
∑tk∈t(X) tw(t k )
∑tk ∈T tw(t k )
(2)
where T is the list of transactions in the database. Weighted
support computed above satisfies the downward closure
property. i.e., if X ⊆ Y then ws(X) ≥ ws(Y ) according to the
properties of Galois Connection [10].
A. Fault Tolerant Frequent Patterns
Given a fault tolerance δ (δ>0). Let P be an itemset such
that|๐‘ƒ| > ๐›ฟ. A transaction T = (TID, X) is said to FT-contain
an itemset P iff there exists P ′ ⊆ P such that P ′ ⊆ X and |๐‘ƒ′| ≥
ฬƒ represents the set of transactions FT(|๐‘ƒ| − ๐›ฟ). Let B(X)
containing an itemset X. It is termed as FT-body of X. The
fault tolerant weighted support is given by
ฬƒ=
๐‘ค๐‘ (๐‘‹)
∑tk∈B(X)
ฬƒ tw(t k )
∑tk∈T tw(t k )
(3)
An itemset X is called a Fault Tolerant Frequent pattern iff
1.
2.
ฬƒ ≥ min_wsup FT and
๐‘ค๐‘ (๐‘‹)
item
for each item x ∈ ๐‘‹, wsupB(X)
ฬƒ ≥ min _wsup
where
wsupB(X)
ฬƒ =
∑t ∈B(X)โ‹€x∈t
tw(tk )
ฬƒ
k
k
∑t ∈B(X)
ฬƒ tw(tk )
( 4)
k
Fault tolerant patterns always contains only global frequent
items whose ws > min_wsupitem. Fault tolerant pattern mining
contributes to longer and more frequent patterns.
III.
METHODOLOGY
This algorithm uses a WIT-trees (Weighted Itemset-Tidset
tree) data structure which represents the input data. Each node
of the WIT-tree includes two fields:
1) X: an itemset
2) B(X): the set of transactions that FT contains X.
The root node of the WIT-tree contains all global frequent
Items whose ws(X) > min_wsupitem. These items form the
Level 1 nodes of the tree. Each level 1 node belongs to the
same equivalence class with prefix {} or [วพ] and will become a
new equivalence class by using its item as the prefix and it will
join with all nodes following it to create a new equivalence
class. New equivalence classes in higher levels can be created
by running this process recursively. All item sets satisfy
downward closure property. Downward closure property helps
in pruning an equivalence class if its ๐‘ค๐‘ 
ฬƒ value does not satisfy
min_wsupFT.
Given a fault tolerance δ. Only those patterns whose length
is greater than fault tolerance factor δ and those that satisfy the
fault tolerant conditions will be treated as an FT-pattern. For
example, if value of δ=1, then the nodes representing 2-itemset
i.e. level 2 nodes and the nodes following it will be considered
for fault tolerance test. Level 1 nodes are not fault tolerant as
their length is not greater than δ value which is 1 in this
example. Only those patterns that satisfy both conditions 1 and
2 will be fault tolerant and others are pruned. Thus WIT-tree
representing fault tolerant frequent patterns can be generated.
This algorithm requires only single scan of the database as it is
based on the intersection of Tid-sets. Thus it runs faster.
IV.
PERFORMANCE EVALUATION
V.
This algorithm is implemented using python. The
experiments were conducted on a pc with an Intel(R)
Pentium(R) Dual CPU 2.16 Ghz and 2.00 GB memory. The
datasets were downloaded from http://fimi.cs.helsinki.fi/data/to
perform the test. This method can find much longer patterns
than the algorithm based on wit-trees [8] using similar support
thresholds. This means that, it can find more general
knowledge. The length of the longest pattern that this method
generates is upto twice that is found by [8]. It can be seen that
as min_wsupitem is low, the number of patterns increases
exponentially and the run time increases as well. This method
scales better when min_ wsupFT is high as it prunes the FTpatterns.
CONCLUSION
An efficient method for the discovery of fault tolerant frequent
weighted itemsets is proposed. Fault tolerant pattern mining
deals with the discovery of more general, interesting and useful
knowledge. This method requires only single scan of the
database which is made possible by taking the intersection of
TID-sets. It also results in faster computation of weighted
support. This algorithm makes use of WIT-tree data structure.
Weight value assigned to the items helps in mining important
itemsets as real data have different significance.
REFERENCES
[1]
J. Han and M. Kamber, Data Mining: Concepts and Techniques (San
Francisco, CA, Morgan Kaufmann Publishers, 2004).
[2]
[3]
[4]
[5]
[6]
[7]
[8]
Agrawal and R. Srikant, Fast algorithms for mining association rules ,
Proc. 20th Int’l Conf. Very Large Data Base(VLDB ‘94), pp.207216,1994.
J. Han, J. Pie and Y. Yin, Mining frequent patterns without candidate
generation, Proc. ACM SIGMOD, 1995.
J. Pei, A. K. H. Tung, and J. Han, “Fault-Tolerant Frequent Pattern
Mining: Problems and Challenges ―, Proceedings of ACM-SIGMOD
International Workshop on Research Issues on Data Mining and
Knowledge Discovery (DMKD'01), Santa Barbara, CA, May 2001.
Z. Huai and M. Huang, A weighted frequent itemsets incremental
updating algorithm base on hash table, IEEE,2011.
[6]Unil Yun and K. H. Ryu, Approximate weighted frequent pattern
mining with/without noisy environments, Knowledge-Based Systems 24
(2011) 73-82.
U. Yun, J.J. Leggett, WFIM: weighted frequent itemset mining with a
weight range and a minimum weight, in: Proceedings of the Fourth
SIAM International Conference on Data Mining, CA, USA, 2005, pp.
636-640.
Bay Vo, F. Coenen and Bac Le, A new method for mining frequent
weighted itemsets based on WIT-trees, Expert Systems with
Applications 40 (2013) 1256-1264.
[9]
[10]
[11]
[12]
[13]
[14]
[15]
Tao, F. Murtagh and M. Farid, Weighted association rule mining using
weighted
support
and
significance
framework.
In:
SIGKDD’03(pp.661-666).
[10]Zaki, M. J. (2004). Mining non-redundant association rules. Data
Mining and Knowledge Discovery, 9(3), 223-248.
C. H. Cai, A. W. Fu, C.H. Cheng, W.W. Kwong, Mining association
rules with weighted items, in: Proceedings of International Database
Engineering and Applications Symposium, IDEAS 98, Cardiff, Wales,
UK, 1998, pp.68-77.
G. Grahne, J. Zhu, Fast algorithms for frequent itemset mining using FPtrees, IEEE Transactions on Knowledge and Data Engineering 17 (2005)
1347-1362.
R. Agrawal, T. Imielinski, A. Swami, Mining association rules between
sets of items in large databases. in: Proceedings of 12th ACM SIGMOD
International Conference on Management of Data, Washington, DC,
USA, 1993, pp. 207-216.
Li Tong Yan and Chen Chao, Efficient mining of weighted frequent
itemsets using MLWFI, IEEE 2011.
Preetham Kumar and Ananthnarayana V S, Parallel method for
discovering frequent itemsets using weighted tree approach,
International Conference on Computer Engineering and Technology,
2009.. 3, no. 4, pp. 262–294, 2000
Download