Mine High Utility Itemset using UP-Tree and FP-Growth

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
Mine High Utility Itemset using UP-Tree and
FP-Growth
NS JAGADEESH#1, B JYOTHSNA*2 , KN DHARANIDHAR#3, A ANANTHA BIPIN*4
#
Assistant Professor, Dept of CSE, Kuppam Engineering College, kuppam, India.
Abstract— Data Mining is defined as a process that extracts some
new, non-trivial, previously unknown potentially useful
information contained in large databases. Traditional mining
techniques have focused largely on detecting the statistical
correlations between the items that are more frequent in the
transaction databases. Also termed as frequent itemset mining.
In this paper, I propose strategies for UP-Growth from the
emerging area called Utility Mining which not only considers the
frequency of the itemsets but also considers the utility associated
with the itemsets. The term utility refers to the importance or the
usefulness of the itemset in transactions quantified in terms like
profit, sales or any other user preferences. Here the objective is
to identify itemsets that have utility values above a given utility
threshold using the pattern growth methodology for mining set
of utility patterns.
Keywords— candidate pruning, frequent itemset, high utility
itemset, utility mining, UP-tree, FP-Growth.
Discovering useful patterns hidden in a database
plays an essential role in several data mining tasks,
such as frequent pattern mining, weighted frequent
pattern mining and high utility pattern mining.
Among them, frequent pattern mining is a
fundamental research topic that has been applied to
different kinds of databases, such as transactional
databases. It is used in the analysis of customer
transactions in retail research where it is termed as
market basket analysis and also been used to
identify the purchase patterns of the consumer.
I. INTRODUCTION
II. LITERATURE SURVEY
Over the last two decades data mining has
emerged as a significant research area. This is
primary due to the interdisciplinary nature of the
subject and the diverse range of application
domains in which data mining based products and
techniques are being employed. This includes
bioinformatics, genetics, medicine, clinical research,
education, retail and marketing research.
Data mining is the process of revealing
previously unknown and potentially useful
information from large databases. The primary goal
is to discover hidden patterns, unexpected trends in
the data. This term is frequently misused to mean
any form of large-scale data or information
processing. The actual data mining task is the
automatic or semi-automatic analysis of large
quantities of data to extract previously unknown
interesting patterns.
Data mining activities uses combination of
techniques from database technologies, statistics,
artificial intelligence and machine learning.
Extensive studies have been proposed for mining
frequent patterns [1, 2, 3, 4, 6]. Among the issues of
frequent pattern mining, the most famous are
association rule mining [1, 3, 4, 6] and sequential
pattern mining. One of the well-known algorithms
for mining association rules is Apriori [1], which is
the pioneer for efficiently mining association rules
from large databases. Pattern growth based
association rule mining algorithms [4, 6] such as
FP-Growth [4] were afterward proposed. It is
widely recognised that FP-Growth achieves a better
performance than Apriori based algorithms since it
finds frequent itemsets without generating any
candidate itemset and scans database just twice.
ISSN: 2231-5381
 Frequent Itemset Mining
An itemset can be defined as a non-empty set of
items. An itemset with k different items is termed
as a k-itemset. For e.g. {bread, butter, milk } may
denote
a
3-itemset
in
a
supermarket
http://www.ijettjournal.org
Page 4046
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
transaction .The notion of frequent itemsets was
introduced by Agrawal et al [1].Frequent itemsets
are the itemsets that appear frequently in the
transactions. The goal of frequent itemset mining is
to identify all the itemsets in a transaction dataset
[6]. Frequent itemset mining plays an essential role
in the theory and practice of many important data
mining tasks, such as mining association rules [1,2]
long patterns [5], emerging patterns and
dependency rules. It has been applied in the field of
telecommunications [3], census analysis[6] and text
analysis.
Item Name
Unit Profit (in USD)
Item A
5
Item B
100
Item C
40
Now consider the itemset AB. Since there are only
3transactions (T3, T5 and T6) that contain this
itemset out of the overall 10 transactions, so the
support for this itemset will be
Support (AB) = 3 / 10 * 100 = 30 %
Since T3 contains 4 units of item A and 1 unit of
The criterion of being frequent is expressed in item B, so the profit earned by the sale of the
terms of support value of the itemsets. The Support itemset AB in transaction T3 is given by
value of an itemset is the percentage of transactions
profit (AB, T3) = 4 * profit(A) + 1 *
that contain the itemset.
profit(B) = 4*5 + 1*100 = 120
1) EXAMPLE 1:
Since AB appears in transactions T3, T5 and T6, so
.
Consider the small example of a transaction total profit associated with itemset AB by the
complete transaction set of 10 transactions is
database representing the sales data and the profit
TABLE I
TRANSACTION DATABASE
Transacion
ID
T1
T2
T3
T4
T5
T6
T7
T8
T9
T10
Quantity of Item sold in Transaction
Item A
Item B
Item C
2
4
4
0
5
10
4
1
3
5
0
0
1
1
1
1
0
0
0
0
1
2
0
1
2
5
2
0
0
0
Profit(AB) = profit(AB,T3) + profit(AB,T5)
+ profit(AB,T6)
=(4*5+1*100) + (5*5+1*100) +
(10*5+1*100 )
= 395
Similarly we can calculate the support values for
the different itemsets and also the profit obtained by
the sale of those itemsets by all ten transactions as
indicated in Table III.
If we consider minimum support = 40 % then we
observe that there are 4 itemsets A, B,C and AC
associated with the sale of each unit of the items.
which qualify as frequent itemsets because they
Table I represents the sales figures for three items –
Item A, B and C and ten transactions overall. The have support more than minimum support threshold
entry in the cells represent the unit of any item sold value. But if we consider the profit associated we
in that transaction
find that out of the 4 most profitable itemsets i.e. C,
AC, BC, and ABC only two are frequent itemsets
Table II represents the unit profit associated also. Itemsets BC and ABC are itemsets which are
with the sale of individual items.
not frequent but still they fetch more profit than
TABLE II
UNIT PROFIT ASSOCIATED WITH ITEMS
some of the frequent itemsets like A or B. This is
ISSN: 2231-5381
http://www.ijettjournal.org
Page 4047
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
inherently because the deviation of the unit profits value of an itemset is the measurement of the
of the items. As we can see one unit of item B when importance of that itemset in the user’s perspective.
sold will fetch much more profit than one unit of
For e.g. if a sales analyst involved in some retail
item A or item C.
research needs to find out which itemsets in the
TABLE III
stores earn the maximum sales revenue for the
SUPPORT AND PROFIT FOR ALL ITEMSETS
stores he or she will define the utility of any itemset
Itemset
Support(%)
Profit(USD)
as the monetary profit that the store earns by selling
A
190
90
each unit of that itemset.
B
400
40
Here note that the sales analyst is not interested in
C
60
520
the number of transactions that contain the itemset
AB
30
395
but he or she is only concerned about the revenue
AC
50
605
generated collectively by all the transactions
BC
30
620
containing the itemset. In practice the utility value
ABC
20
555
of an itemset can be profit, popularity, page-rank,
measure of some aesthetic aspect such as beauty or
This example illustrates the fact that frequent
design or some other measures of user’s preference.
itemset mining approach may not always satisfy a
Formally an itemset S is useful to a user if it
sales manager’s goal. In this case the support satisfies a utility constraint i.e. any constraint in the
measure of the itemsets reflects the statistical
form u(S)>=min_util, where u(S) is the utility value
correlation of items, but it does not reflect their
of the itemset an min_util is a utility threshold
semantic significance which in this example was
defined by the user [32]. In our example if we take
the associated profit.
utility of an itemset as the unit profit associated
In reality a retail business may be interested in
with the sale of that itemset then with utility
identifying its most valuable customers (customers
threshold min_util = 500 then the itemset ABC has
who contribute a major fraction of the profits to the
a utility value of 555 which means that this itemset
business).These are the customers who may buy
is of interest to the user even though its support
full priced items or high margin items which may
value is just 20%. Since while considering the total
be absent from a large number of transactions
utility of an itemset S we multiply the utility values
because most customers do not buy these items
of the individual items consisting the itemset S with
frequently.
the corresponding frequencies of the individual
items of S in the transactions that contain S, so the
 Utility Mining
utility based mining approach can be said to be
measuring the significance of an itemset from two
The limitations of frequent or rare itemset mining
dimensions. The first dimension being the support
motivated researchers to conceive a utility based
value of the itemset i.e., the frequency of the
mining approach, which allows a user to
itemset and the second dimension is the semantic
conveniently express his or her perspectives
significance of the itemset as measured by the user.
concerning the usefulness of itemsets as utility
values and then find itemsets with high utility
III. PROPOSED METHODS
values higher than a threshold. In utility based
mining the term utility refers to the quantitative
representation of user preference i.e. the utility
ISSN: 2231-5381
http://www.ijettjournal.org
Page 4048
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
The framework of proposed method consists of
two steps:
(1) Scan the database twice to construct a global
UP-Tree with the first two strategies (given
in the Subsection III.A).
(2) Recursively generate potential high utility
itemsets abbreviated as PHUIs) from global
UP-Tree and local UP-Trees by UPGrowth+ with the last two strategies (given
in the Subsection III.B).
A. The Proposed Data Structure: UP-Tree
To facilitate the mining performance and avoid
scanning original database repeatedly, we will use a
compact tree structure, named UP-Tree (Utility
Pattern Tree), to maintain the information of
transactions and high utility itemsets. Two
strategies are applied to minimise the overestimated
utilities stored in the nodes of global UP-Tree. In
following subsections, the elements of UP-Tree are
first defined. Next, the two strategies are introduced.
during the construction of a global UP-Tree are
introduced.
2) Strategy
DGU:
Discarding
Global
Unpromising Items
The construction of a global UP-Tree can be
performed with two scans of the original database.
In the first scan, Transaction Utility (also
abbreviated as TU) of each transaction is computed.
At the same time, Transaction-Weighted Utility
(also abbreviated as TWU) of each single item is
also accumulated. By transaction-weighted
downward closure (also abbreviated as TWDC)
property, an item and its supersets are unpromising
to be high utility itemsets if its also TWU is less
than the minimum utility threshold. Such an item is
called an unpromising item.
An item is called a promising item if TWU >=
min_util. Otherwise, it is called an un promising
item. Without loss of generality, an item is also
called a promising item if its overestimated utility is
no less than min_util. Otherwise, it is called an
unpromising item.
3) Strategy DGN: Decreasing Global Node
Utilities
By actual utilities of descendant nodes during the
construction of global UP-Tree we can decrease
global node utilities. By applying strategy DGN, the
utilities of the nodes that are closer to the root of a
global UP-Tree are further reduced. DGN is
especially suitable for the databases containing lots
of long transactions. In other words, the more items
a transaction contains, the more utilities can be
discarded by DGN. On the contrary, traditional
TWU mining model is not suitable for such
databases since the more items a transaction
contains, the higher TWU is.
1) The Elements in UP-Tree
In a UP-Tree, each node N consists of N.name,
N.count, N.nu, N.parent, N.hlink and a set of child
nodes. N.name is the node’s item name. N.count is
the node’s support count.N.nu is the node’s node
utility, i.e., overestimated utility of the node.
N.parent records the parent node of N. N.hlink is a
node link which points to a node whose item name
is the same as N.name.
A table named header table is employed to
facilitate the traversal of UP-Tree. In header table,
each entry records an item name, an overestimated
utility, and a link. The link points to the last
occurrence of the node which has the same item as
the entry in the UP-Tree. By following the links in
header table and the nodes in UP-Tree, the nodes B. The Proposed Mining Method: UP-Growth+
having the same name can be traversed efficiently.
In UP-Growth+, minimal node utilities (also
In following subsections, two strategies for abbreviated as MNU's) in each path are used to
decreasing the overestimated utility of each item make the estimated pruning values closer to real
utility values of the pruned items in database.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 4049
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
MNU for each node can be acquired during the
construction of a global UP-Tree. First, we add an
element, namely N.mnu, into each node of UP-Tree.
N.mnu is minimal node utility of N. When N is
traced, N.mnu keeps track of the minimal value of
N.name’s utility in different transactions. If N.mnu
is larger than u(N.name, Tcurrent), N.mnu is set to
u(N.name, Tcurrent).
Fig. 1 A Block diagram of the proposed system
1) Strategy ENU: Eliminating local unpromising
items and their estimated Node Utilities from the
paths and path utilities
ENU can be recognized as local version of DGU.
It will provide a simple but useful schema to reduce
over estimated utilities locally without an extra scan
of original database.
2) Strategy DNN: Decreasing local Node utilities
for nodes of local UP-Tree by estimated utilities of
descendant Nodes
DLN can be also be recognized as well as a local
version of DGN mentioned in the earlier sections.
By these two strategies, overestimated utilities for
itemsets can be locally reduced in a certain degree
without losing any actual high utility itemset.
AUTHORS DESCRIPTION
IV. CONCLUSION
In this paper, we have presented novel strategies
for UP-growth by utilizing a tree structure for
storing essential information about frequent patterns
for mining high utility itemsets. I have utilized the
concepts standard Frequent Itemset Mining for
mining the complete set of frequent patterns by
means of pattern growth.
Higher efficiency in mining high utility patterns
can be realized by implementing the above two
important concepts. One is the construction of the
UP-tree and the other one is the mining of utility
itemsets from the UP-tree. The proposed UP-tree
based pattern mining utilizes the pattern growth
method to avoid the costly generation of a large
number of candidate sets and reduces the search
space dramatically.
REFERENCES
[1] R. Agrawal and R. Srikant. “Fast algorithms for mining
association rules,” inProc. of the 20th VLDB Conf., pp.
487-499, 1994
[2] R. Agrawal and R. Srikant, “Mining Sequential
Patterns,” in Proc. of the 11th Int’l Conference on Data
Engineering, pp. 3-14, Mar., 1995.
[3] J. Han and Y. Fu, “Discovery of multiple-level
association rules from large databases,” in Proc. 21th
VLDB Conf., Sep. 2000, pp. 420–431.
[4] J. Han, J. Pei, Y. Yin, “Mining frequent patterns without
candidate generation,” in Proc. of the ACM-SIGMOD
Int'l Conf. on Management of Data, pp. 1-12, 2011.
[5] V. S. Tseng, C. J. Chu and T. Liang, “Efficient Mining
of Temporal High Utility Itemsets from Data streams,”
in Proc. of ACM KDD Workshop on Utility-Based Data
Mining Workshop (UBDM’06), USA, Aug., 2006.
[6] R. Martinez, N. Pasquier and C. Pasquier, “GenMiner:
mining non-redundant association rules from integrated
gene expression data and annotations,” Bio-informatics,
Vol. 24, pp. 2643-2644, 2010.
[7] S. J. Yen, Y. S. Lee, C. K. Wang, C. W. Wu and L.-Y.
Ouyang, “The studies of mining frequent patterns based
on frequent pattern tree,” in Proc. of the 13thPAKDD
and LNCS, Vol. 5476, pp. 232-241, 2012.
N.S.Jagadeesh, currently he
is working as
Assistant Professor in Kuppam Engineering
ISSN: 2231-5381
http://www.ijettjournal.org
Page 4050
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
College,
kuppam,
(Information
received
Technology)
and
B.Tech
Engineering) and M.E (Computer Science and
M.Tech
Engineering) from Anna University, Chennai.
(Computer Science and Engineering)
from
JNTU,Anantapur. His Research interest areas
His
Research
interest
areas
are
warehousing and Mining & Networks.
are Data warehousing and Mining & Software
Engineering.
B.Jyothsna, currently she is
working as Assistant Professor
in
Sir
Institute
Vishveshwaraiah
of
Science
&
Technology, Madanapalle. Received B.Tech,
M.Tech (Computer Science and Engineering)
from JNTU, Anantapur. Her Research interest
areas are Data warehousing and mining &
Software Engineering.
KN Dharanidhar, currently
he is working as Assistant
Professor
in
Kuppam
Engineering College, kuppam,
received B.Tech (Information Technology) and
M.Tech (Computer Science and Engineering)
from JNTU, Anantapur. His Research interest
areas are Data warehousing and Mining &
Mobile Computing.
A.Anantha Bipin, currently he
is
working
Professor
as
in
Assistant
Kuppam
Engineering College, kuppam,
received
B.E
(Computer
ISSN: 2231-5381
Science
and
http://www.ijettjournal.org
Page 4051
Data
Download