Daniel Yu

advertisement
Minig Top-K High Utility Itemsets - Report
Daniel Yu, yuda@student.ethz.ch
Computer Science Bsc., ETH Zurich, Switzerland
May 29, 2015
The report is written as a overview about the main aspects in mining
top-k high utility itemsets from the paper ”Mining Top-K High Utility
Itemsets” written by Cheng Wei Wu et. al. from the National Cheng
Kung University in 2012 [1].
1
Introduction
Utility mining, which refers to the discovery of itemsets with utilities
higher than a user-specified minimum utility threshold, is an important
task and has a wide range of applications, especially in e-commerce. But
setting an appropriate minimum utility threshold is a difficult problem.
If the minimum threshold is set to low, too many high utility itemsets
will be generated and it takes a long time to compute, while setting the
minimum threshold too high would result in too few results. Setting
appropriate minimum utility threshold by trial and error is not very efficient. We want to discuss in this report how this can be done better.
This report starts with a small example, to get a basic understanding
about the general high utility itemsets mining. Then we’ll look at a
naiv approach and extend it with the increasing threshold mechanism by
using the so-called ”transactional-weighted downward closure” (TWDC),
which is one of the most important basis of most optimisation mechanism.
At the end, we will see a very short introduction to the ”UP-Tree”, which
is the state-of-the-art datastructure for mining high utility itemsets.
2
Problem Definition
In top-k high utility itemset minig we want to calculate the top-k high
utility itemset in D from the system (I, p, D, q):
• I is a finite set of distinct items I = {i1 , i2 , ..., im }.
1
• p is a function p : (ij , D) → N, which associate each item ij ∈ I
with a positive number, called the external utility.
• D is transactional database, which consist of a set of transactions
{T1 , T2 , ..., Tn }. Each transaction Tc ∈ D is a subset of I and has
an unique identifier c, called Tid.
• q is a function q : (ij , Tc ) → N, which associate each item ij in
transaction ij ∈ Tc with a positive number, called the internal
utility.
The profit of an itemset X in a transaction Tc is denoted as s(X, Tc ) is
defined as:
X
s(X, Tc ) =
p(ij , D) · q(ij , Tc )
ij ∈X
The utility of an itemset X in D is denoted as u(X) is defined as:
X
s(X, Tc )
u(X, D) =
X⊆Tc
Tc ∈D
3
Example
TID
P1
P2
P3
P4
Purchase
(3, A), (5, B)
(2, B)
(2, A), (1, C)
(2, A)(3, B)(1, C)
Table 1: Example purchase database
Item
Profit
A B C D
2$ 1$ 3$ 2$
Table 2: Example price table
Suppose we are a shop, which sell fruits: Apple, Banana, Cherry and
Date: I = {A, B, C, D}. We collected data of purchases of today stored
in a so-called transactional database D = {P1 , ..., Pn }. Each purchase
consist of several items and it’s purchased quantity. E.g. the first customer purchased 3 Apples and 5 Bananas (see Table 1). There is a second
database, wich stores the price of each item (see Table 2).
With high utility itemset mining, we can answer the following question:
Which product set has the highest profit out of these data? To answer
2
this question we define the price as external cost and the quantity as
internal cost.
With this mapping the utility of itemset {A} is in our example 14$:
u({A}, D) = s({A}, P1 ) + s({A}, P3 ) + s({A}, P4 )
= 3 · 2$ + 2 · 2$ + 2 · 2$
= 14$
and the itemset {B, C} has a utility of 6$:
u({B, C}, D) = s({B, C}, P4 )
= 3 · 1$ + 1 · 3$
= 6$
In classical high utility itemset mining, we choose the threshold as parameter and every itemset with a higher utility will be in the result set.
Without any knowledge about the database D, it’s quite hard to choose
the threshold, because if you choose the threshold too low, let’s say 2$,
you will get too much itemsets. And if you choose it too high, let’s say
20$, no high utility itemset will be found. We rather want an algorithm,
which takes the number of result we want as parameter k. Setting k is
more intuitive than setting the threshold because k represents the number of itemsets that the user want to find. We name this problem the
Top-K High Utility Itemset Mining.
In our example the top-3 high utility itemset would be:
• itemset {A} with a utility of 14$
• itemset {A, C} with a utility of 14$
• and itemset {A, B} with a utility of 18$
Please keep these example databases in mind, since it’s used for all the
example through the whole paper.
3
4
Basic Algorithm
The Basic Algorithm computes the topk high utility itemset problem in three
steps:
Generate all subset of I
1. It first generates all the possible
subsets of I.
compute utility
2. Then it computes for all subsets
their utilities.
Choose top-k
Figure 1: Basic Algorithm
4.1
3. Finally, it chooses the top-k high
utility itemset out of all itemsets.
Analysis: Basic Algorithm
Now we would like to compute the complexity of this naiv algorithm for
finding the top-k high utility itemsets.
1. The number of all subset of I is by definition equals to the size of
the powerset of I. Therefore:
# subsets of I = |P (I)| = 2n ,
where n = |I|. Generating all the subset has a complexity of O(2n ),
which is a exponential growth to the number of items.
2. For each subset, we have to calculate it’s utility, which can be done
with a complete tablescan. The size of the table has O(nm), where
m is the number of transactions. It’s bounded by n, since every
transaction is a subset of I, which has a maximum size of n.
3. Choosing the top-k high utility itemsets is basically a simple scan
of all subset, which can be done in O(2n ).
In total we get: O(2n nm + 2n ) = O(2n nm)
Calculating the utility of all itemsets seems to be very expensive. The
problem is that the utility function is neither monoton nor anti-monotone.
Calculating the utility of a itemset wouldn’t give us any information
about the utility of it’s supersets or subsets. We also would like to have
some mechanism to prune the search space, since the search space grows
exponentially to the number of items as we have seen above. We will
discuss this problem extensively in the next two chapters.
4
5
Transaction-weighted downward closure
(TWDC)
One of the major challenge is, that the utility function of an itemset is
neither monotone nor anti-monotone. In other words, the utility of an
itemset might be equal to, greater or lower than the utility of it’s superset
and subset. This makes hard to prune the search space, since the exact
utility of an itemset won’t give us information about it’s supersets or
subsets.
In 2005, Liu et al. proposed in their paper [4] the Two-Phase algorithm,
which uses the so-called ”Transaction-weighted downward closure”. First
we need the definition of Transactional weighted utility of an itemsets X:
X
T W U (X) =
s(Ti , Ti )
X⊆Ti
Ti ∈D
If we take the same example for chapter 3, the transactional weighted
utility of the itemset {A} is 28$:
T W U ({A}) =
X
s(Ti , Ti )
{A}⊆Ti
Ti ∈D
= s({A, B}, P1 ) + s({A, C}, P3 ) + s({A, B, C}, P4 )
= 11$ + 7$ + 10$
= 28$
If we compare this to the actual utility, we see that the TWU function
is an upper bound for the utility function, which will be proved below.
This function has the nice property of downward closure, which means:
If Y is a subset of X ∈ I, then the transactional weighted utility if Y is
at most the transactional weighted utility of X.
We want now proof the downward closure property:
To prove: Y ⊂ X ⇒ T W U (Y ) ≤ T W U (X)
proof: We assume Y ⊂ X. We can show, that the transactional weighted
utility of X is at least as the transactional weighted utility of Y .
X
X
T W U (>) =
u(Ti , Ti ) ≤
u(Tj , Tj ) = T W U (X),
Y ⊆Ti
Ti ∈D
X⊆Tj
Tj ∈D
since the collection of transaction containing X is a superset of the
collection of transaction containing Y , because Y ⊂ X.
5
6
Increasing Threshold
We learnt a transactional weighted utility (TWU) is an upper bound for
the utility function, which has the downward closure property. But how
can we use it to prune the search space?
In 2012, Wu et.al. proposed with their algorithm TKU Base [1] the
following idea:
The proposed algorithm uses an internal variable named border minimum utility threshold (denoted as min util ). We only want to consider
itemsets with a higher utility that the threshold. The algorithm initially
set the threshold to 0 and gradually raise the threshold to prune the
search space by using the TWDC. We can raise the threshold after a
sufficient number of itemsets with higher TWU has been captured.
For the algorithm, we need to calculate the lower and upper bound of an
itemset. For the upper bound the TWU can be used. For the lower bound
we use the definition of minimum item utility of an item a, denoted as
miu(a):
miu(a) = min u(a, T )
T ∈D
and minimum itemset utility of an itemset X = {a1 , ..., am }, denoted as
M IU (X):
M IU (X) =
m
X
miu(ai ) · SC(X),
i=1
where SC(X) is the support count of an itemset, which is the number of
transaction containing X in D. This is cleary a lower bould for the utility
function.
If we take the data of chapter 3, the minimum itemset utility of itemset
{A, B} is 12$:
M IU ({A, B}) = miu({A}) ∗ SC({A, B}) + miu({B}) ∗ SC({A, B})
= 4$ · 2 + 2$ · 2
= 12$
For the algorithm, we need to destinguish between three different cases
(cf. Figure 2). For a itemset X:
I. M IU (X) ≤ min util ≤ T W U (X)
II. M IU (X) ≤ T W U (X) < min util
III. min util ≤ M IU (X) ≤ T W U (X)
6
MIU
min util
TWU
I.
MIU
TWU min util
II.
min util MIU
TWU
III.
Figure 2: Three cases for min util, MIU and TWU
These cases are complete, because all the other possible cases would
violate the following fact:
M IU (X) ≤ u(X) ≤ T W U (X) ⇔ T W U (X) ≤ T W U (X)
Let’s analyze these three cases in detail:
I. We call such a itemset a potential itemset, since the utility might be
higher than the threshold min util. We have to keep these itemset,
because they are a candidates for high utility itemset.
II. Such an itemset X are definitely not part of the top-h high utility
itemset and can be savely discarded (the proof is below in III.),
since his exact utility is for sure below the threshold min util:
u(X) ≤ T W U (X) < min util
By applying the TWDC property of TWU, we can also prune all
it’s subsets X 0 , which are less promising itemsets because of their
lower TWU:
u(X 0 ) ≤ T W U (X 0 ) ≤ T W U (X) < min util
III. Such an itemset X is also candidate for high utility itemsets, so
we have to keep it. Here the M IU (X) can be used to raise the
border min. We need for this purpose a proof:
To prove: Assume we are mining for the top-k high utility itemset. Let C = {X1 , X2 , ..., Xm } be a ordered set of itemsets,
where m ≥ k and Xi is the i-th itemset in C and M IU (Xi ) ≥
M IU (Xj ), ∀i < j (ordered by M IU ).
For any itemset Y, if T W U (Y ) < min{M IU (Xi )|Xi ∈ C, 1 ≤
i ≤ k}, Y is not a top-k high utility itemset.
7
proof: According to the definition of T W U and M IU we know,
that:
u(Y ) ≤ T W U (Y ) < M IU (Xi ) ≤ u(Xi ),
where Xi ∈ C, 1 ≤ i ≤ k. If there already exist k itemsets
whose utilites are higher that the utility of Y, by the definition
of top-k high utility itemset, Y is not a top-k high utility
itemset.
What also follows from this proof is, that we can safely set the
threshold min util to min{M IU (Xi )|Xi ∈ C, 1 ≤ i ≤ k}, because
there is no sense to consider itemsets, which are definitely not part
of the top-k high utility itemset.
How do we keep track of the itemset to efficiently update border min?
We use a max-heap structure L to maintain the k highest M IU s of the
candidate itemsets until now. Once k MIUs are found, min util is raised
to the k − th MIU in L. Each time a new candidate X is found and its
MIU is higher than min util, X is added ti L and the lowest MIU in L is
removed. After that, min util is raised to the k-th MIU in L.
7
Advanced Algorithm
Generate all the subsets of I
and discard all it’s subsets
II.
Calculate MIU and TWU
I.
save the candidate
discard it
Trash
III.
save the candidate and
update the threshold
Calculate utility and choose top-k
Figure 3: The TKU Base algorithm
The new algorithm consists of three part:
1. generating all the itemsets
2. choose all the potential candidate for high utility itemstes with
the increasing threshold method, which we have discussed in the
last chapter extensively. We initialize the threshold with 0. For
8
each itemset, we check to which case it belongs (I., II. or III.)
and act appropriate. To keep trach of MIUs to efficiently update
border min, we use a max-heap L as discussed before. At the end
we get a list of candidates stored in C.
3. Choosing the top-k high utility itemsets is basically a sinple scan
of C.
Algorithm 1: Advanced Algorithm
// Initialization
1 L ← empty minheap;
2 C ← empty set;
3 min util ← 0 ;
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// Generate all the subsets of I
M ← subset(I);
// Calculate MIU and TWU, case destinction
while M is not empty do
X ← take one itemset ∈ C;
if M IU (X) ≤ min util and min util ≤ T W U (X) then
// Case I.
C ← X;
else if T W U (X) ≤ min util then
// Case II.
C ← X;
L ← M IU (X);
update min util ;
else
// Case III.
M ← M − subset(X);
end
end
// Check the candidates in C
Calculate the exact utility for each itemset in C ;
Output the top-k high utility itemset in C ;
Note:
The subset(X) function generates all the subsets of X.
9
7.1
Analysis: Advanced Algorithm
This new algorithm seems to have a quite overhead to calculate all the
TWUs. Does it at least garantuee to perform better than the basic
algorithm? The answer is sadly no. The TWDC with the increasing
threshold doesn’t give us any guarantee to perform better at all. In fact
it could be slower. As a simple and short example, think of a database D
with just one transaction D1 , which has all the items D1 = I and assume
p = q = 1.
With such a database, the TWU for all itemsets would be equal, since
the TWU is a overestimation:
For any X ⊆ I : T W U (X) =
X
s(Di , Di )
X⊆Di
Di ∈D
= s(D1 , D1 ) = s(I, D1 )
X
=
p(ij , D) · q(ij , D1 )
ij ∈I
=
X
1 = |I| = n
ij ∈I
Which such a system, we wouldn’t get any additional information about
the utility of the itemsets. We couldn’t prune the search space with
TWDC, which means that we still have to check the utility of all possible subsets of I. However in practise, a online store like amazon which
serves millions of products, it’s very unlikely, that a person will purchase
millions of products in one purchase.
For the dataset which the authors used for performance testing, the transaction size was quite small. They doesn’t have to consider this problem,
since they only used ”real world datasets”, where the transaction size is
relative small to the number of Items. for example the Foodmart dataset
has 1559 items and the average size of transactions was 4.4 or the Chainstore dataset has 46086 items and the average size of transactions was
7.2.
8
Up-Tree
In this subsection, we briefly introduce the structure of the UP-Tree.
We’ll need this structure for the baseline approach for mining top-K
high utility itemsets.
In UP-Tree, each node N consists of thefollowing elements: name (the
item name of N), count (the support count of N), nu (the node utility of
N), parent (records the parent node if N) and link (is a node link which
points to a node whose item name is the same as name).
Due to time reasons, this datastuctrue can’t be discussed in detail. For
the details about the Up-Tree, readers can refer to the paper [2]. In
10
short, the UP-Tree can be constructed with only two tablescan of D.
it’s a datastructure, which can delete a itemset and all it’s subset very
efficiently. Also calculating the TWU and the support count, which we
will use for calculating the upper and lower bound is just a traversation
in the UP-Tree.
For the algorithm,the UP-Tree is used for generating the next itemset to
analyze. For case II, and III, the UP-Tree will be updated.
For illustration, this is the UP-Tree for our example from chapter 3:
Root
Item TWU Link
A
28
B
28
C
17
D
0
B,1,2
A,3,14
B,2,18
C,1,7
C,1,10
Figure 4: Example UP-Tree for min util = 0
9
Conclusion
We have seen two algorithms to mine top-k high utility itemsets: the
basic one and the advanced one. The advanced one has the increasing
threshold mechanism to filter the candidiates by using the transactional
weighted downward closure.
We have also learned, that the increasing threshold method is not for
all databases an improvement, since it relies heavly on the additional
information by calculating the transactional weighted utility, which is
not always the case.
The author should have also test the TKU Base on different database
than typical ”real world” commerce data, since high utility mining doesn’t
refer only to commerce datasets.
11
References
[1] C. W. Wu, B.-E. Shie, P. S. Yu and V. S. Tsend. Mining Top-K High
Utility Itemsets. In Proc. of Int’l Conf. in ACM SIGKDD. pp. 78-86,
2012.
[2] V.S. Tseng, C.-W. Wu, B.-E. Sie and P.S. Yu. UP-Growth: an efficient algorithm for high utility itemset mining. In Proc. of Int’l Conf.
in ACM SIGKDD. p.253-262, 2010.
[3] C.F. Ahmed, S.K. Tanbeer, B.-S. Joeng and Y.-K. Lee. Efficient Tree
Structures for High-utility Pattern Mining in Incremental Databases.
In IEEE Transactions on Knowledge and Data Engineering, Vol. 21,
Issue 12, pp. 1708-1721, 2009.
[4] Y. Liu, W.Liao, and A.Choudhary. A fast high-utility itemsets mining
algorithm. In Proc. of the Utility-Based data Mining Workshop, 2005.
[5] Y. Liu, J. Li, W.-K. Liao, A. Choudhary and Y.Shi. High Utility
Itemsets Mining. In Int’l Journal of Information Technology and Decision Making p.905-934. 2010.
12
Download