Minig Top-K High Utility Itemsets - Report Daniel Yu, yuda@student.ethz.ch Computer Science Bsc., ETH Zurich, Switzerland May 29, 2015 The report is written as a overview about the main aspects in mining top-k high utility itemsets from the paper ”Mining Top-K High Utility Itemsets” written by Cheng Wei Wu et. al. from the National Cheng Kung University in 2012 [1]. 1 Introduction Utility mining, which refers to the discovery of itemsets with utilities higher than a user-specified minimum utility threshold, is an important task and has a wide range of applications, especially in e-commerce. But setting an appropriate minimum utility threshold is a difficult problem. If the minimum threshold is set to low, too many high utility itemsets will be generated and it takes a long time to compute, while setting the minimum threshold too high would result in too few results. Setting appropriate minimum utility threshold by trial and error is not very efficient. We want to discuss in this report how this can be done better. This report starts with a small example, to get a basic understanding about the general high utility itemsets mining. Then we’ll look at a naiv approach and extend it with the increasing threshold mechanism by using the so-called ”transactional-weighted downward closure” (TWDC), which is one of the most important basis of most optimisation mechanism. At the end, we will see a very short introduction to the ”UP-Tree”, which is the state-of-the-art datastructure for mining high utility itemsets. 2 Problem Definition In top-k high utility itemset minig we want to calculate the top-k high utility itemset in D from the system (I, p, D, q): • I is a finite set of distinct items I = {i1 , i2 , ..., im }. 1 • p is a function p : (ij , D) → N, which associate each item ij ∈ I with a positive number, called the external utility. • D is transactional database, which consist of a set of transactions {T1 , T2 , ..., Tn }. Each transaction Tc ∈ D is a subset of I and has an unique identifier c, called Tid. • q is a function q : (ij , Tc ) → N, which associate each item ij in transaction ij ∈ Tc with a positive number, called the internal utility. The profit of an itemset X in a transaction Tc is denoted as s(X, Tc ) is defined as: X s(X, Tc ) = p(ij , D) · q(ij , Tc ) ij ∈X The utility of an itemset X in D is denoted as u(X) is defined as: X s(X, Tc ) u(X, D) = X⊆Tc Tc ∈D 3 Example TID P1 P2 P3 P4 Purchase (3, A), (5, B) (2, B) (2, A), (1, C) (2, A)(3, B)(1, C) Table 1: Example purchase database Item Profit A B C D 2$ 1$ 3$ 2$ Table 2: Example price table Suppose we are a shop, which sell fruits: Apple, Banana, Cherry and Date: I = {A, B, C, D}. We collected data of purchases of today stored in a so-called transactional database D = {P1 , ..., Pn }. Each purchase consist of several items and it’s purchased quantity. E.g. the first customer purchased 3 Apples and 5 Bananas (see Table 1). There is a second database, wich stores the price of each item (see Table 2). With high utility itemset mining, we can answer the following question: Which product set has the highest profit out of these data? To answer 2 this question we define the price as external cost and the quantity as internal cost. With this mapping the utility of itemset {A} is in our example 14$: u({A}, D) = s({A}, P1 ) + s({A}, P3 ) + s({A}, P4 ) = 3 · 2$ + 2 · 2$ + 2 · 2$ = 14$ and the itemset {B, C} has a utility of 6$: u({B, C}, D) = s({B, C}, P4 ) = 3 · 1$ + 1 · 3$ = 6$ In classical high utility itemset mining, we choose the threshold as parameter and every itemset with a higher utility will be in the result set. Without any knowledge about the database D, it’s quite hard to choose the threshold, because if you choose the threshold too low, let’s say 2$, you will get too much itemsets. And if you choose it too high, let’s say 20$, no high utility itemset will be found. We rather want an algorithm, which takes the number of result we want as parameter k. Setting k is more intuitive than setting the threshold because k represents the number of itemsets that the user want to find. We name this problem the Top-K High Utility Itemset Mining. In our example the top-3 high utility itemset would be: • itemset {A} with a utility of 14$ • itemset {A, C} with a utility of 14$ • and itemset {A, B} with a utility of 18$ Please keep these example databases in mind, since it’s used for all the example through the whole paper. 3 4 Basic Algorithm The Basic Algorithm computes the topk high utility itemset problem in three steps: Generate all subset of I 1. It first generates all the possible subsets of I. compute utility 2. Then it computes for all subsets their utilities. Choose top-k Figure 1: Basic Algorithm 4.1 3. Finally, it chooses the top-k high utility itemset out of all itemsets. Analysis: Basic Algorithm Now we would like to compute the complexity of this naiv algorithm for finding the top-k high utility itemsets. 1. The number of all subset of I is by definition equals to the size of the powerset of I. Therefore: # subsets of I = |P (I)| = 2n , where n = |I|. Generating all the subset has a complexity of O(2n ), which is a exponential growth to the number of items. 2. For each subset, we have to calculate it’s utility, which can be done with a complete tablescan. The size of the table has O(nm), where m is the number of transactions. It’s bounded by n, since every transaction is a subset of I, which has a maximum size of n. 3. Choosing the top-k high utility itemsets is basically a simple scan of all subset, which can be done in O(2n ). In total we get: O(2n nm + 2n ) = O(2n nm) Calculating the utility of all itemsets seems to be very expensive. The problem is that the utility function is neither monoton nor anti-monotone. Calculating the utility of a itemset wouldn’t give us any information about the utility of it’s supersets or subsets. We also would like to have some mechanism to prune the search space, since the search space grows exponentially to the number of items as we have seen above. We will discuss this problem extensively in the next two chapters. 4 5 Transaction-weighted downward closure (TWDC) One of the major challenge is, that the utility function of an itemset is neither monotone nor anti-monotone. In other words, the utility of an itemset might be equal to, greater or lower than the utility of it’s superset and subset. This makes hard to prune the search space, since the exact utility of an itemset won’t give us information about it’s supersets or subsets. In 2005, Liu et al. proposed in their paper [4] the Two-Phase algorithm, which uses the so-called ”Transaction-weighted downward closure”. First we need the definition of Transactional weighted utility of an itemsets X: X T W U (X) = s(Ti , Ti ) X⊆Ti Ti ∈D If we take the same example for chapter 3, the transactional weighted utility of the itemset {A} is 28$: T W U ({A}) = X s(Ti , Ti ) {A}⊆Ti Ti ∈D = s({A, B}, P1 ) + s({A, C}, P3 ) + s({A, B, C}, P4 ) = 11$ + 7$ + 10$ = 28$ If we compare this to the actual utility, we see that the TWU function is an upper bound for the utility function, which will be proved below. This function has the nice property of downward closure, which means: If Y is a subset of X ∈ I, then the transactional weighted utility if Y is at most the transactional weighted utility of X. We want now proof the downward closure property: To prove: Y ⊂ X ⇒ T W U (Y ) ≤ T W U (X) proof: We assume Y ⊂ X. We can show, that the transactional weighted utility of X is at least as the transactional weighted utility of Y . X X T W U (>) = u(Ti , Ti ) ≤ u(Tj , Tj ) = T W U (X), Y ⊆Ti Ti ∈D X⊆Tj Tj ∈D since the collection of transaction containing X is a superset of the collection of transaction containing Y , because Y ⊂ X. 5 6 Increasing Threshold We learnt a transactional weighted utility (TWU) is an upper bound for the utility function, which has the downward closure property. But how can we use it to prune the search space? In 2012, Wu et.al. proposed with their algorithm TKU Base [1] the following idea: The proposed algorithm uses an internal variable named border minimum utility threshold (denoted as min util ). We only want to consider itemsets with a higher utility that the threshold. The algorithm initially set the threshold to 0 and gradually raise the threshold to prune the search space by using the TWDC. We can raise the threshold after a sufficient number of itemsets with higher TWU has been captured. For the algorithm, we need to calculate the lower and upper bound of an itemset. For the upper bound the TWU can be used. For the lower bound we use the definition of minimum item utility of an item a, denoted as miu(a): miu(a) = min u(a, T ) T ∈D and minimum itemset utility of an itemset X = {a1 , ..., am }, denoted as M IU (X): M IU (X) = m X miu(ai ) · SC(X), i=1 where SC(X) is the support count of an itemset, which is the number of transaction containing X in D. This is cleary a lower bould for the utility function. If we take the data of chapter 3, the minimum itemset utility of itemset {A, B} is 12$: M IU ({A, B}) = miu({A}) ∗ SC({A, B}) + miu({B}) ∗ SC({A, B}) = 4$ · 2 + 2$ · 2 = 12$ For the algorithm, we need to destinguish between three different cases (cf. Figure 2). For a itemset X: I. M IU (X) ≤ min util ≤ T W U (X) II. M IU (X) ≤ T W U (X) < min util III. min util ≤ M IU (X) ≤ T W U (X) 6 MIU min util TWU I. MIU TWU min util II. min util MIU TWU III. Figure 2: Three cases for min util, MIU and TWU These cases are complete, because all the other possible cases would violate the following fact: M IU (X) ≤ u(X) ≤ T W U (X) ⇔ T W U (X) ≤ T W U (X) Let’s analyze these three cases in detail: I. We call such a itemset a potential itemset, since the utility might be higher than the threshold min util. We have to keep these itemset, because they are a candidates for high utility itemset. II. Such an itemset X are definitely not part of the top-h high utility itemset and can be savely discarded (the proof is below in III.), since his exact utility is for sure below the threshold min util: u(X) ≤ T W U (X) < min util By applying the TWDC property of TWU, we can also prune all it’s subsets X 0 , which are less promising itemsets because of their lower TWU: u(X 0 ) ≤ T W U (X 0 ) ≤ T W U (X) < min util III. Such an itemset X is also candidate for high utility itemsets, so we have to keep it. Here the M IU (X) can be used to raise the border min. We need for this purpose a proof: To prove: Assume we are mining for the top-k high utility itemset. Let C = {X1 , X2 , ..., Xm } be a ordered set of itemsets, where m ≥ k and Xi is the i-th itemset in C and M IU (Xi ) ≥ M IU (Xj ), ∀i < j (ordered by M IU ). For any itemset Y, if T W U (Y ) < min{M IU (Xi )|Xi ∈ C, 1 ≤ i ≤ k}, Y is not a top-k high utility itemset. 7 proof: According to the definition of T W U and M IU we know, that: u(Y ) ≤ T W U (Y ) < M IU (Xi ) ≤ u(Xi ), where Xi ∈ C, 1 ≤ i ≤ k. If there already exist k itemsets whose utilites are higher that the utility of Y, by the definition of top-k high utility itemset, Y is not a top-k high utility itemset. What also follows from this proof is, that we can safely set the threshold min util to min{M IU (Xi )|Xi ∈ C, 1 ≤ i ≤ k}, because there is no sense to consider itemsets, which are definitely not part of the top-k high utility itemset. How do we keep track of the itemset to efficiently update border min? We use a max-heap structure L to maintain the k highest M IU s of the candidate itemsets until now. Once k MIUs are found, min util is raised to the k − th MIU in L. Each time a new candidate X is found and its MIU is higher than min util, X is added ti L and the lowest MIU in L is removed. After that, min util is raised to the k-th MIU in L. 7 Advanced Algorithm Generate all the subsets of I and discard all it’s subsets II. Calculate MIU and TWU I. save the candidate discard it Trash III. save the candidate and update the threshold Calculate utility and choose top-k Figure 3: The TKU Base algorithm The new algorithm consists of three part: 1. generating all the itemsets 2. choose all the potential candidate for high utility itemstes with the increasing threshold method, which we have discussed in the last chapter extensively. We initialize the threshold with 0. For 8 each itemset, we check to which case it belongs (I., II. or III.) and act appropriate. To keep trach of MIUs to efficiently update border min, we use a max-heap L as discussed before. At the end we get a list of candidates stored in C. 3. Choosing the top-k high utility itemsets is basically a sinple scan of C. Algorithm 1: Advanced Algorithm // Initialization 1 L ← empty minheap; 2 C ← empty set; 3 min util ← 0 ; 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 // Generate all the subsets of I M ← subset(I); // Calculate MIU and TWU, case destinction while M is not empty do X ← take one itemset ∈ C; if M IU (X) ≤ min util and min util ≤ T W U (X) then // Case I. C ← X; else if T W U (X) ≤ min util then // Case II. C ← X; L ← M IU (X); update min util ; else // Case III. M ← M − subset(X); end end // Check the candidates in C Calculate the exact utility for each itemset in C ; Output the top-k high utility itemset in C ; Note: The subset(X) function generates all the subsets of X. 9 7.1 Analysis: Advanced Algorithm This new algorithm seems to have a quite overhead to calculate all the TWUs. Does it at least garantuee to perform better than the basic algorithm? The answer is sadly no. The TWDC with the increasing threshold doesn’t give us any guarantee to perform better at all. In fact it could be slower. As a simple and short example, think of a database D with just one transaction D1 , which has all the items D1 = I and assume p = q = 1. With such a database, the TWU for all itemsets would be equal, since the TWU is a overestimation: For any X ⊆ I : T W U (X) = X s(Di , Di ) X⊆Di Di ∈D = s(D1 , D1 ) = s(I, D1 ) X = p(ij , D) · q(ij , D1 ) ij ∈I = X 1 = |I| = n ij ∈I Which such a system, we wouldn’t get any additional information about the utility of the itemsets. We couldn’t prune the search space with TWDC, which means that we still have to check the utility of all possible subsets of I. However in practise, a online store like amazon which serves millions of products, it’s very unlikely, that a person will purchase millions of products in one purchase. For the dataset which the authors used for performance testing, the transaction size was quite small. They doesn’t have to consider this problem, since they only used ”real world datasets”, where the transaction size is relative small to the number of Items. for example the Foodmart dataset has 1559 items and the average size of transactions was 4.4 or the Chainstore dataset has 46086 items and the average size of transactions was 7.2. 8 Up-Tree In this subsection, we briefly introduce the structure of the UP-Tree. We’ll need this structure for the baseline approach for mining top-K high utility itemsets. In UP-Tree, each node N consists of thefollowing elements: name (the item name of N), count (the support count of N), nu (the node utility of N), parent (records the parent node if N) and link (is a node link which points to a node whose item name is the same as name). Due to time reasons, this datastuctrue can’t be discussed in detail. For the details about the Up-Tree, readers can refer to the paper [2]. In 10 short, the UP-Tree can be constructed with only two tablescan of D. it’s a datastructure, which can delete a itemset and all it’s subset very efficiently. Also calculating the TWU and the support count, which we will use for calculating the upper and lower bound is just a traversation in the UP-Tree. For the algorithm,the UP-Tree is used for generating the next itemset to analyze. For case II, and III, the UP-Tree will be updated. For illustration, this is the UP-Tree for our example from chapter 3: Root Item TWU Link A 28 B 28 C 17 D 0 B,1,2 A,3,14 B,2,18 C,1,7 C,1,10 Figure 4: Example UP-Tree for min util = 0 9 Conclusion We have seen two algorithms to mine top-k high utility itemsets: the basic one and the advanced one. The advanced one has the increasing threshold mechanism to filter the candidiates by using the transactional weighted downward closure. We have also learned, that the increasing threshold method is not for all databases an improvement, since it relies heavly on the additional information by calculating the transactional weighted utility, which is not always the case. The author should have also test the TKU Base on different database than typical ”real world” commerce data, since high utility mining doesn’t refer only to commerce datasets. 11 References [1] C. W. Wu, B.-E. Shie, P. S. Yu and V. S. Tsend. Mining Top-K High Utility Itemsets. In Proc. of Int’l Conf. in ACM SIGKDD. pp. 78-86, 2012. [2] V.S. Tseng, C.-W. Wu, B.-E. Sie and P.S. Yu. UP-Growth: an efficient algorithm for high utility itemset mining. In Proc. of Int’l Conf. in ACM SIGKDD. p.253-262, 2010. [3] C.F. Ahmed, S.K. Tanbeer, B.-S. Joeng and Y.-K. Lee. Efficient Tree Structures for High-utility Pattern Mining in Incremental Databases. In IEEE Transactions on Knowledge and Data Engineering, Vol. 21, Issue 12, pp. 1708-1721, 2009. [4] Y. Liu, W.Liao, and A.Choudhary. A fast high-utility itemsets mining algorithm. In Proc. of the Utility-Based data Mining Workshop, 2005. [5] Y. Liu, J. Li, W.-K. Liao, A. Choudhary and Y.Shi. High Utility Itemsets Mining. In Int’l Journal of Information Technology and Decision Making p.905-934. 2010. 12