USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012. p. 660-668. Presenter: 0356069 江怡蕙 0356027 薛筑軒 Outline • • • • • • Introduction Related work Problem Statement USpan algorithm Experiment Conclusions & Discussions 2 Outline • Introduction • • • • • • Background • Definition • Challenges Related work Problem Statement USpan algorithm Experiment Conclusions & Discussions 3 Introduction Sequential pattern mining has proven to be very essential for handling order-based critical business problems. EX: structures and functions of molecular or DNA sequences 4 Background The selection of interesting sequences is generally based on the frequency/support framework: sequences of high frequency are treated as significant. Under this framework, the downward closure property (also known as Apriori property) plays a fundamental role. 5 Definition • Utility • Internal utility = quantity ; External utility = quality • High utility pattern mining • Minimum utility • The utility of <ea> in sequence 2 is {(6 × 1 + 1 × 2) , (6 × 1 + 2 × 2)} = {8, 10} 6 Definition • The concept of sequence utility by considering the quality and quantity associated with each item in a sequence, and define the problem of mining high utility sequential patterns; • A complete lexicographic quantitative sequence tree (LQS-tree) to construct utility-based sequences; two concatenation mechanisms IConcatenation and S-Concatenation generate newly concatenated sequences; 7 Definition • Two pruning methods, width and depth pruning, substantially reduce the search space in the LQStree; • USpan traverses LQS-tree and outputs all the high utility sequential patterns. 8 Outline • Introduction • Related work • • • • • Utility Itemset/Pattern Mining • Utility-based Sequential Pattern Mining Problem Statement USpan algorithm Experiment Conclusions & Discussions 9 Utility Itemset/Pattern Mining • Mining high utility itemsets is much more challenging than discovering frequent itemsets, because the fundamental downward closure property in frequent itemset mining does not hold in utility itemsets. • The addition of ordering information in sequences makes it fundamentally different and much more challenging than mining utility itemsets 10 Utility-based Sequential Pattern Mining • Mining frequent sequences many patterns being mined; • Patterns with frequencies lower than minimum support are filtered 11 Outline • • • • • • Introduction Related work Problem Statement USpan algorithm Experiment Conclusions & Discussions 12 Sequence Utility Framework • I = {i1, i2, ..., in} a set of distinct items • Each item ik ∈ I(1<= k<=n) is associated with a quality (or external utility), denoted as p(ik) • A quantitative item, or q-item, is an ordered pair (i, q), where i ∈ I represents an item and q is a positive number representing the quantity or internal utility 13 Sequence Utility Framework • A quantitative itemset, or q-itemset, consists of more than one q-item, which is denoted and defined as l = [(ij1, q1)(ij2, q2)...(ijn, qn )] • A quantitative sequence, or q-sequence, is an ordered list of qitemsets, which is denoted and defined as s =< l1l2 ... lm> • A q-sequence database S consists of sets of tuples <sid, s> 14 Sequence Utility Framework- Definitions EX: (a, 4), [(a, 4)(e, 2)] and [(a, 4)(b, 1)(e, 2)] ⊆ [(a, 4)(b, 1)(e, 2)] But [(a, 2)(e, 2)] or [(a, 4)(c, 1)] not contained in [(a, 4)(b, 1)(e, 2)] <(b, 2)>, <[(b, 2)(e, 3)]>, <[(b, 2)][(e, 3)](a, 2)> 15 Sequence Utility Framework- Definitions <(e, 5)[(c, 2)(f, 1)](b, 2)> is a 4-q-sequence with size 3. <ea> is a 2-sequence with size 2. 16 Sequence Utility Framework- Definitions 17 Sequence Utility Framework- Definitions 18 Sequence Utility Framework- Definitions t = <ea> t’s utility in the s4 sequence in Table 2 is v(t, s4) = {u(<(e, 2)(a, 7)>), u(<(e, 2) (a, 4)>)} = {16, 10}. t’s utility in S is v(t) = {u(t, s2), u(t, s4),u(t, s5)} = {{8, 10}, {16, 10}, {15, 7}} 19 High Utility Sequential Pattern Mining 20 High Utility Sequential Pattern Mining • Definition 10. (High Utility Sequential Pattern) Because a sequence may have multiple utility values in the q-sequence context, we choose the maximum utility as the sequence’s utility. Theof maximum a sequence=10 t is+denoted The utility sequenceutility ea is of umax(<ea>) 16 + 15 =and 41. as umax(t): Ifdefined the minimum utility is ξ = 40, then sequence s = <ea> is a high utility sequential pattern since umax(s) = 41 ≥ ξ • Sequence t is a high utility sequential pattern if and only if ξ user-specified minimum utility 21 Outline • • • • Introduction Related work Problem Statement USpan algorithm • • • • • Lexicographic Q-Sequence Tree Concatenations Width Pruning Depth Pruning USpan Algorithm • Experiment • Conclusions & Discussions 22 USpan Algorithm • USpan is composed of • a lexicographic q-sequence tree • two concatenation mechanisms • two pruning strategies 23 Lexicographic Q-Sequence Tree • Adapt the concept of the Lexicographic Sequence Tree • Suppose we have a k-sequence t, we call the operation of appending a new item to the end of t to form (k+1)sequence concatenation. • If the size of t does not change, we call the operation I-Concatenation. Otherwise, if the size increases by one, we call it S-Concatenation • <ea>’s I-Concatenate and S-Concatenate with b result in <e(ab)> and <eab>, respectively. 24 Lexicographic Q-Sequence Tree • Assume two k-sequences ta and tb are concatenated from sequence t, then ta < tb if • i) ta is I-Concatenated from t, and tb is S-Concatenated from t, • ii) both ta and tb are I-Concatenated or S-Concatenated from t, but the concatenated item in ta is alphabetically smaller than that of tb. • <(ab)> < <(ab)b>, <(abc)> < <(ab)b>, <(ab)c> < <(ab)d> and <(ab)(de)> < <(ab)(df )> 25 Lexicographic Q-Sequence Tree • Definition 11. (Lexicographic Q-sequence Tree) An lexicographic q-sequence tree (LQS-Tree) T is a tree structure satisfying the following rules: • Each node in T is a sequence along with the utility of the sequence, while the root is empty • Any node’s child is either an I-Concatenated or Sconcatenated sequence node of the node itself • All the children of any node in T are listed in an incremental and alphabetical order 26 Lexicographic Q-Sequence Tree • v(ea) = {{8, 10}, {16, 10}, {15, 7}} and umax(ea) = 41. • “Can any <ea>’s child’s maximum utility be calculated by simply adding the highest utility of the q-items after <ea> to umax(ea)?” ---------------- no 27 Lexicographic Q-Sequence Tree • Depth-first search • How can we generate the node’s children’s utilities by concatenating the corresponding items? Concatenations • How can we avoid checking unpromising children? Width pruning • When should USpan stop the search of deeper nodes? Depth pruning 28 Concatenations • Utility matrix • (utility, remaining utility) 29 Concatenations • Utility matrix • (utility, remaining utility) 30 Concatenations • Utility matrix • (utility, remaining utility) 31 Concatenations: I-Concatenation • I-Concatenation for the sequence 𝑏 in s4 • 𝑣( 𝑏𝑒 , 𝑠4 ) = { 10 + 2, 5 + 2 } = { 12, 7 } 32 Concatenations: S-Concatenation • S-Concatenation for the sequence (𝑏𝑒) in s4 • Candidates: 𝑏𝑒 𝑎 , 𝑏𝑒 𝑏 , 𝑏𝑒 𝑑 , 𝑏𝑒 𝑒 • 𝑣( 𝑏𝑒 𝑎 , 𝑠4 ) = { 12 + 14, 12 + 8 } = { 26, 20 } 33 Concatenations • To calculate the utilities of the children of a given sequence (e.g. 𝑏𝑒 in s4) • The positions of the last q-items of q-subsequences that match the sequence (e.g. e1, e3) • Pivot: e1 (the first place where 𝑏𝑒 ends) • Ending q-items: e3 (other places where 𝑏𝑒 ends) • The utility of a sequence (e.g. 12, 7) • These values are stored in LQS-Tree. 34 Concatenations 35 Width Pruning • Avoid constructing unpromising items into LQS-Tree • Sequence-weighted utilization (SWU) of sequence t 𝑆𝑊𝑈 𝑡 = 𝑢(𝑠) 𝑠 ′ ~𝑡∧𝑠 ′ ⊆𝑠∧𝑠⊆𝑆 • 𝑆𝑊𝑈 𝑓 = 𝑢 𝑠1 = 24 • 𝑆𝑊𝑈 𝑒𝑎 = 𝑢 𝑠2 + 𝑢 𝑠4 + 𝑢 𝑠5 = 41 + 50 + 37 36 Width Pruning • Sequence-weighted Downward Closure Property (SDCP) • Given an utility-based sequence database S, and two sequences t1 and t2, where t2 contains t1, then 𝑆𝑊𝑈 𝑡2 ≤ 𝑆𝑊𝑈(𝑡1 ) • 𝑆𝑊𝑈 𝑒𝑎 = 𝑢 𝑠2 + 𝑢 𝑠4 + 𝑢 𝑠5 = 41 + 50 + 37 • 𝑆𝑊𝑈 𝑒 𝑎𝑏 = 𝑢 𝑠2 + 𝑢 𝑠4 + 𝑢 𝑠5 = 41 + 50 + 37 • 𝑆𝑊𝑈 𝑒 𝑎𝑑 = 𝑢 𝑠2 + 𝑢 𝑠4 = 41 + 50 • Item i is a promising item to t • After concatenating i to t, for the new sequence t’ 𝑆𝑊𝑈 𝑡 ′ ≥ 𝜉 37 Depth Pruning • Stop USpan from going deeper in LQS-Tree • Even if all the utilities of the remaining q-items are counted into the utility of the sequence, the cumulative utility still cannot satisfy 𝜉. 38 Depth Pruning • Given a sequence t and S, the maximum utilities of t and t’s offspring are no more than (𝑢𝑟𝑒𝑠𝑡 𝑖, 𝑠 + 𝑢(𝑠′)) 𝑖∈𝑠′ ∧𝑠′ ∼𝑡∧𝑠′ ⊆𝑠∧𝑠⊆𝑆 • i: the pivot in s of t • urest: remaining utility at q-item i in q-sequence s • E.g. 𝑡 = 𝑒𝑎 39 USpan Algorithm // includes depth pruning strategy // generate candidates // width pruning strategy // deal with I-Concatenation // deal with S-Concatenation 40 Outline • • • • • Introduction Related work Problem Statement USpan algorithm Experiment • Settings • Results • Conclusions & Discussions 41 Experimental Settings • Data Sets • DS1: C10 T2.5 S4 I2.5 DB10k N1k • DS2: C8 T2.5 S6 I2.5 DB10k N10k • The average number of elements in a sequence is 10 (8). • The average number of items in an element is 2.5 (2.5). • The average length of a maximal pattern consists of 4 (6) elements and each element is composed of 2.5 (2.5) items average. • The data set contains 10k (10k) sequences. • The number of items is 1k (10k). 42 Experimental Settings • Data Sets • DS3: online shopping transactions • There are 811 distinct products, 350,241 transactions and 59,477 customers. • The average number of elements in a sequence is 5. • The max length of a customer’s sequence is 82. • The most popular product has been ordered 2176 times. • DS4: mobile communication transactions • The dataset is a 100,000 mobile-call history. • There are 67,420 customers in the dataset. • The maximum length of a sequence is 152. 43 Experimental Results – Execution Time & (#Patterns) 44 Experimental Results – Execution Time & (#Patterns) 45 Experimental Results – Distribution in Terms of Length 46 Experimental Results – Distribution in Terms of Length 47 Experimental Results – Pruning 48 Experimental Results – Scalability 49 Experimental Results – Utility vs Frequent 50 Outline • • • • • • Introduction Related work Problem Statement USpan algorithm Experiment Conclusions & Discussions 51 Conclusions • Provide a systematic statement of a generic framework for high utility sequential pattern mining. • Propose an efficient algorithm, Uspan • I-Concatenation, S-Concatenation • Width pruning, depth pruning • USpan can efficiently identify high utility sequences in large-scale data with low minimum utility. 52 Discussions • Strongest part of this paper • USpan grows tree by DFS and needs not to store the whole LQS-Tree in memory. • Two pruning strategies are proposed and work well in their experiments. • Only need to calculate the tables once at beginning. • Weak points of this paper • Each sequence needs a table to store it values and all the tables are stored in memory. • Each single tree node contains much information. 53 Discussions • Possible improvement • Design algorithms for even bigger datasets and better pruning strategies. • Shrink the number of tables or shrink the number of elements in a table. • Possible extension • The metric of “utility” • Items with positive and negative unit profits • Time constraints (as in GSP) • Possible Application • Business decision-making • Analysis of game records of experts • But need to specify “item” and “utility” first 54 END & Thanks for your attention