Document

實驗室研究暨成果說明會 Content and Knowledge Management Laboratory (B) Data Mining Part Director: Anthony J. T. Lee Presenter: Wan-chuen Lin Outline  Introduction of basic data mining concepts about our research topics  Brief description of doctoral research  Topic 1: Mining frequent itemsets with multidimensional constraints  Topic 2: Mining the inter-transactional association rules of multi-dimensional interval patterns  Topic 3: Inter-sequence association rules mining  Topic 4: Mining association rules among timeseries data 2 Introduction of Data Mining  Data mining is the task of discovering knowledge from large amounts of data.  One of the fundamental data mining problems, frequent itemset mining, covers a broad spectrum of mining topics, including association rules, sequential patterns, etc.  Frequent itemset mining is to discover all the itemsets whose supports in the database exceed a user-specified threshold. 3 Introduction of Association Rules  Association rule is of the form XY, where X and Y are both frequent itemsets in the given database and XY=.  The support of XY is the percentage of transactions in the given database that contain both X and Y, i.e., P(XY).  The confidence of XY is the percentage of transactions in the given database containing X that also contain Y, i.e., P(Y|X). 4 Introduction of Sequential Patterns  A sequence is an ordered list of itemsets, and denoted by <s1s2…sl>, where sj is an itemset.  sj is also called an element of the sequence, and denoted as (x1x2…xm), where xk is an item.  The support of a sequence  in a sequence database is the number of tuples containing .  A sequence  is called a sequential pattern if support()min-support. 5 Algorithm for Mining Frequent Itemsets  Apriori  Candidate set generation-and–test  Level-wise: it iteratively generates candidate k-itemsets from previously found frequent (k-1)-itemsets, and then checks the supports of candidates to form frequent k-itemsets.  Lk-1 Ck Lk Join Support Check 6 Algorithm for Mining Frequent Itemsets (cont’d)  FP-growth  The method constructs a compressed frequent pattern tree, called FP-tree.  A divide-and-conquer strategy to recursively decompose the mining task into a set of smaller tasks in conditional databases, and concatenates the suffix itemset with the frequent itemsets generated from a conditional FP-tree. 7 Algorithm for Mining Sequential Patterns－PrefixSpan  It finds length-1 sequential patterns in the target database first, and partitions the database into smaller projected databases with prefix of each sequential pattern previously found.  The sequential patterns can be mined by constructing corresponding projected databases and mine each recursively.  It preserves the element order of each tuple in the mining process. 8 Brief Description of Doctoral Research  Mining calling path patterns in GSM networks  Two problems of mining calling path patterns   Mining PMFCPs Mining periodic PMFCPs  Graph structures [(periodic) frequent calling path graph] and graph-based mining algorithms   Based on a depth-first No candidate paths are generated and the database is scanned only once if the whole graph structure can be held in the main memory. 9 Brief Description of Doctoral Research (cont’d)  Bioinformatic data mining  Gene Clustering  Sequence comparisons, alignments and compression   DNA sequence Protein sequence  Application   Phylogenetic tree to predict the function of a new protein Relationship between DNA sequence & disease 10 Topic 1: Mining Frequent Itemsets with Multi-dimensional Constraints  Frequent itemset mining often generates a very large number of frequent itemsets.   Only the subset of the frequent itemsets and association rules is of interest to users. Users need additional post-processing to find useful ones.  Constraint-based mining pushes user-specific constraints deep inside the mining process to improve performance.  With multi-dimensional items, constraints can be imposed on multiple dimensional attributes. 11 Topic 1: Mining Frequent Itemsets with Multi-dimensional Constraints Multi-dimensional Constraints attributes (dimensions) itemID a1 a2 …. am ik = (k1, k2 …, km) A = iA = (A1, A2,…, Am) A1=A.a1 12 Topic 1: Mining Frequent Itemsets with Multi-dimensional Constraints  Multi-dimensional constraints can be categorized according to constraint properties.  anti-monotone, monotone, convertible and inconvertible  It can be also classified according to the number of sub-constraints included.   Single constraint against multiple dimensions, Ex: max(S.cost)  min(S.price) Conjunction and/or disjunction of multiple subconstraints, Ex: (C1: S.cost  v1)  (C2: S.price  v2) 13 Topic 1: Mining Frequent Itemsets with Multi-dimensional Constraints  We extend constraints to place over multidimensional itemsets and develop algorithms for mining frequent itemsets with multidimensional constraints by extension of CFG (Constrained Frequent Pattern Growth),  Overview of our algorithm    Phase 1: Frequency check Phase 2: Constraint check Phase 3: Conditional database construction 14 Example: Cam  max(S.cost)  min(S.price) A-conditional Database Database BECA BEA DA BDA BDE BDECA BEC BDEC DEC BDC Frequent items: B, D, E, C, A C(BDECA)=false C(B)=true C(D)=true C(E)=true BEC BE D BD BDEC Frequent items: B, D, E, C C(BDECA)=false C(BA)=false C(EA)=true C(DA)=true C(CA)=false C(C)=true C(A)=true EA-conditional Database D Frequent items:  15 Topic 2: Mining Inter-transactional Association Rules of Multi-dimensional Interval Patterns  Transaction could be the items bought by the same customer, the events happened on the same day, and so on.  Intra-transactional association rules: associations among items within the same transaction.  Ex: buy (X, diapers) => buy (X, beer) [support=80%]  Inter-transactional association rules: association relations among different transactions.  Ex: If the prices of IBM and SUN go up, Microsoft’s 16 will most likely [80%] increases the next day. Topic 2: Mining Inter-transactional Association Rules of Multi-dimensional Interval Patterns  Interval data are different from the point data in that they occupy regions of non-zero size.  Multi-dimensional Intervals can be represented as line segments (1-D), rectangles (2-D), hyper-cubes (n-D), etc.  Extended item: denoted as (Location)<Size>  Reference point: the smallest  (Location) among all (Location)<Size>.  Maxspan: a sliding window; only associations covered by it are considered. 17 Example  There are two cubes in the 3-dimensional space: 0,2,1<1,1,1> and 1,1,0<2,2,1>.  Reference point: (0,1,0)  The two items are denoted as 0,1,1<1,1,1> and 1,0,0<2,2,1>. 0,2,1<1,1,1> 1,1,0<2,2,1> 18 Algorithm (Apriori-like) Example  Support: 10% (10%*20=2)  Maxspan: 4  L 1: 0,0<1,1> 0,0<1,2> 0,0<1,3> 0,0<2,1> 19 Algorithm (Apriori-like) Example (cont’d)  Remind: Apriori-like algorithm  Lk-1 Ck Support Check Join  L 2: Lk {0,0<1,1>, 1,1<2,1>}, {1,0<1,1>, 0,1<1,2>}, {0,0<1,2>, 2,0<2,1>}, {0,0<1,3>, 3,0<1,2>}  L3: {3,0<1,1>, 2,1<1,2>, 0,3<1,3>} {1,0<1,1>, 0,1<1,2>, 2,1<2,1>} {3,0<1,1>, 0,3<1,3>, 4,1<2,1>} {2,0<1,2>, 0,2<1,3>, 4,0<2,1>}  L4: {0,3<1,3>, 4,1<2,1>, 2,1<1,2>, 3,0<1,1>} 20 Topic 3: Inter-sequence Association Rules Mining  Inter-sequence model 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 <e(ac)bac> <dd(ac)bd> <> <b(ab)cc> <bc> <acc> <ab> <ceacc(ce)> 2 <(bc)cb> Transaction Time : 1 <c(ab)d(ad)> Transaction ID : 21 Topic 3: Inter-sequence Association Rules Mining (cont’d)  Extended sequence (denote asΔt<s1s2…sl>): a sequence s = <s1s2…sl> at time pointΔt.  Algorithm:   Step 1: Use PrefixSpan to find all sequential patterns Step 2: Use an Apriori-like method to check if some extended sequence set is large  Use L-bucket (List-bucket) & C-bucket (candidate-bucket) to improve mining efficiency. 22 Example  min_support = 3  maxspan = 2 PrefixSpan Sequential Patterns: –<a>, , <c> –<ab>, <(ab)>, <ac>, <ba>, <bc>, <cb>, <cc> –<acc> The database Tran. ID Tran. Time Sequence 1 1 <c(ab)d(ad)> 2 2 <(bc)cb> 3 3 <e(ac)bac> 4 4 <b(ab)cc> 5 5 <(ab)c> 6 6 <dd(ac)bd> 7 7 <bc> 8 8 <acc> 9 9 <ab> 10 10 <ceacc(ce)> 23 Example (cont’d) PrefixSpan Result <a>, , <c> <ab>, <(ab)>, <ac>, <ba>, <bc>, <cb>, <cc> <acc> Candidates C2 {Δ0<a>, Δ1<a>}, {Δ0<a>, Δ2<a>} {Δ0<a>, Δ1}, {Δ0, Δ1<a>}, {Δ0<a>, Δ2}, {Δ0, Δ2<a>} {Δ0<a>, Δ1<c>}, {Δ0<c>, Δ1<a>}, {Δ0<a>, Δ2<c>}, {Δ0<c>, Δ2<a>} {Δ0, Δ1}, {Δ0, Δ2} L1 {Δ0<a>} {Δ0} {Δ0<c>} {Δ0, Δ1<c>}, {Δ0<c>, Δ1}, {Δ0, Δ2<c>}, {Δ0<c>, Δ2} {Δ0<c>, Δ1<c>}, {Δ0<c>, Δ2<c>} 24 Example (cont’d) PrefixSpan Result <a>, , <c> <ab>, <(ab)>, <ac>, <ba>, <bc>, <cb>, <cc> <acc> C2 Apriori-like Lk-1 → Ck → Lk L2 {Δ0<ab>}, {Δ0<(ab)>}, {Δ0<ac>}, {Δ0<ba>}, {Δ0<bc>}, {Δ0<cb>},{Δ0<cc>} {Δ0<a>, Δ1<a>}, {Δ0<a>, Δ2<a>}, {Δ0<a>, Δ1}, {Δ0, Δ1<a>}, {Δ0<a>, Δ2}, {Δ0, Δ2<a>}, {Δ0<a>, Δ1<c>}, {Δ0<c>, Δ1<a>}, {Δ0<a>, Δ2<c>}, {Δ0<c>, Δ2<a>}, {Δ0, Δ1}, {Δ0, Δ2}, {Δ0, Δ1<c>}, {Δ0<c>, Δ1}, {Δ0, Δ2<c>}, {Δ0<c>, Δ2}, 25 {Δ0<c>, Δ1<c>}, {Δ0<c>, Δ2<c>} Topic 4: Mining Association Rules among Time-series Data  A line is an ordered and continuous list in the form {t1, t2, …, tm} describing the property of the subject along the time.  Step 1: find the frequent lines and points in each line-set. (Apriori-like algorithm)  Step 2: use those frequent-set combination to find the associations among them. (intertransaction association rules) 26 Topic 4: Mining Association Rules among Time-series Data 27 Time-series Data Approximation  For the algorithm’s efficiency  Equally partition the fluctuation rate into several classes. 28 Step 1: Line Discovery (Apriori-like) Step 2: Association Rule Mining 29 Data Mining Part Thank You!

Document

Related documents

Products

Support

Document

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib