advertisement

A Short Introduction to Sequential Data Mining Koji IWANUMA Hidetomo NABESHIMA University of Yamanashi The First Franco-Japanese Symposium on Knowledge Discovery in System Biology, September 17, Aix-en-Provence Two Main Frameworks of Sequential Mining Sequential pattern mining for multiple data sequences Sequence ID Purchase data record 1 <bread, cheese> 2 <(wheat, milk), bread, (berry, sausage)> 3 <(bread, pumpkin, sausage)> 4 <bread, cheese, sausage> 5 <cheese> Sequential pattern mining for a single data sequence Data sequence <S1 S2 S3 S4 S5 S6 S7 … … Sn> 2 J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji What Is Sequential Pattern Mining? Given a set of sequences, find the complete set of frequent subsequences A sequence : < (ef) (ab) (df) c b > A sequence database SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern 3 Challenges on Sequential Pattern Mining A huge number of possible sequential patterns are hidden in databases A mining algorithm should find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold be highly efficient, scalable, involving only a small number of database scans be able to incorporate various kinds of user-specific constraints J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji 4 Sequential Pattern Mining Algorithms for Multiple Data Sequences Apriori-based method: GSP (Generalized Sequential Patterns: Srikant & Agrawal @ EDBT’96) Pattern-growth methods: FreeSpan & PrefixSpan (Han et al.@KDD’00; Pei, et al.@ICDE’01) Vertical format-based mining: SPADE (Zaki@Machine Leanining’00) Constraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, Shim@VLDB’99; Pei, Han, Wang @ CIKM’02) Mining closed sequential patterns: CloSpan (Yan, Han & Afshar @SDM’03) J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji 5 Mining Sequential Patterns from a Very-Long Single Sequence A series of daily news paper articles < > typhoon flood, landslide typhoon flood, landslide <typhoon (flood, landslide)> 6 Sequential Pattern Mining Algorithms for a Single data Sequence Discovery of frequent episodes in event sequences, based on a sliding window system [Mannila 1998]： The frequency measure becomes anti-monotonic, but has a problem, i.e., a duplicate counting of an occurrence. Asynchronous periodic pattern mining [Yang et.al 2000, Huang 2004]： Any anti-monotonic frequency measures are not investigated. On-line approximation algorithm for mining frequent items, not for frequent subsequences Lossy counting algorithm [Manku and Motwani, VLDB’02] 7 Research in Our Laboratory Sequential Data Mining from a very-large single data sequence. Main target: sequential textual data, especially, newspaper-articles corpora Objectives: to generate a robust and useful large-scale event-sequences corpus. Application 1： topic tracking/detection in information retrieval. Application 2： automated content-tracking in WEB. Application 3: scenario/story semi-automatic creation Ordinary temporal data analysis: various log data in computer systems, genetic information, etc. 8 Technical Topics (1/2) A new framework for extracting frequent subsequences from a single long data sequence: in IEEE Inter. Conf. on Data Mining 2005 (ICDM2005): A new rational frequency measures, which satisfies the Apriori (anti-monotonic) property and has no duplicate counting. A fast on-line algorithm for a some limited case 9 Technical Topics (1/2) On-going current works and future work On-line rational filters based on confidence criteria and/or information-gain for eliminating redundant valueless sequences from system output Methods for finding meta-structures embedded in huge amount of frequent sequences generated by a system A method using compression based on context-free grammarinference/learning More fast extraction algorithm based on a method for simultaneously searching multiple strings over compressed data. 10 References: Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques (Chapter 8). www.cs.uiuc.edu/~hanj 11 Thanks for your attention!! 12