A short introduction to sequential data mining

advertisement
A Short Introduction to Sequential
Data Mining
Koji IWANUMA
Hidetomo NABESHIMA
University of Yamanashi
The First Franco-Japanese Symposium on Knowledge Discovery in
System Biology, September 17, Aix-en-Provence
Two Main Frameworks of Sequential
Mining
Sequential pattern mining for multiple data sequences
Sequence ID
Purchase data record
1
<bread, cheese>
2
<(wheat, milk), bread, (berry, sausage)>
3
<(bread, pumpkin, sausage)>
4
<bread, cheese, sausage>
5
<cheese>
Sequential pattern mining for a single data sequence
Data sequence
<S1 S2 S3 S4 S5 S6 S7 …
… Sn>
2
J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji
What Is Sequential Pattern Mining?
Given a set of sequences, find the complete
set of frequent subsequences
A
sequence
: < (ef) (ab) (df) c b >
A sequence database
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
An element may contain a set of items.
Items within an element are unordered
and we list them alphabetically.
<a(bc)dc> is a subsequence
of <a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a
sequential pattern
3
Challenges on Sequential Pattern Mining
A huge number of possible sequential patterns are hidden
in databases
A mining algorithm should
find the complete set of patterns, when possible,
satisfying the minimum support (frequency) threshold
be highly efficient, scalable, involving only a small
number of database scans
be able to incorporate various kinds of user-specific
constraints
J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji
4
Sequential Pattern Mining Algorithms
for Multiple Data Sequences
Apriori-based method: GSP (Generalized Sequential Patterns:
Srikant & Agrawal @ EDBT’96)
Pattern-growth methods: FreeSpan & PrefixSpan (Han et
[email protected]’00; Pei, et [email protected]’01)
Vertical format-based mining: SPADE ([email protected] Leanining’00)
Constraint-based sequential pattern mining (SPIRIT: Garofalakis,
Rastogi, [email protected]’99; Pei, Han, Wang @ CIKM’02)
Mining closed sequential patterns: CloSpan (Yan, Han & Afshar
@SDM’03)
J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji
5
Mining Sequential Patterns from a
Very-Long Single Sequence
A series of daily news paper articles
<
>
typhoon
flood,
landslide
typhoon
flood,
landslide
<typhoon (flood, landslide)>
6
Sequential Pattern Mining Algorithms
for a Single data Sequence
Discovery of frequent episodes in event sequences, based
on a sliding window system [Mannila 1998]:
The frequency measure becomes anti-monotonic, but has a
problem, i.e., a duplicate counting of an occurrence.
Asynchronous periodic pattern mining [Yang et.al 2000,
Huang 2004]:
Any anti-monotonic frequency measures are not
investigated.
On-line approximation algorithm for mining frequent items,
not for frequent subsequences
Lossy counting algorithm [Manku and Motwani, VLDB’02]
7
Research in Our Laboratory
Sequential Data Mining from a very-large single
data sequence.
Main target: sequential textual data, especially,
newspaper-articles corpora
Objectives: to generate a robust and useful large-scale
event-sequences corpus.
Application 1: topic tracking/detection in information retrieval.
Application 2: automated content-tracking in WEB.
Application 3: scenario/story semi-automatic creation
Ordinary temporal data analysis: various log
data in computer systems, genetic information,
etc.
8
Technical Topics (1/2)
A new framework for extracting frequent
subsequences from a single long data
sequence:
in IEEE Inter. Conf. on Data Mining 2005 (ICDM2005):
A new rational frequency measures, which
satisfies the Apriori (anti-monotonic) property
and has no duplicate counting.
A fast on-line algorithm for a some limited
case
9
Technical Topics (1/2)
On-going current works and future work
On-line rational filters based on confidence criteria and/or
information-gain for eliminating redundant valueless
sequences from system output
Methods for finding meta-structures embedded in huge
amount of frequent sequences generated by a system
A method using compression based on context-free grammarinference/learning
More fast extraction algorithm based on a method for
simultaneously searching multiple strings over compressed
data.
10
References:
Jiawei Han and Micheline Kamber. Data
Mining: Concepts and Techniques
(Chapter 8). www.cs.uiuc.edu/~hanj
11
Thanks for your attention!!
12
Download