# A short introduction to sequential data mining

```A Short Introduction to Sequential
Data Mining
Koji IWANUMA
Hidetomo NABESHIMA
University of Yamanashi
The First Franco-Japanese Symposium on Knowledge Discovery in
System Biology, September 17, Aix-en-Provence
Two Main Frameworks of Sequential
Mining
Sequential pattern mining for multiple data sequences
Sequence ID
Purchase data record
1
2
3
4
5
&lt;cheese&gt;
Sequential pattern mining for a single data sequence
Data sequence
&lt;S1 S2 S3 S4 S5 S6 S7 …
… Sn&gt;
2
J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji
What Is Sequential Pattern Mining?
Given a set of sequences, find the complete
set of frequent subsequences
A
sequence
: &lt; (ef) (ab) (df) c b &gt;
A sequence database
SID
sequence
10
&lt;a(abc)(ac)d(cf)&gt;
20
30
&lt;(ef)(ab)(df)cb&gt;
40
&lt;eg(af)cbc&gt;
An element may contain a set of items.
Items within an element are unordered
and we list them alphabetically.
&lt;a(bc)dc&gt; is a subsequence
of &lt;a(abc)(ac)d(cf)&gt;
Given support threshold min_sup =2, &lt;(ab)c&gt; is a
sequential pattern
3
Challenges on Sequential Pattern Mining
A huge number of possible sequential patterns are hidden
in databases
A mining algorithm should
find the complete set of patterns, when possible,
satisfying the minimum support (frequency) threshold
be highly efficient, scalable, involving only a small
number of database scans
be able to incorporate various kinds of user-specific
constraints
J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji
4
Sequential Pattern Mining Algorithms
for Multiple Data Sequences
Apriori-based method: GSP (Generalized Sequential Patterns:
Srikant &amp; Agrawal @ EDBT’96)
Pattern-growth methods: FreeSpan &amp; PrefixSpan (Han et
al.@KDD’00; Pei, et al.@ICDE’01)
Vertical format-based mining: SPADE (Zaki@Machine Leanining’00)
Constraint-based sequential pattern mining (SPIRIT: Garofalakis,
Rastogi, Shim@VLDB’99; Pei, Han, Wang @ CIKM’02)
Mining closed sequential patterns: CloSpan (Yan, Han &amp; Afshar
@SDM’03)
J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji
5
Mining Sequential Patterns from a
Very-Long Single Sequence
A series of daily news paper articles
&lt;
&gt;
typhoon
flood,
landslide
typhoon
flood,
landslide
&lt;typhoon (flood, landslide)&gt;
6
Sequential Pattern Mining Algorithms
for a Single data Sequence
Discovery of frequent episodes in event sequences, based
on a sliding window system [Mannila 1998]：
The frequency measure becomes anti-monotonic, but has a
problem, i.e., a duplicate counting of an occurrence.
Asynchronous periodic pattern mining [Yang et.al 2000,
Huang 2004]：
Any anti-monotonic frequency measures are not
investigated.
On-line approximation algorithm for mining frequent items,
not for frequent subsequences
Lossy counting algorithm [Manku and Motwani, VLDB’02]
7
Research in Our Laboratory
Sequential Data Mining from a very-large single
data sequence.
Main target: sequential textual data, especially,
newspaper-articles corpora
Objectives: to generate a robust and useful large-scale
event-sequences corpus.
Application 1： topic tracking/detection in information retrieval.
Application 2： automated content-tracking in WEB.
Application 3: scenario/story semi-automatic creation
Ordinary temporal data analysis: various log
data in computer systems, genetic information,
etc.
8
Technical Topics (1/2)
A new framework for extracting frequent
subsequences from a single long data
sequence:
in IEEE Inter. Conf. on Data Mining 2005 (ICDM2005):
A new rational frequency measures, which
satisfies the Apriori (anti-monotonic) property
and has no duplicate counting.
A fast on-line algorithm for a some limited
case
9
Technical Topics (1/2)
On-going current works and future work
On-line rational filters based on confidence criteria and/or
information-gain for eliminating redundant valueless
sequences from system output
Methods for finding meta-structures embedded in huge
amount of frequent sequences generated by a system
A method using compression based on context-free grammarinference/learning
More fast extraction algorithm based on a method for
simultaneously searching multiple strings over compressed
data.
10
References:
Jiawei Han and Micheline Kamber. Data
Mining: Concepts and Techniques
(Chapter 8). www.cs.uiuc.edu/~hanj
11