Document

advertisement
實驗室研究暨成果說明會
Content and Knowledge
Management Laboratory (B)
Data Mining Part
Director: Anthony J. T. Lee
Presenter: Wan-chuen Lin
Outline
 Introduction of basic data mining concepts about
our research topics
 Brief description of doctoral research
 Topic 1: Mining frequent itemsets with multidimensional constraints
 Topic 2: Mining the inter-transactional
association rules of multi-dimensional interval
patterns
 Topic 3: Inter-sequence association rules mining
 Topic 4: Mining association rules among timeseries data
2
Introduction of Data Mining
 Data mining is the task of discovering
knowledge from large amounts of data.
 One of the fundamental data mining problems,
frequent itemset mining, covers a broad
spectrum of mining topics, including
association rules, sequential patterns, etc.
 Frequent itemset mining is to discover all the
itemsets whose supports in the database
exceed a user-specified threshold.
3
Introduction of Association Rules
 Association rule is of the form XY, where X
and Y are both frequent itemsets in the given
database and XY=.
 The support of XY is the percentage of
transactions in the given database that contain
both X and Y, i.e., P(XY).
 The confidence of XY is the percentage of
transactions in the given database containing
X that also contain Y, i.e., P(Y|X).
4
Introduction of Sequential Patterns
 A sequence is an ordered list of itemsets, and
denoted by <s1s2…sl>, where sj is an itemset.
 sj is also called an element of the sequence,
and denoted as (x1x2…xm), where xk is an item.
 The support of a sequence  in a sequence
database is the number of tuples containing .
 A sequence  is called a sequential pattern if
support()min-support.
5
Algorithm for Mining Frequent
Itemsets
 Apriori
 Candidate set generation-and–test
 Level-wise: it iteratively generates
candidate k-itemsets from previously found
frequent (k-1)-itemsets, and then checks
the supports of candidates to form frequent
k-itemsets.
 Lk-1
Ck
Lk
Join
Support Check
6
Algorithm for Mining Frequent
Itemsets (cont’d)
 FP-growth
 The method constructs a compressed
frequent pattern tree, called FP-tree.
 A divide-and-conquer strategy to
recursively decompose the mining task into
a set of smaller tasks in conditional
databases, and concatenates the suffix
itemset with the frequent itemsets
generated from a conditional FP-tree.
7
Algorithm for Mining Sequential
Patterns-PrefixSpan
 It finds length-1 sequential patterns in the
target database first, and partitions the
database into smaller projected databases
with prefix of each sequential pattern
previously found.
 The sequential patterns can be mined by
constructing corresponding projected
databases and mine each recursively.
 It preserves the element order of each tuple
in the mining process.
8
Brief Description of Doctoral Research
 Mining calling path patterns in GSM networks
 Two problems of mining calling path patterns


Mining PMFCPs
Mining periodic PMFCPs
 Graph structures [(periodic) frequent calling path
graph] and graph-based mining algorithms


Based on a depth-first
No candidate paths are generated and the
database is scanned only once if the whole graph
structure can be held in the main memory.
9
Brief Description of Doctoral Research
(cont’d)
 Bioinformatic data mining
 Gene Clustering
 Sequence comparisons, alignments and
compression


DNA sequence
Protein sequence
 Application


Phylogenetic tree to predict the function of a
new protein
Relationship between DNA sequence & disease
10
Topic 1: Mining Frequent Itemsets
with Multi-dimensional Constraints
 Frequent itemset mining often generates a
very large number of frequent itemsets.


Only the subset of the frequent itemsets and
association rules is of interest to users.
Users need additional post-processing to find
useful ones.
 Constraint-based mining pushes user-specific
constraints deep inside the mining process to
improve performance.
 With multi-dimensional items, constraints can
be imposed on multiple dimensional attributes.
11
Topic 1: Mining Frequent Itemsets
with Multi-dimensional Constraints
Multi-dimensional Constraints
attributes (dimensions)
itemID
a1 a2 …. am
ik = (k1, k2 …, km)
A = iA = (A1, A2,…, Am)
A1=A.a1
12
Topic 1: Mining Frequent Itemsets
with Multi-dimensional Constraints
 Multi-dimensional constraints can be
categorized according to constraint properties.

anti-monotone, monotone, convertible and
inconvertible
 It can be also classified according to the
number of sub-constraints included.


Single constraint against multiple dimensions,
Ex: max(S.cost)  min(S.price)
Conjunction and/or disjunction of multiple subconstraints,
Ex: (C1: S.cost  v1)  (C2: S.price  v2)
13
Topic 1: Mining Frequent Itemsets
with Multi-dimensional Constraints
 We extend constraints to place over multi-
dimensional itemsets and develop algorithms
for mining frequent itemsets with multidimensional constraints by extension of CFG
(Constrained Frequent Pattern Growth),
 Overview of our algorithm



Phase 1: Frequency check
Phase 2: Constraint check
Phase 3: Conditional database construction
14
Example: Cam  max(S.cost)  min(S.price)
A-conditional Database
Database
BECA
BEA
DA
BDA
BDE
BDECA
BEC
BDEC
DEC
BDC
Frequent items: B, D, E, C, A
C(BDECA)=false
C(B)=true
C(D)=true
C(E)=true
BEC
BE
D
BD
BDEC
Frequent items: B, D, E, C
C(BDECA)=false
C(BA)=false
C(EA)=true
C(DA)=true
C(CA)=false
C(C)=true
C(A)=true
EA-conditional Database
D
Frequent items: 
15
Topic 2: Mining Inter-transactional Association
Rules of Multi-dimensional Interval Patterns
 Transaction could be the items bought by the
same customer, the events happened on the
same day, and so on.
 Intra-transactional association rules:
associations among items within the same
transaction.

Ex: buy (X, diapers) => buy (X, beer) [support=80%]
 Inter-transactional association rules: association
relations among different transactions.

Ex: If the prices of IBM and SUN go up, Microsoft’s
16
will most likely [80%] increases the next day.
Topic 2: Mining Inter-transactional Association
Rules of Multi-dimensional Interval Patterns
 Interval data are different from the point data
in that they occupy regions of non-zero size.
 Multi-dimensional Intervals can be
represented as line segments (1-D),
rectangles (2-D), hyper-cubes (n-D), etc.
 Extended item: denoted as (Location)<Size>
 Reference point: the smallest  (Location)
among all (Location)<Size>.
 Maxspan: a sliding window; only associations
covered by it are considered.
17
Example
 There are two cubes in
the 3-dimensional
space: 0,2,1<1,1,1>
and 1,1,0<2,2,1>.
 Reference point: (0,1,0)
 The two items are
denoted as
0,1,1<1,1,1> and
1,0,0<2,2,1>.
0,2,1<1,1,1>
1,1,0<2,2,1>
18
Algorithm (Apriori-like) Example
 Support: 10%
(10%*20=2)
 Maxspan: 4
 L 1:
0,0<1,1>
0,0<1,2>
0,0<1,3>
0,0<2,1>
19
Algorithm (Apriori-like) Example
(cont’d)
 Remind: Apriori-like algorithm
 Lk-1
Ck Support Check
Join
 L 2:
Lk
{0,0<1,1>, 1,1<2,1>}, {1,0<1,1>, 0,1<1,2>},
{0,0<1,2>, 2,0<2,1>}, {0,0<1,3>, 3,0<1,2>}
 L3: {3,0<1,1>, 2,1<1,2>, 0,3<1,3>}
{1,0<1,1>, 0,1<1,2>, 2,1<2,1>}
{3,0<1,1>, 0,3<1,3>, 4,1<2,1>}
{2,0<1,2>, 0,2<1,3>, 4,0<2,1>}
 L4: {0,3<1,3>, 4,1<2,1>, 2,1<1,2>, 3,0<1,1>}
20
Topic 3: Inter-sequence Association
Rules Mining
 Inter-sequence model
3
4
5
6
7
8
9 10
1
2
3
4
5
6
7
8
9 10
<e(ac)bac>
<dd(ac)bd>
<>
<b(ab)cc>
<bc>
<acc>
<ab>
<ceacc(ce)>
2
<(bc)cb>
Transaction
Time :
1
<c(ab)d(ad)>
Transaction ID :
21
Topic 3: Inter-sequence Association
Rules Mining (cont’d)
 Extended sequence (denote asΔt<s1s2…sl>):
a sequence s = <s1s2…sl> at time pointΔt.
 Algorithm:


Step 1: Use PrefixSpan to find all sequential
patterns
Step 2: Use an Apriori-like method to check if
some extended sequence set is large
 Use L-bucket (List-bucket) & C-bucket
(candidate-bucket) to improve mining
efficiency.
22
Example
 min_support = 3
 maxspan = 2
PrefixSpan
Sequential Patterns:
–<a>, <b>, <c>
–<ab>, <(ab)>, <ac>,
<ba>, <bc>, <cb>, <cc>
–<acc>
The database
Tran. ID
Tran.
Time
Sequence
1
1
<c(ab)d(ad)>
2
2
<(bc)cb>
3
3
<e(ac)bac>
4
4
<b(ab)cc>
5
5
<(ab)c>
6
6
<dd(ac)bd>
7
7
<bc>
8
8
<acc>
9
9
<ab>
10
10
<ceacc(ce)>
23
Example (cont’d)
PrefixSpan Result
<a>, <b>, <c>
<ab>, <(ab)>, <ac>,
<ba>, <bc>, <cb>,
<cc>
<acc>
Candidates C2
{Δ0<a>, Δ1<a>}, {Δ0<a>, Δ2<a>}
{Δ0<a>, Δ1<b>}, {Δ0<b>, Δ1<a>},
{Δ0<a>, Δ2<b>}, {Δ0<b>, Δ2<a>}
{Δ0<a>, Δ1<c>}, {Δ0<c>, Δ1<a>},
{Δ0<a>, Δ2<c>}, {Δ0<c>, Δ2<a>}
{Δ0<b>, Δ1<b>}, {Δ0<b>, Δ2<b>}
L1
{Δ0<a>}
{Δ0<b>}
{Δ0<c>}
{Δ0<b>, Δ1<c>}, {Δ0<c>, Δ1<b>},
{Δ0<b>, Δ2<c>}, {Δ0<c>, Δ2<b>}
{Δ0<c>, Δ1<c>}, {Δ0<c>, Δ2<c>}
24
Example (cont’d)
PrefixSpan Result
<a>, <b>, <c>
<ab>, <(ab)>, <ac>,
<ba>, <bc>, <cb>,
<cc>
<acc>
C2
Apriori-like
Lk-1 → Ck → Lk
L2
{Δ0<ab>}, {Δ0<(ab)>}, {Δ0<ac>},
{Δ0<ba>}, {Δ0<bc>},
{Δ0<cb>},{Δ0<cc>}
{Δ0<a>, Δ1<a>}, {Δ0<a>, Δ2<a>},
{Δ0<a>, Δ1<b>}, {Δ0<b>, Δ1<a>},
{Δ0<a>, Δ2<b>}, {Δ0<b>, Δ2<a>},
{Δ0<a>, Δ1<c>}, {Δ0<c>, Δ1<a>},
{Δ0<a>, Δ2<c>}, {Δ0<c>, Δ2<a>},
{Δ0<b>, Δ1<b>}, {Δ0<b>, Δ2<b>},
{Δ0<b>, Δ1<c>}, {Δ0<c>, Δ1<b>},
{Δ0<b>, Δ2<c>}, {Δ0<c>, Δ2<b>},
25
{Δ0<c>, Δ1<c>}, {Δ0<c>, Δ2<c>}
Topic 4: Mining Association Rules
among Time-series Data
 A line is an ordered and continuous list in the
form {t1, t2, …, tm} describing the property of
the subject along the time.
 Step 1: find the frequent lines and points in
each line-set. (Apriori-like algorithm)
 Step 2: use those frequent-set combination to
find the associations among them. (intertransaction association rules)
26
Topic 4: Mining Association Rules
among Time-series Data
27
Time-series Data Approximation
 For the algorithm’s
efficiency
 Equally partition
the fluctuation rate
into several
classes.
28
Step 1: Line Discovery (Apriori-like)
Step 2:
Association Rule Mining
29
Data Mining Part
Thank You!
Download