Data Miing and Knowledge Discvoery

advertisement
Mining Frequent Patterns II:
Mining Sequential & Navigational
Patterns
Bamshad Mobasher
DePaul University
Sequential pattern mining
 Association rule mining does not consider the order of
transactions.
 In many applications such orderings are significant. E.g.,
 in market basket analysis, it is interesting to know whether people buy
some items in sequence,
e.g., buying bed first and then bed sheets some time later.
 In Web usage mining, it is useful to find navigational patterns of users
in a Web site from sequences of page visits of users
2
Sequential Patterns
Extending Frequent Itemsets
 Sequential patterns add an extra dimension to frequent itemsets and
association rules - time.
 Items can appear before, after, or at the same time as each other.
 General form: “x% of the time, when A appears in a transaction, B appears within z
transactions.”
 note that other items may appear between A and B, so sequential patterns do not
necessarily imply consecutive appearances of items (in terms of time)
 Examples
 Renting “Star Wars”, then “Empire Strikes Back”, then “Return of the Jedi” in that order
 Collection of ordered events within an interval
 Most sequential pattern discovery algorithms are based on extensions of the Apriori
algorithm for discovering itemsets
 Navigational Patterns
 they can be viewed as a special form of sequential patterns which capture navigational
patterns among users of a site
 in this case a session is a consecutive sequence of pageview references for a user over a
specified period of time
3
Objective
Given a set S of input data sequences (or sequence
database), the problem of mining sequential
patterns is to find all the sequences that have a
user-specified minimum support
Each such sequence is called a frequent sequence,
or a sequential pattern
The support for a sequence is the fraction of total
data sequences in S that contains this sequence
4
Sequence Databases
A sequence database consists of an ordered lis of
elements or events
Each element can be a set of items or a single item (a singleton set)
Transaction databases vs. sequence databases
A transaction database
A sequence database
TID
itemsets
SID
sequences
10
a, b, d
10
<a(abc)(ac)d(cf)>
20
a, c, d
20
<(ad)c(bc)(ae)>
30
a, d, e
30
<(ef)(ab)(df)cb>
40
b, e, f
40
<eg(af)cbc>
Elements in (…) are sets
5
Subsequence vs. super sequence
 A sequence is an ordered list of events, denoted < e1 e2 … el >
 Given two sequences α=< a1 a2 … an > and β=< b1 b2 … bm >
 α is called a subsequence of β, denoted as α⊆ β, if there exist
integers 1≤ j1 < j2 <…< jn ≤m such that
a1 ⊆ bj1, a2 ⊆ bj2,…, an ⊆ bjn
 Examples:
 < (ab), d> is a subsequence of < (abc), (de)>
 3, (4, 5), 8 is contained in (or is a subsequence of)
6, (3, 7), 9, (4, 5, 8), (3, 8)
 <a.html, c.html, f.html> ⊆
<a.html, b. html, c.html, d.html, e.html, f.html, g.html>
6
What Is Sequential Pattern Mining?
 Given a set of sequences and support threshold, find the
complete set of frequent subsequences
A
sequence
: < (ef) (ab) (df) c b >
A sequence database
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
An element may contain a set of items.
Items within an element are unordered
and we list them alphabetically.
<a(bc)dc> is a subsequence
of <a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a
sequential pattern
7
Another Example
Transactions Sorted by Customer ID
8
Example (continued)
Sequences produced from transactions
Final sequential patterns
9
GSP mining algorithm
 Very similar to the Apriori algorithm
10
Sequential Pattern Mining Algorithms
 Apriori-based method: GSP (Generalized Sequential Patterns:
Srikant & Agrawal, 1996)
 Pattern-growth methods: FreeSpan & PrefixSpan (Han et al., 2000;
Pei, et al., 2001)
 Vertical format-based mining: SPADE (Zaki 2000)
 Constraint-based sequential pattern mining (SPIRIT: Garofalakis, et
al., 1999; Pei, et al., 2002)
 Mining closed sequential patterns: CloSpan (Yan, Han & Afshar,
2003)
From: J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji
11
Mining Navigation Patterns
 Each session induces a user trail through the site
 A trail is a sequence of web pages followed by a user
during a session, ordered by time of access
 A sequential pattern in this context is a frequent trail
 Sequential pattern mining can help identify common
navigational sequences which in turn helps in
understanding common user behavioral patterns
 If the goal is to make predictions about future user actions
based on past behavior, approaches such as Markov models
(e.g., Markov Chains) can be used
12
Mining Navigational Patterns
 Another Approach: Markov Chains
 idea is to model the navigational sequences through the site as a state-transition
diagram without cycles (a directed acyclic graph)
 a Markov Chain consists of a set of states (pages or pageviews in the site)
S = {s1, s2, …, sn}
and a set of transition probabilities
P = {p1,1, … , p1,n, p2,1, … , p2,n, … , pn,1, … , pn,n}
 a path r from a state si to a state sj, is a sequence states where the transition
probabilities for all consecutive states are greater than 0
 the probability of reaching a state sj from a state si via a path r is the product of all
the probabilities along the path:
 the probability of reaching sj from si is the sum over all paths:
13
Construct Markov Chain from Web
Navigational Data
 Add a unique start state
 the start state has a transition to the first page in each session
(representing the start of a session)
 alternatively, could have a transition to every state, assuming that every
page can potentially be start of a session
 Add a unique final state
 the last page in each trail has a transition to the final state (representing
the end of the session)
 The transition probabilities are obtained from counting
click-throughs
 The Markov chain built is called absorbing since we always
end up in the final state
14
A Hypothetical Markov Chain
An example
Markov Chain
 What is the probability that a user who visits the
Home page purchases a product?
 Home -> Search -> PD -> $ = 1/3 * 1/2 *1/2 = 1/12 = 0.083
 Home -> Cat -> PD -> $ = 1/3 * 1/3 * 1/2 = 1/18 = 0.056
 Home -> Cat -> $ = 1/3 * 1/3 = 1/9 = 0.111
 Home -> RS -> PD -> $ = 1/3 * 2/3 * 1/2 = 1/9 = 0.111
Sum = 0.361
15
Markov Chain Example
Calculating conditional probabilities for transitions
Web site hyperlink graph
B
A
D
0.57
C
E
Transition BC:
Total occurrences of B: 14
Total occurrence of BC: 8
Pr(C|B) = 8/14 = 0.57
Sessions:
A, B
A, B
A, B, C
A, B, C
A, B, C, D
A, B, C, E
A, C, E
A, C, E
A, B, D
A, B, D
A, B, D, E
B, C
B, C
B, C, D
B, C, E
B, D, E
16
Markov Chain Example (cont.)
0.14
Start
0.31
0.82
0.69
B
0.57
A
0.18
C
0.21
0.20
0.40
D
0.67
Final
0.33
E
The Full
Markov Chain
1.00
0.40
Probability that someone will visit page C?
SBC + SAC + SABC
(0.31 * 0.57) + (0.69 * 0.18) + (0.69 * 0.82 * 0.57) = 0.503
Prob. that someone who has visited B will visit E?
BDE + BCE + BCDE
(0.21 * 0.33) + (0.57 * 0.40) + (0.57 * 0.20 * 0.33) = 0.335
Sessions:
A, B
A, B
A, B, C
A, B, C
A, B, C, D
A, B, C, E
A, C, E
A, C, E
A, B, D
A, B, D
A, B, D, E
B, C
B, C
B, C, D
B, C, E
B, D, E
Probability that someone visiting page C will leave the
site? 0.40 = 40%
17
Mining Frequent Trails Using
Markov Chains
 Support s in [0,1) – accept only trails whose initial probability is
above s
 Confidence c in [0,1) – accept only trails whose probability is
above c
 Recall: the probability of a trail is obtained by multiplying the transition
probabilities of the links in the trail
 Mining for Patterns
 Find all trails whose initial probability is higher than s, and whose trail
probability is above c.
 Use depth-first search on the Markov chain to compute the trails
 The average time needed to find the frequent trails is proportional to the
number of web pages in the site
18
Markov Chains: Another Example
ID
Session Trail
1
A1 > A2 > A3
2
A1 > A2 > A3
3
A1 > A2 > A3 > A4
4
A5 > A2 > A4
5
A5 > A2 > A4 > A6
6
A5 > A2 > A3 > A6
19
Frequent Trails From Example
Support = 0.1 and Confidence = 0.3
Trail
A1 > A2 > A3
A5 > A2 > A3
A2 > A3
A1 > A2 > A4
A5 > A2 > A4
A2 > A4
A4 > A6
Probability
0.67
0.67
0.67
0.33
0.33
0.33
0.33
20
Frequent Trails From Example
Support = 0.1 and Confidence = 0.5
Trail
Probability
A1 > A2 > A3
0.67
A5 > A2 > A3
0.67
A2 > A3
0.67
21
Efficient Management of Navigational Trails
 Approach: Store sessions in an aggregated sequence tree
 Initially introduced in Web Utilization Miner (WUM) - Spiliopoulou, 1998
 for each occurrence of a sequence start a new branch or increase the frequency counts of
matching nodes
 in example below, note that s6 contains “b” twice, hence the sequence is
<(b,1),(d,1),(b,2),(e,1)>
22
Mining Navigational Patterns
The aggregated sequence
tree can be used directly to
determine support and
confidence for navigational
patterns
Note that each node
represents a navigational
path ending in that node
Support = count at the node / count at root
Confidence = count at the node / count at the parent
Navigation pattern: a  b
Support = 11/35 = 0.31
Confidence = 11/21 = 0.52
Nav. pattern: a  b  e
Support = 11/35 = 0.31
Confidence = 11/11 = 1.00
Nav. patterns: a  b  e  f
Support = 3/35 = 0.086
Confidence = 3/11 = 0.27
23
Download