Frequent Pattern Mining

advertisement
CIS4930
Introduction to Data Mining
Frequent Pattern Mining
Peixiang Zhao
Tallahassee, Florida, 2016
What is Frequent Pattern Analysis?
• Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
• Motivation: Finding inherent regularities in data
– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
•
Applications
–
Basket data analysis, cross-marketing, catalog design, sale campaign analysis,
Web log (click stream) analysis, and DNA sequence analysis
1
Why Frequent Patterns?
• Frequent patterns
– An intrinsic and important property of datasets
• Foundation for many essential data mining tasks
– Association, correlation, and causality analysis
– Sequential, structural (e.g., sub-graph) patterns
– Pattern analysis in spatiotemporal, multimedia, time-series, and stream data
– Classification: discriminative, frequent pattern analysis
– Cluster analysis: frequent pattern-based clustering
– Broad applications
2
Association Rule Mining
• Given a set of transactions, find rules that will predict
the occurrence of an item based on the occurrences of
other items in the transaction
– Examples:
• {Diaper}  {Beer},
{Milk, Bread}  {Eggs,Coke},
{Beer, Bread}  {Milk},
– Implication means co-occurrence, not causality!
3
Basic Concepts
• Itemset
– A set of one or more items
• k-itemset X = {x1, …, xk}
• (absolute) support of X
– Frequency or occurrence of an
itemset X
• (relative) support of X
Customer
buys both
– The fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
Customer
buys diaper
• An itemset X is frequent if
– X’s support is no less than a
minsup threshold
Customer
buys beer
4
Basic Concepts
•
Association rule
–
•
An implication expression of the
form XY, where X and Y are
itemsets
Find all the association rules
X  Y with minimum
support and confidence
–
–
•
Example
Support, s, fraction of
transactions that contain both X Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4,
and Y (probability that a
Eggs:3, {Beer, Diaper}:3
transaction contains X and Y)
Confidence, c, how often items Association rules: (many more!)
in Y appear in transactions that
Beer  Diaper (60%, 100%)
contain X (conditional
Diaper  Beer (60%, 75%)
probability that a transaction
having X also contains Y)
5
Closed Patterns and Max-Patterns
• A long pattern contains a combinatorial number of sub-patterns,
e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 =
1.27*1030 sub-patterns!
• Solution: Mine closed patterns and max-patterns instead
• An itemset X is closed if
– X is frequent and there exists no super-pattern Y ‫ כ‬X, with the same support as X
• An itemset X is a maximal pattern if
– X is frequent and there exists no frequent super-pattern Y ‫ כ‬X
• Closed pattern is a lossless compression of freq. patterns
– Reducing the # of patterns and rules
6
Closed and Max-Patterns: An Example
• Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
– Min_sup = 1.
• What is the set of closed itemset?
– <a1, …, a100>: 1
– < a1, …, a50>: 2
• What is the set of maximal pattern?
– <a1, …, a100>: 1
• What is the set of all patterns?
– !!
7
Computational Complexity
• How many itemsets are potentially to be generated in the worst
case?
– The number of frequent itemsets is senstive to the minsup threshold
– When minsup is low, there exist an exponential number of frequent itemsets
– The worst case: MN where M: # distinct items, and N: max length of
transactions
• The worst case complexity vs. the expected probability
– Ex. Suppose Walmart has 104 kinds of products
• The chance to pick up one product 10-4
• The chance to pick up a particular set of 10 products: ~10-40
• What is the chance this particular set of 10 products to be frequent
103 times in 109 transactions?
8
Association Rule Mining
• Given a set of transactions T, the goal of association
rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!
9
Mining Association Rules
• Observations
– All the association rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
– Rules originating from the same itemset have identical support but
can have different confidence
– Thus, we may decouple the support and confidence requirements
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Rules:
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
10
Mining Association Rules
•
Two-step approach:
1. Frequent Itemset Mining
– Generate all itemsets whose support  minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
•
Frequent itemset generation is still computationally
expensive
11
Frequent Itemset Mining
Given d items, there
are 2d possible
candidate itemsets
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
12
Frequent Itemset Mining
• Brute-force approach
1. Each itemset in the lattice is a candidate frequent itemset
2. Count the support of each candidate by scanning the database
•
Match each transaction against every candidate
– Time complexity: O(N*M*w)  Expensive since M = 2d
Transactions
N
TID
1
2
3
4
5
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
List of
Candidates
M
w
13
Computational Complexity of Association Rule Mining
• Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules
 d   d  k 
R       

 k   j 
 3  2 1
d 1
d k
k 1
j 1
d
d 1
If d=6, R = 602 rules
14
Frequent Itemset Mining Strategies
• Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
• Reduce the number of transactions (N)
– Reduce size of N as the size of itemset increases
– Used by vertical-based mining algorithms
• Reduce the number of comparisons (NM)
– Use efficient data structures to store the candidates or transactions
– No need to match every candidate against every transaction
15
Methods for Frequent Itemset Mining
• Traversal of Itemset Lattice
– General-to-specific vs Specific-to-general
Frequent
itemset
border
null
null
..
..
..
..
{a1,a2,...,an}
(a) General-to-specific
{a1,a2,...,an}
Frequent
itemset
border
null
..
..
Frequent
itemset
border
(b) Specific-to-general
{a1,a2,...,an}
(c) Bidirectional
16
Methods for Frequent Itemset Mining
• Traversal of Itemset Lattice
– Breadth-first vs Depth-first
(a) Breadth first
(b) Depth first
17
Methods for Frequent Itemset Mining
• Traversal of Itemset Lattice
– Equivalent Classes
null
A
AB
ABC
B
AC
AD
ABD
ACD
null
C
BC
BD
BCD
D
CD
A
AB
AC
ABC
B
C
BC
AD
ABD
D
BD
CD
ACD
BCD
ABCD
ABCD
(a) Prefix tree
(b) Suffix tree
18
Methods for Frequent Itemset Mining
• Representation of Database
– horizontal vs vertical data layout
Horizontal
Data Layout
TID
1
2
3
4
5
6
7
8
9
10
Items
A,B,E
B,C,D
C,E
A,C,D
A,B,C,D
A,E
A,B
A,B,C
A,C,D
B
Vertical Data Layout
A
1
4
5
6
7
8
9
B
1
2
5
7
8
10
C
2
3
4
8
9
D
2
4
5
9
E
1
3
6
19
Frequent Itemset Mining Methods
• Apriori: A Candidate Generation-and-Test Approach
– Improving the Efficiency of Apriori
• FPGrowth: A Frequent Pattern-Growth Approach
• ECLAT: Frequent Pattern Mining with Vertical Data
Format
20
The Downward Closure Property
• The downward closure property of frequent patterns
– Any subset of a frequent itemset must be frequent
• If {beer, diaper, nuts} is frequent, so is {beer, diaper}
• i.e., every transaction having {beer, diaper, nuts} also contains
{beer, diaper}
• Apriori principle holds due to the following property of the
support measure:
X , Y : ( X  Y )  s( X )  s(Y )
– Support of an itemset never exceeds the support of its subsets
– This is known as the anti-monotone property of support
21
Illustrating Apriori Principle
null
A
Found to
be
Infrequent
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
Pruned
supersets
ABCE
ABDE
ACDE
BCDE
ABCDE
22
Apriori:
A Candidate Generation & Test Approach
• Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
• Algorithm
1. Initially, scan DB once to get frequent 1-itemset
2.1 Generate length (k+1) candidate itemsets from length k frequent
itemsets
2.2 Test the candidates against DB
3. Terminate when no frequent or candidate set can be generated
23
The Apriori Algorithm—An Example
Database TDB
Supmin = 2
L1
C1
1st scan
C2
C2
2nd scan
L2
C3
3rd scan
L3
24
The Apriori Algorithm
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
25
Implementation
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4 = {abcd}
26
Candidate Generation: SQL
• SQL Implementation of candidate generation
– Suppose the items in Lk-1 are listed in an order
– Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
– Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
27
Frequency Counting
• Hash tree
– Suppose you have 15 candidate itemsets of length 3:
• {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5
6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
– You need:
• Hash function
• Max leaf size: max number of itemsets stored in a leaf node (if
number of candidate itemsets exceeds max leaf size, split the
node)
234
Hash function
567
3,6,9
1,4,7
145
345 356 367
136
357 368
2,5,8
689
124
125 159
457
458
28
Subset Operation
Given a transaction t, what
are the possible subsets of
size 3?
Transaction, t
1 2 3 5 6
Level 1
1 2 3 5 6
2 3 5 6
3 5 6
Level 2
12 3 5 6
13 5 6
123
125
126
135
136
Level 3
15 6
156
23 5 6
235
236
25 6
256
35 6
356
Subsets of 3 items
29
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356
12+ 356
1,4,7
3+ 56
3,6,9
2,5,8
13+ 56
234
567
15+ 6
145
136
345
124
457
125
458
159
356
357
689
367
368
30
Factors Affecting Complexity
• Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of frequent itemsets
• Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and I/O costs
may also increase
• Size of database
– since Apriori makes multiple passes, run time of algorithm may increase with
number of transactions
• Average transaction width
– transaction width increases with denser data sets
– This may increase max length of frequent itemsets, and the number of subsets
in a transaction increases with its width
31
Maximal Frequent Itemset
• An itemset is maximal frequent if none of its immediate
supersets is frequent
null
A
B
C
D
E
Maximal
Itemsets
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ABCD
E
ACDE
BCDE
Border
32
Closed Itemset
• An itemset is closed if none of its immediate supersets
has the same support as the itemset
TID
1
2
3
4
5
Items
{A,B}
{B,C,D}
{A,B,C,D}
{A,B,D}
{A,B,C,D}
Itemset
{A}
{B}
{C}
{D}
{A,B}
{A,C}
{A,D}
{B,C}
{B,D}
{C,D}
Support
4
5
3
4
4
2
3
3
4
3
Itemset Support
{A,B,C}
2
{A,B,D}
3
{A,C,D}
2
{B,C,D}
3
{A,B,C,D}
2
33
Maximal vs. Closed Itemsets
TID
Items
1
ABC
2
ABCD
3
BCE
4
ACDE
5
DE
Transaction Ids
null
124
123
A
12
124
AB
12
24
AC
AD
ABD
ABE
2
Not supported by
any transactions
B
AE
345
D
2
3
BC
BD
4
ACD
245
C
123
4
24
2
ABC
1234
BE
2
4
ACE
ADE
E
24
CD
34
CE
3
BCD
45
DE
4
BCE
BDE
CDE
4
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
34
Maximal vs Closed Frequent Itemsets
Minimum support = 2
124
123
A
12
124
AB
12
ABC
24
AC
AD
ABD
ABE
1234
B
AE
345
D
2
3
BC
BD
4
ACD
245
C
123
4
24
2
Closed but
not maximal
null
24
BE
2
4
ACE
E
ADE
CD
Closed and
maximal
34
CE
3
BCD
45
DE
4
BCE
BDE
CDE
4
2
ABCD
ABCE
ABDE
ACDE
BCDE
# Closed = 9
# Maximal = 4
ABCDE
35
Maximal vs Closed Itemsets
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
36
Further Improvement of Apriori
• Major computational challenges
– Multiple scans of transaction database
– Huge number of candidates
– Tedious workload of support counting for candidates
• Improving Apriori: general ideas
– Reduce passes of transaction database scans
– Shrink number of candidates
– Facilitate support counting of candidates
37
Partition: Scan Database Only Twice
• Any itemset that is potentially frequent in DB must be frequent
in at least one of the partitions of DB
– Scan 1: partition database and find local frequent patterns
– Scan 2: consolidate global frequent patterns
DB1
sup1(i) < σDB1
+
DB2
sup2(i) < σDB2
+
+
DBk
supk(i) < σDBk
=
DB
sup(i) < σDB
38
FP-Growth:
A Frequent Pattern-Growth Approach
• Bottlenecks of the Apriori approach
– Breadth-first (i.e., level-wise) search
– Candidate generation and test
• Often generates a huge number of candidates
• The FPGrowth Approach
– Depth-first search
– Avoid explicit candidate generation
• Major philosophy: Grow long patterns from short ones using
local frequent items only
– “abc” is a frequent pattern
– Get all transactions having “abc”, i.e., project DB on abc: DB|abc
– “d” is a local frequent item in DB|abc  abcd is a frequent pattern
39
General Idea
• The FP-growth method indexes the database for fast support
computation via an augmented prefix tree: the frequent pattern
tree (FP-tree)
– Each node in the tree is labeled with a single item, with the
support information for the itemset comprising the items on
the path from the root to that node
• FP-tree construction
– Initially the tree contains as root the null item ∅
– For each tuple X ∈ D, we insert the X into the FP-tree, incrementing the count
of all nodes along the path that represents X
– If X shares a prefix with some previously inserted transaction, then X will
follow the same path until the common prefix. For the remaining items in X,
new nodes are created under the common prefix, with counts initialized to 1
– The FP-tree is complete when all transactions have been inserted
40
FP-Tree: Preprocessing
• Procedure
1. Scan the DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order: f-list
3. Scan DB again, construct FP-tree
• This way, frequent patterns can be partitioned into
subsets according to f-list (ex. b-e-a-c-d)
–
–
–
–
Patterns containing d
Patterns having d but no c
…
Patterns having b but no e nor a, c, d
• Completeness and nonredundency
41
Frequent Pattern Tree
• The FP-tree is a prefix compressed representation of D, and the
frequent items are sorted in descending order of support
42
FP-Growth Algorithm
• Given a FP-tree R, projected FP-trees are built for each frequent
item i in R in increasing order of support in a recursive manner
• To project R on item i, we find all the occurrences of i in the tree,
and for each occurrence, we determine the corresponding path
from the root to i
– The count of item i on a given path is recorded in cnt(i) and the
path is inserted into the new projected tree
– While inserting the path, the count of each node along the given
path is incremented by the path count cnt(i)
• The base case for the recursion happens when the input FP-tree
R is a single path
– Basic enumeration
43
Projected Frequent Pattern Tree for D
44
Frequent Pattern Tree Projection
45
FP-Growth Algorithm
P: projected items
The support of X = the minimum support of x
in Y
X: new projected items, including i
Recursive computation
46
Benefits of FP-Tree
• Completeness
– Preserve complete information for frequent pattern mining
– Never break a long pattern of any transaction
• Compactness
– Reduce irrelevant info—infrequent items are gone
– Items in frequency descending order: the more frequently
occurring, the more likely to be shared
– Never be larger than the original database (not count node-links
and the count field)
47
Advantages of the Pattern Growth Approach
• Divide-and-conquer
– Decompose both the mining task and DB according to the frequent patterns
obtained so far
– Lead to focused search of smaller databases
• Other factors
– No candidate generation, no candidate test
– Compressed database: FP-tree structure
– No repeated scan of entire database
– Basic ops: counting local freq items and building sub FP-tree, no pattern search
and matching
48
ECLAT: Mining by Exploring Vertical Data Format
• For each item, store a list of transaction ids (tids)
49
ECLAT
• Determine the support of any k-itemset by intersecting
tid-lists of two of its (k-1) subsets
A
1
3
4
5

B
1
2
3
4
5
6

AB
1
3
4
5
• Eclat intersects the tidsets only if the frequent itemsets
share a common prefix
– It traverses the prefix search tree in a DFS-like manner, processing
a group of itemsets that have the same prefix, also called a prefix
equivalence class
50
Eclat Algorithm
F: the set of frequent itemsets P: frequent items and their tid-lists
Xa: one frequent itemset
Pa: the set of frequent
itemsets generated from
Xa. They are in the same
prefix equivalence class
Recursive computation
51
ECLAT: Tidlist Intersections
52
Generating Association Rules
• How to efficiently generate rules from frequent
itemsets?
– In general, confidence does not have an anti-monotone property
c(ABC D) can be larger or smaller than c(AB D)
– But confidence of rules generated from the same itemset has an
anti-monotone property
• e.g., L = {A,B,C,D}:
c(ABC  D)  c(AB  CD)  c(A  BCD)
– Confidence is anti-monotone w.r.t. number of items on the RHS
of the rule
53
Example
• If {A,B,C,D} is a frequent itemset, candidate rules:
A BCD,
AB CD,
BD AC,
ABC D,
B ACD,
AC  BD,
CD AB,
ABD C,
C ABD,
AD  BC,
D ABC
BC AD,
ACD B,
BCD A,
• If |L| = k, then there are 2k – 2 candidate association
rules (ignoring L   and   L)
54
Generating Association Rules
• Given a frequent itemset Z, we look at all proper
subsets X ⊂ Z to compute rules of the form
X  Y, where Y = Z\X
• The rule must be frequent
– s = sup(XY) = sup(Z) >= minsup
• We compute the confidence as
– c = sup(X Y)/sup(X) = sup(Z)/sup(X)
– If c >= minconf, this rule is a strong association rule
– Otherwise, conf(WZ\W)<c for all subsets W ⊂ X, because
sup(W) >= sup(X). We can thus avoid checking subsets of X
55
Example
Lattice of rules
Low
Confidence
Rule
CD=>AB
ABCD=>{ }
BCD=>A
BD=>AC
D=>ABC
ACD=>B
BC=>AD
C=>ABD
ABD=>C
AD=>BC
B=>ACD
ABC=>D
AC=>BD
AB=>CD
A=>BCD
Pruned
Rules
56
Association Rule Mining Algorithm
F: the set of frequent itemsets
A: all proper subsets of Z
X: start from the largest subset
A strong association rule
If X fails, all its subsets fail as well
57
Pattern Evaluation
• Association rule algorithms tend to produce too many
rules
– many of them are uninteresting or redundant
– Redundant if {A,B,C}  {D} and {A,B}  {D}
have same support & confidence
• Interestingness measures can be used to prune/rank
the derived patterns
• In the original formulation of association rules, support
& confidence are the only measures used
– Anything else?
58
Computing Interestingness Measure
• Given a rule X  Y, information needed to compute
rule interestingness can be obtained from a
contingency table
Contingency table for X  Y
Y
Y
X
f11
f10
f1+
X
f01
f00
fo+
f+1
f+0
|T|
f11: support of X and Y
f10: support of X and Y
f01: support of X and Y
f00: support of X and Y
Used to define various measures

support, confidence, lift, Gini,
J-measure, ……
59
Drawback of Confidence
Coffee
Coffee
Tea
15
5
20
Tea
75
5
80
90
10
100
Association Rule: Tea  Coffee
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9
 Although confidence is high, rule is misleading
 P(Coffee|Tea) = 0.9375
60
Properties of a Good Interestingness Measure
• Three principal properties a good measure M must
satisfy:
1. M(A,B) = 0 if A and B are statistically independent
•
P(AB) = P(A)P(B)
2. M(A,B) increase monotonically with P(A,B) when P(A) and P(B)
remain unchanged
3. M(A,B) decreases monotonically with P(A) [or P(B)] when
P(A,B) and P(B) [or P(A)] remain unchanged
61
Other Considerations
• Is the measure symmetric?
– Does M(A,B) = M(B,A)?
• Is the measure insensitive to row/column scaling?
Male
Female
High
2
3
5
Low
1
4
5
3
7
10
Male
Female
High
4
30
34
Low
2
40
42
6
70
76
2x
10x
• Is the measure non-invariant?
A
A
B
p
r
B
q
s
A
A
B
p
r
B
q
s+k
62
63
Summary
• Basic concepts: association rules, support-confident
framework, closed and max-patterns
• Scalable frequent pattern mining methods
– Apriori (Candidate generation & Frequency counting)
– Projection-based
•
FP-Growth
– Vertical format approach
•
ECLAT
 Which patterns are interesting?

Pattern evaluation methods
64
65
Research on Pattern Mining: A Road Map
Mining Multiple-Level Association Rules
• Items often form hierarchies
• Flexible support settings
– Items at the lower level are expected to have lower support
uniform support
Level 1
min_sup = 5%
Level 2
min_sup = 5%
reduced support
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1
min_sup = 5%
Level 2
min_sup = 3%
66
Mining Multiple-Level Association Rules
• Flexible min-support thresholds: Some items are more valuable
but less frequent
– Use non-uniform, group-based min-support
– E.g., {diamond, watch, camera}: 0.05%; {bread, milk}: 5%; …
• Redundancy Filtering: Some rules may be redundant due to
“ancestor” relationships between items
– milk  wheat bread [support = 8%, confidence = 70%]
– 2% milk  wheat bread [support = 2%, confidence = 72%]
– A rule is redundant if its support is close to the “expected” value, based on the
rule’s ancestor
67
Negative and Rare Patterns
• Rare patterns: Very low support but interesting
– E.g., buying Rolex watches
– Setting individual-based or special group-based support threshold
for valuable items
• Negative patterns
– Since it is unlikely that one buys Ford Expedition (an SUV car)
and Toyota Prius (a hybrid car) together, Ford Expedition and
Toyota Prius are likely negatively correlated patterns
• Negatively correlated patterns that are infrequent tend to be
more interesting than those that are frequent
68
Constraint-based (Query-Directed) Mining
• Finding all the patterns in a database autonomously? —
unrealistic!
– The patterns could be too many but not focused!
• Data mining should be an interactive process
– User directs what to be mined using a data mining query language (or a
graphical user interface)
• Constraint-based mining
– User flexibility: provides constraints on what to be mined
– Optimization: explores such constraints for efficient
– Note: still find all the answers satisfying constraints, not finding some answers
in “heuristic search”
69
Meta-Rule Guided Mining
• Meta-rule can be in the rule form with partially instantiated predicates and
constants
P1(X, Y) ^ P2(X, W) => buys(X, “iPad”)
• The resulting rule derived can be
age(X, “15-25”) ^ profession(X, “student”) => buys(X, “iPad”)
• In general, it can be in the form of
P1 ^ P2 ^ … ^ Pl => Q1 ^ Q2 ^ … ^ Qr
• Method to find meta-rules
– Find frequent (l+r) predicates (based on min-support threshold)
– Push constants deeply when possible into the mining
– Use confidence, correlation, and other measures when possible
70
Download