Chapter 5. Association Rule Introduction In data mining, association

advertisement
Chapter 5. Association Rule
Introduction
In data mining, association rule learners are used to discover elements
that co-occur frequently within a data set[1] consisting of multiple
independent selections of elements (such as purchasing transactions),
and to discover rules, such as implication or correlation, which relate cooccurring elements. Questions such as "if a customer purchases product
A, how likely is he to purchase product B?" and "What products will a
customer buy if he buys products C and D?" are answered by
association-finding algorithms. This application of association rule
learners is also known as market basket analysis. As with most data
mining techniques, the task is to reduce a potentially huge amount of
information to a small, understandable set of statistically supported
statements.
A famous story about association rule mining is the "beer and diaper"
story. A purported survey of behavior of supermarket shoppers
discovered that customers (presumably young men) who buy diapers
tend also to buy beer. This anecdote became popular as an example of
how unexpected association rules might be found from everyday data.
Definition
Association Rule Mining:
Given a set of transactions, find rules that will predict the occurrence of
an item based on the occurrences of other items in the transaction
Association Rule
An implication expression of the form X  Y, where X and Y are itemsets
Example:
Market-Basket Transactions
TID
Items
1
2
3
4
5
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
{Diaper}  {Beer},
{Milk, Bread}  {Eggs,Coke},
{Beer, Bread}  {Milk},
Itemset
A collection of one or more items
–
Example: {Milk, Bread, Diaper}
k-itemset
An itemset that contains k items
Rule Evaluation Metrics
Support (s)
Fraction of transactions that contain both X and Y
Confidence (c)
Measures how often items in Y appear in transactions that contain X
Example:
{Milk, Diaper}  Beer
s = δ(Milk, Diaper, Beer)/T = 2/5 = 0.4 = 40%
c = δ(Milk, Diaper, Beer) / δ(Milk, Diaper) = 2/3 = 0.67 = 67%
Frequent Itemset
An itemset whose support is greater than or equal to a minsup threshold
Frequent pattern
A pattern (a set of items, subsequences, substructures, etc.) that occurs
frequently in a data set
Motivation
Finding inherent regularities in data
 What products were often purchased together?— Beer and
diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
Applications
 Basket data analysis, cross-marketing, catalog design, sale
campaign analysis, Web log (click stream) analysis, and DNA
sequence analysis.
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to
find all rules having
–
support ≥ minsup threshold
–
confidence ≥ minconf threshold
Example of Rules:
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
Methods for Mining Frequent Patterns
 The downward closure property of frequent patterns
 Any subset of a frequent itemset must be frequent
 If {beer, diaper, nuts} is frequent, so is {beer, diaper}
 i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
 Scalable mining methods: Three major approaches
 Apriori (Agrawal & Srikant@VLDB’94)
 Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
 Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
Apriori pruning principle
 If there is any itemset which is infrequent, its superset should not
be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al.
@ KDD’ 94)
 Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from length k
frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can be
generated
Frequent Itemset Generation
Brute-force approach:
–
Each itemset in the lattice is a candidate frequent itemset
–
Count the support of each candidate by scanning the
database
Given d unique items:
–
Total number of itemsets = 2d
–
Total number of possible association rules:
d 1  d
  d  k  d  k 

R       
j 
k 1  k 
j 1 
 3d  2 d 1  1
If d=6, R = 602 rules
Expensive !!!
Frequent Itemset Generation
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
© Tan,Steinbach, Kumar
Introduction to Data Mining
Given d items, there
are 2d possible
candidate itemsets
4/18/2004
8
Frequent Itemset Generation Strategies
Reduce the number of candidates (M)
–
Complete search: M=2d
–
Use pruning techniques to reduce M
Reduce the number of transactions (N)
–
Reduce size of N as the size of itemset increases
–
Used by DHP and vertical-based mining algorithms
Reduce the number of comparisons (NM)
–
Use efficient data structures to store the candidates or
transactions
–
No need to match every candidate against every transaction
Reducing Number of Candidates
Apriori principle:
–
If an itemset is frequent, then all of its subsets must also be
frequent
Apriori principle holds due to the following property of the support
measure:
–
X , Y : ( X  Y )  s( X )  s(Y )
Support of an itemset never exceeds the support of its
subsets
–
This is known as the anti-monotone property of support
Illustrating Apriori Principle
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
Found to be
Infrequent
ABCD
ABCE
Pruned
supersets
© Tan,Steinbach, Kumar
ABDE
ACDE
BCDE
ABCDE
Introduction to Data Mining
4/18/2004
13
The Apriori Algorithm—An Example
Database TDB
Tid
Items
10
A, C, D
20
B, C, E
30
A, B, C, E
40
B, E
Supmin = 2
sup
{A, C}
2
{B, C}
2
{B, E}
3
{C, E}
2
Itemset
{B, C, E}
March 5, 2008
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
1st scan
Itemset
C3
sup
C1
C2
L2
Itemset
Itemset
sup
{A}
2
L1
{B}
3
{C}
3
{E}
3
Itemset
sup
{A, B}
1
{A, C}
2
{A, E}
1
{A, C}
{B, C}
2
{A, E}
{B, E}
3
{B, C}
{C, E}
2
{B, E}
3rd scan
L3
C2
2nd scan
Itemset
sup
{B, C, E}
2
Data Mining: Concepts and Techniques
Important Details of Apriori
 How to generate candidates?
Itemset
{A, B}
{C, E}
12
 Step 1: self-joining Lk
 Step 2: pruning
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4={abcd}
Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets
is frequent
null
Maximal
Itemsets
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
Infrequent
Itemsets
ABCD
E
© Tan,Steinbach, Kumar
Introduction to Data Mining
ACDE
BCDE
Border
4/18/2004
27
Closed Itemset
An itemset is closed if none of its immediate supersets has the same
support as the itemset
TID
1
2
3
4
5
Itemset
{A}
{B}
{C}
{D}
{A,B}
{A,C}
{A,D}
{B,C}
{B,D}
{C,D}
Items
{A,B}
{B,C,D}
{A,B,C,D}
{A,B,D}
{A,B,C,D}
Support
4
5
3
4
4
2
3
3
4
3
Itemset Support
{A,B,C}
2
{A,B,D}
3
{A,C,D}
2
{B,C,D}
3
{A,B,C,D}
2
Maximal vs Closed Itemsets
TID
Items
1
ABC
2
ABCD
3
BCE
4
ACDE
5
Transaction Ids
null
124
123
A
12
124
AB
DE
12
24
AC
AD
ABD
ABE
2
AE
2
3
BD
4
ACD
345
D
BC
BE
2
4
ACE
ADE
E
24
CD
34
CE
45
3
BCD
ABCE
ABDE
ACDE
BDE
CDE
BCDE
ABCDE
Introduction to Data Mining
DE
4
BCE
4
ABCD
Not supported by
any transactions
© Tan,Steinbach, Kumar
245
C
123
4
24
2
ABC
1234
B
4/18/2004
29
Maximal vs Closed Frequent Itemsets
Minimum support = 2
124
123
A
12
124
AB
12
ABC
24
AC
AD
ABD
ABE
1234
B
245
C
123
4
AE
2
3
BD
4
ACD
345
D
BC
24
2
Closed but
not maximal
null
24
BE
2
4
ACE
E
ADE
CD
Closed and
maximal
34
CE
3
BCD
45
DE
4
BCE
BDE
CDE
4
2
ABCD
ABCE
ABDE
ACDE
BCDE
# Closed = 9
# Maximal = 4
ABCDE
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
30
4/18/2004
31
Maximal vs Closed Itemsets
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
© Tan,Steinbach, Kumar
Introduction to Data Mining
Alternative Methods for Frequent Itemset Generation

Representation of Database
– horizontal vs vertical data layout
Horizontal
Data Layout
TID
1
2
3
4
5
6
7
8
9
10
Items
A,B,E
B,C,D
C,E
A,C,D
A,B,C,D
A,E
A,B
A,B,C
A,C,D
B
© Tan,Steinbach, Kumar
Vertical Data Layout
A
1
4
5
6
7
8
9
B
1
2
5
7
8
10
C
2
3
4
8
9
D
2
4
5
9
Introduction to Data Mining
E
1
3
6
4/18/2004
35
Frequent Pattern-growth (FP-growth)
Use a compressed representation of the database using an FP-tree
Once an FP-tree has been constructed, it uses a recursive divide-andconquer approach to mine the frequent itemsets
FP-tree construction
null
After reading TID=1:
TID
1
2
3
4
5
6
7
8
9
10
A:1
Items
{A,B}
{B,C,D}
{A,C,D,E}
{A,D,E}
{A,B,C}
{A,B,C,D}
{B,C}
{A,B,C}
{A,B,D}
{B,C,E}
B:1
After reading TID=2:
null
B:1
A:1
B:1
C:1
D:1
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
37
FP-Tree Construction
TID
1
2
3
4
5
6
7
8
9
10
Items
{A,B}
{B,C,D}
{A,C,D,E}
{A,D,E}
{A,B,C}
{A,B,C,D}
{B,C}
{A,B,C}
{A,B,D}
{B,C,E}
Header table
Item
Pointer
A
B
C
D
E
© Tan,Steinbach, Kumar
Transaction
Database
null
B:3
A:7
B:5
C:1
C:3
D:1
D:1
C:3
D:1
D:1
D:1
E:1
E:1
E:1
Pointers are used to assist
frequent itemset generation
Introduction to Data Mining
4/18/2004
38
FP-growth
C:1
Conditional Pattern base
for D:
P = {(A:1,B:1,C:1),
(A:1,B:1),
(A:1,C:1),
(A:1),
(B:1,C:1)}
D:1
Recursively apply FPgrowth on P
null
A:7
B:5
B:1
C:1
C:3
D:1
D:1
Frequent Itemsets found
(with sup > 1):
AD, BD, CD, ACD, BCD
D:1
D:1
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
39
Latihan:
1. Tentukan frequent itemset data berikut ini dengan menggunakan
metoda Apriori dan FP-growth. Tentukan pula Closed Itemset dan
Max Itemset.
a.
TID
Items
1
2
3
4
5
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
b.
Itemset Temperature Wind Humidity Play
1
Warm
Calm Dry
Yes
2
Cold
Calm Dry
Yes
3
Cold
Windy Raining No
4
Cold
Gale Dry
No
5
Cold
Windy Raining Yes
Download