Slides

advertisement
Association Analysis (2)
Example
Min_sup_count = 2
C1
TID
List of item ID’s
Itemset
T1
I1, I2, I5
{I1}
T2
I2, I4
T3
F1
Itemset
Sup.
count
{I2}
{I1}
6
I2, I3
{I3}
{I2}
7
T4
I1, I2, I4
{I4}
{I3}
6
T5
I1, I3
{I5}
{I4}
2
T6
I2, I3
{I5}
2
T7
I1, I3
T8
I1, I2, I3, I5
T9
I1, I2, I3
Generate C2 from F1F1
Min_sup_count = 2
TID
List of item ID’s
T1
I1, I2, I5
T2
I2, I4
T3
I2, I3
T4
I1, I2, I4
T5
I1, I3
T6
I2, I3
T7
F1
Itemset
C2
Sup.
count
Itemset
Itemset
Sup. C
{I1,I2}
{I1,I2}
4
{I1}
6
{I1,I3}
{I1,I3}
4
{I2}
7
{I1,I4}
{I1,I4}
1
{I3}
6
{I1,I5}
{I1,I5}
2
{I4}
2
{I2,I3}
{I2,I3}
4
{I5}
2
{I2,I4}
{I2,I4}
2
I1, I3
{I2,I5}
{I2,I5}
2
T8
I1, I2, I3, I5
{I3,I4}
{I3,I4}
0
T9
I1, I2, I3
{I3,I5}
{I3,I5}
1
{I4,I5}
{I4,I5}
0
Generate C3 from F2F2
Min_sup_count = 2
F2
Prune
TID
List of item ID’s
Itemset
Sup. C
Itemset
Itemset
T1
I1, I2, I5
{I1,I2}
4
{I1,I2,I3}
{I1,I2,I3}
T2
I2, I4
{I1,I3}
4
{I1,I2,I5}
{I1,I2,I5}
T3
I2, I3
{I1,I5}
2
{I1,I3,I5}
{I1,I3,I5}
T4
I1, I2, I4
{I2,I3}
4
{I2,I3,I4}
{I2,I3,I4}
T5
I1, I3
{I2,I4}
2
{I2,I3,I5}
{I2,I3,I5}
T6
I2, I3
{I2,I5}
2
{I2,I4,I5}
{I2,I4,I5}
T7
I1, I3
T8
I1, I2, I3, I5
T9
I1, I2, I3
F3
Itemset
Sup. C
{I1,I2,I3}
2
{I1,I2,I5}
2
Generate C4 from F3F3
Min_sup_count = 2
C4
TID
List of item ID’s
Itemset
T1
I1, I2, I5
{I1,I2,I3,I5} 2
T2
I2, I4
T3
I2, I3
T4
I1, I2, I4
T5
I1, I3
T6
I2, I3
T7
I1, I3
T8
I1, I2, I3, I5
T9
I1, I2, I3
Sup. C
{I1,I2,I3,I5} is pruned because {I2,I3,I5} is
infrequent
F3
Itemset
Sup. C
{I1,I2,I3}
2
{I1,I2,I5}
2
Candidate support counting
• Scan the database of transactions to determine the support of
each candidate itemset
• Brute force: Match each transaction against every candidate.
– Too many comparisons!
• Better method: Store the candidate itemsets in a hash structure
– A transaction will be tested for match only against candidates
contained in a few buckets
Transactions
N
TID
1
2
3
4
5
Hash Structure
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
k
Buckets
Generate Hash Tree
Hash function
3,6,9
1,4,7
2,5,8
You need:
• A hash function (e.g. p mod 3)
• Max leaf size: max number of itemsets stored in a leaf node (if number of
candidate itemsets exceeds max leaf size, split the node)
Suppose you have 15 candidate itemsets of length 3 and leaf size is 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
234
567
345
136
145
124
457
125
458
159
356
357
689
367
368
Generate Hash Tree
Suppose you have 15 candidate itemsets of length 3 and leaf size is 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
Hash function
3,6,9
1,4,7
2,5,8
Split nodes with more than
3 candidates
using the second item
145
136
124
457
125
458
159
234
567
356
357
689
345
367
368
Generate Hash Tree
Suppose you have 15 candidate itemsets of length 3 and leaf size is 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
Hash function
3,6,9
1,4,7
2,5,8
Now split nodes
using the third item
234
567
145
124
457
125
458
159
136
356
357
689
345
367
368
Generate Hash Tree
Suppose you have 15 candidate itemsets of length 3 and leaf size is 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
Hash function
3,6,9
1,4,7
234
567
145
136
2,5,8
124
457
125
458
159
Now, split this similarly.
356
357
689
345
367
368
Subset
Operation
Given a (lexicographically
ordered) transaction t, say
{1,2,3,5,6} how can we
enumerate the possible
subsets of size 3?
Transaction, t
1 2 3 5 6
Level 1
1 2 3 5 6
2 3 5 6
3 5 6
Level 2
12 3 5 6
13 5 6
123
125
126
135
136
Level 3
15 6
156
23 5 6
235
236
Subsets of 3 items
25 6
256
35 6
356
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356
1,4,7
3+ 56
234
567
145
136
345
124
457
125
458
159
356
357
689
367
368
2,5,8
3,6,9
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356
12+ 356
1,4,7
3+ 56
13+ 56
234
15+ 6
567
145
136
345
124
457
125
458
159
356
357
689
367
368
2,5,8
3,6,9
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356
12+ 356
1,4,7
3+ 56
3,6,9
2,5,8
13+ 56
234
15+ 6
567
145
136
345
124
457
125
458
159
356
357
689
367
368
Match transaction against 7 out of 15 candidates
Rule Generation
• An association rule can be extracted by partitioning a frequent itemset Y into
two nonempty subsets, X and Y -X, such that
XY-X
satisfies the confidence threshold.
• Each frequent k-itemset, Y, can produce up to 2k-2 association rules
– ignoring rules that have empty antecedents or consequents.
Example
Let Y = {1, 2, 3} be a frequent itemset.
Six candidate association rules can be generated from Y:
{1, 2} {3},
Computing the confidence of an
{1, 3}{2},
association rule does not require
additional scans of the transactions.
{2, 3} {1},
Consider {1, 2}{3}. The confidence is
{1}{2, 3},
 ({1, 2, 3}) /  ({1, 2})
{2} {1, 3},
Because {1, 2, 3} is frequent, the antimonotone property of support ensures that {1,
{3}  {1, 2}.
2} must be frequent, too, and we know the
supports of frequent itemsets.
Confidence-Based Prunning I
Theorem.
If a rule XY – X does not satisfy the confidence threshold,
then any rule X ’Y – X ’, where X ’ is a subset of X, cannot
satisfy the confidence threshold as well.
Proof.
Consider the following two rules: X ’  Y – X ’ and X  Y – X,
where X ’ X.
The confidence of the rules are  (Y ) /  (X ’) and  (Y ) /  (X),
respectively.
Since X ’ is a subset of X,  (X ’)   (X).
Therefore, the former rule cannot have a higher confidence than
the latter rule.
Confidence-Based Prunning II
• Observe that:
X ’ X implies that Y – X ’  Y – X
X’
X
Y
Confidence-Based Prunning III
• Initially, all the highconfidence rules that have only one item in
the rule consequent are extracted.
• These rules are then used to generate new candidate rules.
• For example, if
– {acd}  {b} and {abd}  {c} are highconfidence rules, then the
candidate rule {ad}  {bc} is generated by merging the
consequents of both rules.
Confidence-Based Prunning IV
Item
Bread
Coke
Milk
Beer
Diaper
Eggs
Count
4
2
4
3
4
1
Items (1-itemsets)
Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}
Count
3
2
3
2
3
3
Pairs (2-itemsets)
Triplets (3-itemsets)
Itemset
{Bread,Milk,Diaper}
{Bread,Milk}{Diaper} (confidence = 3/3)
{Bread,Diaper}{Milk} (confidence = 3/3)
{Diaper,Milk}{Bread} (confidence = 3/3)
Count
3
threshold=50%
Confidence-Based Prunning V
Merge:
{Bread,Milk}{Diaper}
{Bread,Diaper}{Milk}
{Bread}{Diaper,Milk} (confidence = 3/4)
…
Compact Representation of Frequent
Itemsets
• Some itemsets are redundant because they have identical support
as their supersets
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
4
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
5
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
6
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
7
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
8
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
9
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
11
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
12
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
13
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
14
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
15
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
10 
 3    
k 1  k 
 435,456
10
• Number of frequent itemsets
• Need a compact representation
Maximal Frequent Itemsets
An itemset is maximal frequent if none of its immediate supersets
is frequent
null
Maximal
Itemsets
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
Infrequent
Itemsets
ABCD
E
Border
Maximal frequent
itemsets form the
smallest set of
itemsets from
which all frequent
itemsets can be
derived.
Maximal Frequent Itemsets
• Despite providing a compact representation, maximal frequent
itemsets do not contain the support information of their subsets.
– For example, the support of the maximal frequent itemsets
{a, c, e}, {a, d}, and {b,c,d,e} do not provide any hint about the
support of their subsets.
• An additional pass over the data set is therefore needed to
determine the support counts of the nonmaximal frequent
itemsets.
• It might be desirable to have a minimal representation of
frequent itemsets that preserves the support information.
– Such representation is the set of the closed frequent itemsets.
Closed Itemset
• An itemset is closed if none of its immediate supersets has the same support
as the itemset.
– Put another way, an itemset X is not closed if at least one of its immediate
supersets has the same support count as X.
• An itemset is a closed frequent itemset if it is closed and its support is
greater than or equal to minsup.
TID
1
2
3
4
5
Items
{A,B}
{B,C,D}
{A,B,C,D}
{A,B,D}
{A,B,C,D}
Itemset
{A}
{B}
{C}
{D}
{A,B}
{A,C}
{A,D}
{B,C}
{B,D}
{C,D}
Support
4
5
3
4
4
2
3
3
4
3
Itemset Support
{A,B,C}
2
{A,B,D}
3
{A,C,D}
2
{B,C,D}
3
{A,B,C,D}
2
Lattice
TID
Items
1
ABC
2
ABCD
3
BCE
4
ACDE
5
DE
Transaction Ids
null
124
123
A
12
124
AB
12
24
AC
ABC
ABD
ABE
AE
345
D
2
3
BC
BD
4
ACD
245
C
123
4
24
2
Not supported by
any transactions
B
AD
2
1234
BE
2
4
ACE
ADE
E
24
CD
ABCE
ABDE
ABCDE
CE
3
BCD
ACDE
45
DE
4
BCE
4
ABCD
34
BCDE
BDE
CDE
Maximal vs. Closed Itemsets
Minimum support = 2
124
123
A
12
124
AB
12
ABC
24
AC
AD
ABD
ABE
1234
B
AE
345
D
2
3
BC
BD
4
ACD
245
C
123
4
24
2
Closed but
not maximal
null
24
BE
2
4
ACE
E
ADE
CD
Closed and
maximal
34
CE
3
BCD
45
DE
4
BCE
BDE
CDE
4
2
ABCD
ABCE
ABDE
ACDE
BCDE
# Closed = 9
# Maximal = 4
ABCDE
Maximal vs Closed Itemsets
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
All maximal frequent itemsets
are closed because none
of the maximal frequent
itemsets can have the same
support count as their
immediate supersets.
Deriving Frequent Itemsets From
Closed Frequent Itemsets
• Consider {a, d}.
– It is frequent because {a, b, d} is.
– Since it isn't closed, its support count must be identical to one of its immediate
supersets.
– The key is to determine which superset among {a, b, d}, {a, c, d}, or {a, d, e} has
exactly the same support count as {a, d}.
• The Apriori principle states that:
– Any transaction that contains the superset of {a, d} must also contain {a, d}.
– However, any transaction that contains {a, d} does not have to contain the
supersets of {a, d}.
– So, the support for {a, d} must be equal to the largest support among its supersets.
– Since {a, c, d} has a larger support than both {a, b, d} and {a, d, e}, the support
for {a, d} must be identical to the support for {a, c, d}.
Algorithm
Let C denote the set of closed frequent itemsets
Let kmax denote the maximum length of closed frequent itemsets
Fkmax ={f | f C, | f | = kmax } {Find all frequent itemsets of size kmax}
for k = kmax – 1 downto 1 do
Set Fk to be all sub-itemsets of length k from the frequent itemsets
in Fk+1 plus the closed frequent itemsets of size k.
for each f  Fk do
if f  C then
f.support = max{f’.support | f’  Fk+1, f f’}
end if
end for
end for
Example
C = {ABC:3, ACD:4, CE:6, DE:7}
kmax=3
F3 = {ABC:3, ACD:4}
F2 = {AB:3, AC:4, BC:3, AD:4, CD:4, CE:6, DE:7}
F1 = {A:4, B:3, C:6, D:7, E:7}
Computing Frequent Closed Itemsets
• How?
• Use the Apriori Algorithm.
• After computing, say Fk and Fk+1, check whether there is some
itemset in Fk which has a support equal to the support of one of
its supersets in Fk+1. Purge all such itemsets from Fk.
Download