21SpCS157BL15Asssociation_Rules2

advertisement
Huffman Codes and Asssociation
Rules (II)
Prof. Sin-Min Lee
Department of Computer Science
Huffman Code Example
• Given: AB
C
D
E
3
1
2
4
6
By using an increasing algorithm (changing
from smallest to largest), it changes to:
B
C
A
D
E
1
2
3
4
6
Huffman Code Example – Step 1
• Because B and C are the lowest values, they can
be appended. The new value is 3
3
BC
Huffman Code Example – Step 2
• Reorder the problem using the increasing
algorithm again. This gives us:
BC A
3
3
D
4
E
6
Huffman Code Example – Step 3
• Doing another append will give:
6
3
A
BC
Huffman Code Example – Step 4
•
D
4
D
4
D
4
D
4
From the initial BC A D E
code we get:
E ABC
6 6
E
BCA
6
6
ABC E
6
6
BCA E
6
6
D
E
A
BC
D
E
BC
A
D
A
BC
E
D
BC
A
E
Huffman Code Example – Step 5
• Taking derivates from the
previous step, we get:
D
E
BCA
4
6
6
E
DBCA
6
10
DABC E
10
6
D
E
ABC
4
6
6
D
E
E
A
D
BC
A
BC
D
A
BC
E
D
BC
A
E
Huffman Code Example – Step 6
• Taking derivates from the
previous step, we get:
BCA D
E
6
4
6
E
DBCA
6
10
E
DABC
40
10
ABC D
E
6
4
6
BC
A
E
E
A
D
B
D
BC
A
D
A
BC
BC
D
E
Huffman Code Example – Step 7
• After the previous step, we’re supposed to map a 1
to each right branch and a 0 to each left branch.
The results of the codes are:
0
0
0
1
0
0
0
0
0
0
0
A
A
D
B
C
B
1
E=0
D = 10
A = 110
B = 1110
C = 1111
0
C
1
0
D
1
E
E
0
B
1
A
1
E=0
D = 10
B = 1110
C = 1111
A = 111
D
1
E
1
1
C
1
1
0
B
B = 000
C = 001
A = 01
D = 10
E = 11
B
1
D
1
1
A
1
1
0
1
C
A=0
B = 010
C = 011
D = 10
E = 11
Example
• Items={milk, coke, pepsi, beer, juice}.
• Support = 3 baskets.
B1 = {m, c, b}
B2 = {m, p, j}
B3 = {m, b}
B4 = {c, j}
B5 = {m, p, b}
B6 = {m, c, b, j}
B7 = {c, b, j}
B8 = {b, c}
• Frequent itemsets: {m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}.
Association Rules
• Association rule R : Itemset1 => Itemset2
– Itemset1, 2 are disjoint and Itemset2 is nonempty
– meaning: if transaction includes Itemset1 then
it also has Itemset2
• Examples
– A,B => E,C
– A => B,C
Example
+ B1 = {m, c, b}
_
B3 = {m, b}
_
B5 = {m, p, b}
B7 = {c, b, j}
B2 = {m, p, j}
B4 = {c, j}
+ B6 = {m, c, b, j}
B8 = {b, c}
• An association rule: {m, b} → c.
– Confidence = 2/4 = 50%.
From Frequent Itemsets to
Association Rules
• Q: Given frequent set {A,B,E}, what are
possible association rules?
–
–
–
–
–
–
–
A => B, E
A, B => E
A, E => B
B => A, E
B, E => A
E => A, B
__ => A,B,E (empty rule), or true => A,B,E
Classification vs Association Rules
Classification Rules
• Focus on one target
field
• Specify class in all
cases
• Measures: Accuracy
Association Rules
• Many target fields
• Applicable in some
cases
• Measures: Support,
Confidence, Lift
Rule Support and Confidence
• Suppose R : I => J is an association rule
– sup (R) = sup (I  J) is the support count
• support of itemset I  J (I or J)
– conf (R) = sup(J) / sup(R) is the confidence of R
• fraction of transactions with I  J that have J
• Association rules with minimum support and count
are sometimes called “strong” rules
Association Rules Example:
• Q: Given frequent set {A,B,E}, what
association rules have minsup = 2 and
minconf= 50% ?
A, B => E : conf=2/4 = 50%
A, E => B : conf=2/2 = 100%
B, E => A : conf=2/2 = 100%
E => A, B : conf=2/2 = 100%
Don’t qualify
A =>B, E : conf=2/6 =33%< 50%
B => A, E : conf=2/7 = 28% < 50%
__ => A,B,E : conf: 2/9 = 22% < 50%
TID List of items
1
A, B, E
2
B, D
3
B, C
4
A, B, D
5
A, C
6
B, C
7
A, C
8
A, B, C, E
9
A, B, C
Find Strong Association Rules
• A rule has the parameters minsup and
minconf:
– sup(R) >= minsup and conf (R) >= minconf
• Problem:
– Find all association rules with given minsup
and minconf
• First, find all frequent itemsets
Finding Frequent Itemsets
• Start by finding one-item sets (easy)
• Q: How?
• A: Simply count the frequencies of all items
Finding itemsets: next level
• Apriori algorithm (Agrawal & Srikant)
• Idea: use one-item sets to generate two-item sets,
two-item sets to generate three-item sets, …
– If (A B) is a frequent item set, then (A) and (B) have to
be frequent item sets as well!
– In general: if X is frequent k-item set, then all (k-1)item subsets of X are also frequent
Compute k-item set by merging (k-1)-item sets
Finding Association Rules
• A typical question: “find all association rules
with support ≥ s and confidence ≥ c.”
– Note: “support” of an association rule is the support
of the set of items it mentions.
• Hard part: finding the high-support (frequent )
itemsets.
– Checking the confidence of association rules
involving those sets is relatively easy.
Naïve Algorithm
• A simple way to find frequent pairs is:
– Read file once, counting in main memory the
occurrences of each pair.
• Expand each basket of n items into its
pairs.
n (n -1)/2
• Fails if #items-squared exceeds main
memory.
C1
Filter
First
pass
L1
Construct
C2
Filter
Second
pass
L2
Construct
C3
[Agrawal, Srikant 94]
Fast Algorithms for Mining Association Rules, by
Rakesh Agrawal and Ramakrishan Sikant, IBM
Almaden Research Center
C^1
Database
TID
Items
100
134
200
235
300
1235
400
25
TID
100
{ {1},{3},{4} }
200
{ {2},{3},{5} }
300
{ {1},{2},{3},{5} }
400
{ {2},{5} }
C2
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
Set-ofitemsets
L1
Itemset
Support
{1}
2
{2}
3
{3}
3
{5}
3
C^2
TID
Set-of-itemsets
100
{ {1 3} }
200
{ {2 3},{2 5} {3 5} }
300
{ {1 2},{1 3},{1 5},
{2 3}, {2 5}, {3 5} }
400
{3 5}
{ {2 5} }
C^3
C3
TID
Set-of-itemsets
itemset
200
{ {2 3 5} }
{2 3 5}
300
{ {2 3 5} }
L2
Itemset
Support
{1 3}
2
{2 3}
3
{2 5}
3
{3 5}
2
L3
Itemset
Support
{2 3 5}
2
Dynamic
Programming Approach
Want
proof of principle of optimality and overlapping
subproblems
Principle
of Optimality
The optimal solution to
solution of
Lk-1
Proof by contradiction
Overlapping
Lk includes the optimal
Subproblems
Lemma of every subset of a frequent item set is a
frequent item set
Proof by contradiction
The Apriori Algorithm:
Example
TID
List of Items
T100
I1, I2, I5
T100
I2, I4
T100
I2, I3
T100
I1, I2, I4
T100
I1, I3
T100
I2, I3
T100
I1, I3
T100
I1, I2 ,I3, I5
T100
I1, I2, I3
• Consider a database, D ,
consisting of 9 transactions.
• Suppose min. support count
required is 2 (i.e. min_sup = 2/9
= 22 % )
• Let minimum confidence
required is 70%.
• We have to first find out the
frequent itemset using Apriori
algorithm.
• Then, Association rules will be
generated using min. support &
min. confidence.
Step 1: Generating 1-itemset
Frequent Pattern
Scan D for
count of each
candidate
Itemset
Sup.Count
{I1}
6
{I2}
Compare candidate
support count with
minimum support
count
Itemset
Sup.Count
{I1}
6
7
{I2}
7
{I3}
6
{I3}
6
{I4}
2
{I4}
2
2
{I5}
L21
{I5}
C1
• In
the first iteration of the algorithm, each item is a
member of the set of candidate.
• The set of frequent 1-itemsets, L1 , consists of the
candidate 1-itemsets satisfying minimum support.
Step 2: Generating 2-itemset
Frequent Pattern
Generate
C2
candidates
from L1
Itemset
Scan D for
count of
each
candidate
Itemset
Sup.
Count
{I1, I2}
4
{I1, I4}
{I1, I3}
4
{I1, I5}
{I1, I4}
{I2, I3}
Compare
candidate
support count
with
minimum
support count
Itemse
t
Sup
Count
{I1, I2}
4
{I1, I3}
4
1
{I1, I5}
2
{I1, I5}
2
{I2, I3}
4
{I2, I4}
{I2, I3}
4
{I2, I4}
2
{I2, I5}
{I2, I4}
2
{I3, I4}
{I2, I5}
2
{I3, I4}
0
{I3, I5}
1
{I4, I5} C2
0
{I1, I2}
{I1, I3}
{I3, I5}
C{I4,
2 I5}
{I2, I5}
L2
2
Step 2: Generating 2-itemset Frequent
Pattern [Cont.]
• To discover the set of frequent 2-itemsets, L2 , the
algorithm uses L1 Join L1 to generate a candidate set
of 2-itemsets, C2.
• Next, the transactions in D are scanned and the
support count for each candidate itemset in C2 is
accumulated (as shown in the middle table).
• The set of frequent 2-itemsets, L2 , is then
determined, consisting of those candidate 2-itemsets
in C2 having minimum support.
• Note: We haven’t used Apriori Property yet.
Step 3: Generating 3-itemset
Frequent Pattern
Scan D for
count of
each
candidate
Itemset
{I1, I2, I3}
{I1, I2, I5}
C3
Scan D for
count of
each
candidate
Itemset
Sup.
Count
{I1, I2, I3}
2
{I1, I2, I5}
2
Compare
candidate
support count
with min
support count
C3
Itemset
Sup
Count
{I1, I2, I3}
2
{I1, I2, I5}
2
• The generation of the set of candidate 3-itemsets, C3 ,
L3
involves use of the Apriori Property.
• In order to find C3, we compute L2 Join L2.
• C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3,
I4}, {I2, I3, I5}, {I2, I4, I5}}.
• Now, Join step is complete and Prune step will be used to
reduce the size of C3. Prune step helps to avoid heavy
computation due to large Ck.
Step 3: Generating 3-itemset Frequent
Pattern
[Cont.]
Based on the Apriori property that all subsets of a frequent
itemset must also be frequent, we can determine that four latter
candidates cannot possibly be frequent. How ?
For example , lets take {I1, I2, I3}. The 2-item subsets of it are {I1,
I2}, {I1, I3} & {I2, I3}. Since all 2-item subsets of {I1, I2, I3} are
members of L2, We will keep {I1, I2, I3} in C3.
Lets take another example of {I2, I3, I5} which shows how the
pruning is performed. The 2-item subsets are {I2, I3}, {I2, I5} &
{I3,I5}.
BUT, {I3, I5} is not a member of L2 and hence it is not frequent
violating Apriori Property. Thus We will have to remove {I2, I3,
I5} from C3.
Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all
members of result of Join operation for Pruning.
Now, the transactions in D are scanned in order to determine L3,
consisting of those candidates 3-itemsets in C3 having minimum
support.
Step 4: Generating 4-itemset Frequent
Pattern
• The algorithm uses L3 Join L3 to generate a candidate
set of 4-itemsets, C4. Although the join results in {{I1,
I2, I3, I5}}, this itemset is pruned since its subset {{I2,
I3, I5}} is not frequent.
• Thus, C4 = φ , and algorithm terminates, having
found all of the frequent items. This completes our
Apriori Algorithm.
• What’s Next ?
These frequent itemsets will be used to generate
strong association rules ( where strong association
rules satisfy both minimum support & minimum
confidence).
Step 5: Generating Association Rules
from Frequent Itemsets
• Procedure:
• For each frequent itemset “l”, generate all nonempty
subsets of l.
• For every nonempty subset s of l, output the rule “s 
(l-s)” if
support_count(l) / support_count(s) >= min_conf
where min_conf is minimum confidence threshold.
• Back To Example:
We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5},
{I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.
– Lets take l = {I1,I2,I5}.
– Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1},
{I2}, {I5}.
Step 5: Generating Association Rules
from Frequent Itemsets [Cont.]
• Let minimum confidence threshold is , say 70%.
• The resulting association rules are shown below, each
listed with its confidence.
– R1: I1 ^ I2  I5
• Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
• R1 is Rejected.
– R2: I1 ^ I5  I2
• Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%
• R2 is Selected.
– R3: I2 ^ I5  I1
• Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%
• R3 is Selected.
Step 5: Generating Association
Rules from Frequent Itemsets
[Cont.]
– R4: I1  I2 ^ I5
• Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
• R4 is Rejected.
– R5: I2  I1 ^ I5
• Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
• R5 is Rejected.
– R6: I5  I1 ^ I2
• Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
• R6 is Selected.
In this way, We have found three strong
association rules.
Example
Simple algorithm:
ABCDE
Large itemset
Rules with minsup
ACDEB
CDEAB
ADEBC
ABCED
BCEAD
ACDBE ACEBD
Fast algorithm:
ACEBD ABCED
ABCDE
ACDEB
ABECD
ABCED
ACEBD
Download