Uploaded by Abhijeet Jain

Association mining ch(6)

advertisement
Introduction to Data mining
Introduction, Preprocessing,
Clustering, Association Rules,
Classification,…
What Is Association Mining?
•
Association rule mining:
–
Association rule mining searches for interesting
relationships among items in a given data set.
– Example: 98% of people who purchase tires and auto
accessories also get automotive services done
•
Applications:
–
Basket data analysis, sales, catalog design, storage
layout, customer segmentation based on buying
pattern.
Association Rule: Basic Concepts
• Given: (1) database of transactions, (2) each transaction
is a list of items (purchased by a customer in a visit)
• Find: all rules that correlate the presence of one set of
items with that of another set of items
– E.g., 98% of people who purchase tires and auto accessories
also get automotive services done
• Applications
– *  Maintenance Agreement (What the store should do to
boost Maintenance Agreement sales)
– Home Electronics  * (What other products should the store
stocks up?)
– Attached mailing in direct marketing
Format of Rules
•
Examples.
Rule form: “Body ead [support,
confidence]”.
– buys(x, “diapers”)  buys(x, “beers”) [0.5%,
60%]
– major(x, “CS”) ^ takes(x, “DB”) grade(x,
“A”) [1%, 75%]
–
Basic Concepts
•
Problem: Given a set of transactions D, the
problem of mining association rules is to
generate all association rules that have support
and confidence greater than the user-specified
minimum support called (minsup) and minimum
confidence (called minconf) respectively.
• Find all the rules X & Y  Z with minimum confidence and support
– support, s, probability that a transaction contains {X  Y  Z}=
% of transactions that contain {X  Y  Z}.
– confidence, c, conditional probability that a transaction having {X
 Y} also contains Z
– =Support {X  Y  Z} / Support {X  Y}
Rule Measures: Support and
Confidence
Customer
buys both
Customer
buys diaper
Customer
buys beer
Transaction ID Items Bought
2000
A,B,C
1000
A,C
4000
A,D
5000
B,E,F
Let minimum support 50%, and
minimum confidence 50%, we have
– A  C (50%, 66.6%)
– C  A (50%, 100%)
Confidence of a rule
•
•
•
The confidence of a rule provides an
accurate prediction on the association of
the items in the rule.
The support of a rule indicates how
frequent the rule is in the transactions.
Rules that have small support are
uninteresting, since they do not describe
large populations.
Association Rule Mining
•
Boolean vs. quantitative associations (Based on
the types of values handled)
– buys(x, “SQLServer”) ^ buys(x, “DMBook”) 
buys(x, “DBMiner”) [0.2%, 60%]
– age(x, “30..39”) ^ income(x, “42..48K”) 
buys(x, “PC”) [1%, 75%]
• Single level vs. multiple-level analysis
– What brands of beers are associated with
what brands of diapers?
Mining Association Rules—An Example
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
For rule A  C:
support = support({A C}) = 50%
confidence = support({A
C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must
be frequent
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
Mining Frequent Itemsets: the Key Step
•
A set of items is referred to as a an itemset.
•
An itemset thatcontains k items is a k-itemset.
– Example: The set {Computer, financial software} is a 2-itemset.
•
Find the frequent itemsets: the sets of items that have minimum support
– A subset of a frequent itemset must also be a frequent itemset
• i.e., if {AB} is a frequent itemset, both {A} and {B} should be a
frequent itemset
– Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
•
Use the frequent itemsets to generate association rules.
Association rule mining
• It is a two step process:
• Step1: Find all frequent item sets
– Each of item set will occur at least as frequently as a
predetermined minimum support count.
• Step2: Generate association rules from the
frequent item sets.
• The second step is easier than the first step.
The algorithm
• The algorithm uses a priori knowledge of frequent item
sets properties.
• A priori employs level wise approach
– k-items sets are used to explore (k+1)-item sets.
– First set of 1-itemsets is found (L1)
– L1 is used to find L2
– L2 is used to find L3.
• The finding of each Lk requires one full scan of the
database.
A priori property
• All nonempty subsets of a frequent item set must also be frequent.
– An item set I does not satisfy the minimum support threshold, min-sup,
then I is not frequent, i.e., support(I) < min-sup
– If an item A is added to the item set I then the resulting item set (I U A)
can not occur more frequently than I.
• Monotonic functions are functions that move in only one direction.
• This property is called anti-monotonic.
• If a set can not pass a test, all its supersets will fail the same test as
well.
• This property is monotonic in failing the test.
The Apriori Algorithm
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1)-itemset that is not frequent
cannot be a subset of a frequent k-itemset
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
How to Generate Candidates?
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
• Step 2: pruning (a priori property is used)
For all itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
The Apriori Algorithm —
Example
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
Example of Generating Candidates
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
– abcd from abc and abd
– acde from acd and ace
• Pruning:
– acde is removed because ade is not in L3
• C4={abcd}
Generating association rules from frequent
item sets
• Confidence (AB)=>P(B/A)= support count (AUB) /
support count (A)
• Procedure
– For each frequent item set l, generate all non-empty subsets of l.
– For every non-empty subset s of l, output the rule s(l-s).
• Examples
Methods to Improve Apriori’s Efficiency
•
Hash-based itemset counting: A k-itemset whose corresponding hashing
bucket count is below the threshold cannot be frequent
•
Transaction reduction: A transaction that does not contain any frequent kitemset is useless in subsequent scans
•
Partitioning: Any itemset that is potentially frequent in DB must be frequent
in at least one of the partitions of DB
•
Sampling: mining on a subset of given data, lower support threshold + a
method to determine the completeness
•
Dynamic itemset counting: add new candidate itemsets only when all of
their subsets are estimated to be frequent
Is Apriori Fast Enough? — Performance
Bottlenecks
• The core of the Apriori algorithm:
– Use frequent (k – 1)-itemsets to generate candidate frequent kitemsets
– Use database scan and pattern matching to collect counts for the
candidate itemsets
• The bottleneck of Apriori: candidate generation
– Huge candidate sets:
• 104 frequent 1-itemset will generate 107 candidate 2-itemsets
• To discover a frequent pattern of size 100, e.g., {a1, a2, …,
a100}, one needs to generate 2100  1030 candidates.
– Multiple scans of database:
• Needs (n +1 ) scans, n is the length of the longest pattern
Presentation of Association
Rules (Table Form )
Visualization of Association Rule Using Plane Graph
Visualization of Association Rule Using Rule Graph
Download