Frequent Itemset

advertisement
AMCS/CS 340: Data Mining
Association Rules
Xiangliang Zhang
King Abdullah University of Science and Technology
Outline: Mining Association Rules
• Motivation and Definition
• High computational complexity
• Frequent itemsets mining

Apriori algorithm – reduce the number of candidate

Frequent-Pattern tree (FP-tree)
• Rule Generation
• Rule Evaluation
• Mining rules with multiple minimum supports
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
2
The Market-Basket Model
• A large set of items
e.g., things sold in a supermarket
• A large set of baskets, each of which is a small
set of the items
e.g., each basket lists the things a customer buys once
• Goal: Find “interesting” connections between items
• Can be used to model any
many‐many relationship,
not just in the retail setting
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
3
The Frequent Itemsets
• Simplest question: Find sets of items that
appear together “frequently” in baskets
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
4
Definition: Frequent Itemset
• Itemset
A collection of one or more items
Example: {Milk, Bread, Diaper}
k-itemset
An itemset that contains k items
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
• Support count ()
Frequency of occurrence of an itemset
E.g. ({Milk, Bread,Diaper}) = 2
• Support
Fraction of transactions that contain an itemset
E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
An itemset whose support is greater than or
equal to a minsup threshold
5
Definition: Frequent Itemset
• Itemset
A collection of one or more items
Example: {Milk, Bread, Diaper}
k-itemset
An itemset that contains k items
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
• Support count ()
Frequency of occurrence of an itemset
E.g. ({Milk, Bread,Diaper}) = 2
• Support
Fraction of transactions that contain an itemset
E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
An itemset whose support is greater than or
equal to a minsup threshold
Example:
Set minsup = 0.5
The frequent 2-itemsests :
{Bread, Milk},
{Bread, Dipper}
{Milk, Diaper},
{Diaper, Beer}
6
Definition: Association Rule
Association Rule
– If-then rules, an implication
expression of the form X  Y,
where X and Y are itemsets
– Example:
{Milk, Diaper}  {Beer}
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Rule Evaluation Metrics
Example:
– Support (s)
of transactions that
contain both X and Y
{Milk, Diaper}  Beer
 Fraction
– Confidence (c)
 Measures
how often items in Y
appear in transactions that
contain X
s
c
 (Milk , Diaper, Beer )
|T|

2
 0.4
5
 (Milk, Diaper, Beer ) 2
  0.67
 (Milk , Diaper )
3
7
Application of Association Rule
1. Items = products;
Baskets = sets of products a customer bought;
many people buy beer and diapers together
 Run a sale on diapers; raise price of beer
2. Items = documents;
Baskets = documents containing a similar sentence;
 Items that appear together too often could represent
plagiarism
3. Items = words;
Baskets = Web pages;
 Co‐occurrence of relatively rare words may indicate an
interesting relationship
8
Outline: Mining Association Rules
• Motivation and Definition
• High computational complexity
• Frequent itemsets mining

Apriori algorithm – reduce the number of candidate

Frequent-Pattern tree (FP-tree)
• Rule Generation
• Rule Evaluation
• Mining rules with multiple minimum supports
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
9
Association Rule Mining Task
• Given a set of transactions T, the goal of association
rule mining is to find all rules having
- support ≥ minsup threshold
- confidence ≥ minconf threshold
• Brute-force approach:
- List all possible association rules
- Compute the support and confidence for each rule
- Prune rules that fail the minsup and minconf
thresholds
 Computationally prohibitive!
10
Mining Association Rules
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Rules:
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support
but can have different confidence
• Thus, we may decouple the support and confidence requirements
11
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
Frequent itemset generation is still computationally
expensive
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
12
Computational Complexity
Given d unique items in all transactions:
• Total number of itemsets = 2d -1
• Total number of possible association rules:
 d  d  k  d  k 
R       

k 1
j

1
 j 
 k 
 3d  2d 1  1
d 1
If d=6, R = 602 rules
d (#items) can be 100K (Wal‐Mart)
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
13
Outline: Mining Association Rules
• Motivation and Definition
• High computational complexity
• Frequent itemsets mining

Apriori algorithm – reduce the number of candidate

Frequent-Pattern tree (FP-tree)
• Rule Generation
• Rule Evaluation
• Mining rules with multiple minimum supports
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
14
Reduce the number of candidates
Complete search of frequent items = 2d -1
Priory principle:
X , Y : X  Y
- If an itemset is frequent, then all of its subsets
must also be frequent
if Y is frequent, all X are frequent
- If an itemset is infrequent, then all of this
parent-sets must also be infrequent
if X is infrequent, all Y are infrequent
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
15
Illustrating Apriori Principle
Database D
TID
Items
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
4
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Minimum Support = 3/5
Items (1-itemsets)
Scan D
Item
Bread
Coke
Milk
Beer
Diaper
Eggs
Count
4
2
4
3
4
1
Item
Bread
Milk
Beer
Diaper
Eliminate
Count
4
4
3
4
Generate
Pairs (2-itemsets)
Itemset
Count
{Bread,Milk}
3
{Bread,Diaper}
3
{Milk,Diaper}
3
{Beer,Diaper}
3
Itemset
Count
3
Eliminate {Bread,Milk}
{Bread,Beer}
2
{Bread,Diaper}
3
{Milk,Beer}
2
{Milk,Diaper}
3
{Beer,Diaper}
3
Scan D
Generate
Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}
Triplets (3-itemsets)
Itemset
{Bread,Milk,Diaper}
{Bread,Diaper,Beer}
Milk
{Mile,Diaper,Beer}
Prune
Itemset
{Bread,Milk,Diaper}
Scan D
Itemset
Count
{Bread,Milk,Diaper} 23
Not a frequent 3-itemset
Apriori Algorithm
• Let k=1
• Generate frequent itemsets of length 1
• Repeat until no new frequent itemsets are identified
1. Generate length (k+1) candidate itemsets from length k
frequent itemsets
2. Prune candidate itemsets containing subsets of length
k that are infrequent
3. Count the support of each candidate by scanning the
DB
4. Eliminate candidates that are infrequent, leaving only
those that are frequent
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
17
Factors Affecting Complexity
Choice of minimum support threshold
• lowering support threshold results in more frequent itemsets
• this may increase number of candidates and max length of frequent
itemsets
Dimensionality (number of items) of the data set
• more space is needed to store support count of each item
• if number of frequent items also increases, both computation and I/O
costs may also increase
Size of database
• since Apriori makes multiple passes, run time of algorithm may
increase with number of transactions
Average transaction width
• transaction width increases with denser data sets
• This may increase max length of frequent itemsets
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
18
Outline: Mining Association Rules
• Motivation and Definition
• High computational complexity
• Frequent itemsets mining

Apriori algorithm – reduce the number of candidate

Frequent-Pattern tree (FP-tree)
• Rule Generation
• Rule Evaluation
• Mining rules with multiple minimum supports
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
19
Mining Frequent Patterns Without
Candidate Generation
• Compress a large database into a compact, FrequentPattern tree (FP-tree) structure
- highly condensed, but complete for frequent pattern
mining
- avoid costly database scans
• Develop an efficient, FP-tree-based frequent pattern mining
method
- A divide-and-conquer methodology: decompose mining
tasks into smaller ones
- Avoid candidate generation: sub-database test only!
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
20
FP-tree construction
Header Table
Item frequency
a
b
c
d
e
8
7
6
5
3
Steps:
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Scan DB again, construct FP-tree
(One transaction  one path in the FP-tree)
21
FP-tree construction: pointers
Header table
Item
a
b
c
d
e
Pointer
Pointers are used to assist
frequent itemset generation
22
Benefits of the FP-tree Structure
Completeness:
• never breaks a long pattern of any transaction
• preserves complete information for frequent pattern mining
Compactness
• One transaction  one path in the FP-tree
• Paths may overlap:
the more the paths overlap, the more compression achieved
• Never be larger than the original database (if not count node-links
and counts)
• Best case: only one single tranche of nodes
• The size of a FP-tree depends on how the items are ordered
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
23
Different FP-tree by ordering items
Header Table
Item frequency
a
8
b
7
c
6
d
5
e
3
Header Table
Item frequency
e
3
d
5
c
6
b
7
a
8
24
Generating frequent itemset
FP-Growth Algorithm:
Bottom-up fashion to derive the frequent
itemsets ending with a particular item
Paths can be accessed rapidly using
the pointers associated to e
25
Generating frequent itemset
FP-Growth Algorithm:
Decompose the frequent itemset generation
problem into multiple sub-problems
Example: find frequent itemsets including e
How to solve the sub-problem?
Start from paths containing node e
Construct conditional tree for e
(if e is frequent)
Prefix Paths ending in {de}
conditional tree for {de}
Prefix Paths ending in {ce}
conditional tree for {ce}
Prefix Paths ending in {ae}
conditional tree for {ae}
27
Example: find frequent itemsets including e
How to solve the sub-problem?
1. Check e is frequent or not
2. Convert the prefix paths into
conditional FP-tree
1) Update support counts
2) Remove node e
3) Ignore infrequent node b
Count(e) =3 > Minsup=2
Frequent Items:
{e}
28
Example: find frequent itemsets including e
How to solve the sub-problem?
(prefix paths and conditional tree with {de})
1. Check d is frequent or not in the
prefix paths ending in {de}
Count(d,e) =2
(Minsup=2)
Frequent Items:
{e} {d,e}
29
Example: find frequent itemsets including e
How to solve the sub-problem?
(prefix paths and conditional tree with {de})
2. Convert the prefix paths (ending in {de})
into conditional FP-tree for {de}
1) Update support counts
2) Remove node d
3) Ignore infrequent node c
Frequent Items:
{e} {d,e}
30
Example: find frequent itemsets including e
How to solve the sub-problem?
(prefix paths and conditional tree with {de})
1. Check a is frequent or not in the
prefix paths ending in {ade}
Count(a) =2
(Minsup=2)
Frequent Items:
{e} {d,e} {a,d,e}
31
Example: find frequent itemsets including e
How to solve the sub-problem?
(prefix paths and conditional tree with {ce})
1. Check c is frequent or not in the
prefix paths ending in {e}
Count(c,e) =2
(Minsup=2)
Frequent Items:
{e} {d,e} {a,d,e} {c,e}
32
Example: find frequent itemsets including e
How to solve the sub-problem?
(prefix paths and conditional tree with {ce})
2. Convert the prefix paths (ending in {ce})
into conditional FP-tree for {ce}
1) Update support counts
2) Remove node c
Frequent Items:
{e} {d,e} {a,d,e} {c,e}
33
Example: find frequent itemsets including e
How to solve the sub-problem?
(prefix paths and conditional tree with {ce})
1. Check a is frequent or not in the
prefix paths ending in {ace}
Count(a) =1
(Minsup=2)
Frequent Items:
{e} {d,e} {a,d,e} {c,e}
34
Example: find frequent itemsets including e
How to solve the sub-problem?
(prefix paths and conditional tree with {ae})
1. Check a is frequent or not in the
prefix paths ending in {ae}
Count(a,e) =2
(Minsup=2)
Frequent Items:
{e} {d,e} {a,d,e} {c,e}
{a,e}
35
Mining Frequent Itemset using FP-tree
General idea (divide-and-conquer)
• Recursively grow frequent pattern path using the FP-tree
• At each step, a conditional FP-tree is constructed by updating the
frequency counts along the prefix paths and removing all infrequent
items
Properties
• Sub-problems are disjoint  No duplicate itemsets
• FP-growth is an order of magnitude faster Apriori algorithm (depends
on the compaction factor of data)
- No candidate generation, no candidate test
- Use compact data structure
- Eliminate repeated database scan
- Basic operation is counting and FP-tree building
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
36
Compacting the Output of Frequent Itemsets --Maximal vs Closed Itemsets
• Maximal Frequent itemsets: no immediate superset is
frequent
• Closed itemsets: no immediate superset has the same
count (> 0).
Stores not only frequent information, but exact counts.
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
51
Example: Maximal vs Closed Itemsets
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
52
Outline: Mining Association Rules
• Motivation and Definition
• High computational complexity
• Frequent itemsets mining

Apriori algorithm – reduce the number of candidate

Frequent-Pattern tree (FP-tree)
• Rule Generation
• Rule Evaluation
• Mining rules with multiple minimum supports
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
53
Rule Generation
• Given a frequent itemset L, find all non-empty
subsets f  L such that f  L – f satisfies the
minimum confidence requirement
If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D,
A BCD,
AB CD,
BD AC,
ABD C,
B ACD,
AC  BD,
CD AB,
ACD B,
C ABD,
AD  BC,
BCD A,
D ABC
BC AD,
• If |L| = k, then there are 2k – 2 candidate
association rules (ignoring L   and   L)
54
Rule Generation
How to efficiently generate rules from frequent itemsets?
 In general, confidence does not have an anti-monotone
property
c(ABC D) can be larger or smaller than c(AB D)
s(ABCD)
c(ABCD) = ------------s(ABC)
s(ABD)
c(ABD) =-----------s(AB)
only s(ABC) < s(AB), s(ABCD) < s(ABD) (support has monotone property)
 But confidence of rules generated from the same itemset has
an anti-monotone property
e.g., L = {A,B,C,D}:
c(ABC  D)  c(AB  CD)  c(A  BCD)
Computing the confidence of a rule does not require additional scans of
data
55
Rule Generation for Apriori Algorithm
Apriori algorithm:
• level-wise approach for generating association rules
• each level: same number of items in rule consequent
• rules with t consequent items are used to generate rules with
t+1 consequent items
ABCD=>{ }
BCD=>A
CD=>AB
BD=>AC
D=>ABC
ACD=>B
BC=>AD
C=>ABD
ABD=>C
AD=>BC
B=>ACD
ABC=>D
AC=>BD
A=>BCD
AB=>CD
56
Rule Generation for Apriori Algorithm
• Candidate rule is generated by merging two
rules that share the same prefix
in the rule consequent
•
Join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
•
Prune rule D=>ABC if its
subset AD=>BC does not have
high confidence
CD=>AB
BD=>AC
D=>ABC
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
57
Rule Generation for Apriori Algorithm
Low-confidence Rules
BCD=>A
CD=>AB
BD=>AC
D=>ABC
ABCD=>{ }
ACD=>B
BC=>AD
C=>ABD
Pruned Rules (all rules containing
item A as consequent)
ABD=>C
AD=>BC
B=>ACD
ABC=>D
AC=>BD
AB=>CD
A=>BCD
58
Outline: Mining Association Rules
• Motivation and Definition
• High computational complexity
• Frequent itemsets mining

Apriori algorithm – reduce the number of candidate

Frequent-Pattern tree (FP-tree)
• Rule Generation
• Rule Evaluation
• Mining rules with multiple minimum supports
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
59
Interesting rules ?
• Rules with high support and confidence may be useful, but
not “interesting”
The “then” part is a frequent action in “if-them” rules
Example:
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
Prob(Diaper) = 4/5 = 0.8
• high interest suggests a cause
that might be worth investigating
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
60
Computing Interestingness Measure
Given a rule X  Y, information needed to compute rule
interestingness can be obtained from a contingency table
Contingency table for X  Y
Y
Y
X
f11
f10
f1+
X
f01
f00
fo+
f+0
|T|
f+1
f11:
f10:
f01:
f00:
the # of transactions containing both X and Y
the # of transactions containing only X, no Y
the # of transactions containing only Y, no X
the # of transactions containing no X, no Y
Used to define various measures
 support (f11/|T|), confidence(f11/f1+), lift, Gini, J-measure, etc
62
Drawback of Confidence
Coffee
Coffee
Tea
15
5
20
Tea
75
5
80
90
10
100
Association Rule: Tea  Coffee
Confidence= P(Coffee|Tea) = 15/20 = 0.75
but P(Coffee) = 90/100 = 0.9
 Although confidence is high, rule is misleading
 P(Coffee|Tea) =75/80= 0.9375
63
Example: Lift/Interest
Coffee
Coffee
Tea
15
5
20
Tea
75
5
80
90
10
100
Association Rule: Tea  Coffee
Confidence= P(Coffee|Tea) = 15/20 = 0.75
but P(Coffee) = 90/100 = 0.9
 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)
66
There are lots of
measures proposed
in the literature
Tan,Steinbach, Kumar
Introduction to Data Mining
Outline: Mining Association Rules
• Motivation and Definition
• High computational complexity
• Frequent itemsets mining

Apriori algorithm – reduce the number of candidate

Frequent-Pattern tree (FP-tree)
• Rule Generation
• Rule Evaluation
• Mining rules with multiple minimum supports
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
71
Threshold for rules ?
• How to set the appropriate minsup threshold?
 If minsup is set too high, we could miss itemsets
involving interesting rare items (e.g., expensive products,
jewelry)
 If minsup is set too low,
- it is computationally expensive
- the number of itemsets is very large
- extract spurious patterns (cross-support patterns having
weak correlations, e.g. milk (s=0.7) and caviar (s=0.0004) )
• Using a single minimum support threshold may not
be effective
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
72
Problems with the single minsup
• Single minsup: It assumes that all items in the
data are of the same nature and/or have similar
frequencies.
• Not true: In many applications, some items
appear very frequently in the data, while others
rarely appear.
E.g., in a supermarket, people buy food processor and
cooking pan much less frequently than they buy bread and
milk.
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
73
Multiple minsups model
• Each item i can have a minimum item support MIS(i)
• The minimum support of a rule R is expressed as
the lowest MIS value of the items that appear in the
rule.
i.e., a rule R: a1, a2, …, ak  ak+1, …, ar satisfies its minimum support if
its actual support is 
min(MIS(a1), MIS(a2), …, MIS(ar)).
• By providing different MIS values for different items,
the user effectively expresses different support
requirements for different rules.
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
74
Multiple Minimum Support
How to apply multiple minimum supports?
MS(i): minimum support for item i
e.g.: MIS(Milk)=5%,
MIS(Coke) = 3%,
MIS(Broccoli)=0.1%,
MIS(Salmon)=0.5%
MIS({Milk, Broccoli}) = min (MIS(Milk), MIS(Broccoli))
= 0.1%
Challenge: Support is no longer anti-monotone
Suppose:
Support(Milk, Coke) = 1.5% and
Support(Milk, Coke, Broccoli) = 0.5%
{Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is
frequent
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
75
Summary
• Association rule mining has been extensively
studied in the data mining community.
• There are many efficient algorithms and model
variations
• Other related work includes
-
Multi-level or generalized rule mining
Constrained rule mining
Incremental rule mining
Maximal frequent itemset mining
Numeric association rule mining
Rule interestingness and visualization
Parallel algorithms
…
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
79
Related Resources
• Tools, Softwares
 A set of software for Frequent pattern Mining: Apriori, Eclat,
FPgrowth, RElim, SaM etc. http://www.borgelt.net/fpm.html
 Frequent Itemset Mining Implementations Repository
http://fimi.cs.helsinki.fi/src/
 Arules: Mining association rules and frequent itemsets
http://cran.r-project.org/web/packages/arules/index.html
• Annotated Bibliography on Association Rule Mining
http://michael.hahsler.net/research/bib/association_rules/
• Apriori Demo in Silverlight, http://codeding.com/?article=13
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
80
References: Frequent-pattern Mining
•
•
•
•
•
•
•
•
•
•
•
R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of
frequent itemsets. In Journal of Parallel and Distributed Computing (Special Issue on High
Performance Data Mining), 2000.
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in
large databases. SIGMOD'93, 207-216, Washington, D.C.
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94.
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD’00.
H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association
rules. KDD'94.
A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association
rules in large databases. VLDB'95.
C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal
structures. VLDB'98.
R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95.
R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables.
SIGMOD'96.
H. Toivonen. Sampling large databases for association rules. VLDB'96.
M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of
association rules. KDD’97.
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
81
References: Performance Improvements
•
•
•
•
•
•
•
•
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and
implication rules for market basket analysis. SIGMOD'97, 1997.
D.W. Cheung, J. Han, V. Ng, and C.Y. Wong. Maintenance of discovered
association rules in large databases: An incremental updating technique. ICDE'96,
T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using twodimensional optimized association rules: Scheme, algorithms, and visualization.
SIGMOD'96, Montreal, Canada.
E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association
rules. SIGMOD'97, Tucson, Arizona.
G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G.
Piatetsky-Shapiro and W. J. Frawley, Knowledge Discovery in Databases,.
AAAI/MIT Press, 1991.
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with
relational database systems: Alternatives and implications. SIGMOD'98.
K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing
optimized rectilinear regions for association rules. KDD'97.
M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery
of association rules. Data Mining and Knowledge Discovery, 1:343-374, 1997.
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
What you should know
• What is the motivation of association rule mining?
• What are the basic steps for mining association
rules?
• How does Apriori algorithm work?
• What is the issue of Apriori algorithm? How to solve
it?
• How does Frequent-Pattern tree work?
• How to generate rules from frequent itemsets?
• How to mine rules with multiple minimum supports?
• How to evaluate the Rules?
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
83
Demo
1. Apriori on “census” data
http://www.borgelt.net/apriori.html
2. Apriori Demo in Silverlight,
http://codeding.com/?article=13
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
84
Download