Apriori Algorithm - The Institute of Finance Management

advertisement
Association Rule Mining:
Apriori Algorithm
CIT366: Data Mining & Data Warehousing
Bajuna Salehe
The Institute of Finance Management:
Computing and IT Dept.
Brief About Association Rule Mining
• The results of Market Basket Analysis allowed
companies to more fully understand
purchasing behaviour and, as a result better
target market audiences.
• Association mining is user-centric as the
objective is the elicitation of useful (or
interesting) rules from which new knowledge
can be derived.
Brief About Association Rule Mining
• Association mining applications have been
applied to many different domains including
market basket and risk analysis in commercial
environments, epidemiology, clinical
medicine, fluid dynamics, astrophysics, crime
prevention, and counter-terrorism—all areas
in which the relationship between objects can
provide useful knowledge.
Example of Association Rule
• For example, an insurance company, by
finding a strong correlation between two
policies A and B, of the form A -> B, indicating
that customers that held policy A were also
likely to hold policy B, could more efficiently
target the marketing of policy B through
marketing to those clients that held policy A
but not B.
Brief About Association Rule Mining
• Association mining analysis is a two part
process.
– First, the identification of sets of items or itemsets
within the dataset.
– Second, the subsequent derivation of inferences
from these itemsets
Why Use Support and Confidence?
• Support reflects the statistical significance of a
rule. Rules that have very low support are
rarely observed, and thus, are more likely to
occur by chance. For example, the rule A → B
may not be significant because both items are
present together only once in the previous
Table in the last week lecture.
Why Use Support and Confidence?
• Additionally, low support rules may not be
actionable from a marketing perspective
because it is not profitable to promote items
that are seldom bought together by
customers.
• For these reasons, support is often used as a
filter to eliminate uninteresting rules.
Why Use Support and Confidence?
• Confidence is another useful metric because it
measures how reliable is the inference made
by a rule.
– For a given rule A→ B , the higher the confidence,
the more likely it is for itemset B to be present in
transactions that contain A. In a sense, confidence
provides an estimate of the conditional probability
for B given A.
Causality & Association Rule
• Finally, it is worth noting that the inference
made by an association rule does not
necessarily imply causality.
• Instead, the implication indicates a strong cooccurrence relationship between items in the
antecedent and consequent of the rule.
Causality & A. Rule
• Causality, on the other hand, requires a
distinction between the causal and effect
attributes of the data and typically involves
relationships occurring over time (e.g., ozone
depletion leads to global warming).
More About Support and Confidence
• The support for the following candidate rules
– {Bread, Cheese} → {Milk}, {Bread,Milk} → {Cheese},
{Cheese,Milk} → {Bread}, {Bread} → {Cheese,Milk},
{Milk} → {Bread,Cheese}, {Cheese} → {Bread,Milk}
are identical since they correspond to the same
itemset, {Bread, Cheese, Milk}.
• If the itemset is infrequent, then all six candidate
rules can be immediately pruned without having
to compute their confidence values.
More About Support and Confidence
• Therefore, a common strategy adopted by many
association rule mining algorithms is to
decompose the problem into two major subtasks:
– Frequent Itemset Generation. Find all itemsets that
satisfy the minsup threshold. These itemsets are
called frequent itemsets.
– Rule Generation. Extract high confidence association
rules from the frequent
• Itemsets found in the previous step. These rules
are called strong rules.
Frequent Itemset Generation
• A lattice structure can be used to enumerate
the list of possible itemsets.
• For example, the figure below illustrates all
itemsets derivable from the set {A,B,C,D}.
Frequent Itemset Generation
Frequent Itemset Generation
• In general, a data set that contains d items
may generate up to 2d (raise to power ‘d’) − 1
possible itemsets, excluding the null set.
• Because d can be very large in many
commercial databases, frequent itemset
generation is an exponentially expensive task.
Frequent Itemset Generation
• A naıve approach for finding frequent itemsets
is to determine the support count for every
candidate itemset in the lattice structure.
• To do this, we need to match each candidate
against every transaction.
Apriori Algorithm
• This algorithm is among of the algorithms that
are grouped into candidate generation
algorithms, used to identify candidate itemsets.
• The common data structure that is used in apriori
algorithm is tree data structures.
• Two common type of tree data structures used in
apriori are:– Enumeration Set Tree
– Prefix Tree
Data Structure for Apriori Algorithm
Apriori Algorithm
• Frequent itemsets (also called as large
itemsets), are those itemsets whose support is
greater than minSupp (Minimum Support).
• The apriori property (downward closure
property) says that any subsets of any
frequent itemset are also frequent itemsets
• The use of support for pruning candidate
itemsets is guided by the following principle
(Apriori Principle).
– If an itemset is frequent, then all of its subsets
Reminder: Steps of Association Rule
Mining
• The major steps in association rule
mining are:
– Frequent Itemset generation
– Rules derivation
Apriori Algorithm
• Any subset of a frequent itemset must be
frequent
– If {beer, nappy, nuts} is frequent, so is {beer,
nappy}
– Every transaction having {beer, nappy, nuts} also
contains {beer, nappy}
• Apriori pruning principle: If there is any
itemset which is infrequent, its superset
should not be generated/tested!
Apriori Algorithm
• The APRIORI algorithm uses the downward
closure property, to prune unnecessary
branches for further consideration. It needs
two parameters, minSupp and minConf. The
minSupp is used for generating frequent
itemsets and minConf is used for rule
derivation.
The Apriori Algorithm: An Example
Database TDB
Tid
10
20
30
40
C1
Items
A, C, D
B, C, E
A, B, C, E
B, E
1st scan
C2
L2
Itemset
{A, C}
{B, C}
{B, E}
{C, E}
C3
sup
2
2
3
2
Itemset
{B, C, E}
Itemset
{A}
{B}
{C}
{D}
{E}
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
3rd scan
sup
2
3
3
1
3
sup
1
2
1
2
3
2
L3
Itemset
{A}
{B}
{C}
{E}
L1
C2
2nd
Itemset
{B, C, E}
scan
sup
2
sup
2
3
3
3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Important Details Of The Apriori
Algorithm
• There are two crucial questions in
implementing the Apriori algorithm:
– How to generate candidates?
– How to count supports of candidates?
Generating Candidates
• There are 2 steps to generating candidates:
– Step 1: Self-joining Lk
– Step 2: Pruning
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4={abcd}
Apriori Algorithm
•
•
•
•
•
•
•
•
•
•
•
•
•
•
k = 1.
Fk = { i | i ∈ I ∧ σ({i})/ N ≥ minsup}. {Find all frequent 1-itemsets}
repeat
k = k + 1.
Ck = apriori-gen(Fk−1). {Generate candidate itemsets}
for each transaction t ∈ T do
Ct = subset(Ck, t). {Identify all candidates that belong to t}
for each candidate itemset c ∈ Ct do
σ(c) = σ(c) + 1. {Increment support count}
end for
end for
Fk = { c | c ∈ Ck ∧ σ(c) /N ≥ minsup}. {Extract the frequent k-itemsets}
until Fk = ∅
Result = _Fk
How to Count Supports Of
Candidates?
• Why counting supports of candidates a
problem?
– The total number of candidates can be huge
– One transaction may contain many candidates
• Method:
– Candidate itemsets are stored in a hash-tree
– Leaf node of hash-tree contains a list of itemsets
and counts
– Interior node contains a hash table
– Subset function: finds all the candidates contained
in a transaction
Generating Association Rules
• Once all frequent itemsets have been found
association rules can be generated
• Strong association rules from a frequent
itemset are generated by calculating the
confidence in each possible rule arising from
that itemset and testing it against a minimum
confidence threshold
Example
TID
T100
T200
T300
T400
T500
T600
T700
T800
T900
List of item_IDs
Juice_Can, Crisps, Milk
Crisps, Bread
Crisps, Nappies
Juice_Can, Crisps, Bread
Juice_Can, Nappies
Crisps, Nappies
Juice_Can, Nappies
Juice_Can, Crisps, Nappies,
Milk
Juice_Can, Crisps, Nappies
ID Item
I1 Juice_C
an
I2 Crisps
I3 Nappies
I4 Bread
I5 Milk
Example
Challenges Of Frequent Pattern Mining
• Challenges
– Multiple scans of transaction database
– Huge number of candidates
– Tedious workload of support counting for
candidates
• Improving Apriori: general ideas
– Reduce passes of transaction database scans
– Shrink number of candidates
– Facilitate support counting of candida
Bottleneck Of Frequent-Pattern Mining
• Multiple database scans are costly
• Mining long patterns needs many passes of
scanning and generates lots of candidates
– To find frequent itemset i1i2…i100
• # of scans: 100
• # of Candidates:
+
+…+
= 2100-1
= 1.27*1030
• Bottleneck: candidate-generation-and-test
Mining Frequent Patterns Without
Candidate Generation
• Techniques for mining frequent itemsets
which avoid candidate generation include:
– FP-growth
• Grow long patterns from short ones using local
frequent items
– ECLAT (Equivalence CLASS Transformation)
algorithm
• Uses a data representation in which transactions are
associated with items, rather than the other way
around (vertical data format)
• These methods can be much faster than the
Apriori algorithm
Download