Uploaded by cannonreyes12

Rule Learning: Itemset Mining, Apriori Algorithm & Association Rules

advertisement
Rule Learning
Module Introduction
Objectives
By the end of this module, you will be prepared to:
• Sketch itemset mining
• Sketch association rule learning
• Formulate the a-priori algorithm
• Define support
• Define how empirically determined probabilities are computed for rules and
itemsets
Rule Learning Identifies Patterns in Data
• Often considered “unsupervised” as no target
class
• Provides analytic insights into data
• Explainability
Introduction to
Rule Learning and Itemset Mining
Objectives
By the end of this module, you will be prepared to:
• Sketch itemset mining
• Define support
• Define how empirically determined probabilities are
computed for rules and itemsets
Gaining insights into data
• Rule and itemset
learning are
explainable methods
that show how their
conclusions were
reached
• These methods are
used to identify unique
patterns in data
• They do not rely on
ground truth, and are
sometimes considered
as “unsupervised”
methods
Source: BBC https://www.bbc.com/news/technology-33804287
Basic Setup
•
Itemsets
•
Example: Market basket Analysis
Consider the following transaction database
{apple, banana, bread}
{bread, apple, peanut_butter}
{bread, jelly, peanut_butter}
{salmon, capers, cheese, orange_juice}
{salmon, oil, parsley}
{milk, peanut_butter, bread}
{eggs, bread, oil}
Support
• The support for a given itemset is simply the
number of times* that the itemset appears in
the transaction database
• For a given itemset X, this will be annotated
sup(X) or #X
*In some implementations this may be the fraction of times.
Association Rules and Confidence
•
Example: Market basket Analysis
Consider the following transaction
database
• {apple, banana, bread}
• {bread, apple, peanut_butter, jelly}
• {bread, jelly, peanut_butter}
• {salmon, capers, cheese,
orange_juice}
• {salmon, oil, parsley}
• {milk, peanut_butter, jelly}
• {eggs, bread, oil}
Consider rule
{peanut_butter, jelly} 🡪 {bread}
Itemset Mining and the
A-Priori Algorithm
Objectives
By the end of this module, you will be prepared to:
• Formulate the a-priori algorithm
Problem: Finding Itemsets
Given a transaction database, can we find all
itemsets that meet a certain level of minimum
support?
This is called minining frequent itemsets
How Do You Find Itemsets? – Brute Force
• Brute force method to find itemsets in a
transaction database of size n of m different
items for a given level of support minSupt
• Result = {}
• For i in 1,…,m
• For each combination C of items of size m
• Check each transaction t to see if C is a subset
of t
• If C is a subset of minSupt transactions, then
add C to Result
• Return Result
Problems with Brute Force Approach
• Exponential runtime
• Needlessly examines itemsets that are not
present
• Needlessly examines itemsets that we already
know do not meet the support
Downward Closure Property
•
The Apriori Algorithm
• Leverages Downward Closure for a more efficient
process
• Iterative, level wise search
• At each iteration k, only consider certain itemsets
contain frequent itemsets of size k-1
• Join step: generate all possible candidates of length k
• Prune step: remove those candidates that cannot be
frequent (as they contain a non-frequent subset)
A Prior Algorithm: Pseudocode
•
candidateGen subroutine
•
Example: Market basket Analysis
First Pass
Consider the following transaction
database
• {apple, banana, bread}
• {bread, apple, peanut_butter, jelly}
• {bread, jelly, peanut_butter}
• {salmon, capers, cheese,
orange_juice}
• {salmon, oil, parsley}
• {milk, peanut_butter, jelly}
• {eggs, bread, oil}
Find all itemsets of size 2
First pass:
• {apple}
• {bread}
• {peanut_butter}
• {jelly}
• {salmon}
• {oil}
Example: Market basket Analysis
k=2
Consider the following transaction
database
• {apple, banana, bread}
• {bread, apple, peanut_butter, jelly}
• {bread, jelly, peanut_butter}
• {salmon, capers, cheese,
orange_juice}
• {salmon, oil, parsley}
• {milk, peanut_butter, jelly}
• {eggs, bread, oil}
Find all itemsets of size 2
Candidates:
• {apple, bread}
• {apple, peanut_butter}
• {apple, jelly}
• {apple, salmon}
• {apple, oil}
• {bread, peanut_butter}
• {bread, jelly}
• {bread, salmon}
• {bread, oil}
• {peanut_butter, jelly}
• {peanut_butter, salmon}
• {peanut_butter, oil}
• {jelly, salmon}
• {jelly, oil}
• {salmon, oil}
Example: Market basket Analysis
k=3
Consider the following transaction
database
Candidates:
• {bread, peanut_butter, jelly}
• {apple, banana, bread}
• {bread, apple, peanut_butter, jelly}
• {bread, jelly, peanut_butter}
• {salmon, capers, cheese,
orange_juice}
• {salmon, oil, parsley}
• {milk, peanut_butter, jelly}
• {eggs, bread, oil}
Find all itemsets of size 2
Termination criteria met
(no more candidates)
Result:
• {apple}
• {bread}
• {peanut_butter}
• {jelly}
• {salmon}
• {oil}
• {apple, bread}
• {bread, peanut_butter}
• {bread, jelly}
• {peanut_butter, jelly}
A Priori
• Typically, the size of the largest item set is
bounded at much less than m (usually ~10)
• Very fast algorithm, under certain conditions it can
run in linear time
• Setting minSupt=1 will make A Priori preform
poorly
• Key: higher support yields a sparsity that A Priori
leverages;
Association Rule Mining
Objectives
By the end of this module, you will be prepared to:
• Sketch association rule learning
Problem
• Given a transaction database, suppose we
have a set of frequent itemsets, all having a
support of at least minSupt
• How do we then find association rules that
meet some minimum level of confidence?
Simple Extension to A Priori
•
Example: Market basket Analysis
Association Rules
Consider the following transaction
database
• {apple, banana, bread}
• {bread, apple, peanut_butter, jelly}
• {bread, jelly, peanut_butter}
• {salmon, capers, cheese,
orange_juice}
• {salmon, oil, parsley}
• {milk, peanut_butter, jelly}
• {eggs, bread, oil}
Find all itemsets of size 2
Itemsets:
• {apple} (2)
• {bread} (4)
• {peanut_butter} (3)
• {jelly} (3)
• {salmon} (2)
• {oil} (2)
• {apple, bread} (2)
• {bread, peanut_butter} (2)
• {bread, jelly} (2)
• {peanut_butter, jelly} (3)
Itemsets:
• apple 🡪 bread 1.0
• bread 🡪 apple 0.5
• bread 🡪 peanut_butter 0.5
• peanut_butter 🡪 bread 0.66
• bread 🡪 jelly 0.5
• jelly 🡪 bread 0.66
• peanut_butter 🡪 jelly 1.0
• jelly 🡪 peanut_butter 1.0
Class Association Rules (CAR)
• This is where a certain item or items is a “target
class” and appears only in the consequent
• Set of class labels is disjoint from the set of items
• Key idea to mine:
• Find items sets that meet minimum support
• Compute confidence for the itemset as an antecedent
based on the fraction of itemsets that appear with the
target class
Download