Association Rule Mining: Apriori Algorithm CIT366: Data Mining & Data Warehousing Bajuna Salehe The Institute of Finance Management: Computing and IT Dept. Brief About Association Rule Mining • The results of Market Basket Analysis allowed companies to more fully understand purchasing behaviour and, as a result better target market audiences. • Association mining is user-centric as the objective is the elicitation of useful (or interesting) rules from which new knowledge can be derived. Brief About Association Rule Mining • Association mining applications have been applied to many different domains including market basket and risk analysis in commercial environments, epidemiology, clinical medicine, fluid dynamics, astrophysics, crime prevention, and counter-terrorism—all areas in which the relationship between objects can provide useful knowledge. Example of Association Rule • For example, an insurance company, by finding a strong correlation between two policies A and B, of the form A -> B, indicating that customers that held policy A were also likely to hold policy B, could more efficiently target the marketing of policy B through marketing to those clients that held policy A but not B. Brief About Association Rule Mining • Association mining analysis is a two part process. – First, the identification of sets of items or itemsets within the dataset. – Second, the subsequent derivation of inferences from these itemsets Why Use Support and Confidence? • Support reflects the statistical significance of a rule. Rules that have very low support are rarely observed, and thus, are more likely to occur by chance. For example, the rule A → B may not be significant because both items are present together only once in the previous Table in the last week lecture. Why Use Support and Confidence? • Additionally, low support rules may not be actionable from a marketing perspective because it is not profitable to promote items that are seldom bought together by customers. • For these reasons, support is often used as a filter to eliminate uninteresting rules. Why Use Support and Confidence? • Confidence is another useful metric because it measures how reliable is the inference made by a rule. – For a given rule A→ B , the higher the confidence, the more likely it is for itemset B to be present in transactions that contain A. In a sense, confidence provides an estimate of the conditional probability for B given A. Causality & Association Rule • Finally, it is worth noting that the inference made by an association rule does not necessarily imply causality. • Instead, the implication indicates a strong cooccurrence relationship between items in the antecedent and consequent of the rule. Causality & A. Rule • Causality, on the other hand, requires a distinction between the causal and effect attributes of the data and typically involves relationships occurring over time (e.g., ozone depletion leads to global warming). More About Support and Confidence • The support for the following candidate rules – {Bread, Cheese} → {Milk}, {Bread,Milk} → {Cheese}, {Cheese,Milk} → {Bread}, {Bread} → {Cheese,Milk}, {Milk} → {Bread,Cheese}, {Cheese} → {Bread,Milk} are identical since they correspond to the same itemset, {Bread, Cheese, Milk}. • If the itemset is infrequent, then all six candidate rules can be immediately pruned without having to compute their confidence values. More About Support and Confidence • Therefore, a common strategy adopted by many association rule mining algorithms is to decompose the problem into two major subtasks: – Frequent Itemset Generation. Find all itemsets that satisfy the minsup threshold. These itemsets are called frequent itemsets. – Rule Generation. Extract high confidence association rules from the frequent • Itemsets found in the previous step. These rules are called strong rules. Frequent Itemset Generation • A lattice structure can be used to enumerate the list of possible itemsets. • For example, the figure below illustrates all itemsets derivable from the set {A,B,C,D}. Frequent Itemset Generation Frequent Itemset Generation • In general, a data set that contains d items may generate up to 2d (raise to power ‘d’) − 1 possible itemsets, excluding the null set. • Because d can be very large in many commercial databases, frequent itemset generation is an exponentially expensive task. Frequent Itemset Generation • A naıve approach for finding frequent itemsets is to determine the support count for every candidate itemset in the lattice structure. • To do this, we need to match each candidate against every transaction. Apriori Algorithm • This algorithm is among of the algorithms that are grouped into candidate generation algorithms, used to identify candidate itemsets. • The common data structure that is used in apriori algorithm is tree data structures. • Two common type of tree data structures used in apriori are:– Enumeration Set Tree – Prefix Tree Data Structure for Apriori Algorithm Apriori Algorithm • Frequent itemsets (also called as large itemsets), are those itemsets whose support is greater than minSupp (Minimum Support). • The apriori property (downward closure property) says that any subsets of any frequent itemset are also frequent itemsets • The use of support for pruning candidate itemsets is guided by the following principle (Apriori Principle). – If an itemset is frequent, then all of its subsets Reminder: Steps of Association Rule Mining • The major steps in association rule mining are: – Frequent Itemset generation – Rules derivation Apriori Algorithm • Any subset of a frequent itemset must be frequent – If {beer, nappy, nuts} is frequent, so is {beer, nappy} – Every transaction having {beer, nappy, nuts} also contains {beer, nappy} • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Apriori Algorithm • The APRIORI algorithm uses the downward closure property, to prune unnecessary branches for further consideration. It needs two parameters, minSupp and minConf. The minSupp is used for generating frequent itemsets and minConf is used for rule derivation. The Apriori Algorithm: An Example Database TDB Tid 10 20 30 40 C1 Items A, C, D B, C, E A, B, C, E B, E 1st scan C2 L2 Itemset {A, C} {B, C} {B, E} {C, E} C3 sup 2 2 3 2 Itemset {B, C, E} Itemset {A} {B} {C} {D} {E} Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} 3rd scan sup 2 3 3 1 3 sup 1 2 1 2 3 2 L3 Itemset {A} {B} {C} {E} L1 C2 2nd Itemset {B, C, E} scan sup 2 sup 2 3 3 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Important Details Of The Apriori Algorithm • There are two crucial questions in implementing the Apriori algorithm: – How to generate candidates? – How to count supports of candidates? Generating Candidates • There are 2 steps to generating candidates: – Step 1: Self-joining Lk – Step 2: Pruning • Example of Candidate-generation – L3={abc, abd, acd, ace, bcd} – Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace – Pruning: • acde is removed because ade is not in L3 – C4={abcd} Apriori Algorithm • • • • • • • • • • • • • • k = 1. Fk = { i | i ∈ I ∧ σ({i})/ N ≥ minsup}. {Find all frequent 1-itemsets} repeat k = k + 1. Ck = apriori-gen(Fk−1). {Generate candidate itemsets} for each transaction t ∈ T do Ct = subset(Ck, t). {Identify all candidates that belong to t} for each candidate itemset c ∈ Ct do σ(c) = σ(c) + 1. {Increment support count} end for end for Fk = { c | c ∈ Ck ∧ σ(c) /N ≥ minsup}. {Extract the frequent k-itemsets} until Fk = ∅ Result = _Fk How to Count Supports Of Candidates? • Why counting supports of candidates a problem? – The total number of candidates can be huge – One transaction may contain many candidates • Method: – Candidate itemsets are stored in a hash-tree – Leaf node of hash-tree contains a list of itemsets and counts – Interior node contains a hash table – Subset function: finds all the candidates contained in a transaction Generating Association Rules • Once all frequent itemsets have been found association rules can be generated • Strong association rules from a frequent itemset are generated by calculating the confidence in each possible rule arising from that itemset and testing it against a minimum confidence threshold Example TID T100 T200 T300 T400 T500 T600 T700 T800 T900 List of item_IDs Juice_Can, Crisps, Milk Crisps, Bread Crisps, Nappies Juice_Can, Crisps, Bread Juice_Can, Nappies Crisps, Nappies Juice_Can, Nappies Juice_Can, Crisps, Nappies, Milk Juice_Can, Crisps, Nappies ID Item I1 Juice_C an I2 Crisps I3 Nappies I4 Bread I5 Milk Example Challenges Of Frequent Pattern Mining • Challenges – Multiple scans of transaction database – Huge number of candidates – Tedious workload of support counting for candidates • Improving Apriori: general ideas – Reduce passes of transaction database scans – Shrink number of candidates – Facilitate support counting of candida Bottleneck Of Frequent-Pattern Mining • Multiple database scans are costly • Mining long patterns needs many passes of scanning and generates lots of candidates – To find frequent itemset i1i2…i100 • # of scans: 100 • # of Candidates: + +…+ = 2100-1 = 1.27*1030 • Bottleneck: candidate-generation-and-test Mining Frequent Patterns Without Candidate Generation • Techniques for mining frequent itemsets which avoid candidate generation include: – FP-growth • Grow long patterns from short ones using local frequent items – ECLAT (Equivalence CLASS Transformation) algorithm • Uses a data representation in which transactions are associated with items, rather than the other way around (vertical data format) • These methods can be much faster than the Apriori algorithm