Narrative - Spatial Database Group

advertisement
“Fast Algorithms for Mining Association Rules”
By Rakesh Agarwal, Ramakrishnan Srikant
Presented by:
Muhammad Aurangzeb Ahmad
Nupur Bhatnagar
Motivation:
With the increased dissemination of bar code scanning technologies it was possible to
accumulate vast amounts of Market-basket dataset containing millions of transactions. Thus the
need arose to study the patterns of consumer consumption that could help in improving the
marketing infrastructure and related disciplines like targeted marketing.
Association Rule Mining is a data mining technique which is well suited for mining Marketbasket dataset. The research described in the current paper came out during the early days of
data mining research and was also meant to demonstrate the feasibility of fast scalable data
mining algorithms. Although a few algorithms for mining association rules existed at the time,
the Apriori and Apriori TID algorithms greatly reduced the overhead costs associated with
generating association rules.
Problem Statement:
Given a set of transactions, find rules that will predict the occurrence of an item based on the
occurrences of other items in the transaction. The following is an illustrative example of market
basket dataset.
TID
Transactions
1
books, stationary
2
Books,bags,grocery,utensils,
3
Stationary,bags,grocery,coke
4
books,stationary,bags,grocery
5
Books,stationary,bags,coke
Market based transaction set
For the above dataset the following rule holds good.
{books}->{stationary}
The rule can be read as, “customers who buy books also tend to buy stationary.”
Formal Definition:
Let I = {i1, i2, ...., im} be a set of literals, called items. Let D be a set of transactions, where each
transaction T is a set of items such that T subset I. The problem of mining association rules is to
generate all association rules that have support and confidence greater than the user-specified
minimum support (called minsup) and minimum confidence (called minconf) respectively.
Major Contributions:
Two new algorithms for Association Rule Mining, Apriori and AprioriTID, along with a hybrid
of the two algorithms, are described in the paper. The performance of these algorithms is shown
to be from many times better for smaller datasets to many orders of magnitude better than the
then current algorithms. The algorithms described in the paper represent a huge improvement
over the state of the art in Association Rule Mining at the time. The new algorithms improve
upon the existing algorithms by employing the following.
 Apriori and AprioriTID reduces the number of itemsets to be generated each pass by
reducing the number of candidate itemsets.
 AprioriTID uses an encoding of the database for each pass instead of using the complete
database. This greatly reduces the overhead cost associated with making passes over the
dataset especially in later passes.
 Apriori Hybrid combines the features of the two aforementioned algorithms. It employs
the Apriori algorithm for mining for earlier passes but switches to AprioriTID when the
size of the encoding is expected to fit into memory.
Key Concepts:
 Itemset
A collection of one or more items in a market based transactions.
Consider a set of literals I = {i1, i2, ...., im} then I is called itemset.
 Association Rule
An association rule is an implication of the form X->Y where X and Y are the itemsets.
 Support
Support measures fraction of transactions that contain both X and Y. Given a rule X->Y:
Support = (X union Y )/ N
where N is the total number of transactions.
 Confidence
Confidence measures how often item in Y appear in transactions that contain X. Given the
rule X->Y :
Confidence = X union Y / X
 Large Itemset
Itemsets with minimum support (minsup) and minimum confidence (minconf) are called as
large itemsets,while others that do not cross minimum support and minimum confidence
values are known as small itemsets.
 k Itemset
An itemset with k number of items is referred to as k-itemset.
 Candidate Itemsets
A set of itemsets which are generated from a seed of itemsets which were found to be large
in the previous pass. Large itemsets for the next iteration are selected from the candidate
itemsets if the support of the candidate itemsets is equal to or larger than minsup and
minconf.
 Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all rules having
support ≥ minsup threshold
confidence ≥ minconf threshold
Proposed Algorithms
APRIORI ALGORITHM:
Input
The market base transaction dataset.
Procedure
 The first pass of the algorithm counts item occurrences to determine large 1-itemsets.
 This process is repeat until no new large 1-itemsets are identified.
 (k+1) length candidate itemsets are generated from length k large itemsets.
 Candidate itemsets containing subsets of length k that are are not large are pruned.
 Support of each candidate itemset is counted by scanning the database.
 Eliminate candidate itemsets that are small.
Output
Itemsets that are “large” and qualify the min support and min confidence thresholds.
APRIORI-TID ALGORITHM:
Apriori TID has the same candidate generation function as apriori. The interesting feature is that
it does not use database for counting support after the first pass. An encoding of the candidate
itemsets used in the previous pass is used. In later passes the size of encoding can become much
smaller than the database,thus saving reading effort.
APRIORI-Hybrid Algorithm:
Apriori and AprioriTid use the same candidate generation procedure and therefore count the
same itemsets Apriori examines every transaction in the database. On the other hand, rather than
scanning the database, AprioriTid scans candidate itemsets used in the previous pass for
obtaining support counts. Apriori Hybrid uses Apriori in the initial passes and switches to
AprioriTid when it expects that the candidate itemsets at the end of the pass will be in memory.
Validation Methodology
The author has generated synthetic data sets involving transactions to evaluate the performance
of algorithms. The transactions mimic the transactions in the “real” world. Transaction may
contain more than one large itemsets. Transaction sizes are typically clustered around a mean
and a few transactions have many items. To model the phenomenon that large itemsets often
have common items, some fraction of items in subsequent itemsets are chosen from the previous
itemset generated. Each itemset in a transaction has a weight associated with it, which
corresponds to the probability that this itemset will be picked.
Assumptions
 The dataset is in the form of market basket dataset.
 The synthetic dataset is created through a detailed mediated process. The assumption
is that the performance of the algorithm in the synthetic dataset is indicative of its
performance on a real world dataset. The data might be too synthetic as to not give any
valuable information about real world datasets and solving those problems via
association rule mining.
 All the items in the data are in a lexicographical order.
 It is assumed that all the data is present in the same site or table and there are no cases
which there would be a requirement to make joins. It is possible that the overhead cost
of making joins could adversely effect the performance of the algorithms.
Revision
The following list gives a list of possible revisions to the paper.
 As stated above, at least some real world datasets should be used to perform the
experiments.
 Details regarding temporary storage should be added to the paper.
 Given that the size of databases has grown substantially over the last decade or so,
the experiments for scaling factors should be extended to include large datasets.
 Modification in the representation structure:
 Low support threshold can result in more large itemsets and increase the
number of candidate itemsets.
 Apriori makes multiple passes, run time of algorithm may increase with number
of transactions.
To solve these problems a more compact representation of the itemsets was suggested after the
apriori algorithm was formulated. We would consider more compact representation of the
itemsets if we have to rewrite the paper again.
Download