Survey on Data Mining -

advertisement
Survey on Data Mining -- Association Rules
Chen-jung Stella Li
csl15@cornell.edu
Introduction
Data mining has become a research area with increasing importance due to its
capability of helping end users extract useful information from large databases.
With rapid growth in adapting high-technological tools, businesses can now generate
and collect massive amount of data, which they could not have done before. From
this myriad collection of data, businesses would like to "discover" certain useful and
interesting information, which could assist them in marketing strategies, decision
making, etc. This can all be accomplished with various data mining techniques, such
as association rules, characterization, classification, clustering, and so forth. This
survey paper will present the method on association rules and explore various
algorithms that have been studied in this field.
The mining association rules basically is to find important associations among
items in a given database of sales transactions such that the presence of some items
will imply the presence of other items in the same transaction. A formal model is
introduced in [AIS93]. Let I={i1, i2, … , im} be a set of binary attributes, called
items. Let D a set of transactions and each transaction T is a set of items such that
TI. Let X be a set of items. A transaction T is said to contain X if and only if
XT. An association rule is an implication of the form XY, where X I, Y I,
and XY = . (X is called the antecedent of the rule, while Y is called the
consequent of the rule.) Furthermore, the rule XY is said to hold in the
transaction set D with confidence c if there are c% of the transaction set D containing
X also containing Y. The rule XY is said to have support s in the transaction set
D if there are s% of transactions in D containing XY. The confidence factor
indicates the strength of the implication rules; whereas the support factor indicates the
frequencies of the occurring patterns in the rule. The goal of mining association
rules is then to find "strong" rules, which are rules with high confidence and strong
support, in a large databases.
The problem of mining association rules can further be decomposed into two
sub-problems using the above formulation: 1) discovering the large itemsets, and 2)
1
using the large itemsets to generate the association rules for the database. The first
sub-problem, discovering the large itemsets, is said to generate all combinations of
items that have fractional transaction support above a certain threshold. All other
combinations that fall below the support threshold are called small itemsets. The
second sub-problem, generating association rules, uses the large itemsets found in the
first problem and generates the desired rules. A straightforward way to generate
rules is the following: for every large itemset X, find all non-empty subsets of X.
And for every such subset Y, output a rule of the form Y(X-Y) if the ratio of
support of X to support of Y is exceeds c, the minimum confidence factor. Since
there is a straightforward way in solving the second sub-problem, the major emphasis
has been placed on finding efficient algorithm in finding the large itemsets. The
algorithms that will be discussed in the following sections are the AIS algorithm
[AIS93], Apriori and AprioriTid [AS94], DHP [PCY95], Partition Algorithm
[SON95], and SETM [HS95].
AIS Algorithm
Before introducing the AIS algorithm, there are two terms needed to be defined.
The "frontier set" of a pass is a set which consists of itemsets that are extended during
the pass. During each pass, a measurement of the support for certain itemsets is
taken. These itemsets are called "candidate itemsets." They are derived from the
tuples in the database and the itemsets contained in the frontier set. The AIS
algorithm makes multiple passes over the database. Initially, the frontier set is
empty. During each pass over the database, the candidate sets are generated from
taking the extensions of the frontier set and the tuples in the database. At the end of
each pass, if the support for a candidate set is equal or above the minimum support
required, this candidate set is kept and considered as a large itemset which will then
be used in the following passes. Meanwhile, this itemset will be determined whether
it should be added to the frontier set for the next pass. That is, those candidate
itemsets that were expected to be small but turned out to be large in the current pass
would be included in the frontier set for the next pass. The entire algorithm
terminates when the frontier set becomes empty. At the end, the remaining
candidate itemsets are the large itemsets that we are supposed to discover.
Apriori, AprioriTid, and AprioriHybrid Algorithms
The Apriori algorithm first counts occurrences of items to determine the large
1-itemsets. (itemset with the cardinality of 1) Then there are two phases in the
2
subsequent passes.
First of all, the large itemsets Lk-1 found in the (k-1)th pass are
used to generate the candidate itemsets Ck, using the "apriori-gen" function. Next,
the support of candidates in Ck is counted by scanning the database. The final
answer is obtained by taking the union of all Lk itemsets. The "apriori-gen" function
takes Lk-1 as argument and returns a superset of the set of all large k-itemsets.
Firstly, the join step is taken by joining Lk-1 with Lk-1. The next step is the prune
step, which deletes all itemsets cCk such that some (k-1)-subset of c is not in Lk-1.
The intuition behind the "apriori-gen" function is that every subset of a large itemset
must be large; thus we can combine almost-matching pairs of large (k-1)-itemsets, i.e.
Lk-1, and then prune out those with non-large (k-1)-subsets.
The AprioriTid algorithm is a variation of the Apriori algorithm. The
AprioriTid algorithm also uses the "apriori-gen" function to determine the candidate
itemsets before the pass begins. The main difference from the Apriori algorithm is
that the AprioriTid algorithm does not use the database for counting support after the
first pass. Instead, the set <TID, {Xk}> is used for counting. (Each Xk is a
potentially large k-itemset in the transaction with identifier TID.) The benefit of
using this scheme for counting support is that at each pass other than the first pass, the
scanning of the entire database is avoided. But the downside of this is that the set
<TID, {Xk}> that would have been generated at each pass may be huge. Another
algorithm, called AprioriHybrid, is introduced in [AS94]. The basic idea of the
AprioriHybird algorithm is to run the Apriori algorithm initially, and then switch to
the AprioriTid algorithm when the generated database (i.e. <TID, {Xk}>) would fit in
the memory.
DHP Algorithm
The DHP (Direct Hashing and Pruning) algorithm is an effective hash-based
algorithm for the candidate set generation. The DHP algorithm consists of three
steps. The first step is to get a set of large 1-itemsets and constructs a hash table for
2-itemsets. (The hash table contains the information regarding the support of each
itemset.) The second step generates the set of candidate itemsets Ck, but it only adds
the k-itemset into Ck if that k-itemset is hashed into a hash entry whose value is
greater than or equal to the minimum transaction support. The third part is
essentially the same as the second part except it does not use the hash table in
determining whether to include a particular itemset into the candidate itemsets. The
second part is designed for the use in early iteration, whereas the third part should be
used for later iterations when the number of hash buckets with a support count greater
3
than or equal to s (the minimum transaction support required) is less than a
pre-defined threshold. Note that the DHP algorithm has two major features: one is
its efficiency in generation of large itemsets; the other is effectiveness in reduction on
transaction database size. The generation of smaller candidate sets is the key to
effectively trim the transaction database size at earlier stages of the iterations such
that the computational cost for the later iterations is significantly reduced. Therefore,
the DHP algorithm is particularly powerful to determine large itemsets in early stages,
and improves the performance bottleneck in a great deal.
Partition Algorithm
The Partition algorithm logically partition the database D into n partitions, and
only reads the entire database at most two times to generate the association rules.
The reason for using the partition scheme is that any potential large itemset would
appear as a large itemset in at least one of the partitions. The algorithm consists of
two phases. In the first phase, the algorithm iterates n times, and during each
iteration, only one partition is considered. At any given iteration, the function
"gen_large_itemsets" takes a single partition and generates local large itemsets of all
lengths from this partition. All of these local large itemsets of the same lengths in all
n partitions are merged and then combined to generate the global candidate itemsets.
In the second phase, the algorithm counts the support of each global candidate
itemsets and generates the global large itemsets. Note that the database is read twice
during the process: once in the first phase and the other in the second phase, which the
support counting requires a scan of the entire database. By taking minimal number
of passes through the entire database drastically saves the time used for doing I/O.
SETM Algorithm
The SETM algorithm is a set-oriented algorithm for mining association rules.
The algorithms introduced previously are mostly tuple-oriented. The SETM differs
from the previous algorithms in a great deal; particularly, it incorporates the simple
database primitives such as sorting and merge-scan join. In the first iteration, the
SETM algorithm sorts R1 (denoted as the relation in the 1st iteration) on item, and C1
is the count relation generated for R1. In the following iterations, there are two sort
operations and one merge-scan join performed. After the first sort operation, a
temporary relation, R'k, in the kth iteration is generated by performing merge-scan
join on Rk-1 and R1. Then, the second sort operation is to sort R'k on items, which
is used for efficient generation of the support counts. The generation of counts
4
involves a sequential scan over Rk. At the end of each iteration, deleting tuples from
Rk that do not meet the minimum support generates Rk. Note that this is done via
table look-ups on relation Ck.
Another major contribution of the paper [HS95] is that it proposes ways to
express the generation of association rules using SQL queries. Thus, besides
introducing the SETM algorithm, the paper [HS95] also presents two formulations of
their set-oriented data mining strategy that would naturally leads to nested-loop joins
and sort-merge joins. Let suppose that there is a relation, SALES(trans_id, item).
The first step in the formulation for leading to nested-loop joins is to generate the
counts for each item x and to store the result in relation C1(item, count).
INSERT INTO C1
SELECT r1.item, COUNT(*)
FROM SQLES r1
GROUP BY r1.item
HAVING COUNT(*) >= :min_support;
Then the next step is to generate all patterns (x,y) and check whether they meet the
minimum support criterion. For ease of expressing, let's take a specific item X. All
the patterns (A,y) can be generated using a self-join of SALES.
SELECT r1.item, r2.item, COUNT(*)
FROM SALES r1, SALES r2
WHERE r1.trans_id = r2.trans_id AND R1.item = 'A' AND r2.item <> 'A'
GROUP BY r1.item, r2.item
HAVING COUNT(*) >= :min_support;
Another formulation leads to sort-merge joins. The first step is to generate ordered
patterns of length k:
INSERT INTO R'k
SELECT p.trans_id, p.item1, …, p.item[k-1], q.item
FROM Rk-1 p, SALES q
WHERE q.trans_id = p.trans_id AND q.item > p.item[k-1];
The next step is to generate the counts relations for those patterns in R'k that meet the
minimum support criterion.
5
INSERT INTO Ck
SELECT p.item1, …, p.item[k], COUNT(*)
FROM R'k p
GROUP BY p.item,…, p.item[k]
HAVING COUNT(*) >= :min_support;
Then, select those tuples from R'k that meet the minimum support such that a further
extension of the relations can be done.
INSERT INTO Rk
SELECT p.trans_id, p.item1, … p.item[k]
FROM R'k p, Ck q
WHERE p.item1 = q.item1 AND … p.item[k] = q.item[k]
ORDER BY p.trans_id, p.item1, …, p.item[k];
This final step is repeated until Rk=.
Therefore, the paper [HS95] proves the point that "at least some aspects of data
mining can be carried out by using general query languages such as SQL, rather than
by developing specialized black box algorithms."
Conclusion
The goal of mining association rules is to discover important associations among
items in a database of transactions such that the presence of some items will imply the
presence of other items. The problem of mining association rules has been
decomposed into two sub-problems: discovering the large itemsets, and then generate
rules based on these large itemsets. The attention has been placed on the first
sub-problem since the second sub-problem is quite straightforward. Thus, there
have been several algorithms proposed to solve the first sub-problem. These
researches in algorithms of mining association rules are basically motivated by the
fact that the amount of the processed data in mining association rules is huge; thus it
is crucial to devise efficient algorithms to conduct mining on such data. The
classical one is introduced in [AIS93] and is often referred as the AIS algorithm. It
first layouts the framework of mining association rules and introduces a terminology
called candidate itemsets, which is used throughout in the later work. However, the
Apriori and AprioriTid algorithms [AS94] are the most well known ones in the field.
6
Both of them use the "apriori-gen" function in generating the candidate itemsets.
The DHP algorithm [PCY95] is very similar to the Apriori algorithm except that it
employs a hash table scheme to make the generation more efficiently and effectively.
The Partition algorithm [SON95] reduces the I/O overhead, in comparison with the
Apriori algorithm, by first dividing the database into partitions and generating large
itemsets from those partitions. The last algorithm being looked at is the SETM
algorithm [HS95]. The SETM algorithm is inherently different from the others
because it is a set-oriented algorithm. It is developed from realizing that there are
certain formulations in SQL statements that could lead to nested-loop joins and
sort-merge joins. Since the efficiency in mining association rules is application
dependent, up to this point, there is no clear performance modeling to determine
which algorithm performs ultimately better than others.
References
[AIS93] R. Agrawal, T. Imielinski, and A. Swami, "Mining Association Rules
between Sets of Items in Large Databases," Proc. ACM SIGMOD, May
1993.
[AS94] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules
in Large Databases," Proc. 20th Int'l Conf. of Very Large Data Bases, Sept.
1994.
[PCY95] J.-S Park, M.-S. Chen, and P.S. Yu, "An Effective Hash Based Algorithm
for Mining Association Rules," Proc. ACM SIGMOD, May 1995.
[SON95] A. Savasere, E. Omiecinski, and S. Navathe, "An Efficient Algorithm for
Mining Association Rules in large Databases," Proc. 21st Int'l Conf. of Very
Large Data Bases, 1995
[HS95] M. Houtsma and A. Swami, "Set-Oriented Mining for Association Rules in
Relational Databases," IEEE 11th Int'l Conf. on Data Engineering, 1995
7
Download