Apriori and Assoc. Rules

advertisement
Association Rules
Association rule mining is the process of finding patterns, associations and
correlations among sets of items in a database. The association rules generated have an
antecedent and a consequent. An association rule is a pattern of the form X & Y  Z
[support, confidence], where X, Y, and Z are items in the dataset. The left hand side of
the rule X & Y is called the antecedent of the rule and the right hand side Z is called the
consequent of the rule. This means that given X and Y there is some association with Z.
Within the dataset, confidence and support are two measures to determine the certainty or
usefulness for each rule. Support is the probability that a set of items in the dataset
contains both the antecedent and consequent of the rule or P (X  Y  Z). Confidence is
the probability that a set of items containing the antecedent also contains the consequent
or P (Z X  Y). Typically an association rule is called strong if it satisfies both a
minimum support threshold and a minimum confidence threshold that is determined by
the user (Han).
Assume a sports equipment store would like to determine if there were any
association rules between the items bought together during a single purchase. They
collected this data:
Transaction
1
2
3
4
Items
jersey, mouth guard, pants
jersey, pants
jacket, jersey
hat, mouth guard, stick
Table 1: Transactions of sports equipment
A rule you might find would be that jersey  pants. Support for this rule is the
probability that a transaction contains jersey and pants together. Here it occurs two out of
four times (2/4) which is 50%. The confidence is the probability of a transaction
containing jersey also contains pants. There are three transactions containing jersey and
two containing jersey and pants so the confidence is 2/3 or 66%. Therefore this rule
written in complete form is jersey  pants [50%, 66%].
If instead the pants were chosen to be the antecedent we would have the rule
pants  jersey [50%, 100%]. In this case the confidence is now 100% because in every
transaction containing pants there is also the jersey with it. Setting minimums for both
support and confidence we can limit the rules to only those of significance. These
association rules are rather easy to establish because of the small dataset but it becomes
more difficult as the dataset grows.
Some basic terms and concepts need to be understood in order to further
understand data mining on large datasets. An itemset is defined as a set of items. An
itemset containing k items is a k-itemset. Therefore the set {A, B} is a 2-itemset. The
occurrence frequency of an itemset is simply the number of transactions that contain the
itemset. It is sometimes referred to as the frequency, support count, or count of the
itemset. The minimum support count is defined as the number of transactions required
for the itemset to satisfy minimum support. The minimum support count is equal to the
product of the total number of transactions in the dataset and the user-defined minimum
support. Any itemsets satisfying the minimum support is considered a frequent itemset
and the set of k-itemsets is commonly denoted by Lk (Han).
Going back to the example from Figure 1 we can obtain the following values.
The occurrence frequency of the itemset, { jersey}, would equal 3 because it occurs in 3
of the transactions in the database. The minimum support count is equal to the total
transactions x support or in the case above it would be 4 x 50% = 2. This is because we
previously wanted a minimum support threshold of 50%. Therefore the 1-itemset,
{jersey}, would satisfy the minimum support and therefore it would be considered a
frequent itemset and contained in L1.
Association rule mining on large datasets is broken down into two steps. First,
find all the itemsets that occur at least as frequently as the pre-determined minimum
support count. This step will produce k lists of frequent itemsets from L1 to Lk. The
second step is to generate strong association rules from the frequent itemsets. An
association rule is considered strong if it satisfies both a minimum support and minimum
confidence. The overall performance of mining association rules is determined by
finding the frequent itemsets in the first step. An important but simple algorithm for
finding frequent itemsets is the Apriori Algorithm (Han).
Apriori Algorithm
The Apriori algorithm is a basic algorithm for finding frequent itemsets from a set
of data by using candidate generation. Apriori uses an iterative approach known as a
level-wise search because the k-itemsets is used to determine the (k + 1)-itemsets. The
search begins for the set of frequent 1-itemsets denoted L1. L1 is then used to find the set
of frequent 2-iemsets, L2. L2 is then used to find L3 and so on. This continues until no
more frequent k-itemsets can be found (Han).
To improve efficiency of a level-wise generation the Apriori algorithm uses the
Apriori property. The Apriori property states that all nonempty subsets of a frequent
itemset are also a frequent itemsets. So, if {A, B} is a frequent itemset then subsets {A}
and {B} are also frequent itemsets. The level-wise search uses this Apriori property
when stepping from level to the next. If an itemset I does not satisfy the minimal support
then I will not be considered a frequent itemset. If item A is added to the itemset I then
the new itemset I  A cannot occur more frequently then the original itemset I. If an
itemset fails to be considered a frequent itemset then all supersets of that itemset will also
fail that same test. The Apriori algorithm uses this property to decrease the number of
itemsets in the candidate list therefore optimizing search time. As the Apriori algorithm
steps from finding Lk-1 to finding Lk it uses a two-step process consisting of the Join Step
and the Prune Step (Han).
The first step is the Join Step and it is responsible for generating a set of candidate
k-itemsets denoted Ck from Lk-1. It does this by joining Lk-1 with itself. Apriori assumes
that items within a transaction or itemset are sorted in lexicographic order. The joining
Lk-1 to Lk-1 is only performed between itemsets that have the first (k-2) items in common
with each other. Suppose itemsets I1 and I2 are members of Lk-1. They will be joined with
each other if (I1[1] = I2[1] and I1[2] = I2[2] and … and I1[k-2] = I2[k-2] and I1[k-1] < I2[k1]). Where I1[1] is the first item in itemset I1 and I1[k-1] is the last item in I1 and so on
for I2. It checks to make sure all k-2 items are equal and then lastly makes sure the last
item in the itemset is unequal in order to eliminate duplicate candidate k-itemsets. The
new candidate k-itemset generated from joining I1 with I2 would be I1[1] I1[2]… I1[k1]I2[k-1] (Han).
The second step is the Prune Step and converts Ck to Lk. The candidate list Ck
contains all of the frequent k-itemsets but it also contains k-itemsets that do not satisfy
the minimum support count. The scan of the database will determine the occurrence
frequency of every candidate k-itemset to determine if it satisfies the minimum support.
This will be very costly as Ck becomes large. To reduce the size of Ck the Apriori
property is used. The Apriori property states that any (k-1)-itemset that is not frequent
cannot be a subset of a frequent k-itemset. Therefore if any (k-1)-subset of a candidate kitemset is not in Lk-1 then it can be removed from Ck because it too cannot be frequent.
This can be done rather quickly by utilizing a hash tree containing all frequent itemsets.
The resulting set of frequent k-itemsets is denoted Lk (Han).
Assume there is that same storeowner who wishes to find associations with items
bought from his store. Letters will be substituted for the items from Table 1 for better
understanding and clarity. Given the following data in Table 2 let us run step-by-step
through the Apriori algorithm and set the minimum for confidence and support to 50%.
Transaction
Item List
1
A, C, D
2
B, C, E
3
B, E
4
A, B, C, E
Table 2: Transaction list in lexicographic order
For the first iteration of the algorithm each item becomes a member of the set of
candidate 1-itemsets C1 (Fig. 1). The algorithm then scans all of the transactions to count
the number of occurrences of each item. Using the predetermined minimum support of
50% the minimum support would therefore be 50% x 4 transactions which equals 2.
Only candidate 1-itemsets satisfying minimum support will be kept to generate the set of
frequent 1-itemsets L1.
On the second iteration, the algorithm is attempting to generate the set of frequent
2-itemsets L2. It does this by joining L1 with itself to generate C2, a candidate set of 2itemsets. Again, the transactions in the dataset are scanned to determine the support
count for each candidate 2-itemset in C2. All candidate 2-itemsts in C2 that satisfy
minimum support will be included in the set of frequent 2-itemsets L2 (Han).
The next iteration begins with generating the set of candidate 3-itemsets, C3, by
joining L2 to itself by joining all itemsets that have the first item similar to each other.
This occurs with the itemsets {B, C} and {B, E} because B = B and C < E. The resulting
candidate list is therefore {B, C, E}. The occurrence frequency of the itemset is counted
and if it satisfies the minimum support count it will be included in the set of frequent 3itemsets, L3 (Han).
C1
L1
C2
itemset
{A}
{B}
{C}
{D}
{E}
support
50%
75%
75%
25%
75%
Initial Itemset
itemset
{A C}
{B C}
{B E}
support
50%
50%
75%
itemset
{A}
{B}
{C}
{E}
support
50%
75%
75%
75%
Prune to eliminate
anything that
doesn't have
minimal support
itemset
{A C B}
{A C B E}
{B C E}
support
25%
25%
50%
itemset
{A B}
{A C}
{A E}
{B C}
{B E}
{C E}
support
25%
50%
25%
50%
75%
25%
Join the last item
of each item set to
produce a new
item set
itemset
{B C E}
support
50%
Prune again
Append Again
Prune
L2
C3
L3
Figure 1: Demonstrates Aprior Algorithm
The algorithm attempts another iteration but because it cannot generate any
candidate 4-itemets C4 will be an empty set. At this point the algorithm will terminate
having found all of the frequent itemsets (Han).
The generation of strong association rules is now rather straightforward now that
all of the frequent itemsets have been found for the database. To be considered a strong
association rule it must satisfy both minimum support and minimum confidence. The
minimum support is automatically satisfied because all of the association rules are
generated from the frequent itemsets. The confidence must be computed for each rule to
ensure it satisfies the minimum confidence. The confidence for the rule A  B is the
conditional probability P (B | A) which is equal to the support count of (A  B) divided
by the support count of (A) (Han).
Based on this equation we can generate association rules. For every frequent
itemset l generate all nonempty subsets of l. Taking all the nonempty subsets of each l
output the following rules where s is a subset of l. A strong association rule is defined as
s  (l – s) only if the support count of (l) divided by the support count of (s) satisfies the
minimum confidence previously determined for the association rules. This is done for
every subset (s) of l and for every frequent itemset l (Han).
Using the previous example from Figure 1 we will generate the association rules
for this example. We will also compute the confidence for each rule and eliminate any
rules that do not meet the minimum confidence previously set at 50%.
AC
CA
BC
CB
BE
EB
B&CE
B&EC
C&EB
confidence = 2/2 = 100%
confidence = 2/3 = 66%
confidence = 2/3 = 66%
confidence = 2/3 = 66%
confidence = 3/3 = 100%
confidence = 3/3 = 100%
confidence = 2/2 = 100%
confidence = 2/3 = 60%
confidence = 2/2 = 100%
All of these association rules are considered strong because they meet the
minimum confidence and support which was set at 50%.
Bibliography
Han, Jiawei and Micheline Kamber. 2001. Data Mining: Concepts and Techniques.
Morgan Kaufman Publishers.
Download