Association Rules (Market Basket Analysis)

advertisement
Association Rules (Market Basket Analysis)
Market basket: collection of items purchased by a customer in a single transaction (e.g.
supermarket, web)
Association rules:
 Unsupervised learning
 Used for pattern discovery
 Each rule has form: A -> B, or Left -> Right
For example: “70% of customers who purchase 2% milk will also purchase whole wheat bread.”
Data mining using association rules is the process of looking for strong rules:
1. Find the large itemsets (i.e. most frequent combinations of items)
Most frequently used algorithm: Apriori algorithm.
2. Generate association rules for the above itemsets.
How to measure the strength of an association rule?
1. Using support/confidence
2. Using dependence framework
Support/confidence
Support shows the frequency of the patterns in the rule; it is the percentage of transactions that
contain both A and B, i.e.
Support = Probability(A and B)
Support = (# of transactions involving A and B) / (total number of transactions).
Confidence is the strength of implication of a rule; it is the percentage of transactions that contain B
if they contain A, ie.
Confidence = Probability (B if A) = P(B/A)
Confidence =
(# of transactions involving A and B) / (total number of transactions that have A).
Ex.ample:
Customer Item
purchased
1
pizza
2
salad
3
pizza
4
salad
Item
purchased
beer
soda
soda
tea
If A is “purchased pizza” and B is “purchased soda” then
Support = P(A and B) = ¼
Confidence = P(B / A) = ½
Confidence does not measure if the association between A and B is random or not.
For example, if milk occurs in 30% of all baskets, information that milk occurs in 30% of all
baskets with bread is useless. But if milk is present in 50% of all baskets that contain coffee, that is
significant information.
Support allows us to weed out most infrequent combinations – but sometimes we should not ignore
them, for example, if the transaction is valuable and generates a large revenue, or if the products
repel each other.
Ex. We measure the following:
P(Coke in a basket) = 50%
P(pepsi in a basket) = 50%
P(coke and peps in a basket) = 0.001%
What does this mean? If Coke and Pepsi were independent, we would expect that
P(coke and pepsi in a basket) = .5*0.5 = 0.25.
The fact that the joint probability is much smaller says that the products are dependent and that they
repel each other.
In order to exploit this information, work with the dependency framework.
Dependence framework
Ex. To continue the previous example:
Actual(coke and peps in a basket) = 0.001%
Expected(coke and pepsi in a basket) = 50%*50% = 25%
If items are statistically dependent, the presence of one of the items in the basket gives us a lot of
information about the other items. How to determine the threshold of statistical dependence? Use:
 Chi-square
 Impact
 Lift
Chi_square =
(ExpectedCooccurrence – ActualCooccurrence)/ExpectedCooccurrence
Pick a small alpha (e.g. 5% or 10%). Number of degrees of freedom equals the number of items
minus 1.
Ex.
Chi_square(pepsi and coke) = (25-0.001)/25 = 0.999
Deg freedom = 2
Alpha = 5%
From the tables, chi-square=3.84, which is higher than our Chi_square, therefore pepsi and
coke are dependent.
Impact = ActualCooccurrence/ExpectedCooccurrence
Impact = 1 if products are independent, or ≠ if the products are dependent.
Ex. Impact(pepsi on coke) = 0.001/25
Lift(A on B) =
(ActualCooccurrence - ExpectedCooccurrence )/(Frequency of occurrence of A)
-1 ≤ Lift ≤ 1
Lift is similar to correlation: it is 0 if A and B are independent, and +1 or -1 if they are dependent.
+1 indicates attraction, and -1 indicates repulsion.
Ex. Lift(coke on pepsi) = (0.001-25)/50
Why do two items repel or attract – are they substitutes? Or are they complimentary and a third
product is needed? Or do they address different market segments?
Product Triangulation strategy examines cross-purchase skews to answer the above questions. If
the most significant skew occurs when triangulating with respect to promotion or pricing, the
products are substitutes.
Ex. Orange juice and soda repel each other (so are they substitutes?). They each exhibit a different
profile when compared with whole bread and potato chips, so they are not substitutes, they have
two different market segments.
Pepsi and Coke repel and show no cross-purchase patterns.
1. FIND FREQUENT ITEMSETS: Apriori Algorithm
Finds all frequent itemsets in a database.
Definitions:
 Itemset: a set of items
 k-itemset: an itemset which consists of k items
 Frequent itemset (i.e. large itemset): an itemset with sufficient support
 Lk or Fk: a set of large (frequent) k-itemsets
 ck: a set of candidate k-itemsets
 Appriori property: if an item X is joined with item Y,
Support(X U Y) = min(Support(X), Support(Y))
Negative border: an intemset is in the negative border if it is infrequent but all its “neighbors” in
the candidate itemset are frequent.
Interesting rules: strong rules for which antecedent and consequent are dependent
Apriori algorithm:
//Perform iterations, bottom up:
// iteration 1: find L1, all single items with Support > threshold
// iteration 2: using L1, find L2
// iteration i: using L i-1, find L i
// … until no more frequent k itemsets can be found.
//Each iteration i consists of two phases:
1. candidate generation phase:
//Construct a candidate set of large itemsets, i.e. find all the items that could qualify
for further consideration by examining only candidates in set Li-1*L i-1
2. candidate counting and selection
//Count the number of occurrences of each candidate itemset
//Determine large itemsets based on predetermined support, i.e. select only
candidates with sufficient support
Set Lk is defined as the set containing the frequent k itemsets which satisfy
Support > threshold.
Lk*L k is defined as:
Lk*L k = {X U Y, where X, Y belong to L k and | X ∩Y| = k-1}.
Apriori algorithm, in more detail:
//find all frequent itemsets
Appriori(database D of transactions, min_support) {
F1 = {frequent 1-itemsets}
k=2
while Fk-1 ≠ EmptySet
Ck= AprioriGeneration(Fk-1)
for each transaction t in the database D{
Ct= subset(Ck, t)
for each candidate c in Ct {
count c ++
}
Fk = {c in Ck such that countc ≥ min_support}
k++
}
F = U k ≥ 1 Fk
}
//prune the candidate itemsets
ApprioriGeneration(Fk-1) {
//Insert into Ck all combinations of elements in Fk-1 obtained by self-joining
itemsets in Fk-1
//self joining means that all but the last item in the itemsets considered
“overlaps,” i.e join items p, q from Fk-1 to make candidate k-itemsets of form p1p2 …p k1q1q2…q k-1 (without overlapping) such that p i =q i for i=1,2, .., k-2 and pk-1 < qk-1.
//Delete all itemsets c in Ck such that some (k-1) subset of c is not in Lk-1
}
//find all subsets of candidates contained in t
Subset(Ck, t) {
…
}
Example: http://magna.cs.ucla.edu/~hxwang/axl_manual/node20.html
Find association rules, given min_support = 2 and the database of transactions:
1234
234
345
125
245
Apply Apriori algorithm:
F1 = { 1, 2, 3, 4, 5} because all singles show up with frequency >=2.
K=2
C2 = AprioriGeneration(F1):
Insert into C2 pairs 1 2, 1 3, 1 4, 1 5, 2 3, 2 4, 2 5, 3 4, 3 5, 4 5
F2 = {1 2, 2 3, 2 4, 2 5, 3 4, 4 5} because 1 3, 1 4, 1 5, 3 4 are not frequent
K=3:
C3 = AprioriGeneration(F2):
Insert into C3: 2 3 4, 2 3 5, 2 4 5
Delete from C3: 2 3 5 ( because 3 5 is not in F2)
F3 = {2 3 4} (because 2 4 5 shows up only once)
K=4:
C4 = AprioriGeneration(F3)
Insert into C4: none
Since we cannot generate any more candidate sets by self-joining, the algorithms stops here. The
frequent itemsets are F1, F2, and F3. Negative border contains all pairs deleted from C2, and 2 3 5.
2. GENERATE ASSOCIATION RULES FROM FREQUENT ITEMSETS
For all pairs of frequent itemsets (assume we call them A and B) such that A U B is also frequent,
calculate c, the confidence of the rule:
c = support(A U B) / support(A).
If c >= min_confidence, the rule is strong and should be included.
Example: continuing the previous example: we can generate the rules involving any combination
of:
1, 2, 3, 4, 5, 1 2, 2 3, 2 4, 2 5, 3 4, 4 5, 2 3 4.
For example, rule 1 2 -> 2 5 is not a strong rule because 1 2 5 is not a frequent itemset.
Rule 2 3 -> 4 could be a strong rule, because 2 3 4 and 2 3 are frequent itemsets, and c= 2/4.
Applications
A: sales of item a
B: sales of item B
Rule A-> … tells you what products will be affected if A is affected
Rule ... -> B tells you what needs to be done so that B is affected
Rule A … ->B tells you what to combine with A to affect B.
The next step: sequential pattern discovery (i.e. “association rules in time”). For example:
college_degree-> professional_job -> high_salary.
Example: http://www.icaen.uiowa.edu/~comp/Public/Apriori.pdf
Assume min_support = 40% = 2/5, min_confidence = 70%. Five transactions are recorded in a
supermarket:
#
1
2
3
4
5
Transaction
Beer, diaper, baby powder, bread, umbrella
Diaper, baby powder,
Beer, diaper, milk
Diaper, beer, detergent
Beer, milk, cola
Code
BDPRU
DP
BDM
DBG
BMC
1. Find frequent itemsets
F1 = {B, D, P, M}
k = 2 C2 = BD, BP, BM, DP, DM, PM -> eliminate infrequent BP, DM, PM
F2 = {BD, BM, DP}
K=3: C3 = {BDM} (according to other books; this web page erroneously includes BDP)
F3 = {BDM}
2. Generate strong rules out of set B, D, P, M, BD, BM, DP, BDM
Rule
B->D
B->P -------B->M
B->BD
B->BM
B->DP ------B->BDM
D->B
…
M->B
…
P->B ------P->D
…
Support (X U Y)
3/5
Support(X)
4/5
Confidence
3/4
Strong rule?
yes
2/5
3/5
2/5
4/5
4/5
4/5
2/4
3/4
2/4
no
yes
no
1/5
4/5
1/4
no
2/5
2/5
1
yes
2/5
2/5
1
yes
Interesting! What the rules are saying is that it is very likely that a customer who buys diapers or
milk will also buy beer. Does that rule make sense?
Example p.170:
Assume min_support = 0.4, min_confidence = 0.6. Contingency table for high-school students (with
the derived quantities in italics):
Eat cereal
Y
Play
Y
basketball N
2000
1750
3750
N
1000
250
1250
3000
2000
5000
“Play basketball -> eat cereal” is a strong rule because
support (play basketball and eat cereal) = 2000/5000,
support(play basketball) = 3000/5000
c(rule) = 2/3 > min_confidence.
BUT: support(eat cereal) = 3750/5000 > c
So, it does not make sense to talk about eating cereal and playing basketball.
Lift (bbal on cereal) = (2000-(3000*3750)/5000)/3000 = -0.1 so eating cereal has nothing to do
with playing bbal?
Improving Apriori algorithm


Group items into higher conceptual groups, e.g. white and brown bread become “bread.”
Reduce the number of scans of the entire database (Apriori needs n+1 scans, where n is the
length of the longest pattern)
o Partition-based apriori
o Take a subset from the database, generate candidates for frequent itemsets; then confirm
the hypothesis on the entire database.
Alternative to Apriori, using fewer scans of database:
Frequent Pattern (FP) Growth Method
Used to find the frequent itemsets using only two scans of database.
Algorithm:
1.
2.
3.
4.
Scan databse and find items with frequency greater then or equal to a threshold T
Order the frequent items in decreasing order
Construct a tree which has only the root
Scan database again; for each sample:
a. add the items from the sample to the existing tree, using only the frequent items (i.e.
items discovered in step 1.)
b. repeat a. until all samples have been processed
5. Enumerate all frequent itemsets by examining the tree: the frequent itemsets are present in
those paths for which every node is represented with the frequency ≥ T .
Example p.173: Assume threshold T = 3.
Input itemset
facdgimp
Steps 1 and 2:
Frequent items and
their frequencies
are:
abcflmo
bfhjo
bcksp
afcelpmn
f 4
a3
c4
b3
m3
p3
=>
f4
c4
a3
b3
m3
p3
Frequent input
sequence:
facmp
Sorted frequent
input sequence:
fcamp
abcfm
fcabm
bf
fb
bcp
cbp
afcpm
fcamp
Steps 3 and 4: Construct the tree using the last column of the table above. The growing tree is
shown below.
root
root
root
root
root
f 1
f 2
f 3
f 3
c1
c2
c2
a1
a2
a2
m1
m1
b1
m1
b1
m1
b1
m2
b1
p1
p1
m1
p1
m1
p1
m1
p2
m1
b1
c2
c1
b1
a2
f 4
b1 c3
c1
b1
p1 a3
b1
p1
Step 5: The frequent itemsets are contained in those paths, starting from the root, which have
frequency ≥ T. Therefore, at each branching, see if any item from the branches can be added to the
frequent “path,” i.e. if the total of branches gives frequency ≥ T for this item.
In the above tree, the frequent itemsets are: f c a m, and c p.
WEB MINING
Mine for: content, structure, or usage
Tools are for analyzing on-line or off-line behavior.
Efficiency of the web site = number of purchases / number of visits
A web site is a network of related pages.
Web pages can be:
 Authorities (provide the best source of information)
 Hubs (links to authorities)
HITS algorithm
Searches for authorities and hubs.
Algorithm:
Use search engines to search for a given term and collect a root set of pages
Expand the root set by including all the pages that the root set links to, up to a cutoff point (e.g.
1000-5000 pages including links). Assume that now we have n pages total.
Construct the adjacency matrix A such that aij = 1if page i links to page j.
Associate authority weight ap and hub weight hp with each page, and set it to a uniform constant
initially.
a = Transpose{a1, a2, …, an}
h = Transpose{h1, h2, …, hn}
Then update a and h iteratively:
a = Transpose(A)*h = (Transpose(A)*A)*a
h = A*a = (A * Transpose(A))*h
Example p.181:
Assume initial:
1
A=
2
3
6
4
000111
000110
000010
000000
000000
001000
5
a = Transpose{.1, .1, .1, .1, .1, .1}
h = Transpose{.1, .1, .1, .1, .1, .1}
a = Transpose(A)*A*a = Transpose(0 0 .1 .5 .6 .3)
h = (A * Transpose(A))*h = Transpose(.6 .5 .4 0 0 .1)
Seems like document 5 is the best authority and document 1 is the best hub.
LOGSOM Algorithm
For finding user’s navigation behavior – which pages do they visit the most.
For a given set of URLs, urli, i=1, ..,n, and a set of user transactions tj, j=1, …, m, assign 1 to a urli
if the transaction j involved visiting this page. Make a table of all transactions:
t1
t2
url1 url2 … urln
1
0
1
0
1
1
tm 0
0
1
Use K-means clustering to group the users into k transaction groups, and then record the number of
hits of each group. For example:
group1
group2
url1 url2 … urln
16 0
12
0
5
3
groupk 20
10
0
Mining path-traversal patterns (sequence mining)
Given a collection of sequences ordered in time, where each sequence contains a set of Web pages,
find the longest and most frequent sequences of navigation patterns.
1.
2.
3.
4.
Find the longest traversal pattern
Draw it out as a tree (keep track of backward links)
Find the longest consecutive sequences
Find the most frequent ones
Example p.186
Path = A B C D C B E G H G W A O U O V, assume threshold of 40% (2/5).
Tree:
A–B–C–D
B–E–G
G–H
G–W
A–O
O –U
O –V
Maximum forward references: ABCD, ABEGH, ABEGW, AOU, AOV
Frequent forward references: AB, BE, AO, EG, BEG, ABEG
Maximal reference sequence: ABEG, AO (by heuristics).
Text Mining
Document = a vector of tokens (each word is a token)
Calculate Hamming distance for each token
Download