2-assoc - The Hong Kong Polytechnic University

advertisement
COMP 578
Data Warehousing & Data Mining
Ch 2 Discovering Association Rules
Keith C.C. Chan
Department of Computing
The Hong Kong Polytechnic University
The AR Mining Problem

Given a database of transactions.



Each transaction being a list of items.
E.g. purchased by a customer in a visit.
Find all rules that correlate the presence
of one set of items with that of another
set of items

E.g., 30% of people who buys diapers also
buys beer.
2
Motivation & Applications (1)

If we can find such associations, we will be
able to answer:

???  beer
(What should the company do to boost beer sales?)

Diapers  ???
(What other products should the store stocks up?)

Attached mailing in direct marketing.
3
Motivation & Applications (2)

Originally for marketing to understand
purchasing trends.


What products or services customers tend to purchase
at the same time, or later on?
Use market basket analysis to plan:

Coupon and discounting:



Do not offer simultaneous discounts on beer and diapers if
they tend to be bought together.
Discount one to pull in sales of the other.
Product placement.


Place products that have a strong purchasing relationship
close together.
Place such products far apart to increase traffic past other
items.
4
Measure of Interestingness


For a data mining algorithm to mine for
interesting association rules, users have to define
a measure of “interestingness”.
Two popular interestingness measures have been
proposed:



Support and Confidence
Lift Ratio (Interest)
MineSet from SGI use the terms predictability
and prevalence instead of support and confidence.
5
The Support and Confidence
Given rule X & Y => Z
 Support, S = P(X  Y  Z)
where A  B indicates that a transaction contains both X and Y
(union of item sets X and Y)
[# of tuples containing both A & B / total # of tuples]

Confidence, C = P(Z | X  Y )
P(Z | X  Y ) is a conditional probability that a transaction having
{XY} also contains Z
[# of tuples containing both X&Y&Z / # of tuples containing X&Y]
6
The Support and Confidence
Customer
buys both
Customer
buys diaper
Customer
buys beer
Transaction ID Items Bought
2000
A,B,C
1000
A,C
4000
A,D
5000
B,E,F
Let minimum support 50%, and
minimum confidence 50%, find
out the S and C of :
1. A  C
2. C  A
Answer:
A  C (50%, 66.6%)
C  A (50%, 100%)
7
How Good is a Predictive Model?
Response curves
- How does the response rate of a targeted selection compare to a
random selection?
8
What is A Lift Ratio? (1)

Consider the rule:



It states an explicit percentage (50% of the time).
Consider this other rule:


When people buy diapers they also buy beer 50 percent
of the time.
People who purchase a VCR are three times more likely
to also purchase a camcorder.
The rule used the comparative phrase “three times
more likely”?
9
What is A Lift Ratio? (2)


The probability is compared to the baseline
likelihood.
The baseline likelihood is the probability of the
event occurring independently.



E.g., if people normally buy beer 5% of the time,
then the first rule could have said “10 times more
likely.”
The ratio in this kind of comparison is called lift.
A key goal of an association rule mining exercise
is to find rules that have the desired lift.
10
Lift Ratio As Interestingness
An
Example:
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
Rule Support Confidence
X=>Y 25%
50%
X=>Z 37.50%
75%
X
and Y, positively correlated,
X
and Z, negatively related
Support and Confidence
of X=>Z dominates
11
Lift Ratio As Interestingness


It is a measure of dependent or correlated events
The lift of rule X => Y
lift X ,Y
P(Y | X )

P(Y )
Apriori = P(Y)
Confidence=P(Y|X)
P( X  Y )
or
P( X ) P(Y )
Lift = 1 means X and Y are independent events
Lift < 1 means X and Y are negatively correlated
Lift > 1 means X and Y are positively correlated
(better than random)
12
AR Mining with Lift Ratio (1)

To understand what lift ratio is, consider the
following:

500,000 transactions






20,000 transactions contain diapers (4 percent)
30,000 transactions contain beer (6 percent)
10,000 transactions contain both diapers and beer (2 percent)
Confidence measures how much a particular item is
dependent on another.
When people buy diapers, they also buy beer 50% of
the time (10,000/20,000).
The confidence for this rule is 50%.
13
AR Mining with Lift Ratio (2)

The inverse rule could be stated as:


When people buy beer they also buy diapers 1/3 of
the time (Conf=33.33% = 10,000/30,000).
In the absence of any knowledge about what else
was bought, the following can be computed:



People buy diapers 4 percent of the time.
People buy beer 6 percent of the time.
4% and 6% are called the expected confidence (or
baseline likelihood, or A Priori Probability) of
buying diapers or beer.
14
AR Mining with Lift Ratio (3)




Lift measures the difference between the
confidence of a rule and the expected confidence.
Lift is one measure of the strength of an effect.
If people who bought diapers also bought beer
8% of the time, then the effect is small if
expected confidence is 6%.
If the confidence is 50%, and lift is more than 8
times (when measured as a ratio), then the
interactions between diapers and beer is very
strong.
15
AR Mining with Lift Ratio : An Example

Consider item sets with three items:





10,000 transactions contain wipes.
8,000 transactions contain wipes and diapers (80%).
220 transactions contain wipes and beer (2.2%).
200 transactions contain wipes, diapers and beer
(2%).
The complete set of 12 rules is presented in a table
along with their confidence, support and lift.
16
AR Mining with Lift Ratio : An Example
LHS







Beer
Diapers
Wipes
Diapers
Beer
Wipes
Beer

Wipes
2.00
2.00
1.00
0.04

Diapers
4.00
90.91
22.73
0.04
10
Diapers
Beer
Diapers
Wipes
Wipes
Beer
Diapers
& Wipes
Diapers
& Beer
Beer &
Wipes
Diapers
Exp
Conf
(%)
6.00
4.00
2.00
4.00
6.00
2.00
6.00

0.044
1.00
22.73
0.04
11
Wipes

2.00
2.00
1.00
0.04
12
Beer

Wipes &
Beer
Diapers
& Beer
Diapers
& Wipes
1.60
0.67
0.42
0.04
1
2
3
4
5
6
7
8
9
RHS
Conf
(%)
Lift
Ratio
Supp
(%)
50.00
33.33
40.00
80.00
2.20
0.73
2.50
8.33
8.33
20.00
20.00
0.37
0.37
0.42
2.00
2.00
1.60
1.60
0.04
0.04
0.04
17
AR Mining with Lift Ratio : An Example



The greatest amount of lift, if measured as a ratio,
is found in the 9th and 10th rules.
Both have a lift greater than 22, computed as
90.91/4 and 1/0.044.
For the 9th rule, the lift of 22 means:



People who purchase wipes and beer are 22 times
more likely to also purchase diapers than people who
do not.
Note the negative lift (lift ratio less than 1) in the
5th, 6th, 7th and last rules.
The latter two rules both have a lift ratio of
approximately 0.42.
18
AR Mining with Lift Ratio : An Example


Negative lift on the 7th rule means that people
who buy diapers and wipes are less likely to buy
beer than one would expect.
Rules with very high or very low confidence
model an anomaly.



If a rule says, with a confidence of 1 (100%), that
whenever people bought pet food they also bought
pet supplies.
Further investigation show that was for one day only.
There was a special giveaway.
19
AR Mining with Lift Ratio : An Example




Most rules have dairy on the right hand side.
Milk or eggs are so commonly purchased, “dairy”
is quite likely to show up in many rules.
Ability to exclude specific items is very useful.
Interesting rules are:





Have a very high or very low lift.
Do not involve items that appear on most transactions.
Have support that exceeds a threshold.
 Low support might simply be due to a statistical anomaly.
Rules that are more general are frequently desirable.
Sometimes interesting to differentiate between diapers sold in
boxes vs. diapers sold in bulk.
20
Lift Ratio and Sample Size



Consider the association A => B.
A lift ratio can be very large even if the
number of transactions having A and B
together or separately are very small.
To take sample size into consideration, one
can consider using the support and
confidence as interestingness measures.
21
Complexity of AR Mining Algorithms




An association algorithm is simply a counting
algorithm.
Probabilities are computed by taking ratios
among various counts.
If item hierarchies are in use, then some
translation (or lookup) is needed.
One must carefully control the sizes of the item
sets because of combinatorial explosion problem.
22
Complexity of AR Mining Algorithms




Large grocery stores stock sell more than 100,000
different items.
There can be 5 billion possible item pairs, and 1.7
x 1014 sets of three items.
An item hierarchy can be used to reduce this
number to a manageable size.
There is unlikely to be a specific relationship
between Pampers in the 30-count box and Blue
Ribbon in 12oz cans.
23
Complexity of AR Mining Algorithms



If there is such a relationship, it is probably
subsumed by the more general relationship
between diapers and beer.
Using an item hierarchy reduces the
number of combinations.
It also helps to find more general higherlevel relationships such as those between
any kind of diapers and any kind of beer.
24
Complexity of AR Mining Algorithms

The combinatorial explosion problem:







Even if you use an item hierarchy to group items
together so that the average group size is 50.
Reducing 100,000 items to 2,000 item groups.
With 2,000 item groups there are still almost 2
million paired item sets.
An algorithm might require up to 2 million counting
registers.
There are 1.3 billion three-item item sets!
Many combinations will never occur.
Some sort of dynamic memory or counter allocation
and addressing scheme will be needed.
25
The Apriori Algorithm
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
For rule A  C:
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
support = support({A ^ C}) = 50%
confidence = support({A ^ C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
26
Applying Apriori Algorithm
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
ANIMATED
DEMO
27
Improving Apriori’s Efficiency

Hash-based itemset counting: A k-itemset whose corresponding hashing
bucket count is below the threshold cannot be frequent

Transaction reduction: A transaction that does not contain any frequent
k-itemset is useless in subsequent scans

Partitioning: Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB

Sampling: mining on a subset of given data, lower support threshold + a
method to determine the completeness

Dynamic itemset counting: add new candidate itemsets only when all of
their subsets are estimated to be frequent.
28
Is Apriori Fast Enough?

The core of the Apriori algorithm:



Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets
Use database scan and pattern matching to collect counts for the
candidate itemsets
The bottleneck of Apriori: candidate generation

Huge candidate sets:



104 frequent 1-itemset will generate 107 candidate 2-itemsets
To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100},
one needs to generate 2100  1030 candidates.
Multiple scans of database:

Needs (n +1 ) scans, n is the length of the longest pattern
29
Multiple-Level ARs





Items often form hierarchy.
Items at the lower level are expected to have
lower support.
Rules regarding itemsets at
appropriate levels could be quite useful.
Transaction database can be encoded based on
dimensions and levels
It is smart to explore shared multi-level mining
(Han & Fu,VLDB’95).
30
Mining Multi-Level Association

A top_down, progressive deepening approach:



First find high-level strong rules:
milk  bread [20%, 60%].
Then find their lower-level “weaker” rules:
2% milk  wheat bread [6%, 50%].
Variations at mining multiple-level association rules.


Level-crossed association rules:
2% milk  Wonder wheat bread
Association rules with multiple, alternative hierarchies:
2% milk  Wonder breadg
31
Multi-level Association:
Uniform Support vs. Reduced Support (1)

Uniform Support: the same minimum support for all
levels


+ One minimum support threshold. No need to examine
itemsets containing any item whose ancestors do not have
minimum support.
– Lower level items do not occur as frequently. If support
threshold

too high  miss low level associations.

too low  generate too many high level associations.
32
Multi-level Association:
Uniform Support vs. Reduced Support (2)

Reduced Support: reduced minimum
support at lower levels

There are 4 search strategies:

Level-by-level independent

Level-cross filtering by k-itemset

Level-cross filtering by single item

Controlled level-cross filtering by single item
33
Uniform Support
Multi-level mining with uniform support
Level 1
min_sup = 5%
Level 2
min_sup = 5%
Milk
[support = 10%]
2% Milk
Skim Milk
[support = 6%]
[support = 4%]
34
Reduced Support
Multi-level mining with reduced support
Level 1
min_sup = 5%
Level 2
min_sup = 3%
Milk
[support = 10%]
2% Milk
Skim Milk
[support = 6%]
[support = 4%]
Back
35
Multi-level Association:
Redundancy Filtering


Some rules may be redundant due to “ancestor”
relationships between items.
Example




milk  wheat bread,
[support = 8%, confidence = 70%]
2% milk  wheat bread, [support = 2%, confidence = 72%]
We say the first rule is an ancestor of the second
rule.
A rule is redundant if its support is close to the
“expected” value, based on the rule’s ancestor.
36
Multi-Level Mining:
Progressive Deepening

A top-down, progressive deepening approach:



First mine high-level frequent items:
milk (15%), bread (10%)
Then mine their lower-level “weaker” frequent itemsets:
2% milk (5%), wheat bread (4%)
Different min_support threshold across multi-levels
lead to different algorithms:

If adopting the same min_support across multi-levels
then toss t if any of t’s ancestors is infrequent.

If adopting reduced min_support at lower levels
then examine only those descendents whose ancestor’s support is
frequent/non-negligible.
37
AR Representation Scheme

In words:


In first-order logic or PROLOG-like statement:


60% of people who buys diapers also buys beers and
0.5% buys both.
buys(x, “diapers”) -> buys(x, “beers”) [0.5%, 60%]
Also representation as if-then rules.


If diapers in Itemset THEN beers in Itemset [0.5%, 60%]
If people buy diapers, they also buy beers 60% of the
time, 0.5% of the people buy both.
38
Presentation of Association Rules (Tabular )
39
Visualization of Association Rule Using Plane Graph
40
Visualization of Association Rule Using Rule Graph
41
Sequential Apriori Algorithm (1)
The problem of mining sequential patterns can be split into the following phases:
1.
2.
3.
4.
5.
Sort Phase. This step implicitly converts the original transaction database into
a database of sequences.
Litemset Phase. In this phase we find the set of all litemsets L. We are also
simultaneously finding the set of all large 1-sequences.
Transformation Phase. We need to repeatedly determine which of a given set
of large sequences are contained in a customer sequence. We transform each
customer sequence into an alternative representation.
Sequence Phase. Use the set of litemsets to find the desired sequences.
Algorithms for this phase below.
Maximal Phase. Find the maximal sequences among the set of large
sequences. In some algorithms this phase is combined with the sequence phase
to reduce the time wasted in counting non maximal sequences.
REFERENCE:
Mining Sequential Patterns
42
Sequential Apriori Algorithm (2)
There are two families of algorithms- count-all and count-some. The count-all
algorithms count all the large sequences, including non-maximal sequences. The
non-maximal sequences must then be pruned out (in the maximal phase).
AprioriAll is a count-all algorithm, based on the Apriori algorithm for finding
large itemsets.
Apriori-Some is a count-some algorithm. The intuition behind these algorithms
is that since we are only interested in maximal sequences, we can avoid
counting sequences which are contained in a longer sequence if we first count
longer sequences.
43
AprioriAll Algorithm (1)
Step 1
Step 2
Minimum support = 25%
Step 3
44
AprioriAll Algorithm (2)
Step 4
L1 = large 1-sequences; // Result of litemset phase
for ( k = 2; Lk-1 0; k++) do
begin
Ck = New Candidates generated from Lk-1 (see next slide)
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck that are
contained in c.
Lk = Candidates in Ck with minimum support.
end
Answer = Maximal Sequences in k Lk ;
45
AprioriAll Algorithm (3)
Apriori Candidate Generation
The apriori-generate function takes as argument Lk-1, the set of all large
(k-1)-sequences. It works as follows. First join Lk-1 with Lk-1
insert into Ck
select p.litemset1 , ..., p.litemsetk-1 , q.litemsetk-1
from Lk-1 p, Lk-1 q
where p.litemset1 = q.litemset1 , . . .,
p.litemsetk-2 = q.litemsetk-2 ;
Next delete all sequences c Ck such that some (k-1)-subsequence of c is
not in Lk-1
46
Count Operation in Sequential Apriori
Hash Tree used for fast search of candidate occurrences.
Similar to association rule discovery, except for following
differences.
• Every event-timestamp pair in the timeline is hashed at the root.
• Events eligible for hashing at the next level are determined by the
maximum gap (xg), window size (ws), and span (ms) constraints.
REFERENCE:
Sequential Hash Tree for fast access
http://www-users.cs.umn.edu/~mjoshi/hpdmtut/sld144.htm
47
Exercises:
1.
What is the difference between the
algorithms of Apriori and AprioriAll?
2.
What happens if min. support and
confidence are set too low / high?
3.
Give a short example to show that items in
a strong association rule may actually be
negatively correlated.
48
END OF CHAPTER 2
BACK TO MAIN
49
Download