Association rules

advertisement
Algoritmy strojového učení:
klasifikační a asociační pravidla
1
Klasifikační pravidla
a jak je lze získat z dat
2
Presbyope = vetchý
Myope = krátkozraký
Person
O1
O2
O3
O4
O5
O6-O13
O14
O15
O16
O17
O18
O19-O23
O24
Illustrative example:
Contact lenses data
Age
Spect. presc.
young
myope
young
myope
young
myope
young
myope
young
hypermetrope
...
...
pre-presbyohypermetrope
pre-presbyohypermetrope
pre-presbyohypermetrope
presbyopic
myope
presbyopic
myope
...
...
presbyopic hypermetrope
Astigm.
no
no
yes
yes
no
...
no
yes
yes
no
no
...
yes
Tear prod.
reduced
normal
reduced
normal
reduced
...
normal
reduced
normal
reduced
normal
...
normal
Lenses
NONE
SOFT
NONE
HARD
NONE
...
SOFT
NONE
NONE
NONE
NONE
...
NONE
Classes: N(none), S(soft), H(hard) contact lenses
3 / 23
Umělá inteligence I.
Decision tree for
contact lenses recommendation
tear prod.
reduced
normal
astigmatism
NONE
no
yes
spect. pre.
SOFT
myope
HARD
4 / 23
hypermetrope
NONE
Umělá inteligence I.
Problems with dec. trees?




Dec,.trees can be transfored into rules, but to apply them
we need to have „complete information“ about the case
The resulting rule sets can be rather complex (1 rule = 1
branch of the tree) and difficult to understand for human
user
Sets of rules in DNF are sometimes easier to grasp:
 If X then C1
 If X and Y then C2
 If not X and Z and Y then C3
 If B then C2
But learning such sets is more difficult!
5 / 23
Umělá inteligence I.
Ordered or unordered
sets of rules?

Disjunction of 2 rules does not have to increase their
performance!
Example:

Let us have 1000 cases and 2 rules R1 and R2, each
covering 100 cases and each correct on 90 of them.

What happens if the rules R1 and R2 are combined?

In the best case the incorrect cases are identical and the
performance of R1 OR R2 is (90+90)/(90+90+10)=0,95

In the worst case R1 and R2 are correct on the same cases
and wrong on different ones. In such a case, the
performance of R1 OR R2 is (90)/(90+10+10)=0,82
6 / 23
Umělá inteligence I.
Ruleset representation

Rule base is a disjunctive set of conjunctive rules

Standard form of rules:
IF Condition THEN Class
Class IF Conditions
Class  Conditions

Examples:
IF Outlook=Sunny  Humidity=Normal THEN PlayTennis=Yes
IF Outlook=Overcast THEN PlayTennis=Yes
IF Outlook=Rain  Wind=Weak THEN PlayTennis=Yes

Form of CN2 rules:
IF Conditions THEN MajClass [ClassDistr]

Rule base: {R1, R2, R3, …, DefaultRule}
7 / 23
Umělá inteligence I.
Decision tree vs. rule learning:
Splitting vs. covering
+ + +
+ +
+
- - -
+ + +
+ +
+
- - -
8 / 23

Splitting
(ID3, C4.5,
J48, See5)

Covering
(AQ, CN2)
Umělá inteligence I.
Classification Rule Learning

Rule set representation

Two rule learning approaches:

Learn decision tree, convert to rules
 Learn set/list of rules
 Learning
an unordered set of rules
 Learning an ordered list of rules

9 / 23
Heuristics, overfitting, pruning
Umělá inteligence I.
PlayTennis: Training examples
Day
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14
10 / 23
Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Temperature
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
Humidity
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Wind
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Strong
PlayTennis
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Umělá inteligence I.
PlayTennis:
Using a decision tree for classification
Outlook
Sunny
Overcast
Humidity
High
No
Rain
Yes
Normal
Yes
Wind
Strong
No
Weak
Yes
Is Saturday morning OK for playing tennis?
Outlook=Sunny, Temperature=Hot, Humidity=High, Wind=Strong
PlayTennis = No, because Outlook=Sunny  Humidity=High
11 / 23
Umělá inteligence I.
Contact lense:
classification rules
tear production=reduced => lenses=NONE
[S=0,H=0,N=12]
tear production=normal & astigmatism=no =>
lenses=SOFT [S=5,H=0,N=1]
tear production=normal & astigmatism=yes & spect.
pre.=myope => lenses=HARD [S=0,H=3,N=2]
tear production=normal & astigmatism=yes & spect.
pre.=hypermetrope => lenses=NONE
[S=0,H=1,N=2]
DEFAULT lenses = NONE
12 / 23
Umělá inteligence I.
Unordered rulesets

rule Class IF Conditions is learned by first determining
Class and then Conditions
 ordered sequence of classes C1, …, Cn in RuleSet
 But: unordered (independent) execution of rules when
classifying a new instance: all rules are tried and
predictions of those covering the example are
collected; voting is used to obtain the final
classification

if no rule fires, then DefaultClass (majority class in E)
13 / 23
Umělá inteligence I.
Contact lense:
decision list
Ordered (order dependent) rules :
IF tear production=reduced THEN lenses=NONE
ELSE /*tear production=normal*/
IF astigmatism=no THEN lenses=SOFT
ELSE /*astigmatism=yes*/
IF spect. pre.=myope THEN lenses=HARD
ELSE /* spect.pre.=hypermetrope*/
lenses=NONE
14 / 23
Umělá inteligence I.
Ordered set of rules:
if-then-else decision lists

rule Class IF Conditions is learned by first determining
Conditions and then Class

Notice: mixed sequence of classes C1, …, Cn in
RuleBase

But: ordered execution when classifying a new
instance: rules are sequentially tried and the first rule
that `fires’ (covers the example) is used for classification

Decision list {R1, R2, R3, …, D}: rules Ri are
interpreted as if-then-else rules

If no rule fires, then DefaultClass (majority class in Ecur)
15 / 23
Umělá inteligence I.
Original covering algorithm
(AQ, Michalski 1969,86)
Basic (bottom-up) covering algorithm
+ + +
+ +
+
- - -
for each class Ci do
 Ei := Pi U Ni (Pi pos., Ni neg.)
 RuleBase(Ci) := empty
 repeat {learn-set-of-rules}
 learn-one-rule R covering some positive
examples and no negatives
 add R to RuleBase(Ci)
 delete from Pi all pos. ex. covered by R
 until Pi = empty
16 / 23
Umělá inteligence I.
Learning unordered set of rules
(CN2, Clark and Niblett)
Clark and Niblett, 1989: top-down approach to search (specialization applies
a beam search)

RuleBase := empty

for each class Ci do
 Ei := Pi U Ni, RuleSet(Ci) := empty
 repeat {learn-set-of-rules}
 R := Class = Ci IF Conditions, Conditions := true
 repeat {learn-one-rule}
R’ := Class = Ci IF Conditions AND Cond
(general-to-specific beam search of Best R’)
until stopping criterion is satisfied (no negatives covered
or Performance(R’) < ThresholdR)
 add R’ to RuleSet(Ci)
 delete from Pi all positive examples covered by R’
until stopping criterion is satisfied (all positives covered or
Performance(RuleSet(Ci)) < ThresholdRS)

RuleBase := RuleBase U RuleSet(Ci)
17 / 23
Umělá inteligence I.
Learn-one-rule:
Greedy vs. beam search

learn-one-rule by greedy general-to-specific
search, at each step selecting the `best’
descendant, no backtracking

beam search: maintain a list of k best candidates
at each step; descendants (specializations) of
each of these k candidates are generated, and the
resulting set is again reduced to k best candidates
Recommended reading for search in AI
V. Mařík: Řešení úloh a využívání znalostí, kapitola v knize Mařík et al.: UI(1),
Academia 1993, 2003
Umělá inteligence I.
18 / 23
Illustrative example:
Contact lenses data
Person
O1
O2
O3
O4
O5
O6-O13
O14
O15
O16
O17
O18
O19-O23
O24
19 / 23
Age
Spect. presc.
young
myope
young
myope
young
myope
young
myope
young hypermetrope
...
...
pre-presbyohypermetrope
pre-presbyohypermetrope
pre-presbyohypermetrope
presbyopic
myope
presbyopic
myope
...
...
presbyopic hypermetrope
Astigm.
no
no
yes
yes
no
...
no
yes
yes
no
no
...
yes
Tear prod.
reduced
normal
reduced
normal
reduced
...
normal
reduced
normal
reduced
normal
...
normal
Lenses
NONE
SOFT
NONE
HARD
NONE
...
SOFT
NONE
NONE
NONE
NONE
...
NONE
Umělá inteligence I.
Learn-one-rule as heuristic search
Lenses = hard IF true S=H=N=
...
Lenses = hard
IF Astigmatism = no
[S=5, H=0, N=7]
Lenses = hard
IF Astigmatism = yes
[S=0, H=4, N=8]
Lenses = hard
IF Tearprod. = reduced
[S=0, H=0, N=12]
Lenses = hard
IF Tearprod. = normal
[S=5, H=4, N=3]
Lenses = hard
IF Tearprod. = normal
AND Spect.Pre. = myope
[S=2, H=3, N=1]
Lenses = hard
IF Tearprod. = normal
AND Spect.Pre. = hyperm.
[S=3, H=1, N=2]
20 / 23
Lenses = hard
IF Tearprod. = normal
AND Astigmatism = no
Lenses = hard
IF Tearprod. = normal
AND Astigmatism = yes
[S=0, H=4, N=2]
[S=5, H=0, N=1]
Umělá inteligence I.
Rule learning: summary



21 / 23
Hypothesis construction: find a set of n rules
 usually simplified by n separate rule constructions
Rule construction: find a pair (Class, Cond)
 select rule head (class) and construct rule body, or
 construct rule body and assign rule head (in ordered
algos)
Body construction: find a set of m features
 usually simplified by adding to rule body one feature
at a time
Umělá inteligence I.
Associations and
Frequent Item Analysis
22
Outline

Transactions

Frequent itemsets

Subset Property

Association rules

Applications
23 / 23
Umělá inteligence I.
Transactions Example
TID Produce
1
MILK, BREAD, EGGS
2
BREAD, SUGAR
3
BREAD, CEREAL
4
MILK, BREAD, SUGAR
5
MILK, CEREAL
6
BREAD, CEREAL
7
MILK, CEREAL
8
MILK, BREAD, CEREAL, EGGS
9
MILK, BREAD, CEREAL
24 / 23
Umělá inteligence I.
Transaction database: Example
TID Products
1
A, B, E
2
B, D
3
B, C
4
A, B, D
5
A, C
6
B, C
7
A, C
8
A, B, C, E
9
A, B, C
ITEMS:
A = milk
B= bread
C= cereal
D= sugar
E= eggs
25 / 23
TID Produce
1
MILK, BREAD, EGGS
2
BREAD, SUGAR
3
BREAD, CEREAL
4
MILK, BREAD, SUGAR
5
MILK, CEREAL
6
BREAD, CEREAL
7
MILK, CEREAL
8
MILK, BREAD, CEREAL, EGGS
9
MILK, BREAD, CEREAL
Instances = Transactions
Umělá inteligence I.
Transaction database: Example
Attributes converted to binary flags
TID Products
1
A, B, E
2
B, D
3
B, C
4
A, B, D
5
A, C
6
B, C
7
A, C
8
A, B, C, E
9
A, B, C
26 / 23
TID
A
B
C
D
E
1
1
1
0
0
1
2
0
1
0
1
0
3
4
0
1
1
1
1
0
0
1
0
0
5
1
0
1
0
0
6
0
1
1
0
0
7
8
1
1
0
1
1
1
0
0
0
1
9
1
1
1
0
0
Umělá inteligence I.
Definitions

Item (položka): attribute =value pair or simply value


Itemset I (množina položek) : a subset of possible items


usually attributes are converted to binary flags for each value, e.g.
Product = “A” is written as “A”
Example: I = {A,B,E} (order unimportant)
Transaction: (TID, itemset)

27 / 23
TID is transaction ID
Umělá inteligence I.
Support and Frequent Itemsets

Support of an itemset I


In example database:


sup(I ) = no. of transactions t
that support (i.e. contain) I
sup ({A,B,E}) = 2, sup ({B,C}) = 4
Frequent itemset I is one with at
least the minimum support count

28 / 23
sup(I ) >= minsup
TID
A
B
C
D
E
1
1
1
0
0
1
2
0
1
0
1
0
3
4
0
1
1
1
1
0
0
1
0
0
5
1
0
1
0
0
6
0
1
1
0
0
7
8
1
1
0
1
1
1
0
0
0
1
9
1
1
1
0
0
Umělá inteligence I.
SUBSET PROPERTY
Every subset of a frequent set is frequent!

Q: Why is it so?

Example: Suppose {A,B} is frequent. Since each
occurrence of A,B includes both A and B, then both A and B
must also be frequent

Similar argument for larger itemsets

29 / 23
Almost all association rule algorithms are based on this
subset property !
Umělá inteligence I.
Finding Frequent Itemsets

Start by finding one-item frequent sets (easy)

Q: How?

A: Simply count the frequencies of all items
30 / 23
Umělá inteligence I.
Finding itemsets: next level
Apriori algorithm (Agrawal & Srikant)
Idea: use one-item sets to generate two-item sets, two-item
sets to generate three-item sets, …

If (A,B) is a frequent item set, then (A) and (B) have to be frequent
item sets as well!
 In general: if X is frequent k-item set, then all (k-1)-item subsets of X
are also frequent
 Compute k-item set by merging (k-1)-item sets
31 / 23
Umělá inteligence I.
An example

Given: five three-item sets
(A B C), (A B D), (A C D), (A C E), (B C D)

Lexicographic order improves efficiency

Which are candidates for four-item sets?
(A B C D)
Q: OK?
Answer: Yes, because all 3-item subsets are frequent
(A C D E)
Q: OK?
Answer: No, because (C D E) is not frequent
32 / 23
Umělá inteligence I.
Classification vs Association Rules
Classification Rules
Association Rules

Focus on one target field

Many target fields

Specify class in all cases

Applicable in some cases

Measures : Accuracy

Measures : Support,
33 / 23
Confidence, Lift
Umělá inteligence I.
Association Rules

Association rule R : Itemset1 => Itemset2
Itemset1 and Itemset2 are disjoint and Itemset2 is non-empty
“if a transaction includes Itemset1 then it also has Itemset2”

Examples
A,B => C
A,B => C,E
A => B,C
A,B =>D
34 / 23
TID
A
B
C
D
E
1
1
1
0
0
1
2
0
1
0
1
0
3
4
0
1
1
1
1
0
0
1
0
0
5
1
0
1
0
0
6
0
1
1
0
0
7
8
1
1
0
1
1
1
0
0
0
1
9
1
1
1
0
0
Umělá inteligence I.
From Frequent Itemsets
to Association Rules

Q: Given frequent set {A,B,E}, what are possible association rules?







35 / 23
A => B, E
A, B => E
A, E => B
B => A, E
B, E => A
E => A, B
__ => A,B,E (empty rule), or true => A,B,E
Umělá inteligence I.
Rule Support and Confidence

Suppose R : I => J is an association rule
sup (R)
= sup (I  J) is the support count
support of itemset I  J
conf (R)
= sup(I  J) / sup(I) is the confidence of R
fraction of transactions with I that have J, too

36 / 23
Association rules with given minimum support and conf are
sometimes called “strong” rules
Umělá inteligence I.
Measures for the rule Ant => Suc
a is the total number of transactions with items Ant  Suc
support = a/n
confidence = a/r
cover = a/k
Suc
Non

(Suc)
Ant
a
b
r=
a+b
Non
(Ant)
c
d
s=
c+d

k=
a+c
l=
b+d
n=
r+s
4ft quantifiers in LispMiner
„above average“ a/r > (1+p)*k/n means “When comparing
number of transactions meeting Suc
 in the full dataset and
 among all transactions which meet Ant
one finds that the difference is at least 100*p % (the number is
Umělá inteligence I.
higher in the second set)”
37 / 23
Association Rules Example:
conf (I => J )
= sup(I  J) / sup(I)
Q: Given frequent set {A,B,E}, what
association rules have minsup = 2 and
minconf= 50% ?

A, B => E : conf=2/4 = 50%
A, E => B : conf=2/2 = 100%
B, E => A : conf=2/2 = 100%
E => A, B : conf=2/2 = 100%
Don’t qualify
A =>B, E :
conf= 2/6 =33%< 50%
B => A, E :
conf= 2/7 = 28% < 50%
TID List of items
1
A, B, E
2
B, D
3
B, C
4
A, B, D
5
A, C
6
B, C
7
A, C
8
A, B, C, E
9
A, B, C
__ => A,B,E : conf= 2/9 = 22% < 50%
38 / 23
Umělá inteligence I.
Find Strong Association Rules

A rule has the parameters minsup and minconf:


Problem:


sup(R) >= minsup and conf (R) >= minconf
Find all association rules with given minsup and minconf
First, find all frequent itemsets
39 / 23
Umělá inteligence I.
Generating Association Rules

Two stage process:

Determine frequent itemsets e.g. with the Apriori algorithm.
 For each frequent item set I

for each subset J of I


determine all association rules of the form: I-J => J
Main idea used in both stages : subset property
40 / 23
Umělá inteligence I.
Example: Generating Rules
from an Itemset
Frequent itemset from
golf/tenis data:
Is this frequent item
set?
{Humidity = normal,
Windy = False,
Play = Yes }
Support is 4
41 / 23
Outlook
Temp
Humidity
Windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
Umělá inteligence I.
Example: Generating
Rules from the freq. set
Humidity = Normal,
Windy = False,Play = Yes
Seven potential rules:
If Humidity = Normal and Windy =
False then Play = Yes
4/4
If
If
If
If
If
If
Outlook
Temp
Humidity
Windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
Humidity = Normal and Play = Yes then Windy = False
Windy = False and Play = Yes then Humidity = Normal
Humidity = Normal then Windy = False and Play = Yes
Windy = False then Humidity = Normal and Play = Yes
Play = Yes then Humidity = Normal and Windy = False
True then Humidity = Normal and Windy = False and Play = Yes
42 / 23
Umělá inteligence I.
4/6
4/6
4/7
4/8
4/9
4/12
Outlook
Temp
Humidity
Windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Rules with support > 1
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
and confidence = 100% :
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
Rules for the
weather data


Association rule

Sup.
Conf.
1
Humidity=Normal Windy=False
Play=Yes
4
100%
2
Temperature=Cool
Humidity=Normal
4
100%
3
Outlook=Overcast
Play=Yes
4
100%
4
Temperature=Cold Play=Yes
Humidity=Normal
3
100%
...
...
...
...
...
58
Outlook=Sunny Temperature=Hot
Humidity=High
2
100%
In total: 3 rules with support four, 5 with support three, 50 with support two
43 / 23
Umělá inteligence I.
Weka associations
File: weather.nominal.arff
MinSupport: 0.2
44 / 23
Umělá inteligence I.
Filtering Association Rules


Problem: any large dataset can lead to very large number
of association rules, even with reasonable minimal
Confidence and Support
Confidence by itself is not sufficient !

e.g. if all transactions include Z, then
 any rule I => Z will have confidence 100%.

Other measures to filter rules
45 / 23
Umělá inteligence I.
Further WEKA measures
for the rule Ant => Suc
support = a/n
confidence = a/r
cover = a/k
Suc
Non

(Suc)
Ant
a
b
r=
a+b
Non
(Ant)
c
d
s=
c+d

k=
a+c
l=
b+d
n=
r+s
lift = (a/r)/(k/n)= a*n/(r*k)
“Lift estimates increase in precision of default prediction of Suc on
the set of transactions meeting Ant when compared to than on the
whole dataset”
leverage = (a-r*k/n)/n “Ratio of ‘extra’ transactions covered by
the rule when compared to those covered provided Ant and Suc are
independent”
conviction = r*l/(b*n) “Similar to lift, but it considers transactions,
which are not covered by Suc.”
46 / 23
Umělá inteligence I.
Weka associations: output
47 / 23
Umělá inteligence I.
Association Rule LIFT

The lift of an association rule I => J is defined as:

lift = P(J|I) / P(J)
 Note, P(I) = (support of I) / (no. of transactions)
 ratio of confidence to expected confidence

Interpretation:

if lift > 1, then I and J are positively correlated
lift < 1, then I are J are negatively correlated.
lift = 1, then I and J are independent.
48 / 23
Umělá inteligence I.
Other issues

ARFF format very inefficient for typical market basket data


Attributes represent items in a basket and most items are usually
missing
Interestingness of associations

49 / 23
find unusual associations: Milk usually goes with bread, but soy milk
does not.
Umělá inteligence I.
Beyond Binary Data

Hierarchies
drink  milk  low-fat milk  Stop&Shop low-fat milk …
 find associations on any level


Sequences over time

…
50 / 23
Umělá inteligence I.
Applications

Market basket analysis

Store layout, client offers
 Bookstore: offers of similar titles (see e.g. Amazon)
 „Diapers and beer“ urban legend

Recommendations concerning new services or new
customers, e.g.


Finding unusual events


if (Car=Porsche & Gender=Male & Age < 20) then
(Risk=high & Insurance = high)
WSARE – What is Strange About Recent Events
…
51 / 23
Umělá inteligence I.
Summary

Frequent itemsets

Association rules

Subset property

Apriori algorithm

Application difficulties
52 / 23
Umělá inteligence I.
Download