Document

advertisement

Association Rule Mining (II)

Instructor: Qiang Yang

Thanks: J.Han and J. Pei

1

Bottleneck of Frequent-pattern

Mining

Multiple database scans are costly

Mining long patterns needs many passes of scanning and generates lots of candidates

To find frequent itemset i

1 i

2

…i

100

# of scans: 100

# of Candidates: (

1 = 1.27*10 30 !

100

1 ) + (

100

2 ) + … + (

1

1

0

0

0

0 ) = 2 100 -

Bottleneck: candidate-generation-and-test

Can we avoid candidate generation?

Frequent-pattern mining methods 2

FP-growth: Frequent-pattern Mining

Without Candidate Generation

Heuristic: let the set of transactions contain an item. If x

P be a frequent itemset, must be a frequent itemset

P , and is a frequent item in S , {

S x x be be

}  P

No candidate generation!

A compact data structure, FP-tree, to store information for frequent pattern mining

Recursive mining algorithm for mining complete set of frequent patterns

Frequent-pattern mining methods 3

Example

Items Bought f,a,c,d,g,i,m,p a,b,c,f,l,m,o b,f,h,j,o b,c,k,s,p a,f,c,e,l,p,m,n

Min Support = 3

Frequent-pattern mining methods 4

Scan the database

List of frequent items, sorted: (item:support)

<(f:4), (c:4), (a:3),(b:3),(m:3),(p:3)>

The root of the tree is created and labeled with “{}”

Scan the database

Scanning the first transaction leads to the first branch of the tree:

<(f:1),(c:1),(a:1),(m:1),(p:1)>

Order according to frequency

Frequent-pattern mining methods 5

Scanning TID=100

Transaction

Database

TID Items

100 f,a,c,d,g,i,m,p

Header Table c a

Node

Item count head f 1

1

1 m p

1

1 m:1 f:1 c:1 a:1 root

{} p:1

Frequent-pattern mining methods 6

Scanning TID=200

Items

Bought f,a,c,d,g,i,m,p a,b,c,f,l,m,o b,f,h,j,o b,c,k,s,p a,f,c,e,l,p,m,n

Frequent Single Items:

F1=<f,c,a,b,m,p>

TID=200

Possible frequent items:

Intersect with F1: f,c,a,b,m

Along the first branch of <f,c,a,m,p>, intersect:

<f,c,a>

Generate two children

<b>, <m>

Frequent-pattern mining methods 7

Scanning TID=200

Transaction

Database

TID Items

200 f,c,a, b,m root

{} c a b m

Header Table

Node

Item count head f 1

1

2

1

1 p 1 m:1 f:2 c:2 a:2 b:1 p:1 m:1

Frequent-pattern mining methods 8

The final FP-tree

Transaction

Database

TID Items

100

200

300

400

500 f,a,c,d,g,i,m,p a,b,c,f,l,m,o b,f,h,j,o b,c,k,s,p a,f,c,e,l,p,m,n

{} b m p

Header Table c a

Node

Item count head f 1

2

1

3

2

2 m:2 c:3 a:3 f:4 b:1 b:1 c:1 b:1 p:1

Min support = 3

Frequent 1-items in frequency descending order: f,c,a,b,m,p p:2 m:1

Frequent-pattern mining methods 9

FP-Tree Construction

Scans the database only twice

Subsequent mining: based on the FP-tree

Frequent-pattern mining methods 10

How to Mine an FP-tree?

Step 1: form conditional pattern base

Step 2: construct conditional FP-tree

Step 3: recursively mine conditional FPtrees

Frequent-pattern mining methods 11

Conditional Pattern Base

Let {I} be a frequent item

A sub database which

 consists of the set of prefix paths in the FP-tree

With item {I} as a co-occurring suffix pattern

Example:

{m} is a frequent item

{m}’s conditional pattern base:

<f,c,a>: support =2

<f,c,a,b>: support = 1

Mine recursively on such databases a:3 f:4 m:2 b:1

{} c:1 c:3 b:1 b:1 p:1 p:2 m:1

Frequent-pattern mining methods 12

Conditional Pattern Tree

Let {I} be a suffix item, {DB|I} be the conditional pattern base

The frequent pattern tree Tree

I is known as the conditional pattern tree

Example:

{m} is a frequent item

{m}’s conditional pattern base:

<f,c,a>: support =2

<f,c,a,b>: support = 1

{m}’s conditional pattern tree c:3 f:4 a:3 m:2

{}

Frequent-pattern mining methods 13

Composition of patterns

a

and

b

Let a be a frequent item in DB, B be a ’s conditional pattern base, and b be an itemset in B.

Then a + b is frequent in DB if and only if b is frequent in B.

Example:

Starting with a ={p}

{p}’s conditional pattern base (from the tree) B=

(f,c,a,m): 2

(c,b): 1

Let b be {c}.

Then a+b ={p,c}, with support = 3.

Frequent-pattern mining methods 14

Single path tree

Let P be a single path

FP tree

Let {I

1

, I

2

, …I k

} be an itemset in the tree

Let I j have the lowest support

Then the support({I

1

,

I

2

, …I k

})=support(I j

)

Example:

Frequent-pattern mining methods

{} f:4 c:1 c:3 b:1 b:1 p:1 a:3 m:2 b:1 p:2 m:1

15

FP_growth Algorithm Fig 6.10

Recursive Algorithm

Input: A transaction database, min_supp

Output: The complete set of frequent patterns

1. FP-Tree construction

2. Mining FP-Tree by calling FP_growth(FP_tree, null)

Key Idea: consider single path FP-tree and multi-path FP-tree separately

Continue to split until get single-path FP-tree

Frequent-pattern mining methods 16

FP_Growth (tree,

a

)

If tree contains a single path P, then

For each combination (denoted as b ) of the nodes in the path P, then

Generate pattern b + a with support = min_supp of nodes in b

Else for each a in the header of tree, do {

Generate pattern b =

Construct

 a +

(1) b ’s conditional pattern base and

(2) b ’s conditional FP-tree Tree b

If Tree b is not empty, then

Call FP-growth(Tree b

, b );

} a with support = a .

support ;

Frequent-pattern mining methods 17

60

50

40

30

20

100

90

80

70

10

0

0

FP-Growth vs. Apriori: Scalability

With the Support Threshold

Data set T25I20D10K

D1 FP-grow th runtime

D1 Apriori runtime

0.5

1 1.5

Support threshold(%)

2

Frequent-pattern mining methods

2.5

3

18

140

120

100

80

60

40

20

0

0

FP-Growth vs. Tree-Projection:

Scalability with the Support Threshold

Data set T25I20D100K

D2 FP-growth

D2 TreeProjection

2 0.5

1

Support threshold (%)

Frequent-pattern mining methods

1.5

19

Why Is FP-Growth the Winner?

Divide-and-conquer:

 decompose both the mining task and DB according to the frequent patterns obtained so far leads to focused search of smaller databases

Other factors

 no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting and FP-tree building, not pattern search and matching

Frequent-pattern mining methods 20

Implications of the Methodology:

Papers by Han, et al.

Mining closed frequent itemsets and max-patterns

CLOSET (DMKD’00)

Mining sequential patterns

FreeSpan (KDD’00), PrefixSpan (ICDE’01)

Constraint-based mining of frequent patterns

Convertible constraints (KDD’00, ICDE’01)

Computing iceberg data cubes with complex measures

H-tree and H-cubing algorithm (SIGMOD’01)

Frequent-pattern mining methods 21

Visualization of Association Rules:

Pane Graph

Frequent-pattern mining methods 22

Visualization of Association Rules:

Rule Graph

Frequent-pattern mining methods 23

Mining Various Kinds of Rules or Regularities

Multi-level, quantitative association rules, correlation and causality, ratio rules, sequential patterns, emerging patterns, temporal associations, partial periodicity

Classification, clustering, iceberg cubes, etc.

Frequent-pattern mining methods 24

Multiple-level Association Rules

Items often form hierarchy

Flexible support settings: Items at the lower level are expected to have lower support.

Transaction database can be encoded based on dimensions and levels explore shared multi-level mining uniform support reduced support

Level 1 min_sup = 5%

Milk

[support = 10%]

Level 1 min_sup = 5%

Level 2 min_sup = 5%

2% Milk

[support = 6%]

Skim Milk

[support = 4%]

Frequent-pattern mining methods

Level 2 min_sup = 3%

25

Quantitative Association Rules

Numeric attributes are

 dynamically

Such that the confidence or compactness of the rules mined is maximized.

2-D quantitative association rules: A

 A cat

Cluster “adjacent” association rules to form general rules using a 2-D grid.

Example: age(X,”34-35”)  income(X,”30K discretized quan1

 A quan2

50K”)

 buys(X,”high resolution TV”)

Frequent-pattern mining methods 26

Redundant Rules [SA95]

Which rule is redundant?

 milk  wheat bread, [support = 8%, confidence =

70%]

“ skim milk ”  wheat bread, [support = 2%, confidence = 72%]

The first rule is more general than the second rule.

A rule is redundant if its support is close to the

“ expected ” value, based on a general rule, and its confidence is close to that of the general rule.

Frequent-pattern mining methods 27

INCREMENTAL MINING

[CHNW96]

Rules in DB were found and a set of new tuples added to DB, db is

Task: to find new rules in DB + db.

Usually, DB is much larger than db.

Properties of Itemsets:

 frequent in DB + db if frequent in both DB and db.

infrequent in DB + db if also in both DB and db.

frequent only in DB, then merge with counts in db.

No DB scan is needed!

frequent only in db, then itemset counts.

scan DB once to update their

Same principle applicable to distributed/parallel mining.

CORRELATION RULES

Association does not measure correlation

[BMS97, AY98].

Among 5000 students

3000 play basketball, 3750 eat cereal, 2000 do both play basketball  eat cereal [40%, 66.7%]

Conclusion: “ basketball and cereal are correlated ” is misleading

 because the overall percentage of students eating cereal is

75%, higher than 66.7%.

Confidence does not always give correct picture!

Frequent-pattern mining methods 29

Correlation Rules

P ( A

B )

P ( A ) P ( B )

P ( B | A ) * P ( A )

P ( B | A ) / P ( B )

P ( A ) P ( B )

P(A^B)=P(B)*P(A), if

A and B are independent events

A and B negatively correlated  the value is less than 1;

Otherwise A and B positively correlated.

P(B|A)/P(B) is known as the lift of rule B  A

If less than one, then

B and A are negatively correlated.

Basketball  Cereal

2000/(3000*3750/500

0)=2000*5000/3000*3

750<1

Frequent-pattern mining methods 30

Chi-square Correlation [BMS97]

2 

( 1

3 * 5 / 9 )

2

3 * 5 / 9

+

( 2

3 * ( 9

5 ) / 9 )

2

3 * ( 9

5 ) / 9

+

( 4

( 9

3 ) * 5 / 9 )

2

( 9

3 ) * 5 / 9

+

( 2

( 9

3 ) * ( 9

5 ) / 9 )

2

( 9

3 ) * ( 9

5 ) / 9

0 .

9

Item1 not Item1

Item2 not Item2 row sum

1 2 3

4 2 6 column sum 5 4 9

The cutoff value at 95% significance level is

3.84 > 0.9

Thus, we do not reject the independence assumption.

Frequent-pattern mining methods 31

Constraint-based Data Mining

Finding all the patterns in a database autonomously ? — unrealistic!

The patterns could be too many but not focused!

Data mining should be an interactive process

User directs what to be mined using a data mining query language (or a graphical user interface)

Constraint-based mining

User flexibility: provides constraints on what to be mined

System optimization: explores such constraints for efficient mining— constraint-based mining

Frequent-pattern mining methods 32

Constraints in Data Mining

Knowledge type constraint :

 classification, association, etc.

Data constraint — using SQL-like queries

 find product pairs sold together in stores in Vancouver in Dec.’00

Dimension/level constraint

 in relevance to region, price, brand, customer category

Rule (or pattern) constraint

 small sales (price < $10) triggers big sales (sum >

$200)

Interestingness constraint

 strong rules: min_support  3%, min_confidence 

60%

Frequent-pattern mining methods 33

Constrained Mining vs.

Constraint-Based Search

Constrained mining vs. constraint-based search/reasoning

Both are aimed at reducing search space

Finding all patterns satisfying constraints vs. finding some (or one) answer in constraint-based search in AI

Constraint-pushing vs. heuristic search

It is an interesting research problem on how to integrate them

Constrained mining vs. query processing in DBMS

Database query processing requires to find all

Constrained pattern mining shares a similar philosophy as pushing selections deeply in query processing

Frequent-pattern mining methods 34

Constrained Frequent Pattern Mining: A

Mining Query Optimization Problem

Given a frequent pattern mining query with a set of constraints C, the algorithm should be

 sound: it only finds frequent sets that satisfy the given constraints C

 complete : all frequent sets satisfying the given constraints C are found

A naïve solution

First find all frequent sets, and then test them for constraint satisfaction

More efficient approaches:

Analyze the properties of constraints comprehensively

Push them as deeply as possible inside the frequent pattern computation.

Frequent-pattern mining methods 35

Anti-Monotonicity in Constraint-

Based Mining

TDB (min_sup=2)

Anti-monotonicity

 intemset S satisfies the constraint, so does any of its subset sum(S.Price)  v is anti-monotone sum(S.Price)  v is not anti-monotone

Example. C: range(S.profit)  15 is anti-monotone

Itemset ab violates C

So does every superset of ab

Frequent-pattern mining methods

TID Transaction

10 a, b, c, d, f

20 b, c, d, f, g, h

30 a, c, d, e, f

40 c, e, f, g

Item Profit c d a b e

40

0

-20

10

-30 f g h

30

20

-10 36

Which Constraints Are Anti-

Monotone?

Constraint Antimonotone

No no v  S

S  V

S  V min(S)  v min(S)  v max(S)  v max(S)  v count(S)  v count(S)  v sum(S)  v ( a  S, a  0 ) sum(S)  v ( a  S, a  0 ) range(S)  v range(S)  v avg(S)  v,   {  ,  ,  } support(S)   support(S)   yes no yes yes no yes no yes no yes no convertible yes no

Frequent-pattern mining methods 37

Monotonicity in Constraint-

Based Mining

TDB (min_sup=2)

Monotonicity

When an intemset S satisfies the constraint, so does any of its superset sum(S.Price)  v is monotone min(S.Price)  v is monotone

Example. C: range(S.profit)  15

Itemset ab satisfies C

So does every superset of ab

Frequent-pattern mining methods

TID Transaction

10 a, b, c, d, f

20 b, c, d, f, g, h

30

40 a, c, d, e, f c, e, f, g f g d e h

Ite m a b c

Profit

40

0

-20

10

-30

30

20

-10

38

Which Constraints Are Monotone?

Constraint v  S

S  V

S  V min(S)  v min(S)  v max(S)  v max(S)  v count(S)  v count(S)  v sum(S)  v ( a  S, a  0 ) sum(S)  v ( a  S, a  0 ) range(S)  v range(S)  v avg(S)  v,   {  ,  ,  } support(S)   support(S)  

Frequent-pattern mining methods

Monotone yes yes no yes no no yes no yes no yes no yes convertible no yes

39

Succinctness, Convertible,

Inconvertable Constraints in Book

We will not consider these in this course.

Frequent-pattern mining methods 40

Associative Classification

Mine association possible rules in form of itemset

 class

Itemset: a set of attribute-value pairs

Class: class label

Build Classifier

Organize rules according to decreasing precedence based on confidence and support

B. Liu, W. Hsu & Y. Ma. Integrating classification and association rule mining. In KDD’98

Frequent-pattern mining methods 41

Classification by Aggregating

Emerging Patterns

Emerging pattern (EP): A pattern frequent in one class of data but infrequent in others.

Age<=30 is frequent in class

“buys_computer=yes” and infrequent in class

“buys_computer=no”

Rule: age<=30  buys computer

G. Dong & J. Li. Efficient mining of emerging patterns: discovering trends and differences. In KDD’99

Frequent-pattern mining methods 42

Download