Document

Association Rule Mining (II)

Instructor: Qiang Yang

Thanks: J.Han and J. Pei

1

Bottleneck of Frequent-pattern

Mining









Multiple database scans are costly

Mining long patterns needs many passes of scanning and generates lots of candidates



To find frequent itemset i

1 i

2

…i

100



# of scans: 100



# of Candidates: (

1 = 1.27*10 30 !

100

1 ) + (

100

2 ) + … + (

1

1

0

0

0

0 ) = 2 100 -

Bottleneck: candidate-generation-and-test

Can we avoid candidate generation?

Frequent-pattern mining methods 2

FP-growth: Frequent-pattern Mining

Without Candidate Generation









Heuristic: let the set of transactions contain an item. If x

P be a frequent itemset, must be a frequent itemset

P , and is a frequent item in S , {

S x x be be

}  P

No candidate generation!

A compact data structure, FP-tree, to store information for frequent pattern mining

Recursive mining algorithm for mining complete set of frequent patterns


Example

Items Bought f,a,c,d,g,i,m,p a,b,c,f,l,m,o b,f,h,j,o b,c,k,s,p a,f,c,e,l,p,m,n

Min Support = 3


Scan the database







List of frequent items, sorted: (item:support)



<(f:4), (c:4), (a:3),(b:3),(m:3),(p:3)>

The root of the tree is created and labeled with “{}”

Scan the database





Scanning the first transaction leads to the first branch of the tree:

<(f:1),(c:1),(a:1),(m:1),(p:1)>

Order according to frequency


Scanning TID=100

Transaction

Database

TID Items

100 f,a,c,d,g,i,m,p

Header Table c a

Node

Item count head f 1

1

1 m p

1

1 m:1 f:1 c:1 a:1 root

{} p:1


Scanning TID=200

Items

Bought f,a,c,d,g,i,m,p a,b,c,f,l,m,o b,f,h,j,o b,c,k,s,p a,f,c,e,l,p,m,n





Frequent Single Items:



F1=<f,c,a,b,m,p>

TID=200



Possible frequent items:





Intersect with F1: f,c,a,b,m



Along the first branch of <f,c,a,m,p>, intersect:





<f,c,a>

Generate two children



<b>, <m>


Scanning TID=200

Transaction

Database

TID Items

200 f,c,a, b,m root

{} c a b m

Header Table

Node


1

2

1

1 p 1 m:1 f:2 c:2 a:2 b:1 p:1 m:1


The final FP-tree

Transaction

Database

TID Items

100

200

300

400

500 f,a,c,d,g,i,m,p a,b,c,f,l,m,o b,f,h,j,o b,c,k,s,p a,f,c,e,l,p,m,n

{} b m p

Header Table c a

Node


2

1

3

2

2 m:2 c:3 a:3 f:4 b:1 b:1 c:1 b:1 p:1

Min support = 3

Frequent 1-items in frequency descending order: f,c,a,b,m,p p:2 m:1


FP-Tree Construction





Scans the database only twice

Subsequent mining: based on the FP-tree


How to Mine an FP-tree?



Step 1: form conditional pattern base



Step 2: construct conditional FP-tree



Step 3: recursively mine conditional FPtrees


Conditional Pattern Base





Let {I} be a frequent item



A sub database which



 consists of the set of prefix paths in the FP-tree

With item {I} as a co-occurring suffix pattern

Example:







{m} is a frequent item

{m}’s conditional pattern base:





<f,c,a>: support =2

<f,c,a,b>: support = 1

Mine recursively on such databases a:3 f:4 m:2 b:1

{} c:1 c:3 b:1 b:1 p:1 p:2 m:1


Conditional Pattern Tree







Let {I} be a suffix item, {DB|I} be the conditional pattern base

The frequent pattern tree Tree

I is known as the conditional pattern tree

Example:







{m} is a frequent item

{m}’s conditional pattern base:





<f,c,a>: support =2

<f,c,a,b>: support = 1

{m}’s conditional pattern tree c:3 f:4 a:3 m:2

{}


Composition of patterns

a

and

b





Let a be a frequent item in DB, B be a ’s conditional pattern base, and b be an itemset in B.

Then a + b is frequent in DB if and only if b is frequent in B.

Example:



Starting with a ={p}



{p}’s conditional pattern base (from the tree) B=







(f,c,a,m): 2

(c,b): 1

Let b be {c}.

Then a+b ={p,c}, with support = 3.


Single path tree











Let P be a single path

FP tree

Let {I

1

, I

2

, …I k

} be an itemset in the tree

Let I j have the lowest support

Then the support({I

1

,

I

2

, …I k

})=support(I j

)

Example:

Frequent-pattern mining methods

{} f:4 c:1 c:3 b:1 b:1 p:1 a:3 m:2 b:1 p:2 m:1

15

FP_growth Algorithm Fig 6.10









Recursive Algorithm

Input: A transaction database, min_supp

Output: The complete set of frequent patterns





1. FP-Tree construction



2. Mining FP-Tree by calling FP_growth(FP_tree, null)

Key Idea: consider single path FP-tree and multi-path FP-tree separately

Continue to split until get single-path FP-tree


FP_Growth (tree,

a

)





If tree contains a single path P, then



For each combination (denoted as b ) of the nodes in the path P, then



Generate pattern b + a with support = min_supp of nodes in b

Else for each a in the header of tree, do {









Generate pattern b =

Construct



 a +

(1) b ’s conditional pattern base and

(2) b ’s conditional FP-tree Tree b

If Tree b is not empty, then



Call FP-growth(Tree b

, b );

} a with support = a .

support ;


60

50

40

30

20

100

90

80

70

10

0

0

FP-Growth vs. Apriori: Scalability

With the Support Threshold

Data set T25I20D10K

D1 FP-grow th runtime

D1 Apriori runtime

0.5

1 1.5

Support threshold(%)

2


2.5

3

18

140

120

100

80

60

40

20

0

0

FP-Growth vs. Tree-Projection:

Scalability with the Support Threshold

Data set T25I20D100K

D2 FP-growth

D2 TreeProjection

2 0.5

1

Support threshold (%)


1.5

19

Why Is FP-Growth the Winner?





Divide-and-conquer:



 decompose both the mining task and DB according to the frequent patterns obtained so far leads to focused search of smaller databases

Other factors







 no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting and FP-tree building, not pattern search and matching


Implications of the Methodology:

Papers by Han, et al.



Mining closed frequent itemsets and max-patterns



CLOSET (DMKD’00)



Mining sequential patterns



FreeSpan (KDD’00), PrefixSpan (ICDE’01)



Constraint-based mining of frequent patterns



Convertible constraints (KDD’00, ICDE’01)



Computing iceberg data cubes with complex measures



H-tree and H-cubing algorithm (SIGMOD’01)


Visualization of Association Rules:

Pane Graph


Visualization of Association Rules:

Rule Graph


Mining Various Kinds of Rules or Regularities



Multi-level, quantitative association rules, correlation and causality, ratio rules, sequential patterns, emerging patterns, temporal associations, partial periodicity



Classification, clustering, iceberg cubes, etc.


Multiple-level Association Rules









Items often form hierarchy

Flexible support settings: Items at the lower level are expected to have lower support.

Transaction database can be encoded based on dimensions and levels explore shared multi-level mining uniform support reduced support

Level 1 min_sup = 5%

Milk

[support = 10%]



2% Milk

[support = 6%]

Skim Milk

[support = 4%]



25

Quantitative Association Rules



Numeric attributes are

 dynamically

Such that the confidence or compactness of the rules mined is maximized.





2-D quantitative association rules: A

 A cat

Cluster “adjacent” association rules to form general rules using a 2-D grid.



Example: age(X,”34-35”)  income(X,”30K discretized quan1

 A quan2

50K”)

 buys(X,”high resolution TV”)


Redundant Rules [SA95]







Which rule is redundant?



 milk  wheat bread, [support = 8%, confidence =

70%]

“ skim milk ”  wheat bread, [support = 2%, confidence = 72%]

The first rule is more general than the second rule.

A rule is redundant if its support is close to the

“ expected ” value, based on a general rule, and its confidence is close to that of the general rule.


INCREMENTAL MINING

[CHNW96]









Rules in DB were found and a set of new tuples added to DB, db is

Task: to find new rules in DB + db.



Usually, DB is much larger than db.

Properties of Itemsets:







 frequent in DB + db if frequent in both DB and db.

infrequent in DB + db if also in both DB and db.

frequent only in DB, then merge with counts in db.



No DB scan is needed!

frequent only in db, then itemset counts.

scan DB once to update their

Same principle applicable to distributed/parallel mining.

CORRELATION RULES





Association does not measure correlation

[BMS97, AY98].







Among 5000 students



3000 play basketball, 3750 eat cereal, 2000 do both play basketball  eat cereal [40%, 66.7%]

Conclusion: “ basketball and cereal are correlated ” is misleading

 because the overall percentage of students eating cereal is

75%, higher than 66.7%.

Confidence does not always give correct picture!








Correlation Rules

P ( A



B )

P ( A ) P ( B )



P ( B | A ) * P ( A )



P ( B | A ) / P ( B )

P ( A ) P ( B )

P(A^B)=P(B)*P(A), if

A and B are independent events

A and B negatively correlated  the value is less than 1;

Otherwise A and B positively correlated.









P(B|A)/P(B) is known as the lift of rule B  A

If less than one, then

B and A are negatively correlated.

Basketball  Cereal

2000/(3000*3750/500

0)=2000*5000/3000*3

750<1


Chi-square Correlation [BMS97]



2 

( 1



3 * 5 / 9 )

2

3 * 5 / 9

+

( 2



3 * ( 9



5 ) / 9 )

2

3 * ( 9



5 ) / 9

+

( 4



( 9



3 ) * 5 / 9 )

2

( 9



3 ) * 5 / 9

+

( 2



( 9



3 ) * ( 9



5 ) / 9 )

2

( 9



3 ) * ( 9



5 ) / 9



0 .

9



Item1 not Item1

Item2 not Item2 row sum

1 2 3

4 2 6 column sum 5 4 9

The cutoff value at 95% significance level is

3.84 > 0.9



Thus, we do not reject the independence assumption.


Constraint-based Data Mining







Finding all the patterns in a database autonomously ? — unrealistic!



The patterns could be too many but not focused!

Data mining should be an interactive process



User directs what to be mined using a data mining query language (or a graphical user interface)

Constraint-based mining



User flexibility: provides constraints on what to be mined



System optimization: explores such constraints for efficient mining— constraint-based mining


Constraints in Data Mining











Knowledge type constraint :

 classification, association, etc.

Data constraint — using SQL-like queries

 find product pairs sold together in stores in Vancouver in Dec.’00

Dimension/level constraint

 in relevance to region, price, brand, customer category

Rule (or pattern) constraint

 small sales (price < $10) triggers big sales (sum >

$200)

Interestingness constraint

 strong rules: min_support  3%, min_confidence 

60%


Constrained Mining vs.

Constraint-Based Search





Constrained mining vs. constraint-based search/reasoning



Both are aimed at reducing search space





Finding all patterns satisfying constraints vs. finding some (or one) answer in constraint-based search in AI

Constraint-pushing vs. heuristic search





It is an interesting research problem on how to integrate them

Constrained mining vs. query processing in DBMS



Database query processing requires to find all

Constrained pattern mining shares a similar philosophy as pushing selections deeply in query processing


Constrained Frequent Pattern Mining: A

Mining Query Optimization Problem







Given a frequent pattern mining query with a set of constraints C, the algorithm should be

 sound: it only finds frequent sets that satisfy the given constraints C

 complete : all frequent sets satisfying the given constraints C are found

A naïve solution



First find all frequent sets, and then test them for constraint satisfaction

More efficient approaches:





Analyze the properties of constraints comprehensively

Push them as deeply as possible inside the frequent pattern computation.






Anti-Monotonicity in Constraint-

Based Mining

TDB (min_sup=2)

Anti-monotonicity





 intemset S satisfies the constraint, so does any of its subset sum(S.Price)  v is anti-monotone sum(S.Price)  v is not anti-monotone

Example. C: range(S.profit)  15 is anti-monotone





Itemset ab violates C

So does every superset of ab


TID Transaction

10 a, b, c, d, f

20 b, c, d, f, g, h

30 a, c, d, e, f

40 c, e, f, g

Item Profit c d a b e

40

0

-20

10

-30 f g h

30

20

-10 36

Which Constraints Are Anti-

Monotone?

Constraint Antimonotone

No no v  S

S  V

S  V min(S)  v min(S)  v max(S)  v max(S)  v count(S)  v count(S)  v sum(S)  v ( a  S, a  0 ) sum(S)  v ( a  S, a  0 ) range(S)  v range(S)  v avg(S)  v,   {  ,  ,  } support(S)   support(S)   yes no yes yes no yes no yes no yes no convertible yes no


Monotonicity in Constraint-

Based Mining

TDB (min_sup=2)





Monotonicity







When an intemset S satisfies the constraint, so does any of its superset sum(S.Price)  v is monotone min(S.Price)  v is monotone

Example. C: range(S.profit)  15





Itemset ab satisfies C

So does every superset of ab


TID Transaction

10 a, b, c, d, f

20 b, c, d, f, g, h

30

40 a, c, d, e, f c, e, f, g f g d e h

Ite m a b c

Profit

40

0

-20

10

-30

30

20

-10

38

Which Constraints Are Monotone?

Constraint v  S

S  V

S  V min(S)  v min(S)  v max(S)  v max(S)  v count(S)  v count(S)  v sum(S)  v ( a  S, a  0 ) sum(S)  v ( a  S, a  0 ) range(S)  v range(S)  v avg(S)  v,   {  ,  ,  } support(S)   support(S)  


Monotone yes yes no yes no no yes no yes no yes no yes convertible no yes

39

Succinctness, Convertible,

Inconvertable Constraints in Book



We will not consider these in this course.


Associative Classification







Mine association possible rules in form of itemset

 class





Itemset: a set of attribute-value pairs

Class: class label

Build Classifier



Organize rules according to decreasing precedence based on confidence and support

B. Liu, W. Hsu & Y. Ma. Integrating classification and association rule mining. In KDD’98


Classification by Aggregating

Emerging Patterns





Emerging pattern (EP): A pattern frequent in one class of data but infrequent in others.





Age<=30 is frequent in class

“buys_computer=yes” and infrequent in class

“buys_computer=no”

Rule: age<=30  buys computer

G. Dong & J. Li. Efficient mining of emerging patterns: discovering trends and differences. In KDD’99


Document

Association Rule Mining (II)

Bottleneck of Frequent-pattern

Mining

Example

Scan the database

Scanning TID=100

Scanning TID=200

Scanning TID=200

The final FP-tree

FP-Tree Construction

How to Mine an FP-tree?

Conditional Pattern Base

Conditional Pattern Tree

Composition of patterns

and

Single path tree

FP_growth Algorithm Fig 6.10

FP_Growth (tree,

)

Why Is FP-Growth the Winner?

Implications of the Methodology:

Papers by Han, et al.

Visualization of Association Rules:

Pane Graph

Visualization of Association Rules:

Rule Graph

Mining Various Kinds of Rules or Regularities

Multiple-level Association Rules

Redundant Rules [SA95]

INCREMENTAL MINING

[CHNW96]

CORRELATION RULES

Correlation Rules

Chi-square Correlation [BMS97]

Constraint-based Data Mining

Constraints in Data Mining

Constrained Mining vs.

Constraint-Based Search

Anti-Monotonicity in Constraint-

Based Mining

Which Constraints Are Anti-

Monotone?

Monotonicity in Constraint-

Based Mining

Which Constraints Are Monotone?

Succinctness, Convertible,

Inconvertable Constraints in Book

Associative Classification

Classification by Aggregating

Emerging Patterns

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib