Mine

advertisement
Frequent Pattern and
Association Analysis
(baseado nos slides do livro: Data
Mining: C & T)
Frequent Pattern and
Association Analysis
 Basic concepts
 Scalable mining methods
 Mining a variety of rules and interesting
patterns
 Constraint-based mining
 Mining sequential and structured patterns
 Extensions and applications
SAD Tagus 2004/05
H. Galhardas
Frequent Pattern and
Association Analysis
 Basic concepts
 Scalable mining methods
 Mining a variety of rules and interesting
patterns
 Constraint-based mining
 Mining sequential and structured patterns
 Extensions and applications
SAD Tagus 2004/05
H. Galhardas
What Is Frequent
Pattern Analysis?
Frequent pattern: a pattern (a set of items,
subsequences, substructures, etc.) that occurs
frequently in a data set

First proposed by Agrawal, Imielinski, and Swami
[AIS93] in the context of association rule mining
SAD Tagus 2004/05
H. Galhardas
Motivation

Finding inherent regularities in data

What products were often purchased together?— Beer and diapers?!

What are the subsequent purchases after buying a PC?

What kinds of DNA are sensitive to this new drug?

Can we automatically classify web documents?
Exs: Basket data analysis, cross-marketing, catalog
design, sale campaign analysis, Web log (click
stream) analysis, DNA sequence analysis, etc.
SAD Tagus 2004/05
H. Galhardas
Basic Concepts (1)
Item: Boolean variable representing its
presence or absence
Basket: Boolean vector of variables
Analyzed to discover patterns of items that are
frequently associated (or buyed) together
Association rules: association patterns
X => Y
SAD Tagus 2004/05
H. Galhardas
Basic Concepts (2)

Transaction-id
Items bought
10
A, B, D
20
A, C, D
30
A, D, E
40
B, E, F
50
B, C, D, E, F
Customer
buys both
Customer
buys diaper

Itemset X = {x1, …, xk}
Find all the rules X  Y with
minimum support and confidence
Support, s, probability that a
transaction contains X  Y
Support = # tuples (X  Y) / # total
tuples = P(X  Y)
Confidence, c, conditional probability
that a transaction having X also
contains Y
Customer
buys beer
Confidence = # tuples (X  Y) / #
tuples X = P(Y|X)
SAD Tagus 2004/05
H. Galhardas
Basic Concepts (3)
Transaction-id
Items bought
10
A, B, D
20
A, C, D
30
A, D, E
40
B, E, F
50
B, C, D, E, F
Interesting (or strong)
rules – satisfy minimum
support threshold and
minimum confidence
threshold
Let supmin = 50%, confmin = 50%
Frequent Pattern: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
 A  D (60%, 100%), 60% of all analyzed transactions show that A
and D are bought together; 100% of customers that bought A also
bought D
 D  A (60%, 75%)
H. Galhardas
SAD Tagus 2004/05
Basic Concepts (4)
Itemset (or pattern): set of items (k-itemset has k
items)
Occurrence frequency of itemset: # of transactions
containing the itemset
Also known as: frequency, support count or count of the
itemset
Frequent itemset: Itemset satisfies minimum
support, i.e.,
frequency >= min-sup * # total transactions
Confidence (A => B) = P(A|B) = frequency (A  B)/frequency (A)
Pb. Mining assoc. Rules = Pb. Mining frequent itemsets
SAD Tagus 2004/05
H. Galhardas
Mining frequent itemsets
A long pattern contains a combinatorial number of sub-patterns
Solution: Mine closed patterns and max-patterns

An itemset X is closed frequent if X is frequent and there exists
no super-itemset Y ‫ כ‬X, with the same support as X

An itemset X is a max-itemset if X is frequent and there exists
no frequent super-itemset Y ‫ כ‬X

Closed itemset is a lossless compression of freq. patterns:
reducing the # of patterns and rules
SAD Tagus 2004/05
H. Galhardas
Example
DB = {<a1, …, a100>, < a1, …, a50>}
Min_sup = 1.


What is the set of closed itemset?

<a1, …, a100>: 1

< a1, …, a50>: 2
What is the set of max-pattern?


<a1, …, a100>: 1
What is the set of all patterns?

!!
SAD Tagus 2004/05
H. Galhardas
Frequent Pattern and
Association Analysis
 Basic concepts
 Scalable mining methods
 Mining a variety of rules and interesting
patterns
 Constraint-based mining
 Mining sequential and structured patterns
 Extensions and applications
SAD Tagus 2004/05
H. Galhardas
Scalable Methods for Mining
Frequent Patterns
Downward closure (or Apriori) property of frequent
patterns: any subset of a frequent itemset must be
frequent


If {beer, diaper, nuts} is frequent, so is {beer, diaper}
i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
Scalable mining methods: Three major approaches



Apriori (Agrawal & Srikant@VLDB’94)
Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)
SAD Tagus 2004/05
H. Galhardas
Apriori: A Candidate Generationand-Test Approach
Method:

Initially, scan DB once to get frequent 1-itemset

Generate length (k+1) candidate itemsets from length k
frequent itemsets

Test the candidates against DB

Terminate when no frequent or candidate set can be
generated
Apriori pruning principle: If there is any itemset which
is infrequent, its superset should not be
generated/tested!
SAD Tagus 2004/05
H. Galhardas
How to Generate
Candidates?

Suppose the items in Lk-1 are listed in an order

Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
SAD Tagus 2004/05
H. Galhardas
Example
Supmin = 2
Database TDB
Tid
10
20
30
40
L2
C1
Items
A, C, D
B, C, E
A, B, C, E
B, E
Itemset
{A, C}
{B, C}
{B, E}
{C, E}
C3
1st scan
C2
sup
2
2
3
2
Itemset
{B, C, E}
SAD Tagus 2004/05
Itemset
{A}
{B}
{C}
{D}
{E}
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
3rd scan
sup
2
3
3
1
3
sup
1
2
1
2
3
2
L3
L1
C2
2nd scan
Itemset
{B, C, E}
H. Galhardas
Itemset
{A}
{B}
{C}
{E}
sup
2
sup
2
3
3
3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
The Apriori Algorithm

Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
SAD Tagus 2004/05
H. Galhardas
Important Details of Apriori


How to generate candidates?

Step 1: self-joining Lk

Step 2: pruning
Example of Candidate-generation

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3


abcd from abc and abd

acde from acd and ace
Pruning:


acde is removed because ade is not in L3
C4={abcd}
SAD Tagus 2004/05
H. Galhardas
Another example (1)
SAD Tagus 2004/05
H. Galhardas
Another example (2)
SAD Tagus 2004/05
H. Galhardas
Generation of ass. rules
from frequent itemsets
Confidence(A=>B) = P(B|A) =
= support-count (A  B) / support-count (A)
support-count (A  B): nb of transactions containing itemsets A  B
support-count (A): nb of transactions containing itemset A
Association rules can be generated as:
 For each frequent itemset l, generate all
nonempty subsets of l
 For every nonempty subset s of l, output the
rule:
s=>(l-s) if support-count(l)/support-count(s) >= min-conf
SAD Tagus 2004/05
H. Galhardas
How to Count Supports of
Candidates?

Why counting supports of candidates a problem?



The total number of candidates can be very huge
One transaction may contain many candidates
Method based on hashing:

Candidate itemsets are stored in a hash-tree

Leaf node of hash-tree contains a list of itemsets and
counts

Interior node contains a hash table

Subset function: finds all the candidates contained in a
transaction
SAD Tagus 2004/05
H. Galhardas
Challenges of Frequent
Pattern Mining


Challenges

Multiple scans of transaction database

Huge number of candidates

Tedious workload of support counting for
candidates
Improving Apriori: general ideas

Partitioning

Sampling

others
SAD Tagus 2004/05
H. Galhardas
Partition: Scan Database
Only Twice

Any itemset that is potentially frequent in
DB must be frequent in at least one of the
partitions of DB

Scan 1: partition database and find local
frequent patterns

Scan 2: consolidate global frequent patterns
SAD Tagus 2004/05
H. Galhardas
Sampling for Frequent
Patterns

Select a sample of original database, mine frequent
patterns within sample using Apriori

Scan database once to verify frequent itemsets found
in sample

Scan database again to find missed frequent patterns

Use lower support threshold to find frequent itemsets in
sample

Trade off accuracy vs efficiency
SAD Tagus 2004/05
H. Galhardas
Bottleneck of Frequentpattern Mining

Multiple database scans are costly

Mining long patterns needs many passes of
scanning and generates lots of candidates

To find frequent itemset i1i2…i100

# of scans: 100

# of Candidates: (1001) + (1002) + … + (110000) = 2100-1
= 1.27*1030 !

Bottleneck: candidate-generation-and-test

Can we avoid candidate generation? Yes!
SAD Tagus 2004/05
H. Galhardas
Visualization of Association Rules: Plane Graph
SAD Tagus 2004/05
H. Galhardas
Visualization of Association Rules: Rule Graph
SAD Tagus 2004/05
H. Galhardas
Visualization of Association
Rules (SGI/MineSet 3.0)
SAD Tagus 2004/05
H. Galhardas
Frequent Pattern and
Association Analysis
 Basic concepts
 Scalable mining methods
 Mining a variety of rules and interesting
patterns
 Constraint-based mining
 Mining sequential and structured patterns
 Extensions and applications
SAD Tagus 2004/05
H. Galhardas
Mining Various Kinds of
Association Rules

Mining multi-level association

Miming multi-dimensional association

Mining quantitative association

Mining interesting correlation patterns
SAD Tagus 2004/05
H. Galhardas
Example
SAD Tagus 2004/05
H. Galhardas
Mining Multiple-Level
Association Rules


Items often form hierarchy
Flexible support settings

Items at the lower level are expected to have lower
support
reduced support
uniform support
Level 1
min_sup = 5%
Level 2
min_sup = 5%
SAD Tagus 2004/05
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
H. Galhardas
Level 1
min_sup = 5%
Level 2
min_sup = 3%
Multi-level Association:
Redundancy Filtering

Some rules may be redundant due to “ancestor”
relationships between items. Example:
R1: milk  wheat bread
[support = 8%, confidence = 70%]
R2: 2% milk  wheat bread [support = 2%, confidence = 72%]

R1 is an ancestor of R2: R1 can be obtained from
R2 replacing items by its ancestors in the
concept hierarchy

R2 is redundant if its support is close to the
“expected” value, based on the rule’s ancestor.
SAD Tagus 2004/05
H. Galhardas
Mining Multi-Dimensional
Association
Single-dimensional rules:
buys(X, “milk”)  buys(X, “bread”)
Multi-dimensional rules:  2 dimensions or predicates

Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”)

Hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
SAD Tagus 2004/05
H. Galhardas
Recall: categorical vs
quantitative attributes
Categorical (or nominal) attributes: finite number of
possible values, no ordering among values
Quantitative Attributes: numeric, implicit ordering
among values
SAD Tagus 2004/05
H. Galhardas
Mining Quantitative Associations
Techniques can be categorized by how numerical
attributes, such as age or salary are treated:
Static discretization based on predefined concept hierarchies
(data cube methods)
Dynamic discretization based on data distribution (quantitative
rules, e.g., Agrawal & Srikant@SIGMOD96)
Clustering: Distance-based association (e.g., Yang &
Miller@SIGMOD97): one dimensional clustering then
association
Deviation: (such as Aumann and Lindell@KDD99)
Sex = female => Wage: mean=$7/hr (overall mean = $9)
SAD Tagus 2004/05
H. Galhardas
Static Discretization of
Quantitative Attributes (1)

Discretized prior to mining using concept hierarchy.

Numeric values are replaced by ranges.

Frequent itemset mining algo. must be modified so
that frequent predicate sets are searched

Instead of searching only one attribute (ex: buys), search
through relevant attributes (ex: age, occupation, buys) and
treat each attribute-value pair as an itemset.
SAD Tagus 2004/05
H. Galhardas
Static Discretization of
Quantitative Attributes (2)

Data cube is well suited for mining and may already
exist

The cells of an n-dimensional cuboid correspond to
the predicate sets and can be used to store the
support counts
()

Mining from data cubes
(age)
(income)
(buys)
can be much faster.
(age, income)
SAD Tagus 2004/05
H. Galhardas
(age,buys) (income,buys)
(age,income,buys)
Mining Various Kinds of
Association Rules

Mining multi-level association

Miming multi-dimensional association

Mining quantitative association
 Mining
SAD Tagus 2004/05
interesting correlation patterns
H. Galhardas
Mining interesting
correlation patterns

Strong association rules (w/ high support and confidence) can be
uninteresting. Example:
Basketball
Not basketball
Sum (row)
Cereal
2000
1750
3750
Not cereal
1000
250
1250
Sum(col.)
3000
2000
5000
play basketball  eat cereal [40%, 66.7%] is misleading

The overall percentage of students eating cereal is 75% which is higher than
66.7%.
play basketball  not eat cereal [20%, 33.3%] is more accurate, although
with lower support and confidence
SAD Tagus 2004/05
H. Galhardas
Are All the Rules Found
Interesting?

The confidence of a rule only estimates the conditional
probability of item B given item A, but doesn’t measure the
real correlation between A and B

Another example:
Buy walnuts  buy milk [1%, 80%] is misleading if 85% of customers
buy milk

Support and confidence are not good to represent correlations

Many other interestingness measures (Tan, Kumar, Sritastava
@KDD’02)
SAD Tagus 2004/05
H. Galhardas
Are All the Rules Found
Interesting?

Correlation rule: A => B [supp, conf., corr. measure]

Measure of dependent/correlated events: correlation
coefficient or lift

The occurrence of A is independent of the occurrence of
B if

P(A B) P(B | A) conf(A  B)
lift 


P(A)P(B)
P(B)
sup(B)
P(A B)  P(A)P(B)
If lift > 1, A and B positively correlated, if lift < 1 then
negatively correlated
SAD Tagus 2004/05
H. Galhardas
Lift - example
Milk
No Milk
Sum (row)
Coffee
m, c
~m, c
c
No Coffee
m, ~c
~m, ~c
~c
Sum(col.)
m
~m

DB
m, c
~m, c
m~c
~m~c
lift
A1
1000
100
100
10,000
9.26
A2
100
1000
1000
100,000
8.44
A3
1000
100
10000
100,000
9.18
A4
1000
1000
1000
1000
1
Coffee => Milk
SAD Tagus 2004/05
H. Galhardas


Mining Highly Correlated
Patterns
2
lift and  are not good measures for correlations
in transactional DBs
all-conf or coherence could be good measures
(Omiecinski @TKDE’03)
all _ conf 


sup( X )
max_ item _ sup( X )
coh 
sup( X )
| universe( X ) |
Both all-conf and coherence have the downward
closure property
Efficient algorithms can be derived for mining (Lee
et al. @ICDM’03sub)
SAD Tagus 2004/05
H. Galhardas
Bibliografia

(Livro) Data Mining: Concepts and
Techniques, J. Han & M. Kamber, Morgan
Kaufmann, 2001 (Cap. 6 – livro 2001, Cap. 4
– draft)
SAD Tagus 2004/05
H. Galhardas
Download