ppt - CACS

advertisement
Association Mining
Data Mining
Spring 2012
Transactional Database
•
Transactional Database
•
Transaction – A row in the database
•
i.e.: {Eggs, Cheese, Milk}
Transactional dataset
Eggs
Cheese
Milk
Jam
Cheese
Bacon
Butter
Bread
Bread
Butter
Milk
Eggs
Cat food
Eggs
Milk
Cheese
Items and Itemsets
•
Item = {Milk}, {Cheese}, {Bread}, etc.
•
Itemset = {Milk}, {Milk, Cheese}, {Bacon, Bread, Milk}
•
Doesn’t have to be in the dataset
•
Can be of size 1 – n
Transactional dataset
Eggs
Cheese
Milk
Jam
Cheese
Bacon
Butter
Bread
Bread
Butter
Milk
Eggs
Cat food
Eggs
Milk
Cheese
The Support Measure

Support Examples
Support({Eggs}) = 3/5 = 60%
Support({Eggs, Milk}) = 2/5 = 40%
Transactional dataset
Eggs
Cheese
Milk
Jam
Cheese
Bacon
Butter
Bread
Bread
Butter
Milk
Eggs
Cat food
Eggs
Milk
Cheese
Minimum Support
Minsup – The minimum support threshold for an
itemset to be considered frequent (User defined)
Frequent itemset – an itemset in a database
whose support is greater than or equal to minsup.
Support(X) > minsup = frequent
Support(X) < minsup = infrequent
Minimum Support Examples
Minimum support = 50%
Support({Eggs}) = 3/5 = 60%
 Pass
Support({Eggs, Milk}) = 2/5 = 40%
 Fail
Transactional dataset
Eggs
Cheese
Milk
Jam
Cheese
Bacon
Butter
Bread
Bread
Butter
Milk
Eggs
Cat food
Eggs
Milk
Cheese
Association Rules

Confidence Example 1
{Eggs} => {Bread}
Confidence = sup({Eggs, Bread})/Sup({Eggs})
Confidence = (1/5)/(3/5) = 33%
Transactional dataset
Eggs
Cheese
Milk
Jam
Cheese
Bacon
Butter
Bread
Bread
Butter
Milk
Eggs
Cat food
Eggs
Milk
Cheese
Confidence Example 2
{Milk} => {Eggs, Cheese}
Confidence = sup({Milk, Eggs, Cheese})/sup({Milk})
Confidence = (2/5)/(3/5) = 66%
Transactional dataset
Eggs
Cheese
Milk
Jam
Cheese
Bacon
Butter
Bread
Bread
Butter
Milk
Eggs
Cat food
Eggs
Milk
Cheese
Strong Association Rules
Minimum Confidence – A user defined minimum
bound on confidence. (Minconf)
Strong association rule – a rule X=>Y whose
conf > minconf.
- this is a potentially interesting rule for the user.
Conf(X=>Y) > minconf = strong
Conf(X=>Y) < minconf = uninteresting
Minimum Confidence Example
Minconf = 50%
{Eggs} => {Bread}
Confidence = (1/5)/(3/5) = 33%
 Fail
{Milk} => {Eggs, Cheese}
Confidence = (2/5)/(3/5) = 66%  Pass
Association Mining
Association Mining:
- Finds strong rules contained in a dataset from
frequent itemsets.
Can be divided into two major subtasks:
1. Finding frequent itemsets
2. Rule generation
Transactional Database Revisited
•
Some algorithms change items into letters or numbers
•
Numbers are more compact
•
Easier to make comparisons
Transactional dataset
1
2
3
5
2
7
6
8
8
6
3
1
4
1
3
2
Basic Set Logic
Subset – a subset itemset X is contained in an
itemset Y.
Superset – a superset itemset Y contains an
itemset X.
example:
X = {1,2}
Y = {1,2,3,5}
Y
X
Apriori
 Arranges database into a temporary lattice structure
to find associations
 Apriori principle –
1. itemsets in the lattice with support < minsup will
only produce supersets with support < minsup.
2. the subsets of frequent itemsets are always
frequent.
 Prunes lattice structure of non-frequent itemsets
using minsup.
 Reduces the number of comparisons
 Reduces the number of candidate itemsets
Monotonicity
Monotone (upward closed) - if X is a subset of Y,
then support(X) cannot exceed support(Y).
Anti-Monotone (downward closed) - if X is a subset
of Y, then support(Y) cannot exceed support(X).
Apriori is anti-monotone.
- uses this property to prune the lattice structure.
Itemset Lattice
Lattice Pruning
Lattice Example
1
2
2
4
1
2
1
4
3
4
5
4
Count occurrences of each 1-itemset in the database and
compute their support: Support = #occurrences/#rows in db
Prune anything less than minsup = 30%
Lattice Example
1
2
2
4
1
2
1
4
1
2
2
4
1
2
1
4
1
2
2
4
1
2
1
4
3
4
5
4
5
4
5
4
3
4
3
4
Count occurrences of each 2-itemset in the
database and compute their support
Prune anything less than minsup = 30%
Lattice Example
A
B
B
D
A
B
A
D
C
D
E
D
Count occurrences of the last 3-itemset in the database
and compute its support.
Prune anything less than minsup = 30%
Example - Results
1
2
2
4
1
2
1
4
3
4
5
4
Frequent itemsets: {1}, {2}, {3}, {1,2}, {1,3}, {2,3}, {1,2,3}
Apriori Algorithm
Frequent Itemset Generation
Transactional Database
1
2
3
2
3
5
1
3
5
1
5
1.
2.
3.
4.
4
5
Itemset
Support
Frequent
{1}
75%
Yes
{2}
50%
No
{3}
75%
Yes
{4}
25%
No
{5}
100%
Yes
Minsup = 70%
Generate all 1-itemsets
Calculate the support for each itemset
Determine whether or not the itemsets are frequent
Frequent Itemset Generation
Transactional Database
1
2
3
2
3
5
1
3
5
1
5
4
5
Itemset
Support
Frequent
{1,3}
50%
Yes
{1,5}
75%
Yes
{3,5}
75%
Yes
Generate all 2-itemsets, minsup = 70%
{1} U {3} = {1,3} , {1} U {5} = {1,5}
{3} U {5} = {3,5}
Frequent Itemset Generation
Transactional Database
Itemset
Support
Frequent
1
2
3
{1,3,5}
50%
Yes
2
3
5
1
3
5
1
5
4
5
Generate all 3-itemsets, minsup = 70%
{1,3} U {1,5} = {1,3,5}
Frequent Itemset Results
All frequent itemsets generated are output:
{1} , {3} , {5}
{1,3} , {1,5} , {3,5}
{1,3,5}
Apriori Rule Mining
Apriori Rule Mining
Rule Combinations:
1. {1,2} 2-itemsets
{1}=>{2}
{2}=>{1}
2. {1,2,3} 3-itemsets
{1}=>{2,3}
{2,3}=>{1}
{1,2}=>{3}
{3}=>{1,2}
{1,3}=>{2}
{2}=>{1,3}
Strong Rule Generation
Transactional
Database
4
5
Rule
Confidence
Strong
{1}=>{3}
No
{3}=>{1}
No
1
2
3
2
3
5
{1}=>{5}
Yes
1
3
5
{5}=>{1}
No
1
5
{3}=>{5}
Yes
{5}=>{3}
No
1. I = {{1}, {3}, {5}}
2. Rules = X => Y
3. Minconf = 80%
Strong Rule Generation
Transactional
Database
4
5
Rule
Confidence
Strong
{2}=>{3,5}
Yes
{3,5}=>{2}
No
1
2
3
2
3
5
{2,3}=>{5}
Yes
1
3
5
{5}=>{2,3}
No
1
5
{2,5}=>{3}
Yes
{3}=>{2,5}
No
1. I = {{1}, {3}, {5}}
2. Rules = X => Y
3. Minconf = 80%
Strong Rules Results
All strong rules generated are output:
{1}=>{5}
{3}=>{5}
{2}=>{3,5}
{2,3}=>{5}
{2,5}=>{3}
Other Frequent Itemsets
Closed Frequent Itemset – a frequent itemset X
who has no immediate supersets with the same
support count as X.
Maximal Frequent Itemset – a frequent itemset
whom none of its immediate supersets are frequent.
Itemset Relationships
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
Targeted Association Mining
Targeted Association Mining
* Users may only be interested in specific results
* Potential to get smaller, faster, and more focused results
* Examples:
1. User wants to know how often only bread and garlic
cloves occur together.
2. User wants to know what items occur with toilet
paper.
Itemset Trees
* Itemset Tree:
- A data structure which aids in users querying for a
specific itemset and it’s support.
* Items within a transaction are mapped to integer values
and ordered such that each transaction is in lexical order.
{Bread, Onion, Garlic} = {1, 2, 3}
* Why use numbers?
- make the tree more compact
- numbers follow ordering easily
Itemset Trees
An Itemset Tree T contains:
* A root pair (I, f(I)), where I is an itemset and f(I) is its count.
* A (possibly empty) set {T1, T2, . . . , Tk} each element of which is
an itemset tree.
* If Ij is in the root, then it will also be in
The root’s children
* If Ij is not in the root, then it might be
in the root’s children if:
first_item(I) < first_item(Ij)
and
last_item(I) < last_item(Ij)
Building an Itemset Tree
Let ci be a node in the itemset tree.
Let I be a transaction from the dataset
Loop:
Case 1: ci = I
Case 2: ci is a child of I
- make I the parent node of ci
Case 3: ci and I contain a common
lexical overlap i.e. {1,2,4} vs. {1,2,6}
- make a node for the overlap
- make I and ci it’s children.
Case 4: ci is a parent of I
- Loop to check ci’s children
- make I a child of ci
Note: {2,6} and {1,2,6} do not have a
Lexical overlap
Itemset Trees - Creation
Dataset
2
4
1
2
3
9
1
2
2
2
9
3
6
5
Itemset Trees - Creation
Dataset
2
4
1
2
3
9
1
2
3
6
2
2
9
Child node.
5
Itemset Trees - Creation
Dataset
2
4
1
2
3
9
1
2
3
6
2
2
9
Child node.
5
Itemset Trees - Creation
Dataset
2
4
1
2
3
9
1
2
3
6
2
2
9
Child node.
5
Itemset Trees - Creation
Dataset
2
4
1
2
3
9
1
2
3
5
6
2
2
9
Lexical overlap
Itemset Trees - Creation
Dataset
2
4
1
2
3
9
1
2
3
5
6
2
2
9
Parent node.
Itemset Trees - Creation
Dataset
2
4
1
2
3
9
1
2
3
6
2
2
9
Child node.
5
Itemset Trees – Querying
Let I be an itemset,
Let ci be a node in the tree
Let totalSup be the total count for I in the tree
For all s.t. first_item(ci) < first_item(I):
Case 1: If I is contained in ci.
- Add support to totalSup.
Case 2: If I is not contained and last_item(ci) < last_item(I)
- proceed down the tree
Example 1
Itemset Trees - Querying
Querying Example 1:
Query: {2}
totalSup = 0
Itemset Trees - Querying
Querying Example 1:
Query: {2}
2=2
Add to support:
totalSup = 3
Itemset Trees - Querying
Querying Example 1:
Query: {2}
1,2 contains 2
Add to support
totalSup = 3 + 2 = 5
Itemset Trees - Querying
Querying Example 1:
Query: {2,9}
3 > 2, and end of
Subtree.
Return support
totalSup = 5
Example 2
Itemset Trees - Querying
Querying Example 2:
Query: {2,9}
totalSup = 0
Itemset Trees - Querying
Querying Example 2:
Query: {2,9}
totalSup = 0
2<2
2 < 9 continue
Itemset Trees - Querying
Querying Example 2:
Query: {2,9}
totalSup = 0
2<2
4<9
{2,4} doesn’t contain
{2,9}, go to next sibling
Itemset Trees - Querying
Querying Example 2:
Query: {2,9}
totalSup = 1
{2,9} = {2,9}
Add to support!
Itemset Trees - Querying
Querying Example 2:
Query: {2,9}
totalSup = 1
1<2
2<9
continue
Itemset Trees - Querying
Querying Example 2:
Query: {2,9}
totalSup = 1
1<2
5<9
{1,2,3,5} doesn’t contain
{2,9}, go to next sibling
Itemset Trees - Querying
Querying Example 2:
Query: {2,9}
totalSup = 1
1<2
6<9
{1,2,6} doesn’t contain
{2,9}, go to next node
Itemset Trees - Querying
Querying Example 2:
Query: {2,9}
totalSup = 1
3 < 2 <= fail
9<9
End of tree,
totalSupp = 1
Nodes = 8
Download