Frequent Item Based Clustering

advertisement
Frequent Item Based
Clustering
M.Sc Student:
Supervisor:
Homayoun Afshar
Martin Ester
Contents
Introduction and motivation
Frequent item sets
Text data as transactional data
Cluster set definition
Our approach
Test data set, results, challenges
Related works
Conclusion
Homayoun Afshar
Frequent Item Based Clustering
2
Introduction and Motivation
Huge amount of information online
Lots of this information is in text format
E.G. Emails, web pages, news group
postings, …
Need to group related documents
Nontrivial task
Homayoun Afshar
Frequent Item Based Clustering
3
Frequent Item Sets
Given a dataset D={t1,t2,…,tn}
Each ti is a transaction
tiI where I is the set of all items
Given a threshold min_sup
iI such that
|{t  it and tD}|>min_sup
i is a frequent item set with respect to
minimum support min_sup
Homayoun Afshar
Frequent Item Based Clustering
4
Text Data As Transactional Data
Assume each word as an item
And each document as a transaction
Using a minimum support find frequent
item sets (frequent word sets)
Frequent Word SetsFrequent Item Sets
Homayoun Afshar
Frequent Item Based Clustering
5
Cluster Set Definition
f={X1,X2,…,Xn} is the set of all the
frequent item sets with respect to some
minimum support
c={C1,C2,…,Cm} is a cluster set, where
Ci is the documents that are covered
with some Xkf
And…
Homayoun Afshar
Frequent Item Based Clustering
6
Cluster Set Definition …
Each optimal cluster set has to:
Cover the whole data set
Mutual overlap between clusters in cluster
set must be minimized
Clusters should be roughly the same size
Homayoun Afshar
Frequent Item Based Clustering
7
Our Approach:
Frequent-Item Based Clustering …
Find all the frequent word sets
Form cluster sets with just one cluster
Overlap is zero
Coverage is the support of the frequent
item set presenting the cluster
Form cluster sets with two clusters
Find the overlap and coverage
Homayoun Afshar
Frequent Item Based Clustering
8
Our Approach:
Frequent-Item Based Clustering …
Prune the candidate list for cluster sets
If Cov(ci)Cov(cj) and
overlap(ci)>overlap(cj)
ci and cj are candidates in same level
remove if Overlap(ci)>= |Cov(ci)|
Generate the next level
Find Overlap and Coverage, Prune
Stop when there are no more candidates left
Homayoun Afshar
Frequent Item Based Clustering
9
Our Approach:
Coverage And Overlap …
Using a bit matrix
Each column is a document
Each row is a frequent word set
Coverage: OR, counting the 1s
Overlap: XOR, OR, AND, counting 1s
Homayoun Afshar
Frequent Item Based Clustering
10
Our Approach:
Coverage And Overlap …
10110010 (1st)
10001010 (2nd)
10101100 (3rd)
-----------Coverage: OR all =
10111110
count 1s -> coverage = 6
cost = 2 ORs + counting 1s
cost for counting 1s = 8 (shifts, ANDs, Adds)
Homayoun Afshar
Frequent Item Based Clustering
11
Our Approach:
Coverage And Overlap …
Overlap:
10110010 (1st)
10001010 (2nd)
-----------AND first two =
10000010 (i)
XOR first two =
00111000 (ii)
10101100 (3rd)
-----------AND 3rd with (ii)
00101000 (iii)
-----------OR (i) and (iii)
10101010
now count 1s for overlap -> Overlap = 4
Homayoun Afshar
Frequent Item Based Clustering
12
Test Data,
Results, Challenges
Test data set
Reuters 21578
21578 documents Reuters news
8655 of them have exactly one topic
Remove stop words
Stem all the words
Number of frequent word sets
5% min_sup = 10678
10% min_sup=1217
20% min_sup=78
Homayoun Afshar
Frequent Item Based Clustering
13
Test Data,
Results,
Challenges
With 20% min support
sample 2-cluster candidate set
{(said,reuter)(line,ct,vs)}
Overlap = 1
Coverage = 5259
sample 5-cluster candidate set
{(reuter)(vs)(net)(line,ct,net)(vs,net,shr)}
Overlap = 3303
Coverage = 8609
Homayoun Afshar
Frequent Item Based Clustering
14
Test Data,
Results,
Challenges
More Results
With min_sup=10%
{(reuter)(includ)(mln,includ)(mln,profit)(year,ct)(year,mln,net)}
6-clusters cluster set
Coverage = 8616
Overlap = 2553
{(reuter)(loss)(profit)(year,1986)(mln,profit)(year,ct)(year,mln,net)}
7-clusters cluster set
Coverage = 8611
Overlap = 2705
{(reuter)(loss)(profit)(year,1986)(mln,includ)(mln,profit)(year,ct)(year,mln,net)}
8-clusters cluster set
Coverage = 8616
Overlap = 3033
Homayoun Afshar
Frequent Item Based Clustering
15
Test Data, Results,
Challenges
Lower support values
Pruning is very slow
2-cluster set with minSup=20%
Creating= 0.010 seconds.
Updating= 1.853 seconds. (Overlap and Coverage)
Pruning= 11.767 seconds.
Sorting= 0.000 seconds.
Number of candidates
Before prune=3003
After prune=73
Homayoun Afshar
Frequent Item Based Clustering
16
Test Data, Results,
Challenges
Hierarchical clustering
Clustering quality
In our test data set, entropy
Real data sets, classes are not known
Test the pruning more efficiently
Defining an upper threshold
Using following ratios to prune candidates
Overlap
Coverage
or
Coverage
Overlap
Using only max item sets
Homayoun Afshar
Frequent Item Based Clustering
17
Related Works
Similar idea
Frequent Term-Based Text Clustering [BEX02]
Florian Beil, Martin Ester, Xiaowei Xu
Focuses on finding one optimal clustering set
(non overlapping)-FTC
Hierarchical clustering (overlapping)-HFTC
Homayoun Afshar
Frequent Item Based Clustering
18
Conclusion
To get optimal clustering
Reduce minimum support
Reduce number of frequent items
Introduce maximum support
Use only max item sets
Better pruning (speed)
Hierarchical clustering
Homayoun Afshar
Frequent Item Based Clustering
19
References
[AS94] R. Agrawal, R. Sirkant. Fast Algorithms for Mining
Association rules in large databases. In Proc. 1994 Int. Conf.
Very Large Data Bases (VLDB’94), pages 487-499, Santiago,
Chile, Sept. 1994.
[BEX02] F. Beil, M. Ester,X. Xu. Frequent Term-Based Text
clustering.
J. Han, M. Kamber. Data Mining Concepts and Techniques. Morgan
Kaufmann, 2001.
Homayoun Afshar
Frequent Item Based Clustering
20
Download