Clustering methods

advertisement
Part 1: Introduction
Clustering methods
Course code: 175314
Pasi Fränti
10.3.2014
Speech & Image Processing Unit
School of Computing
University of Eastern Finland
Joensuu, FINLAND
Sample data
Sources of RGB
vectors
Red-Green plot of the
vectors
Sample data
Employment statistics:
ID
800
801
802
803
804
805
806
807


POSTAL ZONE
Munchen
Munchen-land ost
Munchen-land sued
Munchen-land west
Munchen-land nord
Freising
Dachau
Ingolstadt


Self
Civil
employed servents
56750
57218
7684
5790
3780
1977
7226
5623
2226
1305
8187
5140
8165
2763
5810
5212




Clerks
300201
20279
11058
25571
9347
14632
11638
15019


Manual
workers
242375
23491
7398
20380
12432
24377
24489
30532


Application example 1
Color reconstruction
Image with
original
colors
Image with
compression
artifacts
Application example 2
speaker modeling for voice biometrics
Tomi
Feature extraction
and clustering
Mikko
Tomi
Matti
Matti
Training data
Feature extraction
Mikko
Speaker models
?
Best match: Matti !
Speaker modeling
Speech data
Result of clustering
Application example 3
Image segmentation
Normalized color
plots according to
red and green
components.
green
Image with 4
color clusters
red
Application example 4
Quantization
Approximation of continuous range values (or a very large set of possible
discrete values) by a small set of discrete symbols or integer values
Quantized signal
Original signal
Color quantization of images
Color image
RGB samples
Clustering
Application example 5
Clustering of spatial data
Clustered locations of users
Clustered locations of users
Timeline clustering
Clustering of photos
Clustering GPS trajectories
Mobile users, taxi routes, fleet management
Conclusions from clusters
Cluster 2:
Home
Cluster 1:
Office
Part I:
Clustering problem
Subproblems of clustering
1. Where are the clusters?
(Algorithmic problem)
2. How many clusters?
(Methodological problem: which criterion?)
3. Selection of attributes
(Application related problem)
4. Preprocessing the data
(Practical problems: normalization, outliers)
Clustering result as partition
Partition of data
Illustrated by
Voronoi diagram
Cluster prototypes
Illustrated by
Convex hulls
Duality of partition and centroids
Partition of data
Partition by nearest
prototype mapping
Cluster prototypes
Centroids
as prototypes
Challenges in clustering
Incorrect cluster
allocation
Incorrect number
of clusters
Too many clusters
Clusters missing
Cluster missing
How to solve?
Solve the clustering:
 Given input data (X) of N data vectors, and
number of clusters (M), find the clusters.
 Result given as a set of prototypes, or partition.
Solve the number of clusters:
 Define appropriate cluster validity function f.
 Repeat the clustering algorithm for several M.
 Select the best result according to f.
Solve the problem efficiently.
Taxonomy of clustering
[Jain, Murty, Flynn, Data clustering: A review, ACM Computing Surveys, 1999.]
• One possible classification based on cost function.
• MSE is well defined and most popular.
Definitions and data
Set of N data points:
X={x1, x2, …, xN}
Partition of the data:
P={p1, p2, …, pM},
Set of M cluster prototypes (centroids):
C={c1, c2, …, cM},
Distance and cost function
Euclidean distance of data vectors:
d ( xi , x j ) 
 x
K
k 1
k
i
x

k 2
j
Mean square error:
1 N
MSE(C , P)   xi  c pi
N i 1
2
Dependency of data structures
 Centroid condition: for a given partition (P), optimal
cluster centroids (C) for minimizing MSE are the
average vectors of the clusters:
cj 
x
pi  j
i
1
 j  1, M 
pi  j
 Optimal partition: for a given centroids (C), optimal
partition is the one with nearest centroid :
pi  arg min d ( xi , c j ) 2  i  1, N 
1 j  M
Complexity of clustering
• Number of possible clusterings:
N 
  
M 
1 M
Mj M  N
  j
(1)

M ! j 1
 j
• Clustering problem is NP complete
[Garey et al., 1982]
• Optimal solution by branch-and-bound in
exponential time.
• Practical solutions by heuristic algorithms.
Cluster software
http://cs.joensuu.fi/sipu/soft/cluster2009.exe
• Main area:
Output
area
•
•
Main area
Input area
•
working space for
data
Input area:
inputs to be
processed
Output area:
obtained results
Menu Process:
selection of
operation
Procedure to simulate k-means
Data set
Clustering
image
Codebook
Partition
Open data set (file *.ts), move it into Input area
Process – Random codebook, select number of clusters
REPEAT
Move obtained codebook from Output area into Input area
Process – Optimal partition, select Error function
Move codebook into Main area, partition into Input area
Process – Optimal codebook
UNTIL DESIRED CLUSTERING
XLMiner software
http://www.resample.com/xlminer/help/HClst/HClst_ex.htm
Example of data in XLMiner
Distance matrix & dendrogram
Conclusions
 Clustering is a fundamental tools needed in
Speech and Image processing.
 Failing to do clustering properly may defect
the application analysis.
 Good clustering tool needed so that
researchers can focus on application
requirements.
Literature
1.
2.
3.
4.
5.
S. Theodoridis and K. Koutroumbas, Pattern
Recognition, Academic Press, 3rd edition, 2006.
C. Bishop, Pattern Recognition and Machine Learning,
Springer, 2006.
A.K. Jain, M.N. Murty and P.J. Flynn, Data clustering: A
review, ACM Computing Surveys, 31(3): 264-323,
September 1999.
M.R. Garey, D.S. Johnson and H.S. Witsenhausen, The
complexity of the generalized Lloyd-Max problem, IEEE
Transactions on Information Theory, 28(2): 255-256,
March 1982.
F. Aurenhammer: Voronoi diagrams-a survey of a
fundamental geometric data structure, ACM Computing
Surveys, 23 (3), 345-405, September 1991.
Download