Cluster

advertisement
Course on Data Mining (581550-4)
Intro/Ass. Rules
7.11.
24./26.10.
Clustering
14.11.
Episodes
KDD Process
Home Exam
30.10.
Text Mining
14.11.2001
21.11.
28.11.
Data mining: Clustering
Appl./Summary
1
Course on Data Mining (581550-4)
Today 14.11.2001
• Today's subject:
o Classification, clustering
• Next week's program:
o Lecture: Data mining process
o Exercise: Classification,
clustering
o Seminar: Classification,
clustering
14.11.2001
Data mining: Clustering
2
Classification and clustering
14.11.2001
I.
Classification and
prediction
II.
Clustering and
similarity
Data mining: Clustering
3
Cluster analysis
Overview
14.11.2001
•
•
•
•
•
•
•
•
What is cluster analysis?
Similarity and dissimilarity
Types of data in cluster analysis
Major clustering methods
Partitioning methods
Hierarchical methods
Outlier analysis
Summary
Data mining: Clustering
4
What is cluster analysis?
• Cluster: a collection of data
objects
o similar to one another within
the same cluster
o dissimilar to the objects in the
other clusters
• Aim of clustering: to group a set
of data objects into clusters
14.11.2001
Data mining: Clustering
5
Typical uses of clustering
Used as?
14.11.2001
• As a stand-alone tool to
get insight into data
distribution
• As a preprocessing step
for other algorithms
Data mining: Clustering
6
Applications of clustering
• Marketing: discovering of distinct customer
groups in a purchase database
• Land use: identifying of areas of similar land use
in an earth observation database
• Insurance: identifying groups of motor insurance
policy holders with a high average claim cost
• City-planning: identifying groups of houses
according to their house type, value, and
geographical location
14.11.2001
Data mining: Clustering
7
What is good clustering?
• A good clustering method will produce high quality
clusters with
o high intra-class similarity
o low inter-class similarity
• The quality of a clustering result depends on
o the similarity measure used
o implementation of the similarity measure
• The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns
14.11.2001
Data mining: Clustering
8
Requirements of clustering in
data mining (1)
• Scalability
• Ability to deal with different
types of attributes
• Discovery of clusters with
arbitrary shape
• Minimal requirements for
domain knowledge to determine
input parameters
14.11.2001
Data mining: Clustering
9
Requirements of clustering in
data mining (2)
• Ability to deal with noise and
outliers
• Insensitivity to order of input
records
• High dimensionality
• Incorporation of user-specified
constraints
• Interpretability and usability
14.11.2001
Data mining: Clustering
10
Similarity and dissimilarity
between objects (1)
• There is no single definition of
similarity or dissimilarity between
data objects
• The definition of similarity or
dissimilarity between objects
depends on
o the type of the data considered
o what kind of similarity we are
looking for
14.11.2001
Data mining: Clustering
11
Similarity and dissimilarity
between objects (2)
• Similarity/dissimilarity between objects is
often expressed in terms of a distance
measure d(x,y)
• Ideally, every distance measure should be a
metric, i.e., it should satisfy the following
conditions: 1. d ( x, y )  0
2. d ( x, y )  0 iff x  y
3. d ( x, y )  d ( y, x)
4. d ( x, z )  d ( x, y )  d ( y , z )
14.11.2001
Data mining: Clustering
12
Type of data in cluster analysis
• Interval-scaled variables
• Binary variables
• Nominal, ordinal, and ratio
variables
• Variables of mixed types
• Complex data types
14.11.2001
Data mining: Clustering
13
Interval-scaled variables (1)
• Continuous measurements of a
roughly linear scale
• For example, weight, height and age
• The measurement unit can affect
the cluster analysis
• To avoid dependence on the
measurement unit, we should
standardize the data
14.11.2001
Data mining: Clustering
14
Interval-scaled variables (2)
To standardize the measurements:
• calculate the mean absolute deviation
sf  1
n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
where m f  1 (x1 f  x2 f  ...  xnf ), and
n
• calculate the standardized measurement (z-score)
xif  m f
zif 
sf
14.11.2001
Data mining: Clustering
15
Interval-scaled variables (3)
• One group of popular distance measures for intervalscaled variables are Minkowski distances
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1
i2 j 2
ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive
integer
14.11.2001
Data mining: Clustering
16
Interval-scaled variables (4)
• If q = 1, the distance measure is Manhattan (or
city block) distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1
j1
i2
j2
ip
jp
• If q = 2, the distance measure is Euclidean
distance
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1
i2 j 2
i p jp
14.11.2001
Data mining: Clustering
17
Binary variables (1)
• A binary variable has only
two states: 0 or 1
• A contingency table for
binary data
Object j
Object i
14.11.2001
1
0
1
a
b
0
c
d
sum a  c b  d
Data mining: Clustering
sum
a b
cd
p
18
Binary variables (2)
• Simple matching coefficient (invariant similarity, if
the binary variable is symmetric):
d (i, j) 
bc
a bc  d
• Jaccard coefficient (noninvariant similarity, if the
binary variable is asymmetric):
d (i, j) 
14.11.2001
bc
a bc
Data mining: Clustering
19
Binary variables (3)
Example: dissimilarity between binary variables:
• a patient record table
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
• eight attributes, of which
o gender is a symmetric attribute, and
o the remaining attributes are asymmetric binary
14.11.2001
Data mining: Clustering
20
Binary variables (4)
• Let the values Y and P be set to 1, and the value N
be set to 0
• Compute distances between patients based on the
asymmetric variables by using Jaccard coefficient
0 1
 0.33
2  0 1
11
d ( jack , jim) 
 0.67
111
1 2
d ( jim, mary) 
 0.75
11 2
d ( jack , mary) 
14.11.2001
Data mining: Clustering
21
Nominal variables
• A generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue, green
• Method 1: simple matching
o m: # of matches, p: total # of variables
m
d (i, j)  p 
p
• Method 2: use a large number of binary variables
o create a new binary variable for each of the M
nominal states
14.11.2001
Data mining: Clustering
22
Ordinal variables
• An ordinal variable can be discrete or continuous
• Order of values is important, e.g., rank
• Can be treated like interval-scaled
o replacing xif by their rank rif {1, ..., M f }
o map the range of each variable onto [0, 1] by
replacing i-th object in the f-th variable by
rif 1
zif 
M f 1
o compute the dissimilarity using methods for
interval-scaled variables
14.11.2001
Data mining: Clustering
23
Ratio-scaled variables
• A positive measurement on a nonlinear scale,
approximately at exponential scale
o for example, AeBt or Ae-Bt
• Methods:
o treat them like interval-scaled variables — not a
good choice! (why?)
o apply logarithmic transformation yif = log(xif)
o treat them as continuous ordinal data and treat
their rank as interval-scaled
14.11.2001
Data mining: Clustering
24
Variables of mixed types (1)
• A database may contain all the six
types of variables
• One may use a weighted formula
to combine their effects:
 pf  1 ij( f )dij( f )
d (i, j) 
 pf  1 ij( f )
where  ij( f )  0
if xif or xjf is missing,
or xif  xjf  0;
otherwise
14.11.2001
Data mining: Clustering
δ
( f)
ij
1
25
Variables of mixed types (2)
Contribution of variable f to distance d(i,j):
• if f is binary or nominal:
d
(f)
ij
 0 if xif  xjf ; otherwise
d
(f)
ij
1
• if f is interval-based: use the normalized distance
• if f is ordinal or ratio-scaled
o compute ranks rif and zif

r 1
M 1
if
f
o and treat zif as interval-scaled
14.11.2001
Data mining: Clustering
26
Complex data types
• All objects considered in data mining are not
relational => complex types of data
o examples of such data are spatial data, multimedia
data, genetic data, time-series data, text data and
data collected from World-Wide Web
• Often totally different similarity or dissimilarity
measures than above
o can, for example, mean using of string and/or
sequence matching, or methods of information
retrieval
14.11.2001
Data mining: Clustering
27
Major clustering methods
•
•
•
•
•
14.11.2001
Partitioning methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based methods
(conceptual clustering,
neural networks)
Data mining: Clustering
28
Partitioning methods
• A partitioning method: construct a
partition of a database D of n objects into a
set of k clusters such that
o each cluster contains at least one object
o each object belongs to exactly one
cluster
• Given a k, find a partition of k clusters
that optimizes the chosen partitioning
criterion
14.11.2001
Data mining: Clustering
29
Criteria for judging
the quality of partitions
• Global optimal: exhaustively enumerate all
partitions
• Heuristic methods:
o k-means (MacQueen’67): each cluster is
represented by the center of the cluster
(centroid)
o k-medoids (Kaufman & Rousseeuw’87):
each cluster is represented by one of the
objects in the cluster (medoid)
14.11.2001
Data mining: Clustering
30
K-means clustering method (1)
•
•
Input to the algorithm: the number of clusters k, and a
database of n objects
Algorithm consists of four steps:
1. partition object into k nonempty subsets/clusters
2. compute a seed points as the centroid (the mean of
the objects in the cluster) for each cluster in the
current partition
3. assign each object to the cluster with the nearest
centroid
4. go back to Step 2, stop when there are no more new
assignments
14.11.2001
Data mining: Clustering
31
K-means clustering method (2)
Alternative algorithm also consists of four steps:
1. arbitrarily choose k objects as the initial cluster
centers (centroids)
2. (re)assign each object to the cluster with the nearest
centroid
3. update the centroids
4. go back to Step 2, stop when there are no more new
assignments
14.11.2001
Data mining: Clustering
32
K-means clustering method Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
14.11.2001
1
2
3
4
5
6
7
8
9
10
0
1
Data mining: Clustering
2
3
4
5
6
7
8
9
10
33
Strengths of
K-means clustering method
• Relatively scalable in processing large
data sets
• Relatively efficient: O(tkn), where n is
# objects, k is # clusters, and t is #
iterations. Normally, k, t << n.
• Often terminates at a local optimum;
the global optimum may be found
using techniques such as genetic
algorithms
14.11.2001
Data mining: Clustering
34
Weaknesses of
K-means clustering method
• Applicable only when the mean of
objects is defined
• Need to specify k, the number of
clusters, in advance
• Unable to handle noisy data and
outliers
• Not suitable to discover clusters with
non-convex shapes, or clusters of
very different size
14.11.2001
Data mining: Clustering
35
Variations of
K-means clustering method (1)
• A few variants of the k-means which
differ in
o selection of the initial k centroids
o dissimilarity calculations
o strategies for calculating cluster
centroids
14.11.2001
Data mining: Clustering
36
Variations of
K-means clustering method (2)
• Handling categorical data: k-modes
(Huang’98)
o replacing means of clusters with
modes
o using new dissimilarity measures to
deal with categorical objects
o using a frequency-based method to
update modes of clusters
• A mixture of categorical and numerical
data: k-prototype method
14.11.2001
Data mining: Clustering
37
K-medoids clustering method
•
•
Input to the algorithm: the number of clusters k, and a
database of n objects
Algorithm consists of four steps:
1. arbitrarily choose k objects as the initial medoids
(representative objects)
2. assign each remaining object to the cluster with the
nearest medoid
3. select a nonmedoid and replace one of the medoids
with it if this improves the clustering
4. go back to Step 2, stop when there are no more new
assignments
14.11.2001
Data mining: Clustering
38
Hierarchical methods
• A hierarchical method: construct a
hierarchy of clustering, not just a single
partition of objects
• The number of clusters k is not required
as an input
• Use a distance matrix as clustering
criteria
• A termination condition can be used
(e.g., a number of clusters)
14.11.2001
Data mining: Clustering
39
A tree of clusterings
• The hierarchy of clustering is ofter
given as a clustering tree, also called
a dendrogram
o leaves of the tree represent the
individual objects
o internal nodes of the tree represent
the clusters
14.11.2001
Data mining: Clustering
40
Two types of
hierarchical methods (1)
Two main types of hierarchical clustering techniques:
• agglomerative (bottom-up):
o place each object in its own cluster (a singleton)
o merge in each step the two most similar clusters
until there is only one cluster left or the termination
condition is satisfied
• divisive (top-down):
o start with one big cluster containing all the objects
o divide the most distinctive cluster into smaller
clusters and proceed until there are n clusters or the
termination condition is satisfied
14.11.2001
Data mining: Clustering
41
Two types of
hierarchical methods (2)
Step 0
a
b
c
Step 1
14.11.2001
agglomerative
ab
abcde
cde
d
e
Step 4
Step 2 Step 3 Step 4
de
divisive
Step 3
Step 2 Step 1 Step 0
Data mining: Clustering
42
Inter-cluster distances
• Three widely used ways of defining the inter-cluster
distance, i.e., the distance between two separate
clusters, are
o single linkage method (nearest neighbor):
d (i, j )  min xC , yC  d ( x, y ) 
i
j
o complete linkage method (furthest neighbor):
d (i, j )  max xC , yC  d ( x, y ) 
o average linkage method (unweighted pairgroup average): d (i, j )  avg
 d ( x, y ) 
i
j
xCi , yCj
14.11.2001
Data mining: Clustering
43
Strengths of
hierarchical methods
• Conceptually simple
• Theoretical properties are well
understood
• When clusters are merged/split,
the decision is permanent =>
the number of different
alternatives that need to be
examined is reduced
14.11.2001
Data mining: Clustering
44
Weaknesses of
hierarchical methods
• Merging/splitting of clusters is
permanent => erroneous decisions
are impossible to correct later
• Divisive methods can be
computational hard
• Methods are not (necessarily)
scalable for large data sets
14.11.2001
Data mining: Clustering
45
Outlier analysis (1)
• Outliers
o are objects that are considerably
dissimilar from the remainder of the data
o can be caused by a measurement or
execution error, or
o are the result of inherent data variability
• Many data mining algorithms try
o to minimize the influence of outliers
o to eliminate the outliers
14.11.2001
Data mining: Clustering
46
Outlier analysis (2)
• Minimizing the effect of outliers and/or
eliminating the outliers may cause
information loss
• Outliers themselves may be of interest
=> outlier mining
• Applications of outlier mining
o Fraud detection
o Customized marketing
o Medical treatments
14.11.2001
Data mining: Clustering
47
Summary (1)
• Cluster analysis groups objects
based on their similarity
• Cluster analysis has wide
applications
• Measure of similarity can be
computed for various type of
data
• Selection of similarity measure
is dependent on the data used
and the type of similarity we
are searching for
14.11.2001
Data mining: Clustering
48
Summary (2)
• Clustering algorithms can be
categorized into
o partitioning methods,
o hierarchical methods,
o density-based methods,
o grid-based methods, and
o model-based methods
• There are still lots of research
issues on cluster analysis
14.11.2001
Data mining: Clustering
49
Seminar Presentations/Groups 7-8
Classification of spatial data
K. Koperski, J. Han, N. Stefanovic:
“An Efficient Two-Step Method of
Classification of Spatial Data",
SDH’98
14.11.2001
Data mining: Clustering
50
Seminar Presentations/Groups 7-8
WEBSOM
K. Lagus, T. Honkela, S. Kaski, T.
Kohonen: “Self-organizing Maps
of Document Collections: A New
Approach to Interactive
Exploration”, KDD’96
T. Honkela, S. Kaski, K. Lagus, T.
Kohonen: “WEBSOM – SelfOrganizing Maps of Document
Collections”, WSOM’97
14.11.2001
Data mining: Clustering
51
Course on Data Mining
Thanks to
Jiawei Han from Simon Fraser University
for his slides which greatly helped
in preparing this lecture!
14.11.2001
Data mining: Clustering
52
References - clustering
•
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high
dimensional data for data mining applications. SIGMOD'98
•
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
•
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the
clustering structure, SIGMOD’99.
P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in
large spatial databases. KDD'96.
•
•
•
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing
techniques for efficient class identification. SSD'95.
•
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139172, 1987.
•
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on
dynamic systems. In Proc. VLDB’98.
14.11.2001
Data mining: Clustering
53
References - clustering
•
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98.
•
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
•
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley &
Sons, 1990.
•
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98.
•
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley
and Sons, 1988.
•
P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
•
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.
•
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996 Int.
Conf. on Pattern Recognition, 101-105.
•
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very
large spatial databases. VLDB’98.
•
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB’97.
•
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large
databases. SIGMOD'96.
14.11.2001
Data mining: Clustering
54
Download