Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 11 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2009 Han, Kamber & Pei. All rights reserved.
3/13/2016
1
March 13, 2016
Data Mining: Concepts and Techniques
2
Chapter 11. Cluster Analysis: Advanced Methods

Statistics-Based Clustering

Clustering High-Dimensional Data

Semi-Supervised Learning and Active Learning

Constraint-Based Clustering

Bi-Clustering and co-Clustering

Collaborative filtering

Spectral Clustering

Evaluation of Clustering Quality

Summary
3
Model-Based Clustering


What is model-based clustering?
 Assumption: a cluster is generated by a model such as a
probability distribution
 A model (e.g., Gaussian distribution) is determined by a
set of parameters
 Task: optimize the fit between the given data and some
mathematical models by learning the parameters of the
models
Typical methods
 Statistical approach
 EM (Expectation maximization), AutoClass
 Neural network approach
 SOM (Self-Organizing Feature Map)
4
Mixture Models



A cluster can be modeled as a probability
distribution
 Practically, assume a distribution can be
approximated well using multivariate normal
distribution
Multiple clusters is a mixture of different probability
distributions
A data set is a set of observations from a mixture
of models
5
Object Probability


Suppose there are k clusters and a set X of m
objects
 Let the j-th cluster have parameter j = (j, j)
 The probability that a point is in the j-th cluster
is wj, where w1 + …+ wk = 1
The probability of an object x is
k
prob( x | )   w j p j ( x |  j )
j 1
m
m
i 1
i 1 j 1
k
prob( X | )   prob( xi | )  w j p j ( xi |  j )
6
Example

1
prob( xi | ) 
e
2 
( x )2
2 2
1  (4,2) 2  (4,2)
1
prob ( x | ) 
e
2 2

( x  4)2
8
1

e
2 2

( x 4)2
8
7
Maximal Likelihood Estimation

Maximum likelihood principle:
 If we know a set of objects are from one
distribution, but do not know the parameter, we
can choose the parameter maximizing the
probability
( x )
m

1
Maximize
prob ( xi | )  
e 2
2 
j 1
2

2

Equivalently, maximize
( xi   ) 2
log prob( X | )  
 0.5m log 2  m log 
2
2
i 1
m
8
The EM (Expectation Maximization) Algorithm

Expectation Maximization algorithm
Select an initial set of model parameters
Repeat
Expectation Step: For each object, calculate
the probability that it belongs to each
distribution i, i.e., prob(xi|i)
Maximization Step: Given the probabilities
from the expectation step, find the new
estimates of the parameters that maximize
the expected likelihood
Until the parameters are stable
9
Advantages and Disadvantages

Mixture models are more general than k-means
and fuzzy c-means

Clusters can be characterized by a small number
of parameters

The results may satisfy the statistical assumptions
of the generative models

Computationally expensive

Need large data sets

Hard to estimate the number of clusters
10
Neural Network Approaches


Neural network approaches
 Represent each cluster as an exemplar, acting as a
“prototype” of the cluster
 New objects are distributed to the cluster whose
exemplar is the most similar according to some
distance measure
Typical methods
 SOM (Soft-Organizing feature Map)
 Competitive learning
 Involves a hierarchical architecture of several units
(neurons)
 Neurons compete in a “winner-takes-all” fashion for
the object currently being presented
11
Self-Organizing Feature Map (SOM)

SOMs, also called topological ordered maps, or Kohonen SelfOrganizing Feature Map (KSOMs)

It maps all the points in a high-dimensional source space into a 2 to 3-d
target space, s.t., the distance and proximity relationship (i.e., topology)
are preserved as much as possible

Similar to k-means: cluster centers tend to lie in a low-dimensional
manifold in the feature space

Clustering is performed by having several units competing for the
current object

The unit whose weight vector is closest to the current object wins

The winner and its neighbors learn by having their weights adjusted

SOMs are believed to resemble processing that can occur in the brain

Useful for visualizing high-dimensional data in 2- or 3-D space
12
Web Document Clustering Using SOM

The result of
SOM clustering
of 12088 Web
articles

The picture on
the right: drilling
down on the
keyword
“mining”

Based on
websom.hut.fi
Web page
13
Chapter 11. Cluster Analysis: Advanced Methods

Statistics-Based Clustering

Clustering High-Dimensional Data

Semi-Supervised Learning and Active Learning

Constraint-Based Clustering

Bi-Clustering and co-Clustering

Collaborative filtering

Spectral Clustering

Evaluation of Clustering Quality

Summary
14
Clustering High-Dimensional Data


Clustering high-dimensional data
 Many applications: text documents, DNA micro-array data
 Major challenges:
 Many irrelevant dimensions may mask clusters
 Distance measure becomes meaningless—due to equi-distance
 Clusters may exist only in some subspaces
Methods
 Feature transformation: only effective if most dimensions are
relevant
 PCA & SVD useful only when features are highly
correlated/redundant
 Feature selection: wrapper or filter approaches
 useful to find a subspace where the data have nice clusters
 Subspace-clustering: find clusters in all the possible subspaces
 CLIQUE, ProClus, and frequent pattern-based clustering
15
The Curse of Dimensionality
(graphs adapted from Parsons et al. KDD Explorations 2004)


Data in only one dimension is relatively
packed
Adding a dimension “stretch” the
points across that dimension, making
them further apart

Adding more dimensions will make the
points further apart—high dimensional
data is extremely sparse

Distance measure becomes
meaningless—due to equi-distance
16
Why Subspace Clustering?
(adapted from Parsons et al. SIGKDD Explorations 2004)

Clusters may exist only in some subspaces

Subspace-clustering: find clusters in all the subspaces
17
CLIQUE (Clustering In QUEst)

Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)

Automatically identifying subspaces of a high dimensional data space
that allow better clustering than original space

CLIQUE can be considered as both density-based and grid-based

It partitions each dimension into the same number of equal length
interval

It partitions an m-dimensional data space into non-overlapping
rectangular units

A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter

A cluster is a maximal set of connected dense units within a
subspace
18
CLIQUE: The Major Steps

Partition the data space and find the number of points that
lie inside each cell of the partition.

Identify the subspaces that contain clusters using the
Apriori principle

Identify clusters



Determine dense units in all subspaces of interests
Determine connected dense units in all subspaces of
interests.
Generate minimal description for the clusters
 Determine maximal regions that cover a cluster of
connected dense units for each cluster
 Determination of minimal cover for each cluster
19
=3
30
40
Vacation
20
50
Salary
(10,000)
0 1 2 3 4 5 6 7
30
Vacation
(week)
0 1 2 3 4 5 6 7
age
60
20
30
40
50
age
60
50
age
20
Strength and Weakness of CLIQUE


Strength
 automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces
 insensitive to the order of records in input and does not
presume some canonical data distribution
 scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
Weakness
 The accuracy of the clustering result may be degraded
at the expense of simplicity of the method
21
Frequent Pattern-Based Approach

Clustering high-dimensional space (e.g., clustering text
documents, microarray data)

Projected subspace-clustering: which dimensions to be
projected on?


CLIQUE, ProClus

Feature extraction: costly and may not be effective?

Using frequent patterns as “features”
Clustering by pattern similarity in micro-array data
(pClustering) [H. Wang, W. Wang, J. Yang, and P. S.
Yu. Clustering by pattern similarity in large data
sets, SIGMOD’02]
March 13, 2016
Data Mining: Concepts and Techniques
22
Clustering by Pattern Similarity (p-Clustering)

Left figure: Micro-array “raw” data shows 3 genes and their
values in a multi-D space: Difficult to find their patterns

Right two: Some subsets of dimensions form nice shift and
scaling patterns

No globally defined similarity/distance measure

Clusters may not be exclusive

An object can appear in multiple clusters
23
Why p-Clustering?





Microarray data analysis may need to

Clustering on thousands of dimensions (attributes)

Discovery of both shift and scaling patterns
Clustering with Euclidean distance measure? — cannot find shift patterns
Clustering on derived attribute Aij = ai – aj? — introduces N(N-1) dimensions
Bi-cluster (Y. Cheng and G. Church. Biclustering of expression data. ISMB’00)
using transformed mean-squared residue score matrix (I, J)
1
d 
d
ij | J |  ij
jJ
d

1
 d
| I | i  I ij
d

Where

A submatrix is a δ-cluster if H(I, J) ≤ δ for some δ > 0
Ij
IJ

1
d

| I || J | i  I , j  J ij
Problems with bi-cluster

No downward closure property

Due to averaging, it may contain outliers but still within δ-threshold
March 13, 2016
Data Mining: Concepts and Techniques
24
p-Clustering: Clustering
by Pattern Similarity

P-score: the similarity between two objects rx, ry on two attributes au,
av
d xa d xb 
pScore( 
) | (d xa  d xb )  (d ya  d yb ) |

d ya d yb 

δ-pCluster: If for any 2 by 2 matrix X, pScore(X) ≤ δ (δ > 0)

Properties of δ-pCluster

Downward closure

Clusters are more homogeneous than bi-cluster (thus the name:
pair-wise Cluster)


MaPle (Pei et al. 2003): Efficient mining of maximum p-clusters
For scaling patterns, taking logarithmic on d xa / d ya   will lead to
d xb / d yb
the pScore form
Chapter 11. Cluster Analysis: Advanced Methods

Statistics-Based Clustering

Clustering High-Dimensional Data

Semi-Supervised Learning and Active Learning

Constraint-Based Clustering

Bi-Clustering and co-Clustering

Collaborative filtering

Spectral Clustering

Evaluation of Clustering Quality

Summary
31
Why Constraint-Based Cluster Analysis?


Need user feedback: Users know their applications the best
Less parameters but more user-desired constraints, e.g., an
ATM allocation problem: obstacle & desired clusters
32
A Classification of Constraints in Cluster Analysis


Clustering in applications: desirable to have user-guided
(i.e., constrained) cluster analysis
Different constraints in cluster analysis:
 Constraints on individual objects (do selection first)


Constraints on distance or similarity functions


# of clusters, MinPts, etc.
User-specified constraints


Weighted functions, obstacles (e.g., rivers, lakes)
Constraints on the selection of clustering parameters


Cluster on houses worth over $300K
Contain at least 500 valued customers and 5000 ordinary ones
Semi-supervised: giving small training sets as
“constraints” or hints
33
Clustering With Obstacle Objects





Tung, Hou, and Han. Spatial
Clustering in the Presence of
Obstacles, ICDE'01
K-medoids is more preferable since
k-means may locate the ATM center
in the middle of a lake
Visibility graph and shortest path
Triangulation and micro-clustering
Two kinds of join indices (shortestpaths) worth pre-computation
 VV index: indices for any pair of
obstacle vertices
 MV index: indices for any pair of
micro-cluster and obstacle indices
34
An Example: Clustering With Obstacle Objects
Not Taking obstacles into account
Taking obstacles into account
35
User-Guided Clustering
person
name
course
course-id
group
office
semester
name
position
instructor
area
Advise
professor
name
degree
User hint
Target of
clustering

Publish
Publication
author
title
title
year
student
area

Course
Professor
Group

Open-course
Work-In
conf
Register
student
Student
course
name
office
semester
position
unit
grade
X. Yin, J. Han, P. S. Yu, “Cross-Relational Clustering with User's Guidance”,
KDD'05
User usually has a goal of clustering, e.g., clustering students by research area
User specifies his clustering goal to CrossClus
36
Comparing with Classification
User hint

User-specified feature (in the form
of attribute) is used as a hint, not
class labels

The attribute may contain too
many or too few distinct values,
e.g., a user may want to
cluster students into 20
clusters instead of 3

All tuples for clustering
Additional features need to be
included in cluster analysis
37
Comparing with Semi-Supervised Clustering


Semi-supervised clustering: User provides a training set
consisting of “similar” (“must-link) and “dissimilar”
(“cannot link”) pairs of objects
User-guided clustering: User specifies an attribute as a
hint, and more relevant features are found for clustering
User-guided clustering
All tuples for clustering
Semi-supervised clustering
x
All tuples for clustering
38
Why Not Semi-Supervised Clustering?



Much information (in multiple relations) is needed to judge
whether two tuples are similar
A user may not be able to provide a good training set
It is much easier for a user to specify an attribute as a hint,
such as a student’s research area
Tom Smith
Jane Chang
SC1211
BI205
TA
RA
Tuples to be compared
User hint
39
CrossClus: An Overview

Measure similarity between features by how they group
objects into clusters

Use a heuristic method to search for pertinent features


Use tuple ID propagation to create feature values


Start from user-specified feature and gradually
expand search range
Features can be easily created during the expansion
of search range, by propagating IDs
Explore three clustering algorithms: k-means, k-medoids,
and hierarchical clustering
40
Multi-Relational Features


A multi-relational feature is defined by:
 A join path, e.g., Student → Register → OpenCourse → Course
 An attribute, e.g., Course.area
 (For numerical feature) an aggregation operator, e.g., sum or average
Categorical feature f = [Student → Register → OpenCourse → Course,
Course.area, null]
areas of courses of each student
Tuple
Areas of courses
DB
AI
TH
t1
5
5
0
t2
0
3
t3
1
t4
t5
Values of feature f
Tuple
Feature f
DB
AI
TH
t1
0.5
0.5
0
7
t2
0
0.3
0.7
5
4
t3
0.1
0.5
0.4
5
0
5
t4
0.5
0
0.5
3
3
4
t5
0.3
0.3
0.4
f(t1)
f(t2)
f(t3)
f(t4)
DB
AI
TH
f(t5)
41
Representing Features

Similarity between tuples t1 and t2 w.r.t. categorical feature f

Cosine similarity between vectors f(t1) and f(t2)
sim f t1 , t 2  
Similarity vector Vf
L
 f t . p
k 1
L
 f t . p
k 1




1
1
2
k
k

 f t 2 . pk
L
 f t . p
k 1
2
2
k
Most important information of a
feature f is how f groups tuples into
clusters
f is represented by similarities
between every pair of tuples
indicated by f
The horizontal axes are the tuple
indices, and the vertical axis is the
similarity
This can be considered as a vector
of N x N dimensions
42
Similarity Between Features
Vf
Values of Feature f and g
Feature f (course)
Feature g (group)
DB
AI
TH
Info sys
Cog sci
Theory
t1
0.5
0.5
0
1
0
0
t2
0
0.3
0.7
0
0
1
t3
0.1
0.5
0.4
0
0.5
0.5
t4
0.5
0
0.5
0.5
0
0.5
t5
0.3
0.3
0.4
0.5
0.5
0
Vg
Similarity between two features –
cosine similarity of two vectors
V f V g
sim  f , g   f g
V V
43
Computing Feature Similarity
Feature f
Tuples
Feature g
DB
Info sys
AI
Cog sci
TH
Theory
Similarity between feature
values w.r.t. the tuples
sim(fk,gq)=Σi=1 to N f(ti).pk∙g(ti).pq
Info sys
DB
2
ti , t j  sim g ti , t j   
 fsimilarities,
V f  V g   Tuple
sim fsimilarities,
Featuresim
value
k , gq 
N
N
i 1 j 1
l
hard to compute
DB
Info sys
AI
Cog sci
TH
Theory
m
k 1 q easy
1
to compute
Compute similarity
between each pair of
feature values by one
scan on data
44
Searching for Pertinent Features

Different features convey different aspects of information
Academic Performances
Research area
Demographic info
GPA
Permanent address
GRE score
Nationality
Number of papers
Research group area
Conferences of papers
Advisor


Features conveying same aspect of information usually
cluster tuples in more similar ways
 Research group areas vs. conferences of publications
Given user specified feature
 Find pertinent features by computing feature similarity
45
Heuristic Search for Pertinent Features
Overall procedure
Course
Professor
person
name
course
course-id
group
office
semester
name
position
instructor
area
2
1. Start from the userGroup
specified feature
name
2. Search in neighborhood area
of existing pertinent
features
User hint
3. Expand search range
gradually
Target of
clustering

Open-course
Work-In
Advise
Publish
professor
student
author
1
title
degree
Publication
title
year
conf
Register
student
Student
course
name
office
semester
position
unit
grade
Tuple ID propagation is used to create multi-relational features
 IDs of target tuples can be propagated along any join path, from
which we can find tuples joinable with each target tuple
46
Clustering with Multi-Relational Features

Given a set of L pertinent features f1, …, fL, similarity
between two tuples
L
sim t1 , t 2    sim f i t1 , t 2   f i .weight
i 1


Weight of a feature is determined in feature search by
its similarity with other pertinent features
Clustering methods

CLARANS [Ng & Han 94], a scalable clustering
algorithm for non-Euclidean space

K-means

Agglomerative hierarchical clustering
47
Experiments: Compare CrossClus with



Baseline: Only use the user specified feature
PROCLUS [Aggarwal, et al. 99]: a state-of-the-art
subspace clustering algorithm
 Use a subset of features for each cluster
 We convert relational database to a table by
propositionalization
 User-specified feature is forced to be used in every
cluster
RDBC [Kirsten and Wrobel’00]
 A representative ILP clustering algorithm
 Use neighbor information of objects for clustering
 User-specified feature is forced to be used
48
Measure of Clustering Accuracy

Accuracy

Measured by manually labeled data


We manually assign tuples into clusters according
to their properties (e.g., professors in different
research areas)
Accuracy of clustering: Percentage of pairs of tuples in
the same cluster that share common label


This measure favors many small clusters
We let each approach generate the same number of
clusters
49
DBLP Dataset
Clustering Accurarcy - DBLP
1
0.9
0.8
0.7
CrossClus K-Medoids
CrossClus K-Means
CrossClus Agglm
Baseline
PROCLUS
RDBC
0.6
0.5
0.4
0.3
0.2
0.1
e
th
re
A
ll
ho
r
oa
ut
+C
W
nf
+
or
d
Co
au
th
or
or
d
Co
Co
nf
+
W
or
au
th
Co
or
d
W
Co
nf
0
50
Summary

Cluster analysis groups objects based on their similarity
and has wide applications

Measure of similarity can be computed for various types
of data

Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods,
grid-based methods, and model-based methods

Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical,
distance-based or deviation-based approaches

There are still lots of research issues on cluster analysis
61
Problems and Challenges


Considerable progress has been made in scalable
clustering methods

Partitioning: k-means, k-medoids, CLARANS

Hierarchical: BIRCH, ROCK, CHAMELEON

Density-based: DBSCAN, OPTICS, DenClue

Grid-based: STING, WaveCluster, CLIQUE

Model-based: EM, Cobweb, SOM

Frequent pattern-based: pCluster

Constraint-based: COD, constrained-clustering
Current clustering techniques do not address all the
requirements adequately, still an active area of research
62
References









G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
P. Michaud. Clustering Techniques. Future Generation Computer Systems, 13, 1997.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A
Review, SIGKDD Explorations, 6(1), June 2004
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition,.
A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based
Clustering in Large Databases, ICDT'01.
A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles,
ICDE'01
H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large
data sets, SIGMOD’ 02.
X. Yin, J. Han, and P.S. Yu, “Cross-Relational Clustering with User's Guidance”, in
Proc. 2005 Int. Conf. on Knowledge Discovery and Data Mining (KDD'05), Chicago,
IL, Aug. 2005.
63
64
Slides Not to Be Used in Class
65
Chapter 11. Cluster Analysis: Advanced Methods









Statistics-Based Clustering

Model-Based Clustering: The Expectation-Maximization Method

Neural Network Approach (SOM)

Fuzzy and non-crisp clustering
Clustering High-Dimensional Data

Why Subspace Clustering?—Challenges on Clustering High-Dimensional Data

PROCLUS: A Dimension-Reduction Subspace Clustering Method

Frequent Pattern-Based Clustering Methods
Semi-Supervised Learning and Active Learning

Semi-supervised clustering

Classification of partially labeled data
Constraint-Based and User-Guided Cluster Analysis

Clustering with Obstacle Objects

User-Constrained Cluster Analysis

User-Guided Cluster Analysis
Bi-Clustering and co-Clustering
Collaborative filtering

Clustering-based approach

Classification-Based Approach

Frequent Pattern-Based Approach
Spectral Clustering
Evaluation of Clustering Quality
Summary
66
Conceptual Clustering


Conceptual clustering
 A form of clustering in machine learning
 Produces a classification scheme for a set of unlabeled
objects
 Finds characteristic description for each concept (class)
COBWEB (Fisher’87)
 A popular a simple method of incremental conceptual
learning
 Creates a hierarchical clustering in the form of a
classification tree
 Each node refers to a concept and contains a
probabilistic description of that concept
67
COBWEB Clustering Method
A classification tree
68
More on Conceptual Clustering

Limitations of COBWEB




The assumption that the attributes are independent of each other is
often too strong because correlation may exist
Not suitable for clustering large database data – skewed tree and
expensive probability distributions
CLASSIT

an extension of COBWEB for incremental clustering of continuous
data

suffers similar problems as COBWEB
AutoClass (Cheeseman and Stutz, 1996)

Uses Bayesian statistical analysis to estimate the number of
clusters

Popular in industry
69