Uploaded by Guru Chandra

DMDW Module 5

advertisement
UNIT-V
CLUSTER ANALYSIS
The process of grouping a set of physical or abstract objects into classes of similar objects is
called clustering. A cluster is a collection of data objects that are
 similar to one another within the same cluster
 dissimilar to the objects in other clusters
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.
Cluster analysis is finding similarities between data according to the characteristics found in
the data and grouping similar data objects into clusters. It is unsupervised learning i.e., no
predefined classes.
Applications are Pattern recognition. Spatial data analysis, Image processing, Economic
science, World Wide Web and many others.
Examples of cluster applications:
 Marketing: Help marketers discover distinct groups in their customer bases, and use
this knowledge to develop targeted marketing programs.
 Land use: Identification of areas of similar land use in an earth observation database.
 Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost.
 City-planning: Identifying groups of houses according to their house type, value, and
geographical location.
 Earth-quake studies: Observed earth quake epicenters should be clustered along
continent faults.
A good cluster method will produce high quality clusters with
 high intra-class similarity
 Low inter-class similarity
The quality of a clustering result depends on both the similarity measure used by the method
and its implementation. Quality can also be measured by its ability to discover some or all of
the hidden patterns.
The following are typical requirements of clustering in data mining:
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to determine input parameters
 Ability to deal with noisy data
 Incremental clustering and insensitivity to the order of input records
 High dimensionality
 Constraint-based clustering
 Interpretability and usability
1
TYPES OF DATA IN CLUSTER ANALYSIS
Main memory-based clustering algorithms typically operate on either of the following two
data structures.
 Data matrix (or object-by-variable structure): This represents n objects, such as
persons, with p variables (also called measurements or attributes), such as age, height,
weight, gender, and so on. The structure is in the form of a relational table, or n-by-p
matrix (n objects p variables)

(Two modes)
Dissimilarity matrix (or object-by-object structure): This stores a collection of
proximities that are available for all pairs of n objects. It is often represented by an nby-n table
(One Mode)
where d(i, j) is the measured difference or dissimilarity between objects i and j.
Data matrix versus Dissimilarity matrix is as follows
Data matrix
Dissimilarity matrix
1. The rows and columns in this matrix
1. The rows and columns represent the
represent different entities.
same entity.
2. This matrix is called two-mode
2. This matrix is called one-mode matrix.
matrix.
3. If the data are presented in the form
3.Many clustering algorithms operate on
of a data matrix, it can first be
this matrix
transformed into a dissimilarity
matrix
before
applying
such
clustering algorithms.
Types of data in clustering analysis are as follows
1. Interval-scaled variables
2. Binary variables
3. Nominal, ordinal and ratio variables.
4. Variables of mixed types.
5. Vector Objects
1. Interval-scaled variables:
Interval-scaled variables are continuous measurements of a roughly linear scale.
Examples: weight and height, latitude and longitude coordinates (e.g., when clustering
houses), and weather temperature
2
To standardize measurements, one choice is to convert the original measurements to unitless
variables. Given measurements for a variable f , this can be performed as follows.
Distances are normally used to measure the similarity or dissimilarity between two data
objects.
 Euclidean distance is defined as

Manhattan (or city block) distance is defined as
Both the Euclidean distance and Manhattan distance satisfy the following mathematic
requirements of a distance function
Example: Let x1 = (1, 2) and x2 = (3, 5) represent two objects as in Figure. The Euclidean
distance between the two is
The Manhattan distance between the two is
2+3 = 5.

Minkowski distance is a generalization of both Euclidean distance and Manhattan
distance. It is defined as
3
Where p is a positive integer. Such a distance is also called Lp norm, in some
literature. It represents the Manhattan distance when p = 1 (i.e., L1 norm) and
Euclidean distance when p = 2 (i.e., L2 norm).

If each variable is assigned a weight according to its perceived importance, the
weighted Euclidean distance can be computed as
Weighting can also be applied to the Manhattan and Minkowski distances.
2. Binary variables:
A binary variable has only two states: 0 or 1, where 0 means that the variable is
absent, and 1 means that it is present. If binary variables have interval scaled, then it lead to
misleading clustering results. Therefore methods specific to binary data are necessary for
computing dissimilarities.
To compute the dissimilarity between two binary variables is
Consider a 2-by-2 contingency table of Table, where q is the number of variables that
equal 1 for both objects i and j, r is the number of variables that equal 1 for object i but that
are 0 for object j, s is the number of variables that equal 0 for object i but equal 1 for object j,
and t is the number of variables that equal 0 for both objects i and j. The total number of
variables is p, where p = q+r+s+t.
Binary variables are as follows
 Symmetric binary variable: A binary variable is symmetric if both of its states are
equally valuable and carry the same weight; that is, there is no preference on which
outcome should be coded as 0 or 1.
Example: attribute gender  male and female
 Symmetric binary dissimilarity based on symmetric binary variables and assess the
dissimilarity between objects i and j.


Asymmetric binary variable: A binary variable is asymmetric if the outcomes of the
states are not equally.
Example: Disease test  positive and negative
Asymmetric binary dissimilarity: It has number of negative matches, t, is
considered unimportant and thus is ignored in computation as
4
Distance can be measured between two binary variables based on the notion of
similarity instead of dissimilarity.
For example, the asymmetric binary similarity between the objects i and j, or sim(i, j), can be
computed as,
The coefficient sim(i, j) is called the Jaccard coefficient.
Example: Consider patient record table contain the attributes name, gender, fever, cough,
test-1, test-2, test-3 and test-4, where name is an object identifier, gender is a symmetric
attribute, and the remaining attributes are asymmetric binary
For asymmetric attribute values, let the values Y (yes) and P (positive) be set to1, and the
value N (no or negative) be set to 0.
Then
These measurements suggest that Mary and Jim are unlikely to have a similar disease because
they have the highest dissimilarity value among to three pairs of the three patients, Jack and
Mary are the most likely to have a similar disease.
3. Categorical, Ordinal, and Ratio-Scaled Variables:
Objects can be described by categorical, ordinal, and ratio-scaled variables are as follows
Categorical Variables: A categorical variable is a generalization of the binary variable in
that it can take on more than two states.
Example: map color is a categorical variable that may have, say, five states: red, yellow,
green, pink, and blue.
The dissimilarity between two objects i and j can be computed based on the ratio of
mismatches
Where m is the number of matches (i.e., the number of variables for which i and j are in the
same state), and p is the total number of variables. Weights can be assigned to increase the
effect of m or to assign greater weight to the matches in variables having a larger number of
states.
5
Example: Suppose sample data of table, except that only object-identifier and the variable
test-1 are available, where test-1 is categorical
 Table- I
Dissimilarity matrix is as follows
Here we have one categorical variable, test-1, set p=1 so that d(i,j) evaluates to 0 if objects I
and j match, and 1 if the objects differ. So
Ordinal Variables: A discrete ordinal variable resembles a categorical variable, except that
the M states of the ordinal value are ordered in a meaningful sequence. An ordinal variable
can be discrete or continuous. Order is important.
Example: Professional ranks enumerated in a sequential order, such as assistant, associate and
professors.
It can be treated like interval-scaled as follows
 Replace xif by their rank i.e.,
 Map the range of each variable onto [0,1] by replacing ith object in the fth variable by

Compute the dissimilarity using methods for interval-scaled variables.
Example: Suppose sample data taken from Table-I, except that this time only the objectidentifier and the continuous ordinal variable, test-2 are available. There are three states for
test-2, namely fair, good and excellent, ie., Mf = 3
 Replace each value for test-2 by its rank, the objects are assigned the ranks 3, 1, 2 and
3 respectively.
 Normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5 and rank 3 to 1.0.
 Use, Euclidean distances
6
Which results in the following dissimilarity matrix
Ratio-Scaled Variables: A positive measurement on a nonlinear scale, approximately
at exponential scale, such as
where A and B are positive constants, and t typically represents time.
Example: Growth of a bacteria population or decay of a radioactive element.
There are three methods to handle ratio-scaled variables for computing the
dissimilarity between objects as follows
 Treat ratio-scaled variables like interval-scaled variables
 Apply logarithmic transformation to a ratio-scaled variable f having value xif
for object i by using the formula yif = log(xif ).
 Treat xif as continuous ordinal data and treat their ranks as interval-valued.
Example: Consider Table-I, except that only object identifier and the ratio-scaled variable,
test-3, are available. Take the log of test-3 results in the values 2.65, 1.34, 2.21, and 3.08 for
the objects 1 to 4, respectively. Using the Euclidean distance on the transformed values,
obtain the following dissimilarity matrix:
4. Variables of Mixed Types:
A database may contain all the six types of variables as symmetric binary, asymmetric binary,
nominal, ordinal, interval and ratio.
Suppose that the data set contains p variables of mixed type. The dissimilarity d(i, j) between
objects i and j is defined as
7
Example: Consider Table-I
Consider
Resulting dissimilarity matrix obtained for the data described by the three variables of mixed
type is
5. Vector Objects: Keywords in documents, general features in micro-arrays etc.
Applications are information retrieval, biologic taxonomy etc..
There are several ways to define such a similarity function, s(x, y), to compare two vectors x
and y. One popular way is to define the similarity function as a cosine measure as follows
where x1 is a transposition of vector x, ||x|| is the Euclidean norm of vector x, ||y|| is the
Euclidean norm of vector y, and s is essentially the cosine of the angle between vectors x and
y.
Example: Nonmetric similarity between two objects using cosine
Given two vectors x = (1, 1, 0, 0) and y = (0, 1, 1, 0)
Similarity between x and y is
A simple variation of above measure is
Which is the ratio of the number of attributes shared by x and y to the number of attributes
possessed by x or y. This function, known as the Tanimoto coefficient or Tanimoto
distance, is frequently used in information retrieval and biology taxonomy.
8
A CATEGORIZATION OF MAJOR CLUSTERING METHODS
The major clustering methods can be classified into the following categories.

Partitioning methods: Constructs various partitions and then evaluate them by some
criterion. Example: Minimizing the sum of square errors.
Typical methods are k-Means, k-Medoids, CLARANS.

Hierarchical methods: Create a hierarchical decomposition of the set of data (or
objects) using some criterion. This method can be classified based on
 Agglomerative approach (Bottom-up): starts with each object forming a
separate group. It successively merges the objects or groups that are close to
one another, until all of the groups are merged into one (the topmost level of
the hierarchy), or until a termination condition holds.
 Divisive approach (Top-down): starts with all of the objects in the same
cluster. In each successive iteration, a cluster is split up into smaller clusters,
until eventually each object is in one cluster, or until a termination condition
holds.
Typical methods are BIRCH, ROCK, Chameleon and so on

Density-based methods: It is based on connectivity and density functions. Typical
methods are DBSCAN, OPTICS, DENCLUE etc.

Grid-based methods: It is based on a multiple-level granularity structure. Typical
methods are STING, WaveCluster, and CLIQUE etc.

Model-based methods: This model is hypothesized for each of the clusters and tries
to find the best fit of that model to each other. Typical methods are EM, SOM and
COBWEB etc.
Classes of clustering tasks are as follows
 Clustering high-dimensional data is a particularly important task in cluster analysis
because many applications require the analysis of objects containing a large number
of features or dimensions. Frequent pattern–based clustering, another clustering
methodology, and extracts distinct frequent patterns among subsets of dimensions that
occur frequently.
 Constraint-based clustering is a clustering approach that performs clustering by
incorporation of user-specified or application-oriented constraints. It focus mainly on
 Spatial clustering
 Semi-supervised clustering
9
PARTITIONING METHODS
Constructs various partitions and then evaluate them by some criterion and following diagram
various typical methods. Partitioning methods find sphere-shaped clusters.
1. Classical Partitioning methods: The most common used are k-Means and k-Medoid
methods.
Centroid-Based Technique: The k-Means Method
Each cluster is represented by the center of the cluster. Minimum sum of the squared distance
can be calculated as (square error criterion)
Where E is the sum of the square error for all objects in the data set; p is the point in space
representing a given object; and mi is the mean of cluster Ci (both p and mi are
multidimensional)
The k-Mean is efficient for large data sets but sensitive to outliers.
The k-means algorithm for partitioning, where each cluster’s center is represented by the
mean value of the objects in the cluster is as follows
Given k, the k-Mean algorithm is implemented in four steps
1. Partition objects into k nonempty subsets
2. Compute seed (or object) points as the centroids of the clusters of the current position
(the centroid is the center i.e., mean point, of the cluster).
3. Assign each object to the cluster with the nearest seed point.
4. Go back to step2. Stop when no more new assignment.
10
Example: Consider the set of objects as follows.







The process of iteratively reassigning objects to clusters to improve the partitioning is
referred to as iterative relocation.
The method is relatively scalable and efficient in processing large data sets because
the computational complexity of the algorithm is O(nkt), where n is the total number
of objects, k is the number of clusters, and t is the number of iterations. Normally,
The method often terminates at a local optimum.
Weakness of k-Mean are
 Applicable only when mean is defined i.e., but not defined for categorical.
 Need to specify k, the number of clusters in advance.
 Unable to handle noisy data outliers.
 Not suitable to discover clusters with non-convex shapes.
The EM (Expectation-Maximization) algorithm extends the k-means paradigm in a
different way. Whereas the k-means algorithm assigns each object to a cluster, in EM
each object is assigned to each cluster according to a weight representing its
probability of membership. In other words, there are no strict boundaries between
clusters.
The k-Mean to handle categorical data can be done by
 Replacing means of clusters with modes.
 Using new dissimilarity measures to deal with categorical objects.
 Using a frequency-based method to update modes of clusters.
 A mixture of categorical and numerical data, k-prototype method is used.
The k-means algorithm is sensitive to outliers because an object with an extremely
large value may substantially distort the distribution of data.
11
Representative Object-Based Technique: The k-Medoids Method
Instead of taking the mean value of the objects in a cluster as a reference point, pick actual
objects to represent the clusters, using one representative object per cluster. The partitioning
method is then performed based on the principle of minimizing the sum of the dissimilarities
between each object and its corresponding reference point. That is, an absolute-error
criterion is used, defined as
Where E is the sum of the absolute error for all objects in the data set; p is the point in space
representing a given object in cluster Cj; and oj is the representative object of Cj.
Four cases of the cost function for k-medoids clustering is as follows
PAM(Partitioning Around Medoids) was one of the first k-medoids algorithms partitioning
based on medoid or central objects is as follows
12
The complexity of each iteration is
The k-medoids method is more robust
than k-means in the presence of noise and outliers, because a medoid is less influenced by
outliers or other extreme values than a mean.
In this example, changing the medoid of cluster 2 did not change the assignments of objects
to cluster.
Advantage of k-Medoid method is more robust than k-Means in the presence of noise and
outliers. Disadvantages of k-Medoids are
 It is more costly than the k-Means method.
 Like k-Means, k-Medoids requires the user to specify k.
 It does not scale well for large data sets.
2. Partitioning Methods in Large Databases: From k-Medoids to CLARANS
CLARA:
 CLARA (Clustering LARge Applications) uses a sampling-based method to deal
with large data sets. A random sample should closely represent the original data. The
chosen medoids will likely be similar to what would have been chosen from the whole
data set.
 CLARA draws multiple samples of the data set. Apply PAM to each sample. Return
the best clustering.
13



Complexity of each iteration is
where s is the size of the
sample, k is the number of clusters and n is the number of objects.
PAM finds the best k-Medoids among a given data, and CLARA finds the best kMedoids among the selected samples.
CLARA problems are as follows
 The best k-Medoids may not be selected during the sampling process, in this
case, CLARA will never find the best clustering.
 If the sampling is biased we cannot have a good clustering.
 Trade off-of efficiency.
CLARANS:
 CLARANS (Clustering Large Applications based upon RANdomized Search)
was proposed to improve the quality and the scalability of CLARA.
 It combines sampling techniques with PAM.
 It does not confine itself to any sample at a given time.
 It draws a sample with some randomness in each step of the search.
 CLARANS is more effective than both PAM and CLARA. It handles outliers.
 Disadvantages are the computational complexity of CLARANS is O(n2), where n is
the number of objects. The clustering quality depends on the sampling method.
HIERARCHICAL METHODS
A hierarchical clustering method works by grouping data objects into a tree of clusters. These
methods are classified as follows
 Agglomerative and Divisive Hierarchical clustering
 BIRCH
 ROCK
 Chameleon.
1. Agglomerative and Divisive Hierarchical Clustering:
 Agglomerative (Botton-Up / Merging): This bottom-up strategy starts by placing
each object in its own cluster and then merges these atomic clusters into larger and
larger clusters, until all of the objects are in a single cluster or until certain termination
conditions are satisfied.
 Divisive (Top-Down / Splitting): This top-down strategy does the reverse of
agglomerative hierarchical clustering by starting with all objects in one cluster. It
subdivides the cluster into smaller and smaller pieces, until each object forms a cluster
14
on its own or until it satisfies certain termination conditions, such as a desired number
of clusters is obtained or the diameter of each cluster is within a certain threshold.
Example: Below figure shows the application of AGNES (AGglomerative NESting), an
agglomerative hierarchical clustering method, and DIANA (DIvisive ANAlysis), a divisive
hierarchical clustering method, to a data set of five objects, {a, b, c, d, e}. Initially, AGNES
places each object into a cluster of its own. The clusters are then merged step-by-step. In
DIANA, all of the objects are used to form one initial cluster.
A tree structure called a dendrogram is commonly used to represent the process of
hierarchical clustering. It shows how objects are grouped together step by step. Dendrogram
representation for hierarchical clustering of data objects {a, b, c, d, e } is as follows
Where l= 0 shows the five objects as singleton clusters at level 0. At l = 1, objects a and b are
grouped together to form the first cluster, and they stay together at all subsequent levels.
Four widely used measures for distance between clusters are as follows, where |p-p1|
is the distance between two objects or points, p and p1; mi is the mean for cluster, Ci; and
ni is the number of objects in Ci.
15





When an algorithm uses the minimum distance, dmin(Ci, Cj), to measure the distance
between clusters, it is sometimes called a nearest-neighbor clustering algorithm.
If the clustering process is terminated when the distance between nearest clusters
exceeds an arbitrary threshold, it is called a single-linkage algorithm.
An agglomerative hierarchical clustering algorithm that uses the minimum distance
measure is also called a minimal spanning tree algorithm.
When an algorithm uses the maximum distance, dmax(Ci, Cj), to measure the distance
between clusters, it is sometimes called a farthest-neighbor clustering algorithm.
If the clustering process is terminated when the maximum distance between nearest
clusters exceeds an arbitrary threshold, it is called a complete-linkage algorithm.
Difficulties with hierarchical clustering are as follows
 Although it look simple, difficulties encountered regarding the selection of merge or
split point.
 If merge or split decision, are not chosen well at some step may lead to low quality
clusters.
2. BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
 BIRCH is designed for clustering a large amount of numerical data by integration of
hierarchical clustering (at the initial microclustering stage) and other clustering
methods such as iterative partitioning (at the later macroclustering stage). It
overcomes the two difficulties of agglomerative clustering methods: (1) scalability
and (2) the inability to undo
 BIRCH introduces two concepts
 clustering feature and
 clustering feature tree (CF tree),
which are used to summarize cluster representations.
 Given n d-dimensional data objects or points in a cluster, define the centroid x0, radius
R, and diameter D of the cluster as follows
16

Where R is the average distance from member objects to the centroid, and D is the
average pairwise distance within a cluster. Both R and D reflect the tightness of the
cluster around the centroid.
A clustering feature (CF) is a three-dimensional vector summarizing information
about clusters of objects. Given n d-dimensional objects or points in a cluster, {xi},
then the CF of the cluster is defined as
Example: Consider two disjoint clusters as C1 and C2.
Clustering feature are as CF1 and CF2.
Then the cluster that is formed by merging C1 and C2 is simply CF1 +CF2.
Example: Suppose that there are three points, (2, 5), (3, 2), and (4, 3), in a cluster, C1.
The clustering feature of C1 is

CF tree is a height-balanced tree that stores the clustering features for a hierarchical
clustering. A CF tree has two parameters: branching factor, B, and threshold, T. The
branching factor specifies the maximum number of children per nonleaf node. The
threshold parameter specifies the maximum diameter of subclusters stored at the leaf
nodes of the tree.

BIRCH applies a multiphase clustering technique also and primary phases are:
 Phase 1: BIRCH scans the database to build an initial in-memory CF tree,
which can be viewed as a multilevel compression of the data that tries to
preserve the inherent clustering structure of the data
 Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf
nodes of the CF tree, which removes sparse clusters as outliers and groups
dense clusters into larger ones.
 The computation complexity of the algorithm is O(n), where n is the number of
objects to be clustered.
3. ROCK: A Hierarchical Clustering Algorithm for Categorical Attributes
ROCK (RObust Clustering using linKs) is a hierarchical clustering algorithm that explores
the concept of links (the number of common neighbors between two objects) for data with
categorical attributes. ROCK takes a more global approach to clustering by considering the
neighborhoods of individual pairs of points. If two similar points also have similar
neighborhoods, then the two points likely belong to the same cluster and so can be merged.
17
The two points, pi and pj, are neighbors if
, where sim is a similarity
function and Ө is a user-specified threshold.
ROCK’s concepts of neighbors and links are illustrated, where the similarity between two
“points” or transactions, Ti and Tj, is defined with the Jaccard coefficient as
Example: Suppose that a market basket database contains transactions regarding the items a,
b, . . . , g. Consider two clusters of transactions,C1 andC2. C1, which references the items (a, b,
c, d, e), contains the transactions {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e},
{b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}. C2 references the items {a, b, f , g}. It contains the
transactions {a, b, f }, {a, b, g}, {a, f , g}, {b, f , g}.
The Jaccard coefficient between the transactions {a, b, c} and {b, d, e} of C1 is 1/5 =0:2. In
fact, the Jaccard coefficient between any pair of transactions in C1 ranges from 0.2 to 0.5
(e.g., {a, b, c} and {a, b, d}). The Jaccard coefficient between transactions belonging to
different clusters may also reach 0.5 (e.g., {a, b, c} of C1 with {a, b, f } or {a, b, g} of C2).
The worst-case time complexity of ROCK is
where mm and ma
are the maximum and average number of neighbors, respectively, and n is the number of
objects.
4. Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling
Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to determine
the similarity between pairs of clusters. It was derived based on the observed weaknesses of
two hierarchical clustering algorithms: ROCK and CURE.
It measures the similarity based on a dynamic model
 Two clusters are merged only if the interconnectivity and closeness (proximity)
between two clusters are high relative to the internal interconnectivity of the clusters
and closeness of items within the clusters.
 Cure ignores information about interconnectivity of the objects.
 Rock ignores information about the closeness of two clusters.
 It is a two-phase algorithm
 Use a graph partitioning algorithm: Cluster objects into a large number of
relatively small sub-clusters.
 Use an agglomerative hierarchical clustering algorithm: find the genuine
clusters by repeatedly combining these sub-clusters.
Chameleon: Hierarchical clustering based on k-nearest neighbors and dynamic modeling
18
The graph-partitioning algorithm partitions the k-nearest-neighbor graph such that it
minimizes the edge cut. That is, a cluster C is partitioned into subclusters Ci andCj so as to
minimize the weight of the edges that would be cut should C be bisected into Ci and Cj. Edge
cut is denoted EC(Ci, Cj) and assesses the absolute interconnectivity between clusters Ci and
Cj. Chameleon determines the similarity between each pair of clusters Ci and Cj according to
their relative interconnectivity, RI(Ci, Cj), and their relative closeness, RC(Ci, Cj):
 The relative interconnectivity, RI(Ci, Cj), between two clusters, Ci and Cj, is defined
as the absolute interconnectivity between Ci and Cj, normalized with respect to the
internal interconnectivity of the two clusters, Ci and Cj. That is,

The relative closeness, RC(Ci, Cj), between a pair of clusters, Ci and Cj, is the
absolute closeness between Ci and Cj, normalized with respect to the internal
closeness of the two clusters, Ci and Cj. It is defined as
The processing cost for high-dimensional data may require O(n2) time for n objects in the
worst case.
DENSITY-BASED METHODS
Clustering based on density (local cluster criterion), such as density-connected points. Major
features are as follows
 Discover clusters of arbitrary shape
 Handle noise.
 One scan
 Need density parameters as termination condition
Density based clustering algorithms as follows
 DBSCAN grows clusters according to a density-based connectivity analysis.
 OPTICS extends DBSCAN to produce a cluster ordering obtained from a wide range
of parameter settings.
 DENCLUE clusters objects based on a set of density distribution functions.
1. DBSCAN: A Density-Based Clustering Method Based on Connected Regions with
Sufficiently High Density
19
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density based
clustering algorithm. The algorithm grows regions with sufficiently high density into clusters
and discovers clusters of arbitrary shape in spatial databases with noise. It defines a cluster as
a maximal set of density-connected points.
The basic ideas of density-based clustering involve a number of new definitions as
 The neighborhood within a radius ε of a given object is called the ε-neighborhood of
the object.
 If the ε-neighborhood of an object contains at least a minimum number, MinPts, of
objects, then the object is called a core object.
 Given a set of objects, D, say that an object p is directly density-reachable from
object q if p is within the ε-neighborhood of q, and q is a core object.
 An object p is density-reachable from object q with respect to ε and MinPts in a set
of objects, D, if there is a chain of objects p1, . . . , pn, where p1 = q and pn = p such
that pi+1 is directly density-reachable from pi with respect to ε and MinPts, for ,
pi є D.
 An object p is density-connected to object q with respect to ε and MinPts in a set of
objects, D, if there is an object o є D such that both p and q are density-reachable
from o with respect to ε and MinPts.
Density reachability is the transitive closure of direct density reachability, and this
relationship is asymmetric. Density connectivity, however, is a symmetric relation.
Example: Consider below figure for a given and represented by the radius of circles and say
let MinPts=3.

Of the labeled points, m, p, o, and r are core objects because each is in an εneighborhood containing at least three points.
 q is directly density-reachable from m. m is directly density-reachable from p and
vice versa.
 q is (indirectly) density-reachable from p because q is directly density-reachable from
m and m is directly density-reachable from p. However, p is not density-reachable
from q because q is not a core object. Similarly, r and s are density-reachable from o,
and o is density-reachable from r.
 o, r, and s are all density-connected.
A density-based cluster is a set of density-connected objects that is maximal with respect to
density-reachability. Every object not contained in any cluster is considered to be noise.
 DBSCAN searches for clusters by checking the ε -neighborhood of each point in the
database.
 If the ε -neighborhood of a point p contains more than MinPts, a new cluster with p
as a core object is created.
 DBSCAN then iteratively collects directly density-reachable objects from these core
objects, which may involve the merge of a few density-reachable clusters.
20

The process terminates when no new point can be added to any cluster.
If a spatial index is used, the computational complexity of DBSCAN is O(nlogn), where n is
the number of database objects. Otherwise, it is O(n2).
2. OPTICS: Ordering Points To Identify the Clustering Structure
It is ordering points to identify the clustering structure.
 Produces a special order of the database writes its density-based clustering structure.
 This cluster-ordering contains information equivalent to the density-based clustering
corresponding to a broad range of parameter settings.
 Good for both automatic and interactive cluster analysis, including finding intrinsic
clustering structure.
 Can be represented graphically or using visualization techniques.
It uses two values to be stored for each object as follows
 The core-distance of an object p is the smallest ε1 value that makes {p} a core object.
If p is not a core object, the core-distance of p is undefined.
 The reachability-distance of an object q with respect to another object p is the
greater value of the core-distance of p and the Euclidean distance between p and q. If
p is not a core object, the reachability-distance between p and q is undefined.
Example:
This figure shows core distance and reachability-distance. If ε=6mm and MinPts=5. The core
distance of p is distance ε1, between p and fourth closest data object. Reachability-distance of
q1 with respect to p is the core distance of p (i.e., ε1=3mm) because this is greater than
Euclidean distance from p to q1. Reachability-distance of q2 with respect to p is the Euclidean
distance from p to q2 because this is greater than core-distance of p.
The cluster ordering of a data set can be represented graphically. For example, below figure
is the reachability plot for a simple two-dimensional data set, which presents a general
overview of how the data are structured and clustered. The data objects are plotted in cluster
order (horizontal axis) together with their respective reachability-distance (vertical axis).
21
OPTICS has the runtime complexity as O(nlogn) if a spatial index is used, where n is the
number of objects.
3. DENCLUE: Clustering Based on Density Distribution Functions
DENCLUE (DENsity-based CLUstEring) is a clustering method based on a set of density
distribution functions. The method is built on the following ideas
(1) the influence of each data point can be formally modeled using a mathematical
function, called an influence function, which describes the impact of a data point
within its neighborhood;
(2) the overall density of the data space can be modeled analytically as the sum of the
influence function applied to all data points; and
(3) clusters can then be determined mathematically by identifying density attractors,
where density attractors are local maxima of the overall density function.
Let x and y be objects or points in Fd, a d-dimensional input space. The influence function of
data object y on x is a function
function fB
which is defined in terms of a basic influence
It can be used to compute a square wave influence function,
The density function at an object or point
of all data points. Given n data objects,
defined as
is defined as the sum of influence functions
the density function at x is
22
Below figure shows a 2-D data set together with the corresponding overall density function
for a square wave and a Gaussian influence function
Advantages of DENCLUE are as follows
 Solid mathematical foundation.
 Good for data sets with large amounts of noise.
 Allows a compact mathematical description of arbitrarily shaped clusters in highdimensional data sets.
 Significant faster than existing algorithm (e.g., DBSCAN).
 But needs a large number of parameters.
MODEL-BASED CLUSTERING METHODS
Model-based clustering methods attempt to optimize the fit between the given data and some
mathematical model. Three examples of model-based clustering are as follows
 Expectation-Maximization – presents an extension of the k-Means partitioning
algorithm.
 Conceptual Clustering
 Neural Network Approach
1. Expectation-Maximization:
A popular iterative refinement algorithm. An extension to k-Means
 Assign each object to a cluster according to a weight (probability distribution).
 New means are computed based on weighted measures.
This process can be as follows
 Starts with an initial estimate of the parameter vector.
 Iteratively rescores the patterns against the mixture density produced by the parameter
vector.
 The rescored patterns are used to update the parameter updates.
 Patterns belonging to the same cluster, if they are placed by their scores in a particular
component.
The EM algorithm is as follows
 Initially, random assign k cluster centers.
 Iteratively refine the clusters based on two steps
 Expectation step: Assign each object xi to cluster Ck with probability as
23
follows the normal (i.e., Gaussian)
distribution around mean, mk, with expectation, Ek
 Maximization step: Estimation of model parameters
This step is the “maximization” of the likelihood of the distributions
given the data.
The computational complexity is linear in d (the number of input features), n (the number of
objects), and t (the number of iterations).
Important questions are as follows
1. a. Discuss about the categorization of major clustering methods.
b. Explain in detail about partitioning methods and hierarchical methods.
2. a. What is outlier? Explain outlier with suitable example.
b.Write short notes on partitioning and model based clustering.
3. a. Explain about density based clustering.
b. Explain about outlier analysis.
4. a. Briefly outline how to compute the dissimilarity between objects described by the
following types of variables
i.
Numerical (Interval-scaled) variables
ii.
Asymmetric binary variables
iii.
Categorical variables
b. Why is outlier mining important? Briefly describe the different approaches behind
statistical-based outlier detection, distanced-based outlier detection, density-based
local outlier detection, and derivation-based outlier detection.
5. a. Explain the different categories of clustering methods.
b. Explain about agglomerative and divisive hierarchal clustering.
6. What are model-based clustering methods? Explain in any two model-based clustering
methods in detail.
7. What are the grid based methods? Explain in any two grid based methods in detail.
24
Download