UNIT-V CLUSTER ANALYSIS The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster dissimilar to the objects in other clusters Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according to their similarity. Cluster analysis is finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters. It is unsupervised learning i.e., no predefined classes. Applications are Pattern recognition. Spatial data analysis, Image processing, Economic science, World Wide Web and many others. Examples of cluster applications: Marketing: Help marketers discover distinct groups in their customer bases, and use this knowledge to develop targeted marketing programs. Land use: Identification of areas of similar land use in an earth observation database. Insurance: Identifying groups of motor insurance policy holders with a high average claim cost. City-planning: Identifying groups of houses according to their house type, value, and geographical location. Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults. A good cluster method will produce high quality clusters with high intra-class similarity Low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation. Quality can also be measured by its ability to discover some or all of the hidden patterns. The following are typical requirements of clustering in data mining: Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Ability to deal with noisy data Incremental clustering and insensitivity to the order of input records High dimensionality Constraint-based clustering Interpretability and usability 1 TYPES OF DATA IN CLUSTER ANALYSIS Main memory-based clustering algorithms typically operate on either of the following two data structures. Data matrix (or object-by-variable structure): This represents n objects, such as persons, with p variables (also called measurements or attributes), such as age, height, weight, gender, and so on. The structure is in the form of a relational table, or n-by-p matrix (n objects p variables) (Two modes) Dissimilarity matrix (or object-by-object structure): This stores a collection of proximities that are available for all pairs of n objects. It is often represented by an nby-n table (One Mode) where d(i, j) is the measured difference or dissimilarity between objects i and j. Data matrix versus Dissimilarity matrix is as follows Data matrix Dissimilarity matrix 1. The rows and columns in this matrix 1. The rows and columns represent the represent different entities. same entity. 2. This matrix is called two-mode 2. This matrix is called one-mode matrix. matrix. 3. If the data are presented in the form 3.Many clustering algorithms operate on of a data matrix, it can first be this matrix transformed into a dissimilarity matrix before applying such clustering algorithms. Types of data in clustering analysis are as follows 1. Interval-scaled variables 2. Binary variables 3. Nominal, ordinal and ratio variables. 4. Variables of mixed types. 5. Vector Objects 1. Interval-scaled variables: Interval-scaled variables are continuous measurements of a roughly linear scale. Examples: weight and height, latitude and longitude coordinates (e.g., when clustering houses), and weather temperature 2 To standardize measurements, one choice is to convert the original measurements to unitless variables. Given measurements for a variable f , this can be performed as follows. Distances are normally used to measure the similarity or dissimilarity between two data objects. Euclidean distance is defined as Manhattan (or city block) distance is defined as Both the Euclidean distance and Manhattan distance satisfy the following mathematic requirements of a distance function Example: Let x1 = (1, 2) and x2 = (3, 5) represent two objects as in Figure. The Euclidean distance between the two is The Manhattan distance between the two is 2+3 = 5. Minkowski distance is a generalization of both Euclidean distance and Manhattan distance. It is defined as 3 Where p is a positive integer. Such a distance is also called Lp norm, in some literature. It represents the Manhattan distance when p = 1 (i.e., L1 norm) and Euclidean distance when p = 2 (i.e., L2 norm). If each variable is assigned a weight according to its perceived importance, the weighted Euclidean distance can be computed as Weighting can also be applied to the Manhattan and Minkowski distances. 2. Binary variables: A binary variable has only two states: 0 or 1, where 0 means that the variable is absent, and 1 means that it is present. If binary variables have interval scaled, then it lead to misleading clustering results. Therefore methods specific to binary data are necessary for computing dissimilarities. To compute the dissimilarity between two binary variables is Consider a 2-by-2 contingency table of Table, where q is the number of variables that equal 1 for both objects i and j, r is the number of variables that equal 1 for object i but that are 0 for object j, s is the number of variables that equal 0 for object i but equal 1 for object j, and t is the number of variables that equal 0 for both objects i and j. The total number of variables is p, where p = q+r+s+t. Binary variables are as follows Symmetric binary variable: A binary variable is symmetric if both of its states are equally valuable and carry the same weight; that is, there is no preference on which outcome should be coded as 0 or 1. Example: attribute gender male and female Symmetric binary dissimilarity based on symmetric binary variables and assess the dissimilarity between objects i and j. Asymmetric binary variable: A binary variable is asymmetric if the outcomes of the states are not equally. Example: Disease test positive and negative Asymmetric binary dissimilarity: It has number of negative matches, t, is considered unimportant and thus is ignored in computation as 4 Distance can be measured between two binary variables based on the notion of similarity instead of dissimilarity. For example, the asymmetric binary similarity between the objects i and j, or sim(i, j), can be computed as, The coefficient sim(i, j) is called the Jaccard coefficient. Example: Consider patient record table contain the attributes name, gender, fever, cough, test-1, test-2, test-3 and test-4, where name is an object identifier, gender is a symmetric attribute, and the remaining attributes are asymmetric binary For asymmetric attribute values, let the values Y (yes) and P (positive) be set to1, and the value N (no or negative) be set to 0. Then These measurements suggest that Mary and Jim are unlikely to have a similar disease because they have the highest dissimilarity value among to three pairs of the three patients, Jack and Mary are the most likely to have a similar disease. 3. Categorical, Ordinal, and Ratio-Scaled Variables: Objects can be described by categorical, ordinal, and ratio-scaled variables are as follows Categorical Variables: A categorical variable is a generalization of the binary variable in that it can take on more than two states. Example: map color is a categorical variable that may have, say, five states: red, yellow, green, pink, and blue. The dissimilarity between two objects i and j can be computed based on the ratio of mismatches Where m is the number of matches (i.e., the number of variables for which i and j are in the same state), and p is the total number of variables. Weights can be assigned to increase the effect of m or to assign greater weight to the matches in variables having a larger number of states. 5 Example: Suppose sample data of table, except that only object-identifier and the variable test-1 are available, where test-1 is categorical Table- I Dissimilarity matrix is as follows Here we have one categorical variable, test-1, set p=1 so that d(i,j) evaluates to 0 if objects I and j match, and 1 if the objects differ. So Ordinal Variables: A discrete ordinal variable resembles a categorical variable, except that the M states of the ordinal value are ordered in a meaningful sequence. An ordinal variable can be discrete or continuous. Order is important. Example: Professional ranks enumerated in a sequential order, such as assistant, associate and professors. It can be treated like interval-scaled as follows Replace xif by their rank i.e., Map the range of each variable onto [0,1] by replacing ith object in the fth variable by Compute the dissimilarity using methods for interval-scaled variables. Example: Suppose sample data taken from Table-I, except that this time only the objectidentifier and the continuous ordinal variable, test-2 are available. There are three states for test-2, namely fair, good and excellent, ie., Mf = 3 Replace each value for test-2 by its rank, the objects are assigned the ranks 3, 1, 2 and 3 respectively. Normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5 and rank 3 to 1.0. Use, Euclidean distances 6 Which results in the following dissimilarity matrix Ratio-Scaled Variables: A positive measurement on a nonlinear scale, approximately at exponential scale, such as where A and B are positive constants, and t typically represents time. Example: Growth of a bacteria population or decay of a radioactive element. There are three methods to handle ratio-scaled variables for computing the dissimilarity between objects as follows Treat ratio-scaled variables like interval-scaled variables Apply logarithmic transformation to a ratio-scaled variable f having value xif for object i by using the formula yif = log(xif ). Treat xif as continuous ordinal data and treat their ranks as interval-valued. Example: Consider Table-I, except that only object identifier and the ratio-scaled variable, test-3, are available. Take the log of test-3 results in the values 2.65, 1.34, 2.21, and 3.08 for the objects 1 to 4, respectively. Using the Euclidean distance on the transformed values, obtain the following dissimilarity matrix: 4. Variables of Mixed Types: A database may contain all the six types of variables as symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio. Suppose that the data set contains p variables of mixed type. The dissimilarity d(i, j) between objects i and j is defined as 7 Example: Consider Table-I Consider Resulting dissimilarity matrix obtained for the data described by the three variables of mixed type is 5. Vector Objects: Keywords in documents, general features in micro-arrays etc. Applications are information retrieval, biologic taxonomy etc.. There are several ways to define such a similarity function, s(x, y), to compare two vectors x and y. One popular way is to define the similarity function as a cosine measure as follows where x1 is a transposition of vector x, ||x|| is the Euclidean norm of vector x, ||y|| is the Euclidean norm of vector y, and s is essentially the cosine of the angle between vectors x and y. Example: Nonmetric similarity between two objects using cosine Given two vectors x = (1, 1, 0, 0) and y = (0, 1, 1, 0) Similarity between x and y is A simple variation of above measure is Which is the ratio of the number of attributes shared by x and y to the number of attributes possessed by x or y. This function, known as the Tanimoto coefficient or Tanimoto distance, is frequently used in information retrieval and biology taxonomy. 8 A CATEGORIZATION OF MAJOR CLUSTERING METHODS The major clustering methods can be classified into the following categories. Partitioning methods: Constructs various partitions and then evaluate them by some criterion. Example: Minimizing the sum of square errors. Typical methods are k-Means, k-Medoids, CLARANS. Hierarchical methods: Create a hierarchical decomposition of the set of data (or objects) using some criterion. This method can be classified based on Agglomerative approach (Bottom-up): starts with each object forming a separate group. It successively merges the objects or groups that are close to one another, until all of the groups are merged into one (the topmost level of the hierarchy), or until a termination condition holds. Divisive approach (Top-down): starts with all of the objects in the same cluster. In each successive iteration, a cluster is split up into smaller clusters, until eventually each object is in one cluster, or until a termination condition holds. Typical methods are BIRCH, ROCK, Chameleon and so on Density-based methods: It is based on connectivity and density functions. Typical methods are DBSCAN, OPTICS, DENCLUE etc. Grid-based methods: It is based on a multiple-level granularity structure. Typical methods are STING, WaveCluster, and CLIQUE etc. Model-based methods: This model is hypothesized for each of the clusters and tries to find the best fit of that model to each other. Typical methods are EM, SOM and COBWEB etc. Classes of clustering tasks are as follows Clustering high-dimensional data is a particularly important task in cluster analysis because many applications require the analysis of objects containing a large number of features or dimensions. Frequent pattern–based clustering, another clustering methodology, and extracts distinct frequent patterns among subsets of dimensions that occur frequently. Constraint-based clustering is a clustering approach that performs clustering by incorporation of user-specified or application-oriented constraints. It focus mainly on Spatial clustering Semi-supervised clustering 9 PARTITIONING METHODS Constructs various partitions and then evaluate them by some criterion and following diagram various typical methods. Partitioning methods find sphere-shaped clusters. 1. Classical Partitioning methods: The most common used are k-Means and k-Medoid methods. Centroid-Based Technique: The k-Means Method Each cluster is represented by the center of the cluster. Minimum sum of the squared distance can be calculated as (square error criterion) Where E is the sum of the square error for all objects in the data set; p is the point in space representing a given object; and mi is the mean of cluster Ci (both p and mi are multidimensional) The k-Mean is efficient for large data sets but sensitive to outliers. The k-means algorithm for partitioning, where each cluster’s center is represented by the mean value of the objects in the cluster is as follows Given k, the k-Mean algorithm is implemented in four steps 1. Partition objects into k nonempty subsets 2. Compute seed (or object) points as the centroids of the clusters of the current position (the centroid is the center i.e., mean point, of the cluster). 3. Assign each object to the cluster with the nearest seed point. 4. Go back to step2. Stop when no more new assignment. 10 Example: Consider the set of objects as follows. The process of iteratively reassigning objects to clusters to improve the partitioning is referred to as iterative relocation. The method is relatively scalable and efficient in processing large data sets because the computational complexity of the algorithm is O(nkt), where n is the total number of objects, k is the number of clusters, and t is the number of iterations. Normally, The method often terminates at a local optimum. Weakness of k-Mean are Applicable only when mean is defined i.e., but not defined for categorical. Need to specify k, the number of clusters in advance. Unable to handle noisy data outliers. Not suitable to discover clusters with non-convex shapes. The EM (Expectation-Maximization) algorithm extends the k-means paradigm in a different way. Whereas the k-means algorithm assigns each object to a cluster, in EM each object is assigned to each cluster according to a weight representing its probability of membership. In other words, there are no strict boundaries between clusters. The k-Mean to handle categorical data can be done by Replacing means of clusters with modes. Using new dissimilarity measures to deal with categorical objects. Using a frequency-based method to update modes of clusters. A mixture of categorical and numerical data, k-prototype method is used. The k-means algorithm is sensitive to outliers because an object with an extremely large value may substantially distort the distribution of data. 11 Representative Object-Based Technique: The k-Medoids Method Instead of taking the mean value of the objects in a cluster as a reference point, pick actual objects to represent the clusters, using one representative object per cluster. The partitioning method is then performed based on the principle of minimizing the sum of the dissimilarities between each object and its corresponding reference point. That is, an absolute-error criterion is used, defined as Where E is the sum of the absolute error for all objects in the data set; p is the point in space representing a given object in cluster Cj; and oj is the representative object of Cj. Four cases of the cost function for k-medoids clustering is as follows PAM(Partitioning Around Medoids) was one of the first k-medoids algorithms partitioning based on medoid or central objects is as follows 12 The complexity of each iteration is The k-medoids method is more robust than k-means in the presence of noise and outliers, because a medoid is less influenced by outliers or other extreme values than a mean. In this example, changing the medoid of cluster 2 did not change the assignments of objects to cluster. Advantage of k-Medoid method is more robust than k-Means in the presence of noise and outliers. Disadvantages of k-Medoids are It is more costly than the k-Means method. Like k-Means, k-Medoids requires the user to specify k. It does not scale well for large data sets. 2. Partitioning Methods in Large Databases: From k-Medoids to CLARANS CLARA: CLARA (Clustering LARge Applications) uses a sampling-based method to deal with large data sets. A random sample should closely represent the original data. The chosen medoids will likely be similar to what would have been chosen from the whole data set. CLARA draws multiple samples of the data set. Apply PAM to each sample. Return the best clustering. 13 Complexity of each iteration is where s is the size of the sample, k is the number of clusters and n is the number of objects. PAM finds the best k-Medoids among a given data, and CLARA finds the best kMedoids among the selected samples. CLARA problems are as follows The best k-Medoids may not be selected during the sampling process, in this case, CLARA will never find the best clustering. If the sampling is biased we cannot have a good clustering. Trade off-of efficiency. CLARANS: CLARANS (Clustering Large Applications based upon RANdomized Search) was proposed to improve the quality and the scalability of CLARA. It combines sampling techniques with PAM. It does not confine itself to any sample at a given time. It draws a sample with some randomness in each step of the search. CLARANS is more effective than both PAM and CLARA. It handles outliers. Disadvantages are the computational complexity of CLARANS is O(n2), where n is the number of objects. The clustering quality depends on the sampling method. HIERARCHICAL METHODS A hierarchical clustering method works by grouping data objects into a tree of clusters. These methods are classified as follows Agglomerative and Divisive Hierarchical clustering BIRCH ROCK Chameleon. 1. Agglomerative and Divisive Hierarchical Clustering: Agglomerative (Botton-Up / Merging): This bottom-up strategy starts by placing each object in its own cluster and then merges these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied. Divisive (Top-Down / Splitting): This top-down strategy does the reverse of agglomerative hierarchical clustering by starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces, until each object forms a cluster 14 on its own or until it satisfies certain termination conditions, such as a desired number of clusters is obtained or the diameter of each cluster is within a certain threshold. Example: Below figure shows the application of AGNES (AGglomerative NESting), an agglomerative hierarchical clustering method, and DIANA (DIvisive ANAlysis), a divisive hierarchical clustering method, to a data set of five objects, {a, b, c, d, e}. Initially, AGNES places each object into a cluster of its own. The clusters are then merged step-by-step. In DIANA, all of the objects are used to form one initial cluster. A tree structure called a dendrogram is commonly used to represent the process of hierarchical clustering. It shows how objects are grouped together step by step. Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e } is as follows Where l= 0 shows the five objects as singleton clusters at level 0. At l = 1, objects a and b are grouped together to form the first cluster, and they stay together at all subsequent levels. Four widely used measures for distance between clusters are as follows, where |p-p1| is the distance between two objects or points, p and p1; mi is the mean for cluster, Ci; and ni is the number of objects in Ci. 15 When an algorithm uses the minimum distance, dmin(Ci, Cj), to measure the distance between clusters, it is sometimes called a nearest-neighbor clustering algorithm. If the clustering process is terminated when the distance between nearest clusters exceeds an arbitrary threshold, it is called a single-linkage algorithm. An agglomerative hierarchical clustering algorithm that uses the minimum distance measure is also called a minimal spanning tree algorithm. When an algorithm uses the maximum distance, dmax(Ci, Cj), to measure the distance between clusters, it is sometimes called a farthest-neighbor clustering algorithm. If the clustering process is terminated when the maximum distance between nearest clusters exceeds an arbitrary threshold, it is called a complete-linkage algorithm. Difficulties with hierarchical clustering are as follows Although it look simple, difficulties encountered regarding the selection of merge or split point. If merge or split decision, are not chosen well at some step may lead to low quality clusters. 2. BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies BIRCH is designed for clustering a large amount of numerical data by integration of hierarchical clustering (at the initial microclustering stage) and other clustering methods such as iterative partitioning (at the later macroclustering stage). It overcomes the two difficulties of agglomerative clustering methods: (1) scalability and (2) the inability to undo BIRCH introduces two concepts clustering feature and clustering feature tree (CF tree), which are used to summarize cluster representations. Given n d-dimensional data objects or points in a cluster, define the centroid x0, radius R, and diameter D of the cluster as follows 16 Where R is the average distance from member objects to the centroid, and D is the average pairwise distance within a cluster. Both R and D reflect the tightness of the cluster around the centroid. A clustering feature (CF) is a three-dimensional vector summarizing information about clusters of objects. Given n d-dimensional objects or points in a cluster, {xi}, then the CF of the cluster is defined as Example: Consider two disjoint clusters as C1 and C2. Clustering feature are as CF1 and CF2. Then the cluster that is formed by merging C1 and C2 is simply CF1 +CF2. Example: Suppose that there are three points, (2, 5), (3, 2), and (4, 3), in a cluster, C1. The clustering feature of C1 is CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering. A CF tree has two parameters: branching factor, B, and threshold, T. The branching factor specifies the maximum number of children per nonleaf node. The threshold parameter specifies the maximum diameter of subclusters stored at the leaf nodes of the tree. BIRCH applies a multiphase clustering technique also and primary phases are: Phase 1: BIRCH scans the database to build an initial in-memory CF tree, which can be viewed as a multilevel compression of the data that tries to preserve the inherent clustering structure of the data Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of the CF tree, which removes sparse clusters as outliers and groups dense clusters into larger ones. The computation complexity of the algorithm is O(n), where n is the number of objects to be clustered. 3. ROCK: A Hierarchical Clustering Algorithm for Categorical Attributes ROCK (RObust Clustering using linKs) is a hierarchical clustering algorithm that explores the concept of links (the number of common neighbors between two objects) for data with categorical attributes. ROCK takes a more global approach to clustering by considering the neighborhoods of individual pairs of points. If two similar points also have similar neighborhoods, then the two points likely belong to the same cluster and so can be merged. 17 The two points, pi and pj, are neighbors if , where sim is a similarity function and Ө is a user-specified threshold. ROCK’s concepts of neighbors and links are illustrated, where the similarity between two “points” or transactions, Ti and Tj, is defined with the Jaccard coefficient as Example: Suppose that a market basket database contains transactions regarding the items a, b, . . . , g. Consider two clusters of transactions,C1 andC2. C1, which references the items (a, b, c, d, e), contains the transactions {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}. C2 references the items {a, b, f , g}. It contains the transactions {a, b, f }, {a, b, g}, {a, f , g}, {b, f , g}. The Jaccard coefficient between the transactions {a, b, c} and {b, d, e} of C1 is 1/5 =0:2. In fact, the Jaccard coefficient between any pair of transactions in C1 ranges from 0.2 to 0.5 (e.g., {a, b, c} and {a, b, d}). The Jaccard coefficient between transactions belonging to different clusters may also reach 0.5 (e.g., {a, b, c} of C1 with {a, b, f } or {a, b, g} of C2). The worst-case time complexity of ROCK is where mm and ma are the maximum and average number of neighbors, respectively, and n is the number of objects. 4. Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to determine the similarity between pairs of clusters. It was derived based on the observed weaknesses of two hierarchical clustering algorithms: ROCK and CURE. It measures the similarity based on a dynamic model Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters. Cure ignores information about interconnectivity of the objects. Rock ignores information about the closeness of two clusters. It is a two-phase algorithm Use a graph partitioning algorithm: Cluster objects into a large number of relatively small sub-clusters. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters. Chameleon: Hierarchical clustering based on k-nearest neighbors and dynamic modeling 18 The graph-partitioning algorithm partitions the k-nearest-neighbor graph such that it minimizes the edge cut. That is, a cluster C is partitioned into subclusters Ci andCj so as to minimize the weight of the edges that would be cut should C be bisected into Ci and Cj. Edge cut is denoted EC(Ci, Cj) and assesses the absolute interconnectivity between clusters Ci and Cj. Chameleon determines the similarity between each pair of clusters Ci and Cj according to their relative interconnectivity, RI(Ci, Cj), and their relative closeness, RC(Ci, Cj): The relative interconnectivity, RI(Ci, Cj), between two clusters, Ci and Cj, is defined as the absolute interconnectivity between Ci and Cj, normalized with respect to the internal interconnectivity of the two clusters, Ci and Cj. That is, The relative closeness, RC(Ci, Cj), between a pair of clusters, Ci and Cj, is the absolute closeness between Ci and Cj, normalized with respect to the internal closeness of the two clusters, Ci and Cj. It is defined as The processing cost for high-dimensional data may require O(n2) time for n objects in the worst case. DENSITY-BASED METHODS Clustering based on density (local cluster criterion), such as density-connected points. Major features are as follows Discover clusters of arbitrary shape Handle noise. One scan Need density parameters as termination condition Density based clustering algorithms as follows DBSCAN grows clusters according to a density-based connectivity analysis. OPTICS extends DBSCAN to produce a cluster ordering obtained from a wide range of parameter settings. DENCLUE clusters objects based on a set of density distribution functions. 1. DBSCAN: A Density-Based Clustering Method Based on Connected Regions with Sufficiently High Density 19 DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density based clustering algorithm. The algorithm grows regions with sufficiently high density into clusters and discovers clusters of arbitrary shape in spatial databases with noise. It defines a cluster as a maximal set of density-connected points. The basic ideas of density-based clustering involve a number of new definitions as The neighborhood within a radius ε of a given object is called the ε-neighborhood of the object. If the ε-neighborhood of an object contains at least a minimum number, MinPts, of objects, then the object is called a core object. Given a set of objects, D, say that an object p is directly density-reachable from object q if p is within the ε-neighborhood of q, and q is a core object. An object p is density-reachable from object q with respect to ε and MinPts in a set of objects, D, if there is a chain of objects p1, . . . , pn, where p1 = q and pn = p such that pi+1 is directly density-reachable from pi with respect to ε and MinPts, for , pi є D. An object p is density-connected to object q with respect to ε and MinPts in a set of objects, D, if there is an object o є D such that both p and q are density-reachable from o with respect to ε and MinPts. Density reachability is the transitive closure of direct density reachability, and this relationship is asymmetric. Density connectivity, however, is a symmetric relation. Example: Consider below figure for a given and represented by the radius of circles and say let MinPts=3. Of the labeled points, m, p, o, and r are core objects because each is in an εneighborhood containing at least three points. q is directly density-reachable from m. m is directly density-reachable from p and vice versa. q is (indirectly) density-reachable from p because q is directly density-reachable from m and m is directly density-reachable from p. However, p is not density-reachable from q because q is not a core object. Similarly, r and s are density-reachable from o, and o is density-reachable from r. o, r, and s are all density-connected. A density-based cluster is a set of density-connected objects that is maximal with respect to density-reachability. Every object not contained in any cluster is considered to be noise. DBSCAN searches for clusters by checking the ε -neighborhood of each point in the database. If the ε -neighborhood of a point p contains more than MinPts, a new cluster with p as a core object is created. DBSCAN then iteratively collects directly density-reachable objects from these core objects, which may involve the merge of a few density-reachable clusters. 20 The process terminates when no new point can be added to any cluster. If a spatial index is used, the computational complexity of DBSCAN is O(nlogn), where n is the number of database objects. Otherwise, it is O(n2). 2. OPTICS: Ordering Points To Identify the Clustering Structure It is ordering points to identify the clustering structure. Produces a special order of the database writes its density-based clustering structure. This cluster-ordering contains information equivalent to the density-based clustering corresponding to a broad range of parameter settings. Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure. Can be represented graphically or using visualization techniques. It uses two values to be stored for each object as follows The core-distance of an object p is the smallest ε1 value that makes {p} a core object. If p is not a core object, the core-distance of p is undefined. The reachability-distance of an object q with respect to another object p is the greater value of the core-distance of p and the Euclidean distance between p and q. If p is not a core object, the reachability-distance between p and q is undefined. Example: This figure shows core distance and reachability-distance. If ε=6mm and MinPts=5. The core distance of p is distance ε1, between p and fourth closest data object. Reachability-distance of q1 with respect to p is the core distance of p (i.e., ε1=3mm) because this is greater than Euclidean distance from p to q1. Reachability-distance of q2 with respect to p is the Euclidean distance from p to q2 because this is greater than core-distance of p. The cluster ordering of a data set can be represented graphically. For example, below figure is the reachability plot for a simple two-dimensional data set, which presents a general overview of how the data are structured and clustered. The data objects are plotted in cluster order (horizontal axis) together with their respective reachability-distance (vertical axis). 21 OPTICS has the runtime complexity as O(nlogn) if a spatial index is used, where n is the number of objects. 3. DENCLUE: Clustering Based on Density Distribution Functions DENCLUE (DENsity-based CLUstEring) is a clustering method based on a set of density distribution functions. The method is built on the following ideas (1) the influence of each data point can be formally modeled using a mathematical function, called an influence function, which describes the impact of a data point within its neighborhood; (2) the overall density of the data space can be modeled analytically as the sum of the influence function applied to all data points; and (3) clusters can then be determined mathematically by identifying density attractors, where density attractors are local maxima of the overall density function. Let x and y be objects or points in Fd, a d-dimensional input space. The influence function of data object y on x is a function function fB which is defined in terms of a basic influence It can be used to compute a square wave influence function, The density function at an object or point of all data points. Given n data objects, defined as is defined as the sum of influence functions the density function at x is 22 Below figure shows a 2-D data set together with the corresponding overall density function for a square wave and a Gaussian influence function Advantages of DENCLUE are as follows Solid mathematical foundation. Good for data sets with large amounts of noise. Allows a compact mathematical description of arbitrarily shaped clusters in highdimensional data sets. Significant faster than existing algorithm (e.g., DBSCAN). But needs a large number of parameters. MODEL-BASED CLUSTERING METHODS Model-based clustering methods attempt to optimize the fit between the given data and some mathematical model. Three examples of model-based clustering are as follows Expectation-Maximization – presents an extension of the k-Means partitioning algorithm. Conceptual Clustering Neural Network Approach 1. Expectation-Maximization: A popular iterative refinement algorithm. An extension to k-Means Assign each object to a cluster according to a weight (probability distribution). New means are computed based on weighted measures. This process can be as follows Starts with an initial estimate of the parameter vector. Iteratively rescores the patterns against the mixture density produced by the parameter vector. The rescored patterns are used to update the parameter updates. Patterns belonging to the same cluster, if they are placed by their scores in a particular component. The EM algorithm is as follows Initially, random assign k cluster centers. Iteratively refine the clusters based on two steps Expectation step: Assign each object xi to cluster Ck with probability as 23 follows the normal (i.e., Gaussian) distribution around mean, mk, with expectation, Ek Maximization step: Estimation of model parameters This step is the “maximization” of the likelihood of the distributions given the data. The computational complexity is linear in d (the number of input features), n (the number of objects), and t (the number of iterations). Important questions are as follows 1. a. Discuss about the categorization of major clustering methods. b. Explain in detail about partitioning methods and hierarchical methods. 2. a. What is outlier? Explain outlier with suitable example. b.Write short notes on partitioning and model based clustering. 3. a. Explain about density based clustering. b. Explain about outlier analysis. 4. a. Briefly outline how to compute the dissimilarity between objects described by the following types of variables i. Numerical (Interval-scaled) variables ii. Asymmetric binary variables iii. Categorical variables b. Why is outlier mining important? Briefly describe the different approaches behind statistical-based outlier detection, distanced-based outlier detection, density-based local outlier detection, and derivation-based outlier detection. 5. a. Explain the different categories of clustering methods. b. Explain about agglomerative and divisive hierarchal clustering. 6. What are model-based clustering methods? Explain in any two model-based clustering methods in detail. 7. What are the grid based methods? Explain in any two grid based methods in detail. 24