GUJARAT UNIVERSITY M.Sc. Applied Mathematical Science Semester III Name Roll No. Subject Code Subject Name Topic PRAJAPATI RUTVIK 22 AMS – 506 Research Methodology and Multivariate Stat Cluster Analysis (Overview) What is cluster Analysis? Why Cluster Analysis? OVERVIEW Assumptions How does Cluster Analysis work? Clustering Methods Applications What is Cluster Analysis? • Cluster analysis is a multivariate data mining technique whose goal is to groups objects based on a set of user selected characteristic. • When plotted geometrically, objects within clusters should be very close together and clusters will be far apart. • Clusters should exhibit high internal homogeneity and high external heterogeneity. Why Cluster Analysis? Cluster analysis is a major tool in a number of applications in many fields of Business, Engineering & etc. Data Reduction: Convert large data into meaningful classified into manageable groups. Hypothesis Generation: Cluster analysis is useful to develop hypothesis concerning the nature of data to examine previously stated hypothesis. Hypothesis testing: Hypothesis testing is a systematic procedure for deciding whether the results of a research study support a particular theory which applies to a population. Prediction based on groups: Analyzing the future based options. Assumptions • Sufficient size is needed to ensure representativeness of the population and its underlying structure, particularly small groups within the populations. • Outliers can severely distort the representativeness of the results if they appear as structure (clusters) that are inconsistent with the research objectives. • Representativeness of the sample. The sample must represent the research question. • Impact of multicollinearity. Input variables should be examined for substantial multicollinearity and if present. • Reduce the variables to equal numbers in each set of correlated measures. How does Cluster Analysis work ? • The primary objective of cluster analysis is to define the structure the data by placing most similar observations into groups. To accomplish the task, we must address three basic questions. • How do we measure similarity? • How do we form clusters? • How many groups do we form? Measuring Similarity • Similarity represents the degree of correspondence among objects across all the characteristic used in the analysis. It is a set of rules that serve as criteria for grouping or separating items. Correlation Measure: - Less frequently used, where large values of r's do indicate Similarity. Distance Measure: - Most often used as distance measure of similarity, with higher values representing greater dissimilarity (distance between cases), not similarity. - Types of Distance Measure: i. Euclidean Distance. ii. Squared Euclidean Distance. iii. City – block(Manhattan) Distance. iv. Chebyshev Distance. v. Mahalanobis Distance (𝐷2 ). • Identify the two most similar(closest) observation not already in the same cluster and combine them. How do we form Cluster? • We apply this rule repeatedly to generate number of cluster solutions, starting with each observation as its own "CLSUTER" and then combining two clusters at a time until all observations are in a single cluster. This process is termed as a Hierarchical Procedure because it moves in a stepwise fashion to form an entire range of cluster solutions. It is also Agglomerative Method because clusters are formed by combining existing clusters. How do we form a group? Nested Grouping Dendrogram Method of Clustering • There are basically three types of methods that can be used to carry out a cluster analysis; these methods can be classified as follows: Hierarchical Cluster Analysis Nonhierarchical Cluster Analysis Two step Cluster Analysis Hierarchical Cluster Analysis • This stepwise procedure attempts to identify relatively homogenous groups of cases based on selected characteristic using an algorithm either agglomerative or divisive, resulting to construct of a hierarchy or tree like structure (dendrogram) depicting the formation of clusters. This is the one of the most straight forward method. • The Hierarchical Cluster Analysis provides an excellent framework with each to compare to any set of cluster solutions. • This methods help in judging how many clusters should be retained or considered. Two Basic Types of HCA Agglomerative Algorithm Divisive Algorithm • Hierarchical Cluster begins with each object or observations in a separate cluster. In each subsequent step, the two cluster that are most similar combined to build a new aggregate cluster. This procedure repeated until all the point combined in one cluster. • Similarity decreases during successive steps. Clusters can't be split. • Begin with all objects in a single cluster, which is then divided at each step into two additional that contains most two dissimilar objects. The single cluster is divided into two clusters, then one of these clusters is split for a total three clusters. This continue until all observation are in a single – members clusters. From 1 cluster to n sub clusters. Non Hierarchical Cluster Analysis • In contrast to Hierarchical Method, the NCA do not involve treelike construction process. Instead, they assign objects into a clusters once the number of cluster is specified. • Two steps in Non HCA: i. Specify Cluster Seed- Identify starting point. ii. Assignment- Assign each observations to one of the clusters seeds. Non Heirarchical Clustering Algorithm Sequential Threshold Method Parallel Threshold Method Optimizing Procedures. All of this to group of clustering algorithm know as K- means. Two Step Cluster Analysis • The TwoStep Cluster Analysis procedure is an exploratory tool designed to reveal natural groupings (or clusters) within a dataset that would otherwise not be apparent. The algorithm employed by this procedure has several desirable features that differentiate it from traditional clustering techniques: • Handling of categorical and continuous variables. By assuming variables to be independent, a joint multinomial-normal distribution can be placed on categorical and continuous variables. • Automatic selection of number of clusters. By comparing the values of a model-choice criterion across different clustering solutions, the procedure can automatically determine the optimal number of clusters. • Scalability. By constructing a cluster features (CF) tree that summarizes the records, the TwoStep algorithm allows you to analyze large data files. Real Life Applications • Market Segmentation: Group people (with the willingness, purchasing power and the authority to buy) according to their similarity in several dimensions related to a product under consideration. • Sales Segmentations: Clustering can tell you what types of customers but what products. • Credit Risk: High performer of customers based on their credit history. • Operations: High performer segmentation & promotions based on person's performance. Real Life Applications • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost. • City-planning: Identifying groups of houses according to their house type, value and geographically location. • Geographically: Identification of areas of similar land use in an earth observation database. THANK YOU !!!