1. What is Cluster Analysis ? (Introduction) Cluster analysis is a technique used for classification of data in which data elements are partitioned into groups called clusters that represent collections of data elements that are proximate based on a distance or dissimilarity function. [1] The cluster analysis approach is an important tool in decision making and an effective creativity technique in generating ideas and obtaining solutions. The term cluster analysis (first used by Tryon, 1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories. A general question facing researchers in many areas of inquiry is how to organize observed data into meaningful structures, that is, to develop taxonomies. In other words cluster analysis is an exploratory data analysis tool which aims at sorting different objects into groups in a way that the degree of association between two objects is maximal if they belong to the same group and minimal otherwise. [2] Cluster analysis is mainly a discovery tool, it often surfaces perceived problem areas, concerns or items that naturally belong together. The clusters analysis aims at [3]: classifying data into natural groupings on the basis of similar or related characteristics, indentifying most important characteristics to be considered in developing a problem specification, developing a more homogeneous group of items from a large list of dissimilar items, identifying differences among customer, employee or supplier groups in regard to quality perception and performance issues. 2. How is it implemented? Types of clustering Data clustering algorithms can be hierarchical or partitional. Hierarchical algorithms find successive clusters using previously established clusters, whereas partitional algorithms determine all clusters at once. For example, designing an effective hierarchical menu system for an automotive application in-vehicle mobile multimedia systems, one alternative is a hierarchical menu implemented on an integrated display/control unit such as a multi-function display (MFD).In a top-down approach, the designer identifies first-order (macro) categories that are repeatedly divided into progressively smaller subcategories until menu items are represented at their lowest level. This approach is conceptually driven in that items are discriminated along categorical boundaries and conceptual dimensions. The top-down approach emphasizes the differences between functions rather than their similarities. Hierarchical algorithms can be agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters. [5] Distance Measures The joining or tree clustering method above uses the dissimilarities (similarities) or distances between objects when forming the clusters. Similarities are a set of rules that serve as criteria for grouping or separating items. An important step in any clustering is to select a distance measure, which will determine how the similarity of two elements is calculated. These distances (similarities) can be based on a single dimension or multiple dimensions, with each dimension representing a rule or condition for grouping objects. This will influence the shape of the clusters, as some elements may be close to one another according to one distance and further away according to another. [2] Common distance functions: -The Euclidean distance -The Mahalanobis distance[8] -The Manhattan distance [9] -The Hamming distance[10]. Picture 1: The Euclidean distance Source: http://www1.uni-hamburg.de/RRZ/Software/Statistica/Handbuch/stcluan.html#d Type of data in clustering analysis: Interval-scaled variables, Binary variables, Nominal, ordinal, and ratio variables, Variables of mixed types. The clusters analysis tool is best utilized after a brainstorming session to organize data by subdividing different idea, items or characteristics into relatively similar groups, each under a topical heading. Consider a Horizontal Hierarchical Tree Plot (see graph below), on the left of the plot, we begin with each object in a class by itself. Now imagine that, in very small steps, we "relax" our criterion as to what is and is not unique. Put another way, we lower our threshold regarding the decision when to declare two or more objects to be members of the same cluster. The following tree diagram classifies 22 different car models and their linkage (connection) using “Euclidean distance” which compares car category according to certain characteristics (e.g fuel consumption, cost, accessories etc). 3. What are the success factors? (Do/ Do not) Cluster analysis is not as much a typical statistical test as it is a "collection" of different algorithms that "put objects into clusters according to well defined similarity rules." The point here is that, unlike many other statistical procedures, cluster analysis methods are mostly used when we do not have any a priori hypotheses, but are still in the exploratory phase of our research. In a sense, cluster analysis finds the "most significant solution possible."[2]. What Is Good Clustering? High Quality: high intra-class similarity(similarity between two or more classes of attributes) low inter-class similarity (similarity between attributes belonging in the same category) Depends on: similarity measure (how similar two or more attributes are) algorithm for searching ability to discover hidden patterns Clustering may not be the best way to discover interesting groups in a data set. Often visualisation methods work well, allowing the human expert to identify useful groups. However, as the data set sizes increase to millions of entities, this becomes in practical and clusters help to partition the data so that we can deal with smaller groups. Different algorithms deliver different clusterings [6]. 4. Case study – Title: An Analysis of Industrial Clusters in Burnaby [7]. This case study presents the findings of the analysis of industrial clusters in The City of Burnaby in the State of Vancouver, CANADA and draws conclusions based on this analysis. Specifically, Figure 1 shows both the ratio of GVRD (Greater Vancouver Regional District) employment compared to Canada as a whole and Burnaby compared to Canada. The Y-axis value on Figure 1 is the ratio of the percentage of employees’ employment within a specific industrial category for the GVRD or Burnaby compared to the national average. Not all industrial codes are shown: the figures show, as expected, that the GVRD and Burnaby have much smaller than average labour forces in agriculture and mining, and smaller than average employment in manufacturing and public administration. If either the GVRD or Burnaby has a ratio greater than one, it indicates that it has some competitive advantage compared to Canada as a whole. FIGURE 1 What is perhaps more important is to look at those areas where Burnaby has a distinct advantage over the rest of the GVRD. There are four such areas: utilities, construction, wholesale and information and culture. A similar type of analysis can be done by occupational codes. Figure 2 shows the ratios of the percentages of total employment in the GVRD and Burnaby in a particular aggregation of occupational codes compared to Canada. Neither Burnaby nor the GVRD have higher than average numbers of workers in primary and manufacturing occupations. The GVRD overall has average numbers of individuals in health occupations and education, but Burnaby falls behind the GVRD (probably due to the preponderance of health and educational facilities outside Burnaby). Burnaby has slight advantages in management and business. While it does not have as great an advantage in the arts as does the GVRD as a whole, its advantage over the rest of Canada is still significant, and bears further study. FIGURE 2 5. List of References Web sites: [1] http://mathworld.wolfram.com/ClusterAnalysis.html [2] http://www.statsoft.com/textbook/stcluan.html [3] http://datamining.anu.edu.au/student/math3346_2006/algintro-2x3.pdf [4]Using Cluster Analysis for Deriving Menu Structures for Automotive Mobile Multimedia Applications Mona L. Toms, Mark A. Cummings-Hill and David G. Curry Delphi Delco Electronics Systems Scott M. Cone Veridian Engineering, SAE 2001 World Congress Detroit, Michigan March 5-8, 2001 [5] en.wikipedia.org/wiki/Data_clustering [6] datamining.anu.edu.au/student/math3346_2006/clusters-2x3.pdf [7] www.sfu.ca/cprost/publications.htm [8] http://mrw.interscience.wiley.com/emrw/9780470011812/eob/article/b2a13038/curren t/abstract [9] http://www.camo.com/rt/Resources/Clustering.html [10] http://en.wikipedia.org/wiki/Hamming_distance 6. Glossary Algorithm: As opposed to heuristics (which contain general recommendations based on statistical evidence or theoretical reasoning), algorithms are completely defined, finite sets of steps, operations, or procedures that will produce a particular outcome. For example, with a few exceptions, all computer programs, mathematical formulas, and (ideally) medical and food recipes are algorithms. Cluster: a collection of data objects: Similar to one another within the same clusterDissimilar to the objects in other clusters. Cluster analysis: Grouping a set of data objects into clusters. Clustering is unsupervised classification: no predefined classes—descriptive data mining. Heuristics: are general recommendations or guides based on statistical evidence or theoretical reasoning. 7. Keywords Cluster analysis, Types of clustering, Cluster algorithms, Cluster distances, Creativity technique, Creative. 8. Questions 1) What is cluster analysis and what is it used for? 2) Which are the factors used in a tree clustering method? 3) Which are the methods (codes) used in clustering the industries in Burnaby? 4) High Quality similarities in good Clustering are: Please insert T for true & F for false: a. high intra-class similarity b. high inter-class similarity c. low intra-class similarity d. low inter-class similarity Answer: a. T b. F c. F d. T