Cluster Analysis

advertisement
1. What is Cluster Analysis ? (Introduction)
Cluster analysis is a technique used for classification of data in which data elements
are partitioned into groups called clusters that represent collections of data elements
that are proximate based on a distance or dissimilarity function. [1] The cluster
analysis approach is an important tool in decision making and an effective creativity
technique in generating ideas and obtaining solutions.
The term cluster analysis (first used by Tryon, 1939) encompasses a number of
different algorithms and methods for grouping objects of similar kind into respective
categories. A general question facing researchers in many areas of inquiry is how to
organize observed data into meaningful structures, that is, to develop taxonomies. In
other words cluster analysis is an exploratory data analysis tool which aims at sorting
different objects into groups in a way that the degree of association between two
objects is maximal if they belong to the same group and minimal otherwise. [2]
Cluster analysis is mainly a discovery tool, it often surfaces perceived problem areas,
concerns or items that naturally belong together.
The clusters analysis aims at [3]:




classifying data into natural groupings on the basis of similar or related
characteristics,
indentifying most important characteristics to be considered in developing a
problem specification,
developing a more homogeneous group of items from a large list of dissimilar
items,
identifying differences among customer, employee or supplier groups in
regard to quality perception and performance issues.
2. How is it implemented?
Types of clustering
Data clustering algorithms can be hierarchical or partitional. Hierarchical
algorithms find successive clusters using previously established clusters, whereas
partitional algorithms determine all clusters at once. For example, designing an
effective hierarchical menu system for an automotive application in-vehicle mobile
multimedia systems, one alternative is a hierarchical menu implemented on an
integrated display/control unit such as a multi-function display (MFD).In a top-down
approach, the designer identifies first-order (macro) categories that are repeatedly
divided into progressively smaller subcategories until menu items are represented at
their lowest level. This approach is conceptually driven in that items are discriminated
along categorical boundaries and conceptual dimensions. The top-down approach
emphasizes the differences between functions rather than their similarities.
Hierarchical algorithms can be agglomerative ("bottom-up") or divisive ("top-down").
Agglomerative algorithms begin with each element as a separate cluster and merge
them into successively larger clusters. Divisive algorithms begin with the whole set
and proceed to divide it into successively smaller clusters. [5]
Distance Measures
The joining or tree clustering method above uses the dissimilarities (similarities) or
distances between objects when forming the clusters. Similarities are a set of rules
that serve as criteria for grouping or separating items. An important step in any
clustering is to select a distance measure, which will determine how the similarity of
two elements is calculated. These distances (similarities) can be based on a single
dimension or multiple dimensions, with each dimension representing a rule or
condition for grouping objects. This will influence the shape of the clusters, as some
elements may be close to one another according to one distance and further away
according to another. [2]
Common distance functions:
-The Euclidean distance
-The Mahalanobis distance[8]
-The Manhattan distance [9]
-The Hamming distance[10].
Picture 1: The Euclidean distance
Source:
http://www1.uni-hamburg.de/RRZ/Software/Statistica/Handbuch/stcluan.html#d
Type of data in clustering analysis: Interval-scaled variables, Binary variables,
Nominal, ordinal, and ratio variables, Variables of mixed types.
The clusters analysis tool is best utilized after a brainstorming session to organize data
by subdividing different idea, items or characteristics into relatively similar groups,
each under a topical heading.
Consider a Horizontal Hierarchical Tree Plot (see graph below), on the left of the plot,
we begin with each object in a class by itself. Now imagine that, in very small steps,
we "relax" our criterion as to what is and is not unique. Put another way, we lower our
threshold regarding the decision when to declare two or more objects to be members
of the same cluster. The following tree diagram classifies 22 different car models and
their linkage (connection) using “Euclidean distance” which compares car category
according to certain characteristics (e.g fuel consumption, cost, accessories etc).
3. What are the success factors? (Do/ Do not)
Cluster analysis is not as much a typical statistical test as it is a "collection" of
different algorithms that "put objects into clusters according to well defined similarity
rules." The point here is that, unlike many other statistical procedures, cluster
analysis methods are mostly used when we do not have any a priori hypotheses,
but are still in the exploratory phase of our research. In a sense, cluster analysis finds
the "most significant solution possible."[2].
What Is Good Clustering?
High Quality:
 high intra-class similarity(similarity between two or more classes of attributes)
 low inter-class similarity (similarity between attributes belonging in the same
category)
Depends on:
 similarity measure (how similar two or more attributes are)
 algorithm for searching
 ability to discover hidden patterns
Clustering may not be the best way to discover interesting groups in a data set.
Often visualisation methods work well, allowing the human expert to identify
useful groups. However, as the data set sizes increase to millions of entities, this
becomes in practical and clusters help to partition the data so that we can deal
with smaller groups. Different algorithms deliver different clusterings [6].
4. Case study – Title: An Analysis of Industrial Clusters in Burnaby
[7].
This case study presents the findings of the analysis of industrial clusters in The City
of Burnaby in the State of Vancouver, CANADA and draws conclusions based on this
analysis.
Specifically, Figure 1 shows both the ratio of GVRD (Greater Vancouver Regional
District) employment compared to Canada as a whole and Burnaby compared to
Canada. The Y-axis value on Figure 1 is the ratio of the percentage of employees’
employment within a specific industrial category for the GVRD or Burnaby compared
to the national average.
Not all industrial codes are shown: the figures show, as expected, that the GVRD and
Burnaby have much smaller than average labour forces in agriculture and mining, and
smaller than average employment in manufacturing and public administration.
If either the GVRD or Burnaby has a ratio greater than one, it indicates that it has
some competitive advantage compared to Canada as a whole.
FIGURE 1
What is perhaps more important is to look at those areas where Burnaby has a distinct
advantage over the rest of the GVRD. There are four such areas: utilities,
construction, wholesale and information and culture. A similar type of analysis can be
done by occupational codes.
Figure 2 shows the ratios of the percentages of total employment in the GVRD and
Burnaby in a particular aggregation of occupational codes compared to Canada.
Neither Burnaby nor the GVRD have higher than average numbers of workers in
primary and manufacturing occupations. The GVRD overall has average numbers of
individuals in health occupations and education, but Burnaby falls behind the GVRD
(probably due to the preponderance of health and educational facilities outside
Burnaby).
Burnaby has slight advantages in management and business. While it does not have as
great an advantage in the arts as does the GVRD as a whole, its advantage over the
rest of Canada is still significant, and bears further study.
FIGURE 2
5. List of References
Web sites:
[1] http://mathworld.wolfram.com/ClusterAnalysis.html
[2] http://www.statsoft.com/textbook/stcluan.html
[3] http://datamining.anu.edu.au/student/math3346_2006/algintro-2x3.pdf
[4]Using Cluster Analysis for Deriving Menu Structures for Automotive Mobile
Multimedia Applications Mona L. Toms, Mark A. Cummings-Hill and David G.
Curry Delphi Delco Electronics Systems Scott M. Cone Veridian Engineering,
SAE 2001 World Congress Detroit, Michigan March 5-8, 2001
[5] en.wikipedia.org/wiki/Data_clustering
[6] datamining.anu.edu.au/student/math3346_2006/clusters-2x3.pdf
[7] www.sfu.ca/cprost/publications.htm
[8]
http://mrw.interscience.wiley.com/emrw/9780470011812/eob/article/b2a13038/curren
t/abstract
[9] http://www.camo.com/rt/Resources/Clustering.html
[10] http://en.wikipedia.org/wiki/Hamming_distance
6. Glossary
Algorithm: As opposed to heuristics (which contain general recommendations based
on statistical evidence or theoretical reasoning), algorithms are completely defined,
finite sets of steps, operations, or procedures that will produce a particular outcome.
For example, with a few exceptions, all computer programs, mathematical formulas,
and (ideally) medical and food recipes are algorithms.
Cluster: a collection of data objects: Similar to one another within the same clusterDissimilar to the objects in other clusters.
Cluster analysis: Grouping a set of data objects into clusters. Clustering is
unsupervised classification: no predefined classes—descriptive data mining.
Heuristics: are general recommendations or guides based on statistical evidence or
theoretical reasoning.
7. Keywords
Cluster analysis, Types of clustering, Cluster algorithms, Cluster distances, Creativity
technique, Creative.
8. Questions
1)
What is cluster analysis and what is it used for?
2)
Which are the factors used in a tree clustering method?
3)
Which are the methods (codes) used in clustering the industries in Burnaby?
4) High Quality similarities in good Clustering are:
Please insert T for true & F for false:
a.
high intra-class similarity
b.
high inter-class similarity
c.
low intra-class similarity
d.
low inter-class similarity
Answer:
a. T
b. F
c. F
d. T
Download