Uploaded by rutvikprajapati09

CLUSTER ANALYSIS PPT

advertisement
GUJARAT UNIVERSITY
M.Sc. Applied Mathematical Science
Semester III
Name
Roll No.
Subject Code
Subject
Name
Topic
PRAJAPATI RUTVIK
22
AMS – 506
Research Methodology and Multivariate Stat
Cluster Analysis (Overview)
What is cluster Analysis?
Why Cluster Analysis?
OVERVIEW
Assumptions
How does Cluster Analysis work?
Clustering Methods
Applications
What is Cluster Analysis?
• Cluster analysis is a multivariate data
mining technique whose goal is to
groups objects based on a set of
user selected characteristic.
• When plotted geometrically,
objects within clusters should
be very close together and clusters
will be far apart.
• Clusters should exhibit high internal
homogeneity and high external
heterogeneity.
Why Cluster Analysis?
Cluster analysis is a major tool in a number of applications in many fields of Business, Engineering & etc.
Data Reduction: Convert large data into meaningful classified into manageable groups.
Hypothesis Generation: Cluster analysis is useful to develop hypothesis concerning the nature of data to
examine previously stated hypothesis.
Hypothesis testing: Hypothesis testing is a systematic procedure for deciding whether the results of a research
study support a particular theory which applies to a population.
Prediction based on groups: Analyzing the future based options.
Assumptions
• Sufficient size is needed to ensure representativeness of the
population and its underlying structure, particularly small
groups within the populations.
• Outliers can severely distort the representativeness of the
results if they appear as structure (clusters) that are
inconsistent with the research objectives.
• Representativeness of the sample. The sample must represent
the research question.
• Impact of multicollinearity. Input variables should be examined
for substantial multicollinearity and if present.
• Reduce the variables to equal numbers in each set of
correlated measures.
How does Cluster Analysis
work ?
• The primary objective of cluster analysis is to define the
structure the data by placing most similar observations into
groups. To accomplish the task, we must address three
basic questions.
• How do we measure similarity?
• How do we form clusters?
• How many groups do we form?
Measuring Similarity
• Similarity represents the degree of correspondence among objects across
all the characteristic used in the analysis. It is a set of rules that serve as
criteria for grouping or separating items.
 Correlation Measure:
- Less frequently used, where large values of r's do indicate Similarity.
 Distance Measure:
- Most often used as distance measure of similarity, with higher values
representing greater dissimilarity (distance between cases), not similarity.
- Types of Distance Measure:
i.
Euclidean Distance.
ii.
Squared Euclidean Distance.
iii.
City – block(Manhattan) Distance.
iv.
Chebyshev Distance.
v.
Mahalanobis Distance (𝐷2 ).
• Identify the two most similar(closest)
observation not already in the same cluster and
combine them.
How do we
form Cluster?
• We apply this rule repeatedly to generate
number of cluster solutions, starting
with each observation as its own "CLSUTER" and
then combining two clusters at a time until
all observations are in a single cluster. This
process is termed as a Hierarchical
Procedure because it moves in a stepwise
fashion to form an entire range of cluster
solutions. It is also
Agglomerative Method because clusters are
formed by combining existing clusters.
How do we form a group?
Nested Grouping
Dendrogram
Method of
Clustering
• There are basically three types of methods
that can be used to carry out a cluster
analysis; these methods can be classified
as follows:
 Hierarchical Cluster Analysis
 Nonhierarchical Cluster Analysis
 Two step Cluster Analysis
Hierarchical Cluster
Analysis
• This stepwise procedure attempts to identify relatively
homogenous groups of cases based on selected
characteristic using an algorithm either agglomerative or
divisive, resulting to construct of a hierarchy or tree like
structure (dendrogram) depicting the formation of
clusters. This is the one of the most straight forward
method.
• The Hierarchical Cluster Analysis provides an excellent
framework with each to compare to any set of
cluster solutions.
• This methods help in judging how many clusters should
be retained or considered.
Two Basic Types of HCA
Agglomerative Algorithm
Divisive Algorithm
• Hierarchical Cluster begins with
each object or observations in a
separate cluster. In each
subsequent step, the two cluster
that are most similar combined to
build a new aggregate cluster. This
procedure repeated until all the
point combined in one cluster.
• Similarity decreases during
successive steps. Clusters can't be
split.
• Begin with all objects in a single
cluster, which is then divided at
each step into two additional that
contains most two dissimilar
objects. The single cluster is
divided into two clusters, then one
of these clusters is split for a total
three clusters. This continue until
all observation are in a single –
members clusters. From 1 cluster
to n sub clusters.
Non Hierarchical Cluster
Analysis
• In contrast to Hierarchical Method, the NCA do not involve
treelike construction process. Instead, they assign objects
into a clusters once the number of cluster is specified.
• Two steps in Non HCA:
i.
Specify Cluster Seed- Identify starting point.
ii.
Assignment- Assign each observations to one of the
clusters seeds.
Non Heirarchical Clustering Algorithm
Sequential Threshold
Method
Parallel Threshold
Method
Optimizing Procedures.
All of this to group of
clustering algorithm
know as K- means.
Two Step Cluster Analysis
• The TwoStep Cluster Analysis procedure is an exploratory tool designed to
reveal natural groupings (or clusters) within a dataset that would
otherwise not be apparent. The algorithm employed by this procedure has
several desirable features that differentiate it from traditional clustering
techniques:
• Handling of categorical and continuous variables. By assuming variables
to be independent, a joint multinomial-normal distribution can be placed
on categorical and continuous variables.
• Automatic selection of number of clusters. By comparing the values of a
model-choice criterion across different clustering solutions, the procedure
can automatically determine the optimal number of clusters.
• Scalability. By constructing a cluster features (CF) tree that summarizes the
records, the TwoStep algorithm allows you to analyze large data files.
Real Life
Applications
• Market Segmentation: Group people (with
the willingness, purchasing power and the
authority to buy) according to their similarity
in several dimensions related to a product
under consideration.
• Sales Segmentations: Clustering can tell you
what types of customers but what products.
• Credit Risk: High performer of customers
based on their credit history.
• Operations: High performer segmentation &
promotions based on person's performance.
Real Life
Applications
• Insurance: Identifying groups of
motor insurance policy holders
with a high average claim cost.
• City-planning: Identifying groups
of houses according to their
house type, value and
geographically location.
• Geographically: Identification of
areas of similar land use in an
earth observation database.
THANK YOU !!!
Download