Uploaded by pari.jain.92560281

Clustering[12842]

advertisement
Cluster Analysis
1
• What is the nature of the group of customers that PMV Electric
are targeting ?
2
3
4
5
Size, shape , color, calyx
6
Cluster Analysis
• Unsupervised
• Aims to decompose or partition a data set in to clusters
such that each cluster is similar within itself but is as
dissimilar as possible to other clusters.
• Inter-cluster (between-groups) distance is maximized and
intra-cluster (within-group) distance is minimized.
• Different to classification as there are no predefined
classes, no training data, may not even know how many
clusters.
7
Distance metrics
8
How can we measure distance?
• Euclidean distance: Most common method to measure
dissimilarity between observations, when observations include
continuous variables.
• Let observations u = (u1, u2, . . . , uq) and v = (v1, v2, . . . , vq)
each comprise measurements of q variables.
• The Euclidean distance between observations u and v is
• 𝑑𝑢,𝑣 =
𝑢1 − 𝑣1
2
+ 𝑢2 − 𝑣2
2
+ ∙ ∙ ∙ + 𝑢𝑞 − 𝑣𝑞
2
9
• Construct the distance matrix for persons
• Which two customers are closest to each other ?
• What is the distance between person a and person f ?
10
Add a new column Age.
What is the distance between person a and person f now ?
11
Add a new column Age.
What is the distance between person a and person f now ?
12
Add three new columns to store standardized values.
Z-transform => x-mean/std.dev
What is the distance between person a and person f based on
standardized variables ?
13
Typical Steps in cluster analysis
Decide on the
clustering
variables
Decide on the
clustering
procedure
Hierarchical
methods
Select a distance
measure
Choose a
clustering
algorithm
Partitioning
methods
Decide on
the number
of clusters
A Concise Guide to Market Research: The Process, Data, and Methods Using IBM SPSS Statistics
Validate and
interpret the
cluster
solution
14
Steps in K-means
1. Initialisation : Select the number of clusters : k
2. Pick k seeds/observations as centroids of the k clusters.
Seeds are picked randomly
3. Calculate Euclidean distance of each object in dataset from
centroid
4. Allocate each object to the nearest cluster based on the
computed distances
5. Compute the new centroids
6. Check if the stopping criterion has been met (cluster remains
unchanged). If not, go to step 3.
16
Centroid = average of all data points in that specific cluster
17
Profile of the cluster is based on the centroid
The nature of each group is based on the centroid
18
19
Elbow Plot
https://uc-r.github.io/kmeans_clustering
20
Download