Cluster Analysis 1 • What is the nature of the group of customers that PMV Electric are targeting ? 2 3 4 5 Size, shape , color, calyx 6 Cluster Analysis • Unsupervised • Aims to decompose or partition a data set in to clusters such that each cluster is similar within itself but is as dissimilar as possible to other clusters. • Inter-cluster (between-groups) distance is maximized and intra-cluster (within-group) distance is minimized. • Different to classification as there are no predefined classes, no training data, may not even know how many clusters. 7 Distance metrics 8 How can we measure distance? • Euclidean distance: Most common method to measure dissimilarity between observations, when observations include continuous variables. • Let observations u = (u1, u2, . . . , uq) and v = (v1, v2, . . . , vq) each comprise measurements of q variables. • The Euclidean distance between observations u and v is • 𝑑𝑢,𝑣 = 𝑢1 − 𝑣1 2 + 𝑢2 − 𝑣2 2 + ∙ ∙ ∙ + 𝑢𝑞 − 𝑣𝑞 2 9 • Construct the distance matrix for persons • Which two customers are closest to each other ? • What is the distance between person a and person f ? 10 Add a new column Age. What is the distance between person a and person f now ? 11 Add a new column Age. What is the distance between person a and person f now ? 12 Add three new columns to store standardized values. Z-transform => x-mean/std.dev What is the distance between person a and person f based on standardized variables ? 13 Typical Steps in cluster analysis Decide on the clustering variables Decide on the clustering procedure Hierarchical methods Select a distance measure Choose a clustering algorithm Partitioning methods Decide on the number of clusters A Concise Guide to Market Research: The Process, Data, and Methods Using IBM SPSS Statistics Validate and interpret the cluster solution 14 Steps in K-means 1. Initialisation : Select the number of clusters : k 2. Pick k seeds/observations as centroids of the k clusters. Seeds are picked randomly 3. Calculate Euclidean distance of each object in dataset from centroid 4. Allocate each object to the nearest cluster based on the computed distances 5. Compute the new centroids 6. Check if the stopping criterion has been met (cluster remains unchanged). If not, go to step 3. 16 Centroid = average of all data points in that specific cluster 17 Profile of the cluster is based on the centroid The nature of each group is based on the centroid 18 19 Elbow Plot https://uc-r.github.io/kmeans_clustering 20