9.1 - In exploratory data analysis, it’s helpful to see if there are meaningful subgroups (or clusters) in the data. Purposes include: - Generating new questions - Improving predictive analyses - This chapter looks at clustering via K-means algorithm, including techniques to choose the number of clusters 9.3 Clustering: (Data analysis) technique involving separating a data set into subgroups of related data - Could use clustering to separate a data set of online customers into groups that correspond to purchasing behaviours, for example Once the data is separated, we can (for example) use the subgroups to generalize new questions about the data and follow up with a predictive modelling exercise → In DSCI 100 Clustering will only be used for uncovering patterns in the data (exploratory) Recall: Both Classification and Regression are supervised tasks where there is a response variable (a category label or value), and we have examples of past data with labels/values that help us predict those of future data. - Can use a test data set to assess prediction performance Why Clustering is different is because it is an unsupervised task (trying to understand and examine structure of data without any response variable labels or values to assist) - Advantages to this: A) Requires no additional annotation or input on the data (impossible to annotate all articles on Wikipedia with human-made topic labels → can simply cluster the articles without this information to find groupings corresponding to topics automatically) B) Not a single good choice for evaluation → In DSCI 100 use visualization to ascertain clustering quality, but beyond that too advanced for this course - To cluster our observations to look for subgroups, we will use the K-means algorithm in this chapter 9.4 Goal: Use two variables from penguins dataset (penguin bill (mm) and flipper length (mm)) to see if there are distinct types of penguin in our data - It is worth noting that the textbook reduces the size of the data to 18 observations and 2 variables to make clear visualizations illustrating how clustering works - K-means clustering algorithm uses randomness when choosing a starting position for each cluster, so to ensure replicability set the seed: library(tidyverse) set.seed(1) Now we can load and preview the penguins data. penguins <- read_csv("data/penguins.csv") penguins ## # A tibble: 18 × 2 ## bill_length_mm flipper_length_mm ## <dbl> <dbl> ## 1 39.2 196 ## 2 36.5 182 ## 3 34.5 187 ## 4 36.7 187 ## 5 38.1 181 ## 6 39.2 190 ## 7 36 195 ## 8 37.8 193 ## 9 46.5 213 ## 10 46.1 215 ## 11 47.8 215 ## 12 45 220 ## 13 49.1 212 ## 14 43.3 208 ## 15 46 195 ## 16 46.7 195 ## 17 52.2 197 ## 18 46.8 189 - Since we are using K-means clustering we need to standardize the data to illustrate how it works (Recall Classification I textbook notes) uc_recipe <- uc_recipe |> step_scale(all_predictors()) |> step_center(all_predictors()) |> prep() uc_recipe penguins_standardized ## # A tibble: 18 × 2 ## bill_length_standardized flipper_length_standardized ## <dbl> <dbl> ## 1 -0.641 -0.190 ## 2 -1.14 -1.33 ## 3 -1.52 -0.922 ## 4 -1.11 -0.922 ## 5 -0.847 -1.41 ## 6 -0.641 -0.678 ## 7 -1.24 -0.271 ## 8 -0.902 -0.434 ## 9 0.720 1.19 ## 10 0.646 1.36 ## 11 0.963 1.36 ## 12 0.440 1.76 ## 13 1.21 1.11 ## 14 0.123 0.786 ## 15 0.627 -0.271 ## 16 0.757 -0.271 ## 17 1.78 -0.108 ## 18 0.776 -0.759 Next we can create the scatter plot using this data set to see if we can detect subtypes or groups in our data set: ggplot(penguins_standardized, aes(x = flipper_length_standardized, y = bill_length_standardized)) + geom_point() + xlab("Flipper Length (standardized)") + ylab("Bill Length (standardized)") + theme(text = element_text(size = 12)) Three key groups: 1) Small flipper and bill length group 2) Small flipper length, but large bill length group 3) Large flipper and bill length group - Finding groups via visualization becomes more difficult as we increase the number of variables we consider when clustering - Clustering algorithms rigorously separate the data into groups (K-means algorithm) + the Elbow method to select the number of clusters Now that these observations are made, we can use them to inform our species classifications or ask further questions about our data, like the relationship between flipper length and bill length, and that relationship may differ depending on the type of penguin we have. 9.5 (K-Means) 9.5.1 Measuring cluster quality K-means algorithm groups data into K clusters - Starts with an initial clustering of the data - Iteratively improves it by making adjustments to the assignment of data to clusters until it cannot improve any further Questions… a) How can we measure the quality of clustering b) What does it mean to improve it In K-means clustering, we measure the quality of a cluster by its within-cluster sum-of-squared-distances (WSSD): 1) Find the cluster centers by computing the mean of each variable over the data points in the cluster a) E.g. We have a cluster containing four observations, and we are using x and y to cluster the data 2) Add up the squared distance between each point in the cluster and the cluster center. We use the straight line/ Euclidean distance formula Note that the larger the S2 value, the more spread out the cluster is (points are far from the cluster center) - Large is also relative to both the scale of the variables for clustering and the number of points in the cluster - A cluster where points are very close to the center might still have a large S2 if there are many data points in the cluster After calculating the WSSD for all the clusters, we sum them together to get the total WSSD (In the textbook it is illustrated by adding up all the squared distances for the 18 observations, denoted by black lines) Since K-means uses the straight-line distance to measure the quality of a clustering, it is limited to clustering based on quantitative variables. Other variants of the K-means / other clustering algorithms can use other distance metrics to allow for non-qualitative data to be clustered, but these are beyond the scope of DSCI 100 9.5.2 The Clustering Algorithm We begin the K-means algorithm by picking K and randomly assigning a roughly equal number of observations to each of the K clusters. K-means consists of two major steps that attempt to minimize the sum of WSSDs over all the clusters (total WSSD): A) Center update (Compute the center of each cluster) B) Label update (Reassign each data point to the cluster with the nearest center) These two steps get repeated until the cluster assignments no longer change - At this point (Iteration 4), we can terminate the algorithm since none of the assignments changed in the fourth iteration; both the centers and labels will remain the same from this point onward 9.5.3 Random Restarts - K-means can get stuck in a bad solution (e.g. unlucky random initialization by K-means) - This is not the case for classification and regression as previously seen - In such cases, when it appears like a relatively bad clustering of the data, K-means cannot improve it - To solve this problem, when clustering data using K-means, we should randomly re-initialize the labels a few times, run K-means for each initialization, and pick the clustering that has the lowest final total WSSD 9.5.4 Choosing K - To cluster data using K-means, we must pick the number of clusters, K With no response variable or performing of cross-validation with some measure of model prediction error, how do we do this? - Why the lack of these things is seemingly an issue is because of K is too small, then multiple clusters get grouped together, and if K is too large, then clusters get subdivided - In both cases, we will potentially miss interesting structure in the data Setting K<3 causes the clustering to merge separate groups of data, causing a large total WSSD since cluster center is not close to any of the data in the cluster Setting K>3 causes the clustering to subdivide groups of data, which still decreases the total WSSD but only by a diminishing amount (improves by smaller and smaller amounts) - If we plot the total WSSD versus the number of clusters, we see that the decrease in total WSSD levels off (forms an elbow shape) when we reach roughly the right number of clusters 9.6 - To perform K-means clustering in R using a tidymodels workflow, we first load the following library: library(tidyclust) Recall: K-means clustering uses straight-line distance to decide which points are similar to each other, which means that the scale of each of the variables in the data will influence which cluster data points end up being assigned - Variables with a large scale will have a much larger effect on decided cluster assignment than variables with a small scale - step_scale and step_center preprocessing steps address this problem → include these in a recipe that standardizes the data before clustering - Standardization ensures each variable has a mean of 0 and a standard deviation of 1 prior to clustering kmeans_recipe <- recipe(~ ., data=penguins) |> step_scale(all_predictors()) |> step_center(all_predictors()) kmeans_recipe Now we need to indicate that we are performing K-means clustering, and we will use the num_clusters argument to specify the number of clusters (here we choose K = 3) and specify that we are using the “stats” engine kmeans_spec <- k_means(num_clusters = 3) |> set_engine("stats") kmeans_spec To actually run the K-means clustering, we combine the recipe and model specification in a workflow, and use the fit function → K-means uses a random initialization of assignments, but since we set the random seed earlier, the clustering will be reproducible kmeans_fit <- workflow() |> add_recipe(kmeans_recipe) |> add_model(kmeans_spec) |> fit(data = penguins) kmeans_fit kmeans_fit has lots of information to visualize the clusters, pick K, and evaluate the total WSSD Now, let’s visualize the clusters as a coloured scatterplot … first need to augment our original data frame with the cluster assignments; can be done using augment from tidyclust clustered_data <- kmeans_fit |> augment(penguins) clustered_data ## # A tibble: 18 × 3 ## bill_length_mm flipper_length_mm .pred_cluster ## <dbl> <dbl> <fct> ## 1 39.2 196 Cluster_1 ## 2 36.5 182 Cluster_1 ## 3 34.5 187 Cluster_1 ## 4 36.7 187 Cluster_1 ## 5 38.1 181 Cluster_1 ## 6 39.2 190 Cluster_1 ## 7 36 195 Cluster_1 ## 8 37.8 193 Cluster_1 ## 9 46.5 213 Cluster_2 ## 10 46.1 215 Cluster_2 ## 11 47.8 215 Cluster_2 ## 12 45 220 Cluster_2 ## 13 49.1 212 Cluster_2 ## 14 43.3 208 Cluster_2 ## 15 46 195 Cluster_3 ## 16 46.7 195 Cluster_3 ## 17 52.2 197 Cluster_3 ## 18 46.8 189 Cluster_3 Note: The augment() function takes the original (unstandardized) data — in this case, penguins — and adds a new column: .pred_cluster, which contains each row's cluster assignment from the trained kmeans_fit model. Importantly: Even though kmeans_fit was trained using standardized values, the augment function does NOT return standardized data — it returns the original input data + cluster predictions. So now that we have the cluster assignments included in the clustered_data tidy data frame, we can visualize them as shown in Figure 9.13. Note that we are plotting the un-standardized data here; if we for some reason wanted to visualize the standardized data from the recipe, we would need to use the bake function to obtain that first. cluster_plot <- ggplot(clustered_data, aes(x = flipper_length_mm, y = bill_length_mm, color = .pred_cluster), size = 2) + geom_point() + labs(x = "Flipper Length", y = "Bill Length", color = "Cluster") + scale_color_manual(values = c("steelblue", "darkorange", "goldenrod1")) + theme(text = element_text(size = 12)) cluster_plot - Now need to select K by finding where the elbow occurs in the plot of total WSSD versus the number of clusters. Can obtain the total WSSD (tot.withinss) from our clustering with 3 clusters using the glance function: glance(kmeans_fit) ## # A tibble: 1 × 4 ## totss tot.withinss betweenss iter ## <dbl> <dbl> <dbl> <int> ## 1 34 4.47 29.5 2 To calculate the total WSSD for a variety of Ks, we will create a data frame with a column named num_clusters with rows containing each value of K we want to run K-means with (here, 1 to 9). penguin_clust_ks <- tibble(num_clusters = 1:9) penguin_clust_ks Then we construct our model specification again, this time specifying that we want to tune the num_clusters parameter kmeans_spec <- k_means(num_clusters = tune()) |> set_engine("stats") kmeans_spec We combine the recipe and specification in a workflow, then use tune_cluster to run K-means on each of the different settings of num_clusters. The grid argument controls which values of K we want to try (valyes from 1-9 stored in the penguin_clust_ks data frame). Set the resamples argument to apparent(penguins) to tell K-means to run on the whole data set for each value of num_clusters. Finally, we collect the results using the collect_metrics function. kmeans_results <- workflow() |> add_recipe(kmeans_recipe) |> add_model(kmeans_spec) |> tune_cluster(resamples = apparent(penguins), grid = penguin_clust_ks) |> collect_metrics() kmeans_results ## # A tibble: 18 × 7 ## num_clusters .metric .estimator mean n std_err .config ## <int> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 1 sse_total standard 34 1 NA Preprocessor1_… ## 2 1 sse_within_total standard 34 1 NA Preprocessor1_… ## 3 2 sse_total standard 34 1 NA Preprocessor1_… ## 4 Preprocessor1_… ## 5 Preprocessor1_… ## 6 Preprocessor1_… ## 7 Preprocessor1_… ## 8 Preprocessor1_… ## 9 Preprocessor1_… ## 10 Preprocessor1_… ## 11 Preprocessor1_… ## 12 Preprocessor1_… ## 13 Preprocessor1_… ## 14 Preprocessor1_… ## 15 Preprocessor1_… ## 16 Preprocessor1_… ## 17 Preprocessor1_… ## 18 Preprocessor1_… 2 sse_within_total standard 10.9 1 NA 3 sse_total 34 1 NA 3 sse_within_total standard 4.47 1 NA 4 sse_total 34 1 NA 4 sse_within_total standard 3.54 1 NA 5 sse_total 34 1 NA 5 sse_within_total standard 2.23 1 NA 6 sse_total 34 1 NA 6 sse_within_total standard 1.75 1 NA 7 sse_total 34 1 NA 7 sse_within_total standard 2.06 1 NA 8 sse_total 34 1 NA 8 sse_within_total standard 2.46 1 NA 9 sse_total 34 1 NA 0.906 1 NA standard standard standard standard standard standard standard 9 sse_within_total standard The total WSSD results correspond to the mean column when the .metric variable is equal to sse_within_total. We can obtain a tidy data frame with this information using filter and mutate: kmeans_results <- kmeans_results |> filter(.metric == "sse_within_total") |> mutate(total_WSSD = mean) |> select(num_clusters, total_WSSD) kmeans_results ## # A tibble: 9 × 2 ## num_clusters total_WSSD ## <int> <dbl> ## 1 1 34 ## 2 2 10.9 ## 3 3 4.47 ## 4 4 3.54 ## 5 5 2.23 ## 6 ## 7 ## 8 ## 9 6 7 8 9 1.75 2.06 2.46 0.906 Now that we have total_WSSD and num_clusters, lets make a line plot and find the elbow elbow_plot <- ggplot(kmeans_results, aes(x = num_clusters, y = total_WSSD)) + geom_point() + geom_line() + xlab("K") + ylab("Total within-cluster sum of squares") + scale_x_continuous(breaks = 1:9) + theme(text = element_text(size = 12)) elbow_plot It looks like 3 clusters is the right choice - Why is K = 8 worse than K = 7 (shouldn’t it be getting better as K increases) - Recall that K-means can get stuck in a bad solution … here with K = 8 it had an unlucky initialization and found a bad clustering To prevent finding a bad clustering we can try a few different random initializations via the nstart argument in model specification, lets try 10 restarts: kmeans_spec <- k_means(num_clusters = tune()) |> set_engine("stats", nstart = 10) kmeans_spec ## K Means Cluster Specification (partition) ## ## Main Arguments: ## num_clusters = tune() ## ## Engine-Specific Arguments: ## nstart = 10 ## ## Computational engine: stats Final Code: kmeans_results <- workflow() |> add_recipe(kmeans_recipe) |> add_model(kmeans_spec) |> tune_cluster(resamples = apparent(penguins), grid = penguin_clust_ks) |> collect_metrics() |> filter(.metric == "sse_within_total") |> mutate(total_WSSD = mean) |> select(num_clusters, total_WSSD) elbow_plot <- ggplot(kmeans_results, aes(x = num_clusters, y = total_WSSD)) + geom_point() + geom_line() + xlab("K") + ylab("Total within-cluster sum of squares") + scale_x_continuous(breaks = 1:9) + theme(text = element_text(size = 12)) elbow_plot Plot: Explanation: - Rerunning the same workflow now with the new model specification means that K-means clustering will be performed nstart = 10 times for each K value, with the collect_metrics function picking the best clustering of the 10 runs for each K value, and reporting the results for that best clustering … The more times we perform K-means clustering, the more likely we are to find a good clustering (if one exists) What value should be chosen for nstart? Depends on many factors: - Size and characteristics of data set - Power levels of computer => The larger the nstart value the better from an analysis perspective, but there is a trade-off that doing many clusterings could take a long time … so this is something that needs to be balanced
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )