Twister Kmeans Project Report Peng Chen, Yuan Gao 1 Dataflow of an automatic Twister Kmeans Program: centroid generation Generate 10 sets of centroids generation Data generation Use “map-only” operation to generate data concurrently … Map() Map() Data points … Map() … Map() Compute the distance from each data point to the centroid of each cluster, find the nearest center and assign to points to the cluster center. Compute new centroids by calculating the center of the points in each cluster. Compute the difference between new centroids and the previous one. If it is less than the threshold, break the … Reduce( Until less than threshold Compute objective function … Map() Map() Choose the best centroids with the lowest value of objective function 2 Comparison Analysis Below is the original kmeans twister dataflow: Data generation Use “map-only” operation to generate data concurrently … Map() Map() Data points Compute the distance from each data point to the centroid of each cluster, find the nearest center and assign to points to the cluster center. … Map() Map() … Reduce() Compute new centroids by calculating the center of the points in each cluster. Compute the difference between new centroids and the previous one. If it is less than the threshold, break the iteration Until less than threshold In order to implement an automatic Twister Kmeans Program which runs with different centroids and gets the best case, we need to find the minimal value of objective function within the ten round. So compared with the original kmeans, we add the part to calculate the value of objective function. Specifically, we use assg[i] to store the minimum centroid of last time, then calculate the sum of the Euclidian distances between data points and assg[i], and then store the result in an additional column of the centroid array. So after generating data points, the map() we use is different from the original map(). And after the main iteration for kmeans clustering, we add another round of map() to calculate the final value of the objective function. To implement the program automatically, we initially run a loop to generate 10 sets of centroids. 3 Questions and Answers a. The sequential complexity per iteration is O(NK) for K centers and N points. What is time complexity of each Map Task? Answer: Assume we have p mappers, then the time complexity of each map task is O(NK/p). b. What is time complexity of Reduce task? Answer: O(k) c. What speed up would you expect when N is large for Twister version? Answer: Assume we have p mappers, then the speed up is s = ( O(NK)+O(K) ) / ( O(NK/p) + O(K)) = O(NK) / O(NK/p) (since N >> K) =p d. In your best solution with lowest objective function value, could you explain or describe the reason? Answer: The k-means algorithm we are using is a heuristic algorithm that converges to a local optimum. During each iteration, it decreases the value of within-cluster sum of squares (WCSS), which happens to be our objective function, until some threshold is satisfied. However, the final result depends on the location of initial centroids. That is to say, the location of new centroids depends on the location of old centroids. So if we run k-means algorithm on the same dataset with 10 different initial centroids, we expect to have 10 different results. We chose the one with the lowest objective function (WCSS) value, which is the best among the 10.