chenpeng_proj2_report

advertisement
Twister Kmeans Project Report
Peng Chen, Yuan Gao
1 Dataflow of an automatic Twister Kmeans Program:
centroid generation
Generate 10 sets of centroids
generation
Data generation
Use “map-only” operation to
generate data concurrently
…
Map()
Map()
Data points
…
Map()
…
Map()
Compute the distance from
each data point to the centroid
of each cluster, find the nearest
center and assign to points to
the cluster center.
Compute new centroids by
calculating the center of the
points in each cluster.
Compute the difference
between new centroids and the
previous one. If it is less than
the threshold, break the
…
Reduce(
Until less than threshold
Compute objective function
…
Map()
Map()
Choose the best centroids with
the lowest value of objective
function
2 Comparison Analysis
Below is the original kmeans twister dataflow:
Data generation
Use “map-only”
operation to
generate data
concurrently
…
Map()
Map()
Data points
Compute the distance from each
data point to the centroid of each
cluster, find the nearest center
and assign to points to the cluster
center.
…
Map()
Map()
…
Reduce()
Compute new centroids by
calculating the center of the
points in each cluster.
Compute the difference between
new centroids and the previous
one. If it is less than the threshold,
break the iteration
Until less than threshold
In order to implement an automatic Twister Kmeans Program which runs with different centroids and
gets the best case, we need to find the minimal value of objective function within the ten round. So
compared with the original kmeans, we add the part to calculate the value of objective function.
Specifically, we use assg[i] to store the minimum centroid of last time, then calculate the sum of the
Euclidian distances between data points and assg[i], and then store the result in an additional column of
the centroid array. So after generating data points, the map() we use is different from the original map().
And after the main iteration for kmeans clustering, we add another round of map() to calculate the final
value of the objective function. To implement the program automatically, we initially run a loop to
generate 10 sets of centroids.
3 Questions and Answers
a. The sequential complexity per iteration is O(NK) for K centers and N points. What is time
complexity of each Map Task?
Answer:
Assume we have p mappers, then the time complexity of each map task is O(NK/p).
b. What is time complexity of Reduce task?
Answer:
O(k)
c. What speed up would you expect when N is large for Twister version?
Answer:
Assume we have p mappers, then the speed up is
s = ( O(NK)+O(K) ) / ( O(NK/p) + O(K))
= O(NK) / O(NK/p)
(since N >> K)
=p
d. In your best solution with lowest objective function value, could you explain or describe the
reason?
Answer:
The k-means algorithm we are using is a heuristic algorithm that converges to a local optimum.
During each iteration, it decreases the value of within-cluster sum of squares (WCSS), which
happens to be our objective function, until some threshold is satisfied. However, the final result
depends on the location of initial centroids. That is to say, the location of new centroids
depends on the location of old centroids. So if we run k-means algorithm on the same dataset
with 10 different initial centroids, we expect to have 10 different results. We chose the one with
the lowest objective function (WCSS) value, which is the best among the 10.
Download