Large-scale Single-pass k-Means Clustering at Scale ©MapR Technologies - Confidential 1 Large-scale Single-pass k-Means Clustering ©MapR Technologies - Confidential 2 Large-scale k-Means Clustering ©MapR Technologies - Confidential 3 Goals Cluster very large data sets Facilitate large nearest neighbor search Allow very large number of clusters Achieve good quality – low average distance to nearest centroid on held-out data Based on Mahout Math Runs on Hadoop (really MapR) cluster FAST – cluster tens of millions in minutes ©MapR Technologies - Confidential 4 Non-goals Use map-reduce (but it is there) Minimize the number of clusters Support metrics other than L2 ©MapR Technologies - Confidential 5 Anti-goals Multiple passes over original data Scale as O(k n) ©MapR Technologies - Confidential 6 Why? ©MapR Technologies - Confidential 7 K-nearest Neighbor with Super Fast k-means ©MapR Technologies - Confidential 8 What’s that? Find the k nearest training examples Use the average value of the target variable from them This is easy … but hard – – – easy because it is so conceptually simple and you don’t have knobs to turn or models to build hard because of the stunning amount of math also hard because we need top 50,000 results Initial prototype was massively too slow – – 3K queries x 200K examples takes hours needed 20M x 25M in the same time ©MapR Technologies - Confidential 9 How We Did It 2 week hackathon with 6 developers from customer bank Agile-ish development To avoid IP issues – – – – all code is Apache Licensed (no ownership question) all data is synthetic (no question of private data) all development done on individual machines, hosting on Github open is easier than closed (in this case) Goal is new open technology to facilitate new closed solutions Ambitious goal of ~ 1,000,000 x speedup ©MapR Technologies - Confidential 10 How We Did It 2 week hackathon with 6 developers from customer bank Agile-ish development To avoid IP issues – – – – all code is Apache Licensed (no ownership question) all data is synthetic (no question of private data) all development done on individual machines, hosting on Github open is easier than closed (in this case) Goal is new open technology to facilitate new closed solutions Ambitious goal of ~ 1,000,000 x speedup – well, really only 100-1000x after basic hygiene ©MapR Technologies - Confidential 11 What We Did Mechanism for extending Mahout Vectors – Shared memory matrix – FileBasedMatrix uses mmap to share very large dense matrices Searcher interface – DelegatingVector, WeightedVector, Centroid ProjectionSearch, KmeansSearch, LshSearch, Brute Super-fast clustering – Kmeans, StreamingKmeans ©MapR Technologies - Confidential 12 Projection Search java.lang.TreeSet! ©MapR Technologies - Confidential 13 How Many Projections? ©MapR Technologies - Confidential 14 K-means Search Simple Idea – – pre-cluster the data to find the nearest points, search the nearest clusters Recursive application – to search a cluster, use a Searcher! ©MapR Technologies - Confidential 15 ©MapR Technologies - Confidential 16 x ©MapR Technologies - Confidential 17 ©MapR Technologies - Confidential 18 ©MapR Technologies - Confidential 19 x ©MapR Technologies - Confidential 20 But This Requires k-means! Need a new k-means algorithm to get speed – – – Hadoop is very slow at iterative map-reduce Maybe Pregel clones like Giraph would be better Or maybe not Streaming k-means is – – – One pass (through the original data) Very fast (20 us per data point with threads) Very parallelizable ©MapR Technologies - Confidential 21 Basic Method Use a single pass of k-means with very many clusters – output is a bad-ish clustering but a good surrogate Use weighted centroids from step 1 to do in-memory clustering – output is a good clustering with fewer clusters ©MapR Technologies - Confidential 22 Algorithmic Details Foreach data point xn compute distance to nearest centroid, ∂ sample u, if u > ∂/ß add to nearest centroid else create new centroid if number of centroids > 10 log n recursively cluster centroids set ß = 1.5 ß if number of centroids did not decrease ©MapR Technologies - Confidential 23 How It Works Result is large set of centroids – – – these provide approximation of original distribution we can cluster centroids to get a close approximation of clustering original or we can just use the result directly ©MapR Technologies - Confidential 24 Parallel Speedup? 200 ✓ Non- threaded Tim e per point (μs) 100 2 Threaded version 3 50 4 40 6 5 30 8 10 14 12 Perfect Scaling 20 16 10 1 2 3 4 5 Threads ©MapR Technologies - Confidential 25 20 Warning, Recursive Descent Inner loop requires finding nearest centroid With lots of centroids, this is slow But wait, we have classes to accelerate that! ©MapR Technologies - Confidential 26 Warning, Recursive Descent Inner loop requires finding nearest centroid With lots of centroids, this is slow But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though) ©MapR Technologies - Confidential 27 Warning, Recursive Descent Inner loop requires finding nearest centroid With lots of centroids, this is slow But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though) Empirically, projection search beats 64 bit LSH by a bit ©MapR Technologies - Confidential 28 Moving to Scale Map-reduce implementation nearly trivial Map: rough-cluster input data, output ß, weighted centroids Reduce: – – – single reducer gets all centroids if too many centroids, merge using recursive clustering optionally do final clustering in-memory Combiner possible, but essentially never important ©MapR Technologies - Confidential 29 Contact: – – tdunning@maprtech.com @ted_dunning Slides and such: – http://info.mapr.com/ted-mlconf.html Hash tags: #mlconf #mahout #mapr ©MapR Technologies - Confidential 30