Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1 Outline Hadoop Basics Case Study Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient Resource Entries to ML labs Advanced Topics Q&A 2 Introduction to Hadoop Hadoop Map/Reduce is a java based software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. 3 Hadoop Cluster Architecture Client TaskTracker DataNode Slave node Job submission node HDFS master JobTracker NameNode TaskTracker DataNode Slave node From Jimmy Lin’s slides TaskTracker DataNode Slave node 4 Hadoop HDFS 5 Hadoop Cluster Rack Awareness 6 Hadoop Development Cycle 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 Hadoop Cluster You 5. Move data out of HDFS 6. Scp data from cluster From Jimmy Lin’s slides 7 Divide and Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 “Result” From Jimmy Lin’s slides Combine 8 High-level MapReduce pipeline 9 Detailed Hadoop MapReduce data flow 10 Outline Hadoop Basics Case Study Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient Resource Entries to ML labs Advanced Topics Q&A 11 Word Count with MapReduce Doc 1 Doc 2 one fish, two fish Map Doc 3 red fish, blue fish cat in the hat one 1 1 red 2 1 cat 3 1 two 1 1 blue 2 1 hat 3 1 fish 1 2 fish 2 2 Shuffle and Sort: aggregate values by keys cat Reduce fish 3 1 1 4 one 1 1 red 2 1 From Jimmy Lin’s slides blue 2 1 hat 3 1 two 1 1 12 Outline Hadoop Basics Case Study Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient Resource Entries to ML labs Advanced Topics Q&A 13 Calculating document pairwise similarity Trivial Solution load each vector o(N) times load each term o(dft2) times Goal scalable and efficient solution for large collections From Jimmy Lin’s slides 14 Better Solution Each term contributes only if appears in Load weights for each term once Each term contributes o(dft2) partial scores From Jimmy Lin’s slides 15 Decomposition Each term contributes only if appears in reduce map Load weights for each term once Each term contributes o(dft2) partial scores From Jimmy Lin’s slides 16 Standard Indexing (a) Map doc doc doc doc (b) Shuffle (c) Reduce tokenize tokenize tokenize Shuffling group values by: terms combine posting list combine posting list combine posting list tokenize From Jimmy Lin’s slides 17 Inverted Indexing with MapReduce Doc 1 Doc 2 one fish, two fish Map Doc 3 red fish, blue fish cat in the hat one 1 1 red 2 1 cat 3 1 two 1 1 blue 2 1 hat 3 1 fish 1 2 fish 2 2 Shuffle and Sort: aggregate values by keys cat Reduce fish 3 1 1 2 one 1 1 red 2 1 2 2 From Jimmy Lin’s slides blue 2 1 hat 3 1 two 1 1 18 Indexing (3-doc toy collection) Clinton Clinton Obama Obama Clinton Clinton Clinton 1 2 1 Cheney Clinton Cheney 1 Indexing Barack 1 Clinton Clinton Barack Barack Obama Obama Obama 1 From Jimmy Lin’s slides 1 19 Pairwise Similarity (b) Group pairs (a) Generate pairs Clinton 1 2 (c) Sum pairs 2 1 2 Cheney 1 2 2 3 1 1 Barack 1 1 Obama 1 1 How to deal with the long list? 1 From Jimmy Lin’s slides 20 Outline Hadoop Basics Case Study Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient Resource Entries to ML labs Advanced Topics Q&A 21 PageRank PageRank – an information propagation model Intensive access of neighborhood list 22 PageRank with MapReduce n1 [n2, n4] n2 [n3, n5] n2 n3 n3 [n4] n4 [n5] n4 n5 n5 [n1, n2, n3] Map n1 n4 n2 n2 n5 n3 n3 n4 n4 n1 n2 n5 n3 n5 Reduce n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5] n5 [n1, n2, n3] How to maintain the graph structure? From Jimmy Lin’s slides Outline Hadoop Basics Case Study Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient Resource Entries to ML labs Advanced Topics Q&A 24 K-Means Clustering 25 K-Means Clustering with MapReduce Mapper_i-1 Mapper_i Each Mapper loads a set of data samples, and assign each sample to a nearest centroid Reducer_i-1 3 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Reducer_i Each Mapper needs to keep a copy of centroids Reducer_i+1 4 2 3 Mapper_i+1 2 4 How to set the initial centroids is very important! Usually we set the centroids using Canopy Clustering. [McCallum, Nigam and Ungar: "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", SIGKDD 2000] 26 Outline Hadoop Basics Case Study Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient Resource Entries to ML labs Advanced Topics Q&A 27 Matrix Factorization for Link Prediction In this task, we observe a sparse matrix X∈ Rm×n with entries xij. Let R = {(i,j,r): r = xij, where xij ≠0} denote the set of observed links in the system. In order to predict the unobserved links in X, we model the users and the items by a user factor matrix U∈ Rk×m and an item factor matrix V∈ Rk×n. The goal is to approximate the link matrix X via multiplying the factor matrix U and V, which can be learnt by minimizing: 28 Solving Matrix Factorization via Alternative Least Squares Given X and V, updating U: n m X ui n k k V k k k k k k A b Similarly, given X and U, we can alternatively update V 29 MapReduce for ALS Stage 1 Stage 2 Mapper_i Group rating data in X using for item j Mapper_i Group features in V using for item j Group rating data in X using for user i i Vj i i+1 Vj+2 Vj Reducer_i Reducer_i Rating for Features for item j item j Align ratings and features for item j, and make a copy of Vj for each observe xij Standard ALS: Calculate A and b, and update Ui i-1 i+1 Vj i Vj Vj 30 Outline Hadoop Basics Case Study Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient Resource Entries to ML labs Advanced Topics Q&A 31 Cluster Coefficient In graph mining, a clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. The local clustering coefficient of a vertex in a graph quantifies how close its neighbors are to being a clique (complete graph), which is used to determine whether a graph is a small-world network. [D. J. Watts and Steven Strogatz (June 1998). "Collective dynamics of 'small-world' networks". Nature 393 (6684): 440–442] How to maintain the Tier-2 neighbors? 32 Cluster Coefficient with MapReduce Stage 1 Stage 2 Mapper_i Mapper_i Reducer_i Reducer_i Calculate the cluster coefficient BFS based method need three stages, but actually we only need two! 33 Resource Entries to ML labs Mahout Apache’s scalable machine learning libraries Jimmy Lin’s Lab iSchool at the University of Maryland Jimeng Sun & Yan Rong’s Collections IBM TJ Watson Research Center Edward Chang & Yi Wang Google Beijing 34 Advanced Topics in Machine Learning with MapReduce Probabilistic Graphical models Gradient based optimization methods Graph Mining Others… 35 Some Advanced Tips Design your algorithm with a divide and conquer manner Make your functional units loosely dependent Carefully manage your memory and disk storage Discussions… 36 Outline Hadoop Basics Case Study Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient Resource Entries to ML labs Advanced Topics Q&A 37 Q&A Why not MPI? Hadoop is Cheap in everything…D.P.T.H… What’s the advantages of Hadoop? Scalability! How do you guarantee the model equivalence? Guarantee equivalent/comparable function logics How can you beat “large memory” solution? Clever use of Sequential Disk Access 38