Learning with Hadoop based Data Mining MapReduce Evan Xiang, HKUST

advertisement
Learning with Hadoop
– A case study on MapReduce
based Data Mining
Evan Xiang, HKUST
1
Outline
 Hadoop Basics
 Case Study
 Word Count
 Pairwise Similarity
 PageRank
 K-Means Clustering
 Matrix Factorization
 Cluster Coefficient
 Resource Entries to ML labs
 Advanced Topics
 Q&A
2
Introduction to Hadoop
 Hadoop Map/Reduce is
 a java based software framework for easily writing
applications
 which process vast amounts of data (multi-terabyte
data-sets) in-parallel on large clusters (thousands of
nodes) of commodity hardware
 in a reliable, fault-tolerant manner.
3
Hadoop Cluster Architecture
Client
TaskTracker
DataNode
Slave node
Job submission node
HDFS master
JobTracker
NameNode
TaskTracker
DataNode
Slave node
From Jimmy Lin’s slides
TaskTracker
DataNode
Slave node
4
Hadoop HDFS
5
Hadoop Cluster Rack Awareness
6
Hadoop Development Cycle
1. Scp data to cluster
2. Move data into HDFS
3. Develop code locally
4. Submit MapReduce job
4a. Go back to Step 3
Hadoop Cluster
You
5. Move data out of HDFS
6. Scp data from cluster
From Jimmy Lin’s slides
7
Divide and Conquer
“Work”
Partition
w1
w2
w3
“worker”
“worker”
“worker”
r1
r2
r3
“Result”
From Jimmy Lin’s slides
Combine
8
High-level MapReduce pipeline
9
Detailed Hadoop MapReduce data flow
10
Outline
 Hadoop Basics
 Case Study
 Word Count
 Pairwise Similarity
 PageRank
 K-Means Clustering
 Matrix Factorization
 Cluster Coefficient
 Resource Entries to ML labs
 Advanced Topics
 Q&A
11
Word Count with MapReduce
Doc 1
Doc 2
one fish, two fish
Map
Doc 3
red fish, blue fish
cat in the hat
one
1 1
red
2 1
cat
3 1
two
1 1
blue
2 1
hat
3 1
fish
1 2
fish
2 2
Shuffle and Sort: aggregate values by keys
cat
Reduce
fish
3 1
1 4
one
1 1
red
2 1
From Jimmy Lin’s slides
blue
2 1
hat
3 1
two
1 1
12
Outline
 Hadoop Basics
 Case Study
 Word Count
 Pairwise Similarity
 PageRank
 K-Means Clustering
 Matrix Factorization
 Cluster Coefficient
 Resource Entries to ML labs
 Advanced Topics
 Q&A
13
Calculating document pairwise
similarity
 Trivial Solution
 load each vector o(N) times
 load each term o(dft2) times
Goal
scalable and efficient solution
for large collections
From Jimmy Lin’s slides
14
Better Solution
Each term contributes only if appears in
 Load weights for each term once
 Each term contributes o(dft2) partial scores
From Jimmy Lin’s slides
15
Decomposition
Each term contributes only if appears in
reduce
map
 Load weights for each term once
 Each term contributes o(dft2) partial scores
From Jimmy Lin’s slides
16
Standard Indexing
(a) Map
doc
doc
doc
doc
(b) Shuffle
(c) Reduce
tokenize
tokenize
tokenize
Shuffling
group values
by: terms
combine
posting
list
combine
posting
list
combine
posting
list
tokenize
From Jimmy Lin’s slides
17
Inverted Indexing with MapReduce
Doc 1
Doc 2
one fish, two fish
Map
Doc 3
red fish, blue fish
cat in the hat
one
1 1
red
2 1
cat
3 1
two
1 1
blue
2 1
hat
3 1
fish
1 2
fish
2 2
Shuffle and Sort: aggregate values by keys
cat
Reduce
fish
3 1
1 2
one
1 1
red
2 1
2 2
From Jimmy Lin’s slides
blue
2 1
hat
3 1
two
1 1
18
Indexing (3-doc toy collection)
Clinton
Clinton
Obama
Obama
Clinton
Clinton
Clinton
1
2
1
Cheney
Clinton
Cheney
1
Indexing
Barack
1
Clinton
Clinton
Barack
Barack
Obama
Obama
Obama
1
From Jimmy Lin’s slides
1
19
Pairwise Similarity
(b) Group pairs
(a) Generate pairs
Clinton
1
2
(c) Sum pairs
2
1
2
Cheney
1
2
2
3
1
1
Barack
1
1
Obama
1
1
How to deal with
the long list?
1
From Jimmy Lin’s slides
20
Outline
 Hadoop Basics
 Case Study
 Word Count
 Pairwise Similarity
 PageRank
 K-Means Clustering
 Matrix Factorization
 Cluster Coefficient
 Resource Entries to ML labs
 Advanced Topics
 Q&A
21
PageRank
 PageRank – an information propagation model
Intensive access of
neighborhood list
22
PageRank with MapReduce
n1 [n2, n4]
n2 [n3, n5]
n2
n3
n3 [n4]
n4 [n5]
n4
n5
n5 [n1, n2, n3]
Map
n1
n4
n2
n2
n5
n3
n3
n4
n4
n1
n2
n5
n3
n5
Reduce
n1 [n2, n4] n2 [n3, n5]
n3 [n4]
n4 [n5]
n5 [n1, n2, n3]
How to maintain the
graph structure?
From Jimmy Lin’s slides
Outline
 Hadoop Basics
 Case Study
 Word Count
 Pairwise Similarity
 PageRank
 K-Means Clustering
 Matrix Factorization
 Cluster Coefficient
 Resource Entries to ML labs
 Advanced Topics
 Q&A
24
K-Means Clustering
25
K-Means Clustering with
MapReduce
Mapper_i-1
Mapper_i
Each Mapper loads a set
of data samples, and
assign each sample to a
nearest centroid
Reducer_i-1
3
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Reducer_i
Each Mapper needs
to keep a copy of
centroids
Reducer_i+1
4
2
3
Mapper_i+1
2
4
How to set the initial centroids is very important!
Usually we set the centroids using Canopy Clustering.
[McCallum, Nigam and Ungar: "Efficient Clustering of High Dimensional
Data Sets with Application to Reference Matching", SIGKDD 2000]
26
Outline
 Hadoop Basics
 Case Study
 Word Count
 Pairwise Similarity
 PageRank
 K-Means Clustering
 Matrix Factorization
 Cluster Coefficient
 Resource Entries to ML labs
 Advanced Topics
 Q&A
27
Matrix Factorization
for Link Prediction
 In this task, we observe a sparse matrix X∈ Rm×n with
entries xij. Let R = {(i,j,r): r = xij, where xij ≠0} denote the
set of observed links in the system. In order to predict
the unobserved links in X, we model the users and the
items by a user factor matrix U∈ Rk×m and an item
factor matrix V∈ Rk×n. The goal is to approximate the
link matrix X via multiplying the factor matrix U and V,
which can be learnt by minimizing:
28
Solving Matrix Factorization via
Alternative Least Squares
 Given X and V, updating U:
n
m
X
ui
n
k
k
V
k
k
k
k
k
k
A
b
 Similarly, given X and U, we can alternatively update V
29
MapReduce for ALS
Stage 1
Stage 2
Mapper_i
Group rating
data in X using
for item j
Mapper_i
Group features
in V using for
item j
Group rating
data in X using
for user i
i
Vj
i
i+1
Vj+2
Vj
Reducer_i
Reducer_i
Rating for
Features for
item j
item j
Align ratings and features for
item j, and make a copy of Vj for
each observe xij
Standard ALS:
Calculate A and b,
and update Ui
i-1
i+1
Vj
i
Vj
Vj
30
Outline
 Hadoop Basics
 Case Study
 Word Count
 Pairwise Similarity
 PageRank
 K-Means Clustering
 Matrix Factorization
 Cluster Coefficient
 Resource Entries to ML labs
 Advanced Topics
 Q&A
31
Cluster Coefficient
 In graph mining, a clustering coefficient is a measure of
degree to which nodes in a graph tend to cluster
together. The local clustering coefficient of a vertex in a
graph quantifies how close its neighbors are to being a
clique (complete graph), which is used to determine
whether a graph is a small-world network.
[D. J. Watts and Steven
Strogatz (June 1998).
"Collective dynamics of
'small-world' networks".
Nature 393 (6684): 440–442]
How to maintain the
Tier-2 neighbors?
32
Cluster Coefficient with MapReduce
Stage 1
Stage 2
Mapper_i
Mapper_i
Reducer_i
Reducer_i
Calculate the cluster
coefficient
BFS based method need three stages, but actually we only need two!
33
Resource Entries to ML labs
Mahout
Apache’s scalable machine learning libraries
Jimmy Lin’s Lab
 iSchool at the University of Maryland
Jimeng Sun & Yan Rong’s Collections
 IBM TJ Watson Research Center
Edward Chang & Yi Wang
Google Beijing
34
Advanced Topics in Machine
Learning with MapReduce
Probabilistic Graphical models
Gradient based optimization methods
Graph Mining
Others…
35
Some Advanced Tips
Design your algorithm with a divide and
conquer manner
Make your functional units loosely dependent
Carefully manage your memory and disk
storage
Discussions…
36
Outline
 Hadoop Basics
 Case Study
 Word Count
 Pairwise Similarity
 PageRank
 K-Means Clustering
 Matrix Factorization
 Cluster Coefficient
 Resource Entries to ML labs
 Advanced Topics
 Q&A
37
Q&A
Why not MPI?
 Hadoop is Cheap in everything…D.P.T.H…
What’s the advantages of Hadoop?
 Scalability!
How do you guarantee the model equivalence?
 Guarantee equivalent/comparable function logics
How can you beat “large memory” solution?
 Clever use of Sequential Disk Access
38
Download