Slides

advertisement
Scalable Regression Tree Learning
on Hadoop using OpenPlanet
Wei Yin
Contributions
• We implement OpenPlanet, an open-source
implementation of the PLANET regression tree algorithm
using the Hadoop MapReduce framework.
• We tune and analyze the impact of two parameters, HDFS
block sizes and threshold value between ExpandNode and
InMemoryWeka Tasks of OpenPlanet to improve the
default performance.
Motivation for large-scale Machine
Learning
• Models operate on large data sets
• Large number of forecasting models
• New data arrives constantly and real-time
training requirement
Regression Tree
Classification algorithm maps
features → target variable (prediction)
Classifier uses a Binary Search Tree Structure
• Intuitive to understand
by domain users
• Effect for each feature
• Each non-leaf node is a binary classifier
with a decision condition
• One numeric or categorical feature goes left
or right in the tree
• Leaf Nodes contain the regression function
or a single prediction value
Google’s PLANET Algorithm
worker
worker
worker
Master
worker
worker
worker
• Use distributed worker nodes coordinated using a
master node to build regression tree
21-Sep-11
USC DR Technical Forum
5
OpenPlanet
• Give an introduction abolut OpenPlanet
• Introduce difference between OpenPlanet and PLANET
• Give specific re-implementation details
Controller
Model File
InitHistogram
ExpandNode
InMemoryWeka
Threshold Value(60000)
Cotroller
Controller{
/*read user defined parameters, such as input file path, test data file, model
output file etc.*/
Read Parameters( arguments[] );
/*Initialize 3 job Sets: MRExpandSet, MRInMemWekaSet, CompletionSet,
each of which contains the nodes that need relevant process*/
JobSetsInit(ExpandSet, InMemWekaSet, CompletionSet);
/*Initialize Model File instance containing a Regression Tree structure with
root node only*/
InitModelFile(modelfile);
Do {
/*If any Set is not empty, continue the loop*/
/*populate each Set using modelfile*/
populateSets(modelfile, ExpandSet, InMemWekaSet, CompletionSet );
if(ExpandSet != 0){
processing_nodes <- all nodes in ExpandSet;
TaskRunner(InitHistogram, processing_nodes);
CandidatePoints <- collects Reducers Result();
TaskRunner(ExpandNodes, processing_nodes);
globalOptimalSplitPoint <- collects Reducers’ Result();
}
if(InMemWekaSet != 0){
processing_nodes <- all nodes in InMemWekaSet;
TaskRunner(InMemWeka, processing_nodes);
}
UpdatesTreeModel(Result);
}
While ( ExpandSet & InMemWekaSet & CompletionSet != 0 )
Output(modelfile);
}
Start
Initialization
Populate Queues
While Queues are NOT empty
True
False
MRExpandQueue NOT Empty
True
Issue MRInitial Task
False
Issue MRExpandNode Task
MRInMemQueue NOT Empty
False
True
Issue MR-InMemGrow Task
Update Model & Populate
Queue
End
ModelFile
It is an object used to contain the regression model, and support relevant functions, like adding
node, check node status etc.
Model File Instance
Root
*Regression Tree Model
F1 < 27
*Update Function( )
F2 < 43
F3 ∈{M,W,F}
*CurrentLeaveNode( )
*.…...
F1< 90
Weka
Model
F4 < 16
Predict
Value =
95
Advantages:
• More convenient to update the model and predict target value, compare to parsing XML file.
• Load and Write model file == serialize and de-serialize Java Object
InitHistogram
•
•
A pre-processing step to find out potential candidate split points for ExpandNodes
Numerical features: Find fewer candidate points from huge data at expanse of little accuracy
lost, e.g feat1, feat2,
• Categorical features: All the components, e.g feat 3
Input node (or subset):
node 3
block
Map
block
Map
block
Map
block
Map
•
•
Feat1(num)
Reduce
Reduce
Feat2(num),
Feat3 (categ)
Feature 1:
{10,2,1,8,3,6,9,4,6,5,7}
Sampling:
Boundaries of
equal-depth
histogram
f1: 1,3,5,7,9
f2: 30,40,50,60,70
Colt:
High performance
Java library
{1,3,5,7,9}
f3: 1,2,3,4,5,6,7
{Moday -> Friday}
Filtering: Only interests in data point
belong to node 3
Routing: Key-Value: (featureID, value)
ExpandNodes Task just need to evaluate all the points in the candidate set without consulting
other resources.
ExpandNode
Input node (or subset):
node 3
Candidate
points
block
𝑀𝑖𝑛{ 𝐷𝑙𝑒𝑓𝑡 × 𝑉𝑎𝑟 𝐷𝑙𝑒𝑓𝑡 + 𝐷𝑟𝑖𝑔ℎ𝑡 × 𝑉𝑎𝑟 𝐷𝑟𝑖𝑔ℎ𝑡 }
D_right = D_total – D_left
Map
Reduce
block
Local optimal
split point , sp1
(value = 23)
Map
Controller
block
Global optimal
split point , sp1
(value = 23)
node3
f2< 23
Map
Reduce
block
Update
expending node
e.g. sp1 = 23 in
feature 2
Map
Filtering
Routing
Local optimal
split point , sp2
(value = 26)
𝐷𝑙𝑒𝑓𝑡
𝐷𝑟𝑖𝑔ℎ𝑡
MRInMemWeka
Input node (or subset):
node 4
block
Map
block
Map
node 5
1.Collect data points
for node 4
2.Call Weka REPTree
(or any other model,
M5P), to build model
Node 4
Reduce
…..
Controller
block
block
Map
Node 5
Map
Filtering
Routing: (NodeID, Data Point)
Reduce
Location of
Weka Model
node4
node5
Weka
Model
Weka
Model
Update
tree nodes
Distinctions between OpenPlanet and PLANET:
(1) Sampling MapReduce method: InitHistogram
(2) Broadcast(BC) function
BC_Key.Set(BC);
for( i : num_reducers ){
BC_Key.partitionID = i ;
send(BC_Key, BC_info);
}
Partitioning:
ID of Partition =
key.hashCode() % numReduceTasks
(3)Hybrid Model
Weka
Model
Weka
Model
Weka
Model
Weka
Model
Weka
Model
Key.hashCode(){
if (key_type == BC)
return partitionID;
else
return key.hashCode;
}
Performance Analysis and Tuning Method
Baseline for Weka, Matlab and
OpenPlanet (single machine)
Parallel Performance for
OpenPlanet with default settings
Question ?
1. For 17 million data set, very little improvement difference
between 2x8 case and 8x8 case
2. Not much performance improvement, especially compare to
Weka baseline performance
* 1.58 speed-up for 17 M dataSet
* no speed-up for small data point (no memory overhead)
Question 1: Similar performance between 2x8 case and 8x8 case ?
MRInMemWeka
MRExTotal
MRInitTotal
1
3
500
450
400
350
300
250
200
150
100
50
0
5 7 9 11 13 15 17 19 21
Iteration Number
1
Utilization Persentation (%)
M2 Map Slot Usage (%)
0
400
800
1200
1600
2000
2400
2800
3200
3600
4000
4400
4800
5200
5600
Utilization Persentation (%)
M1 Map Slot Usage (%)
Running Time (sec)
3
5 7 9 11 13 15 17 19 21
Iteration Number
Mapper Slots Utilization for 8x8 Cores
(Default)
Mapper Slots Utilization for 2x8 Cores
(Default) M1 Reducer Usage(%)
M2 Reducer Usage(%)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
MRInMemWeka
MRExTotal
MRInitTotal
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
M1
M2
M3
M4
M5
M6
M7
M8
0
400
800
1200
1600
2000
2400
2800
3200
3600
4000
4400
4800
5200
500
450
400
350
300
250
200
150
100
50
0
OpenPlanet Time per Stage using 8x8
Cores (Default Settings)
Average Training Time (Sec )
Average Training Time (Sec )
OpenPlanet Time per Stage using 2x8
Cores (Default Settings)
Running Time (sec)
Answer for question 1:
1.
2.
3.
In HDFS, the basic unit is block (default 64MB)
Each Map instance only process one block each time
Therefore, we can say, if N blocks, only N Map instances can running in
parallel.
Therefore for our problem:
1. Size of train data: 17 Million = 842 MB
2. Default block size = 64 MB
3. Number of blocks = 842/64 = 13
4. For 2x8 case, 13 Map can run in parallel, utilization=13/16 = 81%
5. For 8x8 case, 13 Map can run in parallel, utilization=13/64 = 20%
6. But for both case, only 13 Map running in parallel => reason for the similar
performance
Solution: Tuning the block size, which lead to:
Number of blocks ≈ Number of computing cores
 What if number of blocks >> Number of computing cores ?
Not necessary to improve performance since the network bandwidth limitation
Improvement:
1. We tune block size = 16 MB
2. We have 842/16 = 52 blocks
3. Total running time = 4,300,457
sec, while 5,712,154 sec for original
version, speed-up : 1.33
Data
Size
Nodes
3.5M Tuples
/170MB
17M Tuples
/840MB
35M tuples /1.7GB
2x8
20MB
80MB
128MB
4x8
8MB
32MB
64MB
8x8
4MB
16MB
32MB
OpenPlanet Time per Stage using 8x8
Cores (optimized block size)
400
MRInMemWeka
MRExTotal
MRInitTotal
300
250
200
150
16,000
100
50
14,000
0
1
3
5 7 9 11 13 15 17 19 21
Iteration Number
Utilization Persentation (%)
Mapper Slots Utilization for 8x8 Cores
(optimized block size)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
M1
M2
M3
M4
M5
M6
M7
M8
Average Training Time (sec)
Average Training Time (Sec )
350
12,000
10,000
8,000
2x8 Default Block Size
2x8 Optimized Block Size
2x8 16MB Block Size
4x8 Default Block Size
4x8 Optimized Block Size
4x8 16MB Block Size
8x8 Default Block Size
8x8 Optimized Block Size
8x8 16MB Block Size
6,000
4,000
2,000
0
Running Time (sec)
3.5
17.5
35.1
Training Data Size (rows) Million
Question 2:
1. Weka works better if no memory overhead
2. Observed from the picture,
OpenPlanet Time per Stage using 8x8
Cores (optimized block size)
400
MRInMemWeka
MRExTotal
MRInitTotal
Average Training Time (Sec )
350
300
250
Area is small
Area is large
200
150
100
50
0
1
3
5 7 9 11 13 15 17 19 21
Iteration Number
What about balance those two areas but avoid memory overhead for Weka ?
Solution:
Increasing the threshold value between ExpandNodes task and InMemWeka
Then by experiment, when the JVM for reducer instance is 1GB, the maximum threshold
value is 2,000,000.
Performance Improvement
400
500
450
400
350
300
250
200
150
100
50
0
MRInMemTotal
MRInMemWeka
MRExTotal
MRInitTotal
350
MRExTotal
MRInitTotal
300
250
200
150
100
50
0
1
1.
2.
3.
4.
OpenPlanet Time per Stage using 8x8
Cores (optimized block size)
Average Training Time (Sec )
Average Training Time (Sec )
OpenPlanet Time per Stage using 8x8 Cores
(Optimized Block Size & 2M Threshold value)
2
3
4
5
Iteration Number
6
7
1
8
3
5 7 9 11 13 15 17 19 21
Iteration Number
Total running time = 1,835,430 sec vs 4,300,457 sec
Areas balanced
Iteration number decreased
Speed-up = 4300457 / 1835430 = 2.34
AVG Total speed-up on 17M data set using 8x8 cores:
• Weka: 4.93 times
• Matlab: 14.3 times
AVG Accuracy (CV-RMSE):
Weka
10.09 %
Matlab
10.46 %
OpenPlanet
10.35 %
Summary:
• OpenPlanet, an open-source implementation of the PLANET regression
tree algorithm using the Hadoop MapReduce framework.
• We tune and analyze the impact of parameters such as HDFS block sizes
and threshold for in-memory handoff on the performance of OpenPlanet to
improve the default performance.
Future work:
(1) Parallel Execution between MRExpand and MRInMemWeka in each iteration
(2) Issuing multiple OpenPlanet instances for different usages, which leads to increase
the slots utilization
(3) Optimal block size
(4) Real-time model training method
(5) Move to Cloud platform and give analysis about performance
Download