Scalable Regression Tree Learning on Hadoop using OpenPlanet Wei Yin Contributions • We implement OpenPlanet, an open-source implementation of the PLANET regression tree algorithm using the Hadoop MapReduce framework. • We tune and analyze the impact of two parameters, HDFS block sizes and threshold value between ExpandNode and InMemoryWeka Tasks of OpenPlanet to improve the default performance. Motivation for large-scale Machine Learning • Models operate on large data sets • Large number of forecasting models • New data arrives constantly and real-time training requirement Regression Tree Classification algorithm maps features → target variable (prediction) Classifier uses a Binary Search Tree Structure • Intuitive to understand by domain users • Effect for each feature • Each non-leaf node is a binary classifier with a decision condition • One numeric or categorical feature goes left or right in the tree • Leaf Nodes contain the regression function or a single prediction value Google’s PLANET Algorithm worker worker worker Master worker worker worker • Use distributed worker nodes coordinated using a master node to build regression tree 21-Sep-11 USC DR Technical Forum 5 OpenPlanet • Give an introduction abolut OpenPlanet • Introduce difference between OpenPlanet and PLANET • Give specific re-implementation details Controller Model File InitHistogram ExpandNode InMemoryWeka Threshold Value(60000) Cotroller Controller{ /*read user defined parameters, such as input file path, test data file, model output file etc.*/ Read Parameters( arguments[] ); /*Initialize 3 job Sets: MRExpandSet, MRInMemWekaSet, CompletionSet, each of which contains the nodes that need relevant process*/ JobSetsInit(ExpandSet, InMemWekaSet, CompletionSet); /*Initialize Model File instance containing a Regression Tree structure with root node only*/ InitModelFile(modelfile); Do { /*If any Set is not empty, continue the loop*/ /*populate each Set using modelfile*/ populateSets(modelfile, ExpandSet, InMemWekaSet, CompletionSet ); if(ExpandSet != 0){ processing_nodes <- all nodes in ExpandSet; TaskRunner(InitHistogram, processing_nodes); CandidatePoints <- collects Reducers Result(); TaskRunner(ExpandNodes, processing_nodes); globalOptimalSplitPoint <- collects Reducers’ Result(); } if(InMemWekaSet != 0){ processing_nodes <- all nodes in InMemWekaSet; TaskRunner(InMemWeka, processing_nodes); } UpdatesTreeModel(Result); } While ( ExpandSet & InMemWekaSet & CompletionSet != 0 ) Output(modelfile); } Start Initialization Populate Queues While Queues are NOT empty True False MRExpandQueue NOT Empty True Issue MRInitial Task False Issue MRExpandNode Task MRInMemQueue NOT Empty False True Issue MR-InMemGrow Task Update Model & Populate Queue End ModelFile It is an object used to contain the regression model, and support relevant functions, like adding node, check node status etc. Model File Instance Root *Regression Tree Model F1 < 27 *Update Function( ) F2 < 43 F3 ∈{M,W,F} *CurrentLeaveNode( ) *.…... F1< 90 Weka Model F4 < 16 Predict Value = 95 Advantages: • More convenient to update the model and predict target value, compare to parsing XML file. • Load and Write model file == serialize and de-serialize Java Object InitHistogram • • A pre-processing step to find out potential candidate split points for ExpandNodes Numerical features: Find fewer candidate points from huge data at expanse of little accuracy lost, e.g feat1, feat2, • Categorical features: All the components, e.g feat 3 Input node (or subset): node 3 block Map block Map block Map block Map • • Feat1(num) Reduce Reduce Feat2(num), Feat3 (categ) Feature 1: {10,2,1,8,3,6,9,4,6,5,7} Sampling: Boundaries of equal-depth histogram f1: 1,3,5,7,9 f2: 30,40,50,60,70 Colt: High performance Java library {1,3,5,7,9} f3: 1,2,3,4,5,6,7 {Moday -> Friday} Filtering: Only interests in data point belong to node 3 Routing: Key-Value: (featureID, value) ExpandNodes Task just need to evaluate all the points in the candidate set without consulting other resources. ExpandNode Input node (or subset): node 3 Candidate points block 𝑀𝑖𝑛{ 𝐷𝑙𝑒𝑓𝑡 × 𝑉𝑎𝑟 𝐷𝑙𝑒𝑓𝑡 + 𝐷𝑟𝑖𝑔ℎ𝑡 × 𝑉𝑎𝑟 𝐷𝑟𝑖𝑔ℎ𝑡 } D_right = D_total – D_left Map Reduce block Local optimal split point , sp1 (value = 23) Map Controller block Global optimal split point , sp1 (value = 23) node3 f2< 23 Map Reduce block Update expending node e.g. sp1 = 23 in feature 2 Map Filtering Routing Local optimal split point , sp2 (value = 26) 𝐷𝑙𝑒𝑓𝑡 𝐷𝑟𝑖𝑔ℎ𝑡 MRInMemWeka Input node (or subset): node 4 block Map block Map node 5 1.Collect data points for node 4 2.Call Weka REPTree (or any other model, M5P), to build model Node 4 Reduce ….. Controller block block Map Node 5 Map Filtering Routing: (NodeID, Data Point) Reduce Location of Weka Model node4 node5 Weka Model Weka Model Update tree nodes Distinctions between OpenPlanet and PLANET: (1) Sampling MapReduce method: InitHistogram (2) Broadcast(BC) function BC_Key.Set(BC); for( i : num_reducers ){ BC_Key.partitionID = i ; send(BC_Key, BC_info); } Partitioning: ID of Partition = key.hashCode() % numReduceTasks (3)Hybrid Model Weka Model Weka Model Weka Model Weka Model Weka Model Key.hashCode(){ if (key_type == BC) return partitionID; else return key.hashCode; } Performance Analysis and Tuning Method Baseline for Weka, Matlab and OpenPlanet (single machine) Parallel Performance for OpenPlanet with default settings Question ? 1. For 17 million data set, very little improvement difference between 2x8 case and 8x8 case 2. Not much performance improvement, especially compare to Weka baseline performance * 1.58 speed-up for 17 M dataSet * no speed-up for small data point (no memory overhead) Question 1: Similar performance between 2x8 case and 8x8 case ? MRInMemWeka MRExTotal MRInitTotal 1 3 500 450 400 350 300 250 200 150 100 50 0 5 7 9 11 13 15 17 19 21 Iteration Number 1 Utilization Persentation (%) M2 Map Slot Usage (%) 0 400 800 1200 1600 2000 2400 2800 3200 3600 4000 4400 4800 5200 5600 Utilization Persentation (%) M1 Map Slot Usage (%) Running Time (sec) 3 5 7 9 11 13 15 17 19 21 Iteration Number Mapper Slots Utilization for 8x8 Cores (Default) Mapper Slots Utilization for 2x8 Cores (Default) M1 Reducer Usage(%) M2 Reducer Usage(%) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% MRInMemWeka MRExTotal MRInitTotal 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% M1 M2 M3 M4 M5 M6 M7 M8 0 400 800 1200 1600 2000 2400 2800 3200 3600 4000 4400 4800 5200 500 450 400 350 300 250 200 150 100 50 0 OpenPlanet Time per Stage using 8x8 Cores (Default Settings) Average Training Time (Sec ) Average Training Time (Sec ) OpenPlanet Time per Stage using 2x8 Cores (Default Settings) Running Time (sec) Answer for question 1: 1. 2. 3. In HDFS, the basic unit is block (default 64MB) Each Map instance only process one block each time Therefore, we can say, if N blocks, only N Map instances can running in parallel. Therefore for our problem: 1. Size of train data: 17 Million = 842 MB 2. Default block size = 64 MB 3. Number of blocks = 842/64 = 13 4. For 2x8 case, 13 Map can run in parallel, utilization=13/16 = 81% 5. For 8x8 case, 13 Map can run in parallel, utilization=13/64 = 20% 6. But for both case, only 13 Map running in parallel => reason for the similar performance Solution: Tuning the block size, which lead to: Number of blocks ≈ Number of computing cores What if number of blocks >> Number of computing cores ? Not necessary to improve performance since the network bandwidth limitation Improvement: 1. We tune block size = 16 MB 2. We have 842/16 = 52 blocks 3. Total running time = 4,300,457 sec, while 5,712,154 sec for original version, speed-up : 1.33 Data Size Nodes 3.5M Tuples /170MB 17M Tuples /840MB 35M tuples /1.7GB 2x8 20MB 80MB 128MB 4x8 8MB 32MB 64MB 8x8 4MB 16MB 32MB OpenPlanet Time per Stage using 8x8 Cores (optimized block size) 400 MRInMemWeka MRExTotal MRInitTotal 300 250 200 150 16,000 100 50 14,000 0 1 3 5 7 9 11 13 15 17 19 21 Iteration Number Utilization Persentation (%) Mapper Slots Utilization for 8x8 Cores (optimized block size) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% M1 M2 M3 M4 M5 M6 M7 M8 Average Training Time (sec) Average Training Time (Sec ) 350 12,000 10,000 8,000 2x8 Default Block Size 2x8 Optimized Block Size 2x8 16MB Block Size 4x8 Default Block Size 4x8 Optimized Block Size 4x8 16MB Block Size 8x8 Default Block Size 8x8 Optimized Block Size 8x8 16MB Block Size 6,000 4,000 2,000 0 Running Time (sec) 3.5 17.5 35.1 Training Data Size (rows) Million Question 2: 1. Weka works better if no memory overhead 2. Observed from the picture, OpenPlanet Time per Stage using 8x8 Cores (optimized block size) 400 MRInMemWeka MRExTotal MRInitTotal Average Training Time (Sec ) 350 300 250 Area is small Area is large 200 150 100 50 0 1 3 5 7 9 11 13 15 17 19 21 Iteration Number What about balance those two areas but avoid memory overhead for Weka ? Solution: Increasing the threshold value between ExpandNodes task and InMemWeka Then by experiment, when the JVM for reducer instance is 1GB, the maximum threshold value is 2,000,000. Performance Improvement 400 500 450 400 350 300 250 200 150 100 50 0 MRInMemTotal MRInMemWeka MRExTotal MRInitTotal 350 MRExTotal MRInitTotal 300 250 200 150 100 50 0 1 1. 2. 3. 4. OpenPlanet Time per Stage using 8x8 Cores (optimized block size) Average Training Time (Sec ) Average Training Time (Sec ) OpenPlanet Time per Stage using 8x8 Cores (Optimized Block Size & 2M Threshold value) 2 3 4 5 Iteration Number 6 7 1 8 3 5 7 9 11 13 15 17 19 21 Iteration Number Total running time = 1,835,430 sec vs 4,300,457 sec Areas balanced Iteration number decreased Speed-up = 4300457 / 1835430 = 2.34 AVG Total speed-up on 17M data set using 8x8 cores: • Weka: 4.93 times • Matlab: 14.3 times AVG Accuracy (CV-RMSE): Weka 10.09 % Matlab 10.46 % OpenPlanet 10.35 % Summary: • OpenPlanet, an open-source implementation of the PLANET regression tree algorithm using the Hadoop MapReduce framework. • We tune and analyze the impact of parameters such as HDFS block sizes and threshold for in-memory handoff on the performance of OpenPlanet to improve the default performance. Future work: (1) Parallel Execution between MRExpand and MRInMemWeka in each iteration (2) Issuing multiple OpenPlanet instances for different usages, which leads to increase the slots utilization (3) Optimal block size (4) Real-time model training method (5) Move to Cloud platform and give analysis about performance