BALANCED DATA LAYOUT IN HADOOP CPS 216 Kyungmin (Jason) Lee Ke (Jessie) Xu Weiping Zhang Background • How data is stored on HDFS affects Hadoop MapReduce performance • Mapper phase: decreased performance if need to fetch input data from remote node across network • Imbalance during a MapReduce workflow (output from one job used as input to next) makes problem even worse • Project goal: minimize the need to fetch Map input across network by balancing input data across nodes Previous Work • Reactive Solution – HDFS Rebalancer • Algorithm to rebalance data layout in HDFS based on storage utilization • Reacts to already-existing data layout imbalance, would like way to prevent altogether • Proactive Solution – RR Block Placement Policy • On HDFS writes, choose target node in round robin fashion, so data guaranteed balance • Unnecessary writes across network? Can we do better? Balanced Block Placement Policy • Do writes ‘greedily’ as long as cluster is ‘fairly balanced’ • ‘Greedily’ = prioritize target nodes based on location • Local node > node on rack > remote node • ‘Fairly balanced’ = size of all nodes fall within a specified ranged (windowSize) • Algorithm: • Sort live nodes on HDFS used; threshold = max – windowSize • 1st replica: write to local node if it is below threshold or if all nodes are above threshold, otherwise write to least utilized node • 2nd replica: least utilized node that is on different rack (if possible) than 1st replica • 3rd replica and beyond: least utilized remaining node Test Workloads • 4-node cluster • Default Policy (DP) vs. Balanced Policy (BP) • 2 MapReduce Jobs • Balanced Sort (each reducer approx. same output size) • Skewed Sort (skewed reducer output sizes) • 2 Workloads • Single run, vary number reducers (1, 2, 4, 10, 12) • Cascaded workflow, 3 sorts in series, reducers = 10 • Other parameters • RF (replication factor) – 1 and 3 • Speculative Execution on and off (SE vs NSE) • Monitor amount of data written to node by standard deviation higher StdDev implies more imbalance Quick Demo on Amazon EC2 Balanced Sort Single Run • DP very skewed for reducers < 4, as expected • Otherwise both pretty balanced (as expected) Skewed Sort Single Run • DP significantly worse than BP • RF3 show better balance than RF1 • Disabling SE improves balance in BP Balanced Sort Workflow Skewed Sort Workflow Performance • No significant overhead/improvements observed Speculative Execution • Hadoop performance feature that runs same task on 2 • • • • nodes concurrently, uses data from task that completes first and discards the other Usually occurs toward end of a job, leading to unintended data imbalance in balanced policy Turning off speculative execution improved data balance, but in practice would like to keep this feature on for performance boost Our policy too greedy, less affected if a node writes approximately equally to all nodes round robin Hybrid policy, some nodes run round robin and some nodes run balanced policy? Tradeoff between balance and network traffic? Future Considerations • Current implementation assumes data will be balanced throughout cluster’s lifetime • What if some nodes are down for a period of time and data becomes imbalanced? • Data output per job should be spread evenly, vs. overall data layout spread evenly • Need additional knowledge of which job each write belongs to • Effect of window size on balance/performance? • Unable to test due to insufficient funds Conclusion • Implemented new block placement policy that focuses on maintaining data balance while keeping writes local as much as possible • Test data showed success at maintaining data balance • Greatest improvements with skewed outputs • Performance not affected – would expect improvement for skewed datasets given reduction in network usage • Only tested on small cluster with small datasets • Should be more effective on large datasets • Performance weakened by speculative execution • In practice should tweak our policy to get best performance results