BALANCED DATA LAYOUT IN HADOOP CPS 216 Kyungmin (Jason) Lee

advertisement
BALANCED DATA
LAYOUT IN HADOOP
CPS 216
Kyungmin (Jason) Lee
Ke (Jessie) Xu
Weiping Zhang
Background
• How data is stored on HDFS affects Hadoop MapReduce
performance
• Mapper phase: decreased performance if need to fetch
input data from remote node across network
• Imbalance during a MapReduce workflow (output from
one job used as input to next) makes problem even worse
• Project goal: minimize the need to fetch Map input across
network by balancing input data across nodes
Previous Work
• Reactive Solution – HDFS Rebalancer
• Algorithm to rebalance data layout in HDFS based on storage
utilization
• Reacts to already-existing data layout imbalance, would like way to
prevent altogether
• Proactive Solution – RR Block Placement Policy
• On HDFS writes, choose target node in round robin fashion, so
data guaranteed balance
• Unnecessary writes across network? Can we do better?
Balanced Block Placement Policy
• Do writes ‘greedily’ as long as cluster is ‘fairly balanced’
• ‘Greedily’ = prioritize target nodes based on location
• Local node > node on rack > remote node
• ‘Fairly balanced’ = size of all nodes fall within a specified
ranged (windowSize)
• Algorithm:
• Sort live nodes on HDFS used; threshold = max – windowSize
• 1st replica: write to local node if it is below threshold or if all nodes
are above threshold, otherwise write to least utilized node
• 2nd replica: least utilized node that is on different rack (if possible)
than 1st replica
• 3rd replica and beyond: least utilized remaining node
Test Workloads
• 4-node cluster
• Default Policy (DP) vs. Balanced Policy (BP)
• 2 MapReduce Jobs
• Balanced Sort (each reducer approx. same output size)
• Skewed Sort (skewed reducer output sizes)
• 2 Workloads
• Single run, vary number reducers (1, 2, 4, 10, 12)
• Cascaded workflow, 3 sorts in series, reducers = 10
• Other parameters
• RF (replication factor) – 1 and 3
• Speculative Execution on and off (SE vs NSE)
• Monitor amount of data written to node by standard
deviation  higher StdDev implies more imbalance
Quick Demo on Amazon EC2
Balanced Sort Single Run
• DP very skewed for reducers < 4, as expected
• Otherwise both pretty balanced (as expected)
Skewed Sort Single Run
• DP significantly worse than BP
• RF3 show better balance than RF1
• Disabling SE improves balance in BP
Balanced Sort Workflow
Skewed Sort Workflow
Performance
• No significant overhead/improvements observed
Speculative Execution
• Hadoop performance feature that runs same task on 2
•
•
•
•
nodes concurrently, uses data from task that completes
first and discards the other
Usually occurs toward end of a job, leading to unintended
data imbalance in balanced policy
Turning off speculative execution improved data balance,
but in practice would like to keep this feature on for
performance boost
Our policy too greedy, less affected if a node writes
approximately equally to all nodes  round robin
Hybrid policy, some nodes run round robin and some
nodes run balanced policy? Tradeoff between balance
and network traffic?
Future Considerations
• Current implementation assumes data will be balanced
throughout cluster’s lifetime
• What if some nodes are down for a period of time and
data becomes imbalanced?
• Data output per job should be spread evenly, vs. overall
data layout spread evenly
• Need additional knowledge of which job each write belongs to
• Effect of window size on balance/performance?
• Unable to test due to insufficient funds
Conclusion
• Implemented new block placement policy that focuses on
maintaining data balance while keeping writes local as
much as possible
• Test data showed success at maintaining data balance
• Greatest improvements with skewed outputs
• Performance not affected – would expect improvement for
skewed datasets given reduction in network usage
• Only tested on small cluster with small datasets
• Should be more effective on large datasets
• Performance weakened by speculative execution
• In practice should tweak our policy to get best performance results
Download