Location-aware MapReduce in Virtual Cloud

advertisement
Location-aware MapReduce in Virtual Cloud
2011 IEEE computer society
International Conference on Parallel Processing
Yifeng Geng1,2, Shimin Chen3, YongWei Wu1*, Ryan Wu3, Guangwen
Yang1,2, Weimin Zheng1
Reporter:Yu Chih Lin
Outline
 Introduction
 Background
 Model and New Strategy
 Implementation
 Experiment
 Conclusion
Introduction
 MapReduce is an important programming model
• Processing
• Generating large data sets
 Commonly used in applications
• web indexing
• Data mining
• machine learning
Introduction
 Multi-core CPU supporting virtualization technology
• Run two or more virtual machines (VMs) simultaneously
• Share the I/O resources to users
 MapReduce is set up on a distributed file system
• Goolge uses GFS
• Hadoop uses HDFS
Introduction
 In a virtual environmen runs MapReduce, three major problems
• Disk sharing results in unbalanced data distribution and unbalanced
workload
• I/O interference caused by data unbalance and load unbalance
• Disk sharing reduces the data redundancy
Introduction
 Purpose of this paper
• Abstract a model
• Define evaluation metrics
• Analyze the data pattern and task pattern
 For Hadoop
• propose a location-aware file block allocation strategy
Introduction
 Three main benefits by using this paper strategy
• MapReduce’s workload is more balanced
• Reduces I/O interference and improves HDFS’s performance
• Retains data’s redundancy
Background
 I/O has two kinds of traditional interference
• Disk interference –
when multiple processes try to access the same disk simultaneously
• Network interference –
mainly considers the latency and throughput
Background
 I/O virtualization has two kinds of virtualization
• KVM
• Paravirtualization
 Virtual machines share CPUs and memory well, but not I/O.
Background
 Virtualized Hadoop architecture
Model and New Strategy
 Build a generation model to analyze different allocation strategies
• Data pattern
• Task pattern
 To simply the problem for analyzing, make the four assumptions
Model and New Strategy
 Using the same I/O devices hosts and number of virtual machines on each
physical machine
 All the virtual machines are in local area network and the network topology
is flat
 No limitation for workload to be randomly assigned to each virtual machine
 All file blocks have the same size
Model and New Strategy
 actualReplicaNum (a) :
average number of unique file blocks in a physical machine
 Ideal value is 3 (when thereplica number is 3)
𝑎𝑐𝑡𝑢𝑎𝑙𝑅𝑒𝑝𝑙𝑖𝑐𝑎𝑁𝑢𝑚 =
1
𝑛
𝑝−1
fileNum 𝑖
𝑖=0
Model and New Strategy
 maxBlockNum (b) :
shows the maximum number of blocks in a physical machine
maxBlockNum = 𝑚𝑎𝑥 𝑏𝑙𝑜𝑐𝑘𝑁𝑢𝑚 0 , … , 𝑏𝑙𝑜𝑐𝑘𝑁𝑢𝑚 𝑝 −1
Model and New Strategy
 blockNumSigma (c) :
shows the variation of the pattern
 Idea value is 0
2
𝑝−1
blockNumSigma =
1
=
𝑝
𝑟𝑒𝑝𝑙𝑖𝑐𝑎𝑁𝑢𝑚 ∗ 𝑛
blockNum 𝑖 −
𝑝
𝑖=0
Model and New Strategy
 maxAssignedNum (d) :
shows the max number of task that a physical machine is assigned
maxAssignedNum = 𝑚𝑎𝑥 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚 0 , … , 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚 𝑝 −1
Model and New Strategy
 assignedNumSigma (e) :
reveals the load balance of the task pattern
2
𝑝−1
𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚𝑆𝑖𝑔𝑚𝑎 =
1
=
𝑝
𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚 𝑖 −
𝑖=0
𝑛
𝑝
Model and New Strategy
 A new allocation strategy
• Replicas of a file block to different physical machines
• Keeps balance ofthe block number of each physical machines
 Present two intuitive ways
• Round-robin allocation
• Serpentine allocation
Model and New Strategy
 For example , take p = 8 , n = 8 (p : physical machines , n : file blocks)
 An example of round-robin allocation
Model and New Strategy
 For example , take p = 8 , n = 8(p : physical machines , n : file blocks)
 An example of serpentine allocation
Model and New Strategy
 Evaluation metrics for data pattern
actualReplicaNum=3, maxBlockNum=3, blockNumSigma=0
 Enumeration average results for task patterns
 round-robin allocation as results:
maxAssignedNum=2.2724 , assignedNumSigma=0.7943
 serpentine allocation as results:
maxAssignedNum=2.2705 , assignedNumSigma=0.79323
Implementation
 Choose serpentine allocation
 Add the location information of virtual node into the network topology
 For example, one rack among the physical machines
• may be changed from /default-rack to /Phy0
 For example, some rack among the physical machines
• may be changed from /rack1 to /rack1/Phy0
Implementation
 Mechanism makes Hadoop easy
• It can keep compatibility with the native Hadoop
• Make special label starting with “ Phy ”
• Identify locations of virtual machines
Implementation
 To maintain the block information for each virtual node
• In NameNode of Hadoop , add a sorted list by the number of blocks
 In the update
• first update the block number of the virtual node
• Second update its position in the sorted list
Evaluation
 Simulation to compare
• New strategy (serpentine allocation) and Hadoop’s original strategy
 Set parameter
 n = 256
 p = [8,16,32,64,128,256]
 sampling number is set to 1,000,000
Evaluation
 maxBlockNum’s comparison of Hadoop’s original strategy and our new
strategy using sampling
Evaluation
 actualReplicaNum’s comparison original and new strategy
Evaluation
 blockNumSigma’s comparison originals and new strategy
Evaluation
 maxAssignedNum’s comparison original and new strategy
Evaluation
 assignedNumSigma’s comparison original and new strategy
Experiment
 N=224 , P=8
 SAMPLING NUMBER=1,000,000
Original
New
Average of actualReplicaNum
2.0657
3
Average of maxBlockNum
90.5798
84
Average of blockNumSigma
4.1722
0
Average of maxAssignedNum
33.7660
34.5946
Average of assignedNumSigma
3.6256
4.14939
Experiment
 Experiment results of RandomWriter’s execution time
Red : SC off
Blue : SC on
Experiment
 Experiment results of TextSort’s execution time
Red : SC off
Blue : SC on
Experiment
 Experiment results of WordCount’s execution time
Red : SC off
Blue : SC on
Conclusion
 Address problems of data allocation and its impact on MapReduce system
 Build a model and evaluation metrics to evaluate the data and task pattern
 Propose a new strategy for file block allocation in Hadoop
 Simulation and real experiments results
• prove the new allocation strategy is good
Download