Location-aware MapReduce in Virtual Cloud

Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei Wu1*, Ryan Wu3, Guangwen Yang1,2, Weimin Zheng1 Reporter：Yu Chih Lin Outline  Introduction  Background  Model and New Strategy  Implementation  Experiment  Conclusion Introduction  MapReduce is an important programming model • Processing • Generating large data sets  Commonly used in applications • web indexing • Data mining • machine learning Introduction  Multi-core CPU supporting virtualization technology • Run two or more virtual machines (VMs) simultaneously • Share the I/O resources to users  MapReduce is set up on a distributed file system • Goolge uses GFS • Hadoop uses HDFS Introduction  In a virtual environmen runs MapReduce, three major problems • Disk sharing results in unbalanced data distribution and unbalanced workload • I/O interference caused by data unbalance and load unbalance • Disk sharing reduces the data redundancy Introduction  Purpose of this paper • Abstract a model • Define evaluation metrics • Analyze the data pattern and task pattern  For Hadoop • propose a location-aware file block allocation strategy Introduction  Three main benefits by using this paper strategy • MapReduce’s workload is more balanced • Reduces I/O interference and improves HDFS’s performance • Retains data’s redundancy Background  I/O has two kinds of traditional interference • Disk interference – when multiple processes try to access the same disk simultaneously • Network interference – mainly considers the latency and throughput Background  I/O virtualization has two kinds of virtualization • KVM • Paravirtualization  Virtual machines share CPUs and memory well, but not I/O. Background  Virtualized Hadoop architecture Model and New Strategy  Build a generation model to analyze different allocation strategies • Data pattern • Task pattern  To simply the problem for analyzing, make the four assumptions Model and New Strategy  Using the same I/O devices hosts and number of virtual machines on each physical machine  All the virtual machines are in local area network and the network topology is flat  No limitation for workload to be randomly assigned to each virtual machine  All file blocks have the same size Model and New Strategy  actualReplicaNum (a) : average number of unique file blocks in a physical machine  Ideal value is 3 (when thereplica number is 3) 𝑎𝑐𝑡𝑢𝑎𝑙𝑅𝑒𝑝𝑙𝑖𝑐𝑎𝑁𝑢𝑚 = 1 𝑛 𝑝−1 fileNum 𝑖 𝑖=0 Model and New Strategy  maxBlockNum (b) : shows the maximum number of blocks in a physical machine maxBlockNum = 𝑚𝑎𝑥 𝑏𝑙𝑜𝑐𝑘𝑁𝑢𝑚 0 , … , 𝑏𝑙𝑜𝑐𝑘𝑁𝑢𝑚 𝑝 −1 Model and New Strategy  blockNumSigma (c) : shows the variation of the pattern  Idea value is 0 2 𝑝−1 blockNumSigma = 1 = 𝑝 𝑟𝑒𝑝𝑙𝑖𝑐𝑎𝑁𝑢𝑚 ∗ 𝑛 blockNum 𝑖 − 𝑝 𝑖=0 Model and New Strategy  maxAssignedNum (d) : shows the max number of task that a physical machine is assigned maxAssignedNum = 𝑚𝑎𝑥 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚 0 , … , 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚 𝑝 −1 Model and New Strategy  assignedNumSigma (e) : reveals the load balance of the task pattern 2 𝑝−1 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚𝑆𝑖𝑔𝑚𝑎 = 1 = 𝑝 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚 𝑖 − 𝑖=0 𝑛 𝑝 Model and New Strategy  A new allocation strategy • Replicas of a file block to different physical machines • Keeps balance ofthe block number of each physical machines  Present two intuitive ways • Round-robin allocation • Serpentine allocation Model and New Strategy  For example , take p = 8 , n = 8 (p : physical machines , n : file blocks)  An example of round-robin allocation Model and New Strategy  For example , take p = 8 , n = 8(p : physical machines , n : file blocks)  An example of serpentine allocation Model and New Strategy  Evaluation metrics for data pattern actualReplicaNum=3, maxBlockNum=3, blockNumSigma=0  Enumeration average results for task patterns  round-robin allocation as results: maxAssignedNum=2.2724 , assignedNumSigma=0.7943  serpentine allocation as results: maxAssignedNum=2.2705 , assignedNumSigma=0.79323 Implementation  Choose serpentine allocation  Add the location information of virtual node into the network topology  For example, one rack among the physical machines • may be changed from /default-rack to /Phy0  For example, some rack among the physical machines • may be changed from /rack1 to /rack1/Phy0 Implementation  Mechanism makes Hadoop easy • It can keep compatibility with the native Hadoop • Make special label starting with “ Phy ” • Identify locations of virtual machines Implementation  To maintain the block information for each virtual node • In NameNode of Hadoop , add a sorted list by the number of blocks  In the update • first update the block number of the virtual node • Second update its position in the sorted list Evaluation  Simulation to compare • New strategy (serpentine allocation) and Hadoop’s original strategy  Set parameter  n = 256  p = [8,16,32,64,128,256]  sampling number is set to 1,000,000 Evaluation  maxBlockNum’s comparison of Hadoop’s original strategy and our new strategy using sampling Evaluation  actualReplicaNum’s comparison original and new strategy Evaluation  blockNumSigma’s comparison originals and new strategy Evaluation  maxAssignedNum’s comparison original and new strategy Evaluation  assignedNumSigma’s comparison original and new strategy Experiment  N=224 , P=8  SAMPLING NUMBER=1,000,000 Original New Average of actualReplicaNum 2.0657 3 Average of maxBlockNum 90.5798 84 Average of blockNumSigma 4.1722 0 Average of maxAssignedNum 33.7660 34.5946 Average of assignedNumSigma 3.6256 4.14939 Experiment  Experiment results of RandomWriter’s execution time Red : SC off Blue : SC on Experiment  Experiment results of TextSort’s execution time Red : SC off Blue : SC on Experiment  Experiment results of WordCount’s execution time Red : SC off Blue : SC on Conclusion  Address problems of data allocation and its impact on MapReduce system  Build a model and evaluation metrics to evaluate the data and task pattern  Propose a new strategy for file block allocation in Hadoop  Simulation and real experiments results • prove the new allocation strategy is good

Location-aware MapReduce in Virtual Cloud

Related documents

Products

Support

Location-aware MapReduce in Virtual Cloud

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib