Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei Wu1*, Ryan Wu3, Guangwen Yang1,2, Weimin Zheng1 Reporter:Yu Chih Lin Outline Introduction Background Model and New Strategy Implementation Experiment Conclusion Introduction MapReduce is an important programming model • Processing • Generating large data sets Commonly used in applications • web indexing • Data mining • machine learning Introduction Multi-core CPU supporting virtualization technology • Run two or more virtual machines (VMs) simultaneously • Share the I/O resources to users MapReduce is set up on a distributed file system • Goolge uses GFS • Hadoop uses HDFS Introduction In a virtual environmen runs MapReduce, three major problems • Disk sharing results in unbalanced data distribution and unbalanced workload • I/O interference caused by data unbalance and load unbalance • Disk sharing reduces the data redundancy Introduction Purpose of this paper • Abstract a model • Define evaluation metrics • Analyze the data pattern and task pattern For Hadoop • propose a location-aware file block allocation strategy Introduction Three main benefits by using this paper strategy • MapReduce’s workload is more balanced • Reduces I/O interference and improves HDFS’s performance • Retains data’s redundancy Background I/O has two kinds of traditional interference • Disk interference – when multiple processes try to access the same disk simultaneously • Network interference – mainly considers the latency and throughput Background I/O virtualization has two kinds of virtualization • KVM • Paravirtualization Virtual machines share CPUs and memory well, but not I/O. Background Virtualized Hadoop architecture Model and New Strategy Build a generation model to analyze different allocation strategies • Data pattern • Task pattern To simply the problem for analyzing, make the four assumptions Model and New Strategy Using the same I/O devices hosts and number of virtual machines on each physical machine All the virtual machines are in local area network and the network topology is flat No limitation for workload to be randomly assigned to each virtual machine All file blocks have the same size Model and New Strategy actualReplicaNum (a) : average number of unique file blocks in a physical machine Ideal value is 3 (when thereplica number is 3) 𝑎𝑐𝑡𝑢𝑎𝑙𝑅𝑒𝑝𝑙𝑖𝑐𝑎𝑁𝑢𝑚 = 1 𝑛 𝑝−1 fileNum 𝑖 𝑖=0 Model and New Strategy maxBlockNum (b) : shows the maximum number of blocks in a physical machine maxBlockNum = 𝑚𝑎𝑥 𝑏𝑙𝑜𝑐𝑘𝑁𝑢𝑚 0 , … , 𝑏𝑙𝑜𝑐𝑘𝑁𝑢𝑚 𝑝 −1 Model and New Strategy blockNumSigma (c) : shows the variation of the pattern Idea value is 0 2 𝑝−1 blockNumSigma = 1 = 𝑝 𝑟𝑒𝑝𝑙𝑖𝑐𝑎𝑁𝑢𝑚 ∗ 𝑛 blockNum 𝑖 − 𝑝 𝑖=0 Model and New Strategy maxAssignedNum (d) : shows the max number of task that a physical machine is assigned maxAssignedNum = 𝑚𝑎𝑥 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚 0 , … , 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚 𝑝 −1 Model and New Strategy assignedNumSigma (e) : reveals the load balance of the task pattern 2 𝑝−1 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚𝑆𝑖𝑔𝑚𝑎 = 1 = 𝑝 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑𝑁𝑢𝑚 𝑖 − 𝑖=0 𝑛 𝑝 Model and New Strategy A new allocation strategy • Replicas of a file block to different physical machines • Keeps balance ofthe block number of each physical machines Present two intuitive ways • Round-robin allocation • Serpentine allocation Model and New Strategy For example , take p = 8 , n = 8 (p : physical machines , n : file blocks) An example of round-robin allocation Model and New Strategy For example , take p = 8 , n = 8(p : physical machines , n : file blocks) An example of serpentine allocation Model and New Strategy Evaluation metrics for data pattern actualReplicaNum=3, maxBlockNum=3, blockNumSigma=0 Enumeration average results for task patterns round-robin allocation as results: maxAssignedNum=2.2724 , assignedNumSigma=0.7943 serpentine allocation as results: maxAssignedNum=2.2705 , assignedNumSigma=0.79323 Implementation Choose serpentine allocation Add the location information of virtual node into the network topology For example, one rack among the physical machines • may be changed from /default-rack to /Phy0 For example, some rack among the physical machines • may be changed from /rack1 to /rack1/Phy0 Implementation Mechanism makes Hadoop easy • It can keep compatibility with the native Hadoop • Make special label starting with “ Phy ” • Identify locations of virtual machines Implementation To maintain the block information for each virtual node • In NameNode of Hadoop , add a sorted list by the number of blocks In the update • first update the block number of the virtual node • Second update its position in the sorted list Evaluation Simulation to compare • New strategy (serpentine allocation) and Hadoop’s original strategy Set parameter n = 256 p = [8,16,32,64,128,256] sampling number is set to 1,000,000 Evaluation maxBlockNum’s comparison of Hadoop’s original strategy and our new strategy using sampling Evaluation actualReplicaNum’s comparison original and new strategy Evaluation blockNumSigma’s comparison originals and new strategy Evaluation maxAssignedNum’s comparison original and new strategy Evaluation assignedNumSigma’s comparison original and new strategy Experiment N=224 , P=8 SAMPLING NUMBER=1,000,000 Original New Average of actualReplicaNum 2.0657 3 Average of maxBlockNum 90.5798 84 Average of blockNumSigma 4.1722 0 Average of maxAssignedNum 33.7660 34.5946 Average of assignedNumSigma 3.6256 4.14939 Experiment Experiment results of RandomWriter’s execution time Red : SC off Blue : SC on Experiment Experiment results of TextSort’s execution time Red : SC off Blue : SC on Experiment Experiment results of WordCount’s execution time Red : SC off Blue : SC on Conclusion Address problems of data allocation and its impact on MapReduce system Build a model and evaluation metrics to evaluate the data and task pattern Propose a new strategy for file block allocation in Hadoop Simulation and real experiments results • prove the new allocation strategy is good