子計畫:自動化大量資料切割與整合 國立高雄大學 電機工程學系 賴智錦 第二季成果 What had been expected to do • 了解及熟悉現有分散式語言及工具對於分散式運算元與 資料reduction/assembly的處理方式。 • 資料切割與重組有關之演算法設計。 • 以C或類似語言進行單機系統模擬。 • 修正資料切割與重組有關之演算法。 What were achieved • 瞭解分散式工具Hadoop中MapReduce的執行原理。 • 以Java語言進行單機系統模擬。 Any difficulties • 需有合適的測試範例 ,以進行演算法設計。 第二季成果 Future tasks • 對於其他分散式資料切割/重組的瞭解需加強。 • 對於MapReduce的原理與應用,更深入瞭解。 • 嘗試開發符合MapReduce原理的雲端運算程式。 Comments • 測試範例的特性,是否足以顯示雲端運算的功效。 • 單機系統模擬時,所遭遇困難之解決。 第二季成果 Fig. 1. MapReduce data flow with a single reduce task Fig. 2. MapReduce data flow with multiple reduce tasks 第二季成果 操作環境 : hadoop_centos.vmx(趨勢科技) 機器名稱 : hadoop 網路環境 : DHCP 虛擬機器 : VMplayer 第二季成果 Average Rating MapReduce Example • Data set (NetFlix Prize, 17,770 files, 480,189 users) • To calculate the average movie rating per user • Execute a MapReduce task over the dataset on a single node • Source: http://archive.ics.uci.edu/ml/datasets/Netflix+Prize 第二季成果 Average Rating MapReduce Example HadoopDriver.java • 主程式。 UserRatingMapper.java • Mapper : 使用者評分。 AverageValueReducer.java • Reducer: 針對每個使用者評分進行平均計算。 IntArrayWritable.java • 全域物件宣告。 第二季成果 HadoopDriver.java import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.FloatWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; public class HadoopDriver { public static void main(String[] args) { /* Require args to contain the paths */ if(args.length != 1 && args.length != 2) { System.err.println("Error! Usage: \n" +"HadoopDriver <input dir> <output dir>\n" +"HadoopDriver <job.xml>"); System.exit(1); } JobClient client = new JobClient(); JobConf conf = null; if(args.length == 2) { conf = new JobConf(HadoopDriver.class); /* UserRatingMapper outputs (IntWritable, IntArrayWritable(Writable[2])) */ conf.setMapOutputKeyClass(IntWritable.class); conf.setMapOutputValueClass(IntArrayWritable.class); /* AverageValueReducer outputs (IntWritable, FloatWritable) */ 第二季成果 HadoopDriver.java conf.setOutputKeyClass(IntWritable.class); conf.setOutputValueClass(FloatWritable.class); /* Pull input and output Paths from the args */ conf.setInputPath(new Path(args[0])); conf.setOutputPath(new Path(args[1])); /* Set to use Mapper and Reducer classes */ conf.setMapperClass(UserRatingMapper.class); conf.setCombinerClass(UserRatingMapper.class); conf.setReducerClass(AverageValueReducer.class); conf.set("mapred.child.java.opts", "-Xmx2048m"); } else { conf = new JobConf(args[0]); } client.setConf(conf); try { JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace(); } } } 第二季成果 UserRatingMapper.java import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.FloatWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; /** This class reads a list of IntWritables, and emits the average value under * the same key. * * Not much to it. * * @author Daniel Jackson, Scott Griffin */ public class AverageValueReducer extends MapReduceBase implements Reducer<WritableComparable,Writable,WritableComparable,Writable> { public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0, count = 0; IntArrayWritable ratingInput = null; Writable[] inputArray = null; 第二季成果 UserRatingMapper.java while(values.hasNext()) { atingInput = (IntArrayWritable)values.next();; inputArray = (Writable[])ratingInput.get();; sum += ((IntWritable)inputArray[0]).get(); count += ((IntWritable) inputArray[1]).get(); } output.collect(key, new FloatWritable(((float)sum)/count)); } } /* Copyright @ 2008 California Polytechnic State University * Licensed under the Creative Commons * Attribution 3.0 License -http://creativecommons.org/licenses/by/3.0/ */ 第二季成果 AverageValueReducer.java import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.FloatWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; /* This class reads a list of IntWritables, and emits the average value under * the same key. * * Not much to it. * * @author Daniel Jackson, Scott Griffin */ public class AverageValueReducer extends MapReduceBase implements Reducer<WritableComparable,Writable,WritableComparable,Writable> { public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { 第二季成果 AverageValueReducer.java int sum = 0, count = 0; IntArrayWritable ratingInput = null; Writable[] inputArray = null; while(values.hasNext()) { ratingInput = (IntArrayWritable)values.next();; inputArray = (Writable[])ratingInput.get();; sum += ((IntWritable)inputArray[0]).get(); count += ((IntWritable) inputArray[1]).get(); } output.collect(key, new FloatWritable(((float)sum)/count)); } } /* Copyright @ 2008 California Polytechnic State University * Licensed under the Creative Commons * Attribution 3.0 License -http://creativecommons.org/licenses/by/3.0/ */ 第二季成果 import org.apache.hadoop.io.ArrayWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Writable; /* * Must subclass ArrayWritable if it is to be the input to a Reduce function * because the valueClass is not written to the output. Wish there was * some documentation which said that... * @author Daniel */ public class IntArrayWritable extends ArrayWritable { public IntArrayWritable(Writable[] values) { super(IntWritable.class, values); } public IntArrayWritable() { super(IntWritable.class); } public IntArrayWritable(Class valueClass, Writable[] values) { super(IntWritable.class, values); } public IntArrayWritable(Class valueClass) { super(IntWritable.class); } public IntArrayWritable(String[] strings) { super(strings); } } /* Copyright @ 2008 California Polytechnic State University * Licensed under the Creative Commons * Attribution 3.0 License -http://creativecommons.org/licenses/by/3.0/ */ IntArrayWritable.java 第二季成果 NetFlix dataset (Input Mapper) Input : UserID, Rating value, Date 7: 951709, 2, 2001-11-04 585247, 1, 2003-12-19 2625420, 2, 2004-06-03 2322468, 3, 2003-11-12 2056324, 2, 2002-11-10 1969230, 4, 2003-06-01 Emit : UserID, Rating value, Rating sum 951709, 2, 1 585247, 1, 1 2625420, 2, 1 2322468, 3, 1 2056324, 2, 1 1969230, 4, 1 第二季成果 NetFlix dataset (Reducer Output) Input : UserID , Rating value, Rating sum , … 951709, 2, 1 , 4, 1 , 3, 1 , 1, 1 585247, 1, 1 , 2, 1 , 3, 1 2625420, 2, 1 , 4, 1 , 2, 1 , 1, 1 , 3, 1 2322468, 3, 1 , 4, 1 , 3, 1 , 4, 1 2056324, 2, 1 , 1, 1 1969230, 4, 1 , 2, 1 Emit : UserID, Rating value sum / sum list 951709, 2.5 585247, 2 2625420, 2.4 2322468, 3.5 2056324, 1.5 1969230, 3 第二季成果 NetFlix dataset: 500 Total Time= 18 :17 Total size=54.9MB 第二季成果 NetFlix dataset: 1000 Total Time= 34:14 Total size=98.3MB 第二季成果 Result NetFlix dataset: 500 Output File 第二季成果 Result NetFlix dataset: 1000 Output File