自動化大量資料切割與整合

advertisement
子計畫:自動化大量資料切割與整合
國立高雄大學 電機工程學系
賴智錦
第二季成果

What had been expected to do
• 了解及熟悉現有分散式語言及工具對於分散式運算元與
資料reduction/assembly的處理方式。
• 資料切割與重組有關之演算法設計。
• 以C或類似語言進行單機系統模擬。
• 修正資料切割與重組有關之演算法。

What were achieved
• 瞭解分散式工具Hadoop中MapReduce的執行原理。
• 以Java語言進行單機系統模擬。

Any difficulties
•
需有合適的測試範例 ,以進行演算法設計。
第二季成果

Future tasks
• 對於其他分散式資料切割/重組的瞭解需加強。
• 對於MapReduce的原理與應用,更深入瞭解。
• 嘗試開發符合MapReduce原理的雲端運算程式。

Comments
• 測試範例的特性,是否足以顯示雲端運算的功效。
• 單機系統模擬時,所遭遇困難之解決。
第二季成果
Fig. 1. MapReduce data flow with a single reduce task
Fig. 2. MapReduce data flow with multiple reduce tasks
第二季成果
操作環境 : hadoop_centos.vmx(趨勢科技)
機器名稱 : hadoop
網路環境 : DHCP
虛擬機器 : VMplayer
第二季成果
Average Rating MapReduce Example
• Data set (NetFlix Prize, 17,770 files, 480,189 users)
• To calculate the average movie rating per user
• Execute a MapReduce task over the dataset on a single
node
•
Source: http://archive.ics.uci.edu/ml/datasets/Netflix+Prize
第二季成果
Average Rating MapReduce Example

HadoopDriver.java
• 主程式。

UserRatingMapper.java
• Mapper : 使用者評分。

AverageValueReducer.java
• Reducer: 針對每個使用者評分進行平均計算。

IntArrayWritable.java
• 全域物件宣告。
第二季成果
HadoopDriver.java
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
public class HadoopDriver {
public static void main(String[] args) {
/* Require args to contain the paths */
if(args.length != 1 && args.length != 2) {
System.err.println("Error! Usage: \n" +"HadoopDriver <input dir> <output dir>\n" +"HadoopDriver
<job.xml>");
System.exit(1);
}
JobClient client = new JobClient();
JobConf conf = null;
if(args.length == 2) {
conf = new JobConf(HadoopDriver.class);
/* UserRatingMapper outputs (IntWritable, IntArrayWritable(Writable[2])) */
conf.setMapOutputKeyClass(IntWritable.class);
conf.setMapOutputValueClass(IntArrayWritable.class);
/* AverageValueReducer outputs (IntWritable, FloatWritable) */
第二季成果
HadoopDriver.java
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(FloatWritable.class);
/* Pull input and output Paths from the args */
conf.setInputPath(new Path(args[0]));
conf.setOutputPath(new Path(args[1]));
/* Set to use Mapper and Reducer classes */
conf.setMapperClass(UserRatingMapper.class);
conf.setCombinerClass(UserRatingMapper.class);
conf.setReducerClass(AverageValueReducer.class);
conf.set("mapred.child.java.opts", "-Xmx2048m");
} else
{
conf = new JobConf(args[0]);
}
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}
第二季成果
UserRatingMapper.java
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
/** This class reads a list of IntWritables, and emits the average value under
* the same key.
*
* Not much to it.
*
* @author Daniel Jackson, Scott Griffin */
public class AverageValueReducer extends MapReduceBase
implements Reducer<WritableComparable,Writable,WritableComparable,Writable>
{
public void reduce(WritableComparable key, Iterator values,
OutputCollector output, Reporter reporter) throws IOException {
int sum = 0, count = 0;
IntArrayWritable ratingInput = null;
Writable[] inputArray = null;
第二季成果
UserRatingMapper.java
while(values.hasNext()) {
atingInput = (IntArrayWritable)values.next();;
inputArray = (Writable[])ratingInput.get();;
sum += ((IntWritable)inputArray[0]).get();
count += ((IntWritable) inputArray[1]).get();
}
output.collect(key, new FloatWritable(((float)sum)/count));
}
}
/* Copyright @ 2008 California Polytechnic State University
* Licensed under the Creative Commons
* Attribution 3.0 License -http://creativecommons.org/licenses/by/3.0/
*/
第二季成果
AverageValueReducer.java
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
/* This class reads a list of IntWritables, and emits the average value under
* the same key.
*
* Not much to it.
*
* @author Daniel Jackson, Scott Griffin */
public class AverageValueReducer extends MapReduceBase
implements Reducer<WritableComparable,Writable,WritableComparable,Writable>
{
public void reduce(WritableComparable key, Iterator values,
OutputCollector output, Reporter reporter) throws IOException {
第二季成果
AverageValueReducer.java
int sum = 0, count = 0;
IntArrayWritable ratingInput = null;
Writable[] inputArray = null;
while(values.hasNext()) {
ratingInput = (IntArrayWritable)values.next();;
inputArray = (Writable[])ratingInput.get();;
sum += ((IntWritable)inputArray[0]).get();
count += ((IntWritable) inputArray[1]).get();
}
output.collect(key, new FloatWritable(((float)sum)/count));
}
}
/* Copyright @ 2008 California Polytechnic State University
* Licensed under the Creative Commons
* Attribution 3.0 License -http://creativecommons.org/licenses/by/3.0/
*/
第二季成果
import org.apache.hadoop.io.ArrayWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Writable;
/*
* Must subclass ArrayWritable if it is to be the input to a Reduce function
* because the valueClass is not written to the output. Wish there was
* some documentation which said that...
* @author Daniel
*/
public class IntArrayWritable extends ArrayWritable {
public IntArrayWritable(Writable[] values) {
super(IntWritable.class, values); }
public IntArrayWritable() {
super(IntWritable.class);
}
public IntArrayWritable(Class valueClass, Writable[] values) {
super(IntWritable.class, values); }
public IntArrayWritable(Class valueClass) {
super(IntWritable.class);
}
public IntArrayWritable(String[] strings) {
super(strings);
}
}
/* Copyright @ 2008 California Polytechnic State University
* Licensed under the Creative Commons
* Attribution 3.0 License -http://creativecommons.org/licenses/by/3.0/
*/
IntArrayWritable.java
第二季成果
NetFlix dataset (Input  Mapper)
Input :
UserID, Rating value, Date
7:
951709, 2, 2001-11-04
585247, 1, 2003-12-19
2625420, 2, 2004-06-03
2322468, 3, 2003-11-12
2056324, 2, 2002-11-10
1969230, 4, 2003-06-01
Emit :
 UserID, Rating value, Rating sum 
 951709, 2, 1 
 585247, 1, 1 
 2625420, 2, 1 
 2322468, 3, 1 
 2056324, 2, 1 
 1969230, 4, 1 
第二季成果
NetFlix dataset (Reducer  Output)
Input :
 UserID ,  Rating value, Rating sum , … 
 951709,  2, 1 ,  4, 1 ,  3, 1  ,  1, 1  
 585247,  1, 1  ,  2, 1 ,  3, 1  
 2625420,  2, 1  ,  4, 1 ,  2, 1 ,  1, 1 ,  3, 1  
 2322468,  3, 1  ,  4, 1  ,  3, 1 ,  4, 1  
 2056324,  2, 1  ,  1, 1  
 1969230,  4, 1  ,  2, 1  
Emit :
 UserID, Rating value sum / sum list 
 951709, 2.5 
 585247, 2 
 2625420, 2.4 
 2322468, 3.5 
 2056324, 1.5 
 1969230, 3 
第二季成果
NetFlix dataset: 500 Total Time= 18 :17 Total size=54.9MB
第二季成果
NetFlix dataset: 1000 Total Time= 34:14 Total size=98.3MB
第二季成果
Result
NetFlix dataset: 500 Output File
第二季成果
Result
NetFlix dataset: 1000 Output File
Download