Apache Bigtop

advertisement
Apache Bigtop
Week 9 Integration Testing, M/R
Coding
Administration
• Yahoo Field Trip, How Hadoop components are
used in a production environment. Have to be
registered as Working group members/B/C
members
• MSFT Azure talk, volunteers for tech leads to port
bigtop to Azure.
• Roman’s Yahoo HUG presentation next week
• Move to ground floor next week?
• Machine Learning Solution Architect, 2/16
• List
Review from last time
• Hive/Pig/Hbasedata layer for integration tests
• Hbase upgrade to x.92
• JAVA_LIBRARY_PATH for JVM to point to .so
native libs for hadoop
• Hadoop classpath debug to print out classpath
• HBASE 0.92 guess where Hadoop is using
HADOOP_HOME
• /etc/hostname screwed up on ec2
Bigtop Data Integration Layer, Hive,
Pig, Hbase
• Hive:
• Create a separate Java project
• Install Hive locally, verify you can run the
command line, >show tables;
Hive Data Layer
• Import all the jars under hive-0.8.1/lib to
Eclipse
Hive Notes
• Hive has 2 configurations, an embedded and
server.
• To start the server:
– Set the HADOOP_HEAPSIZE to 1024 by copying
hive-env.sh.template to hive-env.sh and
uncommenting the HADOOP_HEAPSIZE setting.
– source ~/hive-0.8.1/conf/hive-env.sh
– Verify, echo $HADOOP_HIVESIZE
Start Hive Server from Command Line
Hive Command Line Server
Hive Notes
Increase Heap Size:
Hive Run JDBC Commands
• Like connecting to MySQL/oracle/MSFT db
• Create connection, PreparedStatement,
ResultSet
• Class.forName("org.apache.hadoop.hive.jdbc.
HiveDriver");
•
Connection con
=DriverManager.getConnection("jdbc:hive://lo
calhost:10000/default", "", "");
• Driver in the jar
Hive JDBC Prepared Statement
• Create Table statement different
Statement stmt = con.createStatement();
String tableName = "testHiveDriverTable";
stmt.executeQuery("drop table " +
tableName);
ResultSet res = stmt
.executeQuery("create table "
+ tableName
+ " (key int, value string) ROW
FORMAT delimited fields terminated by '\t'");
Verification – server running and table
printout Eclipse output
Hive Eclipse/Java Code
Pig, uses Pig Util Class
• Util not in Pig-xxx.jar, only in Test package
• Local mode only, distributed not debugged
Util.deleteDirectory(new File("/Users/dc/pig0.9.2/nyse"));
PigServer ps = new
PigServer(ExecType.LOCAL);
ps.setBatchOn();
Pig Example
String first = " nyse = load
'/Users/dc/programmingpig/data/NYSE_dividen
ds' as (exchange:chararray, symbol:chararray,
date:chararray, dividends:float); ";
String second = "B = foreach nyse
generate symbol, dividends;";
String third = " store B into 'nyse'; ";
Pig Example
Util.registerMultiLineQuery(ps, first + second
+ third);
ps.executeBatch();
ps.shutdown();
Pig Example Output
12/02/11 14:07:57 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: file:///
12/02/11 14:07:59 INFO pigstats.ScriptState: Pig features used in the script: UNKNOWN
12/02/11 14:08:00 INFO rules.ColumnPruneVisitor: Columns pruned for nyse: $0, $2
12/02/11 14:08:01 INFO mapReduceLayer.MRCompiler: File concatenation threshold: 100 optimistic? false
12/02/11 14:08:01 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size before optimization: 1
12/02/11 14:08:01 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size after optimization: 1
12/02/11 14:08:01 INFO pigstats.ScriptState: Pig script settings are added to the job
12/02/11 14:08:01 INFO mapReduceLayer.JobControlCompiler: mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
12/02/11 14:08:02 INFO mapReduceLayer.JobControlCompiler: Setting up single store job
12/02/11 14:08:02 INFO mapReduceLayer.MapReduceLauncher: 1 map-reduce job(s) waiting for submission.
12/02/11 14:08:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/02/11 14:08:02 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/02/11 14:08:02 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
12/02/11 14:08:02 INFO input.FileInputFormat: Total input paths to process : 1
12/02/11 14:08:02 INFO util.MapRedUtil: Total input paths to process : 1
12/02/11 14:08:02 INFO mapReduceLayer.MapReduceLauncher: 0% complete
12/02/11 14:08:03 INFO util.MapRedUtil: Total input paths (combined) to process : 1
12/02/11 14:08:04 INFO mapred.Task: Using ResourceCalculatorPlugin : null
12/02/11 14:08:04 INFO mapReduceLayer.MapReduceLauncher: HadoopJobId: job_local_0001
12/02/11 14:08:05 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/02/11 14:08:05 INFO mapred.LocalJobRunner:
12/02/11 14:08:05 INFO mapred.Task: Task attempt_local_0001_m_000000_0 is allowed to commit now
12/02/11 14:08:05 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to file:/Users/dc/pig-0.9.2/nyse
12/02/11 14:08:07 INFO mapred.LocalJobRunner:
12/02/11 14:08:07 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
12/02/11 14:08:09 WARN pigstats.PigStatsUtil: Failed to get RunningJob for job job_local_0001
12/02/11 14:08:09 INFO mapReduceLayer.MapReduceLauncher: 100% complete
12/02/11 14:08:09 INFO pigstats.SimplePigStats: Detected Local mode. Stats reported below may be incomplete
12/02/11 14:08:09 INFO pigstats.SimplePigStats: Script Statistics:
Pig Example Output
HadoopVersion
PigVersion
UserId StartedAt FinishedAt
Features
0.20.205.0
0.9.2
dc 2012-02-11 14:08:01
2012-02-11 14:08:09
UNKNOWN
Success!
Job Stats (time in seconds):
JobId
AliasFeature Outputs
job_local_0001
B,nyse MAP_ONLY
file:///Users/dc/pig-0.9.2/nyse,
Input(s):
Successfully read records from: "/Users/dc/programmingpig/data/NYSE_dividends"
Output(s):
Successfully stored records in: "file:///Users/dc/pig-0.9.2/nyse"
Pig Example Output
Job DAG:
job_local_0001
12/02/11 14:08:09 INFO
mapReduceLayer.MapReduceLauncher: Success!
M/R Pattern Design Review
• Why? A correctly designed M/R cluster program which
is faster than an individual machine exercises all the
components in a M/R cluster.
• Cluster experience with big data on AWS
• Important when migrating to production processes
• Design Patterns in the Sample HadoopExamples.jar
• WordCount
• Word Count Aggregation
• MultiFileWordCount
M/R Design Pattern Review
• Word Count
From Lin/Dyer Data Intensive Text
Processing with Map Reduce
M/R Adding Array to Mapper Output
• WordCount Design Process
– Mapper(Contents of File, Tokenize, output) <Object, Text,
Text, IntWritable>. Object=file descriptor, Text=fileLine,
Text=word, IntWritable=1.
2 steps to mapper design, 1) split up the input then 2) output
K,V to reducer
– First step, copy Mapper output K,V to reducer .
Reducer(Collect mapper output) <Text, IntWritable, Text,
IntWritable> Second Step, final output form.
• Replace the IntWritable with an arraylist
• Why?
From Lin/Dyer Data-Intensive Text
Processing with Map Reduce
Word Count Notes
• Remove ctors() from map()
Hadoop Avg Coding Demo
•
•
•
•
•
•
Create an AvgPair Object, implements writable
Create ivars, sum, count, key
Auto generate methods for ivars
Implement Write and readFields methods
Put the ctors outside map()
Run using M/R plugin
NXServer/NXClient
• Remote Desktop to EC2
• 2 options
– 1) use prepared AMI by Eric Hammond
– 2) Install NXServer
Prepared AMI
• http://aws.amazon.com/amis/Europe/1950
• US East AMI ID: ami-caf615a3
• Ubuntu 9.04 Jaunty Desktop with NX Server
Free Edition
• Update the Repos to newer versions of
Ubuntu
• Create new user
Ubuntu Create new user
• Script for this AMI only
• >user-setup
Verify login from desktop
• Created user dc, password dc
Download NXPlayer, install
• Create new connection, enter in IP address
Login with username/password
Ubuntu Desktop
Installing NXServer
• Read logs /usr/NX/var/log/install
• If installed correctly should see daemons
Create user
Configure sshd
• sudo nano /etc/init.d/sshd_config
Verify ssh login
Same process as before with nxplayer
• Enter in ip, user name/password
Clone the instance store if you cant get
the NXServer to work
• Problem is the EasyNXServer method uses an
instance store. How to clone to an EBS
volume?
• Create blank volume, default attach is
/dev/sdf
mksf.ext3 /dev/sdf
mkdir /newvolume
Sudo mount /dev/sdf /newvolume
rsync copy instance store to ebs
• Copy the instance store volume to EBS
• rsync –aHxv / /newvolume
• Create further snapshots, create an ami by
specifying kernel, etc…
Download