Apache Bigtop Week 9 Integration Testing, M/R Coding Administration • Yahoo Field Trip, How Hadoop components are used in a production environment. Have to be registered as Working group members/B/C members • MSFT Azure talk, volunteers for tech leads to port bigtop to Azure. • Roman’s Yahoo HUG presentation next week • Move to ground floor next week? • Machine Learning Solution Architect, 2/16 • List Review from last time • Hive/Pig/Hbasedata layer for integration tests • Hbase upgrade to x.92 • JAVA_LIBRARY_PATH for JVM to point to .so native libs for hadoop • Hadoop classpath debug to print out classpath • HBASE 0.92 guess where Hadoop is using HADOOP_HOME • /etc/hostname screwed up on ec2 Bigtop Data Integration Layer, Hive, Pig, Hbase • Hive: • Create a separate Java project • Install Hive locally, verify you can run the command line, >show tables; Hive Data Layer • Import all the jars under hive-0.8.1/lib to Eclipse Hive Notes • Hive has 2 configurations, an embedded and server. • To start the server: – Set the HADOOP_HEAPSIZE to 1024 by copying hive-env.sh.template to hive-env.sh and uncommenting the HADOOP_HEAPSIZE setting. – source ~/hive-0.8.1/conf/hive-env.sh – Verify, echo $HADOOP_HIVESIZE Start Hive Server from Command Line Hive Command Line Server Hive Notes Increase Heap Size: Hive Run JDBC Commands • Like connecting to MySQL/oracle/MSFT db • Create connection, PreparedStatement, ResultSet • Class.forName("org.apache.hadoop.hive.jdbc. HiveDriver"); • Connection con =DriverManager.getConnection("jdbc:hive://lo calhost:10000/default", "", ""); • Driver in the jar Hive JDBC Prepared Statement • Create Table statement different Statement stmt = con.createStatement(); String tableName = "testHiveDriverTable"; stmt.executeQuery("drop table " + tableName); ResultSet res = stmt .executeQuery("create table " + tableName + " (key int, value string) ROW FORMAT delimited fields terminated by '\t'"); Verification – server running and table printout Eclipse output Hive Eclipse/Java Code Pig, uses Pig Util Class • Util not in Pig-xxx.jar, only in Test package • Local mode only, distributed not debugged Util.deleteDirectory(new File("/Users/dc/pig0.9.2/nyse")); PigServer ps = new PigServer(ExecType.LOCAL); ps.setBatchOn(); Pig Example String first = " nyse = load '/Users/dc/programmingpig/data/NYSE_dividen ds' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); "; String second = "B = foreach nyse generate symbol, dividends;"; String third = " store B into 'nyse'; "; Pig Example Util.registerMultiLineQuery(ps, first + second + third); ps.executeBatch(); ps.shutdown(); Pig Example Output 12/02/11 14:07:57 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: file:/// 12/02/11 14:07:59 INFO pigstats.ScriptState: Pig features used in the script: UNKNOWN 12/02/11 14:08:00 INFO rules.ColumnPruneVisitor: Columns pruned for nyse: $0, $2 12/02/11 14:08:01 INFO mapReduceLayer.MRCompiler: File concatenation threshold: 100 optimistic? false 12/02/11 14:08:01 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size before optimization: 1 12/02/11 14:08:01 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size after optimization: 1 12/02/11 14:08:01 INFO pigstats.ScriptState: Pig script settings are added to the job 12/02/11 14:08:01 INFO mapReduceLayer.JobControlCompiler: mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 12/02/11 14:08:02 INFO mapReduceLayer.JobControlCompiler: Setting up single store job 12/02/11 14:08:02 INFO mapReduceLayer.MapReduceLauncher: 1 map-reduce job(s) waiting for submission. 12/02/11 14:08:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/02/11 14:08:02 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/02/11 14:08:02 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 12/02/11 14:08:02 INFO input.FileInputFormat: Total input paths to process : 1 12/02/11 14:08:02 INFO util.MapRedUtil: Total input paths to process : 1 12/02/11 14:08:02 INFO mapReduceLayer.MapReduceLauncher: 0% complete 12/02/11 14:08:03 INFO util.MapRedUtil: Total input paths (combined) to process : 1 12/02/11 14:08:04 INFO mapred.Task: Using ResourceCalculatorPlugin : null 12/02/11 14:08:04 INFO mapReduceLayer.MapReduceLauncher: HadoopJobId: job_local_0001 12/02/11 14:08:05 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 12/02/11 14:08:05 INFO mapred.LocalJobRunner: 12/02/11 14:08:05 INFO mapred.Task: Task attempt_local_0001_m_000000_0 is allowed to commit now 12/02/11 14:08:05 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to file:/Users/dc/pig-0.9.2/nyse 12/02/11 14:08:07 INFO mapred.LocalJobRunner: 12/02/11 14:08:07 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done. 12/02/11 14:08:09 WARN pigstats.PigStatsUtil: Failed to get RunningJob for job job_local_0001 12/02/11 14:08:09 INFO mapReduceLayer.MapReduceLauncher: 100% complete 12/02/11 14:08:09 INFO pigstats.SimplePigStats: Detected Local mode. Stats reported below may be incomplete 12/02/11 14:08:09 INFO pigstats.SimplePigStats: Script Statistics: Pig Example Output HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.20.205.0 0.9.2 dc 2012-02-11 14:08:01 2012-02-11 14:08:09 UNKNOWN Success! Job Stats (time in seconds): JobId AliasFeature Outputs job_local_0001 B,nyse MAP_ONLY file:///Users/dc/pig-0.9.2/nyse, Input(s): Successfully read records from: "/Users/dc/programmingpig/data/NYSE_dividends" Output(s): Successfully stored records in: "file:///Users/dc/pig-0.9.2/nyse" Pig Example Output Job DAG: job_local_0001 12/02/11 14:08:09 INFO mapReduceLayer.MapReduceLauncher: Success! M/R Pattern Design Review • Why? A correctly designed M/R cluster program which is faster than an individual machine exercises all the components in a M/R cluster. • Cluster experience with big data on AWS • Important when migrating to production processes • Design Patterns in the Sample HadoopExamples.jar • WordCount • Word Count Aggregation • MultiFileWordCount M/R Design Pattern Review • Word Count From Lin/Dyer Data Intensive Text Processing with Map Reduce M/R Adding Array to Mapper Output • WordCount Design Process – Mapper(Contents of File, Tokenize, output) <Object, Text, Text, IntWritable>. Object=file descriptor, Text=fileLine, Text=word, IntWritable=1. 2 steps to mapper design, 1) split up the input then 2) output K,V to reducer – First step, copy Mapper output K,V to reducer . Reducer(Collect mapper output) <Text, IntWritable, Text, IntWritable> Second Step, final output form. • Replace the IntWritable with an arraylist • Why? From Lin/Dyer Data-Intensive Text Processing with Map Reduce Word Count Notes • Remove ctors() from map() Hadoop Avg Coding Demo • • • • • • Create an AvgPair Object, implements writable Create ivars, sum, count, key Auto generate methods for ivars Implement Write and readFields methods Put the ctors outside map() Run using M/R plugin NXServer/NXClient • Remote Desktop to EC2 • 2 options – 1) use prepared AMI by Eric Hammond – 2) Install NXServer Prepared AMI • http://aws.amazon.com/amis/Europe/1950 • US East AMI ID: ami-caf615a3 • Ubuntu 9.04 Jaunty Desktop with NX Server Free Edition • Update the Repos to newer versions of Ubuntu • Create new user Ubuntu Create new user • Script for this AMI only • >user-setup Verify login from desktop • Created user dc, password dc Download NXPlayer, install • Create new connection, enter in IP address Login with username/password Ubuntu Desktop Installing NXServer • Read logs /usr/NX/var/log/install • If installed correctly should see daemons Create user Configure sshd • sudo nano /etc/init.d/sshd_config Verify ssh login Same process as before with nxplayer • Enter in ip, user name/password Clone the instance store if you cant get the NXServer to work • Problem is the EasyNXServer method uses an instance store. How to clone to an EBS volume? • Create blank volume, default attach is /dev/sdf mksf.ext3 /dev/sdf mkdir /newvolume Sudo mount /dev/sdf /newvolume rsync copy instance store to ebs • Copy the instance store volume to EBS • rsync –aHxv / /newvolume • Create further snapshots, create an ami by specifying kernel, etc…