Apache Bigtop

Apache Bigtop Week 9 Integration Testing, M/R Coding Administration • Yahoo Field Trip, How Hadoop components are used in a production environment. Have to be registered as Working group members/B/C members • MSFT Azure talk, volunteers for tech leads to port bigtop to Azure. • Roman’s Yahoo HUG presentation next week • Move to ground floor next week? • Machine Learning Solution Architect, 2/16 • List Review from last time • Hive/Pig/Hbasedata layer for integration tests • Hbase upgrade to x.92 • JAVA_LIBRARY_PATH for JVM to point to .so native libs for hadoop • Hadoop classpath debug to print out classpath • HBASE 0.92 guess where Hadoop is using HADOOP_HOME • /etc/hostname screwed up on ec2 Bigtop Data Integration Layer, Hive, Pig, Hbase • Hive: • Create a separate Java project • Install Hive locally, verify you can run the command line, >show tables; Hive Data Layer • Import all the jars under hive-0.8.1/lib to Eclipse Hive Notes • Hive has 2 configurations, an embedded and server. • To start the server: – Set the HADOOP_HEAPSIZE to 1024 by copying hive-env.sh.template to hive-env.sh and uncommenting the HADOOP_HEAPSIZE setting. – source ~/hive-0.8.1/conf/hive-env.sh – Verify, echo $HADOOP_HIVESIZE Start Hive Server from Command Line Hive Command Line Server Hive Notes Increase Heap Size: Hive Run JDBC Commands • Like connecting to MySQL/oracle/MSFT db • Create connection, PreparedStatement, ResultSet • Class.forName("org.apache.hadoop.hive.jdbc. HiveDriver"); • Connection con =DriverManager.getConnection("jdbc:hive://lo calhost:10000/default", "", ""); • Driver in the jar Hive JDBC Prepared Statement • Create Table statement different Statement stmt = con.createStatement(); String tableName = "testHiveDriverTable"; stmt.executeQuery("drop table " + tableName); ResultSet res = stmt .executeQuery("create table " + tableName + " (key int, value string) ROW FORMAT delimited fields terminated by '\t'"); Verification – server running and table printout Eclipse output Hive Eclipse/Java Code Pig, uses Pig Util Class • Util not in Pig-xxx.jar, only in Test package • Local mode only, distributed not debugged Util.deleteDirectory(new File("/Users/dc/pig0.9.2/nyse")); PigServer ps = new PigServer(ExecType.LOCAL); ps.setBatchOn(); Pig Example String first = " nyse = load '/Users/dc/programmingpig/data/NYSE_dividen ds' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); "; String second = "B = foreach nyse generate symbol, dividends;"; String third = " store B into 'nyse'; "; Pig Example Util.registerMultiLineQuery(ps, first + second + third); ps.executeBatch(); ps.shutdown(); Pig Example Output 12/02/11 14:07:57 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: file:/// 12/02/11 14:07:59 INFO pigstats.ScriptState: Pig features used in the script: UNKNOWN 12/02/11 14:08:00 INFO rules.ColumnPruneVisitor: Columns pruned for nyse: $0, $2 12/02/11 14:08:01 INFO mapReduceLayer.MRCompiler: File concatenation threshold: 100 optimistic? false 12/02/11 14:08:01 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size before optimization: 1 12/02/11 14:08:01 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size after optimization: 1 12/02/11 14:08:01 INFO pigstats.ScriptState: Pig script settings are added to the job 12/02/11 14:08:01 INFO mapReduceLayer.JobControlCompiler: mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 12/02/11 14:08:02 INFO mapReduceLayer.JobControlCompiler: Setting up single store job 12/02/11 14:08:02 INFO mapReduceLayer.MapReduceLauncher: 1 map-reduce job(s) waiting for submission. 12/02/11 14:08:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/02/11 14:08:02 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/02/11 14:08:02 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 12/02/11 14:08:02 INFO input.FileInputFormat: Total input paths to process : 1 12/02/11 14:08:02 INFO util.MapRedUtil: Total input paths to process : 1 12/02/11 14:08:02 INFO mapReduceLayer.MapReduceLauncher: 0% complete 12/02/11 14:08:03 INFO util.MapRedUtil: Total input paths (combined) to process : 1 12/02/11 14:08:04 INFO mapred.Task: Using ResourceCalculatorPlugin : null 12/02/11 14:08:04 INFO mapReduceLayer.MapReduceLauncher: HadoopJobId: job_local_0001 12/02/11 14:08:05 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 12/02/11 14:08:05 INFO mapred.LocalJobRunner: 12/02/11 14:08:05 INFO mapred.Task: Task attempt_local_0001_m_000000_0 is allowed to commit now 12/02/11 14:08:05 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to file:/Users/dc/pig-0.9.2/nyse 12/02/11 14:08:07 INFO mapred.LocalJobRunner: 12/02/11 14:08:07 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done. 12/02/11 14:08:09 WARN pigstats.PigStatsUtil: Failed to get RunningJob for job job_local_0001 12/02/11 14:08:09 INFO mapReduceLayer.MapReduceLauncher: 100% complete 12/02/11 14:08:09 INFO pigstats.SimplePigStats: Detected Local mode. Stats reported below may be incomplete 12/02/11 14:08:09 INFO pigstats.SimplePigStats: Script Statistics: Pig Example Output HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.20.205.0 0.9.2 dc 2012-02-11 14:08:01 2012-02-11 14:08:09 UNKNOWN Success! Job Stats (time in seconds): JobId AliasFeature Outputs job_local_0001 B,nyse MAP_ONLY file:///Users/dc/pig-0.9.2/nyse, Input(s): Successfully read records from: "/Users/dc/programmingpig/data/NYSE_dividends" Output(s): Successfully stored records in: "file:///Users/dc/pig-0.9.2/nyse" Pig Example Output Job DAG: job_local_0001 12/02/11 14:08:09 INFO mapReduceLayer.MapReduceLauncher: Success! M/R Pattern Design Review • Why? A correctly designed M/R cluster program which is faster than an individual machine exercises all the components in a M/R cluster. • Cluster experience with big data on AWS • Important when migrating to production processes • Design Patterns in the Sample HadoopExamples.jar • WordCount • Word Count Aggregation • MultiFileWordCount M/R Design Pattern Review • Word Count From Lin/Dyer Data Intensive Text Processing with Map Reduce M/R Adding Array to Mapper Output • WordCount Design Process – Mapper(Contents of File, Tokenize, output) <Object, Text, Text, IntWritable>. Object=file descriptor, Text=fileLine, Text=word, IntWritable=1. 2 steps to mapper design, 1) split up the input then 2) output K,V to reducer – First step, copy Mapper output K,V to reducer . Reducer(Collect mapper output) <Text, IntWritable, Text, IntWritable> Second Step, final output form. • Replace the IntWritable with an arraylist • Why? From Lin/Dyer Data-Intensive Text Processing with Map Reduce Word Count Notes • Remove ctors() from map() Hadoop Avg Coding Demo • • • • • • Create an AvgPair Object, implements writable Create ivars, sum, count, key Auto generate methods for ivars Implement Write and readFields methods Put the ctors outside map() Run using M/R plugin NXServer/NXClient • Remote Desktop to EC2 • 2 options – 1) use prepared AMI by Eric Hammond – 2) Install NXServer Prepared AMI • http://aws.amazon.com/amis/Europe/1950 • US East AMI ID: ami-caf615a3 • Ubuntu 9.04 Jaunty Desktop with NX Server Free Edition • Update the Repos to newer versions of Ubuntu • Create new user Ubuntu Create new user • Script for this AMI only • >user-setup Verify login from desktop • Created user dc, password dc Download NXPlayer, install • Create new connection, enter in IP address Login with username/password Ubuntu Desktop Installing NXServer • Read logs /usr/NX/var/log/install • If installed correctly should see daemons Create user Configure sshd • sudo nano /etc/init.d/sshd_config Verify ssh login Same process as before with nxplayer • Enter in ip, user name/password Clone the instance store if you cant get the NXServer to work • Problem is the EasyNXServer method uses an instance store. How to clone to an EBS volume? • Create blank volume, default attach is /dev/sdf mksf.ext3 /dev/sdf mkdir /newvolume Sudo mount /dev/sdf /newvolume rsync copy instance store to ebs • Copy the instance store volume to EBS • rsync –aHxv / /newvolume • Create further snapshots, create an ami by specifying kernel, etc…

Apache Bigtop

Related documents

Products

Support

Apache Bigtop

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib