大规模数据处理/云计算Lecture 2 – "Hello World" in Hadoop

advertisement
大规模数据处理/云计算
Lecture 2 – "Hello World" in Hadoop
彭波
北京大学信息科学技术学院
7/3/2014
http://net.pku.edu.cn/~course/cs402/
Jimmy Lin
University of Maryland
SEWMGroup
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States
See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
CodeLab1
• 遇到的困难
– 不熟悉java!
– 开发和运行环境搭建?(eclipse, hadoop)
– guide里面的代码编译报错?
– 运行时报错?
– 。。。。。。。。。
• 貌似pdf里给的代码不能用,点那个“source
code here”出来的代码是能用的……呃……不过
我跑出来的结果和pdf里的不一样……
• The method setInputPath(Path) is undefined for
the type JobConf
WordCount/src
WordCount.java line 21
1404272734726 310
不知道什么原因。。
• 编译通不过 求助
• FileInputPath cannot be resolved
FileOutputPath cannot be resolved
这是什么情况。
• Exception in thread "main" java.io.IOException:
Cannot run program "chmod": CreateProcess
error=2, ?????????
我运行的时候报的这个错误
Java Programming for
C/C++ Developers
Historical background
• The C programming language
– early 1970s
– UNIX
• The C++ programming language
– early 1980s
– object-oriented
– a wide variety of application programming
• The Java programming language
– early 1990s
– originally for consumer electronic devices
– enterprise application development
Java SDK
• Software Development Kit
– a group of command-line tools and packages
that you will need to write and run Java
programs
– base classes (Library)
Working with the SDK
• Factorial
– input: a value as a command-line argument
– output: factorial of that number OR exception
• Java Specification
– every Java source code file must have the
exact same name as the class that is defined
inside of it
Execution Environment
Primitive data types
• Char
– 16 bits
– Unicode character set
– escape sequences
Primitive data types
• integer types
– signed
– exact size
Primitive data types
• The floatingpoint types
– IEEE 754
floating-point
values
Primitive data types
• The boolean types
– true, false
Operators
•+ is overloaded
•If you use the + operator with a String and
another operand that is not a String, the other
operand is converted into a String
C/C++ functions versus Java
methods
• In Java terminology,
functions are called
methods.
• Methods can only be
declared as members
of a class; you can't
define a method
outside of a Java
class
Arrays
• objects, so they are declared using the new operator
• scores.length
• the bracket characters ([ ]) that are used to indicate
arrays are bound to the array type, not the array name
• java.lang.ArrayIndexOutOfBounds exception
Strings
•
•
•
•
objects of the String class
String objects are immutable
same string literals
String class has a rich interface
Strings
The main() method
• a strict naming convention
• first element in the array is the first argument, not the
name of the program.
Other differences
• Pointers:
– Java references are pointers to Java objects
– cannot be incremented or decremented
– no address of operators
• Global variables
– no way to declare global variables (or methods)
• no struct, union, typedef, enum
• Freely placed methods
• Garbage collection
– no malloc() and free()
Defining a Java class
Defining a Java class
• Each member must have its own public or
private modifier
• You don't use semicolons (;) after the
closing brackets in class and method
definitions.
• The main() method is a member of the
class
• You call the constructor using the new
keyword
access modifiers
access modifiers
•
•
•
•
public
private
protected
package access
Inheritance
• extends
• super()
Overloading and overriding
The Object class
• All Java classes are ultimately subclasses
of class Object
• a centrally rooted class hierarchy
• usage
– toString()
– define data structures that take objects of
class Object , it can hold any Java object .vs.
C++ template
Interfaces
• All interfaces are implicitly abstract
• All members of an interface are implicitly
public
• All fields defined in an interface are
implicitly static and final
• A Java class can extend only one class,
but it can implement any number of
interfaces
• Best practice for polymorphism
more on objects
• Inner classes and inner interfaces
• Anonymouse classes and objects
Using Library(Java API)
• Java API, classes are grouped into
packages
• you already been using classes from a
default package: java.lang when call
System.out.println()
• import java.util.ArrayList; or
java.util.ArrayList<xx> list = ....
Data Structures
• java.util.*
• java
generics
Deploying your application
• A Java program is a bunch of classes.
• A JAR file is Java Archive
– create a manifest.txt state which class has
main() method
• Main-Class: MyApp
– use jar tool to package all classes files and
manifest.txt
– $jar -cvmf manifest.txt app.jar *.class
– $java -jar app.jar
Package
• put your classes in packages
– java.util, java.net, java.text ....
• preface your package with your reverse
domain name
• setup a matching directory structure
References
• 《Java programming for C C++
developers》
• 《Head First Java》
"Hello World" in Hadoop
What is MapReduce?
• Programming model for expressing distributed
computations at a massive scale
• Execution framework for organizing and
performing such computations
• Open-source implementation called Hadoop
40
Brief History of Hadoop
• Hadoop was created by Doug Cutting, the
creator of Apache Lucene/Nutch,
• 2003, Google published GFS
• 2004, Google published MapReduce
• 2005, Nutch ported to Mapreduce/HDFS
• 2006, Cutting join Yahoo!
• 2008.1, Hadoop became top-level project
at Apache
• 2008.2, Hadoop run on 10000-core cluster
Hadoop Release
New MapReduce API
• favors abstract classes over interfaces
• new API in org.apache.hadoop.mapreduce, old
in org.apache.hadoop.mapred
• new Context class
– JobConf, OutputCollector,Reporter
• new Job class
– JobClient
• reduce() method passes values
– new: java.lang.Iterable, for (VALUEIN value : values)
{ ... }
– old: java.lang.Iterator, hasNext(), next()
Hadoop Streaming & Pipes
• Streaming
– support any programming language, even
shell scripts
– uses standard input and output to
communicate with the map and reduce code
• Pipes
– C++ interface to Hadoop MapReduce
– uses sockets as the communication channel
Hadoop Command
• docs in distribution
– api
– tutorial
• hadoop
– -conf xxx
Changping Cluster
• 28 Nodes, 12 Cores/48GB RAM/10T DISK
– Namenode/JobTracker server - changping11
– ip
: 222.29.134.11
– hdfs port
: 9000
– mapreduce port: 9001
How to use ChangpingCluster
• 1. 添加一个域名解析
– windows: 编辑
C:\WINDOWS\system32\drivers\etc\hosts 文
件,
– linux : /etc/hosts
添加一行如下:
222.29.134.11 changping11
• 否则运行 job 会报告名字解析错误
How to use ChangpingCluster
• 2. 身份设置
• 1). 输出文件统一到 "/cs402/YourName"目录下
•
代码中是:FileOutputFormat.setOutputPath(conf,
new Path("/cs402/YourName"));
• 2). Mapred Location里设置好hadoop.job.ugi =
YourName, cs402
•
用户名和上面文件路径中的名字一致,
•
组名必须是 cs402
•
或者在driver程序里直接设置好。
•
•
Configuration conf = new Configuration();
conf.set("hadoop.job.ugi", "YourName,cs402");
References
• Tom White, Hadoop: The Definitive Guide,
O'Reilly, 3rd, 2012.5.
Q&A
Download