大规模数据处理/云计算 Lecture 2 – "Hello World" in Hadoop 彭波 北京大学信息科学技术学院 7/3/2014 http://net.pku.edu.cn/~course/cs402/ Jimmy Lin University of Maryland SEWMGroup This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details CodeLab1 • 遇到的困难 – 不熟悉java! – 开发和运行环境搭建?(eclipse, hadoop) – guide里面的代码编译报错? – 运行时报错? – 。。。。。。。。。 • 貌似pdf里给的代码不能用,点那个“source code here”出来的代码是能用的……呃……不过 我跑出来的结果和pdf里的不一样…… • The method setInputPath(Path) is undefined for the type JobConf WordCount/src WordCount.java line 21 1404272734726 310 不知道什么原因。。 • 编译通不过 求助 • FileInputPath cannot be resolved FileOutputPath cannot be resolved 这是什么情况。 • Exception in thread "main" java.io.IOException: Cannot run program "chmod": CreateProcess error=2, ????????? 我运行的时候报的这个错误 Java Programming for C/C++ Developers Historical background • The C programming language – early 1970s – UNIX • The C++ programming language – early 1980s – object-oriented – a wide variety of application programming • The Java programming language – early 1990s – originally for consumer electronic devices – enterprise application development Java SDK • Software Development Kit – a group of command-line tools and packages that you will need to write and run Java programs – base classes (Library) Working with the SDK • Factorial – input: a value as a command-line argument – output: factorial of that number OR exception • Java Specification – every Java source code file must have the exact same name as the class that is defined inside of it Execution Environment Primitive data types • Char – 16 bits – Unicode character set – escape sequences Primitive data types • integer types – signed – exact size Primitive data types • The floatingpoint types – IEEE 754 floating-point values Primitive data types • The boolean types – true, false Operators •+ is overloaded •If you use the + operator with a String and another operand that is not a String, the other operand is converted into a String C/C++ functions versus Java methods • In Java terminology, functions are called methods. • Methods can only be declared as members of a class; you can't define a method outside of a Java class Arrays • objects, so they are declared using the new operator • scores.length • the bracket characters ([ ]) that are used to indicate arrays are bound to the array type, not the array name • java.lang.ArrayIndexOutOfBounds exception Strings • • • • objects of the String class String objects are immutable same string literals String class has a rich interface Strings The main() method • a strict naming convention • first element in the array is the first argument, not the name of the program. Other differences • Pointers: – Java references are pointers to Java objects – cannot be incremented or decremented – no address of operators • Global variables – no way to declare global variables (or methods) • no struct, union, typedef, enum • Freely placed methods • Garbage collection – no malloc() and free() Defining a Java class Defining a Java class • Each member must have its own public or private modifier • You don't use semicolons (;) after the closing brackets in class and method definitions. • The main() method is a member of the class • You call the constructor using the new keyword access modifiers access modifiers • • • • public private protected package access Inheritance • extends • super() Overloading and overriding The Object class • All Java classes are ultimately subclasses of class Object • a centrally rooted class hierarchy • usage – toString() – define data structures that take objects of class Object , it can hold any Java object .vs. C++ template Interfaces • All interfaces are implicitly abstract • All members of an interface are implicitly public • All fields defined in an interface are implicitly static and final • A Java class can extend only one class, but it can implement any number of interfaces • Best practice for polymorphism more on objects • Inner classes and inner interfaces • Anonymouse classes and objects Using Library(Java API) • Java API, classes are grouped into packages • you already been using classes from a default package: java.lang when call System.out.println() • import java.util.ArrayList; or java.util.ArrayList<xx> list = .... Data Structures • java.util.* • java generics Deploying your application • A Java program is a bunch of classes. • A JAR file is Java Archive – create a manifest.txt state which class has main() method • Main-Class: MyApp – use jar tool to package all classes files and manifest.txt – $jar -cvmf manifest.txt app.jar *.class – $java -jar app.jar Package • put your classes in packages – java.util, java.net, java.text .... • preface your package with your reverse domain name • setup a matching directory structure References • 《Java programming for C C++ developers》 • 《Head First Java》 "Hello World" in Hadoop What is MapReduce? • Programming model for expressing distributed computations at a massive scale • Execution framework for organizing and performing such computations • Open-source implementation called Hadoop 40 Brief History of Hadoop • Hadoop was created by Doug Cutting, the creator of Apache Lucene/Nutch, • 2003, Google published GFS • 2004, Google published MapReduce • 2005, Nutch ported to Mapreduce/HDFS • 2006, Cutting join Yahoo! • 2008.1, Hadoop became top-level project at Apache • 2008.2, Hadoop run on 10000-core cluster Hadoop Release New MapReduce API • favors abstract classes over interfaces • new API in org.apache.hadoop.mapreduce, old in org.apache.hadoop.mapred • new Context class – JobConf, OutputCollector,Reporter • new Job class – JobClient • reduce() method passes values – new: java.lang.Iterable, for (VALUEIN value : values) { ... } – old: java.lang.Iterator, hasNext(), next() Hadoop Streaming & Pipes • Streaming – support any programming language, even shell scripts – uses standard input and output to communicate with the map and reduce code • Pipes – C++ interface to Hadoop MapReduce – uses sockets as the communication channel Hadoop Command • docs in distribution – api – tutorial • hadoop – -conf xxx Changping Cluster • 28 Nodes, 12 Cores/48GB RAM/10T DISK – Namenode/JobTracker server - changping11 – ip : 222.29.134.11 – hdfs port : 9000 – mapreduce port: 9001 How to use ChangpingCluster • 1. 添加一个域名解析 – windows: 编辑 C:\WINDOWS\system32\drivers\etc\hosts 文 件, – linux : /etc/hosts 添加一行如下: 222.29.134.11 changping11 • 否则运行 job 会报告名字解析错误 How to use ChangpingCluster • 2. 身份设置 • 1). 输出文件统一到 "/cs402/YourName"目录下 • 代码中是:FileOutputFormat.setOutputPath(conf, new Path("/cs402/YourName")); • 2). Mapred Location里设置好hadoop.job.ugi = YourName, cs402 • 用户名和上面文件路径中的名字一致, • 组名必须是 cs402 • 或者在driver程序里直接设置好。 • • Configuration conf = new Configuration(); conf.set("hadoop.job.ugi", "YourName,cs402"); References • Tom White, Hadoop: The Definitive Guide, O'Reilly, 3rd, 2012.5. Q&A