Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation Need to process 10TB datasets On 1 node: ◦ scanning @ 50MB/s = 2.3 days On 1000 node cluster: ◦ scanning @ 50MB/s = 3.3 min Need Efficient, Reliable and Usable framework ◦ Google File System (GFS) paper ◦ Google's MapReduce paper Hadoop uses HDFS, a distributed file system based on GFS, as its shared file system ◦ Files are divided into large blocks and distributed across the cluster (64MB) ◦ Blocks replicated to handle hardware failure ◦ Current block replication is 3 (configurable) ◦ It cannot be directly mounted by an existing operating system. Once you use the DFS (put something in it), relative paths are from /user/{your usr id}. E.G. if your id is jwang30 … your “home dir” is /user/jwang30 Master-Slave Architecture HDFS Master “Namenode” (irkm-1) ◦ Accepts MR jobs submitted by users ◦ Assigns Map and Reduce tasks to Tasktrackers ◦ Monitors task and tasktracker status, re-executes tasks upon failure HDFS Slaves “Datanodes” (irkm-1 to irkm-6) ◦ Run Map and Reduce tasks upon instruction from the Jobtracker ◦ Manage storage and transmission of intermediate output Hadoop is locally “installed” on each machine ◦ Version 0.19.2 ◦ Installed location is in /home/tmp/hadoop ◦ Slave nodes store their data in /tmp/hadoop${user.name} (configurable) If it is the first time that you use it, you need to format the namenode: ◦ - log to irkm-1 ◦ - cd /home/tmp/hadoop ◦ - bin/hadoop namenode –format Basically we see most commands look similar ◦ bin/hadoop “some command” options ◦ If you just type hadoop you get all possible commands (including undocumented) hadoop dfs ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ [-ls <path>] [-du <path>] [-cp <src> <dst>] [-rm <path>] [-put <localsrc> <dst>] [-copyFromLocal <localsrc> <dst>] [-moveFromLocal <localsrc> <dst>] [-get [-crc] <src> <localdst>] [-cat <src>] [-copyToLocal [-crc] <src> <localdst>] [-moveToLocal [-crc] <src> <localdst>] [-mkdir <path>] [-touchz <path>] [-test -[ezd] <path>] [-stat [format] <path>] [-help [cmd]] bin/start-all.sh – starts all slave nodes and master node bin/stop-all.sh – stops all slave nodes and master node Run jps to check the status Log to irkm-1 rm –fr /tmp/hadoop/$userID cd /home/tmp/hadoop bin/hadoop dfs –ls bin/hadoop dfs –copyFromLocal example example After that bin/hadoop dfs –ls Mapper.py Reducer.py bin/hadoop dfs -ls bin/hadoop dfs –copyFromLocal example example bin/hadoop jar contrib/streaming/hadoop-0.19.2-streaming.jar -file wordcount-py.example/mapper.py -mapper wordcountpy.example/mapper.py -file wordcount-py.example/reducer.py -reducer wordcount-py.example/reducer.py -input example -output java-output bin/hadoop dfs -cat java-output/part-00000 bin/hadoop dfs -copyToLocal java-output/part-00000 java-output-local Hadoop job tracker ◦ http://irkm-1.soe.ucsc.edu:50030/jobtracker.jsp Hadoop task tracker ◦ http://irkm-1.soe.ucsc.edu:50060/tasktracker.jsp Hadoop dfs checker ◦ http://irkm-1.soe.ucsc.edu:50070/dfshealth.jsp