Hadoop Setup Prerequisite: • System: Mac OS / Linux / Cygwin on Windows Notice: 1. only works in Ubuntu will be supported by TA. You may try other environments for challenge. 2. Cygwin on Windows is not recommended, for its instability and unforeseen bugs. • Java Runtime Environment, JavaTM 1.6.x recommended • ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons. Hadoop Setup Single Node Setup (Usually for debug) • Untar hadoop-*.**.*.tar.gz to your user path About Version: The latest stable version 1.0.1 is recommended. • edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of your Java installation • edit the files to configure properties: conf/core-site.xml: conf/hdfs-site.xml: <configuration> <configuration> <property> <property> <name> <name> fs.default.name dfs.replication </name> </name> <value> <value> hdfs://localhost:9000 1 </value> </value> </property> </property> </configuration> </configuration> conf/mapred-site.xml: <configuration> <property> <name> mapred.job.tracker </name> <value> localhost:9001 </value> </property> </configuration> Hadoop Setup Cluster Setup ( the only acceptable setup for HW) • Same steps as single node setup • Set dfs.name.dir and dfs.data.dir property in hdfs-site.xml • Add the master’s node name to conf/master Add all the slaves’ node name to conf/slaves • Edit /etc/hosts in each node: add IP and node name item for each node Suppose your master’s node name is ubuntu1 and its IP is 192.168.0.2, then add line “192.168.0.2 ubuntu1” to the file • Copy the folder to the same path of all nodes Notice: JAVA_HOME may not be set the same in each node Hadoop Setup Execution • generating ssh keygen. Passphrase will be omitted when starting up: $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys $ ssh localhost • Format a new distributed-filesystem: $ bin/hadoop namenode –format • Start the hadoop daemons: $ bin/start-all.sh • The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs). Hadoop Setup Execution(continued) • Copy the input files into the distributed filesystem: $ bin/hadoop fs -put conf input • Run some of the examples provided: $ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' • Examine the output files: • View the output files on the distributed filesystem: $ bin/hadoop fs -cat output/* • When you're done, stop the daemons with: $ bin/stop-all.sh Hadoop Setup Details About Configuration Files Hadoop configuration is driven by two types of important configuration files: 1. Read-only default configuration: src/core/core-default.xml src/hdfs/hdfs-default.xml src/mapred/mapred-default.xml conf/mapred-queues.xml.template. 2. Site-specific configuration: conf/core-site.xml conf/hdfs-site.xml conf/mapred-site.xml conf/mapred-queues.xml Hadoop Setup Details About Configuration Files (continued) conf/core-site.xml: Parameter Value Notes fs.default.name URI of NameNode. hdfs://hostname/ Parameter Value Notes dfs.name.dir Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. dfs.data.dir If this is a comma-delimited Comma separated list of list of directories, then data paths on the local filesystem will be stored in all named of a DataNode where it directories, typically on should store its blocks. different devices. conf/hdfs-site.xml: Hadoop Setup Details About Configuration Files (continued) conf/mapred-site.xml: Parameter mapred.job.tracker mapred.system.dir mapred.local.dir Value Host or IP and port of JobTracker. Path on the HDFS where where the Map/Reduce framework stores system files e.g. /hadoop/mapred/system/. Notes host:port pair. This is in the default filesystem (HDFS) and must be accessible from both the server and client machines. Comma-separated list of paths on the local filesystem where temporary Map/Reduce data is written. Multiple paths help spread disk i/o. The maximum number of Map/Reduce tasks, which Defaults to 2 (2 maps and 2 reduces), but vary it mapred.tasktracker.{map|reduce}.tasks.maximum are run simultaneously on a given TaskTracker, depending on your hardware. individually. If necessary, use these files to control the list of dfs.hosts/dfs.hosts.exclude List of permitted/excluded DataNodes. allowable datanodes. If necessary, use these files to control the list of mapred.hosts/mapred.hosts.exclude List of permitted/excluded TaskTrackers. allowable TaskTrackers. mapred.queue.names The Map/Reduce system always supports atleast one queue with the name as default. Hence, this parameter's value should always contain the string default. Some job schedulers supported in Hadoop, like the Capacity Scheduler, support multiple queues. If such a scheduler is being used, the list of configured queue names must be specified here. Comma separated list of queues to which jobs can Once queues are defined, users can submit jobs to be submitted. a queue using the property name mapred.job.queue.name in the job configuration. There could be a separate configuration file for configuring properties of these queues that is managed by the scheduler. Refer to the documentation of the scheduler for information on the same. Hadoop Setup You may get detailed information from The official site: http://hadoop.apache.org Course slides & Textbooks: http://www.cs.sjtu.edu.cn/~liwujun/course/mmds.html Michael G. Noll's Blog (a good guide): http://www.michael-noll.com/ If you have good materials to share, please send them to TA. Hadoop Setup