Hadoop Setup

advertisement
Hadoop Setup
Prerequisite:
• System: Mac OS / Linux / Cygwin on Windows
Notice:
1. only works in Ubuntu will be supported by TA. You may
try other environments for challenge.
2. Cygwin on Windows is not recommended, for its
instability and unforeseen bugs.
• Java Runtime Environment, JavaTM 1.6.x recommended
• ssh must be installed and sshd must be running to use
the Hadoop scripts that manage remote Hadoop
daemons.
Hadoop Setup
Single Node Setup (Usually for debug)
• Untar hadoop-*.**.*.tar.gz to your user path
About Version:
The latest stable version 1.0.1 is recommended.
• edit the file conf/hadoop-env.sh to define at least
JAVA_HOME to be the root of your Java installation
• edit the files to configure properties:
conf/core-site.xml:
conf/hdfs-site.xml:
<configuration>
<configuration>
<property>
<property>
<name>
<name>
fs.default.name
dfs.replication
</name>
</name>
<value>
<value>
hdfs://localhost:9000
1
</value>
</value>
</property>
</property>
</configuration>
</configuration>
conf/mapred-site.xml:
<configuration>
<property>
<name>
mapred.job.tracker
</name>
<value>
localhost:9001
</value>
</property>
</configuration> Hadoop
Setup
Cluster Setup ( the only acceptable setup for HW)
• Same steps as single node setup
• Set dfs.name.dir and dfs.data.dir property in hdfs-site.xml
• Add the master’s node name to conf/master
Add all the slaves’ node name to conf/slaves
• Edit /etc/hosts in each node: add IP and node name item
for each node
Suppose your master’s node name is ubuntu1 and its IP is
192.168.0.2, then add line “192.168.0.2 ubuntu1” to the file
• Copy the folder to the same path of all nodes
Notice: JAVA_HOME may not be set the same in each node
Hadoop Setup
Execution
• generating ssh keygen. Passphrase will be omitted when
starting up:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
$ ssh localhost
• Format a new distributed-filesystem:
$ bin/hadoop namenode –format
• Start the hadoop daemons:
$ bin/start-all.sh
• The hadoop daemon log output is written to the
${HADOOP_LOG_DIR} directory (defaults to
${HADOOP_HOME}/logs).
Hadoop Setup
Execution(continued)
• Copy the input files into the distributed filesystem:
$ bin/hadoop fs -put conf input
• Run some of the examples provided:
$ bin/hadoop jar hadoop-*-examples.jar grep input
output 'dfs[a-z.]+'
• Examine the output files:
• View the output files on the distributed filesystem:
$ bin/hadoop fs -cat output/*
• When you're done, stop the daemons with:
$ bin/stop-all.sh
Hadoop Setup
Details About Configuration Files
Hadoop configuration is driven by two types of important
configuration files:
1. Read-only default configuration:
src/core/core-default.xml
src/hdfs/hdfs-default.xml
src/mapred/mapred-default.xml
conf/mapred-queues.xml.template.
2. Site-specific configuration:
conf/core-site.xml
conf/hdfs-site.xml
conf/mapred-site.xml
conf/mapred-queues.xml
Hadoop Setup
Details About Configuration Files (continued)
conf/core-site.xml:
Parameter
Value
Notes
fs.default.name
URI of NameNode.
hdfs://hostname/
Parameter
Value
Notes
dfs.name.dir
Path on the local filesystem
where the NameNode
stores the namespace and
transactions logs
persistently.
If this is a comma-delimited
list of directories then the
name table is replicated in
all of the directories, for
redundancy.
dfs.data.dir
If this is a comma-delimited
Comma separated list of
list of directories, then data
paths on the local filesystem
will be stored in all named
of a DataNode where it
directories, typically on
should store its blocks.
different devices.
conf/hdfs-site.xml:
Hadoop Setup
Details About Configuration Files (continued)
conf/mapred-site.xml:
Parameter
mapred.job.tracker
mapred.system.dir
mapred.local.dir
Value
Host or IP and port of JobTracker.
Path on the HDFS where where the Map/Reduce
framework stores system files e.g.
/hadoop/mapred/system/.
Notes
host:port pair.
This is in the default filesystem (HDFS) and must be
accessible from both the server and client
machines.
Comma-separated list of paths on the local
filesystem where temporary Map/Reduce data is
written.
Multiple paths help spread disk i/o.
The maximum number of Map/Reduce tasks, which
Defaults to 2 (2 maps and 2 reduces), but vary it
mapred.tasktracker.{map|reduce}.tasks.maximum are run simultaneously on a given TaskTracker,
depending on your hardware.
individually.
If necessary, use these files to control the list of
dfs.hosts/dfs.hosts.exclude
List of permitted/excluded DataNodes.
allowable datanodes.
If necessary, use these files to control the list of
mapred.hosts/mapred.hosts.exclude
List of permitted/excluded TaskTrackers.
allowable TaskTrackers.
mapred.queue.names
The Map/Reduce system always supports atleast
one queue with the name as default. Hence, this
parameter's value should always contain the string
default. Some job schedulers supported in Hadoop,
like the Capacity Scheduler, support multiple
queues. If such a scheduler is being used, the list of
configured queue names must be specified here.
Comma separated list of queues to which jobs can
Once queues are defined, users can submit jobs to
be submitted.
a queue using the property name
mapred.job.queue.name in the job configuration.
There could be a separate configuration file for
configuring properties of these queues that is
managed by the scheduler. Refer to the
documentation of the scheduler for information on
the same.
Hadoop Setup
You may get detailed information from
The official site:
http://hadoop.apache.org
Course slides & Textbooks:
http://www.cs.sjtu.edu.cn/~liwujun/course/mmds.html
Michael G. Noll's Blog (a good guide):
http://www.michael-noll.com/
If you have good materials to share, please send them to TA.
Hadoop Setup
Download