Hadoop Tutorial_dani..

advertisement
Hadoop Install and Basic Use Guide
Authors: Daniel Lust, Anthony Tallercio
Introduction
This guide will show you how to install Apache Hadoop on a Linux environment.
Hadoop allows applications to utilize thousands of nodes while exchanging thousands of
terabytes of data to complete a task. It was written in Java and is used by many large
popular companies with many nodes, such as Facebook1 and Yahoo2, but can be used on
pretty much any hardware at any scale. This guide will go over the necessary
requirements for Hadoop, the full installation process, as well as executing a job (like
MapReduce).
Your average Hadoop cluster will consist of two major parts: a single master node
and multiple working nodes. The master node is made up of four parts: the Job Tracker,
Task Tracker, NameNode, and DataNode. A worker node, which is also known as a
slave node, can either be a DataNode and TaskTracker or just one of the two. A worker
node can be a data-only worker node or a compute-only worker node. JRE 1.6 or higher
is required on all machines to run Hadoop and SSH is required between all clusters for all
startup and shutdown scripts to run properly.
Hadoop Tutorial
Required:

Multiple computers or VM’s with UBUNTU
o SSH Installed
o Sun Java 6
o A stable version of Hadoop
 http://www.bizdirusa.com/mirrors/apache//hado
op/common/stable/
The following tutorial is a combination of tutorials based off of Michael G
Noll’s Running Hadoop On Ubuntu Linux and hadoop
This will successfully set up a single node. The next few steps will require a
profile with administrator access.
If Ubuntu is a clean install it will be necessary to set up a root password
$ sudo passwd root
$ [ sudo] password for user :
1
2
http://en.wikipedia.org/wiki/Apache_Hadoop#Facebook
http://en.wikipedia.org/wiki/Apache_Hadoop#Yahoo.21
$ Enter new UNIX password:
$ Retype new UNIX password:
$ passwd: password updated successfully
Install Java Sun 6 with the following commands in Terminal
$ sudo add-apt-repository "deb http://archive.canonical.com/ lucid
partner"
$ sudo apt-get update
$ sudo apt-get install sun-java6-jdk

accept the user agreement by scrolling down and pressing
enter on “OK”
$ sudo update-java-alternatives -s java-6-sun
Install SSH with the following commands in Terminal
$ sudo get-apt install SSH
Download and unzip hadoop in desired location. To make the file easier to find,
we’ll be using the local folder located in path: /usr/local/. The following commands
will extract the hadoop tar file and rename the folder hadoop in /usr/local.
$ cd /usr/local
$ sudo tar zxf hadoop-0.20.203.Orcl.tar.gz
$ sudo mv hadoop-0.20.203 hadoop
For security purposes we will create a dedicated Hadoop user account and create a
tmp directory for hadoop
$ sudo addgroup hadoop
$ sudo adduser –ingroup hadoop hduser

this will create a new user named “hduser”
$ cd /usr/local/
$ sudo chown –R hduser:hadoop hadoop
$ cd
$ sudo mkdir – p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
$ sudo chmod 750 /app/hadoop/tmp
Depending on which version of Hadoop you’re using, a common issue labedl
HADOOP-7261 describes:
“IPV6 addresses not handles currently in the common library methods. IPV6 can
return address as "0:0:0:0:0:0:port". Some utility methods such as
NetUtils#createSocketAddress(), NetUtils#normalizeHostName(),
NetUtils#getHostNameOfIp() to name a few, do not handle IPV6 address and
expect address to be of format host:port.
Until IPV6 is formally supported, I propose disabling IPV6 for junit tests to avoid
problems seen in HDFS-“
To disable IPv6 type the following command to open a file in gedit.
$ sudo gedit /etc/sysctl.conf
Enter the following scripts at the end of the file, save and close. This require
a reboot.
#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
To check whether IPv6 is enabled or not type in the terminal
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
$ 0 ( IPv6 Enabled)
0r
$ 1 (IPv6 Disabled)
Update $HOME/.bash file.
$ sudo gedit /etc/bash.bashrc
Enter the following information at the end of the file:
# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for
Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-6-sun
# Some convenient aliases and functions for running Hadoop-related
commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less
}
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
At this point we should be able to log into the user we created “hduser” using
$ su – hduser
- this can also be used to log into different usernames.
From here we need to generate a an SSH key for for hduser
hduser@computerName:~$ ssh-keygen –t rsa –P “”
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@computerName
The key's randomart image is:
[...snipp...]
Enable ssh access to your local machine with:
hduser@computerName:~$ cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
Log into ssh and accept public key:
hduser@computerName:~$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is
d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30
UTC 2010 i686 GNU/Linux
Ubuntu 10.04 LTS
[...snipp...]
Exit the SSH connection
hduser@computerName:~$ exit
Edit the following files:
Hadoop-env.sh
Core-site.xml
Mapred-site.xml
Hdfs-site.xml
Hadoop-env.sh : We must change this file to export the correct version of JAVA
hduser@computerName:~$ gedit /usr/local/hadoop/conf/Hadoop-env.sh
Change
# The Java implementation to use. Required
# export JAVA_HOME=/user/lib/j2sdk1.5-sun
To
# The Java implementation to use. Required
export JAVA_HOME=/usr/lib/jvm/java-6-sun
The following files will be re-edited in the future to accommodate for multiple
nodes. For now we will focus on a single node.
core-site.xml
hduser@computerName:~$ gedit /usr/local/hadoop/conf/core-site.xml
<configuration>
<!-- In: conf/core-site.xml -->
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
mapred-site.xml
hduser@computerName:~$ gedit /usr/local/hadoop/conf/mapred-site.xml
<configuration>
<!-- In: conf/mapred-site.xml -->
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
hdfs-site.xml
hduser@computerName:~$ gedit /usr/local/hadoop/conf/hdfs-site.xml
<configuration>
<!-- In: conf/hdfs-site.xml -->
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is
created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
The first time we start our hadoop cluster, its necessary to format the hadoop
filesystem, however like all formats, this will erase all data in the HDFS. The HDFS
is the primary storage system used by hadoop applications, and is responsible for
multiple replicas of data blocks and distributes them to other computer nodes
through out the cluster. This provides an easy reliable, and extremely rapid form of
computations.
The following command will format the HDFS for our new cluster of computers
hduser@computerName:~$ /usr/local/hadoop/bin/hadoop namenode –format
At this point we should be ready to start hadoop.
hduser@computerName:~$ /usr/local/hadoop/bin/start-all.sh
The various outputs produced will display what SSH connections each process is
making. To see the current process (there should be six), simply type :
hduser@computerName:~$ jps
2287 TaskTracker
1992 jobTracker
1121 DataNode
9912 SecondaryNameNode
91921 JPS
1812 NameNode
The numbers in front of each process can vary from time to time. This is simply the
PID of that process. Not important for this.
If you are not seeing all 6 process listed above, it may be necessary to SSH into the
local host. Before you do so, you will want to end all the processes. After ssh into
localhost, start the processes again, and check the running processes with “jps”.
Jps should be the only process running after “stop-all.sh” command.
hduser@computerName:~$ /usr/local/hadoop/bin/stop-all.sh
------- exits running process-------hduser@computerName:~$ ssh localhost
hduser@computerName:~$ /usr/local/hadoop/bin/start-all.sh
This completes the tutorial on how to set up a single nod for hadoop. If you wish to
add additional computers to the cluster, follow the above tutorial for each computer.
The overall objective of this is to set up at least two computers running hadoop to
demonstrate its potential power.
Multi-node cluster
Once you have all your computers set up according to single node guide we can
move on to configure our computers to work as one. The cluster will be set up
where one will be recognized as the MASTER computer, where as all other
computers will be set up as a SLAVE.
Before we begin, make sure all processes for hadoop are stopped. The next step will
require you to log into a different account with admin access to edit. You must
know the local IP address for each computer set up for hadoop on the network.
Open a list of all known hosts on each computer.
hduser@computerName:~$ su – root * OR ANY ADMIN ACCOUNT *
$ sudo gedit /etc/hosts
Update the hosts file with the known IP address of each computer and a designated
name. After trial and error with several different configurations I found it necessary
to include not only the master/ip and slave/ip but the computer Name/local ip, AND
DELETING THE localhost entry. This file must be the same on the MASTER and
ALL SLAVE computers.
If not properly set up, it can lead to a “Too many fetch-failures issue”, where the
Reduce process will take an extremely long time to complete. Each demonstration in
this tutorial should take no longer than 2min 30seconds.
More info on conflict
https://issues.apache.org/jira/browse/HADOOP-1930
Example of my host setup
# etc/hosts
190.111.2.10 Master
190.111.2.10 ComputerName
190.111.2.11 Slave
190.111.2.11 ComputerName2
190.111.2.12 Slave2
190.111.2.12 ComputerName3
Once you have your host file set up, this allows us to SSH into each one of the
computers. Log back into “hduser” and SSH into master.
$ su – hduser
hduser@computerName:~$ ssh Master
----accept necessary public keys---Now we need to authorize the Master machine to be able to log into each SSH with
out the redundancy of entering multiple passwords
hduser@Master:~$ ssh-copy-id –I $HOME/. Ssh/id_rsa.pub hduser@slave
hduser@Master:~$ ssh-copy-id –I $HOME/. Ssh/id_rsa.pub hduser@slave2
For our own peace of mind, we will SSH into each of the slaves to check
connectivity, and also give us an opportunity to accept the public keys or RSA Key
fingerprints.
hduser@Master:~$ ssh Master
hduser@Master:~$
hduser@Slave:~$ exit
ssh Slave
hduser@Master:~$ ssh Slave2
hduser@Slave2:~$ exit
hduser@Master:~$
exit
hduser@computerName:~$
Now that we’ve tested our SSH connections, its time to start configuring our
designated MASTER computer. On the Master computer enter the command to
open up our master file.
hduser@computerName:~$ gedit /usr/local/hadoop/conf/masters
Delete the localhost entry, and add “Master” to the file. Save and close
hduser@computerName:~$ gedit /usr/local/hadoop/conf/slaves
In this file we will add the master, and all slaves.
Master
Slave1
Slave2
(additional slaves if you have them)
Save and close.
Example Host, masters, and slaves file edit
The next steps will be configured on all computers; MASTER AND SLAVES.
We will pull up files we edited for the single setup and set them to a multi-cluster
configuration.
core-site.xml
In this file we will edit the Value from “hdfs://localhost:54310” to
“hdfs://Master:54310”
hduser@computerName:~$ gedit /usr/local/hadoop/conf/core-site.xml
<configuration>
<!-- In: conf/core-site.xml -->
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://Master:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
mapred-site.xml
Like the previous file, the value here will also be changed from Localhost to Master.
hduser@computerName:~$ gedit /usr/local/hadoop/conf/mapred-site.xml
<configuration>
<!-- In: conf/mapred-site.xml -->
<property>
<name>mapred.job.tracker</name>
<value>Master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
hdfs-site.xml
This file, we will change the value from 1 to how ever many computers are set up to
use hadoop. For my tutorial I’m using 3 computers or Nodes.
hduser@computerName:~$ gedit /usr/local/hadoop/conf/hdfs-site.xml
<configuration>
<!-- In: conf/hdfs-site.xml -->
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Default block replication.
The actual number of replications can be specified when the file is
created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
Once you’ve updated all the files for the master and slave, we need to format our
HDFS again. First we will SSH into the master computer
hduser@computerName:~$ ssh Master
hduser@Master:~$ /usr/local/hadoop/bin/hadoop namenode –format
Now its time to run our multi-cluster set up. Simple use the start command, and use
jps to make sure each process was successfully started.
hduser@cMaster:~$ /usr/local/hadoop/bin/start-all.sh
hduser@cMaster:~$ jps
All six processes should be running. If each process was started correctly, the
master computer should of used SSH to log into each slave computer and start two
to three processes determined on the task. If more will be needed, hadoop will start
the process.
Type in the commands on the slave computers to check the running processes.
$ su – hduser
hduser@computerName2:~$ jps
1121 DataNode
91921 JPS
1812 NameNode
If you would like to stop all process, simply type in “/usr/local/hadoop/bin/stopall.sh”. This command will close all process on the master and slaves, and also end
all processes associated with Hadoop.
The next step is considering that all computers are communicating efficiently and all
necessary processes are running. This step will demonstrate a Word counting
program executed by hadoop. There will be two programs written in Python named
mapper.py and reducer.py. Copy the code and save the file in the directory
/usr/local/hadoop.
Save this code using any text edior as mapper.py in directory /usr/local/hadoop
import sys
def read_input(file):
for line in file:
# split the line into words
yield line.split()
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
for word in words:
print '%s%s%d' % (word, separator, 1)
if __name__ == "__main__":
main()
Save this code in any text editor as reducer.py in directory /usr/local/hadoop
from itertools import groupby
from operator import itemgetter
import sys
def read_mapper_output(file, separator='\t'):
for line in file:
yield line.rstrip().split(separator, 1)
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_mapper_output(sys.stdin, separator=separator)
# groupby groups multiple word-count pairs by word,
# and creates an iterator that returns consecutive keys and their
group:
#
current_word - string containing a word (the key)
#
group - iterator yielding all ["<current_word>", "<count>"] items
for current_word, group in groupby(data, itemgetter(0)):
try:
total_count = sum(int(count) for current_word, count in
group)
print "%s%s%d" % (current_word, separator, total_count)
except ValueError:
# count was not a number, so silently discard this item
pass
if __name__ == "__main__":
main()
Next we want to get our books. Create a folder in the directory /tmp named
gutenberg.
In this folder save the following eBooks as a Plain Text UTF-8. Links provided
below.
http://www.gutenberg.org/etext/20417
http://www.gutenberg.org/etext/5000
http://www.gutenberg.org/etext/4300
http://www.gutenberg.org/etext/132
http://www.gutenberg.org/etext/1661
http://www.gutenberg.org/etext/972
http://www.gutenberg.org/etext/19699
With all the processes running and the ebooks copied, we need to take these books
and copy them to our HDFS
hduser@Master:~$ /usr/local/hadoop/bin/hadoop dfs – copyFromLocal /tmp/Gutenberg
/user/hduser/Gutenberg
Finally its time to run hadoop on the ebooks to do a word count.
hduser@Master:/usr/local/hadoop$ bin/hadoop jar
contrib/streaming/hadoop-*streaming*.jar -file
/usr/local/hadoop/mapper.py -mapper
/usr/local/hadoop/mapper.py -file
/usr/local/hadoop/reducer.py -reducer
/usr/local/hadoop/reducer.py -input
/user/hduser/gutenberg/* -output /user/hduser/gutenbergoutput
This code should execute hadoop with the python code we created, you should have
a similar out like below.
During and after the process, you can keep track of all running and completed jobs
Through the following links .
http://localhost:50030/ - Web UI for MapReduce Job Trackers
http://localhost:50060/ - Web UI for task trackers
http://localhost:50070/ - Web UI for HDFS name node.
The MapReduce job tracker is a good way to confirm that all computers are
communicating with one another. Under the node section, it should have the
number of how ever many computers were set up.
3 Nodes
You can also display the name of the computers and stats from the Job.
Once the job is complete you can print out the result files from the terminal using
the command:
“usr/local/hadoop/bin/hadoop dfs -cat
/user/hduser/gutenberg-output/part-00000”
Example Output of words counted file
files stored in /tmp/Gutenberg-output
To test the processing power of hadoop, I ran same process using a different
number of nodes.
With 1 node, the job completed in : 1min 45seconds
With 2 nodes, the job completed in: 1 min 24 seconds
With 3 nodes, the job completed in: 1 min
END OF TUTORIAL
Conclusion
The above tutorials are everything you need to get started with Apache
Hadoop. After installing and running Hadoop, you might have discovered that
running into an issue is a possibility and sometimes troubleshooting will be
nessesary. You also probably discovered that getting past those problems and
achieving the task in mind isn’t too difficult when taking the correct steps. Hadoop
can be used on a very large scale and hopefully this guide is everything you need to
get a good start.
Resources
http://hadoop.apache.org/hdfs/
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-singlenode-cluster/
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-programin-python
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multinode-cluster/
http://en.wikipedia.org/wiki/Apache_Hadoop#Facebook
http://en.wikipedia.org/wiki/Apache_Hadoop#Yahoo.21
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
https://issues.apache.org/jira/browse/HADOOP-1930
http://hadoop.apache.org/common/docs/current/
http://hadoop.apache.org/common/docs/current/hdfs_user_guide.html
http://hadoop.apache.org/common/docs/current/single_node_setup.html
http://hadoop.apache.org/common/docs/current/cluster_setup.html
Download