Hadoop

advertisement
雲端運算虛擬技術
--雲端計算資料處理技術
-- Hadoop
-- MapReduce
賴智錦/詹奇峰
國立高雄大學電機工程學系
2009/08/05
雲端計算資料處理技術

What is large data?
From the point of view of the infrastructure
required to do analytics, data comes in three
sizes:



Small data
Medium data
Large data
Source: http://blog.rgrossman.com/
雲端計算資料處理技術

Small data:


Small data fits into the memory of a single machine.
Example: a small dataset is the dataset for the Netflix
Prize. (The Netflix Prize seeks to substantially improve the accuracy of
predictions about how much someone is going to love a movie based on
their movie preferences.)


The Netflix Prize dataset consists of over 100 million
movie rating files by 480 thousand randomly-chosen,
anonymous Netflix customers that rated over 17
thousand movie titles.
This dataset is just 2 GB of data and fits into the
memory of a laptop.
Source: http://blog.rgrossman.com/
雲端計算資料處理技術

Medium data:


Medium data fits into a single disk or disk array and
can be managed by a database.
It is becoming common today for companies to create
1 to 10 TB or large data warehouses.
Source: http://blog.rgrossman.com/
雲端計算資料處理技術

Large data:



Large data is so large that it is challenging to manage
it in a database and instead specialized systems are
used.
Scientific experiments, such as the Large Hadron
Collider (LHC, the world's largest and highest-energy
particle accelerator), produce large datasets.
Log files produced by Google, Yahoo and Microsoft
and similar companies are also examples of large
datasets.
Source: http://blog.rgrossman.com/
雲端計算資料處理技術

Large data sources:


Most large datasets were produced by the
scientific and defense communities.
Two things have changed:


Large datasets are now being produced by a third
community: companies that provide internet
services, such as search, on-line advertising and
social media.
The ability to analyze these datasets is critical for
advertising systems that produce the bulk of the
revenue for these companies.
Source: http://blog.rgrossman.com/
雲端計算資料處理技術

Large data sources:

Two things have changed:


This provides a measure by which to measure the
effectiveness of analytic infrastructure and analytic
models.
Using this metric, Google settled upon analytic
infrastructure that was quite different than the gridbased infrastructure that is generally used by the
scientific community.
Source: http://blog.rgrossman.com/
雲端計算資料處理技術

What is a large data cloud?

A good working definition is that a large data cloud
provides


storage services and
compute services that are layered over the storage
services that scale to a data center and that have
the reliability associated with a data center.
Source: http://blog.rgrossman.com/
雲端計算資料處理技術

What are some of the options for working with
large data?


The most mature large data cloud application is the
open source Hadoop system, which consists of the
Hadoop Distributed File System (HDFS) and Hadoop’s
implementation of MapReduce.
An important advantage of Hadoop is that it has a very
robust community supporting it and there are a large
number of Hadoop projects, including Pig, which
provides simple database-like operations over data
managed by HDFS.
Source: http://blog.rgrossman.com/
雲端計算資料處理技術

雲端源自平行運算,但比網格更擅長資料運算
--中研院網格計算(ASGC) 主持人林誠謙博士

雲端運算源自平行運算的技術,不脫離網格運算
的哲學,但是雲端運算更專注在資料的處理

單次資料處理量小,讓雲端運算發展出不同於網
格運算的實作方式
--國家高速網路與計算中心企業與計畫管理組計畫主持人
黃維誠博士
雲端計算資料處理技術

雲端運算適合的任務,多半是資料處理次數頻率
高,而每一次要處理的資料量小。
--國家高速網路與計算中心企業與計畫管理組計畫主持人
黃維誠博士
資料來源: http://www.ithome.com.tw/itadm/article.php?c=49410&s=2
雲端計算資料處理技術
雲端運算vs.網格運算
雲端運算
網格運算
主要推動者
資訊供應商(如Google、
Yahoo、IBM、Amazon等)
學術機構(如歐洲粒子研究中心
CERN、中研院、國家高速網路與計
算中心)
標準化程度
無標準化,各家採用的技術架
構也不同。
有標準化的協定和信任機制
開源幅度
部分開源,目前有開源
Hadoop框架,但Google GFS
和資料庫系統BigTable則未開
源。
完全開源
網域限制
企業內部網域
可跨企業、跨管理網域
單一運算叢集
可支援的硬體
相同標準規格的個人電腦 (如
x86處理器、硬碟、4GB 記憶
體、Linux等)
可混合異質性伺服器(不同處理器、
不同作業系統、不同編譯器版本等)
擅長處理的資
料特性
單次運算資料量小(可於單臺
個人電腦上執行),但需要重
複大量處理次數的應用。
單次運算資料量大的應用。例如單
筆數GB的衛星訊號分析。
資料來源:iThome整理,2008年6月
雲端計算資料處理技術


搜尋網頁: 每一次要比對的網頁,其實檔案都不大,
所需耗費的處理器資源不多,所以用大量的個人電腦
就可以來執行網頁搜尋的運算,但是,要用個人電腦
來架設網格運算就比較難,因為網格運算所需的處理
資源較大。
實作的差異就是,雲端運算可以組合大量的個人電腦
來提供服務,而網格運算則需要依賴能提供大量運算
資源的高效能電腦。
--國家高速網路與計算中心企業與計畫管理組計畫主持人黃維
誠博士
資料來源: http://www.ithome.com.tw/itadm/article.php?c=49410&s=2
雲端計算資料處理技術

雲端運算(Cloud Computing):Google提出的分散式
運算技術,讓開發人員很容易開發出全球性的應用服
務,雲端運算技術可以自動管理大量標準化(非異質
性)電腦間的溝通、任務分配和分散式儲存等。

網格運算(Grid Computing):在網路上,透過標準化
協定與信任機制,整合跨網域中的異質伺服器,建立
運算叢集系統來共享運算資源、儲存資源等。

服務在雲端(In-the-Cloud)或雲端服務(Cloud Service):
供應商透過網際網路提供服務,使用者只需透過瀏覽
器就能使用,不需了解供應商的伺服器如何運作。
雲端計算資料處理技術

MapReduce模式:Google運用在雲端運算中的關鍵
技術,讓開發者開發大量資料的處理程式。先透過
Map程式將資料切割成不相關的區塊,分配給大量電
腦處理,再透過Reduce程式將結果彙整,輸出開發
者需要的結果。

Hadoop:使用Java開發的開源雲端運算框架,也是
採用Google雲端運算技術實作的框架,但所用的分散
式檔案系統與Google不同。2006年Yahoo成為該計畫
最主要的貢獻者和使用者。
資料來源: http://www.ithome.com.tw/itadm/article.php?c=49410&s=2
雲端計算資料處理技術

資料處理:是指當有大量資料時,如何進行平行化的
切割,計算,合併,以便讓處理人員可以直接獲取這
些大量資料的總結。

平行化資料分析語言:

Google的 Sawzall 專案與Yahoo的 Pig 專案,都是屬於平行
處理大量資料的高階程式語言。

Google的 Sawzall 建制於MapReduce之上,Yahoo的 Pig
建於Hadoop之上(Hadoop為MapReduce的clone),兩者幾
乎系出同門。
Hadoop: Why?

Need to process 100TB datasets with multiday jobs

On 1-node:


On 1000 node cluster:


Scanning at 50 MB/s = 23 days
Scanning at 50 MB/s = 33 min
Need framework for distribution

Efficient, reliable, usable
Hadoop: Where?

Batch data processing, not real-time/user facing




Highly parallel, data intensive, distributed applications



Log Processing
Document Analysis and Indexing
Web Graphs and Crawling
Bandwidth to data is a constraint
Number of CPUs is a constraint
Very large production deployments (GRID)


Several clusters of 1000s of nodes
LOTS of data (Trillions of records, 100 TB+ data sets)
What is Hadoop?


The Apache Hadoop project develops open-source
software for reliable, scalable, distributed computing.
The project includes:



Core: provides the Hadoop Distributed Filesystem (HDFS)
and support for the MapReduce distributed computing
framework.
MapReduce: A distributed data processing model and
execution environment that runs on large clusters of
commodity machines.
Chukwa: a data collection system for managing large
distributed systems. Chukwa is built on top of the HDFS
and MapReduce framework and inherits Hadoop's
scalability and robustness.
What is Hadoop?




HBase: builds on Hadoop Core to provide a scalable,
distributed database.
Hive: a data warehouse infrastructure built on Hadoop Core
that provides data summarization, adhoc querying and
analysis of datasets.
Pig: a high-level data-flow language and execution
framework for parallel computation. It is build on top of
Hadoop Core.
ZooKeeper: a highly available and reliable coordinate
system. Distributed applications use ZooKeeper to store
and mediate updates for critical shared state.
Hadoop History






2004 - Initial versions of what is now Hadoop Distributed
File System and Map-Reduce implemented by Doug
Cutting & Mike Cafarella
December 2005 - Nutch ported to the new framework.
Hadoop runs reliably on 20 nodes.
January 2006 - Doug Cutting joins Yahoo!
February 2006 - Apache Hadoop project official started
to support the standalone development of Map-Reduce
and HDFS.
March 2006 - Formation of the Yahoo! Hadoop team
April 2006 - Sort benchmark run on 188 nodes in 47.9
hours
Hadoop History





May 2006 - Yahoo sets up a Hadoop research cluster 300 nodes
May 2006 - Sort benchmark run on 500 nodes in 42
hours (better hardware than April benchmark)
October 2006 - Research cluster reaches 600 Nodes
December 2006 - Sort times 20 nodes in 1.8 hrs, 100
nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8
hrs
April 2007 - Research clusters - 2 clusters of 1000 nodes
Source: http://hadoop.openfoundry.org/slides/Hadoop_OSDC_08.pdf
Hadoop Components

Hadoop Distributed Filesystem (HDFS)

is a distributed file system designed to run on commodity
hardware.

is highly fault-tolerant and is designed to be deployed on low-cost
hardware.

provides high throughput access to application data and is
suitable for applications that have large data sets.

relaxes a few POSIX requirements to enable streaming access to
file system data. (POSIX: Portable Operating System Interface
[for Unix"])

was originally built as infrastructure for the Apache Nutch web
search engine project.

is part of the Apache Hadoop Core project.
Hadoop Components

Hadoop Distributed Filesystem (HDFS)
Source: http://hadoop.apache.org/core/
Hadoop Components

HDFS Assumptions and Goals

Hardware failure

Hardware failure is the norm rather than the exception.

An HDFS instance may consist of hundreds or thousands of
server machines, each storing part of the file system’s data.

There are a huge number of components and each
component has a non-trivial probability of failure means that
some component of HDFS is always non-functional.

Therefore, detection of faults and quick, automatic recovery
from them is a core architectural goal of HDFS.
Hadoop Components

HDFS Assumptions and Goals

Streaming Data Access

Applications that run on HDFS need streaming access to their
data sets.

They are not general purpose applications that typically run on
general purpose file systems.

HDFS is designed more for batch processing rather than
interactive use by users.

The emphasis is on high throughput of data access rather
than low latency of data access. POSIX imposes many hard
requirements that are not needed for applications that are
targeted for HDFS.
Hadoop Components

HDFS Assumptions and Goals

Large Data Sets

Applications that run on HDFS have large data sets.

A typical file in HDFS is gigabytes to terabytes in size. Thus,
HDFS is tuned to support large files.

It should provide high aggregate data bandwidth and scale to
hundreds of nodes in a single cluster. It should support tens of
millions of files in a single instance.
Hadoop Components

HDFS Assumptions and Goals

Simple Coherency Model

HDFS applications need a write-once-read-many access
model for files.

A file once created, written, and closed need not be changed.
This assumption simplifies data coherency issues and enables
high throughput data access.

A Map/Reduce application or a web crawler application fits
perfectly with this model. There is a plan to support
appending-writes to files in the future.
Hadoop Components

HDFS Assumptions and Goals

Simple Coherency Model

HDFS applications need a write-once-read-many access
model for files.

A file once created, written, and closed need not be changed.
This assumption simplifies data coherency issues and enables
high throughput data access.

A Map/Reduce application or a web crawler application fits
perfectly with this model. There is a plan to support
appending-writes to files in the future.
Hadoop Components

HDFS Assumptions and Goals

"Moving Computation is Cheaper than Moving Data"

A computation requested by an application is much more
efficient if it is executed near the data it operates on. This is
especially true when the size of the data set is huge.

This minimizes network congestion and increases the overall
throughput of the system.

It is often better to migrate the computation closer to where
the data is located rather than moving the data to where the
application is running. HDFS provides interfaces for
applications to move themselves closer to where the data is
located.
Hadoop Components

HDFS Assumptions and Goals

Portability Across Heterogeneous Hardware and
Software Platforms

HDFS has been designed to be easily portable from one
platform to another.

This facilitates widespread adoption of HDFS as a platform of
choice for a large set of applications.
Hadoop Components

HDFS: Namenode and Datanode

HDFS has a master/slave architecture



An HDFS cluster consists of a single NameNode, a master server
that manages the file system namespace and regulates access to
files by clients.
In addition, there are a number of DataNodes, usually one per
node in the cluster, which manage storage attached to the nodes
that they run on.
HDFS exposes a file system namespace and allows user data to
be stored in files. Internally, a file is split into one or more blocks
and these blocks are stored in a set of DataNodes.
Hadoop Components

HDFS: Namenode and Datanode

HDFS has a master/slave architecture



The NameNode executes file system namespace operations like
opening, closing, and renaming files and directories. It also
determines the mapping of blocks to DataNodes.
The DataNodes are responsible for serving read and write
requests from the file system’s clients. The DataNodes also
perform block creation, deletion, and replication upon instruction
from the NameNode.
The existence of a single NameNode in a cluster greatly
simplifies the architecture of the system. The NameNode is the
arbitrator and repository for all HDFS metadata. The system is
designed in such a way that user data never flows through the
NameNode.
Hadoop Components

Hadoop Distributed Filesystem (HDFS)
Source: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html
Hadoop Components

HDFS: The File System Namespace


HDFS supports a traditional hierarchical file organization. A user
or an application can create directories and store files inside
these directories.
The file system namespace hierarchy is similar to most other
existing file systems.



one can create and remove files, move a file from one directory to
another, or rename a file.
The NameNode maintains the file system namespace. Any
change to the file system namespace or its properties is recorded
by the NameNode.
An application can specify the number of replicas of a file that
should be maintained by HDFS. The number of copies of a file is
called the replication factor of that file. This information is stored
by the NameNode.
Hadoop Components

Hadoop Distributed Processing Framework




Using MapReduce Metaphor
Map/Reduce is a software framework for easily writing
applications which process vast amounts of data in-parallel
on large clusters of commodity hardware.
A simple programming model that applies to many largescale computing problems
Hide messy details in MapReduce runtime library:





Automatic parallelization
Load balancing
Network and disk transformation optimization
Handling of machine failures
Robustness
Hadoop Components



A Map/Reduce job usually splits the input data-set into
independent chunks which are processed by the map tasks in
a completely parallel manner.
The framework sorts the outputs of the maps, which are then
input to the reduce tasks. The framework takes care of
scheduling tasks, monitoring them and re-executes the failed
tasks.
The Map/Reduce framework consists of a single master
JobTracker and one slave TaskTracker per cluster-node.


The master is responsible for scheduling the jobs' component
tasks on the slaves, monitoring them and re-executing the failed
tasks.
The slaves execute the tasks as directed by the master.
Hadoop Components

Although the Hadoop framework is implemented in
JavaTM, Map/Reduce applications need not be
written in Java.


Hadoop Streaming is a utility which allows users to create
and run jobs with any executables (e.g. shell utilities) as the
mapper and/or the reducer.
Hadoop Pipes is a SWIG- compatible C++ API to
implement Map/Reduce applications (non JNITM [Java
Native Interface] based).
Hadoop Components

MapReduce concepts

Definition:



Map function: Take a set of (key, value) pairs and generate a
set of intermediate (key, value) pairs by applying some function
to all these pairs. Eg., (k1, v1)  list(k2, v2)
Reduce function: Merge all pairs with same key applying a
reduction function on the values. E.g., (k2, list(v2))  list(k3, v3)
Input and Output types of a Map/Reduce job:
(input) <k1, v1>  map  <k2, v2>  combine  <k2, v2>  reduce  <k3, v3> (output)





Read a lot of data
Map: extract something meaningful from each record
Shuffle and Sort
Reduce: aggregate, summarize, filter, or transform
Write the results
Hadoop Components

MapReduce concepts
Input
the fox
ate the
mouse
the small
mouse
Map
Shuffle & Sort
ate, 1
mouse, 1
Map
Reduce
the, 3
fox, 2
brown, 1
small, 1
small, 1
the, 1
mouse, 1
the, 1
brown, 1
fox, 1
the quick
brown
fox
Output
the, 1
fox, 1
the, 1
Map
Map
Reduce
quick, 1
Reduce
the, 1
ate, 1
mouse, 2
quick, 1
Hadoop Components

Consider the problem of counting the number of
occurrences of each word in a large collection of
documents:
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
The map function emits each word plus
an associated count of occurrences ("1"
in this example).
The reduce function sums together all the
counts emitted for a particular word.
Hadoop Components

MapReduce Execution Overview
1. The MapReduce library in the user
program first shards the input files into M
pieces of typically 16-64 megabytes (MB) per
piece. It then starts up many copies of the
program on a cluster of machines.
2. One of the copies of the program is
special: the master. The rest are workers that
are assigned work by the master. There are
M map tasks and R reduce tasks to assign.
The master picks idle workers and assigns
each one a map task or a reduce task.
3. A worker who is assigned a map task
reads the contents of the corresponding input
split. It parses key/value pairs out of the input
data and passes each pair to the userdefined map function. The intermediate
key/value pairs produced by the map
function are buffered in memory.
Source: J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large
Clusters. Communications of the ACM, 51(1):107-113, 2008.
Hadoop Components

MapReduce Execution Overview
4. Periodically, the buffered pairs are written
to local disk, partitioned into R regions by the
partitioning function. The locations of these
buffered pairs on the local disk are passed
back to the master, who is responsible for
forwarding these locations to the reduce
workers.
5. When a reduce worker is notified by the
master about these locations, it uses remote
procedure calls to read the buffered data
from the local disks of the map workers.
When a reduce worker has read all
intermediate data, it sorts it by the
intermediate keys so that all occurrences of
the same key are grouped together. If the
amount of intermediate data is too large to fit
in memory, an external sort is used.
Source: J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large
Clusters. Communications of the ACM, 51(1):107-113, 2008.
Hadoop Components

MapReduce Execution Overview
6. The reduce worker iterates over the sorted
intermediate data and for each unique
intermediate key encountered, it passes the
key and the corresponding set of
intermediate values to the user's reduce
function. The output of the reduce function is
appended to a final output file for this reduce
partition.
7. When all map tasks and reduce tasks
have been completed, the master wakes up
the user program. At this point, the
MapReduce call in the user program returns
back to the user code.
Source: J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large
Clusters. Communications of the ACM, 51(1):107-113, 2008.
Hadoop Components

MapReduce Examples

Distributed Grep (global search regular expression and
print out the line):



The map function emits a line if it matches a given pattern.
The reduce function is an identity function that just copies the
supplied intermediate data to the output.
Count of URL Access Frequency:


The map function processes logs of web page requests and
outputs <URL, 1>.
The reduce function adds together all values for the same
URL and emits a <URL, total count> pair.
Hadoop Components

MapReduce Examples

Reverse Web-Link Graph:



The map function outputs <target, source> pairs for each link to a
target URL found in a page named "source".
The reduce function concatenates the list of all source URLs
associated with a given target URL and emits the pair: <target,
list(source)>.
Inverted Index:



The map function parses each document, and emits a sequence
of <word, document ID> pairs.
The reduce function accepts all pairs for a given word, sorts the
corresponding document IDs and emits a <word, list(document
ID)> pair.
The set of all output pairs forms a simple inverted index. It is easy
to augment this computation to keep track of word positions.
Hadoop Components

MapReduce Examples
 Term-Vector per Host: A term vector summarizes the
most important words that occur in a document or a set of
documents as a list of <word, frequency> pairs.


The map function emits a <hostname, term vector> pair for
each input document (where the hostname is extracted from
the URL of the document).
The reduce function is passed all per-document term vectors
for a given host. It adds these term vectors together, throwing
away infrequent terms, and then emits a final <hostname,
term vector> pair.
• MapReduce Programs in Google's Source Tree
• New MapReduce Programs per Month
Source: http://www.cs.virginia.edu/~pact2006/program/mapreduce-pact06-keynote.pdf
Who Uses Hadoop












Amazon/A9
Facebook
Google
IBM
Joost
Last.fm
New York Times
PowerSet (now Microsoft)
Quantcast
Veoh
Yahoo!
More at http://wiki.apache.org/hadoop/PoweredBy
Hadoop Resource






http://hadoop.apache.org
http://developer.yahoo.net/blogs/hadoop/
http://code.google.com/intl/zh-TW/edu/submissions/
uwspr2007_clustercourse/listing.html
http://developer.amazonwebservices.com/connect/entry.j
spa?externalID=873
J. Dean and S. Ghemawat, "MapReduce: Simplified
Data Processing on Large Clusters," Communications of
the ACM, 51(1):107-113, 2008.
T. White, Hadoop: The Definitive Guide (MapReduce for
the Cloud), O'Reilly, 2009.
Hadoop Download

國網中心 http://ftp.twaren.net/Unix/Web/apache/hadoop/core/

HTTP

http://ftp.stut.edu.tw/var/ftp/pub/OpenSource/apache/hadoop/core/

http://ftp.twaren.net/Unix/Web/apache/hadoop/core/
http://ftp.mirror.tw/pub/apache/hadoop/core/
http://apache.cdpa.nsysu.edu.tw/hadoop/core/
http://ftp.tcc.edu.tw/pub/Apache/hadoop/core/
http://apache.ntu.edu.tw/hadoop/core/





FTP

ftp://ftp.stut.edu.tw/pub/OpenSource/apache/hadoop/core/
ftp://ftp.stu.edu.tw/Unix/Web/apache/hadoop/core/
ftp://ftp.twaren.net/Unix/Web/apache/hadoop/core/

ftp://apache.cdpa.nsysu.edu.tw/Unix/Web/apache/hadoop/core/


Hadoop Virtual Image
http://code.google.com/intl/zh-TW/edu/parallel/tools/hadoopvm/

Setting up a Hadoop cluster can be an all day job. A
virtual machine image has created with a preconfigured
single node instance of Hadoop.
A virtual machine encapsulates one operating system within another.
(http://developer.yahoo.com/hadoop/tutorial/module3.html)
Hadoop Virtual Image
http://code.google.com/intl/zh-TW/edu/parallel/tools/hadoopvm/

While this doesn't have the power of a full cluster, it
does allow you to use the resources on your local
machine to explore the Hadoop platform.

The virtual machine image is designed to be used
with the free VMware Player.

Hadoop can be run on a single-node in a pseudodistributed mode where each Hadoop daemon runs
in a separate Java process.
Setting Up the Image

The image is packaged as a directory archive. To begin set up
deflate the image in the directory of your choice (you need at
least 10GB, the disk image can grow to 20GB).

The VMware image package contains:




image.vmx -- The VMware guest OS profile, a configuration file
that describes the virtual machine characteristics (virtual CPU(s),
amount of memory, etc.).
20GB.vmdk -- A VMware virtual disk used to store the content of
the virtual machine hard disk; this file grows as you store data on
the virtual image. It is configured to store up to 20GB.
The archive contains two other files, image.vmsd and nvram,
these are not critical for running the image but are created by the
VMware player on startup.
As you run the virtual machine log files (vmware-x.log) will be
created.
Setting Up the Image

The system image is based on Ubuntu (version 7.04)
and contains a Java machine (Sun JRE 6 - DLJ License
v1.1) and the latest Hadoop distribution (0.13.0).

A new window will appear which will print a message
indicating the IP address allocated to the guest OS. This
is the IP address you will use to submit jobs from the
command line or the Eclipse environment.

The guest OS contains a running Hadoop infrastructure
which is configured with:


A GFS (HDFS) infrastructure using a single data node (no replication)
A single MapReduce worker
Setting Up the Image

The guest OS can be reached from the provided console or
via SSH using the IP address indicated above. Log into the
guest OS with:



guest log in: guest, guest password: guest
administrator log in: root, administrator password: root
Once the image is loaded, you can log in with the guest
account. Hadoop will be installed in the guest home
directory(/home/guest/hadoop). Three scripts are provided
for Hadoop maintenance purposes:



start-hadoop -- Starts file-system and MapReduce daemons.
stop-hadoop -- Stops all Hadoop daemons.
reset-hadoop -- Restarts new Hadoop environment with entirely
empty file system.
Hadoop 0.20 Install

前言

架設環境

Install Hadoop


下載並更新軟體套件

安裝java

ssh 安裝設定

安裝hadoop
Hadoop範例測試
前言

Ubuntu 是一套自由的作業系統 ,是從 Debian 衍生
出來的 GNU/Linux 分支。是由第一位自費上太空的
非裔企業家 Mark Shuttleworth 所創立的 Canonical
Ltd. 所維護的一個發行版。於 04 年正式推出第一個
版本 (Ubuntu 4.10 Warty Warthog),一推出馬上受到
熱烈迴響,從 05 年開始到現在,都是最熱門的
GNU/Linux 發行版。目前最新版本是:Ubuntu 9.04。

開發hadoop需要用到許多的物件導向語法,包括繼
承關係、介面類別,而且需要匯入正確的classpath。
Ubuntu Operating System

最低需求:






300MHz 的 x86 處理器
64MB 的主記憶體 (LiveCD 需要有 256MB 的主記憶體才能執行)
4GB 的硬碟空間 (含分配給置換空間的部份)
能夠顯示 640x480 的 VGA 的顯示晶片
光碟機或網路卡
建議需求:






700MHz 的 x86 處理器
384MB 的主記憶體
8GB 的硬碟空間 (含分配給置換空間的部份)
能夠顯示 1024x768 的 VGA 的顯示晶片
音效卡
可連上網際網路
架設環境



Live CD ubuntu 9.04
sun-java-6
hadoop 0.20.0
目錄說明

使用者(user):cfong

使用者家目錄(user's home directory):
/home/cfong

專案目錄 (project directory):
/home/cfong/workspace

hadoop目錄: /opt/hadoop
Install Hadoop

安裝方式模式可以有很多種,可以在WMware
Player下模擬,也可在Ubuntu Operating System
或Cent Operating System下安裝。在此將在Live
CD ubuntu 9.04 Operating System下安裝。

在安裝過程中要清楚每個套件安裝在哪個目錄底
下。
Update Package

$sudo –i
# 切入super user

$sudo apt-get update
#updata package lists

$sudo apt-get upgrade
#upgrade all install package
Download Package

Download hadoop-0.20.0.tar放在 /opt/ 下
http://apache.cdpa.nsysu.edu.tw/hadoop/core/hadoop0.20.0/hadoop-0.20.0.tar.gz

Download Java SE Development Kit (JDK) JDK
6 Update 14 (jdk-6u10-docs.zip) 放在 /tmp/ 下
https://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDSCDS_Developer-Site/en_US/-/USD/ViewProductDetailStart?ProductRef=jdk-6u10-docs-oth-JPR@CDSCDS_Developer
Install Java

安裝java 基本套件
$ sudo apt-get install java-common sun-java6-bin sun-java6-jdk

安裝sun-java6-doc
$ sudo apt-get install sun-java6-doc
SSH安裝設定

$ apt-get install ssh

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

$ ssh localhost
Install Hadoop

$ cd /opt

$ sudo tar -zxvf hadoop-0.20.0.tar.gz

$ sudo chown -R cfong:cfong /opt/hadoop-0.20.0

$ sudo ln -sf /opt/hadoop-0.20.0 /opt/hadoop
Environment Variables Set up

nano /opt/hadoop/conf/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-6-sun

export HADOOP_HOME=/opt/hadoop

export PATH=$PATH:/opt/hadoop/bin
Environment Variables Setup
nano /opt/hadoop/conf/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop/hadoop-${user.name}</value>
</property>
</configuration>
Environment Variables Setup
nano /opt/hadoop/conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Environment Variables Setup
nano /opt/hadoop/conf/mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
Environment Variables Setup
nano /opt/hadoop/conf/mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
啟動Hadoop

$ cd /opt/hadoop

$ source /opt/hadoop/conf/hadoop-env.sh

$ hadoop namenode -format

$ start-all.sh

$ hadoop fs -put conf input

$ hadoop fs -ls
Hadoop Examples

Example 1:
$cd /opt/hadoop
$bin/hadoop version
Hadoop 0.20.0
Subversion
https://svn.apache.org/repos/asf/hadoop/core/branches/bra
nch-0.20 -r 763504
Compiled by ndaley on Thu Apr 9 05:18:40 UTC 200
Compiled by hadoopqa on Thu May 15 07:22:55 UTC 2008
Hadoop Examples

Example 2: $opt/hadoop/bin/hadoop jar hadoop-0.20.0examples.jar pi 4 10000
Number of Maps = 4
Samples per Map = 10000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Starting Job
09/08/01 06:56:41 INFO mapred.FileInputFormat: Total input paths to process : 4
09/08/01 06:56:42 INFO mapred.JobClient: Running job: job_200908010505_0002
09/08/01 06:56:43 INFO mapred.JobClient: map 0% reduce 0%
09/08/01 06:56:53 INFO mapred.JobClient: map 50% reduce 0%
09/08/01 06:56:56 INFO mapred.JobClient: map 100% reduce 0%
09/08/01 06:57:05 INFO mapred.JobClient: map 100% reduce 100%
09/08/01 06:57:07 INFO mapred.JobClient: Job complete: job_200908010505_0002
09/08/01 06:57:07 INFO mapred.JobClient: Counters: 18
09/08/01 06:57:07 INFO mapred.JobClient: Job Counters
09/08/01 06:57:07 INFO mapred.JobClient: Launched reduce tasks=1
09/08/01 06:57:07 INFO mapred.JobClient: Launched map tasks=4
09/08/01 06:57:07 INFO mapred.JobClient: Data-local map tasks=4
09/08/01 06:57:07 INFO mapred.JobClient: FileSystemCounters
09/08/01 06:57:07 INFO mapred.JobClient: FILE_BYTES_READ=94
09/08/01 06:57:07 INFO mapred.JobClient: HDFS_BYTES_READ=472
09/08/01 06:57:07 INFO mapred.JobClient: FILE_BYTES_WRITTEN=334
09/08/01 06:57:07 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=215
09/08/01 06:57:07 INFO mapred.JobClient: Map-Reduce Framework
09/08/01 06:57:07 INFO mapred.JobClient: Reduce input groups=8
09/08/01 06:57:07 INFO mapred.JobClient: Combine output records=0
09/08/01 06:57:07 INFO mapred.JobClient: Map input records=4
09/08/01 06:57:07 INFO mapred.JobClient: Reduce shuffle bytes=112
09/08/01 06:57:07 INFO mapred.JobClient: Reduce output records=0
09/08/01 06:57:07 INFO mapred.JobClient: Spilled Records=16
09/08/01 06:57:07 INFO mapred.JobClient: Map output bytes=72
09/08/01 06:57:07 INFO mapred.JobClient: Map input bytes=96
09/08/01 06:57:07 INFO mapred.JobClient: Combine input records=0
09/08/01 06:57:07 INFO mapred.JobClient: Map output records=8
09/08/01 06:57:07 INFO mapred.JobClient: Reduce input records=8
Job Finished in 25.84 seconds
Estimated value of Pi is 3.14140000000000000000
Hadoop Examples

Example 3: $opt/hadoop/bin/start-all.sh
localhost: starting datanode, logging to
/opt/hadoop/logs/hadoop-root-datanode-cfong-desktop.out
localhost: starting secondarynamenode, logging to
/opt/hadoop/logs/hadoop-root-secondarynamenode-cfongdesktop.out
starting jobtracker, logging to /opt/hadoop/logs/hadoop-rootjobtracker-cfong-desktop.out
localhost: starting tasktracker, logging to
/opt/hadoop/logs/hadoop-root-tasktracker-cfong-desktop.out
Hadoop Examples

Example 4: $opt/hadoop/bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
Hadoop Examples
Example 5:
$ cd /opt/hadoop
$ jps

20911 JobTracker
20582 DataNode
27281 Jps
20792 SecondaryNameNode
21054 TaskTracker
20474 NameNode
Hadoop Examples

Example 6: $ sudo netstat -plten | grep java
tcp6
tcp6
tcp6
tcp6
tcp6
tcp6
tcp6
tcp6
tcp6
tcp6
tcp6
tcp6
tcp6
tcp6
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 :::50145
0 :::45538
0 :::50020
0 127.0.0.1:9000
0 127.0.0.1:9001
0 :::50090
0 :::53866
0 :::50060
0 127.0.0.1:37870
0 :::50030
0 :::50070
0 :::50010
0 :::50075
0 :::50397
:::*
:::*
:::*
:::*
:::*
:::*
:::*
:::*
:::*
:::*
:::*
:::*
:::*
:::*
LISTEN
0
LISTEN
0
LISTEN
0
LISTEN
0
LISTEN
0
LISTEN
0
LISTEN
0
LISTEN
0
LISTEN
0
LISTEN
0
LISTEN
0
LISTEN
0
LISTEN
0
LISTEN
0
141655
20792/java
142200
20911/java
143573
20582/java
139970
20474/java
142203
20911/java
143534
20792/java
140629
20582/java
143527
21054/java
143559
21054/java
143441
20911/java
143141
20474/java
143336
20582/java
143536
20582/java
139967
20474/java
Download