雲端運算虛擬技術 --雲端計算資料處理技術 -- Hadoop -- MapReduce 賴智錦/詹奇峰 國立高雄大學電機工程學系 2009/08/05 雲端計算資料處理技術 What is large data? From the point of view of the infrastructure required to do analytics, data comes in three sizes: Small data Medium data Large data Source: http://blog.rgrossman.com/ 雲端計算資料處理技術 Small data: Small data fits into the memory of a single machine. Example: a small dataset is the dataset for the Netflix Prize. (The Netflix Prize seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences.) The Netflix Prize dataset consists of over 100 million movie rating files by 480 thousand randomly-chosen, anonymous Netflix customers that rated over 17 thousand movie titles. This dataset is just 2 GB of data and fits into the memory of a laptop. Source: http://blog.rgrossman.com/ 雲端計算資料處理技術 Medium data: Medium data fits into a single disk or disk array and can be managed by a database. It is becoming common today for companies to create 1 to 10 TB or large data warehouses. Source: http://blog.rgrossman.com/ 雲端計算資料處理技術 Large data: Large data is so large that it is challenging to manage it in a database and instead specialized systems are used. Scientific experiments, such as the Large Hadron Collider (LHC, the world's largest and highest-energy particle accelerator), produce large datasets. Log files produced by Google, Yahoo and Microsoft and similar companies are also examples of large datasets. Source: http://blog.rgrossman.com/ 雲端計算資料處理技術 Large data sources: Most large datasets were produced by the scientific and defense communities. Two things have changed: Large datasets are now being produced by a third community: companies that provide internet services, such as search, on-line advertising and social media. The ability to analyze these datasets is critical for advertising systems that produce the bulk of the revenue for these companies. Source: http://blog.rgrossman.com/ 雲端計算資料處理技術 Large data sources: Two things have changed: This provides a measure by which to measure the effectiveness of analytic infrastructure and analytic models. Using this metric, Google settled upon analytic infrastructure that was quite different than the gridbased infrastructure that is generally used by the scientific community. Source: http://blog.rgrossman.com/ 雲端計算資料處理技術 What is a large data cloud? A good working definition is that a large data cloud provides storage services and compute services that are layered over the storage services that scale to a data center and that have the reliability associated with a data center. Source: http://blog.rgrossman.com/ 雲端計算資料處理技術 What are some of the options for working with large data? The most mature large data cloud application is the open source Hadoop system, which consists of the Hadoop Distributed File System (HDFS) and Hadoop’s implementation of MapReduce. An important advantage of Hadoop is that it has a very robust community supporting it and there are a large number of Hadoop projects, including Pig, which provides simple database-like operations over data managed by HDFS. Source: http://blog.rgrossman.com/ 雲端計算資料處理技術 雲端源自平行運算,但比網格更擅長資料運算 --中研院網格計算(ASGC) 主持人林誠謙博士 雲端運算源自平行運算的技術,不脫離網格運算 的哲學,但是雲端運算更專注在資料的處理 單次資料處理量小,讓雲端運算發展出不同於網 格運算的實作方式 --國家高速網路與計算中心企業與計畫管理組計畫主持人 黃維誠博士 雲端計算資料處理技術 雲端運算適合的任務,多半是資料處理次數頻率 高,而每一次要處理的資料量小。 --國家高速網路與計算中心企業與計畫管理組計畫主持人 黃維誠博士 資料來源: http://www.ithome.com.tw/itadm/article.php?c=49410&s=2 雲端計算資料處理技術 雲端運算vs.網格運算 雲端運算 網格運算 主要推動者 資訊供應商(如Google、 Yahoo、IBM、Amazon等) 學術機構(如歐洲粒子研究中心 CERN、中研院、國家高速網路與計 算中心) 標準化程度 無標準化,各家採用的技術架 構也不同。 有標準化的協定和信任機制 開源幅度 部分開源,目前有開源 Hadoop框架,但Google GFS 和資料庫系統BigTable則未開 源。 完全開源 網域限制 企業內部網域 可跨企業、跨管理網域 單一運算叢集 可支援的硬體 相同標準規格的個人電腦 (如 x86處理器、硬碟、4GB 記憶 體、Linux等) 可混合異質性伺服器(不同處理器、 不同作業系統、不同編譯器版本等) 擅長處理的資 料特性 單次運算資料量小(可於單臺 個人電腦上執行),但需要重 複大量處理次數的應用。 單次運算資料量大的應用。例如單 筆數GB的衛星訊號分析。 資料來源:iThome整理,2008年6月 雲端計算資料處理技術 搜尋網頁: 每一次要比對的網頁,其實檔案都不大, 所需耗費的處理器資源不多,所以用大量的個人電腦 就可以來執行網頁搜尋的運算,但是,要用個人電腦 來架設網格運算就比較難,因為網格運算所需的處理 資源較大。 實作的差異就是,雲端運算可以組合大量的個人電腦 來提供服務,而網格運算則需要依賴能提供大量運算 資源的高效能電腦。 --國家高速網路與計算中心企業與計畫管理組計畫主持人黃維 誠博士 資料來源: http://www.ithome.com.tw/itadm/article.php?c=49410&s=2 雲端計算資料處理技術 雲端運算(Cloud Computing):Google提出的分散式 運算技術,讓開發人員很容易開發出全球性的應用服 務,雲端運算技術可以自動管理大量標準化(非異質 性)電腦間的溝通、任務分配和分散式儲存等。 網格運算(Grid Computing):在網路上,透過標準化 協定與信任機制,整合跨網域中的異質伺服器,建立 運算叢集系統來共享運算資源、儲存資源等。 服務在雲端(In-the-Cloud)或雲端服務(Cloud Service): 供應商透過網際網路提供服務,使用者只需透過瀏覽 器就能使用,不需了解供應商的伺服器如何運作。 雲端計算資料處理技術 MapReduce模式:Google運用在雲端運算中的關鍵 技術,讓開發者開發大量資料的處理程式。先透過 Map程式將資料切割成不相關的區塊,分配給大量電 腦處理,再透過Reduce程式將結果彙整,輸出開發 者需要的結果。 Hadoop:使用Java開發的開源雲端運算框架,也是 採用Google雲端運算技術實作的框架,但所用的分散 式檔案系統與Google不同。2006年Yahoo成為該計畫 最主要的貢獻者和使用者。 資料來源: http://www.ithome.com.tw/itadm/article.php?c=49410&s=2 雲端計算資料處理技術 資料處理:是指當有大量資料時,如何進行平行化的 切割,計算,合併,以便讓處理人員可以直接獲取這 些大量資料的總結。 平行化資料分析語言: Google的 Sawzall 專案與Yahoo的 Pig 專案,都是屬於平行 處理大量資料的高階程式語言。 Google的 Sawzall 建制於MapReduce之上,Yahoo的 Pig 建於Hadoop之上(Hadoop為MapReduce的clone),兩者幾 乎系出同門。 Hadoop: Why? Need to process 100TB datasets with multiday jobs On 1-node: On 1000 node cluster: Scanning at 50 MB/s = 23 days Scanning at 50 MB/s = 33 min Need framework for distribution Efficient, reliable, usable Hadoop: Where? Batch data processing, not real-time/user facing Highly parallel, data intensive, distributed applications Log Processing Document Analysis and Indexing Web Graphs and Crawling Bandwidth to data is a constraint Number of CPUs is a constraint Very large production deployments (GRID) Several clusters of 1000s of nodes LOTS of data (Trillions of records, 100 TB+ data sets) What is Hadoop? The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The project includes: Core: provides the Hadoop Distributed Filesystem (HDFS) and support for the MapReduce distributed computing framework. MapReduce: A distributed data processing model and execution environment that runs on large clusters of commodity machines. Chukwa: a data collection system for managing large distributed systems. Chukwa is built on top of the HDFS and MapReduce framework and inherits Hadoop's scalability and robustness. What is Hadoop? HBase: builds on Hadoop Core to provide a scalable, distributed database. Hive: a data warehouse infrastructure built on Hadoop Core that provides data summarization, adhoc querying and analysis of datasets. Pig: a high-level data-flow language and execution framework for parallel computation. It is build on top of Hadoop Core. ZooKeeper: a highly available and reliable coordinate system. Distributed applications use ZooKeeper to store and mediate updates for critical shared state. Hadoop History 2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January 2006 - Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS. March 2006 - Formation of the Yahoo! Hadoop team April 2006 - Sort benchmark run on 188 nodes in 47.9 hours Hadoop History May 2006 - Yahoo sets up a Hadoop research cluster 300 nodes May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark) October 2006 - Research cluster reaches 600 Nodes December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8 hrs April 2007 - Research clusters - 2 clusters of 1000 nodes Source: http://hadoop.openfoundry.org/slides/Hadoop_OSDC_08.pdf Hadoop Components Hadoop Distributed Filesystem (HDFS) is a distributed file system designed to run on commodity hardware. is highly fault-tolerant and is designed to be deployed on low-cost hardware. provides high throughput access to application data and is suitable for applications that have large data sets. relaxes a few POSIX requirements to enable streaming access to file system data. (POSIX: Portable Operating System Interface [for Unix"]) was originally built as infrastructure for the Apache Nutch web search engine project. is part of the Apache Hadoop Core project. Hadoop Components Hadoop Distributed Filesystem (HDFS) Source: http://hadoop.apache.org/core/ Hadoop Components HDFS Assumptions and Goals Hardware failure Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. There are a huge number of components and each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. Hadoop Components HDFS Assumptions and Goals Streaming Data Access Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. Hadoop Components HDFS Assumptions and Goals Large Data Sets Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance. Hadoop Components HDFS Assumptions and Goals Simple Coherency Model HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A Map/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future. Hadoop Components HDFS Assumptions and Goals Simple Coherency Model HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A Map/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future. Hadoop Components HDFS Assumptions and Goals "Moving Computation is Cheaper than Moving Data" A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. It is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located. Hadoop Components HDFS Assumptions and Goals Portability Across Heterogeneous Hardware and Software Platforms HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications. Hadoop Components HDFS: Namenode and Datanode HDFS has a master/slave architecture An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. Hadoop Components HDFS: Namenode and Datanode HDFS has a master/slave architecture The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode. Hadoop Components Hadoop Distributed Filesystem (HDFS) Source: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html Hadoop Components HDFS: The File System Namespace HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems. one can create and remove files, move a file from one directory to another, or rename a file. The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode. Hadoop Components Hadoop Distributed Processing Framework Using MapReduce Metaphor Map/Reduce is a software framework for easily writing applications which process vast amounts of data in-parallel on large clusters of commodity hardware. A simple programming model that applies to many largescale computing problems Hide messy details in MapReduce runtime library: Automatic parallelization Load balancing Network and disk transformation optimization Handling of machine failures Robustness Hadoop Components A Map/Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. The Map/Reduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master. Hadoop Components Although the Hadoop framework is implemented in JavaTM, Map/Reduce applications need not be written in Java. Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer. Hadoop Pipes is a SWIG- compatible C++ API to implement Map/Reduce applications (non JNITM [Java Native Interface] based). Hadoop Components MapReduce concepts Definition: Map function: Take a set of (key, value) pairs and generate a set of intermediate (key, value) pairs by applying some function to all these pairs. Eg., (k1, v1) list(k2, v2) Reduce function: Merge all pairs with same key applying a reduction function on the values. E.g., (k2, list(v2)) list(k3, v3) Input and Output types of a Map/Reduce job: (input) <k1, v1> map <k2, v2> combine <k2, v2> reduce <k3, v3> (output) Read a lot of data Map: extract something meaningful from each record Shuffle and Sort Reduce: aggregate, summarize, filter, or transform Write the results Hadoop Components MapReduce concepts Input the fox ate the mouse the small mouse Map Shuffle & Sort ate, 1 mouse, 1 Map Reduce the, 3 fox, 2 brown, 1 small, 1 small, 1 the, 1 mouse, 1 the, 1 brown, 1 fox, 1 the quick brown fox Output the, 1 fox, 1 the, 1 Map Map Reduce quick, 1 Reduce the, 1 ate, 1 mouse, 2 quick, 1 Hadoop Components Consider the problem of counting the number of occurrences of each word in a large collection of documents: map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); The map function emits each word plus an associated count of occurrences ("1" in this example). The reduce function sums together all the counts emitted for a particular word. Hadoop Components MapReduce Execution Overview 1. The MapReduce library in the user program first shards the input files into M pieces of typically 16-64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines. 2. One of the copies of the program is special: the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task. 3. A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the userdefined map function. The intermediate key/value pairs produced by the map function are buffered in memory. Source: J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107-113, 2008. Hadoop Components MapReduce Execution Overview 4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. 5. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. If the amount of intermediate data is too large to fit in memory, an external sort is used. Source: J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107-113, 2008. Hadoop Components MapReduce Execution Overview 6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's reduce function. The output of the reduce function is appended to a final output file for this reduce partition. 7. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code. Source: J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107-113, 2008. Hadoop Components MapReduce Examples Distributed Grep (global search regular expression and print out the line): The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. Hadoop Components MapReduce Examples Reverse Web-Link Graph: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. Inverted Index: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions. Hadoop Components MapReduce Examples Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. • MapReduce Programs in Google's Source Tree • New MapReduce Programs per Month Source: http://www.cs.virginia.edu/~pact2006/program/mapreduce-pact06-keynote.pdf Who Uses Hadoop Amazon/A9 Facebook Google IBM Joost Last.fm New York Times PowerSet (now Microsoft) Quantcast Veoh Yahoo! More at http://wiki.apache.org/hadoop/PoweredBy Hadoop Resource http://hadoop.apache.org http://developer.yahoo.net/blogs/hadoop/ http://code.google.com/intl/zh-TW/edu/submissions/ uwspr2007_clustercourse/listing.html http://developer.amazonwebservices.com/connect/entry.j spa?externalID=873 J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communications of the ACM, 51(1):107-113, 2008. T. White, Hadoop: The Definitive Guide (MapReduce for the Cloud), O'Reilly, 2009. Hadoop Download 國網中心 http://ftp.twaren.net/Unix/Web/apache/hadoop/core/ HTTP http://ftp.stut.edu.tw/var/ftp/pub/OpenSource/apache/hadoop/core/ http://ftp.twaren.net/Unix/Web/apache/hadoop/core/ http://ftp.mirror.tw/pub/apache/hadoop/core/ http://apache.cdpa.nsysu.edu.tw/hadoop/core/ http://ftp.tcc.edu.tw/pub/Apache/hadoop/core/ http://apache.ntu.edu.tw/hadoop/core/ FTP ftp://ftp.stut.edu.tw/pub/OpenSource/apache/hadoop/core/ ftp://ftp.stu.edu.tw/Unix/Web/apache/hadoop/core/ ftp://ftp.twaren.net/Unix/Web/apache/hadoop/core/ ftp://apache.cdpa.nsysu.edu.tw/Unix/Web/apache/hadoop/core/ Hadoop Virtual Image http://code.google.com/intl/zh-TW/edu/parallel/tools/hadoopvm/ Setting up a Hadoop cluster can be an all day job. A virtual machine image has created with a preconfigured single node instance of Hadoop. A virtual machine encapsulates one operating system within another. (http://developer.yahoo.com/hadoop/tutorial/module3.html) Hadoop Virtual Image http://code.google.com/intl/zh-TW/edu/parallel/tools/hadoopvm/ While this doesn't have the power of a full cluster, it does allow you to use the resources on your local machine to explore the Hadoop platform. The virtual machine image is designed to be used with the free VMware Player. Hadoop can be run on a single-node in a pseudodistributed mode where each Hadoop daemon runs in a separate Java process. Setting Up the Image The image is packaged as a directory archive. To begin set up deflate the image in the directory of your choice (you need at least 10GB, the disk image can grow to 20GB). The VMware image package contains: image.vmx -- The VMware guest OS profile, a configuration file that describes the virtual machine characteristics (virtual CPU(s), amount of memory, etc.). 20GB.vmdk -- A VMware virtual disk used to store the content of the virtual machine hard disk; this file grows as you store data on the virtual image. It is configured to store up to 20GB. The archive contains two other files, image.vmsd and nvram, these are not critical for running the image but are created by the VMware player on startup. As you run the virtual machine log files (vmware-x.log) will be created. Setting Up the Image The system image is based on Ubuntu (version 7.04) and contains a Java machine (Sun JRE 6 - DLJ License v1.1) and the latest Hadoop distribution (0.13.0). A new window will appear which will print a message indicating the IP address allocated to the guest OS. This is the IP address you will use to submit jobs from the command line or the Eclipse environment. The guest OS contains a running Hadoop infrastructure which is configured with: A GFS (HDFS) infrastructure using a single data node (no replication) A single MapReduce worker Setting Up the Image The guest OS can be reached from the provided console or via SSH using the IP address indicated above. Log into the guest OS with: guest log in: guest, guest password: guest administrator log in: root, administrator password: root Once the image is loaded, you can log in with the guest account. Hadoop will be installed in the guest home directory(/home/guest/hadoop). Three scripts are provided for Hadoop maintenance purposes: start-hadoop -- Starts file-system and MapReduce daemons. stop-hadoop -- Stops all Hadoop daemons. reset-hadoop -- Restarts new Hadoop environment with entirely empty file system. Hadoop 0.20 Install 前言 架設環境 Install Hadoop 下載並更新軟體套件 安裝java ssh 安裝設定 安裝hadoop Hadoop範例測試 前言 Ubuntu 是一套自由的作業系統 ,是從 Debian 衍生 出來的 GNU/Linux 分支。是由第一位自費上太空的 非裔企業家 Mark Shuttleworth 所創立的 Canonical Ltd. 所維護的一個發行版。於 04 年正式推出第一個 版本 (Ubuntu 4.10 Warty Warthog),一推出馬上受到 熱烈迴響,從 05 年開始到現在,都是最熱門的 GNU/Linux 發行版。目前最新版本是:Ubuntu 9.04。 開發hadoop需要用到許多的物件導向語法,包括繼 承關係、介面類別,而且需要匯入正確的classpath。 Ubuntu Operating System 最低需求: 300MHz 的 x86 處理器 64MB 的主記憶體 (LiveCD 需要有 256MB 的主記憶體才能執行) 4GB 的硬碟空間 (含分配給置換空間的部份) 能夠顯示 640x480 的 VGA 的顯示晶片 光碟機或網路卡 建議需求: 700MHz 的 x86 處理器 384MB 的主記憶體 8GB 的硬碟空間 (含分配給置換空間的部份) 能夠顯示 1024x768 的 VGA 的顯示晶片 音效卡 可連上網際網路 架設環境 Live CD ubuntu 9.04 sun-java-6 hadoop 0.20.0 目錄說明 使用者(user):cfong 使用者家目錄(user's home directory): /home/cfong 專案目錄 (project directory): /home/cfong/workspace hadoop目錄: /opt/hadoop Install Hadoop 安裝方式模式可以有很多種,可以在WMware Player下模擬,也可在Ubuntu Operating System 或Cent Operating System下安裝。在此將在Live CD ubuntu 9.04 Operating System下安裝。 在安裝過程中要清楚每個套件安裝在哪個目錄底 下。 Update Package $sudo –i # 切入super user $sudo apt-get update #updata package lists $sudo apt-get upgrade #upgrade all install package Download Package Download hadoop-0.20.0.tar放在 /opt/ 下 http://apache.cdpa.nsysu.edu.tw/hadoop/core/hadoop0.20.0/hadoop-0.20.0.tar.gz Download Java SE Development Kit (JDK) JDK 6 Update 14 (jdk-6u10-docs.zip) 放在 /tmp/ 下 https://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDSCDS_Developer-Site/en_US/-/USD/ViewProductDetailStart?ProductRef=jdk-6u10-docs-oth-JPR@CDSCDS_Developer Install Java 安裝java 基本套件 $ sudo apt-get install java-common sun-java6-bin sun-java6-jdk 安裝sun-java6-doc $ sudo apt-get install sun-java6-doc SSH安裝設定 $ apt-get install ssh $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ ssh localhost Install Hadoop $ cd /opt $ sudo tar -zxvf hadoop-0.20.0.tar.gz $ sudo chown -R cfong:cfong /opt/hadoop-0.20.0 $ sudo ln -sf /opt/hadoop-0.20.0 /opt/hadoop Environment Variables Set up nano /opt/hadoop/conf/hadoop-env.sh export JAVA_HOME=/usr/lib/jvm/java-6-sun export HADOOP_HOME=/opt/hadoop export PATH=$PATH:/opt/hadoop/bin Environment Variables Setup nano /opt/hadoop/conf/core-site.xml <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop/hadoop-${user.name}</value> </property> </configuration> Environment Variables Setup nano /opt/hadoop/conf/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> Environment Variables Setup nano /opt/hadoop/conf/mapred-site.xml <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration> Environment Variables Setup nano /opt/hadoop/conf/mapred-site.xml <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration> 啟動Hadoop $ cd /opt/hadoop $ source /opt/hadoop/conf/hadoop-env.sh $ hadoop namenode -format $ start-all.sh $ hadoop fs -put conf input $ hadoop fs -ls Hadoop Examples Example 1: $cd /opt/hadoop $bin/hadoop version Hadoop 0.20.0 Subversion https://svn.apache.org/repos/asf/hadoop/core/branches/bra nch-0.20 -r 763504 Compiled by ndaley on Thu Apr 9 05:18:40 UTC 200 Compiled by hadoopqa on Thu May 15 07:22:55 UTC 2008 Hadoop Examples Example 2: $opt/hadoop/bin/hadoop jar hadoop-0.20.0examples.jar pi 4 10000 Number of Maps = 4 Samples per Map = 10000 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Starting Job 09/08/01 06:56:41 INFO mapred.FileInputFormat: Total input paths to process : 4 09/08/01 06:56:42 INFO mapred.JobClient: Running job: job_200908010505_0002 09/08/01 06:56:43 INFO mapred.JobClient: map 0% reduce 0% 09/08/01 06:56:53 INFO mapred.JobClient: map 50% reduce 0% 09/08/01 06:56:56 INFO mapred.JobClient: map 100% reduce 0% 09/08/01 06:57:05 INFO mapred.JobClient: map 100% reduce 100% 09/08/01 06:57:07 INFO mapred.JobClient: Job complete: job_200908010505_0002 09/08/01 06:57:07 INFO mapred.JobClient: Counters: 18 09/08/01 06:57:07 INFO mapred.JobClient: Job Counters 09/08/01 06:57:07 INFO mapred.JobClient: Launched reduce tasks=1 09/08/01 06:57:07 INFO mapred.JobClient: Launched map tasks=4 09/08/01 06:57:07 INFO mapred.JobClient: Data-local map tasks=4 09/08/01 06:57:07 INFO mapred.JobClient: FileSystemCounters 09/08/01 06:57:07 INFO mapred.JobClient: FILE_BYTES_READ=94 09/08/01 06:57:07 INFO mapred.JobClient: HDFS_BYTES_READ=472 09/08/01 06:57:07 INFO mapred.JobClient: FILE_BYTES_WRITTEN=334 09/08/01 06:57:07 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=215 09/08/01 06:57:07 INFO mapred.JobClient: Map-Reduce Framework 09/08/01 06:57:07 INFO mapred.JobClient: Reduce input groups=8 09/08/01 06:57:07 INFO mapred.JobClient: Combine output records=0 09/08/01 06:57:07 INFO mapred.JobClient: Map input records=4 09/08/01 06:57:07 INFO mapred.JobClient: Reduce shuffle bytes=112 09/08/01 06:57:07 INFO mapred.JobClient: Reduce output records=0 09/08/01 06:57:07 INFO mapred.JobClient: Spilled Records=16 09/08/01 06:57:07 INFO mapred.JobClient: Map output bytes=72 09/08/01 06:57:07 INFO mapred.JobClient: Map input bytes=96 09/08/01 06:57:07 INFO mapred.JobClient: Combine input records=0 09/08/01 06:57:07 INFO mapred.JobClient: Map output records=8 09/08/01 06:57:07 INFO mapred.JobClient: Reduce input records=8 Job Finished in 25.84 seconds Estimated value of Pi is 3.14140000000000000000 Hadoop Examples Example 3: $opt/hadoop/bin/start-all.sh localhost: starting datanode, logging to /opt/hadoop/logs/hadoop-root-datanode-cfong-desktop.out localhost: starting secondarynamenode, logging to /opt/hadoop/logs/hadoop-root-secondarynamenode-cfongdesktop.out starting jobtracker, logging to /opt/hadoop/logs/hadoop-rootjobtracker-cfong-desktop.out localhost: starting tasktracker, logging to /opt/hadoop/logs/hadoop-root-tasktracker-cfong-desktop.out Hadoop Examples Example 4: $opt/hadoop/bin/stop-all.sh stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: stopping datanode localhost: stopping secondarynamenode Hadoop Examples Example 5: $ cd /opt/hadoop $ jps 20911 JobTracker 20582 DataNode 27281 Jps 20792 SecondaryNameNode 21054 TaskTracker 20474 NameNode Hadoop Examples Example 6: $ sudo netstat -plten | grep java tcp6 tcp6 tcp6 tcp6 tcp6 tcp6 tcp6 tcp6 tcp6 tcp6 tcp6 tcp6 tcp6 tcp6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 :::50145 0 :::45538 0 :::50020 0 127.0.0.1:9000 0 127.0.0.1:9001 0 :::50090 0 :::53866 0 :::50060 0 127.0.0.1:37870 0 :::50030 0 :::50070 0 :::50010 0 :::50075 0 :::50397 :::* :::* :::* :::* :::* :::* :::* :::* :::* :::* :::* :::* :::* :::* LISTEN 0 LISTEN 0 LISTEN 0 LISTEN 0 LISTEN 0 LISTEN 0 LISTEN 0 LISTEN 0 LISTEN 0 LISTEN 0 LISTEN 0 LISTEN 0 LISTEN 0 LISTEN 0 141655 20792/java 142200 20911/java 143573 20582/java 139970 20474/java 142203 20911/java 143534 20792/java 140629 20582/java 143527 21054/java 143559 21054/java 143441 20911/java 143141 20474/java 143336 20582/java 143536 20582/java 139967 20474/java