Parallel Collection of Live Data Using Hadoop 報告人:葉瑞群 日期:2012/06/04 出處:IEEE Transactions on Knowledge and Data Engineering Outline 1. INTRODUCTION 2. BACKGROUND 3. IMPLEMENTATION 4. EXPERIMENTS AND RESULTS 5. CONCLUSIONS AND FUTURE WORK 2 1. INTRODUCTION(1/3) Hadoop is a software framework written in Java. It was originally built by a former developer at Yahoo, named Doug Cutting, to support distribution for the Nutch search engine project. It was designed for running applications on large clusters (>>1ΤΒ) made from commodity hardware. Hadoop is a distributed system that uses sort/merge processing techniques. 3 1. INTRODUCTION(2/3) Hadoop is already being used by a surprising array of companies including Yahoo!, Google, Facebook and Twitter, helping them to effectively distribute their computing jobs among servers in their own data centers or public computing services operated by other companies such as Amazon. 4 1. INTRODUCTION(3/3) The aim of our research is to effectively use Hadoop for collecting live data. We will explain how combining Hadoop with crawling techniques could maximize the efficiency of data procession. 5 2. BACKGROUND(1/3) 6 2. BACKGROUND(2/3) 7 2. BACKGROUND(3/3) This subsection provides information about three selected applications in which Hadoop can be applied leading to increased performance. 1. Domain Appraisal Tool (DAT) 2. OpenBet 3. Brute Force Cryptanalysis 8 3. IMPLEMENTATION(1/4) We should be able to locate domains that have been sold at specific time intervals. For each of these domains, we should be able to find the name, the sales price, the broker of the sale, the date of the sale, the number of Google results, the global ranking in Alexa rank, and finally the importance degree of the domain name in Google’s page rank service. Thus, we had to create some web-crawlers and program them in order to collect different transaction prices from different websites (e.g. from namebio.com). 9 3. IMPLEMENTATION(2/4) The difficulty we had to deal with was the large amount of money in betting exchange games evolving simultaneously. Parallel crawling of data using Hadoop has been proved to be inevitable. It is a fact that after using Hadoop, the accuracy and the performance of the main prediction model of Openbet has been improved adequately. 10 3. IMPLEMENTATION(3/4) A dictionary attack cannot be applied, thousands of days are required for going through all the possible x-digit plaintexts using only one CPU. However, as a brute force attack can be easily parallelized, we used Hadoop to speed up the recovery process. 11 3. IMPLEMENTATION(4/4) For testing purposes, we created a cluster of 4 commodity 12 dual core PCs. Our testing environment includes the following: 1. The cluster consists of 8 machines (4 dual core CPUs). 2. Each machine runs Ubuntu 9.10 Linux as an operating system and Hadoop 0.20.1. 3. Machines are typically dual-processor x86 processors at 1.8 – 2 GHz, with 1-2GB of memory per machine. 4. For each machine there is a 2GB swap file, and 1 Gb Ethernet network connection. 4. EXPERIMENTS AND RESULTS(1/4) 13 4. EXPERIMENTS AND RESULTS(2/4) 14 4. EXPERIMENTS AND RESULTS(3/4) 15 4. EXPERIMENTS AND RESULTS(4/4) 16 5. CONCLUSIONS AND FUTURE WORK(1/1) In this paper we studied the applicability of Hadoop, and proposed three applications (Domain appraisal tool, Open Bet and Brute Force Cryptanalysis). We also exposed experiments on the proposed implementations. As we observed, Hadoop is really efficient while running in a fully distributed mode, however in order to achieve optimal results and get advantage of Hadoop scalability, it is necessary to use a large clusters of computers. 17 END 18