Parallel Collection of Live Data Using Hadoop :葉瑞群 報告人

advertisement
Parallel Collection of Live Data Using
Hadoop
報告人:葉瑞群
日期:2012/06/04
出處:IEEE Transactions on Knowledge and Data Engineering
Outline
 1. INTRODUCTION
 2. BACKGROUND
 3. IMPLEMENTATION
 4. EXPERIMENTS AND RESULTS
 5. CONCLUSIONS AND FUTURE WORK
2
1. INTRODUCTION(1/3)
 Hadoop is a software framework written in Java. It was
originally built by a former developer at Yahoo, named
Doug Cutting, to support distribution for the Nutch
search engine project.
 It was designed for running applications on large clusters
(>>1ΤΒ) made from commodity hardware. Hadoop is a
distributed system that uses sort/merge processing
techniques.
3
1. INTRODUCTION(2/3)
 Hadoop is already being used by a surprising array of
companies including Yahoo!, Google, Facebook and
Twitter, helping them to effectively distribute their
computing jobs among servers in their own data centers
or public computing services operated by other
companies such as Amazon.
4
1. INTRODUCTION(3/3)
 The aim of our research is to effectively use Hadoop for
collecting live data. We will explain how combining
Hadoop with crawling techniques could maximize the
efficiency of data procession.
5
2. BACKGROUND(1/3)
6
2. BACKGROUND(2/3)
7
2. BACKGROUND(3/3)
 This subsection provides information about three
selected applications in which Hadoop can be applied
leading to increased performance.
 1. Domain Appraisal Tool (DAT)
 2. OpenBet
 3. Brute Force Cryptanalysis
8
3. IMPLEMENTATION(1/4)
 We should be able to locate domains that have been sold
at specific time intervals. For each of these domains, we
should be able to find the name, the sales price, the
broker of the sale, the date of the sale, the number of
Google results, the global ranking in Alexa rank, and
finally the importance degree of the domain name in
Google’s page rank service.
 Thus, we had to create some web-crawlers and program
them in order to collect different transaction prices from
different websites (e.g. from namebio.com).
9
3. IMPLEMENTATION(2/4)
 The difficulty we had to deal with was the large amount
of money in betting exchange games evolving
simultaneously. Parallel crawling of data using Hadoop
has been proved to be inevitable. It is a fact that after
using Hadoop, the accuracy and the performance of the
main prediction model of Openbet has been improved
adequately.
10
3. IMPLEMENTATION(3/4)
 A dictionary attack cannot be applied, thousands of days
are required for going through all the possible x-digit
plaintexts using only one CPU. However, as a brute
force attack can be easily parallelized, we used Hadoop
to speed up the recovery process.
11
3. IMPLEMENTATION(4/4)
 For testing purposes, we created a cluster of 4 commodity




12
dual core PCs. Our testing environment includes the
following:
1. The cluster consists of 8 machines (4 dual core CPUs).
2. Each machine runs Ubuntu 9.10 Linux as an operating
system and Hadoop 0.20.1.
3. Machines are typically dual-processor x86 processors
at 1.8 – 2 GHz, with 1-2GB of memory per machine.
4. For each machine there is a 2GB swap file, and 1 Gb
Ethernet network connection.
4. EXPERIMENTS AND RESULTS(1/4)
13
4. EXPERIMENTS AND RESULTS(2/4)
14
4. EXPERIMENTS AND RESULTS(3/4)
15
4. EXPERIMENTS AND RESULTS(4/4)
16
5. CONCLUSIONS AND FUTURE WORK(1/1)
 In this paper we studied the applicability of Hadoop, and
proposed three applications (Domain appraisal tool,
Open Bet and Brute Force Cryptanalysis). We also
exposed experiments on the proposed implementations.
As we observed, Hadoop is really efficient while running
in a fully distributed mode, however in order to achieve
optimal results and get advantage of Hadoop scalability,
it is necessary to use a large clusters of computers.
17
END
18
Download