HADOOP - ShareCourse

advertisement
+
100062108 李智宇、
100062116 林威宏、
100062220 施閔耀
+
2
Outline
 Introduction
 Architecture
of Hadoop
 HDFS
 MapReduce
 Comparison
 Why
Hadoop
 Conclusion
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
+
What is Hadoop ?
 open-source
 process
 Easy
 lots
and store big data
to use and implement, economic, flexible
of nodes(server)
 written
 free
software framework
in JAVA
license
 created
by Doug Cutting and
Mike Cafarella in 2005
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
3
+
Advantages of Interpreted Language
 Cross-platform(ex: Windows, Ubuntu, Mac
 smaller
OS X)
executable program size
 easier
to modify during both development and
execution
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
4
+
Architecture of Hadoop
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
5
+
Hadoop in Enterprise
The Dell representation of the Hadoop ecosystem.
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
6
+
Hadoop in Enterprise
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
7
+
Who is using Hadoop ?
more than half of the Fortune 50 uses Hadoop by 2013
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
8
+
9
HDFS
 Hadoop
Distributed File System
 Client: user
 name
node: manage and store metadata,
namespace of files
 Data
node: store files
 each
data node sends its status to name
node periodically
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
+
HDFS: Writing data in HDFS
 Each
file will be divided into blocks(in size 64 or
128MB) , and have three copies in different data
nodes.
 Client
asks name node to get a list of data node
sorted by distance, and send the file to the nearest
one , then the data node will send the file to the
rest node.
 When
above operation done, data node will send
“done” to name node.
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
10
+
HDFS: Reading data in HDFS
 Client
send filename to the name node ,
then the name node will send a list of the
blocks of files sorted by distance.
 Client
use the list to get the file from data
node.
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
11
+
HDFS: failure
 node
failure
 communication
 data
failure
corruption
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
12
+
HDFS: handle failure
 Handle
writing failure:
name node will skip the data node without
an ACK.
 Handle
reading failure:
recall that when reading a file, client will
get a list of data node content the file.
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
13
+
HDFS: handle failure
 Name
node handle node failure :
name node will find out the data the failure
node have, and copy those data from others
and restore them to other data node.
 Note
that HDFS can’t guarantee at least one
copy of data is alive.
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
14
+
15
MapReduce
 similar
to divide-and-conquer
 First, use “Map” to
divide tasks
 Second, use “Shuffle” to “transfer
the data
from the mapper nodes to a reducer’s node
and decompress if needed. “
 Third, use “Reduce” to “execute
the userdefined reduce function to produce the
final output data. “
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
+
MapReduce-Map
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
Figure 2: Execut ion of a map t ask showing t he
16
Figu
+
MapReduce-shuffle
17
Figure 1: Execut ion of a MapReduce job.
100062116 林威宏、 100062220 施閔耀
asks100062108
while李智宇、
t aking
dat a locality int o account . Each TaskTracker has a predefined numb
+
MapReduce-Reduce
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
18
+
MapReduce
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
19
+
Comparison
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
20
+
Comparison
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
21
+
Why Hadoop?
technically
Comparison of Grep Task Result with Vertica and DBMS-X
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
22
+
Why Hadoop?
technically
 Simple
structure vs. Optimization
 Transaction
 Lower
time not minimized
performance with same number of
nodes
 No
compelling reason to choose Hadoop
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
23
+
Why Hadoop?
commercially
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
24
+
Why Hadoop
commercially
 Cheap
(Buy more servers to beat DBMS)
 Flexible
(Both in design and deployment)
 Easier
to design
 Easier
to scale up
 Combine
with other system to achieve
better performance
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
25
+
Conclusion
 Hadoop
is much easier for users to
implement and more economic
 MapReduce
advocates should study the
techniques used in parallel DBMSs
 Hybrid
systems are also popular
 With
improvement of performance, we
believe Hadoop will lead the trend of big
data computing
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
26
+
Reference

http://hadoop.apache.org/

http://www.runpc.com.tw/content/cloud_content.aspx?id=105318

http://en.wikipedia.org/wiki/Apache_Hadoo

https://www.facebookbrand.com/

http://assets.fontsinuse.com/static/use-media-items/15/14246/full2048x768/522903b7/Yahoo_Logo.png

http://wiki.apache.org/hadoop/PoweredBy

http://semiaccurate.com/assets/uploads/2011/09/Amazon-logo.jpg

http://www.conceptcupboard.com/blog/wpcontent/uploads/2013/09/google.jpg
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
27
+
Reference

http://datashieldcorp.com/files/2013/11/adobe-LOGO-2.jpg

http://upload.wikimedia.org/wikipedia/commons/7/77/The_New_
York_Times_logo.png

http://i.dell.com/sites/content/business/solutions/whitepapers/en/
Documents/hadoop-introduction.pdf

http://hadoop.intel.com/pdfs/IntelDistributionReferenceArchitectur
e.pdf

http://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=we
b&cd=2&ved=0CDQQFjAB&url=http%3A%2F%2Fwww.classcloud.
org%2Fcloud%2Frawattachment%2Fwiki%2FHinet100402%2F02.HadoopOverview.pdf&
ei=IE2XUtLfBMfxiAea_oHQCA&usg=AFQjCNFoIXxLJrOnoul4cKJpQ8
v3_kuTYg
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
28
+
Reference

http://www.accenture.com/SiteCollectionDocuments/PDF/Accentur
e-Hadoop-Deployment-Comparison-Study.pdf

https://www.google.com.tw/url?sa=t&rct=j&q&esrc=s&source=web
&cd=1&ved=0CCkQFjAA&url=http%3A%2F%2Fwww.psgtech.edu
%2Fyrgcc%2Fattach%2FMAP%2520REDUCE%2520PROGRAMMIN
G.ppt&ei=7lGXUtvCJsy5iAfWtYH4Bw&usg=AFQjCNGWRKJLaltvbvORULZV6_Te2y74g&sig2=Ba77ihsV1SEqcNeEFkRzfg

https://www.cs.duke.edu/starfish/files/hadoop-models.pdf

http://dotnetmis91.blogspot.tw/2010/04/hdfs-hadoopmapreduce.html

http://wiki.apache.org/hadoop/HDFS

http://www.ewdna.com/2013/04/Hadoop-HDFS-Comics.html
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
29
+
Reference

http://en.wikipedia.org/wiki/Interpreted_language

A Comparison of Approaches to Large-Scale Data Analysis by Sam
Madden

http://www.cc.ntu.edu.tw/chinese/epaper/0011/20091220_1106.ht
m

http://web.cs.wpi.edu/~cs561/s12/Lectures/6/Hadoop.pdf

http://www.mobilemartin.com/mobile/show-me-the-mobilemoney.jpg
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
30
Download