+ 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 + 2 Outline Introduction Architecture of Hadoop HDFS MapReduce Comparison Why Hadoop Conclusion 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 + What is Hadoop ? open-source process Easy lots and store big data to use and implement, economic, flexible of nodes(server) written free software framework in JAVA license created by Doug Cutting and Mike Cafarella in 2005 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 3 + Advantages of Interpreted Language Cross-platform(ex: Windows, Ubuntu, Mac smaller OS X) executable program size easier to modify during both development and execution 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 4 + Architecture of Hadoop 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 5 + Hadoop in Enterprise The Dell representation of the Hadoop ecosystem. 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 6 + Hadoop in Enterprise 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 7 + Who is using Hadoop ? more than half of the Fortune 50 uses Hadoop by 2013 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 8 + 9 HDFS Hadoop Distributed File System Client: user name node: manage and store metadata, namespace of files Data node: store files each data node sends its status to name node periodically 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 + HDFS: Writing data in HDFS Each file will be divided into blocks(in size 64 or 128MB) , and have three copies in different data nodes. Client asks name node to get a list of data node sorted by distance, and send the file to the nearest one , then the data node will send the file to the rest node. When above operation done, data node will send “done” to name node. 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 10 + HDFS: Reading data in HDFS Client send filename to the name node , then the name node will send a list of the blocks of files sorted by distance. Client use the list to get the file from data node. 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 11 + HDFS: failure node failure communication data failure corruption 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 12 + HDFS: handle failure Handle writing failure: name node will skip the data node without an ACK. Handle reading failure: recall that when reading a file, client will get a list of data node content the file. 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 13 + HDFS: handle failure Name node handle node failure : name node will find out the data the failure node have, and copy those data from others and restore them to other data node. Note that HDFS can’t guarantee at least one copy of data is alive. 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 14 + 15 MapReduce similar to divide-and-conquer First, use “Map” to divide tasks Second, use “Shuffle” to “transfer the data from the mapper nodes to a reducer’s node and decompress if needed. “ Third, use “Reduce” to “execute the userdefined reduce function to produce the final output data. “ 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 + MapReduce-Map 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 Figure 2: Execut ion of a map t ask showing t he 16 Figu + MapReduce-shuffle 17 Figure 1: Execut ion of a MapReduce job. 100062116 林威宏、 100062220 施閔耀 asks100062108 while李智宇、 t aking dat a locality int o account . Each TaskTracker has a predefined numb + MapReduce-Reduce 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 18 + MapReduce 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 19 + Comparison 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 20 + Comparison 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 21 + Why Hadoop? technically Comparison of Grep Task Result with Vertica and DBMS-X 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 22 + Why Hadoop? technically Simple structure vs. Optimization Transaction Lower time not minimized performance with same number of nodes No compelling reason to choose Hadoop 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 23 + Why Hadoop? commercially 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 24 + Why Hadoop commercially Cheap (Buy more servers to beat DBMS) Flexible (Both in design and deployment) Easier to design Easier to scale up Combine with other system to achieve better performance 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 25 + Conclusion Hadoop is much easier for users to implement and more economic MapReduce advocates should study the techniques used in parallel DBMSs Hybrid systems are also popular With improvement of performance, we believe Hadoop will lead the trend of big data computing 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 26 + Reference http://hadoop.apache.org/ http://www.runpc.com.tw/content/cloud_content.aspx?id=105318 http://en.wikipedia.org/wiki/Apache_Hadoo https://www.facebookbrand.com/ http://assets.fontsinuse.com/static/use-media-items/15/14246/full2048x768/522903b7/Yahoo_Logo.png http://wiki.apache.org/hadoop/PoweredBy http://semiaccurate.com/assets/uploads/2011/09/Amazon-logo.jpg http://www.conceptcupboard.com/blog/wpcontent/uploads/2013/09/google.jpg 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 27 + Reference http://datashieldcorp.com/files/2013/11/adobe-LOGO-2.jpg http://upload.wikimedia.org/wikipedia/commons/7/77/The_New_ York_Times_logo.png http://i.dell.com/sites/content/business/solutions/whitepapers/en/ Documents/hadoop-introduction.pdf http://hadoop.intel.com/pdfs/IntelDistributionReferenceArchitectur e.pdf http://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=we b&cd=2&ved=0CDQQFjAB&url=http%3A%2F%2Fwww.classcloud. org%2Fcloud%2Frawattachment%2Fwiki%2FHinet100402%2F02.HadoopOverview.pdf& ei=IE2XUtLfBMfxiAea_oHQCA&usg=AFQjCNFoIXxLJrOnoul4cKJpQ8 v3_kuTYg 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 28 + Reference http://www.accenture.com/SiteCollectionDocuments/PDF/Accentur e-Hadoop-Deployment-Comparison-Study.pdf https://www.google.com.tw/url?sa=t&rct=j&q&esrc=s&source=web &cd=1&ved=0CCkQFjAA&url=http%3A%2F%2Fwww.psgtech.edu %2Fyrgcc%2Fattach%2FMAP%2520REDUCE%2520PROGRAMMIN G.ppt&ei=7lGXUtvCJsy5iAfWtYH4Bw&usg=AFQjCNGWRKJLaltvbvORULZV6_Te2y74g&sig2=Ba77ihsV1SEqcNeEFkRzfg https://www.cs.duke.edu/starfish/files/hadoop-models.pdf http://dotnetmis91.blogspot.tw/2010/04/hdfs-hadoopmapreduce.html http://wiki.apache.org/hadoop/HDFS http://www.ewdna.com/2013/04/Hadoop-HDFS-Comics.html 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 29 + Reference http://en.wikipedia.org/wiki/Interpreted_language A Comparison of Approaches to Large-Scale Data Analysis by Sam Madden http://www.cc.ntu.edu.tw/chinese/epaper/0011/20091220_1106.ht m http://web.cs.wpi.edu/~cs561/s12/Lectures/6/Hadoop.pdf http://www.mobilemartin.com/mobile/show-me-the-mobilemoney.jpg 100062108 李智宇、 100062116 林威宏、 100062220 施閔耀 30