大规模数据处理/云计算 Introduction

大规模数据处理/云计算 Lecture 1 Introduction to MapReduce 闫宏飞北京大学信息科学技术学院 7/9/2013 http://net.pku.edu.cn/~course/cs402/ Jimmy Lin University of Maryland SEWMGroup This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details What is this course about? • • • • Data-intensive information processing Large-data (“web-scale”) problems Focus on MapReduce programming An entry-level course~ 2 What is MapReduce? • Programming model for expressing distributed computations at a massive scale • Execution framework for organizing and performing such computations • Open-source implementation called Hadoop 3 Why Large Data? How much data?  Google processes 20 PB a day (2008)  Wayback Machine has 3 PB + 100 TB/month (3/2009)  Facebook has 2.5 PB of user data + 15 TB/day (4/2009)  eBay has 6.5 PB of user data + 50 TB/day (5/2009)  CERN’s LHC will generate 15 PB a year 640K ought to be enough for anybody. 5 6 Happening everywhere! microarray chips Molecular biology (cancer) fiber optics microprocessors Network traffic (spam) 300M/day Simulations (Millennium) particle colliders Particle events (LHC) 1B 7 1M/sec 8 Maximilien Brice, © CERN 9 Maximilien Brice, © CERN 10 Maximilien Brice, © CERN 11 Maximilien Brice, © CERN No data like more data! s/knowledge/data/g; How do we get here if we’re not Google? (Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007) 12 Example: information extraction  Answering factoid questions   Pattern matching on the Web Works amazingly well Who shot Abraham Lincoln?  X shot Abraham Lincoln  Learning relations    Start with seed instances Search for patterns on the Web Using patterns to find more instances Wolfgang Amadeus Mozart (1756 - 1791) Einstein was born in 1879 Birthday-of(Mozart, 1756) Birthday-of(Einstein, 1879) PERSON (DATE – PERSON was born in DATE (Brill et al., TREC 2001; Lin, ACM TOIS 2007) (Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; … ) 13 Example: Scene Completion Hays, Efros (CMU), “Scene Completion Using Millions of Photographs” SIGGRAPH, 2007  Image Database Grouped by Semantic Content      30 different Flickr.com groups 2.3 M images total (396 GB).  Select Candidate Images Most Suitable for Filling Hole    Classify images with gist scene detector [Torralba] Color similarity Local context matching Computation   Index images offline 50 min. scene matching, 20 min. local matching, 4 min. compositing Reduces to 5 minutes total by using 5 machines Extension  Flickr.com has over 500 million 14 images … 14 More Data More Gains? • CNNIC中国互联网络发展状况统计截至 2010年6月底，我国网民规模达4.2亿人，互联网普及率持续上升增至31.8%。手机网民成为拉动中国总体网民规模攀升的主要动力，半年内新增 4334万，达到2.77亿人，增幅为18.6%。值得关注的是，互联网商务化程度迅速提高，全国网络购物用户达到1.4亿，网上支付、网络购物和网上银行半年用户增长率均在30%左右，远远超过其他类网络应用。 15 2009年全国新闻出版业基本情况 • 2009年：出版书籍238868种（初版145475种，重版、重印93393种），总印数37. 88亿册（张），总印张312.46亿印张，折合用纸量73.4万吨（包括附录用纸1.41亿印张，折合用纸量0.33万吨），定价总金额567.27亿元（包括附录定价总金额 4.73亿元）。与上年相比种数增长8.86%（初版增长11.24%，重版、重印增长5.36%），总印数增长4.53%，总印张增长4.61%，定价总金额增长8.94%。 16 Did you know? 17 Did you know? • “We are currently preparing our students for jobs that don’t yet exist …” • “It is estimated that a week’s worth of the New York Times contains more information than a person was likely to come across in a lifetime in the 18th century” • “The amount of new technical information is doubling every 2 years” • “So what does IT ALL MEAN?” 18 “We are living in exponential times “ 19 Two Different Views • a “thrower-awayer” Jennifer Widom • MyLifeBits Gordon Bell “丢弃，必要时再找回来的代价要比维护它们要小得多” “trying to live an efficient life so that one has time to work and be with one’s family. “ 20 Information Overloading • 不能学以致用的原因之一：信息超载 – 对于那些只接触过一次的信息，我们通常只能记住其中一小部分。 – 我们应该少而精而非多而浅地去学习。 – 要想掌握某件事，关键在于间隔性重复。 – 一旦真正透彻地掌握了自己的工作，人们就会变得更有创造性，甚至能够创造奇迹。 21 What is Cloud Computing? The best thing since sliced bread?  Before clouds…     Grids Vector supercomputers … Cloud computing means many different things:     Large-data processing Rebranding of web 2.0 Utility computing Everything as a service 23 Rebranding of web 2.0  Rich, interactive web applications     Clouds refer to the servers that run them AJAX as the de facto standard (for better or worse) Examples: Facebook, YouTube, Gmail, … “The network is the computer”: take two    User data is stored “in the clouds” Rise of the netbook, smartphones, etc. Browser is the OS 24 Source: Wikipedia (Electricity meter) Utility Computing  What?    Why?     Computing resources as a metered service (“pay as you go”) Ability to dynamically provision virtual machines Cost: capital vs. operating expenses Scalability: “infinite” capacity Elasticity: scale up or down on demand Does it make sense?   Benefits to cloud users Business case for cloud providers I think there is a world market for about five computers. 26 Everything as a Service  Utility computing = Infrastructure as a Service (IaaS)    Platform as a Service (PaaS)    Why buy machines when you can rent cycles? Examples: Amazon’s EC2, Rackspace Give me nice API and take care of the maintenance, upgrades, … Example: Google App Engine Software as a Service (SaaS)   Just run it for me! Example: Gmail, Salesforce 27 Utility Computing  “pay-as-you-go” 好比让用户把电源插头插在墙上，你得到的电压和Microsoft得到的一样，只是你用得少，pay less； utility computing的目标就是让计算资源也具有这样的服务能力，用户可以使用500强公司所拥有的计算资源，只是 use less pay less。这是cloud computing的一个重要方面 28 Platform as a Service (PaaS)  对于开发Web Application和Services，PaaS提供了一整套基于Internet的，从开发，测试，部署，运营到维护的全方位的集成环境。特别它从一开始就具备了Multi-tenant architecture，用户不需要考虑多用户并发的问题，而由 platform来解决，包括并发管理，扩展性，失效恢复，安全。 29 Software as a Service (SaaS)  a model of software deployment whereby a provider licenses an application to customers for use as a service on demand. 30 Who cares?  Ready-made large-data problems      Lots of user-generated content Even more user behavior data Examples: Facebook friend suggestions, Google ad placement Business intelligence: gather everything in a data warehouse and run analytics to generate insight Utility computing    Provision Hadoop clusters on-demand in the cloud Lower barrier to entry for tackling large-data problem Commoditization and democratization of large-data capabilities 31 Story around Hadoop Google-IBM Cloud Computing Initiative • 2007年10月初，Google和IBM联合与6所大学签署协议，提供在大型分布式计算系统上开发软件的课程和支持服务，帮助学生和研究人员获得开发网络级应用软件的经验。这个项目的主要内容是传授MapReduce算法和Hadoop 文件系统。两家公司将各自出资 2000～2500万美元，为从事计算机科学研究的教授和学生提供所需的电脑软硬件和相关服务。 33 Cloud Computing Initiative • Google and IBM team on cloud computing initiative for universities(2007) – provide several hundred computers – access through the Internet to test parallel programming projects • The idea for the program from Google senior software engineer Christophe Bisciglia – Google Code University 34 The Information Factories • Googleplex(pre-2008) – servers number 450,000, according to the lowest estimate – 200 petabytes of hard disk storage – four petabytes of RAM – To handle the current load of 100 million queries a day, – input-output bandwidth must be in the neighborhood of 3 petabits per second 35 Google Infrastructure • 2003 "The Google file system," in sosp. Bolton Landing, NY, USA: ACM Press, 2003. • 2004 "MapReduce: Simplified Data Processing on Large Clusters," in osdi, 2004, • 2006 "Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!)," in osdi, 2006 • ……. 36 Hadoop Project Doug Cutting 37 History of Hadoop • • • • • • • • • • • • • • • • 2004 - Initial versions of what is now Hadoop Distributed File System and MapReduce implemented by Doug Cutting & Mike Cafarella December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January 2006 - Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS. March 2006 - Formation of the Yahoo! Hadoop team May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes April 2006 - Sort benchmark run on 188 nodes in 47.9 hours May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark) October 2006 - Research cluster reaches 600 Nodes December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8 January 2007 - Research cluster reaches 900 node April 2007 - Research clusters - 2 clusters of 1000 nodes April 2008: Won the 1-terabyte sort benchmark in 209 seconds on 900 nodes. October 2008: Loading 10 terabytes of data per day onto research clusters. March 2009: 17 clusters with a total of 24,000 nodes. April 2009: Won the minute sort by sorting 500 GB in 59 seconds (on 1,400 nodes) 38 nodes). and the 100-terabyte sort in 173 minutes (on 3,400 Google Code University • 2008, Seminar: Mass Data Processing Technology on Large Scale Clusters, Tsinghua University Aaron Kimball 39 Startup: Cloudera • Cloudera is pushing a commercial distribution for Hadoop Mike Olson Christophe Bisciglia Doug Cutting Aaron Kimball Tom White 40 Course Administrivia TextBooks • [Tom] Tom White, Hadoop: The Definitive Guide, O'Reilly, 3rd, 2012.5. • [Lin] Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, 2013.1. 42 This schedule is tentative and subject to change without notice ID Topics Contents Reading 7/9 Introduction to MapReduce (ppt) Why large data? Cloud Computing Story about Hadoop [Lin]Ch1:Introduction [Tom]Ch1:Meet Hadoop 7/11 MapReduce Basics (ppt) How do we scale up? MapReduce HDFS [Lin]Ch2:Mapreduce Basics [Tom]Ch6:How mapreduce works [GFS&MapReduce Paper] MapReduce Program Develop Basic MapReduce algorithm design and design patterns [Tom]Ch5:Developing a MapReduce Application [Lin]Ch3:MapReduce algorithm design Introduction to Information Retrieval Inverted Index on MapReduce Retrieval Problems [Lin]Ch4:Inverted Indexing for Text Retrieval Graph Algorithm and Mapreduce Parallel Breadth-First-Search PageRank [Lin]Ch5:Graph Algorithms What is clustering Applications of clustering in information retrieval K-means algorithm Evaluation of clustering How many clusters Canopy clustering [IIR] Ch16 7/16 7/18 7/23 7/25 MapReduce Algorithm Design(ppt) Text retrieval (ppt) Graph Algorithm (ppt ) Flat Clustering (ppt) Canopy Clustering (ppt) 43 Recap • Why large data? • Cloud computing • Story about Hadoop 44

大规模数据处理/云计算 Introduction

Related documents

Products

Support

大规模数据处理/云计算 Introduction

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib