大规模数据处理/云计算 Lecture 1 Introduction to MapReduce 闫宏飞 北京大学信息科学技术学院 7/9/2013 http://net.pku.edu.cn/~course/cs402/ Jimmy Lin University of Maryland SEWMGroup This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details What is this course about? • • • • Data-intensive information processing Large-data (“web-scale”) problems Focus on MapReduce programming An entry-level course~ 2 What is MapReduce? • Programming model for expressing distributed computations at a massive scale • Execution framework for organizing and performing such computations • Open-source implementation called Hadoop 3 Why Large Data? How much data? Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/2009) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERN’s LHC will generate 15 PB a year 640K ought to be enough for anybody. 5 6 Happening everywhere! microarray chips Molecular biology (cancer) fiber optics microprocessors Network traffic (spam) 300M/day Simulations (Millennium) particle colliders Particle events (LHC) 1B 7 1M/sec 8 Maximilien Brice, © CERN 9 Maximilien Brice, © CERN 10 Maximilien Brice, © CERN 11 Maximilien Brice, © CERN No data like more data! s/knowledge/data/g; How do we get here if we’re not Google? (Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007) 12 Example: information extraction Answering factoid questions Pattern matching on the Web Works amazingly well Who shot Abraham Lincoln? X shot Abraham Lincoln Learning relations Start with seed instances Search for patterns on the Web Using patterns to find more instances Wolfgang Amadeus Mozart (1756 - 1791) Einstein was born in 1879 Birthday-of(Mozart, 1756) Birthday-of(Einstein, 1879) PERSON (DATE – PERSON was born in DATE (Brill et al., TREC 2001; Lin, ACM TOIS 2007) (Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; … ) 13 Example: Scene Completion Hays, Efros (CMU), “Scene Completion Using Millions of Photographs” SIGGRAPH, 2007 Image Database Grouped by Semantic Content 30 different Flickr.com groups 2.3 M images total (396 GB). Select Candidate Images Most Suitable for Filling Hole Classify images with gist scene detector [Torralba] Color similarity Local context matching Computation Index images offline 50 min. scene matching, 20 min. local matching, 4 min. compositing Reduces to 5 minutes total by using 5 machines Extension Flickr.com has over 500 million 14 images … 14 More Data More Gains? • CNNIC中国互联网络发展状况统计 截至 2010年6月底,我国网民规模达4.2亿人,互 联网普及率持续上升增至31.8%。手机网民成为 拉动中国总体网民规模攀升的主要动力,半年内 新增 4334万,达到2.77亿人,增幅为18.6%。值 得关注的是,互联网商务化程度迅速提高,全国 网络购物用户达到1.4亿,网上支付、网络购物和 网上银 行半年用户增长率均在30%左右,远远超 过其他类网络应用。 15 2009年全国新闻出版业基本情况 • 2009年:出版书籍238868种(初版145475种, 重版、重印93393种),总印数37. 88亿册(张), 总印张312.46亿印张,折合用纸量73.4万吨(包 括附录用纸1.41亿印张,折合用纸量0.33万吨), 定价总金额567.27亿 元(包括附录定价总金额 4.73亿元)。与上年相比种数增长8.86%(初版 增长11.24%,重版、重印增长5.36%),总印数 增长4.53%,总印 张增长4.61%,定价总金额增 长8.94%。 16 Did you know? 17 Did you know? • “We are currently preparing our students for jobs that don’t yet exist …” • “It is estimated that a week’s worth of the New York Times contains more information than a person was likely to come across in a lifetime in the 18th century” • “The amount of new technical information is doubling every 2 years” • “So what does IT ALL MEAN?” 18 “We are living in exponential times “ 19 Two Different Views • a “thrower-awayer” Jennifer Widom • MyLifeBits Gordon Bell “丢弃,必要时再找回来的代价 要比维护它们要小得多” “trying to live an efficient life so that one has time to work and be with one’s family. “ 20 Information Overloading • 不能学以致用的原因之一: 信息超载 – 对于那些只接触过一次的信息, 我们通常只能记住其中一小部 分。 – 我们应该少而精而非多而浅地 去学习。 – 要想掌握某件事,关键在于间 隔性重复。 – 一旦真正透彻地掌握了自己的 工作,人们就会变得更有创造 性,甚至能够创造奇迹。 21 What is Cloud Computing? The best thing since sliced bread? Before clouds… Grids Vector supercomputers … Cloud computing means many different things: Large-data processing Rebranding of web 2.0 Utility computing Everything as a service 23 Rebranding of web 2.0 Rich, interactive web applications Clouds refer to the servers that run them AJAX as the de facto standard (for better or worse) Examples: Facebook, YouTube, Gmail, … “The network is the computer”: take two User data is stored “in the clouds” Rise of the netbook, smartphones, etc. Browser is the OS 24 Source: Wikipedia (Electricity meter) Utility Computing What? Why? Computing resources as a metered service (“pay as you go”) Ability to dynamically provision virtual machines Cost: capital vs. operating expenses Scalability: “infinite” capacity Elasticity: scale up or down on demand Does it make sense? Benefits to cloud users Business case for cloud providers I think there is a world market for about five computers. 26 Everything as a Service Utility computing = Infrastructure as a Service (IaaS) Platform as a Service (PaaS) Why buy machines when you can rent cycles? Examples: Amazon’s EC2, Rackspace Give me nice API and take care of the maintenance, upgrades, … Example: Google App Engine Software as a Service (SaaS) Just run it for me! Example: Gmail, Salesforce 27 Utility Computing “pay-as-you-go” 好比让用户把电源插头插在墙上,你得到 的电压和Microsoft得到的一样,只是你用得少,pay less; utility computing的目标就是让计算资源也具有这样的服务 能力,用户可以使用500强公司所拥有的计算资源,只是 use less pay less。这是cloud computing的一个重要方面 28 Platform as a Service (PaaS) 对于开发Web Application和Services,PaaS提供了一整套 基于Internet的,从开发,测试,部署,运营到维护的全方 位的集成环境。特别它从一开始就具备了Multi-tenant architecture,用户不需要考虑多用户并发的问题,而由 platform来解决,包括并发管理,扩展性,失效恢复,安全 。 29 Software as a Service (SaaS) a model of software deployment whereby a provider licenses an application to customers for use as a service on demand. 30 Who cares? Ready-made large-data problems Lots of user-generated content Even more user behavior data Examples: Facebook friend suggestions, Google ad placement Business intelligence: gather everything in a data warehouse and run analytics to generate insight Utility computing Provision Hadoop clusters on-demand in the cloud Lower barrier to entry for tackling large-data problem Commoditization and democratization of large-data capabilities 31 Story around Hadoop Google-IBM Cloud Computing Initiative • 2007年10月初,Google和IBM联 合与6所大学签署协议,提供在大 型分布式计算系统上开发软件的 课程和支持服务,帮助学生和研 究人员获得开发网络级应用软件 的经验。这个项目的主要内容是 传授MapReduce算法和Hadoop 文件系统。两家公司将各自出资 2000~2500万美元,为从事计算 机科学研究的教授和学生提供所 需的电脑软硬件和相关服务。 33 Cloud Computing Initiative • Google and IBM team on cloud computing initiative for universities(2007) – provide several hundred computers – access through the Internet to test parallel programming projects • The idea for the program from Google senior software engineer Christophe Bisciglia – Google Code University 34 The Information Factories • Googleplex(pre-2008) – servers number 450,000, according to the lowest estimate – 200 petabytes of hard disk storage – four petabytes of RAM – To handle the current load of 100 million queries a day, – input-output bandwidth must be in the neighborhood of 3 petabits per second 35 Google Infrastructure • 2003 "The Google file system," in sosp. Bolton Landing, NY, USA: ACM Press, 2003. • 2004 "MapReduce: Simplified Data Processing on Large Clusters," in osdi, 2004, • 2006 "Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!)," in osdi, 2006 • ……. 36 Hadoop Project Doug Cutting 37 History of Hadoop • • • • • • • • • • • • • • • • 2004 - Initial versions of what is now Hadoop Distributed File System and MapReduce implemented by Doug Cutting & Mike Cafarella December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January 2006 - Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS. March 2006 - Formation of the Yahoo! Hadoop team May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes April 2006 - Sort benchmark run on 188 nodes in 47.9 hours May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark) October 2006 - Research cluster reaches 600 Nodes December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8 January 2007 - Research cluster reaches 900 node April 2007 - Research clusters - 2 clusters of 1000 nodes April 2008: Won the 1-terabyte sort benchmark in 209 seconds on 900 nodes. October 2008: Loading 10 terabytes of data per day onto research clusters. March 2009: 17 clusters with a total of 24,000 nodes. April 2009: Won the minute sort by sorting 500 GB in 59 seconds (on 1,400 nodes) 38 nodes). and the 100-terabyte sort in 173 minutes (on 3,400 Google Code University • 2008, Seminar: Mass Data Processing Technology on Large Scale Clusters, Tsinghua University Aaron Kimball 39 Startup: Cloudera • Cloudera is pushing a commercial distribution for Hadoop Mike Olson Christophe Bisciglia Doug Cutting Aaron Kimball Tom White 40 Course Administrivia TextBooks • [Tom] Tom White, Hadoop: The Definitive Guide, O'Reilly, 3rd, 2012.5. • [Lin] Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, 2013.1. 42 This schedule is tentative and subject to change without notice ID Topics Contents Reading 7/9 Introduction to MapReduce (ppt) Why large data? Cloud Computing Story about Hadoop [Lin]Ch1:Introduction [Tom]Ch1:Meet Hadoop 7/11 MapReduce Basics (ppt) How do we scale up? MapReduce HDFS [Lin]Ch2:Mapreduce Basics [Tom]Ch6:How mapreduce works [GFS&MapReduce Paper] MapReduce Program Develop Basic MapReduce algorithm design and design patterns [Tom]Ch5:Developing a MapReduce Application [Lin]Ch3:MapReduce algorithm design Introduction to Information Retrieval Inverted Index on MapReduce Retrieval Problems [Lin]Ch4:Inverted Indexing for Text Retrieval Graph Algorithm and Mapreduce Parallel Breadth-First-Search PageRank [Lin]Ch5:Graph Algorithms What is clustering Applications of clustering in information retrieval K-means algorithm Evaluation of clustering How many clusters Canopy clustering [IIR] Ch16 7/16 7/18 7/23 7/25 MapReduce Algorithm Design(ppt) Text retrieval (ppt) Graph Algorithm (ppt ) Flat Clustering (ppt) Canopy Clustering (ppt) 43 Recap • Why large data? • Cloud computing • Story about Hadoop 44