MapReduce Theory and Practice http://net.pku.edu.cn/~course/cs402/2010/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/15/2010 Last Course Review Quiz What are they? 1. 数据(data) 1. Bit 2. Byte 2. 数据类型(data types) 3. 信息(information) 3 Data The term data refers to groups of information that represent the qualitative or quantitative attributes of a variable or set of variables. Data (plural of "datum", which is seldom used) are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and knowledge are derived. Raw data refers to a collection of numbers, characters, images or other outputs from devices that collect information to convert physical quantities into symbols, that are unprocessed. 4 Bit 位(英语:Bit),亦称二 进制位,指二进制中的一位, 是信息的最小单位。Bit是 Binary digit(二 进制数位) 的缩写 假设一事件以A或B的方式发 生,且A、B发生的概率相等, 都为0.5,则一个二进位可用 来代表A或B之一。 例如: 5 二进位可以用来表示一个简单 的正负 有两种状态的开关(如电灯开关) 晶体管的通断 某根导线上电压的有无 一个抽像的逻辑上的是否 Byte 6 字节,英文名称是Byte。 Byte是Binary Term的 缩写。一个字节代表八 个比特。它是通常被作 为计算机信息计量单位, 不论被存储数据的类型 为何。 History of “Information” Latin origin: a representation implanted in the mind-> idea Language and Coding:hide information in messages and then decode them。 莫尔斯电码 Mathematics: Shannon在channel transmission工作中,定 义了一个message所包含的信息量为它在source中出现概率 的log2 ,单位为’bits’。 Logic and linguistics:communication-oriented sense of information涉及到semantic meaning语义, knowledge知识 Society:information as something that is contained in the message used to inform. “information is the tennis ball of communication” 7 8 How much data? Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/2009) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERN’s LHC will generate 15 PB a year (??) 640K ought to be enough for anybody. 9 “We are living in exponential times “ 10 Information Overloading Political theorist Neil Postman spoke to the German Informatics Society in 1990, claiming that we are informing ourselves to death. He argued that the development of computer technology is not as positive as it has been heralded to be. With our focus on technology, we are forfeiting our humanity. We are drowning in information that contains empty promises of improving our lives. (Postman 1990). 11 怎样应对信息过载? 12 What’s matter with ME?! What you want to do with 1000pcs, or even 100,000 pcs? 13 Cloud is coming… Google alone has 450,000 systems running across 20 datacenters, and Microsoft's Windows Live team is doubling the number of servers it uses every 14 months, which is faster than Moore's Law “Data Center is a Computer” Parallelism everywhere Massive Scalable Reliable Resource Management Data Management Programming Model & Tools 14 What’s Mapreduce Parallel/Distributed Computing Programming Model Input split shuffle 15 output Word Frequencies in Web pages 输入:one document per record 用户实现map function,输入为 key = document URL value = document contents map输出 (potentially many) key/value pairs. 对document中每一个出现的词,输出一个记录<word, “1”> 16 Example continued: MapReduce运行系统(库)把所有相同key的记录收集到一 起 (shuffle/sort) 用户实现reduce function对一个key对应的values计算 求和sum Reduce输出<key, sum> 17 Homework Reading Checklist What’s the title? What’s the main point of view? What’s the most impact on you? 19 Introduction to Distributed System Design How many times physicist occurs in this document? Tell me something about Remote Procedure Calls Tell me something about the types of failures that can occur in a distributed system 20 Introduction to Parallel Programming and MapReduce MASTER/WORKER technique approximating pi MapReduce is an abstraction that allows Google engineers to perform simple computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance. 21 End