【1】 - 北京大学网络与信息系统研究所

advertisement
MapReduce
Theory and Practice
http://net.pku.edu.cn/~course/cs402/2010/
彭波
pb@net.pku.edu.cn
北京大学信息科学技术学院
7/15/2010
Last Course Review
Quiz
What are they?
1. 数据(data)
1. Bit
2. Byte
2. 数据类型(data types)
3. 信息(information)
3
Data




The term data refers to groups of
information that represent the
qualitative or quantitative attributes of
a variable or set of variables.
Data (plural of "datum", which is
seldom used) are typically the results of
measurements and can be the basis of
graphs, images, or observations of a set
of variables.
Data are often viewed as the lowest
level of abstraction from which
information and knowledge are derived.
Raw data refers to a collection of
numbers, characters, images or other
outputs from devices that collect
information to convert physical
quantities into symbols, that are
unprocessed.
4
Bit
位(英语:Bit),亦称二
进制位,指二进制中的一位,
是信息的最小单位。Bit是
Binary digit(二 进制数位)
的缩写
假设一事件以A或B的方式发
生,且A、B发生的概率相等,
都为0.5,则一个二进位可用
来代表A或B之一。 例如:







5
二进位可以用来表示一个简单
的正负
有两种状态的开关(如电灯开关)
晶体管的通断
某根导线上电压的有无
一个抽像的逻辑上的是否
Byte

6
字节,英文名称是Byte。
Byte是Binary Term的
缩写。一个字节代表八
个比特。它是通常被作
为计算机信息计量单位,
不论被存储数据的类型
为何。
History of “Information”





Latin origin: a representation implanted in the mind-> idea
Language and Coding:hide information in messages and
then decode them。 莫尔斯电码
Mathematics: Shannon在channel transmission工作中,定
义了一个message所包含的信息量为它在source中出现概率
的log2 ,单位为’bits’。
Logic and linguistics:communication-oriented sense of
information涉及到semantic meaning语义, knowledge知识
Society:information as something that is contained in the
message used to inform. “information is the tennis ball of
communication”
7
8
How much data?





Google processes 20 PB a day (2008)
Wayback Machine has 3 PB + 100 TB/month (3/2009)
Facebook has 2.5 PB of user data + 15 TB/day
(4/2009)
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
CERN’s LHC will generate 15 PB a year (??)
640K ought to be
enough for
anybody.
9
“We are living in exponential times “
10
Information Overloading

Political theorist Neil Postman spoke to the
German Informatics Society in 1990, claiming
that we are informing ourselves to death. He
argued that the development of computer
technology is not as positive as it has been
heralded to be. With our focus on technology,
we are forfeiting our humanity. We are drowning
in information that contains empty promises of
improving our lives. (Postman 1990).
11
怎样应对信息过载?
12
What’s matter with ME?!

What you want to do with 1000pcs, or even
100,000 pcs?
13
Cloud is coming…
Google alone has 450,000
systems running across 20
datacenters, and Microsoft's
Windows Live team is doubling
the number of servers it uses
every 14 months, which is faster
than Moore's Law
“Data Center is a Computer”
Parallelism everywhere
Massive Scalable Reliable
Resource Management
Data Management
Programming Model & Tools
14
What’s Mapreduce

Parallel/Distributed Computing Programming
Model
Input split
shuffle
15
output
Word Frequencies in Web pages


输入:one document per record
用户实现map function,输入为



key = document URL
value = document contents
map输出 (potentially many) key/value pairs.

对document中每一个出现的词,输出一个记录<word, “1”>
16
Example continued:


MapReduce运行系统(库)把所有相同key的记录收集到一
起 (shuffle/sort)
用户实现reduce function对一个key对应的values计算


求和sum
Reduce输出<key, sum>

17
Homework Reading
Checklist



What’s the title?
What’s the main point of view?
What’s the most impact on you?
19
Introduction to Distributed System
Design



How many times physicist occurs in this
document?
Tell me something about Remote Procedure
Calls
Tell me something about the types of failures
that can occur in a distributed system
20
Introduction to Parallel Programming
and MapReduce

MASTER/WORKER technique


approximating pi
MapReduce is an abstraction that allows Google
engineers to perform simple computations while
hiding the details of parallelization, data
distribution, load balancing and fault tolerance.
21
End
Download