大规模数据处理/云计算 Introduction

advertisement
大规模数据处理/云计算
Lecture 1 Introduction to MapReduce
闫宏飞
北京大学信息科学技术学院
7/9/2013
http://net.pku.edu.cn/~course/cs402/
Jimmy Lin
University of Maryland
SEWMGroup
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States
See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
What is this course about?
•
•
•
•
Data-intensive information processing
Large-data (“web-scale”) problems
Focus on MapReduce programming
An entry-level course~
2
What is MapReduce?
• Programming model for expressing distributed
computations at a massive scale
• Execution framework for organizing and
performing such computations
• Open-source implementation called Hadoop
3
Why Large Data?
How much data?

Google processes 20 PB a day (2008)

Wayback Machine has 3 PB + 100 TB/month (3/2009)

Facebook has 2.5 PB of user data + 15 TB/day (4/2009)

eBay has 6.5 PB of user data + 50 TB/day (5/2009)

CERN’s LHC will generate 15 PB a year
640K ought to be
enough for anybody.
5
6
Happening everywhere!
microarray chips
Molecular biology
(cancer)
fiber optics
microprocessors
Network traffic (spam)
300M/day
Simulations
(Millennium)
particle colliders
Particle events (LHC)
1B
7
1M/sec
8
Maximilien Brice, © CERN
9
Maximilien Brice, © CERN
10
Maximilien Brice, © CERN
11
Maximilien Brice, © CERN
No data like more data!
s/knowledge/data/g;
How do we get here if we’re not Google?
(Banko and Brill, ACL 2001)
(Brants et al., EMNLP 2007)
12
Example: information extraction

Answering factoid questions


Pattern matching on the Web
Works amazingly well
Who shot Abraham Lincoln?  X shot Abraham Lincoln

Learning relations



Start with seed instances
Search for patterns on the Web
Using patterns to find more instances
Wolfgang Amadeus Mozart (1756 - 1791)
Einstein was born in 1879
Birthday-of(Mozart, 1756)
Birthday-of(Einstein, 1879)
PERSON (DATE –
PERSON was born in DATE
(Brill et al., TREC 2001; Lin, ACM TOIS 2007)
(Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; … )
13
Example: Scene Completion
Hays, Efros (CMU), “Scene Completion Using
Millions of Photographs” SIGGRAPH, 2007

Image Database Grouped by
Semantic Content





30 different Flickr.com groups
2.3 M images total (396 GB).

Select Candidate Images Most
Suitable for Filling Hole



Classify images with gist scene
detector [Torralba]
Color similarity
Local context matching
Computation


Index images offline
50 min. scene matching, 20
min. local matching, 4 min.
compositing
Reduces to 5 minutes total by
using 5 machines
Extension

Flickr.com has over 500 million
14
images …
14
More Data More Gains?
• CNNIC中国互联网络发展状况统计
截至 2010年6月底,我国网民规模达4.2亿人,互
联网普及率持续上升增至31.8%。手机网民成为
拉动中国总体网民规模攀升的主要动力,半年内
新增 4334万,达到2.77亿人,增幅为18.6%。值
得关注的是,互联网商务化程度迅速提高,全国
网络购物用户达到1.4亿,网上支付、网络购物和
网上银 行半年用户增长率均在30%左右,远远超
过其他类网络应用。
15
2009年全国新闻出版业基本情况
• 2009年:出版书籍238868种(初版145475种,
重版、重印93393种),总印数37. 88亿册(张),
总印张312.46亿印张,折合用纸量73.4万吨(包
括附录用纸1.41亿印张,折合用纸量0.33万吨),
定价总金额567.27亿 元(包括附录定价总金额
4.73亿元)。与上年相比种数增长8.86%(初版
增长11.24%,重版、重印增长5.36%),总印数
增长4.53%,总印 张增长4.61%,定价总金额增
长8.94%。
16
Did you know?
17
Did you know?
• “We are currently preparing our students for
jobs that don’t yet exist …”
• “It is estimated that a week’s worth of the New
York Times contains more information than a
person was likely to come across in a lifetime in
the 18th century”
• “The amount of new technical information is
doubling every 2 years”
• “So what does IT ALL MEAN?”
18
“We are living in exponential times “
19
Two Different Views
•
a “thrower-awayer”
Jennifer Widom
• MyLifeBits
Gordon Bell
“丢弃,必要时再找回来的代价
要比维护它们要小得多”
“trying to live an efficient life
so that one has time to work
and be with one’s family. “
20
Information Overloading
• 不能学以致用的原因之一:
信息超载
– 对于那些只接触过一次的信息,
我们通常只能记住其中一小部
分。
– 我们应该少而精而非多而浅地
去学习。
– 要想掌握某件事,关键在于间
隔性重复。
– 一旦真正透彻地掌握了自己的
工作,人们就会变得更有创造
性,甚至能够创造奇迹。
21
What is Cloud Computing?
The best thing since sliced bread?

Before clouds…




Grids
Vector supercomputers
…
Cloud computing means many different things:




Large-data processing
Rebranding of web 2.0
Utility computing
Everything as a service
23
Rebranding of web 2.0

Rich, interactive web applications




Clouds refer to the servers that run them
AJAX as the de facto standard (for better or worse)
Examples: Facebook, YouTube, Gmail, …
“The network is the computer”: take two



User data is stored “in the clouds”
Rise of the netbook, smartphones, etc.
Browser is the OS
24
Source: Wikipedia (Electricity meter)
Utility Computing

What?



Why?




Computing resources as a metered service (“pay as you go”)
Ability to dynamically provision virtual machines
Cost: capital vs. operating expenses
Scalability: “infinite” capacity
Elasticity: scale up or down on demand
Does it make sense?


Benefits to cloud users
Business case for cloud providers
I think there is a
world market for
about five
computers.
26
Everything as a Service

Utility computing = Infrastructure as a Service (IaaS)



Platform as a Service (PaaS)



Why buy machines when you can rent cycles?
Examples: Amazon’s EC2, Rackspace
Give me nice API and take care of the maintenance, upgrades, …
Example: Google App Engine
Software as a Service (SaaS)


Just run it for me!
Example: Gmail, Salesforce
27
Utility Computing

“pay-as-you-go” 好比让用户把电源插头插在墙上,你得到
的电压和Microsoft得到的一样,只是你用得少,pay less;
utility computing的目标就是让计算资源也具有这样的服务
能力,用户可以使用500强公司所拥有的计算资源,只是
use less pay less。这是cloud computing的一个重要方面
28
Platform as a Service (PaaS)

对于开发Web Application和Services,PaaS提供了一整套
基于Internet的,从开发,测试,部署,运营到维护的全方
位的集成环境。特别它从一开始就具备了Multi-tenant
architecture,用户不需要考虑多用户并发的问题,而由
platform来解决,包括并发管理,扩展性,失效恢复,安全
。
29
Software as a Service (SaaS)

a model of software deployment whereby a provider
licenses an application to customers for use as a service
on demand.
30
Who cares?

Ready-made large-data problems





Lots of user-generated content
Even more user behavior data
Examples: Facebook friend suggestions, Google ad placement
Business intelligence: gather everything in a data warehouse and
run analytics to generate insight
Utility computing



Provision Hadoop clusters on-demand in the cloud
Lower barrier to entry for tackling large-data problem
Commoditization and democratization of large-data capabilities
31
Story around Hadoop
Google-IBM Cloud Computing Initiative
• 2007年10月初,Google和IBM联
合与6所大学签署协议,提供在大
型分布式计算系统上开发软件的
课程和支持服务,帮助学生和研
究人员获得开发网络级应用软件
的经验。这个项目的主要内容是
传授MapReduce算法和Hadoop
文件系统。两家公司将各自出资
2000~2500万美元,为从事计算
机科学研究的教授和学生提供所
需的电脑软硬件和相关服务。
33
Cloud Computing Initiative
• Google and IBM team on cloud
computing initiative for
universities(2007)
– provide several hundred computers
– access through the Internet to test
parallel programming projects
• The idea for the program from
Google senior software engineer
Christophe Bisciglia
– Google Code University
34
The Information Factories
• Googleplex(pre-2008)
– servers number 450,000,
according to the lowest
estimate
– 200 petabytes of hard disk
storage
– four petabytes of RAM
– To handle the current load of
100 million queries a day,
– input-output bandwidth must be
in the neighborhood of 3
petabits per second
35
Google Infrastructure
• 2003 "The Google file system," in sosp. Bolton
Landing, NY, USA: ACM Press, 2003.
• 2004 "MapReduce: Simplified Data Processing
on Large Clusters," in osdi, 2004,
• 2006 "Bigtable: A Distributed Storage System
for Structured Data (Awarded Best Paper!)," in
osdi, 2006
• …….
36
Hadoop Project
Doug Cutting
37
History of Hadoop
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
2004 - Initial versions of what is now Hadoop Distributed File System and MapReduce implemented by Doug Cutting & Mike Cafarella
December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20
nodes.
January 2006 - Doug Cutting joins Yahoo!
February 2006 - Apache Hadoop project official started to support the standalone
development of Map-Reduce and HDFS.
March 2006 - Formation of the Yahoo! Hadoop team
May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes
April 2006 - Sort benchmark run on 188 nodes in 47.9 hours
May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April
benchmark)
October 2006 - Research cluster reaches 600 Nodes
December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in
5.2 hrs, 900 nodes in 7.8
January 2007 - Research cluster reaches 900 node
April 2007 - Research clusters - 2 clusters of 1000 nodes
April 2008: Won the 1-terabyte sort benchmark in 209 seconds on 900 nodes.
October 2008: Loading 10 terabytes of data per day onto research clusters.
March 2009: 17 clusters with a total of 24,000 nodes.
April 2009: Won the minute sort by sorting 500 GB in 59 seconds (on 1,400 nodes)
38 nodes).
and the 100-terabyte sort in 173 minutes (on 3,400
Google Code University
• 2008, Seminar: Mass Data
Processing Technology on
Large Scale Clusters, Tsinghua
University
Aaron Kimball
39
Startup: Cloudera
• Cloudera is pushing a
commercial distribution for
Hadoop
Mike Olson Christophe Bisciglia
Doug Cutting
Aaron Kimball
Tom White
40
Course Administrivia
TextBooks
• [Tom] Tom White, Hadoop: The Definitive Guide,
O'Reilly, 3rd, 2012.5.
• [Lin] Jimmy Lin and Chris Dyer, Data-Intensive
Text Processing with MapReduce, 2013.1.
42
This schedule is tentative and subject to change without notice
ID
Topics
Contents
Reading
7/9
Introduction to
MapReduce
(ppt)
Why large data?
Cloud Computing
Story about Hadoop
[Lin]Ch1:Introduction
[Tom]Ch1:Meet Hadoop
7/11
MapReduce
Basics (ppt)
How do we scale up?
MapReduce
HDFS
[Lin]Ch2:Mapreduce Basics
[Tom]Ch6:How mapreduce works
[GFS&MapReduce Paper]
MapReduce Program Develop
Basic MapReduce algorithm design and
design patterns
[Tom]Ch5:Developing a MapReduce
Application
[Lin]Ch3:MapReduce algorithm design
Introduction to Information Retrieval
Inverted Index on MapReduce
Retrieval Problems
[Lin]Ch4:Inverted Indexing for Text
Retrieval
Graph Algorithm and Mapreduce
Parallel Breadth-First-Search
PageRank
[Lin]Ch5:Graph Algorithms
What is clustering
Applications of clustering in information
retrieval
K-means algorithm
Evaluation of clustering
How many clusters
Canopy clustering
[IIR] Ch16
7/16
7/18
7/23
7/25
MapReduce
Algorithm
Design(ppt)
Text retrieval
(ppt)
Graph
Algorithm (ppt
)
Flat Clustering
(ppt)
Canopy
Clustering
(ppt)
43
Recap
• Why large data?
• Cloud computing
• Story about Hadoop
44
Download