lect18

advertisement
Cloud Computing and
MapReduce
Used slides from RAD Lab at UC Berkeley about the cloud ( http://abovetheclouds.cs.berkeley.edu/) and slides from Jimmy
Lin’s slides (http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/index.html) (licensed under Creation Commons
Attribution 3.0 License)
Cloud computing
• What is the “cloud”?
– Many answers. Easier to explain with
examples:
•
•
•
•
•
•
Gmail is in the cloud
Amazon (AWS) EC2 and S3 are the cloud
Google AppEngine is the cloud
Windows Azure is the cloud
SimpleDB is in the cloud
The “network” (cloud) is the computer
Cloud Computing
What about Wikipedia?
“Cloud computing is the delivery of
computing as a service rather than a
product, whereby shared resources,
software, and information are provided to
computers and other devices as a utility
(like the electricity grid) over a network
(typically the Internet). “
Cloud properties
• Cloud offers:
– Scalability : scale out vs scale up (also scale
back)
– Reliability (hopefully!)
– Availability (24x7)
– Elasticity : pay-as-you go depending on your
demand
– Multi-tenancy
more
• Scalability means that you (can) have infinite resources,
can handle unlimited number of users
• Multi-tenancy enables sharing of resources and costs
across a large pool of users. Lower cost, higher
utilization… but other issues: e.g. security.
• Elasticity: you can add or remove computer nodes and
the end user will not be affected/see the improvement
quickly.
• Utility computing (similar to electrical grid)
CLOUD COMPUTING
ECONOMICS AND
ELASTICITY
Cloud Application Demand
– Daily, weekly, monthly, …
Resources
• Many cloud applications have cyclical demand
curves
Demand
Time
• Workload spikes more frequent and significant
– Death of Michael Jackson:
• 22% of tweets, 20% of Wikipedia traffic, Google
thought they are under attack
– Obama inauguration day: 5x increase in tweets
How do you pick
a capacity level?
Economics of Cloud Users
Resources
Capacity
Demand
Resources
• Pay by use instead of provisioning for peak
• Recall: DC costs >$150M and takes 24+
months to design and build
Capacity
Demand
Time
Time
Static data center
Data center in the cloud
Unused resources
Economics of Cloud Users
• Risk of over-provisioning: underutilization
• Huge sunk cost in infrastructure
Capacity
Resources
Unused resources
Time
Static data center
Resources
Demand
Capacity
Demand
2
1
Time (days)
3
Utility Computing Arrives
• Amazon Elastic Compute Cloud (EC2)
• “Compute unit” rental: $0.10-0.80 0.085-0.68/hour
– 1 CU ≈ 1.0-1.2 GHz 2007 AMD Opteron/Intel Xeon core
Platform
Units
Memory
Disk
Small - $0.10 $.085/hour
32-bit
1
1.7GB
160GB
Large - $0.40 $0.35/hour
64-bit
4
7.5GB
850GB – 2 spindles
X Large - $0.80 $0.68/hour
64-bit
8
15GB
1690GB – 4 spindles
High CPU Med - $0.20 $0.17
64-bit
5
1.7GB
350GB
High CPU Large - $0.80 $0.68
64-bit
20
7GB
1690GB
High Mem X Large - $0.50
64-bit
6.5
17.1GB
1690GB
High Mem XXL - $1.20
64-bit
13
34.2GB
1690GB
High Mem XXXL - $2.40
64-bit
26
68.4GB
1690GB
Northern VA cluster
• No up-front cost, no contract, no minimum
• Billing rounded to nearest hour (also regional,spot pricing)
• New paradigm(!) for deploying services?, HPC?
Utility Storage Arrives
• Amazon S3 and Elastic Block Storage offer
low-cost, contract-less storage
Cloud Computing Infrastructure
• Computation model: MapReduce*
• Storage model: HDFS*
• Other computation models: HPC/Grid
Computing
• Network structure
*Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet,
Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License)
Cloud Computing Computation
Models
• Finding the right level of abstraction
– von Neumann architecture vs cloud
environment
• Hide system-level details from the
developers
– No more race conditions, lock contention, etc.
• Separating the what from how
– Developer specifies the computation that needs
to be performed
– Execution framework (“runtime”) handles
“Big Ideas”
• Scale “out”, not “up”
– Limits of SMP and large shared-memory machines
• Idempotent operations
– Simplifies redo in the presence of failures
• Move processing to the data
– Cluster has limited bandwidth
• Process data sequentially, avoid random access
– Seeks are expensive, disk throughput is reasonable
• Seamless scalability for ordinary programmers
– From the mythical man-month to the tradable
machine-hour
Typical Large-Data Problem
•
•
•
•
•
Iterate over a large number of records
Extract something of interest from each
Shuffle and sort intermediate results
Aggregate intermediate results
Generate final output
Key idea: provide a functional abstraction for
these two operations – MapReduce
(Dean and Ghemawat, OSDI 2004)
MapReduce
• Programmers specify two functions:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
– All values with the same key are sent to the
same reducer
• The execution framework handles
everything else…
MapReduce
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
map
map
map
map
a 1 b 2
c 3 c 6
a 5 c 2
b 7 c 8
Shuffle and Sort: aggregate values by keys
a 15
b 27
reduce
reduce
r1 s1
r2 s2
c 2368
reduce
r3 s3
MapReduce
• Programmers specify two functions:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
– All values with the same key are sent to the
same reducer
• The execution framework handles
everything else…
What’s “everything else”?
MapReduce “Runtime”
• Handles scheduling
– Assigns workers to map and reduce tasks
• Handles “data distribution”
– Moves processes to data
• Handles synchronization
– Gathers, sorts, and shuffles intermediate data
• Handles errors and faults
– Detects worker failures and automatically restarts
• Handles speculative execution
– Detects “slow” workers and re-executes work
• Everything happens on top of a distributed FS
(later)
Sounds simple, but many challenges!
MapReduce
• Programmers specify two functions:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
– All values with the same key are reduced together
• The execution framework handles everything else…
• Not quite…usually, programmers also specify:
partition (k’, number of partitions) → partition for k’
– Often a simple hash of the key, e.g., hash(k’) mod R
– Divides up key space for parallel reduce operations
combine (k’, v’) → <k’, v’>*
– Mini-reducers that run in memory after the map phase
– Used as an optimization to reduce network traffic
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
map
map
map
map
a 1 b 2
c 3 c 6
a 5 c 2
b 7 c 8
combine
combine
combine
combine
a 1 b 2
c 9
a 5 c 2
b 7 c 8
partition
partition
partition
partition
Shuffle and Sort: aggregate values by keys
a 15
b 27
reduce
reduce
r1 s1
r2 s2
2 9 8
cc 2
368
reduce
r3 s3
Two more details…
• Barrier between map and reduce phases
– But we can begin copying intermediate data
earlier
• Keys arrive at each reducer in sorted order
– No enforced ordering across reducers
MapReduce Overall Architecture
User
Program
(1) submit
Master
(2) schedule map (2) schedule reduce
worker
split 0
split 1 (3) read
(4) local write
split 2
worker
split 3
split 4
(5) remote read
worker
(6) write output
file 0
worker
output
file 1
Reduce
phase
Output
files
worker
Input
files
Map
phase
Adapted from (Dean and Ghemawat, OSDI 2004)
Intermediate files
(on local disk)
“Hello World” Example: Word
Count
Map(String docid, String text):
for each word w in text:
Emit(w, 1);
Reduce(String term, Iterator<Int> values):
int sum = 0;
for each v in values:
sum += v;
Emit(term, value);
MapReduce can refer to…
• The programming model
• The execution framework (aka “runtime”)
• The specific implementation
Usage is usually clear from context!
MapReduce Implementations
• Google has a proprietary implementation in
C++
– Bindings in Java, Python
• Hadoop is an open-source implementation in
Java
– Development led by Yahoo, used in production
– Now an Apache project
– Rapidly expanding software ecosystem, but still
lots of room for improvement
• Lots of custom research implementations
Cloud Computing Storage, or how
NAS
do we get data to the
workers?
SAN
Compute Nodes
What’s the problem here?
Distributed File System
• Don’t move data to workers… move workers to the
data!
– Store data on the local disks of nodes in the cluster
– Start up the workers on the node that has the data local
• Why?
– Network bisection bandwidth is limited
– Not enough RAM to hold all the data in memory
– Disk access is slow, but disk throughput is reasonable
• A distributed file system is the answer
– GFS (Google File System) for Google’s MapReduce
– HDFS (Hadoop Distributed File System) for Hadoop
GFS: Assumptions
• Choose commodity hardware over “exotic”
hardware
– Scale “out”, not “up”
• High component failure rates
– Inexpensive commodity components fail all the time
• “Modest” number of huge files
– Multi-gigabyte files are common, if not encouraged
• Files are write-once, mostly appended to
– Perhaps concurrently
• Large streaming reads over random access
– High sustained throughput over low latency
GFS slides adapted from material by (Ghemawat et al., SOSP 2003)
GFS: Design Decisions
• Files stored as chunks
– Fixed size (64MB)
• Reliability through replication
– Each chunk replicated across 3+ chunkservers
• Single master to coordinate access, keep metadata
– Simple centralized management
• No data caching
– Little benefit due to large datasets, streaming reads
• Simplify the API
– Push some of the issues onto the client (e.g., data
HDFS =layout)
GFS clone (same basic ideas implemented in Java)
From GFS to HDFS
• Terminology differences:
– GFS master = Hadoop namenode
– GFS chunkservers = Hadoop datanodes
• Functional differences:
– No file appends in HDFS (planned feature)
– HDFS performance is (likely) slower
HDFS Architecture
HDFS namenode
Application
HDFS Client
(file name, block id)
/foo/bar
File namespace
block 3df2
(block id, block location)
instructions to datanode
(block id, byte range)
block data
datanode state
HDFS datanode
HDFS datanode
Linux file system
Linux file system
…
Adapted from (Ghemawat et al., SOSP 2003)
…
Namenode Responsibilities
• Managing the file system namespace:
– Holds file/directory structure, metadata, file-toblock mapping, access permissions, etc.
• Coordinating file operations:
– Directs clients to datanodes for reads and writes
– No data is moved through the namenode
• Maintaining overall health:
– Periodic communication with the datanodes
– Block re-replication and rebalancing
– Garbage collection
Putting everything together…
namenode
job submission node
namenode daemon
jobtracker
tasktracker
tasktracker
tasktracker
datanode daemon
datanode daemon
datanode daemon
Linux file system
Linux file system
Linux file system
…
slave node
…
slave node
…
slave node
MapReduce/GFS Summary
• Simple, but powerful programming model
• Scales to handle petabyte+ workloads
– Google: six hours and two minutes to sort 1PB (10
trillion 100-byte records) on 4,000 computers
– Yahoo!: 16.25 hours to sort 1PB on 3,800
computers
• Incremental performance improvement with
more nodes
• Seamlessly handles failures, but possibly with
performance penalties
Download