MapReduce Programming in Clusters

advertisement
Introduction to MapReduce
ECE7610
The Age of Big-Data
 Big-data age
Facebook collects 500 terabytes a day(2011)
 Google collects 20000PB a day (2011)

 Data is an important asset to any
organization

Finance company; insurance company; internet
company
 We need new
 Algorithms/data structures/programming model
2
What to do ? (Word Count)
 Consider a large data collection and count
the occurrences of the different words
{web, weed, green, sun, moon, land, part, web, green,…}
Main
Data
collection
WordCounter
parse( )
count( )
DataCollection
web
2
weed
1
green
2
sun
1
moon
1
land
1
part
1
ResultTable
3
What to do ?(Word Count)
Multi-thread
Lock on shared data
Main
Data
collection
web
2
weed
1
green
2
sun
1
moon
1
land
1
part
1
Thread
1..*
WordCounter
parse( )
count( )
DataCollection
ResultTable
4
What to do?(Word Count)
 Single machine cannot serve all the

Data
collection



data: you need a distributed special
(file) system
Large number of commodity hardware
disks: say, 1000 disks 1TB each
Critical aspects: fault tolerance +
replication + load balancing, monitoring
Exploit parallelism afforded by splitting
parsing and counting
Provision and locate computing at data
locations
5
What to do? (Word Count)
Data
collection
Main
Data
collection
Data
collection
2
weed
1
green
2
sun
1
moon
1
land
1
part
1
Thread
Data
collection
1..*
1..*
Counter
Parser
Data
collection
KEY
web
WordList
DataCollection
web
weed
green
sun
moon
Separate counters
Separate data
ResultTable
land
part
web
green
…….
VALUE
6
It is not easy to parallel….
Different programming models
Fundamental issues
Message Passing Shared Memory
Scheduling, data distribution, synchronization, interprocess communication, robustness, fault tolerance,
…
Architectural issues
Flynn’s taxonomy (SIMD, MIMD, etc.), network
topology, bisection bandwidth, cache coherence, …
Different programming constructs
Mutexes, conditional variables, barriers, …
masters/slaves, producers/consumers, work queues,. …
Common problems
Livelock, deadlock, data starvation, priority inversion,
…dining philosophers, sleeping barbers, cigarette
smokers, …
Actually, Programmer’s Nightmare….
7
MapReduce: Automate for you
Important distributed parallel programming paradigm for large-scale
applications.
 Becomes one of the core technologies powering big IT companies, like
Google, IBM, Yahoo and Facebook.
 The framework runs on a cluster of machines and automatically partitions
jobs into number of small tasks and processes them in parallel.
 Features: fairness, task data locality, fault-tolerance.

8
MapReduce
MAP: Input data  <key, value> pair
web
1
weed
1
green
1
web
1
sun
1
weed
1
moon
1
green
1
land
Data
Collection:
split n
Map
…
Data
Collection:
split 2
Split the data to
Supply multiple
processors
……
Data
Collection:
split1
Map
1
sun1
web
part
1
moon
weed
1
web
1
land
green
1web
green
1
part
sun
1weed
web …
1
1
web
moon
1green
weedKEY 1
VALUEgreen
land
1sun
green
1
… 1moon
part
sun
1
KEY
web
1land
moon
1
green
1part
land
1
…
1web
part
1
green
KEY
VALUE
web
1
…
green
1
…
1
KEY
VALUE
KEY
1
1
1
1
1
1
1
1
1
1
1
1
VALUE
1
1
1
1
1
VALUE
9
MapReduce
MAP: Input data  <key, value> pair
REDUCE: <key, value> pair  <result>
Data
Collection:
split n
Split the data to
Supply multiple
processors
Map
Reduce
Reduce
…
Data
Collection:
split 2
Map
……
Data
Collection:
split1
Map
Reduce
10
Large scale data splits
Map <key, 1>
Reducers (say, Count)
Parsehash
Coun
t
Parsehash
Coun
t
Parsehash
C. Xu @ Wayne State
Count
Parsehash
11
MapReduce
12
How to store the data ?
Compute Nodes
What’s the problem here?
13
Distributed File System
 Don’t move data to workers… Move
workers to the data!


Store data on the local disks for nodes in the cluster
Start up the workers on the node that has the data local
 Why?


Not enough RAM to hold all the data in memory
Network is the bottleneck, disk throughput is good
 A distributed file system is the answer


GFS (Google File System)
HDFS for Hadoop
14
GFS/HDFS Design
 Commodity hardware over “exotic” hardware
 High component failure rates
 Files stored as chunks
Fixed size (64MB)
Reliability through replication
 Each chunk replicated across 3+ chunkservers
Single master to coordinate access, keep metadata
 Simple centralized management
No data caching
 Little benefit due to large data sets, streaming reads
Simplify the API
 Push some of the issues onto the client





15
GFS/HDFS
16
MapReduce Data Locality
 Master scheduling policy




Asks HDFS for locations of replicas of input file blocks
Map tasks typically split into 64MB (== GFS block size)
Locality levels: node locality/rack locality/off-rack
Map tasks scheduled as close to its input data as possible
 Effect

Thousands of machines read input at local disk speed.
Without this, rack switches limit read rate and network
bandwidth becomes the bottleneck.
17
MapReduce Fault-tolerance
 Reactive way
 Worker failure
• Heartbeat, Workers are periodically pinged by master
– NO response = failed worker
• If the processor of a worker fails, the tasks of that
worker are reassigned to another worker.

Master failure
• Master writes periodic checkpoints
• Another master can be started from the last
checkpointed state
• If eventually the master dies, the job will be aborted
18
MapReduce Fault-tolerance
 Proactive way (Speculative Execution)
 The problem of “stragglers” (slow workers)
• Other jobs consuming resources on machine
• Bad disks with soft errors transfer data very slowly
• Weird things: processor caches disabled (!!)
When computation almost done, reschedule inprogress tasks
 Whenever either the primary or the backup
executions finishes, mark it as completed

19
MapReduce Scheduling
 Fair Sharing
conducts fair scheduling using greedy method to maintain
data locality
 Delay
 uses delay scheduling algorithm to achieve good data
locality by slightly compromising fairness restriction
 LATE(Longest Approximate Time to End)
 improves MapReduce applications' performance in
heterogenous environment, like virtualized environment,
through accurate speculative execution
 Capacity
 introduced by Yahoo, supports multiple queues for shared
users and guarantees each queue a fraction of the
capacity of the cluster

20
MapReduce Cloud Service
•
•
•
Providing MapReduce frameworks as a service in clouds
becomes an attractive usage model for enterprises.
A MapReduce cloud service allows users to cost-effectively
access a large amount of computing resources with creating
own cluster.
Users are able to adjust the scale of MapReduce clusters in
response to the change of the resource demand of
applications.
21
Amazon Elastic MR
3. Develop code locally
0. Allocate Hadoop cluster
1. Scp data to cluster
2. Move data into HDFS
EC2
4. Submit MapReduce job
4a. Go back to Step 3
You
Your Hadoop Cluster
5. Move data out of HDFS
6. Scp data from cluster
7. Clean up!
New Challenges
 Interference between co-hosted VMs

Slow down the job 1.5-7 times
 Locality preserving policy no long effective
 Lose more than 20% locality (depends)
 Need specifically designed scheduler for
virtual MapReduce cluster
Interference-aware
 Locality-aware

23
MapReduce Programming
 Hadoop implementation of MR in Java (version 1.0.4)
 WordCount example: hadoop-
1.0.4/src/examples/org/apache/hadoop/examples/WordCount.java
24
MapReduce Programming
25
Map
 Implement your own map class extending
the Mapper class
26
Reduce
 Implement your own reducer class
extending the reducer class
27
Main()
28
Demo
29
Download