From Lisp to MapReduce and GFS

advertisement
Comp6611 Course Lecture
Big data applications
Yang PENG
Network and System Lab
CSE, HKUST
Monday, March 11, 2013
ypengab@cse.ust.hk
Material adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google
Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License)
Today's Topics

MapReduce


Background information/overview
Map and Reduce
-------- from a programmer's perspective

Architecture and workflow
-------- a global overview


Improvement

Background
MapReduce
Spark
Virtues and defects
Spark
MapReduce Background

Before MapReduce, large-scale data processing
was difficult

Managing parallelization and distribution



Data storage and distribution


Background
MapReduce
Spark
Application development is tedious and hard to debug
Resource scheduling and load-balancing
Distributed file system
“Moving computation is cheaper than moving data.”

Fault/crash tolerance

Scalability
Does “Divide and Conquer
paradigm” still work in big data?
Partition
Work
Background
MapReduce
Spark
𝒘𝟏
𝒘𝟐
𝒘𝟑
Worker
Worker
Worker
𝒘𝟏
𝒘𝟐
𝒘𝟑
Result
Combine
Programming Model
•
Opportunity: design an software abstraction undertake the divide
and conquer and reduce programmers' workload for
•
•
•
•
resource management
task scheduling
distributed synchronization and communication
Functional programming, which has long history, provides
some high-order functions to support divide and conquer.


Map: do something to everything in a list
Fold: combine results of a list in some way
Application
Background
Abstraction
MapReduce
Spark
Computer
Computer
…
Computer
Map

Map is a higher-order function

How map works:


Function is applied to every element in a list
Result is a new list
f
Background
MapReduce
Spark
f
f
f
f
Fold

Fold is also a higher-order function

How fold works:





Accumulator set to initial value
Function applied to list element and the accumulator
Result stored in the accumulator
Repeated for every item in the list
Result is the final value in the accumulator
f
Background
MapReduce
Spark
Initial value
f
f
f
f
final value
Map/Fold in Action

Simple map example:
(map (lambda (x) (* x x))
'(1 2 3 4 5))
 '(1 4 9 16 25)

Fold examples:
(fold + 0 '(1 2 3 4 5))  15
(fold * 1 '(1 2 3 4 5))  120

Background
MapReduce
Spark
Sum of squares:
(define (sum-of-squares v)
(fold + 0 (map (lambda (x) (* x x)) v)))
(sum-of-squares '(1 2 3 4 5))  55
MapReduce

Programmers specify two functions:
map (k1,v1) → list(k2,v2)
reduce (k2, list (v2)) → list(v2)
function map(String name, String document):
// K1 name: document name
// V1 document: document contents
for each word w in document:
emit (w, 1)
Background
MapReduce
function reduce(String word, Iterator partialCounts):
// K2 word: a word
// list(V2) partialCounts: a list of aggregated partial counts
sum = 0
for each pc in partialCounts:
sum += ParseInt(pc)
emit (word, sum)
Spark
An implementation of WordCount
It's just divide and conquer!
Data Store
Initial kv pairs
Initial kv pairs
map
Initial kv pairs
map
Initial kv pairs
map
k1, values…
k1, values…
k3, values…
k3, values…
k2, values…
k2, values…
map
k1, values…
k3, values…
k2, values…
k1, values…
k3, values…
k2, values…
Barrier: aggregate values by keys
k1, values…
k2, values…
k3, values…
Background
MapReduce
reduce
reduce
reduce
final k1 values
final k2 values
final k3 values
Spark
Behind the scenes…
Background
MapReduce
Spark
Programming interface

input reader

Map function

partition function



compare function


Background
The partition function is given the key and the number
of reducers and returns the index of the desired reduce.
For load-balance, e.g. Hash function
The compare function is used to sort computing output.
Ordering guarantee

Reduce function

output writer
MapReduce
Spark
Ouput of a Hadoop job
ypeng@vm115:~/hadoop-0.20.2$ bin/hadoop jar hadoop-0.20.2-examples.jar wordcount
/user/hduser/wordcount/15G-enwiki-input /user/hduser/wordcount/15G-enwiki-output
13/01/16 07:00:48 INFO input.FileInputFormat: Total input paths to process : 1
13/01/16 07:00:49 INFO mapred.JobClient: Running job: job_201301160607_0003
13/01/16 07:00:50 INFO mapred.JobClient: map 0% reduce 0%
.........................
13/01/16 07:01:50 INFO mapred.JobClient: map 18% reduce 0%
13/01/16 07:01:52 INFO mapred.JobClient: map 19% reduce 0%
13/01/16 07:02:06 INFO mapred.JobClient: map 20% reduce 0%
13/01/16 07:02:08 INFO mapred.JobClient: map 20% reduce 1%
13/01/16 07:02:10 INFO mapred.JobClient: map 20% reduce 2%
.........................
13/01/16 07:06:41 INFO mapred.JobClient: map 99% reduce 32%
13/01/16 07:06:47 INFO mapred.JobClient: map 100% reduce 33%
13/01/16 07:06:55 INFO mapred.JobClient: map 100% reduce 39%
Background
.........................
13/01/16 07:07:21 INFO mapred.JobClient: map 100% reduce 99%
MapReduce
13/01/16 07:07:31 INFO mapred.JobClient: map 100% reduce 100%
Spark
13/01/16 07:07:43 INFO mapred.JobClient: Job complete: job_201301160607_0003
(To continue.)
Progress
Counters in a Hadoop job
13/01/16 07:07:43 INFO mapred.JobClient: Counters: 18
13/01/16 07:07:43 INFO mapred.JobClient: Job Counters
13/01/16 07:07:43 INFO mapred.JobClient:
Launched reduce tasks=24
13/01/16 07:07:43 INFO mapred.JobClient:
Rack-local map tasks=17
13/01/16 07:07:43 INFO mapred.JobClient:
Launched map tasks=249
13/01/16 07:07:43 INFO mapred.JobClient:
Data-local map tasks=203
13/01/16 07:07:43 INFO mapred.JobClient: FileSystemCounters
13/01/16 07:07:43 INFO mapred.JobClient:
FILE_BYTES_READ=12023025990
13/01/16 07:07:43 INFO mapred.JobClient:
HDFS_BYTES_READ=15492905740
13/01/16 07:07:43 INFO mapred.JobClient:
FILE_BYTES_WRITTEN=14330761040
13/01/16 07:07:43 INFO mapred.JobClient:
HDFS_BYTES_WRITTEN=752814339
13/01/16 07:07:43 INFO mapred.JobClient: Map-Reduce Framework
13/01/16 07:07:43 INFO mapred.JobClient:
Reduce input groups=39698527
13/01/16 07:07:43 INFO mapred.JobClient:
Combine output records=508662829
13/01/16 07:07:43 INFO mapred.JobClient:
Map input records=279422018
13/01/16 07:07:43 INFO mapred.JobClient:
Reduce shuffle bytes=2647359503
Background
13/01/16 07:07:43 INFO mapred.JobClient:
Reduce output records=39698527
MapReduce
13/01/16 07:07:43 INFO mapred.JobClient:
Spilled Records=828280813
Spark
13/01/16 07:07:43 INFO mapred.JobClient:
Map output bytes=24932976267
13/01/16 07:07:43 INFO mapred.JobClient:
Combine input records=2813475352
13/01/16 07:07:43 INFO mapred.JobClient:
Map output records=2376465967
13/01/16 07:07:43 INFO mapred.JobClient:
Reduce input records=71653444
Summary of
counters in job
Master in MapReduce

Resource Management



Task Scheduling





MapReduce
Spark
“Moving computation is cheaper than moving data.”
Map and reduce tasks are assigned to idle Workers.
Tasks on failure workers will be re-scheduled.
When job is close to end, it launches backup tasks.
Counter

Background
Maintain the current resource usage of each
Worker(CPU, RAM, Used & free disk space, etc.)
Examine worker failure periodically.


provides interactive job progress.
stores the occurrences of various events.
is helpful to performance tuning.
Data-oriented Map scheduling
=
input
1
+
+
2
3
+
input splits
Switch
5
4
3
Worker 1
5
2
Worker 2
1
Worker 3
4
Rack 1
Launch map 1 on Worker 3
Background
Launch map 2 on Worker 4
MapReduce
Spark
Launch map 3 on Worker 1
Launch map 4 on Worker 2
Launch map 5 on Worker 5
Switch
4
2
1
2
3
5
3
1
Worker 4
Worker 5
Worker 6
Rack 2
4
+
5
Data flow in MapReduce jobs
rack-local split
non-local split
local split
intermediate files
(on disk)
merged spills
(on disk)
Reducer
Mapper
circular buffer
(in memory)
Combiner
Background
MapReduce
other Reducers
spills (on disk)
Spark
other Mappers
Map internal
Background
MapReduce
Spark

The map phase reads the task’s input split from
GFS, parses it into records(key/value pairs), and
applies the map function to each records.

After the map function has been applied to each
record, the commit phase registers the final
output to Master, which will tell reduce the
location of map output.
Reduce internal
Background
MapReduce
Spark

The shuffle phase fetches the reduce task’s input
data.

The sort phase groups records with the same key
together.

The reduce phase applies the user-defined
reduce function to each key and corresponding
list of values.
Backup Tasks

There are barriers in a MapReduce job.


No reduce function executes until all maps finish.
The job can not complete until all reduces finish.
Map
Map
Map
Reduce
Reduce
Job
complete
Reduce
Map
Background

The execution time of a job will be severely lengthened if a
task is blocked.

Master schedules backup/speculative tasks for unfinished
ones before the job is close to end.

A job will take 44% longer if backup tasks are disabled.
MapReduce
Spark
Virtues and defects of MR
Virtues
Background
MapReduce

Towards large scale data

Programming friendly

Implicit parallelism

Data-locality

Fault/crash tolerance

Scalability

Open-source with good
ecosystem[1]
Defects

Bad for iterative
ML algorithms

Not sure
Spark
[1]
http://docs.hortonworks.com/CURRENT/index.htm#About_Hortonworks_Data_Platform/Understanding_Hadoop_Ecosystem.htm
Network traffic in MapReduce
1. Map may read split from remote ChunkServer
2. Reduce copy the output of Map
3. Reduce output write to GFS
1
2
Background
MapReduce
Spark
3
Disk R/W in MapReduce
1. ChunkServer reads local block for remote split fetching
2. Spill intermediate result to disk
3. Write the copied partition to local disk
4. Write the result output to local ChunkServer
5. Write the result output to remote ChunkServer
1
2
3
Background
MapReduce
Spark
4
5
Iterative MapReduce
Performing graph algorithm Using MapReduce.
Background
MapReduce
Spark
Motivation of Spark
Background
MapReduce
Spark

Iterative algorithms (machine learning, graphs)

Interactive data mining tools (R, Excel, Python)
Programming Model

Fine-grained


Computing outputs of every iteration are distributed and
store to stable storage
Coarse-grained
Only logging the transformations to build a dataset
(i.e. lineage)


Resilient distributed datasets (RDDs)


Background

Immutable, partitioned collections of objects
Created through parallel transformations (map, filter,
groupBy, join, …) on data in stable storage
Can be cached for efficient reuse
MapReduce
Spark

Actions on RDDs

Count, reduce, collect, save, …
Spark Operations
Transformations
(define a new
RDD)
Background
MapReduce
Spark
Actions
(return a result to
driver program)
map
filter
sample
groupByKey
reduceByKey
sortByKey
flatMap
union
join
cogroup
cross
mapValues
collect
reduce
count
save
lookupKey
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
BaseTransformed
RDD
RDD
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
cachedMsgs = messages.cache()
tasks
Driver
Cache 1
Worker
Block 1
Action
cachedMsgs.filter(_.contains(“foo”)).count
Cache 2
cachedMsgs.filter(_.contains(“bar”)).count
Worker
. . .
Cache 3
Result: scaled
full-text
tosearch
1 TB data
of Wikipedia
in 5-7 sec
in <1(vs
sec170
(vssec
20 for
secon-disk
for on-disk
data)
data)
Worker
Block 3
Block 2
RDD Fault Tolerance
RDDs maintain lineage information that can be used
to reconstruct lost partitions
Ex:
messages = textFile(...).filter(_.startsWith(“ERROR”))
.map(_.split(‘\t’)(2))
HDFS File
Background
MapReduce
Spark
Filtered RDD
filter
(func = _.contains(...))
map
(func = _.split(...))
Mapped
RDD
Example: Logistic Regression
Goal: find best line separating two sets of points
random initial line
Background
MapReduce
Spark
target
Example: Logistic Regression
val data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
Keep variable
“data” in memory
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final w: " + w)
Background
MapReduce
Spark
Running Time (s)
Logistic Regression
Performance
Background
MapReduce
Spark
4500
4000
3500
3000
2500
2000
1500
1000
500
0
127 s / iteration
Hadoop
Spark
first iteration 174 s
further iterations 6 s
1
5
10
20
30
Number of Iterations
Spark Programming Interface
(eg. page rank)
Representing RDDs
Spark Scheduler

Dryad-like DAGs

Pipelines functions
within a stage

Cache-aware work
reuse & locality

Partitioning-aware
to avoid shuffles
Background
MapReduce
Spark
= cached data partition
Behavior with Not Enough RAM
60
11.5
40
29.7
40.7
58.1
80
68.8
Iteration time (s)
100
20
0
Cache
disabled
25%
50%
75%
% of working set in memory
Fully
cached
No Failure
Failure in the 6th Iteration
5
6
Iteration
7
8
9
59
57
4
59
3
57
81
58
2
58
56
1
57
140
120
100
80
60
40
20
0
119
Iteratrion time (s)
Fault Recovery Results
10
Conclusion
Background
MapReduce
Spark

Both MapReduce and Spark are excellent big
data software, which are scalable, fault-tolerant,
and programming friendly.

Especially, Spark provides a more effective
method for iterative computing jobs.
QUESTIONS?
Download