Spark - Apache Mesos

advertisement
Spark
A framework for iterative and
interactive cluster computing
Matei Zaharia, Mosharaf Chowdhury,
Michael Franklin, Scott Shenker, Ion Stoica
UC Berkeley
Outline
Background: Nexus project
Spark goals
Programming model
Example jobs
Implementation
Interactive Spark
Nexus Background
Rapid innovation in cluster computing frameworks
Apache
Hama
Pig
Pregel
Dryad
Problem
Rapid innovation in cluster computing frameworks
No single framework optimal for all applications
Want to run multiple frameworks in a single cluster
» …to maximize utilization
» …to share data between frameworks
» …to isolate workloads
Solution
Nexus is an “operating system” for the cluster
over which diverse frameworks can run
» Nexus multiplexes resources between frameworks
» Frameworks control job execution
Nexus Architecture
Hadoop job
Hadoop job
MPI job
Hadoop v19
scheduler
Hadoop v20
scheduler
MPI
scheduler
Nexus master
Nexus slave
Nexus slave
Hadoop v20
executor
Hadoop v19
task
task
executor
MPI
execut
or
task
Nexus slave
Hadoop v19
executor
task
MPI
execut
or
task
Nexus Status
Prototype in 7000 lines of C++
Ported frameworks:
» Hadoop (900 line patch)
» MPI (160 line wrapper scripts)
New frameworks:
» Spark, Scala framework for iterative jobs & more
» Apache+haproxy, elastic web server farm (200 lines)
Outline
Background: Nexus project
Spark goals
Programming model
Example job
Implementation
Interactive Spark
Spark Goals
Support iterative jobs
» Machine learning researchers in our lab identified this
as a workload that Hadoop doesn’t perform well on
Experiment with programmability
» Leverage Scala to integrate cleanly into programs
» Support interactive use from Scala interpreter
Retain MapReduce’s fine-grained fault-tolerance
Programming Model
Distributed datasets
» HDFS files, “parallelized” Scala collections
» Can be transformed with map and filter
» Can be cached across parallel operations
Parallel operations
» Foreach, reduce, collect
Shared variables
» Accumulators (add-only)
» Broadcast variables (read-only)
Example 1:
Logistic Regression
Logistic Regression
Goal: find best line separating two sets of points
random initial line
target
Serial Version
val data = readData(...)
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
var gradient = Vector.zeros(D)
for (p <- data) {
val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y
gradient += scale * p.x
}
w -= gradient
}
println("Final w: " + w)
Spark Version
val data = spark.hdfsTextFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
var gradient = spark.accumulator(Vector.zeros(D))
for (p <- data) {
val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y
gradient += scale * p.x
}
w -= gradient.value
}
println("Final w: " + w)
Spark Version
val data = spark.hdfsTextFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
var gradient = spark.accumulator(Vector.zeros(D))
for (p <- data) {
val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y
gradient += scale * p.x
}
w -= gradient.value
}
println("Final w: " + w)
Spark Version
val data = spark.hdfsTextFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
var gradient = spark.accumulator(Vector.zeros(D))
data.foreach(p => {
val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y
gradient += scale * p.x
})
w -= gradient.value
}
println("Final w: " + w)
Functional Programming Version
val data = spark.hdfsTextFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
w -= data.map(p => {
val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y
scale * p.x
}).reduce(_+_)
}
println("Final w: " + w)
Job Execution
update
param
Master
aggregate
param
Slave 1
Slave 2
Slave 3
Slave 4
Big Dataset
R1
R2
R3
Spark
R4
Job Execution
Master
update
param
Master
aggregate
param
Map 1
Map 2
Map 3
Map 4
param
aggregate
Slave 1
Slave 2
Slave 3
Slave 4
Reduce
param
Map 5
R1
R2
R3
Map 6
Map 7
R4
aggregate
Reduce
Spark
Map 8

Hadoop / Dryad
Running Time (s)
Performance
4500
4000
3500
3000
2500
2000
1500
1000
500
0
127 s / iteration
Hadoop
Spark
1
5
10
20
Number of Iterations
30
first iteration 174 s
further iterations 6 s
Example 2:
Alternating Least Squares
Collaborative Filtering
Predict movie ratings for a set of users based on
their past ratings
R=
1
?
5
4
?
?
?
?
?
3
5
?
4
5
?
?
5
?
?
?
Movies
?
?
?
2
3
3
1
?
Users
Matrix Factorization
Model R as product of user and movie matrices
A and B of dimensions U×K and M×K
R
=
BT
A
Problem: given subset of R, optimize A and B
Alternating Least Squares
Algorithm
Start with random A and B
Repeat:
1. Fixing B, optimize A to minimize error on scores in R
2. Fixing A, optimize B to minimize error on scores in R
Serial ALS
val R = readRatingsMatrix(...)
var A = (0 until U).map(i => Vector.random(K))
var B = (0 until M).map(i => Vector.random(K))
for (i <- 1 to ITERATIONS) {
A = (0 until U).map(i => updateUser(i, B, R))
B = (0 until M).map(i => updateMovie(i, A, R))
}
Naïve Spark ALS
val R = readRatingsMatrix(...)
var A = (0 until U).map(i => Vector.random(K))
var B = (0 until M).map(i => Vector.random(K))
for (i <- 1 to ITERATIONS) {
A = spark.parallelize(0 until U, numSlices)
.map(i => updateUser(i, B, R))
.collect()
B = spark.parallelize(0 until M, numSlices)
.map(i => updateMovie(i, A, R))
.collect()
}
Problem:
R re-sent
to all nodes
in each
parallel
operation
Efficient Spark ALS
val R = spark.broadcast(readRatingsMatrix(...))
var A = (0 until U).map(i => Vector.random(K))
var B = (0 until M).map(i => Vector.random(K))
for (i <- 1 to ITERATIONS) {
A = spark.parallelize(0 until U, numSlices)
.map(i => updateUser(i, B, R.value))
.collect()
B = spark.parallelize(0 until M, numSlices)
.map(i => updateMovie(i, A, R.value))
.collect()
}
Solution:
mark R as
broadcast
variable
ALS Performance
300
Iteration Duration (s)
250
200
First Iteration
150
Subsequent
Iterations
100
50
0
4 cores 12 cores 20 cores 28 cores 36 cores 60 cores
(1 node) (2 nodes) (3 nodes) (4 nodes) (5 nodes) (8 nodes)
Cluster Configuration
Subseq. Iteration Breakdown
Time within Iteration (s)
300
250
200
Computation
150
Broadcast
100
50
0
4 cores 12 cores 20 cores 28 cores 36 cores 60 cores
(1 node) (2 nodes) (3 nodes) (4 nodes) (5 nodes) (8 nodes)
36% of
iteration
spent on
broadcast
Outline
Background: Nexus project
Spark goals
Programming model
Example job
Implementation
Interactive Spark
Architecture
Driver program connects to
Nexus and schedules tasks
Workers run tasks, report
results and variable updates
local
cache
user code,
broadcast vars
Nexus
Data shared with HDFS/NFS
No communication between
workers for now
HDFS
Driver
tasks,
results
Workers
Distributed Datasets
Each distributed dataset object maintains a lineage that
is used to rebuild slices that are lost / fall out of cache
Ex:
errors = textFile(“log”).filter(_.contains(“error”))
.map(_.split(‘\t’)(1))
.cache()
getIterator(slice)
HdfsFile
FilteredFile
MappedFile
path: hdfs://…
func: contains(...)
func: split(…)
HDFS
CachedFile
Local
cache
Language Integration
Scala closures are Serializable objects
» Serialize on driver, load & run on workers
Not quite enough
» Nested closures may reference entire outer scope
» May pull in non-Serializable variables not used inside
» Solution: bytecode analysis + reflection
Shared variables
» Accumulators: serialized form contains ID
» Broadcast vars: serialized form is path to HDFS file
Interactive Spark
Modified Scala interpreter to allow Spark to be
used interactively from the command line
Required two changes:
» Modified wrapper code generation so that each “line”
typed has references to objects for its dependencies
» Place generated classes in distributed filesystem
Enables in-memory exploration of big data
Demo
Conclusions
Spark provides two abstractions that enable
iterative jobs and interactive use:
1. Distributed datasets with controllable persistence,
supporting fault-tolerant parallel operations
2. Shared variables for efficient broadcast and imperative
style programming
Language integration achieved using Scala
features + some amount of hacking
All this is surprisingly little code (~1600 lines)
Related Work
DryadLINQ
» SQL-like queries integrated in C# programs
» Build queries through operations on lazy datasets
» Cannot have a dataset persist across queries
» No concept of shared variables for broadcast etc
Pig & Hive
» Query languages that can call into Java/Python/etc UDFs
» No support for caching a dataset across queries
OpenMP
» Compiler extension for parallel loops in C++
» Annotate variables as read-only or accumulator above loop
» Cluster version exists, but not fault-tolerant
Future Work
Open-source Spark and Nexus
» Probably this summer
» Very interested in getting users!
Understand which classes of algorithms we can
handle and how to extend Spark for others
Build higher-level interfaces on top of
interactive Spark (e.g. R, SQL)
Questions
?
?
Download