Spark – An Introduction - Prof. Harold Liu`s Website!

advertisement
Berkeley Data Analytics Stack
Prof. Harold Liu
15 December 2014
Data Processing Goals
•
•
•
Low latency (interactive) queries on
historical data: enable faster decisions
– E.g., identify why a site is slow and fix it
Low latency queries on live data (streaming):
enable decisions on real-time data
– E.g., detect & block worms in real-time (a
worm may infect 1mil hosts in 1.3sec)
Sophisticated data processing: enable
“better” decisions
– E.g., anomaly detection, trend analysis
Today’s Open Analytics Stack…
•
..mostly focused on large on-disk datasets: great for
batch but slow
Application
Data Processing
Storage
Infrastructure
Goals
Batch
One
stack to
rule them all!
Interactive
Streaming
 Easy to combine batch, streaming, and interactive computations
 Easy to develop sophisticated algorithms
 Compatible with existing open source ecosystem (Hadoop/HDFS)
Support Interactive and Streaming Comp.
•
•
Aggressive use of memory
Why?
1. Memory transfer rates >> disk or SSDs
2. Many datasets already fit into memory
•
Inputs of over 90% of jobs in
Facebook, Yahoo!, and Bing clusters
fit into memory
•
e.g., 1TB = 1 billion records @ 1KB
each
10Gbps
128512GB
40-60GB/s
16 cores
0.21GB/s
(x10 disks)
3. Memory density (still) grows with Moore’s
law
•
RAM/SSD hybrid memories at
horizon
14GB/s
(x4 disks)
10-30TB
1-4TB
High end datacenter node
Support Interactive and Streaming Comp.
•
•
Increase parallelism
Why?
– Reduce work per node 
improve latency
result
T
•
Techniques:
– Low latency parallel scheduler
that achieve high locality
– Optimized parallel communication
patterns (e.g., shuffle, broadcast)
– Efficient recovery from failures
and straggler mitigation
result
Tnew (< T)
Support Interactive and Streaming Comp.
•
•
•
Trade between result accuracy and
response times
Why?
– In-memory processing does not
guarantee interactive query
processing
• e.g., ~10’s sec just to scan 512
GB RAM!
• Gap between memory capacity
and transfer rate increasing
Challenges:
– accurately estimate error and
running time for…
– … arbitrary computations
128512GB
40-60GB/s
16 cores
doubles
every 18
months
doubles
every 36
months
Berkeley Data Analytics Stack
(BDAS)
New apps: AMP-Genomics, Carat, …
Application
Data Processing
Data Storage
Management
Resource
Infrastructure
Management
• in-memory processing
• trade between time, quality, and
cost
Efficient data sharing across
frameworks
Share infrastructure across
frameworks
(multi-programming for datacenters)
lg
orit
“Launched” January 2011: 6 Year Plan
hm
8 CS Faculty
a
~40 students
s
3 software engineers
chi
eo
Organized for collaboration:
ne
pl
s
e
Berkeley AMPLab

–
–
–
•
Berkeley AMPLab
•
Funding:
–
–
–
XData,
CISE Expedition Grant
Industrial, founding sponsors
18 other sponsors, including
Goal: Next Generation of Analytics Data Stack for Industry &
Research:
• Berkeley Data Analytics Stack (BDAS)
• Release as Open Source
Berkeley Data Analytics Stack
(BDAS)
•
Existing stack components….
HIVE
Pig
…
HBase
Storm
Data Processing
MPI
Data
Processing
Hadoop
HDFS
Data Management
Resource Management
Data
Mgmnt.
Resource
Mgmnt.
Mesos
•
•
•
Management platform that allows multiple framework to share
cluster
Compatible with existing open analytics stack
Deployed in production at Twitter on 3,500+ servers
HIVE
Pig
…
HBase
Storm
MPI
Data
Processing
Hadoop
HDFS
Data
Mgmnt.
Mesos
Resource
Mgmnt.
Spark
•
•
In-memory framework for interactive and iterative
computations
– Resilient Distributed Dataset (RDD): fault-tolerance, inmemory storage abstraction
Scala interface, Java and Python APIs
HIVE Pig
Spark
…
Data
Storm MPI Processing
Hadoop
HDFS
Data
Mgmnt.
Mesos
Resource
Mgmnt.
Spark Streaming [Alpha Release]
•
•
•
Large scale streaming computation
Ensure exactly one semantics
Integrated with Spark  unifies batch, interactive, and streaming
computations!
Spark
Streamin
g
HIVE Pig
Spark
Data
… Stor MP
Processing
I
m
Hadoop
HDFS
Data
Mgmnt.
Mesos
Resource
Mgmnt.
Shark  Spark SQL
•
•
HIVE over Spark: SQL-like interface (supports Hive 0.9)
– up to 100x faster for in-memory data, and 5-10x for disk
In tests on hundreds node cluster at
Spark
Streamin
g
HIVE Pig
Shark
Spark
Data
… Stor MP
Processing
I
m
Hadoop
HDFS
Data
Mgmnt.
Mesos
Resource
Mgmnt.
Tachyon
•
•
High-throughput, fault-tolerant in-memory storage
Interface compatible to HDFS
•
Support for Spark and Hadoop
Spark
Streamin
g
HIVE Pig
Shark
Data
… Stor MP
Processing
I
m
Hadoop
Spark
Tachyon
HDFS
Mesos
Data
Mgmnt.
Resource
Mgmnt.
BlinkDB
•
•
•
Large scale approximate query engine
Allow users to specify error or time bounds
Preliminary prototype starting being tested at Facebook
Spark
Streamin
g
BlinkDB
Pig
Shark
Spark
HIVE
Data
… Stor MP
Processing
I
m
Hadoop
Tachyon
HDFS
Mesos
Data
Mgmnt.
Resource
Mgmnt.
SparkGraph
•
•
GraphLab API and Toolkits on top of Spark
Fault tolerance by leveraging Spark
Spark
Streamin
g
Spark
Graph
BlinkDB
Pig
Shark
Spark
HIVE
Data
… Stor MP
Processing
I
m
Hadoop
Tachyon
HDFS
Mesos
Data
Mgmnt.
Resource
Mgmnt.
MLlib
•
•
•
Declarative approach to ML
Develop scalable ML algorithms
Make ML accessible to non-experts
Spark
Streamin
g
Spark
Graph
MLbas
e
BlinkDB
Pig
Shark
Spark
HIVE
Data
… Stor MP
Processing
I
m
Hadoop
Tachyon
HDFS
Mesos
Data
Mgmnt.
Resource
Mgmnt.
Compatible with Open Source Ecosystem
•
Support existing interfaces whenever possible
GraphLab API
Spark
Streamin
g
Spark
Graph
MLbas
e
BlinkDB Hive Interface
Shark
Spark
Tachyon
and Shell
Pig
Data
… Stor MP
HIVE
Processing
I
m
Hadoop
HDFS API
Mesos
Compatibility Data
layer for
HDFS
Hadoop, Storm,
MPI,
Mgmnt.
etc to run over Mesos
Resource
Mgmnt.
Compatible with Open Source Ecosystem
•
Use existing interfaces whenever possible
Accept inputs
from Kafka,
Flume, Twitter,
TCP
Sockets, …
Spark
Streamin
g
Spark
Graph
Support Hive API
MLbas
e
BlinkDB
Pig
Shark
Spark
Data
… Stor MP
HIVE
Processing
I
m
Support HDFS
Hadoop
API, S3 API, and
Hive metadata
Tachyon
HDFS
Mesos
Data
Mgmnt.
Resource
Mgmnt.
Summary
•
•
•
•
•
Support interactive and streaming computations
– In-memory, fault-tolerant storage abstraction, low-latency
scheduling,...
Easy to combine batch, streaming, and interactive
Batch
computations
– Spark execution engine supports
Spark
all comp. models
Interacti
Streami
Easy to develop sophisticated algorithms
ve
ng
– Scala interface, APIs for Java, Python, Hive QL, …
– New frameworks targeted to graph based and ML algorithms
Compatible with existing open source ecosystem
Open source (Apache/BSD) and fully committed to release high
quality software
– Three-person software engineering team lead by Matt Massie
(creator of Ganglia, 5th Cloudera engineer)
Spark
In-Memory Cluster Computing for
Iterative and Interactive Applications
UC Berkeley
Background
• Commodity clusters have become an important computing
platform for a variety of applications
– In industry: search, machine translation, ad targeting, …
– In research: bioinformatics, NLP, climate simulation, …
• High-level cluster programming models like MapReduce
power many of these apps
• Theme of this work: provide similarly powerful abstractions
for a broader class of applications
Motivation
Current popular programming models for
clusters transform data flowing from stable
storage to stable storage
e.g., MapReduce:
Map
Input
Reduce
Output
Map
Map
Reduce
Motivation
• Acyclic data flow is a powerful abstraction, but is
not efficient for applications that repeatedly reuse a
working set of data:
– Iterative algorithms (many in machine learning)
– Interactive data mining tools (R, Excel, Python)
• Spark makes working sets a first-class concept to
efficiently support these apps
Spark Goal
• Provide distributed memory abstractions for clusters to
support apps with working sets
• Retain the attractive properties of MapReduce:
– Fault tolerance (for crashes & stragglers)
– Data locality
– Scalability
Solution: augment data flow model with
“resilient distributed datasets” (RDDs)
Programming Model
• Resilient distributed datasets (RDDs)
– Immutable collections partitioned across cluster that
can be rebuilt if a partition is lost
– Created by transforming data in stable storage using
data flow operators (map, filter, group-by, …)
– Can be cached across parallel operations
• Parallel operations on RDDs
– Reduce, collect, count, save, …
• Restricted shared variables
– Accumulators, broadcast variables
Example: Log Mining
•Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
Base
Transformed
RDD RDD
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
cachedMsgs = messages.cache()
tasks
Driver
Cached RDD
Cache 1
Worke
r
Block 1
Parallel
operation
cachedMsgs.filter(_.contains(“foo”)).count
Cache 2
cachedMsgs.filter(_.contains(“bar”)).count
Worke
r
. . .
Cache 3
Result: full-text search of
Wikipedia in <1 sec (vs 20 sec
for on-disk data)
Worke
r
Block 3
Block 2
RDDs in More Detail
• An RDD is an immutable, partitioned, logical collection
of records
– Need not be materialized, but rather contains
information to rebuild a dataset from stable storage
• Partitioning can be based on a key in each record
(using hash or range partitioning)
• Built using bulk transformations on other RDDs
• Can be cached for future reuse
RDD Operations
Transformations
(define a new RDD)
map
filter
sample
union
groupByKey
reduceByKey
join
cache
…
Parallel operations (Actions)
(return a result to driver)
reduce
collect
count
save
lookupKey
…
RDD Fault Tolerance
• RDDs maintain lineage information that can be used to
reconstruct lost partitions
• e.g.:
cachedMsgs = textFile(...).filter(_.contains(“error”))
.map(_.split(‘\t’)(2))
.cache()
HdfsRDD
path: hdfs://…
FilteredRDD
func:
contains(...)
MappedRDD
func: split(…)
CachedRDD
Example 1: Logistic Regression
• Goal: find best line separating two sets of points
random initial line
target
Logistic Regression Code
• val data = spark.textFile(...).map(readPoint).cache()
• var w = Vector.random(D)
• for (i <- 1 to ITERATIONS) {
• val gradient = data.map(p =>
• (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
• ).reduce(_ + _)
• w -= gradient
•}
• println("Final w: " + w)
Logistic Regression Performance
127 s / iteration
first iteration 174 s
further iterations 6 s
Example 2: MapReduce
• MapReduce data flow can be expressed using RDD
transformations
res = data.flatMap(rec => myMapFunc(rec))
.groupByKey()
.map((key, vals) => myReduceFunc(key, vals))
Or with combiners:
res = data.flatMap(rec => myMapFunc(rec))
.reduceByKey(myCombiner)
.map((key, val) => myReduceFunc(key, val))
Example 3
Other Spark Applications
• Twitter spam classification (Justin Ma)
• EM alg. for traffic prediction (Mobile Millennium)
• K-means clustering
• Alternating Least Squares matrix factorization
• In-memory OLAP aggregation on Hive data
• SQL on Spark (future work)
Conclusion
• By making distributed datasets a first-class primitive,
Spark provides a simple, efficient programming model for
stateful data analytics
• RDDs provide:
– Lineage info for fault recovery and debugging
– Adjustable in-memory caching
– Locality-aware parallel operations
• We plan to make Spark the basis of a suite of batch
and interactive data analysis tools
Download