Spark – An Introduction - Prof. Harold Liu`s Website!

Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014 Data Processing Goals • • • Low latency (interactive) queries on historical data: enable faster decisions – E.g., identify why a site is slow and fix it Low latency queries on live data (streaming): enable decisions on real-time data – E.g., detect & block worms in real-time (a worm may infect 1mil hosts in 1.3sec) Sophisticated data processing: enable “better” decisions – E.g., anomaly detection, trend analysis Today’s Open Analytics Stack… • ..mostly focused on large on-disk datasets: great for batch but slow Application Data Processing Storage Infrastructure Goals Batch One stack to rule them all! Interactive Streaming  Easy to combine batch, streaming, and interactive computations  Easy to develop sophisticated algorithms  Compatible with existing open source ecosystem (Hadoop/HDFS) Support Interactive and Streaming Comp. • • Aggressive use of memory Why? 1. Memory transfer rates >> disk or SSDs 2. Many datasets already fit into memory • Inputs of over 90% of jobs in Facebook, Yahoo!, and Bing clusters fit into memory • e.g., 1TB = 1 billion records @ 1KB each 10Gbps 128512GB 40-60GB/s 16 cores 0.21GB/s (x10 disks) 3. Memory density (still) grows with Moore’s law • RAM/SSD hybrid memories at horizon 14GB/s (x4 disks) 10-30TB 1-4TB High end datacenter node Support Interactive and Streaming Comp. • • Increase parallelism Why? – Reduce work per node  improve latency result T • Techniques: – Low latency parallel scheduler that achieve high locality – Optimized parallel communication patterns (e.g., shuffle, broadcast) – Efficient recovery from failures and straggler mitigation result Tnew (< T) Support Interactive and Streaming Comp. • • • Trade between result accuracy and response times Why? – In-memory processing does not guarantee interactive query processing • e.g., ~10’s sec just to scan 512 GB RAM! • Gap between memory capacity and transfer rate increasing Challenges: – accurately estimate error and running time for… – … arbitrary computations 128512GB 40-60GB/s 16 cores doubles every 18 months doubles every 36 months Berkeley Data Analytics Stack (BDAS) New apps: AMP-Genomics, Carat, … Application Data Processing Data Storage Management Resource Infrastructure Management • in-memory processing • trade between time, quality, and cost Efficient data sharing across frameworks Share infrastructure across frameworks (multi-programming for datacenters) lg orit “Launched” January 2011: 6 Year Plan hm 8 CS Faculty a ~40 students s 3 software engineers chi eo Organized for collaboration: ne pl s e Berkeley AMPLab  – – – • Berkeley AMPLab • Funding: – – – XData, CISE Expedition Grant Industrial, founding sponsors 18 other sponsors, including Goal: Next Generation of Analytics Data Stack for Industry & Research: • Berkeley Data Analytics Stack (BDAS) • Release as Open Source Berkeley Data Analytics Stack (BDAS) • Existing stack components…. HIVE Pig … HBase Storm Data Processing MPI Data Processing Hadoop HDFS Data Management Resource Management Data Mgmnt. Resource Mgmnt. Mesos • • • Management platform that allows multiple framework to share cluster Compatible with existing open analytics stack Deployed in production at Twitter on 3,500+ servers HIVE Pig … HBase Storm MPI Data Processing Hadoop HDFS Data Mgmnt. Mesos Resource Mgmnt. Spark • • In-memory framework for interactive and iterative computations – Resilient Distributed Dataset (RDD): fault-tolerance, inmemory storage abstraction Scala interface, Java and Python APIs HIVE Pig Spark … Data Storm MPI Processing Hadoop HDFS Data Mgmnt. Mesos Resource Mgmnt. Spark Streaming [Alpha Release] • • • Large scale streaming computation Ensure exactly one semantics Integrated with Spark  unifies batch, interactive, and streaming computations! Spark Streamin g HIVE Pig Spark Data … Stor MP Processing I m Hadoop HDFS Data Mgmnt. Mesos Resource Mgmnt. Shark  Spark SQL • • HIVE over Spark: SQL-like interface (supports Hive 0.9) – up to 100x faster for in-memory data, and 5-10x for disk In tests on hundreds node cluster at Spark Streamin g HIVE Pig Shark Spark Data … Stor MP Processing I m Hadoop HDFS Data Mgmnt. Mesos Resource Mgmnt. Tachyon • • High-throughput, fault-tolerant in-memory storage Interface compatible to HDFS • Support for Spark and Hadoop Spark Streamin g HIVE Pig Shark Data … Stor MP Processing I m Hadoop Spark Tachyon HDFS Mesos Data Mgmnt. Resource Mgmnt. BlinkDB • • • Large scale approximate query engine Allow users to specify error or time bounds Preliminary prototype starting being tested at Facebook Spark Streamin g BlinkDB Pig Shark Spark HIVE Data … Stor MP Processing I m Hadoop Tachyon HDFS Mesos Data Mgmnt. Resource Mgmnt. SparkGraph • • GraphLab API and Toolkits on top of Spark Fault tolerance by leveraging Spark Spark Streamin g Spark Graph BlinkDB Pig Shark Spark HIVE Data … Stor MP Processing I m Hadoop Tachyon HDFS Mesos Data Mgmnt. Resource Mgmnt. MLlib • • • Declarative approach to ML Develop scalable ML algorithms Make ML accessible to non-experts Spark Streamin g Spark Graph MLbas e BlinkDB Pig Shark Spark HIVE Data … Stor MP Processing I m Hadoop Tachyon HDFS Mesos Data Mgmnt. Resource Mgmnt. Compatible with Open Source Ecosystem • Support existing interfaces whenever possible GraphLab API Spark Streamin g Spark Graph MLbas e BlinkDB Hive Interface Shark Spark Tachyon and Shell Pig Data … Stor MP HIVE Processing I m Hadoop HDFS API Mesos Compatibility Data layer for HDFS Hadoop, Storm, MPI, Mgmnt. etc to run over Mesos Resource Mgmnt. Compatible with Open Source Ecosystem • Use existing interfaces whenever possible Accept inputs from Kafka, Flume, Twitter, TCP Sockets, … Spark Streamin g Spark Graph Support Hive API MLbas e BlinkDB Pig Shark Spark Data … Stor MP HIVE Processing I m Support HDFS Hadoop API, S3 API, and Hive metadata Tachyon HDFS Mesos Data Mgmnt. Resource Mgmnt. Summary • • • • • Support interactive and streaming computations – In-memory, fault-tolerant storage abstraction, low-latency scheduling,... Easy to combine batch, streaming, and interactive Batch computations – Spark execution engine supports Spark all comp. models Interacti Streami Easy to develop sophisticated algorithms ve ng – Scala interface, APIs for Java, Python, Hive QL, … – New frameworks targeted to graph based and ML algorithms Compatible with existing open source ecosystem Open source (Apache/BSD) and fully committed to release high quality software – Three-person software engineering team lead by Matt Massie (creator of Ganglia, 5th Cloudera engineer) Spark In-Memory Cluster Computing for Iterative and Interactive Applications UC Berkeley Background • Commodity clusters have become an important computing platform for a variety of applications – In industry: search, machine translation, ad targeting, … – In research: bioinformatics, NLP, climate simulation, … • High-level cluster programming models like MapReduce power many of these apps • Theme of this work: provide similarly powerful abstractions for a broader class of applications Motivation Current popular programming models for clusters transform data flowing from stable storage to stable storage e.g., MapReduce: Map Input Reduce Output Map Map Reduce Motivation • Acyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data: – Iterative algorithms (many in machine learning) – Interactive data mining tools (R, Excel, Python) • Spark makes working sets a first-class concept to efficiently support these apps Spark Goal • Provide distributed memory abstractions for clusters to support apps with working sets • Retain the attractive properties of MapReduce: – Fault tolerance (for crashes & stragglers) – Data locality – Scalability Solution: augment data flow model with “resilient distributed datasets” (RDDs) Programming Model • Resilient distributed datasets (RDDs) – Immutable collections partitioned across cluster that can be rebuilt if a partition is lost – Created by transforming data in stable storage using data flow operators (map, filter, group-by, …) – Can be cached across parallel operations • Parallel operations on RDDs – Reduce, collect, count, save, … • Restricted shared variables – Accumulators, broadcast variables Example: Log Mining •Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Base Transformed RDD RDD results errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() tasks Driver Cached RDD Cache 1 Worke r Block 1 Parallel operation cachedMsgs.filter(_.contains(“foo”)).count Cache 2 cachedMsgs.filter(_.contains(“bar”)).count Worke r . . . Cache 3 Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Worke r Block 3 Block 2 RDDs in More Detail • An RDD is an immutable, partitioned, logical collection of records – Need not be materialized, but rather contains information to rebuild a dataset from stable storage • Partitioning can be based on a key in each record (using hash or range partitioning) • Built using bulk transformations on other RDDs • Can be cached for future reuse RDD Operations Transformations (define a new RDD) map filter sample union groupByKey reduceByKey join cache … Parallel operations (Actions) (return a result to driver) reduce collect count save lookupKey … RDD Fault Tolerance • RDDs maintain lineage information that can be used to reconstruct lost partitions • e.g.: cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) .cache() HdfsRDD path: hdfs://… FilteredRDD func: contains(...) MappedRDD func: split(…) CachedRDD Example 1: Logistic Regression • Goal: find best line separating two sets of points random initial line target Logistic Regression Code • val data = spark.textFile(...).map(readPoint).cache() • var w = Vector.random(D) • for (i <- 1 to ITERATIONS) { • val gradient = data.map(p => • (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x • ).reduce(_ + _) • w -= gradient •} • println("Final w: " + w) Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s Example 2: MapReduce • MapReduce data flow can be expressed using RDD transformations res = data.flatMap(rec => myMapFunc(rec)) .groupByKey() .map((key, vals) => myReduceFunc(key, vals)) Or with combiners: res = data.flatMap(rec => myMapFunc(rec)) .reduceByKey(myCombiner) .map((key, val) => myReduceFunc(key, val)) Example 3 Other Spark Applications • Twitter spam classification (Justin Ma) • EM alg. for traffic prediction (Mobile Millennium) • K-means clustering • Alternating Least Squares matrix factorization • In-memory OLAP aggregation on Hive data • SQL on Spark (future work) Conclusion • By making distributed datasets a first-class primitive, Spark provides a simple, efficient programming model for stateful data analytics • RDDs provide: – Lineage info for fault recovery and debugging – Adjustable in-memory caching – Locality-aware parallel operations • We plan to make Spark the basis of a suite of batch and interactive data analysis tools

Spark – An Introduction - Prof. Harold Liu`s Website!

Related documents

Products

Support

Spark – An Introduction - Prof. Harold Liu`s Website!

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib