Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014 Data Processing Goals • • • Low latency (interactive) queries on historical data: enable faster decisions – E.g., identify why a site is slow and fix it Low latency queries on live data (streaming): enable decisions on real-time data – E.g., detect & block worms in real-time (a worm may infect 1mil hosts in 1.3sec) Sophisticated data processing: enable “better” decisions – E.g., anomaly detection, trend analysis Today’s Open Analytics Stack… • ..mostly focused on large on-disk datasets: great for batch but slow Application Data Processing Storage Infrastructure Goals Batch One stack to rule them all! Interactive Streaming Easy to combine batch, streaming, and interactive computations Easy to develop sophisticated algorithms Compatible with existing open source ecosystem (Hadoop/HDFS) Support Interactive and Streaming Comp. • • Aggressive use of memory Why? 1. Memory transfer rates >> disk or SSDs 2. Many datasets already fit into memory • Inputs of over 90% of jobs in Facebook, Yahoo!, and Bing clusters fit into memory • e.g., 1TB = 1 billion records @ 1KB each 10Gbps 128512GB 40-60GB/s 16 cores 0.21GB/s (x10 disks) 3. Memory density (still) grows with Moore’s law • RAM/SSD hybrid memories at horizon 14GB/s (x4 disks) 10-30TB 1-4TB High end datacenter node Support Interactive and Streaming Comp. • • Increase parallelism Why? – Reduce work per node improve latency result T • Techniques: – Low latency parallel scheduler that achieve high locality – Optimized parallel communication patterns (e.g., shuffle, broadcast) – Efficient recovery from failures and straggler mitigation result Tnew (< T) Support Interactive and Streaming Comp. • • • Trade between result accuracy and response times Why? – In-memory processing does not guarantee interactive query processing • e.g., ~10’s sec just to scan 512 GB RAM! • Gap between memory capacity and transfer rate increasing Challenges: – accurately estimate error and running time for… – … arbitrary computations 128512GB 40-60GB/s 16 cores doubles every 18 months doubles every 36 months Berkeley Data Analytics Stack (BDAS) New apps: AMP-Genomics, Carat, … Application Data Processing Data Storage Management Resource Infrastructure Management • in-memory processing • trade between time, quality, and cost Efficient data sharing across frameworks Share infrastructure across frameworks (multi-programming for datacenters) lg orit “Launched” January 2011: 6 Year Plan hm 8 CS Faculty a ~40 students s 3 software engineers chi eo Organized for collaboration: ne pl s e Berkeley AMPLab – – – • Berkeley AMPLab • Funding: – – – XData, CISE Expedition Grant Industrial, founding sponsors 18 other sponsors, including Goal: Next Generation of Analytics Data Stack for Industry & Research: • Berkeley Data Analytics Stack (BDAS) • Release as Open Source Berkeley Data Analytics Stack (BDAS) • Existing stack components…. HIVE Pig … HBase Storm Data Processing MPI Data Processing Hadoop HDFS Data Management Resource Management Data Mgmnt. Resource Mgmnt. Mesos • • • Management platform that allows multiple framework to share cluster Compatible with existing open analytics stack Deployed in production at Twitter on 3,500+ servers HIVE Pig … HBase Storm MPI Data Processing Hadoop HDFS Data Mgmnt. Mesos Resource Mgmnt. Spark • • In-memory framework for interactive and iterative computations – Resilient Distributed Dataset (RDD): fault-tolerance, inmemory storage abstraction Scala interface, Java and Python APIs HIVE Pig Spark … Data Storm MPI Processing Hadoop HDFS Data Mgmnt. Mesos Resource Mgmnt. Spark Streaming [Alpha Release] • • • Large scale streaming computation Ensure exactly one semantics Integrated with Spark unifies batch, interactive, and streaming computations! Spark Streamin g HIVE Pig Spark Data … Stor MP Processing I m Hadoop HDFS Data Mgmnt. Mesos Resource Mgmnt. Shark Spark SQL • • HIVE over Spark: SQL-like interface (supports Hive 0.9) – up to 100x faster for in-memory data, and 5-10x for disk In tests on hundreds node cluster at Spark Streamin g HIVE Pig Shark Spark Data … Stor MP Processing I m Hadoop HDFS Data Mgmnt. Mesos Resource Mgmnt. Tachyon • • High-throughput, fault-tolerant in-memory storage Interface compatible to HDFS • Support for Spark and Hadoop Spark Streamin g HIVE Pig Shark Data … Stor MP Processing I m Hadoop Spark Tachyon HDFS Mesos Data Mgmnt. Resource Mgmnt. BlinkDB • • • Large scale approximate query engine Allow users to specify error or time bounds Preliminary prototype starting being tested at Facebook Spark Streamin g BlinkDB Pig Shark Spark HIVE Data … Stor MP Processing I m Hadoop Tachyon HDFS Mesos Data Mgmnt. Resource Mgmnt. SparkGraph • • GraphLab API and Toolkits on top of Spark Fault tolerance by leveraging Spark Spark Streamin g Spark Graph BlinkDB Pig Shark Spark HIVE Data … Stor MP Processing I m Hadoop Tachyon HDFS Mesos Data Mgmnt. Resource Mgmnt. MLlib • • • Declarative approach to ML Develop scalable ML algorithms Make ML accessible to non-experts Spark Streamin g Spark Graph MLbas e BlinkDB Pig Shark Spark HIVE Data … Stor MP Processing I m Hadoop Tachyon HDFS Mesos Data Mgmnt. Resource Mgmnt. Compatible with Open Source Ecosystem • Support existing interfaces whenever possible GraphLab API Spark Streamin g Spark Graph MLbas e BlinkDB Hive Interface Shark Spark Tachyon and Shell Pig Data … Stor MP HIVE Processing I m Hadoop HDFS API Mesos Compatibility Data layer for HDFS Hadoop, Storm, MPI, Mgmnt. etc to run over Mesos Resource Mgmnt. Compatible with Open Source Ecosystem • Use existing interfaces whenever possible Accept inputs from Kafka, Flume, Twitter, TCP Sockets, … Spark Streamin g Spark Graph Support Hive API MLbas e BlinkDB Pig Shark Spark Data … Stor MP HIVE Processing I m Support HDFS Hadoop API, S3 API, and Hive metadata Tachyon HDFS Mesos Data Mgmnt. Resource Mgmnt. Summary • • • • • Support interactive and streaming computations – In-memory, fault-tolerant storage abstraction, low-latency scheduling,... Easy to combine batch, streaming, and interactive Batch computations – Spark execution engine supports Spark all comp. models Interacti Streami Easy to develop sophisticated algorithms ve ng – Scala interface, APIs for Java, Python, Hive QL, … – New frameworks targeted to graph based and ML algorithms Compatible with existing open source ecosystem Open source (Apache/BSD) and fully committed to release high quality software – Three-person software engineering team lead by Matt Massie (creator of Ganglia, 5th Cloudera engineer) Spark In-Memory Cluster Computing for Iterative and Interactive Applications UC Berkeley Background • Commodity clusters have become an important computing platform for a variety of applications – In industry: search, machine translation, ad targeting, … – In research: bioinformatics, NLP, climate simulation, … • High-level cluster programming models like MapReduce power many of these apps • Theme of this work: provide similarly powerful abstractions for a broader class of applications Motivation Current popular programming models for clusters transform data flowing from stable storage to stable storage e.g., MapReduce: Map Input Reduce Output Map Map Reduce Motivation • Acyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data: – Iterative algorithms (many in machine learning) – Interactive data mining tools (R, Excel, Python) • Spark makes working sets a first-class concept to efficiently support these apps Spark Goal • Provide distributed memory abstractions for clusters to support apps with working sets • Retain the attractive properties of MapReduce: – Fault tolerance (for crashes & stragglers) – Data locality – Scalability Solution: augment data flow model with “resilient distributed datasets” (RDDs) Programming Model • Resilient distributed datasets (RDDs) – Immutable collections partitioned across cluster that can be rebuilt if a partition is lost – Created by transforming data in stable storage using data flow operators (map, filter, group-by, …) – Can be cached across parallel operations • Parallel operations on RDDs – Reduce, collect, count, save, … • Restricted shared variables – Accumulators, broadcast variables Example: Log Mining •Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Base Transformed RDD RDD results errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() tasks Driver Cached RDD Cache 1 Worke r Block 1 Parallel operation cachedMsgs.filter(_.contains(“foo”)).count Cache 2 cachedMsgs.filter(_.contains(“bar”)).count Worke r . . . Cache 3 Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Worke r Block 3 Block 2 RDDs in More Detail • An RDD is an immutable, partitioned, logical collection of records – Need not be materialized, but rather contains information to rebuild a dataset from stable storage • Partitioning can be based on a key in each record (using hash or range partitioning) • Built using bulk transformations on other RDDs • Can be cached for future reuse RDD Operations Transformations (define a new RDD) map filter sample union groupByKey reduceByKey join cache … Parallel operations (Actions) (return a result to driver) reduce collect count save lookupKey … RDD Fault Tolerance • RDDs maintain lineage information that can be used to reconstruct lost partitions • e.g.: cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) .cache() HdfsRDD path: hdfs://… FilteredRDD func: contains(...) MappedRDD func: split(…) CachedRDD Example 1: Logistic Regression • Goal: find best line separating two sets of points random initial line target Logistic Regression Code • val data = spark.textFile(...).map(readPoint).cache() • var w = Vector.random(D) • for (i <- 1 to ITERATIONS) { • val gradient = data.map(p => • (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x • ).reduce(_ + _) • w -= gradient •} • println("Final w: " + w) Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s Example 2: MapReduce • MapReduce data flow can be expressed using RDD transformations res = data.flatMap(rec => myMapFunc(rec)) .groupByKey() .map((key, vals) => myReduceFunc(key, vals)) Or with combiners: res = data.flatMap(rec => myMapFunc(rec)) .reduceByKey(myCombiner) .map((key, val) => myReduceFunc(key, val)) Example 3 Other Spark Applications • Twitter spam classification (Justin Ma) • EM alg. for traffic prediction (Mobile Millennium) • K-means clustering • Alternating Least Squares matrix factorization • In-memory OLAP aggregation on Hive data • SQL on Spark (future work) Conclusion • By making distributed datasets a first-class primitive, Spark provides a simple, efficient programming model for stateful data analytics • RDDs provide: – Lineage info for fault recovery and debugging – Adjustable in-memory caching – Locality-aware parallel operations • We plan to make Spark the basis of a suite of batch and interactive data analysis tools