Berkley_Data_Analysis_Stack_(BDAS).

Berkley Data Analysis Stack (BDAS) Mesos, Spark, Shark, Spark Streaming Current Data Analysis Open Stack Application Data Processing Storage Infrastructure Characteristics: • Batch Processing on on-disk data.. • Not very efficient with “Interactive” and “Streaming” computations. Goal Berkeley Data Analytics Stack (BDAS) New apps: AMP-Genomics, Carat, … Application Data Processing • in-memory processing • trade between time, quality, and cost Data Management Storage Efficient data sharing across frameworks Resource Infrastructure Management Share infrastructure across frameworks (multi-programming for datacenters) BDAS Components Mesos • A platform for sharing commodity clusters between diverse computing frameworks. B. Hindman et. al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, tech report, UCB, 2010 Mesos • “Resource Offers” to publish available resources • Has to deal with “framework” specific constraints (without knowing the specific constraints). • e.g. data locality. • Allows the framework scheduler to “reject offer” if constraints are not met. B. Hindman et. al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, tech report, UCB, 2010 Mesos • Other Issues: • Resource Allocation Strategies: Pluggable • fair sharing plugin implemented • Revocation • Isolation: • “Existing OS isolation techniques: Linux Containers”. • Fault Tolerance: • Master: stand by master nodes and zoo keeper • Slaves: Reports task/slave failures to the framework, the latter handles. • Framework scheduler failure: Replicate B. Hindman et. al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, tech report, UCB, 2010 Spark Current popular programming models for clusters transform data flowing from stable storage to stable storage • Spark: In-Memory Cluster Computing for Iterative and Interactive Applications. – Assumption • Same working data set is used across iterations or for a number of interactive queries. • Commodity Cluster • Local Data partitions fit in Memory Some Slides taken from presentation: Spark • Acyclic data flows – powerful abstractions – But not efficient for Iterative/interactive applications that repeatedly use the same “working data set”. Solution: augment data flow model with in-memory “resilient distributed datasets” (RDDs) RDDs • An RDD is an immutable, partitioned, logical collection of records – Need not be materialized, but rather contains information to rebuild a dataset from stable storage (lazy-loading and lineage) – can be rebuilt if a partition is lost (Transform once read many) • Partitioning can be based on a key in each record (using hash or range partitioning) • Created by transforming data in stable storage using data flow operators (map, filter, group-by, …) • Can be cached for future reuse Generality of RDDs • Claim: Spark’s combination of data flow with RDDs unifies many proposed cluster programming models – General data flow models: MapReduce, Dryad, SQL – Specialized models for stateful apps: Pregel (BSP), HaLoop (iterative MR), Continuous Bulk Processing • Instead of specialized APIs for one type of app, give user first-class control of distributed datasets Programming Model Transformations (define a new RDD) map filter sample union groupByKey reduceByKey join cache … Parallel operations (return a result to driver) reduce collect count save lookupKey … Example: Log Mining • Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) messages = errors.map(_.split(‘\t’)(2)) Worker results errors = lines.filter(_.startsWith(“ERROR”)) cachedMsgs = messages.cache() Cache 1 BaseTransformed RDD RDD tasks Block 1 Driver Cached RDD Parallel operation cachedMsgs.filter(_.contains(“foo”)).count Cache 2 cachedMsgs.filter(_.contains(“bar”)).count Worker . . . Cache 3 Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Worker Block 3 Block 2 RDD Fault Tolerance • RDDs maintain lineage information that can be used to reconstruct lost partitions • Ex: cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) .cache() HdfsRDD FilteredRDD MappedRDD path: hdfs://… func: contains(...) func: split(…) CachedRDD Benefits of RDD Model • Consistency is easy due to immutability • Inexpensive fault tolerance (log lineage rather than replicating/checkpointing data) • Locality-aware scheduling of tasks on partitions • Despite being restricted, model seems applicable to a broad variety of applications Example: Logistic Regression • Goal: find best line separating two sets of points random initial line target Logistic Regression Code val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w) Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s Page Rank: Scala Implementation val links = // RDD of (url, neighbors) pairs var ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...) Spark Summary • Fast, expressive cluster computing system compatible with Apache Hadoop – Works with any Hadoop-supported storage system (HDFS, S3, Avro, …) • Improves efficiency through: Up to 100× faster – In-memory computing primitives – General computation graphs • Improves usability through: – Rich APIs in Java, Scala, Python – Interactive shell Often 2-10× less code Spark Streaming • Framework for large scale stream processing – – – – – Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark’s batch and interactive processing Provides a simple batch-like API for implementing complex algorithm Can absorb live data streams from Kafka, Flume, ZeroMQ, etc. Requirements  Scalable to large clusters  Second-scale latencies  Simple programming model  Integrated with batch & interactive processing  Stateful Stream Processing  Traditional streaming systems have a eventdriven record-at-a-time processing model - Each node has mutable state - For each record, update state & send new records  State is lost if node dies!  Making stateful stream processing be faulttolerant is challenging mutable state input records node 1 node 3 input records node 2 24 Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs live data stream  Chop up the live stream into batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations  Finally, the processed results of the RDD operations are returned in batches Spark Streaming batches of X seconds processed results 25 Spark Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) DStream: a sequence of RDD representing a stream of data Twitter Streaming API batch @ t batch @ t+1 batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed) Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) new DStream transformation: modify data in one Dstream to create another DStream batch @ t batch @ t+1 batch @ t+2 tweets DStream hashTags Dstream [#cat, #dog, … ] flatMap flatMap … flatMap new RDDs created for every batch Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage batch @ t batch @ t+1 batch @ t+2 tweets DStream hashTags DStream flatMa p flatMa p flatMa p save save save every batch saved to HDFS Fault-tolerance  RDDs remember the sequence of operations that created it from the original fault-tolerant input data  Batches of input data are replicated in memory of multiple worker nodes, therefore fault-tolerant  Data lost due to worker failure, can be recomputed from input data tweets RDD input data replicated in memory flatMap hashTags RDD lost partitions recomputed on other workers Key concepts • DStream – sequence of RDDs representing a stream of data – Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets • Transformations – modify data from on DStream to another – Standard RDD operations – map, countByValue, reduce, join, … – Stateful operations – window, countByValueAndWindow, … • Output Operations – send data to external entity – saveAsHadoopFiles – saves to HDFS – foreach – do anything with each batch of results Higher throughput than Storm  Spark Streaming: 670k records/second/node Throughput per node (MB/s) Comparison with Storm and S4  Storm: 115k records/second/node 120 Grep Spark 80 40 Storm 0 100 1000 Record Size (bytes) Throughput per node (MB/s)  Apache S4: 7.5k records/second/node 30 WordCount Spark 20 10 Storm 0 100 1000 Record Size (bytes) 31

Berkley_Data_Analysis_Stack_(BDAS).

Related documents

Products

Support

Berkley_Data_Analysis_Stack_(BDAS).

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib