Comp6611 Course Lecture Big data applications Yang PENG Network and System Lab CSE, HKUST Monday, March 11, 2013 ypengab@cse.ust.hk Material adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License) Today's Topics MapReduce Background information/overview Map and Reduce -------- from a programmer's perspective Architecture and workflow -------- a global overview Improvement Background MapReduce Spark Virtues and defects Spark MapReduce Background Before MapReduce, large-scale data processing was difficult Managing parallelization and distribution Data storage and distribution Background MapReduce Spark Application development is tedious and hard to debug Resource scheduling and load-balancing Distributed file system “Moving computation is cheaper than moving data.” Fault/crash tolerance Scalability Does “Divide and Conquer paradigm” still work in big data? Partition Work Background MapReduce Spark 𝒘𝟏 𝒘𝟐 𝒘𝟑 Worker Worker Worker 𝒘𝟏 𝒘𝟐 𝒘𝟑 Result Combine Programming Model • Opportunity: design an software abstraction undertake the divide and conquer and reduce programmers' workload for • • • • resource management task scheduling distributed synchronization and communication Functional programming, which has long history, provides some high-order functions to support divide and conquer. Map: do something to everything in a list Fold: combine results of a list in some way Application Background Abstraction MapReduce Spark Computer Computer … Computer Map Map is a higher-order function How map works: Function is applied to every element in a list Result is a new list f Background MapReduce Spark f f f f Fold Fold is also a higher-order function How fold works: Accumulator set to initial value Function applied to list element and the accumulator Result stored in the accumulator Repeated for every item in the list Result is the final value in the accumulator f Background MapReduce Spark Initial value f f f f final value Map/Fold in Action Simple map example: (map (lambda (x) (* x x)) '(1 2 3 4 5)) '(1 4 9 16 25) Fold examples: (fold + 0 '(1 2 3 4 5)) 15 (fold * 1 '(1 2 3 4 5)) 120 Background MapReduce Spark Sum of squares: (define (sum-of-squares v) (fold + 0 (map (lambda (x) (* x x)) v))) (sum-of-squares '(1 2 3 4 5)) 55 MapReduce Programmers specify two functions: map (k1,v1) → list(k2,v2) reduce (k2, list (v2)) → list(v2) function map(String name, String document): // K1 name: document name // V1 document: document contents for each word w in document: emit (w, 1) Background MapReduce function reduce(String word, Iterator partialCounts): // K2 word: a word // list(V2) partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum) Spark An implementation of WordCount It's just divide and conquer! Data Store Initial kv pairs Initial kv pairs map Initial kv pairs map Initial kv pairs map k1, values… k1, values… k3, values… k3, values… k2, values… k2, values… map k1, values… k3, values… k2, values… k1, values… k3, values… k2, values… Barrier: aggregate values by keys k1, values… k2, values… k3, values… Background MapReduce reduce reduce reduce final k1 values final k2 values final k3 values Spark Behind the scenes… Background MapReduce Spark Programming interface input reader Map function partition function compare function Background The partition function is given the key and the number of reducers and returns the index of the desired reduce. For load-balance, e.g. Hash function The compare function is used to sort computing output. Ordering guarantee Reduce function output writer MapReduce Spark Ouput of a Hadoop job ypeng@vm115:~/hadoop-0.20.2$ bin/hadoop jar hadoop-0.20.2-examples.jar wordcount /user/hduser/wordcount/15G-enwiki-input /user/hduser/wordcount/15G-enwiki-output 13/01/16 07:00:48 INFO input.FileInputFormat: Total input paths to process : 1 13/01/16 07:00:49 INFO mapred.JobClient: Running job: job_201301160607_0003 13/01/16 07:00:50 INFO mapred.JobClient: map 0% reduce 0% ......................... 13/01/16 07:01:50 INFO mapred.JobClient: map 18% reduce 0% 13/01/16 07:01:52 INFO mapred.JobClient: map 19% reduce 0% 13/01/16 07:02:06 INFO mapred.JobClient: map 20% reduce 0% 13/01/16 07:02:08 INFO mapred.JobClient: map 20% reduce 1% 13/01/16 07:02:10 INFO mapred.JobClient: map 20% reduce 2% ......................... 13/01/16 07:06:41 INFO mapred.JobClient: map 99% reduce 32% 13/01/16 07:06:47 INFO mapred.JobClient: map 100% reduce 33% 13/01/16 07:06:55 INFO mapred.JobClient: map 100% reduce 39% Background ......................... 13/01/16 07:07:21 INFO mapred.JobClient: map 100% reduce 99% MapReduce 13/01/16 07:07:31 INFO mapred.JobClient: map 100% reduce 100% Spark 13/01/16 07:07:43 INFO mapred.JobClient: Job complete: job_201301160607_0003 (To continue.) Progress Counters in a Hadoop job 13/01/16 07:07:43 INFO mapred.JobClient: Counters: 18 13/01/16 07:07:43 INFO mapred.JobClient: Job Counters 13/01/16 07:07:43 INFO mapred.JobClient: Launched reduce tasks=24 13/01/16 07:07:43 INFO mapred.JobClient: Rack-local map tasks=17 13/01/16 07:07:43 INFO mapred.JobClient: Launched map tasks=249 13/01/16 07:07:43 INFO mapred.JobClient: Data-local map tasks=203 13/01/16 07:07:43 INFO mapred.JobClient: FileSystemCounters 13/01/16 07:07:43 INFO mapred.JobClient: FILE_BYTES_READ=12023025990 13/01/16 07:07:43 INFO mapred.JobClient: HDFS_BYTES_READ=15492905740 13/01/16 07:07:43 INFO mapred.JobClient: FILE_BYTES_WRITTEN=14330761040 13/01/16 07:07:43 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=752814339 13/01/16 07:07:43 INFO mapred.JobClient: Map-Reduce Framework 13/01/16 07:07:43 INFO mapred.JobClient: Reduce input groups=39698527 13/01/16 07:07:43 INFO mapred.JobClient: Combine output records=508662829 13/01/16 07:07:43 INFO mapred.JobClient: Map input records=279422018 13/01/16 07:07:43 INFO mapred.JobClient: Reduce shuffle bytes=2647359503 Background 13/01/16 07:07:43 INFO mapred.JobClient: Reduce output records=39698527 MapReduce 13/01/16 07:07:43 INFO mapred.JobClient: Spilled Records=828280813 Spark 13/01/16 07:07:43 INFO mapred.JobClient: Map output bytes=24932976267 13/01/16 07:07:43 INFO mapred.JobClient: Combine input records=2813475352 13/01/16 07:07:43 INFO mapred.JobClient: Map output records=2376465967 13/01/16 07:07:43 INFO mapred.JobClient: Reduce input records=71653444 Summary of counters in job Master in MapReduce Resource Management Task Scheduling MapReduce Spark “Moving computation is cheaper than moving data.” Map and reduce tasks are assigned to idle Workers. Tasks on failure workers will be re-scheduled. When job is close to end, it launches backup tasks. Counter Background Maintain the current resource usage of each Worker(CPU, RAM, Used & free disk space, etc.) Examine worker failure periodically. provides interactive job progress. stores the occurrences of various events. is helpful to performance tuning. Data-oriented Map scheduling = input 1 + + 2 3 + input splits Switch 5 4 3 Worker 1 5 2 Worker 2 1 Worker 3 4 Rack 1 Launch map 1 on Worker 3 Background Launch map 2 on Worker 4 MapReduce Spark Launch map 3 on Worker 1 Launch map 4 on Worker 2 Launch map 5 on Worker 5 Switch 4 2 1 2 3 5 3 1 Worker 4 Worker 5 Worker 6 Rack 2 4 + 5 Data flow in MapReduce jobs rack-local split non-local split local split intermediate files (on disk) merged spills (on disk) Reducer Mapper circular buffer (in memory) Combiner Background MapReduce other Reducers spills (on disk) Spark other Mappers Map internal Background MapReduce Spark The map phase reads the task’s input split from GFS, parses it into records(key/value pairs), and applies the map function to each records. After the map function has been applied to each record, the commit phase registers the final output to Master, which will tell reduce the location of map output. Reduce internal Background MapReduce Spark The shuffle phase fetches the reduce task’s input data. The sort phase groups records with the same key together. The reduce phase applies the user-defined reduce function to each key and corresponding list of values. Backup Tasks There are barriers in a MapReduce job. No reduce function executes until all maps finish. The job can not complete until all reduces finish. Map Map Map Reduce Reduce Job complete Reduce Map Background The execution time of a job will be severely lengthened if a task is blocked. Master schedules backup/speculative tasks for unfinished ones before the job is close to end. A job will take 44% longer if backup tasks are disabled. MapReduce Spark Virtues and defects of MR Virtues Background MapReduce Towards large scale data Programming friendly Implicit parallelism Data-locality Fault/crash tolerance Scalability Open-source with good ecosystem[1] Defects Bad for iterative ML algorithms Not sure Spark [1] http://docs.hortonworks.com/CURRENT/index.htm#About_Hortonworks_Data_Platform/Understanding_Hadoop_Ecosystem.htm Network traffic in MapReduce 1. Map may read split from remote ChunkServer 2. Reduce copy the output of Map 3. Reduce output write to GFS 1 2 Background MapReduce Spark 3 Disk R/W in MapReduce 1. ChunkServer reads local block for remote split fetching 2. Spill intermediate result to disk 3. Write the copied partition to local disk 4. Write the result output to local ChunkServer 5. Write the result output to remote ChunkServer 1 2 3 Background MapReduce Spark 4 5 Iterative MapReduce Performing graph algorithm Using MapReduce. Background MapReduce Spark Motivation of Spark Background MapReduce Spark Iterative algorithms (machine learning, graphs) Interactive data mining tools (R, Excel, Python) Programming Model Fine-grained Computing outputs of every iteration are distributed and store to stable storage Coarse-grained Only logging the transformations to build a dataset (i.e. lineage) Resilient distributed datasets (RDDs) Background Immutable, partitioned collections of objects Created through parallel transformations (map, filter, groupBy, join, …) on data in stable storage Can be cached for efficient reuse MapReduce Spark Actions on RDDs Count, reduce, collect, save, … Spark Operations Transformations (define a new RDD) Background MapReduce Spark Actions (return a result to driver program) map filter sample groupByKey reduceByKey sortByKey flatMap union join cogroup cross mapValues collect reduce count save lookupKey Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) BaseTransformed RDD RDD results errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() tasks Driver Cache 1 Worker Block 1 Action cachedMsgs.filter(_.contains(“foo”)).count Cache 2 cachedMsgs.filter(_.contains(“bar”)).count Worker . . . Cache 3 Result: scaled full-text tosearch 1 TB data of Wikipedia in 5-7 sec in <1(vs sec170 (vssec 20 for secon-disk for on-disk data) data) Worker Block 3 Block 2 RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2)) HDFS File Background MapReduce Spark Filtered RDD filter (func = _.contains(...)) map (func = _.split(...)) Mapped RDD Example: Logistic Regression Goal: find best line separating two sets of points random initial line Background MapReduce Spark target Example: Logistic Regression val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) Keep variable “data” in memory for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w) Background MapReduce Spark Running Time (s) Logistic Regression Performance Background MapReduce Spark 4500 4000 3500 3000 2500 2000 1500 1000 500 0 127 s / iteration Hadoop Spark first iteration 174 s further iterations 6 s 1 5 10 20 30 Number of Iterations Spark Programming Interface (eg. page rank) Representing RDDs Spark Scheduler Dryad-like DAGs Pipelines functions within a stage Cache-aware work reuse & locality Partitioning-aware to avoid shuffles Background MapReduce Spark = cached data partition Behavior with Not Enough RAM 60 11.5 40 29.7 40.7 58.1 80 68.8 Iteration time (s) 100 20 0 Cache disabled 25% 50% 75% % of working set in memory Fully cached No Failure Failure in the 6th Iteration 5 6 Iteration 7 8 9 59 57 4 59 3 57 81 58 2 58 56 1 57 140 120 100 80 60 40 20 0 119 Iteratrion time (s) Fault Recovery Results 10 Conclusion Background MapReduce Spark Both MapReduce and Spark are excellent big data software, which are scalable, fault-tolerant, and programming friendly. Especially, Spark provides a more effective method for iterative computing jobs. QUESTIONS?