Distributed Cluster Computing Platforms Outline What is the purpose of Data Intensive Super Computing? MapReduce Pregel Dryad Spark/Shark Distributed Graph Computing Why DISC DISC stands for Data Intensive Super Computing A lot of applications. ◦ scientific data, web search engine, social network ◦ economic, GIS New data are continuously generated People want to understand the data BigData analysis is now considered as a very important method for scientific research. What are the required features for the platform to handle DISC? Application specific: it is very difficult or even impossible to construct one system to fit them all. One example is the POSIX compatible file system. Each system should be re-configure or even re-designed for a specific application. Think about the motivation for building the Google file system for Google search engine. Programmer friendly interfaces: The Application programmer should not consider how to handle the infrastructure such as machines and networks. Fault Tolerant: The platform should handle the fault components automatically without any special treatment from the application. Scalability: The platform should run on top of at least thousands of machines and harnessing the power of all the components. The load balance should be achieved by the platform instead of the application itself. Try to understand all these four features during the introduction of the concrete platform below. Programming Model Implementation Google MapReduce Refinements Evaluation Conclusion Motivation: large scale data processing Process lots of data to produce other derived data ◦ Input: crawled documents, web request logs etc. ◦ Output: inverted indices, web page graph structure, top queries in a day etc. ◦ Want to use hundreds or thousands of CPUs ◦ but want to only focus on the functionality MapReduce hides messy details in a library: ◦ ◦ ◦ ◦ Parallelization Data distribution Fault-tolerance Load balancing Motivation: Large Scale Data Processing Want to process lots of data ( > 1 TB) Want to parallelize across hundreds/thousands of CPUs … Want to make this easy "Google Earth uses 70.5 TB: 70 TB for the raw imagery and 500 GB for the index data." From: http://googlesystem.blogspot.com/2006/09/how-muchdata-does-google-store.html MapReduce Automatic parallelization & distribution Fault-tolerant Provides status and monitoring tools Clean abstraction for programmers Programming Model Borrows from functional programming Users implement interface of two functions: ◦ map (in_key, in_value) -> (out_key, intermediate_value) list ◦ reduce (out_key, intermediate_value list) -> out_value list map Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). map() produces one or more intermediate values along with an output key from the input. reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() combines those intermediate values into one or more final values for that same output key (in practice, usually only one final value per key) Architecture Input key*value pairs Input key*value pairs ... map map Data store 1 Data store n (key 1, values...) (key 2, values...) (key 3, values...) (key 2, values...) (key 1, values...) (key 3, values...) == Barrier == : Aggregates intermediate values by output key key 1, intermediate values reduce final key 1 values key 2, intermediate values reduce final key 2 values key 3, intermediate values reduce final key 3 values Parallelism map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently Bottleneck: reduce phase can’t start until map phase is completely finished. Example: Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); Example vs. Actual Source Code Example is written in pseudo-code Actual implementation is in C++, using a MapReduce library Bindings for Python and Java exist via interfaces True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.) Example Page 1: the weather is good Page 2: today is good Page 3: good weather is good. Map output Worker 1: ◦(the 1), (weather 1), (is 1), (good 1). Worker 2: ◦(today 1), (is 1), (good 1). Worker 3: ◦(good 1), (weather 1), (is 1), (good 1). Reduce Input Worker 1: ◦ (the 1) Worker 2: ◦ (is 1), (is 1), (is 1) Worker 3: ◦ (weather 1), (weather 1) Worker 4: ◦ (today 1) Worker 5: ◦ (good 1), (good 1), (good 1), (good 1) Reduce Output Worker 1: ◦ (the 1) Worker 2: ◦ (is 3) Worker 3: ◦ (weather 2) Worker 4: ◦ (today 1) Worker 5: ◦ (good 4) Some Other Real Examples Term frequencies through the whole Web repository Count of URL access frequency Reverse web-link graph Implementation Overview Typical cluster: ◦ ◦ ◦ ◦ ◦ 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory Limited bisection bandwidth Storage is on local IDE disks GFS: distributed file system manages data (SOSP'03) Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines Implementation is a C++ library linked into user programs Architecture Execution Parallel Execution Task Granularity And Pipelining Fine granularity tasks: many more map tasks than machines ◦ Minimizes time for fault recovery ◦ Can pipeline shuffling with map execution ◦ Better dynamic load balancing Often use 200,000 map/5000 reduce tasks w/ 2000 machines Locality Effect: Thousands of machines read input at local disk speed Master program divvies up tasks based on location of data: (Asks GFS for locations of replicas of input file blocks) tries to have map() tasks on same machine as physical file data, or at least same rack map() task inputs are divided into 64 MB blocks: same size as Google File System chunks Without this, rack switches limit read rate Fault Tolerance Master detects worker failures ◦Re-executes completed & in-progress map() tasks ◦Re-executes in-progress reduce() tasks Master notices particular input key/values cause crashes in map(), and skips those values on re-execution. ◦Effect: Can work around bugs in thirdparty libraries! Fault Tolerance On worker failure: ◦ Detect failure via periodic heartbeats ◦ Re-execute completed and in-progress map tasks ◦ Re-execute in progress reduce tasks ◦ Task completion committed through master Master failure: ◦ Could handle, but don't yet (master failure unlikely) Robust: lost 1600 of 1800 machines once, but finished fine Optimizations No reduce can start until map is complete: ◦ A single slow disk controller can rate-limit the whole process Master redundantly executes “slow-moving” map tasks; uses results of first copy to finish, (one finishes first “wins”) Slow workers significantly lengthen completion time Other jobs consuming resources on machine Bad disks with soft errors transfer data very slowly Weird things: processor caches disabled (!!) Why is it safe to redundantly execute map tasks? Wouldn’t this mess up the total computation? Optimizations “Combiner” functions can run on same machine as a mapper Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth Under what conditions is it sound to use a combiner? Refinement Sorting guarantees within each reduce partition Compression of intermediate data Combiner: useful for saving network bandwidth Local execution for debugging/testing User-defined counters Performance Tests run on cluster of 1800 machines: ◦ 4 GB of memory ◦ Dual-processor 2 GHz Xeons with Hyperthreading ◦ Dual 160 GB IDE disks ◦ Gigabit Ethernet per machine ◦ Bisection bandwidth approximately 100 Gbps Two benchmarks: MR_Grep Scan 1010 100-byte records to extract records matching a rare pattern (92K matching records) MR_Sort benchmark) Sort 1010 100-byte records (modeled after TeraSort MR_Grep Locality optimization helps: ◦ 1800 machines read 1 TB of data at peak of ~31 GB/s ◦ Without this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs MR_Sort Backup tasks reduce job completion time significantly System deals well with failures Normal No Backup Tasks 200 processes killed More and more MapReduce MapReduce Programs In Google Source Tree Example uses: distributed grep distributed sort web link-graph reversal term-vector per host web access log stats inverted index construction document clustering machine learning statistical machine translation Real MapReduce : Rewrite of Production Indexing System Rewrote Google's production indexing system using MapReduce ◦ Set of 10, 14, 17, 21, 24 MapReduce operations ◦ New code is simpler, easier to understand ◦ MapReduce takes care of failures, slow machines ◦ Easy to make indexing faster by adding more machines MapReduce Conclusions MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Functional programming paradigm can be applied to large-scale applications Fun to use: focus on problem, let library deal w/ messy details MapReduce Programs Sorting Searching Indexing Classification TF-IDF Breadth-First Search / SSSP PageRank Clustering MapReduce for PageRank PageRank: Random Walks Over The Web If a user starts at a random web page and surfs by clicking links and randomly entering new URLs, what is the probability that s/he will arrive at a given page? The PageRank of a page captures this notion ◦ More “popular” or “worthwhile” pages get a higher rank PageRank: Visually www.cnn.com en.wikipedia.org www.nytimes.com PageRank: Formula Given page A, and pages T1 through Tn linking to A, PageRank is defined as: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) C(P) is the cardinality (out-degree) of page P d is the damping (“random URL”) factor PageRank: Intuition Calculation is iterative: PRi+1 is based on PRi Each page distributes its PRi to all pages it links to. Linkees add up their awarded rank fragments to find their PRi+1 d is a tunable parameter (usually = 0.85) encapsulating the “random jump factor” PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) PageRank: First Implementation Create two tables 'current' and 'next' holding the PageRank for each page. Seed 'current' with initial PR values Iterate over all pages in the graph, distributing PR from 'current' into 'next' of linkees current := next; next := fresh_table(); Go back to iteration step or end if converged Distribution of the Algorithm Key insights allowing parallelization: ◦ The 'next' table depends on 'current', but not on any other rows of 'next' ◦ Individual rows of the adjacency matrix can be processed in parallel ◦ Sparse matrix rows are relatively small Distribution of the Algorithm Consequences of insights: ◦ We can map each row of 'current' to a list of PageRank “fragments” to assign to linkees ◦ These fragments can be reduced into a single PageRank value for a page by summing ◦ Graph representation can be even more compact; since each element is simply 0 or 1, only transmit column numbers where it's 1 Map step: break page rank into even fragments to distribute to link targets Reduce step: add together fragments into next PageRank Iterate for next step... Phase 1: Parse HTML Map task takes (URL, page content) pairs and maps them to (URL, (PRinit, list-of-urls)) ◦ PRinit is the “seed” PageRank for URL ◦ list-of-urls contains all pages pointed to by URL Reduce task is just the identity function Phase 2: PageRank Distribution Map task takes (URL, (cur_rank, url_list)) ◦ For each u in url_list, emit (u, cur_rank/|url_list|) ◦ Emit (URL, url_list) to carry the points-to list along through iterations PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Phase 2: PageRank Distribution Reduce task gets (URL, url_list) and many (URL, val) values ◦ Sum vals and fix up with d ◦ Emit (URL, (new_rank, url_list)) PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Finishing up... A non-parallelizable component determines whether convergence has been achieved (Fixed number of iterations? Comparison of key values?) If so, write out the PageRank lists - done! Otherwise, feed output of Phase 2 into another Phase 2 iteration PageRank Conclusions MapReduce isn't the greatest at iterated computation, but still helps run the “heavy lifting” Key element in parallelization is independent PageRank computations in a given step Parallelization requires thinking about minimum data partitions to transmit (e.g., compact representations of graph rows) ◦ Even the implementation shown today doesn't actually scale to the whole Internet; but it works for intermediate-sized graphs So, do you think that MapReduce is suitable for PageRank? (homework, give concrete reason for why and why not.) Dryad Design Implementation Dryad Policies as Plug-ins Building on Dryad Design Space Internet Dataparallel Private data center Shared memory Latency Throughput 54 Data Partitioning DATA RAM DATA 55 2-D Piping Unix Pipes: 1-D grep | sed | sort | awk | perl Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50 56 Dryad = Execution Layer Job (Application) Dryad Cluster Pipeline ≈ Shell Machine 57 Dryad Design Implementation Policies as Plug-ins Building on Dryad 58 Virtualized 2-D Pipelines 59 Virtualized 2-D Pipelines 60 Virtualized 2-D Pipelines 61 Virtualized 2-D Pipelines 62 Virtualized 2-D Pipelines • 2D DAG • multi-machine • virtualized 63 Dryad Job Structure grep1000 | sed500 | sort1000 | awk500 | perl50 Channels Input files Stage sort grep Output files awk sed perl sort grep awk sed grep sort Vertices (processes) 64 Channels Finite Streams of items X Items M • distributed filesystem files (persistent) • SMB/NTFS files (temporary) • TCP pipes (inter-machine) • memory FIFOs (intra-machine) 65 Architecture data plane job schedule Files, TCP, FIFO, Network NS Job manager control plane V V V PD PD PD cluster 66 Staging 1. Build 2. Send .exe JM code 7. Serialize vertices vertex code 5. Generate graph 6. Initialize vertices 3. Start JM Cluster services 8. Monitor Vertex execution 4. Query cluster resources Fault Tolerance Dryad Design Implementation Policies and Resource Management Building on Dryad 69 Policy Managers R R R R Stage R Connection R-X X X X X Manager R manager X Stage X R-X Manager Job Manager 70 Duplicate Execution Manager X[0] X[1] X[3] Completed vertices X[2] Slow vertex X’[2] Duplicate vertex Duplication Policy = f(running times, data volumes) Aggregation Manager S S S rack # dynamic S S #3S #3S #2S T static #1S S #2S #1S # 1A # 2A # 3A T 72 Data Distribution (Group By) Source Source Source m mxn Dest Dest Dest n 73 Range-Distribution Manager S S S S S S [0-100) Hist [0-30),[30-100) static T D D T [0-?) [0-30) dynamic D T [?-100) [30-100) 74 Goal: Declarative Programming X X static X S S T X T S T T dynamic 75 Dryad Design Implementation Policies as Plug-ins Building on Dryad 76 Machine Learning sed, awk, grep, etc. legacy code C# PSQL Perl C++ Queries Distributed Shell SSIS C# Vectors DryadLINQ C++ SQL server Dryad Distributed Filesystem CIFS/NTFS Job queueing, monitoring Software Stack Cluster Services Windows Server Windows Server Windows Server Windows Server 77 SkyServer Query 18 select distinct P.ObjID into results from photoPrimary U, neighbors N, photoPrimary L where U.ObjID = N.ObjID and L.ObjID = N.NeighborObjID and P.ObjID < L.ObjID and abs((U.u-U.g)-(L.u-L.g))<0.05 and abs((U.g-U.r)-(L.g-L.r))<0.05 and abs((U.r-U.i)-(L.r-L.i))<0.05 and abs((U.i-U.z)-(L.i-L.z))<0.05 H n Y Y L L U S 4n S M 4n M D n D X n X N U N 78 SkyServer Q18 Performance 16.0 Dryad In-Memory 14.0 Dryad Two-pass 12.0 SQLServer 2005 10.0 Speed-up (times) 8.0 6.0 4.0 2.0 0.0 0 2 4 6 Number of Computers 8 10 79 DryadLINQ • Declarative programming • Integration with Visual Studio • Integration with .Net • Type safety • Automatic serialization • Job graph optimizations static dynamic • Conciseness 80 LINQ Collection<T> collection; bool IsLegal(Key); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; 81 DryadLINQ = LINQ + Dryad Vertex code Collection<T> collection; bool IsLegal(Key k); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; Query plan (Dryad job) Data collection C# C# C# C# results 82 Sort & Map-Reduce in DryadLINQ S S S [0-100) Sampl [0-30),[30-100) D D Sort D Sort [0-30) [30-100) 83 PLINQ public static IEnumerable<TSource> DryadSort<TSource, TKey>(IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IComparer<TKey> comparer, bool isDescending) { return source.AsParallel().OrderBy(keySelector, comparer); } 84 Machine Learning in DryadLINQ Data analysis Machine learning Large Vector DryadLINQ Dryad 85 Very Large Vector Library PartitionedVector<T> T T T Scalar<T> T 86 Operations on Large Vectors: Map 1 T f U f preserves partitioning T f U 87 Map 2 (Pairwise) T U f V T U f V 88 Map 3 (Vector-Scalar) T f U V T U f V 89 89 Reduce (Fold) f U U U U f f f U U U f U 90 Linear Algebra T T , , U V = mn , , m 91 Linear Regression Data xt , yt n m t {1,..., n} Find S.t. A n m Axt yt 92 Analytic Solution A ( y x )( x x t t X[0] T t X[1] t X[2] T 1 t t Y[0] Y[1] ) Y[2] Map X×XT X×XT X×XT Y×XT Y×XT Y×XT Reduce Σ Σ [ ]-1 * A 93 Linear Regression Code A (t yt x )( t xt x ) T t T 1 t Vectors x = input(0), y = input(1); Matrices xx = x.PairwiseOuterProduct(x); OneMatrix xxs = xx.Sum(); Matrices yx = y.PairwiseOuterProduct(x); OneMatrix yxs = yx.Sum(); OneMatrix xxinv = xxs.Map(a => a.Inverse()); OneMatrix A = yxs.Map( xxinv, (a, b) => a.Multiply(b)); 94 Expectation Maximization (Gaussians) • 160 lines • 3 iterations shown 95 Conclusions Dryad = distributed execution environment Application-independent (semantics oblivious) Supports rich software ecosystem ◦ ◦ ◦ ◦ Relational algebra Map-reduce LINQ Etc. DryadLINQ = A Dryad provider for LINQ This is only the beginning! 96 Some other system you should know about BigData processing Hadoop HDFS, MapReduce (open source version of GFS and MapReduce) HIVE/Pig/Sawzall (Query Language Processing) Spark/Shark (Efficient use of cluster memory and supporting iterative mapreduce program) Thank you! Any Questions? Pregel as backup slides Introduction Computation Model Writing a Pregel Program Pregel System Implementation Experiments Conclusion Introduction (1/2) Source: SIGMETRICS ’09 Tutorial – MapReduce: The Programming Model and Practice, by Jerry Zhao Introduction (2/2) Many practical computing problems concern large graphs Large graph data Graph algorithms Web graph Transportation routes Citation relationships Social networks PageRank Shortest path Connected components Clustering techniques MapReduce is ill-suited for graph processing ◦ Many iterations are needed for parallel graph processing ◦ Materializations of intermediate results at every MapReduce iteration harm performance Single Source Shortest Path (SSSP) Problem ◦ Find shortest path from a source node to all target nodes Solution ◦ Single processor machine: Dijkstra’s algorithm Example: SSSP – Dijkstra’s Algorithm 1 10 2 0 9 3 5 4 6 7 2 Example: SSSP – Dijkstra’s Algorithm 1 10 10 2 0 9 3 5 4 6 7 5 2 Example: SSSP – Dijkstra’s Algorithm 1 8 14 10 2 0 9 3 5 4 6 7 5 2 7 Example: SSSP – Dijkstra’s Algorithm 1 8 13 10 2 0 9 3 5 4 6 7 5 2 7 Example: SSSP – Dijkstra’s Algorithm 1 8 9 10 2 0 9 3 5 4 6 7 5 2 7 Example: SSSP – Dijkstra’s Algorithm 1 8 9 10 2 0 9 3 5 4 6 7 5 2 7 Single Source Shortest Path (SSSP) Problem ◦ Find shortest path from a source node to all target nodes Solution ◦ Single processor machine: Dijkstra’s algorithm ◦ MapReduce/Pregel: parallel breadth-first search (BFS) MapReduce Execution Overview Example: SSSP – Parallel BFS in MapReduce Adjacency matrix A A B B C 10 B E 1 2 4 D 3 7 Adjacency List A: (B, 10), (D, 5) 9 2 A D: (B, 3), (C, 9), (E, 2) E: (A, 7), (C, 6) 10 0 2 9 3 4 6 6 5 7 B: (C, 1), (D, 2) C: (E, 4) 1 5 C E D C D 2 E Example: SSSP – Parallel BFS in MapReduce Map input: <node ID, <dist, adj list>> B <A, <0, <(B, 10), (D, 5)>>> C 1 <B, <inf, <(C, 1), (D, 2)>>> <C, <inf, <(E, 4)>>> <D, <inf, <(B, 3), (C, 9), (E, 2)>>> A 10 0 <E, <inf, <(A, 7), (C, 6)>>> Map output: <dest node ID, dist> 2 9 3 5 <B, 10> <D, 5> <A, <0, <(B, 10), (D, 5)>>> <C, inf> <D, inf> <B, <inf, <(C, 1), (D, 2)>>> 4 6 7 D <E, inf> <C, <inf, <(E, 4)>>> <B, inf> <C, inf> <E, inf> <D, <inf, <(B, 3), (C, 9), (E, 2)>>> <A, inf> <C, inf> <E, <inf, <(A, 7), (C, 6)>>> 2 E Flushed to local disk!! Example: SSSP – Parallel BFS in MapReduce Reduce input: <node ID, dist> B <A, <0, <(B, 10), (D, 5)>>> <B, 10> <B, inf> 1 <A, inf> <B, <inf, <(C, 1), (D, 2)>>> C A 10 0 2 9 3 4 6 <C, <inf, <(E, 4)>>> <C, inf> <C, inf> <C, inf> <D, <inf, <(B, 3), (C, 9), (E, 2)>>> <D, 5> <D, inf> <E, <inf, <(A, 7), (C, 6)>>> <E, inf> <E, inf> 5 7 D 2 E Example: SSSP – Parallel BFS in MapReduce Reduce input: <node ID, dist> B <A, <0, <(B, 10), (D, 5)>>> <B, 10> <B, inf> 1 <A, inf> <B, <inf, <(C, 1), (D, 2)>>> C A 10 0 2 9 3 4 6 <C, <inf, <(E, 4)>>> <C, inf> <C, inf> <C, inf> <D, <inf, <(B, 3), (C, 9), (E, 2)>>> <D, 5> <D, inf> <E, <inf, <(A, 7), (C, 6)>>> <E, inf> <E, inf> 5 7 D 2 E Example: SSSP – Parallel BFS in MapReduce Reduce output: <node ID, <dist, adj list>> = Map input for next iteration <A, <0, <(B, 10), (D, 5)>>> B A <C, <inf, <(E, 4)>>> <D, <5, <(B, 3), (C, 9), (E, 2)>>> 1 10 Flushed to DFS!! <B, <10, <(C, 1), (D, 2)>>> C 10 0 2 9 3 4 6 <E, <inf, <(A, 7), (C, 6)>>> Map output: <dest node ID, dist> <B, 10> <D, 5> <C, 11> <D, 12> <E, inf> <B, 8> <C, 14> <E, 7> <A, inf> <C, inf> 5 <A, <0, <(B, 10), (D, 5)>>> <B, <10, <(C, 1), (D, 2)>>> <C, <inf, <(E, 4)>>> <D, <5, <(B, 3), (C, 9), (E, 2)>>> <E, <inf, <(A, 7), (C, 6)>>> 7 5 D 2 E Flushed to local disk!! Example: SSSP – Parallel BFS in MapReduce Reduce input: <node ID, dist> B <A, <0, <(B, 10), (D, 5)>>> <B, 10> <B, 8> 1 10 <A, inf> <B, <10, <(C, 1), (D, 2)>>> C A 10 0 2 9 3 4 6 <C, <inf, <(E, 4)>>> <C, 11> <C, 14> <C, inf> <D, <5, <(B, 3), (C, 9), (E, 2)>>> <D, 5> <D, 12> <E, <inf, <(A, 7), (C, 6)>>> <E, inf> <E, 7> 5 7 5 D 2 E Example: SSSP – Parallel BFS in MapReduce Reduce input: <node ID, dist> B <A, <0, <(B, 10), (D, 5)>>> <B, 10> <B, 8> 1 10 <A, inf> <B, <10, <(C, 1), (D, 2)>>> C A 10 0 2 9 3 4 6 <C, <inf, <(E, 4)>>> <C, 11> <C, 14> <C, inf> <D, <5, <(B, 3), (C, 9), (E, 2)>>> <D, 5> <D, 12> <E, <inf, <(A, 7), (C, 6)>>> <E, inf> <E, 7> 5 7 5 D 2 E Example: SSSP – Parallel BFS in MapReduce Reduce output: <node ID, <dist, adj list>> = Map input for next iteration <A, <0, <(B, 10), (D, 5)>>> B <C, <11, <(E, 4)>>> <D, <5, <(B, 3), (C, 9), (E, 2)>>> A 1 8 Flushed to DFS!! <B, <8, <(C, 1), (D, 2)>>> C 11 10 0 2 9 3 4 6 <E, <7, <(A, 7), (C, 6)>>> 5 7 … the rest omitted … 5 D 2 7 E Computation Model (1/3) Input Supersteps (a sequence of iterations) Output Computation Model (2/3) “Think like a vertex” Inspired by Valiant’s Bulk Synchronous Parallel model (1990) Source: http://en.wikipedia.org/wiki/Bulk_synchronous_parallel Computation Model (3/3) Superstep: the vertices compute in parallel ◦ Each vertex ◦ Receives messages sent in the previous superstep ◦ Executes the same user-defined function ◦ Modifies its value or that of its outgoing edges ◦ Sends messages to other vertices (to be received in the next superstep) ◦ Mutates the topology of the graph ◦ Votes to halt if it has no further work to do ◦ Termination condition ◦ All vertices are simultaneously inactive ◦ There are no messages in transit Example: SSSP – Parallel BFS in Pregel 1 10 2 0 9 3 5 4 6 7 2 Example: SSSP – Parallel BFS in Pregel 10 2 9 3 5 5 2 4 6 7 10 0 1 Example: SSSP – Parallel BFS in Pregel 1 10 10 2 0 9 3 5 4 6 7 5 2 Example: SSSP – Parallel BFS in Pregel 2 5 14 8 10 0 11 1 10 9 3 12 4 6 7 5 2 7 Example: SSSP – Parallel BFS in Pregel 1 8 11 10 2 0 9 3 5 4 6 7 5 2 7 Example: SSSP – Parallel BFS in Pregel 9 1 8 11 10 0 14 13 2 9 3 5 4 7 5 2 6 15 7 Example: SSSP – Parallel BFS in Pregel 1 8 9 10 2 0 9 3 5 4 6 7 5 2 7 Example: SSSP – Parallel BFS in Pregel 1 8 9 10 2 0 9 3 5 4 7 5 2 6 13 7 Example: SSSP – Parallel BFS in Pregel 1 8 9 10 2 0 9 3 5 4 6 7 5 2 7 Differences from MapReduce Graph algorithms can be written as a series of chained MapReduce invocation Pregel ◦ Keeps vertices & edges on the machine that performs computation ◦ Uses network transfers only for messages MapReduce ◦ Passes the entire state of the graph from one stage to the next ◦ Needs to coordinate the steps of a chained MapReduce C++ API Writing a Pregel program ◦ Subclassing the predefined Vertex class Override this! in msgs out msg Example: Vertex Class for SSSP System Architecture Pregel system also uses the master/worker model ◦ Master ◦ Maintains worker ◦ Recovers faults of workers ◦ Provides Web-UI monitoring tool of job progress ◦ Worker ◦ Processes its task ◦ Communicates with the other workers Persistent data is stored as files on a distributed storage system (such as GFS or BigTable) Temporary data is stored on local disk Execution of a Pregel Program 1. Many copies of the program begin executing on a cluster of machines 2. The master assigns a partition of the input to each worker ◦ Each worker loads the vertices and marks them as active 3. The master instructs each worker to perform a superstep ◦ Each worker loops through its active vertices & computes for each vertex ◦ Messages are sent asynchronously, but are delivered before the end of the superstep ◦ This step is repeated as long as any vertices are active, or any messages are in transit 4. After the computation halts, the master may instruct each worker to save its portion of the graph Fault Tolerance Checkpointing ◦ The master periodically instructs the workers to save the state of their partitions to persistent storage ◦ e.g., Vertex values, edge values, incoming messages Failure detection ◦ Using regular “ping” messages Recovery ◦ The master reassigns graph partitions to the currently available workers ◦ The workers all reload their partition state from most recent available checkpoint Experiments Environment ◦ H/W: A cluster of 300 multicore commodity PCs ◦ Data: binary trees, log-normal random graphs (general graphs) Naïve SSSP implementation ◦ The weight of all edges = 1 ◦ No checkpointing Experiments SSSP – 1 billion vertex binary tree: varying # of worker tasks Experiments SSSP – binary trees: varying graph sizes on 800 worker tasks Experiments SSSP – Random graphs: varying graph sizes on 800 worker tasks Conclusion & Future Work Pregel is a scalable and fault-tolerant platform with an API that is sufficiently flexible to express arbitrary graph algorithms Future work ◦ Relaxing the synchronicity of the model ◦ Not to wait for slower workers at inter-superstep barriers ◦ Assigning vertices to machines to minimize inter-machine communication ◦ Caring dense graphs in which most vertices send messages to most other vertices