BIG DATA ALGORITHMS GOOGLE TREND BIG DATA everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it... IS THERE ANYTHING FUNDAMENTALLY NEW? • Massive Data vs Big Data • The 3 V’s • Volume • Velocity • Variety BIG DATA ECOSYSTEM BIG DATA APPLICATIONS Big Data Algorithms External memory Data stream Algorithms Algorithms Distributed Algorithms Parallel Algorithms 2006 1999 1988 8 1980 COMPUTATIONAL MODELS FOR BIG DATA All models are wrong, But some are useful. George E. P. Box WHAT’S THE BOTTLENECK? • CPU speed approaching limit • Does it matter? • From CPU-intensive computing to dataintensive computing • Algorithm has to be near-linear, linear, or even sub-linear! 10 • Data movement, i.e., communication is the bottleneck! Random Access Machine Model R A M • Standard theoretical model of computation: – Unlimited memory – Uniform access cost • Simple model crucial for success of computer industry 11 Hierarchical Memory L 1 L 2 R A M • Modern machines have complicated memory hierarchy – Levels get larger and slower further away from CPU – Data moved between levels using large blocks 12 Slow I/O • Disk access is 106 times slower than main memory access read/write arm track 4835 1915 5748 4125 magnetic surface “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer) – Disk systems try to amortize large access time transferring large contiguous blocks of data (8-16Kbytes) – Important to store/access data to take advantage of blocks (locality) 13 Scalability Problems • Most programs developed in RAM-model – Run on large datasets because OS moves blocks as needed Scalability problems! running time • Moderns OS utilizes sophisticated paging and prefetching strategies – But if program makes scattered accesses even good OS cannot take advantage of block access data size 14 External Memory Model D Block I/O N = # of items in the problem instance B = # of items per disk block M = # of items that fit in main memory I/O: # blocks moved between memory and disk M CPU time is ignored P Successful model used extensively in massive data algorithms and database communities 15 Fundamental Bounds • • • • Scanning: Sorting: Permuting Searching: Internal N N log N N log External N B N B log B min{ N , 2 N log M N B B N B log M B N B } N • Note: – Linear I/O: O(N/B) – Permuting not linear – Permuting and sorting bounds are equal in all practical cases – B factor VERY important: NB NB log M B NB N 16 Queues and Stacks • Queue: – Maintain push and pop blocks in main memory Push Pop O(1/B) I/O per operation (amortized) • Stack: – Maintain push/pop block in main memory O(1/B) I/O per operation (amortized) 17 Sorting • Merge sort: – Create N/M memory sized sorted lists – Repeatedly merge lists together Θ(M/B) at a time N ( M ) N ( M / ( M ) B N M 2 /( ) ) M B 1 O (log M B N M ) phases using O ( N B ) I/Os each O ( N log B M B N B ) I/Os 18 Sorting • <M/B sorted lists (queues) can be merged in O(N/B) I/Os M/B blocks in main memory • The M/B head elements kept in a heap in main memory 19 Toy Experiment: Permuting • Problem: – Input: N elements out of order: 6, 7, 1, 3, 2, 5, 10, 9, 4, 8 * Each element knows its correct position – Output: Store them on disk in the right order • Internal memory solution: – Just scan the original sequence and move every element in the right place! – O(N) time, O(N) I/Os • External memory solution: – Use sorting N – O(N log N) time, O ( NB log ) I/Os B M B 20 21 Searching in External Memory • Store N elements in a data structure such that – Given a query element x, find it or its predecessor 22 B-trees • BFS-blocking naturally corresponds to tree with fan-out ( B ) • B-trees balanced by allowing node degree to vary – Rebalancing performed by splitting and merging nodes (a,b)-tree • T is an (a,b)-tree (a≥2 and b≥2a-1) – All leaves on the same level (contain between a and b elements) – Except for the root, all nodes have degree between a and b – Root has degree between 2 and b • (a,b)-tree uses linear space and has height O (log a N ) Choosing a,b = ( B ) each node/leaf stored in one disk block O(N/B) space and O (log N ) query B (2,4)-tree (a,b)-Tree Insert • Insert: Search and insert element in leaf v DO v has b+1 elements/children Split v: make nodes v’ and v’’ with b 21 b and b 21 a elements insert element (ref) in parent(v) (make new root if necessary) v=parent(v) v b 1 v’ v’’ b 21 b 21 • Insert touch O (log a N ) nodes (a,b)-Tree Insert (a,b)-Tree Delete • Delete: Search and delete element from leaf v DO v has a-1 elements/children Fuse v with sibling v’: move children of v’ to v delete element (ref) from parent(v) (delete root if necessary) If v has >b (and ≤ a+b-1<2b) children split v v=parent(v) • Delete touch O (log a N ) nodes v a -1 v 2a - 1 (a,b)-Tree Delete (a,b)-Tree • (a,b)-tree properties: – Every update can cause O(logaN) rebalancing operations – If b>2a * Why? O ( 1B) (2,3)-tree delete insert rebalancing operations amortized Summary/Conclusion: B-tree • B-trees: (a,b)-trees with a,b = ( B ) – O(N/B) space – O(logB N) query – O(logB N) update • B-trees with elements in the leaves sometimes called B+-tree – Now B-tree and B+tree are synonyms • Construction in O ( NB log M B NB ) I/Os – Sort elements and construct leaves – Build tree level-by-level bottom-up Basic Structures: I/O-Efficient Priority Queue Internal Priority Queues • Operations: – Required: * Insert * DeleteMax * Max – Optional: * Delete * Update • Implementation: – Binary tree – Heap 100 40 90 40 50 29 23 15 65 Insertion 30 Internal Priority Queues • Operations: – Required: * Insert * DeleteMax * Max – Optional: * Delete * Update • Implementation: – Binary tree – Heap 100 40 90 40 65 50 23 15 29 Insertion 30 Internal Priority Queues • Operations: – Required: * Insert * DeleteMax * Max – Optional: * Delete * Update • Implementation: – Binary tree – Heap 40 90 40 65 50 23 15 29 DeleteMax 30 Internal Priority Queues • Operations: – Required: * Insert * DeleteMax * Max – Optional: * Delete * Update • Implementation: – Binary tree – Heap 90 40 40 65 50 23 15 29 DeleteMax 30 Internal Priority Queues • Operations: – Required: * Insert * DeleteMax * Max – Optional: * Delete * Update • Implementation: – Binary tree – Heap 90 40 65 40 30 50 29 23 15 DeleteMax How to Make the Heap I/O-Efficient I/O Technique 1: Make it many-way I/O Technique 2: Buffering! External Heap main memory insert buffer in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent External Heap: Insert main memory insert buffer in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent External Heap: Insert main memory insert buffer in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent External Heap: Insert main memory insert buffer in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks sift-up may not be half-full Heap property: All elements in a child are smaller than those in its parent External Heap: Insert main memory insert buffer in memory sift-up heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent External Heap: Insert main memory insert buffer in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks swap may not be half-full Heap property: All elements in a child are smaller than those in its parent External Heap: DeleteMax main memory insert buffer in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent External Heap: DeleteMax main memory insert buffer in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent External Heap: DeleteMax insert buffer main memory in memory refill heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent External Heap: DeleteMax insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks refill may not be half-full Heap property: All elements in a child are smaller than those in its parent External Heap: DeleteMax insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks refill merge may not be half-full Heap property: All elements in a child are smaller than those in its parent External Heap: I/O Analysis • What is the I/O cost for a sequence of N mixed insertions / deletemax (analysis in paper too complicated) • Height of heap: Θ(logM/BN/B) • Insertions – Wait until insert buffer is full (served at least Ω(M) inserts) – Then do one (occasionally two) bottom-up chains of sift-ups. * Cost: O(M/B∙logM/BN/B) * Amortized cost per insert: O(1/B∙logM/BN/B) • DeleteMax: – Wait until root is below half full (served at least Ω(M) deletemax) – Then do one, two, sometimes a lot of refills… dead – Do one sift-up: this is easy External Heap: I/O Analysis • Cost of all refills: – Need a global argument – Idea: trace individual elements – Total amount of “work”: O(N logM/BN/B) * One unit of work: move one element up one level * Refills do positive work * sift-ups do both positive and negative work * |positive work done by refills| + |positive works done by siftups| – |negative work done by sift-ups| = O(N logM/BN/B) * But note: |positive works done by sift-ups| > |negative work done by sift-ups| * So, |positive work done by refills| = O(N logM/BN/B) External Heap: I/O Analysis • Work done by refills: O(N logM/BN/B) • Each refill spends Θ(M/B) I/Os and does Θ(M) work • Total # I/Os for all refills: N log N /B M N log M B B N M /B M /B • How about merges? • Amortized I/O per operation: 1 O log B • Another way of sorting M /B N B B sort ( N ) External Heap: In Practice • In practice: Know the scale of your problem! – Suppose M = 512M, B = 256K, then two levels can support M*(M/B) = 1024G = 1T of data! insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks Amortized I/O per insert or delete-max: O(1/B) may not be half-full Recap: Basic General I/O Techniques (1) Make it many way: Merge sort (2) Buffering: External heap (priority queue) (3) Reduce to sort + pqueue Pointer Dereferencing • “Almost every problem in computer science can be solved by another level of indirection” pointer array P[i] 5 3 7 3 6 4 8 data array D[i] • Dereference each pointer needs many random I/Os • How do we get the values I/O-efficiently? – Output (i, data) pairs 4 6 8 I/O-Efficient Pointer Dereferencing pointer array P[i] 5 3 7 3 6 4 8 4 6 data array D[i] Total I/O: sort(N) • Sort pointer array by pointers – Produce a list of (i, P[i]) pairs, sorted by P[i] • Scan both arrays in parallel – Produce (i, data) pairs • Sort the list back by i if needed 8 Time-Forward Processing 2 7 1 9 7 9 • Scan sequence in order, create a priority queue • For a cell • For each incoming edge 9 9 10 Total I/O: sort(N) • DeleteMin from pq if there’s a match, obtain the incoming value • Compute the outgoing value • For each outgoing edge • Insert (destination address, value) to pq, with destination as key Application: Maximal Independent Set • Given an undirected graph G = (V,E) stored on disk – A list of (vertex-id, vertex-id) pairs representing all edges • An independent set is a set I of vertices so that no two vertices in I are adjacent • Set I is maximal if any other vertex is added to I, then I becomes not independent – Note: maximum independent set is NP-hard! • Internal memory – Add vertices one by one until no more vertices can be added – Time: O(|E|) I/O-Efficient Maximal Independent Set 1 4 6 2 Total I/O: sort(N) 3 7 5 1 2 3 4 5 6 7 • Make all edges directed from a low vertex id to a high vertex id • Sort all edges by source • Now have a time-forward processing problem! Big Data Algorithms External memory Algorithms Data stream Algorithms Distributed Algorithms Parallel Algorithms 2006 1999 1988 1980 59 Problem One: Missing Card • I take one from a deck of 52 cards, and pass the rest to you. Suppose you only have a (very basic) calculator and bad memory, how can you find out which card is missing with just one pass over the 51 cards? • What if there are two missing cards? 60 A data stream algorithm … • Makes one pass over the input data • Uses a small amount of memory (much smaller than the input data) • Computes something 61 Why do we need streaming algorithms • Networking – Often get to see the data once – Don’t want to store the entire data • Databases – Data stored on disk, sequential scans are much faster • Data stream algorithms have been a very active research area for the past 15 years • Problems considered today – Missing card – Reservoir sampling – Majority – Heavy hitters 62 Reservoir Sampling [Waterman ' ??; Vitter '85] • Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items • Every subset of size s has equal probability to be the sample • When the i-th item arrives – With probability s/i, use it to replace an item in the current sample chosen uniformly at random – With probability 1-s/i, throw it away 63 Reservoir Sampling: Correctness Proof 64 Problem two: Majority • Given a sequence of items, find the majority if there is one • AABCDBAABBAAAAAACCCDABAAA • Answer: A • Trivial if we have O(n) memory • Can you do it with O(1) memory and two passes? – First pass: find the possible candidate – Second pass: compute its frequency and verify that it is > n/2 • How about one pass? – Unfortunately, no 65 Problem three: Heavy hitters • Problem: find all items with counts > φn, for some 0< φ<n • Relaxation: – If an item has count > φ n, it must be reported, together with its estimated count with (absolute) error < εn – If an item has count < (φ − ε) n, it cannot be reported – For items in between, don’t care • In fact, we will solve the most difficult case φ = ε • Applications – Frequent IP addresses – Data mining 66 Heavy hitters Misra-Gries (MG) algorithm finds up to k items that occur more than 1/k fraction of the time in a stream [MG’82] – Estimate their frequencies with additive error ≤ N/(k+1) Keep k different candidates in hand. For each item in stream: If item is monitored, increase its counter – Else, if < k items monitored, add new item with count 1 – Else, decrease all counts by 1 – k=5 1 2 3 4 5 6 7 8 9 67 Mergeable Summaries Heavy hitters Misra-Gries (MG) algorithm finds up to k items that occur more than 1/k fraction of the time in a stream [MG’82] – Estimate their frequencies with additive error ≤ N/(k+1) Keep k different candidates in hand. For each item in stream: If item is monitored, increase its counter – Else, if < k items monitored, add new item with count 1 – Else, decrease all counts by 1 – k=5 1 2 3 4 5 6 7 8 9 68 Mergeable Summaries Heavy hitters Misra-Gries (MG) algorithm finds up to k items that occur more than 1/k fraction of the time in a stream [MG’82] – Estimate their frequencies with additive error ≤ N/(k+1) Keep k different candidates in hand. For each item in stream: If item is monitored, increase its counter – Else, if < k items monitored, add new item with count 1 – Else, decrease all counts by 1 – k=5 1 2 3 4 5 6 7 8 9 69 Mergeable Summaries Streaming MG analysis N = total input size Error in any estimated count at most N/(k+1) – – – – – – 70 Estimated count a lower bound on true count Each decrement spread over (k+1) items: 1 new one and k in MG Equivalent to deleting (k+1) distinct items from stream At most N/(k+1) decrement operations Hence, can have “deleted” N/(k+1) copies of any item So estimated counts have at most this much error Mergeable Summaries How about deletions? • Any deterministic algorithm needs Ω(n) space – Why? – In fact, Las Vegas randomization doesn’t help • Will design a randomized algorithm that works with high probability – For any item x, we can estimate its actual count within error εn with probability 1-δ for any small constant δ 71 The Count-Min sketch [Cormode, Muthukrishnan, 2003] A Count-Min (CM) Sketch with parameters ( , ) is represented by a two-dimensional array counts with width w and depth d : count [1,1] count [ d , w ] Given parameters 2 1 ( , ) , set w and d log . Each entry of the array is initially zero. d hash functions are chosen uniformly at random from a 2-univeral family For example, we can choose a prime number p > u, and random aj, bj, j=1,…,d. Define: h1 , , h d : {1 n} {1 w} hj(x) = (aj x + bj mod p) mod w Property: for any x ≠ y, Pr[hj(x)=hj(y)] ≤ 1/w 72 Updating the sketch Update procedure : When item x arrives, set 1 j d count [ j , h j ( x )] count [ j , h j ( x )] 1 h1 x hd 1 1 1 1 When item x is deleted, do the same except changing +1 to -1 73 Estimating the count of x aˆ x min count [ j , h j ( x )] Q ( x) j actual count Theorem 1 estimated count a x aˆ x Pr[ aˆ x a x n ] 74 Proof We introduce indicator variables I x,y, j ( x y ) ( h j ( x ) h j ( y )) 1 if 0 otherwise E ( I x , y , j ) Pr[ h j ( x ) h j ( y )] Define the variable I x, j I x, y, j 1 w 2 ay y By construction, count [ j , h j ( x )] a x I x , j min count [ j , h j ( i )] a i 75 For the other direction, observe that E ( I x, j ) E I x,y, j a y y a y E ( I x, y, j ) n / 2 y Pr[ aˆ x a x n ] Pr[ j , count [ j , h j ( x )] a x n ] Pr[ j , a x I x , j a x n ] Pr[ j , I x , j 2 E ( I x , j )] 2 -d Markov inequality Pr[ X t ] E(X ) t 0 t 1 1 So, the Count-Min Sketch has size O log ■ 76 Big Data Algorithms External memory Algorithms Data stream Algorithms Distributed Algorithms Parallel Algorithms 2006 1999 1988 1980 77 Distributed Systems Performance vs. programmability MPI MapReduce provides Automatic parallelization & distribution Fault tolerance Scalability Restricted programming model Map/Reduce map(key, val) is run on each item in set emits new-key / new-val pairs reduce(key, vals) is run for each unique key emitted by map() emits final output count words in docs Input consists of (url, contents) pairs map(key=url, val=contents): ▫ For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): ▫ Sum all “1”s in values list ▫ Emit result “(word, sum)” Count, Illustrated see bob throw see spot run map(key=url, val=contents): For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): Sum all “1”s in values list Emit result “(word, sum)” see bob run see spot throw 1 1 1 1 1 1 bob run see spot throw 1 1 2 1 1 Reverse Web-Link Graph Map For each URL linking to target, … Output <target, source> pairs Reduce Concatenate list of all source URLs Outputs: <target, list (source)> pairs Inverted Index Map For each (url, doc) pair Emit (keyword, url) for each keyword in doc Reduce For each keyword, output (url, list of keywords) Model is Widely Applicable MapReduce Programs In Google Source Tree Example uses: distributed grep distributed sort web link-graph reversal term-vector / host web access log stats inverted index construction document clustering machine learning statistical machine translation ... ... ... Implementation Overview Typical cluster: • 100s/1000s of multicore x86 machines, 4 GB of memory • One or two-level tree-shaped switched network with 100 Gbps of aggregate bandwidth at the root • Storage is on local IDE disks • GFS: distributed file system manages data • Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines Implementation is a C++ library linked into user programs Execution How is this distributed? 1. 2. 3. Partition input key/value pairs into chunks, run map() tasks in parallel After all map()s are complete, consolidate all emitted values for each unique emitted key Now partition space of output map keys, and run reduce() in parallel If map() or reduce() fails, re-execute! Job Processing TaskTracker 0 TaskTracker 1 TaskTracker 2 JobTracker TaskTracker 3 “grep” TaskTracker 4 TaskTracker 5 1. Client submits “grep” job, indicating code and input files 2. JobTracker breaks input file into k chunks. Assigns work to trackers. 3. After map(), tasktrackers exchange mapoutput to build reduce() keyspace 4. JobTracker breaks reduce() keyspace into m chunks. Assigns work. 5. reduce() output may go to NDFS Execution Parallel Execution Task Granularity & Pipelining Fine granularity tasks: map tasks >> machines Minimizes time for fault recovery Can pipeline shuffling with map execution Better dynamic load balancing Computational Model for MapReduce? Complicated, many factors, still no consensus Communication (total vs maximum) Space Time vs. work (parallelism) # rounds MapReduce: Pros and Cons Pros Simple programming model Excellent scalability for one-round batch jobs Fault tolerance Cons Programming model too simple Poor performance for iterative jobs Google Pregel for Graph Data Master/Worker model Each worker assigned a subset of a graph’s vertices Vertex-centric model. Each vertex has: An arbitrary “value” that can be get/set. List of messages sent to it List of outgoing edges (edges have a value too) A binary state (active/inactive) The Pregel model Bulk Synchronous Parallel model (Valiant, 95) Synchronous iterations of asynchronous computation Master initiates each iteration (called a “superstep”) At every superstep Workers asynchronously execute a user function on all of its vertices Vertices can receive messages sent to it in the last superstep Vertices can send messages to other vertices to be received in the next superstep Vertices can modify their value, modify values of edges, change the topology of the graph (add/remove vertices or edges) Vertices can “vote to halt” Execution stops when all vertices have voted to halt and no vertices have messages. Vote to halt trumped by non-empty message queue Illustration: vertex partitions Worker 3 Worker 1 http://stochastix.files.wordpress.com/ Worker 2 Loading the graph input Master assigns section of input to each worker Vertex “ownership” determined by hash(v) mod N N - number of partitions Recall each worker is assigned one or more partitions User can modify this to exploit data locality Worker reads its section of input: Stores vertices belonging to it Sends other vertices to the appropriate worker Input stored on something like GFS Section assignments determined by data locality Simple example: max propagation old_val := val for each message m if m > val then val := m if old_val == val then vote_to_halt else for each neighbor v send_message(v, val) Combiners Sometimes vertices only care about a summary value for the messages it is sent (e.g., previous example) Combiners allow for this (examples: min, max, sum, avg) Messages combined locally and remotely Reduces bandwidth overhead User-defined, not enabled by default Worker 1 v0 v1 Worker 2 v3 v2 v4 Combiner Combiner Worker 3 Combiner vs v5 Aggregators Compute aggregate statistics from vertex-reported values During a superstep, each worker aggregates values from its vertices to form a partially aggregated value At the end of a superstep, partially aggregated values from each worker are aggregated in a tree structure Allows for the parallelization of this process Global aggregate is sent to the master master Fault Tolerance (1/2) At start of superstep, master tells workers to save their state: Vertex values, edge values, incoming messages Saved to persistent storage Master saves aggregator values (if any) This isn’t necessarily done at every superstep That could be very costly Authors determine checkpoint frequency using mean time to failure model Fault Tolerance (2/2) When master detects one or more worker failures: All workers revert to last checkpoint Continue from there That’s a lot of repeated work! At least it’s better than redoing the whole thing. Example 1: PageRank 1 𝑃𝑅 𝑢 = 0.15 × + 0.85 × 𝑁 𝑣→𝑢 𝑃𝑅(𝑣) outdegree(𝑣) Example 2: Single Source Shortest Paths At each superstep… s … d0 d1 dv d0 + ws ws ds vertex receives messages d0 + wt wt dt if min(d0,d1) < dv, it sends messages to its neighbors and updates its new minimum distance from s else, it votes to halt After execution, each vertex’s value is its minimum distance from s Example 2: SSSP Combiner Each vertex interested only in minimum of its messages Might as well use a combiner! Computational Model for Pregel # supersteps L (maximum # edges in any shortest path from source) Communication O(E log V)? O(E L) Better algorithms are known, but harder to implement Conclusions Algorithm design facing new constraints/challenges in the big data era Resources other than time may be the main consideration Data movement cost often the primary concern Algorithmic ideas often independent of technological improvements Thank you!