Connectivity A Semi-External Algorithm Analysis: • Scan vertex set to load vertices into main memory • Scan edge set to carry out algorithm • O(scan(|V| + |E|)) I/Os Theorem: If |V| M, the connected components of a graph can be computed in O(scan(|V| + |E|)) I/Os. 1 Connectivity The General Case Idea [Chiang et al 1995]: • If |V| M – Use semi-external algorithm • If |V| > M – Identify simple connected subgraphs of G – Contract these subgraphs to obtain graph G’ = (V’, E’) with |V’| c|V|, c < 1 – Recursively compute connected components of G’ – Obtain labelling of connected components of G from labelling of components of G’ 2 Connectivity The General Case 1a e 1 B b 1 f 1 A i h g 2 2 D C j 2 2 n 2 m l 2 E k 2 2 D 2 1 d 1 c 1 B 1 A C 2 2 E 3 Connectivity The General Case Main steps: • Find smallest neighbors • Compute connected components of graph H induced by selected edges • Contract each component into a single vertex • Call the procedure recursively • Copy label of every vertex v G’ to all vertices in G represented by v 4 Finding smallest neighbors To find smallest neighbor w(v) of every vertex v: Scan edges and replace each undirected edge {u,v} with directed edges (u,v) and (v,u) Sort directed edges lexicographically This produces adjacency lists Scan adjacency list of v and return as w(v) first vertex in list This takes overall O(sort(|E|)) I/Os To produce edge set of (undirected) graph H, sort and scan edges {v, w(v)} to remove duplicates This takes another O(sort(|V|)) I/Os 5 Computing Conn Comps of H Cannot use same algorithm recursively (didn’t reduce vertex set) Exploit following property: Lemma Graph H is a forest Assume not. Then H must contain cycle x0, x1, …, xk = x0. Since no duplicate edges, k ≥ 3. Since each vertex v has at most one incident edge {v,w(v)} in H, w.l.o.g. xi+1 = w(xi) for 0 ≤ i < k. Then the existence of {xi-1,xi} implies that xi-1 > xi+1. Similarly, xk-1 > x1. If k even: x0 > x2 > … > xk = x0 yields a contradiction. If k odd: x0 > x2 > … > xk-1 > x1 > x3 > … > xk = x0 yields a contradiction. 6 Exploit Property that H is a Forest Apply Euler tour to H in order to transform each tree into a list Now compute connected components using ideas from list ranking: Find large independent set I of H and remove vertices in I from H Recursively find connected components of smaller graphs Reintegrate vertices in I (assign component label of neighbor) This takes sort(|H|) = sort(|V|) I/Os 7 Recursive Calls Every connected component of H has size at least 2 |V’| |V|/2 O(log (|V|/M)) recursive calls Theorem: The connected components of a graph G = (V,E) can be computed in O(sort(|V|) + sort(|E|) log(|V|/M)) I/Os. 8 Improved Connectivity via BFS • BFS in O(|V| + sort(|E|)) I/Os [Munagala & Ranade 99] BFS can be used to identify connected components • When |V| = |E|/B, algorithm takes O(sort(|E|)) I/Os • Same alg. but stop recursion before, when # of vertices reduced to |E|/B (after log (|V|B/|E|) recursive calls) • At this point, apply BFS rather than semi-external connectivity Theorem: The connected components of a graph G = (V,E) can be computed in O(sort(|V|) + sort(|E|) log (|V|B / |E|) I/Os. 9 Minimum Spanning Tree (MST) Can push same ideas to work on MSTs: Theorem: A MST of a graph G = (V,E) can be computed in O(sort(|V|) + sort(|E|) log (|V|/M)) I/Os. Theorem: A MST of a graph G = (V,E) can be found in O(sort(|V|) + sort(|E|) log (|V|B / |E|) I/Os. 10 Three Techniques for Graph Algs • Time-forward processing: – Express graph problems as evaluation problems of DAGs • Graph contraction: – Reduce the size of G while maintaining the properties of interest – Solve problem recursively on compressed graph – Construct solution for G from solution for compressed graph • Bootstrapping: – Switch to generally less efficient algorithm as soon as (part of the) input is small enough 11 Cache Oblivious Algorithms Typical Cache Configuration 13 Cache Oblivious Model Introduced by Frigo, Leiserson, Prokop & Ramachandran [FLPR99, Pro99]. Its principle idea simple: design external-memory algorithms without knowing B and M (internal details of the hierarchical memory) But this simple idea has several surprisingly powerful consequences. 14 Consequences of Cache Oblivious If cache-oblivious alg. performs well between two levels of the memory hierarchy, then it must automatically work well between any two adjacent levels of memory hierarchy. Self-tuning: a cache-oblivious algorithm should work well on all machines without modification (still subject to some tuning, e.g., where to trim base case of recursion) In contrast to external-memory model, algs in the cache-oblivious model cannot explicitly manage the cache 15 Assumptions of Cache Oblivious How can we design algs that minimize number of block transfers if we do not know the page-replacement strategy? An adversarial page replacement strategy could always evict next block that will be accessed… Cache oblivious model assumes an ideal cache: page replacement is optimal, and cache is fully associative. 16 Assumptions of Cache Oblivious Optimal Page Replacement: Page replacement strategy knows the future and always evicts page that will be accessed farthest in future. Real-world caches do not know the future, and employ more realistic page replacement strategies such as evicting the least-recently-used block (LRU) or evicting the oldest block (FIFO). 17 Assumptions of Cache Oblivious Full Associativity Any block can be stored anywhere in cache. In contrast, most caches have limited associativity: each block belongs to a cluster and at most some small constant c of blocks from a common cluster can be stored in cache at once. Typical real-world caches are either directed mapped (c = 1) or 2-way associative (c = 2). Some caches have more associativity—4-way or 8-way—but constant c is certainly limited 18 Justification of Ideal Cache Frigo et al. [FLPR99,Pro99] justify the ideal-cache model by a collection of reductions that modify an ideal-cache alg to operate on a more realistic cache model. Running time of the alg. degrades somewhat, but in most cases by only a constant factor. Will outline major steps, without going into the details of the proofs. 19 Justification of Ideal Cache Replacement Strategy: The first reduction removes optimal (omniscient) replacement strategy that uses information about future requests. Lemma [FLPR99]. If an alg makes T memory transfers on cache of size M/2 with optimal replacement, then it makes at most 2T memory transfers on cache of size M with LRU or FIFO replacement (and same block size B). I.e., LRU and FIFO do just as well as optimal replacement up to constant factors of memory transfers and wastage of the cache. This competitiveness property of LRU and FIFO goes back to a 1985 paper of Sleator and Tarjan. 20 Another Assumption: Tall Cache Commonly assumed that cache taller than wide, i.e., number of blocks, M/B, larger than size of each block, B: M = Ω ( B2 ) Particularly important in more sophisticated cache-oblivious algs: ensures that cache provides polynomially large “buffer” for guessing block size slightly wrong. Also commonly assumed in external-memory algorithms. 22 Ideal Cache Oblivious Model Focus on two levels: Level 1 has size M Level 2 tranfers blocks of size B. Algorithm designer does not need to know parameters M and B explicitly Sometimes, tall cache assumption: M = Ω ( B2 ) usually true in practice. 23 (Easy) Cache Oblivious Algs Scanning N elements stored in a contiguous segment of memory costs at most •N/B• +1 memory transfers: Reversing an array same as scanning: 24 Matrix Transposition for (i = 0; i < N; i++) for (j = i+1; j < N; j++) swap(A[i][j], A[j][i]) How many cache misses? O(N2) in the worst case. How to improve this? Recursion (divide & conquer) may be helpful. 25 Cache Oblivious Matrix Transposition x y y+dy xmid = (dx/2) x+dx Which problem must be solved recursively? 26 Cache Oblivious Matrix Transposition O(N2/B) cache misses 27 Rough Experiments Athlon 1Ghz, 512M RAM, Linux 28 Stop Recursion Earlier Stop recursion when problem size becomes less than a certain block size and use simple for loop implementation inside block. Using different block sizes seems to have little effect on running time. 29 Why Divide & Conquer Works? Divide & conquer repeatedly refines problem size. Eventually, problem will fit in cache (size ≤ M), and later will fit in single block (size ≤ B). For divide & conquer recursion dominated by leaf costs, algorithm will usually use within a constant factor of the optimal number of memory transfers. If divide and merge can be done using few memory transfers, then divide & conquer approach efficient even when cost not dominated by leaves. 30 Divide & Conquer OK: Selection Median and Selection: find k-th item in unsorted sequence Classical (internal memory) algorithm [Blum et al]: Recurrence on running time T(N) is: T(N) = T(N/5) + T(7N/10) + O(N) = O(N) 31 Cache Oblivious Implementation Step 1 conceptual; do nothing Step 2 in two parallel scans: one reads array 5 items at a time, other writes new array of computed medians. Assuming M ≥ 2B, that’s O(1 + N/B) memory transfers. Step 3 recursive call of size N/5. Step 4 in three parallel scans: one reads array, two others write partitioned arrays. Again, parallel scans use O(1 + N/B) memory transfers (M ≥ 3B) Step 5 recursive call of size at most 7N/10 Recurrence on memory transfers T(N) is: T(N) = T(N/5) + T(7N/10) + O(1 + N/B) 32 Failed Attempt in the Analysis Recurrence on memory transfer T(N) is: T(N) = T(N/5) + T(7N/10) + O(1 + N/B) Wish to prove O(1 + N/B) memory tranfers If T(O(1)) = O(1), each leaf incurs a constant number of memory transfers. How many leaves does the recurrence tree have? L(N) total number of leaves: L(N) = L(N/5) + L(7N/10) If L(N) = Nc, then (1/5)c + (7/10)c = 1. I.e., c ≈ 0.8397803 But T(N) is Ω( Nc ), which is still larger than O(1 + N/B) (when B ≤ N ≤ B Nc. i.e, B ≤ N ≤ B1/(1-c) = B6.24) 33 Refined Analysis Recurrence on memory transfer T(N) is: T(N) = T(N/5) + T(7N/10) + O(1 + N/B) Luckily, can use base case stronger than T(O(1)) = O(1): T(O(B)) = O(1) (once problem fits into O(1) blocks, all 5 steps incur only constant number of memory transfers) Stop recursion at O(B): then there are only (N/B)c leaves in recursion tree, which cost only O((N/B)c)= o(N/B) memory transfers. Thus cost per level decreases geometrically from root, so total cost is cost of root: O(1 + N/B). 34 Cache Oblivious Implementaion Theorem. The worst-case linear-time median algorithm, implemented with appropriate scans, uses O(1 + N/B) memory transfers, provided M ≥ 3B. Key part of analysis is to identify relevant base case, so that “overhead term” does not dominate cost for small problem sizes relative to cache. Other than the new base case, analysis is same as classic (internal memory) algorithm. 35 Divide & Conquer KO: Binary Src Binary search has the following recurrence: T(N) = T(N/2) + O(1) Cost of leaves balance with cost of root: cost of every level is the same, so extra log N factor Hope to reduce log N factor in a blocked setting by using stronger base case T(O(B)) = O(1) However, stronger base case does not help much: only reduce number of levels in the recursion tree by an additive Θ(log B) In this case, solution to recurrence becomes: T(N) = log N - | Θ(log B) | Will see later how to get O(logB N) with a different layout than the sorted one 36 Matrix Multiplication Wish to compute C = A · B. For sake of simplicity, square matrices whose dimensions are powers of two (this is w.l.o.g) Trivial alg.: For each cij, scan in parallel row i of A and column j of B. Ideally, A stored in row-major and B in column-major order. Then each element of C requires ≤ O(1 + N/B) memory transfers, if M ≥ 3B. Cost could only be smaller if M large enough to store previously visited row or column. If M ≥ N, relevant row of A remembered for an entire row of C. But for column of B to be remembered, M ≥ N2, in which case entire problem fits in cache. Theorem. Assume A stored in row-major and B in column-major order. Then trivial matrix-multiplication uses O(N2 + N3/B) memory transfers if 3B ≤ M < N2 and O(1 + N2/B) memory transfers if M ≥ 3N2 . 37 Matrix Multiplication Point of theorem is that, even with ideal storage order of A and B, trivial algorithm still requires O ( N3 / B ) memory transfers unless entire problem fits in cache. Can do better, and achieve running time of O(N2/B + N3/B √M). In external-memory, this bound first achieved by Hong and Kung [HK81] Cache-oblivious solution uses same idea as externalmemory solution: block matrices. 38 Matrix Multiplication Can write C = A · B as a divide-and-conquer recursion using block-matrix notation: This way, reduce N · N multiplication problem down to eight (N/2) · (N/2) multiplication subproblems, plus four (N/2) · (N/2) addition subproblems (which can be solved by single scan in O(1+N2/B) memory transfers). Thus, we get following recurrence: T (N) = 8 T (N/2) + O(1 + N2/B) 39 Matrix Layout To make small matrix blocks fit into blocks or main memory, matrix not stored in row-major or column-major order, but rather in recursive layout. Each matrix A laid out so that blocks A11, A12, A21, A22 occupies consecutive segment of memory, and these four segments stored together in arbitrary order. 40 Base Case Base case becomes trickier, as both B and M relevant. Certainly, T (O(√B)) = O(1), because O(√ B) · O(√B) submatrix fits in a constant number of blocks. But this base case turns out to be irrelevant. More interesting is T (c√M)= O(M/B), where constant c chosen so that three c√M · c√M submatrices fit in cache, and hence each block is read or written at most once. 41 Analysis Recurrence is T (N) = 8 T (N/2) + O(1 + N2/B) Stronger base case T (c√M)= O(M/B). At level i of recurrence tree: 8i nodes, matrix dimension is N / 2i total cost 8i O(N2 / (22i B)) = 2i O(N2 / B) Recursion stops when N / 2i = c√M, i.e., L = O(log (N/√M)) Total cost is L Σ 2i O(N2 / B) = (2L+1-1) O(N2 / B) = O(N2/B) + O(N3/ (B √M)) i=0 (That’s divide-merge cost at root plus total leaf cost). Divide/merge cost at root of the recursion tree is O(N2/B). These two costs balance when N = Θ (√M), when depth of tree is O(1). 42 Matrix Multiplication Trivial: O( N3/B ) Cache Ob.: O ( N2/B + N3/B√M ) Trivial vs. blocked Cache Oblivious 43 Static Searching 44 Cache Oblivious Searching Divide and conquer on tree layout (van Emde Boas O(loglog U) priority queue) Split tree at midde level, resulting in one top tree and ≈ √N bottom subtrees, each of size ≈ √N Recursively layout top subtree followed by bottom subtrees 45 Cache Oblivious Searching If height not power of 2, each split rounds so that bottom subtrees have heights power of 2: 46 CO Searching • Recursively split tree (cut at middle level) until every recursive subtree has size at most B (or small enough to fit into cache line) • Each recursive subtree stores an interval of memory of size at most B, so occupies at most two blocks. • Each recursive subtree except topmost has same height. • Since trees are cut at middle level in each step, this height may be as small as (log B)/2, for subtree of size Θ(√B), but no smaller. 47 CO Searching O(logB N) cache misses • Search visits nodes along root-to-leaf path of length log N , visiting sequence of recursive subtrees along the way. • All but first recursive subtree has height at least (log B)/2, so number of visited recursive subtrees is ≤ 1 + 2(log N )/(log B) = 1 + 2 logB N . • Each recursive subtree may incur up to two memory transfers, for a total of ≤ ( 2 + 4 logB N ) memory transfers. • Faster than trivial search by log2 N / 4 logB N = log2 B / 4 • log2 B / 2 more realistic (each recursive subtree in a block) • For disk blocks of 1024 elements, expect speedup ≈ 5 (or ≈ 2.5) 48 Experiments on CO Searching 256 bytes tree nodes 49 Resilient Algorithms and Data Structures Memory Errors Memory error: one or multiple bits read differently from how they were last written. Many possible causes: • electrical or magnetic interference (cosmic rays) • hardware problems (bit permanently damaged) • corruption in data path between memories and processing units Errors in DRAM devices concern for a long time [May & Woods 79, Ziegler et al 79, Chen & Hsiao 84, Normand 96, O’Gorman et al 96, Mukherjee et al 05, … ] 51 Memory Errors Soft Errors: Randomly corrupt bits, but do not leave any physical damage --- cosmic rays Hard Errors: Corrupt bits in a repeatable manner because of a physical defect (e.g., stuck bits) --- hardware problems 52 Error Correcting Codes (ECC) Error correcting codes (ECC) allow detection and correction of one or multiple bit errors Typical ECC is SECDED (i.e., single error correct, double error detect) Chip-Kill can correct up to 4 adjacent bits at once ECC has several overheads in terms of performance (33%), size (20%) and money (10%). ECC memory chips are mostly used in memory systems for server machines rather than for client computers 53 Impact of Memory Errors Consequence of a memory error is system dependent 1. Correctable errors : fixed by ECC 2. Uncorrectable errors : 2.1. Detected : Explicit failure (e.g., a machine reboot) 2.2. Undetected : 2.2.1. Induced failure (e.g., a kernel panic) 2.2.2. Unnoticed (but application corrupted, e.g., segmentation fault, file not found, file not readable, … ) 54 How Common are Memory Errors? 55 How Common are Memory Errors? 56 How Common are Memory Errors? [Schroeder et al 2009] experiments 2.5 years (Jan 06 – Jun 08) on Google fleet (104 machines, ECC memory) Memory errors are NOT rare events! 57 Memory Errors Not all machines (clients) have ECC memory chips. Increased demand for larger capacities at low cost just makes the problem more serious – large clusters of inexpensive memories Need of reliable computation in the presence of memory faults 58 Memory Errors Other scenarios in which memory errors have impact (and seem to be modeled in an adversarial setting): • Memory errors can cause security vulnerabilities: Fault-based cryptanalysis [Boneh et al 97, Xu et al 01, Bloemer & Seifert 03] Attacking Java Virtual Machines [Govindavajhala & Appel 03] Breaking smart cards [Skorobogatov & Anderson 02, Bar-El et al 06] • Avionics and space electronic systems: Amount of cosmic rays increase with altitude (soft errors) 59 Memory Errors in Space 60 Memory Errors in Space 61 Memory Errors in Space 62 Recap on Memory Errors I’m thinking of getting back into crime, Luigi. Legitimate business is too corrupt… 1. Memory errors can be harmful: uncorrectable memory errors cause some catastrophic event (reboot, kernel panic, data corruption, …) 63 A small example Classical algorithms may not be correct in the presence of (even very few) memory errors An example: merging two ordered lists A 80 1 2 B 11 12 13 14 15 16 17 18 19 20 Out 3 4 11 12 13 Q(n) 5 6 7 8 9 10 ... 20 80 2 3 4 ... 9 10 Q(n2) inversions Q(n) 64 Recap on Memory Errors I know my PIN number: it’s my name I can’t remember… 2. Memory errors are NOT rare: even a small cluster of computers with few GB per node can experience one bit error every few minutes. 65 Memory Errors In the field study, Google researchers observed mean error rates of 2,000 – 6,000 per GB per year (25,000 – 75,000 FIT/Mbit) Mem. size Mean Time Between Failures 512 MB 1 GB 16 GB 64 GB 1 TB 2.92 hours 1.46 hours 5.48 minutes 1.37 minutes 5.13 seconds 66 Recap on Memory Errors 3. ECC may not be available (or may not be enough): No ECC in inexpensive memories. ECC does not guarantee complete fault coverage; expensive; system halt upon detection of uncorrectable errors; service disruption; etc… etc… 67 Impact of Memory Errors 68 Resilient Algorithms and Data Structures Make sure that the algorithms and data structures we design are capable of dealing with memory errors Resilient Algorithms and Data Structures: Capable of tolerating memory errors on data (even throughout their execution) without sacrificing correctness, performance and storage space 69 Faulty- Memory Model [Finocchi, I. 04] • Memory fault = the correct data stored in a memory location gets altered (destructive faults) at any time • Faults can appear in any memory location simultaneously Wish to produce correct output on uncorrupted data (in an adversarial model) • Assumptions: – Only O(1) words of reliable memory (safe memory) – Corrupted values indistinguishable from correct ones • Even recursion may be problematic in this model. 70 Terminology d = upper bound known on the number of memory errors (may be function of n) a = actual number of memory errors (happen during specific execution) Note: typically a ≤ d All the algorithms / data structure described here need to know d in advance 71 Other Faulty Models Design of fault-tolerant alg’s received attention for 50+ years Liar Model [Ulam 77, Renyi 76,…] Comparison questions answered by a possibly lying adversary. Can exploit query replication strategies. Fault-tolerant sorting networks [Assaf Upfal 91, Yao Yao 85,…] Comparators can be faulty. Exploit substantial data replication using fault-free data replicators. Parallel Computations [Huang et al 84, Chlebus et al 94, …] Faults on parallel/distributed architectures: PRAM or DMM simulations (rely on fault-detection mechanisms) 72 Other Faulty Models Memory Checkers [Blum et al 93, Blum et al 95, …] Programs not reliable objects: self-testing and self-correction. Essential error detection and error correction mechanisms. Robustness in Computational Geometry [Schirra 00, …] Faults from unreliable computation (geometric precision) rather than from memory errors Noisy / Unreliable Computation [Bravermann Mossel 08] Faults (with given probability) from unreliable primitives (e.g., comparisons) rather than from memory errors ……………………………………… 73 Outline of the Talk 1. Motivation and Model 2. Resilient Algorithms: • Sorting and Searching 3. Resilient Data Structures • Priority Queues • Dictionaries 4. (Ongoing) Experimental Results 5. Conclusions and Open Problems 74 Resilient Sorting We are given a set of n keys that need to be sorted Value of some keys may get arbitrarily corrupted We cannot tell which is faithful and which is corrupted Q1. Can sort efficiently correct values in presence of memory errors? Q2. How many memory errors can tolerate in the worst case if we wish to maintain optimal time and space? 75 Terminology • Faithful key = never corrupted • Faulty key = corrupted • Faithfully ordered sequence = ordered except for corrupted keys 1 2 3 4 5 6 80 7 8 9 10 Faithfully ordered • Resilient sorting algorithm = produces a faithfully ordered sequence (i.e., wish to sort correctly all the uncorrupted keys) 76 Trivially Resilient Resilient variable: consists of (2d+1) copies x1, x2, , x2d+1 of a standard variable x Value of resilient variable given by majority of its copies: • cannot be corrupted by faults • can be computed in linear time and constant space [Boyer Moore 91] Trivially-resilient algorithms and data structures have Θ(d) multiplicative overheads in terms of time and space Note: Trivially-resilient does more than ECC (SECDED, Chip-Kill, ….) 77 Trivially Resilient Sorting Trivially Resilient Sorting Can trivially sort in O(d n log n) time during d memory errors O(n log n) sorting algorithm able to tolerate only O (1) memory errors 78 Resilient Sorting Upper Bound [Finocchi, Grandoni, I. 05]: Comparison-based sorting algorithm that takes O(n log n + d2) time to run during d memory errors O(n log n) sorting algorithm able to tolerate up to O ((n log n)1/2) memory errors Lower Bound [Finocchi, I. 04]: Any comparison-based resilient O(n log n) sorting algorithm can tolerate the corruption of at most O ((n log n)1/2) keys 79 Resilient Sorting (cont.) Integer Sorting [Finocchi, Grandoni, I. 05]: Randomized integer sorting algorithm that takes O(n + d2) time to run during d memory errors O(n) randomized integer sorting algorithm able to tolerate up to O(n1/2) memory errors 80 Resilient Binary Search 1 2 80 3 4 5 10 7 8 9 13 20 26 search(5) = false Wish to get correct answers at least on correct keys: search(s) either finds a key equal to s, or determines that no correct key is equal to s If only faulty keys are equal to s, answer uninteresting (cannot hope to get trustworthy answer) 81 Trivially Resilient Binary Search Trivially Resilient Binary Search Can search in O(d log n) time during d memory errors 82 Resilient Searching Upper Bounds : Randomized algorithm with O(log n + d) expected time [Finocchi, Grandoni, I. 05] Deterministic algorithm with O(log n + d) time [Brodal et al. 07] Lower Bounds : W(log n + d) lower bound (deterministic) [Finocchi, I. 04] W(log n + d) lower bound on expected time [Finocchi, Grandoni, I. 05] 83 Outline of the Talk 1. Motivation and Model 2. Resilient Algorithms: • Sorting and Searching 3. Resilient Data Structures • Priority Queues • Dictionaries 4. (Ongoing) Experimental Results 5. Conclusions and Open Problems 84 Resilient Data Structures Data structures more vulnerable to memory errors than algorithms: Algorithms affected by errors during execution Data structures affected by errors in lifetime 85 Resilient Priority Queues Maintain a set of elements under insert and deletemin insert adds an element deletemin deletes and returns either the minimum uncorrupted value or a corrupted value Consistent with resilient sorting 86 Resilient Priority Queues Upper Bound : Both insert and deletemin can be implemented in O(log n + d) time [Jorgensen et al. 07] (based on cache-oblivious priority queues) Lower Bound : A resilient priority queue with n > d elements must use W(log n + d) comparisons to answer an insert followed by a deletemin [Jorgensen et al. 07] 87 Resilient Dictionaries Maintain a set of elements under insert, delete and search insert and delete as usual, search as in resilient searching: search(s) either finds a key equal to s, or determines that no correct key is equal to s Again, consistent with resilient sorting 88 Resilient Dictionaries Randomized resilient dictionary implements each operation in O(log n + d) time [Brodal et al. 07] More complicated deterministic resilient dictionary implements each operation in O(log n + d) time [Brodal et al. 07] 89 Resilient Dictionaries Pointer-based data structures Faults on pointers likely to be more problematic than faults on keys Randomized resilient dictionaries of Brodal et al. built on top of traditional (non-resilient) dictionaries Our implementation built on top of AVL trees 90 Outline of the Talk 1. Motivation and Model 2. Resilient Algorithms: • Sorting and Searching 3. Resilient Data Structures • Priority Queues • Dictionaries 4. (Ongoing) Experimental Results 5. Conclusions and Open Problems 91 Algorithm / Data Structure Experimental Framework Non-Resilient O(f(n)) Trivially Resilient O(d · f(n)) Resilient O(f(n) + g(d )) Resilient sorting from [Ferraro-Petrillo et al. 09] Resilient dictionaries from [Ferraro-Petrillo et al. 10] Implemented resilient binary search and heaps Implementations of resilient sorting and dictionaries more engineered than resilient binary search and heaps 92 Experimental Platform • 2 CPUs Intel Quad-Core Xeon E5520 @ 2.26Ghz • L1 cache 256Kb, L2 cache 1 Mb, L3 cache 8 Mb • 48 GB RAM • Scientific Linux release with Linux kernel 2.6.18-164 • gcc 4.1.2, optimization flag –O3 93 Fault Injection This talk: Only random faults Preliminary experiments (not here): error rates depend on memory usage and time. Algorithm / data structure and fault injection implemented as separate threads (Run on different CPUs) 94 Resiliency: Why should we care? What’s the impact of memory errors? Try to analyze impact of errors on mergesort, priority queues and dictionaries using a common framework (sorting) Attempt to measure error propagation: try to estimate how much output sequence is far from being sorted (because of memory errors) Heapsort implemented on array. For coherence, in AVLSort we do not induce faults on pointers We’ll measure faults on AVL pointers in separate experiment 95 Error Propagation • k-unordered sequence = faithfully ordered except for k (correct) keys 1 80 2 3 4 9 5 7 8 6 10 2-unordered • k-unordered sorting algorithm = produces a kunordered sequence, i.e., it faithfully sorts all but k correct keys • Resilient is 0-unordered = i.e., it faithfully sorts all correct keys 96 The Importance of Being Resilient n = 5,000,000; 0.01% (random) errors in input 0.13% errors in output 0.02% (random) errors in input 0.22% errors in output a 97 The Importance of Being Resilient n = 5,000,000; 0.01% (random) errors in input 0.40% errors in output 0.02% (random) errors in input 0.47% errors in output a 98 The Importance of Being Resilient n = 5,000,000; 0.01% (random) errors in input 68.20% errors in output 0.02% (random) errors in input 79.62% errors in output a 99 The Importance of Being Resilient a 100 Error Amplification Mergesort 0.002-0.02% (random) errors in input 24.50-79.51% errors in output!!! AVLsort 0.002-0.02% (random) errors in input 0.39-0.47% errors in output Heapsort 0.002-0.02% (random) errors in input 0.01-0.22% errors in output They all show some error amplification. Large variations likely to depend on data organization Note: Those are errors on keys. Errors on pointers are more dramatic for pointer-based data structures 101 The Importance of Being Resilient AVL with n = 5,000,000; a errors on memory used (keys, parent pointers, pointers, etc…) 100,000 searches; around a searches fail: on the avg, able to complete only about (100,000/a) searches before crashing a 102 Isn’t Trivial Resiliency Enough? Memory errors are a problem Do we need to tackle it with new algorithms / data structures? Aren’t simple-minded approaches enough? 103 Isn’t Trivial Resiliency Enough? d = 1024 104 Isn’t Trivial Resiliency Enough? d = 1024 100.000 random search 105 Isn’t Trivial Resiliency Enough? d = 512 100.000 random ops 106 Isn’t Trivial Resiliency Enough? d = 1024 100.000 random ops no errors on pointers 107 Isn’t Trivial Resiliency Enough? All experiments for 105 ≤ n ≤ 5 105, d=1024, unless specified otherwise Mergesort Trivially resilient about 100-200X slower than non-resilient Binary Search Trivially resilient about 200-300X slower than non-resilient Dictionaries Trivially resilient AVL about 300X slower than non-resilient Heaps Trivially resilient about 1000X slower than non-resilient (d = 512) [deletemin are not random and slow] 108 Performance of Resilient Algorithms Memory errors are a problem Trivial approaches produce slow algorithms / data structures Need non-trivial (hopefully fast) approaches How fast can be resilient algorithms / data structures? 109 Performance of Resilient Algorithms a = d = 1024 110 Performance of Resilient Algorithms a = d = 1024 111 Performance of Resilient Algorithms a = d = 1024 100,000 random search 112 Performance of Resilient Algorithms a = d = 1024 100,000 random search 113 Performance of Resilient Algorithms a = d = 512 100,000 random ops 114 Performance of Resilient Algorithms a = d = 512 100,000 random ops 115 Performance of Resilient Algorithms a = d = 1024 100,000 random ops 116 Performance of Resilient Algorithms a = d = 1024 100,000 random ops 117 Performance of Resiliency All experiments for 105 ≤ n ≤ 5 105, a=d=1024, unless specified otherwise Mergesort Resilient mergesort about 1.5-2X slower than non-resilient mergesort [Trivially resilient mergesort about 100-200X slower] Binary Search Resilient binary search about 60-80X slower than non-resilient binary search [Trivially resilient binary search about 200-300X slower] Heaps Resilient heaps about 20X slower than non-resilient heaps (a = d = 512) [Trivially resilient heaps about 1000X slower] Dictionaries Resilient AVL about 10-20X slower than non-resilient AVL [Trivially resilient AVL about 300X slower] 118 Larger Data Sets How well does the performance of resilient algorithms / data structures scale to larger data sets? Previous experiments: 105 ≤ n ≤ 5 105 New experiment with n = 5 106 (no trivially resilient) 119 Larger Data Sets n = 5,000,000 a 120 Larger Data Sets n = 5,000,000 a 121 Larger Data Sets 100,000 random search on n = 5,000,000 elements log2 n ≈ 22 a 122 Larger Data Sets 100,000 random search on n = 5,000,000 elements a 123 Larger Data Sets 100,000 random ops on a heap with n = 5,000,000 log2 n ≈ 22 a 124 Larger Data Sets 100,000 random ops on a heap with n = 5,000,000 a 125 Larger Data Sets 100,000 random ops on AVL with n = 5,000,000 log2 n ≈ 22 a 126 Larger Data Sets 100,000 random ops on AVL with n = 5,000,000 a 127 Larger Data Sets All experiments for n = 5 106 Mergesort [was 1.5-2X for 105 ≤ n ≤ 5 105] Resilient mergesort is 1.6-2.3X slower (requires ≤ 0.04% more space) Binary Search [was 60-80X for 105 ≤ n ≤ 5 105] Resilient search is 100-1000X slower (requires ≤ 0.08% more space) Heaps [was 20X for 105 ≤ n ≤ 5 105] Resilient heap is 100-1000X slower (requires 100X more space) Dictionaries [was 10-20X for 105 ≤ n ≤ 5 105] Resilient AVL is 6.9-14.6X slower (requires about 1/3 space) 128 Sensitivity to d How critical is the choice of d ? Underestimating d (a > d) compromises resiliency Overestimating d (a << d) gives some performance degradation 129 Performance Degradation a = 32, but algorithm overestimates d = 1024: Mergesort Resilient mergesort improves by 9.7% in time and degrades by 0.04% in space Binary Search Resilient search degrades to 9.8X in time and by 0.08% in space Heaps Resilient heap degrades to 13.1X in time and by 59.28% in space Dictionaries Resilient AVL degrades by 49.71% in time 130 Robustness Resilient mergesort and dictionaries appear more robust than resilient search and heaps I.e., resilient mergesort and dictionaries scale better with n, less sensitive to d (so less vulnerable to bad estimates of d), How much of this is due to the fact that their implementations are more engineered? 131 Outline of the Talk 1. Motivation and Model 2. Resilient Algorithms: • Sorting and Searching 3. Resilient Data Structures • Priority Queues • Dictionaries 4. (Ongoing) Experimental Results 5. Conclusions and Open Problems 132 Concluding Remarks • Need of reliable computation in the presence of memory errors • Investigated basic algorithms and data structures in the faulty memory model: do not wish to detect /correct errors, only produce correct output on correct data • Tight upper and lower bounds in this model • After first tests, resilient implementations of algorithms and data structures look promising 133 Future Work and Open Problems • Lower bounds for resilient integer sorting? • Full repertoire for resilient priority queues (delete, decreasekey, increasekey)? • Resilient graph algorithms? • Resilient algorithms oblivious to d? • Better faulty memory model? • More (faster) implementations, engineering and experimental analysis? 134 Thank You! My memory’s terrible these days… 135 Questions & Answers 136 Euler Tour Given a tree T, and a distinguished vertex r of T, an Euler tour of T is a traversal of T that starts and ends at r and traverses every edge exactly twice, once in each direction. r 137 Euler Tour Formally, every undirected edge {u,v} in T replaced by two directed edges (u,v) and (v,u) The tour starts with an edge (r,w) For every vertex v in T with incoming edges e1, e2, …, ek and outgoing edges e’1, e’2, …, e’k, numbered so that ei and e’i have the same endpoints, edge ei is w1 succeeded by edge e’(i mod k) +1 in the Euler tour. w4 v w2 w3 Euler Tour If we wish to compute the Euler tour as a list (say because we want to apply list ranking), we can do that in O(sort(N)) I/Os r 139