I/O-Algorithms Lars Arge Spring 2012 March 6, 2012 I/O-algorithms I/O-Model D Block I/O M • Parameters N = # elements in problem instance B = # elements that fits in disk block M = # elements that fits in main memory T = # output size in searching problem • We often assume that M>B2 P Lars Arge • I/O: Movement of block between memory and disk 2 I/O-algorithms Fundamental Bounds Internal N External • Sorting: • Permuting N log N N B N log M B NB min{ N , NB log M B • Searching: log 2 N log B N • Scanning: Lars Arge N B N } B 3 I/O-algorithms Fundamental Data Structures • B-trees: Node degree (B) queries in O(log B N T B) – Rebalancing using split/fuse updates in O(log B N ) • Weight-balanced B-tress: Weight rather than degree constraint Ω(w(v)) updates below v between rebalancing operations on v • Persistent B-trees: – Update in current version in O(log B N ) – Search in all previous versions in O(log B N T B) • Buffer trees – Batching of operations to obtain O( B1 log M B NB ) bounds O( NB log M NB ) construction algorithms B Lars Arge 4 I/O-algorithms Last time: Interval management • Maintain N intervals with unique endpoints dynamically such that stabbing query with point x can be answered efficiently x • Static solution: Persistent B-tree – Linear space and O(log B N T B) query • Dynamic solution: External interval tree – O(log B N ) update Lars Arge 5 I/O-algorithms Internal Interval Tree • Base tree on endpoints – “slab” Xv associated with each node v • Interval stored in highest node v where it contains midpoint of Xv • Intervals Iv associated with v stored in – Left slab list sorted by left endpoint (search tree) – Right slab list sorted by right endpoint (search tree) Linear space and O(log N) update Lars Arge 6 I/O-algorithms Internal Interval Tree x • Query with x on left side of midpoint of Xroot – Search left slab list left-right until finding non-stabbed interval – Recurse in left child O(log N+T) query bound Lars Arge 7 I/O-algorithms Externalizing Interval Tree • Natural idea: – Block tree – Use B-tree for slab lists • Number of stabbed intervals in large slab list may be small (or zero) – We can be forced to do I/O in each of O(log N) nodes Lars Arge 8 I/O-algorithms Externalizing Interval Tree ( B ) multislab • Idea: – Decrease fan-out to ( B ) height remains O(log B N ) – ( B ) slabs define (B) multislabs – Interval stored in two slab lists (as before) and one multislab list – Intervals in small multislab lists collected in underflow structure – Query answered in v by looking at 2 slab lists and not O(log N) Lars Arge 9 I/O-algorithms External Interval Tree • Linear space, O(log B N T B) query, O(log B N ) update • General solution techniques: – Filtering: Charge part of query cost to output – Bootstrapping: * Use O(B2) size structure in each internal node * Constructed using persistence * Dynamic using global rebuilding – Weight-balanced B-tree: Split/fuse in amortized O(1) Lars Arge 10 I/O-algorithms Last time: Three-Sided Range Queries • Interval management: “1.5 dimensional” search x1 x2 (x1,x2) (x,x) x • More general 2d problem: Dynamic 3-sidede range searching – Maintain set of points in plane such that given query (q1, q2, q3), all points q3 (x,y) with q1 x q2 and y q3 can be found efficiently q1 q2 • Linear space, O(logB N+T/B) query static solution using persistence – Dynamic: External priority search tree Lars Arge 11 I/O-algorithms Internal Priority Search Tree 9 16.20 4 5,6 16 19,9 5 9,4 1 1,2 1 4 4,1 5 13 13,3 9 13 19 20,3 16 19 20 • Base tree on x-coordinates with nodes augmented with points • Heap on y-coordinates – Decreasing y values on root-leaf path – (x,y) on path from root to leaf holding x – If v holds point then parent(v) holds point Linear space and O(log N) update Lars Arge 12 I/O-algorithms Internal Priority Search Tree 9 16.20 4 4 5,6 19 4 16 19,9 5 9,4 1 1,2 1 4 4,1 5 13 13,3 9 13 19 20,3 16 19 20 • Query with (q1, q2, q3) starting at root v: – Report point in v if satisfying query – Visit both children of v if point reported – Always visit child(s) of v on path(s) to q1 and q2 O(log N+T) query Lars Arge 13 I/O-algorithms Externalizing Priority Search Tree 9 16.20 4 5,6 16 19,9 5 9,4 1 1,2 1 4 4,1 5 13 13,3 9 13 19 20,3 16 19 20 • Natural idea: Block tree • Problem: – O(log B N ) I/Os to follow paths to to q1 and q2 – But O(T) I/Os may be used to visit other nodes (“overshooting”) O(log B N T ) query Lars Arge 14 I/O-algorithms Externalizing Priority Search Tree 9 16.20 4 5,6 16 19,9 5 9,4 1 1,2 1 4 4,1 5 13 13,3 9 13 19 20,3 16 19 20 • Solution idea: – Store B points in each node * O(B2) points stored in each supernode * B output points can pay for “overshooting” – Bootstrapping: * Store O(B2) points in each supernode in static structure Lars Arge 15 I/O-algorithms Summary/Conclusion: Priority Search Tree • We have now discussed structures for special cases of twodimensional range searching – Space: O(N/B) q q3 T – Query: O(log B N B) – Updates: O(log B N ) q1 q q2 • Cannot be obtained for general (4-sided) 2d range searching: logB N ) space – O(logcB N ) query requires Ω( NB log log B BN – O ( NB ) space requires ( N B ) query q4 q3 q1 Lars Arge q2 16 I/O-algorithms External Range Tree • Base tree: Weight balanced tree with branching parameter Θ(log B N ) and leaf parameter B on x-coordinates logB N O(log logB N N ) O( log log ) height N B B • Points below each node stored in 4 linear space secondary structures: – “Right” priority search tree (log B N ) – “Left” priority search tree – B-tree on y-coordinates – Interval (priority search) tree logB N O( NB log log ) N space B Lars Arge B 17 I/O-algorithms External Range Tree • Secondary interval tree: – Connect points in each slab in y-order – Project obtained segments in y-axis (log B N ) – Intervals stored in interval tree * Interval augmented with pointer to corresponding points in ycoordinate B-tree in corresponding child node Lars Arge 18 I/O-algorithms External Range Tree • Query with (q1, q2, q3 , q4) answered in top node with q1 and q2 in different slabs v1 and v2 • Points in slab v1 – Found with 3-sided query in v1 (log B N ) using right priority search tree • Points in slab v2 – Found with 3-sided query in v2 using left priority search tree v1 v2 • Points in slabs between v1 and v2 – Answer stabbing query with q3 using interval tree first point above q3 in each of the O(log B N ) slabs – Find points using y-coordinate B-tree in O(log B N ) slabs Lars Arge 19 I/O-algorithms External Range Tree • Query analysis: – O(log B N ) I/Os to find relevant node – O(log B N T B) I/Os to answer two 3-sided queries – O(log B N logB N B ) O(log B N ) I/Os to query interval tree – O(log B N T B) I/Os to traverse O(log B N ) B-trees O(log B N T B) I/Os (log B N ) v1 Lars Arge v2 20 I/O-algorithms External Range Tree • Insert: – Insert x-coordinate in weight-balanced B-tree * Split of v can be performed in O(w(v) log B w(v)) I/Os log2B N O( ) I/Os logB logB N logB N ) nodes on one – Update secondary structures in all O( log log B BN root-leaf path (log B N ) * Update priority search trees * Update interval tree * Update B-tree log2B N O( log log N ) I/Os B B • Delete: v1 v2 – Similar and using global rebuilding Lars Arge 21 I/O-algorithms Summary: External Range Tree log N B • 2d range searching in O( NB log log ) space B BN – O(log B N T B) I/O query log2B N – O( log log N ) I/O update B B • Optimal among O(log B N T B) query structures q4 q3 q1 Lars Arge q2 22 I/O-algorithms kdB-tree • kd-tree: – Recursive subdivision of point-set into two half using vertical/horizontal line – Horizontal line on even levels, vertical on uneven levels – One point in each leaf Linear space and logarithmic height Lars Arge 23 I/O-algorithms kd-Tree: Query • Query – Recursively visit nodes corresponding to regions intersecting query – Report point in trees/nodes completely contained in query • Query analysis – Horizontal line intersect Q(N) = 2+2Q(N/4) = O( N ) regions – Query covers T regions O( N T ) I/Os worst-case Lars Arge 24 I/O-algorithms kdB-tree • kdB-tree: – Stop subdivision when leaf contains between B/2 and B points – BFS-blocking of internal nodes • Query as before – Analysis as before but each region now contains Θ(B) points O( N B T B) I/O query Lars Arge 25 I/O-algorithms Construction of kdB-tree • Simple O( NB log 2 NB ) algorithm – Find median of y-coordinates (construct root) – Distribute point based on median – Recursively build subtrees – Construct BFS-blocking top-down • Idea in improved O( NB log M B NB ) algorithm – Construct log M B levels at a time using O(N/B) I/Os Lars Arge 26 I/O-algorithms Construction of kdB-tree • Sort N points by x- and by y-coordinates using O( NB log M B NB ) I/Os • Building log M B levels ( M B nodes) in O(N/B) I/Os: 1. Construct M B by M B grid with N M B points in each slab 2. Count number of points in each grid cell and store in memory 3. Find slab s with median x-coordinate 4. Scan slab s to find median x-coordinate and construct node 5. Split slab containing median x-coordinate and update counts 6. Recurse on each side of median x-coordinate using grid (step 3) Grid grows to M B M B M B ( M B ) during algorithm Each node constructed in O( N /( M B B)) I/Os Lars Arge 27 I/O-algorithms kdB-tree • kdB-tree: – Linear space – Query in O( N B T B) I/Os – Construction in O( NB log M / B NB ) I/Os – Point search in O(log B N ) I/Os • Dynamic? – Deletions relatively easily in O(log 2B N ) I/Os (partial rebuilding) Lars Arge 28 I/O-algorithms kdB-tree Insertion using Logarithmic Method • Partition pointset S into subsets S0, S1, … Slog N, |Si| = 2i or |Si| = 0 • Build kdB-tree Di on Si .................................. 2 0 2 log N 1 2 2 2 log N • Query: Query each Di O( 2 B Ti B ) O( N B T B) i 0 i 1 j i • Insert: Find first empty Di and construct Di out of 1 j 0 2 2 elements in S0,S1, … Si-1 i – O( 2B log M / B 2B ) I/Os O( B1 log M / B NB ) per moved point – Point moved O(log N) times 2 O( B1 log M / B NB log N ) O(log B N ) I/Os amortized i Lars Arge i 29 I/O-algorithms kdB-tree Insertion and Deletion • Insert: Use logarithmic method ignoring deletes • Delete: Simply delete point p from relevant Di – i can be calculated based on # insertions since p was inserted – # insertions calculated by storing insertion number of each point in separate B-tree .................................. extra update cost O(log B N ) 2 0 2 1 2 2 2 log N • To maintain O(log N) structures Di – Perform global rebuild after every Θ(N) updates O( B1 log M / B NB ) O(log B N ) extra update cost Lars Arge 30 I/O-algorithms Summary: kdB-tree • 2d range searching in O(N/B) space – Query in O( N B T B) I/Os – Construction in O( NB log M / B NB ) I/Os – Updates in O(log 2B N ) I/Os q4 q3 • Optimal query among linear space structures Lars Arge q1 q2 31 I/O-algorithms O-Tree Structure • O-tree: – B-tree on ( N B log B N ) vertical slabs – B-tree on ( N B log B N ) horizontal slabs in each vertical slab 2 – kdB-tree on ( N N 2 ) ( B log B N ) points in each leaf ( log N ) B B N N B B log B N log 2B N B log 2B N Lars Arge 32 I/O-algorithms O-Tree Query • Perform rangesearch with q1 and q2 in vertical B-tree – Query all kdB-trees in leaves of two horizontal B-trees with xinterval intersected but not spanned by query – Perform rangesearch with q3 and q4 in horizontal B-trees with xinterval spanned by query * Query all kdB-trees with range intersected by query N N B B log B N log 2B N B log 2B N Lars Arge 33 I/O-algorithms O-Tree Query Analysis • Vertical B-tree query: O(log B ( N B log B N )) O( N B ) • Query of all kdB-trees in leaves of two horizontal B-trees: O( N B log B N ) O( B log 2B N B ) O( TB ) O( N B TB ) • Query O ( N B log B N ) horizontal B-trees: O( N B log B N ) O(log B ( N B log B N )) O( N B ) • Query 2 O( N B log B N ) kdB-trees not completely in query 2 O( N B log B N ) O( B log 2B N B ) O( TB ) O( N B TB ) • Query in kdB-trees completely contained in query: O( TB ) O( N B TB ) I/Os Lars Arge 34 I/O-algorithms O-Tree Update • Insert: – Search in vertical B-tree: O(log B N ) I/Os – Search in horizontal B-tree: O(log B N ) I/Os – Insert in kdB-tree: O(log2B ( B log 2B N )) O(log B N ) I/Os • Use global rebuilding when structures grow too big/small – B-trees not contain ( N B log B N ) elements – kdB-trees not contain ( B log 2B N ) elements O(log B N ) I/Os • Deletes can be handled in O(log B N ) I/Os similarly Lars Arge 35 I/O-algorithms Summary: O-Tree • 2d range searching in linear space – O( N B TB ) I/O query – O(log B N ) I/O update • Optimal among structures using linear space q4 q3 q1 q2 • Can easily be extended to work in d-dimensions 1 1 d T N O (( ) ) with optimal query bound B B Lars Arge 36 I/O-algorithms Summary/Conclusion: 3 and 4-sided Queries • 3-sided 2d range searching: External priority search tree – O(log B N T B) query, O ( NB ) space, O(log B N ) update q4 q3 q3 q1 q2 q1 q2 • General (4-sided) 2d range searching: logB N N T O (log N ) O ( ) space, – External2 range tree: B B query, B logB logB N logB N O( log log ) update B BN – O-tree: O ( N B T B ) query, O ( NB ) space, O(log B N ) update Lars Arge 37 I/O-algorithms Summary/Conclusion: Tools and Techniques • Tools: – B-trees – Persistent B-trees – Buffer trees – Logarithmic method – Weight-balanced B-trees – Global rebuilding • Techniques: – Bootstrapping – Filtering (x,x) q3 q1 q4 q3 q1 Lars Arge q2 q2 38 I/O-algorithms References • External Memory Geometric Data Structures Lecture notes by Lars Arge. – Section 8-9 Lars Arge 39