Slides (AG)

I/O Efficient Algorithms Problem •Data is often too massive to fit in the internal memory •The I/O communication between internal memory (fast) and external memory (slower) can be a major performance bottleneck Goal Design algorithms and data structures for external memory to exploit locality and parallelism in order to reduce I/O costs Fundamental I/O operations • Scanning • Sorting • Searching • Outputting Bounds Operation I/O bound, D = 1 I/O bound, general D ≥ 1 Scan(N) Θ(N/B) = Θ(n) Θ(N/DB) = Θ(n/D) Sort(N) Θ(N/B log M/B N/B) = Θ(n log m n) Θ(N/DB log M/B N/B) = Θ(n/D log m n) Search(N) Θ(log B N) Θ(log DB N) Output(Z) Θ(max {1 , Z/B}) = Θ(max {1,z}) Θ(max {1 , Z/DB}) = Θ(max {1,z/D}) N = problem size (in units of data items) B = block transfer size (in units of data items) D = number of independent disk drives Z = number of items of an answer Types of problems Batched – Scan and Sort Online – Search and Output External Hashing for Online Dictionary Search Insert O(1) Delete O(1) Lookup O(Output(Z)) Statically allocated tables • Most commonly/traditionally used • Can handle only a fixed range of N • Goal is to develop dinamic external memory structures that can easily handle different sizes of data Extendible Hashing R. Fagin, J. Nievergelt, N. Pippinger, and H. R. Strong • assume that the size K of the range of the hash function is sufficiently large • directory consists of an array of 2d pointersfor a given d ≥ 0 (d is the global depth) • each item is assigned to the table location corresponding to the d least signifcant bits of its hash address • d is set to the smallest value for which each table location has at most B items assigned to it • each table location contains a pointer to a block where its items are stored • a lookup takes two I/Os: one to access the directory and one to access the block storing the item (only one I/O if the directory fits in internal memory) Minimizing Storage Utilization • a table location may hold fewer than B items, therefore they can share the same disk block for storing their items • a table location shares a disk block with all the other table locations having the same k least significant bits in their address • k is chosen to be as small as possible so that the pooled items t into a single disk block • each disk block has its own local depth Inserting New Items • when a new item is inserted, and its disk block overflows, the global depth d and the block's local depth k are recalculated so that the invariants on d and k once again hold • this is done by splitting the block that overflows and redistributing its items • global depth d is incremented by 1, the directory doubles in size (this is how the hash is able to adapt to the growing N) • pointers in the new directory are set to the appropriate disk blocks • the disk blocks themselves do not need to be changed during doubling, except for the one block where the overflow has occured Inserting New Items • when a new item is inserted, and its disk block overflows, the global depth d and the block's local depth k are recalculated so that the invariants on d and k once again hold • this is done by splitting the block that overflows and redistributing its items • global depth d is incremented by 1, the directory doubles in size (this is how the hash is able to adapt to the growing N) • pointers in the new directory are set to the appropriate disk blocks • the disk blocks themselves do not need to be changed during doubling, except for the one block where the overflow has occured Inserting New Items contd. • let hashd be the hash function corresponding to the d least significant bits of hash (hashd(x) = hash(x) % 2d • initially a single disk block is created to store the data items, and all the slots in the directory are initialized to point to the block • the local depth k of the block is set to 0 • when a new item with key value x is inserted, it is stored in the disk block pointed to by directory slot hashd(x) • if as a result block b overflows, then b is split into two blocks - the original block b and a new block b’ and its items are redistributed based upon the (b.k + 1)st least signicant bit of hash(x) (b.k = b’s local depth) • b.k is incremented by 1 and that value alsostored in in b’.k • if the blocks are still overflowing the blocks are split and their sizes are incremented until overflow no longer occurs Inserting New Items contd. • after all splits are done, if b.k ≤ d, we just update those directory pointers originally pointing to b that need to be changed • if b.k > d then the directory is not large enough to accommodate hash addresses with b.k bits, so we repeatedly double the directory size and increment the global depth d by 1 until d = b.k • once again: - pointers in the new directory are initialized to point to the appropriate disk blocks - the disk blocks do not need to be modified during doubling, except for the block that overflows Deleting Items • deletion is handled very similarly to insertion • when two blocks with the same local depth k contain items whose hash addresses share the same k-1 least significant bits and can fit together into a single block, then their items are merged into a single block with a decremented value of k • the combined size of the blocks being merged must be sufficiently less than B to prevent immediate splitting after a subsequent insertion • the directory shrinks by half and the global depth d is decremented by 1, when all the local depths are less than the current value of d Some Numbers • the expected number of disk blocks required to store the data items is n/ ln 2, therefore the blocks tend to be about 69% full • at least Ω(n/B) blocks are needed to store the directory P. Flajolet showed that on the average the directory uses Ѳ(N1/Bn/B) = Ѳ(N1+1/B/B2) blocks, which can be superlinear in N asymptotically • for practical values of N and B, the N1/B term is a small constant, typically less than 2, and the directory size is within a constant factor of the optimum So... • the resulting directory is equivalent to the leaves of a perfectly balanced tree, in which the search path for each item is determined by its hash address, except that hashing allows the leaves of the tree to be accessed directly in a single I/O • therefore any item can be retrieved in a total of two I/Os • if the directory ts in internal memory, only one I/O is needed The End Jeff Vitter's survey paper: http://www.cs.duke.edu/~jsv/Papers/catalog/node38.html

Slides (AG)

Related documents

Products

Support

Slides (AG)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib