Slides (AG)

advertisement
I/O Efficient Algorithms
Problem
•Data is often too massive to fit in the internal
memory
•The I/O communication between internal memory
(fast) and external memory (slower) can be a
major performance bottleneck
Goal
Design algorithms and data structures for external
memory to exploit locality and parallelism in order to
reduce I/O costs
Fundamental I/O operations
• Scanning
• Sorting
• Searching
• Outputting
Bounds
Operation
I/O bound, D = 1
I/O bound, general D ≥ 1
Scan(N)
Θ(N/B) = Θ(n)
Θ(N/DB) = Θ(n/D)
Sort(N)
Θ(N/B log M/B N/B) = Θ(n log m n)
Θ(N/DB log M/B N/B) = Θ(n/D log m n)
Search(N)
Θ(log B N)
Θ(log DB N)
Output(Z)
Θ(max {1 , Z/B}) = Θ(max {1,z})
Θ(max {1 , Z/DB}) = Θ(max {1,z/D})
N = problem size (in units of data items)
B = block transfer size (in units of data items)
D = number of independent disk drives
Z = number of items of an answer
Types of problems
Batched – Scan and Sort
Online – Search and Output
External Hashing for Online
Dictionary Search
Insert
O(1)
Delete O(1)
Lookup O(Output(Z))
Statically allocated tables
• Most commonly/traditionally used
• Can handle only a fixed range of N
• Goal is to develop dinamic external memory
structures that can easily handle different
sizes of data
Extendible Hashing
R. Fagin, J. Nievergelt, N. Pippinger, and H. R. Strong
• assume that the size K of the range of the hash function is sufficiently
large
• directory consists of an array of 2d pointersfor a given d ≥ 0 (d is the
global depth)
• each item is assigned to the table location corresponding to the d least
signifcant bits of its hash address
• d is set to the smallest value for which each table location has at most
B items assigned to it
• each table location contains a pointer to a block where its items are
stored
• a lookup takes two I/Os: one to access the directory and one to access
the block storing the item (only one I/O if the directory fits in
internal memory)
Minimizing Storage Utilization
• a table location may hold fewer than B items, therefore they can share
the same disk block for storing their items
• a table location shares a disk block with all the other table locations
having the same k least significant bits in their address
• k is chosen to be as small as possible so that the pooled items t into a
single disk block
• each disk block has its own local depth
Inserting New Items
• when a new item is inserted, and its disk block overflows, the global
depth d and the block's local depth k are recalculated so that the
invariants on d and k once again hold
• this is done by splitting the block that overflows and redistributing its
items
• global depth d is incremented by 1, the directory doubles
in size (this is how the hash is able to adapt to the growing N)
• pointers in the new directory are set to the appropriate disk blocks
• the disk blocks themselves do not need to be changed during doubling,
except for the one block where the overflow has occured
Inserting New Items
• when a new item is inserted, and its disk block overflows, the global
depth d and the block's local depth k are recalculated so that the
invariants on d and k once again hold
• this is done by splitting the block that overflows and redistributing its
items
• global depth d is incremented by 1, the directory doubles
in size (this is how the hash is able to adapt to the growing N)
• pointers in the new directory are set to the appropriate disk blocks
• the disk blocks themselves do not need to be changed during doubling,
except for the one block where the overflow has occured
Inserting New Items contd.
• let hashd be the hash function corresponding to the d least significant
bits of hash (hashd(x) = hash(x) % 2d
• initially a single disk block is created to store the data items, and all
the slots in the directory are initialized to point to the block
• the local depth k of the block is set to 0
• when a new item with key value x is inserted, it is stored in the disk
block pointed to by directory slot hashd(x)
• if as a result block b overflows, then b is split into two blocks - the
original block b and a new block b’ and its items are redistributed
based upon the (b.k + 1)st least signicant bit of hash(x) (b.k = b’s
local depth)
• b.k is incremented by 1 and that value alsostored in in b’.k
• if the blocks are still overflowing the blocks are split and their sizes are
incremented until overflow no longer occurs
Inserting New Items contd.
• after all splits are done, if b.k ≤ d, we just update those directory
pointers originally pointing to b that need to be changed
• if b.k > d then the directory is not large enough to accommodate hash
addresses with b.k bits, so we repeatedly double the directory
size and increment the global depth d by 1 until d = b.k
• once again:
- pointers in the new directory are initialized to point to the
appropriate disk blocks
- the disk blocks do not need to be modified during doubling,
except for the block that overflows
Deleting Items
• deletion is handled very similarly to insertion
• when two blocks with the same local depth k contain items whose
hash addresses share the same k-1 least significant bits and can
fit together into a single block, then their items are merged into
a single block with a decremented value of k
• the combined size of the blocks being merged must be sufficiently less
than B to prevent immediate splitting after a subsequent
insertion
• the directory shrinks by half and the global depth d is decremented by
1, when all the local depths are less than the current value of d
Some Numbers
• the expected number of disk blocks required to store the data items
is n/ ln 2, therefore the blocks tend to be about 69% full
• at least Ω(n/B) blocks are needed to store the directory
P. Flajolet showed that on the average the directory uses
Ѳ(N1/Bn/B) = Ѳ(N1+1/B/B2) blocks, which can be superlinear in N
asymptotically
• for practical values of N and B, the N1/B term is a small constant,
typically less than 2, and the directory size is within a constant
factor of the optimum
So...
• the resulting directory is equivalent to the leaves of a perfectly
balanced tree, in which the search path for each item is
determined by its hash address, except that hashing allows the
leaves of the tree to be accessed directly in a single I/O
• therefore any item can be retrieved in a total of two I/Os
• if the directory ts in internal memory, only one
I/O is needed
The End
Jeff Vitter's survey paper:
http://www.cs.duke.edu/~jsv/Papers/catalog/node38.html
Download