Quick Review of Apr 10 material • B+-Tree File Organization

advertisement
Quick Review of Apr 10 material
• B+-Tree File Organization
– similar to B+-tree index
– leaf nodes store records, not pointers to records stored in an
original file
– leaf and interior nodes are different
• B-trees
– search key values appear only once
– pointer to record/bucket for that search key value always stored
with the search key itself, even in interior nodes
• Hashing Overview
– Hash functions
• ideally uniform, random, easy to compute
Today
•
•
•
•
•
•
Overflow
Hash file performance
Hash indices
Dynamic Hashing (Extendable Hashing)
Note: HW#3 due next class (April 17)
HW #4: due Thursday April 24 (9 days from now)
– Questions: 12.11, 12.12, 12.13, 12.16
Overflow
• Overflow is when an insertion into a bucket can’t occur
because it is full.
• Overflow can occur for the following reasons:
– too many records (not enough buckets)
– poor hash function
– skewed data:
• multiple records might have the same search key
• multiple search keys might be assigned the same bucket
Overflow (2)
• Overflow is handled by one of two methods
– chaining of multiple blocks in a bucket, by attaching a number of
overflow buckets together in a linked list
– double hashing: use a second hash function to find another
(hopefully non-full) bucket
– in theory we could use the next bucket that had space; this is often
called open hashing or linear probing. This is often used to
construct symbol tables for compilers
• useful where deletion does not occur
• deletion is very awkward with linear probing, so it isn’t useful in most
database applications
Hashed File Performance Metrics
• An important performance measure is the loading factor
(number of records)/(B*f)
B is the number of buckets
f is the number of records that will fit in a single bucket
• when loading factor too high (file becomes too full),
double the number of buckets and rehash
Hashed File Performance
•
•
•
•
•
(Assume that the hash table is in main memory)
Successful search: best case 1 block; worst case every
chained bucket; average case half of worst case
Unsuccessful search: always hits every chained bucket
(best case, worst case, average case)
With loading factor around 90% and a good hashing
function, average is about 1.2 blocks
Advantage of hashing: very fast for exact queries
Disadvantage: records are not sorted in any order. As a
result, it is effectively impossible to do range queries
Hash Indices
• Hashing can be used for index-structure creation as well as
for file organization
• A hash index organizes the search keys (and their record
pointers) into a hash file structure
• strictly speaking, a hash index is always a secondary index
– if the primary file was stored using the same hash function, an
additional, separate primary hash index would be unnecessary
– We use the term hash index to refer both to secondary hash indices
and to files organized using hashing file structures
Example of a Hash Index
Hash index into file
account, on search key
account-number;
Hash function computes
sum of digits in account
number modulo 7.
Bucket size is 2
Static Hashing
• We’ve been discussing static hashing: the hash function maps searchkey values to a fixed set of buckets. This has some disadvantages:
– databases grow with time. Once buckets start to overflow, performance
will degrade
– if we attempt to anticipate some future file size and allocate sufficient
buckets for that expected size when we build the database initially, we will
waste lots of space
– if the database ever shrinks, space will be wasted
– periodic reorganization avoids these problems, but is very expensive
• By using techniques that allow us to modify the number of buckets
dynamically (“dynamic hashing”) we can avoid these problems
– Good for databases that grow and shrink in size
– Allows the hash function to be modified dynamically
Dynamic Hashing
• One form of dynamic hashing is extendable hashing
– hash function generates values over a large range -- typically b-bit
integers, with b being something like 32
– At any given moment, only a prefix of the hash function is used to index
into a table of bucket addresses
– With the prefix at a given moment being j, with 0<=j<=32, the bucket
address table size is 2j
– Value of j grows and shrinks as the size of the database grows and shrinks
– Multiple entries in the bucket address table may point to a bucket
– Thus the actual number of buckets is < 2j
– the number of buckets also changes dynamically due to coalescing and
splitting of buckets
General Extendable Hash Structure
Use of Extendable Hash Structure
• Each bucket j stores a value ij; all the entries that point to the same
bucket have the same values on the first ij bits
• To locate the bucket containing search key Kj;
– compute H(Kj) = X
– Use the first i high order bits of X as a displacement into the bucket
address table and follow the pointer to the appropriate bucket
• T insert a record with search-key value Kj
– follow lookup procedure to locate the bucket, say j
– if there is room in bucket j, insert the record
– Otherwise the bucket must be split and insertion reattempted
• in some cases we use overflow buckets instead (as explained shortly)
Splitting in Extendable Hash Structure
To split a bucket j when inserting a record with search-key value Kj
• if i> ij (more than one pointer in to bucket j)
– allocate a new bucket z
– set ij and iz to the old value ij incremented by one
– update the bucket address table (change the second half of the set of
entries pointing to j so that they now point to z)
– remove all the entries in j and rehash them so that they either fall in z or j
– reattempt the insert (Kj). If the bucket is still full, repeat the above.
Splitting in Extendable Hash Structure
(2)
To split a bucket j when inserting a record with search-key value Kj
• if i= ij (only one pointer in to bucket j)
– increment i and double the size of the bucket address table
– replace each entry in the bucket address table with two entries that point to
the same bucket
– recompute new bucket address table entry for Kj
– now i> ij so use the first case described earlier
• When inserting a value, if the bucket is still full after several splits
(that is, i reaches some preset value b), give up and create an overflow
bucket rather than splitting the bucket entry table further
– how might this occur?
Deletion in Extendable Hash Structure
To delete a key value Kj
• locate it in its bucket and remove it
• the bucket itself can be removed if it becomes empty (with appropriate
updates to the bucket address table)
• coalescing of buckets is possible
– can only coalesce with a “buddy” bucket having the same value of ij and
same ij -1prefix, if one such bucket exists
• decreasing bucket address table size is also possible
– very expensive
– should only be done if the number of buckets becomes much smaller than
the size of the table
Extendable Hash Structure Example
Hash function
on branch name
Initial hash table
(empty)
Extendable Hash Structure Example (2)
Hash structure after insertion of one Brighton and two Downtown records
Extendable Hash Structure Example (3)
Hash structure after insertion of Mianus record
Extendable Hash Structure Example (4)
Hash structure after insertion of three Perryridge records
Extendable Hash Structure Example (5)
Hash structure after insertion of Redwood and Round Hill records
Extendable Hashing vs. Other Hashing
• Benefits of extendable hashing:
– hash performance doesn’t degrade with growth of file
– minimal space overhead
• Disadvantages of extendable hashing
– extra level of indirection (bucket address table) to find desired record
– bucket address table may itself become very big (larger than memory)
• need a tree structure to locate desired record in the structure!
– Changing size of bucket address table is an expensive operation
• Linear hashing is an alternative mechanism which avoids these
disadvantages at the possible cost of more bucket overflows
Comparison:
Ordered Indexing vs. Hashing
• Each scheme has advantages for some operations and
situations. To choose wisely between different schemes
we need to consider:
– cost of periodic reorganization
– relative frequency of insertions and deletions
– is it desirable to optimize average access time at the expense of
worst-case access time?
– What types of queries do we expect?
• Hashing is generally better at retrieving records for a specific key
value
• Ordered indices are better for range queries
Download