Quick Review of material covered Apr 8

advertisement
Quick Review of material covered Apr 8
• B+-Tree Overview and some definitions
–
–
–
–
balanced tree
multi-level
reorganizes itself on insertion and deletion
built so each node fits on a single disk page
• Examined mechanics of B+-tree Insertion and Deletion
– looked at several examples
• We’ll finish up B+-trees with two more concepts:
– B+-tree File Organization
– B-tree index files
B+-tree File Organization
• B+-Tree Indices solve the problem of index file degradation. The
original data file will still degrade upon a stream of insert/delete
operations.
• Solve data-file degradation by using a B+-tree file organization
• Leaf nodes in a B+-tree file organization store records, not pointers
into a separate original datafile
– since records are larger than pointers, the maximum number of recrods
that can be stored in a leaf node is less than the number of pointers in a
non-leaf node
– leaf nodes must still be maintained at least half full
– insert and delete are handled in the same was as insert and delete for
entries in a B+-tree index
B+-tree File Organization Example
• Records are much bigger than pointers, so good space usage is
important
• To improve space usage, involve more sibling nodes in redistribution
during splits and merges (to avoid split/merge when possible)
– involving one sibling guarantees 50% space use
– involving two guarantees at least 2/3 space use, etc.
B-tree Index Files
• B-trees are similar to B+-trees, but search-key values
appear only once in the index (eliminates redundant
storage of key values)
– search keys in non-leaf nodes don’t appear in the leaf nodes, so an
additional pointer field for each search key in a non-leaf node must
be stored to point to the bucket or record for that key value
– leaf nodes look like B+-tree leaf nodes:
(P1, K1, P2, K2, …, Pn)
– non-leaf nodes look like so:
(P1, B1, K1, P2, B2, K2, …, Pn)
where the Bi are pointers to buckets or file records.
B-tree Index File Example
B-tree
and
B+-tree
B-tree Index Files (cont.)
• Advantages of B-tree Indices (vs. B+-trees)
– May use less tree nodes than a B+-tree on the same data
– Sometimes possible to find a specific key value before reaching a
leaf node
• Disadvantages of B-tree Indices
– Only a small fraction of key values are found early
– Non-leaf nodes are larger, so fanout is reduced, and B-trees may be
slightly taller than B+-trees on the same data
– Insertion and deletion are more complicated than on B+-trees
– Implementation is more difficult than B+-trees
• In general, advantages don’t outweigh disadvantages
Hashing
• We’ve examined Ordered Indices (design based upon
sorting or ordering search key values); the other type of
major indexing technique is Hashing
• Underlying concept is very simple:
– observation: small files don’t require indices or complicated search
methods
– use some clever method, based upon the search key, to split a large
file into a lot of little buckets
– each bucket is sufficiently small
– use the same method to find the bucket for a given search key
Hashing Basics
– A bucket is a unit of storage containing one or more records
(typically a bucket is one disk block in size)
– In a hash file organization we find the bucket for a record directly
from its search-key value using a hash function
– A hash function is a function that maps from the set of all searchkey values K to the set of all bucket addresses B
– The hash function is used to locate records for access, insertion,
and deletion
– Records with different search-key values may be mapped to the
same bucket
• the entire bucket must be searched to find a record
• buckets are designed to be small, so this task is usually not onerous
Hashed File Example
– So we:
• divide the set of disk blocks that make up the file into buckets
• devise a hash function that maps each key value into a bucket
V: set of key values
B: number of buckets
H: hashing function
H: V--> (0, 1, 2, 3, …, B-1)
Example: V= 9 digit SS#; B=1000; H= key modulo 1000
Hash Functions
• To search/insert/delete/modify a key do:
– compute H(k) to get the bucket number
– search sequentially in the bucket (heap organization within each
bucket)
• Choosing H: almost any function that generates “random”
numbers in the range [0, B-1]
– try to distribute the keys evenly into the B buckets
– one rule of thumb when using MOD -- use a prime number
Hash Functions (2)
• Collision is when two or more key values go to the same
bucket
– too many collisions increases search time and degrades
performance
– no or few collisions means that each bucket has only one (or very
few) key(s)
• Worst-case hash functions map all search keys to the same
bucket
Hash Functions (3)
• Ideal hash functions are uniform
– each bucket is assigned the same number of search-key values
from the set of all possible values
• Ideal hash functions are random
– each bucket has approximately the same number of records
assigned to it irrespective of the actual distribution of search-key
values in the file
• Finding a good hash function is not always easy
Examples of Hash Functions
• Given 26 buckets and a string-valued search key, consider the
following possible hash functions:
–
–
–
–
–
Hash based upon the first letter of the string
Hash based upon the last letter of the string
Hash based upon the middle letter of the string
Hash based upon the most common letter in the string
Hash based upon the “average” letter in the string: the sum of the letters
(using A=0, B=1, etc) divided by the number of letters
– Hash based upon the length of the string (modulo 26)
• Typical hash functions perform computation on the internal binary
representation of the search key
– example: searching on a string value, hash based upon the binary sum of
the characters in the string, modulo the number of buckets
Download