- xPrEsS123

advertisement
CST203-2 Database Management Systems
Lecture 7
 Disadvantages on index structure:

We must access an index structure to locate data, or must use
binary search, and that results in more I/O operations

Space taken by index structure
 Hashing allows us to avoid accessing an index
structure
Hash File Organization
 We obtain the address of the disk block containing a
desired record directly by computing a function on the
search-key value of the record
 Bucket is a unit of storage that can store one or more
records
 A bucket is typically a disk block, but could be chosen
to be smaller or larger than a disk block.
 A hash function h is a function from K to B.
h – hash function
Ki - search key
Bi = h(Ki)
 If h(K5) = h(K7)

bucket h(K ) contains records with search-key values K and
records with search-key values K
5
5
7
Deletion
 Equally straightforward
 If the search-key value of the record to be deleted is Ki,
we compute h(Ki), then search the corresponding
bucket for that record, and delete the record from the
bucket
Hash Functions
 The worst possible hash function maps all search-key
values to the same bucket
 An ideal hash function distributes the stored keys
uniformly across all the buckets, so that every bucket
has the same number of records
 We want to choose a hash function that assigns search-
key values to buckets
 The distribution is uniform.

The hash function assigns each bucket the same number of
search-key values from the set of all possible search-key values
 The distribution is random


Each bucket will have nearly the same number of values
assigned to it, regardless of the actual distribution of searchkey values.
The hash value will not be correlated to any externally visible
ordering on the search-key values, such as alphabetic ordering
or ordering by the length of the search keys
Example of Hash File Organization
(Cont.)
Hash file organization of account file, using branch-name as key
 Assume…….. There are 26 buckets
 hash function maps names beginning with the ith letter of the alphabet to the ith
bucket
 Simple but no uniform distribution
 Assume……… there are 10 buckets,
 The binary representation of the ith character is assumed to be the integer i.
 The hash function returns the sum of the binary representations of the
characters modulo 10
 E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3
int h = 0;
for (int i = 0; i < n; i++) {
h = h + s.charAt(i);
}
h = h % 10;
Example of Hash File Organization
Hash file
organization of
account file, using
branch-name as
key
Handling of Bucket Overflows
 We have assumed that, when a record is inserted, the
bucket to which it is mapped has space to store the record
 If the bucket does not have enough space, a bucket
overflow is said to occur
 Reasons:


Insufficient buckets.
Skew
1. Multiple records may have the same search key.
2. The chosen hash function may result in non-uniform distribution of
search keys.
 We handle bucket overflow by using overflow buckets.
 If a record must be inserted into a bucket b, and b is already
full,
 the system provides an overflow bucket for b, and inserts the
record into the overflow bucket.
 If the overflow bucket is also full, the system provides another
overflow bucket, and so on
 All the overflow buckets of a given bucket are chained
together in a linked list(overflow chaining)
`
Hash Indices
 A hash index organizes the search keys, with their
associated pointers, into a hash file structure
 We apply a hash function on a search key to identify a
bucket, and store the key and its associated pointers in
the bucket (or in overflow buckets)
 The term hash index is used to denote hash file
structures as well as hash indices
 Hash indices are only secondary index structures
 But, hash file organization provides the same direct
access to records that indexing provides, we pretend
that a file organized by hashing also has a primary
hash index on it
Dynamic Hashing
 Most databases grow larger over time
 For static hashing,



Choose a hash function based on the current file size. This option
will result in performance degradation as the database grows
Choose a hash function based on the anticipated size of the file at
some point in the future. Although performance degradation is
avoided, a significant amount of space may be wasted initially
Periodically reorganize the hash structure in response to file growth.
Such a reorganization involves choosing a new hash function,
recomputing the hash function on every record in the file, and
generating new bucket assignments. This reorganization is a
massive, time-consuming operation. Furthermore, it is necessary to
forbid access to the file during reorganization.
Download