CST203-2 Database Management Systems Lecture 7 Disadvantages on index structure: We must access an index structure to locate data, or must use binary search, and that results in more I/O operations Space taken by index structure Hashing allows us to avoid accessing an index structure Hash File Organization We obtain the address of the disk block containing a desired record directly by computing a function on the search-key value of the record Bucket is a unit of storage that can store one or more records A bucket is typically a disk block, but could be chosen to be smaller or larger than a disk block. A hash function h is a function from K to B. h – hash function Ki - search key Bi = h(Ki) If h(K5) = h(K7) bucket h(K ) contains records with search-key values K and records with search-key values K 5 5 7 Deletion Equally straightforward If the search-key value of the record to be deleted is Ki, we compute h(Ki), then search the corresponding bucket for that record, and delete the record from the bucket Hash Functions The worst possible hash function maps all search-key values to the same bucket An ideal hash function distributes the stored keys uniformly across all the buckets, so that every bucket has the same number of records We want to choose a hash function that assigns search- key values to buckets The distribution is uniform. The hash function assigns each bucket the same number of search-key values from the set of all possible search-key values The distribution is random Each bucket will have nearly the same number of values assigned to it, regardless of the actual distribution of searchkey values. The hash value will not be correlated to any externally visible ordering on the search-key values, such as alphabetic ordering or ordering by the length of the search keys Example of Hash File Organization (Cont.) Hash file organization of account file, using branch-name as key Assume…….. There are 26 buckets hash function maps names beginning with the ith letter of the alphabet to the ith bucket Simple but no uniform distribution Assume……… there are 10 buckets, The binary representation of the ith character is assumed to be the integer i. The hash function returns the sum of the binary representations of the characters modulo 10 E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3 int h = 0; for (int i = 0; i < n; i++) { h = h + s.charAt(i); } h = h % 10; Example of Hash File Organization Hash file organization of account file, using branch-name as key Handling of Bucket Overflows We have assumed that, when a record is inserted, the bucket to which it is mapped has space to store the record If the bucket does not have enough space, a bucket overflow is said to occur Reasons: Insufficient buckets. Skew 1. Multiple records may have the same search key. 2. The chosen hash function may result in non-uniform distribution of search keys. We handle bucket overflow by using overflow buckets. If a record must be inserted into a bucket b, and b is already full, the system provides an overflow bucket for b, and inserts the record into the overflow bucket. If the overflow bucket is also full, the system provides another overflow bucket, and so on All the overflow buckets of a given bucket are chained together in a linked list(overflow chaining) ` Hash Indices A hash index organizes the search keys, with their associated pointers, into a hash file structure We apply a hash function on a search key to identify a bucket, and store the key and its associated pointers in the bucket (or in overflow buckets) The term hash index is used to denote hash file structures as well as hash indices Hash indices are only secondary index structures But, hash file organization provides the same direct access to records that indexing provides, we pretend that a file organized by hashing also has a primary hash index on it Dynamic Hashing Most databases grow larger over time For static hashing, Choose a hash function based on the current file size. This option will result in performance degradation as the database grows Choose a hash function based on the anticipated size of the file at some point in the future. Although performance degradation is avoided, a significant amount of space may be wasted initially Periodically reorganize the hash structure in response to file growth. Such a reorganization involves choosing a new hash function, recomputing the hash function on every record in the file, and generating new bucket assignments. This reorganization is a massive, time-consuming operation. Furthermore, it is necessary to forbid access to the file during reorganization.