File Structures by Folk, Zoellick and Riccardi Chap12. Extendible Hashing 서울대학교 컴퓨터공학부 객체지향시스템연구실 SNU-OOPSLA-LAB 교수 김 형 주 File Structures SNU-OOPSLA Lab. 1 Chapter Objectives Describe the problem solved by extendible hashing and related approaches Explain how extendible hashing works; show how it combines tries with conventional, static hashing Use the buffer, file, and index classes of previous chapters to implement extendible hashing, including deletion Review studies of extendible hashing performance Examine alternative approaches to the same problem, including dynamic hashing, linear hashing, and hashing schemes that control splitting by allowing for overflow buckets File Structures SNU-OOPSLA Lab. 2 Contents 12.1 Introduction 12.2 How extendible hashing works 12.3 Implementation 12.4 Deletion 12.5 Extendible hashing performance 12.6 Alternative approaches File Structures SNU-OOPSLA Lab. 3 12.1 Introduction Dynamic files Static hashing undergo a lot of growths described in chapter 11 (direct hashing) typically worse than B-Tree for dynamic files eventually requires file reorganization Extendible hashing hashing for dynamic file Fagin, Nievergelt, Pippenger, and Strong (ACM TODS 1979) File Structures SNU-OOPSLA Lab. 4 Overview(1) Direct access (hashing) files have static size, so not suitable for files whose size is unknown in advance Dynamic file structure is desired which retains the feature of fast retrieval by primary key, and which also expands and contracts as the number of records in the file fluctuates (without reorganizing the whole file) Similar motivation! Indexed-sequential File ==> B tree Hashing ==> Extendible Hashing File Structures SNU-OOPSLA Lab. 5 Overview(2) Extendible Hashing Primary key Hashing function H(key) Extract first d digit Directory Index File Structures Table look-up SNU-OOPSLA Lab. File pointer 6 12.2 How Extendible Hashing works Idea from Tries file (radix searching) The branching factor of the tree is equal to the # of alternative symbols in each position of the key e.g.) Radix 26 trie - able, abrahms, adams, anderson, adnrews, baird Use a b File Structures the first n characters for branching b d n l r adams d able abrahms e anderson r andrews baird SNU-OOPSLA Lab. 7 Extendible Hashing H maps keys to a fixed address space, with size the largest prime less than a power of 2 (65531 < 216) File pointers point to blocks of records known as buckets, where an entire bucket is read by one physical data transfer, buckets may be added to or removed from the file dynamically The d bits are used as an index in a directory array containing 2d entries, which usually resides in primary memory The value d, the directory size(2d), and the number of buckets change automatically as the file expands and contracts File Structures SNU-OOPSLA Lab. 8 Extendible Hashing Example Directory with d=3 and 4 buckets d’=1 d=3 000 001 010 011 100 101 110 111 B0 H(key)=0 d’=3 B100 H(key)=100 d’=3 B101 H(key)=101 d’=2 B11 H(key)=11 File Structures SNU-OOPSLA Lab. 9 Turning the trie into a directory Using Trie for extendible hashing (1) Use Radix 2 Trie : Keys in A : beginning with 0 Keys in B : beginning with 10 Keys in C : beginning with 11 A 0 1 0 1 B C (2) Retrieving from secondary storage the buckets containing keys, instead of individual keys File Structures SNU-OOPSLA Lab. 10 Representation of Trie (1) Tree is not preferable (directory is not big) A flattened array 1. Make a complete full binary tree 2. Collapse it into the directory structure 0 1 0 1 A 00 A 01 0 1 File Structures B 10 B C 11 C SNU-OOPSLA Lab. 11 Representation of Trie(2) Directory is a complete binary tree Directory entry : a pointer to the associated bucket Given an address beginning with the bits 10, the 210 directory entries Introduced for uniform distribution File Structures SNU-OOPSLA Lab. 12 Retrieve a record Steps in retrieving a record with a given key find H(given key) extract first d bits of H(given key) use this value as an index into the directory to find a pointer use this pointer to read a bucket into primary memory locate the desired record within the bucket (scan) File Structures SNU-OOPSLA Lab. 13 Expansion & Contraction(1) A pair of adjunct buckets with the same value of d’ which share a common value of the first d’-1 bits of H(key) can be combined if the average load < 50%, so all records would be able to fit into one bucket File contraction is the reverse of expansion; the directory can be compacted and d decremented whenever all pairs of pointers have the same values File Structures SNU-OOPSLA Lab. 14 Expansion & Contraction(2) Bucket B0 overflows, then splits into B0 and B1 d=3 d’=2 000 001 010 d’=2 011 100 d’=3 101 110 111 d’=3 B00 H(key)=00.. B01 H(key)=01.. B100 H(key)=100.. B00 H(key)=101.. d’=2 B00 H(key)=11.. File Structures SNU-OOPSLA Lab. 15 Expansion & Contraction(3) d=4 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 File Structures d’=2 B00 H(key)=00.. d’=2 B01 H(key)=01.. d’=4 B1000H(key)=1000.. d’=4 B1001H(key)=1001.. d’=3 B101 H(key)=101.. d’=2 B11 H(key)=11.. Bucket B100 overflows, d increase to 4 SNU-OOPSLA Lab. 16 Splitting to Handle Overflow (1) When overflow occurs e.g.1) Overflowing of bucket A Split A into A and D Come to use additional unused bits No need to expand the directory 00 A 01 10 11 File Structures B C 00 A 01 D 10 B 11 C SNU-OOPSLA Lab. 17 Splitting to Handle Overflow(2) e.g. Overflowing of bucket B Do not have additional unused bits (need to expand the directory) 1. Divide B using 3 bits of hash address 2. Make a complete full binary tree 3. Collapse it into the directory structure 00 A 01 10 11 File Structures B C SNU-OOPSLA Lab. 18 1. Result of overflow of bucket B A 0 B 0 1 0 1 D 1 C 3. Directory 2. Complete Binary Tree 0 0 000 0 1 001 1 0 A 1 1 0 1 File Structures 0 1 0 1 A 010 011 B B 100 D 101 C SNU-OOPSLA Lab. D 110 C 111 19 Creating Address Function hash(KEY) Fold/Add hashing algorithm Do not MOD hashing value by address space since no fixed address space exists Output from the hash function for a number of keys bill lee pauline alan julie mike elizabeth mark File Structures 0000 0011 0110 1100 0000 0100 0010 1000 0000 1111 0110 0101 0100 1100 1010 0010 0010 1110 0000 1001 0000 0111 0100 1101 0010 1100 0110 1010 0000 1010 0000 0111 SNU-OOPSLA Lab. 20 Int Hash (char * key) { int sum = 0; int len = strlen(key); if (len % 2 == 1) len ++; // make len even for (int j = 0; j < len; j+2) sum = (sum + 100 * key[j] + key[j+1]) % 19937; return sum; } Figure 12.7 Function Hash (key) returns an integer hash value for key for a 15 bit File Structures SNU-OOPSLA Lab. 21 Int MakeAddress (char * key, int depth) { int retval = 0; int hashVal = Hash(key); // reverse the bits for (int j = 0; j < depth; j++) { retval = retval << 1; int lowbit = hashVal & 1; retval = retval | lowbit; hashVal = hashVal >> 1; } return retval; } Figure 12.9 Function MakeAddress(key,depth) File Structures SNU-OOPSLA Lab. 22 Class Bucket: protected TextIndex {protected: Bucket (Directory & dir, int maxKeys = defaultMaxKeys); int Insert (char * key, int recAddr); int Remove(char * key); Bucket * Split (); int NewRange (int & newStart, int & newEnd); int Redistribute (Bucket & newBucket); int FindBuddy (); int TryCombine (); int Combine (Bucket * buddy, int buddyIndex); int Depth; Directory & Dir; int BucketAddr; friend class Directory; friend class BucketBuffer; }; Figure 12.10 Main members of class Bucket File Structures SNU-OOPSLA Lab. 23 class Directory {public: Directory (…..); ~Directory(); int Open (..); int Create(…); int Close(); int Insert(…); int Delete(…); int Search(…); protected int DoubleSize(); int Collape(); int InsertBucket (….); int Find (…); int StoreBucket(…); int LoadBucket(…) ….. } Figure 12.11 Definition of class Directory File Structures SNU-OOPSLA Lab. 24 12.4 Deletion When to combine buckets Buddy buckets: the buckets are siblings and at the leaf level of the tree (Buddy means something like friend) e.g., B and D in page 19 are buddy buckets Examine the directory to see if we can make changes there Shrink the directory if none of the buckets requires the depth of address information that is currently available in the directory File Structures SNU-OOPSLA Lab. 25 Buddy Bucket Given a bucket with an address uvwxy, where u, v, w, x, and y have values of either 0 or 1, the buddy bucket, if it exists, has the value uvwxz, such that z = y XOR 1 If enough keys are deleted, the contents of buddy buckets can be combined into a single bucket File Structures SNU-OOPSLA Lab. 26 Collapsing the Directory Collapse condition If a single cell, downsizing is impossible If there is a pair of directory cells that do not both point to the same bucket, collapsing is impossible Allocating space Allocate half the size of the original Copy the bucket references shared by each cell pair to a single cell in the new directory File Structures SNU-OOPSLA Lab. 27 12.5 Extendible Hashing Performance Time : O(1) If the directory can kept in RAM: a single access Otherwise: two accesses are necessary Space utilization of the bucket r (# of records), b (block size), N (# of Blocks) Utilization = r / bN Average utilization ==> 0.69 Space utilization for the directory How large a directory should we expect to have, given an expected number of keys? Expected value for the directory size by Flajolet(1983) Estimated directory size =3.92 / b X r(1+1/b) File Structures SNU-OOPSLA Lab. 28 Space utilization for buckets Periodic and fluctuating With uniform distributed addresses, all the buckets tend to fill up at the same time -> split at the same time As buffer fills up : 90% After a concentrated series of splits : 50% r : # of records , b : block size N ~= 4/(b ln 2) Utilization = r / bN ~= ln 2 = 0.69 Average utilization of 69% B tree space utilization Normal B-tree : 67%, B-tree with redistribution in insertion : 85 % File Structures SNU-OOPSLA Lab. 29 12.6 Alternative Approaches(1): Dynamic Hashing Similar to dynamic extendible hashing Use a directory to track bucket addresses Extend the directory through the use of tries Start with a hash function that covers an address space of a fixed size When overflow occurs splits forming the leaves of a trie that grows down from the original address node makes a trie File Structures SNU-OOPSLA Lab. 30 Alternative Approaches(2): Dynamic Hashing Two kinds of nodes External node: reference a data bucket Internal node: point to two children index nodes When a node has split children, it changed from an external node to an internal node Two hash functions Apply the first hash function original address space if external node is found : search is completed if internal node is found : apply second hash function File Structures SNU-OOPSLA Lab. 31 (a) (b) 1 2 1 3 2 Original address space 4 3 40 (c) 1 20 21 41 4 3 2 Original address space 4 1 Original address space 41 410 File Structures SNU-OOPSLA Lab. 411 32 Dynamic Hashing vs. Extendible Hashing(1) Overflow handling Both schemes extend the hash function locally, as a binary search trie Both schemes use directory structure Dynamic hashing: a linked structure Extendible hashing: perfect tree expressible as an array Space Utilization both schemes is the same (space utilization : 69%) File Structures SNU-OOPSLA Lab. 33 Dynamic Hashing and Extendible Hashing(2) Growth of directory Actual size of an index node Dynamic hashing: slower, more gradual growth Extendible hashing: extend directory by doubling it Dynamic hashing is lager than a directory cell in extendible hashing (because of pointers) Page fault Dynamic hashing: more than one page fault (with linked structure for the directory) Extendible hashing: single page fault File Structures SNU-OOPSLA Lab. 34 Alternative Approaches(3): Linear Hashing Unlike extendible hashing and dynamic hashing, linear hashing does not use a directory. The actual address space is extended one bucket at a time as buckets overflow Because the extension of the address space does not necessarily correspond to the bucket that is overflowing, linear hashing necessarily involves the use of overflow buckets, even as the address space expands No directories: Avoid additional seek resulting from additional layer Use more bits of hashed value hd(k) : depth d hashing function (using function make_address) File Structures SNU-OOPSLA Lab. 35 The growth of address space in linear hashing(1) w a b c d 00 01 10 11 a b c d A 000 01 10 11 100 (b) (a) y x a 00 b 01 c 10 x d A 11 100 B 101 a 00 b 01 (c) File Structures c 10 d 11 A 100 B C 101 110 (d) SNU-OOPSLA Lab. (continued...) 36 The growth of address space in linear hashing(2) x a 00 b 01 c 10 d 11 A 100 B C D 101 110 111 (e) File Structures SNU-OOPSLA Lab. 37 Alternative Approaches(5) :Approaches to Controlling Splitting Postpone splitting: increase space utilization B-Tree: redistribution rather than splitting Hashing: placing records in chains of overflow buckets to postpone splitting Triggering event for splitting Linear hashing Every time any bucket overflows Not split overflowing bucket Litwin(1980): overall load factor of the file Below 2 seeks, 75% ~ 80% storage utilization File Structures SNU-OOPSLA Lab. 38 Alternative Approaches(5) :Approaches to Controlling Splitting Postpone splitting for extensible hashing Use chaining overflow bucket Avoid doubling directory space 1.1 seek, 76% ~ 81% storage utilization File Structures SNU-OOPSLA Lab. 39 Let’s Review !!! 12.1 Introduction 12.2 How extendible hashing works 12.3 Implementation 12.4 Deletion 12.5 Extendible hashing performance 12.6 Alternative approaches File Structures SNU-OOPSLA Lab. 40