Database Management 7. course Reminder • • • • • • • Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats Today • System catalogue • Hash-based indexing – Static – Extendible – Linear • Time-cost of operations System catalogue • Special table • Indexes – Type of the data structure and search key • Tables – – – – Name, filename, file structure (e.g. heap) Attribute names, types Integrity constraints Index names • Views – Name and definition • Statistics, permissions, buffer size, etc. Attr_Cat(attr_name, rel_name, type, position) attr_name attr_name rel_name type position sid name login age gpa fid fname sal rel_name Attribute_Cat Attribute_Cat Attribute_Cat Attribute_Cat Students Students Students Students Students Faculty Faculty Faculty type string string string integer string string string integer real string string real position 1 2 3 4 1 2 3 4 5 1 2 3 Hash-based indexing Basic thought • Index for every search key • Hash function ( f ) between search key ( K ) and memory address ( A ): A = f ( K ) • Ideally bijective: key is the address Hashing • Ideal for joining tables • Just for equality check • Many versions Static hashing • • • • File~collection of buckets Bucket: one primary page and overflow pages File has N buckets: 0..N-1 Data entries – Data records with key k – <k, record ids with key k> – <k, list of records with key k> • To identify the bucket hash function h is applied. • In the bucket alternative search is applied • Insertion h is used to find the proper bucket • If there is not enough space, create an overflow chain to the bucket • In case of deletion h is used to locate tha bucket • If the deleted was the last record, than page is removed • Bucket number: h ( value ) mod N • h ( value ) = ( a * value + b ) • a and b are constants • Primary pages stored sequentially on the disk • If the file grows a lot – Long overflow chain – Worsens the search – Create new file with more buckets! • If the file shrinks a lot – A lot of space is wasted – Merge buckets! Solution • Ideally – 80% of the buckets is used – no overflow • Periodically rehash the file – Takes time – Index cannot be used during rehashing • Use dynamic hashing – Extendible Hashing – Linear Hashing Extendible hashing • Like Static Hashing • If a new entry is to be inserted to a full bucket – Double the number of buckets – Use directory of pointers (only the directory file has to be doubled) – Split only the overflowed bucket Example Insert 20* Result Insert 9* Split bucket B If bucket gets empty • Merging buckets is also possible • Not always done • Decrease local depth Storage • • • • • • Typical: 100 MB file 100 bytes/data entry Page size: 4KB 1,000,000 data entries 25,000 elements in the directory High chance that it will fit in memory speed=speed of Static Hashing • Otherwise twice slow • Collision: entries with the same hash values (overflow pages are needed) Linear Hashing • Family of hash functions: h0, h1, … • Each function's range is twice that of its predecessor • E.g. hi(value) = h(value) mod (2i N). • do:number of bits of N’s representation • di:do+i • Example: N=32, do=5, h1 is h mod (2*32), d1=6 Basic idea • • • • Rounds of splitting Number of actual round is Level Only hLevel and are hLevel+1 in use At any given point within a round we have – splitted buckets – buckets yet to be splitted – buckets created by splits in this round Searching • hLevel is applied – If it leads to an unsplitted bucket, we look there – If it leads to a splitted bucket, we apply hLevel+1 to decide in which bucket our data is • Insertion may need overflow page • If the overflow chain gets big then split is triggered Example Level=0 round number NLevel=N*2Level number of buckets at the beginning of the Lth round (N0=N) • If split is triggered, actual (Next) bucket is split and redistributed by hL+1 • The new bucket gets to the end of the buckets • Next is incremented by 1 • Apply hLevel and if the searched hash value is before Next then apply hLevel+1 • Continue: insert 43*, 37*, 29*, 22*, 66*, 34*, and 50*. 43 37 29 22, 66, 34 50 Deletion • If the last bucket is empty, it can be removed • Merging can be triggered for not empty buckets • New round, merging: empty buckets are removed, Level is decremented Next=NLevel/2-1 Comparison • If Linear hashing is stored as Extendible • Hashing function is similar to Extendible hashing (hi hi+1 ~ doubling the directory) • Extendible hashing: reduced number of splits and higher bucket occupancy • Linear hashing – Avoids directory structure – Primary pages are stored consecutively. Quicker equality selection. – Skewed distribution results in almost empty buckets • If directory structure for Linear hashing: one bucket=one directory • Overflow pages are stored easily • Overhead of a directory level • Costly for large, uniformly distributed files • Improves space occupancy File organizations Cost model • To analyze the (time) cost of the DB operations • No. of data pages: B • Records/page: R • Time of reading/writing: D=15ms (dominant) • Time of record processing: C=100nanos • Time of hashing: H=100nanos • Reduced calculation just for the I/O time • 3 basic file organization: – Heap files – Sorted files – Hashed files File operations • • • • • Scan Search with equality selection (=) Search with range selection (>,<) Insert Delete B data pages R records/page D time of reading/writing • Scan the file: B ( D + RC ) C time of record • Search with equality selection: processing – One result: in average B ( D + RC ) / 2 – Several results: search the entire file, B ( D + RC ) Heap files • Search with range selection: B ( D + RC ) • Insert: fetch the last page, add record, write back, 2D + C • Delete: find record, delete, write page, cost of searching + C + D Sorted files • Scan: B ( D + RC ) • Search with equality selection: B data pages R records/page D time of reading/writing C time of record processing – One result: D log2B + C log2R – Several results: D log2B + C log2R + no. of results • Search with range selection: D log2B + C log2R + no. of results • Insert: find place, insert, move the rest, write pages, search position + B ( D + RC ) in average • Delete: find record, delete, move the rest, write pages, cost of searching + B ( D + RC ) Hashed files • • • • B data pages R records/page D time of reading/writing C time of record processing H time of hashing No overflow pages 80% occupancy of buckets Scan the file: 1.25 * B ( D + RC ) Search with equality selection: in average H + D + RC/2 • Search with range selection: 1.25 * B ( D + RC ) • Insert: locate page, add record, write back, search + D + C • Delete: find record, delete, write page, cost of searching + C + D Summary • Heap file: Storage +, modifying +, searching • Sorted file: Searching +, modifying • Hashed file: Modifying +, range selection --, storage Type Scan Eq. Search Range search Insert Delete Heap BD BD/2 BD 2D Search + D Sorted BD Dlog2B Dlog2B + #matches Search + BD Search + BD Hashed 1.25BD D 1.25 BD 2D Search + D Thank you for your attention! • Book is uploaded: • R. Ramakrishnan, J. Gehrke: Database Management Systems, 2nd edition