Records and Files Storage Technology: Topic 3 Introduction to Database Systems 1 Record Formats: Fixed Length F1 F2 F3 F4 L1 L2 L3 L4 Base address (B) Address = B+L1+L2 Information about field types stored in system catalogs. Direct access to i’th field. Introduction to Database Systems 2 Record Formats: Variable Length Two alternative formats (# fields is fixed): F1 4 Field Count F2 $ F3 $ F4 $ $ Fields Delimited by Special Symbols F1 F2 F3 F4 Array of Field Offsets Second offers direct access to i’th field, efficient storage of nulls (special don’t know value); small directory overhead. Introduction to Database Systems 3 Files of Records Page or block is OK when doing I/O, but higher levels of DBMS operate on records, and files of records. FILE: a collection of pages, each containing a collection of records. Must support: – insert/delete/modify record – read a particular record (specified using record id) – scan all records (possibly with some conditions on the records to be retrieved) Introduction to Database Systems 4 Page Formats: Fixed Length Records Slot 1 Slot 2 Slot 1 Slot 2 Free Space ... Slot N ... Slot N Slot M N PACKED 1 . . . 0 1 1M number of records M ... 3 2 1 UNPACKED, BITMAP number of slots Record id = <page id, slot #>. In first alternative, moving records for free space management changes rid; may not be acceptable. Introduction to Database Systems 5 Page Formats: Variable Length Records Rid = (i,N) Page i Rid = (i,2) Rid = (i,1) 20 N ... 16 2 SLOT DIRECTORY 24 1 Pointer to start of free space Can move records on page without changing rid; so, attractive for fixed-length records too. Introduction to Database Systems 6 Unordered (Heap) Files Simplest file structure contains records in no particular order. As file grows and shrinks, disk pages are allocated and de-allocated. To support record level operations, we must: – keep track of the pages in a file – keep track of free space on pages – keep track of the records on a page There are many alternatives for keeping track of this. Introduction to Database Systems 7 Heap File Implemented as a List Data Page Data Page Data Page Pages with Free Space Header Page Data Page Data Page Data Page Full Pages Each page contains 2 `pointers’ plus data. Introduction to Database Systems 8 Heap File Using a Page Directory Data Page 1 Header Page Data Page 2 DIRECTORY Data Page N The entry for a page can include the number of free bytes on the page. The directory is a collection of pages; linked list implementation is just one alternative. Introduction to Database Systems 9 Indexes A Heap file allows us to retrieve records: – by specifying the rid, or – by scanning all records sequentially Sometimes, we want to retrieve records by specifying the values in one or more fields, e.g., – Find all students in the “CS” department – Find all students with a gpa > 3 Indexes enable us to answer value-based (associative) queries efficiently. Introduction to Database Systems 10 Alternative File Organizations Many alternatives exist, each ideal for some situation , and not so good in others: – Heap files: Suitable when typical access is a file scan retrieving all records. – Sorted Files: Best if records must be retrieved in some order, or only a `range’ of records is needed. – Hashed Files: Good for equality selections. File is a collection of buckets. Bucket = primary page plus zero or more overflow pages. Hashing function h: h(r) = bucket in which record r belongs. h looks at only some of the fields of r, called the search fields. Introduction to Database Systems 11 Cost Model for Analysis Ignore CPU costs, for simplicity: – – – – B: The number of data pages R: Number of records per page D: (Average) time to read or write disk page Measuring number of page I/O’s ignores gains of sequential I/O; thus, even I/O cost is only approximated. – Average-case analysis; based on several simplistic assumptions. Good enough to show the overall trends! Introduction to Database Systems 12 Cost of Operations Scan all recs Heap File BD Sorted File BD Hashed File 1.25 BD Equality Search 0.5 BD D log2B D Range Search BD Insert 2D D (log2B + # of 1.25 BD pages with matches) 2D Search + BD Delete Search + D Search + BD 2D Several assumptions underlie these (rough) estimates! Introduction to Database Systems 13 Assumptions Single record insert and delete. Heap Files: – Equality selection on key; exactly one match. – Insert always at end of file. Sorted Files: – Files compacted after deletions. – Selections on sort field(s). Hashed Files: – No overflow buckets, 80% page occupancy. Introduction to Database Systems 14 Summary Variable length record format with field offset directory offers support for direct access to i’th field and null values. Slotted page format supports variable length records and allows records to move on page. File layer keeps track of pages in a file, and supports abstraction of a collection of records. – Linked list or directory data structure – Sorted and hashed files for query processing Introduction to Database Systems 15