CS4432: Database Systems II Record and Page Formats Chapter 12 CS 4432 1 Overview Data Items Records Blocks Files Memory CS 4432 2 What are the data items we want to store? • a salary • a name • a picture What we have available: Bytes 8 bits CS 4432 3 To represent: • Integer (short): 2 bytes e.g., 35 is 00000000 CS 4432 00100011 4 To represent: • Boolean e.g., TRUE FALSE 1111 1111 0000 0000 • Enumeration types: e.g., RED 1 GREEN 3 BLUE 2 YELLOW 4 … Can we use less than 1 byte/code? Yes, but only if desperate... CS 4432 5 To represent: • Characters Various coding schemes suggested (ASCII) Example: A: 1000001 a: 1100001 5: 0110101 LF: 0001010 CS 4432 6 To represent: String of characters – Null terminated c a e.g., – Length given e.g., 3 c t a t - Fixed length e.g., In Oracle define the string length. e.g., name CHAR(20), CS 4432 7 Key Points • Fixed length items • Variable length items - usually length given at beginning • Type of an item : - tells us how to interpret (plus size if fixed) CS 4432 8 Overview Data Items Records Blocks Files Memory CS 4432 9 Record - Collection of related data items (called FIELDS) E.g.: Employee record: name salary date-of-hire ... CS 4432 CHAR (20), NUMBER, DATE, 10 Types of records: • Main choices: – FIXED vs VARIABLE FORMAT – FIXED vs VARIABLE LENGTH CS 4432 11 Fixed format A SCHEMA contains information such as: - # fields (attributes) - type of each field (length) - order of attributes in record - meaning of each field (domain) - constraints (primary key, etc). Not associated with each record. CS 4432 12 Example: fixed format & fixed length Employee record (1) E.id, 2 byte integer (2) E.name, 10 char. (3) Dept, 2 byte code 55 s m i t h 02 83 j o n e s 01 Schema Records We can simply concatenate fields. CS 4432 13 Variable format • What : – Not all fields are included in the record, – and/or, fields possibly in different orders. • Then : – Record itself must contain format, i.e., it is “self-describing”: CS 4432 14 Why Variable Format ? • “sparse” records • repeating fields • evolving formats CS 4432 15 Example: variable format and length 46 4 S 4 F O RD Code for Ename String type Length of str. # Fields Code identifying field as E# Integer type 2 5 I Field name codes could also be strings, i.e., TAGS CS 4432 16 • EXAMPLE: variable format record with repeating fields e.g., Employee has one or more children 3 E_name: Fred Child: Sally Child: Tom • Do repeating fields always require variable format and length? CS 4432 17 Repeating fields with fixed format & length • Then allocate maximum number of repeating fields • If not used, set to null Example : a person and her hobbies. Mary CS 4432 Sailing Chess -18 Many variants between fixed - variable format: Example1: Include record type in record 5 27 .... record type record length tells me what to expect (i.e., points to schema) CS 4432 19 Record header - data at beginning that describes record May contain: - pointer to schema (record type) - length of record - time stamp (create time, mod. time) - other stuff (e.g., ROW-ID in Oracle) CS 4432 20 Example2: Variant btw FIXED/VAR format • Hybrid format : one part is fixed, other is variable E.g.: All employees have E#, name, dept; and other fields vary. 25 Smith Toy 2 Hobby:chess retired # of var fields CS 4432 21 Also, many variations in internal organization of record Just to show one: * 3 10 * 5 F1 length of field * 12 F2 F3 total size 3 32 5 15 20 0 1 2 3 4 F1 5 F2 15 F3 20 offsets CS 4432 22 Question: We have seen examples for : * Fixed format and length records * Variable format and length records (a) Does fixed format and variable length make sense? (b) Does variable format and fixed length make sense? CS 4432 23 Next: Data Items Records Blocks Files Memory CS 4432 24 Goal : placing records into blocks records blocks ... a file CS 4432 assume fixed length blocks assume a single file (for now) 25 Options for storing records in blocks: (1) (2) (3) (4) (5) (6) CS 4432 separating records spanned vs. unspanned mixed record types – clustering split records sequencing indirection 26 (1) Separating records Block R1 R2 R3 (a) no need to separate if fixed size records. (b) or, use special marker (c) or, give record lengths (or offsets) - within each record - in block header CS 4432 27 (2) Spanned vs. Unspanned • Unspanned: records within one block block 1 R1 block 2 R2 R3 ... R4 R5 • Spanned : records wrap across 2 blocks block 1 R1 CS 4432 R2 R3 (a) R3 R4 (b) block 2 R5 R7 R6 (a) ... 28 Spanned vs. unspanned: • Unspanned is much simpler, but may waste space… • Spanned essential if record size > block size CS 4432 29 Example 106 records each of size 2,050 bytes (fixed) block size = 4096 bytes block 1 R1 2050 bytes block 2 R2 wasted 2046 2050 bytes wasted 2046 • Utiliz = 50% -> ½ of space is wasted CS 4432 30 (3) Mixed versus uniform record types • Mixed - records of different types (e.g., EMPLOYEE, DEPT) allowed in same block e.g., a block EMP CS 4432 e1 DEPT d1 DEPT d2 31 Why do we want to mix? • Answer: CLUSTERING Records that are frequently accessed together should be placed into the same block • Problems Creates variable length records in block Aim to avoid duplicates (how to cluster?) Insert/deletes are harder CS 4432 32 Example Clustering Q1: select C_NAME, C_CITY, AMOUNT, … from DEPOSIT, CUSTOMER where DEPOSIT.C_NAME = CUSTOMER.C.NAME a block layout: CUSTOMER,NAME=SMITH DEPOSIT,NAME=SMITH DEPOSIT,NAME=SMITH CUSTOMER,NAME=JONES Question: Good idea or bad idea ? CS 4432 33 • If Q1 frequent with join on customer and deposit relations, then clustering good • But if instead Q2 frequent with : Q2: SELECT * FROM CUSTOMER then clustering is counter-productive CS 4432 34 Compromise: No mixing, but keep related records in same cylinder ... CS 4432 35 So Far: Storing records in blocks (1) (2) (3) (4) (5) (6) CS 4432 Separating records Spanned vs. Unspanned Mixed record types - Clustering Split records Sequencing Indirection 36 Options for storing records in blocks: (1) (2) (3) (4) (5) (6) CS 4432 separating records spanned vs. unspanned mixed record types – clustering split records sequencing indirection 37 (4) Split records Fixed part in one block Typically for hybrid format Variable part in another block CS 4432 38 Block with fixed recs. R1 (a) R2 (a) Block with variable recs. R1 (b) R2 (b) R2 (c) CS 4432 39 (5) Sequencing • Ordering records in file (and block) by some key value – Sequential file ( - sequenced file) • Why sequencing ? – Typically to make it possible to efficiently read records in order CS 4432 40 Sequencing Options (a) Next record physically contiguous ... R1 Next (R1) (b) Linked R1 Next (R1) What about INSERT/ DELETE ? CS 4432 41 Sequencing Options (c) Overflow area Records in sequence CS 4432 header R1 R2 R3 R4 R5 R2.1 R1.3 R4.7 42 (6) Indirection Addressing • How does one refer to records? Rx • Problem: Records can be on disk or in (virtual) memory. Need common address, but have different physical locations. Many options: Physical CS 4432 Indirect 43 Purely Physical Addressing E.g., Record Address ( ID ) CS 4432 = Device ID Cylinder # Block ID Track # Block # Offset in block 44 Fully Indirect Addressing Solution: Record ID (Oracle: ROWID) as global address, maintain a map table. Map Table rec ID r CS 4432 Rec ID Physical addr. address a 45 Tradeoff Physical Flexibility to move records (for deletions, insertions) Indirect Cost of indirection (lookup) What to do : Options inbetween ? CS 4432 46 Ex #1 : Indirection in block Block Header A block: Free space R3 R4 R1 CS 4432 R2 47 Ex. #2 Use logical block #’s understood by file system instead of direct disk access REC ID File ID, Block # CS 4432 File ID Block # Record # or Offset File System Map Physical Block ID 49 Recap: Storing records in blocks (1) (2) (3) (4) (5) (6) CS 4432 Separating records Spanned vs. Unspanned Mixed record types - Clustering Split records Sequencing Indirection 50 Other Topics in Chapter 12 (1) Insertion/Deletion (2) Buffer Management (3) Comparison of Schemes CS 4432 51 Deletion Block Rx CS 4432 52 Options: (a) (b) Deleted and immediately reclaim space Mark deleted – May need chain of deleted records (for re-use) – Need a way to mark: • special characters • delete field • in map CS 4432 53 As usual, many tradeoffs... • How expensive is to move valid record to free space for immediate reclaim? • How much space is wasted? – e.g., deleted records, delete fields, free space chains,... CS 4432 54 Concern with deletions Dangling pointers R1 ? Note: If pointers point to physical locations (rather than ROWIDs), storing new data in deleted block corrupts data. CS 4432 55 Solution #1: Do not worry CS 4432 56 Solution #2: Tombstones E.g., Leave “MARK” in map or old location • Physical IDs A block This space never re-used CS 4432 This space can be re-used 57 Solution #2: Tombstones E.g., Leave “MARK” in map or old location • Logical IDs map ID 7788 CS 4432 LOC Never reuse ID 7788 nor space in map... 58 Solution #3 (?): • Place record ID within every record • When you follow a pointer, check if it leads to correct record to 3-77 rec-id: 3-77 Does this work??? If space reused, won’t new record have same ID? CS 4432 59 Insert Easy case: Records fixed length/not in sequence Insert new record at end of file or, in deleted slot A little harder: If records are variable size, not as easy may not be able to reuse space – fragmentation Hard case: records in sequence If free space “close by”, not too bad... Or, use overflow idea... Or worst case, reorganize file ... CS 4432 60 Interesting problems: • How much free space to leave in each block, track, cylinder? • How often do I reorganize file + overflow? CS 4432 Free space 61 Buffer Management • • • • • • DB features needed Why LRU may be bad Pinned blocks Forced output Double buffering Swizzling CS 4432 Read Textbook! 62 Pointer Swizzling Issue : If records (objects) contain pointers to other objects, translate locations when load objects into memory. Memory Disk block 1 block 1 block 2 block 2 CS 4432 Rec A Rec A 63 One Option: Translation Table DB Addr Mem Addr Rec-A Rec-A-inMem Solution: Insert fields that represent pointers into map table. Translate pointers as needed. CS 4432 64 Another Option: In memory pointers - need “type” bit to disk M CS 4432 to memory 65 Swizzling Issues • Must ‘unswizzle’ • Updating/writing of records Swizzling Options • Automatic • On-demand • No swizzling / program control CS 4432 66 Comparison • There are 1,000,001 ways to organize my data on disk… Which is right for me? CS 4432 67 Issues: Flexibility Space Utilization Complexity Performance CS 4432 68 To evaluate a given strategy, compute following parameters: -> space used for expected data - on average -> expected time to : - fetch record given key - fetch record with next key - insert/delete/update record - read complete file - reorganize file (maybe sort) -> usage patterns / workload: - how many/which user queries/updates CS 4432 69 NEXT Chapter 13 in book How to find a record quickly, given a key CS 4432 70