Structured Files 9:00 11:00 13:30 15:30 18:00 Aug. 2 Intro & terminology Reliability Fault tolerance Transaction models Reception Aug. 3 Aug. 4 Aug. 5 Aug. 6 TP mons Logging & Files & Structured & ORBs res. Mgr. Buffer Mgr. files Locking Res. Mgr. & COM+ Access paths theory Trans. Mgr. Locking CICS & TP CORBA/ Groupware techniques & Internet EJB + TP Queueing Advanced Replication Performance Trans. Mgr. & TPC Workflow Cyberbricks Party FREE Chapter 19 What The Record Manager Does Storage allocation: store tuples in file blocks Tuple addressing: give tuple an id identifier provide fast access via that id. Enumeration: fast enumeration of all relation’s tuples Content addressing: give fast accessible via attribute values. Maintenance: update/delete a tuple and its access paths. Protection: support for security encrypt or tuple-granularity access control. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Outline Representing values Representing records Storing records in pages and across pages Organizing records (entry, relative, key, hash) Examples of fix/log/log logic. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Record Allocation in a Page Recall: File is a collection of fixed-length pages (blocks). File and buffer managers map files to disc/RAM Jim Gray, Andreas Reuter block page page body Transaction Processing - Concepts and Techniques Page Dir Block Trailer Block Head Page Head slot on disk WICS August 2 - 6, 1999 Page Declares typedef struct { FILENO uint } PAGEID, /* global page numbers /*file where the page lives /* page number within the file /* fileno; pageno; *PAGEIDP; typedef struct PAGEID thatsme; PAGE_TYPE page_type; OBJID object_id; LSN safe_up_to; PAGEID previous; PAGEID next; PAGE_STATE status; int no_entries; int unused; int freespace; char stuff[]; } PAGE_HEADER, * PAGE_PTR; Jim Gray, Andreas Reuter */ */ */ */ /* identifies the page */ /* see description above */ /* internal id of the relation,index,etc. */ /* page LSN for the WAL - protocol */ /* often pages are members of doubly */ /* linked lists */ /* valid,in-doubt,copy of something,etc*/ /* # entries in page dir (see below) */ /* free bytes not in freespace */ /* # contiguous free bytes for data */ /* will grow */ /* */ Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Different uses of pages Data: Homogeneous record storage Cluster: like Data except many different record types Index (access path): hashed or B-tree Free-space bitmap: describes status of 4,000 other pages. Directory: meta-data about this or other files Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Page Directory: Points to Records on Page Page Header 2nd Tuple 5th Tuple 1st Tuple 2nd 3rd Tuple 4thTuple Tuples are inserted in this direction Page directory grows in this direction 5 4 3 2 1 Record id is: File, Page, Directory_offset Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Accessing a Record Read by TID: Insert by TID: Jim Gray, Andreas Reuter Lock record shared locate page Get semaphore shared follow directory offset copy tuple Give semaphore Lock record exclusive locate page Get semaphore exclusive Find space Insert log insert (tid, new value). update page LSN, header, directory, Give semaphore Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Accessing a Record Delete by TID: Update TID: Jim Gray, Andreas Reuter Lock record exclusive locate page Get semaphore exclusive Add record to free space Log delete (tid, old value). update page LSN, header, directory, Give semaphore much like delete-&-insert Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Finding Space for Insert / Update If tuple fits in page contiguous free-space: easy. If tuple fits in page free space: reorganize (compress) Physiological logging makes this cheap. If tuple does not fit then: leave forwarding address on page. Optionally leave record prefix on page. Segment record among several pages. tid Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Finding space within a file Free space table: Summarizes status of many pages (8KB page => 64Kb => 500MB of 8KB data pages) Good for clustered & contiguous allocation p1 f2 f3 f4 f2 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16 f17 f18 f19 p2 p3 f3 p4 f4 p5 f5 p7 p6 f6 P19 f7 F19 ··· P20 . . . . . . . . . . . . . P21 21 bitmap should beFree transaction space protected directories If transaction aborts, page is freed again. Alternatively, treat bitmap as a hint Rebuild periodically. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 ···..... Finding space within a file Free space cursor/list file catalog chain of empty pages . . empty_page_anchor point_of_insert page for next insert Chain should be transaction protected Else: rebuild at restart do not trust pointers (free page may be allocated). Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Tuple Allocation - I The first strategy maintains a pointer to the “current block for insert” (CBI). When that block fills up, an empty block is requested from a system service, which then becomes the new “current block for insert”. CBI: where next? head of list of empty blocks head of list of empty blocks head of list of empty blocks And so on. This is the sequential insert strategy. Questions: What happens, when the pointer arrives at the last block? How do we reclaim space freed by deleted tuples? Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 13 Incremental Space Expansion - I When the list of empty blocks is exhausted, there are two options to find space for new tuples. Let us assume the following configuration: CBI: The first option is to let the CBI pointer circulate over the set of allocated blocks, assuming that space is released by deleted tuples. And so on. This works as long as enough space is freed up by deleted tuples. If there are only few gaps, finding space for a new tuple can become very expensive, because many blocks have to be probed sequentially. The need to probe blocks that are completely filled can be avoided by maintaining a an array of bits that contains one bit per block indicating whether a block is full: 0 Jim Gray, Andreas Reuter 0 1 Transaction Processing - Concepts and Techniques 0 1 WICS August 2 - 6, 1999 14 Naming Tuples (records) Relative byte address: file, offset in file: OK for insert-then-read-only DBs record can't easily grow. deleted space not easily reclaimed. Tuple Identifier file, page, index: The design shown below. nodeid fileid pageno dir_index 3 7446 7446 nodeid fileid pageno dir_index 7446 3 7446 5127 this tuple this tuple pseudo -TID Main disadvantage: expensive reorganization (fixing overflows) Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Implementing Database Keys Address record via directory databasekey of "this tuple" nodeid fileid record seq. no. K A 7 Address has a ID to allow for invalidation id 11 pageid ID never reused. 7446 index id 7 11 7446 thistuple offset Pointer can be swizzled. Popular with network & OO DBs Jim Gray, Andreas Reuter databasekey translation table for file A at nodeK Transaction Processing - Concepts and Techniques page directory WICS August 2 - 6, 1999 Naming Tuples via Primary Key {Entry Sequenced, Relative}: primary key is physical addr {Hash, B-tree}: primary key is content (primary key) Primary Key an alternative to DBkey B-tree clusters related data Problems: B-tree access is slower than Hash. Hash & B-tree keys not fixed length but neither is node.db_key Benefit: key can grow to LARGE databases Good for distributed/partitioned data It’s religious. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Datatype Representation m EP: value input from the user E m PE: value output to the user m : modification through application program PF P mFP: SELECTing values into application program F mEF: input through interactive SQL m FE: interactive query results E: External representation: ASCII, ISO Latin1, Unicode,... P: Programming language representation many: PL/1, Cobol, C, all have different VARCHAR many type mismatches between P and F : interval, datetime, user,... F: File representation: "native" types (e.g.: null values, ....). Lots of mapping functions. -1 It would be great if F (F(x)) = x for these functions, but.... Called the impedance mismatch between DB and PL Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Datatype Representations E P F P _ F: Implies a special language (all other languages are 2nd class) E _ F: Use characters for everything. Problem: E changes from country to country! (all other languages are 2nd class) No easy way out of this. Unicode will help most of us and make E_F more attractive Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Representing Records relati ons attri butes fi el d l ength attri bute descri pti on type off set · meta data · · tupl e addressing phy sical tupl e attr.1 Jim Gray, Andreas Reuter attr.2 attr.3 attr.4 Transaction Processing - Concepts and Techniques attr.5 WICS August 2 - 6, 1999 Representing Records struct relations{ Uint relation_no; char * owner; long creation_date; PAGENO current_point_of_insert; PAGENO empty_page_anchor; Uint no_of_attributes; Uint no_of_fixed_atts; Uint no_of_var_atts; struct attributes * p_attr;} struct attributes[]; { char * attribute_name; Uint attribute_position; char attribute_type; Boolean var_length; Boolean nulls_allowed; char * default_value; Uint field_length; int accumulated_offset; Uint significant_digits; char * encryption_key; char * rest;} Jim Gray, Andreas Reuter /* internal id for the relation /* user id of the creator /* date when it was created /* free space done via /* free space cursor method /*#attributes in relation /* # fixed-length attributes /* # variable-length attributes /* pointer to the attributes array /* attributes array /* external name of the attribute /* index of the field in the tuple (1,2,...) /* this encodes the SQL - type definition /* is it variable_length field ? /* can field assume NULL value ? /* value assumed if none stored in tuple /* maximum length of field /* explained later /* for data type FIXED /* if the value encrypted /* further information on the attribute Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 */ */ */ */ */ */ */ */ */ */ */ */ */ */ */ */ */ */ */ */ */ Representing Records Generic header (rid, tid, #fields) general prefix to all tuple representations relation-id tuple-id number of fields in the tuple or actual tuple length number of fields F1 3 all fixed length encoding (fat records, fast-simple code max < page path length) variable fields have length (short records, slow code) type-length-value (simple slow code, easy reorg) fixed + ptrs to variables. (compact, fast code) Jim Gray, Andreas Reuter name F2 10 F3 F4 F5 4 2 4 m F6 8 n L tuple length L 3F1 4 3F mF2 24F F 45 n6F tuple length number of fields Transaction Processing - Concepts and Techniques F1 3 F2 F3 F4 F5 4 2 4 F6 WICS August 2 - 6, 1999 Representing Records (Reuter Recommends) F1 3 F3 4 F4 2 F5 4 F2 F6 relation identifier tuple identifier Jim Gray, Andreas Reuter number of variable length fields number of fixed length fields pointer array with offs ets for variable length fields Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Some Details Representing null values: missing field special value extra field bitmap Representing keys efficient comparison is important store "conditioned" key so simple byte-compare. Flip integer sign (so negative sorts low) Flip float so exponent first, mantissa second, flipped signs Compress varchars. MANY refinements. Want an order-preserving compression. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Fat Records (Longer Than a Page) Record must fit on page. Long fields segregated to separate page: may be good in some cases (Multi-media DBs) long field Overflow page chains Segment record across pages Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Obese records (Longer Than 10 Pages) If record is super-large, then may want to index into it quickly. “Obvious" design is standard tree. Record is root of tree. Grow levels when one fills. Allows blob growth, update,... Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Non-Normalized Relations C C C S1 S2 S3 .... Sn S1 S2 a) chaining Jim Gray, Andreas Reuter S1 S2 S3 Sn b) cl uster ing Transaction Processing - Concepts and Techniques ... Sn c) mini-di rectory WICS August 2 - 6, 1999 Structured File Definition File u nstr uctur ed (system seq uenced) assoc iativ e ke yed hash structur e d non- associa tiv e entry sequ enced r elativ e cluster ed Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 File Layouts Unstructured: a sequence of bytes eof Structured, Entry Sequenced. Records inserted at end Records cannot grow key is RBA (relative byte address) eof Relative: fixed size record slots records limited by that size key is relative record number Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Associative File Types Hashed: Records addressed by key field(s) bucket has list of records overflow to other buckets or to overflow pages. Key Sequenced Records addressed by keyfield(s) Records in sorted order. either sorting or b-tree or... Jim Gray, Andreas Reuter As Bs Transaction Processing - Concepts and Techniques Ys Zs WICS August 2 - 6, 1999 Parameters at Create Database Record type (fields) Key Organization { Entry Sequenced, Relative, Hashed, Key Sequenced } Block size (page size) Extent size (storage area) Partitioning (among discs or nodes) by key. Attributes: access control allocation and archive strategy transactional lifetime, zero on free, and on and on .... Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Parameters at Create "Secondary" indices. Primary key is....(e.g. customer number). Secondary key is social security number Non-Unique secondary key is Last_Name, First_name Secondary indices can be and index is like a table. fields of index are: secondary key, primary key So can define index on any kind of base table Jim Gray, Andreas Reuter {unique or not } {hashed or Key Sequenced } Base Table Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Secondary Index Example Base table is key-sequenced on CustomerNumber. Index table is key sequence on Name-CustomerNumber. Index can be a replica of the base table in another order. Transaction recovery and locking keeps them consistent. Tuple management system Maintains indices (insert, update, delete) Navigates to base table via secondary index as one request. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 What happens when you open a relation? Many files get opened. Read directory (catalog) Partitions, Indices Access module open (filename,.....) Tuple oriented file system read file descriptor read file descriptor do security checking return file descriptor if there are other partitions: open partititons if there are indices: open indices access the file Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Once OPEN, Application can SCAN the relation Scan is a row & column subset SELECT <column list> FROM <table> WHERE <predicate> With a specified start/stop key AND <key> BETWEEN <low> AND <high> In a specified order (supported by a secondary index) ASCENDING | DESCENDING A locking protocol {Serializable | Repeatable Read | Committed Read Uncommited Read | Skip Uncommitted |…} TIMEOUT <seconds> Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 SCAN States Scan state Tuples in the Scan Before K At K After K Null Jim Gray, Andreas Reuter 1 1 1 K K K 2 2 2 K K K 3 3 3 (Represented by their key values) K K K 4 4 4 K ··· K 5 K ··· K 5 K ··· K 5 n n n scan closed Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 SCAN States: How they change On error, scan state does not change. On open, scan is {before | after} the {first | last} set element if scan is {ascending | descending} On fetch next: if {not end of set | at end of set} scan is {at next | before first | after last } element On insert scan is at element On delete scan is at the missing element Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 SCAN States: How they change On update: scan position is not affected. if tuple moves (because ordering attributes affected) scan key position is unchanged Tuples in the Scan K 1 K 2 K 3 (Represented by their key values) K 4 K ··· K 5 n Scan Direction Moved Tuple Update K 1 K 2 K 3 K 4 K ··· K 5 n K3 Scan is "at" key K after the delete, even if 3 the record moves. Can create Halloween problem (give everybody a 10% raise) But scan enumerates entire set. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 SCAN Data structure enum SCAN_STATE { TOP, ON, BOTTOM, BETWEEN, NIL }; /* the 5 scan states */ enum ISOLATION { UNCOMMITTED_READ,..., SERIALIZABLE, READ_PAST, BOUNCE }; typedef struct { Uint TRID FILE * char * char * char * char * ISOLATION SCAN_STATE char } SCANCB, Jim Gray, Andreas Reuter scanid; owner; fileid; scan_key; start_key; stop_key; filter; isol_degree; scan_state; scan_key[ ]; * SCANCBP; /* handle for scan; returned by open_scan*/ /* which transaction uses the scan */ /* handle of file the scan is defined on */ /* specification of scan key attribute(s) */ /* lower bound of scan range */ /* upper bound of scan range */ /* qualifying predicate for all tuples in scan*/ /* locking policy for tuples accessed */ /* state of scan pointer */ /* scan key the scan is before, at, or after */ Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Entry Sequenced File Insert fix page descriptor page find eof page fix eof data page if no space in page < see next slide for transaction to advance page> unfix descriptor page add record to page (updating on-page directory) generate log record (new value) and update page LSN. compute lock name of record (based on TID). get lock on record unfix data page. To make this work, MUST be assured lock is available Otherwise page sem can (undetected)deadlock with lock wait So, UNDO of entry-sequence insert does not free the space, it just invalidates the record. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Entry Sequenced File Insert If EOF page or File is Full Top level transaction Begin new transaction (will not abort if insert aborts) to extend file to extend file EOF page. (leaves insert transaction) unfix directory page if file full, panic() start a top-level transaction fix the directory advance the page eof updating directory and freespace log the changes fix the data page format it log the change unfix the directory and data page commit the transaction & resume insert transaction fix directory, fix eof, check to see that there is room for the record. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Entry Sequenced Operations Read by RBA. get record lock (node, file, RBA) shared if {timeout, deadlock, error} return error; Fix page if record valid copy to buffer Unfix page Return record or null Delete by RBA. get record lock (node, file, RBA) exclusive if {timeout, deadlock, error} return error; Fix page Mark record invalid Generate log record Update page lsn Unfix page. Note: both must test that RBA <= EOF. Update, ReadNext, ... are similar. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Relative Files Records fit in fixed-length slots Operation on slots. Separate transactions extend the file EOF (allocate and format pages) Page Header Empty Slot Empty Slot Page Directory ... 10 88 18 0 62 82 100 75 Record lengths Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Relative Files {Read | Insert | Update | Delete} by key are all easy Insert "near" key works by: Plan A: look at page Look at neighbor pages (left, right, left, right,...) Plan B: allocate overflow page for base page Plan C: Look in free-space bit-map or byte (%full) map. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Key Sequenced or Hashed Files Key sequenced is subject of next chapter. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 File Clustering Different record types kept in same page/file For example: Master and detail records of an invoice. Detail records always accessed if master is. Situation: Master key : InvoiceNo Detail key: InvoiceNo Foreign Key References Master+ SequenceNo Technique: Hash or Key sequence Master on InvoiceNo Hash or Key Sequence Detail on InvoiceNo+SequenceNo in same table. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Clustering different record types in a page Page 10 10 0 10 1 10 2 10 3 10 4 20 20 0 20 1 33 33 0 33 1 33 2 Master Detail Master Detail Master Detail One disc request gets the entire order. Concept works for any storage hierarchy Is natural for Hierarchical database systems. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 Summary Representing values Representing records storing records in pages and across pages Organizing records (entry, relative, key, hash) Examples of fix/log/log logic. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999