Structured Files 

advertisement
Structured Files
9:00
11:00
13:30
15:30
18:00
Aug. 2
Intro &
terminology
Reliability
Fault
tolerance
Transaction
models
Reception
Aug. 3
Aug. 4
Aug. 5
Aug. 6
TP mons
Logging &
Files &
Structured
& ORBs
res. Mgr.
Buffer Mgr.
files
Locking Res. Mgr. &
COM+
Access paths
theory
Trans. Mgr.
Locking
CICS & TP
CORBA/
Groupware
techniques & Internet
EJB + TP
Queueing
Advanced
Replication Performance
Trans. Mgr.
& TPC
Workflow Cyberbricks
Party
FREE
Chapter 19

What The Record Manager Does
Storage allocation: store tuples in file blocks
Tuple addressing: give tuple an id identifier
provide fast access via that id.
Enumeration: fast enumeration of all relation’s tuples
Content addressing: give fast accessible via attribute
values.
Maintenance: update/delete a tuple and its access
paths.
Protection: support for security
encrypt or tuple-granularity access control.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999

Outline
Representing values
 Representing records
 Storing records in pages and across
pages
 Organizing records (entry, relative, key,
hash)
 Examples of fix/log/log logic.

Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999

Record Allocation in a Page
Recall:
File is a collection of fixed-length pages (blocks).
File and buffer managers map files to disc/RAM
Jim Gray, Andreas Reuter
block
page
page body
Transaction Processing - Concepts and Techniques
Page
Dir
Block
Trailer
Block
Head
Page
Head
slot on disk
WICS August 2 - 6, 1999

Page Declares
typedef struct
{ FILENO
uint
} PAGEID,
/* global page numbers
/*file where the page lives
/* page number within the file
/*
fileno;
pageno;
*PAGEIDP;
typedef struct
PAGEID
thatsme;
PAGE_TYPE
page_type;
OBJID
object_id;
LSN
safe_up_to;
PAGEID
previous;
PAGEID
next;
PAGE_STATE status;
int
no_entries;
int
unused;
int
freespace;
char
stuff[];
} PAGE_HEADER, * PAGE_PTR;
Jim Gray, Andreas Reuter
*/
*/
*/
*/
/* identifies the page
*/
/* see description above
*/
/* internal id of the relation,index,etc. */
/* page LSN for the WAL - protocol
*/
/* often pages are members of doubly */
/* linked lists
*/
/* valid,in-doubt,copy of something,etc*/
/* # entries in page dir (see below)
*/
/* free bytes not in freespace
*/
/* # contiguous free bytes for data
*/
/* will grow
*/
/*
*/
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999

Different uses of pages
Data: Homogeneous record storage
Cluster: like Data except many different record types

Index (access path): hashed or B-tree
Free-space bitmap: describes status of 4,000 other
pages.
Directory: meta-data about this or other files
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
Page Directory: Points to Records on
Page
Page Header
2nd Tuple
5th Tuple
1st Tuple
2nd
3rd Tuple
4thTuple
Tuples are inserted in this direction
Page directory grows in this direction
5
4

3
2
1
Record id is: File, Page, Directory_offset
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
Accessing a Record
Read by TID:
Insert by TID:
Jim Gray, Andreas Reuter
Lock record shared
locate page
Get semaphore shared
follow directory offset
copy tuple
Give semaphore

Lock record exclusive
locate page
Get semaphore exclusive
Find space
Insert
log insert (tid, new value).
update page LSN, header, directory,
Give semaphore
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
Accessing a Record
Delete by TID:
Update TID:
Jim Gray, Andreas Reuter
Lock record exclusive
locate page
Get semaphore exclusive
Add record to free space
Log delete (tid, old value).
update page LSN, header, directory,
Give semaphore
much like delete-&-insert
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999

Finding Space for Insert / Update
If tuple fits in page contiguous free-space: easy.
If tuple fits in page free space: reorganize (compress)
Physiological logging makes this cheap.
If tuple does not fit then:
leave forwarding address on page.
Optionally leave record prefix on page.
Segment record among several pages.
tid
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999

Finding space within a file
Free space table:
Summarizes status of many pages
(8KB page => 64Kb => 500MB of 8KB data pages)
Good for clustered & contiguous allocation

p1
f2 f3 f4 f2
f5 f6 f7
f8 f9 f10
f11 f12 f13
f14 f15 f16
f17 f18 f19
p2
p3
f3
p4
f4
p5
f5
p7
p6
f6
P19
f7
F19
···
P20
.
.
.
.
.
.
.
.
.
.
.
.
.
P21
21
bitmap should beFree
transaction
space protected
directories
If transaction aborts, page is freed again.
Alternatively, treat bitmap as a hint
Rebuild periodically.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
···.....
Finding space within a file
Free space cursor/list
file catalog
chain of empty pages
.
.

empty_page_anchor
point_of_insert
page for next insert
Chain should be transaction protected
Else: rebuild at restart
do not trust pointers
(free page may be allocated).
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
Tuple Allocation - I
The first strategy maintains a pointer to the “current block for insert” (CBI). When that
block fills up, an empty block is requested from a system service, which then becomes the
new “current block for insert”.

CBI:
where next?
head of list
of empty
blocks
head of list
of empty
blocks
head of list of
empty blocks
And so on. This is the sequential insert strategy.
Questions: What happens, when the pointer arrives at the last block? How do we reclaim
space freed by deleted tuples?
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
13
Incremental Space Expansion - I
When the list of empty blocks is exhausted, there are two options to find space for
new tuples. Let us assume the following configuration:

CBI:
The first option is to let the CBI pointer circulate over the set of allocated blocks,
assuming that space is released by deleted tuples.
And so on. This works as long as enough space is freed up by deleted tuples. If there
are only few gaps, finding space for a new tuple can become very expensive, because
many blocks have to be probed sequentially.
The need to probe blocks that are completely filled can be avoided by maintaining
a an array of bits that contains one bit per block indicating whether a block is full:
0
Jim Gray, Andreas Reuter
0
1
Transaction Processing - Concepts and Techniques
0
1
WICS August 2 - 6, 1999
14
Naming Tuples (records)
Relative byte address:
file, offset in file: OK for insert-then-read-only DBs
record can't easily grow.
deleted space not easily reclaimed.
Tuple Identifier
file, page, index: The design shown below.
nodeid
fileid
pageno dir_index
3
7446
7446
nodeid
fileid

pageno dir_index
7446
3
7446
5127
this tuple
this tuple
pseudo -TID
Main disadvantage: expensive reorganization (fixing overflows)
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
Implementing Database Keys
Address record via
directory
databasekey of "this tuple"
nodeid
fileid record seq. no.
K
A
7
Address has a ID to
allow for invalidation
id
11

pageid
ID never reused.
7446
index id
7
11
7446
thistuple
offset
Pointer can be swizzled.
Popular with network &
OO DBs
Jim Gray, Andreas Reuter
databasekey translation
table for file A at nodeK
Transaction Processing - Concepts and Techniques
page directory
WICS August 2 - 6, 1999
Naming Tuples via Primary Key
{Entry Sequenced, Relative}: primary key is physical addr
{Hash, B-tree}: primary key is content (primary key)
Primary Key an alternative to DBkey
B-tree clusters related data
Problems:
B-tree access is slower than Hash.
Hash & B-tree keys not fixed length
but neither is node.db_key
Benefit:
key can grow to LARGE databases
Good for distributed/partitioned data
It’s religious.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999

Datatype Representation
m EP: value input from the
user
E
m PE: value output
to the user
m : modification through application program
PF
P
mFP: SELECTing values
into application
program
F
mEF: input through interactive SQL
m FE: interactive query results
E: External representation: ASCII, ISO Latin1, Unicode,...
P: Programming language representation
many: PL/1, Cobol, C, all have different VARCHAR
many type mismatches between P and F
: interval, datetime, user,...
F: File representation: "native" types (e.g.: null values, ....).
Lots of mapping functions.
-1
It would be great if F (F(x)) = x for these functions, but....
Called the impedance mismatch between DB and PL
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999

Datatype Representations
E
P
F

P _ F: Implies a special language
(all other languages are 2nd class)
E _ F: Use characters for everything.
Problem: E changes from country to country!
(all other languages are 2nd class)
No easy way out of this.
Unicode will help most of us and make E_F more attractive
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
Representing Records
relati ons
attri butes

fi el d
l ength
attri bute
descri pti on
type
off set
·
meta data
·
·
tupl e addressing
phy sical tupl e
attr.1
Jim Gray, Andreas Reuter
attr.2
attr.3
attr.4
Transaction Processing - Concepts and Techniques
attr.5
WICS August 2 - 6, 1999
Representing Records
struct
relations{
Uint
relation_no;
char * owner;
long
creation_date;
PAGENO current_point_of_insert;
PAGENO empty_page_anchor;
Uint
no_of_attributes;
Uint
no_of_fixed_atts;
Uint
no_of_var_atts;
struct
attributes * p_attr;}
struct attributes[];
{
char * attribute_name;
Uint
attribute_position;
char
attribute_type;
Boolean var_length;
Boolean nulls_allowed;
char * default_value;
Uint
field_length;
int
accumulated_offset;
Uint
significant_digits;
char * encryption_key;
char * rest;}
Jim Gray, Andreas Reuter
/* internal id for the relation
/* user id of the creator
/* date when it was created
/* free space done via
/* free space cursor method
/*#attributes in relation
/* # fixed-length attributes
/* # variable-length attributes
/* pointer to the attributes array
/* attributes array
/* external name of the attribute
/* index of the field in the tuple (1,2,...)
/* this encodes the SQL - type definition
/* is it variable_length field ?
/* can field assume NULL value ?
/* value assumed if none stored in tuple
/* maximum length of field
/* explained later
/* for data type FIXED
/* if the value encrypted
/* further information on the attribute
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/

Representing Records
Generic header
(rid, tid, #fields)
general prefix to all tuple
representations
relation-id
tuple-id number of fields in
the tuple or actual tuple length
number of
fields
F1
3
all fixed length encoding
(fat records, fast-simple code
max < page path length)
variable fields have length
(short records, slow code)
type-length-value
(simple slow code, easy reorg)
fixed + ptrs to variables.
(compact, fast code)
Jim Gray, Andreas Reuter
name
F2
10
F3 F4 F5
4
2 4
m
F6
8
n
L
tuple
length
L 3F1
4 3F
mF2
24F
F
45
n6F
tuple
length
number of
fields
Transaction Processing - Concepts and Techniques
F1
3
F2
F3 F4 F5
4 2 4
F6
WICS August 2 - 6, 1999

Representing Records
(Reuter Recommends)
F1
3
F3
4
F4
2
F5
4
F2
F6

relation
identifier
tuple
identifier
Jim Gray, Andreas Reuter
number of
variable length
fields
number of
fixed length
fields
pointer array
with offs ets
for variable
length fields
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
Some Details
Representing null values:
missing field
special value
extra field
bitmap
Representing keys
efficient comparison is important
store "conditioned" key so simple byte-compare.
Flip integer sign (so negative sorts low)
Flip float so exponent first, mantissa second, flipped
signs
Compress varchars.
MANY refinements.
Want an order-preserving compression.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999

Fat Records (Longer Than a Page)
Record must fit on page.
Long fields segregated
to separate page: may be good in
some cases (Multi-media DBs)
long
field
Overflow page chains
Segment record across pages
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999

Obese records (Longer Than 10
Pages)
If record is super-large, then may want to index into it quickly.
“Obvious" design is standard tree.

Record is root of tree.
Grow levels when one fills.
Allows blob growth, update,...
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
Non-Normalized Relations
C
C
C
S1 S2 S3 .... Sn
S1
S2
a) chaining
Jim Gray, Andreas Reuter
S1 S2 S3
Sn
b) cl uster ing
Transaction Processing - Concepts and Techniques

...
Sn
c) mini-di rectory
WICS August 2 - 6, 1999
Structured File Definition
File
u nstr uctur ed
(system seq uenced)
assoc iativ e
ke yed
hash

structur e d
non- associa tiv e
entry
sequ enced
r elativ e
cluster ed
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
File Layouts
Unstructured:
a sequence of bytes
eof

Structured,
Entry Sequenced.
Records inserted at end
Records cannot grow
key is RBA (relative byte address)
eof
Relative:
fixed size record slots
records limited by that size
key is relative record number
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
Associative File Types
Hashed:
Records addressed by key
field(s)
bucket has list of records
overflow to other buckets
or to overflow pages.
Key Sequenced
Records addressed by
keyfield(s)
Records in sorted order.
either sorting or b-tree or...
Jim Gray, Andreas Reuter

As
Bs
Transaction Processing - Concepts and Techniques
Ys
Zs
WICS August 2 - 6, 1999
Parameters at Create
Database
Record type (fields)
Key
Organization { Entry Sequenced, Relative, Hashed, Key
Sequenced }
Block size (page size)
Extent size (storage area)
Partitioning (among discs or nodes) by key.
Attributes: access control
allocation and archive strategy
transactional
lifetime, zero on free, and on and on ....
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999

Parameters at Create
"Secondary" indices.
Primary key is....(e.g. customer number).
Secondary key is social security number
Non-Unique secondary key is Last_Name, First_name
Secondary indices can be
and
index is like a table.
fields of index are:
secondary key, primary key
So can define index on any
kind of base table
Jim Gray, Andreas Reuter

{unique or not }
{hashed or Key Sequenced }
Base Table
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
Secondary Index Example
Base table is key-sequenced on CustomerNumber.
Index table is key sequence on Name-CustomerNumber.
Index can be a replica of the base table in another order.
Transaction recovery and locking keeps them consistent.
Tuple management system
Maintains indices (insert, update, delete)
Navigates to base table via secondary index as one
request.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999

What happens when you open a
relation?
Many files get opened.
Read directory (catalog)
Partitions,
Indices
Access module
open (filename,.....)
Tuple oriented file system
read file descriptor
read file descriptor
do security checking
return file descriptor
if there are other
partitions:
open partititons
if there are indices:
open indices
access the file
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999

Once OPEN,
Application can SCAN the relation
Scan is a row & column subset
SELECT <column list>
FROM <table>
WHERE <predicate>

With a specified start/stop key
AND <key> BETWEEN <low> AND <high>
In a specified order (supported by a secondary index)
ASCENDING | DESCENDING
A locking protocol {Serializable | Repeatable Read | Committed Read
Uncommited Read | Skip Uncommitted |…}
TIMEOUT <seconds>
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
SCAN States
Scan state
Tuples in the Scan
Before
K
At
K
After
K
Null
Jim Gray, Andreas Reuter
1
1
1
K
K
K
2
2
2
K
K
K
3
3
3
(Represented by their key values)
K
K
K
4
4
4
K ··· K
5
K ··· K
5
K ··· K
5
n
n
n
scan closed
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999

SCAN States: How they change
On error, scan state does not change.
On open,
scan is {before | after} the {first | last} set element
if scan is {ascending | descending}
On fetch next:
if {not end of set | at end of set}
scan is {at next | before first | after last } element
On insert
scan is at element
On delete
scan is at the missing element
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999

SCAN States: How they change
On update: scan position is not affected.
if tuple moves (because ordering attributes affected)
scan key position is unchanged
Tuples in the Scan
K
1
K
2
K
3
(Represented by their key values)
K
4
K ··· K
5

n
Scan Direction
Moved Tuple
Update
K
1
K
2
K
3
K
4
K ··· K
5
n
K3
Scan is "at" key K after the delete, even if
3
the record moves.
Can create Halloween problem (give everybody a 10% raise)
But scan enumerates entire set.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
SCAN Data structure
enum
SCAN_STATE { TOP, ON, BOTTOM, BETWEEN, NIL }; /* the 5 scan states
*/
enum
ISOLATION { UNCOMMITTED_READ,..., SERIALIZABLE, READ_PAST, BOUNCE };
typedef struct
{ Uint
TRID
FILE *
char *
char *
char *
char *
ISOLATION
SCAN_STATE
char
} SCANCB,
Jim Gray, Andreas Reuter

scanid;
owner;
fileid;
scan_key;
start_key;
stop_key;
filter;
isol_degree;
scan_state;
scan_key[ ];
* SCANCBP;
/* handle for scan; returned by open_scan*/
/* which transaction uses the scan
*/
/* handle of file the scan is defined on
*/
/* specification of scan key attribute(s)
*/
/* lower bound of scan range
*/
/* upper bound of scan range
*/
/* qualifying predicate for all tuples in scan*/
/* locking policy for tuples accessed
*/
/* state of scan pointer
*/
/* scan key the scan is before, at, or after */
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
Entry Sequenced File Insert
fix page descriptor page
find eof page
fix eof data page
if no space in page
< see next slide for transaction to advance page>
unfix descriptor page
add record to page (updating on-page directory)
generate log record (new value) and update page LSN.
compute lock name of record (based on TID).
get lock on record
unfix data page.
To make this work, MUST be assured lock is available
Otherwise page sem can (undetected)deadlock with lock wait
So, UNDO of entry-sequence insert does not free the space,
it just invalidates the record.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999

Entry Sequenced File Insert
If EOF page or File is Full
Top level transaction
Begin new transaction (will not abort if insert aborts)
to extend file
to extend file EOF page. (leaves insert transaction)
unfix directory page

if file full, panic()
start a top-level transaction
fix the directory
advance the page eof updating directory and freespace
log the changes
fix the data page
format it
log the change
unfix the directory and data page
commit the transaction & resume insert transaction
fix directory, fix eof, check to see that there is room for the record.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
Entry Sequenced Operations
Read by RBA.
get record lock
(node,
file,
RBA) shared
if {timeout, deadlock, error}
return error;
Fix page
if record valid copy to buffer
Unfix page
Return record or null
Delete by RBA.
get record lock
(node,
file,
RBA) exclusive
if {timeout, deadlock, error}
return error;
Fix page
Mark record invalid
Generate log record
Update page lsn
Unfix page.
Note: both must test that RBA <= EOF. Update, ReadNext, ... are similar.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999

Relative Files
Records fit in fixed-length slots
Operation on slots.
Separate transactions extend the file EOF
(allocate and format pages)

Page Header
Empty Slot
Empty Slot
Page Directory
... 10 88 18 0 62 82 100 75
Record lengths
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
Relative Files
{Read | Insert | Update | Delete} by key are all easy
Insert "near" key works by:
Plan A:
look at page
Look at neighbor pages (left, right, left, right,...)
Plan B:
allocate overflow page for base page
Plan C:
Look in free-space bit-map or byte (%full) map.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999

Key Sequenced or Hashed Files
Key sequenced
is subject of next chapter.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999

File Clustering
Different record types kept in same page/file
For example:
Master and detail records of an invoice.
Detail records always accessed if master is.

Situation:
Master key : InvoiceNo
Detail key: InvoiceNo Foreign Key References Master+
SequenceNo
Technique:
Hash or Key sequence Master on InvoiceNo
Hash or Key Sequence Detail on InvoiceNo+SequenceNo in same
table.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
Clustering different record types in a
page
Page
10
10 0
10 1
10 2
10 3
10 4
20
20 0
20 1
33
33 0
33 1
33 2
Master
Detail

Master
Detail
Master
Detail
One disc request gets the entire order.
Concept works for any storage hierarchy
Is natural for Hierarchical database systems.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques
WICS August 2 - 6, 1999
Summary

Representing values

Representing records

storing records in pages and across pages

Organizing records (entry, relative, key,
hash)

Examples of fix/log/log logic.
Jim Gray, Andreas Reuter
Transaction Processing - Concepts and Techniques

WICS August 2 - 6, 1999
Download