Principles of Database
Management Systems
Pekka Kilpeläinen
(after Stanford CS245 slide originals by
Hector Garcia-Molina, Jeff Ullman and
Jennifer Widom)
DBMS 2001
Notes 3: Representing Data
1
Topic for today
• How to represent data on disk and in
memory
"Executive Summary":
• Principles are rather simple, but there
are lots of variations in the details
DBMS 2001
Notes 3: Representing Data
2
Principles of Data Layout
• Attributes of relational tuples (or objects)
represented by sequences of bytes called
fields
• Fields grouped together into records
– representation of tuples or objects
• Records stored in blocks
• File: collection of blocks that forms a relation
(or the extent of an object class)
DBMS 2001
Notes 3: Representing Data
3
Overview
Data Items
here
Records
Blocks
Files
Memory
DBMS 2001
Notes 3: Representing Data
4
What are the data items we want to store?
• a salary, a name
• a date, a picture
•…
What we have available: Bytes
8
bits
DBMS 2001
Notes 3: Representing Data
5
To represent:
• Integer (short): 2 bytes ( -32000…+32000)
e.g., 35 is
00000000
00100011
• Integer (long): 4 bytes ( -2x109…+ 2x109)
• Real, floating point (SQL FLOAT)
4 or 8 bytes
• arithmetic interpretation by hardware
DBMS 2001
Notes 3: Representing Data
6
To represent:
• Characters
various coding schemes suggested,
most popular is ASCII
Example (8 bit ASCII):
A:
a:
5:
LF:
DBMS 2001
01000001
01100001
00110101
00001010
Notes 3: Representing Data
7
To represent:
• Boolean
e.g., TRUE
FALSE
1111 1111
0000 0000
• Application specific
e.g., RED 1 GREEN 2
BLUE 3 YELLOW 4 …
Can we use less than 1 byte/code?
Yes, but only if desperate...
DBMS 2001
Notes 3: Representing Data
8
To represent:
• Dates, e.g.:
–
–
–
–
Integer: # days since Jan 1, 1900
8 chars: YYYYMMDD
7 chars: YYYYDDD
10 chars: YYYY-MM-DD
(SQL2)
(not YYMMDD! Why?)
• Time, e.g.
– Integer: seconds since midnight
– chars: HH:MM:SS[.FF…]
(SQL2)
DBMS 2001
Notes 3: Representing Data
9
To represent:
• String of characters
– Null terminated
e.g.,
c a
– Length given
e.g.,
3 c
t
a
t
- Fixed length
DBMS 2001
Notes 3: Representing Data
10
To represent:
• Bag of bits
Length
DBMS 2001
Bits
Notes 3: Representing Data
11
Key Point
• Fixed length items
• Variable length items
- usually length given at beginning
DBMS 2001
Notes 3: Representing Data
12
Also
• Type of an item: Tells us how to
interpret
(plus size if fixed)
– normally indicated in the record schema
(see in a moment)
DBMS 2001
Notes 3: Representing Data
13
Record - Collection of related fields
E.g.: Employee record:
name field,
salary field,
date-of-hire field, ...
DBMS 2001
Notes 3: Representing Data
14
Types of records:
• Main choices:
– FIXED vs VARIABLE FORMAT
– FIXED vs VARIABLE LENGTH
DBMS 2001
Notes 3: Representing Data
15
Fixed format
A SCHEMA (not record) contains
following information
- number of fields
- type of each field
- order in record
- meaning of each field
DBMS 2001
Notes 3: Representing Data
16
Example: fixed format and length
Employee record
(1) E#, 2 byte integer
(2) E.name, 10 char.
(3) Dept, 2 byte code
46 F o r d
02
83 J o n e s
01
DBMS 2001
Notes 3: Representing Data
Schema
Records
17
Variable format
• Record itself contains format;
“Self Describing”
DBMS 2001
Notes 3: Representing Data
18
Example: variable format and length
46 4 S 4
F o r d
Code for Ename
String type
Length of str.
# Fields
Code identifying
field as E#
Integer type
2 5 I
Field name codes could also be strings, i.e. tags
( XML as a data interchange format)
DBMS 2001
Notes 3: Representing Data
19
Variable format useful for:
• “sparse” records
– e.g, patient records with thousands of
possible tests
• repeating fields
• information integration from
heterogeneous sources
DBMS 2001
Notes 3: Representing Data
20
• EXAMPLE: var format record with
repeating fields
Employee one or more children
3
DBMS 2001
E_name: Fred
Child: Sally Child: Tom
Notes 3: Representing Data
21
Note: Repeating fields does not imply
- variable format, nor
- variable size
Fred
Sally
Tom
--
• Key is to allocate maximum number of
repeating fields (if not used NULL)
DBMS 2001
Notes 3: Representing Data
22
Many variants between
fixed and variable format:
Ex. #1: Include record type in record
5
27
....
record type
record length
tells me what
to expect
(i.e. points to schema)
DBMS 2001
Notes 3: Representing Data
23
Record header - data at beginning
that describes record
May contain:
- indication of record type
- record length
- time stamp
- other stuff ...
DBMS 2001
Notes 3: Representing Data
24
Ex #2 of variant between FIXED/VAR format
• Hybrid format
– one part is fixed, other variable
E.g.: All employees have E#, name, dept;
Other fields vary.
25 Smith Toy 2 Hobby:chess retired
# of var
fields
DBMS 2001
Notes 3: Representing Data
25
Also, many variations in internal
organization of record
Just to show one:
*
3 10
*
5
F1
length of field
*
12
F2
F3
total size
3 32 5 15 20
0
1
2
3
4
F1
5
F2
15
F3
20
32
offsets
DBMS 2001
Notes 3: Representing Data
26
Next: placing records into blocks
assume fixed
length blocks
blocks
...
a file
DBMS 2001
assume a single relation (for now)
Notes 3: Representing Data
27
Issues in storing records in blocks:
(1)
(2)
(3)
(4)
(5)
(6)
separating records
spanned vs. unspanned
mixed record types – clustering
split records
sequencing
addressing records
DBMS 2001
Notes 3: Representing Data
28
(1) Separating records
Block
R1
R2
R3
(a) fixed size recs. -> no need to separate
(b) special marker
(c) give record lengths (or offsets)
- within each record
- in block header (see later)
DBMS 2001
Notes 3: Representing Data
29
(2) Spanned vs. Unspanned
• Unspanned: records are within one block
block 1
R1
block 2
R2
R3
...
R4 R5
• Spanned: records span block boundaries
block 1
R1
DBMS 2001
R2
block 2
R3
(a)
R3
R4
(b)
Notes 3: Representing Data
R5
R6
R7
(a)
...
30
With spanned records:
R1
DBMS 2001
R3
(a)
R2
R3
R4
(b)
R5
R7
R6 (a)
need indication
need indication
of partial record
“pointer” to rest
of continuation
(+ from where?)
Notes 3: Representing Data
31
Spanned vs. unspanned:
• Unspanned is much simpler, but may
waste space…
• Spanned necessary if
record size > block size
(e.g., fields containing large "BLOB"s
for, say, MPEG video clips)
DBMS 2001
Notes 3: Representing Data
32
Example (of unspanned records)
106 records
each of size 2,050 bytes (fixed)
block size = 4096 bytes
block 1
block 2
R1
2050 bytes
R2
wasted 2046
2050 bytes wasted 2046
• Space used about 4 x 109 B,
about half wasted
DBMS 2001
Notes 3: Representing Data
33
(3) Mixed record types
• Mixed - records of different types
(e.g. DEPT, EMPLOYEE)
allowed in same block
e.g., a block:
Dep
DBMS 2001
d1
Emp
e1
Emp
e2
Notes 3: Representing Data
34
Why do we want to mix?
Answer: CLUSTERING
Records that are frequently
accessed together should be
in the same block
DBMS 2001
Notes 3: Representing Data
35
Example
Q1: select DEPT.Name, EMP.Name, …
from DEPT, EMP
where DEPT. Name =
EMP.DeptName
a block
DEPT,Name=Toy, ...
EMP,DeptName=Toy, ...
EMP,DeptName=Toy, ...
…
DBMS 2001
Notes 3: Representing Data
36
• If Q1 frequent, clustering is good
• But consider Q2:
SELECT *
FROM DEPT
If Q2 is frequent, clustering is counterproductive
DBMS 2001
Notes 3: Representing Data
37
(4) Split records
Fixed part in
one block
Typically for
hybrid format
Variable part in
another block
DBMS 2001
Notes 3: Representing Data
38
Block with fixed recs.
R1 (a)
R2 (a)
Blocks with variable recs.
R1 (b)
R2 (b)
R2 (c)
DBMS 2001
Notes 3: Representing Data
39
(5) Sequencing
• Ordering records in file (and block) by
some key value
Sequential file ( sequenced)
DBMS 2001
Notes 3: Representing Data
40
Why sequencing?
Typically to make it possible to efficiently
read records in order
(e.g., to do a merge-join — discussed later)
DBMS 2001
Notes 3: Representing Data
41
Sequencing Options
(a) Next record physically contiguous
R1
Next (R1)
...
(b) Linked
R1
DBMS 2001
Next (R1)
Notes 3: Representing Data
42
Sequencing Options
(c)
Overflow area
Records
in sequence
DBMS 2001
header
R1
R2
R3
R4
R5
Notes 3: Representing Data
R2.1
R1.3
R4.7
43
(6) Addressing records
• How does one refer to records?
Rx
Many options:
Physical
DBMS 2001
Indirect
Notes 3: Representing Data
44
Purely Physical
Device ID
Cylinder #
Block ID
Track #
Block #
Offset in block
E.g., Record
Address
or ID
=
DBMS 2001
Notes 3: Representing Data
45
Fully Indirect
E.g., Record ID is arbitrary byte string
(say, an object ID)
map table
rec ID
r
Rec ID Physical
addr.
address
a
DBMS 2001
Notes 3: Representing Data
46
Tradeoff
Flexibility
Cost
to move records
of indirection
(for deletions, insertions)
DBMS 2001
Notes 3: Representing Data
47
Physical
Indirect
Many options
in between …
DBMS 2001
Notes 3: Representing Data
48
Ex #1 Indirection in block
Header
A block:
Free space
R4
R3
R2
DBMS 2001
R1
Notes 3: Representing Data
49
Block header - data at beginning that
describes block
May contain:
- File ID (or RELATION or DB ID); the ID of this block
- Record directory; Pointer to free space
- Type of block (e.g. contains records of type 4;
is overflow, …)
- Pointer to other blocks “like it”
(say, if part of an index structure)
- Timestamp ...
DBMS 2001
Notes 3: Representing Data
50
Issues in storing records in blocks
(1)
(2)
(3)
(4)
(5)
(6)
Separating records
Spanned vs. Unspanned
Mixed record types - Clustering
Split records
Sequencing
Addressing records
DBMS 2001
here
Notes 3: Representing Data
51
Other Topics
(1) Insertion/Deletion
(2) Pointer Swizzling
(3) Comparison of Schemes
DBMS 2001
Notes 3: Representing Data
52
(1) Deletion
Block
Rx
DBMS 2001
Notes 3: Representing Data
53
Options:
(a) Immediately reclaim space
(b) Mark deleted
– May need chain of deleted records
(for re-use)
• If there are pointers to the record,
either need to set them to NULL or
indicate the record space reclaimed
DBMS 2001
Notes 3: Representing Data
54
As usual, many tradeoffs...
• How expensive is to move valid record
to free space for immediate reclaim?
• How much space is wasted?
– e.g., deleted records, delete fields, free
space chains,...
DBMS 2001
Notes 3: Representing Data
55
Concern with deletions
Dangling pointers
R1
DBMS 2001
?
Notes 3: Representing Data
56
A solution: Tombstones
(a) Leave “MARK” in old location
• with physical
IDs:
A block
This space
never re-used
DBMS 2001
This space can
be re-used
Notes 3: Representing Data
57
A Solution : Tombstones
(b) Leave “MARK” in map table
• with logical
Ids:
Map table
ID
LOC
Never reuse
ID 7788 nor
space in map...
7788
DBMS 2001
Notes 3: Representing Data
58
(1) Insert
Easy case: records not in sequence
Insert new record at end of file,
or in any block with space for it
DBMS 2001
Notes 3: Representing Data
59
Insert
Hard case: records in sequence
If free space “close by”, not too bad…
- can "slide" records to
preceeding/suceeding blocks
- pointers to moved records require
forwarding addresses
Or use additional blocks as an
overflow area
DBMS 2001
Notes 3: Representing Data
60
Interesting problems:
• How much free space to leave in each
block, track, cylinder?
• How often to reorganize file + overflow?
DBMS 2001
Notes 3: Representing Data
61
(2) Pointer Swizzling
• Two kinds of record addresses:
• DB address
– physical location on secondary storage
• Memory address
– record location when loaded into (main or
virtual) memory
• Which one to use, and when?
DBMS 2001
Notes 3: Representing Data
62
Pointer Swizzling
Memory
Disk
block 1
block 1
block 2
block 2
DBMS 2001
Rec A
Rec A
Notes 3: Representing Data
63
One Option:
Translation
Table
DB Addr Mem Addr
Rec-A
Rec-A-inMem
• For all records copied to memory
DBMS 2001
Notes 3: Representing Data
64
Another Option:
In memory pointers - need “type” bit
to disk
M
DBMS 2001
to memory
Notes 3: Representing Data
65
Swizzling
• Automatic ("eager")
• On-demand ("lazy")
• No swizzling / program control
DBMS 2001
Notes 3: Representing Data
66
Comparison
• There are about 10,000,000 ways to
organize my data on disk…
Which one is right for me?
DBMS 2001
Notes 3: Representing Data
67
Issues:
Flexibility
Space Utilization
Complexity
Performance
DBMS 2001
Notes 3: Representing Data
68
To evaluate a given strategy,
estimate following parameters:
-> space used for expected data
-> expected time to
-
DBMS 2001
fetch record given its key
fetch record with next key
insert/delete/update record
scan all records in the file
reorganize file
Notes 3: Representing Data
69
Example
How would you design Megatron 3000
storage system? (for a relational DB, low end)
– Variable length records?
– Spanned?
– What data types?
– Fixed format?
– Record IDs?
– Sequencing?
– How to handle deletions?
DBMS 2001
Notes 3: Representing Data
70
Summary
• How to lay out data on disk
Data Items
Records
Blocks
Files
Memory
DBMS
DBMS 2001
Notes 3: Representing Data
71
Next
How to find a record quickly,
given a key
DBMS 2001
Notes 3: Representing Data
72