disk head

advertisement
Data Storage and Disk Access
Memory hierarchy
 Hard disks

 Architecture
 Processing requests
 Writing to disk

Hard disk reliability and efficiency
 RAID



Solid State Drives
Buffer management
Data storage
Cache
DBMS
Main Memory
Virtual Memory
Disk
File System
Tertiary
Storage

Primary memory: volatile
 Main memory
 Cache

Secondary memory: non-volatile
 Solid State Drive (SSD)
 Magnetic Disk (Hard Disk Drive, HDD)

Tertiary memory: non-volatile
 CD/DVD
 Tape - sequential access
cost
speed
 Usually used as backup or for long-term
storage

Speed
 Main memory is much faster than secondary memory
▪ 10 – 100 nanoseconds to move data in main memory
▪ 0.00001 to 0.0001 milliseconds
▪ 10 milliseconds to read a block from an HDD
▪ 0.1 milliseconds to read a block from a SSD

Cost
 Main memory is around 100 times more expensive
than secondary memory
 SSDs are more expensive than HDDs

System limitations
 On a 64 bit system only 264 bytes can be directly
referenced
▪ Many databases are larger than that

Volatility
 Data must be maintained between program
executions which requires non-volatile memory
▪ Nonvolatile storage retains its contents when the device is
turned off, or if there is a power failure
 Main memory is volatile, secondary storage is not

Database data is usually stored on disks
 A database will often be too large to be retained in main
memory
 When a query is processed data will need to be retrieved
from storage

Data is stored on disk blocks
 Also referred to as blocks, or in relation to the OS, pages
 A contiguous sequence of bytes and
 The unit in which data is written to, and read from
 Block size is typically between 4 and 16 kilobytes

A hard disk consists of a number of platters
 Platters can store data on either one or both of its surfaces so is
referred to as
▪ Single-sided or double sided

Surfaces are composed of concentric rings called tracks
 The set of all tracks with the same diameter is called a cylinder

Sectors are arcs of a track
 And are typically 4 kilobytes in size
 Block size is set when the disk is initialized, usually a small
multiple of the sector size (hence 4 to 16 kilobytes)
surfaces
platter
2*
3*
track
cylinder
*statistics for Western Digital Caviar Black 1 TB hard drive


Data is transferred to or from a surface by a disk
head
There is one disk head for each surface
 These disk heads are moved as a unit (called a disk
head array)
▪ Therefore all the heads are in identical positions with respect
to their surfaces
 To read or write a block a disk head must be
positioned over it

Only one disk head can read or write at a time
the disk spins – around 7,200rpm
disk head array
track
moves in and out
platters

Disk drives are controlled by a processor
called a disk controller which
 Controls the actuator that moves the head
assembly
 Selects sectors and determines when the disk has
rotated to a sector
 Transfers data between the disk and main
memory
 Some controllers buffer data from tracks in the
expectation that the data will be required
The disk constantly spins
7,200 rpm*
The head pivots over
the desired track
The desired block is read as it
passes underneath the head
* Western Digital Caviar Black 1 TB hard drive (again)

The disk head is moved in or out to the track
 This seek time is typically  10 milliseconds
▪ WD Caviar Caviar Black 1TB: 8.9 ms

Wait until the block rotates under the disk head
 This rotational delay is typically  4 milliseconds
▪ WD Caviar Caviar Black 1TB : 4.2 ms

The data on the block is transferred to memory
 This transfer time is the time it takes for the block to
completely rotate past the disk head
▪ Typically less than 1 millisecond

The seek time and rotational delay depend on
 Where the disk head is before the request,
 Which track is being requested, and
 How far the disk has to rotate

The transfer time depends on the request size
 The transfer time (in ms) for one block equals
▪ (60,000 / disk rpm) / blocks per track
 The transfer time (in ms) for an entire track equals
▪ (60,000 / disk rpm)

Typical access time for a block on a hard disk
 15 milliseconds

Typical access time for a main memory frame
 60 nanoseconds

What’s the difference?
 1 millisecond = 1,000,000 nanoseconds
 60 ns = 0.000,060 ms

Accessing a hard drive is around 250,000
times slower than accessing main memory

Disk latency (access time) has three components
 seek time + rotational delay + transfer time
 The overall access time can be shortened by reducing, or
even eliminating seek time and rotational delay

Related data should be stored in close proximity
 Accessing two records in adjacent blocks on a track
▪ Seek the desired track, rotate to first block, and transfer two blocks
= 10 + 4 + 2*1 = 16ms
 Accessing two records on different tracks
▪ Seek the desired track, rotate to the block, and transfer the block,
then repeat = (10 + 4 + 1)*2 = 30ms

What does it mean to say that related data should be
stored close to each other?
 The term close refers not to physical proximity but to how the
access time is affected

In order of closeness:
 Same block
 Adjacent blocks on the same track
 Same track
 Same cylinder, but different surfaces
 Adjacent cylinders
 …


Is 2, or 3 "closer" to 1?
2 is in the adjacent track




And is clearly physically
closer, but
The disk head must be
moved to access it
3 is in the same cylinder

1
The disk head does not
have to be moved
Which is why 3 is closer
x
x
2
x
3

A fair algorithm would take a first-come,
first-serve approach
2,0001
 Insert requests in a queue and process them
4,0004
in the order in which they are received
6,0002
Cylinder Received Complete
Moved
Total
2,000
0
5
2,000
2,000
6,000
0
14
4,000
6,000
14,000
0
27
8,000
14,000
4,000
10
43
10,000
24,000
16,000
20
60
12,000
36,000
10,000
30
72
6,000
42,000
10,0006
14,0003
16,0005

The elevator algorithm usually performs
better than FIFO
2,0001
 Requests are buffered and the disk head
4,0004
moves in one direction, processing requests
 The arm then reverses direction
6,0002
Cylinder Received Complete
Moved
Total
2,000
0
5
2,000
2,000
6,000
0
14
4,000
6,000
14,000
0
27
8,000
14,000
16,000
20
35
2,000
16,000
10,000
30
46
6,000
22,000
4,000
30
58
6,000
28,000
10,0006
14,0003
16,0005

The elevator algorithm gives much better
performance than FIFO on average
 And is a relatively fair algorithm

The elevator algorithm is not optimal
 The shortest-seek first algorithm is closer to optimal
but can result in a high variance in response time
▪ And may even result in starvation for distant requests
 In some cases the elevator algorithm can perform
worse than FIFO

To modify an existing record (on a disk) the
following steps must be taken
 Read the record
 Modify the record in main memory
 Write the modified record back to disk

It is important to remember that the smallest
unit of transfer to / from a disk is a block
 A single disk block usually contains many records
Read one block into main memory …
other records …
Landis#winner#Phonak#...
other records …
other records …
Landis#winner#Phonak#...
other records …
Read one block into main memory …
other records …
Landis#winner#Phonak#...
Landis#disq.#none#...
other records …
… modify the desired record …
other records …
Landis#winner#Phonak#...
other records …
Read one block into main memory …
other records …
other records …
Landis#disq.#none#...
… modify the desired record …
… and write it back.
other records …
Landis#disq.#none#...
Landis#winner#Phonak#...
other records …
Consider creating a new record
 The user enters the data for the record

 Through some application interface
 The record is created in main memory
 And then written to disk

Does this process require a read-modify-write
process?
 YES!
 Because, otherwise, the existing contents of the disk block
will be overwritten

Intermittent failure
 Multiple attempts are required to read or write a sector

Media decay
 A bit or a number of bits are permanently corrupted and it
is impossible to read a sector

Write failure
 A sector cannot be written to or retrieved
▪ Often caused by a power failure during a write

Disk crash
 The entire disk becomes unreadable

An intermittent failure may result in incorrect
data being read by the disk controller
 Such incorrect data can be detected by a checksum

Each sector contains additional bits whose
values are based on the data bits in the sector
 A simple single-bit checksum is to maintain an even
parity on the sector
▪ If there is an odd number of 1s the parity is odd
▪ If there is an even number of 1s the parity is even

Assume that there are seven data bits and a single
checksum bit
 Data bits 0111011 – parity is odd
▪ Checksum bit is set to 1 so that the overall parity is even
Using a single checksum bit allows errors of only one
bit to be detected reliably
 Several checksum bits can be maintained to reduce
the chance of failing to notice an error

 e.g. maintain 8 checksum bits, one for each bit position in
the data bytes
Checksums can detect errors but can't correct them
 Stable storage can be implemented on a disk to allow
errors to be corrected

 Sectors are paired, with each pair representing a single
sector
 Pairs are usually referred to as Left and Right
▪ Errors in a sector (L or R) are detected using checksums

Stable storage can cope with media failures and write
failures

For writing, write the value of some sector X
into XL
 Check that the value is correct (using checksums)
 If the value is not correct after a given number of
attempts then assume that the sector has failed
▪ A spare sector should be substituted for XL
 Repeat the process for XR

For reading, XL and XR are read from in turn
until a correct value is returned

Hard disks act as bottlenecks for processing
 DB data is stored on disks, and must be fetched
into main memory to be processed, and
 Disk access is considerably slower than main
memory processing

There are also reliability issues with disks
 Disks contain mechanical components that are
more prone to failure than electronic components

One solution is to use multiple disks

Multiple disks



Each disk contains multiple platters
Disks can be read in parallel, and
Different disks can read from different
cylinders

e.g. the first disk can access data from
cylinder 6,000, while the second disk is
accessing data from cylinder 11,000

Single disk


Multiple
platters
Disk heads
are always
over the
same
cylinder
Using multiple disks to store data improves efficiency
as the disks can be read in parallel
 To satisfy a request the physical disks and disk blocks
that the data resides on must be identified

 The data may be on a single disk, or it may be split over
multiple disks

The way in which data is distributed over the disks
affects the cost of accessing it
 In the same way that related data should be stored close
to each other on a single disk

A disk array gives the user the abstraction of a single,
large, disk
 When an I/O request is issued the physical disk blocks to
be retrieved have to be identified
 How the data is distributed over the disks in the array
affects how many disks are involved in an I/O request

Data is divided into partitions called striping units
 The striping unit is usually either a block or a bit
 Striping units are distributed over the disks using a round
robin algorithm
Notional File – the data is divided into striping units of a given size
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
The striping units are distributed across a RAID system in a round robin fashion
1
5
9
17 21 25 29 33 37 41 45 49 53 57 61 65 …
disk 1
2
6
10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 …
disk 2
3
7
11 15 19 23 27 31 35 39 43 47 51 55 59 63 67
…
disk 3
4
8
12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 …
disk 4
13
The size of the striping unit has an impact on the behaviour of the system
Assume that a file is to be distributed across a four disk RAID system, using
block striping, and that,
Purely for the sake of illustration, the block size is just one byte!
Notional File – the numbers represent a sequence of individual bits in the file
1
2
3
4
5
6
7
8
9
10 11 12 13 12 15 16 17 18 19 20 21 22 23 24 …
Distribute these bits across a 4 disk RAID system using BLOCK striping:
3
4
5
6
7
8
33 34 35 36Disk
37 138 39 40 65 66 67 68 69 70 71 72
1
2
…
9
10 11 12 13 14 15 16 41 42 43 44Disk
45 246 47 48 73 74 75 76 77 78 79 80 …
17 18 19 20 21 22 23 24 49 50 51 52Disk
53 354 55 56 81 82 83 84 85 86 87 88 …
25 26 27 28 29 30 31 32 57 58 59 60Disk
61 462 63 64 89 90 91 92 93 94 95 96 …
Block 1
Block 2
Block 3
Here is the same file to be distributed across a four disk RAID system, this
time using bit striping, and again remember that
Purely for the sake of illustration , the block size is just one byte!
Notional File – the numbers represent a sequence of individual bits in the file
1
2
3
4
5
6
7
8
9
10 11 12 13 12 15 16 17 18 19 20 21 22 23 24 …
Distribute these bits across a 4 disk RAID system using BIT striping:
13
17 21 25 29 33 37 41 45 Disk
49 153 57 61 65 69 73
1
5
9
2
6
10 14 18 22 26 30 34 38 42 46Disk
50 254 58 62 66 70 74 78 82 86 90 94 …
3
7
11 15 19 23 27 31 35 39 43 47 Disk
51 355 59 63 67 71
4
8
12 16 20 24 28 32 36 40 44 48Disk
52 456 60 64 68 72 76 80 84 88 92 96 …
Block 1
Block 2
77 81 85 89 93 …
75 79 83 87 91 95 …
Block 3

Assume that a disk array consists of D disks
 Data is distributed across the disks using data striping

How does it perform compared to a single disk?
 To answer this question we must specify the kinds of
requests that will be made
▪ Random read – reading multiple, unrelated records
▪ Random write
▪ Sequential read – reading a number of records (such as one
file or table), stored on more than D blocks
▪ Sequential write


Use all D disks to improve efficiency, and distribute data
using block striping
Random read performance
 Very good – up to D different records can be read at once
▪ Depending on which disks the records reside on


Random write performance – same as read performance
Sequential read performance
 Very good – as related data are distributed over all D disks
performance is D times faster than a single disk


Sequential write performance – same as read performance
But what about reliability …

Hard disks contain mechanical components and are less
reliable than other, purely electronic, components
 Increasing the number of hard disks decreases reliability,
reducing the mean-time-to-failure (MTTF)
▪ The MTTF of a hard disk is  50,000 hours, or 5.7 years

In a disk array the overall MTTF decreases
 Because the number of disks is greater
 MTTF of a 100 disk array is 21 days – (50,000/100) / 24
▪ This assumes that failures occur independently and
▪ The failure probability does not change over time

Reliability is improved by storing redundant data


Reliability of a disk array can be improved by storing
redundant data
If a disk fails the redundant data can be used to reconstruct
the data lost on the failed disk
 The data can either be stored on a separate check disk or
 Distributed uniformly over all the disks

Redundant data is typically stored using one of two methods
 Mirroring, where each disk is duplicated
 A parity scheme, where sufficient redundant data is maintained
to recreate the data in any one disk
 Other redundancy schemes provide greater reliability
For each bit on the data disks there is a parity bit on a check disk

 If the sum of the data disks bits is even the parity bit is set to zero
 If the sum of the bits is odd the parity bit is set to one
The data on any one failed disk can be recreated bit by bit

0
1
1
0
1
1
1
1
0
0
1
1
0
0
1
0
1
1
0
1
1
0
0
1
…
1
0
1
0
1
0
1
0
1
0
1
0
1
0
0
1
1
0
0
1
0
1
0
0
…
0
0
0
1
1
1
0
1
0
0
1
1
0
0
0
1
1
0
1
1
1
0
0
1
…
0
1
1
0
0
0
1
0
0
1
0
1
0
0
1
0
0
1
0
1
0
0
1
1
…
0
0
1
0
1
0
0
1
1
1
…
4 data disk system showing individual bit values
1
0
1
1
1
0
1
0
1
1
1
5th check disk containing parity data
1
1
0

Reading
 The parity scheme does not affect reading

Writing
 A naïve approach would be to calculate the new
value of the parity bit from all the data disks
 A better approach is to compare the old and new
values of the disk that is written to
▪ And change the value of a parity bit if the corresponding
bits have changed

A RAID system consists of several disks organized to
increase performance and improve reliability
 Performance is improved through data striping
 Reliability is improved through redundancy
 RAID stands for Redundant Arrays of Independent Disks

There are several RAID schemes or levels
 The levels differ in terms of their
▪ Read and write performance,
▪ Reliability, and
▪ Cost
All D disks are used to improve efficiency, and data is
distributed using block striping
 No redundant information is kept
 Read and write performance is very good
 But, reliability is poor

 Unless data is regularly backed up a RAID 0 system should
only be used when the data is not important

A RAID 0 system is the cheapest of all RAID levels
 As there are no disks used for storing redundant data
An identical copy is kept of each disk in the system, hence the
term mirroring
 Read performance is similar to a single disk

 No data striping, but parallel reads of the duplicate disks can be made
which improves random read performance

Write performance is worse than a single disk as the duplicate disk
has to be written to
 Writes to the original and mirror should not be performed
simultaneously in case there is a global system failure
 But write performance is superior to most other RAID levels

Very reliable but costly
 With D data disks, a level 1 RAID system has 2D disks


Sometimes referred to as RAID level 10, combines both
striping and mirroring
Very good read performance
 Similar to RAID level 0
 2D times the speed of a single disk for sequential reads
 Up to 2D times the speed of a single disk for random reads
 Allows parallel reads of blocks that, conceptually, reside on the
same disk

Poor write performance
 Similar to RAID level 1

Very reliable but the most expensive RAID level

Writing data is the Achilles
heel of RAID systems
 Data and check disks should not
be written to simultaneously
 Parity information may have to
be read before check disks can
be written to

In many RAID systems
writing is less efficient than
with a single disk!

Sequential writes, or random writes in a RAID system using
bit striping:
 Write to all D data disks, using a read-modify-write cycle
 Calculate the parity information from the written data
 Write to the check disk(s)
▪ A read-modify-write cycle is not required

Random writes in a system using block striping:
 Write to the data disk using a read-modify-write cycle
 Read the check disk(s), and calculate the new parity data
 Write to the check disk(s)
A RAID system with D disks can read data up to D times
faster than a single disk system
 For sequential reads there is no performance difference
between bit striping and block striping
 Block striping is more efficient for random reads

 With bit striping all D disks have to be read to recreate a single
record (and block) of the data file
 With block striping, a complete record is stored on one disk, so
only one disk is required to satisfy a single random read

Write performance is similar except that it is also affected by
the parity scheme

Level 2 does not use the standard parity scheme
 Uses a scheme that allows the failed disk to be identified
increasing the number of disks required
 However the failed disk can be detected by the disk
controller so this is unnecessary
 Can tolerate the loss of a single disk

Level 3 is Bit Interleaved Parity
 The striping unit is a single bit
 Random read and write performance is poor as all disks
have to be accessed for each request
 Can tolerate the loss of a single disk
Uses block striping to distribute data over disks
Uses one redundant disk containing parity data
 The ith block on the redundant disk contains
parity checks for the ith blocks of all data disks
 Good sequential read performance
 D times single disk speed
 Very good random read performance
 Disks can be read independently, up to D times
single disk speed


When data is written the affected block and the
redundant disk must both be written to
 To calculate the new value of the redundant disk

 Read the old value of the changed block
 Read the corresponding redundant disk block
 Write the new data block
 Recalculate the block of the redundant disk

To recalculate the redundant data consider the
changes in the bit pattern of the written data block

Cost is moderate
 Only one check disk is required


The system can tolerate the loss of one drive
Write performance is poor for random writes
 Where different data disks are written independently
 For each such write a write to the redundant disk is also
required

Performance can be improved by distributing
the redundant data across all disks – RAID level 5


The dedicated check disk in RAID level 4 tends to act as
a bottleneck for random writes
RAID level 5 does not have a dedicated check disk but
distributes the parity data across all disks
 This removes the bottleneck thus increasing the performance of
random writes
 Sequential write performance is similar to level 4


Cost is moderate, with the same effective space
utilization as level 4
The system can tolerate the loss of one drive

RAID levels 4 and 5 can only cope with single
disk crashes
 Therefore if multiple disks crash at the same time (or
before a failed disk can be replaced) data will be lost

RAID level 6 allows systems to deal with multiple
disk crashes
 These systems use more sophisticated error
correcting codes
 One of the simpler error correcting codes is the
Hamming Code

Consider a system with seven disks which can be
identified with numbers from 1 to 7
 Four of the disks are data disks, disks 1 to 4
 Three of the disks are redundant disks, disks 5 to 7

Each of the three check disks contain parity data
for three of the four data disks
 Disk 5 contains parity data for disks 1, 2 and 3
 Disk 6 contains parity data for disks 1, 2 and 4
 Disk 7 contains parity data for disks 1, 3 and 4
Disk
1
2
3
4
5
6
7
1
1
1
0
1
0
0
1
1
0
1
0
1
0
1
0
1
1
0
0
1
1,2,3
1,2,4
1,3,4
Data
Redundant Data

Reads are performed as normal
 Only the data disks are used

Writes are performed in a similar way to RAID
level 4
 Except that multiple redundant disks may be
involved
 Cost is high, as more check disks are required


If one disk fails use the parity data to restore the
failed disk like level 4
If two disks fail then both disks can be rebuilt
using three of the other disks, e.g.
 If disks 1 and 2 fail
▪ Rebuild disk 1 using disks 3, 4 and 7
▪ Rebuild disk 2 using disks 1, 3 and 5
 If disks 3 and 5 fail
▪ Rebuild disk 3 using disks 1, 4 and 7
▪ Rebuild disk 5 using disks 1, 2 and 3

In real-life RAID systems the disk array is partitioned
into reliability groups
 A reliability group consists of a set of data disks and a set
of check disks
 The number of check disks depends on the reliability level
that is selected

Consider a RAID system with 100 data disks and 10
check disks
 i.e. 10 reliability groups
 The MTTF is increased from 21 days to 250 years!







Level 0 improves performance at the lowest cost but
does not improve reliability
Level 1+0 is better than level 1 and has the best write
performance
Levels 2 and 4 are always inferior to 3 and 5
Level 3 is good for large transfer requests of several
contiguous blocks but bad for many small request of a
single disk block
Level 5 is a good general-purpose solution
Level 6 is appropriate if higher reliability is required
In practice the choice is usually between 0, 1, and 5


The table that follows compares RAID levels using RAID level
0 as a baseline
Comparisons of RAID systems vary dependent on the metric
used and the measurements of those metrics
 The three primary metrics are reliability, performance and cost
 These can be measured in I/Os per second, bytes per second,
response time and so on

The comparison uses throughput per dollar for systems of
equivalent file capacity
 File capacity is the amount of information that can be stored on
the system, which excludes redundant data
Random
Read
Random
Write
Sequential
Read
Sequential
Write
Storage
Efficiency
RAID 0
1
1
1
1
1
RAID 1
1
½
1
½
½
RAID 3
1/G
1/G
(G-1)/G
(G-1)/G
(G-1)/G
RAID 4
(G-1)/G
max(1/G, ¼)
(G-1)/G
(G-1)/G
(G-1)/G
RAID 5
1
max(1/G, ¼)
1
(G-1)/G
(G-1)/G
RAID 6
1
max(1/G, 1/6)
1
(G-2)/G
(G-2)/G


G refers to the number of disks in a reliability group (both
data disks and check disks)
RAID levels 10 and 2 not shown

Solid State Drives (SSDs) use NAND flash memory and do not
contain moving parts like an HDD
 Accessing an SSD does not require seek time or rotational latency and they
are therefore considerably faster
 Flash memory is non-volatile memory that is used by smart-phones, mp3
players and thumb (or USBO) drives

NAND flash architecture is similar to a NAND (negated and) logic
gate hence the name
 NAND flash architecture is only able to read and write data one page at a time

There are two types of SSD
 Multi-level cell (MLC)
 Single-level cell (SLC)

MLC cells can store multiple different charge
levels
 And therefore more than one bit
▪ With four charge levels a cell can store 2 bits
 Multiple threshold voltages makes reading more
complex but allows more data to be stored per
cell

MLC SSDs are cheaper than SLC SSDs
 However write performance is worse
 And their lifetimes are shorter

SLC cells can only store a single charge level
 They are therefore on or off, and can contain only
one bit

SLC drives are less complex
 They are more reliable and have a lower error rate
 They are faster since it is easier to read or write a
single charge value

SLC drives are more expensive
 And typically used for enterprise rather than
home use


Reads are much faster than HDDs since there
are no moving parts
Writes are also faster than HDDs
 However flash memory must be erased before it is
written, and entire blocks must be erased
▪ Referred to as write amplification

The performance increase is greatest for
random reads
DBMS
Transaction
and
Lock
Manager
Query Evaluation
File and Access Code
Buffer Manager
Disk Space Manager
Database
Recovery
Manager
When an SQL command is evaluated a request may
be made for a DB record
 Such a request is passed to the buffer manager
 If the record is not stored in the (main memory)
buffer the page must be fetched from disk
 The disk space manager provides routines for
allocating, de-allocating, reading and writing pages


The disk space manager (DSM) keeps track of available disk
space
 The lowest level of the DBMS architecture

Supports the allocation and de-allocation of disk pages
 Pages are abstract units of storage, mapped to disk blocks
 Reading and writing to a page is performed in one disk I/O
Sequences of pages are allocated to a contiguous sequence
of blocks to increase access speed
 The DSM hides the underlying details of storage

 Allowing higher level processes to consider the data to be a
collection of pages

A DB increases and decreases in size over time
 In addition to mapping pages to blocks the DSM has to
record which disk blocks are in use
 As time goes on, gaps in sequences of allocated blocks
appear

Free blocks need to be recorded so that they can be
allocated in the future, using either
 A linked list, the head points to the first free block
 A bitmap, each bit corresponds to a single block
▪ Allows for fast identification, and therefore allocation, of
contiguous areas of free space

An OS is required to manage space on a disk
 Typically an OS abstracts a file as a sequence of bytes

While possible to build a DSM with the OS many
DBMS perform their own disk management
 This makes the DBMS more portable across platforms
 Using the OS may impose technical limitations such
as maximum file size
 In addition, OS files cannot typically be stored on
separate disks, which may be necessary in a DBMS

Attributes, or fields, must be organized within records
 Information that is common to all records of a particular
type is stored in the system catalog
▪ Including the number and type of fields

Records of a single table can vary from each other
 In addition to differences in data (obviously)
 Different records may contain different number of fields,
or
 Fields of varying length

INTEGER, represented by two or four bytes
FLOAT, represented by four or eight bytes

CHAR(n), fixed length character strings of n bytes

 Unused characters are occupied with a pad character
▪ e.g. if a CHAR(5) stored "elm" it would be stored as elm

VARCHAR(n), character strings of varying lengths
 Stored as arrays of n+1 bytes
▪ i.e. even though a VARCHAR's contents can vary, n+1 bytes are
dedicated to them
 The length of a VARCHAR is stored in the first byte, or
 Its end is specified by a null character

Fields are a fixed length, and the number of
fields is fixed
 Fields may then be stored consecutively
 And given the address of a record, the address of
a particular field can be found
▪ By referring to the field size in the system catalog

It is common to begin all fields at a multiple
of 4 or 8 bytes

Consider an employee record:


{name CHAR(30), address VARCHAR(255), salary FLOAT}
Fields can be found by looking up the field size in the schema
and performing an offset calculation
pointer to schema
length
timestamp
name
0
12
header
address
44
salary
300
308

In the relational model each record contains the same
number of fields
 However fields may be of variable length

If a record contains both fixed and variable length fields,
store the fixed length fields first
 The fixed length fields are easy to locate

To store variable length fields include additional information
in the record header
 The length of the record
 Pointers to the beginning of each variable length field
 A pointer to the end of the record


Consider an employee record with name, and salary being
fixed length and address being variable length
The pointer to the first variable length field may be omitted
other header information
record length
address pointer
end of record
name
header
salary
address

Records in a relational DB have the same
number of fields
 But it is possible to have repeating fields
 For example a many to many relationship in a record
that represents an object
 References to other objects will have to be stored
▪ The references (or pointers) to other objects suggest that
different records will have different lengths

There are three alternatives for recording such
data

Store the entire record in one block
 Maintain a pointer to the first reference

Store the fixed length portion in one location, and the
variable length portion in another
 The header contains a pointer to the variable length
portion (the references to other objects), and
 The number of such objects

Store a fixed length record with an fixed number of
occurrences of the repeating fields, and
 A pointer to (and count of) any additional occurrences

There are many advantages of keeping
records (and therefore fields) fixed length
 More efficient for search
 Lower overhead (the header contains less data)
 Easier to move records around

The main advantage of using variable length
fields is that it can save space
 This can result in fewer disk I/O operations

Modifying a variable field in a record may make it larger
 Later fields in the same record have to be moved, and
 Other records may also have to be moved

When a variable field is modified the record’s size may
increase to the extent that it no longer fits on the page
 The record must then be moved to another page, but
 A "forwarding address" has to be maintained on the old page,
so that external references to the rid are still valid

A record may grow larger than the page size
 The record must then be broken into sections and connected by
pointers

Forwarding addresses may need to be maintained
 When a record grows too large, or
 When records are maintained in order (clustered)

When maintaining ordered data
 Provide a forwarding address if a record has to be moved
to a new page to maintain the ordering
 Delete records by inserting a NULL value, or tombstone
pointer in the header
 The record slot can be re-used when another record is
inserted
1
3
The numbers represent the
primary keys of the records
Insert 6,17, and 21
8
1
8
3
17
6
8
21
1
3
Delete 3,and insert 5
8
1
1
5
8
8

There are other data
types that requires
special treatment in
terms of record storage
 Pointers, and reference
variables
 Large objects such as
text, images, video,
sound etc.

If a record represents an object, the object may contain
pointers to or addresses of some other object
 Such pointers need to be managed by a DBMS

A data item may have two addresses
 Database address on disk, usually 8 bytes
 Memory address in main memory, usually 4 bytes
When an item is on the disk (i.e. secondary storage) its
database address must be used
 And when an item is in the buffer pool it can be referred to
by either its database or memory address

 It is more efficient to use the memory address

Database addresses of items in main memory should be
translated to their current memory addresses
 To avoid unnecessary disk I/O
 It is possible to create a translation table that maps database
addresses to memory addresses

However when using such a table addresses may have to be
repeatedly translated
 Whenever a pointer of a record in main memory is accessed the
translation table must be used

Pointer swizzling is used to avoid repeated translation table
look-up

Whenever a block is moved from secondary to main memory
pointers in that block may be swizzled
 i.e. translated from the database to the memory address

A pointer in main memory consists of
 A bit that indicates whether the pointer is a database or a
memory address, and
 The memory address (four byte) or database address (8 byte) as
appropriate
▪ Space is always reserved for the database address

There are several strategies to decide when a pointer should
be swizzled
Disk
…
Memory
Read into
memory
swizzled
Block 1
unswizzled
Block 2

When a new block is brought into main memory, pointers related
to that block may be swizzled
 The block may contain pointers to records in the same block, or other
blocks, and
 Pointers in records in other blocks, already in main memory, may point
to records in the newly copied block

There are four main swizzling strategies
 Automatic swizzling
 Swizzling on demand
 No swizzling – i.e. just use the translation table
 Programmer controlled swizzling – when access patterns are known


Enter the address of the block and its records into the
translation table
Enter the address of any pointers in the records in the block
into the translation table
 If such an address is already in the table, swizzle the pointer
giving it the appropriate memory address
 If the address is not already in the table, copy its block into
memory and swizzle the pointer

This ensures that all pointers in the new block are swizzled
when the block is loaded, which may save time
 However, it is possible that some of the pointers may not be
followed, hence time spent swizzling them is wasted
Enter the address of the block and its records into the
translation table
 Leave all pointers in the block unswizzled
 When an unswizzled pointer is followed, look up the address
in the translation table

 If the address is in the table, swizzle the pointer
 If the address is not in the table, copy the appropriate block into
main memory, and swizzle the pointer

Unlike automatic swizzling, this strategy does not result in
unnecessary swizzling

When a block is written to disk its pointer must
first be unswizzled
 That is, the pointers to memory addresses must be
replaced by the appropriate database addresses

The translation table can be searched (by
memory address) to find the database address
 This is potentially time consuming
 The translation table should therefore be indexed to
allow efficient lookup of both memory and database
addresses

Pointer swizzling may result in blocks being pinned
 A block is pinned if it cannot safely be written back to disk

A block that is pointed to by a swizzled pointer should be
pinned
 Otherwise, the pointer can no longer be followed to the block at
the specified memory address

If a block is unpinned pointers to it must be unswizzled
 The translation table must also include the memory addresses
of pointers that refer to an entry
▪ As a linked list attached to an entry in the translation table, or
▪ As a (pointer to a) linked list in the record's pointer field

How are large data objects stored in records?
 Video clips, or sound files or the text from a book

LOB data types store and manipulate large blocks of unstructured
data
 Tables can contain multiple LOB columns
 The maximum size of a LOB is large
▪ At least 8 terabytes in Oracle 10g
 LOB data must be processed by application programs

LOB data is stored as either binary or character data
 BLOB – unstructured binary data
 CLOB, NCLOB – character data
 BFILE – unstructured binary data in OS files

LOB's have to be stored on a sequence of blocks
 Ideally the blocks should be contiguous for efficient
retrieval, but
 It is possible to store the LOB on a linked list of blocks
▪ Where each block contains a pointer to the next block


If fast retrieval of LOBs is required they can be
striped across multiple disks for parallel access
It may be necessary to provide an index to a LOB
 For example indexing by seconds for a movie to allow
a client to request small portions of the movie


Records are organized on pages
Pages can be thought of as a collection of slots, each
of which contains a single record
 A record can be identified by its record id (rid)
▪ The rid is the {page ID, slot number} pair

Before considering different organizations for
managing slots it is important to know if
 Records are fixed length or
 Variable length

There are two organizations based on how records are
deleted





Records are stored
consecutively in slots
When a record is deleted
the last record on the page
is moved to its location
Records are found by an
offset calculation
All empty space is at the
bottom of the page
But the rid includes the slot
number
slot 1
slot 2
…
slot N
free
space
N
 As records are moved
external references become
invalid
number of records

The page header contains
a bitmap
slot 1
 Each bit represents a single
slot 2
slot
 A slot's bit is turned off
when the slot is empty
slot 3
…
New records are inserted in
empty slots
slot M
 A record's slot number
1 … 0 1 0 M
doesn't change

M
3 2 1
number of slots
bitmap showing
slot occupancy

With variable length records a page cannot be divided into
fixed length slots
 If a new record is larger than the slot it cannot be inserted
 If a new record is smaller it wastes space
 To avoid wasting space, records must be moved so that all the
free space is contiguous without changing the rids

One solution is to maintain a directory of page slots at the
end of each page which contains
 A pointer (an offset value) to each record and
 The length of each record

Pointers are offsets to records
 Moving a record on the page has
length = 24
no impact on its rid
 Its pointer changes but its slot
number does not
 A pointer to the start of the free
space is required

length = 16
length = 20
pointer to
start of
free free space
space
Records are deleted by setting the
offset to -1
 New records can be inserted in
vacant slots


Pages should be periodically
reorganized to remove gaps
The directory "grows" into the free
space
16 … 24 20 N
N
2 1
slot directory number of
slots


A page can be considered as a collection of
records
Pages containing related records are
organized into collections, or files
 One file usually represents a single table

One file may span several pages
 It is therefore necessary to be able to access all of
the pages that make up a file

The basic file structure is a heap file

Heap files are not ordered in any way
 But they do guarantee that all of the records in a file can be retrieved
by repeatedly requesting the next record
 Each record in a file has a unique record ID (rid)
 And each page in the file is the same size

Heap files support the following operations:
 Creating and destroying files
 Inserting and deleting records
 Scanning all the records in the file

To support these operations it is necessary to:
 Keep track of the pages in the file
 Keep track of which of those pages contain free space

Maintain the heap file as a pair of doubly linked lists of
pages
 One list for pages with free space and
 One list for pages that are full
 The DBMS can record the first page in the list in a table
with one entry for each file

If records are of variable length most pages will end
up in the list of pages with free space
 It may be necessary to search several pages on the free
space list to find one with enough free space

Maintain the heap file as a directory of pages
 Each directory entry identifies a page (or a
sequence of pages) in the heap file

The entries are kept in data page order and
records for each page:
 Whether or not the page is full, or
 The amount of free space
▪ If the amount of free space is recorded there is no need
to visit a page to determine if it contains enough space

The buffer manager is responsible for bringing pages from
disk to main memory as required
 Main memory is partitioned into a collection of pages called the
buffer pool
 Main memory pages are referred to as frames
 Other processes must tell the buffer manager if a page is no
longer required and whether or not it has been modified

A DB may be many times larger than the buffer pool
 Accessing the entire DB (or performing queries that require
joins) can easily fill up the buffer pool
▪ When the buffer pool is full, the buffer manager must decide which
pages to replace by following a replacement policy
Program 1
Disk Page
Program 2
Buffer Manager
Free Frame
MAIN
MEMORY
Buffer Pool
Database
DISK
Buffer pool frames are the
same size as disk pages
 The buffer manager records
two pieces of information for
each frame

Dirty Bit
Pin Count
Data
Page
 dirty bit – on if the page has
been modified
 pin-count – the number of
times the page has been
requested but not released
Main
memory
frame

If the page is already in the buffer pool
 Increment the frame's pin-count (called pinning)

Otherwise
 Choose a frame to replace (using the policy)
▪ A frame is only chosen for replacement if its pin-count is zero
▪ If there is no frame with a pin-count of zero the transaction must
either wait or be aborted
▪ If the chosen frame is dirty write it to the disk
 Read requested page into replacement frame and set its pin-
count to 1

Return the address of the frame

When a process releases (de-allocates) a page its
pin-count is reduced, known as unpinning
 The program indicates if the page has been modified, if so
the buffer manager sets the dirty bit to on

Processes for requesting and releasing pages are
affected by concurrency and crash recovery policies
 These will be discussed at a later date

The policy used to replace frames can affect the
efficiency of database operations
 Ideally a frame should not be replaced if it will be
needed again in the near future

Least Recently Used (LRU) replacement policy
 Assumes that frames that haven't been used recently
are no longer required
 Uses a queue to keep track of frames with pin-count
of zero
 Replaces the frame at the front of the queue
 Requires main memory space for the queue

A variant of the LRU policy with les overhead
 Instead of a queue the policy requires one bit per frame,
and a single variable, called current
 Assume that the frames are numbered from 0 to B-1
▪ Where B is the number of frames

Each frame has an associated referenced bit
 The referenced bit is initially set to off, and is
 Set to on when the frame's pin-count reaches zero

current is initially set to 0, and is used to indicate the
next frame to be considered for replacement

Consider the current frame for replacement
 If pin-count  0, increment current
 If pin-count  0 and referenced bit is on
▪ Switch referenced to off and increment current
 If pin-count  0 and referenced is off
▪ Replace the frame
 If current equals B-1 set it to 0

Only replace frames with pin-counts of zero
 Frames with a pin-count of zero are only replaced
after all older candidates are replaced


LRU and clock replacement are fair schemes
They are not always the best strategies for a DB system
 It is common for some DB operations to require repeated
sequential scans of data (e.g. Cartesian products, joins)
 With LRU such operations may result in sequential flooding

An alternative is the Most Recently Used policy
 This prevents sequential flooding but is generally poor

Most systems use some variant of LRU
 Some systems will identify certain operations, and apply MRU
for those operations


Assume that a process requests sequential scans of a file
The file, shown below, has nine pages
p1
p2
p3
p4
p5
p6
p7
p8
p9
p8
p9
Assume that the buffer pool has ten frames
Buffer Pool

p1
p2
p3
p4
p5
p6
p7
Read page 1 first, then page 2, … then page 9
All the pages are in the buffer, when the next scan of the file is requested, no
further disk access is required!


Assume that a process requests sequential scans of a file
This file, shown below, has eleven pages
p1
p2
p3
p4
p5
p6
p7
p8
p9
p10
p11
Assume that the buffer pool still has ten frames
Buffer Pool

p1
p2
p3
p4
p5
p6
Read pages 1 to 10 first, page 11 is still to be read
p7
p8
p9
p10


Assume that a process requests sequential scans of a file
This file, shown below, has eleven pages
p1
p2
p3
p4
p5
p6
p7
p8
p9
p10
p11
Assume that the buffer pool still has ten frames
Buffer Pool

pp111
p2
p3
p4
p5
p6
p7
p8
p9
Read pages 1 to 10 first, page 11 is still to be read
Using LRU, replace the appropriate frame, which contains p1, with p11
p10


Assume that a process requests sequential scans of a file
This file, shown below, has eleven pages
p1
p2
p3
p4
p5
p6
p7
p8
p9
p10
p11
Assume that the buffer pool still has ten frames
Buffer Pool

p11
p21
p3
p4
p5
p6
p7
p8
p9
Read pages 1 to 10 first, page 11 is still to be read
Using LRU, replace the appropriate frame, which contains p1, with p11
The first scan is complete, start the second scan by reading p1 from the file
Replace the LRU frame (containing p2) with p1
p10


Assume that a process requests sequential scans of a file
This file, shown below, has eleven pages
p1
p2
p3
p4
p5
p6
p7
p8
p9
p10
p11
Assume that the buffer pool still has ten frames
Buffer Pool

p11
p1
p23
p4
p5
p6
p7
p8
p9
Read pages 1 to 10 first, page 11 is still to be read
Using LRU, replace the appropriate frame, which contains p1, with p11
The first scan is complete, start the second scan by reading p1 from the file
Replace the LRU frame (containing p2) with p1
Continue the scan by reading p2, …
p10


Assume that a process requests sequential scans of a file
This file, shown below, has eleven pages
p1
p2
p3
p4
p5
p6
p7
p8
p9
p10
p11
Assume that the buffer pool still has ten frames
Buffer Pool

p10
11
pp111
p2
p43
p45
p65
p67
p87
Each scan of the file requires that every page is read from the disk!
In this case LRU is the WORST possible replacement policy!
p89
pp109

There are similarities between OS virtual
memory and DBMS buffer management
 Both have the goal of accessing more data than
will fit in main memory
 Both bring pages from disk to main memory as
needed and replace unneeded pages

A DBMS requires its own buffer management
 To increase the efficiency of database operations
 To control when a page is written to disk

A DBMS can often predict patterns in the way in which pages
are referenced
 Most page references are generated by processes such as query
processing with known patterns of page accesses
 Knowledge of these patterns allows for a better choice of pages
to replace and
 Allows prefetching of pages, where the page requests can be
anticipated and performed before they are requested
 A DBMS requires the ability to force a page to disk
 To ensure that the page is updated on a disk
 This is necessary to implement crash recovery protocols where the
order in which pages are written is critical

Some DBMS buffer managers are able to predict page
requests
 And fetch pages into the buffer before they are requested
 The pages are then available in the buffer pool as soon as they
are requested, and
 If the pages to be prefetched are contiguous, the retrieval will
be faster than if they had been retrieved individually
 If the pages are not contiguous, retrieval may still be faster as
access to them can be efficiently scheduled

The disadvantage of prefetching (aka double-buffering) is
that it requires extra main memory buffers

Organizing data by cylinders
 Related data should be stored "close to" each other

Using a RAID system to improve efficiency or reliability
 Multiple disks and striping improves efficiency
 Mirroring or redundancy improves reliability

Scheduling requests using the elevator algorithm
 Reduces disk access time for random reads and writes
 Most effective when there are many requests waiting

Prefetching (or double-buffering) data in large chunks
 Speeds up access when needed blocks can be predicted but
requires more main memory buffers
Download