StorageAndFileStruct..

advertisement
Storage and File Structure
 Overview of Physical Storage Media
 Redundant Arrays of Independent Disks (RAID)
 Buffer Management
 File Organization
1
©Silberschatz, Korth and Sudarshan
Properties of Physical Storage Media
Types of storage media differ in terms of:

Speed of data access

Cost per unit of data

Reliability:
 Data loss on power failure or system (software) crash
 Physical failure of the storage device
=> What is the value of data (e.g., salon example)?
2
©Silberschatz, Korth and Sudarshan
Classification of Storage Media
Storage be classified as:

Volatile:
 Content is lost when power is switched off
 Includes primary storage (cache, main-memory)

Non-volatile:
 Contents persist even when power is switched off
 Secondary and tertiary storage, plus battery-backed up RAM
3
©Silberschatz, Korth and Sudarshan
Storage Hierarchy, Cont.
Storage can also be classified as:

Primary Storage:
 Fastest media
 Volatile
 Includes cache and main memory

Secondary Storage:





Moderately fast access time
Non-volatile
Includes flash memory and magnetic disks
Also called on-line storage
Tertiary Storage:




Slow access time
Non-volatile
Includes magnetic tape and optical storage
Also called off-line storage
4
©Silberschatz, Korth and Sudarshan
Primary Storage

Cache:
 Volatile
 Fastest and most costly form of storage
 Managed by the computer system hardware
 Proprietary

Main memory (also known as RAM):
 Volatile
 Fast access (10s to 100s of nanosecs; 1 nanosec = 10–9 seconds)
 Generally too small (or too expensive) to store the entire database
=> The above comment notwithstanding, main memory databases do
exist, especially for clusters.
5
©Silberschatz, Korth and Sudarshan
Physical Storage Media, Cont.

Flash memory:
 Non-volatile
 Widely used in embedded devices such as digital cameras
 SD cards, USB drives, and solid-state drives
 Data can be written at a location only once, but location can be erased and written to again
 Can support only a limited number of write/erase cycles
 Reads are roughly as fast as main memory
 Writes are slow (few microseconds), erase is slower
 Cost per unit of storage roughly similar to main memory
6
©Silberschatz, Korth and Sudarshan
Physical Storage Media, Cont.

Magnetic-disk:
 Non-volatile
•
Survives power failures and system (software) crashes
•
Failure can destroy data, but is very rare (??????)
 Primary medium for database storage*
 Typically stores the entire database
•
Capacities of individual disks is currently in the100s of GBs or TBs
•
Much larger capacity than main or flash memory
 Direct/random-access, unlike magnetic tape, which is sequential access.
7
©Silberschatz, Korth and Sudarshan
Physical Storage Media, Cont.

Optical storage:
 Non-volatile
 Data is read optically from a spinning disk using a laser
 CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms
 Write-once, read-many (WORM) optical disks used for archival storage (CD-R and DVD-R)
 Multiple write versions available (CD-RW, DVD-RW, and DVD-RAM)
 Reads and writes are slower than with magnetic disk
8
©Silberschatz, Korth and Sudarshan
Physical Storage Media, Cont.

Tape storage:
 Non-volatile
 Data is accessed sequentially, consequently known as sequential-access
 Much slower than a disk
 Very high capacity (40 to 300 GB tapes available)
 Storage costs are much cheaper than for a disk, but high quality drives can be very expensive
 Used mainly for backup (recover from disk failure), archive and transfer of very large amounts of data
9
©Silberschatz, Korth and Sudarshan
Physical Storage Media, Cont.

Juke-boxes:
 For storing massive amounts of data
•
Hundreds of terabytes (1 terabyte = 109 bytes), petabyte (1 petabyte = 1012 bytes) or exabytes (1
exabyte = 1012 bytes)
 Large number of removable disks or tapes
 A mechanism for automatic loading/unloading of disks or tapes
10
©Silberschatz, Korth and Sudarshan
Storage Hierarchy
speed
cost
volatility
11
©Silberschatz, Korth and Sudarshan
Magnetic Hard Disk Mechanism
NOTE: Diagram is only a simplification of actual disk drives
12
©Silberschatz, Korth and Sudarshan
Magnetic Disks

Disk assembly consists of:
 A single spindle that spins continually (at 7500 or 10000 RPMs typically)
 Multiple disk platters (typically 2 to 5)

Surface of platter divided into circular tracks:
 Over 16,000 tracks per platter on typical hard disks

Each track is divided into sectors:
 A sector is the smallest unit of data that can be read or written
 Typically 512 bytes
 Typical sectors per track: 200 (on inner tracks) to 400 (on outer tracks)

Head-disk assemblies:
 One head per platter, mounted on a common arm, each very close to its platter.
 Reads or writes magnetically encoded information

“Cylinder i” consists of ith track of all the platters
13
©Silberschatz, Korth and Sudarshan
Magnetic Disks, Cont.

Disks are the primary performance bottleneck in a database system, in part
because of the need for physical movement.

To read/write a sector:
 disk arm swings to position head on right track – seek time (4-10 ms)
 sector rotates under read/write head – rotational latency (4-11 ms)
 data is read/written as sector passes under head – transfer rate (25-100 bps)

Access time - The time it takes from when a read or write request is issued to
when data transfer begins, i.e., seek time + rotational latency.
14
©Silberschatz, Korth and Sudarshan
Performance Measures of Disks, Cont.

Multiple disks may share an interface controller, so the rate that the interface
controller can handle data is also important
 ATA-5: 66 MB/second, SCSI-3: 40 MB/s, Fiber Channel: 256 MB/s

Mean time to failure (MTTF) - The average time the disk is expected to run
continuously without any failure.
 Typically 3 to 5 years
 Probability of failure of new disks is quite low - 30,000 to 1,200,000 hours
 An MTTF of 1,200,000 hours for a new disk means that given 1000 relatively new disks, on an average one
will fail every 1200 hours
 MTTF decreases as disk ages
15
©Silberschatz, Korth and Sudarshan
Magnetic Disks (Cont.)

The term “controller” is used (primarily) in two ways, both in the book and other
literature; the book does not distinguish between the two very well.

Disk controller: (use #1)





Packaged within the disk
Accepts high-level commands to read or write a sector
Initiates actions such as moving the disk arm to the right track and actually reading or writing the data
Computes and attaches checksums to each sector for reliability
Performs re-mapping of bad sectors
16
©Silberschatz, Korth and Sudarshan
Disk Subsystem
interface

Interface controller: (use #2)
 Multiple disks are typically connected to the system bus through an interface controller (i.e., a host adapter)
 Many functions (checksum, bad sector re-mapping) are often carried out by individual disk controllers;
reduces load on the interface controller
17
©Silberschatz, Korth and Sudarshan
Disk Subsystem

The distribution of work between the disk controller and interface controller depends
on the interface standard.

Disk interface standard families:





ATA (AT adaptor)/IDE
SCSI (Small Computer System Interconnect)
Fibre Channel, etc.
Several variants of each standard (different speeds and capabilities)
One computer can have many interface controllers, of the same or different type
=> Like disks, interface controllers are also a performance bottleneck.
18
©Silberschatz, Korth and Sudarshan
Techniques for Optimization
of Disk-Block Access

Several techniques are employed to minimize the negative effects of disk and
controller bottlenecks

File organization:
 Organize data on disk based on how it will be accessed
• Store related information (e.g., data in the same file) on the same or nearby cylinders.
 A disk may get fragmented over time:
• As data is inserted and deleted from disk, free “slots” on disk tend to become scattered
• Data items (e.g., files) are then broken up and scattered over the disk
• Sequential access to a fragmented data results in increased disk arm movement
 Most DBMSs have utilities to de-fragment the file system and, more generally, manipulate the file
organization of a database.
19
©Silberschatz, Korth and Sudarshan
Techniques for Optimization
of Disk-Block Access

Block size adjustment:
 A block is a contiguous sequence of sectors from a single track
 A DBMS transfers data between disk and main memory in blocks
 Sizes range from 512 bytes to several kilobytes
•
Smaller blocks: more transfers from disk
•
Larger blocks: may waste space due to partially filled blocks
•
Typical block sizes range from 4 to 16 kilobytes
 Block size can be adjusted to accommodate workload

Disk-arm-scheduling algorithms:
 Order pending accesses so that disk arm movement is minimized
 One example is the “Elevator algorithm”
20
©Silberschatz, Korth and Sudarshan
Techniques for Optimization
of Disk Block Access, Cont.

Log disk:
 A disk devoted to the transaction log
 Writes to a log disk are very fast since no seeks are required
 Often times provided with battery back-up

Nonvolatile write buffers/RAM:
 Battery backed up RAM or flash memory
 Typically associated with a disk or storage device
 Blocks to be written are first written to the non-volatile RAM buffer
 Disk controller writes data to disk whenever the disk has no other requests
 Allows processes to continue without waiting on write operations
 Data is safe even in the event of power failure
 Allows writes to be reordered to minimize disk arm movement
21
©Silberschatz, Korth and Sudarshan
RAID

Redundant Arrays of Independent Disks (RAID):
 Disk organization techniques that exploit a collection of disks
 Speed, reliability or the combination are increased
 Collection appears as a single disk to the system
 Originally “inexpensive” disks

Ironically, using multiple disks increases the risk of failure (not data loss):
 A system with 100 disks, each with MTTF of 100,000 hours (approx. 11 years), will have a system MTTF
of 1000 hours (approximately 41 days)
22
©Silberschatz, Korth and Sudarshan
Improvement of Reliability
via Redundancy

RAID improves on reliability by using two techniques:
 Parity Information
 Mirroring

Parity Information (basic):
 Makes use of the exclusive-or operator
 1+ 1+0 +1= 1
 0+1+1+0=0
 Enables the detection of single-bit errors
 Could be applied to a collection of disks
23
©Silberschatz, Korth and Sudarshan
Improvement of Reliability
via Redundancy

Mirroring:
 For every disk, keep a duplicate copy
 Every write is carried out on both disks (in parallel)
 If one disk in a pair fails, data still available
 Different reads can take place from either disk and, in particular, from both disks at the same time

Data loss would occur only if 1) a disk fails, and 2) its mirror disk fails before the
first is repaired:
 Probability of both events is very small
 If MTTF is100,000 hours, mean time to repair is10 hours, then mean time to data loss is
approximately 57,000 years for a mirrored pair of disks
24
©Silberschatz, Korth and Sudarshan
Improvement in Performance
via Parallelism

RAID improves performance mainly through the use of parallelism.

Two main goals of parallelism in a disk system:
 Parallelize large accesses to reduce response time (transfer rate).
 Load balance multiple small accesses to increase throughput (I/O rate).

Parallelism is achieved primarily through striping, but also through mirroring.

Striping can be done at varying levels:
 bit-level
 block-level (variable size)
25
©Silberschatz, Korth and Sudarshan
Improvement in Performance via Parallelism

Bit-level striping:
 Abstractly, the data to be stored can be thought of as a list of bytes.
 Suppose there are eight disks, and write bit i of each byte to disk i.
 In theory, each access can read data at eight times the rate of a single disk (reduces transfer time,
and hence improves response time).
 Each byte read or written ties up all 8 disks (reduces throughput).
 Not commonly used, in particular, as described.

Block-level striping:
 Abstractly, the data to be stored can be thought of as a list of blocks, numbered 0..(k-1).
 Suppose there are n disks, numbered 0..(n-1).
 Store block i of a file on disk (i mod n).
 Requests for different blocks can run in parallel if the blocks reside on different disks (improves
throughput).
 A request for a long sequence of blocks can use all disks in parallel (improves response time).
26
©Silberschatz, Korth and Sudarshan
RAID Levels

Different RAID organizations, or RAID “levels,” have differing cost, performance
and reliability characteristics.

Levels vary by source and vendor:




Standard levels: 0, 1, 2, 3, 4, 5, 6
Nested levels: 0+1, 1+0, 5+0, 5+1
Several other, non-standard, vendor specific levels exist as well
Our book: 0, 1, 2, 3, 4, 5

Performance and reliability are not linear in the level #.

It is helpful to compare each level to every other level, and to the single disk
option.
27
©Silberschatz, Korth and Sudarshan
RAID Levels
 RAID Level 0:
 Block-level striping.
 Used in high-performance applications where data lost is not critical.
 RAID Level 1:
 Mirrored disks with block striping.
 Offers the best write performance, according to the authors.
 Sometimes called 0+1, 1+0, 01 or 10.
C
28
C
C
C
©Silberschatz, Korth and Sudarshan
RAID Levels, Cont.
 RAID Level 2:
 Memory-Style Error-Correcting-Codes (ECC) with bit-level striping.
P
P
P
 Parity information is used to detect, locate and correct errors.
 Outdated - Disks contain embedded functionality to detect and locate sector
errors, so only a single parity disk is needed (for correction).
 Also, disks don’t read/write individual bits.
29
©Silberschatz, Korth and Sudarshan
RAID Levels, Cont.
 RAID Level 3:
 Bit-level striping, with parity.
P
 Individual disks report errors.
 To recover data in a damaged disk, compute XOR of bits from other disks
(including parity bit disk).
 Faster response time than with a single disk, but fewer I/Os per second since
every disk has to participate in every I/O.
 When writing data, corresponding parity bits must also be computed and written
to a parity bit disk.
 Subsumes Level 2 (provides all its benefits, at lower cost).
30
©Silberschatz, Korth and Sudarshan
RAID Levels (Cont.)
 RAID Level 4:
 Block-level striping, with parity.
P
 Individual disks report errors.
 To recover data in a damaged disk, compute XOR of bits from other disks
(including parity bit disk).
 When writing a data block, the corresponding block of parity bits must also
be recomputed and written to the parity disk
•
For a single block write – use the old parity block, old value of current block and new value of
current block (2 block reads + 2 block writes)
•
For a large sequential block write - use the new values of blocks corresponding to the parity
block.
31
©Silberschatz, Korth and Sudarshan
RAID Levels (Cont.)
 RAID Level 4: (Cont.)
 Provides higher I/O rates for independent block reads than Level 3
 Provides higher transfer rates for large, multi-block reads & writes compared
to a single disk.
 Parity block is a bottleneck for independent block writes.
32
©Silberschatz, Korth and Sudarshan
RAID Levels (Cont.)
 RAID Level 5:
 Block-level striping with distributed parity; partitions data and parity among all
N + 1 disks, rather than storing data in N disks and parity in 1 disk.
 For example, with 5 disks, parity block for ith set of N blocks is stored on disk
(i mod 5), with the data blocks stored on the other 4 disks.
P
P
P
P
33
P
©Silberschatz, Korth and Sudarshan
RAID Levels (Cont.)
 RAID Level 5 (Cont.)
 Provides same benefits of level 4, but minimizes the parity disk bottleneck.
 Higher I/O rates than Level 4.
 RAID Level 6:
 P+Q Redundancy scheme; similar to Level 5, but uses additional information
to guard against multiple disk failures.
 Better reliability than Level 5 at a higher cost; not used as widely.
P
P
P
P
34
P
P
©Silberschatz, Korth and Sudarshan
Choice of RAID Level
 The “best” level for any particular application is not always clear.
 Factors in choosing RAID level:
 Monetary cost of disks, controllers or RAID storage devices
 Reliability requirements
 Performance requirements:
•
Throughput vs. response time
•
Short vs. long I/O
•
Reads vs. writes
•
During failure and rebuilding of a failed disk
35
©Silberschatz, Korth and Sudarshan
Choice of RAID Level, Cont.
The books analysis…
 RAID 0 is only used when (data) reliability is not important.
 Level 2 and 4 never used since they are subsumed by 3 and 5.
 Level 3 is not used (typically) since bit-level striping forces single block
reads to access all disks, wasting disk arm movement.
 Sometimes vendors advertize a version of level 3 using byte-level striping.
 Level 6 is rarely used since levels 1 and 5 offer adequate safety for
almost all applications.
 So competition is between 1 and 5 only.
36
©Silberschatz, Korth and Sudarshan
Choice of RAID Level (Cont.)
 Level 1 provides much better write performance than level 5:
 Level 5 requires at least 2 block reads and 2 block writes to write a single block
 Level 1 only requires 2 block writes
 Level 1 has higher storage cost than level 5, however:
 I/O requirements have increased greatly, e.g. for Web servers.
 When enough disks have been bought to satisfy required rate I/O rate, they
often have spare storage capacity…
 In that case there is often no extra monetary cost for Level 1!
 Level 5 is preferred for applications with low update rate,
and large amounts of data.
 Level 1 is preferred for all other applications.
37
©Silberschatz, Korth and Sudarshan
Other RAID Issues
 Hardware vs. software RAID.
 Local storage buffers.
 Hot-swappable disks.
 Spare components: disks, fans, power supplies, controllers.
 Battery backup.
38
©Silberschatz, Korth and Sudarshan
Storage Access

A DBMS will typically have several files allocated to it for storage, which are
(usually) formatted and managed by the DBMS.
 Each file typically maps to either a disk, disk partition or storage device.

Each file is partitioned into blocks (a.k.a, pages).
 A block consists of one or more contiguous sectors.
 A block is the smallest unit of DBMS storage allocation & transfer.

Each block is partitioned into records.

Each record is partitioned into fields.
39
©Silberschatz, Korth and Sudarshan
Buffer Management

The DBMS will transfer blocks of data between RAM and Disk, in a manner similar
to an operating system.

The DBMS seeks to minimize the number of block transfers between the disk and
memory.

The portion of main memory available to store copies of disk blocks is called the
buffer (a.k.a. the cache).

The DBMS subsystem responsible for allocating buffer space in main memory is
called the buffer manager.
40
©Silberschatz, Korth and Sudarshan
Buffer Manager Algorithm

Programs call the buffer manager when they need a (disk) block.

If a requested block is already in the buffer then the requesting program is given
the address of the block in main memory.

If the block is not in the buffer then the buffer manager:

Allocates buffer space for the block, throwing out another block, if required.

Writes out to disk the thrown out block if it has been modified since the last time it was retrieved.

Reads the requested block from disk to the buffer once space is available.

Passes the address of the block in main memory to requester.
41
©Silberschatz, Korth and Sudarshan
Buffer-Replacement Policies

How is a block selected for replacement?

Most operating systems use Least Recently Used (LRU).

In a DBMS, LRU can be a bad strategy for certain access patterns.

Queries have well-defined access patterns (such as sequential scans), and so a
DBMS can predict future references (an OS cannot).
 Mixed strategies are common.
 Statistical information can be exploited.
 Well-known access patterns can be exploited.

Well-known access patterns:
 The data dictionary is frequently accessed, so keep those blocks in the buffer.
 Index blocks are used frequently, so keep them in the buffer.
42
©Silberschatz, Korth and Sudarshan
Buffer-Replacement Policies, Cont.
Some terminology...

A memory block that has been loaded into the buffer and is not allowed to be
replaced is said to be pinned.

At any given time a buffer block is in one of the following states:
 Unused (free) – does not contain the copy of a block from disk.
 Used, but not pinned – contains the copy of a block from disk, which is available for replacement.
 Pinned – contains the copy of a block from disk, but which is not available for replacement.
43
©Silberschatz, Korth and Sudarshan
Buffer-Replacement Policies, Cont.

Page replacement strategies:
 Most recently used (MRU) – The moment a disk block in the buffer becomes unpinned, it becomes the
most recently used block.
 Least recently used (LRU) – Of all unpinned disk blocks in the buffer, the one referenced least recently.
44
©Silberschatz, Korth and Sudarshan
Buffer-Replacement Policies, Cont.

Recall the borrower and customer relations:
borrower = (customer-name, loan-number)
customer = (customer-name, customer-street, customer-city)

Consider the a query that performs a join on borrower and customer:
select customer-name, loan-number, street, city
from borrower, customer
where b[customer-name] = c[customer-name];

Suppose that borrower consists of blocks b1, b2, and b3, that customer consists of
blocks c1, c2, c3, c4, and c5, and that the buffer can fit five (5) blocks total.
45
©Silberschatz, Korth and Sudarshan
Buffer-Replacement Policies, Cont.
 Now consider the following query plan for the query:
Use a nested-loop join algorithm:
for each tuple b of borrower do // scan borrower sequentially
for each tuple c of customer do // scan customer sequentially
if b[customer-name] = c[customer-name] then begin
x[customer-name] := b[customer-name];
x[loan-number] := b[load-number];
x[street] := c[street]
x[city] := c[city];
include x in the result of the query;
end if;
end for;
end for;
Use LRU for block replacement:
−
Read in a block of borrower, and keep it, i.e., pin it, in the buffer until the last tuple in that block has been processed.
−
Once all of the tuples from that block have been processed, unpin it.
−
If a customer block is to be brought into the buffer, and another block must be moved out in order to make room, then move
out the least recently used block (borrower, or customer).
46
©Silberschatz, Korth and Sudarshan
Buffer-Replacement Policies, Cont.

Applying the LRU replacement algorithm results in the following for the first two
tuples of b1:
b1
b1
b1
b1
b1
b1
b1
b1
b1
b1
b1
c1
c1
c1
c1
c5
c5
c5
c5
c4
c4
c2
c2
c2
c2
c1
c1
c1
c1
c5
c3
c3
c3
c3
c2
c2
c2
c2
c4
c4
c4
c4
c3
c2
c2

Notice that each time a customer block is read into the buffer, the block replaced is
the next block required!

This results in a total of six (6) page replacements during the processing of the first
two borrower tuples.
47
©Silberschatz, Korth and Sudarshan
Buffer-Replacement Policies, Cont.

Suppose an MRU page replacement strategy is used:
 Read in a block of borrower, and keep it, i.e., pin it, in the buffer until the last tuple in that block has been
processed.
 Once all of the tuples from that block have been processed, unpin it.
 If a customer block is to be brought into the buffer, and another block must be moved out in order to
make room, then move out the most recently used block (borrower, or customer).
48
©Silberschatz, Korth and Sudarshan
Buffer-Replacement Policies, Cont.

Applying the MRU replacement algorithm results in the following for the first two
tuples of b1:
b1

b1
b1
b1
b1
b1
b1
c1
c1
c1
c1
c1
c1
c2
c2
c2
c2
c2 .......
c3
c3
c3
c4
c4
c5
c5
This results in a total of two (2) page replacements for processing the first two
borrower tuples.
49
©Silberschatz, Korth and Sudarshan
Storage Mechanisms

Traditional “low-level” storage DBMS issues:
 Avoiding large numbers of block reads - algorithmically, the bar has been raised
 Records crossing block boundaries
 Dangling pointers
 Disk fragmentation
 Allocating free space

Storage “mechanisms,” which provide partial solutions:
 Record packing
 Free/reserved space lists
 Reserved space
 Record splitting
 Mixed block formatting
 Slotted-page structure
50
©Silberschatz, Korth and Sudarshan
Storage Mechanisms

Recall:





Assumptions:





A database is stored as a collection of files.
A file consists of a sequence of blocks.
A block consists of a sequence of records.
A record consists of a sequence of fields.
Record size may be fixed or variable.
Record size is usually small relative to the block size, but might be larger.
Each file may have records of one particular type or different files are used for different relations.
Sometimes tuples need to be sorted, other times not.
Which assumptions apply determine which techniques are appropriate.
51
©Silberschatz, Korth and Sudarshan
Record Packing

Record “Packing” - basic approach:
 Pack records as tightly as possible.

Works best for fixed-size records:
 Store record i , where i >=1, starting from byte n  (i – 1), where n is the record size.
 Record access is simple, given i.
 Add each new record to the end

Deletion of record i options:
 Move records i + 1, . . ., n
to i, . . . , n – 1 (shift)
 Move record n to i
52
©Silberschatz, Korth and Sudarshan
Record Packing

For mixed or variable sized records:
 Attach an end-of-record () character to the end of each record.
 Pack the records as tightly as possible.

Issues:
 Deletion, insertion and record growth result in 1) lots of record movement or 2) fragmentation.
 Record movement potentially results in dangling pointers.
53
©Silberschatz, Korth and Sudarshan
Free Lists

Free lists – connect free space (or used space) into a linked list.

Advantages:
 Insertion & deletion can be done with few block I/Os (assuming tuples aren’t sorted).
 Records don’t have to move; eliminates dangling pointers.

Free lists are used by virtually all DBMSs at the file level.
54
©Silberschatz, Korth and Sudarshan
Reserved Space

Reserved space – use the maximum required space for each record:
 Can be used for repeating fields.
 Also with varying length attributes, e.g., a name field.

Issues:
 Could waste significant space.
55
©Silberschatz, Korth and Sudarshan
Record Splitting

Record Splitting:
 A single variable-length record is stored as collection of smaller records, linked with pointers.
 Can be used even if the maximum record length is not known.
 Once again, could use with pointers for free and occupied records.

Disadvantages:
 Records are more likely to cross block boundaries
 Space is wasted in all records except the first in a chain
56
©Silberschatz, Korth and Sudarshan
Mixed Block Formatting

Mixed Block Formatting – format blocks differently (typically in different files):
 Anchor block
 Overflow block

Records are likely to cross block boundaries.

Frequently used for very large attributes such as BLOBS, text, etc.
57
©Silberschatz, Korth and Sudarshan
Slotted Page Structure

A Slotted Page (i.e., block) is formatted with three sections:
 Header
 Free-space
 Records

Header contents:
 Number of record entries
 Pointer to the end of free space in the block
 Location and size of each record

Key points:
 External pointers refer to header entries rather than directly to records.
 Records can be moved within a page to eliminate empty space; header must be updated.
58
©Silberschatz, Korth and Sudarshan
File Organization

More sophisticated techniques are typically used in combination with the preceding storage
mechanisms:
 Hashing
 Sequential/sorted
 Clustering

These techniques can improve query performance substantially.
59
©Silberschatz, Korth and Sudarshan
Sequential File Organization

Records in a file are ordered by a search-key.

Good for point queries and range queries.

Maintaining physical sequential order is difficult with insertions & deletions.
60
©Silberschatz, Korth and Sudarshan
Sequential File Organization

Use pointers to keep track of logical record order:

Could be used in conjunction with a free-list.
61
©Silberschatz, Korth and Sudarshan
Sequential File Organization

Insertion – locate the position where the record is to be inserted.
 If there is free space there then insert directly.
 If no free space, insert the record in another location off the free-list.
 In either case, pointer chain must be updated.

Deletion – maintain pointer chains in the obvious way.

Main issue - file may degenerate over
time - reorganize the file from time to time
to restore sequential order.
62
©Silberschatz, Korth and Sudarshan
Clustering File Organization

A clustered file organization stores several relations in one file.

Clustered organization of customer and depositor:

Good for some queries involving:
 depositor
customer
 an individual customer and their accounts.

Bad for others, e.g., involving only customer.
63
©Silberschatz, Korth and Sudarshan
Large Objects – A Special Case

Large objects:





text documents
images
computer aided designs
audio and video data
Most DBMS vendors provide supporting types:
 binary large objects (blobs)
 character large objects (clobs)
 text, image, etc.

Large objects have high demands:
 Typically stored in a contiguous sequence of bytes when brought into memory.
 If an object is bigger than a block, contiguous blocks in the buffer must be allocated for it.
 Record splitting is commonly used.

Might be preferable to disallow direct access to data using SQL, and only allow
access through a 3rd party file system API.
64
©Silberschatz, Korth and Sudarshan
End of Chapter
Data Dictionary Storage
 The data dictionary (a.k.a. system catalog) stores metadata:
 Information about relations
•
names of relations
•
names and types of attributes
•
names and definitions of views
•
integrity constraints
 Physical file organization information
•
How relation is stored (sequential, hash, etc.)
•
Physical location of relation
– operating system file name, or
– disk addresses of blocks containing the records
 Statistical and descriptive data
•
number of tuples in each relation
 User and accounting information, including passwords
 Information about indices (Chapter 12)
66
©Silberschatz, Korth and Sudarshan
Data Dictionary Storage (Cont.)

Catalog structure can use either:
 specialized data structures designed for efficient access
 a set of relations, with existing system features used to ensure efficient access
The latter alternative is the standard.

A possible catalog representation:
Relation-metadata = (relation-name, number-of-attributes, storage-organization, location)
Attribute-metadata = (attribute-name, relation-name, domain-type, position, length)
User-metadata = (user-name, encrypted-password, group)
Index-metadata = (index-name, relation-name, index-type, index-attributes)
View-metadata = (view-name, definition)
67
©Silberschatz, Korth and Sudarshan
Download