Overview of Storage and Indexing

advertisement
Chapter 9, Disks and Files


The Storage Hierarchy
Disks
 Mechanics
 Performance
 RAID



Disk Space Management
Buffer Management
Files of Records
 Format of a Heap File
 Format of a Data Page
 Format of Records
3/24/2016
PSU’s CS 587
1
Learning objectives
Given disk parameters, compute storage
needs and read times
 Given a reminder about what each level
means, be able to derive any figures on the
RAID performance slide
 Describe the pros and cons of alternative
structures for files, pages and records

3/24/2016
PSU’s CS 587
2
A (Very) Simple Hardware Model
CPU chip
register file
ALU
system bus
memory bus
main
memory
I/O
bridge
bus interface
I/O bus
USB
controller
mouse keyboard
3/24/2016
graphics
adapter
disk
controller
Expansion slots for
other devices such
as network adapters.
monitor
PSU’s CS 587
disk
3
Storage Options
Capacity
Access Time
Cost
1k-2k bytes
1 Tc
Way Expensive
10s -1000s K Bytes
2-20 Tc
$10 / MByte
G Bytes
300 – 1000 Tc
$0.03 / MB (eBay)
100s G Bytes
10 ms = 30M Tc
$0.10/ GB (eBay)
Infinite
Forever
Way Cheap
3/24/2016
Registers
Caches
Main Memory
Hard Disk / Flash
Tape
10000
Processor
1000
Network
Relative
Memory
BW
100
Improve
ment
Disk
10
(Latency improvement
= Bandwidth improvement)
1
1
10
100
Relative Latency Improvement
PSU’s CS 587
4
Memory “Hierarchy”
Capacity
Access Time
Cost
1k-2k bytes
1 Tc
Way Expensive
10s -1000s K Bytes
2-20 Tc
$10 / MByte
Staging
Xfer Size
100s G Bytes
10 ms = 30M Tc
$0.10/ GB (eBay)
Infinite
Forever
Way Cheap
3/24/2016
Faster
Registers
Instr. Operands
prog./compiler
1-8 bytes
Cache - SDRAM
may be multiple levels!
Blocks
G Bytes
300 – 1000 Tc
$0.03 / MB (eBay)
Upper Level
cache cntl
8-128 bytes
Memory - DRAM
Pages
OS
4K+ bytes
Files
user/operator
Gbytes
Disk
Tape
PSU’s CS 587
Larger
Lower Level
5
Why Does “Hierarchy” Work?

Locality:
 Program access a relatively
small portion of the address
space at any instant of time

Two Different Types
 Temporal Locality (Locality in Time): If an item is referenced,
it will tend to be referenced again soon
(e.g., loops, reuse)
 Spatial Locality (Locality in Space): If an item is referenced,
items whose addresses are close by tend to be referenced
soon
(e.g., straightline code, array access)
3/24/2016
PSU’s CS 587
6
9.1 The Memory Hierarchy

Typical storage hierarchy as used by a RDBMS:


Primary storage:
Main memory (RAM) for currently used data
Secondary storage:
Disk, Flash Memory for the main database
•
•


http://www.cs.cmu.edu/~damon2007/pdf/graefe07fiveminrule.pdf
What are other reasons besides cost to use disk?
Tertiary storage
Tapes, DVDs for archiving older versions of the data
Other factors



3/24/2016
Caches at every level
Controllers, protocols
Network connections
PSU’s CS 587
7
What is FLASH Memory, Anyway?

Floating gate transitor
 Presence of charge => “0”
 Erase Electrically or UV (EPROM)

Peformance
 Reads like DRAM (~ns)
 Writes like DISK (~ms). Write is a complex
operation
3/24/2016
PSU’s CS 587
8
Components of a Disk
Spindle
Disk head
• platters are always
spinning (say, 120rps).
Tracks
Sector
• one head reads/writes
at any one time.
• to read a record:
• position arm (seek)
• engage head
• wait for data to spin by
• read (transfer data)
Arm movement
Platters
Arm assembly
3/24/2016
PSU’s CS 587
9
More terminology
Spindle
Disk head
Each track is made up of
fixed size sectors.
 Page size is a multiple of
sector size.
 A platter typically has data on
both surfaces.
Arm movement
 All the tracks that you
can reach from one
position of the arm is
Arm assembly
called a cylinder
(imaginary!).
Tracks

3/24/2016
PSU’s CS 587
Sector
Platters
10
Disks Technology Background









CDC Wren I, 1983
3600 RPM
0.03 GBytes capacity
Tracks/Inch: 800
Bits/Inch: 9550
Three 5.25” platters

Bandwidth:
0.6 MBytes/sec
Latency: 48.3 ms
Cache: none

3/24/2016







Seagate 373453, 2003
15000 RPM
73.4 GBytes
Tracks/Inch: 64000
Bits/Inch: 533,000
Four 2.5” platters
(in 3.5” form factor)
Bandwidth:
86 MBytes/sec
Latency: 5.7 ms
Cache: 8 MBytes
PSU’s CS 587
(4X)
(2500X)
(80X)
(60X)
(140X)
(8X)
11
Typical Disk Drive Statistics (2008)
Sector size: 512 bytes
Seek time
Average
Track to track
4-10 ms
.6-1.0 ms
Average Rotational Delay 3 to 5 ms
(rotational speed 10,000 RPM to 5,400RPM)
Transfer Time - Sustained data rate
0.3- 0.1 msec per 8K page, or
25-75 MB/second
Density
12-18 GB/in2
3/24/2016
PSU’s CS 587
12
Disk Capacity

Capacity: maximum number of bits that can be stored.


Capacity is determined by:




Recording density (bits/in): number of bits that can be squeezed
into a 1 inch segment of a track.
Track density (tracks/in): number of tracks that can be squeezed
into a 1 inch radial segment.
Areal density (bits/in2): product of recording and track density.
Modern disks partition tracks into disjoint subsets called
recording zones


3/24/2016
Expressed in units of gigabytes (GB), where 1 GB = 10^9 bytes
Each track in a zone has the same number of sectors, determined by
the circumference of innermost track.
Each zone has a different number of sectors/track
PSU’s CS 587
13
Cost of Accessing Data on Disk

Time to access (read/write) a disk block:
 Taccess = Tavg seek + Tavg rotation + Tavg transfer
 seek time (moving arms to position disk head on track)
 rotational delay (waiting for block to rotate under head)
• Half a rotation, on average
 transfer time (actually moving data to/from disk surface)

Key to lower I/O cost: reduce seek/rotation delays!
 No way to avoid transfer time…

Textbook measures query cost by NUMBER of page I/Os
 Implies all I/Os have the same cost, and that CPU time is free
• This is a common simplification.
 Real DBMSs (in the optimizer) would consider sequential vs.
random disk reads
• Because sequential reads are much faster
• and would count CPU time.
3/24/2016
PSU’s CS 587
14
Disk Parameters Practice

A 2-platter disk rotates at 7,200 rpm. Each
track contains 256KB.
 How many cylinders are required to store an 8
Gigabyte file?
 What is the average rotational delay, in
milliseconds?
3/24/2016
PSU’s CS 587
15
Disk Access Time Example

Given:




Derived:




Rotational rate = 7,200 RPM
Average seek time = 9 ms.
Avg # sectors/track = 400.
Tavg rotation = 1/2 x (60 secs/7200 RPM) x 1000 ms/sec = 4 ms.
Tavg transfer = 60/7200 RPM x 1/400 secs/track x 1000 ms/sec = 0.02 ms
Taccess = 9 ms + 4 ms + 0.02 ms
Important points:



Access time dominated by seek time and rotational latency.
First bit in a sector is the most expensive, the rest are free.
SRAM access time is about 4 ns/doubleword, DRAM about 60 ns
• Disk is about 40,000 times slower than SRAM,
• 2,500 times slower than DRAM.
3/24/2016
PSU’s CS 587
16
So, How far away is the data?
Clock Ticks
10 9
Andromdeda
Tape /Optical
Robot
10 6 Disk
100
10
2
1
2,000 Years
Pluto
Sacramento
Memory
On Board Cache
On Chip Cache
Registers
2 Years
1.5 hr
This Campus
10 min
This Room
My Head
1 min
From http://research.microsoft.com/~gray/papers/AlphaSortSigmod.doc
3/24/2016
PSU’s CS 587
17
Block, page and record sizes
Block – According to text, smallest unit of I/O.
 Page – often used in place of block.
 “typical” record size: commonly hundreds,
sometimes thousands of bytes

 Unlike the toy records in textbooks

“typical” page size 4K, 8K
3/24/2016
PSU’s CS 587
18
Effect of page size on read time


Suppose rotational delay is 4ms, average seek time 6 ms,
transfer speed .5msec/8K.
This graph shows the time required to read 1Gig of data
for different page sizes.
25
Minutes
20
15
10
5
0
1
2
3
4
5
6
7
8
9
10 11
12
13 14
15 16
Page Size (multiples of 8K)
3/24/2016
PSU’s CS 587
19
Why the difference?



What accounts for the difference, in times to read one
Gigabyte, on the previous graph?
Assume: rotational delay 4ms, average seek time 6
ms, transfer speed .5msec/8K
Transfer time
 (230/213 8K blocks) (.5msec/8K) = 66 secs ~= one minute

How many reads?
 Page size 8K: there are 230/213 = 217 = 128K reads
 Page size 64K, there are 1/8th that many reads = 16K reads

Time taken by rotational delays and seeks
 Each read requires a rotational delay and a seek, totalling 10
msec.
 8K: (128K reads)  (10msec/read) = 1,311 secs ~= 22 minutes
 64K: 1/8 of that, or 164 secs ~= 3 minutes
3/24/2016
PSU’s CS 587
20
Moral of the Story


As page size increases, read (and write) time reduces
to transfer time, a big savings.
So why not use a huge page size?
 Wastes memory space if you don’t need all that is read
 Wastes read time if you don’t need all that is read

What applications could use a large page size?
 Those that sequentially access data

3/24/2016
The problem with a small page size is that pages get
scattered across the disk. Turn the page….
PSU’s CS 587
21
Faster I/O, even with a small page size

Even if the page size is small, you can achieve fast
I/O by storing a file’s data as follows:






Consecutive pages on same track, followed by
Consecutive tracks on same cylinder, followed by
Consecutive cylinders adjacent to each other
First two incur no seek time or rotational delay, seek for
third is only one-track.
What is saved with this storage pattern?
How is this storage pattern obtained?
 Disk defragmenter and its relatives/predecessors
• Also places frequently used files near the spindle

When data is in this storage pattern, the application
can do sequential I/O
 Otherwise it must do random I/O
3/24/2016
PSU’s CS 587
22
More Hardware Issues

9. Disks
Disk Controllers
 Interface from Disks to bus
 Checksums, remap bad sectors, driver mgt, etc

Interface Protocols and MB per second xfer rates
 IDE/EIDE/ATA/PATA, SATA -133
 SCSI -640
• BUT for a single device, SCSI is inferior
 Faster network technologies such as Fibre Channel

Storage Area Networks (SANs)
 Disk farm networked to servers
 Servers can be heterogeneous – a primary advantage
 Centralized management
3/24/2016
PSU’s CS 587
23
Dependability
Module reliability = measure of continuous service accomplishment
(or time to failure).
2 metrics
Mean Time To Failure (MTTF) measures Reliability
Failures In Time (FIT) = 1/MTTF, the rate of failures

1.
2.
•
Mean Time To Repair (MTTR) measures Service Interruption

Mean Time Between Failures (MTBF) = MTTF+MTTR



Traditionally reported as failures per billion hours of operation
Module availability measures service as alternate between the 2 states
of accomplishment and interruption (number between 0 and 1, e.g.
0.9)
Module availability = MTTF / ( MTTF + MTTR)
3/24/2016
PSU’s CS 587
24
Example calculating reliability


If modules have exponentially distributed
lifetimes (age of module does not affect
probability of failure), overall failure rate is the
sum of failure rates of the modules
Example: Calculate FIT and MTTF for



3/24/2016
10 disks (1M hour MTTF per disk)
1 disk controller (0.5M hour MTTF)
and 1 power supply (0.2M hour MTTF)
PSU’s CS 587
25
Example calculating reliability

Calculate FIT and MTTF for



10 disks (1M hour MTTF per disk)
1 disk controller (0.5M hour MTTF)
and 1 power supply (0.2M hour MTTF):
FailureRat e  10  (1 / 1,000,000)  1 / 500,000  1 / 200,000
 10  2  5 / 1,000,000
 17 / 1,000,000
 17,000 FIT
MTTF  1,000,000,000 / 17,000
 59,000hours
3/24/2016
PSU’s CS 587
26
9.Disks
9.2 RAID [587]
Disk Array: Arrangement of several disks
that gives abstraction of a single, large disk.
 Goals: Increase performance and reliability.
 Two main techniques:



3/24/2016
Data striping: Data is partitioned; size of a
partition is called the striping Unit. Partitions are
distributed over several disks.
Redundancy: More disks => more failures.
Redundant information allows reconstruction of
data if a disk fails.
PSU’s CS 587
27
Data Striping
•
•
•
•
CPUs go fast, disks don’t. How can disks keep up?
CPUs do work in parallel. Can disks?
Answer: Partition data across D disks (see next slide).
If Partition unit is a page:
• A single page I/O request is no faster
• Multiple I/O requests can run at aggregated bandwidth
• Number of pages in a partition unit called the depth of
the partition.
• Contrary to text, partition units of a bit are almost
never used and partition units of a byte are rare.
3/24/2016
PSU’s CS 587
28
Data Striping (RAID Level 0)
3/24/2016
0
1
2
D-1
D
D+1
D+2
2D-1
2D
2D+1
2D+2
3D-1
…
…
…
0
1
2
Disk 0
Disk 1
Disk 2
PSU’s CS 587
...
…
D-1
Disk D-1
29
Redundancy
• Striping is seductive, but remember reliability!
• MTTF of a disk is about 6 years
• If we stripe over 24 disks, what is MTTF?
• Solution: redundancy
– Parity: corrects single failures
– Others: detect where the failure is, and corrects multiple
failures
– But failure location is provided by controller
– Redundancy may require more than one check bit
• Redundancy makes writes slower – why?
3/24/2016
PSU’s CS 587
30
RAID Levels
• Standardized by SNIA (www.snia.org )
• Vary in practice
• For each level, decide (assume single user)
• Number of disks required to hold D disks of data.
• Speedup s (compared to 1 disk) for
• S/R (Sequential/Random) R/W (Reads/Writes)
• Random: each I/O is one block
• Sequential: Each I/O is one stripe
• Number of disks/blocks that can fail w/o data loss
• Level 0: Block Striped, No redundancy
• Picture is 2 slides back
3/24/2016
PSU’s CS 587
31
JBOD, RAID Level 1
• JBOD: Just a Bunch of Disks
0
0
0
0
1
1
1
1
2
2
2
2
3
3
3
…
…
…
…
2
0
1
3
3
…
…
4
Disk 0
Disk 1
Disk 2
...
3
Disk D-1
• Level 1: Mirrored (two identical JBODs – no striping)
3/24/2016
PSU’s CS 587
32
RAID Level 0+1: Stripe + Mirror
0
D
2D
…
0
1
D+1
2D+1
…
1
2
D+2
2D+2
…
2
Disk 0
Disk 1
Disk 2
0
D
1
D+1
2D+1
…
1
2
D+2
2D+2
…
2
2D
…
0
Disk D
3/24/2016
Disk D+1 Disk D+2
PSU’s CS 587
...
D-1
2D-1
3D-1
…
D-1
Disk D-1
...
D-1
2D-1
3D-1
…
D-1
Disk 2D-1
33
RAID Level 4
• Block-Interleaved Parity (not common)
–
–
–
–
One check disk, uses one bit of parity.
How to tell if there is a failure, or which disk failed?
Read-modify-write
Disk D is a bottleneck
0
1
2
D-1
P
D
D+1
D+2
2D-1
P
2D
2D+1
2D+2
3D-1
P
…
…
…
…
P …
0
1
2
Disk 0
Disk 1
Disk 2
3/24/2016
...
D-1
Disk D-1
PSU’s CS 587
Disk D
34
RAID Level 5

Level 5: Block-Interleaved Distributed Parity
0
1
D-2
D-1
P
D
D+1
2D-2
P
2D-1
2D
2D+1
P
3D-2
3D-1
…
…
…
…
…
…
…
…
…
…
Disk 0
Disk 1
Disk D-2
Disk D-1
Disk D

3/24/2016
...
Level 6: Like 5, but 2 parity bits/disks
 Can survive loss of 2 disks/blocks
PSU’s CS 587
35
Notation on the next slide

#Disks
 Number of disks required to hold D disks worth of data
using this RAID level

Reads/Write speedup of blocks in a single file:





SR: Sequential Read
RR: Random read
SW: Sequential write
RW: Random write
Failure Tolerance
 How many disks can fail without loss of data

Internal Data
 s = Blocks transferred in the time it takes to transfer one
block of data from one disk.
 These numbers are theoretical!
• YMMV…and vary significantly!
3/24/2016
PSU’s CS 587
36
RAID Performance
Level
#Disks
SR
RR
SW
RW
speedup
speedup
speedup
speedup
Failure
Tolerance
0
D
s=D
1sD
s=D
1sD
0
1
2D
s=2
s=2
s=1**
s=1**
D*
0+1
2D
1sD**
D*
5
D+1
Varies
1
s=2D 2s2D s=D**
s=D
1sD
s=D
*If no two are copies of each other
** note – can’t write both mirrors at once – why?
3/24/2016
PSU’s CS 587
37
Small Writes on Levels 4 and 5
Levels 4 and 5 require a read-modify-write
cycle for all writes, since the parity block
must be read and modified.
 On small writes this can be very expensive
 This is another justification for Log Based File
Systems (see your OS course)

3/24/2016
PSU’s CS 587
38
Which RAID Level is best?

If data loss is not a problem
 Level 0

If storage cost is not a problem
 Level 0+1

Else
 Level 5

Software Support
 Linux: 0,1,4,5
(http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html )
 Windows: 0,1,5
(http://www.techimo.com/articles/index.pl?photo=149 )
3/24/2016
PSU’s CS 587
39
9.Disks
9.3, 9.4.1: Covered earlier
3/24/2016
PSU’s CS 587
40
9.Disks
9.4.2 DBMS vs. OS File System
OS does disk space & buffer mgmt: why not let
OS manage these tasks? [715]
 Differences in OS support: portability issues
 Some limitations, e.g., files can’t span disks.
 Buffer management in DBMS requires ability to:



pin a page in buffer pool, force a page to disk
(important for implementing CC & recovery),
adjust replacement policy, and pre-fetch pages based
on access patterns in typical DB operations.
•
3/24/2016
Sometimes MRU is the best replacement policy: For
example, for a scan or a loop that does not fit.
PSU’s CS 587
41
9.Disks
9.5 Files of Records


Page or block is OK when doing I/O, but
higher levels of DBMS operate on records, and
files of records.
FILE: A collection of pages, each containing a
collection of records. Must support:



3/24/2016
insert/delete/modify record
read a particular record (specified using record id)
scan all records (possibly with some conditions on
the records to be retrieved)
PSU’s CS 587
42
9.Disks
9.5.1 Unordered (Heap) Files
Simplest file structure contains records in no
particular order.
 As file grows and shrinks, disk pages are
allocated and de-allocated.
 To support record level operations, we must:





3/24/2016
keep track of the pages in a file
keep track of free space on pages
keep track of the records on a page
There are at least two alternatives for keeping
track of heap files.
PSU’s CS 587
43
9.Disks
Heap File Implemented as a List
Data
Page
Data
Page
Data
Page
Full Pages
Header
Page
Data
Page
Data
Page
Data
Page
Pages with
Free Space
The header page id and Heap file name must
be stored someplace.
 Each page contains 2 `pointers’ plus data.

3/24/2016
PSU’s CS 587
44
Heap File Using a Page Directory
9.Disks
Data
Page 1
Header
Page
Data
Page 2
DIRECTORY
Data
Page N
The entry for a page can include the number
of free bytes on the page.
 The directory is a collection of pages; linked
list implementation is just one alternative.


3/24/2016
Much smaller than linked list of all HF pages!
PSU’s CS 587
45
Comparing Heap File Implementations

Assume






100 directory entries per page.
U full pages, E pages with free space
D directory pages
Then D = (U+E) /100
Note that D is two orders of magnitude less than U or E
Cost to find a page with enough free space
 List: E/2 Directory: (D/2) + 1

Cost to Move a page from Full to Free
(e.g., when a record is deleted)
 List: 3, Directory: 1

3/24/2016
Can you think of some other operations?
PSU’s CS 587
46
9.Disks
9.6 Page Formats: Fixed Length Records
Slot 1
Slot 2
Slot 1
Slot 2
Free
Space
...
Slot N
...
Slot N
Slot M
N
PACKED
3/24/2016
1 . . . 0 1 1M
number
of records
M ... 3 2 1
UNPACKED, BITMAP
PSU’s CS 587
number
of slots
47
Packed vs Unpacked Page Formats

Record ID (RID, TID) = (page#, slot#) ,
in all page formats
 Note that indexes are filled with RIDs
 Data entries in alternatives 2 and 3 are (key, RID..)

Packed
 stores more records
 RIDs change when a record is deleted
• This may not be acceptable.

Unpacked
 RID does not change
 Less data movement when deleting
3/24/2016
PSU’s CS 587
48
9.Disks
Page Formats: Variable Length Records
Rid = (i,N)
Page i
Rid = (i,2)
Rid = (i,1)
20
N
...
16
2
SLOT DIRECTORY
3/24/2016
PSU’s CS 587
24
N
1 # slots
Pointer
to start
of free
space
49
Slotted Page Format


Intergalactic Standard, for fixed length records also.
How to deal with free space fragmentation?
 Pack records. lazily


Note that RIDs don’t change
How are updates handled which expand the size of a
record?
 Forwarding flag to new location


3/24/2016
http://www.postgresql.org/docs/8.3/interactive/st
orage-page-layout.html
postgresql-8.3.1\src\include\storage\bufpage.h
PSU’s CS 587
50
9.Disks
9.7 Record Formats: Fixed Length
F1
F2
F3
F4
L1
L2
L3
L4
Base address (B)
Address = B+L1+L2
Information about field types same for all
records in a file; stored in system catalogs.
 Finding i’th field does not require scan of
record.

3/24/2016
PSU’s CS 587
51
9.Disks
Record Formats: Variable Length

Two alternative formats (# fields is fixed):
F1
4
Field
Count
F2
$
F3
$
F4
$
$
Fields Delimited by Special Symbols
F1
F2
F3
F4
Array of Field Offsets
 Second offers direct access to i’th field, efficient storage
of nulls (special don’t know value); small directory overhead.
3/24/2016
PSU’s CS 587
52
Download