Chapter 9, Disks and Files The Storage Hierarchy Disks Mechanics Performance RAID Disk Space Management Buffer Management Files of Records Format of a Heap File Format of a Data Page Format of Records 3/24/2016 PSU’s CS 587 1 Learning objectives Given disk parameters, compute storage needs and read times Given a reminder about what each level means, be able to derive any figures on the RAID performance slide Describe the pros and cons of alternative structures for files, pages and records 3/24/2016 PSU’s CS 587 2 A (Very) Simple Hardware Model CPU chip register file ALU system bus memory bus main memory I/O bridge bus interface I/O bus USB controller mouse keyboard 3/24/2016 graphics adapter disk controller Expansion slots for other devices such as network adapters. monitor PSU’s CS 587 disk 3 Storage Options Capacity Access Time Cost 1k-2k bytes 1 Tc Way Expensive 10s -1000s K Bytes 2-20 Tc $10 / MByte G Bytes 300 – 1000 Tc $0.03 / MB (eBay) 100s G Bytes 10 ms = 30M Tc $0.10/ GB (eBay) Infinite Forever Way Cheap 3/24/2016 Registers Caches Main Memory Hard Disk / Flash Tape 10000 Processor 1000 Network Relative Memory BW 100 Improve ment Disk 10 (Latency improvement = Bandwidth improvement) 1 1 10 100 Relative Latency Improvement PSU’s CS 587 4 Memory “Hierarchy” Capacity Access Time Cost 1k-2k bytes 1 Tc Way Expensive 10s -1000s K Bytes 2-20 Tc $10 / MByte Staging Xfer Size 100s G Bytes 10 ms = 30M Tc $0.10/ GB (eBay) Infinite Forever Way Cheap 3/24/2016 Faster Registers Instr. Operands prog./compiler 1-8 bytes Cache - SDRAM may be multiple levels! Blocks G Bytes 300 – 1000 Tc $0.03 / MB (eBay) Upper Level cache cntl 8-128 bytes Memory - DRAM Pages OS 4K+ bytes Files user/operator Gbytes Disk Tape PSU’s CS 587 Larger Lower Level 5 Why Does “Hierarchy” Work? Locality: Program access a relatively small portion of the address space at any instant of time Two Different Types Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) 3/24/2016 PSU’s CS 587 6 9.1 The Memory Hierarchy Typical storage hierarchy as used by a RDBMS: Primary storage: Main memory (RAM) for currently used data Secondary storage: Disk, Flash Memory for the main database • • http://www.cs.cmu.edu/~damon2007/pdf/graefe07fiveminrule.pdf What are other reasons besides cost to use disk? Tertiary storage Tapes, DVDs for archiving older versions of the data Other factors 3/24/2016 Caches at every level Controllers, protocols Network connections PSU’s CS 587 7 What is FLASH Memory, Anyway? Floating gate transitor Presence of charge => “0” Erase Electrically or UV (EPROM) Peformance Reads like DRAM (~ns) Writes like DISK (~ms). Write is a complex operation 3/24/2016 PSU’s CS 587 8 Components of a Disk Spindle Disk head • platters are always spinning (say, 120rps). Tracks Sector • one head reads/writes at any one time. • to read a record: • position arm (seek) • engage head • wait for data to spin by • read (transfer data) Arm movement Platters Arm assembly 3/24/2016 PSU’s CS 587 9 More terminology Spindle Disk head Each track is made up of fixed size sectors. Page size is a multiple of sector size. A platter typically has data on both surfaces. Arm movement All the tracks that you can reach from one position of the arm is Arm assembly called a cylinder (imaginary!). Tracks 3/24/2016 PSU’s CS 587 Sector Platters 10 Disks Technology Background CDC Wren I, 1983 3600 RPM 0.03 GBytes capacity Tracks/Inch: 800 Bits/Inch: 9550 Three 5.25” platters Bandwidth: 0.6 MBytes/sec Latency: 48.3 ms Cache: none 3/24/2016 Seagate 373453, 2003 15000 RPM 73.4 GBytes Tracks/Inch: 64000 Bits/Inch: 533,000 Four 2.5” platters (in 3.5” form factor) Bandwidth: 86 MBytes/sec Latency: 5.7 ms Cache: 8 MBytes PSU’s CS 587 (4X) (2500X) (80X) (60X) (140X) (8X) 11 Typical Disk Drive Statistics (2008) Sector size: 512 bytes Seek time Average Track to track 4-10 ms .6-1.0 ms Average Rotational Delay 3 to 5 ms (rotational speed 10,000 RPM to 5,400RPM) Transfer Time - Sustained data rate 0.3- 0.1 msec per 8K page, or 25-75 MB/second Density 12-18 GB/in2 3/24/2016 PSU’s CS 587 12 Disk Capacity Capacity: maximum number of bits that can be stored. Capacity is determined by: Recording density (bits/in): number of bits that can be squeezed into a 1 inch segment of a track. Track density (tracks/in): number of tracks that can be squeezed into a 1 inch radial segment. Areal density (bits/in2): product of recording and track density. Modern disks partition tracks into disjoint subsets called recording zones 3/24/2016 Expressed in units of gigabytes (GB), where 1 GB = 10^9 bytes Each track in a zone has the same number of sectors, determined by the circumference of innermost track. Each zone has a different number of sectors/track PSU’s CS 587 13 Cost of Accessing Data on Disk Time to access (read/write) a disk block: Taccess = Tavg seek + Tavg rotation + Tavg transfer seek time (moving arms to position disk head on track) rotational delay (waiting for block to rotate under head) • Half a rotation, on average transfer time (actually moving data to/from disk surface) Key to lower I/O cost: reduce seek/rotation delays! No way to avoid transfer time… Textbook measures query cost by NUMBER of page I/Os Implies all I/Os have the same cost, and that CPU time is free • This is a common simplification. Real DBMSs (in the optimizer) would consider sequential vs. random disk reads • Because sequential reads are much faster • and would count CPU time. 3/24/2016 PSU’s CS 587 14 Disk Parameters Practice A 2-platter disk rotates at 7,200 rpm. Each track contains 256KB. How many cylinders are required to store an 8 Gigabyte file? What is the average rotational delay, in milliseconds? 3/24/2016 PSU’s CS 587 15 Disk Access Time Example Given: Derived: Rotational rate = 7,200 RPM Average seek time = 9 ms. Avg # sectors/track = 400. Tavg rotation = 1/2 x (60 secs/7200 RPM) x 1000 ms/sec = 4 ms. Tavg transfer = 60/7200 RPM x 1/400 secs/track x 1000 ms/sec = 0.02 ms Taccess = 9 ms + 4 ms + 0.02 ms Important points: Access time dominated by seek time and rotational latency. First bit in a sector is the most expensive, the rest are free. SRAM access time is about 4 ns/doubleword, DRAM about 60 ns • Disk is about 40,000 times slower than SRAM, • 2,500 times slower than DRAM. 3/24/2016 PSU’s CS 587 16 So, How far away is the data? Clock Ticks 10 9 Andromdeda Tape /Optical Robot 10 6 Disk 100 10 2 1 2,000 Years Pluto Sacramento Memory On Board Cache On Chip Cache Registers 2 Years 1.5 hr This Campus 10 min This Room My Head 1 min From http://research.microsoft.com/~gray/papers/AlphaSortSigmod.doc 3/24/2016 PSU’s CS 587 17 Block, page and record sizes Block – According to text, smallest unit of I/O. Page – often used in place of block. “typical” record size: commonly hundreds, sometimes thousands of bytes Unlike the toy records in textbooks “typical” page size 4K, 8K 3/24/2016 PSU’s CS 587 18 Effect of page size on read time Suppose rotational delay is 4ms, average seek time 6 ms, transfer speed .5msec/8K. This graph shows the time required to read 1Gig of data for different page sizes. 25 Minutes 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Page Size (multiples of 8K) 3/24/2016 PSU’s CS 587 19 Why the difference? What accounts for the difference, in times to read one Gigabyte, on the previous graph? Assume: rotational delay 4ms, average seek time 6 ms, transfer speed .5msec/8K Transfer time (230/213 8K blocks) (.5msec/8K) = 66 secs ~= one minute How many reads? Page size 8K: there are 230/213 = 217 = 128K reads Page size 64K, there are 1/8th that many reads = 16K reads Time taken by rotational delays and seeks Each read requires a rotational delay and a seek, totalling 10 msec. 8K: (128K reads) (10msec/read) = 1,311 secs ~= 22 minutes 64K: 1/8 of that, or 164 secs ~= 3 minutes 3/24/2016 PSU’s CS 587 20 Moral of the Story As page size increases, read (and write) time reduces to transfer time, a big savings. So why not use a huge page size? Wastes memory space if you don’t need all that is read Wastes read time if you don’t need all that is read What applications could use a large page size? Those that sequentially access data 3/24/2016 The problem with a small page size is that pages get scattered across the disk. Turn the page…. PSU’s CS 587 21 Faster I/O, even with a small page size Even if the page size is small, you can achieve fast I/O by storing a file’s data as follows: Consecutive pages on same track, followed by Consecutive tracks on same cylinder, followed by Consecutive cylinders adjacent to each other First two incur no seek time or rotational delay, seek for third is only one-track. What is saved with this storage pattern? How is this storage pattern obtained? Disk defragmenter and its relatives/predecessors • Also places frequently used files near the spindle When data is in this storage pattern, the application can do sequential I/O Otherwise it must do random I/O 3/24/2016 PSU’s CS 587 22 More Hardware Issues 9. Disks Disk Controllers Interface from Disks to bus Checksums, remap bad sectors, driver mgt, etc Interface Protocols and MB per second xfer rates IDE/EIDE/ATA/PATA, SATA -133 SCSI -640 • BUT for a single device, SCSI is inferior Faster network technologies such as Fibre Channel Storage Area Networks (SANs) Disk farm networked to servers Servers can be heterogeneous – a primary advantage Centralized management 3/24/2016 PSU’s CS 587 23 Dependability Module reliability = measure of continuous service accomplishment (or time to failure). 2 metrics Mean Time To Failure (MTTF) measures Reliability Failures In Time (FIT) = 1/MTTF, the rate of failures 1. 2. • Mean Time To Repair (MTTR) measures Service Interruption Mean Time Between Failures (MTBF) = MTTF+MTTR Traditionally reported as failures per billion hours of operation Module availability measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) Module availability = MTTF / ( MTTF + MTTR) 3/24/2016 PSU’s CS 587 24 Example calculating reliability If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules Example: Calculate FIT and MTTF for 3/24/2016 10 disks (1M hour MTTF per disk) 1 disk controller (0.5M hour MTTF) and 1 power supply (0.2M hour MTTF) PSU’s CS 587 25 Example calculating reliability Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk) 1 disk controller (0.5M hour MTTF) and 1 power supply (0.2M hour MTTF): FailureRat e 10 (1 / 1,000,000) 1 / 500,000 1 / 200,000 10 2 5 / 1,000,000 17 / 1,000,000 17,000 FIT MTTF 1,000,000,000 / 17,000 59,000hours 3/24/2016 PSU’s CS 587 26 9.Disks 9.2 RAID [587] Disk Array: Arrangement of several disks that gives abstraction of a single, large disk. Goals: Increase performance and reliability. Two main techniques: 3/24/2016 Data striping: Data is partitioned; size of a partition is called the striping Unit. Partitions are distributed over several disks. Redundancy: More disks => more failures. Redundant information allows reconstruction of data if a disk fails. PSU’s CS 587 27 Data Striping • • • • CPUs go fast, disks don’t. How can disks keep up? CPUs do work in parallel. Can disks? Answer: Partition data across D disks (see next slide). If Partition unit is a page: • A single page I/O request is no faster • Multiple I/O requests can run at aggregated bandwidth • Number of pages in a partition unit called the depth of the partition. • Contrary to text, partition units of a bit are almost never used and partition units of a byte are rare. 3/24/2016 PSU’s CS 587 28 Data Striping (RAID Level 0) 3/24/2016 0 1 2 D-1 D D+1 D+2 2D-1 2D 2D+1 2D+2 3D-1 … … … 0 1 2 Disk 0 Disk 1 Disk 2 PSU’s CS 587 ... … D-1 Disk D-1 29 Redundancy • Striping is seductive, but remember reliability! • MTTF of a disk is about 6 years • If we stripe over 24 disks, what is MTTF? • Solution: redundancy – Parity: corrects single failures – Others: detect where the failure is, and corrects multiple failures – But failure location is provided by controller – Redundancy may require more than one check bit • Redundancy makes writes slower – why? 3/24/2016 PSU’s CS 587 30 RAID Levels • Standardized by SNIA (www.snia.org ) • Vary in practice • For each level, decide (assume single user) • Number of disks required to hold D disks of data. • Speedup s (compared to 1 disk) for • S/R (Sequential/Random) R/W (Reads/Writes) • Random: each I/O is one block • Sequential: Each I/O is one stripe • Number of disks/blocks that can fail w/o data loss • Level 0: Block Striped, No redundancy • Picture is 2 slides back 3/24/2016 PSU’s CS 587 31 JBOD, RAID Level 1 • JBOD: Just a Bunch of Disks 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 … … … … 2 0 1 3 3 … … 4 Disk 0 Disk 1 Disk 2 ... 3 Disk D-1 • Level 1: Mirrored (two identical JBODs – no striping) 3/24/2016 PSU’s CS 587 32 RAID Level 0+1: Stripe + Mirror 0 D 2D … 0 1 D+1 2D+1 … 1 2 D+2 2D+2 … 2 Disk 0 Disk 1 Disk 2 0 D 1 D+1 2D+1 … 1 2 D+2 2D+2 … 2 2D … 0 Disk D 3/24/2016 Disk D+1 Disk D+2 PSU’s CS 587 ... D-1 2D-1 3D-1 … D-1 Disk D-1 ... D-1 2D-1 3D-1 … D-1 Disk 2D-1 33 RAID Level 4 • Block-Interleaved Parity (not common) – – – – One check disk, uses one bit of parity. How to tell if there is a failure, or which disk failed? Read-modify-write Disk D is a bottleneck 0 1 2 D-1 P D D+1 D+2 2D-1 P 2D 2D+1 2D+2 3D-1 P … … … … P … 0 1 2 Disk 0 Disk 1 Disk 2 3/24/2016 ... D-1 Disk D-1 PSU’s CS 587 Disk D 34 RAID Level 5 Level 5: Block-Interleaved Distributed Parity 0 1 D-2 D-1 P D D+1 2D-2 P 2D-1 2D 2D+1 P 3D-2 3D-1 … … … … … … … … … … Disk 0 Disk 1 Disk D-2 Disk D-1 Disk D 3/24/2016 ... Level 6: Like 5, but 2 parity bits/disks Can survive loss of 2 disks/blocks PSU’s CS 587 35 Notation on the next slide #Disks Number of disks required to hold D disks worth of data using this RAID level Reads/Write speedup of blocks in a single file: SR: Sequential Read RR: Random read SW: Sequential write RW: Random write Failure Tolerance How many disks can fail without loss of data Internal Data s = Blocks transferred in the time it takes to transfer one block of data from one disk. These numbers are theoretical! • YMMV…and vary significantly! 3/24/2016 PSU’s CS 587 36 RAID Performance Level #Disks SR RR SW RW speedup speedup speedup speedup Failure Tolerance 0 D s=D 1sD s=D 1sD 0 1 2D s=2 s=2 s=1** s=1** D* 0+1 2D 1sD** D* 5 D+1 Varies 1 s=2D 2s2D s=D** s=D 1sD s=D *If no two are copies of each other ** note – can’t write both mirrors at once – why? 3/24/2016 PSU’s CS 587 37 Small Writes on Levels 4 and 5 Levels 4 and 5 require a read-modify-write cycle for all writes, since the parity block must be read and modified. On small writes this can be very expensive This is another justification for Log Based File Systems (see your OS course) 3/24/2016 PSU’s CS 587 38 Which RAID Level is best? If data loss is not a problem Level 0 If storage cost is not a problem Level 0+1 Else Level 5 Software Support Linux: 0,1,4,5 (http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html ) Windows: 0,1,5 (http://www.techimo.com/articles/index.pl?photo=149 ) 3/24/2016 PSU’s CS 587 39 9.Disks 9.3, 9.4.1: Covered earlier 3/24/2016 PSU’s CS 587 40 9.Disks 9.4.2 DBMS vs. OS File System OS does disk space & buffer mgmt: why not let OS manage these tasks? [715] Differences in OS support: portability issues Some limitations, e.g., files can’t span disks. Buffer management in DBMS requires ability to: pin a page in buffer pool, force a page to disk (important for implementing CC & recovery), adjust replacement policy, and pre-fetch pages based on access patterns in typical DB operations. • 3/24/2016 Sometimes MRU is the best replacement policy: For example, for a scan or a loop that does not fit. PSU’s CS 587 41 9.Disks 9.5 Files of Records Page or block is OK when doing I/O, but higher levels of DBMS operate on records, and files of records. FILE: A collection of pages, each containing a collection of records. Must support: 3/24/2016 insert/delete/modify record read a particular record (specified using record id) scan all records (possibly with some conditions on the records to be retrieved) PSU’s CS 587 42 9.Disks 9.5.1 Unordered (Heap) Files Simplest file structure contains records in no particular order. As file grows and shrinks, disk pages are allocated and de-allocated. To support record level operations, we must: 3/24/2016 keep track of the pages in a file keep track of free space on pages keep track of the records on a page There are at least two alternatives for keeping track of heap files. PSU’s CS 587 43 9.Disks Heap File Implemented as a List Data Page Data Page Data Page Full Pages Header Page Data Page Data Page Data Page Pages with Free Space The header page id and Heap file name must be stored someplace. Each page contains 2 `pointers’ plus data. 3/24/2016 PSU’s CS 587 44 Heap File Using a Page Directory 9.Disks Data Page 1 Header Page Data Page 2 DIRECTORY Data Page N The entry for a page can include the number of free bytes on the page. The directory is a collection of pages; linked list implementation is just one alternative. 3/24/2016 Much smaller than linked list of all HF pages! PSU’s CS 587 45 Comparing Heap File Implementations Assume 100 directory entries per page. U full pages, E pages with free space D directory pages Then D = (U+E) /100 Note that D is two orders of magnitude less than U or E Cost to find a page with enough free space List: E/2 Directory: (D/2) + 1 Cost to Move a page from Full to Free (e.g., when a record is deleted) List: 3, Directory: 1 3/24/2016 Can you think of some other operations? PSU’s CS 587 46 9.Disks 9.6 Page Formats: Fixed Length Records Slot 1 Slot 2 Slot 1 Slot 2 Free Space ... Slot N ... Slot N Slot M N PACKED 3/24/2016 1 . . . 0 1 1M number of records M ... 3 2 1 UNPACKED, BITMAP PSU’s CS 587 number of slots 47 Packed vs Unpacked Page Formats Record ID (RID, TID) = (page#, slot#) , in all page formats Note that indexes are filled with RIDs Data entries in alternatives 2 and 3 are (key, RID..) Packed stores more records RIDs change when a record is deleted • This may not be acceptable. Unpacked RID does not change Less data movement when deleting 3/24/2016 PSU’s CS 587 48 9.Disks Page Formats: Variable Length Records Rid = (i,N) Page i Rid = (i,2) Rid = (i,1) 20 N ... 16 2 SLOT DIRECTORY 3/24/2016 PSU’s CS 587 24 N 1 # slots Pointer to start of free space 49 Slotted Page Format Intergalactic Standard, for fixed length records also. How to deal with free space fragmentation? Pack records. lazily Note that RIDs don’t change How are updates handled which expand the size of a record? Forwarding flag to new location 3/24/2016 http://www.postgresql.org/docs/8.3/interactive/st orage-page-layout.html postgresql-8.3.1\src\include\storage\bufpage.h PSU’s CS 587 50 9.Disks 9.7 Record Formats: Fixed Length F1 F2 F3 F4 L1 L2 L3 L4 Base address (B) Address = B+L1+L2 Information about field types same for all records in a file; stored in system catalogs. Finding i’th field does not require scan of record. 3/24/2016 PSU’s CS 587 51 9.Disks Record Formats: Variable Length Two alternative formats (# fields is fixed): F1 4 Field Count F2 $ F3 $ F4 $ $ Fields Delimited by Special Symbols F1 F2 F3 F4 Array of Field Offsets Second offers direct access to i’th field, efficient storage of nulls (special don’t know value); small directory overhead. 3/24/2016 PSU’s CS 587 52