Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management File Organization 1 ©Silberschatz, Korth and Sudarshan Properties of Physical Storage Media Types of storage media differ in terms of: Speed of data access Cost per unit of data Reliability: Data loss on power failure or system (software) crash Physical failure of the storage device => What is the value of data (e.g., salon example)? 2 ©Silberschatz, Korth and Sudarshan Classification of Storage Media Storage be classified as: Volatile: Content is lost when power is switched off Includes primary storage (cache, main-memory) Non-volatile: Contents persist even when power is switched off Secondary and tertiary storage, plus battery-backed up RAM 3 ©Silberschatz, Korth and Sudarshan Storage Hierarchy, Cont. Storage can also be classified as: Primary Storage: Fastest media Volatile Includes cache and main memory Secondary Storage: Moderately fast access time Non-volatile Includes flash memory and magnetic disks Also called on-line storage Tertiary Storage: Slow access time Non-volatile Includes magnetic tape and optical storage Also called off-line storage 4 ©Silberschatz, Korth and Sudarshan Primary Storage Cache: Volatile Fastest and most costly form of storage Managed by the computer system hardware Proprietary Main memory (also known as RAM): Volatile Fast access (10s to 100s of nanosecs; 1 nanosec = 10–9 seconds) Generally too small (or too expensive) to store the entire database => The above comment notwithstanding, main memory databases do exist, especially for clusters. 5 ©Silberschatz, Korth and Sudarshan Physical Storage Media, Cont. Flash memory: Non-volatile Widely used in embedded devices such as digital cameras SD cards, USB drives, and solid-state drives Data can be written at a location only once, but location can be erased and written to again Can support only a limited number of write/erase cycles Reads are roughly as fast as main memory Writes are slow (few microseconds), erase is slower Cost per unit of storage roughly similar to main memory 6 ©Silberschatz, Korth and Sudarshan Physical Storage Media, Cont. Magnetic-disk: Non-volatile • Survives power failures and system (software) crashes • Failure can destroy data, but is very rare (??????) Primary medium for database storage* Typically stores the entire database • Capacities of individual disks is currently in the100s of GBs or TBs • Much larger capacity than main or flash memory Direct/random-access, unlike magnetic tape, which is sequential access. 7 ©Silberschatz, Korth and Sudarshan Physical Storage Media, Cont. Optical storage: Non-volatile Data is read optically from a spinning disk using a laser CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms Write-once, read-many (WORM) optical disks used for archival storage (CD-R and DVD-R) Multiple write versions available (CD-RW, DVD-RW, and DVD-RAM) Reads and writes are slower than with magnetic disk 8 ©Silberschatz, Korth and Sudarshan Physical Storage Media, Cont. Tape storage: Non-volatile Data is accessed sequentially, consequently known as sequential-access Much slower than a disk Very high capacity (40 to 300 GB tapes available) Storage costs are much cheaper than for a disk, but high quality drives can be very expensive Used mainly for backup (recover from disk failure), archive and transfer of very large amounts of data 9 ©Silberschatz, Korth and Sudarshan Physical Storage Media, Cont. Juke-boxes: For storing massive amounts of data • Hundreds of terabytes (1 terabyte = 109 bytes), petabyte (1 petabyte = 1012 bytes) or exabytes (1 exabyte = 1012 bytes) Large number of removable disks or tapes A mechanism for automatic loading/unloading of disks or tapes 10 ©Silberschatz, Korth and Sudarshan Storage Hierarchy speed cost volatility 11 ©Silberschatz, Korth and Sudarshan Magnetic Hard Disk Mechanism NOTE: Diagram is only a simplification of actual disk drives 12 ©Silberschatz, Korth and Sudarshan Magnetic Disks Disk assembly consists of: A single spindle that spins continually (at 7500 or 10000 RPMs typically) Multiple disk platters (typically 2 to 5) Surface of platter divided into circular tracks: Over 16,000 tracks per platter on typical hard disks Each track is divided into sectors: A sector is the smallest unit of data that can be read or written Typically 512 bytes Typical sectors per track: 200 (on inner tracks) to 400 (on outer tracks) Head-disk assemblies: One head per platter, mounted on a common arm, each very close to its platter. Reads or writes magnetically encoded information “Cylinder i” consists of ith track of all the platters 13 ©Silberschatz, Korth and Sudarshan Magnetic Disks, Cont. Disks are the primary performance bottleneck in a database system, in part because of the need for physical movement. To read/write a sector: disk arm swings to position head on right track – seek time (4-10 ms) sector rotates under read/write head – rotational latency (4-11 ms) data is read/written as sector passes under head – transfer rate (25-100 bps) Access time - The time it takes from when a read or write request is issued to when data transfer begins, i.e., seek time + rotational latency. 14 ©Silberschatz, Korth and Sudarshan Performance Measures of Disks, Cont. Multiple disks may share an interface controller, so the rate that the interface controller can handle data is also important ATA-5: 66 MB/second, SCSI-3: 40 MB/s, Fiber Channel: 256 MB/s Mean time to failure (MTTF) - The average time the disk is expected to run continuously without any failure. Typically 3 to 5 years Probability of failure of new disks is quite low - 30,000 to 1,200,000 hours An MTTF of 1,200,000 hours for a new disk means that given 1000 relatively new disks, on an average one will fail every 1200 hours MTTF decreases as disk ages 15 ©Silberschatz, Korth and Sudarshan Magnetic Disks (Cont.) The term “controller” is used (primarily) in two ways, both in the book and other literature; the book does not distinguish between the two very well. Disk controller: (use #1) Packaged within the disk Accepts high-level commands to read or write a sector Initiates actions such as moving the disk arm to the right track and actually reading or writing the data Computes and attaches checksums to each sector for reliability Performs re-mapping of bad sectors 16 ©Silberschatz, Korth and Sudarshan Disk Subsystem interface Interface controller: (use #2) Multiple disks are typically connected to the system bus through an interface controller (i.e., a host adapter) Many functions (checksum, bad sector re-mapping) are often carried out by individual disk controllers; reduces load on the interface controller 17 ©Silberschatz, Korth and Sudarshan Disk Subsystem The distribution of work between the disk controller and interface controller depends on the interface standard. Disk interface standard families: ATA (AT adaptor)/IDE SCSI (Small Computer System Interconnect) Fibre Channel, etc. Several variants of each standard (different speeds and capabilities) One computer can have many interface controllers, of the same or different type => Like disks, interface controllers are also a performance bottleneck. 18 ©Silberschatz, Korth and Sudarshan Techniques for Optimization of Disk-Block Access Several techniques are employed to minimize the negative effects of disk and controller bottlenecks File organization: Organize data on disk based on how it will be accessed • Store related information (e.g., data in the same file) on the same or nearby cylinders. A disk may get fragmented over time: • As data is inserted and deleted from disk, free “slots” on disk tend to become scattered • Data items (e.g., files) are then broken up and scattered over the disk • Sequential access to a fragmented data results in increased disk arm movement Most DBMSs have utilities to de-fragment the file system and, more generally, manipulate the file organization of a database. 19 ©Silberschatz, Korth and Sudarshan Techniques for Optimization of Disk-Block Access Block size adjustment: A block is a contiguous sequence of sectors from a single track A DBMS transfers data between disk and main memory in blocks Sizes range from 512 bytes to several kilobytes • Smaller blocks: more transfers from disk • Larger blocks: may waste space due to partially filled blocks • Typical block sizes range from 4 to 16 kilobytes Block size can be adjusted to accommodate workload Disk-arm-scheduling algorithms: Order pending accesses so that disk arm movement is minimized One example is the “Elevator algorithm” 20 ©Silberschatz, Korth and Sudarshan Techniques for Optimization of Disk Block Access, Cont. Log disk: A disk devoted to the transaction log Writes to a log disk are very fast since no seeks are required Often times provided with battery back-up Nonvolatile write buffers/RAM: Battery backed up RAM or flash memory Typically associated with a disk or storage device Blocks to be written are first written to the non-volatile RAM buffer Disk controller writes data to disk whenever the disk has no other requests Allows processes to continue without waiting on write operations Data is safe even in the event of power failure Allows writes to be reordered to minimize disk arm movement 21 ©Silberschatz, Korth and Sudarshan RAID Redundant Arrays of Independent Disks (RAID): Disk organization techniques that exploit a collection of disks Speed, reliability or the combination are increased Collection appears as a single disk to the system Originally “inexpensive” disks Ironically, using multiple disks increases the risk of failure (not data loss): A system with 100 disks, each with MTTF of 100,000 hours (approx. 11 years), will have a system MTTF of 1000 hours (approximately 41 days) 22 ©Silberschatz, Korth and Sudarshan Improvement of Reliability via Redundancy RAID improves on reliability by using two techniques: Parity Information Mirroring Parity Information (basic): Makes use of the exclusive-or operator 1+ 1+0 +1= 1 0+1+1+0=0 Enables the detection of single-bit errors Could be applied to a collection of disks 23 ©Silberschatz, Korth and Sudarshan Improvement of Reliability via Redundancy Mirroring: For every disk, keep a duplicate copy Every write is carried out on both disks (in parallel) If one disk in a pair fails, data still available Different reads can take place from either disk and, in particular, from both disks at the same time Data loss would occur only if 1) a disk fails, and 2) its mirror disk fails before the first is repaired: Probability of both events is very small If MTTF is100,000 hours, mean time to repair is10 hours, then mean time to data loss is approximately 57,000 years for a mirrored pair of disks 24 ©Silberschatz, Korth and Sudarshan Improvement in Performance via Parallelism RAID improves performance mainly through the use of parallelism. Two main goals of parallelism in a disk system: Parallelize large accesses to reduce response time (transfer rate). Load balance multiple small accesses to increase throughput (I/O rate). Parallelism is achieved primarily through striping, but also through mirroring. Striping can be done at varying levels: bit-level block-level (variable size) 25 ©Silberschatz, Korth and Sudarshan Improvement in Performance via Parallelism Bit-level striping: Abstractly, the data to be stored can be thought of as a list of bytes. Suppose there are eight disks, and write bit i of each byte to disk i. In theory, each access can read data at eight times the rate of a single disk (reduces transfer time, and hence improves response time). Each byte read or written ties up all 8 disks (reduces throughput). Not commonly used, in particular, as described. Block-level striping: Abstractly, the data to be stored can be thought of as a list of blocks, numbered 0..(k-1). Suppose there are n disks, numbered 0..(n-1). Store block i of a file on disk (i mod n). Requests for different blocks can run in parallel if the blocks reside on different disks (improves throughput). A request for a long sequence of blocks can use all disks in parallel (improves response time). 26 ©Silberschatz, Korth and Sudarshan RAID Levels Different RAID organizations, or RAID “levels,” have differing cost, performance and reliability characteristics. Levels vary by source and vendor: Standard levels: 0, 1, 2, 3, 4, 5, 6 Nested levels: 0+1, 1+0, 5+0, 5+1 Several other, non-standard, vendor specific levels exist as well Our book: 0, 1, 2, 3, 4, 5 Performance and reliability are not linear in the level #. It is helpful to compare each level to every other level, and to the single disk option. 27 ©Silberschatz, Korth and Sudarshan RAID Levels RAID Level 0: Block-level striping. Used in high-performance applications where data lost is not critical. RAID Level 1: Mirrored disks with block striping. Offers the best write performance, according to the authors. Sometimes called 0+1, 1+0, 01 or 10. C 28 C C C ©Silberschatz, Korth and Sudarshan RAID Levels, Cont. RAID Level 2: Memory-Style Error-Correcting-Codes (ECC) with bit-level striping. P P P Parity information is used to detect, locate and correct errors. Outdated - Disks contain embedded functionality to detect and locate sector errors, so only a single parity disk is needed (for correction). Also, disks don’t read/write individual bits. 29 ©Silberschatz, Korth and Sudarshan RAID Levels, Cont. RAID Level 3: Bit-level striping, with parity. P Individual disks report errors. To recover data in a damaged disk, compute XOR of bits from other disks (including parity bit disk). Faster response time than with a single disk, but fewer I/Os per second since every disk has to participate in every I/O. When writing data, corresponding parity bits must also be computed and written to a parity bit disk. Subsumes Level 2 (provides all its benefits, at lower cost). 30 ©Silberschatz, Korth and Sudarshan RAID Levels (Cont.) RAID Level 4: Block-level striping, with parity. P Individual disks report errors. To recover data in a damaged disk, compute XOR of bits from other disks (including parity bit disk). When writing a data block, the corresponding block of parity bits must also be recomputed and written to the parity disk • For a single block write – use the old parity block, old value of current block and new value of current block (2 block reads + 2 block writes) • For a large sequential block write - use the new values of blocks corresponding to the parity block. 31 ©Silberschatz, Korth and Sudarshan RAID Levels (Cont.) RAID Level 4: (Cont.) Provides higher I/O rates for independent block reads than Level 3 Provides higher transfer rates for large, multi-block reads & writes compared to a single disk. Parity block is a bottleneck for independent block writes. 32 ©Silberschatz, Korth and Sudarshan RAID Levels (Cont.) RAID Level 5: Block-level striping with distributed parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk. For example, with 5 disks, parity block for ith set of N blocks is stored on disk (i mod 5), with the data blocks stored on the other 4 disks. P P P P 33 P ©Silberschatz, Korth and Sudarshan RAID Levels (Cont.) RAID Level 5 (Cont.) Provides same benefits of level 4, but minimizes the parity disk bottleneck. Higher I/O rates than Level 4. RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but uses additional information to guard against multiple disk failures. Better reliability than Level 5 at a higher cost; not used as widely. P P P P 34 P P ©Silberschatz, Korth and Sudarshan Choice of RAID Level The “best” level for any particular application is not always clear. Factors in choosing RAID level: Monetary cost of disks, controllers or RAID storage devices Reliability requirements Performance requirements: • Throughput vs. response time • Short vs. long I/O • Reads vs. writes • During failure and rebuilding of a failed disk 35 ©Silberschatz, Korth and Sudarshan Choice of RAID Level, Cont. The books analysis… RAID 0 is only used when (data) reliability is not important. Level 2 and 4 never used since they are subsumed by 3 and 5. Level 3 is not used (typically) since bit-level striping forces single block reads to access all disks, wasting disk arm movement. Sometimes vendors advertize a version of level 3 using byte-level striping. Level 6 is rarely used since levels 1 and 5 offer adequate safety for almost all applications. So competition is between 1 and 5 only. 36 ©Silberschatz, Korth and Sudarshan Choice of RAID Level (Cont.) Level 1 provides much better write performance than level 5: Level 5 requires at least 2 block reads and 2 block writes to write a single block Level 1 only requires 2 block writes Level 1 has higher storage cost than level 5, however: I/O requirements have increased greatly, e.g. for Web servers. When enough disks have been bought to satisfy required rate I/O rate, they often have spare storage capacity… In that case there is often no extra monetary cost for Level 1! Level 5 is preferred for applications with low update rate, and large amounts of data. Level 1 is preferred for all other applications. 37 ©Silberschatz, Korth and Sudarshan Other RAID Issues Hardware vs. software RAID. Local storage buffers. Hot-swappable disks. Spare components: disks, fans, power supplies, controllers. Battery backup. 38 ©Silberschatz, Korth and Sudarshan Storage Access A DBMS will typically have several files allocated to it for storage, which are (usually) formatted and managed by the DBMS. Each file typically maps to either a disk, disk partition or storage device. Each file is partitioned into blocks (a.k.a, pages). A block consists of one or more contiguous sectors. A block is the smallest unit of DBMS storage allocation & transfer. Each block is partitioned into records. Each record is partitioned into fields. 39 ©Silberschatz, Korth and Sudarshan Buffer Management The DBMS will transfer blocks of data between RAM and Disk, in a manner similar to an operating system. The DBMS seeks to minimize the number of block transfers between the disk and memory. The portion of main memory available to store copies of disk blocks is called the buffer (a.k.a. the cache). The DBMS subsystem responsible for allocating buffer space in main memory is called the buffer manager. 40 ©Silberschatz, Korth and Sudarshan Buffer Manager Algorithm Programs call the buffer manager when they need a (disk) block. If a requested block is already in the buffer then the requesting program is given the address of the block in main memory. If the block is not in the buffer then the buffer manager: Allocates buffer space for the block, throwing out another block, if required. Writes out to disk the thrown out block if it has been modified since the last time it was retrieved. Reads the requested block from disk to the buffer once space is available. Passes the address of the block in main memory to requester. 41 ©Silberschatz, Korth and Sudarshan Buffer-Replacement Policies How is a block selected for replacement? Most operating systems use Least Recently Used (LRU). In a DBMS, LRU can be a bad strategy for certain access patterns. Queries have well-defined access patterns (such as sequential scans), and so a DBMS can predict future references (an OS cannot). Mixed strategies are common. Statistical information can be exploited. Well-known access patterns can be exploited. Well-known access patterns: The data dictionary is frequently accessed, so keep those blocks in the buffer. Index blocks are used frequently, so keep them in the buffer. 42 ©Silberschatz, Korth and Sudarshan Buffer-Replacement Policies, Cont. Some terminology... A memory block that has been loaded into the buffer and is not allowed to be replaced is said to be pinned. At any given time a buffer block is in one of the following states: Unused (free) – does not contain the copy of a block from disk. Used, but not pinned – contains the copy of a block from disk, which is available for replacement. Pinned – contains the copy of a block from disk, but which is not available for replacement. 43 ©Silberschatz, Korth and Sudarshan Buffer-Replacement Policies, Cont. Page replacement strategies: Most recently used (MRU) – The moment a disk block in the buffer becomes unpinned, it becomes the most recently used block. Least recently used (LRU) – Of all unpinned disk blocks in the buffer, the one referenced least recently. 44 ©Silberschatz, Korth and Sudarshan Buffer-Replacement Policies, Cont. Recall the borrower and customer relations: borrower = (customer-name, loan-number) customer = (customer-name, customer-street, customer-city) Consider the a query that performs a join on borrower and customer: select customer-name, loan-number, street, city from borrower, customer where b[customer-name] = c[customer-name]; Suppose that borrower consists of blocks b1, b2, and b3, that customer consists of blocks c1, c2, c3, c4, and c5, and that the buffer can fit five (5) blocks total. 45 ©Silberschatz, Korth and Sudarshan Buffer-Replacement Policies, Cont. Now consider the following query plan for the query: Use a nested-loop join algorithm: for each tuple b of borrower do // scan borrower sequentially for each tuple c of customer do // scan customer sequentially if b[customer-name] = c[customer-name] then begin x[customer-name] := b[customer-name]; x[loan-number] := b[load-number]; x[street] := c[street] x[city] := c[city]; include x in the result of the query; end if; end for; end for; Use LRU for block replacement: − Read in a block of borrower, and keep it, i.e., pin it, in the buffer until the last tuple in that block has been processed. − Once all of the tuples from that block have been processed, unpin it. − If a customer block is to be brought into the buffer, and another block must be moved out in order to make room, then move out the least recently used block (borrower, or customer). 46 ©Silberschatz, Korth and Sudarshan Buffer-Replacement Policies, Cont. Applying the LRU replacement algorithm results in the following for the first two tuples of b1: b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 c1 c1 c1 c1 c5 c5 c5 c5 c4 c4 c2 c2 c2 c2 c1 c1 c1 c1 c5 c3 c3 c3 c3 c2 c2 c2 c2 c4 c4 c4 c4 c3 c2 c2 Notice that each time a customer block is read into the buffer, the block replaced is the next block required! This results in a total of six (6) page replacements during the processing of the first two borrower tuples. 47 ©Silberschatz, Korth and Sudarshan Buffer-Replacement Policies, Cont. Suppose an MRU page replacement strategy is used: Read in a block of borrower, and keep it, i.e., pin it, in the buffer until the last tuple in that block has been processed. Once all of the tuples from that block have been processed, unpin it. If a customer block is to be brought into the buffer, and another block must be moved out in order to make room, then move out the most recently used block (borrower, or customer). 48 ©Silberschatz, Korth and Sudarshan Buffer-Replacement Policies, Cont. Applying the MRU replacement algorithm results in the following for the first two tuples of b1: b1 b1 b1 b1 b1 b1 b1 c1 c1 c1 c1 c1 c1 c2 c2 c2 c2 c2 ....... c3 c3 c3 c4 c4 c5 c5 This results in a total of two (2) page replacements for processing the first two borrower tuples. 49 ©Silberschatz, Korth and Sudarshan Storage Mechanisms Traditional “low-level” storage DBMS issues: Avoiding large numbers of block reads - algorithmically, the bar has been raised Records crossing block boundaries Dangling pointers Disk fragmentation Allocating free space Storage “mechanisms,” which provide partial solutions: Record packing Free/reserved space lists Reserved space Record splitting Mixed block formatting Slotted-page structure 50 ©Silberschatz, Korth and Sudarshan Storage Mechanisms Recall: Assumptions: A database is stored as a collection of files. A file consists of a sequence of blocks. A block consists of a sequence of records. A record consists of a sequence of fields. Record size may be fixed or variable. Record size is usually small relative to the block size, but might be larger. Each file may have records of one particular type or different files are used for different relations. Sometimes tuples need to be sorted, other times not. Which assumptions apply determine which techniques are appropriate. 51 ©Silberschatz, Korth and Sudarshan Record Packing Record “Packing” - basic approach: Pack records as tightly as possible. Works best for fixed-size records: Store record i , where i >=1, starting from byte n (i – 1), where n is the record size. Record access is simple, given i. Add each new record to the end Deletion of record i options: Move records i + 1, . . ., n to i, . . . , n – 1 (shift) Move record n to i 52 ©Silberschatz, Korth and Sudarshan Record Packing For mixed or variable sized records: Attach an end-of-record () character to the end of each record. Pack the records as tightly as possible. Issues: Deletion, insertion and record growth result in 1) lots of record movement or 2) fragmentation. Record movement potentially results in dangling pointers. 53 ©Silberschatz, Korth and Sudarshan Free Lists Free lists – connect free space (or used space) into a linked list. Advantages: Insertion & deletion can be done with few block I/Os (assuming tuples aren’t sorted). Records don’t have to move; eliminates dangling pointers. Free lists are used by virtually all DBMSs at the file level. 54 ©Silberschatz, Korth and Sudarshan Reserved Space Reserved space – use the maximum required space for each record: Can be used for repeating fields. Also with varying length attributes, e.g., a name field. Issues: Could waste significant space. 55 ©Silberschatz, Korth and Sudarshan Record Splitting Record Splitting: A single variable-length record is stored as collection of smaller records, linked with pointers. Can be used even if the maximum record length is not known. Once again, could use with pointers for free and occupied records. Disadvantages: Records are more likely to cross block boundaries Space is wasted in all records except the first in a chain 56 ©Silberschatz, Korth and Sudarshan Mixed Block Formatting Mixed Block Formatting – format blocks differently (typically in different files): Anchor block Overflow block Records are likely to cross block boundaries. Frequently used for very large attributes such as BLOBS, text, etc. 57 ©Silberschatz, Korth and Sudarshan Slotted Page Structure A Slotted Page (i.e., block) is formatted with three sections: Header Free-space Records Header contents: Number of record entries Pointer to the end of free space in the block Location and size of each record Key points: External pointers refer to header entries rather than directly to records. Records can be moved within a page to eliminate empty space; header must be updated. 58 ©Silberschatz, Korth and Sudarshan File Organization More sophisticated techniques are typically used in combination with the preceding storage mechanisms: Hashing Sequential/sorted Clustering These techniques can improve query performance substantially. 59 ©Silberschatz, Korth and Sudarshan Sequential File Organization Records in a file are ordered by a search-key. Good for point queries and range queries. Maintaining physical sequential order is difficult with insertions & deletions. 60 ©Silberschatz, Korth and Sudarshan Sequential File Organization Use pointers to keep track of logical record order: Could be used in conjunction with a free-list. 61 ©Silberschatz, Korth and Sudarshan Sequential File Organization Insertion – locate the position where the record is to be inserted. If there is free space there then insert directly. If no free space, insert the record in another location off the free-list. In either case, pointer chain must be updated. Deletion – maintain pointer chains in the obvious way. Main issue - file may degenerate over time - reorganize the file from time to time to restore sequential order. 62 ©Silberschatz, Korth and Sudarshan Clustering File Organization A clustered file organization stores several relations in one file. Clustered organization of customer and depositor: Good for some queries involving: depositor customer an individual customer and their accounts. Bad for others, e.g., involving only customer. 63 ©Silberschatz, Korth and Sudarshan Large Objects – A Special Case Large objects: text documents images computer aided designs audio and video data Most DBMS vendors provide supporting types: binary large objects (blobs) character large objects (clobs) text, image, etc. Large objects have high demands: Typically stored in a contiguous sequence of bytes when brought into memory. If an object is bigger than a block, contiguous blocks in the buffer must be allocated for it. Record splitting is commonly used. Might be preferable to disallow direct access to data using SQL, and only allow access through a 3rd party file system API. 64 ©Silberschatz, Korth and Sudarshan End of Chapter Data Dictionary Storage The data dictionary (a.k.a. system catalog) stores metadata: Information about relations • names of relations • names and types of attributes • names and definitions of views • integrity constraints Physical file organization information • How relation is stored (sequential, hash, etc.) • Physical location of relation – operating system file name, or – disk addresses of blocks containing the records Statistical and descriptive data • number of tuples in each relation User and accounting information, including passwords Information about indices (Chapter 12) 66 ©Silberschatz, Korth and Sudarshan Data Dictionary Storage (Cont.) Catalog structure can use either: specialized data structures designed for efficient access a set of relations, with existing system features used to ensure efficient access The latter alternative is the standard. A possible catalog representation: Relation-metadata = (relation-name, number-of-attributes, storage-organization, location) Attribute-metadata = (attribute-name, relation-name, domain-type, position, length) User-metadata = (user-name, encrypted-password, group) Index-metadata = (index-name, relation-name, index-type, index-attributes) View-metadata = (view-name, definition) 67 ©Silberschatz, Korth and Sudarshan