He Tan hetan@ida.liu.se IISLAB IDA Databases Physical Database and Index Structures Real world Queries Model Databases DBMS He Tan Answers Processing of queries and updates Access to stored data Physical database september 2008 1 1 He Tan hetan@ida.liu.se IISLAB IDA He Tan hetan@ida.liu.se IISLAB IDA What you will learn Storage hierarchy • How data is stored on disk CPU • How to speed up accessing of data by using indexes september 2008 3 3 He Tan hetan@ida.liu.se IISLAB IDA september 2008 Primary storage • Disk Secondary storage (fast, , accessible by CPU small, expensive, volatile) (slow, inaccessible by CPU big, cheap, permanent, ) 4 4 He Tan hetan@ida.liu.se IISLAB IDA Disk memory • Cache memory • Main memory block Disk track • A track is divided into equal-sized blocks during disk formatting (or initialization). • arm actuator read/write head spindle disk rotation Block is the unit of transfer of data between disk and main memory, e.g. Read = copy block from disk to buffer in primary memory. Write = the opposite way. R/w time = seek time + rotational delay + block transfer time. cylinder locate block 1-2 msec. 9-60 msec. actuator movement september 2008 5 5 september 2008 6 6 1 He Tan hetan@ida.liu.se IISLAB IDA He Tan hetan@ida.liu.se IISLAB IDA Block transfer A block is the unit of transfer of data between secondary memory and primary memory Physical database • How database is stored on disk ? How records are placed in disk Approximately 1000 to 100000 times more expensive to perform one disk operation than one memory operation How records are placed in file september 2008 • Processor instruction • Memory fetch • Disk fetch How file blocks are placed to disk blocks 7 7 He Tan hetan@ida.liu.se IISLAB IDA Data stored in files. • Records are allocated to blocks A file (table) is a sequence of records • Suppose that • A record (tuple) is a collection of field values • Each field (attribute) has an elementary data type • A file contains records of the same or different length • A file contains records of the different length r is the number of records in the file B is the size in bytes of the block R is the size in bytes of the record (fixed-length) Records have same type, but fields have variable size. Records have same type, but fields have multiple values. Records have same type, but fields are optional. Records have different types. 9 september 2008 • Blocking factor: B bfr = R • Blocks needed: r b = bfr 10 10 He Tan hetan@ida.liu.se IISLAB IDA Spanned/Unspanned records • File and record storage Wasted space per block = B – bfr * R. • File with 1 000 000 records (r) Solution: Spanned records. • Block size (B) of 4 096 bytes • Record size (R) is 100 bytes (fixed-length, unspanned) • Blocking factor (bf) is • Blocks needed (b) is wasted space Unspanned Unspanned block i wasted record 1 record 2 block i+1 wasted record 3 record 5 block i Spanned Spanned 11 File and record storage • • september 2008 8 8 • 9 He Tan hetan@ida.liu.se IISLAB IDA september 2008 He Tan hetan@ida.liu.se IISLAB IDA Files and records september 2008 10 −9 s 10 −8 s 10 −3 s block i+1 record 1 record 3 record 2 record 4 record 3 p B 4096 R = 100 = 40 r 1000000 bfr = 40 = 25000 record 5 11 september 2008 12 12 2 He Tan hetan@ida.liu.se IISLAB IDA He Tan hetan@ida.liu.se IISLAB IDA File organization • How records are placed in the file? • Heap files • Sorted files • Hash files Heap files Records are added to the end of the file. Hence, Cheap record addition. Expensive record retrieval, removal and update, since they imply linear search: b • Average case: 2 block accesses. • Worst case: b block accesses (if it doesn't exist or several exist). Moreover, record removal implies waste of space. So, periodic reorganization. september 2008 13 13 He Tan hetan@ida.liu.se IISLAB IDA 14 14 He Tan hetan@ida.liu.se IISLAB IDA Sorted files • september 2008 Hash files Records ordered according to some field. Hence, Cheap record retrieval based on the ordering field: • All the records: access blocks sequentially. • Next record: probably in the same block. • Random record: binary search, then worst case implies • log2 b The hash function applies to the hash field and returns the position of the record in the array of records. E.g. position = field mod n block accesses. Expensive record addition and deletion. • Addition: a temporary unordered file (overflow file) + periodic reorganization. • Deletion: deletion markers + periodic reorganization. • • Collision: different field values hash to the same position. Solutions: • • • Is record updating cheap or expensive ? Open addressing: Check subsequent positions until unused position is found. Multiple hashing: Use a second hash function. Chaining: Put the record in the overflow area and link it. The search condition to locate the record. The field to be modified. september 2008 He Tan hetan@ida.liu.se IISLAB IDA september 2008 15 15 17 Contiguous allocation • Linked allocation • Linked clusters allocation • Indexed allocation 16 16 He Tan hetan@ida.liu.se IISLAB IDA From file blocks to disk blocks • september 2008 What you will learn • How data is stored on disk • How to speed up accessing of data by using indexes 17 september 2008 18 18 3 He Tan hetan@ida.liu.se IISLAB IDA He Tan hetan@ida.liu.se IISLAB IDA Indexing Primary Index • one or more index files, • The data file is physically ordered on a key feild on the disk. • stores the physical addresses of records. • All data records are stored sequentially on a disk. • single-level ordered indexes • An ordered index file with two fields, an indexing field and a block pointer. Primary index Clustering index Secondary index september 2008 The indexing field is specified on the key field on which the data file is ordered. 19 19 He Tan hetan@ida.liu.se IISLAB IDA september 2008 He Tan hetan@ida.liu.se IISLAB IDA Primary Index 20 20 Blocks needed for primary index? block block anchor • File with 1 000 000 records (r) • Block size (B) of 4 096 bytes • Record size (R) 100 bytes (fixed, unspanned) • Index size (Ri) is 10 bytes (fixed , unspanned) select * from EMPLOYEE where NAME = ’Wood, Donald’; september 2008 21 21 He Tan hetan@ida.liu.se IISLAB IDA • The data file is physically ordered on a nonkey field on the disk. All data records sequentially are stored on a disk • An ordered index file with two fields, an indexing field and a block pointer. 22 22 He Tan hetan@ida.liu.se IISLAB IDA Clustering Index • september 2008 Clustering index block The indexing field is specified on the nonkey field, on which the data file is ordered. september 2008 23 23 september 2008 24 24 4 He Tan hetan@ida.liu.se IISLAB IDA He Tan hetan@ida.liu.se IISLAB IDA Clustering index Blocks needed for clustering index • File with 1000 000 records • Block size (B) of 4096 • Record size 100 bytes (fix, unspanned) • Index size is 10 bytes (unspanned) • 100 records of each value in average select * from EMPLOYEE where DEPT = ’5’; september 2008 25 25 He Tan hetan@ida.liu.se IISLAB IDA An ordered file with two fields, an indexing field and a block pointer. • The data are not physically ordered on the indexing field He Tan hetan@ida.liu.se IISLAB IDA september 2008 27 27 29 Secondary Index Example • File with 1000 000 elements (r) • Block size (B) of 4096 • Record size (R) 100 • Each index entry 10 bytes Secondary index secondary index on a nonordering key field The indexing field is specified on a non-ordering field of the data file september 2008 26 26 He Tan hetan@ida.liu.se IISLAB IDA Secondary Index • september 2008 september 2008 28 28 Secondary Index secondary index on a nonordering nonkey field 29 5 He Tan hetan@ida.liu.se IISLAB IDA september 2008 Secondary Index Example Terminology • File with 1000 000 elements (r) • Dense index: All records in the file occurs in the index • Block size (B) of 4096 • • Record size (R) 100 Sparse (nondense) index: Only some of the records occur in the index. • Each index entry, R, is then 10 bytes • 100 values that are identical (average) • Are a primary index/a clustering index/a secondary index dense or sparse? • Option 2/ 3 31 31 He Tan hetan@ida.liu.se IISLAB IDA He Tan hetan@ida.liu.se IISLAB IDA 32 32 He Tan hetan@ida.liu.se IISLAB IDA Summary: Main index types • september 2008 Summary: Properties of indexes Primary index Index the key field File ordered by the key field • Clustering index Index a nonkey field File ordered by that nonkey field • Secondary index Index on a field that is not used for file ordering. september 2008 He Tan hetan@ida.liu.se IISLAB IDA september 2008 33 33 35 september 2008 He Tan hetan@ida.liu.se IISLAB IDA Example 34 34 Example Assume a file with 5 000 students. There is in average 50 students with the same family name. Each record is 200 bytes. Family name takes 20 bytes and personal number 10. The block size is 4096 and a block pointer is 5 bytes. Assume a file with 5 000 students. There is in average 50 students with the same family name. Each record is 200 bytes. Family name takes 20 bytes and personal number 10. The block size is 4096 and a block pointer is 5 bytes The file is sorted on personal number. The file is sorted on family name • An index on personal number (files sorted on personal number) • An index on personal number (files sorted on personal number) • An index on family name • An index on family name 35 september 2008 36 36 6