He Tan hetan@ida.liu.se IISLAB IDA Databases Physical Database and Index Structures Real world Model Databases DBMS He Tan Queries Answers Processing of queries and updates Access to stored data Physical database september 2007 He Tan hetan@ida.liu.se IISLAB IDA september 2007 1 1 He Tan hetan@ida.liu.se IISLAB IDA What is this about? • Physical file storage on hard drive • How data is stored on disk • Why is disk storage expensive? • How to choose a suitable index • How to speed up accessing of data by using indexes • Calculate the size of an index 3 3 He Tan hetan@ida.liu.se IISLAB IDA september 2007 He Tan hetan@ida.liu.se IISLAB IDA Storage hierarchy • 5 • Cache memory • Main memory Primary storage • Disk Secondary storage 4 4 CPU september 2007 What you will learn Disk memory • byte, capacity • track, cylinder • block (fast, small, expensive, volatile, accessible by CPU) (slow, big, cheap, permanent, inaccessible by CPU) Important because it effects query efficiency. 5 september 2007 6 6 1 He Tan hetan@ida.liu.se IISLAB IDA He Tan hetan@ida.liu.se IISLAB IDA block track Disk • actuator armread/write head spindle disk rotation A track is divided into equal-sized blocks during disk formatting (or initialization). • cylinder of tracks Block is the unit of transfer of data between disk and main memory, e.g. Read = copy block from disk to buffer in main memory. Write = the opposite way. R/w time = seek time + rotational delay + block transfer time. locate block 1-2 msec. 9-60 msec. actuator movement september 2007 7 7 He Tan hetan@ida.liu.se IISLAB IDA september 2007 He Tan hetan@ida.liu.se IISLAB IDA Block transfer 8 8 A block is the unit of transfer of data between disk memory and main memory Physical database • How data is stored on disk ? Approximately 1000 to 100000 times more expensive to perform one disk operation than one memory operation september 2007 • Processor instruction • Memory fetch • Disk fetch 10 −9 s 10 −8 s 10 −3 s 9 9 He Tan hetan@ida.liu.se IISLAB IDA Data stored in files. • A file is a sequence of records (table) • A record is a collection of field values (tuple) • Each field has an elementary data type • A collection of field names and their correponding data types constitues a record type 10 10 He Tan hetan@ida.liu.se IISLAB IDA Files and records • september 2007 Files and records • The records in a file are often of the same type. • A file contains records of the same or different length. • A file contains records of the different length : Records have same type, but fields have variable size. Records have same type, but fields have multiple values. Records have same type, but fields are optional. Records have different types. september 2007 11 11 september 2007 12 12 2 He Tan hetan@ida.liu.se IISLAB IDA He Tan hetan@ida.liu.se IISLAB IDA File and record storage • Records are allocated to blocks • Suppose that r is the number of records in the file B is the size in bytes of the block R is the size in bytes of the record (fixed-length) september 2007 13 13 He Tan hetan@ida.liu.se IISLAB IDA • What is the space wasted per block ? 14 File and record storage Block size (B) of 4096 bytes • Record size (R) is 100 bytes (fixed-length, unspanned) • Blocking factor (bf) is • Blocks needed (b) is block i wasted record 1 record 2 block i+1 wasted record 3 record 5 block i+1 record 1 record 3 record 2 record 4 record 3 p 15 september 2007 Contiguous allocation • Indexed allocation. File organization How records are placed in the file? Linked allocation Linked clusters allocation. 16 16 He Tan hetan@ida.liu.se IISLAB IDA From file blocks to disk blocks • B 4096 R = 100 = 40 r 1000000 bf = 40 = 25000 record 5 expensive sequential access but cheap record addition. 17 • Solution: Spanned records. cheap sequential access but expensive record addition. september 2007 r b = bfr File with 1 000 000 records (r) 15 • Blocks needed: • block i • • • Spanned Spanned He Tan hetan@ida.liu.se IISLAB IDA B bfr = R Wasted space per block = B – bfr * R. Unspanned Unspanned september 2007 Blocking factor: 14 He Tan hetan@ida.liu.se IISLAB IDA Spanned/Unspanned records • september 2007 • 17 september 2007 18 • Heap files. • Sorted files. • Hash files. 18 3 He Tan hetan@ida.liu.se IISLAB IDA He Tan hetan@ida.liu.se IISLAB IDA Heap files • Records are added to the end of the file. Hence, Sorted files • Cheap record addition. Expensive record retrieval, removal and update, since they imply linear search: b Cheap ordered record retrieval: • All the records: access blocks sequentially. • Next record: probably in the same block. • Random record: binary search, then worst case implies • Average case: 2 block accesses. • Worst case: b block accesses (if it doesn't exist or several exist). • september 2007 (i mod bfr) th record in the block 19 19 He Tan hetan@ida.liu.se IISLAB IDA september 2007 september 2007 21 21 He Tan hetan@ida.liu.se IISLAB IDA 23 Indexing • an extra structure that is used to make the search of records faster • How to choose a suitable index • Calculate the size of an index The hash function applies to the hash field and returns the position of the record in the array of records. E.g. position = field mod r september 2007 20 20 He Tan hetan@ida.liu.se IISLAB IDA Hash files • Is record updating cheap or expensive ? The search condition to locate the record. The field to be modified. What is the disk block and record of the i-th file record? i bfr block accesses. • Addition: a temporary unordered file (overflow file) + periodic reorganization. • Deletion: deletion markers + periodic reorganization. Heap file, fix-length records, unspanned blocks, and contiguous allocation. block log2 b Expensive record addition and deletion. Moreover, record removal implies waste of space. So, periodic reorganization. • Records ordered according to some field. Hence, is based one or more index files, • stores the physical addresses of records. 22 22 He Tan hetan@ida.liu.se IISLAB IDA Indexing • september 2007 Primary Index • 23 september 2007 24 An ordered file with two fields, an indexing field and a block pointer. • The data file is physically ordered on a key feild on the disk. • All data records are stored sequentially on a disk. 24 4 He Tan hetan@ida.liu.se IISLAB IDA He Tan hetan@ida.liu.se IISLAB IDA Primary Index •There is one index entry for each data block •Each index entry contains the ordering key value of the first record in the data block plus a pointer to the block. september 2007 25 25 He Tan hetan@ida.liu.se IISLAB IDA september 2007 Blocks needed for primary index? • File with 1000 000 records (r) • Block size (B) of 4096 bytes • Record size (R) 100 bytes (fixed, unspanned) • Index size (R) is 10 bytes (fixed , unspanned) • Blocks for the file 25000 26 26 Clustering Index Clustering Index • An ordered file with two fields, an indexing field and a block pointer. 1 (Andersson, Anders) • The data file is physically ordered on a nonkey field on the disk. 2 (Andersson, Anders) • Store all data records sequentially on a disk 1 (Andersson, Anders) 1 Block 1 3 (Andersson, Nils) 2 3 4 4 (Andersson, Sven) 5 Block 2 4 (Bengtsson, Anders) 4 (Davidsson, Nils) Block 3 5 (Larsson, Anders) september 2007 27 27 He Tan hetan@ida.liu.se IISLAB IDA Clustering Index 1 (Andersson, Anders) block anchoring 1 (Andersson, Anders) 2 (Andersson, Anders) Block 1 Block 2 1 2 3 (Andersson, Nils) Block 3 Blocks needed for clustering index • File with 1000 000 records • Block size (B) of 4096 • Record size 100 bytes (fix, unspanned) • Index size is 10 bytes (unspanned) • Assume: 100 records of each value in average (block anchoring) 3 4 5 4 (Andersson, Sven) 4 (Bengtsson, Anders) 4 (Davidsson, Nils) 5 (Larsson, Anders) Block 4 Block 5 Block 6 september 2007 30 30 5 He Tan hetan@ida.liu.se IISLAB IDA He Tan hetan@ida.liu.se IISLAB IDA Terminology • Dense index: All records in the file occurs in the index • Sparse (nondense) index: Only some of the records occur in the index. Problems with Primary & Clustering Indexes • As with any ordered file, addition and deletion becomes very expensive. We might have to move records in the data file. When the data file is changed, the anchor items might change. • Are a primary index/a clustering index dense or sparse? Use some kind of unordered overflow structure september 2007 31 31 He Tan hetan@ida.liu.se IISLAB IDA Secondary Index • An ordered file with two fields, an indexing field and a block pointer. • The data do not have to be physically ordered on the indexing field september 2007 32 32 Secondary Index A dense secondary index on a nonordering key field september 2007 33 33 He Tan hetan@ida.liu.se IISLAB IDA Secondary Index Example • File with 1000 000 elements (r) • Block size (B) of 4096 • Record size (R) 100 • Suppose that the indexing field is 6 bytes • A block pointer is 4 bytes • Each index entry, R, is then 10 bytes Secondary Index A dense secondary index on a nonordering nonkey field september 2007 35 35 6 He Tan hetan@ida.liu.se IISLAB IDA september 2007 He Tan hetan@ida.liu.se IISLAB IDA Secondary Index Example • • File with 1000 000 elements (r) • Block size (B) of 4096 • Record size (R) 100 • the indexing field is 6 bytes, block pointer is 4 bytes • Each index entry, R, is then 10 bytes • 100 values that are identical (average) • Option 2/ 3 • • Secondary index Index on a field that is not used for file ordering. 37 september 2007 Summary: Properties of indexes 38 38 He Tan hetan@ida.liu.se IISLAB IDA Example Number of first level index entries Dense or nondense Block anchoring Primary Number of blocks in data file Nondense Yes Assume a file with 5 000 students. There is in average 50 students with the same family name. Each record is 200 bytes. Family name takes 20 bytes and personal number 10. The block size is 4096 and a block pointer is 5 bytes. Clustering Number of distinct index field values Nondense Yes/No The file is sorted on personal number. Secondary (unique) Number of records in data file Dense No Secondary (nonunique) Number of records or number of distinct index field values Dense No 39 39 He Tan hetan@ida.liu.se IISLAB IDA Clustering index Index a nonkey field File ordered by that nonkey field on the data file september 2007 Primary index Index the key field File ordered by the key field 37 He Tan hetan@ida.liu.se IISLAB IDA Summary: Main index types september 2007 40 • An index on personal number (files sorted on personal number) • An index on family name 40 Example Assume a file with 5 000 students. There is in average 50 students with the same family name. Each record is 200 bytes. Family name takes 20 bytes and personal number 10. The block size is 4096 and a block pointer is 5 bytes The file is sorted on family name september 2007 41 • An index on personal number (files sorted on personal number) • An index on family name 41 7