block

advertisement
CS346:
Advanced Databases
Graham Cormode
G.Cormode@warwick.ac.uk
Storage, Files
and Indexing
Outline
Part 1:
 Disk properties and file storage
 File organizations: ordered, unordered, and hashed
 Storage topics: RAID and Storage area networks
 Chapter: “Disk Storage, Basic File Stuctures and Hashing” in
Elmasri and Navathe
Part 2: Indexes
2
CS346 Advanced Databases
Why?
 Important to understand how high-level abstractions
(databases) map down to low-level concepts (disks, files)
Get a sense of the scale of the quantities involved
(seek times, overhead of inefficient solutions)
– Appreciate the difference that smart solutions can bring
– Understand where the bottlenecks lie
–
 Give a “bottom-up” perspective on data management
See the whole picture starting from the low-level
– Demystify some aspects that can seem opaque
(B-trees, hashing, file organization)
– Apply to many areas of computer science (OS, algorithms…)
–
3
CS346 Advanced Databases
The Memory Hierarchy
Flash
Storage
4
CS346 Advanced Databases
Data on Disks
 Databases ultimately rely on non-volatile disk storage
–
Data typically does not fit in (volatile) memory
 Physical properties of disks affect performance of the DBMS
–
Need to understand some basics of disks
 A few exceptions to disk-based databases:
Some real-time applications use “in-memory databases”
– Some legacy/massive applications use tape storage as well
–
 Different tradeoffs with flash-based storage
Much faster to read, but limits on number of deletions
– No major difference between random access and linear scan
– “Flash databases” are a niche, but growing area
–
5
CS346 Advanced Databases
Rotating Disk: 5000 – 10000RPM
 Sector size: 0.5KB – 4KB, basic unit of data transfer from disk
 Seek time: move read head into position, currently ~4ms
Includes rotational delay: wait for sector to come under read head
– Random access: 1/0.004 * 4KB = 1MB/second: quite slow
–
 Track-to-track move, currently ~0.4ms: 10 times faster
6
–
Sustained read/write time: 100MB/second (caching can improve)
CS346 Advanced Databases
Disk properties: the fundamental contrast
 Random access is slow, sequential access is fast
By factors of up to 100s
– Want to design storage of data to avoid or minimize random
access and make data access as fast as possible
–
 Buffering can help in multithreaded systems:
Work on other processes while waiting for data to arrive
– Double buffering: maintain two buffers of data
work on current buffer of data, while other buffer fills from disk
– Maximizes parallel utilization, but doesn’t make my thread faster
–
7
CS346 Advanced Databases
Records: the basic unit of the database
 Databases fundamentally composed of records
–
Each record describes an object with a number of fields
 Fields have a type (integer, float, string, time, compound…)
–
Fixed or variable length
 Need to know when one field ends
and the next begins
Field length codes
– Field separators (special characters)
–
 Leads to variable length records
–
8
How to effectively search through data with variable length records?
CS346 Advanced Databases
9
CS346 Advanced Databases
Records and Blocks
 Records get stored on disks organized into blocks
 Small records: pack an integer number into each block
Leaves some space left over in blocks
– Blocking Factor: (average) number of records per block
–
 Large records: may not be effective to leave slack
Records may span across multiple blocks (spanned organization)
– May use a pointer at end of block to point to next block
–
10
CS346 Advanced Databases
Files
 A sequence of records is stored as a file
–
Either using OS file system support, or handled by DBMS
 Database requires support for various file operations:
–
–
–
–
–
–
–
Open file, return new file handler
Scan for the next record that satisfies a search condition
Read the next record from disk into memory
Delete the current record and (eventually) update file on disk
Modify the current record and (eventually) update file on disk
Insert a new record at the current location
Close the file, flush any buffers and postponed operations
 Need suitable file layout and indices to allow fast scan operation
11
CS346 Advanced Databases
File organization: unordered
 Just dump the records on disk
in no particular order
 Insert is very efficient:
just add to last block
 Scan is very inefficient:
need to do a linear search
–
Read half the file on average
 Delete could be inefficient:
Read whole file, write it back with deleted record omitted
– Instead, just “mark” record as deleted
– Periodically remove marked records
–
12
CS346 Advanced Databases
File organization: ordered
 Keep records ordered on
some (key) attribute
 Can scan through records
in that order very easily
 Can search for a value
(or range of values)
by binary search
Binary search: log2 b seeks to find desired record out of b blocks
– Linear search: b/2 seeks on average to find record
–
 Insertion is rather more expensive and complex to do well
–
Keep recent records in “overflow buffer” for periodic merge
 If modifying the key field, treat as a deletion and an insertion
13
CS346 Advanced Databases
14
CS346 Advanced Databases
File organization: hashed
 Use hashing to ensure
records with same key
are grouped together
 Arrange file blocks into
M equal sized buckets
–
Often, 1 block = 1 bucket
 Apply hash function to key field to determine its bucket
 Usual hash table concerns emerge
Need to deal with collisions, e.g. by open addressing, or chaining
– Deletions also get messy, depending on collision method used
–
15
CS346 Advanced Databases
External hashing
 Don’t store records directly in buckets, store pointers to records
Pointers are small, fit more in a block
– “All problems in computer science can be solved by another level
of indirection” – David Wheeler
–
16
CS346 Advanced Databases
External hashing: issues
 Aim for 70-90% occupancy of the hash table
–
Not too much wastage, not too many collisions
 Hash function should spread records evenly across buckets
–
If very skewed distribution, we lose benefits of hashing
 Still costly if access to records ordered by key is required
–
And doesn’t help with accessing records not by key
 Main disadvantage: hard to adjust if number of records grows
–
Need to resize the hash table
 What if too many records hash to the same bucket?
–
17
Can handle extra records by “chaining” to overflow buckets
CS346 Advanced Databases
Hashing: Overflow buckets
18
CS346 Advanced Databases
Extendible hashing
 Hashing scheme that allows the hash table to grow and shrink
–
Avoid wasted space and avoid excessive collisions
 Makes use of a directory of bucket addresses
Directory size is a power of two, 2d
– So can double or halve the directory size as needed
– The first d bits of the hash value are used to index into the directory
–
 Directory entries point to disk blocks storing records
Contiguous directory entries can point to same disk block
– Disk blocks can have a local value of d, d’
–
 Insertions into a block may cause it to overflow and split in two
–
19
The directory is then updated accordingly
CS346 Advanced Databases
 Extendible hashing
example
–
20
Some values of d’ less
than global d
CS346 Advanced Databases
Extendible Hashing: Updating d
 If a bucket becomes full, may need to increase d  d + 1
–
Double the size of the directory
 Similarly, if all buckets have local d’ < d, can decrease d  d – 1
–
Halve the size of the directory
 Other adaptive hashing variants exist
–
21
Dynamic hashing: binary tree directory
CS346 Advanced Databases
RAID disk technology
 RAID originally a way to combine multiple cheap disks for reliability
–
“Redundant Array of Inexpensive Disks” (1980s)
 Now general purpose approach to providing reliability
“Redundant Array of Independent Disks
– Sets of different levels of replication
–
 RAID 0: spread data over multiple disks (striping)
–
22
Increases throughput, but increases risk of data loss
CS346 Advanced Databases
Important RAID levels
 RAID 1: duplication of data across multiple disks (mirroring)
–
–
–
–
–
Data copied to 2 (or more) disks
Disk reliability measured in “mean time between failures” (MTBF)
Typical MTBF is 100K hours – 1M hours (~ 1 century)
Chance of both disks failing at same time is small
So enough time to recover a copy
 RAID 5: block level striping and parity coding spread over disks
–
Parity coding: allows recovery of 1 missing disk
1
0
1
1
Data bits
23
0
1
Parity bit
CS346 Advanced Databases
RAID levels
 RAID 6: Reed-Solomon coding allows multiple disk losses
 Other RAID levels (2, 3, 4) not in common usage
24
CS346 Advanced Databases
Storage Area Networks
 Storage Area Networks: virtual disks
Disks attached to “headless” server
– Easy to configure, low maintenance overhead
–
 Many advantages to SANs:
Flexible configuration: hot-swap new disks in/out
– Can be physically remote from other network elements
 Provided on fast (fibre-based) network
– Separate storage for server configuration, OS updates etc.
–
25
CS346 Advanced Databases
Outline
Part 2
 Indexes: primary and secondary
 Multilevel indexes and B-trees
 Chapter: “Indexing Structure for Files” in Elmasri and Navathe
26
CS346 Advanced Databases
Indexing for Files
 Chapter: “Indexing Structure for Files” in Elmasri and Navathe
–
Move focus from how file is stored on disk to how file is accessed /
indexed by the DBMS
 Index: an auxiliary file that makes it faster to find certain records
An index is usually for one field of the record (e.g. index by name)
– Can have multiple indexes, each for different fields
–
 A basic form of an index is a sorted list of pointers
<field value, pointer to record>, ordered by field value
– “An access path” for the indexed field
–
27
CS346 Advanced Databases
Indexes as access paths
 Indexes usually take up much less space than the original file
Each index entry is much smaller than the full record
– Just need a field value, and a pointer (few bytes)
–
 Efficient to look up matching records
–
Binary search on the index, then follow pointer
 The index may be dense or sparse
Dense index: contains an entry for every possible search value
– Sparse index: contains entries only for some search values
–
 Can have an index on the field that the file is sorted on! Why?
–
28
Can be faster to search via index than do binary search on file
CS346 Advanced Databases
Primary Index
 A primary index applies when the file is ordered by a key field
 A sparse index: one entry for each block of the data file
An index for the first record in the block (the block anchor)
– Can be much fewer entries in index than in the data file
–
 Straightforward to search for a record
Use the index to find the block that the record should be in
– Retrieve the block and see if the record is there
–
 Insertion and deletion of records in the main file is a pain!
–
Almost all the pointers change!
 Some standard tricks to mitigate the pain
Buffer updates in an “overflow” file and check against this
– Linked list of overflow records for each block as needed
– Mark records as deleted, and only purge periodically
–
29
CS346 Advanced Databases
30
CS346 Advanced Databases
Indexing Example
 Example: Given a data file EMPLOYEE(NAME, SSN, ADDRESS,
JOB, SAL, ... )
 Suppose that:
–
–
–
record size R=100 bytes (fixed size)
block size B=1024 bytes
file size r=30000 records
 Blocking factor Bfr= B / R
= 1024 / 100
= 10 records/block
 Number of file blocks b= (r/Bfr)= (30000/10)= 3000 blocks
31
CS346 Advanced Databases
Indexing Example
 For an index on the SSN field, assume the field size VSSN=9 bytes
and the record pointer size PR=6 bytes.
Then:
index entry size RI=(VSSN+ PR)=(9+6)=15 bytes
– index blocking factor BfrI= B / RI  = 1024/15 = 68 entries/block
– number of index blocks bI = (r/ BfrI)= (3000/68)= 45 blocks
– binary search needs log2(bI)= log2(45)= 6 block accesses
 [In practice, likely that these 45 blocks would end up in cache]
–
 This is compared to an average linear search cost of:
–
(b/2) = 30000/2 = 15000 block accesses
 If the file records are ordered, the binary search cost would be:
–
32
log2b = log23000 = 12 block accesses
CS346 Advanced Databases
Clustering Index
 Clustering index applies when data is ordered on a non-key field
The field on which data is ordered is called the clustering field
– The data file is described as a clustered file
– Clustering index is sorted list of <field value, pointer> pairs
–
 Why make a distinction between clustering and primary index?
Field values can appear in many consecutive records
– Only one entry in index for each distinct field value
 No point having multiple entries
– Index points to first data block containing the matching value
–
 Same issues with insertion and deletion as for primary index
33
CS346 Advanced Databases
34
CS346 Advanced Databases
 Cluster index where
each distinct value is
allocated a whole disk
block
 Linked list if more than
one block is needed
35
CS346 Advanced Databases
Secondary indexes
 Secondary indexes provide a secondary means of access to data
–
For when some primary access already exists (e.g. index on key)
 A secondary index is on some other field(s)
Either other candidate key fields which are unique for every record
– Or non-key field with duplicate values
–
 Secondary index is an ordered file of <field value, pointer> pairs
Pointer can be to a file block, or record within a file
– A dense index: must be one pointer per record
–
 Many secondary indexes can be created for a file
Allowing access based on different fields
– By contrast, there can be only one primary index
–
36
CS346 Advanced Databases
 Secondary index
with block pointers
 Unique data values
so structure is simple
37
CS346 Advanced Databases
Secondary index example
 Same set up as previous example:
r=30000 records of size R=100 bytes, block size B = 1024 bytes
 File is stored in 3000 blocks as worked out before
 Search for a record based on a field of V = 9 bytes
–
Linear search would read 1500 blocks on average
 Secondary index on target attribute (9 + 6) = 15 bytes/record
– Blocking factor for index is 1024/15 = 68 entries per block
Need 30000/68 = 442 blocks to store the (dense) index
– Binary search on index takes log2 442 = 9 block accesses
– Slightly more than the primary index (why?)
–
38
CS346 Advanced Databases
Secondary index for non-key, non-ordering
 Secondary index for a non-key non-ordering field
I.e. a field that has duplicate values in many records
– Several possible approaches
–
1. Include duplicate index entries for the same field value (dense)
2. Have variable length entries in the index: a list of pointers to all
blocks containing the target value
3. Use an extra level of indirection: fixed length index entries point
to list of pointers, arranged as list of disk blocks
 Option 3 is most commonly used
–
39
All options are painful when data file is subject to insert/deletes
CS346 Advanced Databases
 “Option 3”
secondary index
40
CS346 Advanced Databases
Single Level Indexing Summary
 Primary index: on the field that the data is sorted by
–
Allows faster access than searching the file directly
 Secondary index: on any field(s) in the data
Can have multiple secondary indexes
– Typically dense
–
 All indexes require extra effort to maintain if the data is
subject to frequent updates (insert/delete operations)
41
CS346 Advanced Databases
Multilevel Indexing
 The indexes described so far miss a trick: they do binary search
But we can read a block of k index records at a time
– Can do a k-way split instead of a 2-way split
– Improves cost from log2 N to logk N
–
 Another way to look at it: if index is large, build index on index…
Original index is first level index, then there is second level index
– Can repeat, creating third level index, fourth level index…
– Until top level of index fits into one disk block
– For all realistic file sizes, a constant number of levels is needed
–
 Apply this idea to any index type (primary, secondary, cluster)
–
42
Assume first level index has fixed length, distinct valued entries
CS346 Advanced Databases
Two-level
index
43
CS346 Advanced Databases
Example
 Convert previous example into a multilevel index
Blocking factor for indexes remains 68
– 442 blocks of first level index
– Second level index: 442/68 = 7 blocks
– Third level index fits in 1 block: stop here!
–
 Hence, need three levels of index: three accesses to find
(pointer to) target record
44
CS346 Advanced Databases
Dynamic multilevel indexes
 Can we modify our storage of indices to make handling
inserts/deletes less painful?
 Use tree-structure to directly access data
–
Keep some space in file blocks to reduce cost of updates
 Use the language of trees to describe the structure:
45
CS346 Advanced Databases
Search trees
 A search tree: a tree where each node contains at most p-1
search values and p pointers as P1, K1, P2, K2, … Kq-1, Pq, q ≤ p
The values are in order: K1 < K2 < … Kq-1
– Each pointer Pi points to a subtree so that Ki-1 < X ≤ Ki for all keys in
subtree
–
 Rules allow efficient search for any key value
–
46
Search within the only subtree it can be in at each level
CS346 Advanced Databases
Search tree example
 Leaf-level entries have the full record
 Insertion is easier: we can add a new block without having to
rewrite the rest of the tree
 If tree is unbalanced (some very deep paths), searches are long
Try to avoid by using rules to avoid tree getting unbalanced
– Perform occasional rebalancing or “self-balancing” trees
–
47
CS346 Advanced Databases
B-trees and B+-trees
 B-trees add the constraint that the tree should be balanced
The root to leaf path should be about the same length for all leaves
– Avoid wasted space: each node should be between half full and full
–
 B+-tree is a slight modification of B-tree that is now the standard
B-trees: allow pointers to data at all levels of the tree
– B+-tree: pointers to data only at the leaf level
– B+-tree slightly simpler (fewer cases to deal handle with updates)
–
 The trees can be used for (primary, secondary) multi-level indexes
–
Updates to data can be reflected in tree easily
 These trees are widely used in file systems and database systems
File systems: NTFS [Windows], NSS, XFS, JFS – for directory entries
– DBMSs: IBM DB2, Informix, MS SQL Server, Oracle, SQLite
–
48
CS346 Advanced Databases
B+-tree
 Internal nodes: P1, K1, P2, K2, … Kq-1, Pq, where p/2 < q ≤ p
 Leaf nodes: K1, Pr1, K2, Pr2, … Kq-1, Prq-1, Pnext, p/2 < q ≤ p
Ki, Pri  : Pri points to record with value Ki
– Pnext points to the next leaf node in the tree (for linear access)
–
49
CS346 Advanced Databases
B+-tree: Search
 Search on a B+-tree is fairly straightforward
Start at root block
– While not at a leaf block
 Determine between which values in the block the key falls
 Follow the relevant pointer to the new block
– Search current leaf block for desired value
– If found, follow pointer to retrieve record
–
50
CS346 Advanced Databases
B-tree: insertion
 As with many tree algorithms, insertion is based on search
Start by searching for where the record should be
– If room in the leaf block, insert a pointer to the new record
– Else, split the leaf block into two, and insert the pointer
–
 Now there are two leaf blocks: need to update parent
Similar process to update parent: may need to split parent
– May propagate back to root
–
 Note that we do not explicitly attempt to keep tree balanced
–
The condition p/2 < q ≤ p ensures that it can’t be too unbalanced
 Algorithms fans: condition ensures height is O(log n) for n keys
–
51
Worst case time for {insert, delete, search} is O(log n)
CS346 Advanced Databases
52
CS346 Advanced Databases
B+-tree: deletion
 Essentially the inverse of insertion
–
–
–
–
–
53
Find the record to delete from the B+-tree
Remove the pointer and if block is still large enough, halt
Else, try to redistribute: move entries from sibling block
If can’t redistribute, merge the two siblings
Then delete one pointer from parent and recurse up tree
CS346 Advanced Databases
54
CS346 Advanced Databases
Summary





Disk properties and file storage
File organizations: ordered, unordered, and hashed
Storage topics: RAID and Storage area networks
Indexes: primary and secondary
Multilevel indexes and B-trees
 Chapter: “Disk Storage, Basic File Stuctures and Hashing” in
Elmasri and Navathe
 Chapter: “Indexing Structure for Files” in Elmasri and Navathe
55
CS346 Advanced Databases
Download