Reading and Review Chapter 12: Indexing and Hashing

advertisement
Reading and Review
Chapter 12: Indexing and Hashing
– 12.1 Basic Concepts
– 12.2 Ordered Indices
– 12.3 B+-tree Index Files
• lots of stuff
– 12.4 B-tree Index Files
– 12.5 Static Hashing
• lots of stuff
– 12.6 Dynamic Hashing (Extendable Hashing)
• lots of stuff
– 12.7 Comparison of Ordered Indexing and Hashing
– 12.9 Multiple-Key Access
• Multiple-key indices, Grid files, Bitmap Indices
– 12.10 Summary
Reading and Review
Chapter 13: Query Processing
–
–
–
–
–
13.1 Overview
13.2 Measures of Query Cost
13.3 Selection Operation
13.4 Sorting (Sort-Merge Algorithm)
13.5 Join Operation
•
•
•
•
Nested-loop Join (regular, block, indexed)
Merge Join
Hash Join
Complex Joins
– 13.6 Other Operations
– 13.7 Evaluation of Expressions
– 13.8 Summary
Reading and Review
Chapter 14: Query Optimization
– 14.1 Overview
– 14.2 Estimating Statistics of Expression Results
•
•
•
•
•
Catalog information
Selection size estimation
Join size estimation
Size estimation for other operations
Estimation of number of distinct values
– 14.3 Transformation of Relational Expressions
• Equivalence rules and examples
• Join ordering, enumeration of equivalent expressions
– 14.4 Choice of Evaluation Plans
– from 14.4 on, ignore any section with a “**” at the end of the section title
– 14.6 Summary
Reading and Review
Chapter 15: Transaction Management
– 15.1 Transaction Concept
• ACID properties
– 15.2 Transaction State
• state diagram of a transaction
– (15.3 Implementation of Atomicity and Durability)
• not important -- we covered this material better in section 17.4
– 15.4 Concurrent Execution and Scheduling
– 15.5 Serializability (Conflict Serializability)
• (ignore 15.5.2 View Serializability)
–
–
–
–
15.6 Recoverability
(15.7 Isolation -- not important, see 16.1 on locking instead)
15.9 Testing for Serializability (Precedence Graphs)
15.10 Summary
Reading and Review
Chapters 16 and 17
– 16.1: Lock-Based Protocols
•
•
•
•
granting locks
deadlocks
two-phase locking protocol
ignore 16.1.4 and 16.1.5
– 16.6.3 Deadlock Detection and Recovery
• Wait-for Graph
– 17.1, 17.2.1 -- skim these sections. Material should be familiar to you as
background, but I won’t be testing on definitions or memorization of it
– 17.4: Log-Based Recovery
• logs, redo/undo, basic concepts
• checkpoints
Indexing and Hashing: Motivation
• Query response speed is a major issue in database design
• Some queries only need to access a very small proportion of the
records in a database
– “Find all accounts at the Perryridge bank branch”
– “Find the balance of account number A-101”
– “Find all accounts owned by Zephraim Cochrane”
• Checking every single record for the queries above is very inefficient
and slow.
• To allow fast access for those sorts of queries, we create additional
structures that we associate with files: indices (index files).
Basic Concepts
• Indexing methods are used to speed up access to desired data
– e.g. Author card catalog in a library
• Search key -- an attribute or set of attributes used to look up records in
a file. This use of the word key differs from that used before in class.
• An index file consists of records (index entries) of the form:
(search-key, pointer)
where the pointer is a link to a location in the original file
• Index files are typically much smaller than the original file
• Two basic types of index:
– ordered indices: search keys are stored in sorted order
– hash indices: search keys are distributed uniformly across “buckets” using
a “hash function”
Index Evaluation Metrics
• We will look at a number of techniques for both ordered indexing and
hashing. No one technique is best in all circumstances -- each has its
advantages and disadvantages. The factors that can be used to evaluate
different indices are:
• Access types supported efficiently.
– Finding records with a specified attribute value
– Finding records with attribute values within a specified range of values
•
•
•
•
•
Access (retrieval) time. Finding a single desired tuple in the file.
Insertion time
Deletion time
Update time
Space overhead for the index
Index Evaluation Metrics (2)
• Speed issues are
–
–
–
–
Access time
Insertion time
Deletion time
Update time
• Access is the operation that occurs the most frequently, because it is
also used for insert, delete, update
• For insert, delete, and update operations we must consider not only the
time for the operation itself (inserting a new record into the file) but
also any time required to update the index structure to reflect changes.
• We will often want to have more than one index for a file
– e.g., card catalogs for author, subject, and title in a library
Ordered Indices
– An ordered index stores the values of the search keys in sorted
order
– records in the original file may themselves be stored in some
sorted order (or not)
– the original file may have several indices, on different search keys
– when there are multiple indices, a primary index is an index whose
search key also determines the sort order of the original file.
Primary indices are also called clustering indices
– secondary indices are indices whose search key specifies an order
different from the sequential order of the file. Also called nonclustering indices
– an index-sequential file is an ordered sequential file with a
primary index.
Dense and Sparse Indices
• A dense index is where an index record appears for every
search-key value in the file
• A sparse index contains index records for only some
search-key values
– applicable when records are sequentially ordered on the search key
– to locate a record with search-key value K we must:
• find index record with largest search-key value <=K
• search file sequentially starting at the location pointed to
• stop (fail) when we hit a record with search-key value >K
– less space and less maintenance overhead for insert, delete
– generally slower than dense index for locating records
Example of a Dense Index
Example of a Sparse Index
Problems with Index-Sequential Files
• Retrieve: search until the key value or a larger key value is found
– individual key access
BAD
– scan the file in order of the key GOOD
• Insert is hard -- all higher key records must be shifted to place the new
record
• Delete may leave holes
• Update is equivalent to a combined delete and insert
– updating a search key field may cause the combined disadvantages of an
insert and a delete by shifting the record’s location in the sorted file
• differential files are often used to hold the recent updates until the
database can be reorganized off-line
Multi-level Index
•
•
•
•
If the primary index does not fit in
memory, access becomes expensive
(disk reads cost)
To reduce the number of disk
accesses to index records, we treat
the primary index kept on disk as a
sequential file and construct a
sparse index on it
If the outer index is still too large,
we can create another level of
index to index it, and so on
Indices at all levels must be
updated on insertion or deletion
from the file
Index Update: Deletion
• When deletions occur in the primary file, the index will
sometimes need to change
• If the deleted record was the only record in the file with its
particular search-key value, the search-key is deleted from
the index also
• Single-level index deletion:
– dense indices -- deletion of search-key is similar to record deletion
– sparse indices -- if an entry for the search key exists in the index, it
is replaced by the next search key value in the original file (taken
in search-key order)
– Multi-level index deletion is a simple extension of the above
Index Update: Insertion
• As with deletions, when insertions occur in the primary
file, the index will sometimes need to change
• Single-level index insertion:
– first perform a lookup using the search-key value appearing in the
record to be inserted
– dense indices -- if the search-key value does not appear in the
index, insert it
– sparse indices -- depends upon the design of the sparse index. If
the index is designed to store an entry for each block of the file,
then no change is necessary to the index unless the insertion
creates a new block (if the old block overflows). If that happens,
the first search-key value in the new block is inserted in the index
– multi-level insertion is a simple extension of the above
Secondary Index Motivation
• Good database design is to have an index to handle all
common (frequently requested) queries.
• Queries based upon values of the primary key can use the
primary index
• Queries based upon values of other attributes will require
other (secondary) indices.
Secondary Indices vs. Primary Indices
• Secondary indices must be dense, with an index entry for
every search-key value, and a pointer to every record in the
file. A primary index may be dense or sparse. (why?)
• A secondary index on a candidate key looks just like a
dense primary index, except that the records pointed to by
successive values of the index are not sequential
• when a primary index is not on a candidate key it suffices
if the index points to the first record with a particular value
for the search key, as subsequent records can be found with
a sequential scan from that point
Secondary Indices vs. Primary Indices
(2)
• When the search key of a secondary index is not a
candidate key (I.e., there may be more than one tuple in the
relation with a given search-key value) it isn’t enough to
just point to the first record with that value -- other records
with that value may be scattered throughout the file. A
secondary index must contain pointers to all the records
– so an index record points to a bucket that contains pointers to every
record in the file with that particular search-key value
Secondary Index on balance
Primary and Secondary Indices
• As mentioned earlier:
– secondary indices must be dense
– Indices offer substantial benefits when searching for records
– When a file is modified, every index on the file must be updated
• this overhead imposes limits on the number of indices
– relatively static files will reasonably permit more indices
– relatively dynamic files make index maintenance quite expensive
• Sequential scan using primary index is efficient; sequential
scan on secondary indices is very expensive
– each record access may fetch a new block from disk (oy!)
Disadvantages of Index-Sequential Files
• The main disadvantage of the index-sequential file
organization is that performance degrades as the file
grows, both for index lookups and for sequential scans
through the data. Although this degradation can be
remedied through file reorganization, frequent reorgs are
expensive and therefore undesirable.
• Next we’ll start examining index structures that maintain
their efficiency despite insertion and deletion of data: the
B+-tree (section 12.3)
B+- Tree Index Files
• Main disadvantage of ISAM files is that performance
degrades as the file grows, creating many overflow blocks
and the need for periodic reorganization of the entire file
• B+- trees are an alternative to indexed-sequential files
– used for both primary and secondary indexing
– B+- trees are a multi-level index
• B+- tree index files automatically reorganize themselves
with small local changes on insertion and deletion.
– No reorg of entire file is required to maintain performance
– disadvantages: extra insertion, deletion, and space overhead
– advantages outweigh disadvantages. B+-trees are used extensively
B+- Tree Index Files (2)
Definition: A B+-tree of order n has:
•
•
•
•
•
All leaves at the same level
balanced tree (“B” in the name stands for “balanced”)
logarithmic performance
root has between 1 and n-1 keys
all other nodes have between n/2 and n-1 keys (>= 50% space
utilization)
• we construct the tree with order n such that one node corresponds to
one disk block I/O (in other words, each disk page read brings up one
full tree node).
B+- Tree Index Files (3)
A B+-tree is a rooted tree satisfying the following properties:
• All paths from root to tree are the same length
• Search for an index value takes time according to the
height of the tree (whether successful or unsuccessful)
B+- Tree Node Structure
• The B+-tree is constructed so that each node (when full)
fits on a single disk page
– parameters:
B: size of a block in bytes (e.g., 4096)
K: size of the key in bytes (e.g., 8)
P: size of a pointer in bytes (e.g., 4)
– internal node must have n such that:
(n-1)*K + n*P <= B
n<= (B+K)/(K+P)
– with the example values above, this becomes
n<=(4096+8)/(8+4)=4114/12
n<=342.83
B+- Tree Node Structure (2)
• Typical B+-tree Node
Ki are the search-key values
Pi are the pointers to children (for non-leaf nodes) or
pointers to records or buckets of records (for leaf nodes)
• the search keys in a node are ordered:
K1<K2 <K3 …<Kn-1
Non-Leaf Nodes in B+-Trees
• Non-leaf nodes form a multi-level sparse index on the leaf
nodes. For a non-leaf node with n pointers:
– all the search keys in the subtree to which P1 points are less than
K1
– For 2<= i <= n-1, all the search keys in the subtree to which Pi
points have values greater than or equal to Ki-1 and less than Kn-1
Leaf Nodes in B+-Trees
• As mentioned last class, primary indices may be sparse
indices. So B+-trees constructed on a primary key (that is,
where the search key order corresponds to the sort order of
the original file) can have the pointers of their leaf nodes
point to an appropriate position in the original file that
represents the first occurrence of that key value.
• Secondary indices must be dense indices. B+-trees
constructed as a secondary index must have the pointers of
their leaf nodes point to a bucket storing all locations
where a given search key value occur; this set of buckets is
often called an occurrence file
Example of a B+-tree
• B+-tree for the account file (n=3)
Another Example of a B+-tree
• B+-tree for the account file (n=5)
•
•
•
Leaf nodes must have between 2 and 4 values
((n-1)/2 and (n-1), with n=5)
Non-leaf nodes other than the root must have between 3 and 5 children
(n/2 and n, with n=5)
Root must have at least 2 children
Observations about B+-trees
• Since the inter-node connections are done by pointers,
“logically” close blocks need not be “physically” close
• The non-leaf levels of the B+-tree form a hierarchy of
sparse indices
• The B+-tree contains a relatively small number of levels
(logarithmic in the size of the main file), thus searches can
be conducted efficiently
• Insertions and deletions to the main file can be handled
efficiently, as the index can be restructured in logarithmic
time (as we shall examine later in class)
Queries on B+-trees
• Find all records with a search-key value of k
– start with the root node (assume it has m pointers)
• examine the node for the smallest search-key value > k
• if we find such a value, say at Kj , follow the pointer Pj to its child
node
• if no such k value exists, then k >= Km-1, so follow Pm
– if the node reached is not a leaf node, repeat the procedure above
and follow the corresponding pointer
– eventually we reach a leaf node. If we find a matching key value
(our search value k = Ki for some i) then we follow Pi to the
desired record or bucket. If we find no matching value, the search
is unsuccessful and we are done.
Queries on B+-trees (2)
• Processing a query traces a path from the root node to a leaf node
• If there are K search-key values in the file, the path is no longer than
logn/2 (K)
• A node is generally the same size as a disk block, typically 4 kilobytes,
and n is typically around 100 (40 bytes per index entry)
• With 1 million search key values and n=100, at most log50(1,000,000)
= 4 nodes are accessed in a lookup
• In a balanced binary tree with 1 million search key values, around 20
nodes are accessed in a lookup
– the difference is significant since every node access might need a disk I/O,
costing around 20 milliseconds
Insertion on B+-trees
• Find the leaf node in which the search-key value would
appear
• If the search key value is already present, add the record to
the main file and (if necessary) add a pointer to the record
to the appropriate occurrence file bucket
• If the search-key value is not there, add the record to the
main file as above (including creating a new occurrence
file bucket if necessary). Then:
– if there is room in the leaf node, insert (key-value, pointer) in the
leaf node
– otherwise, overflow. Split the leaf node (along with the new entry)
Insertion on B+-trees (2)
• Splitting a node:
– take the n (search-key-value, pointer) pairs, including the one
being inserted, in sorted order. Place half in the original node, and
the other half in a new node.
– Let the new node be p, and let k be the least key value in p. Insert
(k, p) in the parent of the node being split.
– If the parent becomes full by this new insertion, split it as
described above, and propogate the split as far up as necessary
• The splitting of nodes proceeds upwards til a node that is
not full is found. In the worst case the root node may be
split, increasing the height of the tree by 1.
Insertion on B+-trees Example
Deletion on B+-trees
• Find the record to be deleted, and remove it from the main file and the
bucket (if necessary)
• If there is no occurrence-file bucket, or if the deletion caused the
bucket to become empty, then delete (key-value, pointer) from the B+tree leaf-node
• If the leaf-node now has too few entries, underflow has occurred. If
the active leaf-node has a sibling with few enough entries that the
combined entries can fit in a single node, then
– combine all the entries of both nodes in a single one
– delete the (K,P) pair pointing to the deleted node from the parent. Follow
this procedure recursively if the parent node underflows.
Deletion on B+-trees (2)
• Otherwise, if no sibling node is small enough to combine
with the active node without causing overflow, then:
– Redistribute the pointers between the active node and the sibling so
that both of them have sufficient pointers to avoid underflow
– Update the corresponding search key value in the parent node
– No deletion occurs in the parent node, so no further recursion is
necessary in this case.
• Deletions may cascade upwards until a node with n/2 or
more pointers is found. If the root node has only one
pointer after deletion, it is removed and the sole child
becomes the root (reducing the height of the tree by 1)
Deletion on B+-trees Example 1
Deletion on B+-trees Example 2
Deletion on B+-trees Example 3
B+-tree File Organization
• B+-Tree Indices solve the problem of index file degradation. The
original data file will still degrade upon a stream of insert/delete
operations.
• Solve data-file degradation by using a B+-tree file organization
• Leaf nodes in a B+-tree file organization store records, not pointers
into a separate original datafile
– since records are larger than pointers, the maximum number of recrods
that can be stored in a leaf node is less than the number of pointers in a
non-leaf node
– leaf nodes must still be maintained at least half full
– insert and delete are handled in the same was as insert and delete for
entries in a B+-tree index
B+-tree File Organization Example
• Records are much bigger than pointers, so good space usage is
important
• To improve space usage, involve more sibling nodes in redistribution
during splits and merges (to avoid split/merge when possible)
– involving one sibling guarantees 50% space use
– involving two guarantees at least 2/3 space use, etc.
B-tree Index Files
• B-trees are similar to B+-trees, but search-key values
appear only once in the index (eliminates redundant
storage of key values)
– search keys in non-leaf nodes don’t appear in the leaf nodes, so an
additional pointer field for each search key in a non-leaf node must
be stored to point to the bucket or record for that key value
– leaf nodes look like B+-tree leaf nodes:
(P1, K1, P2, K2, …, Pn)
– non-leaf nodes look like so:
(P1, B1, K1, P2, B2, K2, …, Pn)
where the Bi are pointers to buckets or file records.
B-tree Index File Example
B-tree
and
B+-tree
B-tree Index Files (cont.)
• Advantages of B-tree Indices (vs. B+-trees)
– May use less tree nodes than a B+-tree on the same data
– Sometimes possible to find a specific key value before reaching a
leaf node
• Disadvantages of B-tree Indices
– Only a small fraction of key values are found early
– Non-leaf nodes are larger, so fanout is reduced, and B-trees may be
slightly taller than B+-trees on the same data
– Insertion and deletion are more complicated than on B+-trees
– Implementation is more difficult than B+-trees
• In general, advantages don’t outweigh disadvantages
Hashing
• We’ve examined Ordered Indices (design based upon
sorting or ordering search key values); the other type of
major indexing technique is Hashing
• Underlying concept is very simple:
– observation: small files don’t require indices or complicated search
methods
– use some clever method, based upon the search key, to split a large
file into a lot of little buckets
– each bucket is sufficiently small
– use the same method to find the bucket for a given search key
Hashing Basics
– A bucket is a unit of storage containing one or more records
(typically a bucket is one disk block in size)
– In a hash file organization we find the bucket for a record directly
from its search-key value using a hash function
– A hash function is a function that maps from the set of all searchkey values K to the set of all bucket addresses B
– The hash function is used to locate records for access, insertion,
and deletion
– Records with different search-key values may be mapped to the
same bucket
• the entire bucket must be searched to find a record
• buckets are designed to be small, so this task is usually not onerous
Hashed File Example
– So we:
• divide the set of disk blocks that make up the file into buckets
• devise a hash function that maps each key value into a bucket
V: set of key values
B: number of buckets
H: hashing function
H: V--> (0, 1, 2, 3, …, B-1)
Example: V= 9 digit SS#; B=1000; H= key modulo 1000
Hash Functions
• To search/insert/delete/modify a key do:
– compute H(k) to get the bucket number
– search sequentially in the bucket (heap organization within each
bucket)
• Choosing H: almost any function that generates “random”
numbers in the range [0, B-1]
– try to distribute the keys evenly into the B buckets
– one rule of thumb when using MOD -- use a prime number
Hash Functions (2)
• Collision is when two or more key values go to the same
bucket
– too many collisions increases search time and degrades
performance
– no or few collisions means that each bucket has only one (or very
few) key(s)
• Worst-case hash functions map all search keys to the same
bucket
Hash Functions (3)
• Ideal hash functions are uniform
– each bucket is assigned the same number of search-key values
from the set of all possible values
• Ideal hash functions are random
– each bucket has approximately the same number of records
assigned to it irrespective of the actual distribution of search-key
values in the file
• Finding a good hash function is not always easy
Examples of Hash Functions
• Given 26 buckets and a string-valued search key, consider the
following possible hash functions:
–
–
–
–
–
Hash based upon the first letter of the string
Hash based upon the last letter of the string
Hash based upon the middle letter of the string
Hash based upon the most common letter in the string
Hash based upon the “average” letter in the string: the sum of the letters
(using A=0, B=1, etc) divided by the number of letters
– Hash based upon the length of the string (modulo 26)
• Typical hash functions perform computation on the internal binary
representation of the search key
– example: searching on a string value, hash based upon the binary sum of
the characters in the string, modulo the number of buckets
Overflow
• Overflow is when an insertion into a bucket can’t occur
because it is full.
• Overflow can occur for the following reasons:
– too many records (not enough buckets)
– poor hash function
– skewed data:
• multiple records might have the same search key
• multiple search keys might be assigned the same bucket
Overflow (2)
• Overflow is handled by one of two methods
– chaining of multiple blocks in a bucket, by attaching a number of
overflow buckets together in a linked list
– double hashing: use a second hash function to find another
(hopefully non-full) bucket
– in theory we could use the next bucket that had space; this is often
called open hashing or linear probing. This is often used to
construct symbol tables for compilers
• useful where deletion does not occur
• deletion is very awkward with linear probing, so it isn’t useful in most
database applications
Hashed File Performance Metrics
• An important performance measure is the loading factor
(number of records)/(B*f)
B is the number of buckets
f is the number of records that will fit in a single bucket
• when loading factor too high (file becomes too full),
double the number of buckets and rehash
Hashed File Performance
•
•
•
•
•
(Assume that the hash table is in main memory)
Successful search: best case 1 block; worst case every
chained bucket; average case half of worst case
Unsuccessful search: always hits every chained bucket
(best case, worst case, average case)
With loading factor around 90% and a good hashing
function, average is about 1.2 blocks
Advantage of hashing: very fast for exact queries
Disadvantage: records are not sorted in any order. As a
result, it is effectively impossible to do range queries
Hash Indices
• Hashing can be used for index-structure creation as well as
for file organization
• A hash index organizes the search keys (and their record
pointers) into a hash file structure
• strictly speaking, a hash index is always a secondary index
– if the primary file was stored using the same hash function, an
additional, separate primary hash index would be unnecessary
– We use the term hash index to refer both to secondary hash indices
and to files organized using hashing file structures
Example of a Hash Index
Hash index into file
account, on search key
account-number;
Hash function computes
sum of digits in account
number modulo 7.
Bucket size is 2
Static Hashing
• We’ve been discussing static hashing: the hash function maps searchkey values to a fixed set of buckets. This has some disadvantages:
– databases grow with time. Once buckets start to overflow, performance
will degrade
– if we attempt to anticipate some future file size and allocate sufficient
buckets for that expected size when we build the database initially, we will
waste lots of space
– if the database ever shrinks, space will be wasted
– periodic reorganization avoids these problems, but is very expensive
• By using techniques that allow us to modify the number of buckets
dynamically (“dynamic hashing”) we can avoid these problems
– Good for databases that grow and shrink in size
– Allows the hash function to be modified dynamically
Dynamic Hashing
• One form of dynamic hashing is extendable hashing
– hash function generates values over a large range -- typically b-bit
integers, with b being something like 32
– At any given moment, only a prefix of the hash function is used to index
into a table of bucket addresses
– With the prefix at a given moment being j, with 0<=j<=32, the bucket
address table size is 2j
– Value of j grows and shrinks as the size of the database grows and shrinks
– Multiple entries in the bucket address table may point to a bucket
– Thus the actual number of buckets is < 2j
– the number of buckets also changes dynamically due to coalescing and
splitting of buckets
General Extendable Hash Structure
Use of Extendable Hash Structure
• Each bucket j stores a value ij; all the entries that point to the same
bucket have the same values on the first ij bits
• To locate the bucket containing search key Kj;
– compute H(Kj) = X
– Use the first i high order bits of X as a displacement into the bucket
address table and follow the pointer to the appropriate bucket
• T insert a record with search-key value Kj
– follow lookup procedure to locate the bucket, say j
– if there is room in bucket j, insert the record
– Otherwise the bucket must be split and insertion reattempted
• in some cases we use overflow buckets instead (as explained shortly)
Splitting in Extendable Hash Structure
To split a bucket j when inserting a record with search-key value Kj
• if i> ij (more than one pointer in to bucket j)
– allocate a new bucket z
– set ij and iz to the old value ij incremented by one
– update the bucket address table (change the second half of the set of
entries pointing to j so that they now point to z)
– remove all the entries in j and rehash them so that they either fall in z or j
– reattempt the insert (Kj). If the bucket is still full, repeat the above.
Splitting in Extendable Hash Structure
(2)
To split a bucket j when inserting a record with search-key value Kj
• if i= ij (only one pointer in to bucket j)
– increment i and double the size of the bucket address table
– replace each entry in the bucket address table with two entries that point to
the same bucket
– recompute new bucket address table entry for Kj
– now i> ij so use the first case described earlier
• When inserting a value, if the bucket is still full after several splits
(that is, i reaches some preset value b), give up and create an overflow
bucket rather than splitting the bucket entry table further
– how might this occur?
Deletion in Extendable Hash Structure
To delete a key value Kj
• locate it in its bucket and remove it
• the bucket itself can be removed if it becomes empty (with appropriate
updates to the bucket address table)
• coalescing of buckets is possible
– can only coalesce with a “buddy” bucket having the same value of ij and
same ij -1prefix, if one such bucket exists
• decreasing bucket address table size is also possible
– very expensive
– should only be done if the number of buckets becomes much smaller than
the size of the table
Extendable Hash Structure Example
Hash function
on branch name
Initial hash table
(empty)
Extendable Hash Structure Example (2)
Hash structure after insertion of one Brighton and two Downtown records
Extendable Hash Structure Example (3)
Hash structure after insertion of Mianus record
Extendable Hash Structure Example (4)
Hash structure after insertion of three Perryridge records
Extendable Hash Structure Example (5)
Hash structure after insertion of Redwood and Round Hill records
Extendable Hashing vs. Other Hashing
• Benefits of extendable hashing:
– hash performance doesn’t degrade with growth of file
– minimal space overhead
• Disadvantages of extendable hashing
– extra level of indirection (bucket address table) to find desired record
– bucket address table may itself become very big (larger than memory)
• need a tree structure to locate desired record in the structure!
– Changing size of bucket address table is an expensive operation
• Linear hashing is an alternative mechanism which avoids these
disadvantages at the possible cost of more bucket overflows
Comparison:
Ordered Indexing vs. Hashing
• Each scheme has advantages for some operations and
situations. To choose wisely between different schemes
we need to consider:
– cost of periodic reorganization
– relative frequency of insertions and deletions
– is it desirable to optimize average access time at the expense of
worst-case access time?
– What types of queries do we expect?
• Hashing is generally better at retrieving records for a specific key
value
• Ordered indices are better for range queries
Index Definition in SQL
• Create an index
create index <index-name> on <relation-name> (<attribute-list>)
e.g.,
create index br-index on branch(branch-name)
• Use create unique index to indirectly specify and enforce the
condition that the search key is a candidate key
– not really required if SQL unique integrity constraint is supported
• To drop an index
drop index <index-name>
Multiple-Key Access
• With some queries we can use multiple indices
• Example:
select account-number
from account
where branch-name=“Perryridge” and balance=1000
• Possible strategies for processing this query using indices on single
attributes:
– use index on balance to find accounts with balances =1000, then test them
individually to see if branch-name=“Perryridge”
– use index on branch-name to find accounts with branchname=“Perryridge”, then test them individually to see if balances =1000
– use branch-name index to find pointers to all records of the Perryridge
branch, and use balance index similarly, then take intersection of both sets
of pointers
Multiple-Key Access (2)
• With some queries using a single-attribute index is unnecessarily
expensive
– with methods (1) and (2) from the earlier slide, we might have the index
we use return a very large set, even though the final result is quite small
– Even with method (3) (use both indices and then find the intersection) we
might have both indices return a large set, which will make for a lot of
unneeded work if the final result (the intersection) is small
• An alternative strategy is to create and use an index on more than one
attribute -- in this example, an index on (branch-name, balance) (both
attributes)
Indices on Multiple Attributes
• Suppose we have an ordered index on the combined search-key
(branch-name, balance)
• Examining the earlier query with the clause
where branch-name=“Perryridge” and balance=1000
Our new index will fetch only records that satisfy both conditions -much more efficient than trying to answer the query with separate
single-valued indices
• We can also handle range queries like:
where branch-name=“Perryridge” and balance<1000
• But our index will not work for
where branch-name<“Perryridge” and balance=1000
Multi-Attribute Indexing
Example: EMP(eno, ename, age, sal)
lots of ways to handle this problem
•
•
separate indices: lots of false hits
combined index based on
composite key
– key= sal*100 + age
– search for 30<sal<40 translates
into 3000<key<4000 (easy)
– search for 60<age<80 difficult
•
•
•
Grid files (up next)
R-tree (B-tree generalization)
Quad-trees, K-d trees, etc...
Grid Files
• Structure used to speed processing of general multiple search-key
queries involving one or more comparison operators
• The grid file has:
– a single grid array
– array has number of dimensions equal to the number of search-key
attributes
– one linear scale for each search-key attribute
• Multiple cells of the grid array can point to the same bucket
• To find the bucket for a search-key value, locate the row and column
of the cell using the linear scales to get the grid location, then follow
the pointer in that grid location to the bucket
Example Grid File for account
Queries on a Grid File
• A grid file on two attributes A and B can handle queries of all the
following forms with reasonable efficiency:
– (a1 < A < a2)
– (b1 < B < b2)
– (a1 < A < a2 and b1 < B < b2)
• For example, to answer (a1 < A < a2 and b1 < B < b2), use the linear
scales to find corresponding candidate grid array cells, and look up all
the buckets pointed-to from those cells
Grid Files (cont)
• During insertion, if a bucket becomes full, a new bucket
can be created if more than one cell points to it
– idea similar to extendable hashing, but in multiple dimensions
– if only one cell points to the bucket, either an overflow bucket
must be created or the grid array size must be increased
• Linear scales can be chosen to uniformly distribute records
across buckets
– not necessary to have scale uniform across the domain -- if records
are distributed in some other pattern, the linear scale can mirror it
(as shown in the example earlier)
Grid Files (end)
• Periodic re-organization to increase grid array size helps
with good performance
– reorganization can be very expensive, though
• Space overhead of the grid array can be high
• R-trees (chapter 23) are an alternative
Bitmap Indices
• Bitmap indices are a special type of index designed for efficient
queries on multiple keys
• Records in a relation are assumed to be numbered sequentially from 0
– given a number n the objective is for it to be easy to retrieve record n
– very easy to achieve if we’re looking at fixed-length records
• Bitmap indices are applicable on attributes that take on a relatively
small number of distinct values
– e.g. gender, country, state, hockey team
– or an arbitrary mapping of a wider spectrum of values into a small number
of categories (e.g. income level: divide income into a small number of
levels such as 0-9999, 10K-19999, 20K-49999, 50K and greater)
• A bitmap is simply an array of bits
Bitmap Indices (cont)
• In the simplest form a bitmap index on an attribute has a bitmap for
each value of the attribute
– bitmap has as many bits as there are records
– in a bitmap for value v, the bit for a record is 1 if the record has the value
v for the attribute, and 0 otherwise
Bitmap Indices (cont)
• Bitmap indices are useful for queries on multiple attributes
– not particularly useful for single-attribute queries
• Queries are answered using bitmap operations
– intersection (and)
– union (or)
– complementation (not)
• Each operation takes two bitmaps of the same size and applies the
operation on corresponding bits to get the result bitmap
– e.g. 100110 and 110011 = 100010
100110 or 110011 = 110111
not 100110 = 011001
Bitmap Indices (cont)
• Every 1 bit in the result marks a desired tuple
– can compute its location (since records are numbered sequentially
and are all of the same size) and retrieve the records
– counting number of tuples in the result (SQL count aggregate) is
even faster
• Bitmap indices are generally very small compared to
relation size
– e.g. if record is 100 bytes, space for a single bitmap on all tuples of
the relation takes 1/800 of the size of the relation itself
– if bitmap allows for 8 distinct values, bitmap is only 1% of the size
of the whole relation
Bitmap Indices (cont)
• Deletion needs to be handled properly
– can’t just store “zero” values at deleted locations (why?)
– need existence bitmap to note if the record at location X is valid or
not
• existence bitmap necessary for complementation
not(A=v): (not bitmap-A-v) and ExistenceBitmap
• Should keep bitmaps for all values, even “null”
– to correctly handle SQL null semantics for not (A=v) must
compute:
not(bitmap-A-Null) and (not bitmap-A-v) and ExistenceBitmap
Efficient Implementation of
Bitmap Operations
• Bitmaps are packed into words; a single word and is a
basic CPU instruction that computes the and of 32 or 64
bits at once
– e.g. one-million-bit maps can be anded with just 31,250
instructions
• Counting number of 1s can be done fast by a trick:
– use each byte to index into a precomputed array of 256 elements
each storing the count of 1s in the binary representation
– add up the retrieved counts
– can use pairs of bytes to speed up further in a similar way, but at a
higher memory cost (64K values in precomputed array for 2 byte)
Bitmaps and B+-trees
• Bitmaps can be used instead of Tuple-ID lists at leaf levels
of B+-trees, for values that have a large number of
matching records
– assuming a tuple-id is 64 bits, this becomes worthwhile if >1/64 of
the records have a given value
• This technique merges benefits of bitmap and B+-tree
indices
– useful if some values are uncommon, and some are quite common
– why not useful if there are only a small number of different values
total (16, 20, something less than 64)?
• That covers all of chapter 12.
Query Processing
• SQL is good for humans, but not as an internal (machine)
representation of how to calculate a result
• Processing an SQL (or other) query requires these steps:
– parsing and translation
• turning the query into a useful internal representation in the extended
relational algebra
– optimization
• manipulating the relational algebra query into the most efficient form
(one that gets results the fastest)
– evaluation
• actually computing the results of the query
Query Processing Diagram
Query Processing Steps
1. parsing and translation
– details of parsing are covered in other places (texts and courses on
compilers). We’ve already covered SQL and relational algebra;
translating between the two should be relatively familiar ground
2. optimization
– This is the meat of chapter 13. How to figure out which plan,
among many, is the best way to execute a query
3. evaluation
– actually computing the results of the query is mostly mechanical
(doesn’t require much cleverness) once a good plan is in place.
Query Processing Example
• Initial query:
select
from
where
balance
account
balance<2500
• Two different relational algebra expressions could represent this query:
– sel balance<2500(Pro balance(account))
– Pro balance( sel balance<2500(account))
• which choice is better? It depends upon metadata (data about the data)
and what indices are available for use on these operations.
Query Processing Metadata
• Cost parameters (some are easy to maintain; some are very hard - this is statistical info maintained in the system’s catalog)
– n(r ): number of tuples in relation r
– b(r ): number of disk blocks containing tuples of relation r
– s(r ): average size of a tuple of relation r
– f(r ): blocking factor of r: how many tuples fit in a disk block
– V(A,r): number of distinct values of attribute A in r. (V(A,r)=n(r ) if
A is a candidate key)
– SC(A,r): average selectivity cardinality factor for attribute A of r.
Equivalent to n(r )/V(A,r). (1 if A is a key)
– min(A,r): minimum value of attribute A in r
– max(A,r): maximum value of attribute A in r
Query Processing Metadata (2)
• Cost parameters are used in two important computations:
– I/O cost of an operation
– the size of the result
• In the following examination we’ll find it useful to
differentiate three important operations:
– Selection (search) for equality (R.A1=c)
– Selection (search) for inequality (R.A1>c) (range queries)
– Projection on attribute A1
Selection for Equality (no indices)
• Selection (search) for equality (R.A1=c)
– cost (sequential search on a sorted relation) =
b(r )/2
average unsuccessful
b(r )/2 + SC(A1,r) -1
average successful
– cost (binary search on a sorted relation) =
log b(r )
average unsuccessful
log b(r ) + SC(A1,r) -1
average successful
– size of the result n(select(R.A1=c)) =
SC(A1,r) = n(r )/V(A1,r)
Selection for Inequality (no indices)
• Selection (search) for inequality (R.A1>c)
– cost (file unsorted) =
b(r )
– cost (file sorted on A1) =
b(r )/2 + b(r )/2 (if we assume that half the tuples qualify)
b(r ) in general
(regardless of the number of tuples that qualify. Why?)
– size of the result =
depends upon the query; unpredictable
Projection on A1
• Projection on attribute A1
– cost =
b(r )
– size of the result n(Pro(R,A1)) =
V(A1,r)
Selection (Indexed Scan) for Equality
Primary Index on key:
cost = (height+1) unsuccessful
cost = (height+1) +1 successful
Primary (clustering) Index on non-key:
cost = (height+1) + SC(A1,r)/f(r )
all tuples with the same value are
clustered
Secondary Index
cost = (height+1) + SC(A1,r)
tuples with the same value are
scattered
Selection (Indexed Scan) for Inequality
Primary Index on key: search for first
value and then pick tuples >= value
cost = (height+1) +1+ size of the result
(in disk pages)
= height+2 + n(r ) * (max(A,r)c)/(max(A,r)-min(A,r))/f(r )
Primary (clustering) Index on non-key:
cost as above (all tuples with the same
value are clustered)
Secondary (non-clustering) Index
cost = (height+1) +B-treeLeaves/2
+ size of result (in tuples)
= height+1 + B-treeLeaves/2 + n(r ) *
(max(A,r)-c)/(max(A,r)-min(A,r))
Complex Selections
• Conjunction (select where theta1 and theta2)
(s1 = # of tuples satisfying selection condition theta1)
combined SC = (s1/n(r )) * (s2/n(r )) = s1*s2/n(r )2
assuming independence of predicates
• Disjunction (select where theta1 or theta2)
combined SC = 1 - (1 - s1/n(r )) * (1 - s2/n(r ))
= s1/n(r )) + s2/n(r ) - s1*s2/n(r )2
• Negation (select where not theta1)
n(! Theta1) = n(r ) - n(Theta1)
Complex Selections with Indices
GOAL: apply the most restrictive condition first and combined use of
multiple indices to reduce the intermediate results as early as possible
• Why? No index will be available on intermediate results!
• Conjunctive selection using one index B:
– select using B and then apply remaining predicates on intermediate results
• Conjunctive selection using a composite key index (R.A1, R.A2):
– create a composite key or range from the query values and search directly
(range search on the first attribute (MSB of the composite key) only)
• Conjunctive selection using two indices B1 and B2:
– search each separately and intersect the tuple identifiers (TIDs)
• Disjunctive selection using two indices B1 and B2:
– search each separately and union the tuple identifiers (TIDs)
Sorting?
• What has sorting to do with query processing?
– SQL queries can specify the output be sorted
– several relational operations (such as joins) can be implemented
very efficiently if the input data is sorted first
– as a result, query processing is often concerned with sorting
temporary (intermediate) and final results
– creating a secondary index on the active relation (“logical” sorting)
isn’t sufficient -- sequential scans through the data on secondary
indices are very inefficient. We often need to sort the data
physically into order
Sorting
• We differentiate two types of sorting:
– internal sorting: the entire relation fits in memory
– external sorting: the relation is too large to fit in memory
• Internal sorting can use any of a large range of wellestablished sorting algorithms (e.g., Quicksort)
• In databases, the most commonly used method for external
sorting is the sort-merge algorithm. (based upon
Mergesort)
Sort-merge Algorithm
• create runs phase.
– Load in M consecutive blocks of the relation (M is number of blocks that
will fit easily in main memory)
– Use some internal sorting algorithm to sort the tuples in those M blocks
– Write the sorted run to disk
– Continue with the next M blocks, etcetera, until finished
• merge runs phase (assuming that the number of runs, N, is less than M)
– load the first block of each run into memory
– grab the first tuple (lowest value) from all the runs and write it to an
output buffer page
– when the last tuple of a block is read, grab a new block from that run
– when the output buffer page is full, write it to disk and start a new one
– continue until all buffer pages are empty
Sort-merge Algorithm (2)
• Merge-runs phase (N>M)
– operate on M runs at a time, creating runs of length M2, and continue in
multiple passes of the Merge operation
• Cost of sorting: b(r ) is the number of blocks occupied by relation r
– runs phase does one read, one write on each block of r: cost 2b(r )
– total number of runs (N): b(r )/M
– number of passes in merge operation: 1 if N<M;
otherwise logM-1(b(r )/M)
– during each pass in the merge phase we read the whole relation and write
it all out again: cost 2b(r ) per pass
– total cost of merge phase is therefore 2b(r ) (logM-1(b(r )/M)+1)
– if only one merge pass is required (N<M) the cost is 4b(r );
– if M>b(r ) then there is only one run (internal sorting) and the cost is b(r )
Join Operation
• Join is perhaps the most important operation combining
two relations
• Algorithms computing the join efficiently are crucial to the
optimization phase of query processing
• We will examine a number of algorithms for computing
joins
• An important metric for estimating efficiency is the size of
the result: as mentioned last class, the best algorithms on
complex (multi-relation) queries is to cut down the size of
the intermediate results as quickly as possible.
Join Operation: Size Estimation
• 0 <= size <= n(r ) * n(s)
(between 0 and size of cartesian product)
• If A = R S is a key of R, then
size <= n(s)
• If A = R S is a key of R and a foreign key of S, then
size = n(s)
• If A = R S is not a key, then each value of A in R appears no more
than n(s)/V(A,s) times in S, so n(r ) tuples of R produce:
size <= n(r ) *n(s)/V(A,s)
symmetrically,
size <= n(s) *n(r)/V(A,r)
if the two values are different we use:
size <=min{n(s)*n(r)/V(A,r), n(r)*n(s)/V(A,s)}
Join Methods: Nested Loop
• Simplest algorithm to compute a join: nested for loops
– requires no indices
• tuple-oriented:
for each tuple t1 in r do begin
for each tuple t2 in s do begin
join t1, t2 and append the result to the output
• block-oriented:
for each block b1 in r do begin
for each block b2 in s do begin
join b1, b2 and append the result to the output
• reverse inner loop
– as above, but we alternate counting up and down in the inner loop. Why?
Cost of Nested Loop Join
• Cost depends upon the number of buffers and the replacement strategy
– pin 1 block from the outer relation, k for the inner and LRU
cost: b(r ) + b(r )*b(s) (assuming b(s)>k)
– pin 1 block from the outer relation, k for the inner and MRU
cost: b(r ) + b(s) + (b(s) - (k-1))*(b(r )-1)
= b(r )(2-k) + k + 1 + b(r )*b(s) (assuming b(s)>k)
– pin k blocks from the outer relation, 1 for the inner
• read k from the outer
• for each block of s join 1xk blocks
• repeat with next k blocks of r untildone
(cost k)
(cost b(s))
(repeated b(r )/k times)
cost: (k+b(s)) * b(r )/k
=b(r ) + b(r )*b(s)/k
– which relation should be the outer one?
Join Methods: Sort-Merge Join
• Two phases:
– Sort both relations on the join attributes
– Merge the sorted relations
sort R on joining attribute
sort S on joining attribute
merge (sorted-R, sorted-S)
• cost with k buffers:
–
–
–
–
b(r) (2 log( b(r)/k) +1) to sort R
b(s) (2 log( b(s)/k) +1) to sort S
b(r ) + b(s) to merge
total: b(r) (2 log(b(r)/k) +1) + b(s) (2 log( b(s)/k) +1) +b(r) + b(s)
Join Methods: Hash Join
• Two phases:
– Hash both relations into hashed partitions
– Bucket-wise join: join tuples of the same partitions only
Hash R on joining attribute into H(R) buckets
Hash S on joining attribute into H(S) buckets
nested-loop join of corresponding buckets
• cost (assuming pairwise buckets fit in the buffers)
–
–
–
–
2b(r ) to hash R (read and write)
2b(s) to hash S (as above)
b(r ) + b(s) to merge
total: 3(b(r ) + b(s))
Join Methods: Indexed Join
• Inner relation has an index (clustering or not):
for each block b(r ) in R do begin
for each tuple t in b(r ) do begin
search the index on S with the value t.A of
the joining attribute and join with the resulting
tuples of S
• cost = b(r ) + n(r ) * cost(select(S.A=c))
where cost(select(S.A=c)) is as described before for indexed selection
– What if R is sorted on A? (hint: use V(A,r) in the above)
3-way Join
Suppose we want to compute R(A,B) |X| S(B,C) |X| T(C,D)
• 1st method: pairwise.
– First compute temp(A,B,C) = R(A,B) |X| S(B,C) cost b(r ) + b(r )*b(s)
size of temp b(temp) = n(r )*n(s)/(V(B,S)/f(r+s))
– then compute result temp(A,B,C) |X| T(C,D)
cost b(t)+b(t)*b(temp)
• 2nd method: scan S and do simultaneous selections on R and T
– cost = b(s) + b(s)* (b(r ) + b(t))
– if R and T are indexed we could do the selections through the indices
cost = b(s) + n(s ) *[ cost(select(R.B=S.B)) + cost(select(T.C=S.C))]
Transformation of Relational
Expressions
• (Section 14.3)
• Two relational algebra expressions are equivalent if they
generate the same set of tuples on every legal database
instance.
– Important for optimization
– Allows one relational expression to be replaced by another without
affecting the results
– We can choose among equivalent expressions according to which
are lower cost (size or speed depending upon our needs)
– An equivalence rule says that expression A is equivalent to B; we
may replace A with B and vice versa in the optimizer.
Equivalence Rules (1)
In the discussion that follows:
Ex represents a relational algebra expression
y represents a predicate
Lz represents a particular list of attributes
 and  are selection and projection as usual
• cascade of selections: a conjunction of selections can be deconstructed
into a series of them
12(E) = 1(2(E))
• selections are commutative
1(2(E)) = 2( 1(E))
Equivalence Rules (2)
• cascade of projections: only the final ops in a series of projections is
necessary
L1(L2(…(L3(E))…)) = L1(E)
• selections can be combined with Cartesian products and theta joins
(E1 X E2) = E1 |X| E2
1(E1 | X|2 E2) = E1 |X|12 E2
• joins and theta-joins are commutative (if you ignore the order of
attributes, which can be corrected by an appropriate projection)
E1 |X| E2 = E2 |X| E1
Equivalence Rules (3)
• joins and cross product are associative
(E1 |X|1 3 E2 ) |X|2 E3 = E1 |X|1 (E2 |X|23 E3)
(where 2 involves attributes from only E2 and E3.)
(E1 |X| E2 ) |X| E3 = E1 |X| (E2 |X| E3)
(E1 X E2 ) X E3 = E1 X (E2 X E3)
(the above two equivalences are useful special cases of the first one)
Equivalence Rules (4)
• selection distributes over join in two cases:
– if all the attributes in 1 appear only in one expression (say E1)
1 (E1 |X| E2) = (1 (E1 )) |X| E2
– if all the attributes in 1 appear only in E1 and all the attributes in 2
appear only in E2
1 2 (E1 |X| E2) = (1 (E1 )) |X| (2 ( E2))
(note that the first case is a special case of the second)
Equivalence Rules (5)
• projection distributes over join. Given L1 and L2 being attributes of E1
and E2 respectively:
– if  involves attributes entirely from L1 and L2 then
L1L2 (E1 |X| E2) = (L1 (E1)) |X| (L2 ( E2))
– if  involves attribute lists L3 and L4 (both not in L1  L2) from E1 and E2
respectively
L1L2 (E1 |X| E2) = L1L2 (L1L3 (E1)) |X| (L2L4 ( E2))
Equivalence Rules (6)
• union and intersection are commutative (difference is not)
E1  E2 = E2  E1
E1  E2 = E2  E1
• union and intersection are associative (difference is not)
(E1  E2)  E3 = E1  (E2  E3)
(E1  E2)  E3 = E1  (E2  E3)
• selection distributes over union, intersection, set difference
 (E1  E2) =  (E1)   (E2)
 (E1  E2) =  (E1)   (E2)
 (E1  E2) =  (E1)  E2
 (E1 - E2) =  (E1) -  (E2)
 (E1 - E2) =  (E1) - E2
Chapter 15:
Transactions
• A transaction is a single logical unit of database work -for example, a transfer of funds from checking to savings
account. It is a set of operations delimited by statements of
the form begin transaction and end transaction
• To ensure database integrity, we require that transactions
have the ACID properties
– Atomic: all or nothing gets done.
– Consistent: preserves the consistency of the database
– Isolated: unaware of any other concurrent transactions (as if there
were none)
– Durable: after completion the results are permanent even through
system crashes
ACID Transactions
• Transactions access data using two operations:
– read (X)
– write (X)
• ACID property violations:
– Consistency is the responsibility of the programmer
– System crash half-way through: atomicity issues
– Another transaction modifies the data half-way through: isolation
violation
– example: transfer $50 from account A to account B
T: read(A); A:=A-50; write (A); read (B); B:=B+50; write (B)
Transaction States
• Five possible states for a
transaction:
– active (executing)
– partially committed (after last
statement’s execution)
– failed (can no longer proceed)
– committed (successful
completion)
– aborted (after transaction has
been rolled back
Transaction States
• Committed or Aborted transactions are called terminated
• Aborted transactions may be
– restarted as a new transaction
– killed if it is clear that it will fail again
• Rollbacks
– can be requested by the transaction itself (go back to a given
execution state)
– some actions can’t be rolled back (e.g., a printed message, or an
ATM cash withdrawal)
Concurrent Execution
• Transaction processing systems allow multiple transactions
to run concurrently
– improves throughput and resource utilization (I/O on transaction
A can be done in parallel with CPU processing on transaction B)
– reduced waiting time (short transactions need not wait for long
ones operating on other parts of the database)
– however, this can cause problems with the “I” (Isolation) property
of ACID
Serializability
• Scheduling transactions so that
they are serializable (equivalent
to some serial schedule of the
same transctions) ensures
database consistency (and the
“I” property of ACID)
• serial schedule is equivalent to
having the transactions execute
one at a time
• non-serial or interleaved
schedule permits concurrent
execution
Serial Schedule
• To the right T1 and T2 are on a
serial schedule: T2 begins after
T1 finishes. No problems.
Interleaved Schedule
• This is an example of an
interleaved concurrent schedule
that raises a number of database
consistency concerns.
Serial and Interleaved Schedules
Serial schedule
Interleaved schedule
Another Interleaved Schedule
The previous serial and interleaved
schedules ensured database
consistency (A+B before
execution = A+B after
execution)
The interleaved schedule on the
right is only slightly different,
but does not ensure a consistent
result.
– assume A=1000 and B=2000
before (sum = 3000)
– after execution, A=950 and
B=2150 (sum = 3100)
Inconsistent Transaction Schedules
• So what caused the problem? What makes one concurrent schedule
consistent, and another one inconsistent?
– Operations on data within a transaction are not relevant, as they are run on
copies of the data residing in local buffers of the transaction
– For scheduling purposes, the only significant operations are read and
write
• Given two transactions, Ti and Tj, both attempting to access data item
Q:
–
–
–
–
if Ti and Tj are both executing read(Q) statements, order does not matter
if Ti is doing write(Q), and Tj read(Q), then order does matter
same if Tj is writing and Ti reading
if both Ti and Tj are executing write(Q), then order might matter if there
are any subsequent operations in Ti or Tj accessing Q
Transaction Conflicts
• Two operations on the same data item by different
transactions are said to be in conflict if at least one of the
operations is a write
• If two consecutive operations of different transactions in a
schedule S are not in conflict, then we can swap the two to
produce another schedule S’ that is conflict equivalent
with S
• A schedule S is serializable if it is conflict equivalent (after
some series of swaps) to a serial schedule.
Transaction Conflict Example
Example: read/write(B) in T0 do not conflict with read/write(A) in T1
T0
read(A)
write(A)
T1
read(A)
read(B)
write(A)
write(B)
read(B)
write(B)
T0
read(A)
write(A)
read(B)
write(B)
T1
read(A)
write(A)
read(B)
write(B)
Serializability Testing (15.9) and
Precedence Graphs
• So we need a simple method to test a schedule S and
discover whether it is serializable.
• Simple method involves constructing a directed graph
called a Precedence Graph from S
• Construct a precedence graph as follows:
– a vertex labelled Ti for every transaction in S
– an edge from Ti to Tj if any of these three conditions holds:
• Ti executes write(Q) before Tj executes read(Q)
• Ti executes read(Q) before Tj executes write(Q)
• Ti executes write(Q) before Tj executes write(Q)
– if the graph has a cycle, S is not serializable
Precedence Graph Example 1
• Compute a precedence graph
for schedule B (right)
• three vertices (T1, T2, T3)
• edge from Ti to Tj if
– Ti writes Q before Tj reads Q
– Ti reads Q before Tj writes Q
– Ti writes Q before Tj writes Q
Precedence Graph Example 1
• Compute a precedence graph
for schedule B (right)
• three vertices (T1, T2, T3)
• edge from Ti to Tj if
– Ti writes Q before Tj reads Q
– Ti reads Q before Tj writes Q
– Ti writes Q before Tj writes Q
Precedence Graph Example 2
• Slightly more complicated
example
• Compute a precedence graph
for schedule A (right)
• three vertices (T1, T2, T3)
• edge from Ti to Tj if
– Ti writes Q before Tj reads Q
– Ti reads Q before Tj writes Q
– Ti writes Q before Tj writes Q
Precedence Graph Example 2
• Slightly more complicated
example
• Compute a precedence graph
for schedule A (right)
Concurrency Control
• So now we can recognize when a schedule is serializable. In practice,
it is often difficult and inefficient to determine a schedule in advance,
much less examine it for serializability.
• Lock-based protocols are a common system used to prevent transaction
conflicts on the fly (i.e., without knowing what operations are coming
later)
• Basic concept is simple: to prevent transaction conflict (two
transactions working on the same data item with at least one of them
writing), we implement a lock system -- a transaction may only access
an item if it holds the lock on that item.
Lock-based Protocols
• We recognize two modes of locks:
– shared: if Ti has a shared-mode (“S”) lock on data item Q then Ti
may read, but not write, Q.
– exclusive: if Ti has an exclusive-mode (“X”) lock on Q then Ti can
both read and write Q.
• Transactions be granted a lock before accessing data
– A concurrency-control manager handles granting of locks
– Multiple S locks are permitted on a single data item, but only one
X lock
– this allows multiple reads (which don’t create serializability
conflicts) but prevents any R/W, W/R, or W/W interactions (which
create conflicts)
Lock-based Protocols
• Transactions must request a lock before accessing a data
item; they may release the lock at any time when they no
longer need access
• If the concurrency-control manager does not grant a
requested lock, the transaction must wait until the data
item becomes available later on.
• Unfortunately, this can lead to a situation called a deadlock
– suppose T1 holds a lock on item R and requests a lock on Q, but
transaction T2 holds an exclusive lock on Q. So T1 waits. Then
T2 gets to where it requests a lock on R (still held by waiting T1).
Now both transactions are waiting for each other. Deadlock.
Deadlocks
• To detect a deadlock situation we use a wait-for graph
– one node for each transaction
– directed edge Ti --> Tj if Ti is waiting for a resource locked by Tj
• a cycle in the wait-for graph implies a deadlock.
– The system checks periodically for deadlocks
– If a deadlock exists, one of the nodes in the cycle must be aborted
• 95% of deadlocks are between two transactions
• deadlocks are a necessary evil
– preferable to allowing the database to become inconsistent
– deadlocks can be rolled back; inconsistent data is much worse
Two-phase Locking Protocol
• A locking protocol is a set of rules for placing and
releasing locks
– a protocol restricts the number of possible schedules
– lock-based protocols are pessimistic
• Two-phase locking is (by far) the most common protocol
– growing phase: a transaction may only obtain locks (never release
any of its locks)
– shrinking phase: a transaction may only release locks (never obtain
any new locks)
Two-phase with Lock Conversion
• Two-phase with lock conversion:
– S can be upgraded to X during the growing phase
– X can be downgraded to S during the shrinking phase (this only
works if the transaction has already written any changed data value
with an X lock, of course)
– The idea here is that during the growing phase, instead of holding
on X on an item that it doesn’t need to write yet, to hold an S lock
on it instead (allowing other transactions to read the old value for
longer) until the point where modifications to the old value begin.
– Similarly in the shrink phase, once a transaction downgrades an X
lock, other transactions can begin reading the new value earlier.
Variants on Two-phase Locking
• Strict two-phase locking
– additionally requires that all X locks are held until commit time
– prevents any other transactions from seeing uncommitted data
– viewing uncommitted data can lead to cascading rollbacks
• Rigorous two-phase locking
– requires that ALL locks are held until commit time
Database Recovery
• Computers may crash, stall, lock
• We require that transactions are Durable (the D in ACID)
– A completed transaction makes a permanent change to the
database that will not be lost
– How do we ensure durability, given the possibility of computer
crash, hard disk failure, power surges, software bugs locking the
system, or all the myriad bad things that can happen?
– How do we recover from failure (i.e., get back to a consistent state
that includes all recent changes)?
• Most widely used system is log-based recovery
Backups
• Regular backups take a snapshot of the database status at a
particular moment in time
– used to restore data in case of catastrophic failure
– expensive operation (writing out the whole database)
• usually done no more than once a week, over the weekend when the
system usage is low
– smaller daily backups store only records that have been modified
since the last weekly backup; done overnight
– backups allow us to recover the database to a fairly recent
consistent state (yesterday’s), but are far too expensive to be used
to save running database modifications
– How do we ensure transaction (D) durability?
Log-Based Recovery
• We store a record of recent modifications; a log.
– Log is a sequence of log records, recording all update activities in
the database. A log record records a single database write. It has
these fields:
• transaction identifier: what transaction performed the write
• data-item identifier: unique ID of the data item (typically the location
on disk)
• old value (what was overwritten)
• new value (value after the write)
– Log is a write-ahead log -- log records are written before the
database updates its records
Log-Based Recovery (2)
• Other log records:
•
•
•
•
<T-ID, start-time>
<T-ID, D-ID, V-old, V-new>
<T-ID, commit-time>
<T-ID, abort-time>
transaction becomes active
transaction makes a write
transaction commits
transaction aborts
• Log contains a complete record of all database activity
since the last backup
• Logs must reside on stable storage
– Assume each log record is written to the end of the log on stable
storage as soon as it is created.
Log-Based Recovery (3)
• Recovery operation uses two primitives:
– redo: reapply the logged update.
• Write V-new into D-ID
– undo: reverse the logged update
• Write V-old into D-ID
– both primitives ignore the current state of the data item -- they
don’t bother to read the value first.
– Multiple applications on the same data item is equivalent to the last
one -- no harm as long as we do them in the correct order, even if
the correct result is already written into stable storage
Checkpoints
• When a system failure occurs, we examine the log to
determine which transactions need to be redone, and
which need to be undone.
• In theory we need to search the entire log
– time consuming
– most of the transactions in the log have already written their output
to stable storage. It won’t hurt the database to redo their results,
but every unnecessary redo wastes time.
• To reduce this overhead database systems introduce
checkpoints.
Log-Based Recover with Checkpoints
• So we have a crash and need to recover. What do we do?
• Three passes through the log between the checkpoint and
the failure
– go forward from the checkpoint to the failure to create the redo and
undo lists
• redo everything that commtted before the failure
• undo everything that failed to commit before the failure
– go backward from failure to checkpoint doing the undos in order
– go forward from checkpoint to failure doing the redos in sequence
– expensive -- three sequential scans of the active log
Recovery Example
•
•
•
First pass: T2, T4 commit; T3, T5 are uncommitted
Second pass: undo T5, then T3
Third pass: redo T4, then T2
Almost Final Stuff on Checkpoints
• Checkpointing usually speeds up recovery
– log prior to checkpoint can be archived
– without checkpoint the log may be very long; three sequential
passes through it could be very expensive
• During checkpointing:
– stop accepting new transactions and wait until all active
transactions commit
– flush the log to stable storage
– flush all dirty disk pages in the buffer to disk
– mark the stable-storage log record with a <checkpoint> marker
Final Stuff on Checkpoints
• Better checkpointing
– don’t wait for active transactions to finish, but don’t let them make
updates to the buffers or the update log during checkpointing
– make the checkpoint log record so that it includes a list L of active
transactions
<checkpoint, L>
– on recovery we need to go further back through previous
checkpoints to find all the changes of all transactions listed in L so
we can undo or redo them
– an even more elaborate scheme (called fuzzy checkpointing) allows
updates during recovery
Deferred vs. Immediate Modification
• Immediate Database Modification
– basically what we’ve been discussing so far -- uncommitted
transactions may write values to disk during their execution
• Deferred Database Modification
– no writes to the database before transaction is partially committed
(i.e., after the execution of its last statement)
– since no uncommitted transaction writes are in the log, there is no
need for the undo pass on recovery.
Download