Quick Review of Apr 15 material

advertisement
Quick Review of Apr 15 material
• Overflow
– definition, why it happens
– solutions: chaining, double hashing
• Hash file performance
– loading factor
– search speed
• Hash Indices (as separate from file organization)
• Static vs. Dynamic Hashing
• Extendable Hashing
– with a detailed example of how it works
Today
• HW#3 due today
• HW #4: due Thursday April 24 (next week)
– Questions: 12.11, 12.12, 12.13, 12.16
• Today:
– indices on multiple attributes
– grid files
– bitmap indices
Index Definition in SQL
• Create an index
create index <index-name> on <relation-name> (<attribute-list>)
e.g.,
create index br-index on branch(branch-name)
• Use create unique index to indirectly specify and enforce the
condition that the search key is a candidate key
– not really required if SQL unique integrity constraint is supported
• To drop an index
drop index <index-name>
Multiple-Key Access
• With some queries we can use multiple indices
• Example:
select account-number
from account
where branch-name=“Perryridge” and balance=1000
• Possible strategies for processing this query using indices on single
attributes:
– use index on balance to find accounts with balances =1000, then test them
individually to see if branch-name=“Perryridge”
– use index on branch-name to find accounts with branchname=“Perryridge”, then test them individually to see if balances =1000
– use branch-name index to find pointers to all records of the Perryridge
branch, and use balance index similarly, then take intersection of both sets
of pointers
Multiple-Key Access (2)
• With some queries using a single-attribute index is unnecessarily
expensive
– with methods (1) and (2) from the earlier slide, we might have the index
we use return a very large set, even though the final result is quite small
– Even with method (3) (use both indices and then find the intersection) we
might have both indices return a large set, which will make for a lot of
unneeded work if the final result (the intersection) is small
• An alternative strategy is to create and use an index on more than one
attribute -- in this example, an index on (branch-name, balance) (both
attributes)
Indices on Multiple Attributes
• Suppose we have an ordered index on the combined search-key
(branch-name, balance)
• Examining the earlier query with the clause
where branch-name=“Perryridge” and balance=1000
Our new index will fetch only records that satisfy both conditions -much more efficient than trying to answer the query with separate
single-valued indices
• We can also handle range queries like:
where branch-name=“Perryridge” and balance<1000
• But our index will not work for
where branch-name<“Perryridge” and balance=1000
Multi-Attribute Indexing
Example: EMP(eno, ename, age, sal)
lots of ways to handle this problem
•
•
separate indices: lots of false hits
combined index based on
composite key
– key= sal*100 + age
– search for 30<sal<40 translates
into 3000<key<4000 (easy)
– search for 60<age<80 difficult
•
•
•
Grid files (up next)
R-tree (B-tree generalization)
Quad-trees, K-d trees, etc...
Grid Files
• Structure used to speed processing of general multiple search-key
queries involving one or more comparison operators
• The grid file has:
– a single grid array
– array has number of dimensions equal to the number of search-key
attributes
– one linear scale for each search-key attribute
• Multiple cells of the grid array can point to the same bucket
• To find the bucket for a search-key value, locate the row and column
of the cell using the linear scales to get the grid location, then follow
the pointer in that grid location to the bucket
Example Grid File for account
Queries on a Grid File
• A grid file on two attributes A and B can handle queries of all the
following forms with reasonable efficiency:
– (a1 < A < a2)
– (b1 < B < b2)
– (a1 < A < a2 and b1 < B < b2)
• For example, to answer (a1 < A < a2 and b1 < B < b2), use the linear
scales to find corresponding candidate grid array cells, and look up all
the buckets pointed-to from those cells
Grid Files (cont)
• During insertion, if a bucket becomes full, a new bucket
can be created if more than one cell points to it
– idea similar to extendable hashing, but in multiple dimensions
– if only one cell points to the bucket, either an overflow bucket
must be created or the grid array size must be increased
• Linear scales can be chosen to uniformly distribute records
across buckets
– not necessary to have scale uniform across the domain -- if records
are distributed in some other pattern, the linear scale can mirror it
(as shown in the example earlier)
Grid Files (end)
• Periodic re-organization to increase grid array size helps
with good performance
– reorganization can be very expensive, though
• Space overhead of the grid array can be high
• R-trees (chapter 23) are an alternative
Bitmap Indices
• Bitmap indices are a special type of index designed for efficient
queries on multiple keys
• Records in a relation are assumed to be numbered sequentially from 0
– given a number n the objective is for it to be easy to retrieve record n
– very easy to achieve if we’re looking at fixed-length records
• Bitmap indices are applicable on attributes that take on a relatively
small number of distinct values
– e.g. gender, country, state, hockey team
– or an arbitrary mapping of a wider spectrum of values into a small number
of categories (e.g. income level: divide income into a small number of
levels such as 0-9999, 10K-19999, 20K-49999, 50K and greater)
• A bitmap is simply an array of bits
Bitmap Indices (cont)
• In the simplest form a bitmap index on an attribute has a bitmap for
each value of the attribute
– bitmap has as many bits as there are records
– in a bitmap for value v, the bit for a record is 1 if the record has the value
v for the attribute, and 0 otherwise
Bitmap Indices (cont)
• Bitmap indices are useful for queries on multiple attributes
– not particularly useful for single-attribute queries
• Queries are answered using bitmap operations
– intersection (and)
– union (or)
– complementation (not)
• Each operation takes two bitmaps of the same size and applies the
operation on corresponding bits to get the result bitmap
– e.g. 100110 and 110011 = 100010
100110 or 110011 = 110111
not 100110 = 011001
Bitmap Indices (cont)
• Every 1 bit in the result marks a desired tuple
– can compute its location (since records are numbered sequentially
and are all of the same size) and retrieve the records
– counting number of tuples in the result (SQL count aggregate) is
even faster
• Bitmap indices are generally very small compared to
relation size
– e.g. if record is 100 bytes, space for a single bitmap on all tuples of
the relation takes 1/800 of the size of the relation itself
– if bitmap allows for 8 distinct values, bitmap is only 1% of the size
of the whole relation
Bitmap Indices (cont)
• Deletion needs to be handled properly
– can’t just store “zero” values at deleted locations (why?)
– need existence bitmap to note if the record at location X is valid or
not
• existence bitmap necessary for complementation
not(A=v): (not bitmap-A-v) and ExistenceBitmap
• Should keep bitmaps for all values, even “null”
– to correctly handle SQL null semantics for not (A=v) must
compute:
not(bitmap-A-Null) and (not bitmap-A-v) and ExistenceBitmap
Efficient Implementation of
Bitmap Operations
• Bitmaps are packed into words; a single word and is a
basic CPU instruction that computes the and of 32 or 64
bits at once
– e.g. one-million-bit maps can be anded with just 31,250
instructions
• Counting number of 1s can be done fast by a trick:
– use each byte to index into a precomputed array of 256 elements
each storing the count of 1s in the binary representation
– add up the retrieved counts
– can use pairs of bytes to speed up further in a similar way, but at a
higher memory cost (64K values in precomputed array for 2 byte)
Bitmaps and B+-trees
• Bitmaps can be used instead of Tuple-ID lists at leaf levels
of B+-trees, for values that have a large number of
matching records
– assuming a tuple-id is 64 bits, this becomes worthwhile if >1/64 of
the records have a given value
• This technique merges benefits of bitmap and B+-tree
indices
– useful if some values are uncommon, and some are quite common
– why not useful if there are only a small number of different values
total (16, 20, something less than 64)?
• That covers all of chapter 12.
Download