COP 5725 Advanced Database Systems Spring 2016, Assignment 1

advertisement
COP 5725 Advanced Database Systems
Spring 2016, Assignment 1
Instructor: Peixiang Zhao
TA: Esra Akbas
Due date: Monday, 02/29/2016, during class
Problem 1
[10 points]
A patient reord consists of the following information:
1. Fixed-length fields: date-of-birth, SSN, and patient ID;
2. Variable-length fields: name, address and patient history. Note pointers are
maintained within the record;
3. Repeating fields: a series of cholesterol tests, each of which requires a (fixedlength) date plus an integer result for the test.
Draw the layout of patient records if
1. [5 points] The repeating tests are kept within the record itself;
2. [5 points] The tests are stored on a separate block, with pointers to them in the
record.
Problem 2
[10 points]
Consider we have n pointers that need to be swizzled, and swizzling one point will
take time t on average. Suppose that if we swizzle all pointers automatically, we can
perform the swizzling in half the time it would take to swizzle each separately. If
the probability that a pointer in main memory will be followed at least once is p, for
what values of p is it more efficient to swizzle automatically than on demand?
Problem 3
[10 points]
Given a kd-tree index that is perfectly balanced, the index concerns two dimensions
(e.g., salary and age). For a query only one of√the two dimensions is specified (e.g.,
age = 35), prove we wind up looking at about n out of the n leaves from the kd-tree
index to answer the query.
COP 5725: Advanced Database Systems
Model
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
Speed
2.66
2.10
1.42
2.80
3.20
3.20
2.20
2.20
2.00
2.80
1.86
2.80
RAM
1024
512
512
1024
512
1024
1024
2048
1024
2048
2048
1024
Spring 2016
Hard disk
250
250
80
250
250
320
200
250
250
300
160
160
Table 1: Some PC’s and their characteristics
Problem 4
[10 points]
Place all the data of Table 1 into a kd-tree. Assume two records can fit in one block.
At each level, pick a separating value that divides the data as evenly as possible. For
an order of the splitting attributes choose:
1. [5 points] Speed, then RAM, alternating;
2. [5 points] Speed, then RAM, then hard-disk, alternating.
Problem 5
[10 points]
For the data of Table 1, show the bitmap indexes for the attributes:
1. [4 points] Speed;
2. [3 points] RAM;
3. [3 points] Hard disk.
Problem 6
[20 points]
Recall that when we make the assumption that data in a relation R is accessed one
block at a time from disk, then we say B(R) to denote the number of blocks necessary to hold all of the tuples of R. Consider two relations, R1 (A, B) and R2 (B, C).
B(R1 ) = 100 and B(R2 ) = 500
1. [5 points] If the memory buffer can hold 21 blocks (M = 21), what is the cost
of joining R1 and R2 using a block nested-loop join?
Assignment 1
Page 2
COP 5725: Advanced Database Systems
Spring 2016
2. [5 points] If we wanted to join R1 and R2 using a block nested-loop join and
limit the cost to 1, 100, what is the smallest value M can be?
3. [5 points] What is the cost of joining R1 and R2 using a simple sort-merge join?
4. [5 points] What is the cost of joining R1 and R2 using a hash-based join?
Problem 7
[10 points]
Consider the nested loop join R ./ S for two relations R and S, if the larger relation,
R, is unclustered, and S is clustered, provide an improved nested loop join algorithm
that works better than T (R)B(S)/(M − 1).
Problem 8
[10 points]
Consider two relations R(x, y) and S(y, z) with B(R) = 1, 000, B(S) = 500, and
M = 101. Assume that attribute y of relation R has two distinct values (y1 and y2 )
and the values are evenly distributed in R. Similarly, attribute y of relation S has
the same two values (y1 and y2 ) and the values are evenly distributed in S. Suppose
that initially both relations are not sorted by attribute y. Compute the total number
(in average) of disk I/Os that are needed for the sort-merge join algorithm in order
to compute R ./ S.
Problem 9
[10 points]
Suppose B(R) = 10, 000 and T (R) = 500, 000. Let there be an index on R.a and let
V (R, a) = k for some number k. Give the cost of the range query σ(C≤a)AND(a≤D) (R)
as a function of k under the following circumstances. You may assume that C and
D are constants such that k/10 of the values are in the range. You may neglect the
disk I/O’s needed to access the index itself.
1. [4 points] The index is clustering;
2. [3 points] The index is not clustering;
3. [3 points] R is clustered, and the index is not used.
Assignment 1
Page 3
Download