COP 5725 Advanced Database Systems Spring 2016, Assignment 1 Instructor: Peixiang Zhao TA: Esra Akbas Due date: Monday, 02/29/2016, during class Problem 1 [10 points] A patient reord consists of the following information: 1. Fixed-length fields: date-of-birth, SSN, and patient ID; 2. Variable-length fields: name, address and patient history. Note pointers are maintained within the record; 3. Repeating fields: a series of cholesterol tests, each of which requires a (fixedlength) date plus an integer result for the test. Draw the layout of patient records if 1. [5 points] The repeating tests are kept within the record itself; 2. [5 points] The tests are stored on a separate block, with pointers to them in the record. Problem 2 [10 points] Consider we have n pointers that need to be swizzled, and swizzling one point will take time t on average. Suppose that if we swizzle all pointers automatically, we can perform the swizzling in half the time it would take to swizzle each separately. If the probability that a pointer in main memory will be followed at least once is p, for what values of p is it more efficient to swizzle automatically than on demand? Problem 3 [10 points] Given a kd-tree index that is perfectly balanced, the index concerns two dimensions (e.g., salary and age). For a query only one of√the two dimensions is specified (e.g., age = 35), prove we wind up looking at about n out of the n leaves from the kd-tree index to answer the query. COP 5725: Advanced Database Systems Model 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 Speed 2.66 2.10 1.42 2.80 3.20 3.20 2.20 2.20 2.00 2.80 1.86 2.80 RAM 1024 512 512 1024 512 1024 1024 2048 1024 2048 2048 1024 Spring 2016 Hard disk 250 250 80 250 250 320 200 250 250 300 160 160 Table 1: Some PC’s and their characteristics Problem 4 [10 points] Place all the data of Table 1 into a kd-tree. Assume two records can fit in one block. At each level, pick a separating value that divides the data as evenly as possible. For an order of the splitting attributes choose: 1. [5 points] Speed, then RAM, alternating; 2. [5 points] Speed, then RAM, then hard-disk, alternating. Problem 5 [10 points] For the data of Table 1, show the bitmap indexes for the attributes: 1. [4 points] Speed; 2. [3 points] RAM; 3. [3 points] Hard disk. Problem 6 [20 points] Recall that when we make the assumption that data in a relation R is accessed one block at a time from disk, then we say B(R) to denote the number of blocks necessary to hold all of the tuples of R. Consider two relations, R1 (A, B) and R2 (B, C). B(R1 ) = 100 and B(R2 ) = 500 1. [5 points] If the memory buffer can hold 21 blocks (M = 21), what is the cost of joining R1 and R2 using a block nested-loop join? Assignment 1 Page 2 COP 5725: Advanced Database Systems Spring 2016 2. [5 points] If we wanted to join R1 and R2 using a block nested-loop join and limit the cost to 1, 100, what is the smallest value M can be? 3. [5 points] What is the cost of joining R1 and R2 using a simple sort-merge join? 4. [5 points] What is the cost of joining R1 and R2 using a hash-based join? Problem 7 [10 points] Consider the nested loop join R ./ S for two relations R and S, if the larger relation, R, is unclustered, and S is clustered, provide an improved nested loop join algorithm that works better than T (R)B(S)/(M − 1). Problem 8 [10 points] Consider two relations R(x, y) and S(y, z) with B(R) = 1, 000, B(S) = 500, and M = 101. Assume that attribute y of relation R has two distinct values (y1 and y2 ) and the values are evenly distributed in R. Similarly, attribute y of relation S has the same two values (y1 and y2 ) and the values are evenly distributed in S. Suppose that initially both relations are not sorted by attribute y. Compute the total number (in average) of disk I/Os that are needed for the sort-merge join algorithm in order to compute R ./ S. Problem 9 [10 points] Suppose B(R) = 10, 000 and T (R) = 500, 000. Let there be an index on R.a and let V (R, a) = k for some number k. Give the cost of the range query σ(C≤a)AND(a≤D) (R) as a function of k under the following circumstances. You may assume that C and D are constants such that k/10 of the values are in the range. You may neglect the disk I/O’s needed to access the index itself. 1. [4 points] The index is clustering; 2. [3 points] The index is not clustering; 3. [3 points] R is clustered, and the index is not used. Assignment 1 Page 3