COP 5725 Advanced Database Systems Fall 2021, Assignment 1 Instructor: Peixiang Zhao TA: Haopeng Zhang Due date: Friday, 11/05/2021, via Canvas Problem 1 SQL [10 points] We define a database for FSU libraries as follow, Books: (bookID INTEGER, title TEXT, genre TEXT, lib INTEGER REFERENCES Libraries, PRIMARY KEY (bookID)); Libraries: Checkouts: (book, day)). (libraryID INTEGER, name TEXT, PRIMARY KEY (libraryID)); (book INTEGER REFERENCES Books, day DATETIME, PRIMARY KEY 1. [3 points] Write a SQL query that returns the bookID and genre of each book that has ever been checked out; 2. [3 points] Write a SQL query that returns the bookID of the book that has been checked out the most times, and the corresponding checked-out count. We assume that each book was checked out a unique number of times; 3. [4 points] Write a SQL query that finds the name of all of the pairs of libraries that have books with matching titles. In the output, report the names of both libraries (the first name is alphabetically less than the second) and the matching titles of the books. Problem 2 Storage [10 points] A patient reord consists of the following information: 1. Fixed-length fields: date-of-birth, SSN, and patient ID; 2. Variable-length fields: name, address and patient history. Note pointers are maintained within the record; 3. Repeating fields: a series of cholesterol tests, each of which requires a (fixedlength) date plus an integer result for the test. COP 5725: Advanced Database Systems (a) A = [1, 2] Fall 2021 (b) A = [1, 2, 3] Figure 1: Two examples of kd-tree for A = [1, 2] and A = [1, 2, 3], respectively Draw the layout of patient records if 1. [5 points] The repeating tests are kept within the record itself; 2. [5 points] The tests are stored on a separate block, with pointers to them in the record. Problem 3 kd-Tree [10 points] Given a kd-tree index that is perfectly balanced, the index concerns two dimensions (e.g., salary and age). For a query only one of√the two dimensions is specified (e.g., age = 35), prove we wind up looking at about n out of the n leaves from the kd-tree index to answer the query. Problem 4 kd-Tree [25 points] In the class, we learned that kd-tree (k-dimensional tree) is primarily for multidimensional search. We consider in this question a degraded kd-tree that is used to index one-dimensional data (where k = 1). First of all, given an array A of n values A = [a1 , a2 , . . . , an ]. We first build a kd-tree T for A, where leaves of T store the data points of A, and the internal nodes of T store the splitting values (i.e. medians) to guide the search. Let v denote the value stored at each split node t. The left subtree of t contains all data points smaller than or equal to v, and the right subtree of t contains all the data points strictly greater than v. The splitting value v is the median of data points at t: If there are an even number x of data points at t, the median is the data point at rank x/2; Otherwise, the median is the data point at rank ⌈x/2⌉. See Figure 1 for some toy examples. 1. [10 points] Given A = [63, 60, 110, 23, 81, 38, 50, 10, 5, 71, 30, 100, 90, 20], construct and draw the kd-tree for A. What is the time complexity of kd-tree construction for a one-dimensional array A of size n? 2. [15 points] Given a range query [l, r], where l ≤ r, write an efficient algorithm to report all the data points x in the one-dimensional kd-tree with n data points, where l ≤ x ≤ r. What is the time complexity of the algorithm? Assignment 1 Page 2 COP 5725: Advanced Database Systems Model 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 Speed 2.66 2.10 1.42 2.80 3.20 3.20 2.20 2.20 2.00 2.80 1.86 2.80 RAM 1024 512 512 1024 512 1024 1024 2048 1024 2048 2048 1024 Fall 2021 Hard disk 250 250 80 250 250 320 200 250 250 300 160 160 Table 1: Some PC’s and their characteristics Problem 5 Quad-Tree [10 points] Place all the data of Table 1 into a quad tree with dimensions speed and ram. Assume that the range for speed is 1.00 to 5.00, and for ram it is 500 to 3, 500. No leaf of the quad tree should have more than two points (Hint: you may chose the format of Figure 14.43 in the text book). Problem 6 Query Processing [15 points] Recall that when we make the assumption that data in a relation R is accessed one block at a time from disk, then we say B(R) to denote the number of blocks necessary to hold all of the tuples of R. Consider two relations, R1 (A, B) and R2 (B, C). B(R1 ) = 100 and B(R2 ) = 500 1. [5 points] If the memory buffer can hold 21 blocks (M = 21), what is the cost of joining R1 and R2 using a block nested-loop join? 2. [5 points] If we wanted to join R1 and R2 using a block nested-loop join and limit the cost to 1, 100, what is the smallest value M can be? 3. [3 points] What is the cost of joining R1 and R2 using a simple sort-merge join? 4. [2 points] What is the cost of joining R1 and R2 using a hash-based join? Problem 7 Query Processing [10 points] Consider the nested loop join R ▷◁ S for two relations R and S, if the larger relation, R, is unclustered, and S is clustered, provide an improved nested loop join algorithm that works better than T (R)B(S)/(M − 1). Assignment 1 Page 3 COP 5725: Advanced Database Systems Problem 8 Fall 2021 Query Processing [10 points] Suppose B(R) = 10, 000 and T (R) = 500, 000. Let there be an index on R.a and let V (R, a) = k for some number k. Give the cost of the range query σ(C≤a)AND(a≤D) (R) as a function of k under the following circumstances. You may assume that C and D are constants such that k/10 of the values are in the range. You may neglect the disk I/O’s needed to access the index itself. 1. [4 points] The index is clustered; 2. [3 points] The index is not clustered; 3. [3 points] R is clustered, and the index is not used. Assignment 1 Page 4