Uploaded by DBMS Fall 2021

HW1

advertisement
COP 5725 Advanced Database Systems
Fall 2021, Assignment 1
Instructor: Peixiang Zhao
TA: Haopeng Zhang
Due date: Friday, 11/05/2021, via Canvas
Problem 1
SQL [10 points]
We define a database for FSU libraries as follow,
Books: (bookID INTEGER, title TEXT, genre TEXT, lib INTEGER REFERENCES
Libraries, PRIMARY KEY (bookID));
Libraries:
Checkouts:
(book, day)).
(libraryID INTEGER, name TEXT, PRIMARY KEY (libraryID));
(book INTEGER REFERENCES Books, day DATETIME, PRIMARY KEY
1. [3 points] Write a SQL query that returns the bookID and genre of each book
that has ever been checked out;
2. [3 points] Write a SQL query that returns the bookID of the book that has
been checked out the most times, and the corresponding checked-out count.
We assume that each book was checked out a unique number of times;
3. [4 points] Write a SQL query that finds the name of all of the pairs of libraries
that have books with matching titles. In the output, report the names of both
libraries (the first name is alphabetically less than the second) and the matching
titles of the books.
Problem 2
Storage [10 points]
A patient reord consists of the following information:
1. Fixed-length fields: date-of-birth, SSN, and patient ID;
2. Variable-length fields: name, address and patient history. Note pointers are
maintained within the record;
3. Repeating fields: a series of cholesterol tests, each of which requires a (fixedlength) date plus an integer result for the test.
COP 5725: Advanced Database Systems
(a) A = [1, 2]
Fall 2021
(b) A = [1, 2, 3]
Figure 1: Two examples of kd-tree for A = [1, 2] and A = [1, 2, 3], respectively
Draw the layout of patient records if
1. [5 points] The repeating tests are kept within the record itself;
2. [5 points] The tests are stored on a separate block, with pointers to them in the
record.
Problem 3
kd-Tree [10 points]
Given a kd-tree index that is perfectly balanced, the index concerns two dimensions
(e.g., salary and age). For a query only one of√the two dimensions is specified (e.g.,
age = 35), prove we wind up looking at about n out of the n leaves from the kd-tree
index to answer the query.
Problem 4
kd-Tree [25 points]
In the class, we learned that kd-tree (k-dimensional tree) is primarily for multidimensional search. We consider in this question a degraded kd-tree that is used
to index one-dimensional data (where k = 1). First of all, given an array A of n
values A = [a1 , a2 , . . . , an ]. We first build a kd-tree T for A, where leaves of T store
the data points of A, and the internal nodes of T store the splitting values (i.e. medians) to guide the search. Let v denote the value stored at each split node t. The
left subtree of t contains all data points smaller than or equal to v, and the right
subtree of t contains all the data points strictly greater than v. The splitting value v
is the median of data points at t: If there are an even number x of data points at t,
the median is the data point at rank x/2; Otherwise, the median is the data point at
rank ⌈x/2⌉. See Figure 1 for some toy examples.
1. [10 points] Given A = [63, 60, 110, 23, 81, 38, 50, 10, 5, 71, 30, 100, 90, 20], construct and draw the kd-tree for A. What is the time complexity of kd-tree
construction for a one-dimensional array A of size n?
2. [15 points] Given a range query [l, r], where l ≤ r, write an efficient algorithm to
report all the data points x in the one-dimensional kd-tree with n data points,
where l ≤ x ≤ r. What is the time complexity of the algorithm?
Assignment 1
Page 2
COP 5725: Advanced Database Systems
Model
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
Speed
2.66
2.10
1.42
2.80
3.20
3.20
2.20
2.20
2.00
2.80
1.86
2.80
RAM
1024
512
512
1024
512
1024
1024
2048
1024
2048
2048
1024
Fall 2021
Hard disk
250
250
80
250
250
320
200
250
250
300
160
160
Table 1: Some PC’s and their characteristics
Problem 5
Quad-Tree [10 points]
Place all the data of Table 1 into a quad tree with dimensions speed and ram. Assume
that the range for speed is 1.00 to 5.00, and for ram it is 500 to 3, 500. No leaf of
the quad tree should have more than two points (Hint: you may chose the format of
Figure 14.43 in the text book).
Problem 6
Query Processing [15 points]
Recall that when we make the assumption that data in a relation R is accessed one
block at a time from disk, then we say B(R) to denote the number of blocks necessary to hold all of the tuples of R. Consider two relations, R1 (A, B) and R2 (B, C).
B(R1 ) = 100 and B(R2 ) = 500
1. [5 points] If the memory buffer can hold 21 blocks (M = 21), what is the cost
of joining R1 and R2 using a block nested-loop join?
2. [5 points] If we wanted to join R1 and R2 using a block nested-loop join and
limit the cost to 1, 100, what is the smallest value M can be?
3. [3 points] What is the cost of joining R1 and R2 using a simple sort-merge join?
4. [2 points] What is the cost of joining R1 and R2 using a hash-based join?
Problem 7
Query Processing [10 points]
Consider the nested loop join R ▷◁ S for two relations R and S, if the larger relation,
R, is unclustered, and S is clustered, provide an improved nested loop join algorithm
that works better than T (R)B(S)/(M − 1).
Assignment 1
Page 3
COP 5725: Advanced Database Systems
Problem 8
Fall 2021
Query Processing [10 points]
Suppose B(R) = 10, 000 and T (R) = 500, 000. Let there be an index on R.a and let
V (R, a) = k for some number k. Give the cost of the range query σ(C≤a)AND(a≤D) (R)
as a function of k under the following circumstances. You may assume that C and
D are constants such that k/10 of the values are in the range. You may neglect the
disk I/O’s needed to access the index itself.
1. [4 points] The index is clustered;
2. [3 points] The index is not clustered;
3. [3 points] R is clustered, and the index is not used.
Assignment 1
Page 4
Download