CS4320 Fall 2017 Final Exam Page 1! of 12 CS4320 Exam December 7th, 2017 (150 minutes working time) Name: __________________________________ Cornell NETID: _________ I understand and will adhere to the Cornell Code of Academic Integrity. ---------------------------------------------------------------Signature Maximum number of points possible: 120. This exam counts for 30 % of your overall grade. Questions vary in difficulty. Do not get stuck on one question. In all problems, whenever you think a problem is underspecified, make assumptions and clearly state them. Good luck! You have 150 minutes working time for this exam. CS4320 Fall 2017 Final Exam Page 2! of 12 Part A) SQL Queries. (20 points) Consider the database created by the following SQL commands. This database, used as example throughout the lecture, stores information on sailors, boats, and reservations for boats by sailors. CREATE TABLE sailors(sid INTEGER PRIMARY KEY, name VARCHAR(100)); CREATE TABLE boats (bid INTEGER PRIMARY KEY, color VARCHAR(5)); CREATE TABLE reserves(sid INTEGER, bid INTEGER, PRIMARY KEY(sid, bid), FOREIGN KEY (sid) REFERENCES sailors(sid), FOREIGN KEY (bid) REFERENCES boats(bid)); In the following, we ask you to find equivalent reformulations for given SQL queries (i.e., to find another SQL query that produces exactly the same result for each possible database instance that is consistent with all constraints implied by the schema definition). A.1) Reformulate the following SQL query without using the AND keyword: SELECT * FROM boats WHERE (bid <> 1 AND bid <> 2); (5 points) SELECT * FROM boats WHERE NOT (bid = 1 OR bid = 2); A.2) Reformulate the following SQL query without using the sailors table in any FROM clause: SELECT S.sid FROM sailors S WHERE EXISTS (SELECT * FROM reserves WHERE sid = S.sid); (5 points) SELECT DISTINCT sid FROM reserves; CS4320 Fall 2017 Final Exam Page 3! of 12 A.3) Reformulate the following SQL query without using the ISNULL keyword: SELECT S.sid FROM sailors S LEFT OUTER JOIN reserves R ON (S.sid = R.sid) WHERE R.sid ISNULL; (5 points) SELECT S.sid FROM sailors S WHERE NOT EXISTS (SELECT * FROM reserves R WHERE R.sid = S.sid); A.4) Reformulate the following SQL query without using the HAVING keyword: SELECT sid AS sailor, COUNT(*) AS count FROM reserves GROUP BY sid HAVING COUNT(*) > 2; (5 points) SELECT * FROM (SELECT sid AS sailor, COUNT(*) AS count FROM reserves GROUP BY sid) AS S WHERE count > 2; CS4320 Fall 2017 Final Exam Page 4! of 12 Part B) Relational Operators. (15 points) The following questions refer to two relations, R and S. R contains 100,000 tuples with 1,000 tuples per disc page. S contains 50,000 tuples with 50 tuples per disc page. B.1) We join R and S by an index nested loops join. A suitable index is defined on S, we assume for simplicity that each index access requires exactly one disc page read. Calculate the number of disc accesses for the join (do not consider cost for writing out the join result). (5 points) We need to read the relation without index, R, which costs 100,000 / 1,000 = 100 disc reads. Then, for each tuple in R, we access the index once: 100,000 disc reads. In total, we have 100,100 disc reads. B.2) We join R and S by a block nested loops join. Assume that 102 buffer pages are available. Choose the outer relation (and block size) leading to minimal join cost and calculate the number of disc reads required for the join (do not count cost for writing out the join result). (5 points) Reserving one buffer page for the output and one as input buffer for the inner relation, 100 buffer pages remain to store blocks of the outer operand. We choose the smaller relation, R with 100,000 / 1,000 = 100 pages versus S with 50,000 / 50 = 1,000 pages, as outer operand. Then, the join cost is 100 + 1 * 1,000 = 1,100 disc reads. B.3) We join R and S by the hash join seen in the lecture (we assume that only one partitioning pass is necessary). Calculate the number of times that a hash function is evaluated in order to perform the join. (5 points) During the partitioning phase, a hash function is evaluated for each input tuple from both relations. During the next phase, we use a second hash function on each input tuple to match tuples in the same partition. In total, we have 2 * (100,000 + 50,000) = 300,000 evaluations. CS4320 Fall 2017 Final Exam Page 5! of 12 Part C) Schema Design and Normalization. (20 points) C.1) Draw an ER diagram capturing the following scenario. There are two types of entities, employees and stores. Employees are characterized by two attributes, their name and their employee ID (the employee ID is unique). Stores are characterized by their store ID (which is unique). Employees may manage other employees and each employee has at most one manager. Employees work at stores and each employee is assigned to at least one store. Make sure that all entity types, relationships, attributes, and constraints implied by the description are also represented in your ER diagram. (10 points) Manager ID name Employee Manages Subordinate Worksat Store store_id CS4320 Fall 2017 Final Exam Page 6! of 12 C.2) Consider a relation schema R with seven attributes: ABCDEFG. Attribute A is a key of the relation. The following functional dependencies hold in addition: BC → A, DE → F, and B → G. Decompose R (via lossless-join decomposition) by the method seen in the lecture until BoyceCodd Normal Form (BCNF) is reached. Justify for each single decomposition step why it is required (by pointing out why the current schema is not in BCNF yet). Justify for the final result why it is in BCNF. (10 points) ABCDEFG Not in BCNF due to B → G as B is no key (and no trivial dependency either). Decomposed into two relations: ABCDEF and BG. Not in BCNF due to DE → F as DE is no key (and no trivial dependency either). Decompose ABCDEF into ABCDE and DEF. The result (i.e., relations ABCDE, DEF, and BG) is in BCNF since BC, DE, and B are keys in their respective relations. CS4320 Fall 2017 Final Exam Page 7! of 12 Part D) Concurrency Control. (25 points) In the following, we ask you to write out schedules with certain properties. Use the notation seen in the lecture (i.e., WT(A) means transaction number T writes object A, RT(A) means transaction T reads object A, CT means transaction T commits, AT means transaction T aborts). D.1) Propose a schedule involving two transactions that is not conflict-serializable. (4 points) W1(A) R2(A) W1(A) D.2) Propose a schedule involving two transactions that exposes the unrepeatable read anomaly. (4 points) R1(A) W2(A) C2 R1(A) D.3) Transform the following schedule into an equivalent serial schedule (i.e., a serial schedule containing the same operations that has the same conflict graph): R3(A) R2(A) R3(C) W3(A) R1(C) W2(B) W1(B) W1(C) (5 points) R2(A) W2(B) R3(A) R3(C) W3(A) R1(C) W1(B) W1(C) CS4320 Fall 2017 Final Exam Page 8! of 12 D.4) Name one advantage of conservative two-phase locking compared to non-conservative twophase locking (i.e., name one reason why conservative two-phase locking may lead to better performance). (4 points) Conservative two-phase locking avoids deadlocks which can be costly. D.5) What is the “wait-die” policy? Explain in less than five sentences its purpose and how it works (i.e., explain what happens in different cases). (4 points) This policy is used for avoiding deadlocks. If a transaction with higher priority requests a lock held by a transaction with lower priority then the former transaction waits. If a transaction with lower priority requests a lock held by a transaction with higher priority then the former transaction aborts. D.6) During the validation of a transaction Tj in optimistic concurrency control, we consider an earlier transaction Ti that finished before Tj started its write phase. Under which condition on Ti and Tj will Tj need to be aborted? (4 points) We need to abort Tj if the write set of Ti (i.e., set of objects written by Ti) overlaps with the read set of Tj (i.e., set of objects read by Tj). CS4320 Fall 2017 Final Exam Page 9! of 12 Part E) Logging and Recovery with ARIES. (20 points) Consider the following (simplified) log entries: 0 T2 Updates P2 5 T1 Updates P3 10 begin_checkpoint 15 end_checkpoint 20 T1 Abort 25 T3 Updates P7 30 CLR: Undo T1 LSN 5 35 T3 Updates P2 40 T3 Commit We assume that those are the last log entries before a system crash. The ARIES algorithm is used for recovery, starting from the checkpoint shown. At the checkpoint, the dirty page table contains only page P2 with recLSN=0 (i.e., time at which page became dirty) and the transaction table contains transaction T2 with lastLSN=0 (i.e., last log entry by transaction) and T1 with lastLSN=5. Both transactions are active (i.e., neither committed nor aborted) at the checkpoint. E.1) The log entries do not give any information on when data pages are written back to hard disc (e.g., due to “page stealing”). Based on all available information, point out one page that must have been written back to hard disc and justify in at most two sentences. (5 points) Page P3 does not appear in the dirty page table at the checkpoint despite the update at LSN 5. Hence, it was written back to disc (and taken out of the dirty page table) before the checkpoint. E.2) Fill in the following table, representing the state of the dirty page table after the analysis phase is completed (you may use less than the available number of rows). (5 points) Page P2 P7 P3 recLSN 0 25 30 CS4320 Fall 2017 Final Exam Page 10 ! of 12 E.3) Which compensation log records are written during the undo phase? Specify those entries in the format used above (i.e., CLR: Undo TX LSN Y for transaction X and log entry number Y). (5 points) CLR: Undo T2 LSN 0 E.4) The ARIES algorithm requires writing parts of the log to stable storage under certain conditions (write-ahead logging). Which log entries in the log above must have caused such a “log flush”? Justify in at most two sentences. (5 points) LSN 40 causes a log flush due to the transaction commit. (P3 is written back to disc between LSN 5 and 15 which also causes a log flush) CS4320 Fall 2017 Final Exam Page 11 ! of 12 Part F) Distributed DBMS, MapReduce. (20 points) F.1) We join two relations located at two different sites, connected via a network. Explain in less than four sentences why a bloom-join might be faster than an approach based on a semi-join. (5 points) The semi-join approach ships a projection on the join column from one site to the other while the bloom-join only ships a bit vector. The bit vector is typically smaller and therefore faster to send. F.2) A distributed DBMS stores N replica of a data set, R designates the number of replicas accessed for each read operation, and W the number of replicas that need to be updated for a successful write. What inequality on N, R, and W must hold to guarantee strong consistency? (5 points) We must have R + W > N. F.3) What is the two-phase commit protocol? Explain shortly, in at most three sentences, what happens in each of the two phases. (5 points) It’s a consensus protocol for distributed transaction processing. In the first phase (voting phase), the coordinator queries the subordinates to find out whether they are able to commit the current transaction. In the second phase (termination phase), the coordinator sends the final decision (whether to commit or to abort) to all subordinates. F.4) What are “stragglers”? Describe, in at most three sentences, one strategy by which typical MapReduce implementations try to minimize their negative impact. (5 points) Stragglers are map or reduce tasks that take unusually long, the MapReduce framework typically creates multiple task instances (“backup tasks”) and proceeds as soon as the first task instance completes. CS4320 Fall 2017 Final Exam Page 12 ! of 12 CS4320 Final Exam This page will be used for grading your exam. Do not write anything on this page. SECTION Part A QUESTION A.1 (Max: 5 points) SCORE SECTION TOTAL (Max: 20 points) A.2 (Max: 5 points) A.3 (Max: 5 points) A.4 (Max: 5 points) Part B B.1 (Max: 5 points) (Max: 15 points) B.2 (Max: 5 points) B.3 (Max: 5 points) Part C C.1 (Max: 10 points) (Max: 20 points) C.2 (Max: 10 points) Part D D.1 (Max: 4 points) (Max: 25 points) D.2 (Max: 4 points) D.3 (Max: 5 points) D.4 (Max: 4 points) D.5 (Max: 4 points) D.6 (Max: 4 points) Part E E.1 (Max: 5 points) (Max: 20 points) E.2 (Max: 5 points) E.3 (Max: 5 points) E.4 (Max: 5 points) Part F F.1 (Max: 5 points) F.2 (Max: 5 points) F.3 (Max: 5 points) F.4 (Max: 5 points) Total (Max: 120 points) (Max: 20 points)