German University in Cairo Faculty of Media Engineering and Technology Bar code Exam Solution CSEN 604: Databases II (MET – CSEN) Spring 2015 Semester Dr. Wael Abouelsaadat Date: May 23rd, 2015 Duration: 3 hours Do not turn this page until you have received the signal to start. In the meantime, read the instructions below carefully. This exam consists of 6 questions (numbered 1 to 6) on 14 pages (including this one and an aid sheet in the last page), printed on one side of the paper. When you receive the signal to start, please make sure that your copy of the examination is complete. Answer each question directly on the examination paper, in the space provided, and use the reverse side of the page for rough work. If you need more space for one of your solutions, use the reverse side of the page and indicate clearly the part of your work that should be marked. 1. 2. 3. 4. 5. 6. _________________ / 25 _________________ / 12 _________________ / 16 _________________ / 16 _________________ / 20 _________________ / 11 (Index Structures) (Result Size Estimation) (I/O Cost Estimation) (Concurrency Control) (Logs & Recovery) (SQL Transaction Modes) _________________ / 100 TOTAL Question 1. Index Structures [25 marks total] a) [4 marks] Suppose we store a relation R(x,y) in a grid file. Both attributes have a range of values from 0 to 1000. The partitions of this grid file happens to be uniformly spaced; for x there are partitions every 20 unites at 20, 40, 60, and so on, while for y the partitions are every 50 units, at 50, 100, 150, and so on. How many buckets do we have to examine to answer the range query: SELECT * FROM R WHERE 310 < x AND x < 400 AND 520 < y AND y < 730; Solution & marking: 25 [Either gets it or none] b) [6 marks] Find any/all violations of the B+ tree structure in the following diagram. Circle each bad node and give a brief explanation of each error. Assume the order of the tree is 4 (n=4; 4 keys, 5 pointers) Solution & marking: - [1 mark] Interior Node 10: below min key req - [1 mark] Interior Node 13,20: key 20 duplicated from root. - [1 mark] Leaf node 20,21: both keys are not less than root. - [1 mark] Leaf node 22,23: key 22 not less than parent key 22 - [2 marks] Leaf nodes 8,9 and 6,7: swapped positions – leaf key 6 less than parent key 8 2/14 c) [15 marks] Consider the extensible hashing index shown below. To insert an entry, it is translated to binary and the first n bits from the right are used. For example, 36 is binary 100100 while 51 is binary 110011. i. [2 marks] Is it possible to identify the last entry that inserted into the index? if yes, specify the entry ? Solution & marking: [2 marks] No, it could be any one of the data entries in the index. [Justification is not required since I did not ask to justify why not] We can always find a sequence of insertions and deletions with a particular key value, among the key values shown in the index, as the last insertion. ii. [2 marks] Which entry is guaranteed to be not the last one inserted? Solution & marking: 10 [Either gets it or none]. [Justification not required since I did not ask to justify] iii. [2 marks] Suppose you are told that there have been no deletions from this index so far. Which buckets where last split? Solution: The last insertion which caused a split cannot be in Bucket C. Buckets B and C or C and D could have made a possible bucket-split combination but the total number of data entries in these combinations is 4 and the absence of deletions demands a sum of at least 5 data entries for such combinations. Buckets B and D can form a possible bucket-split combination because they have a total of 6 data entries between themselves. So do A and 3/14 E. But for the B and D to be split images, the starting global depth should have been 1. If the starting global depth is 2, then the last insertion causing a split would be in A or E. [Either gets it or none]. [Justification not required since I did not ask to justify] iv. [3 marks] Show the index after inserting an entry with hash value 68 (1000100). Solution: [In case of error, partial marks should be given] 4/14 v. [3 marks] Show the index after inserting entries with hash values 17 (10001) and 69 (1000101) into the original index Solution: [In case of error, partial marks should be given] vi. [3 marks] Show the index after deleting the entry with hash value 10 from the original index. Is a merge of buckets triggered by this deletion? If not explain, why. Solution: [In case of error, partial marks should be given] 5/14 Question 2. Result Size Estimation [12 marks total] Consider the following tables: - table student with attributes ID, name, major, credits - table course with title, instructor, credits - table registered with attributes student and course - registered.student is a foreign key to student ID. - Attribute course of relation registered is a foreign key to attribute title of relation course. Given are the following statistics: T(student) = 30, 000 V (student, ID) = 30, 000 V (student, name) = 29, 500 V (student, major) = 20 T(course) = 80 V (course, title) = 80 V (course, instructor) = 50 V (course, credits) = 6 T(registered) = 10, 000 V (registered, student) = 3, 000 V (registered, course) = 30 V (student, credits) = 32 The min and max values for some of the columns are: min(course, credits) = 0 max(course, credits) = 36 min(student, credits) = 0 max(student, credits) = 36 a) [2 marks] Estimate the number of result tuples for the query q = σmajor=CS(student) Solution: b) [3 marks] Estimate the number of result tuples for the following query of OR’ed terms q = σmajor=CS ∨ major=Bio(student) Solution: c) [3 marks] Estimate the number of result tuples for the following query with ANDed terms q = σcredits≥32 ∧ credits≤34(student) Your solution must take into consideration the given min and max values for credits. Solution: 6/14 d) [4 marks] Estimate the number of result tuples for the following join query q = student ID=student registered course=title course Solution: 7/14 Question 3. I/O Cost Estimation [16 marks total] a) [9 marks] Consider two relations R and S with B(R) = 3,500 and B(S) = 2,300. You have M = 101 memory pages available. Compute the number of I/O operations for each join method below. i) [3 marks] Block nested-loop join Solution: ii) [3 marks] Merge-join (inputs not sorted) Solution: [we will also consider multiplying by 5 as in the lecture slides as correct answer since they are two alternative estimation techniques] iii) [3 marks] Hash-join Solution: 8/14 b) [7 marks] Assume you have a database with the following relations; Customers( CustID, Name, Age, Gender ) and is stored on 100,000 disk pages (aka blocks) Purchases ( CustID, Product, Date, Location, Amount) on 2,000,000 disk pages SalesCalls( CustID, Salesperson, Date, Result) on 300,000 disk pages From the database query log, you have observed the following query mix on this database: 10% queries selecting on Customers.CustID 30% queries selecting on Customers.Name 35% queries selecting on Purchases.Product 10% queries selecting on SalesCalls.SalesPerson 15% queries selecting on SalesCalls.Date You want to create indices over these relations to speed up queries over these relations and you have enough resources to build these indices. You may assume that the index allows you to retrieve the answer to the query with significantly less cost than doing a table scan. You can also assume that the savings obtained by building an index on an attribute is proportional to the number of pages in the relation multiplied by the number of queries Which two attributes are best to build indices on? (because they will achieve the best performance enhancement). Justify your answer. Solution: Purchases.product and SalesCalls.Date Purchases is the largest relation and Purchases.Product is the most queried attribute. For the second index, we have to choose between Customers.Name and SalesCalls.Date. Of these two, SalesCalls.Date is more useful since the savings obtained by building an index on an attribute is proportional to the number of pages in the corresponding relation multiplied by the number of queries using that index: 0.15 x 300,000 > 0.1 x 100,000 [each 3.5; 1.5 marks for choice and 2 marks for justification] 9/14 Question 4. Concurrency Control [16 marks total] a) [3 marks] Consider a database that is read-only (i.e., no transactions change any data in the database). Suppose serializability needs to be supported. Place a check mark in-front of each correct statement: __T__ No locking is necessary. _____ Only read locks are necessary and they need to be held until end of transaction. _____ Only read locks are necessary but they can be released as soon as the read is complete. _____ Both read and write locks are necessary and locking must be done in two phases. _____ None of the above. b) [4 marks] In the schedule given below, the label Ri(X) indicates a read of element X by transaction Ti , and Wi(X) indicates a write of element X by transaction Ti . Draw the precedence graph for schedule below. Is the schedule conflict-serializable? If so, what is the order of the three transactions if run serially? R2(A) R1(C) R2(B) R3(B) W2(B) R1(A) R3(C) W3(C) W1(A) Solution: Schedule is not conflict-serializable because the precedence graph has a cycle. 10/14 c) [1 mark] In the case of 3 transactions T1, T2, T3, list all possible serial schedules. Solution: T1,T2,T3 T2,T3,T1 T1,T3,T2 T3,T1,T2 T2,T1,T3 T3,T2,T1 d) [3 marks] Justify why running any one of those serial schedule in your answer to c) is valid despite of the fact that each schedule might result in a different database state. Solution: Each serial schedule will leave the database in a new consistent. It does not really matter which transaction got executed first, as long as it is a serial schedule, it is fine. e) [2 marks] What is two-phase locking? Solution: A transaction must obtain all locks for all resources it needs before releasing any lock. f) [3 marks] Describe an example of two transactions, each has a sequence of read/write steps, running concurrently where using locking but not 2 phase locking will produce an inconsistent database state. Specify values for the records you are reading/writing to demonstrate your solution. Solution: 11/14 Question 5. Logs & Recovery [20 marks total] a) [6 marks] Undo Logging Consider the following sequence of UNDO log records with a non-quiescent checkpoint: <START S> <S,A,60> <COMMIT S> <START T> <T,A,10> <START U> <U,B,20> <T,C,30> <START V> <U,D,40> <START CKPT(T,U,V)> <V,F,70> <COMMIT U> <T,E,50> <COMMIT T> <V,B,80> <COMMIT V> i. [1 mark] when the <END CKPT> record is written? Solution: <END CKPT> will appear immediately after or before <COMMIT V> ii.[5 marks] for each possible point at which a crash could occur, how far back in the log we must look to find all possible incomplete transactions. Solution: 12/14 b) [7 marks] Redo Logging Consider the following set of redo log records, explain what happens to both disk and log in case a failure occurs and the last log to appear on disk is: <START A> <A, X, 4> <A, Y, 2> <START B> <A, Z, 3> <START C> <B, M, 100> <B, N, 50> <C, L, 20> <COMMIT B> <START D> <COMMIT C> <D, O, 12> <D, P, 13> <COMMIT D> <COMMIT A> <START E> <E, Q, 85> <E, R , 32> <COMMIT E> i. [1 mark] <COMMIT C> Solution: ii. [3 marks] <START E> Solution: iii. [3 marks] <B, N, 50> Solution: 13/14 c) [7 marks] Undo/Redo Logging Consider the following set of undo/redo log records, explain what happens to both disk and log in case a failure occurs and the last log to appear on disk is: <START A> <A, X, 4, 41> <A, Y, 2, 21> <START B> <A, Z, 3, 31> <START C> <B, M, 100, 101> <B, N, 50, 51> <C, L, 20, 21> <COMMIT B> <START D> <COMMIT C> <D, O, 12, 11> <D, P, 13, 15> <COMMIT D> <COMMIT A> <START E> <E, Q, 85, 81> <E, R , 32, 31> <COMMIT E> i. [1 mark] <COMMIT C> Solution: ii. [3 marks] <START E> Solution: iii. [3 marks] <B, N, 50> Solution: 14/14 Question 6. SQL Transactions Modes [11 marks total] In this question, you are going to provide an example which includes 2 transactions to show how each transaction isolation level works. Your example must be different from the one in lecture slides. You will Show the result of running the two transactions at the same time (i.e. concurrently) in the same isolation level. Note: below is the one the lecture slide. Students are not allowed to use this one. Each student should come with his/her own example solution. T1: (max) SELECT MAX(price) FROM Sells WHERE bakery= ‘Joe’’s Bakery’; (min) SELECT MIN(price) FROM Sells WHERE bakery = ‘Joe’’s Bakery’; T2: (del) DELETE FROM Sells WHERE Bakery= ‘Joe’’s Bakery’; (ins) INSERT INTO Sells VALUES (‘Joe’’s Bakery’, ‘French’,3.50); a) [1 mark] Draw a table with column names and some data inside that you will use to answer the rest of this question: b) [2 marks] Write two SQL transactions doing operations on the table above you defined in a). One of the two transactions must include update and insert SQL statements. Your transactions content must be relevant to this question and not just any arbitrary SQL. Solution: Transaction 1 SQL statements: Transaction 2 SQL statements: 15/14 c) [2 marks] For the SERIALIZABLE isolation level, what is the result of running T1 and T2 both at the same time in that level? Show the order of execution of the statements in T1 and T2. d) [2 marks] For the REPEATABLE READ isolation level, what is the result of running T1 and T2 both at the same time in that level? Show the order of execution of the statements in T1 and T2. e) [2 marks] For the READ COMMITTED isolation level, what is the result of running T1 and T2 both at the same time in that level? Show the order of execution of the statements in T1 and T2. f) [2 marks] For the READ UNCOMMITTED isolation level, what is the result of running T1 and T2 both at the same time in that level? Show the order of execution of the statements in T1 and T2. End of Exam 16/14