COP 5725 Advanced Database Systems Spring 2016, Assignment 2 Instructor: Peixiang Zhao TA: Esra Akbas Due date: Monday, 04/18/2016, during class Problem 1 [10 points] Rewrite the following relational expression πL (R(a, b, c) ./ S(b, c, d, e)) by pushing the projection operator, π, down as far as it can go if L is: 1. [5 points] b + c → x, c + d → y; 2. [5 points] a, b, a + d → z. Problem 2 [10 points] Consider the following query that joins four relations A, B, C and D, i.e., A ./ B ./ C ./ D, 1. [7 points] how many different orders are there for processing A ./ B ./ C ./ D? Note, to simplify, let’s assume that join orders are symmetric, i.e., A ./ B is equivalent to B ./ A. For instance, we consider ((A ./ B) ./ C) ./ D and D ./ (C ./ (B ./ A)) as the same order. 2. [3 points] how many join orders are left-deep? Problem 3 [15 points] Below are the statistics of four relations W , X, Y and Z. Estimate the sizes of relations that are the results of the following expressions: 1. [5 points] W ./ X ./ Y ./ Z; 2. [5 points] σc=20 (Y ); 3. [5 points] σa=1 AND b>2 (W ); COP 5725: Advanced Database Systems W (a, b) T (W ) = 100 V (W, a) = 20 V (W, b) = 60 Problem 4 X(b, c) T (X) = 200 V (X, b) = 50 V (X, c) = 100 Spring 2016 Y (c, d) T (Y ) = 300 V (Y, c) = 50 V (Y, d) = 50 Z(d, e) T (Z) = 400 V (Z, d) = 40 V (Z, e) = 100 [20 points] For the relations of Problem 3, give the dynamic programming table entries that evaluates all possible join orders allowing all trees. What is the best choice for each join order? Problem 5 [15 points] The Apriori algorithm makes use of the prior knowledge of subset support properties. 1. [5 points] Given a frequent itemset l and a subset s of l, prove that the confidence of the rule s0 → (l − s0 ) cannot be more than the confidence of s → (l − s), where s0 is a subset of s; 2. [10 points] A partitioning variation of Apriori subdivides the transactions of a database D into n nonoverlapping partitions. Prove that any itemset that is frequent in D must be frequent in at least one partition of D. Problem 6 [30 points] The Apriori algorithm uses a candidate generation and frequency counting strategy for frequent itemset mining. Candidate itemsets of size (k + 1) are created by joining a pair of frequent itemsets of size k. A candidate is discarded if any one of its subsets is found to be infrequent during the candidate pruning step. Suppose the Apriori algorithm is applied to the transaction databases, as shown in Table 1 with minsup = 30%, i.e., any itemset occurring in less than 3 transactions is considered to be infrequent. 1. [12 points] Draw an itemset lattice representing the transaction database in Table 1. Label each node in the lattice with the following letters: • N: If the itemset is not considered to be a candidate itemset by the Apriori algorithm. • F: If the itemset is frequent; • I: If the candidate itemset is infrequent after support counting. 2. [2 points] What is the percentage of frequent itemsets (w.r.t. all itemsets in the lattice)? Assignment 2 Page 2 COP 5725: Advanced Database Systems Spring 2016 Table 1: A Sample of Marekt Basket Transactions Transaction ID 1 2 3 4 5 6 7 8 9 10 Items Bought {a, b, d, e} {b, c, d} {a, b, d, e} {a, c, d, e} {b, c, d, e} {b, d, e} {c, d} {a, b, c} {a, d, e} {b, d} 3. [2 points] What is the pruning ratio of the Apriori algorithm on this database (Pruning ratio is defined as the percentage of itemsets not considered to be a candidate)? 4. [2 points] What is the false alarm rate (False alarm rate is the percentage of candidate itemsets that are found to be infrequent after performing support counting)? 5. [12 points] Redraw the itemset lattice representing the transaction database in Table 1. Label each node with the following letter(s): • M: if the node is a maximal frequent itemset; • C: if it is a closed frequent itemset; • N: if it is frequent but neither maximal nor closed; • I: if it is infrequent. Assignment 2 Page 3