Dr. Christoph F. Eick Draft Graded Homework2 COSC 6340 Spring 2004 <Easter> Egg </Easter> > Due: 5a, 5c, 7, 8 We., April 14, 11p (electronic submission); problem 5b is due Sa., April 17, 11p (electronic submission); all other problems are due Sa., May 1, 11p (electronic submission; please submit hardcopies of your solutions on May 3 or 4!!) 5) Clustering [25] Graded a) Suppose the task is to cluster the following seven points (1, 1), (1, 3), (4, 5), (5, 5), (2, 1), (7, 7), (6, 3) into 2 clusters (k=2) assuming Manhattan Distance (d((x1,x2),(y1,y2))=|x1-y1| + |x2-y2|) using the K-means (for k=2) and the kmedoid clustering algorithm. Assume that the K-means initially assigns (1,1), (1, 3) and (7, 7) to Cluster1 and the other 4 points to Cluster2 and that K-medoid chooses points (5,5) and (7,7) as its initial medoids. How do those clusters/medoids change in each iteration of applying the two algorithms? You can write a small program for this homework if helpful. Explain how you derived your answer! Which algorithm you believe is faster in solving this problem? [9] b) Now assume you have to add a clustering operator to the Oracle Database Systems that supports K-means clustering. Give a sketch of a system design that adds clustering capabilities to Oracle. Also discuss the key data structures and algorithms that are used in your system architecture. Limit your answer to less than two pages! [12] c) What are the key ideas and key features of the BIRCH clustering algorithms (what properties make it scalable)? Limit your answer to 5 sentences [3] 6) Similarity Assessment Ungraded Assume the following relation Students(ssn, age, gpa, avg_class_rank) that contains students that were admitted in the year 2000 into our undergraduate program is given. You can assume that 1 age is an integer; the maximum age is 50 the minimum age is 20, and the average age is 28 and the mean absolute deviation is 10. gpa denotes the UH COSC gpa; the average gpa is 2.8 and the mean absolute deviation is 0.6; the maximum gpa is 4.0 the minimum gpa is 0. Avg_class_rank has 5 values (4=top-5%, 3=top-15%, 2=top-25% 1=top_half, 0=bottom half) a) Define a student similarity (or distance) measure that considers gpa and class_rank of being of major importance, and age of being of minor importance. [7] b) Using your (dis)similarity measure compute the (dis)similarity for the following pair of students following 2 students [2] : 1. (111111111, 25, 2.8, 2) 2. (222222222, 24, 3.7, 3) 7) Association Rule Mining [8] Graded a) Assume you have to apply the APRIORI algorithm assuming that the minimum support is 40% (4 out of 10) to the following set of 10 transactions that involve purchases of items A, B, C, D, E, F, G. T1={A, C, D} T6={A, C, D, E, F} T2={A, D, F} T7={A, B, D, F} T3={D, E, F} T8={A, B, C, D, F} T4={A, B, D, F} T9={A, B, C, E} T5={A, F} T10={A, D, E} Describe how Apriori’s Large Item Set Generation algorithm works for the example. List what candidate item sets will be generated in each pass, and which remain in the candidate item set after pruning (use notations of the Han book) [6] b) Assuming minimum confidence is 75%, give 2 rules (of your own choice) that would be generated by an association rule mining algorithm. [1] 8) Multi-Relational Data Mining [3] What are the goals and objectives of multirelational data mining? Limit your answer to 5 sentences! Graded 9) Implementation of Joins and Physical Database Design Ungraded Assume two relations R1(A, B, C) and R2(A, D, E) are given; R1 and R2 are both stored as an unordered file and R1 contains 1000000 (1 million) tuples and R2 contains 500000 (half a million) tuples. Attributes A, B, C, D, and E need 4 byte of storage each, and blocks have a size of 4096 Byte. A is the primary key of both R1 and R2 but only very few A-values occur in both R1 and R2. Moreover, we assume that static hashing is used to implement index structures, and that index pointers require 4 byte of storage; furthermore, you can assume that pages of index blocks are 80% full and do not contain any overflow pages. Moreover, the database system only supports the block nested loops join (only 3 blocks of buffer are available) and the index nested loops join. What index structures would you create to speed up the following 2 queries? 2 Q1: Select B, E from R1, R2 R1.A=R2.A returns 100 answers Q2: Select B from R1, R2 where R1.A=R2.A and D=12; returns 2 answers (assume there are 20000 tuples in R2 with D=12) Describe which index structure you would create (justify your design!), and compute the cost for executing Q1 and Q2 for your chosen design. Also give the query evaluation plan you assume the database system should use to implement query Q1. 11) Query Optimization [10] Graded (Reading the Chaudhuri article might help!!) a) Assume three relations R1(A,B,C) and R2(A,D,E) and R3(A,F) and following SQL query are given: SELECT A,C,E FROM R1, R2, R3 WHERE R1.A=R2.A and R2.A=R3.A and D=12 and B>12 and F>14 Moreover, there is a hash index available for attribute D and another hash index is available for attribute B. Give two “reasonable”, quite different query execution plans (Chaudhuri calls those physical operator trees) that implement the above query [4]: b) Another critical problem in query optimization is the propagation of statistical information. Explain what this problem is by using a query plan you generated for subproblem b) as an example. [3] c) Most query optimizers only consider linear plans. What, in your opinion, is the reason for that? [3] 12) XML documents and XML DTD [11] Graded Take the University E/R Diagram http://www.cs.uh.edu/~ceick/6340/ER-NFL.ppt and define an “equivalent” XML DTD (if you prefer to use XML schema for this problem you are allowed to do so) that is suitable to exchange information in the university world. Also submit an XML document that at least 40 lines long that is valid with respect to the DTD you defined. Also report if there were particular difficulties in mapping the E/R diagram. Also list all the constraints of the University E/R Diagram that could not be expressed in the DTD you generated. 13) Semantic Web [4] Graded What is W3C’s vision concerning the Semantic Web? What are W3C’s most important initiatives concerning the semantic web? Limit your answers to 7-10 sentences! Reading http://www.w3.org/2001/sw/Activity might help answering these questions. 3