Assignment2

advertisement
Dr. Christoph F. Eick
Draft Graded Homework2 COSC 6340
Spring 2004
<Easter>
Egg
</Easter>
>
Due: 5a, 5c, 7, 8 We., April 14, 11p (electronic submission); problem 5b is due Sa., April
17, 11p (electronic submission); all other problems are due Sa., May 1, 11p (electronic
submission; please submit hardcopies of your solutions on May 3 or 4!!)
5) Clustering [25] Graded
a) Suppose the task is to cluster the following seven points (1, 1), (1, 3), (4, 5), (5,
5), (2, 1), (7, 7), (6, 3) into 2 clusters (k=2) assuming Manhattan Distance
(d((x1,x2),(y1,y2))=|x1-y1| + |x2-y2|) using the K-means (for k=2) and the kmedoid clustering algorithm. Assume that the K-means initially assigns (1,1), (1,
3) and (7, 7) to Cluster1 and the other 4 points to Cluster2 and that K-medoid
chooses points (5,5) and (7,7) as its initial medoids. How do those
clusters/medoids change in each iteration of applying the two algorithms? You
can write a small program for this homework if helpful. Explain how you derived
your answer! Which algorithm you believe is faster in solving this problem? [9]
b) Now assume you have to add a clustering operator to the Oracle Database
Systems that supports K-means clustering. Give a sketch of a system design that
adds clustering capabilities to Oracle. Also discuss the key data structures and
algorithms that are used in your system architecture. Limit your answer to less
than two pages! [12]
c) What are the key ideas and key features of the BIRCH clustering algorithms
(what properties make it scalable)? Limit your answer to 5 sentences [3]
6) Similarity Assessment Ungraded
Assume the following relation Students(ssn, age, gpa, avg_class_rank) that
contains students that were admitted in the year 2000 into our undergraduate program is
given. You can assume that
1



age is an integer; the maximum age is 50 the minimum age is 20, and the average
age is 28 and the mean absolute deviation is 10.
gpa denotes the UH COSC gpa; the average gpa is 2.8 and the mean absolute
deviation is 0.6; the maximum gpa is 4.0 the minimum gpa is 0.
Avg_class_rank has 5 values (4=top-5%, 3=top-15%, 2=top-25% 1=top_half,
0=bottom half)
a) Define a student similarity (or distance) measure that considers gpa and
class_rank of being of major importance, and age of being of minor
importance. [7]
b) Using your (dis)similarity measure compute the (dis)similarity for the
following pair of students following 2 students [2] :
1. (111111111, 25, 2.8, 2)
2. (222222222, 24, 3.7, 3)
7) Association Rule Mining [8] Graded
a) Assume you have to apply the APRIORI algorithm assuming that the minimum
support is 40% (4 out of 10) to the following set of 10 transactions that involve
purchases of items A, B, C, D, E, F, G.
T1={A, C, D}
T6={A, C, D, E, F}
T2={A, D, F}
T7={A, B, D, F}
T3={D, E, F}
T8={A, B, C, D, F}
T4={A, B, D, F}
T9={A, B, C, E}
T5={A, F}
T10={A, D, E}
Describe how Apriori’s Large Item Set Generation algorithm works for the example.
List what candidate item sets will be generated in each pass, and which remain in the
candidate item set after pruning (use notations of the Han book) [6]
b) Assuming minimum confidence is 75%, give 2 rules (of your own choice) that
would be generated by an association rule mining algorithm. [1]
8) Multi-Relational Data Mining [3] What are the goals and objectives of multirelational data mining? Limit your answer to 5 sentences! Graded
9) Implementation of Joins and Physical Database Design Ungraded
Assume two relations R1(A, B, C) and R2(A, D, E) are given; R1 and R2 are both stored
as an unordered file and R1 contains 1000000 (1 million) tuples and R2 contains 500000
(half a million) tuples. Attributes A, B, C, D, and E need 4 byte of storage each, and
blocks have a size of 4096 Byte. A is the primary key of both R1 and R2 but only very
few A-values occur in both R1 and R2. Moreover, we assume that static hashing is used
to implement index structures, and that index pointers require 4 byte of storage;
furthermore, you can assume that pages of index blocks are 80% full and do not contain
any overflow pages. Moreover, the database system only supports the block nested loops
join (only 3 blocks of buffer are available) and the index nested loops join. What index
structures would you create to speed up the following 2 queries?
2
Q1: Select B, E
from R1, R2
R1.A=R2.A
returns 100 answers
Q2: Select B
from R1, R2
where R1.A=R2.A
and D=12;
returns 2 answers (assume there are 20000 tuples
in R2 with D=12)
Describe which index structure you would create (justify your design!), and compute the
cost for executing Q1 and Q2 for your chosen design. Also give the query evaluation plan
you assume the database system should use to implement query Q1.
11) Query Optimization [10] Graded (Reading the Chaudhuri article might
help!!)
a) Assume three relations R1(A,B,C) and R2(A,D,E) and R3(A,F) and following SQL
query are given:
SELECT A,C,E
FROM R1, R2, R3
WHERE R1.A=R2.A and R2.A=R3.A and D=12 and B>12
and F>14
Moreover, there is a hash index available for attribute D and another hash index is
available for attribute B. Give two “reasonable”, quite different query execution plans
(Chaudhuri calls those physical operator trees) that implement the above query [4]:
b) Another critical problem in query optimization is the propagation of statistical
information. Explain what this problem is by using a query plan you generated for subproblem b) as an example. [3]
c) Most query optimizers only consider linear plans. What, in your opinion, is the reason
for that? [3]
12) XML documents and XML DTD [11] Graded
Take the University E/R Diagram http://www.cs.uh.edu/~ceick/6340/ER-NFL.ppt and
define an “equivalent” XML DTD (if you prefer to use XML schema for this problem
you are allowed to do so) that is suitable to exchange information in the university world.
Also submit an XML document that at least 40 lines long that is valid with respect to the
DTD you defined. Also report if there were particular difficulties in mapping the E/R
diagram. Also list all the constraints of the University E/R Diagram that could not be
expressed in the DTD you generated.
13) Semantic Web [4] Graded
What is W3C’s vision concerning the Semantic Web? What are W3C’s most important
initiatives concerning the semantic web? Limit your answers to 7-10 sentences! Reading
http://www.w3.org/2001/sw/Activity
might help answering these questions.
3
Download