Uploaded by shatrughnacatholic

Solution to Assignment1

Solution to Assignment1
Solution to Assignment1
1.8 Based on your observation, describe another possible kind of knowledge that needs to be
discovered by data mining methods but has not been listed in this chapter. Does it require a mining
methodology that is quite different from those outlined in this chapter?
Answer: There is no standard answer for this question and one can judge the quality of an answer
based on the freshness and quality of the proposal. For example, one may propose partial periodicity
as a new kind of knowledge, where a pattern is partial periodic if only some offsets of certain time
period in a time series demonstrate some repeating behavior.
1.10 Describe two challenges to data mining regarding performance issues.
Answer: One challenge to data mining regarding performance issues is the efficiency and scalability of
data mining algorithms. Data mining algorithms must be efficient and scalable in order to effectively
extract information from large amount of data in databases within predictable and acceptable running
times. Another challenge is the parallel, distribute and incremental processing of data mining
algorithms. The need for parallel and distributed data mining algorithms has been brought about by
the huge size of many databases, the wide distribution of data, and the computational complexity of
some data mining methods. Due to the high cost of some data mining processes, incremental data
mining algorithms in corporate database updates without the need to mine the entire data again from
2.4 Suppose that a data warehouse for Big-University consists of the following four dimensions:
student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest
conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade
measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the
average grade for the given combination.
a. Draw a snowflake schema diagram for the data warehouse.
Answer: A snowflake schema is shown in Figure below
b. Starting with the base cuboid [student, course, semester, instructor], what specific OLAP
operations (e.g., roll-up from semester to year) should one perform in order to list the average
grade of CS courses for each Big-University student.
Solution to Assignment1
Roll-up on course from course_id to department.
Roll-up on student from student_id to university.
Dice on course, student with department ="CS" and university = "big-university"
Drill-down on student from university to student_name.
c. If each dimension has five levels (including all), such as student < major < status < university < all,
how many cuboids will this cube contain (including the base and apex cuboids)?
Li = 5-1 =4
N=4 dimensions
So, this cube will contain (Li+1)^4 = 625 cuboids
2.7 Regarding the computation of measures in a data cube:
a. Enumerate three categories of measures, based on the kind of aggregate functions used in
computing a data cube.
Answer: The three categories of measures are Distributive, Algebraic and Holistic.
b. For a data cube with three dimensions time, location, and product, which category does the
function variance belong to? Describe how to compute it if the cube is partitioned into many
Answer: Variance is an algebraic aggregate function. If the cube is paetitioned into many
chunks, the variance can be computed as follows: Read in the chunks one by one, keeping track
of the accumulated (1) number of tuples, (2) sum of (xi)2, and (3) sum of xi. After reading all the
chunks, compute the average of xI�s as the sum of xi divided by the total number of tuples. Use
the formula as shown in the hint to get the variance.
c. Suppose the function is "top 10 sales." Discuss how to efficiently compute this measure in a data
Answer: For each cuboid, use 10 units to register the top 10 sales found so far. Read the data in each
cuboid once. If the sales amount in a tuple is grater than an existing one in the list, insert the new sales
amount from the new tuple into the list, and discard the smallest one in the list. The computation of a
higher level cuboid can be performed similarly by propagation of the top-10 cells of its corresponding
lower level cuboids.
2.12 Suppose that a base cuboid has three dimensions A, B, C, with the following number of cells: |A| =
1,000,000, |B|=100, and |C|=1000. Suppose that each dimension is evenly partitioned into 10 portions
for chunking.
a. Assuming each dimension has only one level, draw the complete lattice of the cube.
Solution to Assignment1
b. If each cube cell stores one measure with 4 bytes, what is the total size of the computed cube if
the cube is dense?
1-D Cuboids
All: 1 * 4 = 4
A: 1,000,000 * 4 = 4,000,000
B: 100 * 4 = 400
C: 1,000 * 4 = 4,000
2-D Cuboids
AB: 1,000,000 * 100 * 4 = 400,000,000
AC: 1,000,000 * 1,000 * 4 = 4,000,000,000
BC: 100 * 1,000 * 4 = 400,000
ABC: 1,000,000 * 100 * 1,000 * 4 = 400,000,000,000
c. State the order for computing the chunks in the cube that requires the least amount of space,
and compute the total amount of main memory space required for computing the 2-D planes.
Answer: The total amount of main memory space required for computing the 2-D planes is
1,000 x 100 (for the entire BC plane) + 100 x 100,000 (for one row of the AB plane) +
100,000 x 100 (for one chunk of the AC plane) = 100,000 + 10,000,000 + 10,000,000 =
20,100,000 cells = 80,400,000 bytes.