22/11/2019 Solution to Assignment1 Solution to Assignment1 1.8 Based on your observation, describe another possible kind of knowledge that needs to be discovered by data mining methods but has not been listed in this chapter. Does it require a mining methodology that is quite different from those outlined in this chapter? Answer: There is no standard answer for this question and one can judge the quality of an answer based on the freshness and quality of the proposal. For example, one may propose partial periodicity as a new kind of knowledge, where a pattern is partial periodic if only some offsets of certain time period in a time series demonstrate some repeating behavior. 1.10 Describe two challenges to data mining regarding performance issues. Answer: One challenge to data mining regarding performance issues is the efficiency and scalability of data mining algorithms. Data mining algorithms must be efficient and scalable in order to effectively extract information from large amount of data in databases within predictable and acceptable running times. Another challenge is the parallel, distribute and incremental processing of data mining algorithms. The need for parallel and distributed data mining algorithms has been brought about by the huge size of many databases, the wide distribution of data, and the computational complexity of some data mining methods. Due to the high cost of some data mining processes, incremental data mining algorithms in corporate database updates without the need to mine the entire data again from scratch. 2.4 Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a. Draw a snowflake schema diagram for the data warehouse. Answer: A snowflake schema is shown in Figure below b. Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each Big-University student. Answer: https://cs.nyu.edu/courses/spring03/G22.3033-015/solution1.htm 1/3 22/11/2019 Solution to Assignment1 Roll-up on course from course_id to department. Roll-up on student from student_id to university. Dice on course, student with department ="CS" and university = "big-university" Drill-down on student from university to student_name. c. If each dimension has five levels (including all), such as student < major < status < university < all, how many cuboids will this cube contain (including the base and apex cuboids)? Answer: Li = 5-1 =4 N=4 dimensions So, this cube will contain (Li+1)^4 = 625 cuboids 2.7 Regarding the computation of measures in a data cube: a. Enumerate three categories of measures, based on the kind of aggregate functions used in computing a data cube. Answer: The three categories of measures are Distributive, Algebraic and Holistic. b. For a data cube with three dimensions time, location, and product, which category does the function variance belong to? Describe how to compute it if the cube is partitioned into many chunks. Answer: Variance is an algebraic aggregate function. If the cube is paetitioned into many chunks, the variance can be computed as follows: Read in the chunks one by one, keeping track of the accumulated (1) number of tuples, (2) sum of (xi)2, and (3) sum of xi. After reading all the chunks, compute the average of xI�s as the sum of xi divided by the total number of tuples. Use the formula as shown in the hint to get the variance. c. Suppose the function is "top 10 sales." Discuss how to efficiently compute this measure in a data cube. Answer: For each cuboid, use 10 units to register the top 10 sales found so far. Read the data in each cuboid once. If the sales amount in a tuple is grater than an existing one in the list, insert the new sales amount from the new tuple into the list, and discard the smallest one in the list. The computation of a higher level cuboid can be performed similarly by propagation of the top-10 cells of its corresponding lower level cuboids. 2.12 Suppose that a base cuboid has three dimensions A, B, C, with the following number of cells: |A| = 1,000,000, |B|=100, and |C|=1000. Suppose that each dimension is evenly partitioned into 10 portions for chunking. a. Assuming each dimension has only one level, draw the complete lattice of the cube. https://cs.nyu.edu/courses/spring03/G22.3033-015/solution1.htm 2/3 22/11/2019 Solution to Assignment1 b. If each cube cell stores one measure with 4 bytes, what is the total size of the computed cube if the cube is dense? Apex bytes) (*4 1-D Cuboids All: 1 * 4 = 4 A: 1,000,000 * 4 = 4,000,000 B: 100 * 4 = 400 C: 1,000 * 4 = 4,000 2-D Cuboids AB: 1,000,000 * 100 * 4 = 400,000,000 AC: 1,000,000 * 1,000 * 4 = 4,000,000,000 BC: 100 * 1,000 * 4 = 400,000 Base Total bytes) ABC: 1,000,000 * 100 * 1,000 * 4 = 400,000,000,000 (in 404,404,404,404 c. State the order for computing the chunks in the cube that requires the least amount of space, and compute the total amount of main memory space required for computing the 2-D planes. Answer: The total amount of main memory space required for computing the 2-D planes is 1,000 x 100 (for the entire BC plane) + 100 x 100,000 (for one row of the AB plane) + 100,000 x 100 (for one chunk of the AC plane) = 100,000 + 10,000,000 + 10,000,000 = 20,100,000 cells = 80,400,000 bytes. https://cs.nyu.edu/courses/spring03/G22.3033-015/solution1.htm 3/3